3 min read
On this page

Distributed Storage

Distributed Hash Tables (DHTs)

Consistent Hashing

Maps both keys and nodes to positions on a hash ring. Each node is responsible for keys between it and its predecessor.

Hash Ring (0 to 2^m - 1):

          Node A (hash=10)
         /                 \
    Key k1(5)          Node B (hash=45)
       │                    │
       ▼                    │
  Assigned to A         Key k2(50)
  (nearest CW)             │
       │                    ▼
  Node D (hash=80)    Assigned to B
         \                 /
          Node C (hash=70)

k1 → A (5 < 10, closest clockwise)
k2 → C (50 < 70, closest clockwise)

Virtual nodes: each physical node maps to multiple positions on the ring, improving load balance.

Physical nodes: A, B, C
Virtual nodes:  A1, A2, A3, B1, B2, B3, C1, C2, C3

Ring: ... A2 ... B1 ... C3 ... A1 ... C2 ... B3 ...

Node removal: only keys from removed virtual nodes need reassignment
Node addition: only keys moving to new virtual nodes need migration

Chord

Structured DHT using finger tables for O(log n) lookups.

Each node maintains a finger table:
  finger[i] = successor(n + 2^i) for i = 0..m-1

Node 8's finger table (m=6, ring size 64):
  i=0: successor(9)  = Node 14
  i=1: successor(10) = Node 14
  i=2: successor(12) = Node 14
  i=3: successor(16) = Node 21
  i=4: successor(24) = Node 32
  i=5: successor(40) = Node 42

Lookup(key=30) at Node 8:
  8 → finger[4]=32 is past 30, try finger[3]=21
  21 → forward to closest preceding finger
  → ... → Node 32 (successor of 30)

  O(log n) hops to find any key

Kademlia

Used in BitTorrent DHT, Ethereum. Uses XOR distance metric.

Distance: d(x, y) = x XOR y

Properties:
  d(x, x) = 0          (identity)
  d(x, y) = d(y, x)    (symmetry)
  d(x, z) ≤ d(x, y) + d(y, z)  (triangle inequality)

Each node maintains k-buckets:
  Bucket i: nodes with distance 2^i ≤ d < 2^(i+1)

Lookup: iteratively query α closest-known nodes,
  converges to target in O(log n) steps.

Advantage: XOR symmetry means lookup paths are bidirectional, enabling mutual routing table maintenance.

Amazon Dynamo

Pioneering eventually consistent key-value store. Influenced Cassandra, Riak, DynamoDB.

Key Design Decisions:
┌────────────────────┬──────────────────────────────┐
│ Problem            │ Dynamo Solution              │
├────────────────────┼──────────────────────────────┤
│ Partitioning       │ Consistent hashing + vnodes  │
│ Replication        │ N replicas on preference list│
│ Versioning         │ Vector clocks                │
│ Conflict resolution│ Client-side (read repair)    │
│ Membership         │ Gossip protocol              │
│ Failure handling   │ Sloppy quorum + hinted handoff│
│ Anti-entropy       │ Merkle trees                 │
└────────────────────┴──────────────────────────────┘

Merkle Trees for Anti-Entropy

Each node maintains a Merkle tree of its key range. Comparing trees identifies divergent ranges efficiently.

           Root Hash
          /          \
     H(0-50)      H(51-100)      ← compare root hashes first
     /     \       /       \
  H(0-25) H(26-50) H(51-75) H(76-100)  ← drill into differences
    ...     ...      ...       ...
  Keys     Keys    Keys      Keys

Sync: compare root → if different, compare children → recurse
Only transfer keys in differing leaf ranges.
O(log n) comparisons instead of O(n).

Apache Cassandra

Wide-column store inspired by Dynamo (partitioning) and BigTable (data model).

Architecture:
  ┌──────────────────────────────────────┐
  │          Cassandra Ring              │
  │   N1 ──── N2 ──── N3 ──── N4       │
  │    \                      /         │
  │     ──────── N5 ─────────           │
  │                                      │
  │  Partitioner: Murmur3Hash           │
  │  Replication: NetworkTopologyStrategy│
  │  Consistency: Tunable (ONE..ALL)    │
  └──────────────────────────────────────┘

Data Model:
  Keyspace → Tables → Rows → Columns

  Primary Key = (Partition Key, Clustering Columns)
  Partition Key → determines node placement
  Clustering Columns → sort order within partition

Write path: Commit log (append) -> Memtable (in-memory) -> SSTable (disk flush)

Read path: Memtable -> Row cache -> Bloom filter -> Partition index -> SSTable

BigTable / HBase

Sorted, distributed, multi-dimensional map. Foundation of Google's storage infrastructure.

Data Model: (row_key, column_family:qualifier, timestamp) → value

  Row Key    Column Family "info"     Column Family "metrics"
             name      email          views    clicks
  ─────────────────────────────────────────────────────────
  com.google "Google"  "g@g.com"     1M       500K
  com.github "GitHub"  "gh@gh.com"   2M       800K
  org.wiki   "Wiki"    "w@w.org"     5M       200K

  Rows sorted lexicographically by row key
  Column families stored together (locality group)
Architecture:
  ┌─────────────────────────────────────────────┐
  │              Master (HMaster)               │
  │  - Region assignment                        │
  │  - Load balancing                           │
  │  - Schema changes                           │
  └──────────────┬──────────────────────────────┘
                 │
    ┌────────────┼────────────┐
    ▼            ▼            ▼
  RegionServer RegionServer RegionServer
  [Region A]   [Region B]   [Region C]
  [Region D]   [Region E]   [Region F]

  Region = contiguous range of row keys
  Auto-splits when region grows too large

Distributed File Systems

GFS / HDFS

GFS Architecture:
  ┌──────────────┐
  │    Master    │  Single master (metadata only)
  │  (NameNode)  │  - File namespace
  │              │  - Chunk → chunkserver mapping
  └──────┬───────┘
         │
    ┌────┼────┐
    ▼    ▼    ▼
  CS1   CS2   CS3    ChunkServers (DataNodes)
  [A]   [A]   [B]   - Store 64MB chunks (128MB in HDFS)
  [B]   [C]   [C]   - 3x replication (default)
  [D]   [D]   [E]   - Clients read/write directly to chunks

Write Pipeline:
  Client ──> CS1 ──> CS2 ──> CS3 (chained replication)
  All receive data, primary orders mutations

Ceph

No single point of failure. Uses CRUSH algorithm for deterministic data placement (no master lookup needed).

Architecture:
  ┌────────────────────────────────────────────┐
  │  Client: CRUSH(object, cluster_map) → OSD  │
  │  No metadata server for data I/O!          │
  └────────────────────────────────────────────┘

  CRUSH Algorithm:
    Input:  object_id, cluster_map, placement_rules
    Output: ordered list of OSDs for this object

  Components:
    MON (Monitor): cluster membership, maps (Paxos)
    OSD (Object Storage Daemon): stores objects, replication
    MDS (Metadata Server): POSIX namespace (CephFS only)
    MGR (Manager): monitoring, orchestration

  Storage Layers:
    RADOS ──> RBD (block)
          ──> CephFS (file)
          ──> RGW (object / S3-compatible)

Erasure Coding

Alternative to replication for storage efficiency. Encodes data into n fragments where any k are sufficient to reconstruct. Tolerates n-k failures.

Reed-Solomon (k=4, m=2):

  Original data: [D1] [D2] [D3] [D4]

  Encode:
    [D1] [D2] [D3] [D4] [P1] [P2]
      │    │    │    │    │    │
    Node1 Node2 Node3 Node4 Node5 Node6

  Any 4 of 6 fragments → reconstruct full data

  Storage overhead: 6/4 = 1.5x  (vs 3x for 3-way replication)
  Fault tolerance:  2 node failures (same as 3-way replication)
// Simplified erasure coding concept
STRUCTURE ErasureConfig
    data_shards: integer    // k: minimum fragments to reconstruct
    parity_shards: integer  // m: additional parity fragments

PROCEDURE TOTAL_SHARDS(config) → integer
    RETURN config.data_shards + config.parity_shards

PROCEDURE STORAGE_OVERHEAD(config) → real
    RETURN TOTAL_SHARDS(config) / config.data_shards

PROCEDURE MAX_FAILURES(config) → integer
    RETURN config.parity_shards

// Common configurations
// (6,3):  2x overhead, tolerates 3 failures
// (10,4): 1.4x overhead, tolerates 4 failures
// (16,4): 1.25x overhead, tolerates 4 failures (HDFS default EC)

Trade-off: Erasure coding uses less storage but requires more CPU (encoding/decoding) and higher network bandwidth for recovery (must read k fragments to reconstruct).

Comparison of Storage Systems

| System | Data Model | Consistency | Partition | Use Case | |---|---|---|---|---| | DynamoDB | Key-value | Tunable | Hash | Low-latency apps | | Cassandra | Wide-column | Tunable | Hash ring | Time-series, IoT | | HBase | Wide-column | Strong (row) | Range | Analytics, random access | | GFS/HDFS | File (append) | Strong | Fixed-size chunks | Batch processing | | Ceph | Object/Block/File | Strong | CRUSH (hash) | General purpose | | CockroachDB | Relational | Serializable | Range (auto-split) | OLTP |

Key Takeaways

  • Consistent hashing (with virtual nodes) is the foundation of most distributed key-value stores, enabling minimal data movement during scaling.
  • DHTs like Chord and Kademlia provide decentralized lookup but are mainly used in peer-to-peer systems; production databases typically use simpler partitioning schemes.
  • The Dynamo design (consistent hashing, vector clocks, sloppy quorum, gossip, Merkle trees) influenced an entire generation of distributed databases.
  • Erasure coding is replacing replication in large-scale storage systems where the 3x overhead of replication is prohibitive.
  • CRUSH (Ceph) eliminates metadata lookup for data placement, enabling fully decentralized I/O paths.