Distributed Storage

Distributed Hash Tables (DHTs)

Consistent Hashing

Maps both keys and nodes to positions on a hash ring. Each node is responsible for keys between it and its predecessor.

Hash Ring (0 to 2^m - 1):

          Node A (hash=10)
         /                 \
    Key k1(5)          Node B (hash=45)
       │                    │
       ▼                    │
  Assigned to A         Key k2(50)
  (nearest CW)             │
       │                    ▼
  Node D (hash=80)    Assigned to B
         \                 /
          Node C (hash=70)

k1 → A (5 < 10, closest clockwise)
k2 → C (50 < 70, closest clockwise)

Virtual nodes: each physical node maps to multiple positions on the ring, improving load balance.

Physical nodes: A, B, C
Virtual nodes:  A1, A2, A3, B1, B2, B3, C1, C2, C3

Ring: ... A2 ... B1 ... C3 ... A1 ... C2 ... B3 ...

Node removal: only keys from removed virtual nodes need reassignment
Node addition: only keys moving to new virtual nodes need migration

Chord

Structured DHT using finger tables for O(log n) lookups.

Each node maintains a finger table:
  finger[i] = successor(n + 2^i) for i = 0..m-1

Node 8's finger table (m=6, ring size 64):
  i=0: successor(9)  = Node 14
  i=1: successor(10) = Node 14
  i=2: successor(12) = Node 14
  i=3: successor(16) = Node 21
  i=4: successor(24) = Node 32
  i=5: successor(40) = Node 42

Lookup(key=30) at Node 8:
  8 → finger[4]=32 is past 30, try finger[3]=21
  21 → forward to closest preceding finger
  → ... → Node 32 (successor of 30)

  O(log n) hops to find any key

Kademlia

Used in BitTorrent DHT, Ethereum. Uses XOR distance metric.

Distance: d(x, y) = x XOR y

Properties:
  d(x, x) = 0          (identity)
  d(x, y) = d(y, x)    (symmetry)
  d(x, z) ≤ d(x, y) + d(y, z)  (triangle inequality)

Each node maintains k-buckets:
  Bucket i: nodes with distance 2^i ≤ d < 2^(i+1)

Lookup: iteratively query α closest-known nodes,
  converges to target in O(log n) steps.

Advantage: XOR symmetry means lookup paths are bidirectional, enabling mutual routing table maintenance.

Amazon Dynamo

Pioneering eventually consistent key-value store. Influenced Cassandra, Riak, DynamoDB.

Key Design Decisions:
┌────────────────────┬──────────────────────────────┐
│ Problem            │ Dynamo Solution              │
├────────────────────┼──────────────────────────────┤
│ Partitioning       │ Consistent hashing + vnodes  │
│ Replication        │ N replicas on preference list│
│ Versioning         │ Vector clocks                │
│ Conflict resolution│ Client-side (read repair)    │
│ Membership         │ Gossip protocol              │
│ Failure handling   │ Sloppy quorum + hinted handoff│
│ Anti-entropy       │ Merkle trees                 │
└────────────────────┴──────────────────────────────┘

Merkle Trees for Anti-Entropy

Each node maintains a Merkle tree of its key range. Comparing trees identifies divergent ranges efficiently.

           Root Hash
          /          \
     H(0-50)      H(51-100)      ← compare root hashes first
     /     \       /       \
  H(0-25) H(26-50) H(51-75) H(76-100)  ← drill into differences
    ...     ...      ...       ...
  Keys     Keys    Keys      Keys

Sync: compare root → if different, compare children → recurse
Only transfer keys in differing leaf ranges.
O(log n) comparisons instead of O(n).

Apache Cassandra

Wide-column store inspired by Dynamo (partitioning) and BigTable (data model).

Architecture:
  ┌──────────────────────────────────────┐
  │          Cassandra Ring              │
  │   N1 ──── N2 ──── N3 ──── N4       │
  │    \                      /         │
  │     ──────── N5 ─────────           │
  │                                      │
  │  Partitioner: Murmur3Hash           │
  │  Replication: NetworkTopologyStrategy│
  │  Consistency: Tunable (ONE..ALL)    │
  └──────────────────────────────────────┘

Data Model:
  Keyspace → Tables → Rows → Columns

  Primary Key = (Partition Key, Clustering Columns)
  Partition Key → determines node placement
  Clustering Columns → sort order within partition

Write path: Commit log (append) -> Memtable (in-memory) -> SSTable (disk flush)

Read path: Memtable -> Row cache -> Bloom filter -> Partition index -> SSTable

BigTable / HBase

Sorted, distributed, multi-dimensional map. Foundation of Google's storage infrastructure.

Data Model: (row_key, column_family:qualifier, timestamp) → value

  Row Key    Column Family "info"     Column Family "metrics"
             name      email          views    clicks
  ─────────────────────────────────────────────────────────
  com.google "Google"  "g@g.com"     1M       500K
  com.github "GitHub"  "gh@gh.com"   2M       800K
  org.wiki   "Wiki"    "w@w.org"     5M       200K

  Rows sorted lexicographically by row key
  Column families stored together (locality group)

Architecture:
  ┌─────────────────────────────────────────────┐
  │              Master (HMaster)               │
  │  - Region assignment                        │
  │  - Load balancing                           │
  │  - Schema changes                           │
  └──────────────┬──────────────────────────────┘
                 │
    ┌────────────┼────────────┐
    ▼            ▼            ▼
  RegionServer RegionServer RegionServer
  [Region A]   [Region B]   [Region C]
  [Region D]   [Region E]   [Region F]

  Region = contiguous range of row keys
  Auto-splits when region grows too large

Distributed File Systems

GFS / HDFS

GFS Architecture:
  ┌──────────────┐
  │    Master    │  Single master (metadata only)
  │  (NameNode)  │  - File namespace
  │              │  - Chunk → chunkserver mapping
  └──────┬───────┘
         │
    ┌────┼────┐
    ▼    ▼    ▼
  CS1   CS2   CS3    ChunkServers (DataNodes)
  [A]   [A]   [B]   - Store 64MB chunks (128MB in HDFS)
  [B]   [C]   [C]   - 3x replication (default)
  [D]   [D]   [E]   - Clients read/write directly to chunks

Write Pipeline:
  Client ──> CS1 ──> CS2 ──> CS3 (chained replication)
  All receive data, primary orders mutations

Ceph

No single point of failure. Uses CRUSH algorithm for deterministic data placement (no master lookup needed).

Architecture:
  ┌────────────────────────────────────────────┐
  │  Client: CRUSH(object, cluster_map) → OSD  │
  │  No metadata server for data I/O!          │
  └────────────────────────────────────────────┘

  CRUSH Algorithm:
    Input:  object_id, cluster_map, placement_rules
    Output: ordered list of OSDs for this object

  Components:
    MON (Monitor): cluster membership, maps (Paxos)
    OSD (Object Storage Daemon): stores objects, replication
    MDS (Metadata Server): POSIX namespace (CephFS only)
    MGR (Manager): monitoring, orchestration

  Storage Layers:
    RADOS ──> RBD (block)
          ──> CephFS (file)
          ──> RGW (object / S3-compatible)

Erasure Coding

Alternative to replication for storage efficiency. Encodes data into n fragments where any k are sufficient to reconstruct. Tolerates n-k failures.

Reed-Solomon (k=4, m=2):

  Original data: [D1] [D2] [D3] [D4]

  Encode:
    [D1] [D2] [D3] [D4] [P1] [P2]
      │    │    │    │    │    │
    Node1 Node2 Node3 Node4 Node5 Node6

  Any 4 of 6 fragments → reconstruct full data

  Storage overhead: 6/4 = 1.5x  (vs 3x for 3-way replication)
  Fault tolerance:  2 node failures (same as 3-way replication)

// Simplified erasure coding concept
STRUCTURE ErasureConfig
    data_shards: integer    // k: minimum fragments to reconstruct
    parity_shards: integer  // m: additional parity fragments

PROCEDURE TOTAL_SHARDS(config) → integer
    RETURN config.data_shards + config.parity_shards

PROCEDURE STORAGE_OVERHEAD(config) → real
    RETURN TOTAL_SHARDS(config) / config.data_shards

PROCEDURE MAX_FAILURES(config) → integer
    RETURN config.parity_shards

// Common configurations
// (6,3):  2x overhead, tolerates 3 failures
// (10,4): 1.4x overhead, tolerates 4 failures
// (16,4): 1.25x overhead, tolerates 4 failures (HDFS default EC)

Trade-off: Erasure coding uses less storage but requires more CPU (encoding/decoding) and higher network bandwidth for recovery (must read k fragments to reconstruct).

Comparison of Storage Systems

System	Data Model	Consistency	Partition	Use Case
DynamoDB	Key-value	Tunable	Hash	Low-latency apps
Cassandra	Wide-column	Tunable	Hash ring	Time-series, IoT
HBase	Wide-column	Strong (row)	Range	Analytics, random access
GFS/HDFS	File (append)	Strong	Fixed-size chunks	Batch processing
Ceph	Object/Block/File	Strong	CRUSH (hash)	General purpose
CockroachDB	Relational	Serializable	Range (auto-split)	OLTP

Key Takeaways

Consistent hashing (with virtual nodes) is the foundation of most distributed key-value stores, enabling minimal data movement during scaling.
DHTs like Chord and Kademlia provide decentralized lookup but are mainly used in peer-to-peer systems; production databases typically use simpler partitioning schemes.
The Dynamo design (consistent hashing, vector clocks, sloppy quorum, gossip, Merkle trees) influenced an entire generation of distributed databases.
Erasure coding is replacing replication in large-scale storage systems where the 3x overhead of replication is prohibitive.
CRUSH (Ceph) eliminates metadata lookup for data placement, enabling fully decentralized I/O paths.