Distributed Storage
Distributed Hash Tables (DHTs)
Consistent Hashing
Maps both keys and nodes to positions on a hash ring. Each node is responsible for keys between it and its predecessor.
Hash Ring (0 to 2^m - 1):
Node A (hash=10)
/ \
Key k1(5) Node B (hash=45)
│ │
▼ │
Assigned to A Key k2(50)
(nearest CW) │
│ ▼
Node D (hash=80) Assigned to B
\ /
Node C (hash=70)
k1 → A (5 < 10, closest clockwise)
k2 → C (50 < 70, closest clockwise)
Virtual nodes: each physical node maps to multiple positions on the ring, improving load balance.
Physical nodes: A, B, C
Virtual nodes: A1, A2, A3, B1, B2, B3, C1, C2, C3
Ring: ... A2 ... B1 ... C3 ... A1 ... C2 ... B3 ...
Node removal: only keys from removed virtual nodes need reassignment
Node addition: only keys moving to new virtual nodes need migration
Chord
Structured DHT using finger tables for O(log n) lookups.
Each node maintains a finger table:
finger[i] = successor(n + 2^i) for i = 0..m-1
Node 8's finger table (m=6, ring size 64):
i=0: successor(9) = Node 14
i=1: successor(10) = Node 14
i=2: successor(12) = Node 14
i=3: successor(16) = Node 21
i=4: successor(24) = Node 32
i=5: successor(40) = Node 42
Lookup(key=30) at Node 8:
8 → finger[4]=32 is past 30, try finger[3]=21
21 → forward to closest preceding finger
→ ... → Node 32 (successor of 30)
O(log n) hops to find any key
Kademlia
Used in BitTorrent DHT, Ethereum. Uses XOR distance metric.
Distance: d(x, y) = x XOR y
Properties:
d(x, x) = 0 (identity)
d(x, y) = d(y, x) (symmetry)
d(x, z) ≤ d(x, y) + d(y, z) (triangle inequality)
Each node maintains k-buckets:
Bucket i: nodes with distance 2^i ≤ d < 2^(i+1)
Lookup: iteratively query α closest-known nodes,
converges to target in O(log n) steps.
Advantage: XOR symmetry means lookup paths are bidirectional, enabling mutual routing table maintenance.
Amazon Dynamo
Pioneering eventually consistent key-value store. Influenced Cassandra, Riak, DynamoDB.
Key Design Decisions:
┌────────────────────┬──────────────────────────────┐
│ Problem │ Dynamo Solution │
├────────────────────┼──────────────────────────────┤
│ Partitioning │ Consistent hashing + vnodes │
│ Replication │ N replicas on preference list│
│ Versioning │ Vector clocks │
│ Conflict resolution│ Client-side (read repair) │
│ Membership │ Gossip protocol │
│ Failure handling │ Sloppy quorum + hinted handoff│
│ Anti-entropy │ Merkle trees │
└────────────────────┴──────────────────────────────┘
Merkle Trees for Anti-Entropy
Each node maintains a Merkle tree of its key range. Comparing trees identifies divergent ranges efficiently.
Root Hash
/ \
H(0-50) H(51-100) ← compare root hashes first
/ \ / \
H(0-25) H(26-50) H(51-75) H(76-100) ← drill into differences
... ... ... ...
Keys Keys Keys Keys
Sync: compare root → if different, compare children → recurse
Only transfer keys in differing leaf ranges.
O(log n) comparisons instead of O(n).
Apache Cassandra
Wide-column store inspired by Dynamo (partitioning) and BigTable (data model).
Architecture:
┌──────────────────────────────────────┐
│ Cassandra Ring │
│ N1 ──── N2 ──── N3 ──── N4 │
│ \ / │
│ ──────── N5 ───────── │
│ │
│ Partitioner: Murmur3Hash │
│ Replication: NetworkTopologyStrategy│
│ Consistency: Tunable (ONE..ALL) │
└──────────────────────────────────────┘
Data Model:
Keyspace → Tables → Rows → Columns
Primary Key = (Partition Key, Clustering Columns)
Partition Key → determines node placement
Clustering Columns → sort order within partition
Write path: Commit log (append) -> Memtable (in-memory) -> SSTable (disk flush)
Read path: Memtable -> Row cache -> Bloom filter -> Partition index -> SSTable
BigTable / HBase
Sorted, distributed, multi-dimensional map. Foundation of Google's storage infrastructure.
Data Model: (row_key, column_family:qualifier, timestamp) → value
Row Key Column Family "info" Column Family "metrics"
name email views clicks
─────────────────────────────────────────────────────────
com.google "Google" "g@g.com" 1M 500K
com.github "GitHub" "gh@gh.com" 2M 800K
org.wiki "Wiki" "w@w.org" 5M 200K
Rows sorted lexicographically by row key
Column families stored together (locality group)
Architecture:
┌─────────────────────────────────────────────┐
│ Master (HMaster) │
│ - Region assignment │
│ - Load balancing │
│ - Schema changes │
└──────────────┬──────────────────────────────┘
│
┌────────────┼────────────┐
▼ ▼ ▼
RegionServer RegionServer RegionServer
[Region A] [Region B] [Region C]
[Region D] [Region E] [Region F]
Region = contiguous range of row keys
Auto-splits when region grows too large
Distributed File Systems
GFS / HDFS
GFS Architecture:
┌──────────────┐
│ Master │ Single master (metadata only)
│ (NameNode) │ - File namespace
│ │ - Chunk → chunkserver mapping
└──────┬───────┘
│
┌────┼────┐
▼ ▼ ▼
CS1 CS2 CS3 ChunkServers (DataNodes)
[A] [A] [B] - Store 64MB chunks (128MB in HDFS)
[B] [C] [C] - 3x replication (default)
[D] [D] [E] - Clients read/write directly to chunks
Write Pipeline:
Client ──> CS1 ──> CS2 ──> CS3 (chained replication)
All receive data, primary orders mutations
Ceph
No single point of failure. Uses CRUSH algorithm for deterministic data placement (no master lookup needed).
Architecture:
┌────────────────────────────────────────────┐
│ Client: CRUSH(object, cluster_map) → OSD │
│ No metadata server for data I/O! │
└────────────────────────────────────────────┘
CRUSH Algorithm:
Input: object_id, cluster_map, placement_rules
Output: ordered list of OSDs for this object
Components:
MON (Monitor): cluster membership, maps (Paxos)
OSD (Object Storage Daemon): stores objects, replication
MDS (Metadata Server): POSIX namespace (CephFS only)
MGR (Manager): monitoring, orchestration
Storage Layers:
RADOS ──> RBD (block)
──> CephFS (file)
──> RGW (object / S3-compatible)
Erasure Coding
Alternative to replication for storage efficiency. Encodes data into n fragments where any k are sufficient to reconstruct. Tolerates n-k failures.
Reed-Solomon (k=4, m=2):
Original data: [D1] [D2] [D3] [D4]
Encode:
[D1] [D2] [D3] [D4] [P1] [P2]
│ │ │ │ │ │
Node1 Node2 Node3 Node4 Node5 Node6
Any 4 of 6 fragments → reconstruct full data
Storage overhead: 6/4 = 1.5x (vs 3x for 3-way replication)
Fault tolerance: 2 node failures (same as 3-way replication)
// Simplified erasure coding concept
STRUCTURE ErasureConfig
data_shards: integer // k: minimum fragments to reconstruct
parity_shards: integer // m: additional parity fragments
PROCEDURE TOTAL_SHARDS(config) → integer
RETURN config.data_shards + config.parity_shards
PROCEDURE STORAGE_OVERHEAD(config) → real
RETURN TOTAL_SHARDS(config) / config.data_shards
PROCEDURE MAX_FAILURES(config) → integer
RETURN config.parity_shards
// Common configurations
// (6,3): 2x overhead, tolerates 3 failures
// (10,4): 1.4x overhead, tolerates 4 failures
// (16,4): 1.25x overhead, tolerates 4 failures (HDFS default EC)
Trade-off: Erasure coding uses less storage but requires more CPU (encoding/decoding) and higher network bandwidth for recovery (must read k fragments to reconstruct).
Comparison of Storage Systems
| System | Data Model | Consistency | Partition | Use Case | |---|---|---|---|---| | DynamoDB | Key-value | Tunable | Hash | Low-latency apps | | Cassandra | Wide-column | Tunable | Hash ring | Time-series, IoT | | HBase | Wide-column | Strong (row) | Range | Analytics, random access | | GFS/HDFS | File (append) | Strong | Fixed-size chunks | Batch processing | | Ceph | Object/Block/File | Strong | CRUSH (hash) | General purpose | | CockroachDB | Relational | Serializable | Range (auto-split) | OLTP |
Key Takeaways
- Consistent hashing (with virtual nodes) is the foundation of most distributed key-value stores, enabling minimal data movement during scaling.
- DHTs like Chord and Kademlia provide decentralized lookup but are mainly used in peer-to-peer systems; production databases typically use simpler partitioning schemes.
- The Dynamo design (consistent hashing, vector clocks, sloppy quorum, gossip, Merkle trees) influenced an entire generation of distributed databases.
- Erasure coding is replacing replication in large-scale storage systems where the 3x overhead of replication is prohibitive.
- CRUSH (Ceph) eliminates metadata lookup for data placement, enabling fully decentralized I/O paths.