Distributed Databases Introduction
A distributed database spreads data across multiple machines. This introduces new challenges: network partitions, consistency across nodes, and distributed transactions.
Data Distribution
Horizontal Fragmentation (Sharding)
Split rows across nodes based on a partitioning key.
Users table, shard by user_id:
Shard 1 (Node A): user_id 1-1000
Shard 2 (Node B): user_id 1001-2000
Shard 3 (Node C): user_id 2001-3000
Partitioning strategies:
- Range: Partition by key range (1-1000, 1001-2000). Good for range queries. Risk of hotspots (recent data on one shard).
- Hash: Partition by hash(key) mod N. Even distribution. Range queries span all shards.
- Directory: A lookup table maps each key to a shard. Flexible but adds indirection.
Vertical Fragmentation
Split columns across nodes. Frequently-accessed columns on fast storage, rarely-used columns on cheaper storage.
Node A: users(id, name, email) — hot data
Node B: users(id, bio, preferences) — cold data
Hybrid Fragmentation
Combine horizontal and vertical fragmentation.
Data Replication
Why Replicate?
- Availability: If one node fails, others can serve requests.
- Read scalability: Spread reads across replicas.
- Latency: Place replicas near users (geo-replication).
Synchronous Replication
Write is acknowledged only after all replicas confirm. Strong consistency but slow (wait for the slowest replica).
Client → Primary → [Write to Replica 1] + [Write to Replica 2]
← ACK from both replicas
← ACK to client
Asynchronous Replication
Write is acknowledged after primary writes. Replicas updated later. Fast but risk of data loss (if primary fails before replication).
Client → Primary → ACK to client (immediately)
→ Background replication to Replica 1, Replica 2
Replication lag: Replicas may be seconds (or more) behind the primary. Reading from a replica may return stale data.
Semi-Synchronous
Primary waits for at least one replica to confirm. Balance of speed and safety.
Distributed Query Processing
Query Routing
Client-side routing: Client knows the partition map and sends queries directly to the right shard.
Proxy/coordinator: A middle layer routes queries (e.g., Vitess for MySQL, pgBouncer/Citus for PostgreSQL).
Cross-Shard Queries
Queries spanning multiple shards require scatter-gather:
SELECT COUNT(*) FROM users WHERE age > 25:
→ Send to Shard 1: COUNT(*) WHERE age > 25 → 342
→ Send to Shard 2: COUNT(*) WHERE age > 25 → 521
→ Send to Shard 3: COUNT(*) WHERE age > 25 → 187
→ Coordinator: SUM = 1050
Distributed joins: Cross-shard joins are expensive (data must be shuffled). Design schemas to minimize cross-shard queries (co-locate related data on the same shard).
Distributed Transactions
Two-Phase Commit (2PC)
Coordinator orchestrates the commit across participants.
Phase 1 (Prepare/Vote):
Coordinator → all participants: "Prepare to commit"
Participant 1: writes WAL, locks resources → votes YES
Participant 2: writes WAL, locks resources → votes YES
Phase 2 (Commit/Abort):
If all voted YES:
Coordinator → all: "COMMIT" → participants commit and release locks
If any voted NO:
Coordinator → all: "ABORT" → participants rollback and release locks
Problem: Blocking. If the coordinator crashes after Phase 1, participants are stuck holding locks (waiting for the coordinator's decision). Recovery requires logging the coordinator's decision.
Three-Phase Commit (3PC)
Adds a pre-commit phase between vote and commit. Reduces blocking but still has edge cases and is rarely used in practice.
Saga Pattern
For long-running distributed transactions in microservices:
Instead of one big transaction, execute a sequence of local transactions. If one fails, execute compensating transactions to undo the earlier steps.
1. Create order (local tx)
2. Reserve inventory (local tx)
3. Charge payment (local tx)
→ If payment fails:
3a. Release inventory (compensating tx)
1a. Cancel order (compensating tx)
Choreography: Each service publishes events, and the next service reacts.
Orchestration: A central coordinator directs the saga steps.
CAP Theorem
Brewer's theorem (2000): A distributed system can guarantee at most two of three properties:
- Consistency (C): All nodes see the same data at the same time (linearizability).
- Availability (A): Every request receives a response (not guaranteed to be the latest).
- Partition tolerance (P): System continues operating despite network partitions.
In practice: Network partitions WILL happen. So the real choice is between CP (consistent but may reject requests during partitions) and AP (available but may serve stale data during partitions).
| System | Choice | Example | |---|---|---| | CP | Consistent + Partition-tolerant | HBase, ZooKeeper, etcd, CockroachDB (rejects writes during partition) | | AP | Available + Partition-tolerant | Cassandra, DynamoDB, Riak (serves possibly stale data during partition) |
PACELC (Abadi, 2010): Extension of CAP. "If there's a Partition, choose A or C. Else (normally), choose Latency or Consistency." Most systems make different tradeoffs for normal vs partitioned operation.
Consistency Models
Strong Consistency (Linearizability)
Every read returns the most recent write. All operations appear to occur in a single global order. The gold standard but expensive.
Eventual Consistency
If no new updates occur, all replicas will eventually converge to the same value. No guarantee about how long "eventually" is. Reads may return stale data.
Used by: DNS, CDN caches, Cassandra (tunable), DynamoDB (eventual by default).
Causal Consistency
If operation A causally precedes operation B (A happened before B and could have influenced it), then all nodes see A before B. Operations without causal relationship can be seen in any order.
Stronger than eventual, weaker than strong. Good balance for many applications.
Read-Your-Writes
After a write, subsequent reads by the same client see the write (or a later value). Writes by other clients may not be visible yet.
Implementation: Route reads to the same replica that accepted the write, or use a version/timestamp to ensure freshness.
Monotonic Reads
Once a client reads value v, it will never see an older value in subsequent reads. No "going back in time."
Implementation: Track the latest version seen by each client. Route reads accordingly.
Distributed Concurrency Control
Extend single-node concurrency control to multiple nodes:
- Distributed 2PL: Locks held on the node storing the data. Coordinator manages lock acquisition across nodes. Risk of distributed deadlock.
- Distributed deadlock detection: Build a global waits-for graph. Or use timeouts (simpler, less precise).
- Spanner (Google): Uses TrueTime (synchronized clocks with bounded uncertainty) for globally ordered transactions. Externally consistent without centralized coordination.
Applications in CS
- Web-scale applications: Facebook, Google, Amazon — data too large for one machine. Sharding + replication essential.
- Microservices: Each service may have its own database (polyglot persistence). Cross-service data consistency via sagas or event sourcing.
- Global applications: Geo-replicated databases (Spanner, CockroachDB, YugabyteDB) serve users from the nearest replica.
- Analytics: Distributed query engines (Spark SQL, Presto/Trino) query data across multiple sources.
- Caching: Distributed caches (Redis Cluster, Memcached) shard hot data across nodes.
- Streaming: Kafka stores streams across partitions (a form of distributed log/database).