Distributed Databases Introduction

A distributed database spreads data across multiple machines. This introduces new challenges: network partitions, consistency across nodes, and distributed transactions.

Data Distribution

Horizontal Fragmentation (Sharding)

Split rows across nodes based on a partitioning key.

Users table, shard by user_id:
  Shard 1 (Node A): user_id 1-1000
  Shard 2 (Node B): user_id 1001-2000
  Shard 3 (Node C): user_id 2001-3000

Partitioning strategies:

Range: Partition by key range (1-1000, 1001-2000). Good for range queries. Risk of hotspots (recent data on one shard).
Hash: Partition by hash(key) mod N. Even distribution. Range queries span all shards.
Directory: A lookup table maps each key to a shard. Flexible but adds indirection.

Vertical Fragmentation

Split columns across nodes. Frequently-accessed columns on fast storage, rarely-used columns on cheaper storage.

Node A: users(id, name, email)        — hot data
Node B: users(id, bio, preferences)   — cold data

Hybrid Fragmentation

Combine horizontal and vertical fragmentation.

Data Replication

Why Replicate?

Availability: If one node fails, others can serve requests.
Read scalability: Spread reads across replicas.
Latency: Place replicas near users (geo-replication).

Synchronous Replication

Write is acknowledged only after all replicas confirm. Strong consistency but slow (wait for the slowest replica).

Client → Primary → [Write to Replica 1] + [Write to Replica 2]
                   ← ACK from both replicas
       ← ACK to client

Asynchronous Replication

Write is acknowledged after primary writes. Replicas updated later. Fast but risk of data loss (if primary fails before replication).

Client → Primary → ACK to client (immediately)
                → Background replication to Replica 1, Replica 2

Replication lag: Replicas may be seconds (or more) behind the primary. Reading from a replica may return stale data.

Semi-Synchronous

Primary waits for at least one replica to confirm. Balance of speed and safety.

Distributed Query Processing

Query Routing

Client-side routing: Client knows the partition map and sends queries directly to the right shard.

Proxy/coordinator: A middle layer routes queries (e.g., Vitess for MySQL, pgBouncer/Citus for PostgreSQL).

Cross-Shard Queries

Queries spanning multiple shards require scatter-gather:

SELECT COUNT(*) FROM users WHERE age > 25:
  → Send to Shard 1: COUNT(*) WHERE age > 25 → 342
  → Send to Shard 2: COUNT(*) WHERE age > 25 → 521
  → Send to Shard 3: COUNT(*) WHERE age > 25 → 187
  → Coordinator: SUM = 1050

Distributed joins: Cross-shard joins are expensive (data must be shuffled). Design schemas to minimize cross-shard queries (co-locate related data on the same shard).

Distributed Transactions

Two-Phase Commit (2PC)

Coordinator orchestrates the commit across participants.

Phase 1 (Prepare/Vote):

Coordinator → all participants: "Prepare to commit"
Participant 1: writes WAL, locks resources → votes YES
Participant 2: writes WAL, locks resources → votes YES

Phase 2 (Commit/Abort):

If all voted YES:
  Coordinator → all: "COMMIT" → participants commit and release locks
If any voted NO:
  Coordinator → all: "ABORT" → participants rollback and release locks

Problem: Blocking. If the coordinator crashes after Phase 1, participants are stuck holding locks (waiting for the coordinator's decision). Recovery requires logging the coordinator's decision.

Three-Phase Commit (3PC)

Adds a pre-commit phase between vote and commit. Reduces blocking but still has edge cases and is rarely used in practice.

Saga Pattern

For long-running distributed transactions in microservices:

Instead of one big transaction, execute a sequence of local transactions. If one fails, execute compensating transactions to undo the earlier steps.

1. Create order (local tx)
2. Reserve inventory (local tx)
3. Charge payment (local tx)
   → If payment fails:
     3a. Release inventory (compensating tx)
     1a. Cancel order (compensating tx)

Choreography: Each service publishes events, and the next service reacts.

Orchestration: A central coordinator directs the saga steps.

CAP Theorem

Brewer's theorem (2000): A distributed system can guarantee at most two of three properties:

Consistency (C): All nodes see the same data at the same time (linearizability).
Availability (A): Every request receives a response (not guaranteed to be the latest).
Partition tolerance (P): System continues operating despite network partitions.

In practice: Network partitions WILL happen. So the real choice is between CP (consistent but may reject requests during partitions) and AP (available but may serve stale data during partitions).

System	Choice	Example
CP	Consistent + Partition-tolerant	HBase, ZooKeeper, etcd, CockroachDB (rejects writes during partition)
AP	Available + Partition-tolerant	Cassandra, DynamoDB, Riak (serves possibly stale data during partition)

PACELC (Abadi, 2010): Extension of CAP. "If there's a Partition, choose A or C. Else (normally), choose Latency or Consistency." Most systems make different tradeoffs for normal vs partitioned operation.

Distributed 2PL: Locks held on the node storing the data. Coordinator manages lock acquisition across nodes. Risk of distributed deadlock.
Distributed deadlock detection: Build a global waits-for graph. Or use timeouts (simpler, less precise).
Spanner (Google): Uses TrueTime (synchronized clocks with bounded uncertainty) for globally ordered transactions. Externally consistent without centralized coordination.

Applications in CS

Web-scale applications: Facebook, Google, Amazon — data too large for one machine. Sharding + replication essential.
Microservices: Each service may have its own database (polyglot persistence). Cross-service data consistency via sagas or event sourcing.
Global applications: Geo-replicated databases (Spanner, CockroachDB, YugabyteDB) serve users from the nearest replica.
Analytics: Distributed query engines (Spark SQL, Presto/Trino) query data across multiple sources.
Caching: Distributed caches (Redis Cluster, Memcached) shard hot data across nodes.
Streaming: Kafka stores streams across partitions (a form of distributed log/database).