5 min read
On this page

Distributed Databases Introduction

A distributed database spreads data across multiple machines. This introduces new challenges: network partitions, consistency across nodes, and distributed transactions.

Data Distribution

Horizontal Fragmentation (Sharding)

Split rows across nodes based on a partitioning key.

Users table, shard by user_id:
  Shard 1 (Node A): user_id 1-1000
  Shard 2 (Node B): user_id 1001-2000
  Shard 3 (Node C): user_id 2001-3000

Partitioning strategies:

  • Range: Partition by key range (1-1000, 1001-2000). Good for range queries. Risk of hotspots (recent data on one shard).
  • Hash: Partition by hash(key) mod N. Even distribution. Range queries span all shards.
  • Directory: A lookup table maps each key to a shard. Flexible but adds indirection.

Vertical Fragmentation

Split columns across nodes. Frequently-accessed columns on fast storage, rarely-used columns on cheaper storage.

Node A: users(id, name, email)        — hot data
Node B: users(id, bio, preferences)   — cold data

Hybrid Fragmentation

Combine horizontal and vertical fragmentation.

Data Replication

Why Replicate?

  • Availability: If one node fails, others can serve requests.
  • Read scalability: Spread reads across replicas.
  • Latency: Place replicas near users (geo-replication).

Synchronous Replication

Write is acknowledged only after all replicas confirm. Strong consistency but slow (wait for the slowest replica).

Client → Primary → [Write to Replica 1] + [Write to Replica 2]
                   ← ACK from both replicas
       ← ACK to client

Asynchronous Replication

Write is acknowledged after primary writes. Replicas updated later. Fast but risk of data loss (if primary fails before replication).

Client → Primary → ACK to client (immediately)
                → Background replication to Replica 1, Replica 2

Replication lag: Replicas may be seconds (or more) behind the primary. Reading from a replica may return stale data.

Semi-Synchronous

Primary waits for at least one replica to confirm. Balance of speed and safety.

Distributed Query Processing

Query Routing

Client-side routing: Client knows the partition map and sends queries directly to the right shard.

Proxy/coordinator: A middle layer routes queries (e.g., Vitess for MySQL, pgBouncer/Citus for PostgreSQL).

Cross-Shard Queries

Queries spanning multiple shards require scatter-gather:

SELECT COUNT(*) FROM users WHERE age > 25:
  → Send to Shard 1: COUNT(*) WHERE age > 25 → 342
  → Send to Shard 2: COUNT(*) WHERE age > 25 → 521
  → Send to Shard 3: COUNT(*) WHERE age > 25 → 187
  → Coordinator: SUM = 1050

Distributed joins: Cross-shard joins are expensive (data must be shuffled). Design schemas to minimize cross-shard queries (co-locate related data on the same shard).

Distributed Transactions

Two-Phase Commit (2PC)

Coordinator orchestrates the commit across participants.

Phase 1 (Prepare/Vote):

Coordinator → all participants: "Prepare to commit"
Participant 1: writes WAL, locks resources → votes YES
Participant 2: writes WAL, locks resources → votes YES

Phase 2 (Commit/Abort):

If all voted YES:
  Coordinator → all: "COMMIT" → participants commit and release locks
If any voted NO:
  Coordinator → all: "ABORT" → participants rollback and release locks

Problem: Blocking. If the coordinator crashes after Phase 1, participants are stuck holding locks (waiting for the coordinator's decision). Recovery requires logging the coordinator's decision.

Three-Phase Commit (3PC)

Adds a pre-commit phase between vote and commit. Reduces blocking but still has edge cases and is rarely used in practice.

Saga Pattern

For long-running distributed transactions in microservices:

Instead of one big transaction, execute a sequence of local transactions. If one fails, execute compensating transactions to undo the earlier steps.

1. Create order (local tx)
2. Reserve inventory (local tx)
3. Charge payment (local tx)
   → If payment fails:
     3a. Release inventory (compensating tx)
     1a. Cancel order (compensating tx)

Choreography: Each service publishes events, and the next service reacts.

Orchestration: A central coordinator directs the saga steps.

CAP Theorem

Brewer's theorem (2000): A distributed system can guarantee at most two of three properties:

  • Consistency (C): All nodes see the same data at the same time (linearizability).
  • Availability (A): Every request receives a response (not guaranteed to be the latest).
  • Partition tolerance (P): System continues operating despite network partitions.

In practice: Network partitions WILL happen. So the real choice is between CP (consistent but may reject requests during partitions) and AP (available but may serve stale data during partitions).

| System | Choice | Example | |---|---|---| | CP | Consistent + Partition-tolerant | HBase, ZooKeeper, etcd, CockroachDB (rejects writes during partition) | | AP | Available + Partition-tolerant | Cassandra, DynamoDB, Riak (serves possibly stale data during partition) |

PACELC (Abadi, 2010): Extension of CAP. "If there's a Partition, choose A or C. Else (normally), choose Latency or Consistency." Most systems make different tradeoffs for normal vs partitioned operation.

Consistency Models

Strong Consistency (Linearizability)

Every read returns the most recent write. All operations appear to occur in a single global order. The gold standard but expensive.

Eventual Consistency

If no new updates occur, all replicas will eventually converge to the same value. No guarantee about how long "eventually" is. Reads may return stale data.

Used by: DNS, CDN caches, Cassandra (tunable), DynamoDB (eventual by default).

Causal Consistency

If operation A causally precedes operation B (A happened before B and could have influenced it), then all nodes see A before B. Operations without causal relationship can be seen in any order.

Stronger than eventual, weaker than strong. Good balance for many applications.

Read-Your-Writes

After a write, subsequent reads by the same client see the write (or a later value). Writes by other clients may not be visible yet.

Implementation: Route reads to the same replica that accepted the write, or use a version/timestamp to ensure freshness.

Monotonic Reads

Once a client reads value v, it will never see an older value in subsequent reads. No "going back in time."

Implementation: Track the latest version seen by each client. Route reads accordingly.

Distributed Concurrency Control

Extend single-node concurrency control to multiple nodes:

  • Distributed 2PL: Locks held on the node storing the data. Coordinator manages lock acquisition across nodes. Risk of distributed deadlock.
  • Distributed deadlock detection: Build a global waits-for graph. Or use timeouts (simpler, less precise).
  • Spanner (Google): Uses TrueTime (synchronized clocks with bounded uncertainty) for globally ordered transactions. Externally consistent without centralized coordination.

Applications in CS

  • Web-scale applications: Facebook, Google, Amazon — data too large for one machine. Sharding + replication essential.
  • Microservices: Each service may have its own database (polyglot persistence). Cross-service data consistency via sagas or event sourcing.
  • Global applications: Geo-replicated databases (Spanner, CockroachDB, YugabyteDB) serve users from the nearest replica.
  • Analytics: Distributed query engines (Spark SQL, Presto/Trino) query data across multiple sources.
  • Caching: Distributed caches (Redis Cluster, Memcached) shard hot data across nodes.
  • Streaming: Kafka stores streams across partitions (a form of distributed log/database).