Recovery

Database recovery ensures that committed data survives crashes and uncommitted data is rolled back. The Write-Ahead Log (WAL) is the cornerstone of recovery in virtually all modern databases.

Failure Types

Transaction Failure

A single transaction aborts (constraint violation, deadlock, application error). The DBMS rolls back that transaction's changes. No system-wide impact.

System Failure (Crash)

Software crash, OS crash, or power loss. Memory contents lost. Disk contents survive (assumed). On restart: recover using the WAL.

Media Failure

Disk corruption or failure. Requires backups + WAL archives for recovery.

Write-Ahead Logging (WAL)

WAL Rule: Before any change is written to the database (data pages on disk), the corresponding log record must be written to the log on stable storage (disk, preferably fsync'd).

Transaction:  UPDATE balance SET amount = 200 WHERE id = 1

Log record:  <T1, UPDATE, accounts, row_id=1, old=100, new=200, LSN=42>

Sequence:
1. Write log record to WAL (on disk)
2. Modify the page in the buffer pool (in memory)
3. Eventually flush the dirty page to disk (maybe much later)

Why WAL? If the system crashes after step 2 but before step 3, the WAL has the record needed to redo the change. If the transaction didn't commit, the WAL has the old value needed to undo.

Log Records

Each log record contains:

Field	Content
LSN	Log Sequence Number (monotonically increasing)
Transaction ID	Which transaction
Operation	INSERT, UPDATE, DELETE
Table/Page/Row	What was modified
Before image	Old value (for undo)
After image	New value (for redo)
Prev LSN	Previous log record for this transaction (linked list)

Special records: BEGIN, COMMIT, ABORT, CHECKPOINT.

Undo vs Redo Logging

Undo logging: Log the old value before modifying. On crash: undo uncommitted changes (restore old values).

Constraint: Dirty pages must be flushed before commit. (If not flushed and crash occurs, undo would restore values that were never written.)

Redo logging: Log the new value before modifying. On crash: redo committed changes (apply new values).

Constraint: Dirty pages must NOT be flushed until commit. (If flushed before commit and crash occurs, no redo is needed but the transaction didn't commit — can't undo because old values aren't logged.)

Undo/Redo logging (ARIES approach): Log both old and new values. No constraints on when dirty pages are flushed. Most flexible. Used by all modern databases.

Buffer Management Policies

Steal vs No-Steal

Steal: Dirty pages can be flushed to disk before the transaction commits (to make room in the buffer pool). Requires undo capability (WAL has old values).

No-steal: Dirty pages are NOT flushed until the transaction commits. No undo needed. But limits buffer pool utilization (can't evict dirty pages of active transactions).

Force vs No-Force

Force: All dirty pages are flushed to disk at commit time. No redo needed (all committed data is on disk). But commit is slow (must wait for I/O).

No-force: Dirty pages are NOT necessarily flushed at commit. Only the WAL is forced to disk. Requires redo capability. Fast commits.

Practical Choice

Steal/No-force (ARIES approach): Most flexible. Fast commits (only WAL flushed). Buffer pool fully utilized. Requires both undo and redo during recovery.

All major databases use this: PostgreSQL, MySQL/InnoDB, Oracle, SQL Server.

ARIES (Algorithms for Recovery and Isolation Exploiting Semantics)

The most influential recovery algorithm (IBM, Mohan et al., 1992). Used (with variations) by virtually all modern RDBMS.

Three Principles

Write-ahead logging: Log before modifying.
Repeating history during redo: On recovery, replay the WAL to bring the database to the exact state at the time of crash.
Logging changes during undo: When undoing a transaction during recovery, log the undo operations (CLRs — Compensation Log Records) to avoid re-undoing on another crash.

Recovery Process (Three Phases)

1. Analysis Phase

Scan the WAL from the last checkpoint to the end:

Determine which transactions were active at crash (loser transactions — must be undone).
Determine which pages might be dirty (dirty page table).
Determine the starting point for the redo phase.

2. Redo Phase

Replay the WAL forward from the determined starting point:

For each log record: if the page's on-disk LSN < log record's LSN, redo the operation (the change might not have reached disk).
Redo everything — committed and uncommitted transactions alike. This restores the database to the exact crash state.

"Repeating history": Even uncommitted transactions are redone. Their undo will happen in the next phase.

3. Undo Phase

Roll back all loser transactions (active at crash, never committed):

Process undo records in reverse LSN order (newest first).
For each undo: write a CLR (Compensation Log Record) to the WAL. The CLR ensures that if the system crashes during recovery, we don't redo already-undone work.
Continue until all loser transactions are fully rolled back.

Why ARIES Works

Redo ensures all committed data is on disk.
Undo ensures all uncommitted data is removed.
CLRs ensure recovery is idempotent (can crash and restart recovery safely).
No constraints on when dirty pages are flushed (steal/no-force).

Checkpointing

Periodic checkpoints reduce recovery time by recording the current database state.

Simple Checkpoint

Stop accepting new transactions.
Wait for all active transactions to finish.
Flush all dirty pages to disk.
Write a checkpoint record to the WAL.

Problem: System is unavailable during checkpoint. Unacceptable for production.

Fuzzy Checkpoint

Write a BEGIN_CHECKPOINT record.
Record the dirty page table and active transaction table.
Write an END_CHECKPOINT record.
Don't wait for transactions to finish. Don't flush dirty pages (they'll be flushed lazily).

Recovery starts from the BEGIN_CHECKPOINT position (not the end of the log). Dirty page table tells which pages might need redo.

Fast and non-blocking. Used by all modern databases.

Checkpoint Frequency

More frequent checkpoints → faster recovery (less WAL to replay) but more overhead during normal operation.

Typical: every few minutes, or based on WAL size (e.g., every 1 GB of WAL).

Physiological Logging

ARIES uses physiological logging: log records identify the page (physical) and the operation within the page (logical).

<T1, page_42, INSERT, slot_5, row_data>

Why not physical (byte-level)? Too verbose. Every byte change requires a log record.

Why not logical (SQL-level)? Redo might not be deterministic (e.g., INSERT might go to a different page on replay if the table has grown).

Physiological: The page is fixed (physical), but the operation within the page is described logically. Good balance of space and determinism.

Shadow Paging

An alternative to WAL-based recovery (rarely used in practice).

Maintain two copies of each page: the current version and a shadow (clean) version. Modifications go to the current version. On commit: update the page table to point to the new versions. On abort: discard current versions, revert to shadow.

Advantages: No WAL needed for recovery. Atomic commit via page table swap.

Disadvantages: Expensive (lots of I/O). Fragmentation. Poor for concurrent transactions. Mostly of historical interest (System R used it initially before WAL).

Modern use: SQLite uses a variant (rollback journal or WAL mode).

Point-in-Time Recovery (PITR)

Restore the database to any specific point in time.

Mechanism:

Start from a base backup (full database dump).
Replay WAL records up to the desired timestamp.

Base backup (Monday 00:00)
    + WAL records (Monday 00:00 → Wednesday 14:30)
    = Database state at Wednesday 14:30

PostgreSQL: Continuous archiving + WAL shipping. pg_basebackup for base backup. restore_command + recovery_target_time for PITR.

Use cases: Recover from accidental data deletion ("Oh no, I ran DELETE without WHERE"). Disaster recovery. Database migration.

Applications in CS

Database administration: Understanding recovery helps configure WAL settings, checkpoint frequency, backup strategies.
System design: Durability requirements drive architecture (synchronous replication, WAL shipping, PITR).
Performance tuning: WAL flush on commit is often the bottleneck. Group commit, async commit, and WAL compression help.
Distributed databases: Each node has its own WAL. Distributed recovery coordinates across nodes.
Storage engines: Building a storage engine requires implementing WAL, buffer management, and recovery. RocksDB, WiredTiger, InnoDB all implement ARIES variants.