7 min read
On this page

Process Architecture

The Postmaster

PostgreSQL uses a multi-process architecture, not a multi-threaded one. At the top sits the postmaster, the supervisor process that starts when you launch the server. The postmaster has three jobs:

  1. Listen for incoming client connections
  2. Fork a new backend process for each connection
  3. Manage background worker processes

Every other Postgres process is a child of the postmaster. If the postmaster dies, everything dies. This is by design: it prevents orphaned processes from corrupting shared memory.

$ ps aux | grep postgres
postgres  1234  postmaster -D /var/lib/postgresql/16/main
postgres  1240  postgres: checkpointer
postgres  1241  postgres: background writer
postgres  1242  postgres: walwriter
postgres  1243  postgres: autovacuum launcher
postgres  1244  postgres: stats collector
postgres  1245  postgres: logical replication launcher
postgres  1300  postgres: myuser mydb 192.168.1.10(54321) idle
postgres  1301  postgres: myuser mydb 192.168.1.11(54322) active

Each line is a separate OS process with its own memory space.

One Process Per Connection

When a client connects, the postmaster forks a new backend process dedicated to that connection. This backend handles all queries for that client: parsing, planning, and executing SQL. When the client disconnects, the backend exits.

This model has important implications:

  • Isolation: A crash in one backend does not affect other connections. The postmaster detects the crash and restarts shared memory if needed.
  • Memory overhead: Each backend consumes 5-10MB of memory at minimum, more when running complex queries. 500 connections means 2.5-5GB just for backend overhead.
  • Context switching: Hundreds of active backends competing for CPU cores causes context-switch overhead. The kernel scheduler thrashes.
  • Connection pooling is essential: Applications that open/close connections frequently pay a fork cost each time. PgBouncer or a similar pooler reuses backends across client requests.
-- See active connections
SELECT pid, usename, application_name, client_addr, state, query
FROM pg_stat_activity
WHERE backend_type = 'client backend';

Shared Memory

All backend processes communicate through shared memory. The postmaster allocates a shared memory segment at startup, and every backend maps it into its address space.

Shared Buffers

The largest shared memory component. This is Postgres's buffer cache: a pool of 8KB pages that mirror data file pages on disk. When a backend needs to read a row, it first checks shared buffers. If the page is there (a cache hit), no disk I/O occurs. If not, the backend reads the page from disk into a free buffer slot.

-- Check buffer cache hit ratio
SELECT
    sum(blks_hit) AS cache_hits,
    sum(blks_read) AS disk_reads,
    round(sum(blks_hit) / nullif(sum(blks_hit) + sum(blks_read), 0) * 100, 2) AS hit_ratio
FROM pg_stat_database;

A healthy production system should show a hit ratio above 99%. If you are below 95%, shared_buffers is likely too small or your working set exceeds available memory.

WAL Buffers

A smaller shared memory region where WAL records accumulate before flushing to disk. Backends write WAL records here during transactions. The WAL writer process (or the committing backend itself) flushes these buffers to the WAL segment files on disk.

Lock Tables

Shared memory also holds the lock manager's data structures. Every row-level lock, table-level lock, and advisory lock is tracked here. The lock table is finite: if you acquire millions of row locks in a single transaction, you can exhaust it.

CLOG (Commit Log)

The commit log tracks the status of every transaction ID: in-progress, committed, or aborted. Backends consult the CLOG to determine transaction visibility (whether a row version is visible to the current snapshot).

Background Processes

The postmaster starts several background worker processes at boot. Each handles a specific maintenance task.

Checkpointer

Periodically writes all dirty pages from shared buffers to the data files on disk. A checkpoint ensures that everything up to a certain WAL position is safely on disk. This allows old WAL segments to be recycled.

Checkpoints are triggered by:

  • Time interval (checkpoint_timeout, default 5 minutes)
  • WAL volume (max_wal_size, default 1GB)
  • Manual CHECKPOINT command

Frequent checkpoints reduce crash recovery time but increase I/O. The checkpoint_completion_target setting (default 0.9) spreads the writes over 90% of the checkpoint interval to avoid I/O spikes.

Background Writer

Proactively writes dirty buffers to disk between checkpoints. This frees up buffer slots so backends do not have to do synchronous writes when they need a clean buffer. Without the background writer, backends stall when all buffer slots are dirty.

WAL Writer

Flushes WAL buffers to disk at regular intervals (default 200ms). This batches WAL writes for better I/O efficiency. Backends that commit with synchronous_commit = on may flush WAL themselves if the WAL writer has not yet flushed their records.

Autovacuum Launcher & Workers

The launcher periodically checks tables for dead tuple accumulation and spawns autovacuum worker processes to clean them up. Dead tuples are the byproduct of MVCC: updated or deleted rows leave behind old versions that must eventually be removed.

-- Check autovacuum activity
SELECT relname, last_autovacuum, last_autoanalyze,
       n_dead_tup, n_live_tup
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 10;

Stats Collector

Gathers statistics about table access, index usage, and I/O. These statistics populate the pg_stat_* views that are essential for monitoring and tuning. The stats collector runs in its own process to avoid adding overhead to backend query execution.

Logical Replication Launcher

Manages logical replication workers that send or receive data changes to/from other Postgres instances. Only active when logical replication is configured.

The Catalog Tables

Postgres stores its own metadata in regular tables called the system catalog. These live in the pg_catalog schema and are queryable with standard SQL.

-- All tables in the current database
SELECT relname, relkind, reltuples::bigint AS estimated_rows
FROM pg_class
WHERE relkind = 'r' AND relnamespace = 'public'::regnamespace;

-- All columns of a table
SELECT attname, format_type(atttypid, atttypmod) AS data_type, attnotnull
FROM pg_attribute
WHERE attrelid = 'users'::regclass AND attnum > 0 AND NOT attisdropped;

-- All indexes on a table
SELECT indexrelid::regclass AS index_name,
       pg_get_indexdef(indexrelid) AS definition
FROM pg_index
WHERE indrelid = 'users'::regclass;

Key catalog tables:

  • pg_class: Tables, indexes, sequences, views (all "relations")
  • pg_attribute: Columns of all relations
  • pg_index: Index definitions
  • pg_namespace: Schemas
  • pg_proc: Functions and procedures
  • pg_type: Data types
  • pg_constraint: Constraints (primary key, foreign key, check, unique)

The \d commands in psql are just queries against these catalog tables. Understanding the catalog lets you write custom introspection queries that go beyond what psql offers.

How a Query Flows

When a backend receives a SQL statement, it passes through four stages:

1. Parser

The parser converts the SQL text into a parse tree. This is pure syntax validation. It checks that the SQL is grammatically correct but does not verify that referenced tables or columns exist.

SELECT naem FROM users;  -- passes parsing (naem could be a column)
SELCT name FROM users;   -- fails parsing (SELCT is not valid SQL)

2. Rewriter

The rewrite system applies rules. The most common rewrite expands views: when you query a view, the rewriter substitutes the view's definition into the parse tree. Rules can also redirect INSERT/UPDATE/DELETE to different tables.

3. Planner (Optimizer)

The planner takes the rewritten parse tree and generates an execution plan. This is where Postgres decides:

  • Which indexes to use (or whether to do a sequential scan)
  • Join order and join strategy (nested loop, hash join, merge join)
  • Whether to sort or use an index for ORDER BY
  • Aggregate strategy (plain, sorted, hashed)

The planner uses table statistics (row counts, value distributions, correlation) collected by ANALYZE to estimate costs. Bad statistics lead to bad plans.

EXPLAIN (ANALYZE, BUFFERS)
SELECT u.name, count(o.id) AS order_count
FROM users u
JOIN orders o ON o.user_id = u.id
WHERE u.created_at > '2024-01-01'
GROUP BY u.name
ORDER BY order_count DESC;
 Sort  (cost=1234.56..1234.78 rows=100 width=40) (actual time=12.3..12.4 rows=87 loops=1)
   Sort Key: (count(o.id)) DESC
   Sort Method: quicksort  Memory: 32kB
   ->  HashAggregate  (cost=1200.00..1230.00 rows=100 width=40) (actual time=12.1..12.2 rows=87 loops=1)
         ->  Hash Join  (cost=100.00..1100.00 rows=5000 width=36) (actual time=1.2..10.5 rows=4823 loops=1)
               Hash Cond: (o.user_id = u.id)
               ->  Seq Scan on orders o  (...)
               ->  Hash  (...)
                     ->  Index Scan using idx_users_created on users u  (...)
                           Filter: (created_at > '2024-01-01')
   Buffers: shared hit=892
 Planning Time: 0.4 ms
 Execution Time: 12.6 ms

4. Executor

The executor runs the plan node by node. It pulls tuples through a pipeline: each node requests tuples from its child nodes, applies its operation, and passes results up. This is called the Volcano/iterator model.

The executor is also responsible for enforcing constraints, firing triggers, and managing transactions.

Understanding Architecture Helps You Tune

Knowing the architecture explains why tuning works:

  • shared_buffers matters because every backend reads from the same buffer pool. More shared buffers means more cache hits and fewer disk reads.
  • max_connections has a cost because each connection is a separate process with its own memory allocation.
  • work_mem is per-operation because each backend allocates it independently for sorts and hashes within its own memory space.
  • autovacuum is critical because MVCC creates dead tuples that only vacuum can reclaim.
  • EXPLAIN output is a tree because the executor processes plans as a tree of pipeline nodes.

Common Pitfalls

  • Too many connections without pooling. 500 backends consume gigabytes of memory and thrash the CPU scheduler. Use PgBouncer to multiplex hundreds of application connections onto tens of Postgres backends.
  • Ignoring the buffer cache hit ratio. A hit ratio below 99% means backends are waiting on disk I/O. Either increase shared_buffers or reduce the working set.
  • Not running ANALYZE after bulk loads. The planner relies on statistics. If table statistics are stale after a large data load, the planner will choose bad plans.
  • Killing backend processes with kill -9. Use pg_terminate_backend(pid) instead. Sending SIGKILL to a backend forces the postmaster to restart all of shared memory, disconnecting every client.
  • Assuming more CPU cores always helps. Postgres parallelism is limited. A single query uses at most max_parallel_workers_per_gather workers. Throughput scales with cores, but individual query latency may not.

Key Takeaways

  • Postgres uses one OS process per connection, supervised by the postmaster. Connection pooling is not optional at scale.
  • Shared memory (shared buffers, WAL buffers, lock tables) is the communication layer between all backend processes.
  • Background processes handle checkpointing, WAL writing, autovacuum, and statistics collection.
  • Every query flows through four stages: parse, rewrite, plan, execute. The planner is where performance is won or lost.
  • The system catalog stores all metadata as regular tables. Querying the catalog is a fundamental DBA skill.
  • Understanding the architecture is not academic. It directly explains why configuration settings have the effects they do.