Process Architecture
The Postmaster
PostgreSQL uses a multi-process architecture, not a multi-threaded one. At the top sits the postmaster, the supervisor process that starts when you launch the server. The postmaster has three jobs:
- Listen for incoming client connections
- Fork a new backend process for each connection
- Manage background worker processes
Every other Postgres process is a child of the postmaster. If the postmaster dies, everything dies. This is by design: it prevents orphaned processes from corrupting shared memory.
$ ps aux | grep postgres
postgres 1234 postmaster -D /var/lib/postgresql/16/main
postgres 1240 postgres: checkpointer
postgres 1241 postgres: background writer
postgres 1242 postgres: walwriter
postgres 1243 postgres: autovacuum launcher
postgres 1244 postgres: stats collector
postgres 1245 postgres: logical replication launcher
postgres 1300 postgres: myuser mydb 192.168.1.10(54321) idle
postgres 1301 postgres: myuser mydb 192.168.1.11(54322) active
Each line is a separate OS process with its own memory space.
One Process Per Connection
When a client connects, the postmaster forks a new backend process dedicated to that connection. This backend handles all queries for that client: parsing, planning, and executing SQL. When the client disconnects, the backend exits.
This model has important implications:
- Isolation: A crash in one backend does not affect other connections. The postmaster detects the crash and restarts shared memory if needed.
- Memory overhead: Each backend consumes 5-10MB of memory at minimum, more when running complex queries. 500 connections means 2.5-5GB just for backend overhead.
- Context switching: Hundreds of active backends competing for CPU cores causes context-switch overhead. The kernel scheduler thrashes.
- Connection pooling is essential: Applications that open/close connections frequently pay a fork cost each time. PgBouncer or a similar pooler reuses backends across client requests.
-- See active connections
SELECT pid, usename, application_name, client_addr, state, query
FROM pg_stat_activity
WHERE backend_type = 'client backend';
Shared Memory
All backend processes communicate through shared memory. The postmaster allocates a shared memory segment at startup, and every backend maps it into its address space.
Shared Buffers
The largest shared memory component. This is Postgres's buffer cache: a pool of 8KB pages that mirror data file pages on disk. When a backend needs to read a row, it first checks shared buffers. If the page is there (a cache hit), no disk I/O occurs. If not, the backend reads the page from disk into a free buffer slot.
-- Check buffer cache hit ratio
SELECT
sum(blks_hit) AS cache_hits,
sum(blks_read) AS disk_reads,
round(sum(blks_hit) / nullif(sum(blks_hit) + sum(blks_read), 0) * 100, 2) AS hit_ratio
FROM pg_stat_database;
A healthy production system should show a hit ratio above 99%. If you are below 95%, shared_buffers is likely too small or your working set exceeds available memory.
WAL Buffers
A smaller shared memory region where WAL records accumulate before flushing to disk. Backends write WAL records here during transactions. The WAL writer process (or the committing backend itself) flushes these buffers to the WAL segment files on disk.
Lock Tables
Shared memory also holds the lock manager's data structures. Every row-level lock, table-level lock, and advisory lock is tracked here. The lock table is finite: if you acquire millions of row locks in a single transaction, you can exhaust it.
CLOG (Commit Log)
The commit log tracks the status of every transaction ID: in-progress, committed, or aborted. Backends consult the CLOG to determine transaction visibility (whether a row version is visible to the current snapshot).
Background Processes
The postmaster starts several background worker processes at boot. Each handles a specific maintenance task.
Checkpointer
Periodically writes all dirty pages from shared buffers to the data files on disk. A checkpoint ensures that everything up to a certain WAL position is safely on disk. This allows old WAL segments to be recycled.
Checkpoints are triggered by:
- Time interval (
checkpoint_timeout, default 5 minutes) - WAL volume (
max_wal_size, default 1GB) - Manual
CHECKPOINTcommand
Frequent checkpoints reduce crash recovery time but increase I/O. The checkpoint_completion_target setting (default 0.9) spreads the writes over 90% of the checkpoint interval to avoid I/O spikes.
Background Writer
Proactively writes dirty buffers to disk between checkpoints. This frees up buffer slots so backends do not have to do synchronous writes when they need a clean buffer. Without the background writer, backends stall when all buffer slots are dirty.
WAL Writer
Flushes WAL buffers to disk at regular intervals (default 200ms). This batches WAL writes for better I/O efficiency. Backends that commit with synchronous_commit = on may flush WAL themselves if the WAL writer has not yet flushed their records.
Autovacuum Launcher & Workers
The launcher periodically checks tables for dead tuple accumulation and spawns autovacuum worker processes to clean them up. Dead tuples are the byproduct of MVCC: updated or deleted rows leave behind old versions that must eventually be removed.
-- Check autovacuum activity
SELECT relname, last_autovacuum, last_autoanalyze,
n_dead_tup, n_live_tup
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 10;
Stats Collector
Gathers statistics about table access, index usage, and I/O. These statistics populate the pg_stat_* views that are essential for monitoring and tuning. The stats collector runs in its own process to avoid adding overhead to backend query execution.
Logical Replication Launcher
Manages logical replication workers that send or receive data changes to/from other Postgres instances. Only active when logical replication is configured.
The Catalog Tables
Postgres stores its own metadata in regular tables called the system catalog. These live in the pg_catalog schema and are queryable with standard SQL.
-- All tables in the current database
SELECT relname, relkind, reltuples::bigint AS estimated_rows
FROM pg_class
WHERE relkind = 'r' AND relnamespace = 'public'::regnamespace;
-- All columns of a table
SELECT attname, format_type(atttypid, atttypmod) AS data_type, attnotnull
FROM pg_attribute
WHERE attrelid = 'users'::regclass AND attnum > 0 AND NOT attisdropped;
-- All indexes on a table
SELECT indexrelid::regclass AS index_name,
pg_get_indexdef(indexrelid) AS definition
FROM pg_index
WHERE indrelid = 'users'::regclass;
Key catalog tables:
- pg_class: Tables, indexes, sequences, views (all "relations")
- pg_attribute: Columns of all relations
- pg_index: Index definitions
- pg_namespace: Schemas
- pg_proc: Functions and procedures
- pg_type: Data types
- pg_constraint: Constraints (primary key, foreign key, check, unique)
The \d commands in psql are just queries against these catalog tables. Understanding the catalog lets you write custom introspection queries that go beyond what psql offers.
How a Query Flows
When a backend receives a SQL statement, it passes through four stages:
1. Parser
The parser converts the SQL text into a parse tree. This is pure syntax validation. It checks that the SQL is grammatically correct but does not verify that referenced tables or columns exist.
SELECT naem FROM users; -- passes parsing (naem could be a column)
SELCT name FROM users; -- fails parsing (SELCT is not valid SQL)
2. Rewriter
The rewrite system applies rules. The most common rewrite expands views: when you query a view, the rewriter substitutes the view's definition into the parse tree. Rules can also redirect INSERT/UPDATE/DELETE to different tables.
3. Planner (Optimizer)
The planner takes the rewritten parse tree and generates an execution plan. This is where Postgres decides:
- Which indexes to use (or whether to do a sequential scan)
- Join order and join strategy (nested loop, hash join, merge join)
- Whether to sort or use an index for ORDER BY
- Aggregate strategy (plain, sorted, hashed)
The planner uses table statistics (row counts, value distributions, correlation) collected by ANALYZE to estimate costs. Bad statistics lead to bad plans.
EXPLAIN (ANALYZE, BUFFERS)
SELECT u.name, count(o.id) AS order_count
FROM users u
JOIN orders o ON o.user_id = u.id
WHERE u.created_at > '2024-01-01'
GROUP BY u.name
ORDER BY order_count DESC;
Sort (cost=1234.56..1234.78 rows=100 width=40) (actual time=12.3..12.4 rows=87 loops=1)
Sort Key: (count(o.id)) DESC
Sort Method: quicksort Memory: 32kB
-> HashAggregate (cost=1200.00..1230.00 rows=100 width=40) (actual time=12.1..12.2 rows=87 loops=1)
-> Hash Join (cost=100.00..1100.00 rows=5000 width=36) (actual time=1.2..10.5 rows=4823 loops=1)
Hash Cond: (o.user_id = u.id)
-> Seq Scan on orders o (...)
-> Hash (...)
-> Index Scan using idx_users_created on users u (...)
Filter: (created_at > '2024-01-01')
Buffers: shared hit=892
Planning Time: 0.4 ms
Execution Time: 12.6 ms
4. Executor
The executor runs the plan node by node. It pulls tuples through a pipeline: each node requests tuples from its child nodes, applies its operation, and passes results up. This is called the Volcano/iterator model.
The executor is also responsible for enforcing constraints, firing triggers, and managing transactions.
Understanding Architecture Helps You Tune
Knowing the architecture explains why tuning works:
- shared_buffers matters because every backend reads from the same buffer pool. More shared buffers means more cache hits and fewer disk reads.
- max_connections has a cost because each connection is a separate process with its own memory allocation.
- work_mem is per-operation because each backend allocates it independently for sorts and hashes within its own memory space.
- autovacuum is critical because MVCC creates dead tuples that only vacuum can reclaim.
- EXPLAIN output is a tree because the executor processes plans as a tree of pipeline nodes.
Common Pitfalls
- Too many connections without pooling. 500 backends consume gigabytes of memory and thrash the CPU scheduler. Use PgBouncer to multiplex hundreds of application connections onto tens of Postgres backends.
- Ignoring the buffer cache hit ratio. A hit ratio below 99% means backends are waiting on disk I/O. Either increase shared_buffers or reduce the working set.
- Not running ANALYZE after bulk loads. The planner relies on statistics. If table statistics are stale after a large data load, the planner will choose bad plans.
- Killing backend processes with
kill -9. Usepg_terminate_backend(pid)instead. Sending SIGKILL to a backend forces the postmaster to restart all of shared memory, disconnecting every client. - Assuming more CPU cores always helps. Postgres parallelism is limited. A single query uses at most
max_parallel_workers_per_gatherworkers. Throughput scales with cores, but individual query latency may not.
Key Takeaways
- Postgres uses one OS process per connection, supervised by the postmaster. Connection pooling is not optional at scale.
- Shared memory (shared buffers, WAL buffers, lock tables) is the communication layer between all backend processes.
- Background processes handle checkpointing, WAL writing, autovacuum, and statistics collection.
- Every query flows through four stages: parse, rewrite, plan, execute. The planner is where performance is won or lost.
- The system catalog stores all metadata as regular tables. Querying the catalog is a fundamental DBA skill.
- Understanding the architecture is not academic. It directly explains why configuration settings have the effects they do.