5 min read
On this page

Key Metrics to Monitor

The Dashboard Every Postgres DBA Needs

A Postgres database that is not monitored is a ticking time bomb. It might be running fine right now, but without visibility into its internals, you will not know it is in trouble until users start complaining. By then, the problem has been building for hours or days.

Postgres exposes a rich set of statistics views that tell you exactly what is happening inside the database. The challenge is not finding data. It is knowing which metrics matter, what their healthy ranges look like, and when to raise an alarm.

pg_stat_activity: What Is Happening Right Now

pg_stat_activity shows every active connection to the database: what it is doing, how long it has been doing it, and whether it is waiting for something.

-- Active queries (not idle)
SELECT pid, usename, datname, state,
       now() - query_start AS query_duration,
       wait_event_type, wait_event,
       left(query, 80) AS query_preview
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start;
  pid  | usename | datname |        state        | query_duration | wait_event_type |  wait_event
-------+---------+---------+---------------------+----------------+-----------------+--------------
 12345 | app     | mydb    | active              | 00:00:00.003   |                 |
 12346 | app     | mydb    | active              | 00:05:23.100   | Lock            | relation
 12347 | app     | mydb    | idle in transaction | 00:12:45.000   |                 |

What to watch for

Long-running queries: Any query running longer than a few seconds deserves investigation. Set a statement_timeout as a safety net.

-- Find queries running longer than 1 minute
SELECT pid, usename, now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - query_start > interval '1 minute';

Idle in transaction: A connection that started a transaction and then stopped doing anything. It holds locks and prevents VACUUM from cleaning up dead rows. This is one of the most common causes of table bloat.

-- Find idle-in-transaction connections
SELECT pid, usename, now() - xact_start AS transaction_duration, query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY xact_start;

Set idle_in_transaction_session_timeout to kill these automatically:

# postgresql.conf
idle_in_transaction_session_timeout = '5min'

Waiting connections: Connections blocked on locks indicate contention. If many connections are waiting, you have a locking problem.

-- Find blocked and blocking connections
SELECT blocked.pid AS blocked_pid,
       blocked.query AS blocked_query,
       blocking.pid AS blocking_pid,
       blocking.query AS blocking_query
FROM pg_stat_activity AS blocked
JOIN pg_locks AS bl ON bl.pid = blocked.pid
JOIN pg_locks AS kl ON kl.locktype = bl.locktype
    AND kl.database IS NOT DISTINCT FROM bl.database
    AND kl.relation IS NOT DISTINCT FROM bl.relation
    AND kl.page IS NOT DISTINCT FROM bl.page
    AND kl.tuple IS NOT DISTINCT FROM bl.tuple
    AND kl.transactionid IS NOT DISTINCT FROM bl.transactionid
    AND kl.classid IS NOT DISTINCT FROM bl.classid
    AND kl.objid IS NOT DISTINCT FROM bl.objid
    AND kl.objsubid IS NOT DISTINCT FROM bl.objsubid
    AND kl.pid != bl.pid
JOIN pg_stat_activity AS blocking ON blocking.pid = kl.pid
WHERE NOT bl.granted;

pg_stat_user_tables: Table Health

This view shows access patterns and maintenance status for every table.

SELECT schemaname, relname,
       seq_scan, idx_scan,
       n_tup_ins, n_tup_upd, n_tup_del,
       n_live_tup, n_dead_tup,
       last_vacuum, last_autovacuum,
       last_analyze, last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 10;

Critical metrics

Sequential scans vs index scans: If a large table has high seq_scan and low idx_scan, queries are scanning the entire table instead of using indexes. This is the most common performance problem.

-- Tables with high sequential scan counts (potential missing indexes)
SELECT relname, seq_scan, idx_scan,
       pg_size_pretty(pg_relation_size(relid)) AS table_size
FROM pg_stat_user_tables
WHERE seq_scan > 1000 AND idx_scan < seq_scan
ORDER BY seq_scan DESC;

Dead tuples: When rows are updated or deleted, the old version becomes a dead tuple. Dead tuples waste space and slow down scans. VACUUM reclaims them.

-- Tables with the most dead tuples (need vacuuming)
SELECT relname, n_live_tup, n_dead_tup,
       round(n_dead_tup::numeric / greatest(n_live_tup, 1) * 100, 1) AS dead_pct,
       last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC;

Alert when dead tuple ratio exceeds 10-20% of live tuples. This means autovacuum is not keeping up.

pg_stat_user_indexes: Index Usage

Not every index is used. Unused indexes waste disk space and slow down writes.

-- Unused indexes (candidates for removal)
SELECT schemaname, relname, indexrelname,
       idx_scan,
       pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
  AND indexrelname NOT LIKE '%pkey%'
  AND indexrelname NOT LIKE '%unique%'
ORDER BY pg_relation_size(indexrelid) DESC;

An index with zero scans over weeks is probably safe to drop. But verify first: some indexes exist for uniqueness constraints, and some are only used during monthly reports.

pg_stat_bgwriter: Checkpoint Health

The background writer and checkpointer flush dirty pages to disk. Checkpoint frequency and timing affect both performance and recovery time.

SELECT checkpoints_timed,
       checkpoints_req,
       buffers_checkpoint,
       buffers_clean,
       buffers_backend,
       maxwritten_clean
FROM pg_stat_bgwriter;

checkpoints_timed vs checkpoints_req: Timed checkpoints happen on schedule (every checkpoint_timeout seconds). Requested checkpoints happen when WAL fills up. If checkpoints_req is high relative to checkpoints_timed, your WAL volume is too aggressive for your checkpoint settings.

buffers_backend: Pages written directly by backend processes because the background writer could not keep up. If this is high, the background writer needs tuning.

Connection Count

Every Postgres connection is a separate operating system process. Each process consumes memory (typically 5-10 MB per connection, more for complex queries).

-- Current connections by state
SELECT state, count(*)
FROM pg_stat_activity
GROUP BY state;

-- Connection usage vs limit
SELECT count(*) AS current_connections,
       (SELECT setting::int FROM pg_settings WHERE name = 'max_connections') AS max_connections;

Alert when connections exceed 80% of max_connections. Running out of connections means new requests are rejected.

The solution is almost always connection pooling (PgBouncer, Supavisor), not increasing max_connections. Postgres performs poorly with hundreds of active connections.

Replication Lag

For systems with replicas, replication lag determines your RPO and affects read consistency.

-- On the primary: check replication status
SELECT client_addr, application_name, state,
       pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS send_lag_bytes,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag_bytes,
       replay_lag
FROM pg_stat_replication;
-- On the replica: check how far behind
SELECT now() - pg_last_xact_replay_timestamp() AS replication_delay;

Alert thresholds depend on your use case. For most systems, alert at 30 seconds of lag and page at 5 minutes.

Cache Hit Ratio

Postgres uses shared_buffers as its buffer cache. The cache hit ratio tells you how often data is found in memory versus read from disk.

-- Overall cache hit ratio
SELECT
    sum(heap_blks_hit) AS hits,
    sum(heap_blks_read) AS reads,
    round(sum(heap_blks_hit)::numeric /
          greatest(sum(heap_blks_hit) + sum(heap_blks_read), 1) * 100, 2) AS hit_ratio
FROM pg_statio_user_tables;
   hits    |  reads  | hit_ratio
-----------+---------+-----------
 892345678 |  234567 |     99.97

A healthy cache hit ratio is 99% or above. If it drops below 95%, your working set does not fit in memory. Either increase shared_buffers (and RAM) or optimize your queries to touch less data.

-- Cache hit ratio per table (find the cache-busters)
SELECT relname,
       heap_blks_hit, heap_blks_read,
       round(heap_blks_hit::numeric /
             greatest(heap_blks_hit + heap_blks_read, 1) * 100, 2) AS hit_ratio
FROM pg_statio_user_tables
WHERE heap_blks_hit + heap_blks_read > 1000
ORDER BY hit_ratio ASC
LIMIT 10;

Putting It All Together

The essential monitoring dashboard:

Metric Healthy Warning Critical
Active connections Under 50% of max 50-80% of max Over 80% of max
Cache hit ratio Above 99% 95-99% Below 95%
Replication lag Under 1 second 1-30 seconds Over 30 seconds
Dead tuple ratio Under 5% 5-20% Over 20%
Long-running queries Under 30 seconds 30s-5 minutes Over 5 minutes
Idle in transaction None Under 5 minutes Over 5 minutes
Checkpoint frequency Mostly timed Some requested Mostly requested

Common Pitfalls

  • Not monitoring at all. Postgres does not alert you when things go wrong. If you do not set up monitoring, problems accumulate silently until they become emergencies.
  • Monitoring only uptime. "The database is up" tells you nothing about performance. You need query-level, table-level, and system-level metrics.
  • Ignoring idle-in-transaction connections. They hold locks, prevent vacuum, and cause table bloat. Set idle_in_transaction_session_timeout.
  • Alerting on too many things. If every metric fires an alert, alert fatigue sets in and real problems get ignored. Focus on the metrics that indicate user-facing impact.
  • Not tracking trends. A cache hit ratio of 98% is fine. A cache hit ratio that dropped from 99.9% to 98% in a week is a trend that needs investigation.
  • Forgetting to reset statistics. Statistics accumulate from server start. After major changes, reset them to get a clean baseline: SELECT pg_stat_reset();

Key Takeaways

  • pg_stat_activity reveals what is happening right now: active queries, lock waits, and idle-in-transaction connections.
  • pg_stat_user_tables shows table health: sequential scans vs index scans and dead tuple accumulation.
  • Cache hit ratio should be 99% or above. Below 95% means your working set exceeds available memory.
  • Replication lag determines your effective RPO and read consistency. Monitor it continuously.
  • Connection count is a finite resource. Use connection pooling rather than increasing max_connections.
  • Monitor trends, not just point-in-time values. A slowly degrading metric is often more dangerous than a sudden spike.