Database as the Bottleneck

The database is almost always the first thing that breaks. Not your web server. Not your frontend. Not your API layer. The database.

This makes sense when you think about it. Your web server is stateless — you can run more copies. Your frontend is static files — a CDN handles it. But your database is the one place where all your data lives, where every read and write converges, where consistency actually matters.

Understanding how to optimize your database, and knowing when to add layers around it, is the most impactful scaling skill for a startup engineer.

Why the Database Breaks First

A typical web application request touches the database multiple times. Load a dashboard page and you might execute 10-50 queries. Each query competes for database resources: CPU for computation, memory for caching, I/O for disk reads, and connections for concurrency.

What a single page load might do:
- Authenticate the user (1 query)
- Load user preferences (1 query)
- Fetch dashboard data (3-5 queries)
- Load notifications (1 query)
- Check permissions (1-2 queries)
- Load sidebar data (2-3 queries)
Total: 10-15 queries per page load

At 100 concurrent users: 1,000-1,500 queries per second
At 1,000 concurrent users: 10,000-15,000 queries per second

Your web server can handle 1,000 concurrent connections easily. Your database might struggle at 100 concurrent queries if they are not optimized.

Connection Pooling

The first database bottleneck most startups hit is connection limits. PostgreSQL defaults to 100 maximum connections. Each connection consumes memory on the database server. Opening and closing connections has overhead.

Without connection pooling:
- Each request opens a new database connection
- Connection setup takes 5-10ms (TCP handshake, auth)
- 100 concurrent requests = 100 connections = PostgreSQL limit
- Request 101 gets a connection error

With connection pooling:
- A pool of 20-30 connections is maintained
- Requests borrow a connection from the pool
- When done, the connection returns to the pool
- 1,000 concurrent requests share 30 connections
- Each connection handles multiple requests sequentially

PgBouncer is the standard external connection pooler for PostgreSQL. It sits between your application and the database, maintaining a pool of connections and multiplexing requests.

PgBouncer configuration basics:
- Pool mode: transaction (recommended for most apps)
- Pool size: 20-50 connections (depends on database server size)
- Max client connections: 1,000+ (limited by PgBouncer memory, not DB)
- Deploy: as a sidecar on your app server or as a separate service

Most managed PostgreSQL services include connection pooling. Supabase has a built-in connection pooler. Neon has connection pooling by default. AWS RDS Proxy provides the same functionality.

If you are using a connection pooler, be aware of session-level features that do not work in transaction pooling mode: prepared statements, advisory locks, LISTEN/NOTIFY, and SET commands. Most ORMs handle this fine, but custom SQL might need adjustment.

Query Optimization

Before adding infrastructure, optimize the queries you already have. The fastest query is one you do not execute at all. The second fastest is one that uses an index.

Finding Slow Queries

PostgreSQL has built-in tools for finding slow queries.

Enable slow query logging in postgresql.conf:
log_min_duration_statement = 100  (log queries over 100ms)

Or use pg_stat_statements extension:
- Tracks execution statistics for all queries
- Shows total time, mean time, and call count
- Identifies the queries consuming the most resources

Most ORMs can log query execution times. Enable this in development and you will catch most problems before they reach production.

The EXPLAIN Command

EXPLAIN shows how PostgreSQL plans to execute a query. EXPLAIN ANALYZE actually runs it and shows the real execution time.

Key things to look for in EXPLAIN output:
- Seq Scan: full table scan, usually bad on large tables
- Index Scan: using an index, usually good
- Nested Loop: fine for small result sets, bad for large ones
- Hash Join: good for joining large tables
- Sort: check if it can use an index instead of sorting in memory

Example:
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 123;

Without index:
Seq Scan on orders  (cost=0.00..1234.00 rows=50 width=120)
  Filter: (user_id = 123)
  Rows Removed by Filter: 49950
  Execution Time: 45.123 ms

With index on user_id:
Index Scan using idx_orders_user_id on orders  (cost=0.42..12.50 rows=50 width=120)
  Index Cond: (user_id = 123)
  Execution Time: 0.234 ms

The difference between a sequential scan and an index scan can be three orders of magnitude. On a table with a million rows, this is the difference between a query taking 500ms and taking 0.5ms.

Indexing Strategy

Indexes are the single highest-impact optimization for most applications. They trade disk space and write speed for dramatically faster reads.

When to add an index:
- Column appears in WHERE clauses frequently
- Column is used in JOIN conditions
- Column is used in ORDER BY
- Column has a UNIQUE constraint
- You have identified a slow query that scans the table

When NOT to add an index:
- Table has fewer than 1,000 rows (sequential scan is fine)
- Column has very low cardinality (boolean, status with 3 values)
- Table is write-heavy and rarely read
- You already have too many indexes (slows down writes)

Composite Indexes

For queries that filter on multiple columns, composite indexes are more effective than multiple single-column indexes.

Query: SELECT * FROM orders WHERE user_id = 123 AND status = 'active'

Single column indexes:
- PostgreSQL picks one index, then filters the rest
- Might use user_id index, then filter by status

Composite index on (user_id, status):
- One index lookup satisfies both conditions
- Significantly faster for this specific query pattern
- Column order matters: put the most selective column first

Index Monitoring

Do not just add indexes blindly. Monitor which indexes are actually being used.

Find unused indexes:
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;

Unused indexes waste disk space and slow down writes.
Remove them unless they enforce a constraint.

Read Replicas

When your database is bottlenecked on reads (most web applications), a read replica offloads SELECT queries from the primary database.

Read replica architecture:
- Primary: handles all writes (INSERT, UPDATE, DELETE)
- Replica: receives changes via replication, handles reads
- Application: routes writes to primary, reads to replica

Replication lag:
- Typically under 100ms for managed databases
- Means reads might be slightly behind writes
- Acceptable for most features (profile page, dashboard, lists)
- Not acceptable for: payment confirmation, form submission results

The application needs to know when to read from the primary versus the replica. After a write, you typically read from the primary for that user's next request to avoid showing stale data.

Read routing strategy:
- After a write: read from primary for 5-10 seconds
- All other reads: use replica
- Implementation: sticky connection flag per user session
- Most ORMs support this with minimal configuration

Rails has built-in support for automatic read/write splitting. Django has database routers. For other frameworks, you can implement this at the connection level.

Caching Hot Queries

Some queries are executed frequently with the same parameters. Caching their results in Redis or Memcached avoids hitting the database entirely.

Good candidates for caching:
- User profile data (read often, written rarely)
- Configuration and settings
- Dashboard aggregations (expensive, change slowly)
- API responses from external services
- Permission checks (same user, same check, many times per session)
- Navigation and menu data
- Popular content (trending posts, featured items)

Bad candidates for caching:
- Data that changes every request
- User-specific data that varies per request
- Data where staleness causes business problems
- Queries that are already fast (under 5ms)

Cache-Aside Pattern

The most common caching pattern. Check the cache first. If the data is not there, query the database and store the result in the cache.

Cache-aside flow:
1. Application checks Redis for key "user:123:profile"
2. Cache hit: return cached data (sub-millisecond)
3. Cache miss: query PostgreSQL for user profile
4. Store result in Redis with TTL (time-to-live)
5. Return data to application

TTL strategy:
- User profiles: 5-15 minutes
- Configuration: 1 hour
- Dashboard aggregations: 5-30 minutes
- Navigation data: 1 hour
- Adjust based on how stale the data can be

Cache Invalidation

When data changes, the cache needs to be updated or cleared. This is famously one of the hard problems in computer science, but for most startup use cases, TTL-based expiry is sufficient.

Cache invalidation strategies:
1. TTL expiry: set a timeout, data refreshes automatically
   - Simplest, good enough for 90% of cases
   - Trade-off: data can be stale up to TTL duration

2. Write-through: update cache when writing to database
   - More complex, data is always fresh
   - Risk: cache and database can get out of sync

3. Event-based: publish events on data change, subscribers clear cache
   - Most complex, best consistency
   - Overkill for most startups

Instagram used Memcached extensively from early on. Their approach was simple: cache aggressively, set short TTLs, and let stale data exist for a few seconds. For a photo-sharing app, showing a like count that is 5 seconds behind is completely acceptable.

When to Shard

Sharding splits your database across multiple servers based on some key (usually tenant ID or user ID). Each shard holds a subset of the data.

The answer to "when should I shard?" is almost always "later than you think."

PostgreSQL can handle:
- Hundreds of millions of rows per table
- Terabytes of data per instance
- Thousands of queries per second with proper indexing
- Tens of thousands of queries per second with read replicas and caching

You need sharding when:
- Single database server cannot handle write volume
- Table sizes exceed what fits in memory (affects index performance)
- You need to distribute data geographically for latency
- Connection pooling and read replicas are not enough

You probably do not need sharding if:
- You have fewer than 100 million rows in your largest table
- Your write volume is under 1,000 writes per second
- You have not tried connection pooling, indexing, read replicas, and caching

Sharding adds enormous complexity. Cross-shard queries are slow or impossible. Transactions across shards require distributed consensus. Rebalancing shards when they grow unevenly is painful.

Alternatives to sharding (try these first):
1. Better indexes and query optimization
2. Connection pooling (PgBouncer)
3. Read replicas for read-heavy workloads
4. Caching hot data in Redis
5. Archiving old data to separate tables or cold storage
6. Partitioning (table-level, not server-level)
7. Vertical scaling (bigger database server)

Vertical scaling is underrated. A large database server with 64 cores, 256GB RAM, and NVMe SSDs can handle workloads that most startups will never outgrow. It costs $1,000-$ 2,000/month. That is cheaper than the engineering time to implement and maintain sharding.

Pinterest sharded their database early and has written extensively about the complexity it introduced. They would not recommend it for companies that have not exhausted simpler alternatives.

PostgreSQL Can Handle More Than You Think

PostgreSQL is one of the most capable relational databases in existence. Before reaching for specialized databases or complex architectures, understand what PostgreSQL can do.

PostgreSQL capabilities often overlooked:
- Full-text search (tsvector and tsquery) -- may replace Elasticsearch
- JSON storage and querying (JSONB) -- may replace MongoDB
- Geospatial queries (PostGIS extension) -- may replace specialized geo DBs
- Time-series data (with partitioning) -- may work before TimescaleDB
- Pub/Sub (LISTEN/NOTIFY) -- may replace simple message queues
- Materialized views -- may replace caching for complex aggregations

Do not add a new database to your stack until PostgreSQL provably cannot handle the workload. Each additional database is another thing to operate, monitor, back up, and keep consistent.

Common Pitfalls

Adding indexes without checking if they are used. Unused indexes slow down writes and waste disk space. Monitor index usage and remove indexes that are not being scanned.

Caching without measuring. If you cache a query that takes 2ms, you saved almost nothing. Cache the query that takes 200ms and runs 1,000 times per hour.

Ignoring connection pooling. Many startup outages are caused by exhausting database connections, not by actual load. PgBouncer or a managed pooler solves this with minimal effort.

Sharding too early. Sharding is a one-way door that adds permanent complexity. Exhaust every other option first. Vertical scaling, indexing, caching, and read replicas can take you very far.

Using multiple databases prematurely. Running PostgreSQL, Redis, MongoDB, and Elasticsearch when PostgreSQL alone could handle your workload. Each database is an operational burden. Minimize the number you run.

Not testing backups. A backup you cannot restore is not a backup. Regularly test that your backup and restore process actually works. This is more important than any optimization.

Key Takeaways

The database is almost always the first bottleneck. Expect it and prepare for it.
Connection pooling is the first fix. Most database connection errors are pool exhaustion, not actual load.
Indexing is the highest-impact optimization. A single index can turn a 500ms query into a 0.5ms query.
Read replicas handle read-heavy workloads. Most web applications are 80-90% reads.
Cache hot queries in Redis with TTL-based expiry. Simple, effective, and good enough for most startups.
Shard later than you think. PostgreSQL on a large server with proper indexing, pooling, replicas, and caching can handle more load than most startups will ever generate.
PostgreSQL can do more than you expect. Full-text search, JSON, geospatial, and more. Try it before adding specialized databases.