Caching Fundamentals

Cache-aside pattern request flow

Overview

Caching stores copies of frequently accessed data in a faster storage layer to reduce latency, lower load on origin systems, and improve throughput. Caching is one of the highest-leverage performance optimizations in system design, but it introduces complexity around data freshness and consistency.

Why Cache

Without caching:
  Every request hits the database.
  Database query: 10-50ms
  At 10,000 requests/sec: database is overwhelmed.

With caching:
  First request hits the database and stores the result in cache.
  Subsequent requests read from cache.
  Cache lookup: 0.1-1ms
  Database load reduced by 90%+ for read-heavy workloads.

Real impact:
  - Reduce average response latency from 50ms to 2ms
  - Handle 10x more traffic with the same database
  - Survive traffic spikes without scaling backend infrastructure
  - Reduce cloud costs by avoiding database over-provisioning

What to Cache

Good candidates for caching:
  - Data that is read far more often than written
  - Expensive computations (aggregations, recommendations)
  - Data that does not change frequently
  - API responses from external services
  - Session data and authentication tokens
  - Static content (HTML fragments, configuration)

Poor candidates for caching:
  - Data that changes on every access (real-time counters)
  - Data where staleness is unacceptable (bank balances)
  - Large datasets that exceed cache memory
  - Data unique to each request (one-time tokens)
  - Write-heavy data with low read frequency

Cache Hit & Cache Miss

Cache hit: Requested data is found in the cache.
  Client -> Cache (found!) -> Return cached data
  Fast path. This is what you optimize for.

Cache miss: Requested data is not in the cache.
  Client -> Cache (not found) -> Database -> Store in cache -> Return data
  Slow path. Adds cache lookup overhead on top of database query.

Cache hit ratio = hits / (hits + misses)
  Target: 90-99% for most workloads
  Below 80%: Cache is not effective, investigate access patterns
  Above 99%: Excellent, but verify you are not over-caching stale data

Cold cache: Empty cache after restart or deployment.
  All requests are misses until the cache warms up.
  Can cause a spike in database load.

TTL (Time to Live)

TTL defines how long cached data remains valid before it expires and must be refreshed.

Setting TTL:
  cache.set("user:123", user_data, ttl=300)  // Expires in 5 minutes

TTL trade-offs:
  Short TTL (seconds to minutes):
    - Data is fresher
    - More cache misses
    - Higher database load
    - Good for: frequently changing data, session tokens

  Long TTL (hours to days):
    - More cache hits
    - Data may be stale
    - Lower database load
    - Good for: configuration, reference data, product catalogs

  No TTL (cache forever):
    - Maximum cache hits
    - Data is never automatically refreshed
    - Must invalidate manually on changes
    - Good for: immutable data (content-addressed, versioned resources)

TTL guidelines by data type:
  User session:          15-30 minutes
  Product listing:       5-15 minutes
  Configuration:         1-5 minutes
  User profile:          5-30 minutes
  Search results:        1-5 minutes
  Static content:        24 hours to indefinite
  API rate limit counter: 1 minute (sliding window)

Eviction Policies

When the cache is full, the eviction policy determines which items to remove to make room for new ones.

LRU (Least Recently Used)

Evicts the item that has not been accessed for the longest time.

Access sequence: A, B, C, D, A, E (cache size: 4)
  [A]
  [A, B]
  [A, B, C]
  [A, B, C, D]        Cache full
  [B, C, D, A]        A accessed, moves to front
  [C, D, A, E]        E added, B evicted (least recently used)

Pros:
  - Simple to implement and understand
  - Works well for most access patterns
  - Good general-purpose default

Cons:
  - A full scan can evict all useful cached items
  - Does not consider frequency of access
  - One-time accesses pollute the cache

Used by: Redis (approximated LRU), Memcached, most frameworks
Recommendation: Use LRU as the default unless you have a specific reason not to.

LFU (Least Frequently Used)

Evicts the item accessed the fewest times.

Access counts: A(10), B(3), C(7), D(1)
  Cache full, new item E arrives.
  D evicted (accessed only 1 time).

Pros:
  - Keeps frequently accessed items even if not recently used
  - Better than LRU for workloads with stable hot items

Cons:
  - Items that were popular in the past but are no longer relevant
    (frequency pollution) stick around
  - Cold start problem: new items always have low frequency
  - More complex implementation

Used by: Redis (LFU option since Redis 4.0)
Best for: Workloads with stable, known hot keys.

Other Eviction Policies

FIFO (First In, First Out):
  Evicts the oldest item regardless of access pattern.
  Simple but ignores access frequency and recency.

Random:
  Evicts a random item. Surprisingly effective in some workloads.
  Zero overhead for tracking access patterns.

TTL-based:
  Evict expired items first, then fall back to LRU/LFU.
  Combines time-based and access-based eviction.

W-TinyLFU (used by Caffeine cache in Java):
  Combines recency and frequency with a compact frequency sketch.
  Outperforms both LRU and LFU in most benchmarks.
  Used by: Apache Cassandra's row cache, many JVM applications.

Cache Warming

Cache warming is the process of pre-populating the cache before it receives live traffic.

Why Warm the Cache

Cold cache problem:
  After deployment, restart, or cache flush:
  - All requests are cache misses
  - Database receives full query load simultaneously
  - Response latency spikes
  - Possible cascading failure if database cannot handle the spike

Without warming:                 With warming:
  Time 0: 0% hit rate            Time 0: 85% hit rate (pre-loaded)
  Time 1: 20% hit rate           Time 1: 90% hit rate
  Time 5: 60% hit rate           Time 5: 95% hit rate
  Time 15: 90% hit rate          Time 15: 98% hit rate

Warming Strategies

Strategy 1: Pre-load from database
  On startup, query the database for the most frequently
  accessed items and load them into cache.
  Works well when you know your hot keys.

Strategy 2: Replay access logs
  Analyze recent access logs to identify popular items.
  Load those items into cache before switching traffic.
  More accurate than guessing hot keys.

Strategy 3: Shadow traffic
  Route a copy of production traffic to the new cache
  without serving responses from it. The cache warms
  up from real access patterns.

Strategy 4: Gradual traffic shift
  Slowly ramp traffic from 1% to 100% over minutes.
  Cache warms naturally as traffic increases.
  Load balancer or feature flag controls the ramp.

Strategy 5: Cache replication
  When replacing a cache node, copy data from existing
  nodes before the new node receives traffic.
  Redis supports this natively with replication.

Cache Levels

L1: In-process cache (application memory)
  Latency: microseconds
  Size: megabytes (limited by application heap)
  Examples: HashMap, Guava Cache, Caffeine
  Best for: Configuration, small reference data, hot objects

L2: Local machine cache (separate process, same host)
  Latency: sub-millisecond
  Size: gigabytes (limited by machine memory)
  Examples: Redis on localhost, local Memcached
  Best for: Session data, computed results

L3: Distributed cache (remote servers)
  Latency: 1-5 milliseconds (network hop)
  Size: terabytes (across cluster)
  Examples: Redis Cluster, Memcached fleet
  Best for: Shared state across application instances

L4: CDN / Edge cache
  Latency: depends on geographic proximity
  Size: massive (distributed globally)
  Examples: CloudFront, Cloudflare, Fastly
  Best for: Static assets, public API responses

Multi-level caching:
  Check L1 -> Check L2 -> Check L3 -> Database
  Each miss falls through to the next level.
  Populate each level on the way back.

Cache Metrics

Essential metrics to monitor:
  Hit ratio:        hits / (hits + misses)
  Miss ratio:       1 - hit ratio
  Eviction rate:    evictions per second (high = cache too small)
  Memory usage:     percentage of allocated cache memory used
  Latency (p50/p99): cache operation latency at different percentiles
  Key count:        number of items in cache
  TTL distribution: how long until items expire

Warning signs:
  Hit ratio < 80%:         Access patterns may not be cache-friendly
  Eviction rate increasing: Cache size may need to increase
  p99 latency spiking:     Possible hot key or resource contention
  Memory usage at 100%:    Constant evictions, consider larger cache

Real-World Examples

Facebook uses Memcached as a massive look-aside cache in front of MySQL. They cache billions of objects and achieve hit rates above 99%. TAO, their graph data cache, handles trillions of reads per day.

Netflix uses EVCache (built on Memcached) to cache user preferences, viewing history, and personalization data. Their cache layer handles millions of requests per second.

Stack Overflow serves 1.3 billion page views per month with a relatively small infrastructure, heavily relying on multi-level caching: in-process caching with Redis as L2, all running on just a handful of servers.

Common Pitfalls

Caching without measuring: Always instrument hit rates, latency, and eviction rates. Without metrics, you are guessing.
TTL too long: Stale data causes user-visible bugs that are hard to debug because they appear intermittently.
TTL too short: Frequent cache misses negate the benefit of caching. Measure and tune based on data change frequency.
Not planning for cold cache: Every deployment and restart creates a cold cache. Without warming, you get a load spike on your database.
Caching errors: If a database query fails and you cache the error response, every subsequent request gets the cached error until TTL expires. Never cache failure responses.
Treating cache as a primary data store: Caches are ephemeral. Data in cache can be evicted or lost at any time. Always have a fallback to the source of truth.
Ignoring cache size: An unbounded cache will consume all available memory and crash the process.

Key Takeaways

Caching is the single highest-leverage optimization for read-heavy systems. A 95% hit rate means your database handles only 5% of read traffic.
TTL is a trade-off between freshness and performance. Start with shorter TTLs and extend them based on data change frequency and tolerance for staleness.
LRU is the right default eviction policy for most workloads. Consider LFU only for workloads with stable hot keys.
Multi-level caching (in-process, distributed, CDN) gives the best performance. Each level trades capacity for speed.
Always plan for cold cache scenarios. Pre-warming or gradual traffic ramp prevents database overload after restarts.
Monitor cache metrics continuously. Hit ratio, eviction rate, and latency tell you whether your caching strategy is working.