Rate Limiting Algorithms

Why Rate Limiting Matters

Every API has finite capacity. Without rate limiting, a single misbehaving client — or an intentional attack — can consume all available resources, degrading service for everyone. Rate limiting algorithms decide how to measure and enforce request quotas fairly and efficiently.

Fixed Window

The simplest approach: count requests in fixed time intervals.

How It Works

Divide time into fixed windows (e.g., 1-minute intervals). Each client gets a counter that resets at the start of each window. If the counter exceeds the limit, reject the request.

Window: 12:00:00 - 12:00:59  |  Limit: 100 requests/minute

12:00:05  Request 1   -> allowed (count: 1)
12:00:15  Request 50  -> allowed (count: 50)
12:00:45  Request 100 -> allowed (count: 100)
12:00:46  Request 101 -> rejected (429)
12:01:00  Counter resets to 0
12:01:01  Request 1   -> allowed (count: 1)

The Boundary Problem

Fixed windows have a well-known burst problem at window boundaries. A client can send 100 requests at the end of one window and 100 at the start of the next, producing 200 requests in a 1-second span while technically respecting the 100/minute limit.

Window 1: 11:59:30 - 11:59:59  -> 100 requests (allowed)
Window 2: 12:00:00 - 12:00:30  -> 100 requests (allowed)

Result: 200 requests in 60 seconds, despite a 100/minute limit

Tradeoffs

Extremely simple to implement: one counter per client per window
Low memory: just a counter and a timestamp
Boundary burst problem allows 2x the intended rate
Works well when burst tolerance is acceptable

Sliding Window Log

Track every request timestamp and count requests within a sliding time range.

How It Works

For each request, record its timestamp. To check the rate, count all timestamps within the last N seconds. Remove timestamps older than the window.

Limit: 100 requests per 60 seconds

12:00:30  Check: count timestamps from 11:59:30 to 12:00:30
          Found: 73 requests -> allowed, add timestamp
12:00:45  Check: count timestamps from 11:59:45 to 12:00:45
          Found: 99 requests -> allowed, add timestamp
12:00:46  Check: count timestamps from 11:59:46 to 12:00:46
          Found: 100 requests -> rejected (429)

Tradeoffs

Perfectly accurate: no boundary burst problem
High memory usage: stores every request timestamp
Expensive computation: counting entries on every request
Impractical for high-volume APIs without optimization

Sliding Window Counter

A hybrid approach that approximates the sliding window using two fixed-window counters.

How It Works

Maintain counters for the current window and the previous window. Estimate the request count by weighting the previous window's count based on how far into the current window you are.

Previous window (11:59:00 - 11:59:59): 84 requests
Current window  (12:00:00 - 12:00:59): 36 requests
Current time: 12:00:15 (25% into current window)

Estimated count = (84 * 0.75) + 36 = 63 + 36 = 99
Limit: 100
Result: allowed (1 remaining)

Tradeoffs

Smooths out the boundary burst problem (not perfectly, but significantly)
Low memory: two counters per client
Minimal computation
Good balance of accuracy and efficiency for most APIs

Cloudflare uses a variation of this approach for their rate limiting product.

Token Bucket

Allow bursts up to a defined limit while enforcing an average rate over time.

How It Works

Each client has a bucket that holds tokens. The bucket has a maximum capacity. Tokens are added at a fixed rate. Each request consumes one token. If the bucket is empty, the request is rejected.

Bucket capacity: 10 tokens
Refill rate: 1 token per second

Time 0:    Bucket has 10 tokens
           5 requests arrive -> 5 tokens consumed -> 5 remaining
Time 1s:   1 token added -> 6 remaining
Time 2s:   1 token added -> 7 remaining
           10 requests arrive -> 7 allowed, 3 rejected -> 0 remaining
Time 3s:   1 token added -> 1 remaining

Configuration

A token bucket has two parameters:

{
  "bucket_capacity": 10,
  "refill_rate_per_second": 1
}

The capacity controls the maximum burst size. The refill rate controls the sustained average rate. A bucket with capacity 100 and refill rate 10/second allows a burst of 100 requests immediately, then sustains 10 requests/second.

Why Token Bucket Works for Most APIs

Real traffic is bursty. A mobile app might fetch 10 resources at once when the user opens a screen, then go quiet for 30 seconds. Token bucket accommodates this natural pattern: accumulate tokens during idle periods, spend them during bursts.

Amazon API Gateway and many AWS services use token bucket rate limiting.

Tradeoffs

Allows natural traffic bursts within defined limits
Simple to implement and reason about
Two intuitive parameters: burst size and sustained rate
Requires tracking the bucket state (token count and last refill time)

Leaky Bucket

Process requests at a constant rate, regardless of arrival pattern.

How It Works

Requests enter a queue (the bucket). The queue processes requests at a fixed rate. If the queue is full, new requests are rejected.

Queue capacity: 10
Processing rate: 1 request per second

Time 0:    8 requests arrive -> all queued (queue: 8)
Time 0-1s: 1 request processed (queue: 7)
Time 1s:   5 more requests arrive -> all queued (queue: 11) -> 1 rejected (queue: 10)
Time 1-2s: 1 request processed (queue: 9)

How It Differs from Token Bucket

The key difference: leaky bucket smooths output to a constant rate. Token bucket allows burst output up to the capacity.

Token bucket:  10 requests arrive -> all 10 processed immediately (if tokens available)
Leaky bucket:  10 requests arrive -> processed one at a time at the fixed rate

Tradeoffs

Produces a perfectly smooth output rate
Prevents any bursts from reaching the backend
Requests may be delayed (queued), not just rejected
Less suitable for APIs where clients expect immediate responses
NGINX uses leaky bucket for request rate limiting with the limit_req directive

Choosing the Right Algorithm

Algorithm	Burst Handling	Memory	Accuracy	Best For
Fixed window	Allows 2x burst at boundary	Very low	Low	Simple internal APIs
Sliding window log	No bursts	High	Perfect	Low-volume, strict limits
Sliding window counter	Minimal burst	Low	Good	General purpose
Token bucket	Controlled bursts	Low	Good	Most public APIs
Leaky bucket	No bursts (smoothing)	Low	Good	Constant-rate backends

For most APIs, use token bucket. It handles real-world bursty traffic gracefully while enforcing a sustained rate. Use sliding window counter when you need stricter limits without burst tolerance. Use leaky bucket when your backend requires a constant processing rate.

Distributed Rate Limiting with Redis

In a multi-server deployment, rate limit state must be shared. Redis is the standard solution.

Token Bucket in Redis

Store the token count and last refill timestamp per client:

Key: ratelimit:user_123
Value: { tokens: 7, last_refill: 1711033200 }
TTL: 3600 (auto-cleanup for inactive clients)

The check-and-decrement operation must be atomic. Use a Redis Lua script to avoid race conditions:

1. GET current tokens and last refill time
2. Calculate tokens to add based on elapsed time
3. Cap at bucket capacity
4. If tokens >= 1, decrement and RETURN allowed
5. Else RETURN rejected with retry-after time

Running this as a single Lua script ensures atomicity without distributed locks.

Redis Considerations

Use MULTI/EXEC or Lua scripts for atomic operations
Set TTLs on rate limit keys to avoid unbounded memory growth
Consider Redis Cluster for high-throughput APIs
Handle Redis failures gracefully: if Redis is down, either allow all requests (fail open) or reject all requests (fail closed), depending on your security requirements

Stripe and GitHub both use Redis-backed rate limiting in production.

Common Pitfalls

Using fixed window in production without understanding the boundary problem. A client can legitimately double your intended rate at every window boundary. If your backend cannot handle 2x the advertised rate, use sliding window or token bucket.

Setting bucket capacity too high. A token bucket with capacity 10,000 and refill rate 10/second allows a client to send 10,000 requests in a single burst after a period of inactivity. Set capacity relative to what your backend can absorb in a burst.

Not handling Redis failures. If your rate limiter depends on Redis and Redis goes down, what happens? Decide in advance: fail open (allow requests, risk overload) or fail closed (reject requests, guaranteed downtime). Most APIs fail open with monitoring alerts.

Rate limiting only by API key. A compromised or shared API key bypasses per-user limits. Layer rate limits: per API key, per IP address, and per user ID.

Forgetting to rate limit unauthenticated endpoints. Login, registration, and password reset endpoints are prime targets for brute force. Rate limit these by IP address.

Key Takeaways

Fixed window is simple but allows 2x bursts at boundaries; use it only when burst tolerance is acceptable.
Token bucket is the best default for most APIs because it accommodates natural traffic bursts while enforcing a sustained rate.
Leaky bucket produces a constant output rate; use it when your backend requires smooth, predictable load.
Sliding window counter offers a good balance of accuracy and efficiency without the memory cost of logging every request.
Use Redis with Lua scripts for distributed rate limiting; always plan for Redis failures with an explicit fail-open or fail-closed policy.