Load Balancing

A load balancer sits between clients and servers, distributing requests so that no single backend is overwhelmed. It is one of the first components you add when scaling beyond a single machine and one of the last you stop thinking about.

Why Load Balancing Matters

Availability. If one server dies, the balancer routes around it.
Throughput. Spread work across N servers to handle N times the load.
Latency. Direct users to the least-loaded or geographically nearest server.

Without a load balancer, a single server is both a performance ceiling and a single point of failure.

Layer 4 vs Layer 7

Load balancers operate at different layers of the network stack, and the layer determines what information is available for routing decisions.

Layer 4 (Transport)

Operates on TCP/UDP packets. Sees source IP, destination IP, and port numbers. Does not inspect the payload.

Very fast — minimal processing per packet
Works for any TCP/UDP protocol, not just HTTP
Cannot route based on URL path, headers, or cookies
Example: AWS Network Load Balancer, HAProxy in TCP mode

Layer 7 (Application)

Operates on HTTP requests. Sees the full request: URL, headers, cookies, body.

Can route /api/* to one pool and /static/* to another
Can inspect cookies for sticky sessions
Can rewrite URLs, add headers, terminate TLS
Slightly higher latency than L4 due to payload inspection
Example: AWS Application Load Balancer, Nginx, Envoy

When to Use Which

Use L4 when you need raw throughput and protocol-agnostic balancing (database connections, game servers, non-HTTP protocols). Use L7 when you need content-based routing, TLS termination, or request-level features.

Load Balancing Algorithms

Round Robin

Requests are distributed to servers in order: A, B, C, A, B, C. Simple and fair when servers are identical.

Request 1 -> Server A
Request 2 -> Server B
Request 3 -> Server C
Request 4 -> Server A

Weakness: Ignores server load. If one server is handling a slow query, it still gets the next request.

Weighted Round Robin

Each server gets a weight proportional to its capacity. A server with weight 3 gets three requests for every one that a weight-1 server gets.

Useful when servers have different hardware specs.

Least Connections

Routes to the server with the fewest active connections. Naturally adapts to slow requests because busy servers accumulate connections.

Server A: 12 connections
Server B: 3 connections   <-- next request goes here
Server C: 8 connections

Best for: Workloads with variable request duration (long-polling, file uploads).

Least Response Time

Routes to the server with the lowest average response time. More responsive than least-connections but requires the balancer to track latency.

IP Hash

Hashes the client IP to pick a server. The same client always hits the same server (as long as the server pool is stable).

Weakness: Pool changes (adding/removing servers) remap many clients at once.

Consistent Hashing

Maps both servers and request keys onto a hash ring. Each request goes to the nearest server clockwise on the ring.

Hash ring:  0 ---- S1 ---- S2 ---- S3 ---- 0

Key "user:42" hashes to a point between S1 and S2 -> routed to S2

When a server is added or removed, only the keys adjacent to it are remapped. This makes consistent hashing ideal for caching layers and stateful workloads.

Virtual nodes (multiple points per server on the ring) improve distribution evenness.

Health Checks

A load balancer must know which backends are healthy. Sending traffic to a dead server wastes time and causes errors.

Passive Health Checks

The balancer watches real traffic. If a server returns 5xx errors or drops connections, it is marked unhealthy.

No extra traffic overhead
Slower to detect failure — requires real requests to fail first

Active Health Checks

The balancer periodically sends a probe (HTTP GET /health, TCP connect, or custom script) to each server.

Detects failure before any user traffic is affected
Adds slight network overhead
Configure interval (e.g., every 5 seconds) and threshold (e.g., 3 consecutive failures to mark unhealthy)

Health Check Depth

A shallow check (/health returns 200) only proves the process is running. A deep check verifies the server can reach its database and dependencies. Deep checks catch more problems but risk cascading failures if a shared dependency is slow.

Best practice: Use a shallow check for the load balancer and a separate deep check for monitoring/alerting.

Sticky Sessions

Sticky sessions (session affinity) ensure that all requests from one client go to the same backend. This is sometimes needed when session state is stored in-process.

How It Works

Cookie-based: The load balancer sets a cookie containing the server ID. Subsequent requests include the cookie.
IP-based: Uses the client IP to consistently route to the same server (similar to IP hash).

Why to Avoid Them

Sticky sessions undermine the core benefit of load balancing. If the pinned server goes down, the session is lost. They also cause uneven load distribution. The better solution is to externalize session state (Redis, database) so any server can handle any request.

Global vs Local Load Balancing

Local (Regional) Load Balancing

Distributes traffic across servers within a single data center or region.

Clients in US-East -> Regional LB -> [Server A, B, C]

Global (Multi-Region) Load Balancing

Distributes traffic across multiple regions, typically using DNS-based routing or anycast.

User in Tokyo -> DNS returns Tokyo LB IP -> Tokyo servers
User in London -> DNS returns London LB IP -> London servers

Strategies

GeoDNS: Returns different IPs based on the client's geographic location.
Anycast: Multiple locations advertise the same IP; the network routes to the nearest one.
Latency-based: Routes to the region with the lowest measured latency from the client.

Multi-Tier Example

Tier 1: Global DNS routes user to nearest region
Tier 2: Regional L4 LB terminates TCP, forwards to L7 LB
Tier 3: L7 LB routes /api to app servers, /static to CDN origin

Real-World Examples

Netflix

Netflix uses a combination of AWS Elastic Load Balancers and its own Zuul gateway (L7). Zuul handles routing, canary deployments, and load shedding. At the global level, DNS and anycast direct users to the nearest Open Connect CDN node for streaming.

Google

Google Front End (GFE) is a global L7 load balancer that terminates TLS, applies DDoS protections, and routes to the correct backend service. Google's Maglev is a custom L4 balancer built for massive packet-per-second throughput using consistent hashing.

Shopify

Shopify routes millions of storefronts through Nginx and OpenResty at the edge. Lua scripts in OpenResty make L7 routing decisions (which shop maps to which backend pod) without a separate application layer.

Common Pitfalls

Single load balancer instance. The balancer itself must be redundant — use active-passive or active-active pairs.
No health checks. Without them, users get routed to dead servers. Always configure both active and passive checks.
Sticky sessions as a crutch. Fix the root cause (in-process state) rather than working around it with affinity.
Ignoring connection limits. Each backend has a max connection count. If the LB sends more, the server will degrade. Configure max-connections per backend.
TLS everywhere, terminated nowhere. Decide where TLS terminates. Terminating at the LB is common, but ensure internal traffic is also encrypted if the network is untrusted.
Using L7 when L4 suffices. L7 inspection has CPU cost. For non-HTTP protocols or very high packet rates, L4 is more efficient.

Load Balancer Selection Guide

Requirement                         Recommended approach
Raw TCP/UDP throughput              Layer 4 (NLB, HAProxy TCP mode)
HTTP routing by path/header         Layer 7 (ALB, Nginx, Envoy)
WebSocket support                   Layer 7 with connection upgrade support
gRPC traffic                        Layer 7 with HTTP/2 support (Envoy)
Global traffic distribution         DNS-based (Route 53, Cloudflare) + regional LBs
Service mesh (east-west traffic)    Sidecar proxy (Envoy via Istio/Linkerd)
Simple setup, few backends          Software LB (Nginx, HAProxy)
Cloud-native, managed               Cloud LB (ALB/NLB, GCP LB, Azure LB)

Key Takeaways

Layer 4 balances connections; Layer 7 balances requests. Choose based on the routing intelligence you need.
Least-connections and consistent hashing are the most broadly useful algorithms.
Health checks are non-negotiable. Prefer active checks with reasonable intervals.
Avoid sticky sessions; externalize state instead.
Global load balancing (GeoDNS, anycast) routes users to the nearest region; local balancing distributes within a region.
The load balancer itself must be redundant — never let it become the single point of failure it was meant to eliminate.