Congestion Control
Fundamentals
Congestion occurs when aggregate demand exceeds link capacity, causing queue buildup, increased delay, and packet loss. Congestion control adapts sending rates to match available network capacity.
Key Concepts
| Term | Definition |
|------|-----------|
| BDP (Bandwidth-Delay Product) | bandwidth * RTT; optimal amount of data in flight |
| cwnd (congestion window) | Sender-side limit on unacknowledged data |
| rwnd (receive window) | Receiver-advertised flow control limit |
| Effective window | min(cwnd, rwnd) |
| Throughput | Approximately cwnd / RTT in steady state |
TCP Congestion Control Evolution
TCP Tahoe (1988)
The original congestion control scheme by Jacobson.
- Slow start: cwnd starts at 1 MSS, doubles each RTT (exponential growth).
- Congestion avoidance: After reaching ssthresh, cwnd increases by 1 MSS per RTT (linear growth, AIMD).
- On loss (timeout): Set ssthresh = cwnd/2, reset cwnd = 1 MSS, re-enter slow start.
TCP Reno (1990)
Adds fast recovery to avoid returning to slow start on every loss.
- Fast retransmit: Retransmit on 3 duplicate ACKs (don't wait for timeout).
- Fast recovery: On 3 dup ACKs, ssthresh = cwnd/2, cwnd = ssthresh + 3 MSS. Exit fast recovery on new ACK, set cwnd = ssthresh.
- On timeout: Same as Tahoe (cwnd = 1).
Problem: performs poorly with multiple losses in a single window.
TCP NewReno (RFC 6582)
Improves Reno's fast recovery by tracking the highest sequence number sent when entering recovery. Does not exit fast recovery until all data outstanding at the time of loss is acknowledged (handles multiple losses in one window).
TCP SACK (RFC 2018)
Selective Acknowledgment allows the receiver to report non-contiguous blocks of received data, enabling the sender to retransmit only truly lost segments rather than everything after the first loss.
TCP CUBIC (Linux default since 2.6.19)
CUBIC uses a cubic function of time since the last congestion event rather than relying on ACK clocking.
W(t) = C * (t - K)^3 + W_max
Where:
C = scaling constant (0.4)
K = time to reach W_max = cubic_root(W_max * beta / C)
W_max = window size at last loss
beta = multiplicative decrease factor (0.7)
- Concave growth below W_max: rapid recovery toward the previous operating point.
- Convex probing above W_max: cautious exploration of new capacity.
- Window growth is independent of RTT, improving fairness for high-RTT flows.
TCP BBR (Bottleneck Bandwidth and Round-trip propagation time)
Developed by Google. A model-based approach that estimates the delivery rate and propagation delay rather than relying on loss as a congestion signal.
BBRv1 states:
- Startup: Exponential probing to find bottleneck bandwidth (like slow start but measures delivery rate).
- Drain: Reduce inflight to match BDP after startup overshoot.
- ProbeBW: Steady state; cycles through pacing gain factors [1.25, 0.75, 1, 1, 1, 1, 1, 1] to probe for more bandwidth.
- ProbeRTT: Periodically reduces cwnd to 4 packets for 200ms to measure true minimum RTT.
Sending rate = BtlBw (estimated bandwidth)
cwnd = 2 * BDP (in ProbeBW)
Pacing enforced to spread packets evenly
BBRv2
Addresses fairness issues of BBRv1 (excessive loss at bottleneck, unfairness to CUBIC flows).
- Responds to ECN marks and packet loss (not loss-agnostic like v1).
- Improved ProbeRTT (less aggressive, shorter duration).
- Better coexistence with loss-based congestion control.
- Uses an explicit inflight cap based on loss and ECN signals.
ECN (Explicit Congestion Notification)
ECN (RFC 3168) allows routers to signal congestion before dropping packets by marking the IP header's ECN field.
ECN Mechanism
Sender sets ECT(0) or ECT(1) in IP header
→ Router experiencing congestion sets CE (Congestion Experienced)
→ Receiver sets ECE flag in TCP ACK
→ Sender reduces cwnd and sets CWR flag
| ECN Field (2 bits) | Codepoint | Meaning | |---------------------|-----------|---------| | 00 | Not-ECT | Not ECN-capable | | 01 | ECT(1) | ECN-capable | | 10 | ECT(0) | ECN-capable | | 11 | CE | Congestion Experienced |
Benefits: avoids retransmission delay, especially important for short flows and real-time traffic.
Active Queue Management (AQM)
AQM algorithms in routers proactively manage queues rather than relying on tail-drop.
RED (Random Early Detection)
- Drops/marks packets probabilistically based on average queue length.
- Drop probability increases linearly between min_th and max_th.
- Problem: sensitive to parameter tuning; often misconfigured in practice.
CoDel (Controlled Delay)
Targets the problem of bufferbloat by controlling sojourn time (time packets spend in queue).
Algorithm:
If packet sojourn time > TARGET (5ms) for at least INTERVAL (100ms):
Enter dropping state
Drop packets at increasing frequency: 1/sqrt(count)
If sojourn time drops below TARGET:
Exit dropping state
- No tuning required (only TARGET and INTERVAL, which work well at defaults).
- Distinguishes between good queues (transient bursts) and bad queues (standing queues).
FQ-CoDel (Fair Queuing + CoDel)
Combines stochastic fair queuing with CoDel:
- Packets classified into 1024 queues via hashing on flow 5-tuple.
- Deficit Round Robin (DRR) services queues, prioritizing queues with new packets.
- CoDel applied independently to each queue.
- Default AQM in Linux since 2012. Highly effective against bufferbloat.
DCTCP (Data Center TCP)
Designed for low-latency datacenter environments. Uses ECN marks to estimate the extent of congestion, not just its presence.
alpha = (1 - g) * alpha + g * F
(F = fraction of ECN-marked packets in last window)
cwnd = cwnd * (1 - alpha/2)
(Proportional reduction rather than halving)
- Achieves high throughput with very low queue occupancy (< 1 BDP).
- Requires ECN-capable switches with a low marking threshold.
- Not safe for deployment on the public Internet (too aggressive against CUBIC/Reno flows).
Congestion Control for Real-Time
Google Congestion Control (GCC)
Used in WebRTC for video conferencing:
- Delay-based: monitors one-way delay gradient to detect congestion.
- Adjusts sending rate (not window) to match available bandwidth.
- Targets low latency over high throughput.
SCReAM (Self-Clocked Rate Adaptation for Multimedia)
- IETF RFC 8298 for real-time media.
- Self-clocked: uses ACK timing to pace sending.
- Integrates with codec rate control to match encoding rate to network capacity.
NADA (Network-Assisted Dynamic Adaptation)
- Uses ECN or explicit rate signals from the network.
- Designed for interactive multimedia in managed networks.
Fairness Metrics
Jain's Fairness Index
Quantifies how fairly bandwidth is shared among N flows.
J(x1, x2, ..., xn) = (sum(xi))^2 / (n * sum(xi^2))
Range: 1/n (maximally unfair) to 1 (perfectly fair)
- J = 1: all flows receive equal throughput.
- J = 1/n: one flow gets everything.
- Useful for comparing congestion control algorithms in shared bottleneck scenarios.
Max-Min Fairness
An allocation is max-min fair if no flow's rate can be increased without decreasing the rate of a flow that is already receiving less.
Progressive filling algorithm:
- Start all flows at rate 0.
- Increase all unsatisfied flows equally.
- When a link saturates, freeze flows bottlenecked at that link.
- Continue increasing remaining flows until all are frozen.
Properties:
- Favors small flows (guarantees minimum rate).
- May waste capacity if not all demands are equal.
Proportional Fairness
Maximizes the sum of log(throughput) across all flows.
Maximize: sum(log(xi)) for all flows i
Subject to: capacity constraints
- Allows high-demand flows to get more bandwidth, but with diminishing returns.
- TCP AIMD approximates proportional fairness in steady state.
- Balances efficiency (total throughput) and fairness.
Comparison
| Criterion | Max-Min | Proportional | TCP (AIMD) | |-----------|---------|-------------|------------| | Small flow protection | Strong | Moderate | Moderate | | Total throughput | Lower | Higher | Moderate | | RTT bias | None (ideal) | None (ideal) | Biased against long RTT | | Practical use | Theoretical benchmark | Utility maximization | Deployed standard |
Summary of TCP Variants
| Variant | Signal | Strengths | Weaknesses | |---------|--------|-----------|------------| | Reno | Loss | Simple, well-understood | Poor multi-loss recovery | | CUBIC | Loss | RTT-fair, aggressive probing | Fills buffers | | Vegas | Delay | Low queuing delay | Unfair with loss-based flows | | BBRv1 | Model | High throughput, low delay | Unfair, high retransmit rate | | BBRv2 | Model+ECN+Loss | Improved fairness | Still maturing | | DCTCP | ECN extent | Ultra-low datacenter latency | Not Internet-safe |