Congestion Control

Fundamentals

Congestion occurs when aggregate demand exceeds link capacity, causing queue buildup, increased delay, and packet loss. Congestion control adapts sending rates to match available network capacity.

Key Concepts

Term	Definition
BDP (Bandwidth-Delay Product)	`bandwidth * RTT`; optimal amount of data in flight
cwnd (congestion window)	Sender-side limit on unacknowledged data
rwnd (receive window)	Receiver-advertised flow control limit
Effective window	`min(cwnd, rwnd)`
Throughput	Approximately `cwnd / RTT` in steady state

TCP Congestion Control Evolution

TCP Tahoe (1988)

The original congestion control scheme by Jacobson.

Slow start: cwnd starts at 1 MSS, doubles each RTT (exponential growth).
Congestion avoidance: After reaching ssthresh, cwnd increases by 1 MSS per RTT (linear growth, AIMD).
On loss (timeout): Set ssthresh = cwnd/2, reset cwnd = 1 MSS, re-enter slow start.

TCP Reno (1990)

Adds fast recovery to avoid returning to slow start on every loss.

Fast retransmit: Retransmit on 3 duplicate ACKs (don't wait for timeout).
Fast recovery: On 3 dup ACKs, ssthresh = cwnd/2, cwnd = ssthresh + 3 MSS. Exit fast recovery on new ACK, set cwnd = ssthresh.
On timeout: Same as Tahoe (cwnd = 1).

Problem: performs poorly with multiple losses in a single window.

Improves Reno's fast recovery by tracking the highest sequence number sent when entering recovery. Does not exit fast recovery until all data outstanding at the time of loss is acknowledged (handles multiple losses in one window).

TCP SACK (RFC 2018)

Selective Acknowledgment allows the receiver to report non-contiguous blocks of received data, enabling the sender to retransmit only truly lost segments rather than everything after the first loss.

TCP CUBIC (Linux default since 2.6.19)

CUBIC uses a cubic function of time since the last congestion event rather than relying on ACK clocking.

W(t) = C * (t - K)^3 + W_max

Where:
  C = scaling constant (0.4)
  K = time to reach W_max = cubic_root(W_max * beta / C)
  W_max = window size at last loss
  beta = multiplicative decrease factor (0.7)

Concave growth below W_max: rapid recovery toward the previous operating point.
Convex probing above W_max: cautious exploration of new capacity.
Window growth is independent of RTT, improving fairness for high-RTT flows.

TCP BBR (Bottleneck Bandwidth and Round-trip propagation time)

Developed by Google. A model-based approach that estimates the delivery rate and propagation delay rather than relying on loss as a congestion signal.

BBRv1 states:

Startup: Exponential probing to find bottleneck bandwidth (like slow start but measures delivery rate).
Drain: Reduce inflight to match BDP after startup overshoot.
ProbeBW: Steady state; cycles through pacing gain factors [1.25, 0.75, 1, 1, 1, 1, 1, 1] to probe for more bandwidth.
ProbeRTT: Periodically reduces cwnd to 4 packets for 200ms to measure true minimum RTT.

Sending rate = BtlBw (estimated bandwidth)
cwnd = 2 * BDP (in ProbeBW)
Pacing enforced to spread packets evenly

BBRv2

Addresses fairness issues of BBRv1 (excessive loss at bottleneck, unfairness to CUBIC flows).

Responds to ECN marks and packet loss (not loss-agnostic like v1).
Improved ProbeRTT (less aggressive, shorter duration).
Better coexistence with loss-based congestion control.
Uses an explicit inflight cap based on loss and ECN signals.

ECN (Explicit Congestion Notification)

ECN (RFC 3168) allows routers to signal congestion before dropping packets by marking the IP header's ECN field.

ECN Mechanism

Sender sets ECT(0) or ECT(1) in IP header
    → Router experiencing congestion sets CE (Congestion Experienced)
        → Receiver sets ECE flag in TCP ACK
            → Sender reduces cwnd and sets CWR flag

ECN Field (2 bits)	Codepoint	Meaning
00	Not-ECT	Not ECN-capable
01	ECT(1)	ECN-capable
10	ECT(0)	ECN-capable
11	CE	Congestion Experienced

Benefits: avoids retransmission delay, especially important for short flows and real-time traffic.

Active Queue Management (AQM)

AQM algorithms in routers proactively manage queues rather than relying on tail-drop.

RED (Random Early Detection)

Drops/marks packets probabilistically based on average queue length.
Drop probability increases linearly between min_th and max_th.
Problem: sensitive to parameter tuning; often misconfigured in practice.

CoDel (Controlled Delay)

Targets the problem of bufferbloat by controlling sojourn time (time packets spend in queue).

Algorithm:
  If packet sojourn time > TARGET (5ms) for at least INTERVAL (100ms):
    Enter dropping state
    Drop packets at increasing frequency: 1/sqrt(count)
  If sojourn time drops below TARGET:
    Exit dropping state

No tuning required (only TARGET and INTERVAL, which work well at defaults).
Distinguishes between good queues (transient bursts) and bad queues (standing queues).

FQ-CoDel (Fair Queuing + CoDel)

Combines stochastic fair queuing with CoDel:

Packets classified into 1024 queues via hashing on flow 5-tuple.
Deficit Round Robin (DRR) services queues, prioritizing queues with new packets.
CoDel applied independently to each queue.
Default AQM in Linux since 2012. Highly effective against bufferbloat.

DCTCP (Data Center TCP)

Designed for low-latency datacenter environments. Uses ECN marks to estimate the extent of congestion, not just its presence.

alpha = (1 - g) * alpha + g * F
  (F = fraction of ECN-marked packets in last window)

cwnd = cwnd * (1 - alpha/2)
  (Proportional reduction rather than halving)

Achieves high throughput with very low queue occupancy (< 1 BDP).
Requires ECN-capable switches with a low marking threshold.
Not safe for deployment on the public Internet (too aggressive against CUBIC/Reno flows).

Congestion Control for Real-Time

Google Congestion Control (GCC)

Used in WebRTC for video conferencing:

Delay-based: monitors one-way delay gradient to detect congestion.
Adjusts sending rate (not window) to match available bandwidth.
Targets low latency over high throughput.

SCReAM (Self-Clocked Rate Adaptation for Multimedia)

IETF RFC 8298 for real-time media.
Self-clocked: uses ACK timing to pace sending.
Integrates with codec rate control to match encoding rate to network capacity.

NADA (Network-Assisted Dynamic Adaptation)

Uses ECN or explicit rate signals from the network.
Designed for interactive multimedia in managed networks.

Fairness Metrics

Jain's Fairness Index

Quantifies how fairly bandwidth is shared among N flows.

J(x1, x2, ..., xn) = (sum(xi))^2 / (n * sum(xi^2))

Range: 1/n (maximally unfair) to 1 (perfectly fair)

J = 1: all flows receive equal throughput.
J = 1/n: one flow gets everything.
Useful for comparing congestion control algorithms in shared bottleneck scenarios.

Max-Min Fairness

An allocation is max-min fair if no flow's rate can be increased without decreasing the rate of a flow that is already receiving less.

Progressive filling algorithm:

Start all flows at rate 0.
Increase all unsatisfied flows equally.
When a link saturates, freeze flows bottlenecked at that link.
Continue increasing remaining flows until all are frozen.

Properties:

Favors small flows (guarantees minimum rate).
May waste capacity if not all demands are equal.

Proportional Fairness

Maximizes the sum of log(throughput) across all flows.

Maximize: sum(log(xi)) for all flows i
Subject to: capacity constraints

Allows high-demand flows to get more bandwidth, but with diminishing returns.
TCP AIMD approximates proportional fairness in steady state.
Balances efficiency (total throughput) and fairness.

Comparison

Criterion	Max-Min	Proportional	TCP (AIMD)
Small flow protection	Strong	Moderate	Moderate
Total throughput	Lower	Higher	Moderate
RTT bias	None (ideal)	None (ideal)	Biased against long RTT
Practical use	Theoretical benchmark	Utility maximization	Deployed standard

Summary of TCP Variants

Variant	Signal	Strengths	Weaknesses
Reno	Loss	Simple, well-understood	Poor multi-loss recovery
CUBIC	Loss	RTT-fair, aggressive probing	Fills buffers
Vegas	Delay	Low queuing delay	Unfair with loss-based flows
BBRv1	Model	High throughput, low delay	Unfair, high retransmit rate
BBRv2	Model+ECN+Loss	Improved fairness	Still maturing
DCTCP	ECN extent	Ultra-low datacenter latency	Not Internet-safe