5 min read
On this page

Datacenter Networking

Datacenter Network Requirements

Requirement Description
High bisection bandwidth Full bandwidth between any two server groups
Low latency Microsecond-scale for storage and RPC traffic
Scalability Support 100K+ servers
Fault tolerance No single point of failure; fast failover
Cost efficiency Use commodity hardware where possible

Datacenter Topologies

Fat Tree / Clos Network

The most widely deployed datacenter topology, based on Clos network theory (Charles Clos, 1953).

       Core Switches (k/2)^2
       /    |    |    \
    Agg    Agg  Agg   Agg     (k pods, k/2 agg switches per pod)
    / \    / \  / \   / \
  ToR ToR ToR ToR ToR ToR     (k/2 ToR switches per pod)
  ||| ||| ||| ||| ||| |||
  servers (k/2 per ToR)

For a k-ary fat tree:

  • k pods, each with k/2 aggregation and k/2 ToR (edge) switches.
  • (k/2)^2 core switches.
  • Supports k^3/4 hosts.
  • Full bisection bandwidth using commodity switches.
  • k=48 example: 27,648 hosts, 2,880 switches.

Leaf-Spine (2-Tier Clos)

Leaf-Spine Datacenter Topology

Simplified Clos variant common in modern datacenters.

    Spine 1   Spine 2   Spine 3   Spine 4
      |  \   / |  \   / |  \   / |
    Leaf1  Leaf2  Leaf3  Leaf4  Leaf5
    ||||   ||||   ||||   ||||   ||||
    servers
  • Every leaf connects to every spine (full mesh between tiers).
  • Consistent hop count (2 hops between any two leaves).
  • Easy to scale: add more spines for bandwidth, more leaves for ports.
  • No oversubscription when properly provisioned.

Dragonfly

Designed for high-radix switches, reducing cabling cost.

Group structure:
  Each group: fully connected switches (intra-group links)
  Between groups: each switch has inter-group links to other groups
  • Three levels of links: intra-switch (to servers), intra-group (between switches in a group), inter-group (between groups).
  • High bandwidth with fewer cables than fat tree at large scale.
  • Requires adaptive routing to avoid congestion on inter-group links.
  • Used in HPC systems (Cray Slingshot) more than cloud datacenters.

Topology Comparison

Property Fat Tree Leaf-Spine Dragonfly
Bisection BW Full Full (if provisioned) Full (with adaptive routing)
Hop count 2-4 2 2-3
Cabling complexity High Moderate Low
Scalability Very high High Very high
Typical use Cloud DC Cloud DC HPC

Datacenter Transport

DCTCP (Data Center TCP)

Uses ECN to achieve high throughput with minimal queuing.

Switch: mark packets with CE when queue > K (e.g., K = 20 packets)
Receiver: feed back fraction of marked packets
Sender: cwnd = cwnd * (1 - alpha/2)
  where alpha = EWMA of marked fraction
  • Maintains queue occupancy near K, far below buffer capacity.
  • Requires ECN support on all switches and endpoints.
  • Coexistence with standard TCP requires careful isolation.

RDMA (Remote Direct Memory Access)

Allows direct memory-to-memory transfers between servers without CPU involvement or kernel intervention.

InfiniBand RDMA

  • Native RDMA on InfiniBand fabric.
  • Lossless transport built into the fabric.
  • Dominant in HPC clusters.

RoCE (RDMA over Converged Ethernet)

  • RDMA over standard Ethernet (RoCEv1 over L2, RoCEv2 over UDP/IP).
  • Requires Priority Flow Control (PFC) or other lossless mechanisms.
  • RoCEv2 is routable, enabling RDMA across L3 boundaries.
RoCEv2 packet:
  Ethernet | IP | UDP (port 4791) | IB BTH | Payload | ICRC

PFC and Lossless Ethernet

PFC (IEEE 802.1Qbb) provides per-priority flow control:

  • Switch sends PAUSE frame when queue exceeds threshold.
  • Sender halts transmission for that priority class.
  • Problem: PFC deadlocks (circular buffer dependencies) and head-of-line blocking.
  • Mitigations: PFC watchdog timers, DCQCN congestion control, careful VLAN/priority design.

DCQCN (Data Center QCN)

Congestion control for RoCE combining ECN-based rate reduction (like DCTCP) with QCN-style rate recovery:

  1. Switch marks packets with ECN at threshold.
  2. Receiver generates CNP (Congestion Notification Packet).
  3. Sender reduces rate based on CNP.
  4. Rate recovery through timer-based increase and active increase phases.

Traffic Engineering

ECMP (Equal-Cost Multi-Path)

  • Standard approach: hash on flow 5-tuple to select among equal-cost paths.
  • Provides load balancing across parallel links in Clos topologies.
  • Limitations: hash polarization (same hash at multiple stages), flow collisions (large flows share a path), no adaptation to congestion.

CONGA (Congestion-Aware Load Balancing)

Distributed, congestion-aware load balancing:

1. Leaf switch selects path based on congestion feedback from remote leaf.
2. Remote leaf piggybacks congestion metric (max link utilization on path) in ACKs.
3. Source leaf maintains congestion table per destination leaf per path.
4. Selects least-congested path for each flowlet (not per-packet to avoid reordering).
  • Operates at flowlet granularity (bursts within a flow separated by gaps > ~500us).
  • Achieves near-optimal load balancing without centralized controller.

Other TE Approaches

Approach Mechanism
Hedera Centralized; detects elephant flows, reroutes via OpenFlow
DRILL Per-packet load balancing using local queue depths
LetFlow Flowlet-based random load balancing (simple, effective)
HULA Probe-based distributed TE using best-path tracking at each switch

Network Telemetry

In-Band Network Telemetry (INT)

Switches embed telemetry metadata directly into data packets as they traverse the network.

Original Packet | INT Header | INT Metadata (per-hop) |

Per-hop metadata:
  - Switch ID
  - Ingress/egress port
  - Queue occupancy
  - Ingress/egress timestamp
  - Queue congestion status
  • Provides per-packet, per-hop visibility.
  • Collected at sink node or telemetry collector.
  • P4-programmable switches natively support INT.

gNMI / gRPC Network Management Interface

  • Model-driven telemetry using YANG models.
  • Streaming telemetry: switch pushes data at configured intervals (replaces SNMP polling).
  • Supports dial-in (controller subscribes) and dial-out (device pushes).
  • Significantly lower latency than SNMP for detecting failures.

Telemetry Pipeline

Switches → Streaming (gNMI/INT) → Collector → Time-series DB → Analysis/Dashboards
                                   (e.g., Telegraf)  (InfluxDB,     (Grafana,
                                                      Prometheus)    custom)

SmartNICs

SmartNICs offload network processing from the host CPU to the NIC hardware.

SmartNIC Architectures

Type Implementation Example
FPGA-based Programmable logic Microsoft Azure SmartNIC (Catapult)
SoC-based ARM cores + accelerators NVIDIA BlueField, AMD Pensando
ASIC-based Fixed-function offload Mellanox ConnectX (partial)

Offloaded Functions

  • OVS offload: Virtual switch forwarding in hardware (tc-flower, OVS-DPDK).
  • Encryption: IPsec, TLS offload.
  • Storage: NVMe-oF target/initiator.
  • Telemetry: Packet sampling, flow tracking.
  • Security: Firewall rules, microsegmentation.

Infrastructure Processing Unit (IPU/DPU)

Trend toward treating SmartNICs as infrastructure processors that run a complete infrastructure OS, separating tenant workloads from infrastructure management:

Server:
  Host CPU → Tenant VMs/containers (application workloads)
  DPU/IPU → Infrastructure services (networking, storage, security)

DPDK (Data Plane Development Kit)

DPDK provides a set of libraries for fast packet processing in userspace, bypassing the kernel network stack.

Key Techniques

Technique Description
Kernel bypass UIO/VFIO drivers map NIC directly to userspace
Poll-mode drivers Busy-polling instead of interrupt-driven I/O
Hugepages Reduce TLB misses for packet buffer memory
Core pinning Dedicate CPU cores to packet processing
Lockless ring buffers Efficient inter-core communication
Batch processing Amortize per-packet overhead

Performance

  • Single core can process 10-40 Mpps (million packets per second) depending on packet size.
  • Enables line-rate processing at 100 Gbps on commodity servers.
  • Used in virtual switches (OVS-DPDK), NFV, load balancers, firewalls.

Alternatives to DPDK

Framework Approach
XDP (eXpress Data Path) eBPF programs at NIC driver level in Linux
AF_XDP Socket-based interface to XDP for userspace
io_uring Async I/O framework with networking support
VPP (fd.io) Vector packet processing framework

XDP is increasingly preferred over DPDK for many use cases because it integrates with the Linux kernel ecosystem while still achieving near-DPDK performance for common operations.