Streaming Algorithms

Overview

Streaming algorithms process massive data in a single pass (or few passes) using memory sublinear in the input size. The data stream model captures scenarios where data is too large to store: network traffic, sensor readings, database logs, web clicks.

Formal Model

Input: stream of elements a_1, a_2, ..., a_m from universe [n] = {1, ..., n}
Space: O(polylog(n, m)) or O(n^epsilon) — much less than m
Passes: typically 1 (sometimes 2-3)
Per-element processing time: O(polylog(n))
Turnstile model: elements can be inserted or deleted (stream of updates to frequency vector)

Streaming data structures comparison

Count-Min Sketch

Estimates point queries (frequency of any element) in the turnstile model.

Structure

d independent hash functions h_1, ..., h_d mapping [n] -> [w]
d x w array of counters, initialized to 0

Update(item, count)

for j = 1 to d:
    CMS[j][h_j(item)] += count

Query(item)

return min over j of CMS[j][h_j(item)]

Guarantees

With d = ceil(ln(1/delta)), w = ceil(e/epsilon):
f_hat(i) >= f(i) (never underestimates in insertion-only model)
f_hat(i) <= f(i) + epsilon * ||f||_1 with probability >= 1 - delta
Space: O((1/epsilon) * log(1/delta)) counters

Applications

Network traffic monitoring (heavy hitter detection)
Database query optimization (join size estimation)
NLP (approximate word frequency counts)

Count Sketch

Better error bounds for skewed distributions than Count-Min Sketch.

Structure

d hash functions h_j: [n] -> [w] and d sign functions s_j: [n] -> {-1, +1}

Update(item, count)

for j = 1 to d:
    CS[j][h_j(item)] += s_j(item) * count

Query(item)

return median over j of (s_j(item) * CS[j][h_j(item)])

Guarantees

|f_hat(i) - f(i)| <= epsilon * ||f||_2 with probability >= 1 - delta
L2 guarantee is stronger than L1 for skewed distributions
Space: O((1/epsilon^2) * log(1/delta))

HyperLogLog

Estimates the number of distinct elements (cardinality estimation).

Idea

Hash each element to a uniform binary string. The maximum number of leading zeros observed is an estimator of log2(distinct count).

Algorithm

Use m = 2^b registers (b bits for bucket selection)
For each element x, compute h(x). Use first b bits to select register j.
Store max leading zeros of remaining bits in register[j]
Estimate: alpha_m * m^2 / sum(2^(-register[j]))

Properties

Space: O(m * log(log n)) bits, typically 1.5 KB for ~2% error
Relative error: 1.04 / sqrt(m)
Mergeable: union of two HyperLogLog sketches by taking element-wise max
Used in Redis (PFADD/PFCOUNT), BigQuery, Presto

Corrections

Small range correction: linear counting for small cardinalities
Large range correction: for cardinalities near 2^32

Heavy Hitters

Find all elements with frequency >= epsilon * m (or epsilon * ||f||_1).

Misra-Gries Algorithm (Frequent)

Deterministic, 1/epsilon space.

Initialize: dictionary D (at most 1/epsilon - 1 entries)
For each element e:
    if e in D: D[e] += 1
    else if |D| < 1/epsilon - 1: D[e] = 1
    else:
        decrement all counters in D by 1
        remove entries with count 0

Every element with true frequency >= epsilon * m is in D
Reported frequency is an underestimate; error <= epsilon * m
Space: O(1/epsilon) counters

Space-Saving Algorithm

Replaces Misra-Gries's decrement-all with a single replacement.

Initialize: dictionary D (at most 1/epsilon entries)
For each element e:
    if e in D: D[e] += 1
    else if |D| < 1/epsilon: D[e] = 1
    else:
        find entry (e_min, c_min) with smallest count
        replace with (e, c_min + 1)

Same worst-case guarantees as Misra-Gries
Better practical accuracy; provides both under- and over-estimates
Can return top-k elements ordered by estimated frequency

Quantile Estimation

Estimate the phi-quantile: the element with rank ceil(phi * n) in the sorted stream.

GK Algorithm (Greenwald-Khanna)

Deterministic, epsilon-approximate quantiles
Space: O((1/epsilon) * log(epsilon * n))
Maintains a sorted summary of (value, gap, delta) tuples

t-Digest (Dunning)

Practical streaming quantile sketch
Clusters data points with size limits based on quantile
Clusters near the tails are kept small for accuracy at extreme percentiles
Mergeable: important for distributed systems
Used in Elasticsearch, Apache Spark

KLL Sketch (Karnin-Lang-Liberty)

Near-optimal space: O((1/epsilon) * log(log(1/delta)))
Randomized, mergeable
Used in Apache DataSketches

Sliding Window Model

Compute statistics over only the most recent W elements of the stream.

Exponential Histograms (Datar-Gionis-Indyk-Motwani)

Maintain counts in exponentially growing buckets
Estimate sum of last W bits with (1+epsilon) factor
Space: O((1/epsilon) * log^2 W)

Smooth Histograms

Generalize exponential histograms to smooth functions
Applicable to distinct count, frequency moments, quantiles

AMS Sketch (Alon-Matias-Szegedy)

Estimate frequency moments F_k = sum(f_i^k) where f_i is the frequency of element i.

F_2 Estimation (Second Frequency Moment)

Choose random hash s: [n] -> {-1, +1} (4-wise independent)
Maintain counter Z = sum over stream of s(a_t)
Output Z^2 as estimate of F_2

E[Z^2] = F_2
Var[Z^2] <= 2 * F_2^2 (with 4-wise independence)
Average O(1/epsilon^2) independent copies, take median of O(log(1/delta)) groups
Space: O((1/epsilon^2) * log(1/delta) * log n)

General F_k

F_0 (distinct elements): O((1/epsilon^2) * log n) space — FM sketch, HyperLogLog
F_1 (total count): trivially exact in O(1) space
F_2: AMS sketch, O((1/epsilon^2) * log n) space
F_k for k > 2: requires O(n^(1-2/k)) space (tight for k >= 3)

Johnson-Lindenstrauss Transform

Lemma: For any set of n points in R^d, there exists a map to R^k with k = O(log(n) / epsilon^2) such that all pairwise distances are preserved within (1 +/- epsilon).

Random Projection

Multiply by a k x d random matrix A (entries from N(0,1/k)):

y = A * x
Pr[| ||y||^2 - ||x||^2 | > epsilon * ||x||^2] <= 2 * exp(-k * epsilon^2 / 4)

Sparse JL Transforms

Use sparse random matrices for faster projection: O(nnz) time instead of O(kd)
Achlioptas: entries from {-1, 0, +1} with probabilities {1/6, 2/3, 1/6}
CountSketch-based projections: O(d) time

Applications in Streaming

Dimensionality reduction for distance-preserving sketches
Approximate nearest neighbor search
Compressed sensing and sparse recovery

Lower Bounds

Communication complexity provides streaming lower bounds:

Problem	Space Lower Bound	Tight?
Distinct count	Omega(1/epsilon^2 + log n)	Yes
F_2	Omega(1/epsilon^2 * log n)	~Yes
F_k (k > 2)	Omega(n^(1-2/k))	Yes
Median (exact)	Omega(n)	Yes
Heavy hitters	Omega(1/epsilon)	Yes

Practical Considerations

Mergeability: sketches that can be combined enable distributed computation
Deletions: turnstile model sketches (Count-Min, AMS) handle deletions
Sliding windows: require additional mechanisms beyond standard sketches
Implementations: Apache DataSketches, Redis modules, stream-lib (Java)

Summary

Streaming algorithms enable real-time analytics on massive data using tiny memory. The key insight is that approximate answers with provable guarantees are sufficient for most applications. Sketches form the foundation: they are small, mergeable, and composable summaries that support a variety of queries.