Sublinear Algorithms
Overview
Sublinear algorithms solve problems using resources (time, space, or communication) significantly less than the input size. They cannot read the entire input, so they rely on sampling, random access queries, and structural properties to provide approximate answers or distinguish between cases.
Property Testing
Given query access to an object, determine whether it satisfies a property P or is epsilon-far from satisfying P (i.e., more than epsilon fraction of the representation must change to satisfy P).
Framework
- Input: object x accessible via queries
- Property: P (set of valid objects)
- Distance: fraction of representation that must change
- epsilon-tester: accepts if x in P, rejects with probability >= 2/3 if x is epsilon-far from P, may do either if x is close but not in P
Query complexity: number of queries as a function of epsilon (and sometimes n).
Graph Property Testing
Dense Graph Model
- Graph on n vertices, query: "is (i,j) an edge?" (adjacency matrix)
- Distance: fraction of n^2 entries that must change
Testing Bipartiteness
- Algorithm: sample O(1/epsilon) random vertices, check induced subgraph for odd cycles via BFS
- Query complexity: O(1/epsilon^3) — independent of n
- Reject if any odd cycle found in sampled subgraph
Testing Triangle-Freeness
- Sample O(1/epsilon) triples, query all three edges
- For dense graphs: O(1/epsilon^6) queries suffice
- Fundamental result: every testable property in dense graphs has a tester with query complexity depending only on epsilon
Regularity Lemma Connection
- Szemeredi's regularity lemma underlies many dense graph testers
- Every dense graph can be approximated by a bounded partition into pseudorandom bipartite graphs
- Property testing for any first-order property: constant queries (depends on epsilon)
Sparse/Bounded-Degree Graph Model
- Graph with max degree d, queries: i-th neighbor of vertex v, degree of v
- Distance: fraction of dn edges that must change
Testing Bipartiteness (Sparse)
- Random walks from random vertices detect odd cycles
- Query complexity: O(sqrt(n) * poly(1/epsilon))
- Depends on n — unavoidable in sparse model
Testing Connectivity
- O(1/epsilon) random BFS explorations
- Check if large connected component exists
- Constant query complexity for bounded-degree graphs
Testing Expansion
- Test if graph is an expander or far from it
- Random walks: mixing time reveals expansion
- O(sqrt(n) * poly(log n, 1/epsilon)) queries
Testing Planarity
- Can test if a bounded-degree graph is planar or epsilon-far from planar
- Query complexity: O(sqrt(n) / epsilon^O(1))
- Uses partition oracles and local exploration
Algebraic Property Testing
Testing Linearity (Blum-Luby-Rubinfeld)
Test if f: F_2^n -> F_2 is linear (f(x+y) = f(x) + f(y) for all x, y).
Repeat O(1/epsilon) times:
Pick random x, y from F_2^n
Query f(x), f(y), f(x + y)
Check if f(x) + f(y) = f(x + y)
Accept if all checks pass
- 3 queries per test, O(1/epsilon) total
- Self-correction: if f is close to linear, can compute the nearest linear function at any point with O(1) additional queries
- Foundation of PCP theorem and algebraic property testing
Testing Low-Degree Polynomials
- Test if f: F^n -> F is a polynomial of degree <= d
- Query complexity: O(1/epsilon) for fixed d over large fields
- Uses self-correction along random lines/planes
Testing Reed-Solomon Codes
- Given word, test proximity to a Reed-Solomon codeword
- Poly(1/epsilon) queries
- Central to PCP constructions and error-correcting codes
Sublinear-Time Algorithms
Estimating Average Degree
- Sample O(sqrt(n) / epsilon) random vertices, query their degrees
- Estimate average degree within (1 +/- epsilon) factor
- Cannot do better than Omega(sqrt(n)) queries in general
Estimating the Number of Connected Components
- Sample random vertices, explore local neighborhoods
- Estimate the number of components within additive epsilon * n error
- Query complexity: O(d / epsilon^2) for bounded-degree d graphs
Minimum Spanning Tree Weight Estimation
- Estimate MST weight within (1+epsilon) factor without computing MST
- O(d * n / epsilon^O(1)) time for graphs with max degree d
- Based on counting connected components at different edge weight thresholds
Sublinear-Time Approximation of Maximum Matching
- Estimate matching size within (1+epsilon) factor
- O(n * poly(1/epsilon)) time for bounded-degree graphs
- Uses local augmenting path exploration
Communication Complexity
Studies the minimum communication needed between parties to compute a function.
Two-Party Model
- Alice holds x in {0,1}^n, Bob holds y in {0,1}^n
- They exchange messages to compute f(x, y)
- Communication complexity: minimum number of bits exchanged (worst case)
Key Problems and Bounds
| Problem | Deterministic | Randomized | |----------------------|---------------|----------------| | Equality | n + 1 | O(log n) | | Disjointness | n | Omega(n) | | Inner Product | n | Omega(n) | | Greater Than | O(log n) | O(log log n) | | Set Intersection Size | Theta(n) | Omega(n) |
Techniques for Lower Bounds
- Fooling sets: find large fooling set implies high complexity
- Rank method: D(f) >= log2(rank(M_f)) where M_f is communication matrix
- Discrepancy method: for randomized lower bounds
- Information complexity: mutual information between inputs and transcript
Applications to Streaming/Algorithms
- Disjointness lower bound implies Omega(n) space for exact frequency moments
- Index problem lower bound implies one-pass streaming lower bounds
- Multi-party communication yields multi-pass streaming bounds
Decision Tree Complexity
Minimum number of input bits queried to determine f(x) in the worst case.
Measures
- D(f): deterministic query complexity
- R(f): randomized query complexity (bounded error)
- Q(f): quantum query complexity
- C(f): certificate complexity
- bs(f): block sensitivity
- s(f): sensitivity
Key Relationships
- D(f) = O(R(f)^3) — not known to be tight in general
- R(f) = O(Q(f)^6) (later improved)
- Sensitivity conjecture (proved by Huang 2019): s(f)^4 >= bs(f)
- Implies s(f) = poly(D(f), R(f), Q(f), C(f), bs(f))
- All Boolean function complexity measures are polynomially related
Sensitivity Theorem (Huang 2019)
For any Boolean function f: {0,1}^n -> {0,1}:
- s(f) >= sqrt(bs(f))
- Proved via elegant spectral argument on the hypercube graph
- Resolved a 30-year open problem in complexity theory
Information Complexity
Measures the minimum amount of information that must be revealed about inputs to compute a function.
Definition
IC(f, epsilon): minimum mutual information I(X; Pi | Y) + I(Y; Pi | X) over all protocols Pi computing f with error epsilon, for worst-case distribution.
Properties
- IC(f, epsilon) <= CC(f, epsilon) (information <= communication)
- Direct sum: IC(f^n) = n * IC(f) for independent copies
- Used to prove tight communication bounds (e.g., disjointness)
Interactive Information Complexity
- Extends to multi-round protocols
- Information cost of each message depends on prior messages
- Compression theorems: communication can be compressed to information cost
Distribution Testing
Given sample access to a distribution, determine its properties.
Identity Testing
- Is distribution D equal to known distribution D*, or epsilon-far in total variation?
- Optimal: O(sqrt(n) / epsilon^2) samples where n is support size
Closeness Testing
- Are two unknown distributions D1, D2 equal or epsilon-far?
- Optimal: O(n^(2/3) / epsilon^(4/3)) samples
Uniformity Testing
- Special case of identity testing with D* = uniform
- O(sqrt(n) / epsilon^2) samples
Monotonicity Testing of Distributions
- Is distribution over [n] monotone non-increasing?
- O(sqrt(n) / epsilon^2) samples
Local Computation Algorithms (LCA)
Compute individual parts of a global solution using sublinear time and space, while maintaining global consistency.
Examples
- Maximal Independent Set LCA: query whether vertex v is in the MIS in O(poly(d * log n)) time
- Graph Coloring LCA: query color of vertex v in sublinear time
- Consistency: repeated queries always give the same answer; the union of all answers forms a valid global solution
Summary
Sublinear algorithms demonstrate that many problems can be solved approximately or decided with resources far less than the input size. Property testing provides the theoretical foundation for approximate decision-making. Communication complexity serves as both an independent area and a tool for proving lower bounds in streaming and other models. These techniques are essential for handling modern massive datasets.