Sublinear Algorithms

Overview

Sublinear algorithms solve problems using resources (time, space, or communication) significantly less than the input size. They cannot read the entire input, so they rely on sampling, random access queries, and structural properties to provide approximate answers or distinguish between cases.

Property Testing

Given query access to an object, determine whether it satisfies a property P or is epsilon-far from satisfying P (i.e., more than epsilon fraction of the representation must change to satisfy P).

Framework

Input: object x accessible via queries
Property: P (set of valid objects)
Distance: fraction of representation that must change
epsilon-tester: accepts if x in P, rejects with probability >= 2/3 if x is epsilon-far from P, may do either if x is close but not in P

Query complexity: number of queries as a function of epsilon (and sometimes n).

Graph Property Testing

Dense Graph Model

Graph on n vertices, query: "is (i,j) an edge?" (adjacency matrix)
Distance: fraction of n^2 entries that must change

Testing Bipartiteness

Algorithm: sample O(1/epsilon) random vertices, check induced subgraph for odd cycles via BFS
Query complexity: O(1/epsilon^3) — independent of n
Reject if any odd cycle found in sampled subgraph

Testing Triangle-Freeness

Sample O(1/epsilon) triples, query all three edges
For dense graphs: O(1/epsilon^6) queries suffice
Fundamental result: every testable property in dense graphs has a tester with query complexity depending only on epsilon

Regularity Lemma Connection

Szemeredi's regularity lemma underlies many dense graph testers
Every dense graph can be approximated by a bounded partition into pseudorandom bipartite graphs
Property testing for any first-order property: constant queries (depends on epsilon)

Sparse/Bounded-Degree Graph Model

Graph with max degree d, queries: i-th neighbor of vertex v, degree of v
Distance: fraction of dn edges that must change

Testing Bipartiteness (Sparse)

Random walks from random vertices detect odd cycles
Query complexity: O(sqrt(n) * poly(1/epsilon))
Depends on n — unavoidable in sparse model

Testing Connectivity

O(1/epsilon) random BFS explorations
Check if large connected component exists
Constant query complexity for bounded-degree graphs

Testing Expansion

Test if graph is an expander or far from it
Random walks: mixing time reveals expansion
O(sqrt(n) * poly(log n, 1/epsilon)) queries

Testing Planarity

Can test if a bounded-degree graph is planar or epsilon-far from planar
Query complexity: O(sqrt(n) / epsilon^O(1))
Uses partition oracles and local exploration

Algebraic Property Testing

Testing Linearity (Blum-Luby-Rubinfeld)

Test if f: F_2^n -> F_2 is linear (f(x+y) = f(x) + f(y) for all x, y).

Repeat O(1/epsilon) times:
    Pick random x, y from F_2^n
    Query f(x), f(y), f(x + y)
    Check if f(x) + f(y) = f(x + y)
Accept if all checks pass

3 queries per test, O(1/epsilon) total
Self-correction: if f is close to linear, can compute the nearest linear function at any point with O(1) additional queries
Foundation of PCP theorem and algebraic property testing

Testing Low-Degree Polynomials

Test if f: F^n -> F is a polynomial of degree <= d
Query complexity: O(1/epsilon) for fixed d over large fields
Uses self-correction along random lines/planes

Testing Reed-Solomon Codes

Given word, test proximity to a Reed-Solomon codeword
Poly(1/epsilon) queries
Central to PCP constructions and error-correcting codes

Sublinear-Time Algorithms

Estimating Average Degree

Sample O(sqrt(n) / epsilon) random vertices, query their degrees
Estimate average degree within (1 +/- epsilon) factor
Cannot do better than Omega(sqrt(n)) queries in general

Estimating the Number of Connected Components

Sample random vertices, explore local neighborhoods
Estimate the number of components within additive epsilon * n error
Query complexity: O(d / epsilon^2) for bounded-degree d graphs

Minimum Spanning Tree Weight Estimation

Estimate MST weight within (1+epsilon) factor without computing MST
O(d * n / epsilon^O(1)) time for graphs with max degree d
Based on counting connected components at different edge weight thresholds

Sublinear-Time Approximation of Maximum Matching

Estimate matching size within (1+epsilon) factor
O(n * poly(1/epsilon)) time for bounded-degree graphs
Uses local augmenting path exploration

Communication Complexity

Studies the minimum communication needed between parties to compute a function.

Two-Party Model

Alice holds x in {0,1}^n, Bob holds y in {0,1}^n
They exchange messages to compute f(x, y)
Communication complexity: minimum number of bits exchanged (worst case)

Key Problems and Bounds

Problem	Deterministic	Randomized
Equality	n + 1	O(log n)
Disjointness	n	Omega(n)
Inner Product	n	Omega(n)
Greater Than	O(log n)	O(log log n)
Set Intersection Size	Theta(n)	Omega(n)

Techniques for Lower Bounds

Fooling sets: find large fooling set implies high complexity
Rank method: D(f) >= log2(rank(M_f)) where M_f is communication matrix
Discrepancy method: for randomized lower bounds
Information complexity: mutual information between inputs and transcript

Applications to Streaming/Algorithms

Disjointness lower bound implies Omega(n) space for exact frequency moments
Index problem lower bound implies one-pass streaming lower bounds
Multi-party communication yields multi-pass streaming bounds

Decision Tree Complexity

Minimum number of input bits queried to determine f(x) in the worst case.

Measures

D(f): deterministic query complexity
R(f): randomized query complexity (bounded error)
Q(f): quantum query complexity
C(f): certificate complexity
bs(f): block sensitivity
s(f): sensitivity

Key Relationships

D(f) = O(R(f)^3) — not known to be tight in general
R(f) = O(Q(f)^6) (later improved)
Sensitivity conjecture (proved by Huang 2019): s(f)^4 >= bs(f)
- Implies s(f) = poly(D(f), R(f), Q(f), C(f), bs(f))
- All Boolean function complexity measures are polynomially related

Sensitivity Theorem (Huang 2019)

For any Boolean function f: {0,1}^n -> {0,1}:

s(f) >= sqrt(bs(f))
Proved via elegant spectral argument on the hypercube graph
Resolved a 30-year open problem in complexity theory

Information Complexity

Measures the minimum amount of information that must be revealed about inputs to compute a function.

Definition

IC(f, epsilon): minimum mutual information I(X; Pi | Y) + I(Y; Pi | X) over all protocols Pi computing f with error epsilon, for worst-case distribution.

Properties

IC(f, epsilon) <= CC(f, epsilon) (information <= communication)
Direct sum: IC(f^n) = n * IC(f) for independent copies
Used to prove tight communication bounds (e.g., disjointness)

Interactive Information Complexity

Extends to multi-round protocols
Information cost of each message depends on prior messages
Compression theorems: communication can be compressed to information cost

Distribution Testing

Given sample access to a distribution, determine its properties.

Identity Testing

Is distribution D equal to known distribution D*, or epsilon-far in total variation?
Optimal: O(sqrt(n) / epsilon^2) samples where n is support size

Closeness Testing

Are two unknown distributions D1, D2 equal or epsilon-far?
Optimal: O(n^(2/3) / epsilon^(4/3)) samples

Uniformity Testing

Special case of identity testing with D* = uniform
O(sqrt(n) / epsilon^2) samples

Monotonicity Testing of Distributions

Is distribution over [n] monotone non-increasing?
O(sqrt(n) / epsilon^2) samples

Local Computation Algorithms (LCA)

Compute individual parts of a global solution using sublinear time and space, while maintaining global consistency.

Examples

Maximal Independent Set LCA: query whether vertex v is in the MIS in O(poly(d * log n)) time
Graph Coloring LCA: query color of vertex v in sublinear time
Consistency: repeated queries always give the same answer; the union of all answers forms a valid global solution

Summary

Sublinear algorithms demonstrate that many problems can be solved approximately or decided with resources far less than the input size. Property testing provides the theoretical foundation for approximate decision-making. Communication complexity serves as both an independent area and a tool for proving lower bounds in streaming and other models. These techniques are essential for handling modern massive datasets.