High-Dimensional Geometry
Curse of Dimensionality
As dimensionality grows, geometric intuition from low dimensions breaks down in fundamental ways.
Volume Concentration
The volume of the unit -ball :
This vanishes as : . Relative to the enclosing cube (volume ), the ball occupies a vanishing fraction.
Shell concentration: most of the volume of a -ball is concentrated near its surface. The fraction of volume within radius of the boundary:
For , over 99.99% of the ball's volume lies in the outer 5% shell.
Distance Concentration
For random points uniformly distributed in , the ratio of maximum to minimum pairwise distance converges to 1:
More precisely, for i.i.d. random points: relative to . All points become approximately equidistant, making nearest-neighbor search meaningless with naive approaches.
Implications for Algorithms
- Nearest neighbor: exact methods (kd-trees, Voronoi) degrade to brute-force linear scan for
- Clustering: distances become less discriminative; density-based methods fail
- Volume-based methods: kernel density estimation requires samples for accuracy
- Sampling: uniform sampling of high-dimensional spaces is exponentially wasteful (most volume is in corners/shells)
Locality-Sensitive Hashing (LSH)
LSH (Indyk and Motwani, 1998): a family of hash functions such that:
- Close points hash together: if
- Far points hash apart: if
where and is the approximation factor. The ratio determines the performance: query time is with space.
MinHash (Jaccard Similarity)
For sets , the Jaccard similarity .
MinHash: apply a random permutation to the universe; the hash is .
Use independent hash functions and band into bands of rows (). Two sets become candidates if they agree on all hashes in at least one band. This creates an S-curve for the candidate probability:
Applications: near-duplicate document detection, web deduplication, entity resolution.
SimHash (Cosine Similarity)
For vectors , the cosine similarity is .
SimHash (Charikar, 2002): draw a random hyperplane through the origin (random vector ); the hash bit is .
Concatenate hash bits to form a -bit hash. Hamming distance between hashes approximates angular distance.
The LSH exponent for cosine: (for -near neighbor under angular distance). This is optimal for hyperplane-based LSH.
Cross-Polytope LSH
Andoni, Indyk, Laarhoven, Razenshteyn (2015): hash by applying a random rotation then finding the nearest vertex of a cross-polytope (the set ).
Performance: achieves for Euclidean distance, which approaches the optimal from the LSH lower bound. Practically faster than random projection LSH due to the structure of the hash family.
Data-Dependent LSH
Optimal LSH (Andoni and Razenshteyn, 2015): for -ANN in Euclidean space, is achievable and optimal (assuming data-independent hashing). Data-dependent approaches can do better:
- LSH with data-dependent partitions: learn the hash functions from the data distribution, achieving for structured datasets
- In practice, learned hash functions (deep hashing) often outperform classical LSH
HNSW (Hierarchical Navigable Small World)
HNSW (Malkov and Yashunin, 2018): a graph-based ANN method building a hierarchical proximity graph.
Construction:
- Insert points one by one
- Each point is assigned a random level (geometric distribution: )
- At each level , connect the point to its nearest neighbors in the graph at that level
- Higher levels have fewer points and longer-range connections
Query:
- Start at the entry point (top level)
- Greedily traverse to the nearest neighbor at the current level
- Descend to the next level, using the found point as the starting point
- At level 0, perform a beam search (maintain a list of candidates)
Properties:
- Polylogarithmic query time empirically: expected
- High recall (>95%) with proper parameter tuning
- Memory: where is the max connections per node (typically 16-64)
- State-of-the-art practical ANN performance on many benchmarks
Comparison with LSH: HNSW typically achieves higher recall at lower query time but requires more memory. LSH has provable guarantees; HNSW's guarantees are empirical.
Approximate Nearest Neighbors (ANN)
The -approximate nearest neighbor problem: find a point within distance of the true nearest neighbor.
Methods Landscape
| Method | Query Time | Space | Strengths | |---|---|---|---| | LSH | | | Provable, sublinear | | HNSW | empirical | | Best practical recall-speed | | Product quantization | | | Memory-efficient | | IVF (inverted file) | | | Billion-scale with disk | | Annoy (random trees) | | | Simple, memory-mapped | | ScaNN | | | Anisotropic quantization |
Composite Systems
Real-world ANN systems combine techniques:
- IVF + PQ (FAISS): cluster points (IVF), encode residuals with product quantization. Handles billions of vectors
- IVF + HNSW: use HNSW to navigate the IVF cluster centroids
- Graph + quantization: HNSW with compressed vectors for reduced memory
FAISS (Facebook AI Similarity Search): the standard library. Supports GPU acceleration, billion-scale indices, various index types (flat, IVF, PQ, HNSW, composite).
Johnson-Lindenstrauss Lemma
Theorem (Johnson and Lindenstrauss, 1984): for any and any set of points in , there exists a map with such that for all pairs :
Key point: depends only on and , not on the original dimension . A set of points in arbitrarily high dimensions can be embedded in dimensions with bounded distortion.
Constructive proof: where is a matrix with i.i.d. entries from (or with equal probability, or sparse matrices).
Optimality: is necessary (Alon, 2003).
Random Projections
Random projection matrices provide practical dimensionality reduction:
Gaussian Random Projection
. Satisfies JL with high probability.
Sparse Random Projection
Achlioptas (2003): with equal probability. Same JL guarantees, faster computation (no multiplications needed).
Very sparse (Li, Hastie, Church, 2006): with probabilities , where or . Reduces projection time from to .
Fast JL Transforms
- FJLT (Ailon and Chazelle, 2006): where is random sign diagonal, is Hadamard, is sparse Gaussian. Projection time:
- Subsampled randomized Hadamard transform (SRHT): projection time. Used in randomized numerical linear algebra
Applications
- Dimensionality reduction: preprocess high-dimensional data before running exact algorithms
- Compressed sensing: random projections as measurement matrices (RIP property implies JL for sparse signals)
- Streaming algorithms: maintain a random projection of a data stream, answering distance queries approximately
- Privacy: random projection provides some privacy (though not differential privacy without additional noise)
Concentration of Measure
The concentration of measure phenomenon: in high dimensions, well-behaved functions of many independent variables concentrate tightly around their expectation.
Levy's Lemma
For a 1-Lipschitz function on the unit sphere :
where is uniform on . The concentration is exponential in .
Gaussian Concentration
For and -Lipschitz :
Consequences for Geometry
- Norm concentration: for : with deviation . The norm concentrates around
- Inner product concentration: for independent : (nearly orthogonal). In high dimensions, random vectors are approximately orthogonal
- JL proof: the projection of a vector onto a random -dimensional subspace concentrates around its expected length, by Gaussian concentration
- Maximum of Gaussians: (relevant to covering numbers and epsilon-nets)
Measure Concentration and Machine Learning
- Kernel methods: kernel values become uninformative in high dimensions (all pairwise distances concentrate)
- Manifold hypothesis: real data lies on or near a low-dimensional manifold, circumventing the curse of dimensionality
- Generalization: concentration inequalities (McDiarmid, Rademacher) underpin statistical learning theory
- Random features (Rahimi-Recht, 2007): random projections approximate kernel functions, connecting JL to kernel methods
Embeddings and Metric Spaces
Bourgain's Theorem
Any -point metric space embeds into with distortion . This is tight for some metrics (e.g., requires distortion).
Assouad's Theorem
If a metric space has doubling dimension (every ball can be covered by balls of half the radius), then it embeds into with constant distortion. This formalizes the idea that "intrinsically low-dimensional" data can be reduced.
Nearest Neighbor Search on Manifolds
The intrinsic dimension (not the ambient dimension ) determines the true difficulty of nearest neighbor search. ANN methods (cover trees, navigating nets) achieve query time polynomial in with exponents depending on , not .
Cover tree (Beygelzimer, Kakade, Langford, 2006): a hierarchical data structure for metric spaces with bounded expansion rate. Nearest neighbor in where is the expansion constant (related to doubling dimension). Space .