9 min read
On this page

High-Dimensional Geometry

Curse of Dimensionality

As dimensionality dd grows, geometric intuition from low dimensions breaks down in fundamental ways.

Volume Concentration

The volume of the unit dd-ball BdB_d:

Vd=πd/2Γ(d/2+1)V_d = \frac{\pi^{d/2}}{\Gamma(d/2 + 1)}

This vanishes as dd \to \infty: Vd0V_d \to 0. Relative to the enclosing cube [1,1]d[-1,1]^d (volume 2d2^d), the ball occupies a vanishing fraction.

Shell concentration: most of the volume of a dd-ball is concentrated near its surface. The fraction of volume within radius (1ϵ)(1-\epsilon) of the boundary:

1(1ϵ)d1 as d1 - (1-\epsilon)^d \to 1 \text{ as } d \to \infty

For d=100d = 100, over 99.99% of the ball's volume lies in the outer 5% shell.

Distance Concentration

For nn random points uniformly distributed in [0,1]d[0,1]^d, the ratio of maximum to minimum pairwise distance converges to 1:

dmaxdmin1 as d\frac{d_{\max}}{d_{\min}} \to 1 \text{ as } d \to \infty

More precisely, for i.i.d. random points: dmaxdmin=O(d1/2)d_{\max} - d_{\min} = O(d^{-1/2}) relative to E[d]=Θ(d)\mathbb{E}[d] = \Theta(\sqrt{d}). All points become approximately equidistant, making nearest-neighbor search meaningless with naive approaches.

Implications for Algorithms

  • Nearest neighbor: exact methods (kd-trees, Voronoi) degrade to brute-force linear scan for d20d \gtrsim 20
  • Clustering: distances become less discriminative; density-based methods fail
  • Volume-based methods: kernel density estimation requires n=Ω(ϵd)n = \Omega(\epsilon^{-d}) samples for accuracy ϵ\epsilon
  • Sampling: uniform sampling of high-dimensional spaces is exponentially wasteful (most volume is in corners/shells)

Locality-Sensitive Hashing (LSH)

LSH (Indyk and Motwani, 1998): a family of hash functions H\mathcal{H} such that:

  • Close points hash together: PrhH[h(p)=h(q)]P1\Pr_{h \sim \mathcal{H}}[h(p) = h(q)] \geq P_1 if d(p,q)rd(p,q) \leq r
  • Far points hash apart: PrhH[h(p)=h(q)]P2\Pr_{h \sim \mathcal{H}}[h(p) = h(q)] \leq P_2 if d(p,q)>crd(p,q) > cr

where P1>P2P_1 > P_2 and c>1c > 1 is the approximation factor. The ratio ρ=logP1logP2\rho = \frac{\log P_1}{\log P_2} determines the performance: query time is O(nρ)O(n^\rho) with O(n1+ρ)O(n^{1+\rho}) space.

MinHash (Jaccard Similarity)

For sets A,BA, B, the Jaccard similarity J(A,B)=AB/ABJ(A,B) = |A \cap B| / |A \cup B|.

MinHash: apply a random permutation π\pi to the universe; the hash is hπ(A)=minaAπ(a)h_\pi(A) = \min_{a \in A} \pi(a).

Pr[hπ(A)=hπ(B)]=J(A,B)\Pr[h_\pi(A) = h_\pi(B)] = J(A,B)

Use kk independent hash functions and band into bb bands of rr rows (k=brk = br). Two sets become candidates if they agree on all rr hashes in at least one band. This creates an S-curve for the candidate probability:

P(candidate)=1(1Jr)bP(\text{candidate}) = 1 - (1 - J^r)^b

Applications: near-duplicate document detection, web deduplication, entity resolution.

SimHash (Cosine Similarity)

For vectors u,vRdu, v \in \mathbb{R}^d, the cosine similarity is cosθ=uvuv\cos\theta = \frac{u \cdot v}{\|u\|\|v\|}.

SimHash (Charikar, 2002): draw a random hyperplane through the origin (random vector rr); the hash bit is hr(v)=sign(rv)h_r(v) = \text{sign}(r \cdot v).

Pr[hr(u)=hr(v)]=1θπ\Pr[h_r(u) = h_r(v)] = 1 - \frac{\theta}{\pi}

Concatenate kk hash bits to form a kk-bit hash. Hamming distance between hashes approximates angular distance.

The LSH exponent for cosine: ρ=1c\rho = \frac{1}{c} (for (r,cr)(r, cr)-near neighbor under angular distance). This is optimal for hyperplane-based LSH.

Cross-Polytope LSH

Andoni, Indyk, Laarhoven, Razenshteyn (2015): hash by applying a random rotation then finding the nearest vertex of a cross-polytope (the set {±ei}i=1d\{\pm e_i\}_{i=1}^d).

Performance: achieves ρ=1c2+O(1/d)\rho = \frac{1}{c^2} + O(1/\sqrt{d}) for Euclidean distance, which approaches the optimal ρ=1c2\rho = \frac{1}{c^2} from the LSH lower bound. Practically faster than random projection LSH due to the structure of the hash family.

Data-Dependent LSH

Optimal LSH (Andoni and Razenshteyn, 2015): for (r,cr)(r, cr)-ANN in Euclidean space, ρ=1c2\rho^* = \frac{1}{c^2} is achievable and optimal (assuming data-independent hashing). Data-dependent approaches can do better:

  • LSH with data-dependent partitions: learn the hash functions from the data distribution, achieving ρ<1/c2\rho < 1/c^2 for structured datasets
  • In practice, learned hash functions (deep hashing) often outperform classical LSH

HNSW (Hierarchical Navigable Small World)

HNSW (Malkov and Yashunin, 2018): a graph-based ANN method building a hierarchical proximity graph.

Construction:

  1. Insert points one by one
  2. Each point is assigned a random level \ell (geometric distribution: =ln(uniform)mL\ell = \lfloor -\ln(\text{uniform}) \cdot m_L \rfloor)
  3. At each level \leq \ell, connect the point to its nearest neighbors in the graph at that level
  4. Higher levels have fewer points and longer-range connections

Query:

  1. Start at the entry point (top level)
  2. Greedily traverse to the nearest neighbor at the current level
  3. Descend to the next level, using the found point as the starting point
  4. At level 0, perform a beam search (maintain a list of efef candidates)

Properties:

  • Polylogarithmic query time empirically: O(logn)O(\log n) expected
  • High recall (>95%) with proper parameter tuning
  • Memory: O(nM)O(n \cdot M) where MM is the max connections per node (typically 16-64)
  • State-of-the-art practical ANN performance on many benchmarks

Comparison with LSH: HNSW typically achieves higher recall at lower query time but requires more memory. LSH has provable guarantees; HNSW's guarantees are empirical.

Approximate Nearest Neighbors (ANN)

The (1+ϵ)(1+\epsilon)-approximate nearest neighbor problem: find a point within distance (1+ϵ)d(q,NN)(1+\epsilon) \cdot d(q, \text{NN}) of the true nearest neighbor.

Methods Landscape

| Method | Query Time | Space | Strengths | |---|---|---|---| | LSH | O(n1/c2+o(1))O(n^{1/c^2 + o(1)}) | O(n1+1/c2)O(n^{1+1/c^2}) | Provable, sublinear | | HNSW | O(logn)O(\log n) empirical | O(nM)O(nM) | Best practical recall-speed | | Product quantization | O(n/b)O(n/b) | O(nm)O(n \cdot m) | Memory-efficient | | IVF (inverted file) | O(n/k+k)O(n/k + k) | O(n+k)O(n + k) | Billion-scale with disk | | Annoy (random trees) | O(logn)O(\log n) | O(nT)O(n \cdot T) | Simple, memory-mapped | | ScaNN | O(n/k)O(n/k) | O(n)O(n) | Anisotropic quantization |

Composite Systems

Real-world ANN systems combine techniques:

  • IVF + PQ (FAISS): cluster points (IVF), encode residuals with product quantization. Handles billions of vectors
  • IVF + HNSW: use HNSW to navigate the IVF cluster centroids
  • Graph + quantization: HNSW with compressed vectors for reduced memory

FAISS (Facebook AI Similarity Search): the standard library. Supports GPU acceleration, billion-scale indices, various index types (flat, IVF, PQ, HNSW, composite).

Johnson-Lindenstrauss Lemma

Theorem (Johnson and Lindenstrauss, 1984): for any ϵ(0,1)\epsilon \in (0, 1) and any set of nn points in Rd\mathbb{R}^d, there exists a map f:RdRkf: \mathbb{R}^d \to \mathbb{R}^k with k=O(ϵ2logn)k = O(\epsilon^{-2} \log n) such that for all pairs u,vu, v:

(1ϵ)uv2f(u)f(v)2(1+ϵ)uv2(1-\epsilon)\|u - v\|^2 \leq \|f(u) - f(v)\|^2 \leq (1+\epsilon)\|u - v\|^2

Key point: kk depends only on logn\log n and ϵ\epsilon, not on the original dimension dd. A set of nn points in arbitrarily high dimensions can be embedded in O(logn/ϵ2)O(\log n / \epsilon^2) dimensions with bounded distortion.

Constructive proof: f(x)=1kAxf(x) = \frac{1}{\sqrt{k}} A x where AA is a k×dk \times d matrix with i.i.d. entries from N(0,1)\mathcal{N}(0, 1) (or ±1\pm 1 with equal probability, or sparse matrices).

Optimality: k=Ω(ϵ2logn)k = \Omega(\epsilon^{-2} \log n) is necessary (Alon, 2003).

Random Projections

Random projection matrices provide practical dimensionality reduction:

Gaussian Random Projection

AijN(0,1/k)A_{ij} \sim \mathcal{N}(0, 1/k). Satisfies JL with high probability.

Sparse Random Projection

Achlioptas (2003): Aij=±1/kA_{ij} = \pm 1/\sqrt{k} with equal probability. Same JL guarantees, faster computation (no multiplications needed).

Very sparse (Li, Hastie, Church, 2006): Aij=s/k{+1,0,1}A_{ij} = \sqrt{s/k} \cdot \{+1, 0, -1\} with probabilities {1/(2s),11/s,1/(2s)}\{1/(2s), 1-1/s, 1/(2s)\}, where s=ds = \sqrt{d} or s=d/logds = d/\log d. Reduces projection time from O(dk)O(dk) to O(dk/s)O(dk/s).

Fast JL Transforms

  • FJLT (Ailon and Chazelle, 2006): Φ=PHD\Phi = P \cdot H \cdot D where DD is random sign diagonal, HH is Hadamard, PP is sparse Gaussian. Projection time: O(dlogd+k3)O(d \log d + k^3)
  • Subsampled randomized Hadamard transform (SRHT): O(dlogk)O(d \log k) projection time. Used in randomized numerical linear algebra

Applications

  • Dimensionality reduction: preprocess high-dimensional data before running exact algorithms
  • Compressed sensing: random projections as measurement matrices (RIP property implies JL for sparse signals)
  • Streaming algorithms: maintain a random projection of a data stream, answering distance queries approximately
  • Privacy: random projection provides some privacy (though not differential privacy without additional noise)

Concentration of Measure

The concentration of measure phenomenon: in high dimensions, well-behaved functions of many independent variables concentrate tightly around their expectation.

Levy's Lemma

For a 1-Lipschitz function ff on the unit sphere Sd1S^{d-1}:

Pr[f(X)E[f]>ϵ]2exp((d1)ϵ22)\Pr[|f(X) - \mathbb{E}[f]| > \epsilon] \leq 2 \exp\left(-\frac{(d-1)\epsilon^2}{2}\right)

where XX is uniform on Sd1S^{d-1}. The concentration is exponential in dd.

Gaussian Concentration

For XN(0,Id)X \sim \mathcal{N}(0, I_d) and LL-Lipschitz ff:

Pr[f(X)E[f]>t]2exp(t22L2)\Pr[|f(X) - \mathbb{E}[f]| > t] \leq 2\exp\left(-\frac{t^2}{2L^2}\right)

Consequences for Geometry

  • Norm concentration: for XN(0,Id)X \sim \mathcal{N}(0, I_d): Xd\|X\| \approx \sqrt{d} with deviation O(1)O(1). The norm concentrates around d\sqrt{d}
  • Inner product concentration: for independent X,YN(0,Id)X, Y \sim \mathcal{N}(0, I_d): XY0X \cdot Y \approx 0 (nearly orthogonal). In high dimensions, random vectors are approximately orthogonal
  • JL proof: the projection of a vector onto a random kk-dimensional subspace concentrates around its expected length, by Gaussian concentration
  • Maximum of Gaussians: maxinXi2logn\max_{i \leq n} X_i \approx \sqrt{2 \log n} (relevant to covering numbers and epsilon-nets)

Measure Concentration and Machine Learning

  • Kernel methods: kernel values K(x,y)=exp(xy2/σ2)K(x, y) = \exp(-\|x-y\|^2/\sigma^2) become uninformative in high dimensions (all pairwise distances concentrate)
  • Manifold hypothesis: real data lies on or near a low-dimensional manifold, circumventing the curse of dimensionality
  • Generalization: concentration inequalities (McDiarmid, Rademacher) underpin statistical learning theory
  • Random features (Rahimi-Recht, 2007): random projections approximate kernel functions, connecting JL to kernel methods

Embeddings and Metric Spaces

Bourgain's Theorem

Any nn-point metric space embeds into 2\ell_2 with distortion O(logn)O(\log n). This is tight for some metrics (e.g., 1\ell_1 requires Ω(logn)\Omega(\log n) distortion).

Assouad's Theorem

If a metric space has doubling dimension dd (every ball can be covered by 2d2^d balls of half the radius), then it embeds into RO(d)\mathbb{R}^{O(d)} with constant distortion. This formalizes the idea that "intrinsically low-dimensional" data can be reduced.

Nearest Neighbor Search on Manifolds

The intrinsic dimension dd (not the ambient dimension DD) determines the true difficulty of nearest neighbor search. ANN methods (cover trees, navigating nets) achieve query time polynomial in nn with exponents depending on dd, not DD.

Cover tree (Beygelzimer, Kakade, Langford, 2006): a hierarchical data structure for metric spaces with bounded expansion rate. Nearest neighbor in O(c12logn)O(c^{12} \log n) where cc is the expansion constant (related to doubling dimension). Space O(n)O(n).