High-Dimensional Geometry

Curse of Dimensionality

As dimensionality $d$ grows, geometric intuition from low dimensions breaks down in fundamental ways.

Volume Concentration

The volume of the unit $d$ -ball $B_d$ :

$V_d = \frac{\pi^{d/2}}{\Gamma(d/2 + 1)}$

This vanishes as $d \to \infty$ : $V_d \to 0$ . Relative to the enclosing cube $[-1,1]^d$ (volume $2^d$ ), the ball occupies a vanishing fraction.

Shell concentration: most of the volume of a $d$ -ball is concentrated near its surface. The fraction of volume within radius $(1-\epsilon)$ of the boundary:

$1 - (1-\epsilon)^d \to 1 \text{ as } d \to \infty$

For $d = 100$ , over 99.99% of the ball's volume lies in the outer 5% shell.

Distance Concentration

For $n$ random points uniformly distributed in $[0,1]^d$ , the ratio of maximum to minimum pairwise distance converges to 1:

$\frac{d_{\max}}{d_{\min}} \to 1 \text{ as } d \to \infty$

More precisely, for i.i.d. random points: $d_{\max} - d_{\min} = O(d^{-1/2})$ relative to $\mathbb{E}[d] = \Theta(\sqrt{d})$ . All points become approximately equidistant, making nearest-neighbor search meaningless with naive approaches.

Implications for Algorithms

Nearest neighbor: exact methods (kd-trees, Voronoi) degrade to brute-force linear scan for $d \gtrsim 20$
Clustering: distances become less discriminative; density-based methods fail
Volume-based methods: kernel density estimation requires $n = \Omega(\epsilon^{-d})$ samples for accuracy $\epsilon$
Sampling: uniform sampling of high-dimensional spaces is exponentially wasteful (most volume is in corners/shells)

Locality-Sensitive Hashing (LSH)

LSH (Indyk and Motwani, 1998): a family of hash functions $\mathcal{H}$ such that:

Close points hash together: $\Pr_{h \sim \mathcal{H}}[h(p) = h(q)] \geq P_1$ if $d(p,q) \leq r$
Far points hash apart: $\Pr_{h \sim \mathcal{H}}[h(p) = h(q)] \leq P_2$ if $d(p,q) > cr$

where $P_1 > P_2$ and $c > 1$ is the approximation factor. The ratio $\rho = \frac{\log P_1}{\log P_2}$ determines the performance: query time is $O(n^\rho)$ with $O(n^{1+\rho})$ space.

MinHash (Jaccard Similarity)

For sets $A, B$ , the Jaccard similarity $J(A,B) = |A \cap B| / |A \cup B|$ .

MinHash: apply a random permutation $\pi$ to the universe; the hash is $h_\pi(A) = \min_{a \in A} \pi(a)$ .

$\Pr[h_\pi(A) = h_\pi(B)] = J(A,B)$

Use $k$ independent hash functions and band into $b$ bands of $r$ rows ( $k = br$ ). Two sets become candidates if they agree on all $r$ hashes in at least one band. This creates an S-curve for the candidate probability:

$P(\text{candidate}) = 1 - (1 - J^r)^b$

Applications: near-duplicate document detection, web deduplication, entity resolution.

SimHash (Cosine Similarity)

For vectors $u, v \in \mathbb{R}^d$ , the cosine similarity is $\cos\theta = \frac{u \cdot v}{\|u\|\|v\|}$ .

SimHash (Charikar, 2002): draw a random hyperplane through the origin (random vector $r$ ); the hash bit is $h_r(v) = \text{sign}(r \cdot v)$ .

$\Pr[h_r(u) = h_r(v)] = 1 - \frac{\theta}{\pi}$

Concatenate $k$ hash bits to form a $k$ -bit hash. Hamming distance between hashes approximates angular distance.

The LSH exponent for cosine: $\rho = \frac{1}{c}$ (for $(r, cr)$ -near neighbor under angular distance). This is optimal for hyperplane-based LSH.

Cross-Polytope LSH

Andoni, Indyk, Laarhoven, Razenshteyn (2015): hash by applying a random rotation then finding the nearest vertex of a cross-polytope (the set $\{\pm e_i\}_{i=1}^d$ ).

Performance: achieves $\rho = \frac{1}{c^2} + O(1/\sqrt{d})$ for Euclidean distance, which approaches the optimal $\rho = \frac{1}{c^2}$ from the LSH lower bound. Practically faster than random projection LSH due to the structure of the hash family.

Data-Dependent LSH

Optimal LSH (Andoni and Razenshteyn, 2015): for $(r, cr)$ -ANN in Euclidean space, $\rho^* = \frac{1}{c^2}$ is achievable and optimal (assuming data-independent hashing). Data-dependent approaches can do better:

LSH with data-dependent partitions: learn the hash functions from the data distribution, achieving $\rho < 1/c^2$ for structured datasets
In practice, learned hash functions (deep hashing) often outperform classical LSH

HNSW (Hierarchical Navigable Small World)

HNSW (Malkov and Yashunin, 2018): a graph-based ANN method building a hierarchical proximity graph.

Construction:

Insert points one by one
Each point is assigned a random level $\ell$ (geometric distribution: $\ell = \lfloor -\ln(\text{uniform}) \cdot m_L \rfloor$ )
At each level $\leq \ell$ , connect the point to its nearest neighbors in the graph at that level
Higher levels have fewer points and longer-range connections

Query:

Start at the entry point (top level)
Greedily traverse to the nearest neighbor at the current level
Descend to the next level, using the found point as the starting point
At level 0, perform a beam search (maintain a list of $ef$ candidates)

Properties:

Polylogarithmic query time empirically: $O(\log n)$ expected
High recall (>95%) with proper parameter tuning
Memory: $O(n \cdot M)$ where $M$ is the max connections per node (typically 16-64)
State-of-the-art practical ANN performance on many benchmarks

Comparison with LSH: HNSW typically achieves higher recall at lower query time but requires more memory. LSH has provable guarantees; HNSW's guarantees are empirical.

Approximate Nearest Neighbors (ANN)

The $(1+\epsilon)$ -approximate nearest neighbor problem: find a point within distance $(1+\epsilon) \cdot d(q, \text{NN})$ of the true nearest neighbor.

Methods Landscape

Method	Query Time	Space	Strengths
LSH	$O(n^{1/c^2 + o(1)})$	$O(n^{1+1/c^2})$	Provable, sublinear
HNSW	$O(\log n)$ empirical	$O(nM)$	Best practical recall-speed
Product quantization	$O(n/b)$	$O(n \cdot m)$	Memory-efficient
IVF (inverted file)	$O(n/k + k)$	$O(n + k)$	Billion-scale with disk
Annoy (random trees)	$O(\log n)$	$O(n \cdot T)$	Simple, memory-mapped
ScaNN	$O(n/k)$	$O(n)$	Anisotropic quantization

Composite Systems

Real-world ANN systems combine techniques:

IVF + PQ (FAISS): cluster points (IVF), encode residuals with product quantization. Handles billions of vectors
IVF + HNSW: use HNSW to navigate the IVF cluster centroids
Graph + quantization: HNSW with compressed vectors for reduced memory

FAISS (Facebook AI Similarity Search): the standard library. Supports GPU acceleration, billion-scale indices, various index types (flat, IVF, PQ, HNSW, composite).

Johnson-Lindenstrauss Lemma

Theorem (Johnson and Lindenstrauss, 1984): for any $\epsilon \in (0, 1)$ and any set of $n$ points in $\mathbb{R}^d$ , there exists a map $f: \mathbb{R}^d \to \mathbb{R}^k$ with $k = O(\epsilon^{-2} \log n)$ such that for all pairs $u, v$ :

$(1-\epsilon)\|u - v\|^2 \leq \|f(u) - f(v)\|^2 \leq (1+\epsilon)\|u - v\|^2$

Key point: $k$ depends only on $\log n$ and $\epsilon$ , not on the original dimension $d$ . A set of $n$ points in arbitrarily high dimensions can be embedded in $O(\log n / \epsilon^2)$ dimensions with bounded distortion.

Constructive proof: $f(x) = \frac{1}{\sqrt{k}} A x$ where $A$ is a $k \times d$ matrix with i.i.d. entries from $\mathcal{N}(0, 1)$ (or $\pm 1$ with equal probability, or sparse matrices).

Optimality: $k = \Omega(\epsilon^{-2} \log n)$ is necessary (Alon, 2003).

Random Projections

Random projection matrices provide practical dimensionality reduction:

Gaussian Random Projection

$A_{ij} \sim \mathcal{N}(0, 1/k)$ . Satisfies JL with high probability.

Sparse Random Projection

Achlioptas (2003): $A_{ij} = \pm 1/\sqrt{k}$ with equal probability. Same JL guarantees, faster computation (no multiplications needed).

Very sparse (Li, Hastie, Church, 2006): $A_{ij} = \sqrt{s/k} \cdot \{+1, 0, -1\}$ with probabilities $\{1/(2s), 1-1/s, 1/(2s)\}$ , where $s = \sqrt{d}$ or $s = d/\log d$ . Reduces projection time from $O(dk)$ to $O(dk/s)$ .

Fast JL Transforms

FJLT (Ailon and Chazelle, 2006): $\Phi = P \cdot H \cdot D$ where $D$ is random sign diagonal, $H$ is Hadamard, $P$ is sparse Gaussian. Projection time: $O(d \log d + k^3)$
Subsampled randomized Hadamard transform (SRHT): $O(d \log k)$ projection time. Used in randomized numerical linear algebra

Applications

Dimensionality reduction: preprocess high-dimensional data before running exact algorithms
Compressed sensing: random projections as measurement matrices (RIP property implies JL for sparse signals)
Streaming algorithms: maintain a random projection of a data stream, answering distance queries approximately
Privacy: random projection provides some privacy (though not differential privacy without additional noise)

Concentration of Measure

The concentration of measure phenomenon: in high dimensions, well-behaved functions of many independent variables concentrate tightly around their expectation.

Levy's Lemma

For a 1-Lipschitz function $f$ on the unit sphere $S^{d-1}$ :

$\Pr[|f(X) - \mathbb{E}[f]| > \epsilon] \leq 2 \exp\left(-\frac{(d-1)\epsilon^2}{2}\right)$

where $X$ is uniform on $S^{d-1}$ . The concentration is exponential in $d$ .

Gaussian Concentration

For $X \sim \mathcal{N}(0, I_d)$ and $L$ -Lipschitz $f$ :

$\Pr[|f(X) - \mathbb{E}[f]| > t] \leq 2\exp\left(-\frac{t^2}{2L^2}\right)$

Consequences for Geometry

Norm concentration: for $X \sim \mathcal{N}(0, I_d)$ : $\|X\| \approx \sqrt{d}$ with deviation $O(1)$ . The norm concentrates around $\sqrt{d}$
Inner product concentration: for independent $X, Y \sim \mathcal{N}(0, I_d)$ : $X \cdot Y \approx 0$ (nearly orthogonal). In high dimensions, random vectors are approximately orthogonal
JL proof: the projection of a vector onto a random $k$ -dimensional subspace concentrates around its expected length, by Gaussian concentration
Maximum of Gaussians: $\max_{i \leq n} X_i \approx \sqrt{2 \log n}$ (relevant to covering numbers and epsilon-nets)

Measure Concentration and Machine Learning

Kernel methods: kernel values $K(x, y) = \exp(-\|x-y\|^2/\sigma^2)$ become uninformative in high dimensions (all pairwise distances concentrate)
Manifold hypothesis: real data lies on or near a low-dimensional manifold, circumventing the curse of dimensionality
Generalization: concentration inequalities (McDiarmid, Rademacher) underpin statistical learning theory
Random features (Rahimi-Recht, 2007): random projections approximate kernel functions, connecting JL to kernel methods

Embeddings and Metric Spaces

Bourgain's Theorem

Any $n$ -point metric space embeds into $\ell_2$ with distortion $O(\log n)$ . This is tight for some metrics (e.g., $\ell_1$ requires $\Omega(\log n)$ distortion).

Assouad's Theorem

If a metric space has doubling dimension $d$ (every ball can be covered by $2^d$ balls of half the radius), then it embeds into $\mathbb{R}^{O(d)}$ with constant distortion. This formalizes the idea that "intrinsically low-dimensional" data can be reduced.

Nearest Neighbor Search on Manifolds

The intrinsic dimension $d$ (not the ambient dimension $D$ ) determines the true difficulty of nearest neighbor search. ANN methods (cover trees, navigating nets) achieve query time polynomial in $n$ with exponents depending on $d$ , not $D$ .

Cover tree (Beygelzimer, Kakade, Langford, 2006): a hierarchical data structure for metric spaces with bounded expansion rate. Nearest neighbor in $O(c^{12} \log n)$ where $c$ is the expansion constant (related to doubling dimension). Space $O(n)$ .