Topological Data Analysis

Overview

Topological data analysis (TDA) applies algebraic topology to study the "shape" of data. The central insight: topological features (connected components, loops, voids) captured at multiple scales reveal structural patterns invisible to traditional statistical methods. TDA is coordinate-free, invariant to continuous deformations, and robust to noise.

Simplicial Complexes

A simplicial complex $K$ is a collection of simplices (vertices, edges, triangles, tetrahedra, ...) closed under taking faces: if $\sigma \in K$ and $\tau \subseteq \sigma$ , then $\tau \in K$ .

A $k$ -simplex $\sigma = [v_0, v_1, \ldots, v_k]$ is the convex hull of $k+1$ affinely independent points. Its faces are simplices formed by subsets of its vertices.

Vietoris-Rips Complex

Given a point cloud $P$ and parameter $\epsilon > 0$ :

$\text{VR}_\epsilon(P) = \{\sigma \subseteq P : d(p_i, p_j) \leq \epsilon \text{ for all } p_i, p_j \in \sigma\}$

A simplex is included iff all pairwise distances are at most $\epsilon$ . Equivalently, a subset forms a simplex iff the set of $\epsilon$ -balls centered at its points has a common intersection (clique complex of the $\epsilon$ -neighborhood graph).

Properties:

Determined entirely by pairwise distances (metric data suffices)
Flag complex: determined by its 1-skeleton (edge graph). This is computationally advantageous
Can contain simplices of dimension $\geq d$ even for points in $\mathbb{R}^d$
Expensive to compute/store: up to $2^n$ simplices in the worst case. In practice, impose a maximum dimension

Relationship to Cech: $\text{Cech}_\epsilon(P) \subseteq \text{VR}_\epsilon(P) \subseteq \text{Cech}_{\sqrt{2}\epsilon}(P)$ (by the nerve lemma and Jung's theorem). They yield the same persistent homology at scale $\sqrt{2}$ .

Cech Complex

$\text{Cech}_\epsilon(P) = \{\sigma \subseteq P : \bigcap_{p \in \sigma} B(p, \epsilon/2) \neq \emptyset\}$

A simplex is included iff the balls of radius $\epsilon/2$ centered at its vertices have a common intersection. By the nerve lemma, the Cech complex has the same homotopy type as the union of balls $\bigcup_{p \in P} B(p, \epsilon/2)$ .

More geometrically faithful than Rips but computationally harder (requires checking higher-order intersections, not just pairwise distances). Miniball algorithms determine simplex inclusion.

Alpha Complex

The alpha complex $\alpha_\epsilon(P)$ is the subcomplex of the Delaunay triangulation consisting of simplices whose circumradius is at most $\epsilon$ :

$\alpha_\epsilon(P) = \{\sigma \in \text{Del}(P) : \text{circumradius}(\sigma) \leq \epsilon\}$

Properties:

Homotopy equivalent to the union of balls (same as Cech), by the nerve lemma applied to the Voronoi-restricted balls
Much smaller than the Rips or Cech complex: at most $O(n^{\lceil d/2 \rceil})$ simplices (the size of the Delaunay triangulation)
Efficient for low-dimensional data ( $d \leq 3$ )
Embedded in ambient space (simplices are geometric, not abstract)
For $d = 2$ : at most $O(n)$ simplices. For $d = 3$ : at most $O(n^2)$

Weighted alpha complexes: use weighted Delaunay (regular) triangulations. Accommodate points with different "radii" (e.g., atoms with van der Waals radii in molecular biology).

Cubical Complexes

For gridded data (images, volumes), cubical complexes are more natural than simplicial ones. Elementary cubes (pixels/voxels and their faces) form the cells. Persistent homology on cubical complexes avoids triangulation entirely.

Efficient for image analysis, medical imaging (CT/MRI), and volumetric data.

Persistent Homology

Homology Groups (Brief Review)

The $k$ -th homology group $H_k(K)$ of a simplicial complex $K$ captures $k$ -dimensional "holes":

$H_0$ : connected components
$H_1$ : loops (1-cycles not bounding a 2-chain)
$H_2$ : voids (2-cycles not bounding a 3-chain)

The rank of $H_k$ is the $k$ -th Betti number $\beta_k$ : the number of independent $k$ -dimensional holes.

Homology is computed over a coefficient field (typically $\mathbb{Z}_2$ for simplicity, or $\mathbb{Z}$ for orientability information). With $\mathbb{Z}_2$ coefficients, chains are formal sums of simplices mod 2, and the boundary operator $\partial_k$ maps $k$ -chains to $(k-1)$ -chains.

$H_k = \ker \partial_k / \text{im} \, \partial_{k+1}$

Filtrations

A filtration is a nested sequence of simplicial complexes:

$\emptyset = K_0 \subseteq K_1 \subseteq K_2 \subseteq \cdots \subseteq K_m = K$

Common filtrations:

Rips filtration: $\text{VR}_{\epsilon_1} \subseteq \text{VR}_{\epsilon_2} \subseteq \cdots$ for increasing $\epsilon$
Alpha filtration: $\alpha_{\epsilon_1} \subseteq \alpha_{\epsilon_2} \subseteq \cdots$
Sublevel set filtration: for a function $f: K \to \mathbb{R}$ , the sublevel sets $f^{-1}(-\infty, t]$ form a filtration as $t$ increases. Used for scalar field topology

Birth, Death, and Persistence

As the filtration parameter increases, topological features appear (birth) and disappear (death):

A new connected component appears when a vertex is added (birth of 0-cycle)
Two components merge when an edge connects them (death of younger component --- elder rule)
A loop appears when an edge closes a cycle (birth of 1-cycle)
A loop is filled when a triangle (2-simplex) is added (death of 1-cycle)

The persistence of a feature is $\text{death} - \text{birth}$ . Long-lived features represent genuine topological structure; short-lived features are noise.

Persistence Diagrams

A persistence diagram $\text{Dgm}_k$ is a multiset of points $(b_i, d_i)$ in the plane, where each point represents a $k$ -dimensional homological feature with birth $b_i$ and death $d_i$ .

Points near the diagonal ( $d \approx b$ ) represent noise
Points far from the diagonal represent significant features
Features that never die are plotted at $d = \infty$ (essential classes)
The diagram lies above the diagonal ( $d > b$ )

Barcodes

An equivalent representation: each feature is a horizontal interval $[b_i, d_i)$ (a "bar"). The collection of bars is the barcode. Barcodes are often more visually intuitive than diagrams.

Computation (The Standard Algorithm)

Matrix reduction: represent the filtration boundary operator as a matrix and reduce it (column operations) to identify persistent pairs.

Given the boundary matrix $D$ (columns ordered by filtration time), reduce it by adding columns left-to-right (analogous to column echelon form over $\mathbb{Z}_2$ ):

for j = 1 to m:
    while exists j' < j with low(j') = low(j):
        D[:, j] += D[:, j']  (mod 2)

where $\text{low}(j)$ is the row index of the lowest nonzero entry in column $j$ . A persistence pair is $(\text{low}(j), j)$ : the simplex at index $\text{low}(j)$ creates the feature, the simplex at index $j$ destroys it.

Complexity: $O(m^3)$ worst case for $m$ simplices, but $O(m^\omega)$ via matrix multiplication speedups. In practice, much faster due to sparsity (often nearly linear).

Optimizations:

Clearing optimization: after identifying a pair, clear the corresponding column (set to zero). Reduces work by ~50%
Twist optimization: combine clearing with cohomology-based persistence (Bauer)
Chunk reduction: parallelize by processing independent blocks
Apparent and emergent pairs: identify trivially paired simplices before reduction

Stability Theorem

Bottleneck stability (Cohen-Steiner, Edelsbrunner, Harer, 2007): if $f, g: K \to \mathbb{R}$ are two filtration functions:

$d_B(\text{Dgm}(f), \text{Dgm}(g)) \leq \|f - g\|_\infty$

where $d_B$ is the bottleneck distance between persistence diagrams (the infimum over all matchings of the maximum matched distance, with unmatched points paired to the diagonal).

This guarantees that small perturbations of the data produce small changes in the persistence diagram. Persistence diagrams are robust topological summaries.

Wasserstein stability: the $p$ -Wasserstein distance $W_p$ also satisfies stability bounds, providing finer control for applications requiring soft matching.

Betti Numbers

The $k$ -th Betti number $\beta_k$ counts the number of $k$ -dimensional holes:

$\beta_0$ : number of connected components
$\beta_1$ : number of independent loops/tunnels
$\beta_2$ : number of enclosed voids/cavities

Euler characteristic: $\chi = \sum_{k=0}^d (-1)^k \beta_k = \sum_{k=0}^d (-1)^k n_k$ where $n_k$ is the number of $k$ -simplices. Computable in linear time (just count simplices), but coarser than full homology.

Persistent Betti numbers: $\beta_k^{s,t}$ counts the number of $k$ -features born at or before $s$ that are still alive at $t$ . The persistence diagram encodes the complete persistent Betti number function.

Mapper Algorithm

Mapper (Singh, Memoli, Carlsson, 2007): a topological network summarization tool for high-dimensional data.

Algorithm:

Choose a filter function $f: X \to \mathbb{R}$ (or $\mathbb{R}^k$ ) --- a lens to view the data through (e.g., density, PCA coordinate, eccentricity)
Cover the range of $f$ with overlapping intervals $U_1, U_2, \ldots$
For each interval $U_i$ , cluster the preimage $f^{-1}(U_i)$ using any clustering algorithm
Build a nerve: create a node for each cluster, and an edge between nodes whose clusters share data points (from overlapping intervals)

Output: a graph (or simplicial complex) summarizing the topological structure of the data.

Parameter choices:

Filter function: problem-dependent. Density (DBSCAN), Gaussian density, PCA coordinates, graph Laplacian coordinates, L-infinity centrality
Number of intervals and overlap percentage (typically 20-50% overlap)
Clustering algorithm and its parameters

Applications:

Breast cancer subtype identification (Nicolau, Levine, Carlsson, 2011): Mapper revealed a new subgroup with 100% survival
Diabetes patient stratification
Gene expression analysis
NBA player analysis (playing style topology)

Kepler Mapper: standard Python implementation. Interactive visualization.

Vectorization of Persistence Diagrams

Persistence diagrams are not directly usable in ML (they are multisets, not vectors). Vectorization methods:

Persistence landscapes (Bubenik, 2015): transform diagrams into functions in a Banach space. Statistical tests are valid. The $k$ -th landscape function is the $k$ -th largest value of the tent functions centered at diagram points
Persistence images (Adams et al., 2017): weight diagram points by a function, convolve with Gaussian, discretize on a grid. Stable and differentiable
Persistence statistics: summarize by total persistence, entropy, polynomial features
Kernel methods: persistence weighted Gaussian kernel, sliced Wasserstein kernel, persistence Fisher kernel

Software

Ripser

Ripser (Bauer, 2021): the fastest software for computing Vietoris-Rips persistent homology.

Key algorithmic innovations:

Implicit representation: never explicitly stores the full boundary matrix. Computes matrix entries on-the-fly from the distance matrix
Apparent pairs: identifies persistence pairs directly from cofacet/facet relationships without reduction (handles the majority of pairs)
Clearing optimization: zeros out columns corresponding to identified birth simplices
Cohomology: computes persistent cohomology (dual of homology), which is more efficient for sparse data

Handles tens of thousands of points in minutes. Implementations: C++ (original), Ripser.py (Python bindings), Ripser++ (GPU-accelerated), Cubical Ripser (for gridded data).

GUDHI

GUDHI (Geometry Understanding in Higher Dimensions): comprehensive C++/Python library for TDA.

Features:

Multiple complex constructions: Rips, Cech, Alpha, witness, tangential, cubical
Persistent homology with various coefficient fields
Bottleneck and Wasserstein distances
Persistence representations (landscapes, images)
Nerve and graph-induced complexes
Topological inference and geometric tools

Other Tools

Dionysus2: C++/Python library by Dmitriy Morozov. Supports zigzag persistence, vineyards (tracking diagrams as data changes)
PHAT: optimized boundary matrix reduction (various strategies: standard, twist, chunk, spectral sequence)
Giotto-tda: scikit-learn compatible TDA library. Integrates persistence into ML pipelines
TDA (R package): R interface to multiple backends (GUDHI, Dionysus, PHAT)
Eirene: Julia library for persistent homology
scikit-tda: Python ecosystem unifying Ripser, Kepler Mapper, persim (persistence diagram utilities)

Applications

Shape Analysis

Persistent homology captures shape features (holes, tunnels, voids) across scales. Used in:

3D shape retrieval and classification
Protein structure analysis (alpha complexes of atomic coordinates)
Material science (pore structure characterization)

Time Series and Signals

Takens' embedding: embed a time series $x(t)$ into a point cloud via delay coordinates $(x(t), x(t+\tau), \ldots, x(t+(d-1)\tau))$ . Persistent homology of this point cloud detects:

Periodicity ( $H_1$ features with long persistence)
Quasiperiodicity (torus structure, $H_1$ with rank 2)
Chaos (higher complexity)

Machine Learning Integration

Topological features for classification: compute persistence diagrams, vectorize, feed to classifiers (SVM, random forest, neural network)
Topological loss functions: penalize or encourage topological features during training. Used in image segmentation to enforce correct topology (connected components, no spurious holes)
Topological autoencoders: regularize latent spaces to preserve topological features of the data
Persistent homology of neural network loss landscapes: characterize the topology of sublevel sets of the loss function