Topological Data Analysis
Overview
Topological data analysis (TDA) applies algebraic topology to study the "shape" of data. The central insight: topological features (connected components, loops, voids) captured at multiple scales reveal structural patterns invisible to traditional statistical methods. TDA is coordinate-free, invariant to continuous deformations, and robust to noise.
Simplicial Complexes
A simplicial complex is a collection of simplices (vertices, edges, triangles, tetrahedra, ...) closed under taking faces: if and , then .
A -simplex is the convex hull of affinely independent points. Its faces are simplices formed by subsets of its vertices.
Vietoris-Rips Complex
Given a point cloud and parameter :
A simplex is included iff all pairwise distances are at most . Equivalently, a subset forms a simplex iff the set of -balls centered at its points has a common intersection (clique complex of the -neighborhood graph).
Properties:
- Determined entirely by pairwise distances (metric data suffices)
- Flag complex: determined by its 1-skeleton (edge graph). This is computationally advantageous
- Can contain simplices of dimension even for points in
- Expensive to compute/store: up to simplices in the worst case. In practice, impose a maximum dimension
Relationship to Cech: (by the nerve lemma and Jung's theorem). They yield the same persistent homology at scale .
Cech Complex
A simplex is included iff the balls of radius centered at its vertices have a common intersection. By the nerve lemma, the Cech complex has the same homotopy type as the union of balls .
More geometrically faithful than Rips but computationally harder (requires checking higher-order intersections, not just pairwise distances). Miniball algorithms determine simplex inclusion.
Alpha Complex
The alpha complex is the subcomplex of the Delaunay triangulation consisting of simplices whose circumradius is at most :
Properties:
- Homotopy equivalent to the union of balls (same as Cech), by the nerve lemma applied to the Voronoi-restricted balls
- Much smaller than the Rips or Cech complex: at most simplices (the size of the Delaunay triangulation)
- Efficient for low-dimensional data ()
- Embedded in ambient space (simplices are geometric, not abstract)
- For : at most simplices. For : at most
Weighted alpha complexes: use weighted Delaunay (regular) triangulations. Accommodate points with different "radii" (e.g., atoms with van der Waals radii in molecular biology).
Cubical Complexes
For gridded data (images, volumes), cubical complexes are more natural than simplicial ones. Elementary cubes (pixels/voxels and their faces) form the cells. Persistent homology on cubical complexes avoids triangulation entirely.
Efficient for image analysis, medical imaging (CT/MRI), and volumetric data.
Persistent Homology
Homology Groups (Brief Review)
The -th homology group of a simplicial complex captures -dimensional "holes":
- : connected components
- : loops (1-cycles not bounding a 2-chain)
- : voids (2-cycles not bounding a 3-chain)
The rank of is the -th Betti number : the number of independent -dimensional holes.
Homology is computed over a coefficient field (typically for simplicity, or for orientability information). With coefficients, chains are formal sums of simplices mod 2, and the boundary operator maps -chains to -chains.
Filtrations
A filtration is a nested sequence of simplicial complexes:
Common filtrations:
- Rips filtration: for increasing
- Alpha filtration:
- Sublevel set filtration: for a function , the sublevel sets form a filtration as increases. Used for scalar field topology
Birth, Death, and Persistence
As the filtration parameter increases, topological features appear (birth) and disappear (death):
- A new connected component appears when a vertex is added (birth of 0-cycle)
- Two components merge when an edge connects them (death of younger component --- elder rule)
- A loop appears when an edge closes a cycle (birth of 1-cycle)
- A loop is filled when a triangle (2-simplex) is added (death of 1-cycle)
The persistence of a feature is . Long-lived features represent genuine topological structure; short-lived features are noise.
Persistence Diagrams
A persistence diagram is a multiset of points in the plane, where each point represents a -dimensional homological feature with birth and death .
- Points near the diagonal () represent noise
- Points far from the diagonal represent significant features
- Features that never die are plotted at (essential classes)
- The diagram lies above the diagonal ()
Barcodes
An equivalent representation: each feature is a horizontal interval (a "bar"). The collection of bars is the barcode. Barcodes are often more visually intuitive than diagrams.
Computation (The Standard Algorithm)
Matrix reduction: represent the filtration boundary operator as a matrix and reduce it (column operations) to identify persistent pairs.
Given the boundary matrix (columns ordered by filtration time), reduce it by adding columns left-to-right (analogous to column echelon form over ):
for j = 1 to m:
while exists j' < j with low(j') = low(j):
D[:, j] += D[:, j'] (mod 2)
where is the row index of the lowest nonzero entry in column . A persistence pair is : the simplex at index creates the feature, the simplex at index destroys it.
Complexity: worst case for simplices, but via matrix multiplication speedups. In practice, much faster due to sparsity (often nearly linear).
Optimizations:
- Clearing optimization: after identifying a pair, clear the corresponding column (set to zero). Reduces work by ~50%
- Twist optimization: combine clearing with cohomology-based persistence (Bauer)
- Chunk reduction: parallelize by processing independent blocks
- Apparent and emergent pairs: identify trivially paired simplices before reduction
Stability Theorem
Bottleneck stability (Cohen-Steiner, Edelsbrunner, Harer, 2007): if are two filtration functions:
where is the bottleneck distance between persistence diagrams (the infimum over all matchings of the maximum matched distance, with unmatched points paired to the diagonal).
This guarantees that small perturbations of the data produce small changes in the persistence diagram. Persistence diagrams are robust topological summaries.
Wasserstein stability: the -Wasserstein distance also satisfies stability bounds, providing finer control for applications requiring soft matching.
Betti Numbers
The -th Betti number counts the number of -dimensional holes:
- : number of connected components
- : number of independent loops/tunnels
- : number of enclosed voids/cavities
Euler characteristic: where is the number of -simplices. Computable in linear time (just count simplices), but coarser than full homology.
Persistent Betti numbers: counts the number of -features born at or before that are still alive at . The persistence diagram encodes the complete persistent Betti number function.
Mapper Algorithm
Mapper (Singh, Memoli, Carlsson, 2007): a topological network summarization tool for high-dimensional data.
Algorithm:
- Choose a filter function (or ) --- a lens to view the data through (e.g., density, PCA coordinate, eccentricity)
- Cover the range of with overlapping intervals
- For each interval , cluster the preimage using any clustering algorithm
- Build a nerve: create a node for each cluster, and an edge between nodes whose clusters share data points (from overlapping intervals)
Output: a graph (or simplicial complex) summarizing the topological structure of the data.
Parameter choices:
- Filter function: problem-dependent. Density (DBSCAN), Gaussian density, PCA coordinates, graph Laplacian coordinates, L-infinity centrality
- Number of intervals and overlap percentage (typically 20-50% overlap)
- Clustering algorithm and its parameters
Applications:
- Breast cancer subtype identification (Nicolau, Levine, Carlsson, 2011): Mapper revealed a new subgroup with 100% survival
- Diabetes patient stratification
- Gene expression analysis
- NBA player analysis (playing style topology)
Kepler Mapper: standard Python implementation. Interactive visualization.
Vectorization of Persistence Diagrams
Persistence diagrams are not directly usable in ML (they are multisets, not vectors). Vectorization methods:
- Persistence landscapes (Bubenik, 2015): transform diagrams into functions in a Banach space. Statistical tests are valid. The -th landscape function is the -th largest value of the tent functions centered at diagram points
- Persistence images (Adams et al., 2017): weight diagram points by a function, convolve with Gaussian, discretize on a grid. Stable and differentiable
- Persistence statistics: summarize by total persistence, entropy, polynomial features
- Kernel methods: persistence weighted Gaussian kernel, sliced Wasserstein kernel, persistence Fisher kernel
Software
Ripser
Ripser (Bauer, 2021): the fastest software for computing Vietoris-Rips persistent homology.
Key algorithmic innovations:
- Implicit representation: never explicitly stores the full boundary matrix. Computes matrix entries on-the-fly from the distance matrix
- Apparent pairs: identifies persistence pairs directly from cofacet/facet relationships without reduction (handles the majority of pairs)
- Clearing optimization: zeros out columns corresponding to identified birth simplices
- Cohomology: computes persistent cohomology (dual of homology), which is more efficient for sparse data
Handles tens of thousands of points in minutes. Implementations: C++ (original), Ripser.py (Python bindings), Ripser++ (GPU-accelerated), Cubical Ripser (for gridded data).
GUDHI
GUDHI (Geometry Understanding in Higher Dimensions): comprehensive C++/Python library for TDA.
Features:
- Multiple complex constructions: Rips, Cech, Alpha, witness, tangential, cubical
- Persistent homology with various coefficient fields
- Bottleneck and Wasserstein distances
- Persistence representations (landscapes, images)
- Nerve and graph-induced complexes
- Topological inference and geometric tools
Other Tools
- Dionysus2: C++/Python library by Dmitriy Morozov. Supports zigzag persistence, vineyards (tracking diagrams as data changes)
- PHAT: optimized boundary matrix reduction (various strategies: standard, twist, chunk, spectral sequence)
- Giotto-tda: scikit-learn compatible TDA library. Integrates persistence into ML pipelines
- TDA (R package): R interface to multiple backends (GUDHI, Dionysus, PHAT)
- Eirene: Julia library for persistent homology
- scikit-tda: Python ecosystem unifying Ripser, Kepler Mapper, persim (persistence diagram utilities)
Applications
Shape Analysis
Persistent homology captures shape features (holes, tunnels, voids) across scales. Used in:
- 3D shape retrieval and classification
- Protein structure analysis (alpha complexes of atomic coordinates)
- Material science (pore structure characterization)
Time Series and Signals
Takens' embedding: embed a time series into a point cloud via delay coordinates . Persistent homology of this point cloud detects:
- Periodicity ( features with long persistence)
- Quasiperiodicity (torus structure, with rank 2)
- Chaos (higher complexity)
Machine Learning Integration
- Topological features for classification: compute persistence diagrams, vectorize, feed to classifiers (SVM, random forest, neural network)
- Topological loss functions: penalize or encourage topological features during training. Used in image segmentation to enforce correct topology (connected components, no spurious holes)
- Topological autoencoders: regularize latent spaces to preserve topological features of the data
- Persistent homology of neural network loss landscapes: characterize the topology of sublevel sets of the loss function