Unsupervised Learning
Clustering
K-Means
Partition n points into K clusters by minimizing within-cluster sum of squares:
J = sum_{k=1}^{K} sum_{x_i in C_k} ||x_i - mu_k||^2
Lloyd's Algorithm:
def kmeans(X, K, max_iter=100):
# Initialize centroids
centroids = X[np.random.choice(len(X), K, replace=False)]
for _ in range(max_iter):
# Assign each point to nearest centroid
distances = cdist(X, centroids)
labels = np.argmin(distances, axis=1)
# Update centroids
new_centroids = np.array([X[labels == k].mean(axis=0) for k in range(K)])
if np.allclose(centroids, new_centroids):
break
centroids = new_centroids
return labels, centroids
Complexity: O(n * K * d * iterations). Converges to local minimum.
K-Means++ Initialization
Choose initial centroids to be spread out:
- Pick first centroid uniformly at random
- For each subsequent centroid, pick point x with probability proportional to D(x)^2, where D(x) is the distance to the nearest existing centroid
- Repeat until K centroids are chosen
Guarantees O(log K)-competitive approximation to optimal clustering.
Choosing K
- Elbow method: plot J vs K, look for the "elbow"
- Silhouette score: s(i) = (b(i) - a(i)) / max(a(i), b(i)), where a = mean intra-cluster distance, b = mean nearest-cluster distance. Range [-1, 1]; higher is better.
- Gap statistic: compare log(J_K) with expected value under null reference distribution
Hierarchical Clustering
Build a tree (dendrogram) of nested clusters.
Agglomerative (bottom-up):
- Start with each point as its own cluster
- Repeatedly merge the two closest clusters
- Stop at desired number of clusters or distance threshold
Linkage criteria:
| Linkage | Distance Between Clusters | Properties | |----------|--------------------------------------|-------------------------| | Single | min distance between any two points | Finds elongated clusters, chaining | | Complete | max distance between any two points | Compact clusters | | Average | mean pairwise distance | Balanced | | Ward | increase in total within-cluster var | Minimizes variance, like K-means |
Complexity: O(n^3) naive, O(n^2 log n) with efficient data structures.
DBSCAN
Density-Based Spatial Clustering of Applications with Noise.
Parameters: epsilon (neighborhood radius), minPts (minimum points for core point).
def dbscan(X, eps, min_pts):
labels = np.full(len(X), UNDEFINED)
cluster_id = 0
for i in range(len(X)):
if labels[i] != UNDEFINED:
continue
neighbors = region_query(X, i, eps)
if len(neighbors) < min_pts:
labels[i] = NOISE
else:
expand_cluster(X, labels, i, neighbors, cluster_id, eps, min_pts)
cluster_id += 1
return labels
def expand_cluster(X, labels, i, neighbors, cluster_id, eps, min_pts):
labels[i] = cluster_id
queue = list(neighbors)
while queue:
j = queue.pop(0)
if labels[j] == NOISE:
labels[j] = cluster_id
if labels[j] != UNDEFINED:
continue
labels[j] = cluster_id
j_neighbors = region_query(X, j, eps)
if len(j_neighbors) >= min_pts:
queue.extend(j_neighbors)
Advantages: arbitrary cluster shapes, detects outliers, no need to specify K. Disadvantages: sensitive to epsilon/minPts, struggles with varying densities.
Spectral Clustering
Use the graph Laplacian to embed data, then cluster in the embedding.
- Build similarity graph: W_ij = exp(-||x_i - x_j||^2 / (2*sigma^2))
- Compute graph Laplacian: L = D - W (unnormalized) or L_sym = I - D^{-1/2} W D^{-1/2}
- Find the K smallest eigenvectors of L (excluding the trivial one for L_sym)
- Run K-means on the rows of the eigenvector matrix
Captures non-convex cluster structures that K-means misses.
Dimensionality Reduction
Principal Component Analysis (PCA)
Find orthogonal directions of maximum variance.
Maximize: ||w||=1: w^T Sigma w
Where: Sigma = (1/n) X_centered^T X_centered (covariance matrix)
Solution: eigenvectors of Sigma sorted by eigenvalue magnitude.
def pca(X, n_components):
X_centered = X - X.mean(axis=0)
cov = X_centered.T @ X_centered / len(X)
eigenvalues, eigenvectors = np.linalg.eigh(cov)
# Sort by decreasing eigenvalue
idx = np.argsort(eigenvalues)[::-1][:n_components]
components = eigenvectors[:, idx]
return X_centered @ components, eigenvalues[idx]
Explained variance ratio: lambda_k / sum(lambda_i). Choose n_components to capture e.g. 95% of variance.
Computational shortcut: for n << d, use SVD: X = U Sigma V^T. Principal components are columns of V; projections are U Sigma.
t-SNE (t-distributed Stochastic Neighbor Embedding)
Non-linear method for visualization (typically 2D/3D).
- Compute pairwise affinities in high-dim: p_{j|i} = exp(-||x_i-x_j||^2 / 2*sigma_i^2) / sum_k...
- Symmetrize: p_ij = (p_{j|i} + p_{i|j}) / 2n
- In low-dim, use Student-t distribution: q_ij = (1 + ||y_i - y_j||^2)^{-1} / sum_k...
- Minimize KL(P || Q) via gradient descent
The heavy-tailed t-distribution in low-dim alleviates the crowding problem.
Perplexity: controls effective number of neighbors (~5-50). Roughly: 2^{H(P_i)}.
Limitations: non-convex (different runs give different results), does not preserve global structure well, O(n^2) naive.
UMAP (Uniform Manifold Approximation and Projection)
Based on topological data analysis and Riemannian geometry.
- Build weighted k-nearest neighbor graph in high-dim
- Optimize low-dim embedding to preserve the topological structure
- Uses cross-entropy loss instead of KL divergence
Advantages over t-SNE: faster, better global structure preservation, supports embedding new points (transform), works for general dimensionality reduction (not just 2D).
Key parameters: n_neighbors (local vs global), min_dist (clump tightness).
Independent Component Analysis (ICA)
Find statistically independent components (cocktail party problem).
Model: x = A * s, where s has independent non-Gaussian components.
Goal: find unmixing matrix W = A^{-1} such that s = Wx.
Algorithm (FastICA): maximize non-Gaussianity of projections using negentropy approximation.
Non-Negative Matrix Factorization (NMF)
Factor V (n x d) into W (n x K) * H (K x d) with all entries >= 0.
Minimize: ||V - WH||_F^2 (or KL divergence)
Subject to: W >= 0, H >= 0
Multiplicative update rules:
H <- H * (W^T V) / (W^T W H)
W <- W * (V H^T) / (W H H^T)
Produces parts-based decomposition (e.g., facial features, topic words). Interpretable because components are additive.
Anomaly Detection
Isolation Forest
Idea: anomalies are easier to isolate (fewer splits needed).
- Build random trees by choosing random feature and random split value
- Anomaly score based on average path length across trees:
s(x, n) = 2^{-E[h(x)] / c(n)}
where h(x) = path length, c(n) = average path length in a BST of n nodes. Score near 1 = anomaly; near 0.5 = normal.
Local Outlier Factor (LOF)
Compares local density of a point to its neighbors:
LOF(x) = mean_{o in kNN(x)} [ lrd(o) / lrd(x) ]
where lrd(x) = local reachability density = 1 / mean(reach_dist(x, o)).
- LOF ~ 1: similar density to neighbors (inlier)
- LOF >> 1: much lower density than neighbors (outlier)
Method Selection Guide
| Task | Method | |-------------------------|-------------------------------------| | Compact clusters | K-means, GMM | | Arbitrary shapes | DBSCAN, spectral clustering | | Hierarchical structure | Agglomerative clustering | | Linear dim reduction | PCA | | Visualization | t-SNE, UMAP | | Parts-based decomp | NMF | | Signal separation | ICA | | Outlier detection | Isolation forest, LOF |