Unsupervised Learning

Clustering

K-Means

Partition n points into K clusters by minimizing within-cluster sum of squares:

J = sum_{k=1}^{K} sum_{x_i in C_k} ||x_i - mu_k||^2

Lloyd's Algorithm:

def kmeans(X, K, max_iter=100):
    # Initialize centroids
    centroids = X[np.random.choice(len(X), K, replace=False)]

    for _ in range(max_iter):
        # Assign each point to nearest centroid
        distances = cdist(X, centroids)
        labels = np.argmin(distances, axis=1)

        # Update centroids
        new_centroids = np.array([X[labels == k].mean(axis=0) for k in range(K)])

        if np.allclose(centroids, new_centroids):
            break
        centroids = new_centroids

    return labels, centroids

Complexity: O(n * K * d * iterations). Converges to local minimum.

K-Means++ Initialization

Choose initial centroids to be spread out:

Pick first centroid uniformly at random
For each subsequent centroid, pick point x with probability proportional to D(x)^2, where D(x) is the distance to the nearest existing centroid
Repeat until K centroids are chosen

Guarantees O(log K)-competitive approximation to optimal clustering.

Choosing K

Elbow method: plot J vs K, look for the "elbow"
Silhouette score: s(i) = (b(i) - a(i)) / max(a(i), b(i)), where a = mean intra-cluster distance, b = mean nearest-cluster distance. Range [-1, 1]; higher is better.
Gap statistic: compare log(J_K) with expected value under null reference distribution

Hierarchical Clustering

Build a tree (dendrogram) of nested clusters.

Agglomerative (bottom-up):

Start with each point as its own cluster
Repeatedly merge the two closest clusters
Stop at desired number of clusters or distance threshold

Linkage criteria:

Linkage	Distance Between Clusters	Properties
Single	min distance between any two points	Finds elongated clusters, chaining
Complete	max distance between any two points	Compact clusters
Average	mean pairwise distance	Balanced
Ward	increase in total within-cluster var	Minimizes variance, like K-means

Complexity: O(n^3) naive, O(n^2 log n) with efficient data structures.

DBSCAN

Density-Based Spatial Clustering of Applications with Noise.

Parameters: epsilon (neighborhood radius), minPts (minimum points for core point).

def dbscan(X, eps, min_pts):
    labels = np.full(len(X), UNDEFINED)
    cluster_id = 0

    for i in range(len(X)):
        if labels[i] != UNDEFINED:
            continue
        neighbors = region_query(X, i, eps)
        if len(neighbors) < min_pts:
            labels[i] = NOISE
        else:
            expand_cluster(X, labels, i, neighbors, cluster_id, eps, min_pts)
            cluster_id += 1
    return labels

def expand_cluster(X, labels, i, neighbors, cluster_id, eps, min_pts):
    labels[i] = cluster_id
    queue = list(neighbors)
    while queue:
        j = queue.pop(0)
        if labels[j] == NOISE:
            labels[j] = cluster_id
        if labels[j] != UNDEFINED:
            continue
        labels[j] = cluster_id
        j_neighbors = region_query(X, j, eps)
        if len(j_neighbors) >= min_pts:
            queue.extend(j_neighbors)

Advantages: arbitrary cluster shapes, detects outliers, no need to specify K. Disadvantages: sensitive to epsilon/minPts, struggles with varying densities.

Spectral Clustering

Use the graph Laplacian to embed data, then cluster in the embedding.

Build similarity graph: W_ij = exp(-||x_i - x_j||^2 / (2*sigma^2))
Compute graph Laplacian: L = D - W (unnormalized) or L_sym = I - D^{-1/2} W D^{-1/2}
Find the K smallest eigenvectors of L (excluding the trivial one for L_sym)
Run K-means on the rows of the eigenvector matrix

Captures non-convex cluster structures that K-means misses.

Dimensionality Reduction

Principal Component Analysis (PCA)

Find orthogonal directions of maximum variance.

Maximize: ||w||=1: w^T Sigma w
Where: Sigma = (1/n) X_centered^T X_centered  (covariance matrix)

Solution: eigenvectors of Sigma sorted by eigenvalue magnitude.

def pca(X, n_components):
    X_centered = X - X.mean(axis=0)
    cov = X_centered.T @ X_centered / len(X)
    eigenvalues, eigenvectors = np.linalg.eigh(cov)

    # Sort by decreasing eigenvalue
    idx = np.argsort(eigenvalues)[::-1][:n_components]
    components = eigenvectors[:, idx]

    return X_centered @ components, eigenvalues[idx]

Explained variance ratio: lambda_k / sum(lambda_i). Choose n_components to capture e.g. 95% of variance.

Computational shortcut: for n << d, use SVD: X = U Sigma V^T. Principal components are columns of V; projections are U Sigma.

t-SNE (t-distributed Stochastic Neighbor Embedding)

Non-linear method for visualization (typically 2D/3D).

Compute pairwise affinities in high-dim: p_{j|i} = exp(-||x_i-x_j||^2 / 2*sigma_i^2) / sum_k...
Symmetrize: p_ij = (p_{j|i} + p_{i|j}) / 2n
In low-dim, use Student-t distribution: q_ij = (1 + ||y_i - y_j||^2)^{-1} / sum_k...
Minimize KL(P || Q) via gradient descent

The heavy-tailed t-distribution in low-dim alleviates the crowding problem.

Perplexity: controls effective number of neighbors (~5-50). Roughly: 2^{H(P_i)}.

Limitations: non-convex (different runs give different results), does not preserve global structure well, O(n^2) naive.

UMAP (Uniform Manifold Approximation and Projection)

Based on topological data analysis and Riemannian geometry.

Build weighted k-nearest neighbor graph in high-dim
Optimize low-dim embedding to preserve the topological structure
Uses cross-entropy loss instead of KL divergence

Advantages over t-SNE: faster, better global structure preservation, supports embedding new points (transform), works for general dimensionality reduction (not just 2D).

Key parameters: n_neighbors (local vs global), min_dist (clump tightness).

Independent Component Analysis (ICA)

Find statistically independent components (cocktail party problem).

Model: x = A * s, where s has independent non-Gaussian components.

Goal: find unmixing matrix W = A^{-1} such that s = Wx.

Algorithm (FastICA): maximize non-Gaussianity of projections using negentropy approximation.

Non-Negative Matrix Factorization (NMF)

Factor V (n x d) into W (n x K) * H (K x d) with all entries >= 0.

Minimize: ||V - WH||_F^2   (or KL divergence)
Subject to: W >= 0, H >= 0

Multiplicative update rules:

H <- H * (W^T V) / (W^T W H)
W <- W * (V H^T) / (W H H^T)

Produces parts-based decomposition (e.g., facial features, topic words). Interpretable because components are additive.

Anomaly Detection

Isolation Forest

Idea: anomalies are easier to isolate (fewer splits needed).

Build random trees by choosing random feature and random split value
Anomaly score based on average path length across trees:

s(x, n) = 2^{-E[h(x)] / c(n)}

where h(x) = path length, c(n) = average path length in a BST of n nodes. Score near 1 = anomaly; near 0.5 = normal.

Local Outlier Factor (LOF)

Compares local density of a point to its neighbors:

LOF(x) = mean_{o in kNN(x)} [ lrd(o) / lrd(x) ]

where lrd(x) = local reachability density = 1 / mean(reach_dist(x, o)).

LOF ~ 1: similar density to neighbors (inlier)
LOF >> 1: much lower density than neighbors (outlier)

Method Selection Guide

Task	Method
Compact clusters	K-means, GMM
Arbitrary shapes	DBSCAN, spectral clustering
Hierarchical structure	Agglomerative clustering
Linear dim reduction	PCA
Visualization	t-SNE, UMAP
Parts-based decomp	NMF
Signal separation	ICA
Outlier detection	Isolation forest, LOF