5 min read
On this page

Unsupervised Learning

Clustering

K-Means

Partition n points into K clusters by minimizing within-cluster sum of squares:

J = sum_{k=1}^{K} sum_{x_i in C_k} ||x_i - mu_k||^2

Lloyd's Algorithm:

def kmeans(X, K, max_iter=100):
    # Initialize centroids
    centroids = X[np.random.choice(len(X), K, replace=False)]

    for _ in range(max_iter):
        # Assign each point to nearest centroid
        distances = cdist(X, centroids)
        labels = np.argmin(distances, axis=1)

        # Update centroids
        new_centroids = np.array([X[labels == k].mean(axis=0) for k in range(K)])

        if np.allclose(centroids, new_centroids):
            break
        centroids = new_centroids

    return labels, centroids

Complexity: O(n * K * d * iterations). Converges to local minimum.

K-Means++ Initialization

Choose initial centroids to be spread out:

  1. Pick first centroid uniformly at random
  2. For each subsequent centroid, pick point x with probability proportional to D(x)^2, where D(x) is the distance to the nearest existing centroid
  3. Repeat until K centroids are chosen

Guarantees O(log K)-competitive approximation to optimal clustering.

Choosing K

  • Elbow method: plot J vs K, look for the "elbow"
  • Silhouette score: s(i) = (b(i) - a(i)) / max(a(i), b(i)), where a = mean intra-cluster distance, b = mean nearest-cluster distance. Range [-1, 1]; higher is better.
  • Gap statistic: compare log(J_K) with expected value under null reference distribution

Hierarchical Clustering

Build a tree (dendrogram) of nested clusters.

Agglomerative (bottom-up):

  1. Start with each point as its own cluster
  2. Repeatedly merge the two closest clusters
  3. Stop at desired number of clusters or distance threshold

Linkage criteria:

| Linkage | Distance Between Clusters | Properties | |----------|--------------------------------------|-------------------------| | Single | min distance between any two points | Finds elongated clusters, chaining | | Complete | max distance between any two points | Compact clusters | | Average | mean pairwise distance | Balanced | | Ward | increase in total within-cluster var | Minimizes variance, like K-means |

Complexity: O(n^3) naive, O(n^2 log n) with efficient data structures.

DBSCAN

Density-Based Spatial Clustering of Applications with Noise.

Parameters: epsilon (neighborhood radius), minPts (minimum points for core point).

def dbscan(X, eps, min_pts):
    labels = np.full(len(X), UNDEFINED)
    cluster_id = 0

    for i in range(len(X)):
        if labels[i] != UNDEFINED:
            continue
        neighbors = region_query(X, i, eps)
        if len(neighbors) < min_pts:
            labels[i] = NOISE
        else:
            expand_cluster(X, labels, i, neighbors, cluster_id, eps, min_pts)
            cluster_id += 1
    return labels

def expand_cluster(X, labels, i, neighbors, cluster_id, eps, min_pts):
    labels[i] = cluster_id
    queue = list(neighbors)
    while queue:
        j = queue.pop(0)
        if labels[j] == NOISE:
            labels[j] = cluster_id
        if labels[j] != UNDEFINED:
            continue
        labels[j] = cluster_id
        j_neighbors = region_query(X, j, eps)
        if len(j_neighbors) >= min_pts:
            queue.extend(j_neighbors)

Advantages: arbitrary cluster shapes, detects outliers, no need to specify K. Disadvantages: sensitive to epsilon/minPts, struggles with varying densities.

Spectral Clustering

Use the graph Laplacian to embed data, then cluster in the embedding.

  1. Build similarity graph: W_ij = exp(-||x_i - x_j||^2 / (2*sigma^2))
  2. Compute graph Laplacian: L = D - W (unnormalized) or L_sym = I - D^{-1/2} W D^{-1/2}
  3. Find the K smallest eigenvectors of L (excluding the trivial one for L_sym)
  4. Run K-means on the rows of the eigenvector matrix

Captures non-convex cluster structures that K-means misses.

Dimensionality Reduction

Principal Component Analysis (PCA)

Find orthogonal directions of maximum variance.

Maximize: ||w||=1: w^T Sigma w
Where: Sigma = (1/n) X_centered^T X_centered  (covariance matrix)

Solution: eigenvectors of Sigma sorted by eigenvalue magnitude.

def pca(X, n_components):
    X_centered = X - X.mean(axis=0)
    cov = X_centered.T @ X_centered / len(X)
    eigenvalues, eigenvectors = np.linalg.eigh(cov)

    # Sort by decreasing eigenvalue
    idx = np.argsort(eigenvalues)[::-1][:n_components]
    components = eigenvectors[:, idx]

    return X_centered @ components, eigenvalues[idx]

Explained variance ratio: lambda_k / sum(lambda_i). Choose n_components to capture e.g. 95% of variance.

Computational shortcut: for n << d, use SVD: X = U Sigma V^T. Principal components are columns of V; projections are U Sigma.

t-SNE (t-distributed Stochastic Neighbor Embedding)

Non-linear method for visualization (typically 2D/3D).

  1. Compute pairwise affinities in high-dim: p_{j|i} = exp(-||x_i-x_j||^2 / 2*sigma_i^2) / sum_k...
  2. Symmetrize: p_ij = (p_{j|i} + p_{i|j}) / 2n
  3. In low-dim, use Student-t distribution: q_ij = (1 + ||y_i - y_j||^2)^{-1} / sum_k...
  4. Minimize KL(P || Q) via gradient descent

The heavy-tailed t-distribution in low-dim alleviates the crowding problem.

Perplexity: controls effective number of neighbors (~5-50). Roughly: 2^{H(P_i)}.

Limitations: non-convex (different runs give different results), does not preserve global structure well, O(n^2) naive.

UMAP (Uniform Manifold Approximation and Projection)

Based on topological data analysis and Riemannian geometry.

  1. Build weighted k-nearest neighbor graph in high-dim
  2. Optimize low-dim embedding to preserve the topological structure
  3. Uses cross-entropy loss instead of KL divergence

Advantages over t-SNE: faster, better global structure preservation, supports embedding new points (transform), works for general dimensionality reduction (not just 2D).

Key parameters: n_neighbors (local vs global), min_dist (clump tightness).

Independent Component Analysis (ICA)

Find statistically independent components (cocktail party problem).

Model: x = A * s, where s has independent non-Gaussian components.

Goal: find unmixing matrix W = A^{-1} such that s = Wx.

Algorithm (FastICA): maximize non-Gaussianity of projections using negentropy approximation.

Non-Negative Matrix Factorization (NMF)

Factor V (n x d) into W (n x K) * H (K x d) with all entries >= 0.

Minimize: ||V - WH||_F^2   (or KL divergence)
Subject to: W >= 0, H >= 0

Multiplicative update rules:

H <- H * (W^T V) / (W^T W H)
W <- W * (V H^T) / (W H H^T)

Produces parts-based decomposition (e.g., facial features, topic words). Interpretable because components are additive.

Anomaly Detection

Isolation Forest

Idea: anomalies are easier to isolate (fewer splits needed).

  1. Build random trees by choosing random feature and random split value
  2. Anomaly score based on average path length across trees:
s(x, n) = 2^{-E[h(x)] / c(n)}

where h(x) = path length, c(n) = average path length in a BST of n nodes. Score near 1 = anomaly; near 0.5 = normal.

Local Outlier Factor (LOF)

Compares local density of a point to its neighbors:

LOF(x) = mean_{o in kNN(x)} [ lrd(o) / lrd(x) ]

where lrd(x) = local reachability density = 1 / mean(reach_dist(x, o)).

  • LOF ~ 1: similar density to neighbors (inlier)
  • LOF >> 1: much lower density than neighbors (outlier)

Method Selection Guide

| Task | Method | |-------------------------|-------------------------------------| | Compact clusters | K-means, GMM | | Arbitrary shapes | DBSCAN, spectral clustering | | Hierarchical structure | Agglomerative clustering | | Linear dim reduction | PCA | | Visualization | t-SNE, UMAP | | Parts-based decomp | NMF | | Signal separation | ICA | | Outlier detection | Isolation forest, LOF |