4 min read
On this page

Neural Networks

CNN, RNN, Transformer architecture comparison

The Perceptron to MLP

Single Perceptron

y = activation(w^T x + b)

A single neuron computes a weighted sum of inputs plus bias, then applies a non-linear activation function.

Multi-Layer Perceptron (MLP)

Stack layers of neurons. For an L-layer network:

h_0 = x                                    # input
h_l = activation(W_l * h_{l-1} + b_l)     # hidden layers, l = 1..L-1
y   = output_fn(W_L * h_{L-1} + b_L)      # output layer
def mlp_forward(x, weights, biases, activations):
    h = x
    for W, b, act in zip(weights, biases, activations):
        h = act(W @ h + b)
    return h

Activation Functions

| Function | Formula | Range | Derivative | Notes | |------------|--------------------------------------|-----------|------------------------|--------------------------| | Sigmoid | 1 / (1 + exp(-z)) | (0, 1) | sigma(z)(1-sigma(z)) | Vanishing gradient | | Tanh | (exp(z)-exp(-z))/(exp(z)+exp(-z)) | (-1, 1) | 1 - tanh^2(z) | Zero-centered, still vanishes | | ReLU | max(0, z) | [0, inf) | 0 if z<0, 1 if z>0 | Default choice, dying ReLU | | Leaky ReLU | z if z>0, alpha*z otherwise | (-inf,inf)| alpha or 1 | Fixes dying ReLU | | ELU | z if z>0, alpha(exp(z)-1) otherwise | [-alpha,inf)| smooth | Smooth, slightly slower | | GELU | z * Phi(z) | (-0.17,inf)| smooth | Used in transformers | | SiLU/Swish | z * sigmoid(z) | (-0.28,inf)| smooth | Self-gated | | Softmax | exp(z_k) / sum exp(z_j) | (0, 1) | Jacobian matrix | Multi-class output |

ReLU is the default for hidden layers. GELU/SiLU for transformers and modern architectures.

Loss Functions

Classification

Binary cross-entropy:     L = -[y*log(p) + (1-y)*log(1-p)]
Categorical cross-entropy: L = -sum_k y_k * log(p_k)
Focal loss:                L = -alpha * (1-p)^gamma * log(p)  # for imbalanced data

Regression

MSE:    L = (y - y_hat)^2
MAE:    L = |y - y_hat|
Huber:  L = { 0.5*(y-y_hat)^2        if |y-y_hat| <= delta
            { delta*|y-y_hat| - 0.5*delta^2  otherwise

Backpropagation

Compute gradients of the loss w.r.t. all parameters using the chain rule, proceeding backward through the network.

Forward Pass

z_l = W_l * a_{l-1} + b_l      # pre-activation
a_l = f_l(z_l)                  # activation

Backward Pass

delta_L = dL/da_L * f_L'(z_L)                  # output layer error
delta_l = (W_{l+1}^T * delta_{l+1}) * f_l'(z_l)  # backpropagate error

dL/dW_l = delta_l * a_{l-1}^T
dL/db_l = delta_l
def backprop(x, y, weights, biases, activations):
    # Forward pass: store intermediate values
    a = [x]
    z_list = []
    for W, b, act in zip(weights, biases, activations):
        z = W @ a[-1] + b
        z_list.append(z)
        a.append(act.forward(z))

    # Backward pass
    loss = compute_loss(a[-1], y)
    delta = loss_gradient(a[-1], y) * activations[-1].derivative(z_list[-1])

    grads_W, grads_b = [], []
    for l in reversed(range(len(weights))):
        grads_W.insert(0, delta @ a[l].T)
        grads_b.insert(0, delta)
        if l > 0:
            delta = (weights[l].T @ delta) * activations[l-1].derivative(z_list[l-1])

    return grads_W, grads_b, loss

Computational graph frameworks (PyTorch, JAX) implement this via automatic differentiation, building and traversing the computation graph dynamically.

Universal Approximation Theorem

Theorem (Cybenko 1989, Hornik 1991): A feedforward network with a single hidden layer of sufficient width can approximate any continuous function on a compact subset of R^d to arbitrary precision.

Caveats:

  • Says nothing about how many neurons are needed (could be exponential)
  • Says nothing about learnability (can we find the right weights?)
  • Deep networks are exponentially more efficient than shallow ones for many function classes

Weight Initialization

Proper initialization prevents vanishing/exploding activations.

Xavier/Glorot Initialization

For layers with sigmoid or tanh activation:

W ~ N(0, 2 / (n_in + n_out))    # or Uniform(-sqrt(6/(n_in+n_out)), sqrt(6/(n_in+n_out)))

Maintains variance of activations across layers.

He (Kaiming) Initialization

For ReLU activations (accounts for the fact that ReLU zeros out half the values):

W ~ N(0, 2 / n_in)

Rules of Thumb

| Activation | Initialization | Variance | |------------|----------------|----------------| | Sigmoid/Tanh| Xavier | 2/(n_in+n_out) | | ReLU | He | 2/n_in | | SELU | LeCun | 1/n_in |

Biases: initialize to zero (or small constant for ReLU to avoid dead neurons).

Normalization

Batch Normalization

Normalize activations across the mini-batch:

mu_B = (1/B) * sum x_i
sigma_B^2 = (1/B) * sum (x_i - mu_B)^2
x_hat_i = (x_i - mu_B) / sqrt(sigma_B^2 + epsilon)
y_i = gamma * x_hat_i + beta                    # learnable scale and shift
  • Reduces internal covariate shift
  • Acts as regularization (noise from batch statistics)
  • Allows higher learning rates
  • During inference: use running mean/variance

Layer Normalization

Normalize across features (not batch dimension):

mu_i = (1/d) * sum_j x_{ij}
sigma_i^2 = (1/d) * sum_j (x_{ij} - mu_i)^2
  • Independent of batch size
  • Preferred for RNNs and transformers
  • No discrepancy between training and inference

Other Variants

  • Group Normalization: normalize across groups of channels (good for small batches)
  • Instance Normalization: normalize each channel independently (style transfer)
  • RMSNorm: simplified LayerNorm without centering: y = x / RMS(x) * gamma

Dropout

During training, randomly zero out neurons with probability p:

mask ~ Bernoulli(1 - p)
h_dropped = h * mask / (1 - p)     # inverted dropout (scale at train time)

At inference: use all neurons (no dropout, no scaling needed with inverted dropout).

Interpretation: approximately trains an ensemble of 2^n subnetworks. Also equivalent to approximate Bayesian inference (MC Dropout).

Typical values: p = 0.5 for hidden layers, p = 0.1-0.3 for input or embeddings.

Learning Rate Schedules

| Schedule | Formula / Description | |--------------------|-----------------------------------------------------| | Step decay | lr * gamma^{floor(epoch / step_size)} | | Exponential decay | lr_0 * exp(-k * t) | | Cosine annealing | lr_min + 0.5*(lr_max-lr_min)(1+cos(pit/T)) | | Warmup + decay | Linear warmup for W steps, then cosine/linear decay | | ReduceOnPlateau | Reduce lr by factor when metric plateaus | | One-cycle | Warmup to max lr, then anneal, then final drop |

Warmup is especially important for Adam and transformers: start with very small lr and linearly increase to target lr over the first few thousand steps.

Residual Connections

h_{l+1} = h_l + F(h_l)    # skip connection
  • Enable training of very deep networks (100+ layers)
  • Gradient flows directly through skip connections
  • Identity mapping provides a "highway" for gradients

Practical Training Pipeline

  1. Architecture: start simple (e.g., 2-3 hidden layers, 256-512 units)
  2. Initialization: He for ReLU, Xavier for others
  3. Optimizer: Adam/AdamW with lr=3e-4 as a starting point
  4. Regularization: dropout (0.1-0.5), weight decay, data augmentation
  5. Normalization: BatchNorm for CNNs, LayerNorm for transformers
  6. Schedule: cosine annealing with warmup
  7. Monitor: train/val loss curves, gradient norms, activation distributions
  8. Debug: overfit a single batch first to verify the pipeline works