4 min read
On this page

Neural Networks

CNN, RNN, Transformer architecture comparison

The Perceptron to MLP

Single Perceptron

y = activation(w^T x + b)

A single neuron computes a weighted sum of inputs plus bias, then applies a non-linear activation function.

Multi-Layer Perceptron (MLP)

Stack layers of neurons. For an L-layer network:

h_0 = x                                    # input
h_l = activation(W_l * h_{l-1} + b_l)     # hidden layers, l = 1..L-1
y   = output_fn(W_L * h_{L-1} + b_L)      # output layer
def mlp_forward(x, weights, biases, activations):
    h = x
    for W, b, act in zip(weights, biases, activations):
        h = act(W @ h + b)
    return h

Activation Functions

Function Formula Range Derivative Notes
Sigmoid 1 / (1 + exp(-z)) (0, 1) sigma(z)(1-sigma(z)) Vanishing gradient
Tanh (exp(z)-exp(-z))/(exp(z)+exp(-z)) (-1, 1) 1 - tanh^2(z) Zero-centered, still vanishes
ReLU max(0, z) [0, inf) 0 if z<0, 1 if z>0 Default choice, dying ReLU
Leaky ReLU z if z>0, alpha*z otherwise (-inf,inf) alpha or 1 Fixes dying ReLU
ELU z if z>0, alpha(exp(z)-1) otherwise [-alpha,inf) smooth Smooth, slightly slower
GELU z * Phi(z) (-0.17,inf) smooth Used in transformers
SiLU/Swish z * sigmoid(z) (-0.28,inf) smooth Self-gated
Softmax exp(z_k) / sum exp(z_j) (0, 1) Jacobian matrix Multi-class output

ReLU is the default for hidden layers. GELU/SiLU for transformers and modern architectures.

Loss Functions

Classification

Binary cross-entropy:     L = -[y*log(p) + (1-y)*log(1-p)]
Categorical cross-entropy: L = -sum_k y_k * log(p_k)
Focal loss:                L = -alpha * (1-p)^gamma * log(p)  # for imbalanced data

Regression

MSE:    L = (y - y_hat)^2
MAE:    L = |y - y_hat|
Huber:  L = { 0.5*(y-y_hat)^2        if |y-y_hat| <= delta
            { delta*|y-y_hat| - 0.5*delta^2  otherwise

Backpropagation

Compute gradients of the loss w.r.t. all parameters using the chain rule, proceeding backward through the network.

Forward Pass

z_l = W_l * a_{l-1} + b_l      # pre-activation
a_l = f_l(z_l)                  # activation

Backward Pass

delta_L = dL/da_L * f_L'(z_L)                  # output layer error
delta_l = (W_{l+1}^T * delta_{l+1}) * f_l'(z_l)  # backpropagate error

dL/dW_l = delta_l * a_{l-1}^T
dL/db_l = delta_l
def backprop(x, y, weights, biases, activations):
    # Forward pass: store intermediate values
    a = [x]
    z_list = []
    for W, b, act in zip(weights, biases, activations):
        z = W @ a[-1] + b
        z_list.append(z)
        a.append(act.forward(z))

    # Backward pass
    loss = compute_loss(a[-1], y)
    delta = loss_gradient(a[-1], y) * activations[-1].derivative(z_list[-1])

    grads_W, grads_b = [], []
    for l in reversed(range(len(weights))):
        grads_W.insert(0, delta @ a[l].T)
        grads_b.insert(0, delta)
        if l > 0:
            delta = (weights[l].T @ delta) * activations[l-1].derivative(z_list[l-1])

    return grads_W, grads_b, loss

Computational graph frameworks (PyTorch, JAX) implement this via automatic differentiation, building and traversing the computation graph dynamically.

Universal Approximation Theorem

Theorem (Cybenko 1989, Hornik 1991): A feedforward network with a single hidden layer of sufficient width can approximate any continuous function on a compact subset of R^d to arbitrary precision.

Caveats:

  • Says nothing about how many neurons are needed (could be exponential)
  • Says nothing about learnability (can we find the right weights?)
  • Deep networks are exponentially more efficient than shallow ones for many function classes

Weight Initialization

Proper initialization prevents vanishing/exploding activations.

Xavier/Glorot Initialization

For layers with sigmoid or tanh activation:

W ~ N(0, 2 / (n_in + n_out))    # or Uniform(-sqrt(6/(n_in+n_out)), sqrt(6/(n_in+n_out)))

Maintains variance of activations across layers.

He (Kaiming) Initialization

For ReLU activations (accounts for the fact that ReLU zeros out half the values):

W ~ N(0, 2 / n_in)

Rules of Thumb

Activation Initialization Variance
Sigmoid/Tanh Xavier 2/(n_in+n_out)
ReLU He 2/n_in
SELU LeCun 1/n_in

Biases: initialize to zero (or small constant for ReLU to avoid dead neurons).

Normalization

Batch Normalization

Normalize activations across the mini-batch:

mu_B = (1/B) * sum x_i
sigma_B^2 = (1/B) * sum (x_i - mu_B)^2
x_hat_i = (x_i - mu_B) / sqrt(sigma_B^2 + epsilon)
y_i = gamma * x_hat_i + beta                    # learnable scale and shift
  • Reduces internal covariate shift
  • Acts as regularization (noise from batch statistics)
  • Allows higher learning rates
  • During inference: use running mean/variance

Layer Normalization

Normalize across features (not batch dimension):

mu_i = (1/d) * sum_j x_{ij}
sigma_i^2 = (1/d) * sum_j (x_{ij} - mu_i)^2
  • Independent of batch size
  • Preferred for RNNs and transformers
  • No discrepancy between training and inference

Other Variants

  • Group Normalization: normalize across groups of channels (good for small batches)
  • Instance Normalization: normalize each channel independently (style transfer)
  • RMSNorm: simplified LayerNorm without centering: y = x / RMS(x) * gamma

Dropout

During training, randomly zero out neurons with probability p:

mask ~ Bernoulli(1 - p)
h_dropped = h * mask / (1 - p)     # inverted dropout (scale at train time)

At inference: use all neurons (no dropout, no scaling needed with inverted dropout).

Interpretation: approximately trains an ensemble of 2^n subnetworks. Also equivalent to approximate Bayesian inference (MC Dropout).

Typical values: p = 0.5 for hidden layers, p = 0.1-0.3 for input or embeddings.

Learning Rate Schedules

Schedule Formula / Description
Step decay lr * gamma^{floor(epoch / step_size)}
Exponential decay lr_0 * exp(-k * t)
Cosine annealing lr_min + 0.5*(lr_max-lr_min)(1+cos(pit/T))
Warmup + decay Linear warmup for W steps, then cosine/linear decay
ReduceOnPlateau Reduce lr by factor when metric plateaus
One-cycle Warmup to max lr, then anneal, then final drop

Warmup is especially important for Adam and transformers: start with very small lr and linearly increase to target lr over the first few thousand steps.

Residual Connections

h_{l+1} = h_l + F(h_l)    # skip connection
  • Enable training of very deep networks (100+ layers)
  • Gradient flows directly through skip connections
  • Identity mapping provides a "highway" for gradients

Practical Training Pipeline

  1. Architecture: start simple (e.g., 2-3 hidden layers, 256-512 units)
  2. Initialization: He for ReLU, Xavier for others
  3. Optimizer: Adam/AdamW with lr=3e-4 as a starting point
  4. Regularization: dropout (0.1-0.5), weight decay, data augmentation
  5. Normalization: BatchNorm for CNNs, LayerNorm for transformers
  6. Schedule: cosine annealing with warmup
  7. Monitor: train/val loss curves, gradient norms, activation distributions
  8. Debug: overfit a single batch first to verify the pipeline works