Neural Networks

CNN, RNN, Transformer architecture comparison

The Perceptron to MLP

Single Perceptron

y = activation(w^T x + b)

A single neuron computes a weighted sum of inputs plus bias, then applies a non-linear activation function.

Multi-Layer Perceptron (MLP)

Stack layers of neurons. For an L-layer network:

h_0 = x                                    # input
h_l = activation(W_l * h_{l-1} + b_l)     # hidden layers, l = 1..L-1
y   = output_fn(W_L * h_{L-1} + b_L)      # output layer

def mlp_forward(x, weights, biases, activations):
    h = x
    for W, b, act in zip(weights, biases, activations):
        h = act(W @ h + b)
    return h

Activation Functions

Function	Formula	Range	Derivative	Notes
Sigmoid	1 / (1 + exp(-z))	(0, 1)	sigma(z)(1-sigma(z))	Vanishing gradient
Tanh	(exp(z)-exp(-z))/(exp(z)+exp(-z))	(-1, 1)	1 - tanh^2(z)	Zero-centered, still vanishes
ReLU	max(0, z)	[0, inf)	0 if z<0, 1 if z>0	Default choice, dying ReLU
Leaky ReLU	z if z>0, alpha*z otherwise	(-inf,inf)	alpha or 1	Fixes dying ReLU
ELU	z if z>0, alpha(exp(z)-1) otherwise	[-alpha,inf)	smooth	Smooth, slightly slower
GELU	z * Phi(z)	(-0.17,inf)	smooth	Used in transformers
SiLU/Swish	z * sigmoid(z)	(-0.28,inf)	smooth	Self-gated
Softmax	exp(z_k) / sum exp(z_j)	(0, 1)	Jacobian matrix	Multi-class output

ReLU is the default for hidden layers. GELU/SiLU for transformers and modern architectures.

Loss Functions

Classification

Binary cross-entropy:     L = -[y*log(p) + (1-y)*log(1-p)]
Categorical cross-entropy: L = -sum_k y_k * log(p_k)
Focal loss:                L = -alpha * (1-p)^gamma * log(p)  # for imbalanced data

Regression

MSE:    L = (y - y_hat)^2
MAE:    L = |y - y_hat|
Huber:  L = { 0.5*(y-y_hat)^2        if |y-y_hat| <= delta
            { delta*|y-y_hat| - 0.5*delta^2  otherwise

Backpropagation

Compute gradients of the loss w.r.t. all parameters using the chain rule, proceeding backward through the network.

Forward Pass

z_l = W_l * a_{l-1} + b_l      # pre-activation
a_l = f_l(z_l)                  # activation

Backward Pass

delta_L = dL/da_L * f_L'(z_L)                  # output layer error
delta_l = (W_{l+1}^T * delta_{l+1}) * f_l'(z_l)  # backpropagate error

dL/dW_l = delta_l * a_{l-1}^T
dL/db_l = delta_l

def backprop(x, y, weights, biases, activations):
    # Forward pass: store intermediate values
    a = [x]
    z_list = []
    for W, b, act in zip(weights, biases, activations):
        z = W @ a[-1] + b
        z_list.append(z)
        a.append(act.forward(z))

    # Backward pass
    loss = compute_loss(a[-1], y)
    delta = loss_gradient(a[-1], y) * activations[-1].derivative(z_list[-1])

    grads_W, grads_b = [], []
    for l in reversed(range(len(weights))):
        grads_W.insert(0, delta @ a[l].T)
        grads_b.insert(0, delta)
        if l > 0:
            delta = (weights[l].T @ delta) * activations[l-1].derivative(z_list[l-1])

    return grads_W, grads_b, loss

Computational graph frameworks (PyTorch, JAX) implement this via automatic differentiation, building and traversing the computation graph dynamically.

Universal Approximation Theorem

Theorem (Cybenko 1989, Hornik 1991): A feedforward network with a single hidden layer of sufficient width can approximate any continuous function on a compact subset of R^d to arbitrary precision.

Caveats:

Says nothing about how many neurons are needed (could be exponential)
Says nothing about learnability (can we find the right weights?)
Deep networks are exponentially more efficient than shallow ones for many function classes

Weight Initialization

Proper initialization prevents vanishing/exploding activations.

Xavier/Glorot Initialization

For layers with sigmoid or tanh activation:

W ~ N(0, 2 / (n_in + n_out))    # or Uniform(-sqrt(6/(n_in+n_out)), sqrt(6/(n_in+n_out)))

Maintains variance of activations across layers.

He (Kaiming) Initialization

For ReLU activations (accounts for the fact that ReLU zeros out half the values):

W ~ N(0, 2 / n_in)

Rules of Thumb

Activation	Initialization	Variance
Sigmoid/Tanh	Xavier	2/(n_in+n_out)
ReLU	He	2/n_in
SELU	LeCun	1/n_in

Biases: initialize to zero (or small constant for ReLU to avoid dead neurons).

Normalization

Batch Normalization

Normalize activations across the mini-batch:

mu_B = (1/B) * sum x_i
sigma_B^2 = (1/B) * sum (x_i - mu_B)^2
x_hat_i = (x_i - mu_B) / sqrt(sigma_B^2 + epsilon)
y_i = gamma * x_hat_i + beta                    # learnable scale and shift

Reduces internal covariate shift
Acts as regularization (noise from batch statistics)
Allows higher learning rates
During inference: use running mean/variance

Layer Normalization

Normalize across features (not batch dimension):

mu_i = (1/d) * sum_j x_{ij}
sigma_i^2 = (1/d) * sum_j (x_{ij} - mu_i)^2

Independent of batch size
Preferred for RNNs and transformers
No discrepancy between training and inference

Other Variants

Group Normalization: normalize across groups of channels (good for small batches)
Instance Normalization: normalize each channel independently (style transfer)
RMSNorm: simplified LayerNorm without centering: y = x / RMS(x) * gamma

Dropout

During training, randomly zero out neurons with probability p:

mask ~ Bernoulli(1 - p)
h_dropped = h * mask / (1 - p)     # inverted dropout (scale at train time)

At inference: use all neurons (no dropout, no scaling needed with inverted dropout).

Interpretation: approximately trains an ensemble of 2^n subnetworks. Also equivalent to approximate Bayesian inference (MC Dropout).

Typical values: p = 0.5 for hidden layers, p = 0.1-0.3 for input or embeddings.

Learning Rate Schedules

Schedule	Formula / Description
Step decay	lr * gamma^{floor(epoch / step_size)}
Exponential decay	lr_0 * exp(-k * t)
Cosine annealing	lr_min + 0.5(lr_max-lr_min)(1+cos(pi*t/T))
Warmup + decay	Linear warmup for W steps, then cosine/linear decay
ReduceOnPlateau	Reduce lr by factor when metric plateaus
One-cycle	Warmup to max lr, then anneal, then final drop

Warmup is especially important for Adam and transformers: start with very small lr and linearly increase to target lr over the first few thousand steps.

Residual Connections

h_{l+1} = h_l + F(h_l)    # skip connection

Enable training of very deep networks (100+ layers)
Gradient flows directly through skip connections
Identity mapping provides a "highway" for gradients

Practical Training Pipeline

Architecture: start simple (e.g., 2-3 hidden layers, 256-512 units)
Initialization: He for ReLU, Xavier for others
Optimizer: Adam/AdamW with lr=3e-4 as a starting point
Regularization: dropout (0.1-0.5), weight decay, data augmentation
Normalization: BatchNorm for CNNs, LayerNorm for transformers
Schedule: cosine annealing with warmup
Monitor: train/val loss curves, gradient norms, activation distributions
Debug: overfit a single batch first to verify the pipeline works