Neural Networks

The Perceptron to MLP
Single Perceptron
y = activation(w^T x + b)
A single neuron computes a weighted sum of inputs plus bias, then applies a non-linear activation function.
Multi-Layer Perceptron (MLP)
Stack layers of neurons. For an L-layer network:
h_0 = x # input
h_l = activation(W_l * h_{l-1} + b_l) # hidden layers, l = 1..L-1
y = output_fn(W_L * h_{L-1} + b_L) # output layer
def mlp_forward(x, weights, biases, activations):
h = x
for W, b, act in zip(weights, biases, activations):
h = act(W @ h + b)
return h
Activation Functions
| Function | Formula | Range | Derivative | Notes | |------------|--------------------------------------|-----------|------------------------|--------------------------| | Sigmoid | 1 / (1 + exp(-z)) | (0, 1) | sigma(z)(1-sigma(z)) | Vanishing gradient | | Tanh | (exp(z)-exp(-z))/(exp(z)+exp(-z)) | (-1, 1) | 1 - tanh^2(z) | Zero-centered, still vanishes | | ReLU | max(0, z) | [0, inf) | 0 if z<0, 1 if z>0 | Default choice, dying ReLU | | Leaky ReLU | z if z>0, alpha*z otherwise | (-inf,inf)| alpha or 1 | Fixes dying ReLU | | ELU | z if z>0, alpha(exp(z)-1) otherwise | [-alpha,inf)| smooth | Smooth, slightly slower | | GELU | z * Phi(z) | (-0.17,inf)| smooth | Used in transformers | | SiLU/Swish | z * sigmoid(z) | (-0.28,inf)| smooth | Self-gated | | Softmax | exp(z_k) / sum exp(z_j) | (0, 1) | Jacobian matrix | Multi-class output |
ReLU is the default for hidden layers. GELU/SiLU for transformers and modern architectures.
Loss Functions
Classification
Binary cross-entropy: L = -[y*log(p) + (1-y)*log(1-p)]
Categorical cross-entropy: L = -sum_k y_k * log(p_k)
Focal loss: L = -alpha * (1-p)^gamma * log(p) # for imbalanced data
Regression
MSE: L = (y - y_hat)^2
MAE: L = |y - y_hat|
Huber: L = { 0.5*(y-y_hat)^2 if |y-y_hat| <= delta
{ delta*|y-y_hat| - 0.5*delta^2 otherwise
Backpropagation
Compute gradients of the loss w.r.t. all parameters using the chain rule, proceeding backward through the network.
Forward Pass
z_l = W_l * a_{l-1} + b_l # pre-activation
a_l = f_l(z_l) # activation
Backward Pass
delta_L = dL/da_L * f_L'(z_L) # output layer error
delta_l = (W_{l+1}^T * delta_{l+1}) * f_l'(z_l) # backpropagate error
dL/dW_l = delta_l * a_{l-1}^T
dL/db_l = delta_l
def backprop(x, y, weights, biases, activations):
# Forward pass: store intermediate values
a = [x]
z_list = []
for W, b, act in zip(weights, biases, activations):
z = W @ a[-1] + b
z_list.append(z)
a.append(act.forward(z))
# Backward pass
loss = compute_loss(a[-1], y)
delta = loss_gradient(a[-1], y) * activations[-1].derivative(z_list[-1])
grads_W, grads_b = [], []
for l in reversed(range(len(weights))):
grads_W.insert(0, delta @ a[l].T)
grads_b.insert(0, delta)
if l > 0:
delta = (weights[l].T @ delta) * activations[l-1].derivative(z_list[l-1])
return grads_W, grads_b, loss
Computational graph frameworks (PyTorch, JAX) implement this via automatic differentiation, building and traversing the computation graph dynamically.
Universal Approximation Theorem
Theorem (Cybenko 1989, Hornik 1991): A feedforward network with a single hidden layer of sufficient width can approximate any continuous function on a compact subset of R^d to arbitrary precision.
Caveats:
- Says nothing about how many neurons are needed (could be exponential)
- Says nothing about learnability (can we find the right weights?)
- Deep networks are exponentially more efficient than shallow ones for many function classes
Weight Initialization
Proper initialization prevents vanishing/exploding activations.
Xavier/Glorot Initialization
For layers with sigmoid or tanh activation:
W ~ N(0, 2 / (n_in + n_out)) # or Uniform(-sqrt(6/(n_in+n_out)), sqrt(6/(n_in+n_out)))
Maintains variance of activations across layers.
He (Kaiming) Initialization
For ReLU activations (accounts for the fact that ReLU zeros out half the values):
W ~ N(0, 2 / n_in)
Rules of Thumb
| Activation | Initialization | Variance | |------------|----------------|----------------| | Sigmoid/Tanh| Xavier | 2/(n_in+n_out) | | ReLU | He | 2/n_in | | SELU | LeCun | 1/n_in |
Biases: initialize to zero (or small constant for ReLU to avoid dead neurons).
Normalization
Batch Normalization
Normalize activations across the mini-batch:
mu_B = (1/B) * sum x_i
sigma_B^2 = (1/B) * sum (x_i - mu_B)^2
x_hat_i = (x_i - mu_B) / sqrt(sigma_B^2 + epsilon)
y_i = gamma * x_hat_i + beta # learnable scale and shift
- Reduces internal covariate shift
- Acts as regularization (noise from batch statistics)
- Allows higher learning rates
- During inference: use running mean/variance
Layer Normalization
Normalize across features (not batch dimension):
mu_i = (1/d) * sum_j x_{ij}
sigma_i^2 = (1/d) * sum_j (x_{ij} - mu_i)^2
- Independent of batch size
- Preferred for RNNs and transformers
- No discrepancy between training and inference
Other Variants
- Group Normalization: normalize across groups of channels (good for small batches)
- Instance Normalization: normalize each channel independently (style transfer)
- RMSNorm: simplified LayerNorm without centering: y = x / RMS(x) * gamma
Dropout
During training, randomly zero out neurons with probability p:
mask ~ Bernoulli(1 - p)
h_dropped = h * mask / (1 - p) # inverted dropout (scale at train time)
At inference: use all neurons (no dropout, no scaling needed with inverted dropout).
Interpretation: approximately trains an ensemble of 2^n subnetworks. Also equivalent to approximate Bayesian inference (MC Dropout).
Typical values: p = 0.5 for hidden layers, p = 0.1-0.3 for input or embeddings.
Learning Rate Schedules
| Schedule | Formula / Description | |--------------------|-----------------------------------------------------| | Step decay | lr * gamma^{floor(epoch / step_size)} | | Exponential decay | lr_0 * exp(-k * t) | | Cosine annealing | lr_min + 0.5*(lr_max-lr_min)(1+cos(pit/T)) | | Warmup + decay | Linear warmup for W steps, then cosine/linear decay | | ReduceOnPlateau | Reduce lr by factor when metric plateaus | | One-cycle | Warmup to max lr, then anneal, then final drop |
Warmup is especially important for Adam and transformers: start with very small lr and linearly increase to target lr over the first few thousand steps.
Residual Connections
h_{l+1} = h_l + F(h_l) # skip connection
- Enable training of very deep networks (100+ layers)
- Gradient flows directly through skip connections
- Identity mapping provides a "highway" for gradients
Practical Training Pipeline
- Architecture: start simple (e.g., 2-3 hidden layers, 256-512 units)
- Initialization: He for ReLU, Xavier for others
- Optimizer: Adam/AdamW with lr=3e-4 as a starting point
- Regularization: dropout (0.1-0.5), weight decay, data augmentation
- Normalization: BatchNorm for CNNs, LayerNorm for transformers
- Schedule: cosine annealing with warmup
- Monitor: train/val loss curves, gradient norms, activation distributions
- Debug: overfit a single batch first to verify the pipeline works