Recurrent Neural Networks
Vanilla RNN
Process sequences by maintaining a hidden state across time steps:
h_t = tanh(W_hh * h_{t-1} + W_xh * x_t + b_h)
y_t = W_hy * h_t + b_y
The same weights (W_hh, W_xh, W_hy) are shared across all time steps.
def rnn_forward(x_sequence, h_0, W_xh, W_hh, b_h, W_hy, b_y):
h = h_0
outputs = []
for x_t in x_sequence:
h = np.tanh(W_xh @ x_t + W_hh @ h + b_h)
y_t = W_hy @ h + b_y
outputs.append(y_t)
return outputs, h
Backpropagation Through Time (BPTT)
Unroll the RNN across time steps and apply standard backprop:
dL/dW = sum_{t=1}^{T} dL_t/dW
For gradient at time t flowing back to time k:
dh_t/dh_k = prod_{i=k+1}^{t} diag(1 - h_i^2) * W_hh
Vanishing and Exploding Gradients
The product of Jacobians across T steps leads to:
||dh_t/dh_k|| ~ ||W_hh||^{t-k}
- If largest singular value of W_hh > 1: exploding gradients (divergence)
- If largest singular value of W_hh < 1: vanishing gradients (can't learn long-range dependencies)
Mitigation Strategies
| Problem | Solution | |-------------|------------------------------------------------| | Exploding | Gradient clipping: g = g * max_norm / max(||g||, max_norm) | | Vanishing | LSTM/GRU (gating mechanisms) | | Vanishing | Skip connections across time | | Vanishing | Orthogonal initialization for W_hh | | Vanishing | Gradient-friendly activations (ReLU, but risky) |
Long Short-Term Memory (LSTM)
Introduce a cell state c_t with gating mechanisms that control information flow:
f_t = sigma(W_f * [h_{t-1}, x_t] + b_f) # forget gate
i_t = sigma(W_i * [h_{t-1}, x_t] + b_i) # input gate
c_tilde = tanh(W_c * [h_{t-1}, x_t] + b_c) # candidate cell state
c_t = f_t * c_{t-1} + i_t * c_tilde # cell state update
o_t = sigma(W_o * [h_{t-1}, x_t] + b_o) # output gate
h_t = o_t * tanh(c_t) # hidden state
def lstm_cell(x_t, h_prev, c_prev, W, b):
# W: (4*hidden_size, input_size + hidden_size)
# Concatenate input and previous hidden state
combined = np.concatenate([h_prev, x_t])
gates = W @ combined + b
H = len(h_prev)
f = sigmoid(gates[0:H]) # forget
i = sigmoid(gates[H:2*H]) # input
c_tilde = np.tanh(gates[2*H:3*H]) # candidate
o = sigmoid(gates[3*H:4*H]) # output
c = f * c_prev + i * c_tilde
h = o * np.tanh(c)
return h, c
Why LSTMs Work
The cell state c_t acts as a "conveyor belt." The forget gate can be close to 1, allowing gradients to flow unchanged across many time steps:
dc_t/dc_{t-1} = f_t # when f_t ~ 1, gradient passes through unattenuated
Initialization tip: initialize forget gate bias to 1.0 so the LSTM starts by remembering (Jozefowicz et al., 2015).
Parameter Count
For LSTM with input size d and hidden size H:
params = 4 * H * (d + H + 1) # 4 gates, each with weights for input, hidden, and bias
Gated Recurrent Unit (GRU)
Simplified gating with fewer parameters (2 gates instead of 3, no separate cell state):
z_t = sigma(W_z * [h_{t-1}, x_t] + b_z) # update gate
r_t = sigma(W_r * [h_{t-1}, x_t] + b_r) # reset gate
h_tilde = tanh(W_h * [r_t * h_{t-1}, x_t] + b_h) # candidate
h_t = (1 - z_t) * h_{t-1} + z_t * h_tilde # interpolate
- Update gate z_t: how much of new state to mix in (analogous to forget + input gates combined)
- Reset gate r_t: how much of previous state to expose to the candidate
GRU: 3 * H * (d + H + 1) params. Often performs comparably to LSTM with fewer parameters.
LSTM vs GRU
| Aspect | LSTM | GRU | |--------------|-------------------------|-------------------------| | Gates | 3 (forget, input, output)| 2 (update, reset) | | Parameters | 4H(d+H+1) | 3H(d+H+1) | | Cell state | Separate c_t | Uses h_t directly | | Performance | Slightly better on long sequences | Comparable, faster | | Default | When in doubt, use LSTM | Resource-constrained |
Bidirectional RNNs
Process sequence in both forward and backward directions:
h_t_forward = RNN_forward(x_t, h_{t-1}_forward)
h_t_backward = RNN_backward(x_t, h_{t+1}_backward)
h_t = [h_t_forward; h_t_backward] # concatenation
Captures both past and future context at each position. Cannot be used for autoregressive generation (needs future tokens). Used in NER, POS tagging, speech recognition, BERT.
Sequence-to-Sequence (Seq2Seq)
Encoder-decoder architecture for variable-length input to variable-length output:
Encoder: reads input sequence, produces context vector c = h_T^enc
Decoder: generates output sequence conditioned on c
def seq2seq(encoder, decoder, src_seq, max_len):
# Encode
_, context = encoder(src_seq) # final hidden state
# Decode
h = context
token = BOS_TOKEN
output = []
for _ in range(max_len):
token_embed = embed(token)
h, logits = decoder(token_embed, h)
token = argmax(logits)
if token == EOS_TOKEN:
break
output.append(token)
return output
Bottleneck problem: compressing entire input into fixed-size vector c loses information for long sequences. Solved by attention.
Attention Mechanisms
Bahdanau Attention (Additive)
Instead of using only the final encoder state, attend to all encoder hidden states:
e_{t,s} = v^T * tanh(W_1 * h_s^enc + W_2 * h_t^dec) # alignment score
alpha_{t,s} = softmax(e_{t,s}) # attention weights
c_t = sum_s alpha_{t,s} * h_s^enc # context vector
h_t^dec = RNN(x_t, h_{t-1}^dec, c_t) # decoder update
The decoder can focus on different parts of the input at each output step.
Luong Attention (Multiplicative)
Simpler and often faster:
General: e_{t,s} = h_t^{dec T} * W * h_s^enc
Dot product: e_{t,s} = h_t^{dec T} * h_s^enc
Concat: e_{t,s} = v^T * tanh(W * [h_t^dec; h_s^enc])
Applied after the decoder RNN step (vs before in Bahdanau).
Comparison
| Aspect | Bahdanau | Luong | |------------|---------------------------|--------------------------| | Score fn | Additive (MLP) | Multiplicative (dot/gen) | | When | Before decoder step | After decoder step | | Context | Fed into decoder input | Concatenated with output | | Speed | Slower | Faster |
Connectionist Temporal Classification (CTC)
Loss function for sequence labeling when alignment between input and output is unknown (e.g., speech recognition, OCR).
Given input sequence of length T and target of length U (U <= T):
- Add blank token to vocabulary
- Define all valid alignments: monotonic paths from input to output allowing repetitions and blanks
- CTC loss = negative log probability of all valid alignments:
P(y|x) = sum_{pi in B^{-1}(y)} prod_{t=1}^{T} P(pi_t | x)
where B is the many-to-one mapping that removes blanks and merges repeats.
Efficiently computed using forward-backward algorithm (dynamic programming).
CTC Decoding
- Greedy: take argmax at each time step, collapse
- Beam search: maintain top-k paths, considering blanks and merges
- Prefix beam search: merge paths with same output prefix
Deep RNN Variants
Stacked RNNs
Multiple RNN layers where output of layer l feeds into layer l+1:
h_t^l = RNN^l(h_t^{l-1}, h_{t-1}^l)
Typically 2-4 layers. Deeper stacks require residual connections and dropout between layers.
Variational Dropout for RNNs
Use the same dropout mask across all time steps (Gal & Ghahramani, 2016). Standard dropout with different masks per step breaks temporal structure.
Historical Perspective
RNNs were the dominant architecture for sequence tasks (NLP, speech, time series) from ~2014-2017. Transformers (2017) have largely replaced them due to:
- Parallelizable training (no sequential dependency)
- Better modeling of long-range dependencies
- More efficient scaling
RNNs remain relevant for: online/streaming inference, very long sequences with linear memory, and edge devices with limited compute. Mamba and other state-space models revive RNN-like ideas with improved parallelism.