Recurrent Neural Networks

Vanilla RNN

Process sequences by maintaining a hidden state across time steps:

h_t = tanh(W_hh * h_{t-1} + W_xh * x_t + b_h)
y_t = W_hy * h_t + b_y

The same weights (W_hh, W_xh, W_hy) are shared across all time steps.

def rnn_forward(x_sequence, h_0, W_xh, W_hh, b_h, W_hy, b_y):
    h = h_0
    outputs = []
    for x_t in x_sequence:
        h = np.tanh(W_xh @ x_t + W_hh @ h + b_h)
        y_t = W_hy @ h + b_y
        outputs.append(y_t)
    return outputs, h

Backpropagation Through Time (BPTT)

Unroll the RNN across time steps and apply standard backprop:

dL/dW = sum_{t=1}^{T} dL_t/dW

For gradient at time t flowing back to time k:

dh_t/dh_k = prod_{i=k+1}^{t} diag(1 - h_i^2) * W_hh

Vanishing and Exploding Gradients

The product of Jacobians across T steps leads to:

||dh_t/dh_k|| ~ ||W_hh||^{t-k}

If largest singular value of W_hh > 1: exploding gradients (divergence)
If largest singular value of W_hh < 1: vanishing gradients (can't learn long-range dependencies)

Mitigation Strategies

Problem	Solution
Exploding	Gradient clipping: g = g * max_norm / max(
Vanishing	LSTM/GRU (gating mechanisms)
Vanishing	Skip connections across time
Vanishing	Orthogonal initialization for W_hh
Vanishing	Gradient-friendly activations (ReLU, but risky)

Long Short-Term Memory (LSTM)

Introduce a cell state c_t with gating mechanisms that control information flow:

f_t = sigma(W_f * [h_{t-1}, x_t] + b_f)          # forget gate
i_t = sigma(W_i * [h_{t-1}, x_t] + b_i)          # input gate
c_tilde = tanh(W_c * [h_{t-1}, x_t] + b_c)       # candidate cell state
c_t = f_t * c_{t-1} + i_t * c_tilde               # cell state update
o_t = sigma(W_o * [h_{t-1}, x_t] + b_o)           # output gate
h_t = o_t * tanh(c_t)                             # hidden state

def lstm_cell(x_t, h_prev, c_prev, W, b):
    # W: (4*hidden_size, input_size + hidden_size)
    # Concatenate input and previous hidden state
    combined = np.concatenate([h_prev, x_t])
    gates = W @ combined + b

    H = len(h_prev)
    f = sigmoid(gates[0:H])        # forget
    i = sigmoid(gates[H:2*H])      # input
    c_tilde = np.tanh(gates[2*H:3*H])  # candidate
    o = sigmoid(gates[3*H:4*H])    # output

    c = f * c_prev + i * c_tilde
    h = o * np.tanh(c)

    return h, c

Why LSTMs Work

The cell state c_t acts as a "conveyor belt." The forget gate can be close to 1, allowing gradients to flow unchanged across many time steps:

dc_t/dc_{t-1} = f_t    # when f_t ~ 1, gradient passes through unattenuated

Initialization tip: initialize forget gate bias to 1.0 so the LSTM starts by remembering (Jozefowicz et al., 2015).

Parameter Count

For LSTM with input size d and hidden size H:

params = 4 * H * (d + H + 1)    # 4 gates, each with weights for input, hidden, and bias

Gated Recurrent Unit (GRU)

Simplified gating with fewer parameters (2 gates instead of 3, no separate cell state):

z_t = sigma(W_z * [h_{t-1}, x_t] + b_z)           # update gate
r_t = sigma(W_r * [h_{t-1}, x_t] + b_r)           # reset gate
h_tilde = tanh(W_h * [r_t * h_{t-1}, x_t] + b_h)  # candidate
h_t = (1 - z_t) * h_{t-1} + z_t * h_tilde         # interpolate

Update gate z_t: how much of new state to mix in (analogous to forget + input gates combined)
Reset gate r_t: how much of previous state to expose to the candidate

GRU: 3 * H * (d + H + 1) params. Often performs comparably to LSTM with fewer parameters.

LSTM vs GRU

Aspect	LSTM	GRU
Gates	3 (forget, input, output)	2 (update, reset)
Parameters	4H(d+H+1)	3H(d+H+1)
Cell state	Separate c_t	Uses h_t directly
Performance	Slightly better on long sequences	Comparable, faster
Default	When in doubt, use LSTM	Resource-constrained

Bidirectional RNNs

Process sequence in both forward and backward directions:

h_t_forward  = RNN_forward(x_t, h_{t-1}_forward)
h_t_backward = RNN_backward(x_t, h_{t+1}_backward)
h_t = [h_t_forward; h_t_backward]    # concatenation

Captures both past and future context at each position. Cannot be used for autoregressive generation (needs future tokens). Used in NER, POS tagging, speech recognition, BERT.

Sequence-to-Sequence (Seq2Seq)

Encoder-decoder architecture for variable-length input to variable-length output:

Encoder: reads input sequence, produces context vector c = h_T^enc
Decoder: generates output sequence conditioned on c

def seq2seq(encoder, decoder, src_seq, max_len):
    # Encode
    _, context = encoder(src_seq)  # final hidden state

    # Decode
    h = context
    token = BOS_TOKEN
    output = []
    for _ in range(max_len):
        token_embed = embed(token)
        h, logits = decoder(token_embed, h)
        token = argmax(logits)
        if token == EOS_TOKEN:
            break
        output.append(token)
    return output

Bottleneck problem: compressing entire input into fixed-size vector c loses information for long sequences. Solved by attention.

Attention Mechanisms

Bahdanau Attention (Additive)

Instead of using only the final encoder state, attend to all encoder hidden states:

e_{t,s} = v^T * tanh(W_1 * h_s^enc + W_2 * h_t^dec)      # alignment score
alpha_{t,s} = softmax(e_{t,s})                              # attention weights
c_t = sum_s alpha_{t,s} * h_s^enc                           # context vector
h_t^dec = RNN(x_t, h_{t-1}^dec, c_t)                       # decoder update

The decoder can focus on different parts of the input at each output step.

Luong Attention (Multiplicative)

Simpler and often faster:

General:        e_{t,s} = h_t^{dec T} * W * h_s^enc
Dot product:    e_{t,s} = h_t^{dec T} * h_s^enc
Concat:         e_{t,s} = v^T * tanh(W * [h_t^dec; h_s^enc])

Applied after the decoder RNN step (vs before in Bahdanau).

Comparison

Aspect	Bahdanau	Luong
Score fn	Additive (MLP)	Multiplicative (dot/gen)
When	Before decoder step	After decoder step
Context	Fed into decoder input	Concatenated with output
Speed	Slower	Faster

Connectionist Temporal Classification (CTC)

Loss function for sequence labeling when alignment between input and output is unknown (e.g., speech recognition, OCR).

Given input sequence of length T and target of length U (U <= T):

Add blank token to vocabulary
Define all valid alignments: monotonic paths from input to output allowing repetitions and blanks
CTC loss = negative log probability of all valid alignments:

P(y|x) = sum_{pi in B^{-1}(y)} prod_{t=1}^{T} P(pi_t | x)

where B is the many-to-one mapping that removes blanks and merges repeats.

Efficiently computed using forward-backward algorithm (dynamic programming).

CTC Decoding

Greedy: take argmax at each time step, collapse
Beam search: maintain top-k paths, considering blanks and merges
Prefix beam search: merge paths with same output prefix

Deep RNN Variants

Stacked RNNs

Multiple RNN layers where output of layer l feeds into layer l+1:

h_t^l = RNN^l(h_t^{l-1}, h_{t-1}^l)

Typically 2-4 layers. Deeper stacks require residual connections and dropout between layers.

Variational Dropout for RNNs

Use the same dropout mask across all time steps (Gal & Ghahramani, 2016). Standard dropout with different masks per step breaks temporal structure.

Historical Perspective

RNNs were the dominant architecture for sequence tasks (NLP, speech, time series) from ~2014-2017. Transformers (2017) have largely replaced them due to:

Parallelizable training (no sequential dependency)
Better modeling of long-range dependencies
More efficient scaling

RNNs remain relevant for: online/streaming inference, very long sequences with linear memory, and edge devices with limited compute. Mamba and other state-space models revive RNN-like ideas with improved parallelism.