4 min read
On this page

Recurrent Neural Networks

Vanilla RNN

Process sequences by maintaining a hidden state across time steps:

h_t = tanh(W_hh * h_{t-1} + W_xh * x_t + b_h)
y_t = W_hy * h_t + b_y

The same weights (W_hh, W_xh, W_hy) are shared across all time steps.

def rnn_forward(x_sequence, h_0, W_xh, W_hh, b_h, W_hy, b_y):
    h = h_0
    outputs = []
    for x_t in x_sequence:
        h = np.tanh(W_xh @ x_t + W_hh @ h + b_h)
        y_t = W_hy @ h + b_y
        outputs.append(y_t)
    return outputs, h

Backpropagation Through Time (BPTT)

Unroll the RNN across time steps and apply standard backprop:

dL/dW = sum_{t=1}^{T} dL_t/dW

For gradient at time t flowing back to time k:

dh_t/dh_k = prod_{i=k+1}^{t} diag(1 - h_i^2) * W_hh

Vanishing and Exploding Gradients

The product of Jacobians across T steps leads to:

||dh_t/dh_k|| ~ ||W_hh||^{t-k}
  • If largest singular value of W_hh > 1: exploding gradients (divergence)
  • If largest singular value of W_hh < 1: vanishing gradients (can't learn long-range dependencies)

Mitigation Strategies

| Problem | Solution | |-------------|------------------------------------------------| | Exploding | Gradient clipping: g = g * max_norm / max(||g||, max_norm) | | Vanishing | LSTM/GRU (gating mechanisms) | | Vanishing | Skip connections across time | | Vanishing | Orthogonal initialization for W_hh | | Vanishing | Gradient-friendly activations (ReLU, but risky) |

Long Short-Term Memory (LSTM)

Introduce a cell state c_t with gating mechanisms that control information flow:

f_t = sigma(W_f * [h_{t-1}, x_t] + b_f)          # forget gate
i_t = sigma(W_i * [h_{t-1}, x_t] + b_i)          # input gate
c_tilde = tanh(W_c * [h_{t-1}, x_t] + b_c)       # candidate cell state
c_t = f_t * c_{t-1} + i_t * c_tilde               # cell state update
o_t = sigma(W_o * [h_{t-1}, x_t] + b_o)           # output gate
h_t = o_t * tanh(c_t)                             # hidden state
def lstm_cell(x_t, h_prev, c_prev, W, b):
    # W: (4*hidden_size, input_size + hidden_size)
    # Concatenate input and previous hidden state
    combined = np.concatenate([h_prev, x_t])
    gates = W @ combined + b

    H = len(h_prev)
    f = sigmoid(gates[0:H])        # forget
    i = sigmoid(gates[H:2*H])      # input
    c_tilde = np.tanh(gates[2*H:3*H])  # candidate
    o = sigmoid(gates[3*H:4*H])    # output

    c = f * c_prev + i * c_tilde
    h = o * np.tanh(c)

    return h, c

Why LSTMs Work

The cell state c_t acts as a "conveyor belt." The forget gate can be close to 1, allowing gradients to flow unchanged across many time steps:

dc_t/dc_{t-1} = f_t    # when f_t ~ 1, gradient passes through unattenuated

Initialization tip: initialize forget gate bias to 1.0 so the LSTM starts by remembering (Jozefowicz et al., 2015).

Parameter Count

For LSTM with input size d and hidden size H:

params = 4 * H * (d + H + 1)    # 4 gates, each with weights for input, hidden, and bias

Gated Recurrent Unit (GRU)

Simplified gating with fewer parameters (2 gates instead of 3, no separate cell state):

z_t = sigma(W_z * [h_{t-1}, x_t] + b_z)           # update gate
r_t = sigma(W_r * [h_{t-1}, x_t] + b_r)           # reset gate
h_tilde = tanh(W_h * [r_t * h_{t-1}, x_t] + b_h)  # candidate
h_t = (1 - z_t) * h_{t-1} + z_t * h_tilde         # interpolate
  • Update gate z_t: how much of new state to mix in (analogous to forget + input gates combined)
  • Reset gate r_t: how much of previous state to expose to the candidate

GRU: 3 * H * (d + H + 1) params. Often performs comparably to LSTM with fewer parameters.

LSTM vs GRU

| Aspect | LSTM | GRU | |--------------|-------------------------|-------------------------| | Gates | 3 (forget, input, output)| 2 (update, reset) | | Parameters | 4H(d+H+1) | 3H(d+H+1) | | Cell state | Separate c_t | Uses h_t directly | | Performance | Slightly better on long sequences | Comparable, faster | | Default | When in doubt, use LSTM | Resource-constrained |

Bidirectional RNNs

Process sequence in both forward and backward directions:

h_t_forward  = RNN_forward(x_t, h_{t-1}_forward)
h_t_backward = RNN_backward(x_t, h_{t+1}_backward)
h_t = [h_t_forward; h_t_backward]    # concatenation

Captures both past and future context at each position. Cannot be used for autoregressive generation (needs future tokens). Used in NER, POS tagging, speech recognition, BERT.

Sequence-to-Sequence (Seq2Seq)

Encoder-decoder architecture for variable-length input to variable-length output:

Encoder: reads input sequence, produces context vector c = h_T^enc
Decoder: generates output sequence conditioned on c
def seq2seq(encoder, decoder, src_seq, max_len):
    # Encode
    _, context = encoder(src_seq)  # final hidden state

    # Decode
    h = context
    token = BOS_TOKEN
    output = []
    for _ in range(max_len):
        token_embed = embed(token)
        h, logits = decoder(token_embed, h)
        token = argmax(logits)
        if token == EOS_TOKEN:
            break
        output.append(token)
    return output

Bottleneck problem: compressing entire input into fixed-size vector c loses information for long sequences. Solved by attention.

Attention Mechanisms

Bahdanau Attention (Additive)

Instead of using only the final encoder state, attend to all encoder hidden states:

e_{t,s} = v^T * tanh(W_1 * h_s^enc + W_2 * h_t^dec)      # alignment score
alpha_{t,s} = softmax(e_{t,s})                              # attention weights
c_t = sum_s alpha_{t,s} * h_s^enc                           # context vector
h_t^dec = RNN(x_t, h_{t-1}^dec, c_t)                       # decoder update

The decoder can focus on different parts of the input at each output step.

Luong Attention (Multiplicative)

Simpler and often faster:

General:        e_{t,s} = h_t^{dec T} * W * h_s^enc
Dot product:    e_{t,s} = h_t^{dec T} * h_s^enc
Concat:         e_{t,s} = v^T * tanh(W * [h_t^dec; h_s^enc])

Applied after the decoder RNN step (vs before in Bahdanau).

Comparison

| Aspect | Bahdanau | Luong | |------------|---------------------------|--------------------------| | Score fn | Additive (MLP) | Multiplicative (dot/gen) | | When | Before decoder step | After decoder step | | Context | Fed into decoder input | Concatenated with output | | Speed | Slower | Faster |

Connectionist Temporal Classification (CTC)

Loss function for sequence labeling when alignment between input and output is unknown (e.g., speech recognition, OCR).

Given input sequence of length T and target of length U (U <= T):

  1. Add blank token to vocabulary
  2. Define all valid alignments: monotonic paths from input to output allowing repetitions and blanks
  3. CTC loss = negative log probability of all valid alignments:
P(y|x) = sum_{pi in B^{-1}(y)} prod_{t=1}^{T} P(pi_t | x)

where B is the many-to-one mapping that removes blanks and merges repeats.

Efficiently computed using forward-backward algorithm (dynamic programming).

CTC Decoding

  • Greedy: take argmax at each time step, collapse
  • Beam search: maintain top-k paths, considering blanks and merges
  • Prefix beam search: merge paths with same output prefix

Deep RNN Variants

Stacked RNNs

Multiple RNN layers where output of layer l feeds into layer l+1:

h_t^l = RNN^l(h_t^{l-1}, h_{t-1}^l)

Typically 2-4 layers. Deeper stacks require residual connections and dropout between layers.

Variational Dropout for RNNs

Use the same dropout mask across all time steps (Gal & Ghahramani, 2016). Standard dropout with different masks per step breaks temporal structure.

Historical Perspective

RNNs were the dominant architecture for sequence tasks (NLP, speech, time series) from ~2014-2017. Transformers (2017) have largely replaced them due to:

  • Parallelizable training (no sequential dependency)
  • Better modeling of long-range dependencies
  • More efficient scaling

RNNs remain relevant for: online/streaming inference, very long sequences with linear memory, and edge devices with limited compute. Mamba and other state-space models revive RNN-like ideas with improved parallelism.