4 min read
On this page

Recurrent Neural Networks

Vanilla RNN

Process sequences by maintaining a hidden state across time steps:

h_t = tanh(W_hh * h_{t-1} + W_xh * x_t + b_h)
y_t = W_hy * h_t + b_y

The same weights (W_hh, W_xh, W_hy) are shared across all time steps.

def rnn_forward(x_sequence, h_0, W_xh, W_hh, b_h, W_hy, b_y):
    h = h_0
    outputs = []
    for x_t in x_sequence:
        h = np.tanh(W_xh @ x_t + W_hh @ h + b_h)
        y_t = W_hy @ h + b_y
        outputs.append(y_t)
    return outputs, h

Backpropagation Through Time (BPTT)

Unroll the RNN across time steps and apply standard backprop:

dL/dW = sum_{t=1}^{T} dL_t/dW

For gradient at time t flowing back to time k:

dh_t/dh_k = prod_{i=k+1}^{t} diag(1 - h_i^2) * W_hh

Vanishing and Exploding Gradients

The product of Jacobians across T steps leads to:

||dh_t/dh_k|| ~ ||W_hh||^{t-k}
  • If largest singular value of W_hh > 1: exploding gradients (divergence)
  • If largest singular value of W_hh < 1: vanishing gradients (can't learn long-range dependencies)

Mitigation Strategies

Problem Solution
Exploding Gradient clipping: g = g * max_norm / max(
Vanishing LSTM/GRU (gating mechanisms)
Vanishing Skip connections across time
Vanishing Orthogonal initialization for W_hh
Vanishing Gradient-friendly activations (ReLU, but risky)

Long Short-Term Memory (LSTM)

Introduce a cell state c_t with gating mechanisms that control information flow:

f_t = sigma(W_f * [h_{t-1}, x_t] + b_f)          # forget gate
i_t = sigma(W_i * [h_{t-1}, x_t] + b_i)          # input gate
c_tilde = tanh(W_c * [h_{t-1}, x_t] + b_c)       # candidate cell state
c_t = f_t * c_{t-1} + i_t * c_tilde               # cell state update
o_t = sigma(W_o * [h_{t-1}, x_t] + b_o)           # output gate
h_t = o_t * tanh(c_t)                             # hidden state
def lstm_cell(x_t, h_prev, c_prev, W, b):
    # W: (4*hidden_size, input_size + hidden_size)
    # Concatenate input and previous hidden state
    combined = np.concatenate([h_prev, x_t])
    gates = W @ combined + b

    H = len(h_prev)
    f = sigmoid(gates[0:H])        # forget
    i = sigmoid(gates[H:2*H])      # input
    c_tilde = np.tanh(gates[2*H:3*H])  # candidate
    o = sigmoid(gates[3*H:4*H])    # output

    c = f * c_prev + i * c_tilde
    h = o * np.tanh(c)

    return h, c

Why LSTMs Work

The cell state c_t acts as a "conveyor belt." The forget gate can be close to 1, allowing gradients to flow unchanged across many time steps:

dc_t/dc_{t-1} = f_t    # when f_t ~ 1, gradient passes through unattenuated

Initialization tip: initialize forget gate bias to 1.0 so the LSTM starts by remembering (Jozefowicz et al., 2015).

Parameter Count

For LSTM with input size d and hidden size H:

params = 4 * H * (d + H + 1)    # 4 gates, each with weights for input, hidden, and bias

Gated Recurrent Unit (GRU)

Simplified gating with fewer parameters (2 gates instead of 3, no separate cell state):

z_t = sigma(W_z * [h_{t-1}, x_t] + b_z)           # update gate
r_t = sigma(W_r * [h_{t-1}, x_t] + b_r)           # reset gate
h_tilde = tanh(W_h * [r_t * h_{t-1}, x_t] + b_h)  # candidate
h_t = (1 - z_t) * h_{t-1} + z_t * h_tilde         # interpolate
  • Update gate z_t: how much of new state to mix in (analogous to forget + input gates combined)
  • Reset gate r_t: how much of previous state to expose to the candidate

GRU: 3 * H * (d + H + 1) params. Often performs comparably to LSTM with fewer parameters.

LSTM vs GRU

Aspect LSTM GRU
Gates 3 (forget, input, output) 2 (update, reset)
Parameters 4H(d+H+1) 3H(d+H+1)
Cell state Separate c_t Uses h_t directly
Performance Slightly better on long sequences Comparable, faster
Default When in doubt, use LSTM Resource-constrained

Bidirectional RNNs

Process sequence in both forward and backward directions:

h_t_forward  = RNN_forward(x_t, h_{t-1}_forward)
h_t_backward = RNN_backward(x_t, h_{t+1}_backward)
h_t = [h_t_forward; h_t_backward]    # concatenation

Captures both past and future context at each position. Cannot be used for autoregressive generation (needs future tokens). Used in NER, POS tagging, speech recognition, BERT.

Sequence-to-Sequence (Seq2Seq)

Encoder-decoder architecture for variable-length input to variable-length output:

Encoder: reads input sequence, produces context vector c = h_T^enc
Decoder: generates output sequence conditioned on c
def seq2seq(encoder, decoder, src_seq, max_len):
    # Encode
    _, context = encoder(src_seq)  # final hidden state

    # Decode
    h = context
    token = BOS_TOKEN
    output = []
    for _ in range(max_len):
        token_embed = embed(token)
        h, logits = decoder(token_embed, h)
        token = argmax(logits)
        if token == EOS_TOKEN:
            break
        output.append(token)
    return output

Bottleneck problem: compressing entire input into fixed-size vector c loses information for long sequences. Solved by attention.

Attention Mechanisms

Bahdanau Attention (Additive)

Instead of using only the final encoder state, attend to all encoder hidden states:

e_{t,s} = v^T * tanh(W_1 * h_s^enc + W_2 * h_t^dec)      # alignment score
alpha_{t,s} = softmax(e_{t,s})                              # attention weights
c_t = sum_s alpha_{t,s} * h_s^enc                           # context vector
h_t^dec = RNN(x_t, h_{t-1}^dec, c_t)                       # decoder update

The decoder can focus on different parts of the input at each output step.

Luong Attention (Multiplicative)

Simpler and often faster:

General:        e_{t,s} = h_t^{dec T} * W * h_s^enc
Dot product:    e_{t,s} = h_t^{dec T} * h_s^enc
Concat:         e_{t,s} = v^T * tanh(W * [h_t^dec; h_s^enc])

Applied after the decoder RNN step (vs before in Bahdanau).

Comparison

Aspect Bahdanau Luong
Score fn Additive (MLP) Multiplicative (dot/gen)
When Before decoder step After decoder step
Context Fed into decoder input Concatenated with output
Speed Slower Faster

Connectionist Temporal Classification (CTC)

Loss function for sequence labeling when alignment between input and output is unknown (e.g., speech recognition, OCR).

Given input sequence of length T and target of length U (U <= T):

  1. Add blank token to vocabulary
  2. Define all valid alignments: monotonic paths from input to output allowing repetitions and blanks
  3. CTC loss = negative log probability of all valid alignments:
P(y|x) = sum_{pi in B^{-1}(y)} prod_{t=1}^{T} P(pi_t | x)

where B is the many-to-one mapping that removes blanks and merges repeats.

Efficiently computed using forward-backward algorithm (dynamic programming).

CTC Decoding

  • Greedy: take argmax at each time step, collapse
  • Beam search: maintain top-k paths, considering blanks and merges
  • Prefix beam search: merge paths with same output prefix

Deep RNN Variants

Stacked RNNs

Multiple RNN layers where output of layer l feeds into layer l+1:

h_t^l = RNN^l(h_t^{l-1}, h_{t-1}^l)

Typically 2-4 layers. Deeper stacks require residual connections and dropout between layers.

Variational Dropout for RNNs

Use the same dropout mask across all time steps (Gal & Ghahramani, 2016). Standard dropout with different masks per step breaks temporal structure.

Historical Perspective

RNNs were the dominant architecture for sequence tasks (NLP, speech, time series) from ~2014-2017. Transformers (2017) have largely replaced them due to:

  • Parallelizable training (no sequential dependency)
  • Better modeling of long-range dependencies
  • More efficient scaling

RNNs remain relevant for: online/streaming inference, very long sequences with linear memory, and edge devices with limited compute. Mamba and other state-space models revive RNN-like ideas with improved parallelism.