Lec4 Recurrent Neural Networks

Motivation: N-gram and feedforward models have limited context (N-1 tokens). RNNs provide theoretically infinite context.

Three Types of Sequence Models

Recurrence: Condition representations on an encoding of history
Convolution: Condition representations on local context
Attention: Condition representations on weighted average of all tokens

Basic RNN (Elman, 1980)

A sequence model \(f_\theta(x_1, \ldots, x_{|x|}) \rightarrow h_1, \ldots, h_{|x|}\) that transforms a hidden state at each step:

\[h_t = \sigma(W_h h_{t-1} + W_x x_t + b)\]

where \(\sigma\) is activation (tanh, relu), \(h_t \in \mathbb{R}^d\) is hidden state, parameters \(\theta = \{W_h \in \mathbb{R}^{d \times d}, W_x \in \mathbb{R}^{d \times d_{in}}, b \in \mathbb{R}^d\}\)

For language modeling: \(p_\theta(\cdot \mid x_{<t}) = \text{softmax}(W h_t)\)

RNN Training

Backpropagation Through Time (BPTT): Unroll the RNN into a DAG computation graph, run standard backprop.

Same parameters at each step; gradients are accumulated
Training objective (MLE): \(\min -\sum_{x \in D_{train}} \sum_t \log p_\theta(x_t \mid x_{<t})\)
Parallelization is difficult: Computing loss at step \(t\) requires \(h_t\), which depends on \(h_{t-1}, h_{t-2}, \ldots\)

RNN Inference

Generate one token → use new hidden state → repeat until [EOS]

Pros: Only need to store previous hidden state (constant memory), \(O(1)\) per step, \(O(T)\) total for length \(T\)
Cons: Sequential, cannot parallelize

Vanishing Gradients

Gradients decrease as they propagate back: \(\frac{\partial L}{\partial h_0}\) becomes tiny.

Why? \(\frac{\partial h_T}{\partial h_t} = \prod_{t'=t}^{T} \frac{\partial h_{t'+1}}{\partial h_{t'}}\), where \(\frac{\partial h_{t'+1}}{\partial h_{t'}} = \text{diag}(\tanh'(\cdot)) W\)

When dominant eigenvalue of \(W < 1\), gradients vanish exponentially. Implication: Cannot model long-range dependencies!

Gating: The Solution

Basic idea: Pass information across timesteps with learned gates.

\[h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t\]

where \(z_t = \sigma(W_z x_t + U_z h_{t-1})\) is update gate, \(\tilde{h}_t\) is candidate state.

When \(z \to 0\): retain long-term info (\(\frac{\partial h_{t_2}}{\partial h_{t_1}} = 1\))
When \(z > 0\): incorporate new hidden state

GRU (Cho et al., 2014)

2-gate architecture: - Update gate \(z_t\): should I update the previous hidden state? - Reset gate \(r_t = \sigma(W_r x_t + U_r h_{t-1})\): should I use hidden state in update?

\[\hat{h}_t = \tanh(W_h x_t + U_h(r_t \odot h_{t-1})), \quad h_t = (1-z_t) \odot h_{t-1} + z_t \odot \hat{h}_t\]

LSTM (Hochreiter & Schmidhuber, 1997)

4-gate architecture with additional context vector. In practice, GRU/LSTM are drop-in replacements for vanilla RNN.

Encoder-Decoder

Motivation: Conditional generation \(p_\theta(y_1, \ldots, y_T \mid x)\) (e.g., translation, summarization)

Architecture: Encode input \(x\) into context vector \(c \in \mathbb{R}^d\), use it in decoder: - Initialize decoder hidden state - Include in recurrent update: \(W[h_t; x_t; c]\) - Include in output layer: \(\text{softmax}(W[h_t; c])\)

Training: \(\min_\theta \sum_{(x,y) \in D} \sum_t -\log p_\theta(y_t \mid y_{<t}, x)\)

Limitation: Single context vector for all tokens — can we do better?

Attention (Bahdanau et al., 2015)

Key idea: Compute a different context vector at each decoder step as a weighted sum of encoder states.

Keys: Encoder states \(h_1^{enc}, \ldots, h_N^{enc}\)
Query: Current decoder hidden state \(h\)
Attention scores: \(\alpha_n = \text{score}(h, h_n^{enc})\)
Context vector: \(c = \sum_{n=1}^{N} \alpha_n h_n^{enc}\)

Score functions: - Dot product: \(\text{score}(q, k) = q^\top k\) - Bilinear: \(\text{score}(q, k) = q W k\) - Nonlinear: \(\text{score}(q, k) = w^\top \tanh(W[q; k])\)

Usage: \(\text{logits} = \tanh(W_{out}[c_t; h_t])\)

Key Takeaways

RNN: sequence model with recurrent hidden state (theoretically infinite context)
Vanilla RNN suffers from vanishing gradients → use GRU/LSTM
Encoder-decoder for conditional generation
Attention: dynamic context vector via weighted sum of encoder states
RNN training is sequential (hard to parallelize) → motivates Transformers (next lecture)

Lec5 Transformers

This content is kind of like CS336-Lec3 Architectures & Hyperparameters

Attention Score Functions: The most frequently used function is Scaled Dot Product. this solve the problem that scale of dot product increases as dimensions get larger. \[ a(q, k) = \frac{q^\top k}{\sqrt{|k|}} \]