Lec6 Pretraining

Basic Idea

The pretrained base model will be adapted to downstream tasks.

image-20260130130847510

Transfer learning: take “knowledge” from one task and apply it to another task

Less task data: use less data to reach a given level of performance

Better task performance: reach higher performance than training from scratch

One model, multiple tasks: convenient, amortizes cost, a starting point for many uses, …

Major Factors

Each model is influenced by 4 major factors:

Architecture: neural network architecture

Task: what the model predicts (e.g. next-token)

Data: the data used to train the model

Hyper-parameters: e.g. learning rate, batch size

Masked Language Modeling

Masked Language Modeling (MLM) trains a language model to predict masked tokens given the remaining visible context. This objective is widely used in bidirectional models such as BERT.

The MLM loss is defined as:

\[ \mathcal{L}_{\text{MLM}}(\theta; D) = -\frac{1}{|D|} \sum_{x \in D} \;\mathbb{E}_{M \sim \text{corrupt}(x)} \sum_{t \in M} \log p_\theta(x_t \mid x_{-M}) \]

  • \(x \in D\): an input sequence (sentence) from the dataset

  • \(M\): a randomly sampled set of masked token positions

  • \(\text{corrupt}(x)\): the masking (corruption) process applied to \(x\)

  • \(x_{-M}\): all unmasked tokens in the sequence

  • \(x_t\): the ground-truth token at position \(t\)

  • \(p_\theta(x_t \mid x_{-M})\): model probability of predicting \(x_t\) given the visible context

  • Denoising Perspective MLM can be viewed as a denoising autoencoder: the input is corrupted by masking tokens, and the model learns to reconstruct the original sequence.

  • Pseudo-likelihood Objective: Instead of modeling the full joint likelihood of the sequence, MLM maximizes the sum of conditional log-probabilities of masked tokens given the rest of the sequence. This corresponds to maximizing a pseudo-likelihood rather than a true joint likelihood.

Autoregressive Language Modeling

Predict the next token (x_t) given all previous tokens (x_{<t}). [ {}(; D) = - {x D} {t=1}^{|x|} p(x_t x_{<t}) ] Key Properties - Maximizes the likelihood of the training data
- Fits the true data distribution (p_)
- Equivalent to cross-entropy with a one-hot target distribution
- Can be interpreted as learning to compress data generated by (p_
)

image-20260130140511145

Evaluate a model

  • Loss(traning, validation, test)
    • Diagnose training trajectory, compare models in the same family
  • Few-shot prompting
  • Fine-tuning

Data

  • Quantity: How much data do I have?
  • Quality: Is it beneficial for training? (Extraction, Filtering, Deduplication)
  • Coverage: Does the data cover the domain(s) I care about, and in the right proportions?
    • We can train a classifier filtering to help us detect the content

Lec7 Scaling Laws and In-Context Learning

Scaling Laws

Training = Spending Compute

Given a fixed pretraining budget, how should we allocate the budget to achieve the best possible model?

image-20260206232739205

Core approximation for transformer LMs (Kaplan et al. 2020): \[ C \approx 6ND \]

  • N = number of parameters, D = number of tokens, C = FLOPs
  • Example: Llama 2 7B on 2T tokens → C ≈ 8.4 × 10²² FLOPs

Central question: Given fixed C, how to split between N and D?

Power Laws

\[ L = a \cdot x^{b} \quad \Rightarrow \quad \log L = \log a + b \cdot \log x \]

Doubling x → L becomes L · 2ᵇ. With b = −0.095, doubling data only reduces loss by ~6.4%. Diminishing returns are severe — exponentially more resources for linear loss improvement.

image-20260206233531669

Data Scaling

\[ L = (D / D_c)^{-b} \]

  • Three regions: small data → power-law region → irreducible error (bounded by data entropy)
  • Repeated data: ~4 epochs ≈ new data; ~40 epochs → worthless (Muennighoff+ 2025)
  • Different domains (Wikipedia, Books, Common Crawl) have different scaling curves

image-20260206233726326

Model Size Scaling

\[ L = (N / N_c)^{-b} \]

  • Larger models are more sample-efficient (reach lower loss with fewer tokens)
  • Transformers scale significantly better than LSTMs
image-20260206233840078

Compute-Optimal Training

Paper N_opt D_opt Implication
Kaplan et al. 2020 ∝ C⁰·⁷³ ∝ C⁰·²⁷ Prioritize bigger models
Chinchilla (Hoffmann et al. 2022) ∝ C⁰·⁵ ∝ C⁰·⁵ Scale N and D equally

∝ means proportional to

image-20260206234520100

Chinchilla’s insight: many existing LLMs (e.g., Gopher 280B) were undertrained. This directly influenced LLaMA’s design — smaller models trained on much more data.

Run small-scale experiments → fit scaling laws → extrapolate optimal hyperparameters (model size, tokens, batch size, learning rate) for the expensive target run.

Prompting and In-Context Learning

Three Prompting Strategies

Strategy Format Key Issue
No prompt Raw text completion Uncontrollable
Zero-shot Instruction + input Output format unstable; wording-sensitive
Few-shot (ICL) Instruction + K examples + input Examples define format & task; no param updates

ICL Phenomena (Agarwal et al. 2024)

  • Task retrieval: sometimes input-only examples work as well as (input, output) pairs — ICL partly “retrieves” pretraining patterns rather than learning new mappings

image-20260206235526522

  • Pretraining bias: default labels need few shots; flipped/abstract labels need many more to overcome built-in associations

    image-20260206235557572

  • Sensitivity: example ordering, label balance, and label coverage all significantly affect performance (Lu et al. 2021; Zhang et al. 2022)

    image-20260206235916474

  • Model variation: ability to benefit from many-shot varies greatly (Gemini 1.5 Pro improves up to 1024-shot; GPT-4-Turbo plateaus early)

image-20260206235954906

Chat Prompts

Messages format with system / user / assistant roles → tokenizer converts to string with special tokens. System prompts define model behavior (reasoning strategy, safety rules, output format, etc.).

Chain-of-Thought (CoT)

  • Few-shot CoT (Wei et al. 2022): include reasoning steps in examples
  • Zero-shot CoT (Kojima et al. 2022): append “Let’s think step by step”
  • Key insight: CoT gives the model adaptive computation time — each reasoning token is an extra forward pass

Prompt Chains

Chain multiple LLM calls (with different prompts / external tools) sequentially:

1
input → LLM₁ → intermediate → LLM₂ → intermediate → ... → output

Enables problem decomposition, tool use (search, code execution), and multi-step reasoning.


Lec8 Fine-Tuning

What is Fine-Tuning?

Continued gradient-based training of a pre-trained model on task-specific data. Given pre-trained parameters θ₀ and dataset D = {(x, y)ₙ}: \[ \theta^* = \arg\min_\theta \mathbb{E}_{(x,y)\sim D} [\mathcal{L}(f_\theta(x), y)] \] Also called Supervised Fine-Tuning (SFT). Use regularization/dropout to prevent overfitting.

Two Fine-Tuning Paradigms

Classification fine-tuning: Add a linear head on the last hidden state. \[ p_\theta(y \mid x) = \text{softmax}(Wh + b), \quad W \in \mathbb{R}^{K \times d}, \; b \in \mathbb{R}^K \]

  • Loss: cross-entropy \(\mathcal{L} = -\log p_\theta(y \mid x)\)
  • Update all parameters \(\theta = (\theta_0, W, b)\)

Language model fine-tuning: Keep the LM architecture, train on (input, output) pairs.

Keep the LM architecture, train on (input, output) pairs. Concatenate as [start] x [sep] y [end], compute loss only on output tokens:

\[ \mathcal{L}_{\text{MLE}} = -\sum_{t=1}^{T} \log p_\theta(y_t \mid x, y_{<t}) \] No additional head needed — uses the existing LM head.

Which Parameters to Update?

Option What’s Updated Cost Trade-off
Head only \(K \times d + K\) params Cheapest Assumes pre-trained representations are already linearly separable
Full fine-tuning All \(\theta\) May lead to overfitting
PEFT Small subset \(\theta\) Can change the representations

LoRA (Low-Rank Adaptation) [Hu et al. 2021]

Key idea: approximate weight updates with low-rank matrices.

For weight \(W_0 \in \mathbb{R}^{d \times d'}\), decompose the update as:

\[ \Delta W = BA, \quad B \in \mathbb{R}^{d \times r}, \; A \in \mathbb{R}^{r \times d'}, \; r \ll \min(d, d') \]

  • Freeze \(W_0\), only train \(A\) and \(B\)
  • Final weight: \(W = W_0 + \frac{\alpha}{r} \cdot BA\)
  • Typically applied to \(W_q\) and \(W_v\) in attention layers
  • After training, merge \(\Delta W\) into \(W_0\)no extra inference cost

image-20260207002605984

Effects of Fine-Tuning

Data efficiency: Pre-trained models reach good performance with far fewer examples than training from scratch (Howard & Ruder 2018 — ULMFiT).

Distribution narrowing: Fine-tuning minimizes \(D_{\text{KL}}(p_{\text{finetune}} \| p_\theta)\) instead of \(D_{\text{KL}}(p_{\text{data}} \| p_\theta)\). The model’s distribution becomes narrower and specialized.

Side effects:

  • Summarization model fails at translation (lost generality)
  • Model becomes dependent on exact prompt formatting used during training
  • Few-shot learning ability may degrade after fine-tunin

Instruction Tuning

Fine-tune a model on (instruction + input, output) pairs across multiple tasks, so it learns to follow instructions generically.

image-20260207002839337

Chat Tuning

Chat is just instruction tuning with conversational format:

  • System prompt + [user, assistant, user, assistant, …]
  • Special tokens mark role boundaries (e.g., <|start_header_id|>user<|end_header_id|>)
  • Instruction + input are implicitly embedded in the conversation

Data sources: OpenOrca (16 hand-written system prompts, GPT-4 outputs, 2.9M examples), LMSys-1M (real user conversations from online LLM service)

image-20260207002950797

Knowledge Distillation

Use a strong teacher model to train a smaller student model. Both approaches minimize \(D_{\text{KL}}(q \| p_\theta)\) between teacher \(q\) and student \(p_\theta\).

Token-Level Distillation [Hinton et al. 2015]

Student mimics teacher’s full probability distribution at each token position:

\[ \mathcal{L}_{\text{distill}} = -\sum_{y_t \in V} q(y_t \mid y_{<t}, x) \cdot \log p_\theta(y_t \mid y_{<t}, x) \]

  • Uses soft labels (teacher’s distribution) instead of hard one-hot labels
  • Requires access to teacher’s logits
  • Richer signal than standard cross-entropy (which only uses the correct token)

image-20260207003122001

Sequence-Level Distillation [Kim & Rush 2016]

Teacher generates complete outputs; student fine-tunes on them:

\[ \mathcal{L}_{\text{seq}} = \mathbb{E}_{y \sim q(y|x)} \left[-\log p_\theta(y \mid x)\right] \]

  • Only needs teacher’s generated text, not logits (black-box friendly)
  • Example: DeepSeek-R1-Distill-Qwen-7B = Qwen-7B fine-tuned on DeepSeek-R1’s outputs
  • Mathematically also minimizes \(D_{\text{KL}}(q \| p_\theta)\), just with Monte Carlo approximation

image-20260207003300579

Augmented Teacher [West et al. 2022]

Teacher can be an augmented LLM:

\[ q \propto p_{\text{LLM}}(y \mid x) \cdot A(x, y) \] where \(A\) is a classifier/verifier. If the augmented teacher is better than \(p_{\text{LLM}}\) alone, the distilled student can surpass \(p_{\text{LLM}}\) — the student becomes better than its base teacher through filtered/verified data.

Lec9 Decoding Algorithms

Basic Setup

Autoregressive Language Model

\[ p_\theta(y_{1:T} | x) = \prod_{t=1}^{T} p_\theta(y_t | y_{<t}, x) \]

Next-Token Distribution

Each term \(p_\theta(y_t | y_{<t}, x)\) gives us a probability distribution over next tokens.

Decoding: Choose next tokens so that we end up with an output \(y_{1:T}\).


Decoding as Optimization

Goal: MAP Decoding

Find the most likely output: \[ \hat{y} = \arg\max_{y \in \mathcal{Y}} p_\theta(y | x) \]

  • Also called mode-seeking or maximum a-posteriori (MAP)
  • Key challenge: output space \(\mathcal{Y}\) is very large

image-20260210155455010

Approach 1: Greedy Decoding

For \(t = 1 \ldots \text{End}\): \[ \hat{y}_t = \arg\max_{y_t \in V} p_\theta(y_t | \hat{y}_{<t}, x) \]

Limitation: Does not guarantee the most-likely sequence.

image-20260210155537243

  • Width-limited breadth-first search
  • Maintain \(B\) likely paths at each step
  • \(B = 1\): greedy decoding
  • \(B = |V|^{T_{\max}}\): exact MAP (intractable)
  • In practice: \(B = 16\) (hyperparameter)

image-20260210155609745

HuggingFace interface:

1
2
3
4
5
# Greedy
model.generate(do_sample=False, num_beams=1)

# Beam search
model.generate(do_sample=False, num_beams=16)

Pitfalls of MAP Decoding

1. Degeneracy - Repetition traps: Models assign high probability to repetitive loops - Short sequences: Highest-probability sequence might be empty - Remedy: length normalization

2. Atypicality - Most likely outcome ≠ typical outcome - Example: Biased coin \(\Pr[H] = 0.6\), most likely outcome of 100 flips is all heads (atypical!)

3. Probability spread - When multiple ways exist to express the same thing, probability spreads across variants - Highest-probability output may not be the “best”

image-20260210160105078


Sampling

Ancestral Sampling

For \(t = 1 \ldots \text{End}\): \[ \hat{y}_t \sim p_\theta(y_t | \hat{y}_{<t}, x) \]

  • Equivalent to sequence sampling: \(y_{1:T} \sim p_\theta(y_{1:T} | x)\)
  • Implemented via categorical sampling (PyTorch, etc.)

HuggingFace interface:

1
model.generate(do_sample=True)

Problems with Ancestral Sampling

Heavy tail problem:

  • Even if each bad token has small probability, sum of bad tokens has nontrivial probability
  • Leads to incoherence

Compounding error: - Probability of sampling no bad tokens: \((1 - \epsilon)^T\) - Example: \(\epsilon = 0.01, T = 128 \Rightarrow p(\text{no bad tokens}) = 0.276\)

Workaround: Truncate the Tail

Top-k Sampling: Sample only from \(k\) most-probable tokens \[ \hat{y}_t \sim \begin{cases} p_\theta(y_t | y_{<t}, x) / Z_t & \text{if } y_t \text{ in top-}k \\ 0 & \text{otherwise} \end{cases} \]

Top-p Sampling: Sample only from top \(p\) probability mass

Temperature Sampling: Make distribution more peaked \[ \text{softmax}(x, \tau) = \frac{\exp(x/\tau)}{\sum_i \exp(x_i/\tau)} \]

Temperature Parameter Pro Con
High \(\tau \geq 1\) Diverse Incoherent
Low \(\tau < 1\) Coherent Repetitive

image-20260210161335068

HuggingFace interface:

1
2
3
4
5
6
7
8
# Top-k
model.generate(do_sample=True, top_k=20)

# Top-p
model.generate(do_sample=True, top_p=0.9)

# Temperature
model.generate(do_sample=True, temperature=0.7)

Other Truncation Strategies

Method Threshold Strategy
Top-\(k\) Sample from \(k\)-most-probable
Top-\(p\) Cumulative probability at most \(p\)
\(\epsilon\) Probability at least \(\epsilon\)
\(\eta\) Min prob. proportional to entropy
Min-\(p\) Prob. at least \(p_{\min}\) scaled by max token prob.

Speeding Up Decoding

Why is Decoding Slow?

Time bottleneck: \[ \text{time} = \max\left(\frac{\text{OperationFLOPs}}{\text{DeviceFLOP/s}}, \frac{\text{DataTransferred(GB)}}{\text{MemoryBandwidth(GB/s)}}\right) \]

  • Compute-bound: e.g., \(A = BC\) (matrix multiplication)
  • Memory-bound: e.g., \(a = Bx\) (matrix-vector multiplication)
  • Decoding one token is typically memory-bound

Metrics

Latency: How long does a user wait? - Time to first token, time per request

Throughput: How many requests completed per second? - Tokens per second, requests per second

Key-Value Cache

Problem: Without caching, computing attention requires \(O(T^2)\) recomputation.

Solution: Store previously computed keys and values.

At step \(t\) of decoding (1 layer, 1 head): \[ \begin{align} q_t &= h_t W_q \in \mathbb{R}^{1 \times d_k} \\ k_t &= h_t W_K \in \mathbb{R}^{1 \times d_k} \\ v_t &= h_t W_V \in \mathbb{R}^{1 \times d_v} \end{align} \]

Cached: \(K_{1:t-1} \in \mathbb{R}^{(t-1) \times d_k}\), \(V_{1:t-1} \in \mathbb{R}^{(t-1) \times d_v}\)

Append \(k_t\) to \(K_{1:t-1}\) and \(v_t\) to \(V_{1:t-1}\): \[ z_t = \text{softmax}\left(\frac{q_t K_{1:t}^T}{\sqrt{d_k}}\right) V_{1:t} \]

Speedup: Reduces \(O(T^2)\) to \(O(T)\) computation.

Speeding Up a Single Token

Goal Strategy Examples
Reduce memory bandwidth Quantization, distillation, architecture GPTQ, AWQ, GQA/MQA
Increase FLOP/s Optimize operations on hardware FlashAttention, torch.compile
Decrease FLOPs Reduce FLOPs in architecture Mixture-of-Experts, Mamba

Speeding Up a Full Sequence

Strategy Idea Examples
Parallelize over time Draft tokens cheaply, verify in parallel Speculative decoding
Parallelize over time Generate multiple tokens in parallel Non-autoregressive models

Speculative Decoding: 1. Use small draft model \(q\) to generate tokens ahead 2. Large model \(p\) processes drafted tokens in parallel 3. Accept with probability: \(\alpha_i = \min\left(1, \frac{p(\text{token} | \ldots)}{q(\text{token} | \ldots)}\right)\)

Speeding Up Multiple Sequences

Strategy Idea Examples
State re-use Shared prefixes \(\Rightarrow\) shared KV cache PagedAttention, RadixAttention
Improved batching Better scheduling Continuous batching
Program-level optimization Optimize full generation graph SGLang, DSPy

Summary

  1. Decoding as optimization: Greedy, beam search, MAP (A fixed route the maxmize the probability )
    • Pitfalls: degeneracy, atypicality, probability spread
  2. Sampling: Ancestral, top-k, top-p, temperature. (change the probability distribution for the following sampling)
    • Addresses tail probability and diversity issues
  3. Efficient inference: KV cache, speculative decoding, batching
    • Critical for deployment at scale

A key takeaway is that the Beam search or Greedy is search-based, it wants to get a fiexd optimized output, but Samlping introduces diversity or variance to generate more creative answers and may get different answers.