Lec20 Parallelism and Scaling

This lecture addresses the central systems challenge of LLM pre-training: how to efficiently use multiple GPUs to train larger models on more data. The key question is how to split work across devices while minimizing wasted time on communication and idle bubbles.

Motivation

Pre-training benefits from both more data and bigger models — both reduce training loss. The problem is that scaling requires leveraging multiple GPUs effectively, which introduces three competing concerns:

Memory usage: all training state (weights, gradients, optimizer states, activations) must fit in GPU memory
Compute efficiency: GPUs should spend most of their time computing, not waiting
Communication overhead: synchronization between GPUs keeps them idle

The impact of getting this right is dramatic. The slides show two configurations for the same 3.57B model on 32 nodes: one achieves 17.4% MFU at 1.7M tok/s, the other gets 0.7% MFU at 60K tok/s — a 28x throughput difference just from parallelism strategy choices.

Training Basics on One GPU

Compute

Training compute is measured in floating point operations (FLOPs). For a forward and backward pass:

\[ \text{FLOPs} = 6 \times \text{model\_parameters} \times \text{token\_batch\_size} \]

The factor of 6 comes from: 2x for multiply-accumulate in the forward pass, and 4x for the backward pass (2x for each of the two backward computations).

Model FLOP Utilization (MFU) measures how effectively the hardware is used:

\[ \text{MFU} = \frac{\text{Achieved FLOPS}}{\text{Theoretical Peak FLOPS}} \]

For reference, an H100 SXM achieves 1,979 teraFLOPS peak in BFloat16 with Tensor Cores. Real training runs typically achieve 30-45% MFU due to communication overhead, memory bandwidth limits, and idle time.

Memory Usage

A training step requires storing four categories of tensors:

\[ \text{peak\_memory} = \text{model\_bf16} + \text{model\_fp32} + \text{grads\_fp32} + \text{optim\_states} + \text{activations} \]

The rough breakdown per parameter (using mixed precision with Adam):

Component	Bytes per Parameter
BF16 model weights	2
FP32 master weights	4
FP32 gradients	4
Adam momentum (FP32)	4
Adam variance (FP32)	4
Total (excl. activations)	18

For a 7B model, this is ~126 GB just for parameters/gradients/optimizer — already exceeding a single H100’s 80 GB. A 70B model needs ~1,120-1,400 GB.

Memory Usage: Activations

Activation memory scales linearly with batch size and quadratically with sequence length:

\[ m_{\text{act}} = L \cdot seq \cdot bs \cdot h \cdot \left(34 + \frac{5 \cdot n_{\text{heads}} \cdot seq}{h}\right) \]

where \(L\) is the number of layers, \(seq\) is sequence length, \(bs\) is batch size, \(h\) is hidden dimension, and \(n_{\text{heads}}\) is the number of attention heads. The quadratic term comes from the attention score matrix (\(seq \times seq\)). At long sequence lengths (8K+), activations dominate memory usage.

Activation Recomputation

Rather than storing all activations during the forward pass, activation recomputation (gradient checkpointing) saves only certain activations as “checkpoints” and recomputes the rest during the backward pass.

The tradeoff is clear: more compute for less memory. Selective recomputation targets the most memory-expensive activations (like attention scores) while keeping cheaper ones, striking a balance between the two extremes.

Gradient Accumulation

Gradient accumulation splits a large batch into smaller micro-batches, running forward/backward on each sequentially and averaging the gradients before updating:

\[ bs = gbs = mbs \cdot grad\_acc \]

This lets you simulate a large batch size without needing the memory to hold all activations at once. The memory cost is determined by the micro-batch size, while the effective batch size equals micro-batch size times the number of accumulation steps. Pre-training typically uses 4-60 million tokens per batch.

Data Parallelism

Basic Idea

Data parallelism (DP) replicates the entire model on every GPU. Each GPU processes a different micro-batch, computes gradients independently, then all GPUs average their gradients via an all-reduce operation before updating weights.

The global batch size becomes:

\[ \text{global batch size} = mbs \cdot grad\_acc \cdot dp \]

where \(dp\) is the number of data-parallel replicas.

Communication: Collective Operations

Multi-GPU training relies on collective communication primitives:

Operation	Pattern
Send/Recv	Point-to-point transfer between two ranks
Scatter	One rank distributes chunks to all ranks
Gather	One rank collects chunks from all ranks
Broadcast	One rank sends identical data to all ranks
Reduce	All ranks contribute, one rank gets the aggregated result
All-Reduce	All ranks contribute, all ranks get the aggregated result
All-Gather	All ranks contribute chunks, all ranks get the full data

In naive data parallelism, a single all-reduce happens after the entire backward pass completes. This means GPUs sit idle during the entire communication phase.

Overlap + Bucketing

The key optimization is to overlap communication with computation:

Start all-reduce as soon as gradients are ready — don’t wait for the full backward pass
Group gradients into buckets — launch one all-reduce per bucket instead of per parameter

Since the backward pass computes gradients starting from the last layer, the all-reduce for later layers can overlap with gradient computation for earlier layers. This hides much of the communication latency behind useful compute.

Scaling Behavior

Data parallelism does not reduce per-GPU memory — each GPU still holds a full copy of the model, gradients, and optimizer states. It only increases throughput by processing more data in parallel.

As the number of GPUs increases, per-GPU throughput drops due to growing communication overhead. The slides show a 3B model going from ~38K tokens/sec/GPU at DP=8 down to ~22K tokens/sec/GPU at DP=256 (-40.6% degradation).

Worked Example

Consider training with a global batch size of 4 million tokens and sequence length of 4,000: - Batch size in sequences: 4M / 4K = 1,024 - If one GPU fits mbs=2 sequences, with 128 GPUs: 2 × 128 = 256 sequences per step - Need grad_acc = 1024 / 256 = 4 accumulation steps

With 512 GPUs instead: 2 × 512 = 1,024 sequences per step, so grad_acc = 1 (no accumulation needed, faster per step).

Tensor Parallelism

Basic Idea

When a model is too large for a single GPU, tensor parallelism (TP) splits individual weight matrices across GPUs. Each GPU computes a portion of the matrix multiplication, then results are combined via communication.

Column-Wise Parallelism

Split the weight matrix \(W\) into column chunks. Each GPU holds one chunk and computes \(X \cdot W_i\) independently. The input \(X\) is broadcast to all GPUs, and the partial outputs \(Y_i\) are all-gathered to reconstruct the full output.

For a 2-GPU split of \(Y = XW\): - GPU 0 computes: \(Y_0 = X \cdot W_0\) (first half of columns) - GPU 1 computes: \(Y_1 = X \cdot W_1\) (second half of columns) - All-gather produces: \(Y = [Y_0, Y_1]\)

Row-Wise Parallelism

Split the weight matrix into row chunks (and correspondingly split the input into column chunks). Each GPU computes a partial result, then an all-reduce (sum) combines them.

For a 2-GPU split: - GPU 0 computes: \(Y_0 = X_0 \cdot W_0\) - GPU 1 computes: \(Y_1 = X_1 \cdot W_1\) - All-reduce produces: \(Y = Y_0 + Y_1\)

Application to Transformer Layers

Feed-forward layers use column parallel for the first linear (\(A\)), then row parallel for the second linear (\(B\)). This avoids an intermediate all-reduce/all-gather between the two — the output of column parallel naturally becomes the input for row parallel:

\[ Y = \text{GeLU}(XA), \quad Z = \text{Dropout}(YB) \]

With \(A = [A_1, A_2]\) (column split) and \(B = \begin{bmatrix} B_1 \\ B_2 \end{bmatrix}\) (row split), each GPU computes \(\text{GeLU}(X \cdot A_i) \cdot B_i\) and only needs one all-reduce at the end.

Attention layers split naturally by assigning different attention heads to different GPUs. The Q, K, V projections are column-split, each GPU handles its subset of heads independently, and the output projection is row-split.

Tradeoffs

Tensor parallelism reduces per-GPU memory for weights, gradients, optimizer states, and activations. A 70B model that requires 140 GB with no parallelism fits within 80 GB with TP=8 or TP=16.

The cost is communication. Each transformer layer requires all-reduce or all-gather operations in both forward and backward passes. Throughput drops sharply as TP increases: for a 3B model, TP=2 achieves ~13K tokens/sec/GPU while TP=32 drops to ~4.5K (-65.6%).

Cross-node communication bandwidth drops dramatically (436 GB/s intra-node vs ~34 GB/s at 64 nodes), so tensor parallelism should be confined within a single node (typically 8 GPUs per node).

Pipeline Parallelism

Basic Idea

Pipeline parallelism (PP) splits the model by layers across GPUs. GPU 1 holds layers 1-4, GPU 2 holds layers 5-8, and so on. Data flows sequentially through the pipeline.

This reduces per-GPU memory proportionally — an 8B model with PP=8 drops from ~140 GB to ~40 GB per GPU.

The Bubble Problem

The naive approach has a massive inefficiency: while GPU 1 processes the forward pass, GPUs 2-4 sit idle. Then while GPU 4 processes the backward pass, GPUs 1-3 are idle. This wasted time is called the pipeline bubble.

One-Forward-One-Backward (1F1B) Schedule

The 1F1B schedule reduces the bubble by interleaving forward and backward passes across micro-batches. Instead of completing all forward passes before starting backward, each GPU begins its backward pass as soon as the last pipeline stage finishes.

With more micro-batches, the bubble becomes a smaller fraction of total time. The slides show that with PP=2, throughput is ~15K tokens/sec/GPU, but at PP=32 with few microbatches it drops -44.3%. With 32 microbatches, the degradation at PP=32 is better at -52.4% but still significant.

Pipeline parallelism is best suited for cross-node parallelism since it only requires point-to-point communication (send activations to the next stage, receive gradients from the next stage), unlike tensor parallelism which requires all-reduce.

Memory Optimization: ZeRO

The Redundancy Problem

In standard data parallelism, every GPU stores a complete copy of model parameters, gradients, and optimizer states. For a 7B model with Adam, that’s ~126 GB replicated across every GPU — enormously wasteful.

ZeRO Stages

The Zero Redundancy Optimizer (ZeRO) progressively partitions training state across GPUs:

Stage	What is Sharded	Memory Savings	Communication Cost
Baseline (DP)	Nothing	None	Low (all-reduce only)
ZeRO-1	Optimizer states	~4x reduction in optimizer memory	Low
ZeRO-2	+ Gradients	Further reduction	Moderate
ZeRO-3	+ Parameters	Maximum reduction	High (all-gather per layer)

How ZeRO-3 Works

Each GPU stores only \(1/N\) of the parameters (where \(N\) is the number of GPUs). At each layer during forward/backward: 1. GPU issues an all-gather to fetch the full parameters for that layer from all peers 2. Compute activations/gradients using the full parameters 3. Free the fetched parameters and move to the next layer

The key distinction from tensor/pipeline parallelism: ZeRO shards memory, not computation. Every GPU still performs the full computation on the full data — it just doesn’t store everything simultaneously. This means ZeRO-3 has the highest communication cost (all-gather for every layer in both forward and backward), but provides the largest memory savings.

Choosing Strategies

Summary of Approaches

Strategy	Key Idea	Tradeoff	Best Use Case
Data Parallelism (DP)	Parallelize on batch dimension	Redundant memory; model must fit on GPU	Standard models that fit in GPU memory
Tensor Parallelism (TP)	Parallelize on hidden dimension	High communication (all-reduce per layer)	Large layers; within a single node
Pipeline Parallelism (PP)	Parallelize on model dimension	Pipeline bubbles waste time	Large deep models; across nodes
ZeRO	Shard model/optimizer/gradients in DP	High communication (all-gather)	Big models that don’t fit in GPU memory

Decision Framework

When configuring a training run, the goals are: 1. Fit the model into memory — use TP, PP, or ZeRO to reduce per-GPU memory 2. Satisfy the target global batch size — use DP and gradient accumulation 3. Maximize training throughput — minimize communication overhead and idle time

In practice, these strategies are combined. A typical configuration might use: - TP within a node (e.g., TP=4 or TP=8) for memory reduction with fast intra-node communication - PP across nodes (e.g., PP=2 or PP=4) for additional memory reduction with only point-to-point cross-node traffic - DP across the remaining GPUs for throughput scaling - ZeRO-1 to shard optimizer states without major communication overhead

Best Configuration Experiments

The Ultra-Scale Playbook benchmarks show how optimal configurations vary by model size and cluster scale (1M token GBS, seq length 4096, H100 nodes):

Small models (1.3B) on few nodes: pure DP works well (e.g., DP=4, TP=1, PP=2, MFU ~38%)
Medium models (3.5-8.8B): TP=4 within nodes + moderate PP + DP achieves ~40-46% MFU
Large models (80B): require aggressive TP (TP=4-16) + PP (PP=4-16) + ZeRO-1, with MFU dropping to ~15-45% depending on cluster size
More nodes generally hurts MFU: going from 1 node to 64 nodes drops MFU from ~45% to ~5% for a 1.3B model, as communication overhead dominates

Training Frameworks

Two major frameworks implement these parallelism strategies:

torchtitan: PyTorch-native platform supporting FSDP2, Tensor Parallel, Pipeline Parallel, Context Parallel, activation checkpointing, Float8, gradient accumulation, and more
Megatron-LM: NVIDIA’s GPU-optimized library with Megatron Core providing kernels, tensor/pipeline parallelism, distributed training (FSDP, DDP), and post-training support

Lec20 Takeaways

Memory is the first bottleneck: a 7B model needs ~126 GB for training state alone; activation recomputation and gradient accumulation help but cannot solve the problem for large models
Data parallelism scales throughput, not memory: each GPU still holds the full model; communication overhead grows with more GPUs
Tensor parallelism trades communication for memory: split weight matrices across GPUs within a node; keep TP within fast intra-node links
Pipeline parallelism introduces bubbles: splitting layers across GPUs creates idle time; 1F1B scheduling and more microbatches reduce but don’t eliminate the bubble
ZeRO eliminates redundancy: progressively sharding optimizer states, gradients, and parameters trades communication for memory savings; fundamentally different from TP/PP since it shards memory, not computation

Final Summary

Topic	Key Idea
Compute & MFU	6 × params × tokens per step; MFU = achieved / peak FLOPS; real runs get 30-45%
Memory breakdown	18 bytes/param for mixed-precision Adam; activations scale quadratically with seq length
Activation recomputation	Trade ~33% more compute for dramatically less activation memory
Gradient accumulation	Simulate large batches with constant memory via micro-batch averaging
Data parallelism	Replicate model, split data; all-reduce gradients; overlap with bucketing
Tensor parallelism	Split weight matrices (column/row); best within a node due to communication cost
Pipeline parallelism	Split layers across GPUs; 1F1B schedule reduces bubble; best across nodes
ZeRO	Shard optimizer/gradients/params across DP ranks; memory sharding, not compute sharding
Strategy selection	Combine TP (intra-node) + PP (inter-node) + DP (remaining) + ZeRO-1; tune for MFU

Key takeaway: Training large LLMs requires combining multiple parallelism strategies — each addresses a different dimension of the problem (data, hidden, layers, memory), and the optimal combination depends on model size, cluster topology, and batch size constraints.

References

The Ultra-Scale Playbook: Training LLMs on GPU Clusters (HuggingFace)
torchtitan: github.com/pytorch/torchtitan
Megatron-LM: github.com/NVIDIA/Megatron-LM
Rajbhandari et al., “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models,” 2020
Shoeybi et al., “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism,” 2019

This post is based on lecture materials from CMU 11-711 Advanced NLP by Sean Welleck (Lecture 20: Parallelism and Scaling).