Lec 2: PyTorch & Resource accounting

Numeric Formats (Float Pointing Formats)

Before diving into computations, we need to understand what we are calculating. This involves many trade-offs between computational efficiency and numerical precision.

Any floating-point number in a computer consists of three parts: the sign bit (Sign), the exponent (Exponent), and the fraction (Fraction). The calculation formula is \[ (-1)^S \times 1.M \times 2^{E-Bias} \]

  • S determines the sign
  • M determines precision; the more digits after the decimal point, the denser the number
  • E determines the dynamic range; the more bits, the larger the range represented
  • Bias is a fixed constant that allows the exponent to be negative, balancing the positive and negative ranges

Why Not Use FP32 Exclusively

Standard FP32 (Single Precision) has 1 sign bit, 8 exponent bits, and 23 fraction bits.

  • Advantages: High precision, large range, stable training
  • Disadvantages: Slow, consumes a lot of memory
  • In modern large model training, FP32 is mainly used for backup during optimizer updates, rather than core computations.

FP16 and BF16

To seek faster computational efficiency, the industry mainly uses 16-bit formats, primarily FP16 and BF16.

  1. FP16 (Half Precision):

    • Structure: 5 bits for exponent + 10 bits for fraction

    • Drawback: The small number of exponent bits means a small dynamic range; if the value is too small, it can lead to underflow, such as \(1e^{-8}\).

  2. BF16 (Brain Float 16):

    • Structure: 8 bits for exponent + 7 bits for fraction
    • Core advantage: The exponent bits are the same as FP32, meaning it has the same dynamic range as FP32, with slightly lower precision.
    • Conclusion: Deep learning is sensitive to “range” and not sensitive to “precision”; neural networks are inherently fuzzy approximations, making BF16 the standard format for training large models today.

Measuring Computational Power: FLOPs

Matrix Multiplication (MatMul)

The core of deep learning is matrix multiplication \(Y = X \times W\), and the computational cost formula is as follows: \[ FLOPs = 2 \cdot B \cdot D \cdot K \]

  • B: Batch Size
  • D: Input Dimension
  • K: Output Dimension
  • Origin of 2: A single FMA (Fused Multiply-Add) instruction at the computer level typically counts as 2 FLOPs, one multiplication and one addition.

Training Cost Law: 1:2:6

This is the most important Rule of Thumb for this lesson: Why is training so much more expensive than inference?

The entire training process is divided into forward and backward propagation, each with its own computational cost.

  • Forward Pass: Computes \(H = XW\). Consumes 1 unit of computational power.
  • Backward Pass: Requires calculating two gradients, the first is the weight gradient \(dW\), and the input gradient \(dX\). The former is used to update parameters, while the latter is propagated to the previous layer.

For a model with P parameters trained with N tokens, the total floating-point operation volume is approximately: \[ C \approx 6 \cdot N \cdot P \] The forward pass accounts for 2 FLOPs, the backward pass for 4 FLOPs, resulting in a total of 6, which gives us the above ratio.

The parameter count we discuss here always refers to the total number of physically independent variables (excluding shared parameters) that need to be updated by the optimizer.

For shared parameters, they indeed reduce the parameter count, but in terms of computation, the FLOPs do not decrease because you still need to traverse the network structure through these layers to perform the corresponding calculations.

The reason we use \(\approx\) here is that we ignore two things:

  • Embedding layers and Softmax layers: These two layers have a considerable number of parameters P, and the vocabulary is large, but they do not fully adhere to the 2P computational logic. However, compared to the dozens of intermediate Transformer Blocks, the error is acceptable.
  • Attention’s \(N^2\) computation: We previously assumed that computational power is related to the parameter count P, but due to Attention, the multiplication of Query and Key has a computational volume of \(N^2\). This part does not consume parameters, but as long as N (sequence length) is much smaller than P, this term can be ignored.

You might wonder why Attention is \(N^2\)?

Actually, this is more like a simplification of time complexity; the precise computational formula is:

\[ 12N^2LH = \underbrace{L}_{\text{number of layers}} \times \underbrace{(4N^2H)}_{\text{forward propagation}} \times \underbrace{3}_{\text{training coefficient}} \]

  • L is the number of layers in the Transformer, and the model is stacked with L identical blocks.
  • The forward propagation part involves the multiplication of QKV matrices.
  • The training coefficient follows forward propagation 1, backward propagation 2, totaling 3.

We are comparing: “huge parameter count P” vs “sequence length N”. In current standard model configurations, the former is much larger than the latter.

Hardware: H100 and Tensor Cores

GPUs have general cores and specialized cores:

  • CUDA Cores: General, flexible but slow, handling FP32/FP64
  • Tensor Cores: Designed specifically for matrix multiplication, H100’s Tensor Core can execute \(4 \times 4\) matrix operations every clock cycle, supporting mixed precision; inputs can be FP16/BF16, while internal accumulation uses FP32, balancing speed and numerical stability.

Sparsity:

NVIDIA promotes that the H100 has 1979 TFLOPs of FP16 computational power. However, the truth is that this is the theoretical value after enabling (Structured Sparsity (2:4)), which requires that every 4 elements must have 2 zeros to skip zero calculations.

In reality, most training is dense training, so computational power is effectively halved.

Engineering Practice: Code and Dimension Management

Dimensionality Hell

In Transformer code, manually handling dimensions (like view, transpose) is prone to errors and hard to debug:

1
x = x.view(B, N, H, D).permute(0, 2, 1, 3) # Who are 0, 2, 1, 3?

Einops Library

The course strongly recommends using the einops library, which makes dimension transformations “declarative” and “self-documenting”:

1
2
3
4
from einops import rearrange

# Clear semantics: Extract the Head dimension and place it before the sequence length
q = rearrange(x, 'b s (h d) -> b h s d', h=8)

Principle: Code is not just for machines to run; it is also for people to read. Explicit is better than implicit.


Optimizer

Origin of AdamW

The mainstream optimizer we use now is AdamW, but how did it evolve?

  • SGD: Simple gradient descent, moving based on the direct gradient.
  • Momentum: Adds “inertia,” physical momentum used to accumulate momentum on steep slopes, pushing through flat areas, as flat areas may be saddle points or local optima where regular SGD might stop.
  • RMSProp: Adds an adaptive step size (second-order momentum), small steps on steep paths, large steps on flat paths.
  • Adam: Momentum + RMSProp. Integrates the advantages of the first two.

What is Momentum

In deep learning optimizers, momentum is mainly divided into first-order and second-order momentum:

  • First-order momentum (Momentum) is the average of gradients, retaining the sign, determining whether to move forward or backward. It considers the new speed based on historical velocity and current acceleration (gradient).
  • Second-order momentum (Second Moment) is the average of squared gradients; since it is squared, the sign disappears, indicating it only cares about the magnitude and not the direction, measuring how steep the gradient is.

RMSProp (Root Mean Square Propagation): \[ \text{Parameter Update} = \frac{\text{Learning Rate}}{\sqrt{\text{Second-order Momentum}}} \times \text{Gradient} \] It can reduce the step size when the gradient is large (small steps on steep paths) and amplify it when small (large steps on flat paths), addressing the oscillation problem in valleys and convergence issues at saddle points.

Adam (Adaptive Moment Estimation)

\[ \begin{aligned} \text{1. Compute Gradient:} &\quad g_t = \nabla_\theta J(\theta_{t-1}) \\ \text{2. Update First-order Momentum (Momentum):} &\quad m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \\ \text{3. Update Second-order Momentum (RMSProp):} &\quad v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \\ \text{4. Bias Correction:} &\quad \hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t} \\ \text{5. Parameter Update (Final Update):} &\quad \theta_t = \theta_{t-1} - \eta \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon} \end{aligned} \]

\[ \theta_t = \theta_{t-1} - \eta \cdot \frac{\overbrace{\left( \frac{\beta_1 m_{t-1} + (1-\beta_1)g_t}{1-\beta_1^t} \right)}^{\text{Corrected First-order Momentum (Direction + Inertia)}}}{\underbrace{\sqrt{\frac{\beta_2 v_{t-1} + (1-\beta_2)g_t^2}{1-\beta_2^t}}}_{\text{Corrected Second-order Momentum (Adaptive Step Size)}} + \epsilon} \] A simple analysis shows that the numerator represents inertia, while the denominator represents resistance, or normalization. Essentially, this formula embodies the inertia of code momentum pushing forward while simultaneously having RMSProp’s adaptive braking.

Key Takeaways

  1. Where Did the Computational Power Go? The vast majority of computational power (>95%) is spent on Forward and Backward matrix multiplications.
  2. Where Did the Memory Go?
    • Static: Model parameters (Weights) + Optimizer states (Optimizer States).
    • Dynamic: Intermediate activation values (Activations). The larger the Batch Size and the longer the Context Length, the more terrifying the memory consumption of activation values.
  3. Where Are the Bottlenecks?
    • Matrix multiplication layers are usually Compute Bound (limited by computational power).
    • LayerNorm, Softmax, CrossEntropy are typically Memory Bound (limited by memory bandwidth).