Lec11 Multimodal Modeling I (Multi-to-Text)

Big Picture

This part of the course focuses on multi-to-text systems:

  • Input can include images (and text).
  • Output is text.
  • Core challenge: map image content into a sequence of vectors a language model can use.

Representing Images as Tokens

For text, we already have token embeddings.
For images, we need an encoder: \[ f_{\text{enc}}(x_{\text{image}}) \rightarrow z_1,\dots,z_L \] where each \(z_i\) is a vector token.

Vision Transformer (ViT)

ViT turns an image into a sequence of patch embeddings, then applies a standard Transformer.

Given: \[ x_{\text{image}} \in \mathbb{R}^{H \times W \times C} \]

Split into patches of size \(P \times P\): \[ N=\frac{HW}{P^2}, \quad x_p \in \mathbb{R}^{N \times (P^2C)} \]

Project each patch to model dimension \(D\): \[ x = x_p W_e, \quad W_e \in \mathbb{R}^{(P^2C)\times D}, \quad x \in \mathbb{R}^{N\times D} \]

Then add position embeddings and process with Transformer layers.

image-20260304233101190

image-20260304231918596

Key intuition from lecture:

  • Early layers capture local/structural signals.
  • Later layers attend to broader semantic regions.
  • ViT scales well with pretraining compute.

CLIP: Learn Vision from Language Supervision

CLIP (Radford et al., 2021) learns a shared embedding space for images and text.

  • Image encoder: \(f_I(x)\)
  • Text encoder: \(f_T(y)\)
  • Paired image-text should be close; mismatched pairs should be far.

image-20260304233640564

For a batch of \(N\) aligned pairs \((x_n,y_n)\), define similarities: \[ s_{ij}=\frac{f_I(x_i)^\top f_T(y_j)}{\tau} \]

Symmetric contrastive objective: \[ \mathcal{L}_{\text{img}}=-\frac{1}{N}\sum_{i=1}^N \log \frac{e^{s_{ii}}}{\sum_j e^{s_{ij}}}, \quad \mathcal{L}_{\text{text}}=-\frac{1}{N}\sum_{i=1}^N \log \frac{e^{s_{ii}}}{\sum_j e^{s_{ji}}} \] \[ \mathcal{L}_{\text{CLIP}}=\frac{1}{2}\left(\mathcal{L}_{\text{img}}+\mathcal{L}_{\text{text}}\right) \]

Why it mattered:

  • Uses natural language descriptions instead of only class labels.
  • Scales to web-scale image-text data.
  • Enables strong zero-shot transfer.

LLaVA-style Combination with a Language Model

General pipeline from lecture:

  1. Preprocess image (patching/cropping).
  2. Encode image with a vision encoder (often CLIP ViT).
  3. Linearly project vision features to LM embedding dimension.
  4. Concatenate visual tokens with text tokens.
  5. Train/fine-tune on image-text instruction data.
  6. For image positions, skip token-level LM loss.

A simple form: \[ h_v=\text{Proj}(f_I(x_{\text{image}})), \quad p_\theta(y_t\mid y_{<t},x_{\text{text}},h_v) \]

image-20260304234617658

Case Notes Mentioned in Lecture

  • Molmo (AI2): CLIP ViT-L/14 (336px), then pooling/projection before feeding LM; uses full image + crops.

image-20260305002752174

  • PaliGemma (Google): lecture highlighted that jointly updating vision encoder + LM can outperform freezing one side.

image-20260305002723839

Lec11 Takeaways

  • ViT gives a clean tokenization interface for images.
  • CLIP gives strong, scalable image representations through contrastive learning.
  • Multimodal assistants are mostly about interface design between vision tokens and LM tokens.

Lec12 Multimodal Modeling II (Generating Images)

Generative Paradigms

Lecture compared four families:

  • Autoregressive (AR): model \(p(x_t\mid x_{<t})\).

    image-20260305020958065

  • VAE: encode to latent \(z\), decode back.

    image-20260305021020096

  • GAN: generator vs discriminator game.

    image-20260305021040054

  • Diffusion: denoise from noise to data.

    image-20260305021103664

Attempt 1: Pixel-level Autoregression

Flatten image into a long sequence of pixel values: \[ x_{\text{img}} \rightarrow (x_1,\dots,x_T),\quad x_t\in\{0,\dots,255\} \] \[ \mathcal{L}_{\text{MLE}}=-\sum_{t=1}^{T}\log p_\theta(x_t\mid x_{<t}) \]

Examples discussed: PixelRNN, Image Transformer, iGPT.

Main bottlenecks:

  • Sequence length explodes (e.g., \(1024\times1024\times3\approx 3\)M tokens).
  • Pixel tokens are low-level; learning semantics is data-hungry.

Attempt 2: Learn Discrete Image Tokens

Core idea: learn an image tokenizer/de-tokenizer so the LM models a shorter, semantic token sequence.

VAE Refresher

Standard objective: \[ \mathcal{L}_{\text{VAE}}(x)= -\mathbb{E}_{q_{\theta_{\text{enc}}}(z\mid x)}[\log p_{\theta_{\text{dec}}}(x\mid z)] +D_{\text{KL}}\!\left(q_{\theta_{\text{enc}}}(z\mid x)\|p(z)\right) \]

Equivalent view via ELBO: \[ \log p(x)\ge \mathbb{E}_{q(z\mid x)}[\log p(x\mid z)] -D_{\text{KL}}(q(z\mid x)\|p(z)) \]

VQ-VAE: Continuous to Discrete

Encoder gives continuous latent: \[ z_e(x)\in\mathbb{R}^{d} \]

Quantize by nearest codebook entry: \[ k^*=\arg\min_j\|z_e(x)-e_j\|_2,\quad z_q(x)=e_{k^*} \]

Train with reconstruction + codebook + commitment terms: \[ \mathcal{L}= -\log p(x\mid z_q(x)) +\|\text{sg}[z_e(x)]-e\|_2^2 +\beta\|z_e(x)-\text{sg}[e]\|_2^2 \]

Then model discrete image tokens autoregressively.

image-20260305173246620

VQ-GAN

VQ-GAN improves VQ-VAE by adding adversarial/perceptual signals in tokenizer training, then training an AR Transformer on resulting token sequences.

Practical gain from lecture:

  • Sequence becomes much shorter (example: \(32\times32=1024\) tokens) than raw pixels.
  • Better tradeoff between generation quality and AR modeling feasibility.

image-20260305175540139

Unifying Text and Image Tokens

After tokenizer training:

  1. Add image tokens to LM vocabulary.
  2. Train/fine-tune on mixed text+image token streams.
  3. Decode image tokens back to pixels using de-tokenizer.

Examples discussed: DALL-E (2021), Chameleon (Meta, 2024).

Lec12 Takeaways

  • Pure pixel AR is conceptually simple but hard to scale.
  • Discrete tokenizers (VQ-VAE/VQ-GAN) are the key bridge for AR multimodal models.
  • Tradeoff: unified token modeling vs information loss from tokenization.

Lec15 Multimodal Modeling III (Diffusion and Flows)

Where This Fits

The first two multimodal lectures covered:

  • multi-to-text: encode images into token-like representations for an LM
  • token-based multi-to-image: generate images through discrete visual tokens

This lecture moves to the dominant modern paradigm for text-to-image systems:

  • start from noise
  • iteratively denoise toward an image
  • condition the process on text (and potentially other modalities)

Stable Diffusion 3 is a canonical example of this style.

Diffusion: Core Idea

Diffusion models define a fixed forward noising process and a learned reverse denoising process.

Forward process: \[ q(x_t \mid x_{t-1}) = \mathcal{N}\!\left(x_t;\sqrt{\alpha_t}\,x_{t-1}, (1-\alpha_t)I\right) \]

Let \(\beta_t = 1-\alpha_t\) and \(\bar{\alpha}_t=\prod_{i=1}^t \alpha_i\). Then we can sample directly from \(x_0\): \[ q(x_t \mid x_0) = \mathcal{N}\!\left(x_t;\sqrt{\bar{\alpha}_t}\,x_0,(1-\bar{\alpha}_t)I\right) \] \[ x_t=\sqrt{\bar{\alpha}_t}\,x_0+\sqrt{1-\bar{\alpha}_t}\,\epsilon, \quad \epsilon \sim \mathcal{N}(0,I) \]

Noise schedule requirements from lecture:

  • \(\bar{\alpha}_0 = 1\) so we start with data
  • \(\bar{\alpha}_T \approx 0\) so we end close to pure noise
  • the schedule should decrease monotonically

A common choice is a linear schedule for \(\beta_t\).

Training DDPM

In principle, the model maximizes \(\log p_\theta(x_0)\) through an ELBO over latent variables \(x_{1:T}\).
The ELBO decomposes into:

  • prior matching at the final noisy step
  • denoising terms that match each reverse transition
  • a reconstruction term near \(x_0\)

The practical training objective from the lecture is the simplified noise-prediction loss: \[ \mathcal{L}_{\text{simple}}(\theta)= \mathbb{E}_{x_0,\epsilon,t} \left[\left\|\epsilon-\epsilon_\theta(x_t,t)\right\|_2^2\right], \quad t \sim \text{Uniform}\{1,\dots,T\} \]

Training loop:

  1. sample a real image \(x_0\)
  2. sample timestep \(t\)
  3. sample Gaussian noise \(\epsilon\)
  4. construct \(x_t\)
  5. train the network to recover the injected noise

Sampling

The learned reverse process starts from noise: \[ x_T \sim \mathcal{N}(0,I) \]

Then for \(t=T,\dots,1\), use \(\epsilon_\theta(x_t,t)\) to parameterize \[ p_\theta(x_{t-1}\mid x_t) \] and sample the next less-noisy state. Repeating this process gradually converts noise into an image.

DDPM as Score Learning

Lecture made an important connection between noise prediction and score estimation.

Write the noisy variable as: \[ z_t = a_t x_0 + \sigma_t \epsilon \]

The score of the marginal distribution is: \[ \nabla_{z_t} \log p(z_t) \]

For this Gaussian corruption process, the expected noise satisfies: \[ \bar{\epsilon} = -\sigma_t \nabla_{z_t}\log p(z_t) \]

Since DDPM trains \(\epsilon_\theta(z_t,t)\) to approximate the true injected noise, the optimal predictor satisfies: \[ \epsilon_\theta(z_t,t)\approx -\sigma_t \nabla_{z_t}\log p(z_t) \]

So DDPM implicitly learns the score function. Intuitively:

  • predicting the noise tells the model how to remove corruption
  • removing corruption moves samples toward higher-probability regions

Conditional Generation and Guidance

To generate from a prompt or label \(c\), we need a conditional model.

Classifier Guidance

If we have an auxiliary classifier \(p_\phi(c \mid z_t)\), then we can modify the denoising prediction with the classifier gradient: \[ \tilde{\epsilon}_{\theta,\phi}(z_t,c) = \epsilon_\theta(z_t,c)-w\sigma_t \nabla_{z_t}\log p_\phi(c\mid z_t) \]

The extra term pushes sampling toward images more compatible with condition \(c\).

Classifier-Free Guidance

Instead of training a separate classifier, train one model both with and without conditioning, then combine the two predictions at sampling time: \[ \tilde{\epsilon}_\theta(z_t,c) = (1+w)\epsilon_\theta(z_t,c)-w\epsilon_\theta(z_t) \]

This is the standard practical recipe in modern diffusion systems.
Larger guidance weight \(w\) usually improves prompt alignment, but often reduces diversity.

Important Extensions

The lecture highlighted several common extensions:

Problem Solution Main idea
Pixel-space diffusion is expensive Latent diffusion Run diffusion in an autoencoder latent space
U-Net inductive bias may be limiting Diffusion Transformer (DiT) Replace convolution-heavy backbone with a Transformer
Reverse process can be stochastic and slow Continuous-time / ODE views Use deterministic paths and better solvers
Sampling still takes many steps Distillation Train a student sampler with fewer steps

These are the main reasons modern systems are much more practical than the original pixel-space DDPM setup.

Continuous-Time View

As the number of steps goes to infinity, diffusion can be described as a stochastic differential equation (SDE): \[ dz = f(t)z\,dt + g(t)\,dw \]

The reverse-time SDE becomes: \[ dz= \left[f(t)z-g(t)^2 \nabla_z \log p_t(z)\right]dt + g(t)\,d\bar{w} \]

Because DDPM estimates the score, we can plug in: \[ \nabla_z \log p_t(z)\approx -\frac{\epsilon_\theta(z,t)}{\sigma_t} \]

There is also a deterministic probability-flow ODE with the same marginals: \[ dz= \left[f(t)z-\frac{1}{2}g(t)^2 \nabla_z \log p_t(z)\right]dt \]

This viewpoint matters because it connects diffusion to numerical ODE solvers and motivates alternative generation methods.

Flow Matching

Flow matching keeps the idea of transporting one distribution into another, but formulates the problem more directly as learning a velocity field.

Let:

  • \(p_0\) be a simple source distribution such as Gaussian noise
  • \(p_1\) be the target data distribution
  • \(v_t(\cdot)\) be a time-dependent velocity field

The flow \(\phi_t\) is defined by the ODE: \[ \frac{d}{dt}\phi_t(x)=v_t(\phi_t(x)), \quad \phi_0(x)=x \]

If \(v_t\) is correct, pushing \(X_0 \sim p_0\) through this ODE gives \(X_1 \sim p_1\).

Conditional Flow Matching

For a single data point \(x_1\), lecture used the straight-line path: \[ X_t = (1-t)X_0 + t x_1 \]

Its velocity is known exactly: \[ \frac{d}{dt}X_t = x_1 - X_0 \]

So training becomes: \[ \mathcal{L}_{\text{CFM}} = \mathbb{E}_{x_1,X_0,t} \left[\left\|v_\theta(X_t,t)-(x_1-X_0)\right\|_2^2\right] \]

Key intuition:

  • diffusion learns how to denoise a corrupted point
  • flow matching learns how to move along a transport path

Compared with DDPM, flow matching often gives straighter trajectories and can be integrated in fewer steps.

Case Study: Stable Diffusion 3

Lecture used Stable Diffusion 3 as a modern example:

  • Flow matching: replaces the classic DDPM-style training target
  • Latent space: operates on the latent space of a pretrained autoencoder
  • Text conditioning: combines CLIP text features with T5 features

The broader lesson is that state-of-the-art text-to-image systems are now hybrids:

  • latent-space modeling for efficiency
  • strong text encoders for conditioning
  • Transformer-style backbones and ODE/flow-inspired training for better generation quality

Lec15 Takeaways

  • DDPM trains a model to reverse a fixed noising process.
  • Noise prediction is closely tied to score estimation.
  • Guidance makes diffusion usable for conditional generation.
  • Continuous-time views connect diffusion to ODE solvers.
  • Flow matching is a cleaner transport-based alternative that now powers systems like Stable Diffusion 3.

Final Summary

  • Lec11: how to encode images for text generation (ViT + CLIP + LM integration).
  • Lec12: how to generate images with token-based generative modeling.
  • Lec15: how modern text-to-image systems generate images by denoising or transporting noise into data.
  • Together they frame multimodal systems as: representation (encode) + alignment (shared space) + generation (discrete or continuous).

This post is based on CMU 11-711 Advanced NLP lecture materials (Multimodal Modeling I, II, and III, Sean Welleck).