Lec11 Multimodal Modeling I (Multi-to-Text)

Big Picture

This part of the course focuses on multi-to-text systems:

  • Input can include images (and text).
  • Output is text.
  • Core challenge: map image content into a sequence of vectors a language model can use.

Representing Images as Tokens

For text, we already have token embeddings.
For images, we need an encoder: \[ f_{\text{enc}}(x_{\text{image}}) \rightarrow z_1,\dots,z_L \] where each \(z_i\) is a vector token.

Vision Transformer (ViT)

ViT turns an image into a sequence of patch embeddings, then applies a standard Transformer.

Given: \[ x_{\text{image}} \in \mathbb{R}^{H \times W \times C} \]

Split into patches of size \(P \times P\): \[ N=\frac{HW}{P^2}, \quad x_p \in \mathbb{R}^{N \times (P^2C)} \]

Project each patch to model dimension \(D\): \[ x = x_p W_e, \quad W_e \in \mathbb{R}^{(P^2C)\times D}, \quad x \in \mathbb{R}^{N\times D} \]

Then add position embeddings and process with Transformer layers.

image-20260304233101190

image-20260304231918596

Key intuition from lecture:

  • Early layers capture local/structural signals.
  • Later layers attend to broader semantic regions.
  • ViT scales well with pretraining compute.

CLIP: Learn Vision from Language Supervision

CLIP (Radford et al., 2021) learns a shared embedding space for images and text.

  • Image encoder: \(f_I(x)\)
  • Text encoder: \(f_T(y)\)
  • Paired image-text should be close; mismatched pairs should be far.

image-20260304233640564

For a batch of \(N\) aligned pairs \((x_n,y_n)\), define similarities: \[ s_{ij}=\frac{f_I(x_i)^\top f_T(y_j)}{\tau} \]

Symmetric contrastive objective: \[ \mathcal{L}_{\text{img}}=-\frac{1}{N}\sum_{i=1}^N \log \frac{e^{s_{ii}}}{\sum_j e^{s_{ij}}}, \quad \mathcal{L}_{\text{text}}=-\frac{1}{N}\sum_{i=1}^N \log \frac{e^{s_{ii}}}{\sum_j e^{s_{ji}}} \] \[ \mathcal{L}_{\text{CLIP}}=\frac{1}{2}\left(\mathcal{L}_{\text{img}}+\mathcal{L}_{\text{text}}\right) \]

Why it mattered:

  • Uses natural language descriptions instead of only class labels.
  • Scales to web-scale image-text data.
  • Enables strong zero-shot transfer.

LLaVA-style Combination with a Language Model

General pipeline from lecture:

  1. Preprocess image (patching/cropping).
  2. Encode image with a vision encoder (often CLIP ViT).
  3. Linearly project vision features to LM embedding dimension.
  4. Concatenate visual tokens with text tokens.
  5. Train/fine-tune on image-text instruction data.
  6. For image positions, skip token-level LM loss.

A simple form: \[ h_v=\text{Proj}(f_I(x_{\text{image}})), \quad p_\theta(y_t\mid y_{<t},x_{\text{text}},h_v) \]

image-20260304234617658

Case Notes Mentioned in Lecture

  • Molmo (AI2): CLIP ViT-L/14 (336px), then pooling/projection before feeding LM; uses full image + crops.

image-20260305002752174

  • PaliGemma (Google): lecture highlighted that jointly updating vision encoder + LM can outperform freezing one side.

image-20260305002723839

Lec11 Takeaways

  • ViT gives a clean tokenization interface for images.
  • CLIP gives strong, scalable image representations through contrastive learning.
  • Multimodal assistants are mostly about interface design between vision tokens and LM tokens.

Lec12 Multimodal Modeling II (Generating Images)

Generative Paradigms

Lecture compared four families:

  • Autoregressive (AR): model \(p(x_t\mid x_{<t})\).

    image-20260305020958065

  • VAE: encode to latent \(z\), decode back.

    image-20260305021020096

  • GAN: generator vs discriminator game.

    image-20260305021040054

  • Diffusion: denoise from noise to data.

    image-20260305021103664

Attempt 1: Pixel-level Autoregression

Flatten image into a long sequence of pixel values: \[ x_{\text{img}} \rightarrow (x_1,\dots,x_T),\quad x_t\in\{0,\dots,255\} \] \[ \mathcal{L}_{\text{MLE}}=-\sum_{t=1}^{T}\log p_\theta(x_t\mid x_{<t}) \]

Examples discussed: PixelRNN, Image Transformer, iGPT.

Main bottlenecks:

  • Sequence length explodes (e.g., \(1024\times1024\times3\approx 3\)M tokens).
  • Pixel tokens are low-level; learning semantics is data-hungry.

Attempt 2: Learn Discrete Image Tokens

Core idea: learn an image tokenizer/de-tokenizer so the LM models a shorter, semantic token sequence.

VAE Refresher

Standard objective: \[ \mathcal{L}_{\text{VAE}}(x)= -\mathbb{E}_{q_{\theta_{\text{enc}}}(z\mid x)}[\log p_{\theta_{\text{dec}}}(x\mid z)] +D_{\text{KL}}\!\left(q_{\theta_{\text{enc}}}(z\mid x)\|p(z)\right) \]

Equivalent view via ELBO: \[ \log p(x)\ge \mathbb{E}_{q(z\mid x)}[\log p(x\mid z)] -D_{\text{KL}}(q(z\mid x)\|p(z)) \]

VQ-VAE: Continuous to Discrete

Encoder gives continuous latent: \[ z_e(x)\in\mathbb{R}^{d} \]

Quantize by nearest codebook entry: \[ k^*=\arg\min_j\|z_e(x)-e_j\|_2,\quad z_q(x)=e_{k^*} \]

Train with reconstruction + codebook + commitment terms: \[ \mathcal{L}= -\log p(x\mid z_q(x)) +\|\text{sg}[z_e(x)]-e\|_2^2 +\beta\|z_e(x)-\text{sg}[e]\|_2^2 \]

Then model discrete image tokens autoregressively.

image-20260305173246620

VQ-GAN

VQ-GAN improves VQ-VAE by adding adversarial/perceptual signals in tokenizer training, then training an AR Transformer on resulting token sequences.

Practical gain from lecture:

  • Sequence becomes much shorter (example: \(32\times32=1024\) tokens) than raw pixels.
  • Better tradeoff between generation quality and AR modeling feasibility.

image-20260305175540139

Unifying Text and Image Tokens

After tokenizer training:

  1. Add image tokens to LM vocabulary.
  2. Train/fine-tune on mixed text+image token streams.
  3. Decode image tokens back to pixels using de-tokenizer.

Examples discussed: DALL-E (2021), Chameleon (Meta, 2024).

Lec12 Takeaways

  • Pure pixel AR is conceptually simple but hard to scale.
  • Discrete tokenizers (VQ-VAE/VQ-GAN) are the key bridge for AR multimodal models.
  • Tradeoff: unified token modeling vs information loss from tokenization.

Final Summary

  • Lec11: how to encode images for text generation (ViT + CLIP + LM integration).
  • Lec12: how to generate images with token-based generative modeling.
  • Together they frame multimodal systems as:
    representation (encode) + alignment (shared space) + generation (decode).

This post is based on CMU 11-711 Advanced NLP lecture materials (Multimodal Modeling I & II, Sean Welleck).