Tokenization and Embedding

What is Tokenization?

Tokenization is the process of splitting text into basic units (tokens) that can be processed by language models. Many words don’t map to a single token - for example, “indivisible” might be split into multiple subword tokens.

The general pipeline is: 1. Tokenizer: Split text into token IDs 2. Embedding table lookup: Convert token IDs to vector representations

image-20260212204629477

Simple Tokenization Approaches

Word-level Tokenization

Approach: Break text by spaces and punctuation

Examples: - English, French, German, Spanish work well - Special treatment: numbers replaced by special token [number]

Example:

1
2
3
"The most eager is Oregon which is enlisting 5,000 drivers in the country's biggest experiment."

["The", "most", "eager", "is", "Oregon", "which", "is", "enlisting", "5,000", "drivers", ...]

Challenge: What exactly is a “word”? - Clitics: “Bob’s” - Compounds: “handyman” - Multi-word expressions: “do-it-yourself” - Contractions: “isn’t”

Pros and Cons: - ✅ Easy to implement - ❌ Out-of-vocabulary (OOV) tokens (e.g., “Covid”) - ❌ Tradeoff between vocabulary size and unknown tokens: - Smaller vocab → fewer parameters, more OOV - Larger vocab → more parameters, less OOV - ❌ Hard for languages without spaces (Chinese, Japanese, Korean, Khmer)

Character-level Tokenization

Approach: Each letter and punctuation is a token

1
2
3
"The most eager is Oreg..."

["T", "h", "e", " ", "m", "o", "s", "t", " ", "e", "a", "g", "e", "r", ...]

Pros: - ✅ Very small vocabulary (except for languages like Chinese) - ✅ No OOV tokens

Cons: - ❌ Longer sequences - ❌ Tokens don’t represent semantic meaning

Subword-level Tokenization

Goal

  • Moderate size vocabulary
  • No OOV tokens
  • Represent rare words by sequences of subwords

Byte Pair Encoding (BPE)

Origin: Originally developed for data compression by Philip Gage (1994)

Algorithm for Building Vocabulary: 1. Initialize vocabulary with all characters as tokens (plus end-of-word symbol) and their frequencies 2. Loop until vocabulary size reaches capacity: 1. Count successive pairs of tokens in corpus 2. Rank and select the most frequent pair 3. Merge the pair to form a new token, add to vocabulary 3. Output final vocabulary

Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Initial corpus:
cat: 90
catch: 50
rat: 80
rattle: 40

Step 1: Start with characters
Vocab: {a, c, e, h, l, t}

Step 2: Merge 'a' + 't' (most frequent)
Vocab: {a, c, e, h, l, t, at}

Step 3: Merge 'c' + 'at'
Vocab: {a, c, e, h, l, t, at, cat}

Step 4: Merge 'r' + 'at'
Vocab: {a, c, e, h, l, t, at, cat, rat}

Step 5: Merge 'cat' + 'c'
Vocab: {a, c, e, h, l, t, at, cat, rat, catc}

Tokenization Process: 1. Split text by space or other delimiters 2. Repeat: - Greedily find the longest prefix that matches a token in BPE dictionary - Split and process remaining parts until no more text left

Example Output:

1
2
3
"The most eager is Oregon which is enlisting 5,000 drivers in the country's biggest experiment."

["The", "most", "eager", "is", "Oregon", "which", "is", "en", "listing", "5,000", "driver", "s", "in", "the", "country", "'s", "big", "g", "est", "experiment", "."]

Code Examples:

Information-Theoretic Vocabulary (VOLT)

Motivation

Finding the optimal vocabulary traditionally requires: - Multiple full training and testing cycles - Trying different vocabulary sizes (1k, 10k, 30k tokens) - Evaluating NLG/MT performance for each configuration

Challenge: Can we predict good vocabulary without full training?

Measuring Vocabulary Quality

1. Compression Metrics: - Bytes Per Token (BPT): \[BPT = \frac{\text{utf8 bytes}}{\text{tokens}}\]

  • Normalized Sequence Length (NSL): \[NSL = \frac{\text{tokens}}{\text{tokens in LLaMA}}\]

2. Normalized Entropy: \[\mathcal{H}(v) = -\frac{1}{l_v}\sum_{i \in v} P(i)\log P(i)\]

where \(l_v\) is the average number of characters for all tokens in vocabulary \(v\).

Interpretation: Measures semantic-information-per-character. Smaller is better (less ambiguity, easier to generate).

Example:

1
2
3
4
5
Vocabulary 1: {a: 200, e: 90, c: 30, t: 30, s: 90}
→ H(v) = 1.37

Vocabulary 2: {a: 100, aes: 90, cat: 30}
→ H(v) = 0.14 ✓ Better!

Marginal Utility of Vocabulary (MUV)

Definition: \[M_{v_k \rightarrow v_{k+m}} = -\frac{H(v_k) - H(v_{k+m})}{m}\]

  • Value: Normalized Entropy (semantic information)
  • Cost: Vocabulary size
  • MUV: Negative gradient of normalized entropy with respect to size
  • Interpretation: How much value each added token brings

Key Finding: MUV Predicts Performance

Empirical observations: - Maximum MUV correlates with best BLEU scores - MUV and BLEU are correlated on ~2/3 of tasks - MUV serves as a coarse-grained evaluation metric for vocabulary quality

image-20260212205224848

Practical Considerations in LLMs

1. Corpus Deduplication and Filtering

LLaMA 3 Deduplication Strategy: - URL-level dedup: Remove duplicate URLs - Document-level dedup: Using MinHash - Line-level dedup: Using SHA-1 64-bit hash for every 30M docs - Removes boilerplate (navigation menus, cookie warnings, contact info)

Filtering: - N-gram repeats within one line - “Dirty word” counting - Token distribution KL divergence too different from corpus

2. Modern Subword Tokenization Methods

Byte-level BPE (BBPE): - Treats language as sequence of Unicode bytes - Universal for all languages - No need for language-specific preprocessing

WordPiece: - Similar to BPE, but merges pairs that maximize \(P(b|a)\) - Used in BERT

SentencePiece: - Uniform treatment of spaces and punctuation - Replaces space ’ ’ with _ (U+2581) - Then splits into characters and applies BPE - Language-agnostic approach

3. Handling Code: Pre-tokenization

Use regular expressions to split sequences intelligently:

GPT-4 Pattern:

1
(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}|...

Components: - English contractions - Words (with optional leading non-alphanumeric) - Digits (1-3 at a time) - Non-alphanumeric characters - Line breaks - Trailing whitespace

Example: .append → single token (keeps method names intact)

4. Handling Numbers

Challenge: Traditional tokenization treats each digit separately, making arithmetic difficult.

xVal Approach (Golkar et al., 2023): - Replace numbers with special [NUM] token - Add continuous numerical embedding alongside discrete token embedding - Dual-head decoder: - Token head: Predicts next token (logits) - Number head: Predicts numerical value

image-20260212205430637

5. Multilingual Vocabulary

LLaMA 3 Evolution: - LLaMA 2: 32k tokens - LLaMA 3.1: 128k tokens - 100k from OpenAI’s tiktoken (reduced from 200k) - 28k allocated to multilingual support

Construction Methods:

  1. Joint BPE: Combine documents from multiple languages (176 in LLaMA 3), apply BPE on joint corpus
  2. Per-language BPE + Merge: Generate same number of tokens for each language, then merge
  3. ALP Balancing: Allocate capacity by balancing Average Log Probability across languages

Vocabulary Sharing Benefits:

Shared tokens enable cross-lingual transfer. Example - “television” in multiple languages:

1
2
3
4
English: television      Spanish: televisión
French: television Italian: television
Dutch: televisie Portuguese: televisão
Swedish: television Finnish: televisio

Similar spellings → shared subword tokens → multilingual understanding

Vocabulary Sharing and Multilingual Performance

Experimental Setup

Research by Yuan et al. (ACL 2024) investigated: 1. Construct 10k instruction fine-tuning dataset using bilingual parallel data 2. Fine-tune LLaMA-7B embedding layer only 3. Evaluate translation performance: - Bilingual: Supervised language pair - Multilingual: All other directions

Four Quadrants of Performance

Quadrant Bilingual Perf. Multilingual Perf. Example Languages
Reciprocal cs, da, fr, de
Altruistic ar, vi, zh, ko
Stagnant km, lo, gu, te
Selfish hi

Key Insight: Fine-tuning on bilingual data doesn’t always benefit the supervised direction!

image-20260212205611162

Language Families Matter

  • Indo-European languages: Mostly in Reciprocal quadrant
  • Non-Indo-European languages: More varied distribution, including Altruistic and Stagnant quadrants

Stagnant Quadrant: Over-tokenization Problem

Root Cause: Byte-BPE produces longer byte sequences than character counts for some languages.

Example:

1
饕 [tāo] (gluttonous) → three tokens: [227, 234, 260]

Solution - Shortening: - Remove common byte prefixes (e.g., 227 for many Chinese characters) - Improves both bilingual and multilingual performance

Experimental Results:

1
2
3
4
Direction: en→km   en→lo   en→gu   en→te
Full FT: 10.1 7.0 10.0 17.0
Extend: 11.9 6.9 0.5 7.2
Shorten: 12.7 9.4 11.3 21.6 ✓ Best

Recommendation: Shortening is more effective than extending vocabulary for over-tokenized languages.

Tokenizer-free Models: Byte Latent Transformer (BLT)

Motivation: Eliminate tokenizer entirely, work directly with bytes.

Architecture (Pagnoni et al., 2024):

  1. Byte-Level Small Transformer: Encodes byte stream
  2. Entropy-Based Grouping: Groups bytes into patches via cross-attention
  3. Large Latent Transformer: Predicts next patch (main computation)
  4. Unpatching: Converts patches back to byte sequence via cross-attention
  5. Small Byte-Level Transformer: Makes next-byte prediction

Key Benefits: - ✅ No tokenizer training needed - ✅ Universal across all languages - ✅ Handles any byte stream (text, code, data) - ✅ Dynamic granularity (entropy-based patching)

Trade-offs: - More complex architecture - Two-level transformer hierarchy - Cross-attention overhead

image-20260212205638769

Summary

Key Takeaways

  1. Subword Tokenization (BPE):
    • Iteratively merges most frequent token pairs
    • Balances vocabulary size and OOV rate
    • Widely used (GPT, LLaMA, etc.)
  2. Information-Theoretic Vocabulary (VOLT):
    • Uses normalized entropy to measure vocabulary quality
    • MUV (Marginal Utility of Vocabulary) predicts performance
    • Solves entropy-constrained optimal transport problem
    • Avoids expensive grid search
  3. Practical Considerations:
    • Pre-tokenization with regex for code
    • Special handling for numbers (xVal approach)
    • Corpus deduplication at multiple levels
    • Multilingual vocabulary allocation
  4. Multilingual Challenges:
    • Vocabulary sharing enables cross-lingual transfer
    • Over-tokenization hurts some languages (stagnant quadrant)
    • Shortening byte sequences more effective than extending vocabulary
    • Language families affect sharing benefits
  5. Future Direction:
    • Tokenizer-free models (BLT) show promise
    • Trade-off between simplicity and efficiency
    • Dynamic granularity via entropy-based patching

References

  1. Rico Sennrich et al. “Neural Machine Translation of Rare Words with Subword Units.” ACL 2016.
  2. Xu, Zhou, Gan, Zheng, Li. “Vocabulary Learning via Optimal Transport for Neural Machine Translation.” ACL 2021.
  3. Kudo and Richardson. “SentencePiece: A simple and language independent approach to subword tokenization.” EMNLP 2018.
  4. Dagan et al. “Getting the most out of your tokenizer for pre-training and domain adaptation.” ICML 2024.
  5. Golkar et al. “xVal: A Continuous Numerical Tokenization for Scientific Language Models.” 2023.
  6. Zheng et al. “Allocating large vocabulary capacity for cross-lingual language model pre-training.” EMNLP 2021.
  7. Liang et al. “XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models.” EMNLP 2023.
  8. Yuan et al. “How Vocabulary Sharing Facilitates Multilingualism in LLaMA?” ACL 2024.
  9. Pagnoni et al. “Byte Latent Transformer: Patches Scale Better Than Tokens.” 2024.

Demo Tools

Interactive tokenizer demos: - LLaMA Tokenizer - GPT Tokenizer


This post is based on lecture materials from CMU 11-868 LLM Systems by Lei Li.