Lec 1: Tokenization

The essence of a language model is to model a sequence of tokens.

\[ p(x_1, x_2, \dots, x_T) \] where \(x_T\) represents an integer token id. We also need an abstract interface for bidirectional mapping, which is the Tokenizer.

encode(str) -> List[int]
decode(List[int]) -> str

Tradeoffs:

Vocabulary size is big: sequences are short, but there may be rare tokens that are not learned, and embeddings are huge.
Vocabulary size is small: good generalization, fewer parameters, but longer sequences lead to expensive Transformer attention \(O(T^2)\).

Q&A:

Why does a large vocabulary not equal “easier to learn,” but instead leads to the problem of “rare tokens not being learned”?

If the vocabulary contains very rare whole words as a single token, such as "supercalifragilisticexpialidocious", which may appear only once or a few times in training, its corresponding embedding/output weight is essentially not trained. In this case, if a smaller vocabulary is used, this long word will be split into multiple more common subword/character combinations, which can be learned better.
Why does a large vocabulary lead to an overall increase in embedding size?

This is mainly because the shape of the output logits is \(\text{Vocab_size} \times \text{Hidden_size}\). When the vocabulary increases, the number of parameters in this area doubles, leading to higher costs.

Compression Ratio

1
2
3

num_bytes = len(string.encode("utf-8"))
indices = tokenizer(string)
num_tokens = len(indices)

\[ \text{compression_ratio} = \frac{\text{num_bytes}}{\text{num_tokens}} \] A larger value indicates that each token can represent more bytes, resulting in a shorter sequence and a higher degree of information compression.

Character Tokenizer

The core idea is to break the string into Unicode characters, converting each character to an integer using ord(). decode uses chr() to piece it back together.

1 2	encode: list(map(ord, string)) decode: "".join(map(chr, indices))

The advantage is that it is simple and reversible, while the disadvantage is that the vocabulary is very large (approximately 150K+ Unicode characters), with many rare characters, leading to wasted vocabulary capacity.

Byte Tokenizer

The core idea is to convert the string to UTF-8 bytes, where each byte is in the range of 0-255. The vocabulary is fixed at 256.

1
2
3

string_bytes = string.encode("utf-8")
indices = list(map(int, string_bytes))
decode: bytes(indices).decode("utf-8")

Compression_ratio is always equal to 1.

The advantage is that the vocabulary is small and completely reversible, but the sequences are too long, which is disastrous for attention mechanisms.

Word Tokenizer

The core idea is to use regular expressions to split the text into segments, assigning an id to each segment.

The advantage is that sequences will be shorter, but the disadvantage is that the vocabulary size is not fixed and can potentially explode. New words require UNK (which affects perplexity and generalization experience).

BPE Tokenizer

Find a balance between the “small vocabulary of bytes” and the “short sequences of words”:

Common segments are merged into a single token (shortening the sequence).
Rare segments can still degrade into multiple bytes.

Execution steps, starting from bytes, repeat for num_merges times:

Count the occurrences of all adjacent pairs in the current sequence.
Find the most frequent pair.
Merge it into a new token (new id = 256 + i).
Update vocab: vocab[new] = vocab[a] + vocab[b].
Replace all occurrences of that pair in the sequence.

Encode/Decode

encode: Start from bytes and merge pairs in the order of merges.
decode: Map each token id back to bytes, then piece them together to decode utf-8.

The advantage is that the vocabulary is controllable (256 + num_merges) and the sequences are shorter, while the disadvantage is that naive encoding can be very slow.