Lec1 Introduction and Fundamentals

Bag of Words

A simplified representation used in NLP where a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity (frequency).

Pros:

Simplicity: Very easy to understand and implement.
Efficiency: Computationally inexpensive for basic tasks like document classification or spam detection.
Interpretability: You can easily see which words (features) are most important for a prediction.

Cons:

Loss of Context: It ignores word order (e.g., “Not good” and “Good not” look identical).
Sparsity & High Dimensionality: Similar to One-hot, the vector size grows with the vocabulary, leading to “curse of dimensionality.”
No Semantic Similarity: It cannot capture the fact that “boat” and “ship” are related.

One-hot Encoding

Definition: Convet discret values into binary vector, and only one position is 1, others are 0.

Cons:

simple & straightforward
Every token is a single representation

Pros:

The number of tokens means the number of vector’s demension
Sparsity, need more storage
Cannot represent the relationship of two different words

Lec2 Learned Representation

Subword Model (e.g., BPE, FastText)

Definition: Splits words into smaller units (subwords or character n-grams). Frequent words remain whole, while rare words are broken down (e.g., “unhappiness” \(\rightarrow\) “un”, “happi”, “ness”).

Pros:
- Limited vocab size / handle same root word
- Handles OOV (Out-of-Vocabulary): Can generate embeddings for words never seen during training by summing its subwords.
- Morphological Richness: Excellent for languages with many word forms (like Turkish or German).
Cons:
- Computational Overhead: Training and inference are slower because the model has to process more tokens per word.
- Over-tokenization: Some words may be broken into too many meaningless fragments, losing “whole-word” semantic integrity.
- Language Dependency: The efficiency of subword splitting varies significantly across different languages (e.g., English vs. Chinese).

Continuous Bag of Words

Definition: A neural network-based model (Word2vec) that predicts a target center word based on its surrounding context words.

Pros:

Fast Training: Much faster than Skip-gram because it performs only one prediction per context window.
Better for Frequent Words: Provides a more stable representation for words that appear often in the data.
Low Dimensionality: Produces dense, fixed-size vectors (e.g., 300 dimensions) instead of sparse ones.

Cons:

Position Blurring: It averages context word vectors, meaning it ignores the specific order of words in the window.
Underperforms on Rare Words: Since it averages contexts, rare words tend to get “smoothed out” or drowned by frequent neighbors.

Lec3 Language Modeling

A language model (LM) defines a probability distribution over all possible sequences of tokens: \(P(X) = P(x_1, x_2, \dots, x_T)\)

Applications: Score sequences (fluency), text generation, conditional generation (MT, summarization), QA, classification, grammar correction.

Autoregressive Language Models

Factorize joint probability into next-token predictions: \[ P(X) = \prod_{t=1}^{T} P(x_t \mid x_{<t}) \]

This reduces sequence modeling to next-token modeling, with output space being the vocabulary.

Bigram Models

Predict next token using only the previous token: \(P(X) \approx \prod_{t=1}^{T} P(x_t \mid x_{t-1})\)

Training (MLE): \[ P(x_t \mid x_{t-1}) = \frac{\text{count}(x_{t-1}, x_t)}{\sum_{x'} \text{count}(x_{t-1}, x')} \]

Pros: Simple, fast, introduces core LM concepts (MLE, log-space, autoregressive generation)
Cons: Limited context (1 token), no long-range dependencies, no word similarity

Training Objective

Goal: \(\max_\theta \sum_{x \in D} \log p_\theta(x)\) — equivalent to \(\min_\theta D_{KL}(p^* \parallel p_\theta)\)

Log space: Prevents underflow, turns multiplication into addition: \(\log P(X) = \sum_{t} \log P(x_t \mid x_{<t})\)

Text Generation

Iterative sampling: Start with [S] → sample \(x_t \sim P(x_t \mid x_{<t})\) → append → stop at [EOS]

Evaluation Metrics

Negative Log-Likelihood: \(\text{NLL} = - \sum_{i,t} \log P(x_t^{(i)} \mid x_{<t}^{(i)})\) (lower is better)

Perplexity: \(\text{PPL} = \exp\left(-\frac{1}{T}\sum_{i,t} \log P(x_t^{(i)} \mid x_{<t}^{(i)})\right)\) — measures model “confusion”, lower is better

N-gram Models

Condition on previous \(n-1\) tokens: \(P(X) \approx \prod_{t=1}^{T} P(x_t \mid x_{t-n+1:t-1})\)

Smoothing (add-one): \(P(x_t \mid c) = \frac{1 + \text{count}(c, x_t)}{|V| + \sum_{x'} \text{count}(c, x')}\)

Pros: Longer context, fast, interpretable, strong memorization
Cons: No parameter sharing, can’t handle long-distance dependencies, vocab explosion with large \(n\)

Feedforward Neural LMs (Bengio et al., 2003)

Replace count tables with neural networks: word embeddings → concatenate → hidden layers → softmax

\[ P(x_t \mid x_{<t}) = \text{softmax}(W h), \quad L_t = -\log P(x_t \mid x_{<t}) \]

Pros: Parameter sharing across similar words/contexts, dense representations, better generalization
Cons: Fixed context window (addressed by RNNs/Transformers)

Practical Training

Dataset splits: Train (fit params) / Validation (tune hyperparams) / Test (final eval, used once)
Overfitting: Val loss increases while train loss decreases → use regularization, early stopping
Weight init: Xavier: \(W \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{in}+n_{out}}}, \sqrt{\frac{6}{n_{in}+n_{out}}}\right)\)
Learning rate: Large early (exploration) → small later (convergence); warmup stabilizes early gradients
Batching: Pad variable-length sequences, mask padding in forward pass and loss

Key Takeaways

LM = probabilistic sequence modeling with autoregressive factorization
Progression: Bigram → N-gram → Neural LMs (same MLE objective throughout)
Careful evaluation and experimental setup are essential