11711 Advanced NLP: Fundamentals
Lec1 Introduction and Fundamentals
Bag of Words
A simplified representation used in NLP where a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity (frequency).
Pros:
- Simplicity: Very easy to understand and implement.
- Efficiency: Computationally inexpensive for basic tasks like document classification or spam detection.
- Interpretability: You can easily see which words (features) are most important for a prediction.
Cons:
- Loss of Context: It ignores word order (e.g., “Not good” and “Good not” look identical).
- Sparsity & High Dimensionality: Similar to One-hot, the vector size grows with the vocabulary, leading to “curse of dimensionality.”
- No Semantic Similarity: It cannot capture the fact that “boat” and “ship” are related.
One-hot Encoding
Definition: Convet discret values into binary vector, and only one position is 1, others are 0.
Cons:
- simple & straightforward
- Every token is a single representation
Pros:
- The number of tokens means the number of vector’s demension
- Sparsity, need more storage
- Cannot represent the relationship of two different words
Lec2 Learned Representation
Subword Model (e.g., BPE, FastText)
Definition: Splits words into smaller units (subwords or character n-grams). Frequent words remain whole, while rare words are broken down (e.g., “unhappiness” \(\rightarrow\) “un”, “happi”, “ness”).
Pros:
- Limited vocab size / handle same root word
- Handles OOV (Out-of-Vocabulary): Can generate embeddings for words never seen during training by summing its subwords.
- Morphological Richness: Excellent for languages with many word forms (like Turkish or German).
Cons:
- Computational Overhead: Training and inference are slower because the model has to process more tokens per word.
- Over-tokenization: Some words may be broken into too many meaningless fragments, losing “whole-word” semantic integrity.
- Language Dependency: The efficiency of subword splitting varies significantly across different languages (e.g., English vs. Chinese).
Continuous Bag of Words
Definition: A neural network-based model (Word2vec) that predicts a target center word based on its surrounding context words.
Pros:
- Fast Training: Much faster than Skip-gram because it performs only one prediction per context window.
- Better for Frequent Words: Provides a more stable representation for words that appear often in the data.
- Low Dimensionality: Produces dense, fixed-size vectors (e.g., 300 dimensions) instead of sparse ones.
Cons:
- Position Blurring: It averages context word vectors, meaning it ignores the specific order of words in the window.
- Underperforms on Rare Words: Since it averages contexts, rare words tend to get “smoothed out” or drowned by frequent neighbors.
Lec3 Language Modeling
A language model (LM) defines a probability distribution over all possible sequences of tokens: \(P(X) = P(x_1, x_2, \dots, x_T)\)
Applications: Score sequences (fluency), text generation, conditional generation (MT, summarization), QA, classification, grammar correction.
Autoregressive Language Models
Factorize joint probability into next-token predictions: \[ P(X) = \prod_{t=1}^{T} P(x_t \mid x_{<t}) \]
This reduces sequence modeling to next-token modeling, with output space being the vocabulary.
Bigram Models
Predict next token using only the previous token: \(P(X) \approx \prod_{t=1}^{T} P(x_t \mid x_{t-1})\)
Training (MLE): \[ P(x_t \mid x_{t-1}) = \frac{\text{count}(x_{t-1}, x_t)}{\sum_{x'} \text{count}(x_{t-1}, x')} \]
- Pros: Simple, fast, introduces core LM concepts (MLE, log-space, autoregressive generation)
- Cons: Limited context (1 token), no long-range dependencies, no word similarity
Training Objective
Goal: \(\max_\theta \sum_{x \in D} \log p_\theta(x)\) — equivalent to \(\min_\theta D_{KL}(p^* \parallel p_\theta)\)
Log space: Prevents underflow, turns multiplication into addition: \(\log P(X) = \sum_{t} \log P(x_t \mid x_{<t})\)
Text Generation
Iterative sampling: Start with [S] → sample \(x_t \sim P(x_t \mid x_{<t})\) → append →
stop at [EOS]
Evaluation Metrics
Negative Log-Likelihood: \(\text{NLL} = - \sum_{i,t} \log P(x_t^{(i)} \mid x_{<t}^{(i)})\) (lower is better)
Perplexity: \(\text{PPL} = \exp\left(-\frac{1}{T}\sum_{i,t} \log P(x_t^{(i)} \mid x_{<t}^{(i)})\right)\) — measures model “confusion”, lower is better
N-gram Models
Condition on previous \(n-1\) tokens: \(P(X) \approx \prod_{t=1}^{T} P(x_t \mid x_{t-n+1:t-1})\)
Smoothing (add-one): \(P(x_t \mid c) = \frac{1 + \text{count}(c, x_t)}{|V| + \sum_{x'} \text{count}(c, x')}\)
- Pros: Longer context, fast, interpretable, strong memorization
- Cons: No parameter sharing, can’t handle long-distance dependencies, vocab explosion with large \(n\)
Feedforward Neural LMs (Bengio et al., 2003)
Replace count tables with neural networks: word embeddings → concatenate → hidden layers → softmax
\[ P(x_t \mid x_{<t}) = \text{softmax}(W h), \quad L_t = -\log P(x_t \mid x_{<t}) \]
- Pros: Parameter sharing across similar words/contexts, dense representations, better generalization
- Cons: Fixed context window (addressed by RNNs/Transformers)
Practical Training
- Dataset splits: Train (fit params) / Validation (tune hyperparams) / Test (final eval, used once)
- Overfitting: Val loss increases while train loss decreases → use regularization, early stopping
- Weight init: Xavier: \(W \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{in}+n_{out}}}, \sqrt{\frac{6}{n_{in}+n_{out}}}\right)\)
- Learning rate: Large early (exploration) → small later (convergence); warmup stabilizes early gradients
- Batching: Pad variable-length sequences, mask padding in forward pass and loss
Key Takeaways
- LM = probabilistic sequence modeling with autoregressive factorization
- Progression: Bigram → N-gram → Neural LMs (same MLE objective throughout)
- Careful evaluation and experimental setup are essential



