Lec1 Introduction and Fundamentals

Bag of Words

A simplified representation used in NLP where a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity (frequency).

Pros:

  • Simplicity: Very easy to understand and implement.
  • Efficiency: Computationally inexpensive for basic tasks like document classification or spam detection.
  • Interpretability: You can easily see which words (features) are most important for a prediction.

Cons:

  • Loss of Context: It ignores word order (e.g., “Not good” and “Good not” look identical).
  • Sparsity & High Dimensionality: Similar to One-hot, the vector size grows with the vocabulary, leading to “curse of dimensionality.”
  • No Semantic Similarity: It cannot capture the fact that “boat” and “ship” are related.

One-hot Encoding

Definition: Convet discret values into binary vector, and only one position is 1, others are 0.

Cons:

  • simple & straightforward
  • Every token is a single representation

Pros:

  • The number of tokens means the number of vector’s demension
  • Sparsity, need more storage
  • Cannot represent the relationship of two different words

Lec2 Learned Representation

Subword Model (e.g., BPE, FastText)

Definition: Splits words into smaller units (subwords or character n-grams). Frequent words remain whole, while rare words are broken down (e.g., “unhappiness” \(\rightarrow\) “un”, “happi”, “ness”).

  • Pros:

    • Limited vocab size / handle same root word
    • Handles OOV (Out-of-Vocabulary): Can generate embeddings for words never seen during training by summing its subwords.
    • Morphological Richness: Excellent for languages with many word forms (like Turkish or German).

    Cons:

    • Computational Overhead: Training and inference are slower because the model has to process more tokens per word.
    • Over-tokenization: Some words may be broken into too many meaningless fragments, losing “whole-word” semantic integrity.
    • Language Dependency: The efficiency of subword splitting varies significantly across different languages (e.g., English vs. Chinese).

Continuous Bag of Words

Definition: A neural network-based model (Word2vec) that predicts a target center word based on its surrounding context words.

Pros:

  • Fast Training: Much faster than Skip-gram because it performs only one prediction per context window.
  • Better for Frequent Words: Provides a more stable representation for words that appear often in the data.
  • Low Dimensionality: Produces dense, fixed-size vectors (e.g., 300 dimensions) instead of sparse ones.

Cons:

  • Position Blurring: It averages context word vectors, meaning it ignores the specific order of words in the window.
  • Underperforms on Rare Words: Since it averages contexts, rare words tend to get “smoothed out” or drowned by frequent neighbors.

Lec3 Language Modeling

A language model (LM) defines a probability distribution over all possible sequences of tokens: \(P(X) = P(x_1, x_2, \dots, x_T)\)

Applications: Score sequences (fluency), text generation, conditional generation (MT, summarization), QA, classification, grammar correction.

Autoregressive Language Models

Factorize joint probability into next-token predictions: \[ P(X) = \prod_{t=1}^{T} P(x_t \mid x_{<t}) \]

This reduces sequence modeling to next-token modeling, with output space being the vocabulary.

Bigram Models

Predict next token using only the previous token: \(P(X) \approx \prod_{t=1}^{T} P(x_t \mid x_{t-1})\)

Training (MLE): \[ P(x_t \mid x_{t-1}) = \frac{\text{count}(x_{t-1}, x_t)}{\sum_{x'} \text{count}(x_{t-1}, x')} \]

  • Pros: Simple, fast, introduces core LM concepts (MLE, log-space, autoregressive generation)
  • Cons: Limited context (1 token), no long-range dependencies, no word similarity

Training Objective

Goal: \(\max_\theta \sum_{x \in D} \log p_\theta(x)\) — equivalent to \(\min_\theta D_{KL}(p^* \parallel p_\theta)\)

Log space: Prevents underflow, turns multiplication into addition: \(\log P(X) = \sum_{t} \log P(x_t \mid x_{<t})\)

Text Generation

Iterative sampling: Start with [S] → sample \(x_t \sim P(x_t \mid x_{<t})\) → append → stop at [EOS]

Evaluation Metrics

Negative Log-Likelihood: \(\text{NLL} = - \sum_{i,t} \log P(x_t^{(i)} \mid x_{<t}^{(i)})\) (lower is better)

Perplexity: \(\text{PPL} = \exp\left(-\frac{1}{T}\sum_{i,t} \log P(x_t^{(i)} \mid x_{<t}^{(i)})\right)\) — measures model “confusion”, lower is better

N-gram Models

Condition on previous \(n-1\) tokens: \(P(X) \approx \prod_{t=1}^{T} P(x_t \mid x_{t-n+1:t-1})\)

Smoothing (add-one): \(P(x_t \mid c) = \frac{1 + \text{count}(c, x_t)}{|V| + \sum_{x'} \text{count}(c, x')}\)

  • Pros: Longer context, fast, interpretable, strong memorization
  • Cons: No parameter sharing, can’t handle long-distance dependencies, vocab explosion with large \(n\)

Feedforward Neural LMs (Bengio et al., 2003)

Replace count tables with neural networks: word embeddings → concatenate → hidden layers → softmax

\[ P(x_t \mid x_{<t}) = \text{softmax}(W h), \quad L_t = -\log P(x_t \mid x_{<t}) \]

  • Pros: Parameter sharing across similar words/contexts, dense representations, better generalization
  • Cons: Fixed context window (addressed by RNNs/Transformers)

Practical Training

  • Dataset splits: Train (fit params) / Validation (tune hyperparams) / Test (final eval, used once)
  • Overfitting: Val loss increases while train loss decreases → use regularization, early stopping
  • Weight init: Xavier: \(W \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{in}+n_{out}}}, \sqrt{\frac{6}{n_{in}+n_{out}}}\right)\)
  • Learning rate: Large early (exploration) → small later (convergence); warmup stabilizes early gradients
  • Batching: Pad variable-length sequences, mask padding in forward pass and loss

Key Takeaways

  • LM = probabilistic sequence modeling with autoregressive factorization
  • Progression: Bigram → N-gram → Neural LMs (same MLE objective throughout)
  • Careful evaluation and experimental setup are essential