Lec13 Evaluation and Benchmarks

Why Evaluation Matters

Evaluation is the mechanism by which we make progress claims in NLP. Poor evaluation leads to:

Misleading conclusions about model capability.
Models that game metrics without real improvement.
Inability to compare across systems or papers.

Two Families of Evaluation

Loss-Based Evaluation

During training (and sometimes validation), we use the model’s own loss:

\[ \mathcal{L} = -\sum_{t} \log p_\theta(x_t \mid x_{<t}) \]

Commonly reported as perplexity:

\[ \text{PPL} = \exp\!\left(\frac{1}{T}\sum_{t=1}^{T} -\log p_\theta(x_t \mid x_{<t})\right) \]

Lower PPL = better language modeling.
Fast to compute; directly tied to training objective.
Problem: PPL does not always correlate with downstream task performance.

Task-Metric-Based Evaluation

Evaluate on held-out tasks using task-specific metrics.

Metric	Task	Notes
Exact Match (EM)	QA, code	Binary; strict
pass@K	Code generation	Does any of K samples pass tests?
ROUGE-L	Summarization	Longest common subsequence F1
BLEU	Translation	n-gram precision with brevity penalty
Accuracy	Classification, MCQ	Fraction correct

Benchmarks

Properties of a Good Benchmark

A benchmark should be:

Valid — actually measures the capability it claims to.
Reliable — low variance across runs/annotators.
Discriminative — separates models at the current frontier.
Not saturated — top models should not already be at ceiling.
Contamination-free — test examples not in training data.

Key Benchmarks Discussed

MMLU (Massive Multitask Language Understanding)

57 subjects across STEM, humanities, law, etc.
Multiple-choice (4 options), evaluated by accuracy.
Originally challenging; frontier models now saturate it.

HumanEval

164 Python programming problems.
Evaluated with pass@K:

\[ \text{pass@}k = \mathbb{E}_{\text{problems}}\!\left[1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}\right] \]

where \(n\) = total samples generated, \(c\) = correct samples, \(k\) = budget.

Tests functional correctness via unit tests.

GSM8k

8,500 grade-school math word problems.
Requires multi-step arithmetic reasoning.
Solved via chain-of-thought prompting; now also approaching saturation for large models.

Task Metrics in Detail

Exact Match

\[ \text{EM}(y, \hat{y}) = \mathbf{1}[y = \hat{y}] \]

Simple but harsh; sensitive to minor phrasing differences.

ROUGE

ROUGE-N recall (reference \(r\), candidate \(c\)):

\[ \text{ROUGE-N} = \frac{\sum_{s \in r} \sum_{\text{gram}_n \in s} \text{Count}_{\text{match}}(\text{gram}_n)} {\sum_{s \in r} \sum_{\text{gram}_n \in s} \text{Count}(\text{gram}_n)} \]

ROUGE-L uses longest common subsequence instead of n-grams.

BLEU

\[ \text{BLEU} = \text{BP} \cdot \exp\!\left(\sum_{n=1}^{N} w_n \log p_n\right) \]

where \(p_n\) is clipped n-gram precision and BP is the brevity penalty.

BLEU is a corpus-level metric.
Criticized for not capturing semantic similarity.

LLM-as-Judge

Instead of a fixed metric, use a strong LLM to score or compare outputs.

Pairwise setting:

Given prompt \(x\) and two responses \(a\), \(b\), ask a judge model:

“Which response is better? A, B, or tie?”

Absolute scoring:

“Rate this response from 1 to 10 on helpfulness.”

Concerns:

Position bias (judge favors first/second response).
Self-enhancement bias (GPT-4 prefers GPT-4 style).
Verbosity bias (judges often prefer longer responses).

Mitigation:

Swap order and average.
Use multiple judge models.
Calibrate against human ratings.

Checklist-Based Evaluation

Lee et al. (2025) propose decomposing evaluation into fine-grained criteria:

Define a checklist of desiderata for a task.
Ask the LLM judge to check each criterion independently.
Aggregate binary pass/fail signals.

Advantage: More interpretable; reveals where a model fails rather than just whether it fails.

Human Evaluation

Crowd-sourced Annotation

Collect human preference judgments at scale.
Use careful guidelines and quality control.

LMSys Chatbot Arena

Open-source platform where anonymous users compare model outputs.
Uses Bradley-Terry model to estimate ELO-style ratings:

\[ P(\text{model A beats B}) = \frac{e^{\beta_A}}{e^{\beta_A} + e^{\beta_B}} \]

Large volume of real-user comparisons; diverse prompts.
Limitations: prompt distribution not controlled; users self-select.

Lec13 Takeaways

Loss-based metrics (PPL) are cheap but not always predictive of downstream quality.
Good benchmarks must be valid, discriminative, and contamination-free.
No single metric is perfect; triangulate with multiple signals.
LLM-as-judge is increasingly practical but requires careful bias mitigation.

Lec14 Research Skills & Experimental Design

The Scientific Method in NLP

Doing NLP research is doing science:

Observe a phenomenon or gap.
Hypothesize a mechanism or solution.
Design an experiment to test the hypothesis.
Collect data / run models.
Analyze results with appropriate statistics.
Conclude and communicate.

A common failure mode: jumping to step 4 without clearly stating steps 2–3.

Forming a Research Hypothesis

A good hypothesis is:

Falsifiable — can be shown wrong by an experiment.
Specific — names the independent variable, dependent variable, and expected direction.
Motivated — connects to prior work or a theoretical reason.

Example:

“Fine-tuning on chain-of-thought data will improve accuracy on GSM8k compared to standard SFT, because step-by-step supervision reduces distribution shift at inference.”

Statistical Analysis

Standard Error and Confidence Intervals

For metric \(\mu\) estimated from \(n\) i.i.d. samples:

\[ \text{SE} = \frac{\sigma}{\sqrt{n}} \]

95% confidence interval (large-sample, Normal approximation):

\[ \hat{\mu} \pm 1.96 \cdot \text{SE} \]

Always report CIs alongside point estimates. A result without uncertainty is not a result.

Hypothesis Testing

Null hypothesis \(H_0\): no difference between system A and system B.

p-value: probability of observing a difference at least as large as measured, assuming \(H_0\).

Common threshold: \(p < 0.05\) (but this should not be the only criterion).

Paired vs. Unpaired Tests

Setting	Test
Same test set, two models	Paired (McNemar’s, Wilcoxon signed-rank)
Independent test sets	Unpaired (t-test, Mann-Whitney U)

Paired tests are generally more powerful when the pairing is valid (same inputs evaluated twice).

McNemar’s test for classification:

\[ \chi^2 = \frac{(b - c)^2}{b + c} \]

where \(b\) = # examples where A correct, B wrong; \(c\) = vice versa.

Bootstrap Confidence Intervals

Resample the test set with replacement \(B\) times; compute metric on each resample:

\[ \hat{\mu}^{*(b)} = \text{metric}(\text{resample}_b) \]

Report the 2.5th–97.5th percentile of \(\{\hat{\mu}^{*(b)}\}\) as the 95% CI.

Non-parametric; works for any metric.

Data Contamination

The Problem

If test examples appear (verbatim or paraphrased) in pretraining data, reported accuracy is inflated.

Temporal Split

Train on data before date \(T\); evaluate on data after \(T\).

Guarantees no look-ahead contamination.
Used by GSM8k-style temporal holdouts.

N-gram Overlap Checks

Compute n-gram overlap between train and test; flag high-overlap examples.

Power Analysis

Before running an experiment, determine the sample size \(n\) needed to detect an effect.

For a two-sample test at significance level \(\alpha\) and power \(1-\beta\):

\[ n = \frac{(z_{\alpha/2} + z_\beta)^2 \cdot 2\sigma^2}{\delta^2} \]

where \(\delta\) is the minimum detectable effect size.

Key insight: underpowered experiments waste resources and mislead. If \(n\) is too small, you may fail to detect a real effect (Type II error) or report a false positive (Type II error inflated under multiple testing).

Annotation Quality

When building datasets with human labels, measure inter-annotator agreement.

Cohen’s Kappa (2 annotators)

\[ \kappa = \frac{p_o - p_e}{1 - p_e} \]

where \(p_o\) = observed agreement, \(p_e\) = expected agreement by chance.

\(\kappa = 1\): perfect agreement.
\(\kappa = 0\): agreement at chance level.
\(\kappa < 0\): worse than chance.

Fleiss’ Kappa (multiple annotators)

Generalization of Cohen’s Kappa to \(k\) raters:

\[ \bar{\kappa} = \frac{\bar{P} - \bar{P}_e}{1 - \bar{P}_e} \]

Krippendorff’s Alpha

More general: handles ordinal, interval, and ratio scales, missing data, and any number of annotators.

For nominal scale, \(\alpha\) reduces to a form similar to Fleiss’ \(\kappa\).

Guideline (Landis & Koch scale):

\(\kappa\) range	Interpretation
\(< 0.20\)	Slight
\(0.21\)–\(0.40\)	Fair
\(0.41\)–\(0.60\)	Moderate
\(0.61\)–\(0.80\)	Substantial
\(> 0.80\)	Almost perfect

Target \(\kappa \geq 0.6\) for publishable annotation quality.

Variance Reduction

When comparing models, reduce variance to make comparisons sharper.

Control Variates

Use the same random seeds, same test set, same batch order across compared models when possible.

Stratified Sampling

Ensure test set covers different difficulty strata; report per-strata results alongside aggregate.

Multiple Runs

Report mean and standard deviation across \(k\) random seeds:

\[ \bar{m} \pm \text{std}(m_1,\dots,m_k) \]

Experimental Design Checklist

Before running experiments:

Hypothesis is clearly stated.
Evaluation metric is chosen and justified.
Baseline is defined and fair.
Test set is held out strictly.
Sample size is adequate (power analysis or prior work).
Contamination has been checked.
Statistical test and significance threshold are decided.

Lec14 Takeaways

A well-posed hypothesis is half the experiment.
Always report uncertainty (SE, CI) alongside point estimates.
Use paired tests when the pairing is valid; they are more powerful.
Power analysis before data collection prevents underpowered studies.
High annotation agreement (\(\kappa \geq 0.6\)) is necessary for trustworthy datasets.
Data contamination is a serious, underreported threat to benchmark validity.

Final Summary

Lec13: NLP evaluation requires careful metric choice. Benchmarks must be valid, discriminative, and uncontaminated. LLM-as-judge is powerful but needs bias mitigation. Human evaluation (Chatbot Arena) provides ground truth at scale.
Lec14: Good experimental design starts with a falsifiable hypothesis. Statistical rigor (CIs, hypothesis tests, power analysis) and high annotation quality (\(\kappa\)) are non-negotiable in credible NLP research.

This post is based on CMU 11-711 Advanced NLP lecture materials (Evaluation and Benchmarks, Lec13; Research Skills & Experimental Design, Lec14; Sean Welleck).