LLM Evaluation Basics

If you cannot measure an AI feature, every model or prompt change is a guess.

Start with a small eval set. A 50-example eval that reflects real users is often more useful than a giant generic benchmark.

What to evaluate

System type	Measure
Extraction	exact fields, schema validity, missing values
Support bot	correctness, groundedness, tone, escalation
RAG	retrieval recall, faithfulness, citation quality
Agent	task success, tool errors, unsafe actions
Coding assistant	tests passed, compile status, patch size
Summarizer	coverage, factuality, compression ratio

Build a golden dataset

A useful eval example includes:

input
expected behavior
scoring rubric
edge case label
optional reference answer
notes about why it matters

json

{
  "id": "refund-001",
  "input": "Can I get a refund after 45 days?",
  "expected": "Say refund window is 30 days and suggest support escalation.",
  "rubric": ["policy_correct", "no_hallucinated_exception", "polite"]
}

Scoring methods

Method	Best for	Weakness
Exact match	IDs, labels, extracted fields	too strict for prose
Unit tests	code generation	misses UX quality
Human review	high-stakes decisions	slow and expensive
LLM-as-judge	open-ended answers	can be biased
Pairwise comparison	model/prompt A/B tests	needs representative data

LLM-as-judge tips

use a clear rubric
hide which system produced the answer
ask for a score and short rationale
calibrate against human labels
watch for verbosity and position bias
keep judge prompts versioned

Regression testing workflow

text

baseline prompt/model
        |
run eval set
        |
change prompt/model/retriever
        |
run eval set again
        |
compare quality, cost, latency, safety

Knowledge check

Q1: Why should evals come before fine-tuning?
Without evals, you cannot prove the fine-tune improved the task.

Q2: What is a golden dataset?
A representative set of examples with expected behavior and scoring rules.