LLM Evaluation Basics
If you cannot measure an AI feature, every model or prompt change is a guess.
Start with a small eval set. A 50-example eval that reflects real users is often more useful than a giant generic benchmark.
What to evaluate
| System type | Measure |
|---|---|
| Extraction | exact fields, schema validity, missing values |
| Support bot | correctness, groundedness, tone, escalation |
| RAG | retrieval recall, faithfulness, citation quality |
| Agent | task success, tool errors, unsafe actions |
| Coding assistant | tests passed, compile status, patch size |
| Summarizer | coverage, factuality, compression ratio |
Build a golden dataset
A useful eval example includes:
- input
- expected behavior
- scoring rubric
- edge case label
- optional reference answer
- notes about why it matters
json
{
"id": "refund-001",
"input": "Can I get a refund after 45 days?",
"expected": "Say refund window is 30 days and suggest support escalation.",
"rubric": ["policy_correct", "no_hallucinated_exception", "polite"]
}
Scoring methods
| Method | Best for | Weakness |
|---|---|---|
| Exact match | IDs, labels, extracted fields | too strict for prose |
| Unit tests | code generation | misses UX quality |
| Human review | high-stakes decisions | slow and expensive |
| LLM-as-judge | open-ended answers | can be biased |
| Pairwise comparison | model/prompt A/B tests | needs representative data |
LLM-as-judge tips
- use a clear rubric
- hide which system produced the answer
- ask for a score and short rationale
- calibrate against human labels
- watch for verbosity and position bias
- keep judge prompts versioned
Regression testing workflow
text
baseline prompt/model
|
run eval set
|
change prompt/model/retriever
|
run eval set again
|
compare quality, cost, latency, safety
Knowledge check
Q1: Why should evals come before fine-tuning?
Without evals, you cannot prove the fine-tune improved the task.
Q2: What is a golden dataset?
A representative set of examples with expected behavior and scoring rules.