Back
intermediate
Application Reliability

LLM Evaluation Basics

Create small, useful eval sets before changing prompts, models, tools, or retrieval

24 min read· evaluation· evals· LLM-as-judge· quality

LLM Evaluation Basics

If you cannot measure an AI feature, every model or prompt change is a guess.

Start with a small eval set. A 50-example eval that reflects real users is often more useful than a giant generic benchmark.

What to evaluate

System typeMeasure
Extractionexact fields, schema validity, missing values
Support botcorrectness, groundedness, tone, escalation
RAGretrieval recall, faithfulness, citation quality
Agenttask success, tool errors, unsafe actions
Coding assistanttests passed, compile status, patch size
Summarizercoverage, factuality, compression ratio

Build a golden dataset

A useful eval example includes:

  • input
  • expected behavior
  • scoring rubric
  • edge case label
  • optional reference answer
  • notes about why it matters
json
{
  "id": "refund-001",
  "input": "Can I get a refund after 45 days?",
  "expected": "Say refund window is 30 days and suggest support escalation.",
  "rubric": ["policy_correct", "no_hallucinated_exception", "polite"]
}

Scoring methods

MethodBest forWeakness
Exact matchIDs, labels, extracted fieldstoo strict for prose
Unit testscode generationmisses UX quality
Human reviewhigh-stakes decisionsslow and expensive
LLM-as-judgeopen-ended answerscan be biased
Pairwise comparisonmodel/prompt A/B testsneeds representative data

LLM-as-judge tips

  • use a clear rubric
  • hide which system produced the answer
  • ask for a score and short rationale
  • calibrate against human labels
  • watch for verbosity and position bias
  • keep judge prompts versioned

Regression testing workflow

text
baseline prompt/model
        |
run eval set
        |
change prompt/model/retriever
        |
run eval set again
        |
compare quality, cost, latency, safety

Knowledge check

Q1: Why should evals come before fine-tuning?
Without evals, you cannot prove the fine-tune improved the task.

Q2: What is a golden dataset?
A representative set of examples with expected behavior and scoring rules.