RAG Evaluation for Production
A RAG system can fail even when the final answer sounds confident. Production evaluation must separate retrieval quality from generation quality.
The four RAG questions
| Question | Metric examples |
|---|---|
| Did we retrieve the right evidence? | recall@k, MRR, nDCG |
| Did we avoid irrelevant evidence? | precision@k, context precision |
| Did the answer use the evidence? | groundedness, faithfulness |
| Did the user get a useful answer? | answer relevance, task success |
Create test cases
Include examples for:
- direct answer in one document
- answer spread across multiple documents
- no answer in corpus
- conflicting documents
- stale policy
- prompt injection inside a retrieved page
- long document with answer in the middle
Retrieval eval
json
{
"query": "What is the enterprise refund window?",
"expected_doc_ids": ["policy-refunds-2026"],
"expected_source_spans": ["Refunds are available within 30 days..."]
}
Measure whether the retriever finds the expected document and source span before generation starts.
Generation eval
Use a rubric such as:
| Criterion | Pass condition |
|---|---|
| grounded | every factual claim is supported by retrieved context |
| complete | answers the user question |
| calibrated | says when evidence is missing |
| cited | includes relevant citations |
| safe | ignores malicious instructions in retrieved docs |
Production monitoring
Track:
- no-answer rate
- top retrieved sources
- retrieval latency
- answer latency
- citation click-through
- user thumbs up/down
- schema validation failures
- prompt injection detections
- drift in embedding model or corpus
Release checklist
- Run offline evals before deploying prompt or retriever changes.
- Test with canary traffic.
- Compare quality, cost, latency, and no-answer rate.
- Keep rollback paths for prompts, models, embeddings, and indexes.
- Review failed traces weekly.
Knowledge check
Q1: Why evaluate retrieval separately from answer quality?
Because a bad answer can be caused by missing evidence or by poor generation after good evidence was retrieved.
Q2: What should a RAG system do when evidence is missing?
Say it cannot answer from the available sources and ask for more information or escalate.