RAG Evaluation for Production

A RAG system can fail even when the final answer sounds confident. Production evaluation must separate retrieval quality from generation quality.

The four RAG questions

Question	Metric examples
Did we retrieve the right evidence?	recall@k, MRR, nDCG
Did we avoid irrelevant evidence?	precision@k, context precision
Did the answer use the evidence?	groundedness, faithfulness
Did the user get a useful answer?	answer relevance, task success

Create test cases

Include examples for:

direct answer in one document
answer spread across multiple documents
no answer in corpus
conflicting documents
stale policy
prompt injection inside a retrieved page
long document with answer in the middle

Retrieval eval

json

{
  "query": "What is the enterprise refund window?",
  "expected_doc_ids": ["policy-refunds-2026"],
  "expected_source_spans": ["Refunds are available within 30 days..."]
}

Measure whether the retriever finds the expected document and source span before generation starts.

Generation eval

Use a rubric such as:

Criterion	Pass condition
grounded	every factual claim is supported by retrieved context
complete	answers the user question
calibrated	says when evidence is missing
cited	includes relevant citations
safe	ignores malicious instructions in retrieved docs

Production monitoring

Track:

no-answer rate
top retrieved sources
retrieval latency
answer latency
citation click-through
user thumbs up/down
schema validation failures
prompt injection detections
drift in embedding model or corpus

Release checklist

Run offline evals before deploying prompt or retriever changes.
Test with canary traffic.
Compare quality, cost, latency, and no-answer rate.
Keep rollback paths for prompts, models, embeddings, and indexes.
Review failed traces weekly.

Knowledge check

Q1: Why evaluate retrieval separately from answer quality?
Because a bad answer can be caused by missing evidence or by poor generation after good evidence was retrieved.

Q2: What should a RAG system do when evidence is missing?
Say it cannot answer from the available sources and ask for more information or escalate.