Back
advanced
Advanced RAG & Context

RAG Evaluation for Production

Measure retrieval quality, answer faithfulness, citation quality, and production drift

30 min read· RAG· evaluation· groundedness· retrieval

RAG Evaluation for Production

A RAG system can fail even when the final answer sounds confident. Production evaluation must separate retrieval quality from generation quality.

The four RAG questions

QuestionMetric examples
Did we retrieve the right evidence?recall@k, MRR, nDCG
Did we avoid irrelevant evidence?precision@k, context precision
Did the answer use the evidence?groundedness, faithfulness
Did the user get a useful answer?answer relevance, task success

Create test cases

Include examples for:

  • direct answer in one document
  • answer spread across multiple documents
  • no answer in corpus
  • conflicting documents
  • stale policy
  • prompt injection inside a retrieved page
  • long document with answer in the middle

Retrieval eval

json
{
  "query": "What is the enterprise refund window?",
  "expected_doc_ids": ["policy-refunds-2026"],
  "expected_source_spans": ["Refunds are available within 30 days..."]
}

Measure whether the retriever finds the expected document and source span before generation starts.

Generation eval

Use a rubric such as:

CriterionPass condition
groundedevery factual claim is supported by retrieved context
completeanswers the user question
calibratedsays when evidence is missing
citedincludes relevant citations
safeignores malicious instructions in retrieved docs

Production monitoring

Track:

  • no-answer rate
  • top retrieved sources
  • retrieval latency
  • answer latency
  • citation click-through
  • user thumbs up/down
  • schema validation failures
  • prompt injection detections
  • drift in embedding model or corpus

Release checklist

  1. Run offline evals before deploying prompt or retriever changes.
  2. Test with canary traffic.
  3. Compare quality, cost, latency, and no-answer rate.
  4. Keep rollback paths for prompts, models, embeddings, and indexes.
  5. Review failed traces weekly.

Knowledge check

Q1: Why evaluate retrieval separately from answer quality?
Because a bad answer can be caused by missing evidence or by poor generation after good evidence was retrieved.

Q2: What should a RAG system do when evidence is missing?
Say it cannot answer from the available sources and ask for more information or escalate.