Start Lesson
Every production system I maintain has an eval harness that runs before every deployment. Not after. Before. The cost of catching a retrieval regression in staging is minutes. The cost of catching it from a user complaint is days of debugging and lost trust.
This lesson covers how to build an evaluation system that measures retrieval quality, answer quality, and faithfulness --- and how to run it as part of your CI/CD pipeline.
Most teams evaluate only the final answer. This is like testing a car by checking if it reaches the destination without checking the engine, brakes, or steering. You need to evaluate each layer independently:
Layer 1: Retrieval Quality
"Did we find the right documents?"
Metrics: Recall@K, Precision@K, MRR, NDCG
Layer 2: Context Quality
"Is the assembled context faithful and relevant?"
Metrics: Context Precision, Context Recall, Noise Ratio
Layer 3: Answer Quality
"Is the final answer correct, complete, and grounded?"
Metrics: Faithfulness, Answer Relevancy, Correctness
When a user reports a bad answer, the eval harness tells you which layer failed. Without it, you guess --- and you guess wrong most of the time.
You need a set of queries paired with their correct documents. Start with 50--100 pairs and grow over time.
# golden_dataset.json
[
{
"query": "What is the refund policy for digital products?",
"relevant_chunk_ids": ["chunk_4a2f", "chunk_8b1c"],
"irrelevant_chunk_ids": ["chunk_9d3e"], # hard negatives
"expected_answer_contains": ["store credit", "digital products"]
},
{
"query": "How long does international shipping take?",
"relevant_chunk_ids": ["chunk_2e7a"],
"irrelevant_chunk_ids": ["chunk_5f1b"],
"expected_answer_contains": ["7-14 business days"]
}
]
Where to get golden data:
def evaluate_retrieval(golden_dataset, retriever, k=5):
metrics = {
"recall_at_k": [],
"precision_at_k": [],
"mrr": [],
"ndcg": []
}
for item in golden_dataset:
retrieved = retriever.search(item["query"], top_k=k)
retrieved_ids = [r.id for r in retrieved]
relevant_ids = set(item["relevant_chunk_ids"])
# Recall@K: fraction of relevant docs found
hits = len(set(retrieved_ids) & relevant_ids)
recall = hits / len(relevant_ids) if relevant_ids else 0
metrics["recall_at_k"].append(recall)
# Precision@K: fraction of retrieved docs that are relevant
precision = hits / k
metrics["precision_at_k"].append(precision)
# MRR: reciprocal rank of first relevant result
mrr = 0
for rank, doc_id in enumerate(retrieved_ids, 1):
if doc_id in relevant_ids:
mrr = 1.0 / rank
break
metrics["mrr"].append(mrr)
# NDCG: normalized discounted cumulative gain
dcg = sum(
(1.0 if doc_id in relevant_ids else 0.0) / math.log2(rank + 1)
for rank, doc_id in enumerate(retrieved_ids, 1)
)
idcg = sum(1.0 / math.log2(i + 1) for i in range(1, len(relevant_ids) + 1))
ndcg = dcg / idcg if idcg > 0 else 0
metrics["ndcg"].append(ndcg)
return {k: sum(v) / len(v) for k, v in metrics.items()}
What to target:
These targets vary by domain. For a medical system, I push recall@5 above 0.95. For a general customer support bot, 0.80 may be acceptable.
RAGAS (Retrieval Augmented Generation Assessment) provides reference-free metrics that do not require ground truth answers. This is useful for evaluating at scale.
from ragas import evaluate
from ragas.metrics import (
context_precision,
context_recall,
faithfulness,
answer_relevancy,
)
from datasets import Dataset
# Prepare evaluation dataset
eval_data = {
"question": [],
"answer": [],
"contexts": [],
"ground_truth": []
}
for item in test_queries:
# Run your full RAG pipeline
result = rag_pipeline(item["query"])
eval_data["question"].append(item["query"])
eval_data["answer"].append(result["answer"])
eval_data["contexts"].append(result["retrieved_chunks"])
eval_data["ground_truth"].append(item.get("expected_answer", ""))
dataset = Dataset.from_dict(eval_data)
# Run RAGAS evaluation
results = evaluate(
dataset,
metrics=[
context_precision, # Are retrieved chunks relevant?
context_recall, # Did we retrieve all needed info?
faithfulness, # Is the answer supported by context?
answer_relevancy, # Does the answer address the question?
]
)
print(results)
# {'context_precision': 0.82, 'context_recall': 0.78,
# 'faithfulness': 0.91, 'answer_relevancy': 0.87}
What each metric tells you:
DeepEval integrates with pytest, letting you write retrieval quality tests like unit tests:
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
FaithfulnessMetric,
AnswerRelevancyMetric,
ContextualPrecisionMetric
)
def test_refund_policy_query():
"""Test that refund policy queries return accurate, cited answers."""
result = rag_pipeline("What is the refund policy for digital products?")
test_case = LLMTestCase(
input="What is the refund policy for digital products?",
actual_output=result["answer"],
retrieval_context=result["retrieved_chunks"]
)
faithfulness = FaithfulnessMetric(threshold=0.8)
relevancy = AnswerRelevancyMetric(threshold=0.7)
precision = ContextualPrecisionMetric(threshold=0.7)
assert_test(test_case, [faithfulness, relevancy, precision])
def test_no_hallucination_on_unknown():
"""Test that the system admits when it does not know."""
result = rag_pipeline("What is the company's policy on teleportation?")
test_case = LLMTestCase(
input="What is the company's policy on teleportation?",
actual_output=result["answer"],
retrieval_context=result["retrieved_chunks"]
)
# Faithfulness should be high (answer grounded in context or refusal)
faithfulness = FaithfulnessMetric(threshold=0.9)
assert_test(test_case, [faithfulness])
Run as part of CI:
# In your CI/CD pipeline
deepeval test run tests/test_rag_quality.py
Here is the pipeline I use:
# .github/workflows/rag-eval.yml
name: RAG Quality Gate
on:
pull_request:
paths:
- 'src/rag/**'
- 'prompts/**'
- 'config/chunking.yaml'
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run retrieval eval
run: |
python eval/run_retrieval_eval.py \
--dataset eval/golden_dataset.json \
--output eval/results.json
- name: Check quality gates
run: |
python eval/check_gates.py \
--results eval/results.json \
--min-recall 0.85 \
--min-mrr 0.70 \
--min-faithfulness 0.80
- name: Run DeepEval tests
run: deepeval test run tests/test_rag_quality.py
The rule: No merge if retrieval metrics regress. This is a hard gate, not a suggestion. I have seen teams skip this "just once" and ship a chunking change that dropped recall by 15%. It took two weeks to notice because no user explicitly reported "your retrieval is worse" --- they just stopped using the product.
The golden dataset is a living document. Here is how to grow it systematically:
Target: 200+ pairs within 6 months of launch. At this scale, your eval harness catches subtle regressions that 50 pairs would miss.
Use this checklist to assess your evaluation infrastructure:
If you do not have a golden dataset, build one this week. Start with 50 pairs from your production query logs, have a domain expert mark the correct documents, and include at least 10 hard negatives. This single step enables every other evaluation practice in this checklist.
We close the loop with monitoring and observability for RAG systems in production. Evaluation tells you if your system is good at deployment time. Monitoring tells you if it is still good three weeks later when documents go stale, query patterns shift, and embeddings drift.