Eval Harnesses for Retrieval | RAG Systems in Production | Celestinosalim.com

Eval Harnesses for Retrieval

Every production system I maintain has an eval harness that runs before every deployment. Not after. Before. The cost of catching a retrieval regression in staging is minutes. The cost of catching it from a user complaint is days of debugging and lost trust.

This lesson covers how to build an evaluation system that measures retrieval quality, answer quality, and faithfulness --- and how to run it as part of your CI/CD pipeline.

The Three Layers of RAG Evaluation

Most teams evaluate only the final answer. This is like testing a car by checking if it reaches the destination without checking the engine, brakes, or steering. You need to evaluate each layer independently:

Layer 1: Retrieval Quality
  "Did we find the right documents?"
  Metrics: Recall@K, Precision@K, MRR, NDCG

Layer 2: Context Quality
  "Is the assembled context faithful and relevant?"
  Metrics: Context Precision, Context Recall, Noise Ratio

Layer 3: Answer Quality
  "Is the final answer correct, complete, and grounded?"
  Metrics: Faithfulness, Answer Relevancy, Correctness

When a user reports a bad answer, the eval harness tells you which layer failed. Without it, you guess --- and you guess wrong most of the time.

Layer 1: Retrieval Metrics

Building the Golden Dataset

You need a set of queries paired with their correct documents. Start with 50--100 pairs and grow over time.

# golden_dataset.json
[
    {
        "query": "What is the refund policy for digital products?",
        "relevant_chunk_ids": ["chunk_4a2f", "chunk_8b1c"],
        "irrelevant_chunk_ids": ["chunk_9d3e"],  # hard negatives
        "expected_answer_contains": ["store credit", "digital products"]
    },
    {
        "query": "How long does international shipping take?",
        "relevant_chunk_ids": ["chunk_2e7a"],
        "irrelevant_chunk_ids": ["chunk_5f1b"],
        "expected_answer_contains": ["7-14 business days"]
    }
]

Where to get golden data:

User queries from production logs (anonymized) paired with expert-annotated relevant documents.
Questions generated by an LLM from your documents, then validated by a human.
Support tickets where the correct source document is known.

Core Retrieval Metrics

def evaluate_retrieval(golden_dataset, retriever, k=5):
    metrics = {
        "recall_at_k": [],
        "precision_at_k": [],
        "mrr": [],
        "ndcg": []
    }

    for item in golden_dataset:
        retrieved = retriever.search(item["query"], top_k=k)
        retrieved_ids = [r.id for r in retrieved]
        relevant_ids = set(item["relevant_chunk_ids"])

        # Recall@K: fraction of relevant docs found
        hits = len(set(retrieved_ids) & relevant_ids)
        recall = hits / len(relevant_ids) if relevant_ids else 0
        metrics["recall_at_k"].append(recall)

        # Precision@K: fraction of retrieved docs that are relevant
        precision = hits / k
        metrics["precision_at_k"].append(precision)

        # MRR: reciprocal rank of first relevant result
        mrr = 0
        for rank, doc_id in enumerate(retrieved_ids, 1):
            if doc_id in relevant_ids:
                mrr = 1.0 / rank
                break
        metrics["mrr"].append(mrr)

        # NDCG: normalized discounted cumulative gain
        dcg = sum(
            (1.0 if doc_id in relevant_ids else 0.0) / math.log2(rank + 1)
            for rank, doc_id in enumerate(retrieved_ids, 1)
        )
        idcg = sum(1.0 / math.log2(i + 1) for i in range(1, len(relevant_ids) + 1))
        ndcg = dcg / idcg if idcg > 0 else 0
        metrics["ndcg"].append(ndcg)

    return {k: sum(v) / len(v) for k, v in metrics.items()}

What to target:

Recall@5 > 0.85 (you find the relevant document 85% of the time in the top 5)
MRR > 0.70 (the relevant document is usually in the top 2 positions)
Precision@5 > 0.40 (at least 2 of your 5 retrieved chunks are relevant)

These targets vary by domain. For a medical system, I push recall@5 above 0.95. For a general customer support bot, 0.80 may be acceptable.

Layer 2: Context Quality with RAGAS

RAGAS (Retrieval Augmented Generation Assessment) provides reference-free metrics that do not require ground truth answers. This is useful for evaluating at scale.

from ragas import evaluate
from ragas.metrics import (
    context_precision,
    context_recall,
    faithfulness,
    answer_relevancy,
)
from datasets import Dataset

# Prepare evaluation dataset
eval_data = {
    "question": [],
    "answer": [],
    "contexts": [],
    "ground_truth": []
}

for item in test_queries:
    # Run your full RAG pipeline
    result = rag_pipeline(item["query"])

    eval_data["question"].append(item["query"])
    eval_data["answer"].append(result["answer"])
    eval_data["contexts"].append(result["retrieved_chunks"])
    eval_data["ground_truth"].append(item.get("expected_answer", ""))

dataset = Dataset.from_dict(eval_data)

# Run RAGAS evaluation
results = evaluate(
    dataset,
    metrics=[
        context_precision,    # Are retrieved chunks relevant?
        context_recall,       # Did we retrieve all needed info?
        faithfulness,         # Is the answer supported by context?
        answer_relevancy,     # Does the answer address the question?
    ]
)

print(results)
# {'context_precision': 0.82, 'context_recall': 0.78,
#  'faithfulness': 0.91, 'answer_relevancy': 0.87}

What each metric tells you:

Context Precision low? You are retrieving too much noise. Improve your re-ranker or reduce K.
Context Recall low? You are missing relevant documents. Improve embeddings, chunking, or add hybrid search.
Faithfulness low? The LLM is hallucinating beyond the context. Tighten your system prompt or add citation enforcement.
Answer Relevancy low? The LLM is not addressing the question. Check query understanding and prompt design.

Layer 3: DeepEval for Unit Testing

DeepEval integrates with pytest, letting you write retrieval quality tests like unit tests:

import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    FaithfulnessMetric,
    AnswerRelevancyMetric,
    ContextualPrecisionMetric
)

def test_refund_policy_query():
    """Test that refund policy queries return accurate, cited answers."""
    result = rag_pipeline("What is the refund policy for digital products?")

    test_case = LLMTestCase(
        input="What is the refund policy for digital products?",
        actual_output=result["answer"],
        retrieval_context=result["retrieved_chunks"]
    )

    faithfulness = FaithfulnessMetric(threshold=0.8)
    relevancy = AnswerRelevancyMetric(threshold=0.7)
    precision = ContextualPrecisionMetric(threshold=0.7)

    assert_test(test_case, [faithfulness, relevancy, precision])


def test_no_hallucination_on_unknown():
    """Test that the system admits when it does not know."""
    result = rag_pipeline("What is the company's policy on teleportation?")

    test_case = LLMTestCase(
        input="What is the company's policy on teleportation?",
        actual_output=result["answer"],
        retrieval_context=result["retrieved_chunks"]
    )

    # Faithfulness should be high (answer grounded in context or refusal)
    faithfulness = FaithfulnessMetric(threshold=0.9)
    assert_test(test_case, [faithfulness])

Run as part of CI:

# In your CI/CD pipeline
deepeval test run tests/test_rag_quality.py

Integrating Evals into CI/CD

Here is the pipeline I use:

# .github/workflows/rag-eval.yml
name: RAG Quality Gate

on:
  pull_request:
    paths:
      - 'src/rag/**'
      - 'prompts/**'
      - 'config/chunking.yaml'

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run retrieval eval
        run: |
          python eval/run_retrieval_eval.py \
            --dataset eval/golden_dataset.json \
            --output eval/results.json

      - name: Check quality gates
        run: |
          python eval/check_gates.py \
            --results eval/results.json \
            --min-recall 0.85 \
            --min-mrr 0.70 \
            --min-faithfulness 0.80

      - name: Run DeepEval tests
        run: deepeval test run tests/test_rag_quality.py

The rule: No merge if retrieval metrics regress. This is a hard gate, not a suggestion. I have seen teams skip this "just once" and ship a chunking change that dropped recall by 15%. It took two weeks to notice because no user explicitly reported "your retrieval is worse" --- they just stopped using the product.

Building the Golden Dataset Over Time

The golden dataset is a living document. Here is how to grow it systematically:

Launch with 50 expert-curated pairs. Enough to catch major regressions.
Add user feedback pairs. When users report wrong answers, trace back to the retrieval failure and add to the dataset.
Add adversarial pairs. Queries that are specifically designed to trick the system (negation, disambiguation, out-of-scope).
Quarterly review. Remove stale pairs (documents that no longer exist) and rebalance topic coverage.

Target: 200+ pairs within 6 months of launch. At this scale, your eval harness catches subtle regressions that 50 pairs would miss.

Evaluate Your System

Use this checklist to assess your evaluation infrastructure:

[ ] Do you have a golden dataset of at least 50 query-document pairs?
[ ] Does your golden dataset include hard negatives (documents that look relevant but are not)?
[ ] Are you measuring retrieval metrics (Recall@K, Precision@K, MRR, NDCG) separately from answer metrics?
[ ] Is RAGAS or an equivalent reference-free evaluation running on a regular cadence?
[ ] Do you have pytest-style quality tests (via DeepEval or similar) in your test suite?
[ ] Is the eval harness integrated into CI/CD as a hard gate (blocks merge on regression)?
[ ] Are you growing the golden dataset from user feedback and production failures?
[ ] Do you run adversarial queries (negation, out-of-scope, ambiguous) as part of the eval?
[ ] Can you tell from eval results which layer failed (retrieval, context, or generation)?
[ ] Do you review and rebalance the golden dataset quarterly?

If you do not have a golden dataset, build one this week. Start with 50 pairs from your production query logs, have a domain expert mark the correct documents, and include at least 10 hard negatives. This single step enables every other evaluation practice in this checklist.

Key Takeaways

Evaluate retrieval, context, and answer quality as three separate layers. When something breaks, you need to know which layer failed.
Build a golden dataset of 50+ query-document pairs and grow it from user feedback and adversarial testing.
Use RAGAS for reference-free metrics at scale and DeepEval for pytest-style unit tests in CI.
Make the eval harness a hard gate in CI/CD. No merge if retrieval metrics regress.
Target Recall@5 > 0.85, MRR > 0.70, and Faithfulness > 0.80 as starting baselines, then raise them as your system matures.

What's Next

We close the loop with monitoring and observability for RAG systems in production. Evaluation tells you if your system is good at deployment time. Monitoring tells you if it is still good three weeks later when documents go stale, query patterns shift, and embeddings drift.