Start Lesson
Evaluation tells you if your system is good. Monitoring tells you if it is still good. I have watched RAG systems degrade silently over weeks --- document stores go stale, embedding drift accumulates, a schema change in the source data breaks the chunker. Without monitoring, the first signal is a user complaint. With monitoring, you catch it before the user ever notices.
This lesson covers the full observability stack I deploy on every production RAG system.
1. Tracing What happened on this specific query?
2. Metrics How is the system performing overall?
3. Alerting What changed that I need to act on?
4. Debugging Why did this specific query fail?
Most teams implement logging and call it "observability." Logging is a component, not the whole picture. True observability means you can reconstruct and explain any system behavior from the telemetry data alone.
Every RAG query should produce a trace that captures each stage of the pipeline:
from langfuse import Langfuse
from langfuse.decorators import observe
langfuse = Langfuse()
@observe()
def rag_query(query: str):
# Stage 1: Query processing
with langfuse.span(name="query_processing") as span:
normalized = normalize_query(query)
rewritten = rewrite_query(normalized)
span.update(
input=query,
output=rewritten,
metadata={"was_rewritten": query != rewritten}
)
# Stage 2: Retrieval
with langfuse.span(name="retrieval") as span:
results = hybrid_search(rewritten, top_k=20)
span.update(
input=rewritten,
output=[r.id for r in results],
metadata={
"num_results": len(results),
"top_score": results[0].score if results else 0,
"search_type": "hybrid"
}
)
# Stage 3: Re-ranking
with langfuse.span(name="reranking") as span:
reranked = rerank(rewritten, results, top_k=5)
span.update(
input=[r.id for r in results],
output=[r.id for r in reranked],
metadata={
"score_dropoff": results[0].score - reranked[-1].score,
"reranker_model": "cross-encoder/ms-marco-MiniLM-L-12-v2"
}
)
# Stage 4: Generation
with langfuse.span(name="generation") as span:
context = format_context(reranked)
answer = generate_answer(rewritten, context)
span.update(
input={"query": rewritten, "context_tokens": count_tokens(context)},
output=answer,
metadata={
"model": answer.model,
"input_tokens": answer.usage.input_tokens,
"output_tokens": answer.usage.output_tokens,
"cost_usd": calculate_cost(answer.usage)
}
)
return answer
What this gives you: When a user reports a bad answer, you pull up the trace ID and see exactly what query was processed, what documents were retrieved, how they were re-ranked, what context was sent to the LLM, and what the LLM generated. Debugging time drops from hours to minutes.
| Platform | Best For | Open Source | Key Feature | |----------|----------|-------------|-------------| | Langfuse | General RAG, any framework | Yes | Prompt management + eval integration | | LangSmith | LangChain/LangGraph stacks | No | Deep LangChain integration | | Phoenix (Arize) | LlamaIndex stacks | Yes | Notebook-friendly debugging | | Opik (Comet) | Experiment tracking focus | Yes | A/B test comparison |
I default to Langfuse for most projects because it is open-source, framework-agnostic, and has strong eval integration. If you are deeply invested in LangChain, LangSmith's automatic tracing is hard to beat.
Track these metrics continuously and visualize them on a dashboard:
retrieval_metrics = {
# Retrieval quality signals
"top_score": float, # Highest similarity score in results
"score_spread": float, # Difference between top and bottom scores
"num_results": int, # How many chunks were retrieved
"cache_hit": bool, # Was the answer served from cache?
# Latency breakdown
"embedding_ms": float, # Time to embed the query
"search_ms": float, # Time for vector + keyword search
"rerank_ms": float, # Time for re-ranking
"generation_ms": float, # Time for LLM generation
"total_ms": float, # End-to-end latency
# Cost tracking
"input_tokens": int, # Tokens sent to LLM
"output_tokens": int, # Tokens generated
"cost_usd": float, # Total cost for this query
"model_used": str, # Which model handled generation
}
system_metrics = {
# Quality over time
"avg_top_score_24h": float, # Trending down = retrieval degradation
"low_confidence_rate_24h": float, # % of queries with top_score < threshold
"no_result_rate_24h": float, # % of queries with zero results
# Cost efficiency
"cost_per_query_24h": float, # Should be stable or declining
"cache_hit_rate_24h": float, # Should be 30-50% for healthy caching
"model_routing_ratio_24h": dict, # {cheap_model: 70%, expensive: 30%}
# Volume and latency
"query_volume_24h": int,
"p50_latency_ms": float,
"p95_latency_ms": float,
"p99_latency_ms": float,
# Document freshness
"oldest_document_days": int, # How stale is your corpus?
"docs_updated_7d": int, # Ingestion pipeline health
}
Define alerts that catch degradation before users notice:
alerts = [
{
"name": "retrieval_quality_degradation",
"condition": "avg_top_score_24h < 0.75",
"severity": "critical",
"action": "Page on-call. Possible embedding drift or index corruption."
},
{
"name": "cost_spike",
"condition": "cost_per_query_24h > 2x historical average",
"severity": "warning",
"action": "Check model routing. Cache may be down."
},
{
"name": "latency_degradation",
"condition": "p95_latency_ms > 3000",
"severity": "warning",
"action": "Check vector DB performance and re-ranker latency."
},
{
"name": "stale_corpus",
"condition": "docs_updated_7d == 0 AND expected_update_frequency == 'weekly'",
"severity": "warning",
"action": "Ingestion pipeline may be broken. Check data source connectors."
},
{
"name": "cache_failure",
"condition": "cache_hit_rate_24h < 0.10 AND query_volume_24h > 1000",
"severity": "critical",
"action": "Cache infrastructure may be down. All queries hitting full pipeline."
},
{
"name": "high_refusal_rate",
"condition": "no_answer_rate_24h > 0.25",
"severity": "warning",
"action": "25%+ queries getting 'I don't know.' Check if new query patterns emerged."
}
]
The philosophy: Alert on leading indicators, not lagging ones. A drop in average retrieval score is a leading indicator. A user complaint is a lagging one. By the time you get the complaint, the system has been degraded for hours or days.
When an alert fires or a user reports an issue, follow this systematic debugging protocol:
Step 1: Pull the trace
-> What query was sent?
-> Was it rewritten? How?
Step 2: Inspect retrieval
-> What chunks were retrieved?
-> Are any of them relevant?
-> What were the similarity scores?
Step 3: Inspect re-ranking
-> Did re-ranking help or hurt?
-> Was the most relevant chunk promoted or demoted?
Step 4: Inspect context assembly
-> What was sent to the LLM?
-> Was there conflicting information?
-> Was the context too long / too short?
Step 5: Inspect generation
-> Did the LLM faithfully use the context?
-> Did it hallucinate beyond the provided information?
-> Were citations accurate?
Most bugs are found at Steps 2--3. The retrieval either missed the right document entirely (a chunking or embedding issue) or the right document was retrieved but ranked poorly (a re-ranking issue).
For recurring issues, automate the diagnosis:
def diagnose_bad_answer(trace_id: str):
trace = langfuse.get_trace(trace_id)
diagnosis = []
# Check retrieval quality
retrieval_span = trace.get_span("retrieval")
top_score = retrieval_span.metadata["top_score"]
if top_score < 0.70:
diagnosis.append("RETRIEVAL: Low top score ({:.2f}). Likely missing relevant documents.".format(top_score))
# Check re-ranking impact
rerank_span = trace.get_span("reranking")
if rerank_span.metadata["score_dropoff"] > 0.3:
diagnosis.append("RERANKING: Large score dropoff. Re-ranker may be demoting relevant results.")
# Check context size
gen_span = trace.get_span("generation")
context_tokens = gen_span.input["context_tokens"]
if context_tokens > 3000:
diagnosis.append("CONTEXT: Large context ({} tokens). May contain noise.".format(context_tokens))
elif context_tokens < 200:
diagnosis.append("CONTEXT: Thin context ({} tokens). May lack sufficient information.".format(context_tokens))
# Check cost
cost = gen_span.metadata["cost_usd"]
if cost > 0.05:
diagnosis.append("COST: High query cost (${:.3f}). Check model routing.".format(cost))
return diagnosis
Here is the minimal stack I deploy on day one:
Tracing: Langfuse (self-hosted or cloud)
Metrics: Prometheus + Grafana (or Datadog)
Alerting: PagerDuty / Opsgenie for critical, Slack for warnings
Logging: Structured JSON logs to your existing log aggregator
Dashboard: Grafana board with retrieval quality, latency, cost, and volume panels
Day one is not optional. I deploy observability alongside the first version of the RAG system, not after it is "stable." By the time you think you need monitoring, you have already missed weeks of data that would have told you how the system actually behaves under real traffic.
Use this checklist to assess your observability posture:
If you cannot reconstruct the full pipeline execution for a reported bad answer within 15 minutes, your observability is insufficient. Start with Langfuse or LangSmith -- either gives you trace-level visibility with minimal integration work.
You have now covered the full stack of production RAG engineering: from diagnosing failure modes (Lesson 1) through building the retrieval pipeline (Lessons 2-4), optimizing cost and trust (Lessons 5-6), and operationalizing with evaluation and monitoring (Lessons 7-8).
These are not theoretical patterns. They are the techniques I use on every system I ship. The throughline across every lesson is the same: treat RAG as an information supply chain with measurable unit economics at every stage.
Here is the production readiness checklist that ties all eight lessons together:
When you think in systems, you build systems that last. Go build something hardened.