Monitoring & Observability for RAG | RAG Systems in Production | Celestinosalim.com

Monitoring & Observability for RAG

Evaluation tells you if your system is good. Monitoring tells you if it is still good. I have watched RAG systems degrade silently over weeks --- document stores go stale, embedding drift accumulates, a schema change in the source data breaks the chunker. Without monitoring, the first signal is a user complaint. With monitoring, you catch it before the user ever notices.

This lesson covers the full observability stack I deploy on every production RAG system.

The Four Pillars of RAG Observability

1. Tracing         What happened on this specific query?
2. Metrics         How is the system performing overall?
3. Alerting        What changed that I need to act on?
4. Debugging       Why did this specific query fail?

Most teams implement logging and call it "observability." Logging is a component, not the whole picture. True observability means you can reconstruct and explain any system behavior from the telemetry data alone.

Pillar 1: End-to-End Tracing

Every RAG query should produce a trace that captures each stage of the pipeline:

from langfuse import Langfuse
from langfuse.decorators import observe

langfuse = Langfuse()

@observe()
def rag_query(query: str):
    # Stage 1: Query processing
    with langfuse.span(name="query_processing") as span:
        normalized = normalize_query(query)
        rewritten = rewrite_query(normalized)
        span.update(
            input=query,
            output=rewritten,
            metadata={"was_rewritten": query != rewritten}
        )

    # Stage 2: Retrieval
    with langfuse.span(name="retrieval") as span:
        results = hybrid_search(rewritten, top_k=20)
        span.update(
            input=rewritten,
            output=[r.id for r in results],
            metadata={
                "num_results": len(results),
                "top_score": results[0].score if results else 0,
                "search_type": "hybrid"
            }
        )

    # Stage 3: Re-ranking
    with langfuse.span(name="reranking") as span:
        reranked = rerank(rewritten, results, top_k=5)
        span.update(
            input=[r.id for r in results],
            output=[r.id for r in reranked],
            metadata={
                "score_dropoff": results[0].score - reranked[-1].score,
                "reranker_model": "cross-encoder/ms-marco-MiniLM-L-12-v2"
            }
        )

    # Stage 4: Generation
    with langfuse.span(name="generation") as span:
        context = format_context(reranked)
        answer = generate_answer(rewritten, context)
        span.update(
            input={"query": rewritten, "context_tokens": count_tokens(context)},
            output=answer,
            metadata={
                "model": answer.model,
                "input_tokens": answer.usage.input_tokens,
                "output_tokens": answer.usage.output_tokens,
                "cost_usd": calculate_cost(answer.usage)
            }
        )

    return answer

What this gives you: When a user reports a bad answer, you pull up the trace ID and see exactly what query was processed, what documents were retrieved, how they were re-ranked, what context was sent to the LLM, and what the LLM generated. Debugging time drops from hours to minutes.

Choosing a Tracing Platform

| Platform | Best For | Open Source | Key Feature | |----------|----------|-------------|-------------| | Langfuse | General RAG, any framework | Yes | Prompt management + eval integration | | LangSmith | LangChain/LangGraph stacks | No | Deep LangChain integration | | Phoenix (Arize) | LlamaIndex stacks | Yes | Notebook-friendly debugging | | Opik (Comet) | Experiment tracking focus | Yes | A/B test comparison |

I default to Langfuse for most projects because it is open-source, framework-agnostic, and has strong eval integration. If you are deeply invested in LangChain, LangSmith's automatic tracing is hard to beat.

Pillar 2: Production Metrics

Track these metrics continuously and visualize them on a dashboard:

Retrieval Metrics (per query)

retrieval_metrics = {
    # Retrieval quality signals
    "top_score":         float,  # Highest similarity score in results
    "score_spread":      float,  # Difference between top and bottom scores
    "num_results":       int,    # How many chunks were retrieved
    "cache_hit":         bool,   # Was the answer served from cache?

    # Latency breakdown
    "embedding_ms":      float,  # Time to embed the query
    "search_ms":         float,  # Time for vector + keyword search
    "rerank_ms":         float,  # Time for re-ranking
    "generation_ms":     float,  # Time for LLM generation
    "total_ms":          float,  # End-to-end latency

    # Cost tracking
    "input_tokens":      int,    # Tokens sent to LLM
    "output_tokens":     int,    # Tokens generated
    "cost_usd":          float,  # Total cost for this query
    "model_used":        str,    # Which model handled generation
}

System Health Metrics (aggregate)

system_metrics = {
    # Quality over time
    "avg_top_score_24h":         float,  # Trending down = retrieval degradation
    "low_confidence_rate_24h":   float,  # % of queries with top_score < threshold
    "no_result_rate_24h":        float,  # % of queries with zero results

    # Cost efficiency
    "cost_per_query_24h":        float,  # Should be stable or declining
    "cache_hit_rate_24h":        float,  # Should be 30-50% for healthy caching
    "model_routing_ratio_24h":   dict,   # {cheap_model: 70%, expensive: 30%}

    # Volume and latency
    "query_volume_24h":          int,
    "p50_latency_ms":            float,
    "p95_latency_ms":            float,
    "p99_latency_ms":            float,

    # Document freshness
    "oldest_document_days":      int,    # How stale is your corpus?
    "docs_updated_7d":           int,    # Ingestion pipeline health
}

Pillar 3: Alerting

Define alerts that catch degradation before users notice:

alerts = [
    {
        "name": "retrieval_quality_degradation",
        "condition": "avg_top_score_24h < 0.75",
        "severity": "critical",
        "action": "Page on-call. Possible embedding drift or index corruption."
    },
    {
        "name": "cost_spike",
        "condition": "cost_per_query_24h > 2x historical average",
        "severity": "warning",
        "action": "Check model routing. Cache may be down."
    },
    {
        "name": "latency_degradation",
        "condition": "p95_latency_ms > 3000",
        "severity": "warning",
        "action": "Check vector DB performance and re-ranker latency."
    },
    {
        "name": "stale_corpus",
        "condition": "docs_updated_7d == 0 AND expected_update_frequency == 'weekly'",
        "severity": "warning",
        "action": "Ingestion pipeline may be broken. Check data source connectors."
    },
    {
        "name": "cache_failure",
        "condition": "cache_hit_rate_24h < 0.10 AND query_volume_24h > 1000",
        "severity": "critical",
        "action": "Cache infrastructure may be down. All queries hitting full pipeline."
    },
    {
        "name": "high_refusal_rate",
        "condition": "no_answer_rate_24h > 0.25",
        "severity": "warning",
        "action": "25%+ queries getting 'I don't know.' Check if new query patterns emerged."
    }
]

The philosophy: Alert on leading indicators, not lagging ones. A drop in average retrieval score is a leading indicator. A user complaint is a lagging one. By the time you get the complaint, the system has been degraded for hours or days.

Pillar 4: Debugging Workflow

When an alert fires or a user reports an issue, follow this systematic debugging protocol:

Step 1: Pull the trace
  -> What query was sent?
  -> Was it rewritten? How?

Step 2: Inspect retrieval
  -> What chunks were retrieved?
  -> Are any of them relevant?
  -> What were the similarity scores?

Step 3: Inspect re-ranking
  -> Did re-ranking help or hurt?
  -> Was the most relevant chunk promoted or demoted?

Step 4: Inspect context assembly
  -> What was sent to the LLM?
  -> Was there conflicting information?
  -> Was the context too long / too short?

Step 5: Inspect generation
  -> Did the LLM faithfully use the context?
  -> Did it hallucinate beyond the provided information?
  -> Were citations accurate?

Most bugs are found at Steps 2--3. The retrieval either missed the right document entirely (a chunking or embedding issue) or the right document was retrieved but ranked poorly (a re-ranking issue).

Automating Root Cause Analysis

For recurring issues, automate the diagnosis:

def diagnose_bad_answer(trace_id: str):
    trace = langfuse.get_trace(trace_id)

    diagnosis = []

    # Check retrieval quality
    retrieval_span = trace.get_span("retrieval")
    top_score = retrieval_span.metadata["top_score"]
    if top_score < 0.70:
        diagnosis.append("RETRIEVAL: Low top score ({:.2f}). Likely missing relevant documents.".format(top_score))

    # Check re-ranking impact
    rerank_span = trace.get_span("reranking")
    if rerank_span.metadata["score_dropoff"] > 0.3:
        diagnosis.append("RERANKING: Large score dropoff. Re-ranker may be demoting relevant results.")

    # Check context size
    gen_span = trace.get_span("generation")
    context_tokens = gen_span.input["context_tokens"]
    if context_tokens > 3000:
        diagnosis.append("CONTEXT: Large context ({} tokens). May contain noise.".format(context_tokens))
    elif context_tokens < 200:
        diagnosis.append("CONTEXT: Thin context ({} tokens). May lack sufficient information.".format(context_tokens))

    # Check cost
    cost = gen_span.metadata["cost_usd"]
    if cost > 0.05:
        diagnosis.append("COST: High query cost (${:.3f}). Check model routing.".format(cost))

    return diagnosis

The Observability Stack in Practice

Here is the minimal stack I deploy on day one:

Tracing:    Langfuse (self-hosted or cloud)
Metrics:    Prometheus + Grafana (or Datadog)
Alerting:   PagerDuty / Opsgenie for critical, Slack for warnings
Logging:    Structured JSON logs to your existing log aggregator
Dashboard:  Grafana board with retrieval quality, latency, cost, and volume panels

Day one is not optional. I deploy observability alongside the first version of the RAG system, not after it is "stable." By the time you think you need monitoring, you have already missed weeks of data that would have told you how the system actually behaves under real traffic.

Evaluate Your System

Use this checklist to assess your observability posture:

[ ] Can you pull a full trace for any user query (query -> retrieval -> re-ranking -> generation)?
[ ] Does each trace include metadata at every stage (scores, latency, token counts, model used)?
[ ] Are you tracking per-query metrics (top score, latency breakdown, cost)?
[ ] Are you tracking aggregate system health metrics (24h averages for score, latency, cost, volume)?
[ ] Do you have alerts on retrieval quality degradation (avg score dropping)?
[ ] Do you have alerts on cost spikes (2x historical average)?
[ ] Do you have alerts on latency degradation (p95 exceeding budget)?
[ ] Do you have alerts on stale corpus (ingestion pipeline failures)?
[ ] Can you diagnose a bad answer in under 15 minutes using your tracing data?
[ ] Was observability deployed alongside the first version of your RAG system (not added later)?

If you cannot reconstruct the full pipeline execution for a reported bad answer within 15 minutes, your observability is insufficient. Start with Langfuse or LangSmith -- either gives you trace-level visibility with minimal integration work.

Key Takeaways

Observability has four pillars: tracing, metrics, alerting, and debugging. Logging alone is insufficient.
Trace every stage of the pipeline --- query processing, retrieval, re-ranking, and generation --- with enough metadata to reconstruct any query.
Track both per-query metrics (scores, latency, cost) and aggregate system health metrics (24h averages, trends).
Alert on leading indicators (retrieval score drops, cache failures, stale corpora) rather than lagging ones (user complaints).
Deploy observability on day one. The data from the first week of real traffic is the most valuable data you will ever collect.

Course Wrap-Up

You have now covered the full stack of production RAG engineering: from diagnosing failure modes (Lesson 1) through building the retrieval pipeline (Lessons 2-4), optimizing cost and trust (Lessons 5-6), and operationalizing with evaluation and monitoring (Lessons 7-8).

These are not theoretical patterns. They are the techniques I use on every system I ship. The throughline across every lesson is the same: treat RAG as an information supply chain with measurable unit economics at every stage.

Here is the production readiness checklist that ties all eight lessons together:

Failure awareness -- You can name your system's top 3 failure modes and have mitigations for each (Lesson 1)
Chunking -- Documents are chunked with the right strategy per format, with contextual headers and metadata (Lesson 2)
Embeddings -- Model selected and validated against a domain-specific eval set, storage costs calculated (Lesson 3)
Hybrid retrieval -- Dense + sparse search with RRF or tuned fusion, plus re-ranking (Lesson 4)
Cost control -- Semantic caching, model routing, context compression deployed; unit economics passing (Lesson 5)
Citations -- Every answer traced to source with validated, clickable citations (Lesson 6)
Evaluation -- Golden dataset in CI/CD as a hard gate; Recall@5, MRR, faithfulness tracked (Lesson 7)
Monitoring -- End-to-end tracing, production dashboards, alerts on leading indicators (Lesson 8)

When you think in systems, you build systems that last. Go build something hardened.