Observability for AI Systems | Production AI Architecture | Celestinosalim.com

Observability for AI Systems

You Cannot Manage What You Cannot Measure

Traditional software observability is a solved problem. You have structured logs, distributed tracing, metrics dashboards, and alert rules. A request comes in, hits your API, queries a database, and returns a response. You can trace the entire path, measure the latency at each step, and alert when something deviates.

AI systems break this model in two fundamental ways.

First, the most important failures are semantic, not structural. The request succeeds, the response returns in 800ms, the HTTP status is 200 -- and the answer is completely wrong. Your existing monitoring will show green across the board while your users lose trust.

Second, the cost of each request is variable and significant. A traditional API call costs fractions of a cent in compute. An LLM call can cost $0.01 to $0.50 depending on the model and token count. Cost is not just an infrastructure concern -- it is a product metric that needs real-time visibility.

In my experience, the teams that operate AI systems successfully are the ones that built observability in from day one. The teams that bolt it on after the first incident are always playing catch-up.

The Three Pillars of AI Observability

I structure AI observability around three pillars, each serving a different operational need:

PILLAR 1: TRACES          PILLAR 2: METRICS         PILLAR 3: EVALS
────────────────           ────────────────           ────────────────
What happened in           What is happening          How well is the
this specific              across the system          system performing
request?                   right now?                 over time?

Debugging                  Monitoring                 Quality assurance
Per-request detail         Real-time aggregates       Batch assessment
"Why did this fail?"       "Is something wrong?"      "Are we getting worse?"

Pillar 1: Traces

A trace captures the complete lifecycle of an AI interaction -- from user input through preprocessing, model invocation, post-processing, and response delivery. For agentic systems with multiple LLM calls, a trace captures the entire chain with parent-child relationships.

Here is the trace schema I use:

interface AITrace {
  traceId: string
  parentTraceId?: string  // For multi-step agent chains
  timestamp: string
  duration_ms: number

  // Input
  input: {
    userMessage: string
    systemPrompt: string
    contextChunks?: string[]  // RAG context
    inputTokens: number
  }

  // Model
  model: {
    provider: string
    modelId: string
    temperature: number
    maxTokens: number
  }

  // Output
  output: {
    response: string
    outputTokens: number
    finishReason: string  // 'stop' | 'length' | 'content_filter'
  }

  // Operational
  operational: {
    latencyMs: number
    cost: number
    cached: boolean
    retryCount: number
    circuitBreakerState: string
    guardrailsTriggered: string[]
  }

  // Quality (computed async)
  quality?: {
    relevanceScore?: number
    groundednessScore?: number
    userFeedback?: 'positive' | 'negative'
  }
}

Every LLM call in my systems produces a trace with this schema. The trace is the atomic unit of AI observability -- it is what you pull up when debugging a specific issue.

Pillar 2: Metrics

Metrics are the aggregated signals you monitor in real time. I divide them into four categories:

Latency metrics:

p50, p75, p95, p99 response latency (per model, per use case)
Time-to-first-token for streaming responses
End-to-end latency including preprocessing and guardrails

Cost metrics:

Cost per hour, per day (current burn rate)
Cost per interaction (by model, by use case)
Token usage distribution (input vs. output)
Cache hit rate and savings

Reliability metrics:

Error rate by provider and error type
Circuit breaker state per provider
Fallback activation frequency
Rate limit utilization (how close to the ceiling)

Quality metrics:

Guardrail trigger rate (input blocks, output filters)
Finish reason distribution (stop vs. length vs. content_filter)
User feedback ratio (thumbs up / thumbs down)
Response length distribution

# Metrics collection in the gateway layer
class MetricsCollector:
    def record_interaction(self, trace: AITrace):
        # Latency
        self.histogram("ai.latency_ms", trace.operational.latencyMs,
                       tags={"model": trace.model.modelId,
                             "use_case": trace.input.useCase})

        # Cost
        self.gauge("ai.cost_per_hour",
                   self.calculate_hourly_rate(),
                   tags={"model": trace.model.modelId})

        self.counter("ai.total_cost", trace.operational.cost,
                     tags={"model": trace.model.modelId,
                           "use_case": trace.input.useCase})

        # Reliability
        if trace.operational.retryCount > 0:
            self.counter("ai.retries",
                         trace.operational.retryCount,
                         tags={"provider": trace.model.provider})

        # Quality signals
        if trace.operational.guardrailsTriggered:
            for rail in trace.operational.guardrailsTriggered:
                self.counter("ai.guardrail_triggered",
                             1, tags={"rail": rail})

Pillar 3: Evaluations

Metrics tell you something is changing. Evaluations tell you whether the change matters. I run evaluations at two cadences:

Real-time spot checks. Sample 1-5% of production traffic and run lightweight quality assessments. This catches acute quality drops within hours.

Weekly deep evaluations. Run a comprehensive eval suite against a representative sample. Track scores over time. This catches gradual quality drift that spot checks might miss.

# Weekly eval pipeline
def run_weekly_eval(eval_suite, production_traces):
    # Sample recent production traffic
    sample = random.sample(
        production_traces,
        min(500, len(production_traces))
    )

    scores = {}
    for trace in sample:
        scores[trace.traceId] = {
            "relevance": eval_relevance(
                trace.input.userMessage,
                trace.output.response
            ),
            "groundedness": eval_groundedness(
                trace.output.response,
                trace.input.contextChunks
            ),
            "format_compliance": eval_format(
                trace.output.response,
                expected_format
            )
        }

    # Compare against baseline
    current_avg = aggregate_scores(scores)
    baseline = load_baseline_scores()

    for metric, score in current_avg.items():
        drift = baseline[metric] - score
        if drift > DRIFT_THRESHOLD:
            create_alert(
                f"Quality drift: {metric} dropped "
                f"{drift:.1%} from baseline"
            )

    store_eval_results(scores, week=current_week())

The Observability Stack: Tool Selection

The AI observability market has matured significantly. Here is how I evaluate and select tools:

Langfuse is my recommendation for most teams. It is open-source, offers a generous free tier (50K observations/month), and provides tracing, prompt management, and evaluation in a single platform. The self-hosted option means you keep sensitive data in your own infrastructure. For production scale, the Pro tier starts at $59/month.

Helicone excels for teams that want minimal integration effort. It operates as a proxy -- you change your API base URL, and all requests are automatically logged. The 50-80ms latency overhead is acceptable for most applications, and the built-in semantic caching can reduce costs by 20-30%.

LangSmith is the right choice if your stack is built on LangChain or LangGraph. The integration is automatic, and the debugging tools understand LangChain's internals. Its overhead is virtually zero, making it suitable for latency-critical applications.

Datadog LLM Observability is the enterprise option for teams already using Datadog. It integrates AI metrics alongside your existing infrastructure monitoring, which eliminates the "another dashboard" problem.

My general guidance: start with Langfuse for its flexibility and open-source foundation. Migrate to a managed solution if operational overhead becomes a constraint.

Alert Design: Signal Over Noise

Bad alerting is worse than no alerting. If your team ignores alerts because 90% are false positives, you have no alerting. I design AI alerts with a clear severity taxonomy:

CRITICAL (page the on-call):
  ├── All LLM providers down (Tier 3 degradation active)
  ├── Daily cost exceeds 3x budget
  └── Error rate > 20% for 5+ minutes

WARNING (Slack notification, investigate within 4 hours):
  ├── Primary provider circuit breaker open
  ├── Hourly cost exceeds 2x expected
  ├── p95 latency > 2x baseline
  └── Guardrail trigger rate > 10%

INFO (daily digest, review in standup):
  ├── Weekly eval scores declined
  ├── Cache hit rate dropped below threshold
  ├── New model version available for testing
  └── Rate limit utilization > 70%

The principle: a CRITICAL alert means someone needs to act immediately. If your system has a CRITICAL alert that does not require immediate action, demote it. Alert fatigue is the enemy of operational excellence.

The Dashboard I Build for Every AI System

Every production AI system I architect gets a single-pane dashboard with four sections:

┌──────────────────────────┬──────────────────────────┐
│     SYSTEM HEALTH        │     COST TRACKING        │
│  ● Provider status       │  $ Current burn rate     │
│  ● Circuit breaker state │  $ Daily total vs budget │
│  ● Error rate (5min)     │  $ Cost per interaction  │
│  ● p95 latency           │  $ Cache savings         │
├──────────────────────────┼──────────────────────────┤
│     QUALITY SIGNALS      │     TRAFFIC PATTERNS     │
│  -- Eval score trend     │  # Requests per minute   │
│  -- Guardrail triggers   │  # By model / use case   │
│  -- User feedback ratio  │  # Fallback activations  │
│  -- Finish reason dist.  │  # Token distribution    │
└──────────────────────────┴──────────────────────────┘

This dashboard is the first thing I open in the morning and the first thing I check after any deployment. It gives me a complete picture of system health in under 30 seconds.

Observability Tool Trade-Offs

| Tool | Best For | Integration Effort | Latency Overhead | Cost | When to Skip | |------|----------|-------------------|-----------------|------|-------------| | Langfuse | Most teams; flexible, open-source, self-hostable | Medium (SDK integration) | Low (async reporting) | Free tier: 50K obs/month; Pro: $59/month | If you need zero-code setup | | Helicone | Minimal integration effort; proxy-based | Low (change base URL) | 50-80ms per request | Free tier available; usage-based pricing | Latency-critical paths where 50ms matters | | LangSmith | LangChain/LangGraph stacks | Near-zero (automatic if using LangChain) | Near-zero | Free tier: 5K traces/month; Plus: $39/month | Non-LangChain stacks -- the integration advantage disappears | | Datadog LLM Obs | Enterprise teams already on Datadog | Medium (agent + SDK) | Low (agent-based) | Enterprise pricing (contact sales) | Teams without existing Datadog investment -- the value is integration, not standalone | | Custom (OpenTelemetry) | Teams with strict data residency or unique requirements | High (build everything) | Controllable | Infrastructure cost only | When an existing tool covers 80%+ of your requirements |

Start with Langfuse unless you have a strong reason not to. Migrate if and when operational overhead or feature gaps justify the switch. The worst outcome is building custom observability when a mature tool would have worked.

Architecture Review Checklist

Before considering your observability stack production-ready:

[ ] Every LLM call produces a structured trace with input, output, model, cost, latency, and operational metadata
[ ] Traces support parent-child relationships for multi-step agent chains
[ ] Metrics collected for all four categories: latency, cost, reliability, and quality
[ ] Alerts configured with clear severity taxonomy: CRITICAL (page), WARNING (investigate in 4 hours), INFO (daily review)
[ ] Zero false-positive CRITICAL alerts -- every CRITICAL requires immediate human action
[ ] Semantic quality monitoring in place: spot-checking 1-5% of production traffic for response quality
[ ] Weekly eval pipeline scheduled against a representative sample with baseline comparison
[ ] Four-quadrant dashboard deployed: system health, cost tracking, quality signals, traffic patterns
[ ] Dashboard accessible to the on-call engineer with no additional authentication
[ ] Cost anomaly detection configured with a 2x hourly spend threshold

Key Takeaways

AI observability requires three pillars: traces (per-request debugging), metrics (real-time monitoring), and evaluations (quality assurance over time).
The most dangerous AI failures are semantic -- the system returns 200 OK with a wrong answer. Build quality signals into your observability stack.
Start with Langfuse for flexibility, Helicone for minimal integration effort, or LangSmith for LangChain-native stacks.
Design alerts with a clear severity taxonomy. A CRITICAL alert that does not require immediate action is not CRITICAL.
Build a four-quadrant dashboard (health, cost, quality, traffic) for every production AI system.

Observability is not a feature. It is the infrastructure that makes every other feature trustworthy.

What's Next

You can now see everything happening in your AI system. The final lesson turns that visibility into operational confidence. We build the runbooks, architecture decision records, and release checklists that let your team deploy on Fridays -- because boring deployments are safe deployments.