Start Lesson
Traditional software observability is a solved problem. You have structured logs, distributed tracing, metrics dashboards, and alert rules. A request comes in, hits your API, queries a database, and returns a response. You can trace the entire path, measure the latency at each step, and alert when something deviates.
AI systems break this model in two fundamental ways.
First, the most important failures are semantic, not structural. The request succeeds, the response returns in 800ms, the HTTP status is 200 -- and the answer is completely wrong. Your existing monitoring will show green across the board while your users lose trust.
Second, the cost of each request is variable and significant. A traditional API call costs fractions of a cent in compute. An LLM call can cost $0.01 to $0.50 depending on the model and token count. Cost is not just an infrastructure concern -- it is a product metric that needs real-time visibility.
In my experience, the teams that operate AI systems successfully are the ones that built observability in from day one. The teams that bolt it on after the first incident are always playing catch-up.
I structure AI observability around three pillars, each serving a different operational need:
PILLAR 1: TRACES PILLAR 2: METRICS PILLAR 3: EVALS
──────────────── ──────────────── ────────────────
What happened in What is happening How well is the
this specific across the system system performing
request? right now? over time?
Debugging Monitoring Quality assurance
Per-request detail Real-time aggregates Batch assessment
"Why did this fail?" "Is something wrong?" "Are we getting worse?"
A trace captures the complete lifecycle of an AI interaction -- from user input through preprocessing, model invocation, post-processing, and response delivery. For agentic systems with multiple LLM calls, a trace captures the entire chain with parent-child relationships.
Here is the trace schema I use:
interface AITrace {
traceId: string
parentTraceId?: string // For multi-step agent chains
timestamp: string
duration_ms: number
// Input
input: {
userMessage: string
systemPrompt: string
contextChunks?: string[] // RAG context
inputTokens: number
}
// Model
model: {
provider: string
modelId: string
temperature: number
maxTokens: number
}
// Output
output: {
response: string
outputTokens: number
finishReason: string // 'stop' | 'length' | 'content_filter'
}
// Operational
operational: {
latencyMs: number
cost: number
cached: boolean
retryCount: number
circuitBreakerState: string
guardrailsTriggered: string[]
}
// Quality (computed async)
quality?: {
relevanceScore?: number
groundednessScore?: number
userFeedback?: 'positive' | 'negative'
}
}
Every LLM call in my systems produces a trace with this schema. The trace is the atomic unit of AI observability -- it is what you pull up when debugging a specific issue.
Metrics are the aggregated signals you monitor in real time. I divide them into four categories:
Latency metrics:
Cost metrics:
Reliability metrics:
Quality metrics:
# Metrics collection in the gateway layer
class MetricsCollector:
def record_interaction(self, trace: AITrace):
# Latency
self.histogram("ai.latency_ms", trace.operational.latencyMs,
tags={"model": trace.model.modelId,
"use_case": trace.input.useCase})
# Cost
self.gauge("ai.cost_per_hour",
self.calculate_hourly_rate(),
tags={"model": trace.model.modelId})
self.counter("ai.total_cost", trace.operational.cost,
tags={"model": trace.model.modelId,
"use_case": trace.input.useCase})
# Reliability
if trace.operational.retryCount > 0:
self.counter("ai.retries",
trace.operational.retryCount,
tags={"provider": trace.model.provider})
# Quality signals
if trace.operational.guardrailsTriggered:
for rail in trace.operational.guardrailsTriggered:
self.counter("ai.guardrail_triggered",
1, tags={"rail": rail})
Metrics tell you something is changing. Evaluations tell you whether the change matters. I run evaluations at two cadences:
Real-time spot checks. Sample 1-5% of production traffic and run lightweight quality assessments. This catches acute quality drops within hours.
Weekly deep evaluations. Run a comprehensive eval suite against a representative sample. Track scores over time. This catches gradual quality drift that spot checks might miss.
# Weekly eval pipeline
def run_weekly_eval(eval_suite, production_traces):
# Sample recent production traffic
sample = random.sample(
production_traces,
min(500, len(production_traces))
)
scores = {}
for trace in sample:
scores[trace.traceId] = {
"relevance": eval_relevance(
trace.input.userMessage,
trace.output.response
),
"groundedness": eval_groundedness(
trace.output.response,
trace.input.contextChunks
),
"format_compliance": eval_format(
trace.output.response,
expected_format
)
}
# Compare against baseline
current_avg = aggregate_scores(scores)
baseline = load_baseline_scores()
for metric, score in current_avg.items():
drift = baseline[metric] - score
if drift > DRIFT_THRESHOLD:
create_alert(
f"Quality drift: {metric} dropped "
f"{drift:.1%} from baseline"
)
store_eval_results(scores, week=current_week())
The AI observability market has matured significantly. Here is how I evaluate and select tools:
Langfuse is my recommendation for most teams. It is open-source, offers a generous free tier (50K observations/month), and provides tracing, prompt management, and evaluation in a single platform. The self-hosted option means you keep sensitive data in your own infrastructure. For production scale, the Pro tier starts at $59/month.
Helicone excels for teams that want minimal integration effort. It operates as a proxy -- you change your API base URL, and all requests are automatically logged. The 50-80ms latency overhead is acceptable for most applications, and the built-in semantic caching can reduce costs by 20-30%.
LangSmith is the right choice if your stack is built on LangChain or LangGraph. The integration is automatic, and the debugging tools understand LangChain's internals. Its overhead is virtually zero, making it suitable for latency-critical applications.
Datadog LLM Observability is the enterprise option for teams already using Datadog. It integrates AI metrics alongside your existing infrastructure monitoring, which eliminates the "another dashboard" problem.
My general guidance: start with Langfuse for its flexibility and open-source foundation. Migrate to a managed solution if operational overhead becomes a constraint.
Bad alerting is worse than no alerting. If your team ignores alerts because 90% are false positives, you have no alerting. I design AI alerts with a clear severity taxonomy:
CRITICAL (page the on-call):
├── All LLM providers down (Tier 3 degradation active)
├── Daily cost exceeds 3x budget
└── Error rate > 20% for 5+ minutes
WARNING (Slack notification, investigate within 4 hours):
├── Primary provider circuit breaker open
├── Hourly cost exceeds 2x expected
├── p95 latency > 2x baseline
└── Guardrail trigger rate > 10%
INFO (daily digest, review in standup):
├── Weekly eval scores declined
├── Cache hit rate dropped below threshold
├── New model version available for testing
└── Rate limit utilization > 70%
The principle: a CRITICAL alert means someone needs to act immediately. If your system has a CRITICAL alert that does not require immediate action, demote it. Alert fatigue is the enemy of operational excellence.
Every production AI system I architect gets a single-pane dashboard with four sections:
┌──────────────────────────┬──────────────────────────┐
│ SYSTEM HEALTH │ COST TRACKING │
│ ● Provider status │ $ Current burn rate │
│ ● Circuit breaker state │ $ Daily total vs budget │
│ ● Error rate (5min) │ $ Cost per interaction │
│ ● p95 latency │ $ Cache savings │
├──────────────────────────┼──────────────────────────┤
│ QUALITY SIGNALS │ TRAFFIC PATTERNS │
│ -- Eval score trend │ # Requests per minute │
│ -- Guardrail triggers │ # By model / use case │
│ -- User feedback ratio │ # Fallback activations │
│ -- Finish reason dist. │ # Token distribution │
└──────────────────────────┴──────────────────────────┘
This dashboard is the first thing I open in the morning and the first thing I check after any deployment. It gives me a complete picture of system health in under 30 seconds.
| Tool | Best For | Integration Effort | Latency Overhead | Cost | When to Skip | |------|----------|-------------------|-----------------|------|-------------| | Langfuse | Most teams; flexible, open-source, self-hostable | Medium (SDK integration) | Low (async reporting) | Free tier: 50K obs/month; Pro: $59/month | If you need zero-code setup | | Helicone | Minimal integration effort; proxy-based | Low (change base URL) | 50-80ms per request | Free tier available; usage-based pricing | Latency-critical paths where 50ms matters | | LangSmith | LangChain/LangGraph stacks | Near-zero (automatic if using LangChain) | Near-zero | Free tier: 5K traces/month; Plus: $39/month | Non-LangChain stacks -- the integration advantage disappears | | Datadog LLM Obs | Enterprise teams already on Datadog | Medium (agent + SDK) | Low (agent-based) | Enterprise pricing (contact sales) | Teams without existing Datadog investment -- the value is integration, not standalone | | Custom (OpenTelemetry) | Teams with strict data residency or unique requirements | High (build everything) | Controllable | Infrastructure cost only | When an existing tool covers 80%+ of your requirements |
Start with Langfuse unless you have a strong reason not to. Migrate if and when operational overhead or feature gaps justify the switch. The worst outcome is building custom observability when a mature tool would have worked.
Before considering your observability stack production-ready:
Observability is not a feature. It is the infrastructure that makes every other feature trustworthy.
You can now see everything happening in your AI system. The final lesson turns that visibility into operational confidence. We build the runbooks, architecture decision records, and release checklists that let your team deploy on Fridays -- because boring deployments are safe deployments.