Start Lesson
In March 2024, a client called me in a panic. Their primary LLM provider had announced a pricing change that would triple their costs for the model they had built their entire product around. They had six weeks to migrate or absorb an additional $40K/month.
They could not migrate. Their codebase was littered with provider-specific SDK calls, prompt formats, and response parsing logic. The model's quirks had been baked into business logic. What should have been a configuration change became a three-month rewrite.
This is vendor lock-in for AI systems, and it is more dangerous than traditional SaaS lock-in because the AI landscape moves faster. Models are deprecated quarterly. Pricing changes without negotiation for smaller customers. New providers emerge that are 10x cheaper for your specific use case. If your architecture cannot respond to these shifts, your business is at the mercy of your vendor's roadmap.
The vendor off-ramp pattern is the architectural discipline that prevents this. It is the single most strategically important pattern in this course.
The vendor off-ramp pattern separates your AI system into three layers, each with a clear responsibility:
┌─────────────────────────────────────────────┐
│ APPLICATION LAYER │
│ Your business logic, prompts, workflows │
│ Speaks to the Gateway Layer only │
├─────────────────────────────────────────────┤
│ GATEWAY LAYER │
│ Unified interface, routing, cost control │
│ Translates between Application and Provider │
├──────────┬──────────┬──────────┬────────────┤
│ Provider │ Provider │ Provider │ Provider │
│ Adapter │ Adapter │ Adapter │ Adapter │
│ (Claude) │ (GPT-4o) │ (Gemini) │ (Mistral) │
└──────────┴──────────┴──────────┴────────────┘
PROVIDER LAYER (interchangeable)
Application Layer: Your product code never touches a provider SDK directly. It sends structured requests to the gateway and receives structured responses. The application does not know or care which model served the request.
Gateway Layer: The single point of control for all LLM interactions. Handles routing, failover, cost tracking, rate limiting, and observability. This is the architectural choke point where you enforce policy.
Provider Layer: Individual adapters that translate the gateway's unified format into provider-specific API calls. Adding a new provider means writing one adapter, not changing application code.
Here is how I implement the gateway layer in production. This is not a toy example -- this pattern runs in systems handling millions of requests:
// gateway.ts -- The central orchestrator
import { ProviderAdapter, LLMRequest, LLMResponse } from './types'
interface RouteConfig {
primary: string
fallbacks: string[]
maxRetries: number
timeoutMs: number
costCeiling: number // max cost per request in dollars
}
class LLMGateway {
private providers: Map<string, ProviderAdapter>
private routes: Map<string, RouteConfig>
private costTracker: CostTracker
private circuitBreaker: CircuitBreaker
async generate(
request: LLMRequest,
route: string
): Promise<LLMResponse> {
const config = this.routes.get(route)
const providers = [
config.primary,
...config.fallbacks
]
for (const providerId of providers) {
if (this.circuitBreaker.isOpen(providerId)) {
continue // Skip providers with open circuit breakers
}
try {
const adapter = this.providers.get(providerId)
const costEstimate = adapter.estimateCost(request)
if (costEstimate > config.costCeiling) {
this.logCostExceeded(providerId, costEstimate)
continue
}
const response = await withTimeout(
adapter.generate(request),
config.timeoutMs
)
this.costTracker.record(providerId, response.usage)
this.circuitBreaker.recordSuccess(providerId)
return response
} catch (error) {
this.circuitBreaker.recordFailure(providerId)
this.logFailover(providerId, error)
// Continue to next provider in fallback chain
}
}
// All providers exhausted
return this.handleAllProvidersDown(request, route)
}
}
// adapters/anthropic.ts -- Provider-specific translation
class AnthropicAdapter implements ProviderAdapter {
async generate(request: LLMRequest): Promise<LLMResponse> {
const anthropicRequest = {
model: this.modelId,
max_tokens: request.maxTokens,
messages: this.translateMessages(request.messages),
system: request.systemPrompt,
}
const response = await this.client.messages.create(
anthropicRequest
)
return {
content: response.content[0].text,
usage: {
inputTokens: response.usage.input_tokens,
outputTokens: response.usage.output_tokens,
},
latencyMs: this.measureLatency(),
model: response.model,
provider: 'anthropic',
cached: response.usage.cache_read_input_tokens > 0,
}
}
}
The key architectural decision: the LLMRequest and LLMResponse types are owned by your gateway, not by any provider. Every provider adapter translates to and from these types. This is where the portability lives.
You do not have to build this from scratch. The ecosystem now offers mature gateway solutions:
LiteLLM is the most widely adopted open-source gateway. It supports 100+ LLM providers through an OpenAI-compatible interface. You can swap from Anthropic to Google to a self-hosted model by changing a configuration string. It handles retries, fallbacks, and budget controls out of the box.
Bifrost (by Maxim AI) is a newer Go-based gateway focused on performance, adding less than 11 microseconds of overhead at 5,000 requests per second -- 50x faster than LiteLLM for latency-critical paths.
Portkey offers a managed gateway with built-in observability, caching, and a visual interface for managing routes and fallbacks.
My recommendation: start with LiteLLM for most production systems. Its OpenAI-compatible interface means your application code uses the familiar openai SDK format, and routing happens at the gateway level. If you need microsecond-level latency, evaluate Bifrost.
# LiteLLM example -- switching providers is a config change
from litellm import completion
# Route to Anthropic
response = completion(
model="anthropic/claude-3.5-sonnet",
messages=[{"role": "user", "content": prompt}]
)
# Route to OpenAI -- same interface, different config
response = completion(
model="openai/gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
# Route to self-hosted -- still the same interface
response = completion(
model="ollama/llama3.1",
messages=[{"role": "user", "content": prompt}]
)
The gateway handles API translation, but prompts are not perfectly portable between models. Each model has behavioral differences -- how it interprets system prompts, how it handles ambiguity, how it formats outputs.
I handle this with a prompt registry that stores model-specific adaptations:
// prompt-registry.ts
const promptRegistry = {
'customer-support-summarize': {
base: {
system: 'You are a customer support analyst...',
outputFormat: 'JSON with fields: summary, sentiment, action_items'
},
adaptations: {
'anthropic/claude-3.5-sonnet': {
// Claude responds well to explicit XML-style structure
system: 'You are a customer support analyst...\n\n' +
'Respond in this exact format:\n' +
'<summary>...</summary>\n' +
'<sentiment>...</sentiment>\n' +
'<action_items>...</action_items>'
},
'openai/gpt-4o': {
// GPT-4o responds well to JSON schema in system prompt
system: 'You are a customer support analyst...\n\n' +
'Respond with valid JSON matching this schema: {...}'
}
}
}
}
When the gateway routes a request to a different provider, it pulls the appropriate prompt adaptation. The application layer never sees this complexity.
I run this checklist quarterly for every production AI system:
The vendor off-ramp pattern is not just about risk mitigation. It creates strategic leverage:
The $60K/month savings I referenced in the previous lesson was only possible because the off-ramp architecture was already in place. Without it, the team would have identified the savings opportunity but been unable to act on it for months.
| Approach | Best For | Limitations | Maintenance Cost | |----------|----------|-------------|-----------------| | LiteLLM (open-source) | Most production systems; 100+ providers, OpenAI-compatible | Higher latency than Go-based alternatives; Python dependency | Low -- community maintained, config-driven | | Bifrost (Go-based) | Latency-critical paths; high-throughput batch processing | Newer project, smaller community, fewer provider integrations | Medium -- less ecosystem support | | Portkey (managed) | Teams that want zero infra overhead; built-in observability | Vendor dependency on the gateway itself; cost at scale | Low operationally, but adds a vendor to manage | | Custom gateway | Unique routing logic; strict compliance requirements; full control | Engineering time to build and maintain; no community fixes | High -- you own every bug and feature request |
My default recommendation: start with LiteLLM. If latency overhead is unacceptable after benchmarking, evaluate Bifrost. If your team cannot absorb any infrastructure operational load, consider Portkey. Build custom only when the alternatives genuinely do not support your requirements -- not because "we could build it better."
Before considering your vendor off-ramp production-ready:
Build the off-ramp before you need it. By the time you need it, it is too late to build.
Your system can now switch providers, track costs, and route requests intelligently. But what about the content flowing through it? The next lesson builds the defensive layers -- guardrails and safety valves -- that prevent your AI from producing harmful, off-topic, or financially dangerous outputs. We cover the five-layer guardrail architecture, financial circuit breakers, and the testing protocols that prove they actually work.