The Vendor Off-Ramp: How I Cut $60K/Month in AI Spend
Vendor lock-in in AI is not just annoying. It is existential.
I have watched teams build products on top of a single model
provider, ship fast, celebrate the launch, and then open the
next invoice. One contract renewal, one pricing change, one
rate-limit adjustment, and suddenly your margins are gone.
At Eventbrite, I saw this firsthand. External API costs were
running at $15K per day before I built the caching and
deduplication layer that brought them down to $40/month.
On the Ads platform, improved cost visibility led to
sunsetting a third-party ML ranking system that was costing
$60K/month. In both cases, the fix was not switching
vendors -- it was making vendor choice a routing decision
instead of an architecture decision.
This is the pattern: the Vendor Off-Ramp.
The Problem: import OpenAI in Every File
The pattern I see most often looks like this:
import OpenAI from 'openai';
const client = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
});
async function classifyTransaction(text: string) {
const response = await client.chat.completions.create({
model: 'gpt-4',
messages: [
{ role: 'user', content: `Classify this transaction: ${text}` },
],
});
return response.choices[0].message.content;
}
The same provider SDK imported in every service. GPT-4 for
classification tasks that a model one-tenth the cost could
handle. GPT-4 for extracting structured data. GPT-4 for
generating one-sentence summaries. No caching layer. No
fallback provider. No routing logic. Just raw, unoptimized
calls to the most expensive model available, across every
code path.
Ask the team: "What happens if this provider changes pricing
tomorrow? Or has a multi-hour outage?" If the answer is blank
stares, you have a live grenade under your P&L.
The Three-Layer Off-Ramp Pattern
I think about vendor off-ramps in three layers. Each
addresses a different dimension of vendor dependency, and
each compounds the value of the others.
Layer 1: The Model Gateway
The most impactful change: put a gateway between your
application code and your model providers. Instead of every
service importing a vendor SDK directly, every service talks
to your gateway. The gateway handles provider selection,
failover, retry logic, and cost tracking.
You can use an open-source solution like LiteLLM, which
gives you a unified OpenAI-compatible API across 100+ model
providers. Or you can build a thin custom router if you need
domain-specific routing logic.
The principle: your application code should never know
which vendor is serving a request. The moment your business
logic contains a provider name, you have created a dependency
that will cost you money to unwind.
Layer 2: Embedding Portability
This is the one teams overlook until it is too late. If you
are building RAG pipelines, your embeddings are your most
valuable derived asset -- the entire knowledge base of your
application, vectorized and indexed.
The mistake I see repeatedly: teams generate embeddings with
one provider, store only the vectors, and throw away the
source text. When they want to switch embedding providers --
because a new model offers better retrieval quality at half
the cost -- they realize they cannot re-embed without re-
collecting all the original data.
The fix is straightforward: always store the raw text
alongside the embedding vectors. Treat embeddings as a
cache that can be regenerated, not as the source of truth.
When a better embedding model ships (and it will), you run a
background re-indexing job and you are done.
Layer 3: Storage Abstraction
The vector database market is moving fast. Pinecone,
Weaviate, Qdrant, Chroma, pgvector -- each has different
strengths, pricing, and scaling characteristics. Hardcoding
your application to a specific vector database is the storage
equivalent of hardcoding to a specific LLM provider.
An adapter pattern that lets you swap vector backends without
touching application code keeps the interface intentionally
minimal: store, query, delete. Everything else is
implementation detail.
These three layers together form the Vendor Off-Ramp: a
set of abstractions that give you freedom to move between
providers based on cost, quality, and reliability -- not
based on how much code you would have to rewrite.
The Implementation
Here is what the architecture looks like in code.
The Gateway Contract
type TaskTier = 'reasoning' | 'standard' | 'classification';
interface CompletionRequest {
task: TaskTier;
messages: Message[];
maxTokens?: number;
temperature?: number;
}
interface CompletionResponse {
content: string;
provider: string;
model: string;
usage: { inputTokens: number; outputTokens: number };
latencyMs: number;
cost: number;
}
interface ModelGateway {
complete(request: CompletionRequest): Promise<CompletionResponse>;
embed(input: string | string[]): Promise<EmbeddingResult>;
}
Every service in the system talks to this interface. Not to OpenAI. Not to Anthropic. Not to Google. To the gateway.
The Routing Table
This is where the money is. Instead of sending every request to the most expensive model, you route by task complexity:
interface ModelConfig {
provider: string;
model: string;
priority: number;
costPer1kInput: number;
costPer1kOutput: number;
}
const ROUTING_TABLE: Record<TaskTier, ModelConfig[]> = {
reasoning: [
{
provider: 'anthropic',
model: 'claude-sonnet-4-5',
priority: 1,
costPer1kInput: 0.003,
costPer1kOutput: 0.015,
},
{
provider: 'openai',
model: 'gpt-4o',
priority: 2,
costPer1kInput: 0.01,
costPer1kOutput: 0.03,
},
],
standard: [
{
provider: 'anthropic',
model: 'claude-haiku-4-5',
priority: 1,
costPer1kInput: 0.001,
costPer1kOutput: 0.005,
},
{
provider: 'openai',
model: 'gpt-4o-mini',
priority: 2,
costPer1kInput: 0.00015,
costPer1kOutput: 0.0006,
},
],
classification: [
{
provider: 'google',
model: 'gemini-2.0-flash',
priority: 1,
costPer1kInput: 0.0001,
costPer1kOutput: 0.0004,
},
{
provider: 'anthropic',
model: 'claude-haiku-4-5',
priority: 2,
costPer1kInput: 0.001,
costPer1kOutput: 0.005,
},
],
};
Notice the failover chain. Every task tier has a primary and secondary provider. If Anthropic goes down, traffic automatically routes to OpenAI. If Google has a bad day, Haiku picks up the classification work. No human intervention. The system is resilient against single-vendor failure.
The Router
async function route(
request: CompletionRequest
): Promise<CompletionResponse> {
const candidates = ROUTING_TABLE[request.task];
// Check semantic cache first
const cached = await semanticCache.get(request.messages);
if (cached) return cached;
for (const candidate of candidates) {
try {
const start = performance.now();
const response = await providers[candidate.provider].complete({
model: candidate.model,
messages: request.messages,
maxTokens: request.maxTokens,
temperature: request.temperature,
});
const result: CompletionResponse = {
content: response.content,
provider: candidate.provider,
model: candidate.model,
usage: response.usage,
latencyMs: performance.now() - start,
cost: calculateCost(response.usage, candidate),
};
// Cache the result for semantically similar future queries
await semanticCache.set(request.messages, result);
await costTracker.record(result);
return result;
} catch (error) {
logger.warn(
`Failover: ${candidate.provider}/${candidate.model} failed`,
{ error }
);
continue;
}
}
throw new Error('All providers exhausted for task: ' + request.task);
}
Two details matter here. First, the semantic cache: before
making any API call, we check if a sufficiently similar query
has been answered recently. For classification tasks, this
eliminates 30%+ of redundant calls. Second, the cost
tracker: every response gets its actual cost recorded,
giving you the observability to know exactly where the money
is going.
The Embedding Abstraction
interface EmbeddingStore {
store(
id: string,
text: string,
metadata?: Record<string, unknown>
): Promise<void>;
query(
text: string,
options?: { topK?: number; filter?: Record<string, unknown> }
): Promise<SearchResult[]>;
reindex(provider: EmbeddingProvider): Promise<ReindexReport>;
}
The reindex method is the escape hatch. When a better embedding model ships (and in this market, that happens quarterly), you call reindex with the new provider, and the system re-embeds every stored document in the background. No migration project. No downtime. You just move.
When NOT to Abstract
This is not a universal pattern. There are real situations
where building a vendor abstraction is premature or
counterproductive.
Before product-market fit. If you are still figuring out
whether customers want your product, do not spend three
months building a model gateway. Ship with a single provider.
Validate the business. The abstraction comes later.
When compliance requires a specific vendor. Some
regulated industries mandate that data processing happens
through approved vendors. In healthcare and defense, I have
seen cases where the vendor lock-in is the feature -- it
satisfies an audit requirement. Abstracting around it
creates compliance risk.
When the abstraction tax exceeds the savings. Every layer
you add introduces latency, failure modes, and cognitive
overhead for your team. If your AI spend is $2K/month, a
gateway is over-engineering. The break-even point, in my
experience, is around $15-20K/month in AI spend. Below
that, the operational cost of maintaining the abstraction
outweighs the savings.
When you genuinely only use one capability. If your
entire AI integration is a single summarization endpoint, a
full gateway is a sledgehammer for a nail. Start with a
simple provider interface and grow from there.
The judgment call is always the same: is the cost of the
abstraction less than the cost of the dependency? If you
are not sure, you probably do not need it yet.
The Broader Principle: Optionality
The vendor off-ramp is not really about vendors. It is about
optionality.
The AI model ecosystem is moving faster than any technology
market I have worked in. The best model for your use case
today will not be the best model six months from now. The
cheapest provider this quarter will not be the cheapest next
quarter. If your architecture cannot absorb that change
without a rewrite, your unit economics are at the mercy of
forces you do not control.
The three questions I ask on every system I review:
- What is your cost per inference, broken down by task?
If you do not know this number, you cannot optimize it.
- How long would it take to switch providers for your
highest-volume endpoint? If the answer is "weeks" or
"I don't know," you have a vendor dependency, not a vendor
relationship.
- Are you storing raw text alongside your embeddings? If
not, your most valuable data asset is locked to whichever
embedding model you chose on day one.
Building sustainable AI infrastructure means building for the
ecosystem you will have in two years, not the one you have
today. The vendors will change. The models will change. The
pricing will change. The only question is whether your
architecture is ready for it.
If you want to see this pattern running in production,
talk to my AI. It runs on the exact gateway architecture described here --
model routing, failover chains, cost tracking, all of it. Or
if you are looking at your own AI infrastructure costs and
wondering whether there is an off-ramp,
reach out.