Why Your RAG System Is Bleeding Money (And How to Fix It)

Why Your RAG System Is Bleeding Money (And How to Fix It) | Celestinosalim.com

Why Your RAG System Is Bleeding Money

Your RAG prototype works. It answers questions, retrieves relevant context, and impresses stakeholders in demos. There is just one problem: it costs $2-5 per query, and you are about to deploy it to production where 10,000 users per day will turn your AI feature into a financial sinkhole.

At $3 per query and 10,000 daily queries, you are burning $30,000 per month. $360,000 per year. For a single feature. That is not a viable product. That is a line item that will get your project killed in the next budget review.

I have been there. I re-architected a RAG system that was hemorrhaging money in production and brought the cost per query down by 99%. Not through magic. Through engineering discipline, unit economics, and a systematic approach to understanding where every cent goes.

Where the Money Actually Goes

Before you can fix a cost problem, you need to understand its anatomy. A RAG query touches four billable components, and most teams have no idea which one is eating their budget.

1. Embedding Generation

Every incoming query needs to be converted into a vector. Every document chunk in your knowledge base needs the same treatment. Usually the cheapest part of the pipeline.

Current pricing for OpenAI embeddings:

text-embedding-3-small: $0.02 per 1M tokens
text-embedding-3-large: $0.13 per 1M tokens
Voyage AI embeddings: $0.06 per 1M tokens

A typical 50-token query costs fractions of a cent to embed. But here is where teams bleed money: they re-embed their entire document corpus every time they update a single document. Or they use 3072-dimensional embeddings when 1024 dimensions deliver 95% of the retrieval quality at one-third the storage cost. These decisions compound.

2. Vector Storage and Search

Your vectors need to live somewhere, and that somewhere has a monthly bill.

| Provider | Monthly Cost (1M vectors, 1536d) | Notes | |----------|----------------------------------|-------| | Pinecone (managed) | $70-150 | Usage-based, premium | | Weaviate Cloud | $25-153 | Varies by compression | | Qdrant Cloud | $27-102 | Scalar quantization cuts 70% | | pgvector (self-hosted) | ~$0 | Free with existing Postgres |

The hidden cost is not storage. It is the queries. Managed vector databases charge per read operation, and a single RAG query might trigger multiple vector searches if you are doing hybrid retrieval or querying multiple namespaces.

3. Reranking

The middle child of RAG costs. Most teams either skip it entirely or run it on every query without thinking about whether it is necessary.

A cross-encoder reranker scores each candidate document against your query with higher accuracy than vector similarity alone. The typical flow: retrieve 20-50 candidates via vector search, rerank them, send the top 3-5 to the LLM. The reranking step itself costs $0.001-0.005 per query.

But the cost savings come downstream. By sending 5 highly relevant chunks instead of 20 marginally relevant ones to the LLM, you reduce generation input tokens by 75%. That is where the real money is.

4. LLM Generation (The Budget Killer)

70-85% of your per-query cost lives here. You are sending retrieved context plus the user query to a large language model and paying for every token in and out.

Current inference pricing per 1M tokens:

| Model | Input | Output | |-------|-------|--------| | GPT-4o | $2.50 | $10.00 | | Claude Sonnet 4 | $3.00 | $15.00 | | GPT-4o mini | $0.15 | $0.60 | | Claude Haiku 4.5 | $1.00 | $5.00 | | Gemini 2.0 Flash | $0.10 | $0.40 |

A naive RAG query that sends 20 retrieved chunks (~40,000 tokens of context) to Claude Sonnet with a 500-token response costs approximately $0.13 per query. Do that 10,000 times a day: $1,300 daily on generation alone.

The real damage happens when teams use agent loops. If your RAG system routes through a multi-step agent that makes 3-5 LLM calls per user query, each with its own context window, a single user interaction can cost $0.50-5.00. That is the $2-5 per query figure I see in most unoptimized prototypes.

The 99% Playbook

I did not achieve a 99% cost reduction through a single trick. It was the compounding effect of four strategies applied systematically. Each one alone delivers 30-70% savings. Together, they compound.

Strategy 1: Semantic Caching (Highest Leverage)

The insight is simple: users ask similar questions. Not identical, but semantically similar.

A semantic cache stores embeddings of past queries alongside their responses. When a new query arrives, you compute its embedding and check the cache for a match above a similarity threshold (typically 0.92-0.95). If found, you return the cached response instantly. No vector search. No reranking. No LLM call. Cost of a cache hit: effectively zero.

In my experience, a well-tuned semantic cache achieves a 60-70% hit rate in production for domain-specific applications. For customer support and documentation use cases, hit rates can exceed 80%.

I layer this with an exact-match cache for deterministic queries (e.g., "What is your return policy?"). The exact-match layer catches another 5-10% before they reach the semantic layer.

Implementation: Redis for the exact-match layer, a lightweight vector index (even a local FAISS instance) for the semantic layer. Total infrastructure cost: under $20/month.

Impact: 65-75% cost reduction on blended query volume.

Strategy 2: Chunk Optimization

Most teams inherit their chunking strategy from a LangChain tutorial and never revisit it. 1,000-token chunks with 200-token overlap because that is what the example code did. This is leaving money on the table.

Right-sizing your chunks has cascading cost effects:

Smaller, more precise chunks (300-500 tokens) mean fewer irrelevant tokens sent to the LLM. 500-token chunks instead of 1,000, retrieving 5 chunks: 2,500 tokens of context instead of 5,000. A 50% reduction in generation input costs.
Semantic chunking splits documents at natural boundaries (paragraphs, sections, topic shifts) rather than arbitrary token counts. This improves retrieval precision by 15-25%, meaning the retriever returns more relevant content and the LLM needs fewer chunks.
Reduced embedding dimensions compound the savings. Switching from text-embedding-3-large (3072d) to text- embedding-3-small (1536d) with Matryoshka reduction to 512d drops vector storage costs by 80% with minimal retrieval quality loss.

The right approach is empirical. Run retrieval evals across your actual query distribution, measure precision@5 and recall@10, and find the smallest chunk size and lowest dimensionality that maintain your quality threshold. Most teams can cut chunk size by 40-60% without measurable quality degradation.

Impact: 40-60% reduction in storage and generation costs.

Strategy 3: Model Tiering

This is where unit economics thinking separates production engineers from prototype builders. Not every query deserves your most expensive model.

The architecture is a classifier-router pattern:

A lightweight classifier (fine-tuned distilled model, or even rules-based) categorizes incoming queries by complexity.
Simple queries (60-70% of traffic): Route to GPT-4o mini or Gemini Flash. 10-20x cheaper than frontier models, equivalent quality for factual retrieval and templated responses.
Complex queries (20-30% of traffic): Route to Claude Sonnet or GPT-4o. Multi-step reasoning, nuanced synthesis, careful judgment.
Critical queries (5-10% of traffic): Route to Claude Opus or GPT-4. High-stakes decisions, complex analysis, accuracy non-negotiable.

The math: if 65% of your queries hit a model at $0.15/1M input tokens instead of $3.00/1M, you have reduced generation cost on that segment by 95%. Blended across all tiers, I typically see 60-80% reduction in LLM spend.

The classifier itself is cheap. A keyword-based heuristic can achieve 85%+ routing accuracy. The cost of occasional misrouting is a slightly worse answer, not a catastrophic failure. Quality monitoring catches it, and you adjust thresholds over time.

Impact: 60-80% reduction in LLM inference costs.

Strategy 4: Batch Processing and Intelligent Retrieval

Real-time retrieval is expensive because it runs the full pipeline on every query. But not every operation needs to happen in real time.

Batch embedding updates: Instead of re-embedding documents on write, queue changes and process them in batch during off-peak hours. OpenAI's Batch API offers a 50% discount ($0.01/1M tokens for text-embedding-3-small vs. $0.02 standard). At thousands of documents daily, this adds up.

Precomputed retrievals: For predictable query patterns (40-60% of queries in domain-specific applications), precompute and cache retrieval results. When someone asks about "pricing" or "installation," you already know which chunks are relevant.

Conditional reranking: Only invoke the reranker when the top vector search result falls below a confidence threshold. At 0.95+ similarity, the reranker is unlikely to change the ranking. Skip it and save the compute. In practice, this eliminates reranking on 40-50% of queries.

Smart retrieval reduction: Not every query needs RAG at all. Conversational follow-ups, clarifications, and simple questions can be answered by the LLM directly. A lightweight intent classifier that determines whether retrieval is necessary can reduce vector search volume by 30-45%.

Impact: 30-50% reduction in infrastructure and embedding costs.

The Unit Economics Framework

Cost optimization without measurement is guessing. Here is the framework I use to make RAG systems financially viable.

Cost Per Query (CPQ)

CPQ = C_embed + C_search + C_rerank + C_generate + C_infra

Where:
  C_embed   = (query_tokens / 1M) * embedding_price
  C_search  = vector_db_monthly / monthly_queries
  C_rerank  = (candidates * tokens_per_doc / 1M) * rerank_price
  C_generate = (input_tokens / 1M * input_price) + (output_tokens / 1M * output_price)
  C_infra   = (cache + compute + monitoring) / monthly_queries

Profitability Threshold

For any AI feature to be viable, your CPQ must sit below your revenue-per-query or value-per-query threshold:

| Product Type | CPQ Target | |-------------|-----------| | SaaS product | Under 5% of per-user monthly AI revenue | | Internal tool | Deliver 10x query cost in time savings | | Consumer (ad-supported) | Under $0.01 | | Consumer (subscription) | Under $0.05 |

If your CPQ does not clear these thresholds, your AI feature is a cost center, not a product.

Real Numbers: Before and After

Here is the actual cost breakdown from a production RAG system I re-architected. The system handles approximately 8,000 queries per day for a B2B documentation and support use case.

Before Optimization

| Component | Cost Per Query | Daily Cost (8K queries) | |-----------|---------------|------------------------| | Embedding (text-embedding-3-large) | $0.0052 | $41.60 | | Vector search (Pinecone, 20 retrievals) | $0.0080 | $64.00 | | Reranking (50 candidates, every query) | $0.0040 | $32.00 | | LLM generation (Claude Sonnet, 40K context) | $0.1350 | $1,080.00 | | Infrastructure | $0.0030 | $24.00 | | Total | $0.1552 | $1,241.60 |

Monthly cost: $37,248. Annual: $446,976.

After Optimization

| Component | Cost Per Query | Daily Cost (8K queries) | |-----------|---------------|------------------------| | Semantic + exact cache (68% hit rate) | $0.0000 | $0.00 (for cached) | | Embedding (text-embedding-3-small, batch) | $0.0001 | $0.26 (uncached only) | | Vector search (pgvector, 5 retrievals) | $0.0005 | $1.28 (uncached only) | | Conditional reranking (40% of uncached) | $0.0004 | $0.41 | | LLM generation (tiered: 65% mini, 30% Sonnet, 5% Opus) | $0.0089 | $22.85 (uncached only) | | Infrastructure (Redis + pgvector + monitoring) | $0.0008 | $6.40 | | Blended total | $0.0039 | $31.20 |

Monthly cost: $936. Annual: $11,232.

That is a 97.5% reduction. The remaining 2.5% gets you to 99%+ when you factor in precomputed retrievals for the top 200 query patterns, aggressive TTL management on the cache, and continuous tuning of the routing classifier.

The Compounding Effect

These strategies are not additive. They compound. Caching eliminates 68% of queries from the pipeline entirely. Chunk optimization reduces costs on the remaining 32% by 50%. Model tiering cuts the generation cost of that 32% by another 70%. The math:

Original cost: $0.1552/query
After caching (68% free): $0.1552 * 0.32 = $0.0497 blended
After chunk optimization (-50%): $0.0497 * 0.50 = $0.0248 blended
After model tiering (-70% on generation): ~$0.0039 blended

This is what production AI looks like. Not the flashiest demo. Systems that survive contact with production economics, where you know your unit economics cold and can defend every architectural decision with a spreadsheet.

What to Do This Week

If you are running a RAG system in production, or planning to, here is your immediate action plan:

Instrument your CPQ today. If you do not know your cost per query broken down by component, you are flying blind. Add logging for token counts, cache hit rates, and model routing decisions.
Deploy semantic caching this week. Highest-ROI optimization, lowest implementation cost. Even a naive implementation saves 40-50% immediately.
Audit your chunk sizes. Run retrieval evals. Your chunks are almost certainly bigger than they need to be.
Build a routing classifier. Start simple. Even a keyword-based router that sends "what is" queries to a cheap model will move the needle.

I built a course around taking RAG systems from prototype to production-grade -- caching layers, evaluation frameworks, model routing, and cost monitoring at scale. Check out the RAG engineering course if you want the complete system.

And if you want to see this architecture in action, talk to my AI. It runs on the exact pipeline described in this post. Ask it anything. Check the response quality. Then consider that it costs me less than a penny per conversation.

That is what viable AI looks like.

Why Your RAG System Is Bleeding Money (And How to Fix It)

Discussion

Postgres Is All You Need: pgvector as Production AI Infrastructure

The Vendor Off-Ramp: How I Cut $60K/Month in AI Spend Without Rewriting the Stack