The 99% Cost Reduction Playbook | RAG Systems in Production | Celestinosalim.com

The 99% Cost Reduction Playbook

I cut a RAG system's per-query cost from $0.12 to $0.001 --- a 99% reduction --- while simultaneously improving answer quality. This was not a single optimization. It was a systematic audit of every cost center in the pipeline, applying the right lever at each stage.

This lesson is the playbook. I will walk through every technique in the order I applied them, with the dollar impact of each.

Understanding Your Cost Structure

Before optimizing, you need to know where the money goes. Here is the typical cost breakdown for an unoptimized RAG pipeline serving 100,000 queries per day:

Per-query cost breakdown (unoptimized):
  Query embedding:           $0.001  (embed the user query)
  Vector DB query:           $0.002  (similarity search)
  Re-ranking:                $0.005  (cross-encoder inference)
  LLM generation:            $0.110  (GPT-4 class model, ~2000 token context)
  Logging & monitoring:      $0.002
  ─────────────────────────────────
  Total per query:           $0.120
  Daily cost (100K queries): $12,000
  Monthly cost:              $360,000

LLM generation dominates at 92% of cost. This is where most of the savings come from. But the other stages have optimization opportunities too.

Layer 1: Semantic Caching ($0.12 -> $0.05)

The single highest-impact optimization. In most production systems, 30--50% of queries are semantically identical or near-identical. Users ask the same questions in slightly different ways.

import hashlib
from sentence_transformers import SentenceTransformer

class SemanticCache:
    def __init__(self, similarity_threshold=0.95):
        self.model = SentenceTransformer("all-MiniLM-L6-v2")
        self.cache = {}  # In production, use Redis + vector index
        self.threshold = similarity_threshold

    def get(self, query: str):
        query_vec = self.model.encode(query)

        # Check exact cache first (hash-based, near-zero latency)
        query_hash = hashlib.sha256(query.lower().strip().encode()).hexdigest()
        if query_hash in self.cache:
            return self.cache[query_hash]

        # Check semantic cache (vector similarity)
        for cached_hash, entry in self.cache.items():
            similarity = cosine_similarity(query_vec, entry["query_vec"])
            if similarity >= self.threshold:
                return entry["response"]

        return None  # Cache miss

    def put(self, query: str, response: str):
        query_vec = self.model.encode(query)
        query_hash = hashlib.sha256(query.lower().strip().encode()).hexdigest()
        self.cache[query_hash] = {
            "query_vec": query_vec,
            "response": response,
            "timestamp": time.time()
        }

Implementation details:

Use a two-tier cache: exact match (hash-based, sub-millisecond) and semantic match (vector similarity, ~5ms).
Set the similarity threshold at 0.95 or higher. Below that, you risk returning answers to the wrong question.
Add TTL (time-to-live) based on how frequently your underlying documents change.
In production, use Redis for the hash cache and a lightweight vector index (FAISS or the same vector DB) for semantic matching.

Impact: With 40% cache hit rate, per-query cost drops from $0.12 to ~$0.05. Monthly savings: $210,000.

Layer 2: Model Routing ($0.05 -> $0.02)

Not every query needs GPT-4. A query like "What are your business hours?" does not require the same model as "Explain the tax implications of our Q3 restructuring."

class ModelRouter:
    SIMPLE_MODEL = "gpt-4o-mini"    # $0.15 / 1M input tokens
    COMPLEX_MODEL = "gpt-4o"        # $2.50 / 1M input tokens

    def classify_complexity(self, query: str, retrieved_chunks: list) -> str:
        """Route based on query and retrieval signals."""
        signals = {
            "short_query": len(query.split()) < 10,
            "single_chunk_match": len(retrieved_chunks) == 1,
            "high_confidence": retrieved_chunks[0].score > 0.92,
            "factoid_query": self._is_factoid(query),
        }

        simple_signals = sum(signals.values())

        if simple_signals >= 3:
            return self.SIMPLE_MODEL
        return self.COMPLEX_MODEL

    def _is_factoid(self, query: str) -> bool:
        """Detect simple factual queries."""
        factoid_patterns = [
            "what is", "what are", "how much", "when does",
            "where is", "who is", "how many"
        ]
        return any(query.lower().startswith(p) for p in factoid_patterns)

The economics: GPT-4o-mini is roughly 17x cheaper than GPT-4o per token. If 70% of queries can be handled by the smaller model (and in most customer-facing RAG systems, they can), the blended cost drops significantly.

Impact: Per-query cost drops from $0.05 to ~$0.02. Monthly savings on top of caching: $90,000.

Layer 3: Context Compression ($0.02 -> $0.008)

After retrieval and re-ranking, you have a set of chunks to send to the LLM. Most pipelines send everything. Smarter pipelines compress first.

Technique 1: Aggressive Re-Ranking with Cutoff

Instead of sending the top 5 chunks, send only chunks above a relevance threshold:

def compress_context(query, chunks, min_score=0.7, max_tokens=1500):
    """Only include chunks that meet the quality bar."""
    scored_chunks = reranker.score(query, chunks)
    filtered = [c for c in scored_chunks if c.score >= min_score]

    # Enforce token budget
    context = []
    token_count = 0
    for chunk in filtered:
        chunk_tokens = count_tokens(chunk.text)
        if token_count + chunk_tokens > max_tokens:
            break
        context.append(chunk)
        token_count += chunk_tokens

    return context

Technique 2: Extractive Compression

Pull only the relevant sentences from each chunk instead of the full chunk:

def extract_relevant_sentences(query, chunk_text, max_sentences=3):
    """Extract only the sentences most relevant to the query."""
    sentences = sent_tokenize(chunk_text)
    query_vec = embed(query)
    sentence_vecs = [embed(s) for s in sentences]

    scored = [
        (s, cosine_similarity(query_vec, sv))
        for s, sv in zip(sentences, sentence_vecs)
    ]
    scored.sort(key=lambda x: x[1], reverse=True)

    return " ".join(s for s, _ in scored[:max_sentences])

Impact: Reducing average context from 2000 tokens to 800 tokens cuts LLM input costs by 60%. Per-query cost drops to ~$0.008. Monthly savings: $36,000.

Layer 4: Embedding Cost Optimization ($0.008 -> $0.003)

Batch Embedding Processing

Embed documents in batches during ingestion rather than one at a time. Batch processing is up to 10x cheaper with most providers.

# Expensive: one-at-a-time embedding
for doc in documents:
    embedding = openai.embeddings.create(input=doc.text, model="text-embedding-3-large")

# Cheap: batch embedding
BATCH_SIZE = 2048
for i in range(0, len(documents), BATCH_SIZE):
    batch = [doc.text for doc in documents[i:i+BATCH_SIZE]]
    embeddings = openai.embeddings.create(input=batch, model="text-embedding-3-large")

Vector Quantization

Reduce storage costs by compressing vectors:

# int8 quantization: 4x memory reduction, retains ~96% quality
# binary quantization: 32x memory reduction, retains ~92-96% quality

# Qdrant example
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Quantization, ScalarQuantization

client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(size=1024, distance="Cosine"),
    quantization_config=ScalarQuantization(
        type="int8",
        quantile=0.99,
        always_ram=True  # Keep quantized vectors in RAM for speed
    )
)

Dimension Reduction with Matryoshka Embeddings

If using OpenAI's text-embedding-3 models, you can truncate dimensions:

# Full dimensions: 3072 (default)
# Reduced: 256 dimensions, ~95% quality retention

response = openai.embeddings.create(
    input="your text",
    model="text-embedding-3-large",
    dimensions=256  # 12x smaller vectors
)

Impact: Combined embedding and storage optimizations bring per-query cost to ~$0.003.

Layer 5: Query Optimization ($0.003 -> $0.001)

Query Deduplication and Normalization

Before any processing, normalize and deduplicate queries:

def normalize_query(query: str) -> str:
    """Normalize query for caching and deduplication."""
    query = query.lower().strip()
    query = re.sub(r'\s+', ' ', query)  # collapse whitespace
    query = re.sub(r'[?!.]+$', '', query)  # remove trailing punctuation
    return query

Precomputed Answers for High-Frequency Queries

Identify your top 100 queries (they often cover 30--40% of traffic) and precompute answers:

# Daily job: analyze query logs, precompute top queries
top_queries = analytics.get_top_queries(days=7, limit=100)
for query in top_queries:
    answer = full_rag_pipeline(query)
    precomputed_cache.set(query, answer, ttl=86400)  # 24hr TTL

Impact: Final per-query cost: ~$0.001. Monthly cost for 100K daily queries: $3,000 (down from $360,000).

The Optimization Stack, Summarized

| Layer | Technique | Cost Reduction | Cumulative Per-Query | |-------|-----------|---------------|---------------------| | 0 | Unoptimized baseline | --- | $0.120 | | 1 | Semantic caching | 58% | $0.050 | | 2 | Model routing | 60% | $0.020 | | 3 | Context compression | 60% | $0.008 | | 4 | Embedding optimization | 63% | $0.003 | | 5 | Query optimization | 67% | $0.001 |

Each layer is independent. You can apply them in any order based on what is easiest to implement in your system. But I recommend this order because each layer builds on the data from the previous one (e.g., cache hit rates inform model routing thresholds).

The Unit Economics Test

After optimization, run this sanity check:

Revenue per query (or value per query):  $X
Cost per query:                          $0.001
Gross margin per query:                  $X - $0.001

If gross margin is positive: you have a viable product.
If gross margin is negative: optimize further or rethink the product.

RAG systems that cannot pass the unit economics test should not go to production. The techniques in this lesson make most RAG products economically viable. The 99% cost reduction is not a stunt --- it is what separates prototypes from businesses.

Evaluate Your System

Use this checklist to assess your cost posture:

[ ] Do you know your current per-query cost (broken down by stage: embedding, search, re-ranking, generation)?
[ ] Is semantic caching deployed? What is your cache hit rate?
[ ] Do you route queries to different models based on complexity?
[ ] Are you compressing context before sending to the LLM (score cutoff, token budget)?
[ ] Are you using batch embedding for document ingestion (not one-at-a-time)?
[ ] Have you applied vector quantization (int8 or binary) to reduce storage costs?
[ ] Are you using Matryoshka dimension reduction if on OpenAI embeddings?
[ ] Do you precompute answers for your top 100 most frequent queries?
[ ] Does your cost per query pass the unit economics test (cost < value)?
[ ] Is cost per query tracked as a production metric with alerting on spikes?

Start with semantic caching. It is the highest-impact, lowest-effort optimization and typically saves 40-60% alone. Then add model routing. Those two layers combined get most systems to an economically viable per-query cost.

Key Takeaways

LLM inference is 92% of unoptimized RAG cost. Semantic caching and model routing are the two highest-impact levers.
Apply optimizations in layers: caching, routing, compression, embedding optimization, query optimization.
Measure cost per query and track it as a first-class production metric alongside latency and accuracy.
Precompute answers for your top 100 queries --- they often represent 30--40% of total traffic.
Run the unit economics test. If your cost per query exceeds the value per query, no amount of optimization will save the product.

What's Next

We tackle the trust layer: building citation systems that users can verify. Cost optimization without citation quality creates a cheap system nobody trusts. Lesson 6 shows how to build the attribution layer that earns user confidence.