Start Lesson
I cut a RAG system's per-query cost from $0.12 to $0.001 --- a 99% reduction --- while simultaneously improving answer quality. This was not a single optimization. It was a systematic audit of every cost center in the pipeline, applying the right lever at each stage.
This lesson is the playbook. I will walk through every technique in the order I applied them, with the dollar impact of each.
Before optimizing, you need to know where the money goes. Here is the typical cost breakdown for an unoptimized RAG pipeline serving 100,000 queries per day:
Per-query cost breakdown (unoptimized):
Query embedding: $0.001 (embed the user query)
Vector DB query: $0.002 (similarity search)
Re-ranking: $0.005 (cross-encoder inference)
LLM generation: $0.110 (GPT-4 class model, ~2000 token context)
Logging & monitoring: $0.002
─────────────────────────────────
Total per query: $0.120
Daily cost (100K queries): $12,000
Monthly cost: $360,000
LLM generation dominates at 92% of cost. This is where most of the savings come from. But the other stages have optimization opportunities too.
The single highest-impact optimization. In most production systems, 30--50% of queries are semantically identical or near-identical. Users ask the same questions in slightly different ways.
import hashlib
from sentence_transformers import SentenceTransformer
class SemanticCache:
def __init__(self, similarity_threshold=0.95):
self.model = SentenceTransformer("all-MiniLM-L6-v2")
self.cache = {} # In production, use Redis + vector index
self.threshold = similarity_threshold
def get(self, query: str):
query_vec = self.model.encode(query)
# Check exact cache first (hash-based, near-zero latency)
query_hash = hashlib.sha256(query.lower().strip().encode()).hexdigest()
if query_hash in self.cache:
return self.cache[query_hash]
# Check semantic cache (vector similarity)
for cached_hash, entry in self.cache.items():
similarity = cosine_similarity(query_vec, entry["query_vec"])
if similarity >= self.threshold:
return entry["response"]
return None # Cache miss
def put(self, query: str, response: str):
query_vec = self.model.encode(query)
query_hash = hashlib.sha256(query.lower().strip().encode()).hexdigest()
self.cache[query_hash] = {
"query_vec": query_vec,
"response": response,
"timestamp": time.time()
}
Implementation details:
Impact: With 40% cache hit rate, per-query cost drops from $0.12 to ~$0.05. Monthly savings: $210,000.
Not every query needs GPT-4. A query like "What are your business hours?" does not require the same model as "Explain the tax implications of our Q3 restructuring."
class ModelRouter:
SIMPLE_MODEL = "gpt-4o-mini" # $0.15 / 1M input tokens
COMPLEX_MODEL = "gpt-4o" # $2.50 / 1M input tokens
def classify_complexity(self, query: str, retrieved_chunks: list) -> str:
"""Route based on query and retrieval signals."""
signals = {
"short_query": len(query.split()) < 10,
"single_chunk_match": len(retrieved_chunks) == 1,
"high_confidence": retrieved_chunks[0].score > 0.92,
"factoid_query": self._is_factoid(query),
}
simple_signals = sum(signals.values())
if simple_signals >= 3:
return self.SIMPLE_MODEL
return self.COMPLEX_MODEL
def _is_factoid(self, query: str) -> bool:
"""Detect simple factual queries."""
factoid_patterns = [
"what is", "what are", "how much", "when does",
"where is", "who is", "how many"
]
return any(query.lower().startswith(p) for p in factoid_patterns)
The economics: GPT-4o-mini is roughly 17x cheaper than GPT-4o per token. If 70% of queries can be handled by the smaller model (and in most customer-facing RAG systems, they can), the blended cost drops significantly.
Impact: Per-query cost drops from $0.05 to ~$0.02. Monthly savings on top of caching: $90,000.
After retrieval and re-ranking, you have a set of chunks to send to the LLM. Most pipelines send everything. Smarter pipelines compress first.
Instead of sending the top 5 chunks, send only chunks above a relevance threshold:
def compress_context(query, chunks, min_score=0.7, max_tokens=1500):
"""Only include chunks that meet the quality bar."""
scored_chunks = reranker.score(query, chunks)
filtered = [c for c in scored_chunks if c.score >= min_score]
# Enforce token budget
context = []
token_count = 0
for chunk in filtered:
chunk_tokens = count_tokens(chunk.text)
if token_count + chunk_tokens > max_tokens:
break
context.append(chunk)
token_count += chunk_tokens
return context
Pull only the relevant sentences from each chunk instead of the full chunk:
def extract_relevant_sentences(query, chunk_text, max_sentences=3):
"""Extract only the sentences most relevant to the query."""
sentences = sent_tokenize(chunk_text)
query_vec = embed(query)
sentence_vecs = [embed(s) for s in sentences]
scored = [
(s, cosine_similarity(query_vec, sv))
for s, sv in zip(sentences, sentence_vecs)
]
scored.sort(key=lambda x: x[1], reverse=True)
return " ".join(s for s, _ in scored[:max_sentences])
Impact: Reducing average context from 2000 tokens to 800 tokens cuts LLM input costs by 60%. Per-query cost drops to ~$0.008. Monthly savings: $36,000.
Embed documents in batches during ingestion rather than one at a time. Batch processing is up to 10x cheaper with most providers.
# Expensive: one-at-a-time embedding
for doc in documents:
embedding = openai.embeddings.create(input=doc.text, model="text-embedding-3-large")
# Cheap: batch embedding
BATCH_SIZE = 2048
for i in range(0, len(documents), BATCH_SIZE):
batch = [doc.text for doc in documents[i:i+BATCH_SIZE]]
embeddings = openai.embeddings.create(input=batch, model="text-embedding-3-large")
Reduce storage costs by compressing vectors:
# int8 quantization: 4x memory reduction, retains ~96% quality
# binary quantization: 32x memory reduction, retains ~92-96% quality
# Qdrant example
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Quantization, ScalarQuantization
client.create_collection(
collection_name="documents",
vectors_config=VectorParams(size=1024, distance="Cosine"),
quantization_config=ScalarQuantization(
type="int8",
quantile=0.99,
always_ram=True # Keep quantized vectors in RAM for speed
)
)
If using OpenAI's text-embedding-3 models, you can truncate dimensions:
# Full dimensions: 3072 (default)
# Reduced: 256 dimensions, ~95% quality retention
response = openai.embeddings.create(
input="your text",
model="text-embedding-3-large",
dimensions=256 # 12x smaller vectors
)
Impact: Combined embedding and storage optimizations bring per-query cost to ~$0.003.
Before any processing, normalize and deduplicate queries:
def normalize_query(query: str) -> str:
"""Normalize query for caching and deduplication."""
query = query.lower().strip()
query = re.sub(r'\s+', ' ', query) # collapse whitespace
query = re.sub(r'[?!.]+$', '', query) # remove trailing punctuation
return query
Identify your top 100 queries (they often cover 30--40% of traffic) and precompute answers:
# Daily job: analyze query logs, precompute top queries
top_queries = analytics.get_top_queries(days=7, limit=100)
for query in top_queries:
answer = full_rag_pipeline(query)
precomputed_cache.set(query, answer, ttl=86400) # 24hr TTL
Impact: Final per-query cost: ~$0.001. Monthly cost for 100K daily queries: $3,000 (down from $360,000).
| Layer | Technique | Cost Reduction | Cumulative Per-Query | |-------|-----------|---------------|---------------------| | 0 | Unoptimized baseline | --- | $0.120 | | 1 | Semantic caching | 58% | $0.050 | | 2 | Model routing | 60% | $0.020 | | 3 | Context compression | 60% | $0.008 | | 4 | Embedding optimization | 63% | $0.003 | | 5 | Query optimization | 67% | $0.001 |
Each layer is independent. You can apply them in any order based on what is easiest to implement in your system. But I recommend this order because each layer builds on the data from the previous one (e.g., cache hit rates inform model routing thresholds).
After optimization, run this sanity check:
Revenue per query (or value per query): $X
Cost per query: $0.001
Gross margin per query: $X - $0.001
If gross margin is positive: you have a viable product.
If gross margin is negative: optimize further or rethink the product.
RAG systems that cannot pass the unit economics test should not go to production. The techniques in this lesson make most RAG products economically viable. The 99% cost reduction is not a stunt --- it is what separates prototypes from businesses.
Use this checklist to assess your cost posture:
Start with semantic caching. It is the highest-impact, lowest-effort optimization and typically saves 40-60% alone. Then add model routing. Those two layers combined get most systems to an economically viable per-query cost.
We tackle the trust layer: building citation systems that users can verify. Cost optimization without citation quality creates a cheap system nobody trusts. Lesson 6 shows how to build the attribution layer that earns user confidence.