Chunking Strategies That Actually Work | RAG Systems in Production | Celestinosalim.com

Chunking Strategies That Actually Work

Chunking is where most RAG pipelines silently fail. You pick a chunk size, run your splitter, and move on to the "interesting" parts --- embeddings and prompts. Months later, you discover that your system cannot answer questions that span two chunks, and you realize the foundation was wrong from the start.

I have tested every chunking strategy on this list in production. The right choice depends on your document types, query patterns, and cost constraints. There is no universal answer, but there are clear principles.

Why Chunking Matters More Than You Think

The wrong chunking strategy creates a measurable gap in recall between best and worst approaches. In my own benchmarks across enterprise document sets, the difference between the best and worst chunking strategy was 8-12% in Recall@5. That gap is the difference between a system users trust and one they abandon.

Here is the core tension: chunks that are too large dilute the embedding with irrelevant content, making retrieval imprecise. Chunks that are too small lose context, making retrieved fragments useless without their neighbors.

Too large:  "Here is a 2000-word section. The one relevant
             sentence is buried on line 47."
             -> Embedding captures the average meaning, not the specific answer.

Too small:  "The refund window is 30 days."
             -> Retrieved, but the user asked about exceptions,
                which live in the next chunk.

Just right: "Refund Policy: Customers may request a full refund
             within 30 days of purchase. Exceptions include
             digital products and custom orders, which are
             eligible for store credit only."
             -> Complete, self-contained, retrievable.

The Five Strategies

1. Fixed-Size Chunking

Split text every N tokens with M tokens of overlap.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,       # ~12% overlap
    separators=["\n\n", "\n", ". ", " "]
)
chunks = splitter.split_text(document)

When to use: General-purpose starting point. Works well for homogeneous text like blog posts, documentation, and articles.

The numbers: 400--512 tokens with 10--20% overlap is the reliable default. I start every project here and only move to fancier strategies when metrics prove I need to.

Weakness: Ignores document structure. A chunk boundary can land in the middle of a table, a code block, or a critical paragraph.

2. Structure-Aware Chunking

Respect document boundaries: headings, sections, paragraphs, and HTML/Markdown structure.

from langchain.text_splitter import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#", "h1"),
    ("##", "h2"),
    ("###", "h3"),
]
splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
chunks = splitter.split_text(markdown_doc)

When to use: Structured documents --- technical docs, knowledge bases, legal contracts, API references.

Key technique: contextual headers. Prepend the section hierarchy to each chunk so the embedding captures where this chunk lives in the document:

## Billing > Refund Policy > Exceptions
Digital products and custom orders are eligible for
store credit only. Processing takes 5-7 business days.

This gives the embedding model critical context that would otherwise be lost. In my testing, adding contextual headers improved retrieval recall by 8--12% on hierarchical documents.

3. Semantic Chunking

Group sentences by meaning rather than position. Measure embedding similarity between consecutive sentences and split where similarity drops below a threshold.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

chunker = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_percentile_threshold=85
)
chunks = chunker.create_documents([document])

When to use: Documents with topic shifts that do not align with structural markers --- transcripts, meeting notes, long-form articles without clear headings.

Trade-off: Requires embedding every sentence at index time, which adds cost. For a 100,000-document corpus, this can add significant preprocessing expense. I only use semantic chunking when the documents lack reliable structure and the retrieval metrics justify the cost.

4. Recursive / Hierarchical Chunking

Create multiple chunk sizes for the same document and store them in parallel. Retrieve small chunks for precision, then expand to their parent chunk for context.

Level 1 (coarse):  Full section (~2000 tokens)
Level 2 (medium):  Paragraph groups (~500 tokens)
Level 3 (fine):    Individual paragraphs (~150 tokens)

Query: "What is the refund timeline?"
  -> Level 3 match: "Refunds are processed within 30 days."
  -> Expand to Level 2: Full refund policy paragraph with exceptions.
  -> Send Level 2 to LLM for complete context.

When to use: When you need both precise retrieval and rich context. Works well for technical documentation and long legal documents.

Trade-off: 2--3x storage cost. Worth it for high-stakes domains where answer completeness matters more than storage bills.

5. Content-Type Routing

Different document types deserve different chunking strategies. Route based on format.

def chunk_document(doc):
    if doc.type == "pdf_table":
        return table_chunker(doc)      # Keep rows together
    elif doc.type == "code":
        return code_chunker(doc)       # Split on functions/classes
    elif doc.type == "markdown":
        return structure_chunker(doc)  # Split on headers
    elif doc.type == "transcript":
        return semantic_chunker(doc)   # Split on topic shifts
    else:
        return fixed_chunker(doc)      # Default fallback

When to use: Real production systems with mixed document formats. This is what I run in every system past the prototype stage.

Why it matters: A table chunked as plain text becomes meaningless fragments. Code split mid-function loses its logic. Transcripts split at fixed intervals break mid-thought. Routing solves this.

Chunk Enrichment: The Missing Step

Raw chunks are not enough. Before embedding, enrich each chunk with metadata that improves retrieval:

enriched_chunk = {
    "text": chunk_text,
    "metadata": {
        "source": "billing-docs-v3.md",
        "section": "Refund Policy > Exceptions",
        "doc_type": "knowledge_base",
        "last_updated": "2025-01-15",
        "chunk_index": 4,
        "total_chunks": 12,
        "word_count": 187
    }
}

This metadata enables:

Freshness filtering: Only retrieve chunks updated after a certain date.
Source filtering: Restrict retrieval to specific document categories.
Deduplication: Detect when multiple chunks cover the same content.
Citation: Trace every answer back to its source document and section.

My Production Decision Framework

Start here:
  Fixed-size (512 tokens, 12% overlap)
    |
  Measure recall & precision
    |
  Below target? ───> Are documents structured?
                        |               |
                       Yes              No
                        |               |
                  Structure-aware    Semantic chunking
                        |               |
                  Still below target?
                        |
                       Yes
                        |
                  Hierarchical (multi-level)
                        |
                  Mixed doc types? ──> Content-type routing

Do not skip to the complex strategies. Start simple, measure, and escalate only when the data tells you to. Every level of complexity adds engineering cost, debugging surface area, and processing time.

Trade-Offs at a Glance

| Strategy | When It Works | When It Fails | Relative Cost | |----------|--------------|---------------|---------------| | Fixed-size (512 tokens) | Homogeneous prose, quick start | Structured docs, tables, code | Low (baseline) | | Structure-aware | Markdown, HTML, technical docs | Unstructured text, transcripts | Low | | Semantic | Transcripts, long-form without headers | Cost-sensitive pipelines, large corpora | High (embed every sentence) | | Hierarchical | High-stakes domains needing precision + context | Storage-constrained environments | 2-3x storage | | Content-type routing | Mixed format production systems | Single-format document sets (overkill) | Medium (engineering complexity) |

Evaluate Your System

Use this checklist to assess your chunking strategy:

[ ] Have you measured Recall@5 and Precision@5 with your current chunking approach?
[ ] Do your chunks carry contextual headers (section hierarchy prepended)?
[ ] Are tables, code blocks, and lists kept intact (not split mid-structure)?
[ ] Do you have overlap between consecutive chunks (10-20% minimum)?
[ ] Is every chunk enriched with source metadata (document, section, date, type)?
[ ] Do you route different document types to different chunking strategies?
[ ] Have you tested with queries that require information spanning two chunks?
[ ] Is your average chunk size between 400-512 tokens (or justified otherwise)?

If your recall is below 0.80 and you are still using fixed-size chunking on structured documents, structure-aware chunking is the most likely fix. If you are below 0.80 on unstructured text, test semantic chunking against your eval set before adding complexity.

Key Takeaways

Start with fixed-size chunking at 400--512 tokens with 10--20% overlap. It is a strong baseline.
Add contextual headers to every chunk --- prepend the section hierarchy so embeddings capture document structure.
Use content-type routing in production to handle mixed document formats (tables, code, prose, transcripts).
Enrich chunks with metadata for filtering, deduplication, and citation.
Measure retrieval recall and precision before and after changing your chunking strategy. Do not optimize blind.

What's Next

We tackle the other half of the retrieval equation: choosing the right embedding model for your domain. A great chunking strategy paired with the wrong embedding model still produces poor retrieval. Lesson 3 covers how to evaluate and select models for your specific data.