There is a quiet, specific way that RAG pipelines fail that nobody talks about enough.
A clinical pharmacist at a major hospital system asks their AI assistant: "What is the recommended warfarin dosage for patients with renal impairment?" The correct answer requires synthesizing two pieces of information that appear in different sections of a clinical guideline document: the dosage table for renal patients (which says "reduce by 25-50%") and the contraindication note three pages earlier (which says "avoid doses above 7.5mg in patients with GFR below 30").
The RAG system retrieves the dosage table. It does not retrieve the contraindication note. The chunk boundary fell between them. The AI assistant gives an answer that is technically grounded in the retrieved content and dangerously incomplete.
This is a chunking problem. And it produces exactly this class of failure: not random errors, but systematically predictable failures at boundaries the chunking strategy drew in the wrong place.

Why the 512-Token Default Fails
Most RAG implementations ship with a chunk size of 512 tokens. This number comes from early transformer architectures, tutorial defaults, and the fact that it works well enough on general text to pass an initial demo. It is wrong for most specialized content.
The problem is boundary destruction. Information in real documents has structure:
- A policy document states a rule, then lists exceptions
- A technical manual describes a process, then lists safety conditions
- A legal contract defines a term, then specifies the conditions under which it applies
When chunk boundaries cut across these structures, the chunks become semantically incomplete. A chunk containing the rule but not its exceptions is not just less informative — it's dangerous, because it retrieves as highly relevant and provides an authoritative-looking but incomplete answer.
The Three Classes of Boundary Failure
Class 1: Context Truncation. A chunk ends mid-explanation. The chunk is retrieved as relevant, but the answer it contains is missing critical qualifications that appeared after the boundary.
Class 2: Context Separation. Two related pieces of information that should be read together appear in different chunks. Retrieval may find one or the other, but not both, depending on how the query maps to each.
Class 3: Ambiguous Referents. A chunk contains pronouns, abbreviations, or references ("the aforementioned exception," "Section 4.2") whose antecedents are in a different chunk. The retrieved chunk is grammatically coherent but semantically incomplete without context.
Content-Type-Aware Chunking Strategies
The correct chunking strategy depends on the content type, not on a universal default.
Prose documents (policies, reports, analyses): Chunk at paragraph boundaries, not at fixed token counts. Paragraphs in well-written prose represent semantic units. Splitting within them loses structure; splitting at them preserves it. If paragraph lengths vary widely, use a soft maximum (800 tokens) and allow variable chunk sizes.
Code and technical documentation: Chunk at function, class, or logical block boundaries. A function that spans 900 tokens should remain one chunk. A file with 20 short functions should split between them. Never chunk inside a function body.
Tables and structured data: Keep the table header with every row-level chunk. A retrieved row is often meaningless without the column context. For large tables, chunk by row groups with the header repeated.
Lists and enumerated items: Keep the list introduction with the first few items, and repeat the introduction with each subsequent chunk. "The five conditions that trigger this override are: [items 3-5]" needs to include the introduction to be interpretable.
Hierarchical documents (legal, clinical, technical standards): Use hierarchical chunking that preserves the section context. Each leaf chunk carries the breadcrumb of its parent sections: Chapter > Section > Subsection > Paragraph. Retrieval finds the leaf; the context provides the hierarchy.

The Retrieval vs. Generation Tradeoff
Chunking sits at the intersection of two optimization targets that pull in different directions:
Retrieval favors smaller chunks. Smaller chunks have higher semantic density — each chunk is about one thing, making it more precisely matchable to a specific query. Retrieval recall and precision both tend to improve with smaller chunks, to a point.
Generation favors larger chunks. Larger chunks provide more context to the LLM. The answer to a complex question usually requires multiple pieces of information. If those pieces are in different small chunks, they may not all be retrieved. If they're in one large chunk, they arrive together.
Hierarchical chunking is the most principled solution to this tension. Embed small chunks for retrieval, but store parent chunks for generation. When a small chunk is retrieved, return its parent chunk to the LLM as context. Best retrieval precision plus best generation context.
Sliding window is a simpler solution that works reasonably well: each chunk overlaps with the previous by 10-20%. Context near boundaries appears in two chunks, reducing the probability of catastrophic boundary failures.
Semantic Chunking
Semantic chunking uses the embedding model itself to find natural boundaries: chunks end where the embedding changes significantly. It finds content-aware boundaries without content-type-specific rules.
The genuine value: semantic chunking adapts automatically to mixed-content documents and finds boundaries in prose that don't align with standard structural markers.
The honest limitation: semantic chunking adds meaningful computational cost, is sensitive to the choice of embedding model, and can produce chunks of dramatically varying length that complicate downstream processing. For teams with well-defined content types, structure-aware chunking often outperforms semantic chunking at lower cost.
Metadata as a Quality Multiplier
The chunking strategy handles what goes into each chunk. Metadata handles what the retrieval system knows about each chunk that isn't in its text.
Metadata that consistently improves retrieval quality:
- Document recency / version
- Document authority level (official policy vs. working draft)
- Section type (definition, exception, procedure, reference)
- Jurisdiction, department, or product scope
Contextual chunking — adding a sentence of natural language context to each chunk that summarizes where in the document it appears — is one of the highest-leverage techniques with lowest implementation cost. "This passage is from the third section of the warfarin dosing protocol, following the renal adjustment table and preceding the contraindication notes" turns a fragment into a navigable document location.
The Case Study: 67% to 91% Accuracy from Chunking Alone
A legal document review team spent three months fighting retrieval quality issues in their RAG system. They had good retrieval infrastructure, good embedding models, and a modern LLM. Their recall on held-out evaluation queries was 67%.
Over two weeks, they:
- Identified the document types in their corpus (contracts, regulations, correspondence, court decisions)
- Implemented type-specific chunking strategies for each
- Added hierarchical context to every leaf chunk
- Implemented contextual metadata enrichment
Recall on the same held-out evaluation set: 91%.
No embedding model change. No vector database change. No LLM change. The entire gain came from fixing what the retrieval system was actually receiving.
This is not unusual. Chunking is often the highest-leverage intervention in a struggling RAG pipeline, and it is usually the component that received the least deliberate design attention.
Meritshot's Data Science and AI Engineering programs include hands-on RAG pipeline projects with systematic chunking evaluation — the kind of empirical work that turns this article's patterns into muscle memory.





