Chunking Strategies for Legal & Reference RAG Systems

Chunking is the load-bearing decision in any RAG system, and the one most teams get wrong. Generic tutorials reach for fixed-size chunking with 512-token windows and 50-token overlap. That setting is fine for blog content and disastrous for a statute, a clinical guideline, or a 1,400-page practitioner handbook. The wrong chunking produces a system that retrieves “things mentioning the query” rather than “the passage that answers it.”

This guide is for technical readers building RAG over legal, regulatory, scholarly, or reference content — corpora where the structure of the document is meaningful and the cost of a wrong answer is high. The strategies below are the ones that actually held in production, including for deployments where retrieval errors get noticed immediately because the users are domain experts. The AAAi Chat Book is one of those deployments; the case-study section near the end walks through how the choices were made for that specific corpus.

Why generic chunking fails on legal and reference corpora

The default RAG tutorial chunks documents into fixed-size windows of N tokens with M-token overlap. For unstructured prose — articles, blog posts, transcripts — this is workable. For structured documents it is not.

Three problems recur.

Boundaries land in the middle of semantic units. A statute’s subsection (a)(2)(iv) defines a term used in subsection (a)(2)(v). Fixed chunking will split them mid-definition. The retrieval system finds the chunk containing the definition’s first half, presents it to the model, and the model produces an answer based on an incomplete definition. Sometimes the answer is correct anyway; often it isn’t.

Context that depends on structure is lost. Legal and reference documents derive meaning from where they sit: which chapter, which section, which jurisdiction, which version. A clause about “the Buyer’s obligations” needs the surrounding contract context to make sense. Strip the structure and the clause becomes ambiguous.

Cross-references break. A regulation says “subject to the conditions in section 4(b).” A treatise says “as discussed in chapter 7.” Generic chunking ignores these — the retrieved chunk references content that wasn’t retrieved with it, so the model can’t follow the chain.

The fixes are not exotic. They are well-known to anyone who has done serious retrieval work. But they require treating chunking as a design decision per corpus rather than a global setting.

Five chunking strategies compared

The strategies below cover most of what you will use in production. Almost no real deployment uses just one — they get combined per corpus and per document type.

Strategy	Precision	Context preservation	Cost	Ideal use case	Failure mode
Fixed-size	Low-Medium	Poor	Lowest	Unstructured prose, transcripts, blog content	Boundaries split semantic units; cross-refs broken
Recursive (header/separator-aware)	Medium-High	Good	Low	Documents with consistent structure (Markdown, HTML, structured PDFs)	Fails when headers are inconsistent or missing
Semantic (embedding-based)	High	Good	Medium	Documents without reliable headers; ambiguous narrative structure	Slower to index; sensitive to embedding model choice
Parent-document retrieval	High	Excellent	Medium	Long reference works, statutes, clauses with surrounding context dependencies	More storage; need both child and parent stores
Late chunking	Medium-High	Excellent	Higher	Heavily cross-referenced corpora; needs document-level embedding awareness	Heavier compute; less mature tooling

Each strategy is expanded below. The strategy you pick is less important than the strategy you calibrate.

Fixed-size chunking

Split the document into N-token chunks with M-token overlap. Simple, fast, predictable. Fine for unstructured content where the loss of semantic boundaries doesn’t matter much.

For legal and reference content, the place fixed-size still shows up is as a fallback for sections of a document that lack clean structure (a long introduction with no subsections, for example). It is not the right primary strategy for these corpora.

Recursive chunking

Split on the document’s own structural markers first — headings, section breaks, paragraph boundaries — and only fall back to fixed-size splits when no structural marker is available within the target chunk length. This is what most libraries call “recursive character text splitting” with a custom separator list.

For well-structured corpora (Markdown, HTML, properly-tagged PDFs), recursive chunking is a strong default. It produces chunks that align with the document’s own organisation, which is usually a good proxy for semantic boundaries.

For PDFs of legal and scholarly content, the catch is that the structure has to actually be readable. PDFs vary wildly in how cleanly text extraction recovers the document’s organisation. Investing in a good extractor (e.g., a layout-aware one rather than naive text dump) pays back many times over in chunking quality.

Semantic chunking

Embed sentences or short passages, compute similarity between adjacent passages, and split where the semantic distance crosses a threshold. The chunks are bounded by topic shifts rather than by character count.

Useful when document structure is inconsistent or absent — narrative reference works, transcripts of structured discussions, edited compilations where headings are stylistic rather than semantic. The downside is computational: every sentence has to be embedded during the chunking step. The upside is that chunks reflect topic, which improves retrieval precision.

Parent-document retrieval

Index small chunks for retrieval precision and large chunks (or full sections, or full chapters) for context. When a small chunk is retrieved, the system returns its parent chunk to the model. The retrieval is precise; the context is rich.

This is one of the most useful patterns for long-form reference content. A user asks a question about an arbitration procedure. A specific paragraph is the best match — it is the one paragraph that addresses the question — but the model needs the surrounding section to understand the framework. Parent-document retrieval gives you both without compromising either.

The cost is storage: you maintain two stores (or one store with two granularities). Worth it for reference content. Often unnecessary for general-purpose content where small chunks alone are enough.

Late chunking

A newer pattern: embed the entire document (or a long sub-section) first, then chunk based on the embedding context. Each chunk’s embedding reflects its place in the broader document rather than being computed in isolation.

The intuition is that meaning depends on context. “The Buyer” in a contract is determined by the contract’s definitions; embedding the clause in isolation loses that. Late chunking lets the embedding capture the document-level context before the chunk is extracted.

Late chunking is most useful for heavily cross-referenced corpora — contracts, statutes, treatises where individual clauses depend on document-wide context to be interpretable. The compute cost is higher (you embed the document first, then the chunks), and the tooling is still maturing. For most deployments it is worth experimenting with on a subset of the corpus before committing.

Parent-document retrieval for clauses, statutes, chaptered reference works

Parent-document retrieval deserves its own section because it is the pattern that does the most work in legal and reference deployments.

The mechanic, in concrete terms:

For each document, produce two sets of chunks: small chunks (a paragraph or two, optimised for retrieval precision) and large chunks (the enclosing section, chapter, or clause, optimised for context).
Index the small chunks in the vector store with metadata pointing to their parent.
At retrieval time, search over the small chunks for precision.
When you return retrieved chunks to the LLM, swap each one for its parent.

The reason this works for legal content is that legal documents are organised around hierarchical units that mean more than the sum of their sentences. A subsection of a regulation needs the surrounding section to be interpretable. A clause in a contract needs the section it sits in. Parent-document retrieval respects that structure.

In the AAAi Chat Book corpus, the practitioner handbook is organised into chapters and sections that already carry meaning. Chapter-and-section-aware chunking with parent-document retrieval was the natural fit. A user’s question about pre-hearing procedure retrieves the specific paragraph that addresses it, but the model sees the whole section’s framing. Citations point to the section, not just the matched paragraph, so the user can read the broader context if they want to verify.

The same pattern works for chaptered treatises, regulations with consistent subsection structure, contract templates with clearly-bounded clauses, and journal articles with structured abstracts and method sections. It does not work as well for unstructured prose where there is no useful parent unit larger than the chunk itself.

Semantic chunking when headings are unreliable

Some corpora are structured in principle but unreliable in practice. The PDF extraction is messy. The Markdown was generated by an OCR pipeline. The headings are stylistic rather than semantic. The structure is there but you cannot trust it.

In these cases, semantic chunking is the right tool. The mechanic:

Split the document into sentences (or short fixed-size sub-units).
Embed each.
Compute similarity between adjacent sentences.
Insert a chunk boundary where the similarity drops below a threshold.

The resulting chunks reflect topic shifts rather than structural markers. For content where headings cannot be trusted, this is more reliable than recursive chunking would be.

The cost is real: you are embedding every sentence during the chunking step, which is a meaningful compute hit for large corpora. The benefit is that retrieval quality on poorly-structured documents goes up significantly.

A common hybrid: use semantic chunking for documents where the structure is unreliable, recursive chunking for documents where the structure is good. The choice can be per-document or per-corpus, driven by a sample evaluation during ingestion.

Late chunking for cross-referenced documents

Late chunking is the most experimental of the strategies in this article and the one with the most upside for the right corpus.

The traditional pipeline embeds chunks in isolation. The chunk “the Buyer shall deliver the Closing Documents to the Seller within five Business Days” gets an embedding that ignores who the Buyer and Seller are, what Closing Documents are, what counts as a Business Day in this contract.

Late chunking inverts the order. The whole document (or a long sub-section) is embedded first, producing token-level representations that reflect document-wide context. Chunks are extracted from those contextualised representations. Each chunk’s embedding “remembers” the surrounding document.

For heavily cross-referenced legal content this is a meaningful improvement. Retrieval becomes more aware of definitional context. The clause about Buyer’s obligations is more likely to be retrieved for a query about Buyer obligations because the embedding actually reflects that the chunk is about the Buyer (not just “a buyer”).

The trade-offs are compute (embedding long documents is more expensive than embedding short chunks) and tooling maturity (the libraries are newer and have rougher edges). For most deployments, late chunking is worth piloting on a subset of the corpus, then rolling out for document types where it materially improves retrieval precision.

Evaluation: which metrics matter for legal Q&A

Chunking decisions are only as good as the evaluation that validates them. For legal and reference content, three metrics carry most of the load.

Retrieval recall at a fixed cut-off. For a labelled set of queries, what fraction of the gold-relevant chunks appear in the top-k retrieved results? This is the metric that tells you whether your chunking captures the right material at all. If recall is low, no amount of reranking or generation will save you.

Retrieval precision at a fixed cut-off. For the same query set, what fraction of the top-k retrieved chunks are actually gold-relevant? Precision tells you how much noise the model has to filter through. Low precision wastes context window and increases the risk that the model latches onto an irrelevant chunk.

Answer faithfulness against the retrieved context. Does the generated answer make only claims that the retrieved context supports? This is checked by an LLM judge (often a different model than the generator). It is the metric that catches the case where retrieval was fine but the model still drifted off-source.

For heterogeneous corpora — a legal practice library mixed with a journal back-catalog, for example — a single relevance metric typically does not generalise. The fix we use in production is a linear combination of multiple metrics (cosine similarity, dot product, sometimes BM25 score) with weights derived from per-corpus calibration. The weights and the thresholds come out of the same evaluation loop.

Two more pieces matter operationally:

Per-corpus threshold calibration. Generic thresholds do not work. Score distributions differ by content type, embedding model, and chunking strategy. For each corpus we either run an automated sweep against a labelled test set to pick the threshold, or set it by hand based on a sample review. The threshold is what gates the refusal behaviour described in Hallucination-Proof RAG Architecture.

Regression coverage via promptfoo or equivalent. Every chunking change is a system change. The test corpus re-runs whenever chunking parameters, embedding model, or generator change. This is what keeps quality from drifting silently as components are upgraded. Per-client test corpora capture the specific question patterns each deployment sees; a model upgrade that passes our internal tests can still regress on a client-specific question pattern, and the per-client suite catches that.

How chunking was approached for the AAAi Chat Book corpus

The AAAi practitioner handbook presented a clean test case: consistent chapter structure, predictable subsection hierarchy, and a user base of arbitrators who would immediately notice a retrieval error. That made chapter-and-section-aware recursive chunking with parent-document retrieval the natural baseline.

Specific choices that mattered for that corpus (without inventing internal numbers):

Chunk granularity at the section level for retrieval candidates, with parent-document retrieval surfacing the chapter context. Arbitrators asking specific procedural questions need the section’s specific guidance, but the model benefits from seeing the broader procedural framework.
Manual threshold calibration during the initial deployment. A test set was built from representative practitioner questions; thresholds were tuned against this set rather than carried over from defaults.
Cross-encoder reranking on the candidate set. The initial retrieval pass returns more candidates than needed; the reranker selects the few that go to generation. This is what keeps precision high enough to satisfy expert users.
Per-section page-level citation. Citations point to the section and the page within it, so the practitioner can verify in seconds.
Promptfoo regression coverage built during deployment, re-run on every component change. The test corpus contains the kinds of questions the AAAi practitioner audience actually asks.

None of these decisions transfer mechanically to other corpora. The general principle does: the right chunking choices come from looking at the corpus, defining a representative test set, and calibrating against it. There is no useful default.

Frequently asked questions

What is the best chunking strategy for RAG?

There is no single best strategy. For structured legal and reference content, recursive chunking with parent-document retrieval is the strongest default. For unreliably-structured content, semantic chunking. For heavily cross-referenced corpora, experiment with late chunking. The right answer is calibrated per corpus against a representative test set.

What chunk size should I use?

Chunk size is corpus-dependent and should be tuned, not defaulted. For legal practitioner content, section-level chunks with parent-document retrieval surfacing the chapter context is a useful starting point. For shorter dense content (clauses, regulations), smaller is often better. Run a sweep against a labelled test set and pick the size that maximises retrieval precision and recall jointly.

Should I use semantic chunking or recursive chunking?

Use recursive chunking when document structure is reliable (clean Markdown, well-tagged HTML, properly-extracted PDFs). Use semantic chunking when structure is inconsistent or absent. For corpora that mix both, run both strategies on a sample and compare retrieval quality. A hybrid approach (semantic chunking only for documents that fail structural extraction) is often the best operational answer.

What is parent-document retrieval and when do I need it?

Index small chunks for retrieval precision; return larger parent chunks (sections, chapters, full clauses) to the model for context. Use it when the retrievable unit (a paragraph) needs surrounding context (the section) to be interpretable. This is the norm for long-form reference works, statutes, contracts, and chaptered treatises.

How do I evaluate whether my chunking is working?

Build a labelled test set of representative queries with gold-relevant chunks marked. Measure recall and precision at a fixed retrieval cut-off, and faithfulness of generated answers against retrieved context. Re-run after every chunking change. Without a labelled test set you are guessing; with one you can compare strategies quantitatively.

How often should I re-run chunking evaluation?

Any time a chunking parameter, embedding model, or generator changes. The promptfoo (or equivalent) regression suite enables this automatically. For static deployments it might run on every release; for evolving corpora it runs on every meaningful change. The cost of running the suite is small compared to the cost of shipping a regression.

What about overlap between chunks?

Overlap helps recover boundary content that would otherwise be split between chunks, at the cost of redundancy in retrieval. Typical defaults (10-15% of chunk size) are a starting point. For structured content where boundaries align with semantic units (parent-document retrieval over sections), overlap matters less. For fixed-size chunking it matters more.

Should I use the same embedding model for all my content?

Usually yes — mixing embeddings across a single index introduces score-scale problems. The deeper choice is whether the embedding model fits your content type. Some embedding models are stronger on technical and legal language than others. For multi-content-type deployments, a single strong general-purpose embedding model is often the right answer; specialised embedding models can be tested but rarely justify the operational complexity.