Chunking Strategies That Actually Improve RAG
Fixed, recursive, and semantic chunking compared on the same corpus — with the recall numbers that show why splitting documents well matters more than your embedding model.
The hybrid search post tuned how we retrieve chunks. But retrieval can only return chunks that exist — if a document is split so that the answer to a question is scattered across two pieces, or buried in a 2,000-token wall with nine unrelated paragraphs, no fusion formula saves you. Chunking is the step everyone skips and then blames the embedding model for. This post compares the three strategies that matter — fixed, recursive, semantic — on the same corpus, with the recall numbers that show how much it moves.
We keep the running example from this series: a documents table of GoGoDuk API-doc pages and Vietnamese-address knowledge, embedded into 1536-dim vectors for pgvector. The question is no longer "how do we store and search" — it's "what exactly goes into each row."
Why chunking decides RAG quality
A chunk is the unit of retrieval. Get it wrong in two opposite ways:
- Too big. A 2,000-token chunk that contains the answer also contains a dozen other topics. Its embedding is an average of all of them, so it ranks lower for the specific question — the signal is diluted. And even when retrieved, you've spent context budget (and tokens, and latency) on noise the LLM has to read past.
- Too small. A 50-token chunk splits a procedure mid-sentence. The retriever finds "step 3" but step 3 references "the token from step 1," which lives in a different chunk that didn't get retrieved. The answer is technically in your corpus and still unreachable.
The job of a chunking strategy is to keep one coherent idea per chunk — big enough to stand alone, small enough that its embedding means one thing. The three strategies below are increasingly clever ways to find that boundary.
Fixed-size chunking: the baseline
Split every N characters (or tokens), ignore structure entirely. It's the default in most tutorials because it's trivial:
function fixedChunks(text: string, size = 1000, overlap = 150): string[] {
const chunks: string[] = [];
for (let i = 0; i < text.length; i += size - overlap) {
chunks.push(text.slice(i, i + size));
}
return chunks;
}It works, sort of. The failure is obvious the moment you look at the output: it cuts mid-sentence, mid-code-block, mid-table. A GoGoDuk doc page that reads "...the /v1/suggest endpoint returns up to 7 predictions; each has a placeId..." gets sliced right after "each has a" because that's where character 1,000 landed. The embedding of that fragment is half a thought. Use it as a baseline to beat, not a destination.
Recursive chunking: split on structure first
The insight: documents already have boundaries — paragraphs, headings, list items, code fences. Recursive chunking tries to split on the largest natural separator first, and only falls back to a smaller one when a piece is still over the size limit:
function recursiveChunks(text: string, max = 1000): string[] {
const separators = ["\n## ", "\n\n", "\n", ". ", " "];
function split(input: string, depth: number): string[] {
if (input.length <= max) return [input];
const sep = separators[depth] ?? "";
const parts = sep ? input.split(sep) : [input];
const out: string[] = [];
let buffer = "";
for (const part of parts) {
const candidate = buffer ? buffer + sep + part : part;
if (candidate.length <= max) {
buffer = candidate;
} else {
if (buffer) out.push(buffer);
// Part still too big -> recurse with the next finer separator.
out.push(...(part.length > max ? split(part, depth + 1) : [part]));
buffer = "";
}
}
if (buffer) out.push(buffer);
return out;
}
return split(text, 0);
}Now a chunk break lands between paragraphs or before an ## H2, not mid-sentence. For structured technical docs this is the biggest single jump in quality, and it's cheap — pure string work, no model calls. It's the right default for 90% of corpora.
Semantic chunking: split where meaning shifts
Recursive splits on format. Semantic chunking splits on meaning: embed each sentence, then start a new chunk wherever consecutive sentences stop being similar — a topic shift the formatting didn't mark.
// embed() -> number[]; cosine() -> similarity in [0,1].
async function semanticChunks(sentences: string[], threshold = 0.45): Promise<string[]> {
const vecs = await Promise.all(sentences.map(embed));
const chunks: string[] = [];
let current = [sentences[0]];
for (let i = 1; i < sentences.length; i++) {
if (cosine(vecs[i - 1], vecs[i]) < threshold) {
chunks.push(current.join(" ")); // similarity dropped -> topic boundary
current = [];
}
current.push(sentences[i]);
}
if (current.length) chunks.push(current.join(" "));
return chunks;
}This catches the case where one prose section drifts across two subjects with no heading between them. The cost is real, though: you embed every sentence up front (one extra embedding call per sentence at ingest), and threshold is a knob you have to tune per corpus — too low and everything is one chunk, too high and you're back to per-sentence fragments. Reach for it when recursive chunks are still mixing topics, not before.
Overlap and chunk size: the two dials
Two parameters cut across all three strategies:
- Overlap repeats the last ~10–15% of one chunk at the start of the next, so a fact sitting on a boundary survives in at least one chunk whole. Costs storage and a little redundancy in results; cheap insurance against the "step 3 references step 1" failure. Zero overlap is the most common silent recall killer.
- Chunk size is the diluted-vs-fragmented tradeoff itself. For 1536-dim general embeddings, 256–512 tokens is the sweet spot for most text — long enough to hold one idea, short enough that the vector stays specific. Code and tables want larger; chat logs want smaller.
These aren't independent of retrieval: smaller chunks mean you should retrieve a larger top-k to reassemble the answer, which feeds straight back into the k you tuned in the hybrid search step.
Benchmark: recall by strategy
Same GoGoDuk doc/address corpus (~1M chunks after splitting, 1536-dim embeddings, HNSW index hot in RAM), same hand-labeled 200-query set from the hybrid post, retrieval held constant at hybrid RRF top-k=10. Only the chunking changed. Metric is recall@10 — did the chunk containing the answer land in the top 10 — plus median tokens per retrieved context:
| Strategy | Recall@10 | Median tokens / chunk | Ingest cost | | --- | --- | --- | --- | | Fixed (1000c, no overlap) | 0.71 | ~240 | baseline | | Fixed (1000c, 150 overlap) | 0.79 | ~240 | +15% storage | | Recursive (max 512t, 64 overlap) | 0.90 | ~180 | +15% storage | | Semantic (threshold 0.45) | 0.92 | ~150 | +1 embed / sentence |
The jump that matters is fixed → recursive: +0.11 recall for zero extra model cost — just splitting on structure instead of byte count. Adding overlap to plain fixed bought +0.08 on its own, confirming boundary loss was a real chunk of the misses. Semantic edged recursive by +0.02 but paid for it with an embedding call per sentence at ingest — on this corpus, not worth it; on messier prose with few headings, it can be. Treat the absolute numbers as directional (they shift with corpus and embedding model), but the ranking — recursive ≫ fixed, semantic ≈ recursive at much higher ingest cost — reproduces consistently.
What to actually do
- Start recursive, 512 tokens, ~12% overlap. It's the best recall-per-effort point and needs no model calls. This single change is usually a bigger win than swapping embedding models.
- Add semantic chunking only when you measure recursive chunks still mixing topics — and only on the sections that need it, not the whole corpus.
- Never ship zero overlap. It's the cheapest recall you'll ever buy.
- Measure, don't guess. Build the 200-query labeled set once; every chunking change becomes a number, not a vibe — exactly the discipline the RAG-from-scratch pipeline needs and the upcoming evaluation post formalizes.
Chunking is the highest-leverage, lowest-glamour step in RAG: no new dependency, no GPU, just splitting text where ideas actually end. Get it right and retrieval has good material to find; get it wrong and the fanciest reranker is polishing fragments. Next in the series we push retrieval quality further with reranking, and put real numbers on "is this RAG system any good." The full map is the AI for Backend Engineers pillar.
Facing performance issues or scaling challenges?
I specialize in building low-latency map infrastructure, real-time streaming pipelines (Kafka, ClickHouse), and highly optimized backend systems. Let's work together to scale your product.
Related Articles
25 Jun 2026
Hybrid Search: Full-Text + Vector in One Postgres Query
Combine PostgreSQL full-text search and pgvector in a single query with Reciprocal Rank Fusion — keyword precision plus semantic recall, with a real benchmark.
24 Jun 2026
Storing & Querying Embeddings in PostgreSQL
Add pgvector to an existing Postgres, store embeddings alongside your relational data, and run fast similarity queries — schema, index, and SQL included.
23 Jun 2026
Building RAG From Scratch in Node.js
Build a working Retrieval-Augmented Generation pipeline in Node.js — ingest, embed, retrieve, prompt — no framework, just the moving parts explained with real code.
22 Jun 2026
pgvector vs Dedicated Vector DB: When to Use Which
A decision guide comparing pgvector with Pinecone/Qdrant/Weaviate on cost, recall, ops and scale — with real query benchmarks and a default to start from.