Building RAG From Scratch in Node.js
Build a working Retrieval-Augmented Generation pipeline in Node.js — ingest, embed, retrieve, prompt — no framework, just the moving parts explained with real code.
The pillar of this series made the case that RAG is two HTTP calls wrapped around a SELECT. This post cashes that in: we build a working Retrieval-Augmented Generation pipeline in plain Node.js — no LangChain, no framework, no magic. Just the moving parts, each one a function you could have written yourself, so that when a framework does hide one of them you'll know exactly what it's doing.
The running example is on-brand for this blog: a small Q&A assistant over GoGoDuk's API docs and Vietnamese-address knowledge base. "How do I geocode a Hanoi street address?" should retrieve the right doc chunk and answer from our facts, not the model's training data. Everything here is ~150 lines of TypeScript you can paste into a service.
What RAG is (and the problem it solves)
An LLM only knows what was in its training data. Ask it about your internal docs, last week's changelog, or a customer's order and it will either refuse or — worse — invent something plausible. RAG fixes this by retrieving the relevant facts first and pasting them into the prompt, so the model answers from text you handed it instead of from memory.
The trade you're making is concrete: instead of fine-tuning a model on your data (expensive, slow, stale the moment your data changes), you keep your data in a database and fetch the relevant slice per request. New docs are a row insert, not a retraining run. That's why RAG is a systems problem, not an ML one — and why a backend engineer is the right person to build it.
The pipeline: ingest → chunk → embed → store → retrieve → generate
The whole thing is a one-time ingest path and a per-request query path. Draw it once and the code writes itself:
INGEST (offline, once per document)
doc ──▶ chunk ──▶ embed each chunk ──▶ store {text, embedding} in pgvector
QUERY (per request)
question ──▶ embed ──▶ top-k vector search ──▶ build prompt ──▶ LLM ──▶ stream answerSix verbs, two of which are external API calls (embed, generate) and one of which is a SELECT (retrieve). The rest is string handling. We'll build the ingest path first, then the query path.
Step 1–3: chunk and embed documents
A model has a context window and you pay per token, so you don't embed a whole 40-page doc as one blob — retrieval would return the entire thing and bury the answer. You chunk it into passages of a few hundred tokens, with a little overlap so a sentence split across a boundary isn't lost.
A naive-but-honest chunker — split on paragraphs, pack up to a size budget, carry an overlap:
function chunk(text: string, size = 800, overlap = 120): string[] {
const paras = text.split(/\n\s*\n/).map((p) => p.trim()).filter(Boolean);
const chunks: string[] = [];
let buf = "";
for (const p of paras) {
if (buf.length + p.length > size && buf) {
chunks.push(buf);
buf = buf.slice(Math.max(0, buf.length - overlap)); // carry overlap
}
buf += (buf ? "\n\n" : "") + p;
}
if (buf) chunks.push(buf);
return chunks;
}Chunking strategy is its own deep rabbit hole — fixed vs semantic vs recursive splitting changes recall measurably, and a later post in this series benchmarks them. For now, ~800 chars with ~120 overlap is a fine default for prose docs.
Next, turn each chunk into an embedding — a fixed-length float array that captures meaning. One API call per batch of chunks:
// Provider-neutral: any embeddings endpoint returns number[] per input.
async function embed(inputs: string[]): Promise<number[][]> {
const res = await fetch("https://api.example.com/v1/embeddings", {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.EMBEDDINGS_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({ model: "text-embedding-3-small", input: inputs }),
});
if (!res.ok) throw new Error(`embed failed: ${res.status}`);
const json = await res.json();
return json.data.map((d: { embedding: number[] }) => d.embedding);
}Batch your inputs — embedding 64 chunks in one call is far cheaper and faster than 64 calls. A typical small-embedding model returns 1536-dim vectors; remember that number, the database column depends on it.
Step 4–5: store in pgvector and retrieve top-k
We store vectors next to their text in Postgres with pgvector. If you're wondering why Postgres and not a dedicated vector database, that's exactly the question pgvector vs a dedicated vector DB answers — for a corpus this size, one less moving part wins.
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE doc_chunks (
id bigserial PRIMARY KEY,
source text NOT NULL, -- which doc this came from
content text NOT NULL, -- the chunk text we'll paste into the prompt
embedding vector(1536) NOT NULL -- must match the model's dimension
);
CREATE INDEX ON doc_chunks
USING hnsw (embedding vector_cosine_ops);Insert is one row per chunk. Note the ::vector cast — pgvector takes the array as a string literal:
async function store(source: string, chunks: string[]) {
const vectors = await embed(chunks);
for (let i = 0; i < chunks.length; i++) {
await db.query(
`INSERT INTO doc_chunks (source, content, embedding) VALUES ($1, $2, $3::vector)`,
[source, chunks[i], JSON.stringify(vectors[i])],
);
}
}Retrieval is the heart of RAG and it's a single query. Embed the question, then ask Postgres for the closest chunks by cosine distance (<=>):
async function retrieve(question: string, k = 5): Promise<string[]> {
const [qVec] = await embed([question]);
const { rows } = await db.query(
`SELECT content
FROM doc_chunks
ORDER BY embedding <=> $1::vector
LIMIT $2`,
[JSON.stringify(qVec), k],
);
return rows.map((r) => r.content);
}On a corpus of a few thousand chunks with the HNSW index, this SELECT returns in single-digit milliseconds — the embedding call in front of it (tens of ms) dominates retrieval latency, not the database. k = 5 is a sensible start: enough context to answer, few enough to keep the prompt cheap.
One honest caveat: pure vector search retrieves by meaning, so an exact-keyword query (a product SKU, a function name) can rank surprisingly low. The fix is to combine it with the keyword search from PostgreSQL full-text search for Vietnamese addresses — a hybrid of both — which is its own post later in this series.
Building the prompt with retrieved context
Now assemble the prompt: a system instruction that pins the model to the supplied context, the retrieved chunks, and the question. The instruction to refuse when the context doesn't contain the answer is the single most important line for keeping the assistant honest:
function buildPrompt(question: string, contexts: string[]): string {
const context = contexts.map((c, i) => `[${i + 1}] ${c}`).join("\n\n");
return [
"You are a GoGoDuk support assistant. Answer ONLY from the context below.",
"If the context does not contain the answer, say you don't know. Cite sources like [1].",
"",
"Context:",
context,
"",
`Question: ${question}`,
].join("\n");
}For five 800-char chunks this prompt runs roughly 1,000–1,500 tokens — your unit cost per question, and the number to watch as you raise k.
Step 6: call the LLM and stream the answer
Send the prompt, stream the tokens back so the user sees output in ~hundreds of ms instead of waiting for the full answer. Wiring the streamed chunks all the way to the browser over SSE is its own topic later in this series; here's the server-side generation loop:
async function* answer(question: string) {
const contexts = await retrieve(question);
const prompt = buildPrompt(question, contexts);
const res = await fetch("https://api.example.com/v1/messages", {
method: "POST",
headers: {
Authorization: `Bearer ${process.env.LLM_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "claude-haiku-4-5",
max_tokens: 512,
stream: true,
messages: [{ role: "user", content: prompt }],
}),
});
for await (const token of parseSSE(res.body)) {
yield token; // forward to the caller / client as it arrives
}
}That's the entire pipeline. A real question — "How do I geocode a Hanoi street address with GoGoDuk?" — flows through embed → top-k → prompt → stream, and comes back as a grounded answer citing [2] from the geocoding doc chunk, in well under a second to first token. Compared to asking the bare model, the difference is night and day: the bare model invents an endpoint; the RAG version quotes the real one.
Common failure modes & how to debug them
Most RAG bugs are retrieval bugs wearing a generation costume. When the answer is wrong, check the retrieved chunks before you touch the prompt:
- The answer isn't in the top-k. Log the retrieved
contenton every request. If the right chunk never shows up, the problem is upstream — chunks too big (the answer is diluted),ktoo small, or a keyword query that semantic search ranks poorly. Print the chunks first; it's almost always here. - Dimension mismatch. Embedding with a 1536-dim model into a
vector(768)column throws at insert. The column dimension must equal the model's output, and re-embedding everything is the only fix if you switch models. - The model answers from training data anyway. Strengthen the "answer only from context, say you don't know otherwise" instruction. Without it, the model happily fills gaps from memory — the exact hallucination RAG exists to prevent.
- Cost creeps up. Every question is an embedding call plus a generation call, billed per token. Track tokens, cost, and latency per request from day one — the same discipline the pillar insists on, and a natural fit for the kind of usage analytics in real-time API analytics with ClickHouse.
You now have a RAG pipeline you fully understand, end to end — no framework between you and the moving parts. The next posts in this series go deeper on the pieces we glossed: chunking strategies that measurably lift recall, storing and indexing embeddings in Postgres properly, and hybrid keyword-plus-vector search. Bookmark the series map and build the next one with me.
Facing performance issues or scaling challenges?
I specialize in building low-latency map infrastructure, real-time streaming pipelines (Kafka, ClickHouse), and highly optimized backend systems. Let's work together to scale your product.
Related Articles
22 Jun 2026
pgvector vs Dedicated Vector DB: When to Use Which
A decision guide comparing pgvector with Pinecone/Qdrant/Weaviate on cost, recall, ops and scale — with real query benchmarks and a default to start from.
21 Jun 2026
AI for Backend Engineers: A Practical Field Guide
A practical guide to AI for backend engineers: embeddings, vector search, RAG, and LLM APIs — what they are, where they fit, and the cost and latency numbers to track.
20 Jun 2026
Integrating the VNPay Payment Gateway into a NestJS API: An End-to-End Guide
A practical, end-to-end guide to integrating Vietnam's VNPay payment gateway into a NestJS API: build the payment URL, sign vnp_SecureHash with HMAC-SHA512, handle Return URL vs IPN, verify signatures, and keep the order flow idempotent.
2 Jun 2026
NestJS Modular Architecture: Building Production APIs That Scale
A practical guide to NestJS modular architecture for production APIs, covering modules, service boundaries, DTO validation, dependency injection, and maintainability.