Building RAG From Scratch in Node.js

Build a working Retrieval-Augmented Generation pipeline in Node.js — ingest, embed, retrieve, prompt — no framework, just the moving parts explained with real code.

Nguyen Hoang Tuan•23 Jun 2026•10 min read

The pillar of this series made the case that RAG is two HTTP calls wrapped around a SELECT. This post cashes that in: we build a working Retrieval-Augmented Generation pipeline in plain Node.js — no LangChain, no framework, no magic. Just the moving parts, each one a function you could have written yourself, so that when a framework does hide one of them you'll know exactly what it's doing.

The running example is on-brand for this blog: a small Q&A assistant over GoGoDuk's API docs and Vietnamese-address knowledge base. "How do I geocode a Hanoi street address?" should retrieve the right doc chunk and answer from our facts, not the model's training data. Everything here is ~150 lines of TypeScript you can paste into a service.

What RAG is (and the problem it solves)

An LLM only knows what was in its training data. Ask it about your internal docs, last week's changelog, or a customer's order and it will either refuse or — worse — invent something plausible. RAG fixes this by retrieving the relevant facts first and pasting them into the prompt, so the model answers from text you handed it instead of from memory.

The trade you're making is concrete: instead of fine-tuning a model on your data (expensive, slow, stale the moment your data changes), you keep your data in a database and fetch the relevant slice per request. New docs are a row insert, not a retraining run. That's why RAG is a systems problem, not an ML one — and why a backend engineer is the right person to build it.

The pipeline: ingest → chunk → embed → store → retrieve → generate

The whole thing is a one-time ingest path and a per-request query path. Draw it once and the code writes itself:

INGEST (offline, once per document)
  doc ──▶ chunk ──▶ embed each chunk ──▶ store {text, embedding} in pgvector

QUERY (per request)
  question ──▶ embed ──▶ top-k vector search ──▶ build prompt ──▶ LLM ──▶ stream answer

Six verbs, two of which are external API calls (embed, generate) and one of which is a SELECT (retrieve). The rest is string handling. We'll build the ingest path first, then the query path.

Step 1–3: chunk and embed documents

A model has a context window and you pay per token, so you don't embed a whole 40-page doc as one blob — retrieval would return the entire thing and bury the answer. You chunk it into passages of a few hundred tokens, with a little overlap so a sentence split across a boundary isn't lost.

A naive-but-honest chunker — split on paragraphs, pack up to a size budget, carry an overlap:

function chunk(text: string, size = 800, overlap = 120): string[] {
  const paras = text.split(/\n\s*\n/).map((p) => p.trim()).filter(Boolean);
  const chunks: string[] = [];
  let buf = "";
  for (const p of paras) {
    if (buf.length + p.length > size && buf) {
      chunks.push(buf);
      buf = buf.slice(Math.max(0, buf.length - overlap)); // carry overlap
    }
    buf += (buf ? "\n\n" : "") + p;
  }
  if (buf) chunks.push(buf);
  return chunks;
}

Chunking strategy is its own deep rabbit hole — fixed vs semantic vs recursive splitting changes recall measurably, and a later post in this series benchmarks them. For now, ~800 chars with ~120 overlap is a fine default for prose docs.

Next, turn each chunk into an embedding — a fixed-length float array that captures meaning. One API call per batch of chunks:

// Provider-neutral: any embeddings endpoint returns number[] per input.
async function embed(inputs: string[]): Promise<number[][]> {
  const res = await fetch("https://api.example.com/v1/embeddings", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${process.env.EMBEDDINGS_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({ model: "text-embedding-3-small", input: inputs }),
  });
  if (!res.ok) throw new Error(`embed failed: ${res.status}`);
  const json = await res.json();
  return json.data.map((d: { embedding: number[] }) => d.embedding);
}

Batch your inputs — embedding 64 chunks in one call is far cheaper and faster than 64 calls. A typical small-embedding model returns 1536-dim vectors; remember that number, the database column depends on it.

Step 4–5: store in pgvector and retrieve top-k

We store vectors next to their text in Postgres with pgvector. If you're wondering why Postgres and not a dedicated vector database, that's exactly the question pgvector vs a dedicated vector DB answers — for a corpus this size, one less moving part wins.

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE doc_chunks (
  id        bigserial PRIMARY KEY,
  source    text NOT NULL,        -- which doc this came from
  content   text NOT NULL,        -- the chunk text we'll paste into the prompt
  embedding vector(1536) NOT NULL -- must match the model's dimension
);

CREATE INDEX ON doc_chunks
  USING hnsw (embedding vector_cosine_ops);

Insert is one row per chunk. Note the ::vector cast — pgvector takes the array as a string literal:

async function store(source: string, chunks: string[]) {
  const vectors = await embed(chunks);
  for (let i = 0; i < chunks.length; i++) {
    await db.query(
      `INSERT INTO doc_chunks (source, content, embedding) VALUES ($1, $2, $3::vector)`,
      [source, chunks[i], JSON.stringify(vectors[i])],
    );
  }
}

Retrieval is the heart of RAG and it's a single query. Embed the question, then ask Postgres for the closest chunks by cosine distance (<=>):

async function retrieve(question: string, k = 5): Promise<string[]> {
  const [qVec] = await embed([question]);
  const { rows } = await db.query(
    `SELECT content
       FROM doc_chunks
       ORDER BY embedding <=> $1::vector
       LIMIT $2`,
    [JSON.stringify(qVec), k],
  );
  return rows.map((r) => r.content);
}

On a corpus of a few thousand chunks with the HNSW index, this SELECT returns in single-digit milliseconds — the embedding call in front of it (tens of ms) dominates retrieval latency, not the database. k = 5 is a sensible start: enough context to answer, few enough to keep the prompt cheap.

One honest caveat: pure vector search retrieves by meaning, so an exact-keyword query (a product SKU, a function name) can rank surprisingly low. The fix is to combine it with the keyword search from PostgreSQL full-text search for Vietnamese addresses — a hybrid of both — which is its own post later in this series.

Building the prompt with retrieved context

Now assemble the prompt: a system instruction that pins the model to the supplied context, the retrieved chunks, and the question. The instruction to refuse when the context doesn't contain the answer is the single most important line for keeping the assistant honest:

function buildPrompt(question: string, contexts: string[]): string {
  const context = contexts.map((c, i) => `[${i + 1}] ${c}`).join("\n\n");
  return [
    "You are a GoGoDuk support assistant. Answer ONLY from the context below.",
    "If the context does not contain the answer, say you don't know. Cite sources like [1].",
    "",
    "Context:",
    context,
    "",
    `Question: ${question}`,
  ].join("\n");
}

For five 800-char chunks this prompt runs roughly 1,000–1,500 tokens — your unit cost per question, and the number to watch as you raise k.

Step 6: call the LLM and stream the answer

Send the prompt, stream the tokens back so the user sees output in ~hundreds of ms instead of waiting for the full answer. Wiring the streamed chunks all the way to the browser over SSE is its own topic later in this series; here's the server-side generation loop:

async function* answer(question: string) {
  const contexts = await retrieve(question);
  const prompt = buildPrompt(question, contexts);

  const res = await fetch("https://api.example.com/v1/messages", {
    method: "POST",
    headers: {
      Authorization: `Bearer ${process.env.LLM_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      model: "claude-haiku-4-5",
      max_tokens: 512,
      stream: true,
      messages: [{ role: "user", content: prompt }],
    }),
  });

  for await (const token of parseSSE(res.body)) {
    yield token; // forward to the caller / client as it arrives
  }
}

That's the entire pipeline. A real question — "How do I geocode a Hanoi street address with GoGoDuk?" — flows through embed → top-k → prompt → stream, and comes back as a grounded answer citing [2] from the geocoding doc chunk, in well under a second to first token. Compared to asking the bare model, the difference is night and day: the bare model invents an endpoint; the RAG version quotes the real one.

Common failure modes & how to debug them

Most RAG bugs are retrieval bugs wearing a generation costume. When the answer is wrong, check the retrieved chunks before you touch the prompt:

The answer isn't in the top-k. Log the retrieved content on every request. If the right chunk never shows up, the problem is upstream — chunks too big (the answer is diluted), k too small, or a keyword query that semantic search ranks poorly. Print the chunks first; it's almost always here.
Dimension mismatch. Embedding with a 1536-dim model into a vector(768) column throws at insert. The column dimension must equal the model's output, and re-embedding everything is the only fix if you switch models.
The model answers from training data anyway. Strengthen the "answer only from context, say you don't know otherwise" instruction. Without it, the model happily fills gaps from memory — the exact hallucination RAG exists to prevent.
Cost creeps up. Every question is an embedding call plus a generation call, billed per token. Track tokens, cost, and latency per request from day one — the same discipline the pillar insists on, and a natural fit for the kind of usage analytics in real-time API analytics with ClickHouse.

You now have a RAG pipeline you fully understand, end to end — no framework between you and the moving parts. The next posts in this series go deeper on the pieces we glossed: chunking strategies that measurably lift recall, storing and indexing embeddings in Postgres properly, and hybrid keyword-plus-vector search. Bookmark the series map and build the next one with me.

Facing performance issues or scaling challenges?

I specialize in building low-latency map infrastructure, real-time streaming pipelines (Kafka, ClickHouse), and highly optimized backend systems. Let's work together to scale your product.

Let's Work Together

Written by

Nguyen Hoang Tuan

Full-stack developer focused on practical backend architecture, web performance, and production delivery.

GitHub LinkedIn Website

Building RAG From Scratch in Node.js

What RAG is (and the problem it solves)

The pipeline: ingest → chunk → embed → store → retrieve → generate

Step 1–3: chunk and embed documents

Step 4–5: store in pgvector and retrieve top-k

Building the prompt with retrieved context

Step 6: call the LLM and stream the answer

Common failure modes & how to debug them

Facing performance issues or scaling challenges?

Nguyen Hoang Tuan

Related Articles

pgvector vs Dedicated Vector DB: When to Use Which

AI for Backend Engineers: A Practical Field Guide

Integrating the VNPay Payment Gateway into a NestJS API: An End-to-End Guide

NestJS Modular Architecture: Building Production APIs That Scale