NHT

Storing & Querying Embeddings in PostgreSQL

Add pgvector to an existing Postgres, store embeddings alongside your relational data, and run fast similarity queries — schema, index, and SQL included.

Nguyen Hoang TuanNguyen Hoang Tuan24 Jun 20268 min read

The RAG-from-scratch post stored embeddings in Postgres in two lines and moved on. This post slows down on that step, because it's the one a backend engineer actually owns: a vector is just a new column type, and "search by meaning" is just one operator in a SELECT. If you already run Postgres, you do not need a new database to start — you need pgvector, one schema, and one index.

We'll use the running example from this series: a documents table of GoGoDuk API-doc chunks and Vietnamese-address knowledge, each row carrying its text and a 1536-dimension embedding. Everything below is plain SQL you can run today.

Installing pgvector & defining a vector column

pgvector ships as a Postgres extension. On a managed Postgres (RDS, Cloud SQL, Supabase, Neon) it's usually already available — you just enable it. Self-hosted, install the package, then:

CREATE EXTENSION IF NOT EXISTS vector;

That gives you a new column type, vector(N), where N is the dimension of your embeddings — fixed by whichever model you use (e.g. 1536 for many small embedding models, 768 for others). The dimension is part of the type, so pick it once and stick with it:

ALTER TABLE documents ADD COLUMN embedding vector(1536);

That's the whole setup. No new service to deploy, secure, or back up — the vectors live in the same database, and the same backup, as your rows.

Schema design: embeddings next to your real data

The single most useful decision is to keep the embedding on the same row as the data it describes, not in a separate store. That's what makes a filtered semantic search one query instead of two systems talking over the network — the point pgvector vs a dedicated vector DB argues at length.

CREATE TABLE documents (
  id          bigserial PRIMARY KEY,
  source      text NOT NULL,          -- which doc / dataset this came from
  tenant_id   bigint NOT NULL,        -- your normal relational columns…
  published   boolean NOT NULL DEFAULT true,
  content     text NOT NULL,          -- the chunk text
  embedding   vector(1536),           -- …and the vector, right beside them
  created_at  timestamptz NOT NULL DEFAULT now()
);

tenant_id and published are ordinary columns. Because they sit on the same row as embedding, the query planner can reason about a relational filter and a vector search together — no fetching candidate IDs from one system and hydrating them from another.

Inserting embeddings (batching to avoid round-trips)

You generate the embedding in your service layer (one API call to your embedding model), then write it as a normal value. pgvector accepts a vector as a string literal like '[0.12, -0.03, ...]', so from most drivers you pass a JSON-stringified array and cast:

// One round trip for a whole batch — not one INSERT per row.
const rows = chunks.map((c, i) => ({ source, content: c, embedding: vectors[i] }));

await db.query(
  `INSERT INTO documents (source, content, embedding)
   SELECT * FROM UNNEST($1::text[], $2::text[], $3::vector[])`,
  [
    rows.map((r) => r.source),
    rows.map((r) => r.content),
    rows.map((r) => JSON.stringify(r.embedding)), // '[...]' per vector
  ],
);

The thing to avoid is a per-row INSERT in a loop: embedding 1,000 chunks is one model call you can batch, and the database write should be one multi-row statement too. Round-trips, not compute, are what make bulk loads slow.

Similarity queries: the distance operators explained

pgvector adds three distance operators. Each returns a number you can ORDER BY; smaller means closer:

  • <->L2 (Euclidean) distance. Straight-line distance between two points.
  • <=>cosine distance (1 - cosine similarity). Ignores magnitude, compares direction. The usual choice for text embeddings.
  • <#>negative inner product. Fastest; correct only when your vectors are normalized to unit length.

For text-embedding search, use <=> unless you have a specific reason not to. The query is exactly what you'd expect — embed the question in your app, pass the vector as $1, and ask for the nearest rows:

-- Top-5 chunks closest in meaning to the query embedding ($1)
SELECT id, content
FROM documents
ORDER BY embedding <=> $1
LIMIT 5;

That ORDER BY embedding <=> $1 is the entire "search by meaning." No new query language, no second database — a SELECT your team already knows how to read.

Filtering relational + vector in one query

This is the property a bolt-on vector store can't match cleanly. Because the vector lives beside your columns, you filter and rank in the same statement:

SELECT id, content
FROM documents
WHERE tenant_id = $2 AND published        -- relational filter
ORDER BY embedding <=> $1                  -- semantic ranking
LIMIT 5;

One query, one transaction, no consistency drift between a "vector system" and a "source-of-truth system." For a tenant's published docs ranked by meaning, that's the whole feature.

Adding an index and measuring the speedup

Without an index, that ORDER BY does an exact nearest-neighbour scan — it reads every row. Fine at 10k rows, unusable at millions. pgvector solves it with an approximate index (HNSW is the strong default), which trades a sliver of recall for a large speedup. Build it per distance operator you query with (here, cosine):

CREATE INDEX ON documents
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- Recall/speed knob, set per-session or per-transaction:
SET hnsw.ef_search = 40;

Measure it, don't guess. Run EXPLAIN ANALYZE on the same query before and after:

EXPLAIN ANALYZE
SELECT id FROM documents ORDER BY embedding <=> $1 LIMIT 5;

The shape you'll see, on a corpus of ~1M 1536-dim vectors with the index hot in RAM:

| | Plan | Latency (p50) | | --- | --- | --- | | No index | Seq Scan + Sort (reads all rows) | hundreds of ms → seconds | | HNSW index | Index Scan | single-digit to low-tens of ms |

Treat the numbers as orders of magnitude — they move with m, ef_search, hardware, and whether the index fits in memory — but the jump is real: an exact scan grows linearly with rows, the index barely moves. Raise hnsw.ef_search for more recall at the cost of latency; that one knob is most of the tuning you'll do.

Gotchas: dimensions, normalization, NULL embeddings

A few things that bite in production:

  • Dimension is fixed by the model. A vector(1536) column rejects a 768-dim embedding at insert time. If you switch embedding models, you re-embed everything — there's no in-place reshape. Store the model name alongside so you know what a row was embedded with.
  • Normalization matters for <#>. Inner product (<#>) is only a correct ranking if vectors are unit-length. If you're unsure, use cosine (<=>), which normalizes for you. Mixing normalized and raw vectors in one column silently corrupts ranking.
  • NULL embeddings don't error — they just vanish. A row whose embedding is NULL won't match any nearest-neighbour query, so a half-finished backfill quietly drops rows from results. Guard with WHERE embedding IS NOT NULL, or make the column NOT NULL once backfill is done, and track coverage during ingestion.
  • Index builds compete with traffic. A large HNSW build on a live table is minutes-to-hours and contends with production. Build it off-peak (and CONCURRENTLY where you can).

That's the entire skill: a column, three operators, one index, and a SELECT your team already understands. With this in place, the retrieval step from the RAG pipeline is no longer a black box — and when keyword-exact matches start ranking poorly, the next move is to combine this with the PostgreSQL full-text search you may already run, the hybrid approach this series turns to next. The map for all of it is the series pillar.

Facing performance issues or scaling challenges?

I specialize in building low-latency map infrastructure, real-time streaming pipelines (Kafka, ClickHouse), and highly optimized backend systems. Let's work together to scale your product.

Let's Work Together

Written by

Nguyen Hoang Tuan

Nguyen Hoang Tuan

Full-stack developer focused on practical backend architecture, web performance, and production delivery.

Related Articles

23 Jun 2026

Building RAG From Scratch in Node.js

Build a working Retrieval-Augmented Generation pipeline in Node.js — ingest, embed, retrieve, prompt — no framework, just the moving parts explained with real code.