NHT

AI for Backend Engineers: A Practical Field Guide

A practical guide to AI for backend engineers: embeddings, vector search, RAG, and LLM APIs — what they are, where they fit, and the cost and latency numbers to track.

Nguyen Hoang TuanNguyen Hoang Tuan21 Jun 202610 min read

Most "AI for developers" content is written for people building chatbots. This guide is for the rest of us — the backend engineers who already run Postgres, Redis, and a few services in production, and who now have to add AI features without setting the existing system on fire.

The good news: AI for backend engineers is mostly a data and systems problem, not a machine-learning problem. You do not need to train a model. You need to move text in and out of a few APIs, store some vectors, and keep an eye on cost and latency — all things a backend engineer already knows how to do. This post maps the terrain so the rest of this series can go deep on each piece.

Why backend engineers should care about AI now

The skills that matter for shipping AI features in 2026 are not the skills of an ML researcher. They are the skills of a backend engineer: API integration, schema design, caching, rate limiting, retries, and observability.

A typical "AI feature" — search that understands meaning, a support assistant grounded in your docs, automatic tagging or extraction — breaks down into ordinary backend work:

  • Call an external API and handle its failure modes.
  • Store and query a new column type (vectors).
  • Cache expensive responses.
  • Limit and budget usage so a loop does not cost you $400 overnight.
  • Trace requests so you can debug them.

If you own the backend, you already own most of the AI stack. The model is just one more dependency behind an HTTP call.

The four building blocks

Almost every practical AI backend feature is built from four primitives. Learn these and the buzzwords stop being scary.

  • Embeddings — a function that turns text (or an image) into a fixed-length array of floats that captures meaning. "Quán cà phê gần đây" and "coffee shop nearby" land close together in this vector space even though they share no words.
  • Vector search — finding the rows whose embeddings are closest to a query embedding. This is how you do "search by meaning" instead of "search by keyword."
  • RAG (Retrieval-Augmented Generation) — retrieve relevant chunks of your own data with vector search, then paste them into the prompt so the model answers from your facts instead of its training data.
  • LLM API calls — sending a prompt to a large language model and getting text (or structured JSON) back. This is the generation step, and the one with the real cost.

That is the whole vocabulary. Embeddings + vector search give you semantic retrieval; add an LLM call on top and you have RAG.

Where AI fits in a typical backend

Nothing about your architecture changes. AI features slot into the request/response flow you already have. A RAG request looks like this:

client ──▶ API service
              │  1. embed the user's question (LLM/embedding API)2. vector search in Postgres (pgvector) → top-k chunks
              │  3. build prompt = system + retrieved chunks + question
              │  4. call LLM API → answer (optionally streamed)5. log tokens, cost, latency
              ▼
            client

Steps 1 and 4 are external API calls. Steps 2 and 5 are plain database work. The "AI" is two HTTP calls wrapped around a SELECT. Everything you know about timeouts, connection pools, and idempotency still applies — and matters more, because external model APIs are slower and flakier than your own services.

The minimal stack: Postgres + pgvector + one LLM API

You do not need a new platform to start. The smallest stack that ships a real feature is the database you already run plus one extension and one API key.

Add pgvector to Postgres and you can store embeddings next to your relational data and query them with SQL:

CREATE EXTENSION IF NOT EXISTS vector;

ALTER TABLE documents ADD COLUMN embedding vector(1536);

-- Find the 5 chunks closest to a query embedding (cosine distance)
SELECT id, content
FROM documents
ORDER BY embedding <=> $1   -- $1 = the query's embedding
LIMIT 5;

That <=> operator is cosine distance. One column, one operator, and your existing Postgres is now a vector database. Generating the embedding is a single API call from your service layer — the same pattern as any third-party integration you already maintain.

Why you probably don't need a dedicated vector DB yet

It is tempting to reach for Pinecone, Qdrant, or Weaviate on day one. For most products, that is premature. A dedicated vector database is another service to deploy, secure, back up, and keep in sync with your source of truth.

pgvector keeps everything in one place: your vectors live in the same transaction as your rows, so there is no dual-write problem and no consistency drift. Up to a few million vectors with a sensible index, it is fast enough — and when you do outgrow it, you will know exactly why. Start on pgvector; graduate later when the numbers force you to. (A future post in this series benchmarks the crossover point.)

Cost and latency: the numbers to track from day one

This is where backend instincts pay off. LLM calls are billed per token and are an order of magnitude slower than a database query. If you do not measure them, your AWS bill and your p95 latency will both surprise you.

Track these from the first deploy:

| Metric | Why it matters | Rough order of magnitude | | --- | --- | --- | | Tokens per request (in + out) | Tokens are money; this is your unit cost | hundreds to a few thousand | | Cost per request | tokens × price; multiply by traffic | fractions of a cent to a few cents | | Embedding latency | Runs on every search query | tens of ms | | LLM generation latency (p95) | Dominates total response time | hundreds of ms to several seconds | | Cache hit rate | Every hit is a request you didn't pay for | aim high for repeated queries |

The practical takeaways follow directly: cache aggressively (identical and near-identical prompts), stream long answers so the user sees output sooner, use a smaller/cheaper model for simple tasks, and put a hard budget cap and per-user rate limit in front of every LLM call — exactly the discipline you already apply to any expensive external dependency, as covered in Redis Lua Script & SETNX: High-Performance Rate Limiting & Quota Alerting for APIs. And because every call emits tokens, cost, and latency, those events are a natural fit for the kind of analytics pipeline described in Real-Time API Usage Analytics & Billing with ClickHouse and Redis Streams.

What this series covers

This is the pillar of a hands-on series on AI for backend engineers — field notes, not theory. Each upcoming post takes one of the building blocks above and goes deep with real code, real numbers, and the production pitfalls, grouped into four threads:

  • RAG & semantic search — pgvector vs dedicated vector DBs, chunking, hybrid keyword-plus-vector search, reranking, and evaluation.
  • LLMs in the backend — calling model APIs cleanly, streaming, structured output, cost control, caching, retries, and observability.
  • AI on real data & infra — embeddings in Postgres, semantic place search, address dedup, and analyzing LLM usage with ClickHouse.
  • Productionizing AI — self-host vs API, Docker deploys, prompt injection defense, guardrails, and latency tuning.

Several of these build directly on existing field notes — semantic search extends the keyword work in PostgreSQL Full-Text Search: Optimizing Fast Address Autocomplete for Vietnamese Text, and semantic place search applies it to the geocoding stack behind GoGoDuk: Vietnam Map APIs Built for Developers.

Bookmark this post as the map. The destination is the same one you reach with every other backend system: something that is correct, observable, and cheap enough to run at scale.

Facing performance issues or scaling challenges?

I specialize in building low-latency map infrastructure, real-time streaming pipelines (Kafka, ClickHouse), and highly optimized backend systems. Let's work together to scale your product.

Let's Work Together

Written by

Nguyen Hoang Tuan

Nguyen Hoang Tuan

Full-stack developer focused on practical backend architecture, web performance, and production delivery.

Related Articles