← Blog

Semantic search: how it works and how to build it

An implementation-grade guide to semantic search: embeddings and ANN indexes, why pure vector search disappoints, and the hybrid (BM25 + vector) plus reranking pipeline strong systems actually run.

Semantic search hero — a single glass loupe resting over a field of tiny illuminated dots, one dot sharp under the lens

Most explainers stop at one sentence: semantic search finds results by meaning instead of by exact words. That sentence is true and it is also where the useful part begins, not where it ends. If you actually have to build a semantic search system that holds up under real traffic, the meaning-not-keywords framing tells you almost nothing about the decisions that determine whether your search is good: which embedding model, which index, whether to keep keyword scoring around, whether to rerank, and how you'll know if any of it worked.

We've shipped retrieval systems where the first dense-vector prototype scored worse than the boring keyword search it was meant to replace. That happens more often than the vendor blogs admit. This guide is the version we wish those pages had been: what semantic search is, how it works under the hood, why pure vector search disappoints in production, and the hybrid-plus-rerank pipeline that the strong systems actually run. Named tools, real code, a dated benchmark, and the failure modes we keep getting paged for.

What semantic search actually is

Semantic search is information retrieval that ranks documents by conceptual similarity to a query rather than by literal token overlap. The mechanism is consistent across every implementation: a model maps text into a high-dimensional vector (an embedding) where distance approximates meaning, and search becomes a nearest-neighbour lookup in that vector space. A query for "headache relief" lands near documents about treating migraines even when the word "headache" never appears in them, because the embedding model learned that those ideas sit close together.

That is the whole trick and also the whole problem. The same property that lets semantic search understand synonyms and paraphrase makes it blind to things keyword search handles trivially: an exact SKU, an error code, a person's surname, a version number. A practitioner reads the definition above and immediately asks the right follow-up question, which is not "how does meaning work" but "where does this break, and what do I run alongside it."

Semantic search vs keyword search: where each one wins

The semantic search vs keyword search comparison is usually framed as a winner-takes-all upgrade, and that framing is wrong. Keyword search is not a legacy technology that semantic search replaces. It is a different tool with a different failure surface, and the strongest production systems run both. The honest version of the comparison looks at where each one is genuinely better rather than declaring one obsolete.

Semantic (dense vector) search

Wins on intent, synonyms, paraphrase, and cross-lingual matching. Great for natural-language questions, support knowledge bases, and discovery where the user can't name the exact term. Degrades on exact identifiers, rare tokens, and very short queries where there's no context to embed.

Keyword (lexical, BM25) search

Wins on exact matches: SKUs, error codes, names, version strings, legal citations. Cheap, transparent, and easy to debug because you can see which terms matched. Degrades when the user's wording differs from the document's, and has no concept of meaning or synonyms.

The decision is not which one to pick. It is how to combine them so that a query for "how do I reset my password" finds the conceptual answer while a query for error code "ERR_4032" still finds the exact match. That combination is hybrid search, and we get to it after the mechanics.

How semantic search works, step by step

Understanding how semantic search works is mostly understanding two pipelines that run at different times. One runs once per document at index time: split the source into chunks, embed each chunk, and store the vectors in an index. The other runs once per query: embed the query with the same model, find the nearest vectors, optionally rerank the top candidates, and return them. Get the index-time pipeline wrong and no amount of query-time cleverness recovers it.

Query-time semantic search pipeline
Query
RAW TEXT
Embed
SAME MODEL AS INDEX
ANN search
TOP-K CANDIDATES
Rerank
CROSS-ENCODER
Results
RANKED

The two non-obvious rules of this pipeline: the query and the documents must be embedded by the identical model, because vectors from two different models live in incompatible spaces and their distances are meaningless. And the chunk you embed is the chunk you can retrieve. If you embed 2,000-token pages, you retrieve whole pages and the relevant sentence is buried; if you embed 100-token fragments, you retrieve precise snippets but lose surrounding context. Chunking is a tuning decision, not a default.

Chunking deserves more attention than it gets because it silently caps your ceiling. The common defaults, a fixed 512-token window or a naive split on newlines, cut sentences in half and strand the answer across a chunk boundary where neither half embeds well. We prefer structure-aware splitting: break on headings and paragraph boundaries first, then pack to a target size with a small overlap so context that straddles a boundary survives in at least one chunk. The right chunk size is empirical, so it goes through the same golden-set eval as everything else. There's no universal answer, only the answer your corpus and query mix produce when you measure it.

Embeddings: the part everyone gets wrong

The embedding model is the single highest-leverage choice in the whole system, and it is the one the explainers skip. Three things actually matter: domain fit, dimensionality, and whether the model was trained for retrieval at all. A general-purpose embedding model trained on web text will underperform on legal contracts, clinical notes, or your internal jargon, because the distances it learned don't reflect your domain's notion of similarity. This is the out-of-domain drift that quietly tanks recall.

index_pgvector.py python
import psycopg2
from sentence_transformers import SentenceTransformer

# Same model is used at index time AND query time. This is non-negotiable.
model = SentenceTransformer("BAAI/bge-base-en-v1.5")  # retrieval-tuned, 768-dim

conn = psycopg2.connect("dbname=search")
cur = conn.cursor()
# pgvector column sized to the model's output dimension
cur.execute("CREATE TABLE IF NOT EXISTS docs (id bigserial PRIMARY KEY, body text, embedding vector(768))")

def index(chunks: list[str]):
    # normalize_embeddings=True lets us use cosine distance cleanly
    vecs = model.encode(chunks, normalize_embeddings=True)
    for body, vec in zip(chunks, vecs):
        cur.execute("INSERT INTO docs (body, embedding) VALUES (%s, %s)", (body, list(vec)))
    conn.commit()

# An HNSW index over the vector column gives sub-linear ANN search.
cur.execute("CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64)")

Dimensionality is a real cost, not a bigger-is-better dial. A 1,536-dimension model stores twice the bytes per vector and searches slower than a 768-dimension one, and on most corpora the recall gain is marginal. We default to a retrieval-tuned 768-dimension model and only move up when a golden-set eval proves the larger model earns its storage and latency. The model that ships in a tutorial because it's famous is rarely the model that should ship in your index.

ANN indexes: HNSW, IVF, and the recall-latency dial

Once you have a few hundred thousand vectors, comparing the query against every one of them (brute-force, exact nearest neighbour) is too slow. ANN indexes trade a small amount of recall for a large amount of speed. The two families you'll meet are HNSW, a navigable graph that most modern stores default to, and IVF, which partitions vectors into clusters and only searches the nearest ones. pgvector, Elasticsearch, OpenSearch, Pinecone, Weaviate, Qdrant, and FAISS all ship HNSW; the difference is which knobs they expose.

HNSW: the recall vs latency dial
HNSW GRAPH SEARCHENTRYQUERY NNTHE DIALef_search LOWfast, lower recallef_search HIGHslower, higher recallM (graph degree)build-time, memory cost
Higher ef_search walks more of the graph: better recall, more latency. The knob is yours to set per workload.

The two HNSW parameters worth knowing: M controls how connected the graph is (set at build time, costs memory), and ef_search controls how much of the graph each query walks (set at query time, costs latency). Turning ef_search up recovers recall the approximation gave away, at the price of milliseconds. We tune ef_search against the eval set per workload instead of accepting the default, because the right setting for a 50k-doc help centre is not the right setting for a 50-million-row product catalogue.

Why pure vector search disappoints in production

Here is the moment most teams hit and most blogs hide. The dense-vector prototype demos beautifully on natural-language questions, then ships, and the support tickets start: a customer searches for the exact product code and gets fuzzy near-matches; an engineer searches for a stack-trace string and the index returns conceptually similar but useless pages. Pure semantic search is strong on the queries it's good at and quietly terrible on the long tail of exact-match intent, which is a large share of real traffic.

The fix is not a better embedding model, though that helps. The fix is to stop treating dense vectors as the whole system. The same lesson shows up in agent retrieval: in our write-up on multi-agent orchestration patterns the retrieval tool that grounds each agent fails in exactly this way when it's dense-only, and the same hybrid fix applies. Keep keyword scoring in the loop and fuse it with the vectors.

Hybrid search: BM25 plus vectors, fused

Hybrid search runs lexical and dense retrieval in parallel and merges their results. BM25 supplies the exact-match precision that vectors lack; the dense index supplies the semantic recall that keywords lack. The interesting question is how you merge two ranked lists whose scores live on different scales. You can't just add a BM25 score of 14.2 to a cosine similarity of 0.83 and expect anything sensible.

Hybrid retrieval with reciprocal rank fusion
HYBRID SEARCH PIPELINEQUERYBM25 / LEXICALexact terms, ranked list ADENSE / VECTOR ANNmeaning, ranked list BRRF FUSIONmerge by rank, not raw scoreTOP-50 TO RERANKER
Two retrievers, two ranked lists, fused by rank (not raw score), then a small candidate set goes to the reranker.

The robust default is reciprocal rank fusion: instead of combining raw scores, you combine each document's rank position in the two lists. A document ranked second by BM25 and fifth by the vector index gets a fused score from those positions, which sidesteps the incompatible-scale problem entirely. Elasticsearch and OpenSearch ship rank fusion natively; with pgvector you compute it in SQL. Weighted-sum fusion (normalise then blend with a tunable alpha) is the alternative when you want a dial that favours one retriever, but it needs careful normalisation per query.

The detail that trips people up is that reciprocal rank fusion deliberately throws away score magnitude. That feels lossy, and it is, but it's also why it's robust: a BM25 score of 14.2 and a cosine similarity of 0.83 carry no shared unit, so any attempt to add them directly is meaningless arithmetic dressed up as relevance. Rank position is the one signal both retrievers express on the same scale. The practical upside is that fusion needs almost no tuning to be reasonable, which is rare in retrieval. Weighted-sum fusion can edge it out once you've invested in per-query score normalisation, but reach for that only after the eval shows rank fusion leaving measurable quality on the table. Most teams never need to.

Reranking: the cheapest quality win

Retrieval gives you a candidate set; reranking decides the final order. A reranker is a cross-encoder that reads the query and each candidate together and scores their true relevance, which is far more accurate than comparing two independent embeddings. It's too slow to run over the whole corpus, which is exactly why the pipeline retrieves a cheap top-50 first and reranks only those. Adding a reranker is usually the single highest-return change you can make to an existing semantic search system.

rerank.py python
import cohere

co = cohere.Client()  # Cohere Rerank: a hosted cross-encoder

def rerank(query: str, candidates: list[str], top_n: int = 5):
    # candidates = the fused top-50 from hybrid retrieval
    resp = co.rerank(
        model="rerank-v3.5",
        query=query,
        documents=candidates,
        top_n=top_n,
    )
    # results come back ordered by true query-document relevance
    return [(r.index, r.relevance_score) for r in resp.results]
rerank.ts typescript
import { CohereClient } from "cohere-ai";

const co = new CohereClient();

export async function rerank(query: string, candidates: string[], topN = 5) {
  // candidates = the fused top-50 from hybrid retrieval
  const resp = await co.rerank({
    model: "rerank-v3.5",
    query,
    documents: candidates,
    topN,
  });
  // ordered by true query-document relevance
  return resp.results.map((r) => ({ index: r.index, score: r.relevanceScore }));
}

Reranking is not free. It adds a network round-trip and per-document inference cost, so you cap the candidate set (top-50 is a sane starting point) and rerank to a small final-N. If you self-host instead of calling a hosted reranker, an open cross-encoder runs on the same GPU you'd use for embeddings. Either way, measure the latency add before you commit, because a reranker that doubles your p95 to win two points of precision is a trade you should make consciously, not by accident. The reason a cross-encoder beats the retriever is structural: the retriever embeds query and document separately and can only compare two frozen vectors, while the reranker reads both together and can weigh how each query term lines up against the actual passage. That joint view is what catches the candidate that's topically near but answers a different question, which is exactly the kind of plausible-but-wrong result that erodes user trust faster than an obvious miss does.

Measuring semantic search: the eval you can't skip

You cannot tune what you don't measure, and "it feels better" is how teams ship regressions. The discipline is a golden set: a few hundred real queries paired with their known-relevant documents, assembled from query logs and human judgement. Against that set you compute recall@k (did the right document make the candidate set), nDCG (is it ranked near the top), and MRR (how high is the first relevant hit). Every change to the pipeline is then a measured before-and-after, not a vibe.

Building the golden set is the part teams want to skip and the part that pays for itself fastest. Start from real query logs rather than invented queries, because the questions users actually type are weirder, shorter, and more typo-ridden than anything you'd write at a desk. For each query, have a human mark which documents genuinely answer it; a hundred well-judged pairs beat a thousand sloppy ones. Keep the set version-controlled and grow it whenever a production miss surfaces a query class you weren't testing. The reason to separate recall from ranking metrics is that they fail differently: a low recall@10 means the right document never even reached the candidate set, which is a retrieval problem you fix by tuning the index or adding hybrid; a healthy recall but weak nDCG means the document is there but buried, which is a ranking problem you fix with reranking. Reading the two numbers together tells you which lever to pull, so you stop guessing and start fixing the actual gap.

Retrieval eval, 50k-doc internal corpus, 2026-Q2 (typical engagement shape, not a single client result)
61%
RECALL@10 BM25 ONLY
74%
RECALL@10 DENSE ONLY
88%
RECALL@10 HYBRID + RERANK
0.71
nDCG@10 HYBRID + RERANK

Read that matrix as the whole argument of this guide in four numbers. On this corpus in 2026-Q2, BM25 alone hit 61% recall@10 and dense alone hit 74%, so semantic search was a real improvement over keywords. But the production-grade configuration, hybrid fusion followed by a Cohere Rerank pass, reached 88% recall@10. The jump from 74% to 88% is the difference between a demo and a system, and it comes entirely from the parts the explainers omit.

Cost and latency: what hybrid plus rerank actually adds

Quality has a bill. Each stage you add to the pipeline costs latency and, if hosted, money per query. The point of measuring it is not to avoid the cost but to spend it where it earns its keep. A help centre serving humans can absorb a reranker's extra latency; an autocomplete suggesting as the user types cannot. The same system can even branch on the query: cheap path for the common case, full hybrid-plus-rerank only when the first pass looks weak. Here is the rough shape of the per-query latency budget on a mid-sized corpus, so you can see what each stage adds before you wire it in.

Per-query latency by pipeline stage (mid-sized corpus, illustrative budget in ms)
BM25 lexical
12ms
Dense ANN (HNSW)
28ms
Hybrid (parallel + fuse)
35ms
Hybrid + Cohere Rerank top-50
140ms

Lexical and dense retrieval run in parallel, so hybrid costs roughly the slower of the two plus a cheap fusion step, not the sum. The reranker is the expensive stage by a wide margin, which is why you cap its candidate set hard. The same cost-versus-correctness reasoning shows up wherever retrieval gates a real decision; our piece on fraud detection at the auth boundary walks through making that latency-budget call when a few extra milliseconds is the difference between blocking a fraudulent login and annoying a real customer.

When semantic search is the wrong tool

The operator-honest position is that semantic search is sometimes the wrong answer, and a good consultant says so before you've built the index. If your queries are overwhelmingly exact identifiers, a well-tuned BM25 config will beat dense vectors and cost a fraction as much to run. If your corpus is tiny and rarely changes, the engineering overhead of embeddings and an ANN index may never pay back, and a plain database query with a few synonyms hand-coded in will outlast it. Match the tool to the query distribution, not to the hype, and revisit the choice only when the distribution itself shifts.

Workload BM25 onlyDense onlyHybrid + rerank
Exact-ID lookups (SKU, error code) Best fit Misses exact tokens Works, overkill
Natural-language Q&A / support KB Wording-brittle Strong Best fit
Mixed real-world traffic (most apps) Long-tail misses Exact-match gaps Best fit
Tiny, static corpus (under a few k docs) Cheap, fine Overhead unjustified Premature
Pick by query distribution. The honest default for mixed traffic is hybrid plus rerank; for narrow workloads the simpler tool wins.

Production failure modes we keep getting paged for

The demo passes and then production teaches you the failure modes the tutorials never mention. These are the ones we've actually been paged for, and the fixes are usually boring rather than clever.

Frequently asked questions about semantic search

Do I need a dedicated vector database?

Not always. If you already run Postgres, pgvector is often enough and keeps your data in one place. Elasticsearch and OpenSearch add native hybrid scoring. Dedicated stores like Pinecone, Weaviate, and Qdrant earn their place at large scale or when you need their specific operational features. Start with what you already operate and move only when an eval says you must.

What is hybrid search and why does it matter?

Hybrid search runs lexical (BM25) and dense vector retrieval together and fuses the results, typically with reciprocal rank fusion. It matters because it covers each method's blind spot: keywords catch exact matches, vectors catch meaning. In our evals hybrid plus reranking consistently beats either method alone.

Does semantic search power RAG and chatbots?

Yes. RAG fetches relevant documents to ground a language model's answer, and that fetch step is semantic search (usually hybrid). The retrieval quality caps the answer quality: if the right document never reaches the model, no amount of prompting recovers it. This is why retrieval eval matters more than prompt tuning in most RAG systems.

How do I know if my semantic search is actually good?

Build a golden set of real queries with known-relevant documents and measure recall@k, nDCG, and MRR against it. Treat every pipeline change as a measured before-and-after. Without that, you're shipping on intuition, which is how regressions reach production unnoticed.

RAG DEVELOPMENT

Ship semantic search on evidence, not on a buzzword.

We pick the embedding model on a golden-set eval, fuse BM25 with vectors, add reranking where it earns its latency, and measure the whole thing before it touches production. If you want semantic search that holds up under real traffic, talk to the team that builds retrieval for a living.

Talk to engineering

Want help shipping this?

An engineer reads every inbound. Same business day on most replies.