# Semantic search: how it works and how to build it

> An implementation-grade guide to semantic search: embeddings and ANN indexes, why pure vector search disappoints, and the hybrid (BM25 + vector) plus reranking pipeline strong systems actually run.

**HTML version:** https://www.paiteq.com/blog/semantic-search/
**Published:** 2026-06-20T12:25:20.272Z
**Author:** Navin Sharma, Founder · AI Engineering Lead
**Reading time:** ~13 min


---

Most explainers stop at one sentence: semantic search finds results by meaning instead of by exact words. That sentence is true and it is also where the useful part begins, not where it ends. If you actually have to build a semantic search system that holds up under real traffic, the meaning-not-keywords framing tells you almost nothing about the decisions that determine whether your search is good: which embedding model, which index, whether to keep keyword scoring around, whether to rerank, and how you'll know if any of it worked.
We've shipped retrieval systems where the first dense-vector prototype scored worse than the boring keyword search it was meant to replace. That happens more often than the vendor blogs admit. This guide is the version we wish those pages had been: what semantic search is, how it works under the hood, why pure vector search disappoints in production, and the hybrid-plus-rerank pipeline that the strong systems actually run. Named tools, real code, a dated benchmark, and the failure modes we keep getting paged for.

## What semantic search actually is

Semantic search is information retrieval that ranks documents by conceptual similarity to a query rather than by literal token overlap. The mechanism is consistent across every implementation: a model maps text into a high-dimensional vector (an embedding) where distance approximates meaning, and search becomes a nearest-neighbour lookup in that vector space. A query for "headache relief" lands near documents about treating migraines even when the word "headache" never appears in them, because the embedding model learned that those ideas sit close together.
That is the whole trick and also the whole problem. The same property that lets semantic search understand synonyms and paraphrase makes it blind to things keyword search handles trivially: an exact SKU, an error code, a person's surname, a version number. A practitioner reads the definition above and immediately asks the right follow-up question, which is not "how does meaning work" but "where does this break, and what do I run alongside it."

> [!NOTE] (rich block: callout)

## Semantic search vs keyword search: where each one wins

The semantic search vs keyword search comparison is usually framed as a winner-takes-all upgrade, and that framing is wrong. Keyword search is not a legacy technology that semantic search replaces. It is a different tool with a different failure surface, and the strongest production systems run both. The honest version of the comparison looks at where each one is genuinely better rather than declaring one obsolete.
The decision is not which one to pick. It is how to combine them so that a query for "how do I reset my password" finds the conceptual answer while a query for error code "ERR_4032" still finds the exact match. That combination is hybrid search, and we get to it after the mechanics.

## How semantic search works, step by step

Understanding how semantic search works is mostly understanding two pipelines that run at different times. One runs once per document at index time: split the source into chunks, embed each chunk, and store the vectors in an index. The other runs once per query: embed the query with the same model, find the nearest vectors, optionally rerank the top candidates, and return them. Get the index-time pipeline wrong and no amount of query-time cleverness recovers it.
The two non-obvious rules of this pipeline: the query and the documents must be embedded by the identical model, because vectors from two different models live in incompatible spaces and their distances are meaningless. And the chunk you embed is the chunk you can retrieve. If you embed 2,000-token pages, you retrieve whole pages and the relevant sentence is buried; if you embed 100-token fragments, you retrieve precise snippets but lose surrounding context. Chunking is a tuning decision, not a default.
Chunking deserves more attention than it gets because it silently caps your ceiling. The common defaults, a fixed 512-token window or a naive split on newlines, cut sentences in half and strand the answer across a chunk boundary where neither half embeds well. We prefer structure-aware splitting: break on headings and paragraph boundaries first, then pack to a target size with a small overlap so context that straddles a boundary survives in at least one chunk. The right chunk size is empirical, so it goes through the same golden-set eval as everything else. There's no universal answer, only the answer your corpus and query mix produce when you measure it.

## Embeddings: the part everyone gets wrong

The embedding model is the single highest-leverage choice in the whole system, and it is the one the explainers skip. Three things actually matter: domain fit, dimensionality, and whether the model was trained for retrieval at all. A general-purpose embedding model trained on web text will underperform on legal contracts, clinical notes, or your internal jargon, because the distances it learned don't reflect your domain's notion of similarity. This is the out-of-domain drift that quietly tanks recall.
Dimensionality is a real cost, not a bigger-is-better dial. A 1,536-dimension model stores twice the bytes per vector and searches slower than a 768-dimension one, and on most corpora the recall gain is marginal. We default to a retrieval-tuned 768-dimension model and only move up when a golden-set eval proves the larger model earns its storage and latency. The model that ships in a tutorial because it's famous is rarely the model that should ship in your index.

## ANN indexes: HNSW, IVF, and the recall-latency dial

Once you have a few hundred thousand vectors, comparing the query against every one of them (brute-force, exact nearest neighbour) is too slow. ANN indexes trade a small amount of recall for a large amount of speed. The two families you'll meet are HNSW, a navigable graph that most modern stores default to, and IVF, which partitions vectors into clusters and only searches the nearest ones. pgvector, Elasticsearch, OpenSearch, Pinecone, Weaviate, Qdrant, and FAISS all ship HNSW; the difference is which knobs they expose.
The two HNSW parameters worth knowing: M controls how connected the graph is (set at build time, costs memory), and ef_search controls how much of the graph each query walks (set at query time, costs latency). Turning ef_search up recovers recall the approximation gave away, at the price of milliseconds. We tune ef_search against the eval set per workload instead of accepting the default, because the right setting for a 50k-doc help centre is not the right setting for a 50-million-row product catalogue.

## Why pure vector search disappoints in production

Here is the moment most teams hit and most blogs hide. The dense-vector prototype demos beautifully on natural-language questions, then ships, and the support tickets start: a customer searches for the exact product code and gets fuzzy near-matches; an engineer searches for a stack-trace string and the index returns conceptually similar but useless pages. Pure semantic search is strong on the queries it's good at and quietly terrible on the long tail of exact-match intent, which is a large share of real traffic.
The fix is not a better embedding model, though that helps. The fix is to stop treating dense vectors as the whole system. The same lesson shows up in agent retrieval: in our write-up on [multi-agent orchestration patterns](/blog/multi-agent-orchestration-patterns/) the retrieval tool that grounds each agent fails in exactly this way when it's dense-only, and the same hybrid fix applies. Keep keyword scoring in the loop and fuse it with the vectors.

## Hybrid search: BM25 plus vectors, fused

Hybrid search runs lexical and dense retrieval in parallel and merges their results. BM25 supplies the exact-match precision that vectors lack; the dense index supplies the semantic recall that keywords lack. The interesting question is how you merge two ranked lists whose scores live on different scales. You can't just add a BM25 score of 14.2 to a cosine similarity of 0.83 and expect anything sensible.
The robust default is reciprocal rank fusion: instead of combining raw scores, you combine each document's rank position in the two lists. A document ranked second by BM25 and fifth by the vector index gets a fused score from those positions, which sidesteps the incompatible-scale problem entirely. Elasticsearch and OpenSearch ship rank fusion natively; with pgvector you compute it in SQL. Weighted-sum fusion (normalise then blend with a tunable alpha) is the alternative when you want a dial that favours one retriever, but it needs careful normalisation per query.
The detail that trips people up is that reciprocal rank fusion deliberately throws away score magnitude. That feels lossy, and it is, but it's also why it's robust: a BM25 score of 14.2 and a cosine similarity of 0.83 carry no shared unit, so any attempt to add them directly is meaningless arithmetic dressed up as relevance. Rank position is the one signal both retrievers express on the same scale. The practical upside is that fusion needs almost no tuning to be reasonable, which is rare in retrieval. Weighted-sum fusion can edge it out once you've invested in per-query score normalisation, but reach for that only after the eval shows rank fusion leaving measurable quality on the table. Most teams never need to.

## Reranking: the cheapest quality win

Retrieval gives you a candidate set; reranking decides the final order. A reranker is a cross-encoder that reads the query and each candidate together and scores their true relevance, which is far more accurate than comparing two independent embeddings. It's too slow to run over the whole corpus, which is exactly why the pipeline retrieves a cheap top-50 first and reranks only those. Adding a reranker is usually the single highest-return change you can make to an existing semantic search system.
Reranking is not free. It adds a network round-trip and per-document inference cost, so you cap the candidate set (top-50 is a sane starting point) and rerank to a small final-N. If you self-host instead of calling a hosted reranker, an open cross-encoder runs on the same GPU you'd use for embeddings. Either way, measure the latency add before you commit, because a reranker that doubles your p95 to win two points of precision is a trade you should make consciously, not by accident. The reason a cross-encoder beats the retriever is structural: the retriever embeds query and document separately and can only compare two frozen vectors, while the reranker reads both together and can weigh how each query term lines up against the actual passage. That joint view is what catches the candidate that's topically near but answers a different question, which is exactly the kind of plausible-but-wrong result that erodes user trust faster than an obvious miss does.

## Measuring semantic search: the eval you can't skip

You cannot tune what you don't measure, and "it feels better" is how teams ship regressions. The discipline is a golden set: a few hundred real queries paired with their known-relevant documents, assembled from query logs and human judgement. Against that set you compute recall@k (did the right document make the candidate set), nDCG (is it ranked near the top), and MRR (how high is the first relevant hit). Every change to the pipeline is then a measured before-and-after, not a vibe.
Building the golden set is the part teams want to skip and the part that pays for itself fastest. Start from real query logs rather than invented queries, because the questions users actually type are weirder, shorter, and more typo-ridden than anything you'd write at a desk. For each query, have a human mark which documents genuinely answer it; a hundred well-judged pairs beat a thousand sloppy ones. Keep the set version-controlled and grow it whenever a production miss surfaces a query class you weren't testing. The reason to separate recall from ranking metrics is that they fail differently: a low recall@10 means the right document never even reached the candidate set, which is a retrieval problem you fix by tuning the index or adding hybrid; a healthy recall but weak nDCG means the document is there but buried, which is a ranking problem you fix with reranking. Reading the two numbers together tells you which lever to pull, so you stop guessing and start fixing the actual gap.
Read that matrix as the whole argument of this guide in four numbers. On this corpus in 2026-Q2, BM25 alone hit 61% recall@10 and dense alone hit 74%, so semantic search was a real improvement over keywords. But the production-grade configuration, hybrid fusion followed by a Cohere Rerank pass, reached 88% recall@10. The jump from 74% to 88% is the difference between a demo and a system, and it comes entirely from the parts the explainers omit.

## Cost and latency: what hybrid plus rerank actually adds

Quality has a bill. Each stage you add to the pipeline costs latency and, if hosted, money per query. The point of measuring it is not to avoid the cost but to spend it where it earns its keep. A help centre serving humans can absorb a reranker's extra latency; an autocomplete suggesting as the user types cannot. The same system can even branch on the query: cheap path for the common case, full hybrid-plus-rerank only when the first pass looks weak. Here is the rough shape of the per-query latency budget on a mid-sized corpus, so you can see what each stage adds before you wire it in.
Lexical and dense retrieval run in parallel, so hybrid costs roughly the slower of the two plus a cheap fusion step, not the sum. The reranker is the expensive stage by a wide margin, which is why you cap its candidate set hard. The same cost-versus-correctness reasoning shows up wherever retrieval gates a real decision; our piece on [fraud detection at the auth boundary](/blog/ai-fraud-detection-at-auth-boundary/) walks through making that latency-budget call when a few extra milliseconds is the difference between blocking a fraudulent login and annoying a real customer.

## When semantic search is the wrong tool

The operator-honest position is that semantic search is sometimes the wrong answer, and a good consultant says so before you've built the index. If your queries are overwhelmingly exact identifiers, a well-tuned BM25 config will beat dense vectors and cost a fraction as much to run. If your corpus is tiny and rarely changes, the engineering overhead of embeddings and an ANN index may never pay back, and a plain database query with a few synonyms hand-coded in will outlast it. Match the tool to the query distribution, not to the hype, and revisit the choice only when the distribution itself shifts.

## Production failure modes we keep getting paged for

The demo passes and then production teaches you the failure modes the tutorials never mention. These are the ones we've actually been paged for, and the fixes are usually boring rather than clever.

## Frequently asked questions about semantic search

---

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements.

- **Site index for agents:** https://www.paiteq.com/llms.txt
- **Full content for agents:** https://www.paiteq.com/llms-full.txt
- **Book a call:** https://www.paiteq.com/contact/
