# Embedding models: how to pick one for RAG

> A practitioner's guide to choosing embedding models for RAG: dimensions, the MTEB trap, domain fit, multilingual, cost vs latency, open-source vs API — with defaults.

**HTML version:** https://www.paiteq.com/blog/embedding-models/
**Published:** 2026-06-20T12:25:09.180Z
**Author:** Navin Sharma, Founder · AI Engineering Lead
**Reading time:** ~13 min


---

Every RAG system has a model nobody talks about until retrieval starts returning the wrong chunks. It isn't the large language model writing the answers. It's the small one sitting upstream, turning your documents and your user's question into vectors so the two can be compared. **Embedding models** are the part of the stack that decides whether the right passage is even in the candidate set before the LLM ever sees it. Pick the wrong one and no amount of prompt engineering downstream will save you, because the relevant chunk never made it into the context.
The search results for the term are split. Half are vendors explaining what an embedding model is, which you probably already know if you're reading this. The other half are leaderboards. What's missing is the part a practitioner actually needs: how to pick one, why the top of the leaderboard is usually the wrong answer, and which models we'd reach for in 2026 by default. We build RAG systems for a living, we don't sell an embedding model, and we've shipped the boring default more often than the clever pick. This is the selection guide we wish the first two pages of Google gave us.

## Embedding models in one paragraph, and the only question that matters for RAG

An embedding model is a neural network that maps a piece of text to a fixed-length list of numbers, a vector, positioned so that texts with similar meaning land near each other. In a RAG pipeline you run it twice: once over every chunk of your corpus at index time, and once over each incoming query at search time. Retrieval is then just finding the corpus vectors closest to the query vector, usually by cosine similarity. The large language model never searches your documents directly; it only ever sees the chunks the embedding model's geometry surfaced. That's why this choice is load-bearing in a way that doesn't show up on any architecture diagram until recall is bad.
So the only question that matters isn't "which model scores highest in general." It's "which model puts the chunk that answers my user's question into the top results, on my corpus, at a cost and latency I can live with." Everything in this guide is in service of answering that one question, and most of the popular advice — read the leaderboard, pick the biggest — actively works against it.

## The selection procedure: how to actually choose an embedding model

Choosing an embedding model is a procedure, not a lookup. The mistake most teams make is starting at the leaderboard and ending at the model with the highest number. The procedure that actually works runs the other direction: start from your corpus and your constraints, build a tiny shortlist, then let a measurement on your own data break the tie. It takes an afternoon and it routinely overturns the leaderboard's verdict.
Step one is to characterize the corpus: what language is it in, how technical is the vocabulary, how long are the chunks, and is it general prose or a specialist domain like code or contracts. Step two is to write down the hard constraints — does the data have to stay inside your network, what's the budget per million tokens, what query latency can you tolerate. Those two steps usually cut the field to a shortlist of two or three before you've run anything. Step three is the part everyone skips, and the one that decides it: *measure the shortlist on your own corpus*. This is the same discipline we push in any[ AI readiness assessment](/blog/ai-readiness-assessment/) — measure on your data before you commit to infrastructure. The default you ship should be the smallest, cheapest model that clears your recall bar, not the highest-ranked one.

## The MTEB trap: why the top of the leaderboard is usually the wrong pick

The Massive Text Embedding Benchmark, MTEB, is the leaderboard everyone reaches for, and it's a genuinely useful artifact — a broad, public, multi-task scoreboard that lets you compare hundreds of models on the same footing. The trap isn't the benchmark. The trap is treating leaderboard rank as a buying recommendation. The model at the top is the model that scored best across MTEB's particular mix of tasks, which is not the same as the model that will retrieve best on your support tickets or your legal corpus.

> [!NOTE] (rich block: callout)

Use MTEB the way it's meant to be used: to assemble a shortlist and to filter out the genuinely weak models. As of the 2026-Q2 leaderboard the top general retrieval models sit within roughly 2 nDCG points of each other, which is exactly why a two-point leaderboard gap rarely justifies a model switch. Look specifically at the retrieval sub-scores, not the headline average, and look at them alongside the model's parameter count and dimension size. A model ranked eighth on the average can sit second on retrieval while being a quarter of the size — and on your corpus it can beat the top model outright. The leaderboard narrows the field. It does not pick the winner. As a piece of *embedding model comparison*, MTEB is a starting filter, never the verdict.

## Build a 30-minute eval on your own corpus (the step everyone skips)

Here's the entire argument of this guide in one move: before you commit, measure two candidate models on a held-out set of real questions from your own corpus. You need maybe 30 to 50 question-and-known-answer pairs — questions a user would actually ask, each tagged with the chunk that should answer it. Embed your corpus with each candidate, run the questions, and measure recall@k: of the questions, how often did the correct chunk land in the top k results. That's it. The model with higher recall@k on your data wins, full stop, regardless of where it sits on any leaderboard.
Two practical notes. First, watch the prompt prefixes: some open models like E5 and BGE expect a "query:" / "passage:" instruction prefix, and forgetting it quietly tanks recall — that one detail explains a surprising share of "this open model is terrible" complaints on Reddit. Second, build the QA set once and reuse it forever; it becomes your regression test every time a new model ships and you're tempted to migrate. We treat this eval set as a durable asset, same as a unit test.

## Dimensions, context length, and the cost they hide

The dimension count is the length of the vector each model produces, and it's where the most expensive instinct lives: bigger must be better. It isn't, past a point. Every extra dimension is more bytes per vector in your index, more memory at query time, and more arithmetic per similarity comparison. Double the dimensions and you've roughly doubled storage and search cost for a recall gain that flattens fast. A 1024-dimension model frequently lands within a point or two of a 3072-dimension one on real retrieval while costing about a third of the storage and latency. Measure it on your corpus before you pay for the bigger vector.
Two newer levers change the math. Matryoshka Representation Learning trains a single model whose vectors stay useful when you truncate them, so models like OpenAI's text-embedding-3 (truncatable to 256, 512, or 1024 dimensions) and Nomic Embed v2 (trained for 256 and 768) let you trade a sliver of recall for a large cut in index cost — without re-training or switching models. Context length is the other axis: most embedding models cap at 512 tokens, which is fine if your chunks are small, but if you're embedding long passages you want a long-context model like Jina Embeddings v3 (8192 tokens) or BGE-M3 so the tail of the chunk isn't silently truncated before it's ever encoded.

## Domain fit: when the general-purpose default breaks

General-purpose embedding models are trained mostly on general web text, so they're strongest on general prose and weakest exactly where the vocabulary diverges from it. The failure is quiet: retrieval just gets a little worse, the LLM hallucinates a little more to fill the gap, and nobody connects it back to the embedder. Knowing which domains break the default — and which ones don't, despite the hype — is most of the value of a selection guide.
The honest version of this section is that for plain English business documents (support content, internal wikis, product docs) the general default is genuinely fine, and reaching for a specialist is premature optimization. The domains that reliably break it are code, dense scientific text, and anything multilingual. When the chunks you retrieve are about to feed an agent that takes actions, the cost of a wrong chunk compounds, which is why we treat retrieval quality as a first-class concern in any [multi-agent orchestration](/blog/multi-agent-orchestration-patterns/) design — the agent is only as grounded as the embedding model that fed it.

## Multilingual retrieval: a different decision entirely

Multilingual RAG isn't the English decision with more languages bolted on; it's a different problem. There are two distinct requirements people conflate. The first is per-language quality: does the model embed Spanish documents well enough to retrieve Spanish answers to Spanish questions. The second, harder one is cross-lingual alignment: can a query in English retrieve a relevant document in German, because the two land near each other in the same vector space. A model can be strong on the first and weak on the second, and most general English-first models are exactly that.
If you're building multilingual retrieval, the shortlist narrows fast: BGE-M3 is the strong open-source default with genuine cross-lingual alignment across 100-plus languages, Cohere embed-v4 is the hosted option built for multilingual enterprise search, and Jina v3 is a capable open alternative. Read MMTEB, the multilingual extension of MTEB, rather than the English board — and then, as always, confirm on your own corpus, because language coverage on a model card is a promise, not a measurement of your recall.

## Open-source vs API embedding models: the real trade

The open-versus-hosted choice gets argued as ideology and decided by three boring variables: where your data is allowed to go, what it costs at your volume, and who owns the upgrade treadmill. Strip the ideology out and it's a clean trade.
Our bias is explicit: start with the API model, move to self-hosted when a real constraint pushes you there. The constraints that justify the move are data residency or compliance, per-token cost that's become material at scale, a need for sub-10ms embedding latency the network hop can't give you, or a specialist domain where the best model happens to be open. "We prefer open source" is a fine value but a poor reason to take on a GPU fleet before you've outgrown a $0.02-per-million-tokens API call. When you do go open, the best *open source embedding models* are good enough that you're trading convenience for control, not quality for control.

## The current roster: embedding models worth shortlisting in 2026

This is where the stale guides hurt the most. The two selection pages ranking for this term still anchor on OpenAI's Ada 002, a model that's been superseded for years. Any honest list of the best embedding models in 2026 looks nothing like a 2023 one. Here's the roster we'd actually shortlist from, grouped by where it fits, not ranked, because the rank depends on your corpus. One caveat before the list: these models get stored and searched in a vector index (Pinecone, Qdrant, or Weaviate are the common choices), and the model's dimension count drives that index's cost more than the model's API price does.

## Cost and latency: the numbers that decide it at scale

At prototype scale, embedding cost is a rounding error and you should ignore it. At production scale it becomes a real line item, and it's worth understanding which lever moves it. Two costs matter: the one-time cost to embed your corpus at index time, and the recurring cost to embed every query. The corpus cost scales with how much text you have; the query cost scales with traffic. As of 2026, hosted APIs sit roughly in the $0.02 to $0.13 per million tokens range — text-embedding-3-small near $0.02, text-embedding-3-large near $0.13, Cohere embed-v4 in the same low-double-digit-cents band. Verify the current number against each provider's pricing page before you quote it; these move.
Latency splits the same way. A hosted API adds a network round trip of tens of milliseconds, fine for most retrieval, painful only when you're embedding queries in a tight interactive loop. Self-hosting on a co-located GPU gets you single-digit-millisecond embedding but costs you the serving stack. The bars above are the pattern, not your invoice: model size and dimension are the two dials, and "smallest model + smallest dimension that clears your recall bar" is the cheapest point on the curve every time. If you want the broader cost framing for an AI build, our [AI automation buyer's guide](/blog/ai-automation-solutions-buyers-guide/) walks the same measure-then-commit discipline across a whole project, not just the embedder.

## The re-embedding tax: why your choice is a migration cost later

Here's the input nobody puts on the decision sheet, and the one that should make you measure twice: switching embedding models is not a config change. Vectors from one model live in a different geometric space than vectors from another, so they're not comparable. The day you decide a better model has shipped, you have to re-embed your entire corpus with the new model and rebuild the vector index from scratch. For a small corpus that's an afternoon. For tens of millions of chunks it's a real project — compute budget, a re-index window, and a careful cutover so search doesn't degrade mid-migration.
Two design moves soften the tax. Keep your raw chunks and your embedding pipeline reproducible, so re-embedding is a re-run, not an archaeology project. And version your index, so you can build the new one alongside the old and cut over atomically. If you've done both, a model migration is a planned operation instead of an outage. If you've done neither, the re-embedding tax is the reason teams stay on a worse model for years.

## Our defaults: what we reach for, and when we don't

After all the nuance, you still want a default, so here are ours. These aren't the best models in the abstract; they're the models we reach for first because they clear the bar for the common case with the least cost and risk, and we deviate only when an eval on the customer's corpus tells us to. Read down to the row that matches your workload.

## FAQ: embedding model questions, answered straight

---

## About Paiteq

Enterprise AI engineering — production agents, RAG, LLM apps, automation, generative AI. Eval-first, senior-led, fixed-scope engagements.

- **Site index for agents:** https://www.paiteq.com/llms.txt
- **Full content for agents:** https://www.paiteq.com/llms-full.txt
- **Book a call:** https://www.paiteq.com/contact/