Embedding models: how to pick one for RAG

Every RAG system has a model nobody talks about until retrieval starts returning the wrong chunks. It isn't the large language model writing the answers. It's the small one sitting upstream, turning your documents and your user's question into vectors so the two can be compared. Embedding models are the part of the stack that decides whether the right passage is even in the candidate set before the LLM ever sees it. Pick the wrong one and no amount of prompt engineering downstream will save you, because the relevant chunk never made it into the context.

The search results for the term are split. Half are vendors explaining what an embedding model is, which you probably already know if you're reading this. The other half are leaderboards. What's missing is the part a practitioner actually needs: how to pick one, why the top of the leaderboard is usually the wrong answer, and which models we'd reach for in 2026 by default. We build RAG systems for a living, we don't sell an embedding model, and we've shipped the boring default more often than the clever pick. This is the selection guide we wish the first two pages of Google gave us.

Embedding models in one paragraph, and the only question that matters for RAG

An embedding model is a neural network that maps a piece of text to a fixed-length list of numbers, a vector, positioned so that texts with similar meaning land near each other. In a RAG pipeline you run it twice: once over every chunk of your corpus at index time, and once over each incoming query at search time. Retrieval is then just finding the corpus vectors closest to the query vector, usually by cosine similarity. The large language model never searches your documents directly; it only ever sees the chunks the embedding model's geometry surfaced. That's why this choice is load-bearing in a way that doesn't show up on any architecture diagram until recall is bad.

So the only question that matters isn't "which model scores highest in general." It's "which model puts the chunk that answers my user's question into the top results, on my corpus, at a cost and latency I can live with." Everything in this guide is in service of answering that one question, and most of the popular advice — read the leaderboard, pick the biggest — actively works against it.

The selection procedure: how to actually choose an embedding model

Choosing an embedding model is a procedure, not a lookup. The mistake most teams make is starting at the leaderboard and ending at the model with the highest number. The procedure that actually works runs the other direction: start from your corpus and your constraints, build a tiny shortlist, then let a measurement on your own data break the tie. It takes an afternoon and it routinely overturns the leaderboard's verdict.

The selection procedure, start to finish

Your corpus

DOMAIN + LANGUAGES

Constraints

DATA RESIDENCY / COST

Shortlist

2-3 CANDIDATES

Eval on your data

RECALL@K

Cost + latency

AT YOUR SCALE

Default

SMALLEST THAT CLEARS THE BAR

Step one is to characterize the corpus: what language is it in, how technical is the vocabulary, how long are the chunks, and is it general prose or a specialist domain like code or contracts. Step two is to write down the hard constraints — does the data have to stay inside your network, what's the budget per million tokens, what query latency can you tolerate. Those two steps usually cut the field to a shortlist of two or three before you've run anything. Step three is the part everyone skips, and the one that decides it: measure the shortlist on your own corpus. This is the same discipline we push in any AI readiness assessment — measure on your data before you commit to infrastructure. The default you ship should be the smallest, cheapest model that clears your recall bar, not the highest-ranked one.

The MTEB trap: why the top of the leaderboard is usually the wrong pick

The Massive Text Embedding Benchmark, MTEB, is the leaderboard everyone reaches for, and it's a genuinely useful artifact — a broad, public, multi-task scoreboard that lets you compare hundreds of models on the same footing. The trap isn't the benchmark. The trap is treating leaderboard rank as a buying recommendation. The model at the top is the model that scored best across MTEB's particular mix of tasks, which is not the same as the model that will retrieve best on your support tickets or your legal corpus.

Use MTEB the way it's meant to be used: to assemble a shortlist and to filter out the genuinely weak models. As of the 2026-Q2 leaderboard the top general retrieval models sit within roughly 2 nDCG points of each other, which is exactly why a two-point leaderboard gap rarely justifies a model switch. Look specifically at the retrieval sub-scores, not the headline average, and look at them alongside the model's parameter count and dimension size. A model ranked eighth on the average can sit second on retrieval while being a quarter of the size — and on your corpus it can beat the top model outright. The leaderboard narrows the field. It does not pick the winner. As a piece of embedding model comparison, MTEB is a starting filter, never the verdict.

Build a 30-minute eval on your own corpus (the step everyone skips)

Here's the entire argument of this guide in one move: before you commit, measure two candidate models on a held-out set of real questions from your own corpus. You need maybe 30 to 50 question-and-known-answer pairs — questions a user would actually ask, each tagged with the chunk that should answer it. Embed your corpus with each candidate, run the questions, and measure recall@k: of the questions, how often did the correct chunk land in the top k results. That's it. The model with higher recall@k on your data wins, full stop, regardless of where it sits on any leaderboard.

embed_eval.py python

# Measure recall@k for two embedding models on YOUR held-out QA set.
# Each item: a question + the id of the chunk that should answer it.
# This is the 30-minute eval that beats any leaderboard.
import numpy as np

def recall_at_k(model, corpus, qa_pairs, k=5):
    # corpus: {chunk_id: text}; qa_pairs: [(question, correct_chunk_id)]
    ids = list(corpus)
    doc_vecs = model.embed([corpus[i] for i in ids])     # index-time pass
    doc_vecs = _normalize(np.array(doc_vecs))
    hits = 0
    for question, gold_id in qa_pairs:
        q = _normalize(np.array(model.embed([question])))  # query-time pass
        sims = doc_vecs @ q[0]                              # cosine, vecs are unit-norm
        top_k = [ids[i] for i in np.argsort(-sims)[:k]]
        hits += gold_id in top_k
    return hits / len(qa_pairs)

def _normalize(v):
    return v / np.clip(np.linalg.norm(v, axis=-1, keepdims=True), 1e-9, None)

# Decide on the number, not the leaderboard:
# r_small = recall_at_k(text_embedding_3_small, corpus, qa, k=5)
# r_bge   = recall_at_k(bge_m3,                 corpus, qa, k=5)
# Ship the smaller model unless the bigger one clears a margin you can measure.

Two practical notes. First, watch the prompt prefixes: some open models like E5 and BGE expect a "query:" / "passage:" instruction prefix, and forgetting it quietly tanks recall — that one detail explains a surprising share of "this open model is terrible" complaints on Reddit. Second, build the QA set once and reuse it forever; it becomes your regression test every time a new model ships and you're tempted to migrate. We treat this eval set as a durable asset, same as a unit test.

Dimensions, context length, and the cost they hide

The dimension count is the length of the vector each model produces, and it's where the most expensive instinct lives: bigger must be better. It isn't, past a point. Every extra dimension is more bytes per vector in your index, more memory at query time, and more arithmetic per similarity comparison. Double the dimensions and you've roughly doubled storage and search cost for a recall gain that flattens fast. A 1024-dimension model frequently lands within a point or two of a 3072-dimension one on real retrieval while costing about a third of the storage and latency. Measure it on your corpus before you pay for the bigger vector.

Dimension budget vs index cost — and how Matryoshka truncation cuts both

Recall climbs steeply then flattens as dimensions grow, while index size and search latency keep climbing linearly. Matryoshka-trained models (MRL) let you truncate one model to several dimension budgets and pick the knee of the curve instead of paying for the tail.

Two newer levers change the math. Matryoshka Representation Learning trains a single model whose vectors stay useful when you truncate them, so models like OpenAI's text-embedding-3 (truncatable to 256, 512, or 1024 dimensions) and Nomic Embed v2 (trained for 256 and 768) let you trade a sliver of recall for a large cut in index cost — without re-training or switching models. Context length is the other axis: most embedding models cap at 512 tokens, which is fine if your chunks are small, but if you're embedding long passages you want a long-context model like Jina Embeddings v3 (8192 tokens) or BGE-M3 so the tail of the chunk isn't silently truncated before it's ever encoded.

Domain fit: when the general-purpose default breaks

General-purpose embedding models are trained mostly on general web text, so they're strongest on general prose and weakest exactly where the vocabulary diverges from it. The failure is quiet: retrieval just gets a little worse, the LLM hallucinates a little more to fill the gap, and nobody connects it back to the embedder. Knowing which domains break the default — and which ones don't, despite the hype — is most of the value of a selection guide.

Domain	How the default breaks	Specialist worth testing	Is the default fine?
Code / API docs	Treats identifiers as opaque tokens; misses symbol-level similarity	Voyage voyage-code-2, jina-embeddings-v2-code	Often no — test the specialist
Legal / contracts	Long clauses + boilerplate dilute the signal that matters	Voyage AI (law-tuned), BGE-M3 long-context	Test it; chunking matters as much as the model
Biomedical / scientific	Dense jargon and entities the general model never saw	Domain-tuned models on Hugging Face (e.g. PubMed-tuned)	Usually no — the gap is real
Multilingual / non-English	Weak or absent cross-lingual alignment	BGE-M3, Cohere embed-v4, Jina v3	No — see the next section
General business prose	It mostly doesn't break	n/a	Yes — ship text-embedding-3-small or embed-v4

When to reach past the general-purpose default. "Default is fine" means a top general model will clear the bar for most teams; reach for the specialist only when your eval says it doesn't.

The honest version of this section is that for plain English business documents (support content, internal wikis, product docs) the general default is genuinely fine, and reaching for a specialist is premature optimization. The domains that reliably break it are code, dense scientific text, and anything multilingual. When the chunks you retrieve are about to feed an agent that takes actions, the cost of a wrong chunk compounds, which is why we treat retrieval quality as a first-class concern in any multi-agent orchestration design — the agent is only as grounded as the embedding model that fed it.

Multilingual retrieval: a different decision entirely

Multilingual RAG isn't the English decision with more languages bolted on; it's a different problem. There are two distinct requirements people conflate. The first is per-language quality: does the model embed Spanish documents well enough to retrieve Spanish answers to Spanish questions. The second, harder one is cross-lingual alignment: can a query in English retrieve a relevant document in German, because the two land near each other in the same vector space. A model can be strong on the first and weak on the second, and most general English-first models are exactly that.

What to actually check for multilingual retrieval (2026)

100+

LANGUAGES (BGE-M3)

Strong multilingual + cross-lingual coverage

cross-lingual

THE HARD METRIC

EN query → DE doc, not just per-language quality

MMTEB

BENCHMARK TO READ

Multilingual MTEB — read it, don't trust English MTEB

per-corpus

STILL EVAL ON YOUR DATA

Language coverage on paper ≠ recall on your corpus

If you're building multilingual retrieval, the shortlist narrows fast: BGE-M3 is the strong open-source default with genuine cross-lingual alignment across 100-plus languages, Cohere embed-v4 is the hosted option built for multilingual enterprise search, and Jina v3 is a capable open alternative. Read MMTEB, the multilingual extension of MTEB, rather than the English board — and then, as always, confirm on your own corpus, because language coverage on a model card is a promise, not a measurement of your recall.

Open-source vs API embedding models: the real trade

The open-versus-hosted choice gets argued as ideology and decided by three boring variables: where your data is allowed to go, what it costs at your volume, and who owns the upgrade treadmill. Strip the ideology out and it's a clean trade.

Hosted API model

OpenAI text-embedding-3, Cohere embed-v4, or Voyage AI (now an Anthropic company, and the embedding provider Anthropic recommends for Claude). Zero infrastructure, an HTTP call, and you're embedding. Per-token cost that's near-free at small scale. The provider owns quality upgrades. The trade: your text leaves your network on every call, you're priced per token forever, and you inherit their rate limits and outages. The right default for most English RAG, and for any team that shouldn't be running GPUs.

Self-hosted open model

BGE-M3, E5, Nomic Embed v2, Jina v3, Qwen3-Embedding via sentence-transformers, vLLM, or Modal. Your data never leaves your network, latency is yours to control, and at high volume the amortized cost beats per-token pricing. The trade: you run the serving infrastructure, you own the GPU bill and the upgrade treadmill, and a top open model is often the 7B-parameter one that's expensive to serve. Right when residency, scale, or a specialist domain forces it — not by default.

Our bias is explicit: start with the API model, move to self-hosted when a real constraint pushes you there. The constraints that justify the move are data residency or compliance, per-token cost that's become material at scale, a need for sub-10ms embedding latency the network hop can't give you, or a specialist domain where the best model happens to be open. "We prefer open source" is a fine value but a poor reason to take on a GPU fleet before you've outgrown a $0.02-per-million-tokens API call. When you do go open, the best open source embedding models are good enough that you're trading convenience for control, not quality for control.

The current roster: embedding models worth shortlisting in 2026

This is where the stale guides hurt the most. The two selection pages ranking for this term still anchor on OpenAI's Ada 002, a model that's been superseded for years. Any honest list of the best embedding models in 2026 looks nothing like a 2023 one. Here's the roster we'd actually shortlist from, grouped by where it fits, not ranked, because the rank depends on your corpus. One caveat before the list: these models get stored and searched in a vector index (Pinecone, Qdrant, or Weaviate are the common choices), and the model's dimension count drives that index's cost more than the model's API price does.

The 2026 shortlist map: API tier, open tier, specialist tier

Three tiers, not a ranking. Start in the API tier; drop to the open tier when a constraint forces it; reach into the specialist tier only when your eval shows the general model misses your domain.

OpenAICohereSelf-hosted (BGE-M3)

embed_openai.py python

from openai import OpenAI
client = OpenAI()

# text-embedding-3-small: 1536 dims by default, MRL-truncatable.
# Pass dimensions= to trade recall for a smaller, cheaper index.
resp = client.embeddings.create(
    model="text-embedding-3-small",
    input=["How do I rotate an API key?"],
    dimensions=1024,   # truncate via Matryoshka — measure the recall cost
)
vec = resp.data[0].embedding

embed_cohere.py python

import cohere
co = cohere.Client()

# embed-v4: set input_type so query and passage embeddings align.
# Getting input_type wrong is the Cohere equivalent of the E5 prefix bug.
resp = co.embed(
    model="embed-v4.0",
    texts=["How do I rotate an API key?"],
    input_type="search_query",   # use "search_document" at index time
)
vec = resp.embeddings[0]

embed_bge.py python

from sentence_transformers import SentenceTransformer

# BGE-M3: open, multilingual, long-context, 1024 dims.
# Runs on your GPU/CPU — data never leaves your network.
model = SentenceTransformer("BAAI/bge-m3")
vec = model.encode(
    ["How do I rotate an API key?"],
    normalize_embeddings=True,   # so cosine == dot product
)[0]

Cost and latency: the numbers that decide it at scale

At prototype scale, embedding cost is a rounding error and you should ignore it. At production scale it becomes a real line item, and it's worth understanding which lever moves it. Two costs matter: the one-time cost to embed your corpus at index time, and the recurring cost to embed every query. The corpus cost scales with how much text you have; the query cost scales with traffic. As of 2026, hosted APIs sit roughly in the $0.02 to $0.13 per million tokens range — text-embedding-3-small near $0.02, text-embedding-3-large near $0.13, Cohere embed-v4 in the same low-double-digit-cents band. Verify the current number against each provider's pricing page before you quote it; these move.

Relative cost to embed 1M tokens as you change the model and dimension

text-embedding-3-large, full 3072 dims (baseline)

100% of baseline

Most expensive common API choice

text-embedding-3-small, 1536 dims

18% of baseline

~$0.02 vs ~$0.13 per 1M tokens — usually clears the bar

text-embedding-3-small, truncated to 1024 dims (MRL)

18% of baseline

Same API price, ~⅓ the index storage + search latency

Self-hosted BGE-M3 at high volume

6% of baseline

Amortized GPU only — but you run the infra

Latency splits the same way. A hosted API adds a network round trip of tens of milliseconds, fine for most retrieval, painful only when you're embedding queries in a tight interactive loop. Self-hosting on a co-located GPU gets you single-digit-millisecond embedding but costs you the serving stack. The bars above are the pattern, not your invoice: model size and dimension are the two dials, and "smallest model + smallest dimension that clears your recall bar" is the cheapest point on the curve every time. If you want the broader cost framing for an AI build, our AI automation buyer's guide walks the same measure-then-commit discipline across a whole project, not just the embedder.

The re-embedding tax: why your choice is a migration cost later

Here's the input nobody puts on the decision sheet, and the one that should make you measure twice: switching embedding models is not a config change. Vectors from one model live in a different geometric space than vectors from another, so they're not comparable. The day you decide a better model has shipped, you have to re-embed your entire corpus with the new model and rebuild the vector index from scratch. For a small corpus that's an afternoon. For tens of millions of chunks it's a real project — compute budget, a re-index window, and a careful cutover so search doesn't degrade mid-migration.

Two design moves soften the tax. Keep your raw chunks and your embedding pipeline reproducible, so re-embedding is a re-run, not an archaeology project. And version your index, so you can build the new one alongside the old and cut over atomically. If you've done both, a model migration is a planned operation instead of an outage. If you've done neither, the re-embedding tax is the reason teams stay on a worse model for years.

Our defaults: what we reach for, and when we don't

After all the nuance, you still want a default, so here are ours. These aren't the best models in the abstract; they're the models we reach for first because they clear the bar for the common case with the least cost and risk, and we deviate only when an eval on the customer's corpus tells us to. Read down to the row that matches your workload.

Workload	Reach for first	Why	When we don't
English business RAG (the common case)	OpenAI text-embedding-3-small	Clears the bar, MRL-truncatable, ~$0.02/1M, zero infra	Data can't leave the network → go open
Multilingual / cross-lingual	BGE-M3 or Cohere embed-v4	Real cross-lingual alignment, 100+ languages	English-only corpus → the small API model is cheaper
Code / API documentation	Voyage voyage-code-2	Tuned for symbol-level similarity the general model misses	Eval shows the general model already clears it
Data residency / high volume at scale	Self-hosted BGE-M3 or Nomic Embed v2	Data stays put; amortized GPU beats per-token at volume	No residency rule + modest volume → API is simpler
Quick prototype / throwaway demo	all-MiniLM via sentence-transformers	Tiny, free, runs anywhere — good enough to prove the loop	Anything heading to production → start with the real default

Defaults, not laws. Every row assumes you'll confirm with the 30-minute eval before you ship. The point of a default is to start the eval from a sane shortlist, not to skip it.

FAQ: embedding model questions, answered straight

What are embedding models, in one sentence?

Embedding models are neural networks that turn text into fixed-length vectors positioned so that texts with similar meaning land near each other, which is what lets a RAG system retrieve the right chunk by comparing the query vector to the corpus vectors.

What is the best embedding model for RAG in 2026?

There's no single best embedding model; the right one is the smallest model that clears your recall bar on your corpus. For most English RAG we'd start with OpenAI text-embedding-3-small or Cohere embed-v4; for multilingual or self-hosted needs, BGE-M3; for code, Voyage voyage-code-2. Run a 30-minute recall@k eval on your own data to break the tie.

Should I just pick the top model on the MTEB leaderboard?

No. MTEB rank is overfit to the benchmark's task mix, has no column for cost or latency, and is dominated by large models that are expensive to serve. Use MTEB to build a shortlist and read its retrieval sub-scores, then decide with an eval on your own corpus.

Do more dimensions mean better retrieval?

Only up to a point. Recall flattens as dimensions grow while index size, memory, and search latency keep climbing. A 1024-dimension model often matches a 3072-dimension one within a point or two of recall at a third of the cost. With Matryoshka-trained models like text-embedding-3 or Nomic Embed v2 you can truncate one model to a smaller dimension and keep most of the recall.

Open-source or API embedding models — which should I use?

Start with a hosted API model unless a real constraint pushes you to self-host. Go open — BGE-M3, Nomic Embed v2, E5, Jina v3 — when data must stay in your network, when per-token cost is material at scale, or when a specialist domain needs an open model. "We prefer open source" alone isn't a reason to take on a GPU fleet.

What does it cost to switch embedding models later?

A full re-embed of your corpus plus an index rebuild, because vectors from different models aren't comparable, so there's no incremental migration. For a small corpus that's an afternoon; for tens of millions of chunks it's a real project. That re-embedding tax is why your initial choice matters, and why we don't chase small leaderboard updates.

RAG DEVELOPMENT

Pick the embedding model your corpus actually needs — not the one at the top of the leaderboard.

We design RAG systems that retrieve the right chunk before the LLM ever runs, and we measure on your data before we commit to a model. If your retriever is returning the wrong context, talk to the team that has made this call across 200+ engagements.

Explore our rag development services Start the conversation

Embedding models: how to pick one for RAG

Embedding models in one paragraph, and the only question that matters for RAG

The selection procedure: how to actually choose an embedding model

The MTEB trap: why the top of the leaderboard is usually the wrong pick

Build a 30-minute eval on your own corpus (the step everyone skips)

Dimensions, context length, and the cost they hide

Domain fit: when the general-purpose default breaks

Multilingual retrieval: a different decision entirely

Open-source vs API embedding models: the real trade

The current roster: embedding models worth shortlisting in 2026

Cost and latency: the numbers that decide it at scale

The re-embedding tax: why your choice is a migration cost later

Our defaults: what we reach for, and when we don't

FAQ: embedding model questions, answered straight

Pick the embedding model your corpus actually needs — not the one at the top of the leaderboard.

Want help shipping this?

Talk to the engineer
who'd lead the work.

Thanks —,
a reply is on the way.

Embedding models in one paragraph, and the only question that matters for RAG

The selection procedure: how to actually choose an embedding model

The MTEB trap: why the top of the leaderboard is usually the wrong pick

Build a 30-minute eval on your own corpus (the step everyone skips)

Dimensions, context length, and the cost they hide

Domain fit: when the general-purpose default breaks

Multilingual retrieval: a different decision entirely

Open-source vs API embedding models: the real trade

The current roster: embedding models worth shortlisting in 2026

Cost and latency: the numbers that decide it at scale

The re-embedding tax: why your choice is a migration cost later

Our defaults: what we reach for, and when we don't

FAQ: embedding model questions, answered straight

Pick the embedding model your corpus actually needs — not the one at the top of the leaderboard.

Continue reading.

Semantic search: how it works and how to build it

Agentic RAG: architecture, and when it actually pays off

LLM fine tuning: when to do it, and when not to

Want help shipping this?