Best Text Embedding Models for RAG, Ranked by Retrieval Quality and Cost
We scored five production embedding models on retrieval quality, multilingual coverage, context length, dimension flexibility, and price per million tokens, using public MTEB results and vendor-documented specs as of June 2026.
Voyage 3 Large finishes first on raw retrieval quality and is the right default when retrieval is the bottleneck. Gemini Embedding 001 wins for multilingual corpora at API scale, Cohere Embed v4 is the only mainstream pick for mixed text-and-image RAG, OpenAI text-embedding-3-large remains the safest default inside the OpenAI stack, and Qwen3-Embedding-8B is the strongest open-weight option once self-hosting volume justifies the GPUs.
Five general-purpose embedding models, one ranked field. We picked the embedders most teams actually shortlist for production RAG in 2026: three managed APIs (Voyage, Google Gemini, Cohere), one incumbent default (OpenAI), and one open-weight model (Qwen3). Each was scored against the same five metrics so the differences trace to the models rather than the workload.
Quality scores are tied to public MTEB results on retrieval-leaning subsets, with multilingual coverage scored separately because that's the dimension where general-purpose averages hide the most. Price per million tokens is tracked alongside quality but kept out of the quality score. A buyer optimizing for retrieval accuracy and a buyer optimizing for token spend are answering different questions.
Quality is grounded in published MTEB scores and vendor-documented specs, verified against official pricing and model pages in June 2026. Where vendors report different MTEB subsets, we use English MTEB averages for the English retrieval score and MMTEB / multilingual MTEB averages for the multilingual score. Self-hosted models are scored on retrieval quality and on the engineering cost of running them, not on a hosted API price.
Scored against published English MTEB averages on the retrieval-leaning subsets reported by each vendor or by the model's HuggingFace card. We mapped the raw MTEB number to a 0-100 score with 60 MTEB anchored to 60 and 75 MTEB anchored to 95, so a five-point MTEB gap corresponds to roughly a 15-point score gap, consistent with published guidance that a five-point MTEB gap typically translates to 3-8% better recall@10 in production search. Weighted 30%.
Scored on MMTEB / MTEB Multilingual averages plus the documented number of languages each model is trained or marketed to support. Models that lead the MMTEB leaderboard (Qwen3-Embedding-8B at 70.58, Gemini Embedding 001 at 68.32) anchor the top of the band. English-first models with limited cross-lingual training score in the middle even when their English MTEB is competitive. Weighted 20%.
Maximum input length per embedding call, taken from the official model documentation. 128K-token models (Cohere Embed v4) anchor the top of the band, 32K and 8K models sit in the middle, and the 2K Gemini Embedding 001 anchors the bottom. Longer context reduces the share of the RAG pipeline that's dominated by chunking strategy. Weighted 15%.
Scored on Matryoshka Representation Learning support, the number of documented dimension cut-points, and the supported output precisions (float, int8, binary). Models that publish multiple Matryoshka cuts and quantization-aware training (Voyage 3 Large at 2048/1024/512/256 with binary support, Cohere Embed v4 at 256/512/1024/1536 with int8/binary) score at the top because they expose direct storage-vs-quality controls without re-embedding. Weighted 15%.
List input price per million tokens from each vendor's official pricing page in June 2026, normalized so the cheapest model in the field scores highest. Self-hosted models are scored as a separate cost regime: no per-token API charge, but a non-zero GPU bill that only beats API pricing at high embed volume. Weighted 20%.
voyage-3-large is Voyage AI's general-purpose embedding model, with Matryoshka dimensionality at 2048, 1024, 512, and 256, plus quantization-aware training that supports binary and int8 outputs. Vendor benchmarks report a 9.74% retrieval-quality margin over OpenAI text-embedding-3-large across 100 datasets in eight domains, and third-party spec tables put it at the top of MTEB retrieval among commercial APIs at $0.18 per million input tokens with a 32K context window. The trade-off is price. At the top of the band on both quality and cost, it's the right default when retrieval is the binding constraint and the wrong default when API spend is.
Source: Voyage AI ↗Strengths
- Highest reported retrieval quality across general-purpose APIs
- Matryoshka cuts at 2048/1024/512/256 plus binary outputs cut storage up to 200x
- First 200M tokens are free per account on voyage-4 family
Weaknesses
- $0.18 per million tokens is the most expensive in the field
- English-first; multilingual coverage trails Gemini and Qwen3
How it scored, by metric
gemini-embedding-001 is Google's production text embedding model, accessible through the Gemini API and Vertex AI. It's trained with Matryoshka Representation Learning, outputs 3,072 dimensions by default with documented truncation to 1,536 or 768 without quality loss, and has held a top spot on the MMTEB leaderboard with an average task score of 68.32. List pricing is $0.15 per million input tokens, with a 50% batch discount at $0.075. The binding limitation is the 2,048-token input window, which forces aggressive chunking on any document workflow longer than a few pages.
Source: Google ↗Strengths
- MMTEB multilingual leader at 68.32 across managed APIs
- Matryoshka cuts at 3072/1536/768 with documented quality preservation
- Batch API at $0.075 per million tokens cuts indexing cost in half
Weaknesses
- 2,048-token context window forces chunking on long documents
- API errors rather than truncating when inputs exceed 2K tokens
How it scored, by metric
Cohere Embed v4 is a multimodal embedding model that vectorizes text, single images, and interleaved text-and-image content in one 1,536-dimensional space, with Matryoshka cuts at 256, 512, 1,024, and 1,536 and support for float, int8, uint8, binary, and ubinary output types. The model reaches an MTEB score of 65.2 (ahead of OpenAI text-embedding-3-large at 64.6) at $0.12 per million text tokens, and its 128,000-token context window sits in the top 5% of embedders. It's API-only on the public tier, with on-prem available under Cohere's enterprise contract for SOC 2 Type 2 and HIPAA postures.
Source: Cohere ↗Strengths
- 128K context window, top 5% among embedding models
- Embeds PDFs, slides, tables, and figures directly without OCR
- 1536-dim Matryoshka with int8 and binary precision end-to-end
Weaknesses
- Text-only MTEB trails Voyage 3 Large by 3-4 points
- Image tokens billed at $0.47 per million, materially above the text rate
How it scored, by metric
text-embedding-3-large is OpenAI's flagship embedding model, with a default 3,072-dimensional output that can be truncated via the dimensions parameter (Matryoshka-style) without re-embedding, an 8,192-token context window, and an MTEB English average of 64.6 reported at launch. List pricing is $0.13 per million input tokens, with a 50% batch discount at $0.065. It's the most-integrated embedding model in the ecosystem and the safest pick inside the OpenAI stack, but it hasn't been refreshed since January 2024 and now trails Voyage 3 Large, Gemini Embedding 001, and the strongest open-weight models on MTEB.
Source: OpenAI ↗Strengths
- Widest ecosystem and tooling support in the field
- Dimensions parameter shortens 3,072-dim vectors without re-training
- Batch API halves indexing cost to $0.065 per million tokens
Weaknesses
- Not refreshed since January 2024; trails newer flagships on MTEB
- 8K context window is mid-pack, not class-leading
How it scored, by metric
Qwen3-Embedding-8B is the largest model in Alibaba's Qwen3 Embedding series, released under Apache 2.0 with 8B parameters, a 32K-token context window, and flexible vector dimensions up to 4,096. It ranks No.1 on the MTEB multilingual leaderboard with a score of 70.58 (as of June 5, 2025) and posts MTEB English 75.22, Chinese 73.84, and Code 80.68, outperforming Gemini-Embedding on the reported subsets. It's the strongest open-weight option for teams with GPU infrastructure, and the wrong pick for teams without one. Self-hosting only beats API pricing once embed volume crosses roughly 10-15 million embeddings per month.
Source: Alibaba (Qwen) ↗Strengths
- MMTEB leader at 70.58, ahead of every managed-API model in this ranking
- Apache 2.0 license supports commercial use, fine-tuning, and on-prem
- 32K context window and flexible dimensions up to 4,096
Weaknesses
- No managed API, requires GPU infrastructure and MLOps
- Self-hosting is more expensive than APIs below ~10-15M embeddings per month
How it scored, by metric
The ranking above reflects published MTEB and MMTEB scores plus vendor-documented specs verified against official pricing pages in June 2026. The single largest separator at the top of the table isn’t English retrieval quality, every model in the field clears the threshold where the embedder is no longer the bottleneck, but how each one trades off against the other four metrics: multilingual reach, context length, dimension flexibility, and price.
What the scores measure
English retrieval quality carries the most weight because a RAG pipeline that retrieves the wrong chunk gets the wrong answer regardless of how strong the generation model is. We scored it from public MTEB averages on the retrieval-leaning subsets rather than from any vendor-on-vendor head-to-head, because vendor benchmarks consistently flatter the vendor. The mapping is calibrated so a five-point MTEB gap is meaningful, consistent with the public observation that a 5-point gap on MTEB typically translates to 3-8% better recall@10 in real-world search applications .
Multilingual coverage is scored separately because it’s the dimension that general-purpose MTEB averages hide most. For multilingual applications, Qwen3-Embedding-8B offers the best performance (70.6 MTEB) with excellent French support (69.8). If you prefer an API, Google Gemini Embedding offers excellent value with good multilingual performance. OpenAI text-embedding-3-large lags behind on European languages. A buyer indexing a multilingual corpus will see those gaps in production retrieval long before the headline MTEB number suggests they should.
Where the field separates
Voyage 3 Large and Qwen3-Embedding-8B lead on raw retrieval quality on the subsets we scored. Cohere Embed v4 leads on context length and is the only model in the table that natively embeds interleaved text and images: Embed v4.0 is Cohere’s fourth-generation embedding model, released April 15, 2025. It reaches a 65.2 MTEB score, ahead of OpenAI’s text-embedding-3-large (64.6). Beyond text-only retrieval, it embeds interleaved text and images in the same vector space. You can index screenshots of PDFs, slides, figures, and tables directly alongside text documents without converting visual content to text first. Gemini Embedding 001 leads on multilingual MMTEB among managed APIs but is bounded by a 2,048 token input limit per embedding request. Longer documents must be chunked.
Cost and the self-host crossover
Price per million tokens is tracked on the same suite but kept out of the quality score, because a buyer optimizing for retrieval recall and a buyer optimizing for embed spend are answering different questions. The list rates are tight: Embedding input is billed at $0.12 per million tokens at listed AI Gateway rates for Cohere Embed v4, $0.13 per million input tokens, $0 per million output tokens. 8,192 token context window for OpenAI text-embedding-3-large, $0.15 per million input tokens, $0 per million output tokens. 20,000 token context window for Gemini Embedding 001, and a context window of 32K tokens. Pricing starts at 0.18 per million input tokens for Voyage 3 Large. The bigger structural decision is the API-versus-self-host crossover, where Qwen3-Embedding-8B leads the field on MMTEB at no per-token cost but only beats the cheapest API in this ranking once embed volume crosses roughly 10-15 million embeddings per month.
Dimension flexibility is the underrated metric
Every model in the top four supports Matryoshka-style dimension reduction, but the cut-points and precision options differ enough to matter at scale. Voyage 3 Large exposes embeddings in four dimensions: 2048, 1024, 512, and 256 through Matryoshka learning. You can tune the tradeoff between retrieval accuracy and vector storage cost without retraining or running multiple models. Cohere Embed v4 goes further on precision: The model supports various embedding types including float, int8, uint8, binary, and ubinary formats, with configurable output dimensions from 256 to 1536. Those two controls, Matryoshka cuts plus quantization, are what makes the difference between a vector store that fits in RAM and one that doesn’t, and they’re why the top two picks in the ranking score 90 or higher on dimension flexibility while the bottom of the table sits in the low 80s.
- https://www.voyageai.com/
- https://ai.google.dev/gemini-api/docs/embeddings
- https://cohere.com/embed
- https://platform.openai.com/docs/guides/embeddings
- https://huggingface.co/Qwen/Qwen3-Embedding-8B
- https://docs.voyageai.com/docs/pricing
- https://docs.cohere.com/docs/cohere-embed
- https://qwenlm.github.io/blog/qwen3-embedding/
Q.Which embedding model has the highest retrieval quality in 2026?
Among managed APIs, <cite index="22-15">voyage-3-large outperforms OpenAI text-embedding-3-large by 9.74% and Cohere Embed v3 English by 20.71%</cite> across 100 retrieval datasets in eight domains, and third-party spec tables place it at the top of MTEB retrieval among commercial embedders. Among open-weight models, <cite index="67-2">Qwen3-Embedding-8B achieves state-of-the-art mean task-level scores on MMTEB (70.58) and MTEB (English: 75.22, Chinese: 73.84, Code: 80.68), outperforming Gemini-Embedding</cite> on those subsets, but it requires self-hosting on GPU infrastructure.
Q.What is the best embedding model for multilingual RAG?
For multilingual corpora on a managed API, Gemini Embedding 001 is the leader: <cite index="46-3">Google's gemini-embedding-001 is the state-of-the-art text embedding model from Google, currently holding top positions on the Massive Text Embedding Benchmark (MTEB) Multilingual leaderboard with an average task score of 68.32</cite>. If self-hosting is on the table, Qwen3-Embedding-8B sits ahead of it on MMTEB at 70.58 under an Apache 2.0 license.
Q.When does it make sense to self-host instead of using an embedding API?
The crossover is volume. On commodity GPUs, <cite index="53-17,53-18,53-19">a single A10G GPU (~$0.75/hour on AWS) can process roughly 500-1,000 embeddings per second. At that rate, self-hosting becomes cheaper than API calls only when you exceed about 10-15 million embeddings per month. Below that threshold, a paid API is both cheaper and simpler to maintain.</cite> Sovereignty constraints, fine-tuning needs, or MMTEB leadership at no per-token cost are the other reasons to take on the GPU bill.
Q.Should I still use OpenAI text-embedding-3-large?
Inside the OpenAI stack, it remains a defensible default at $0.13 per million tokens with a documented 64.6 MTEB and a dimensions parameter that lets you truncate the 3,072-dim output. Outside that stack, the model hasn't been refreshed since January 2024, and newer flagships (Voyage 3 Large, Gemini Embedding 001) and open-weight models (Qwen3-Embedding-8B) now lead it on the public MTEB and MMTEB boards.
Priya Raman runs the Top AI Tracker test bench. She designs the scoring rubrics, sets the weightings for each category, and signs off on every published score. Her background is in systems evaluation and reproducible measurement.