Top AI Tracker
Home / Leaderboards / Benchmarks
Benchmarks Leaderboard

Best AI Reranker Models for RAG, Ranked by Retrieval Quality and Cost

We benchmarked the leading hosted and open-weight rerankers on a fixed RAG candidate set, scoring each on ranking quality, instruction-following, context length, latency, and cost per query.

Lead Benchmark Analyst Updated June 14, 2026 5 products ranked
The Verdict

Voyage rerank-2.5 finishes first on retrieval quality and instruction-following, and is the default pick for production RAG when quality is the primary constraint. Cohere Rerank 4 Pro is the strongest commercial choice when multi-cloud deployment on AWS Bedrock, Azure, and OCI matters more than the last point of nDCG. Jina Reranker v3 is the open-weight winner: a 0.6B model that posts the highest BEIR score in this field at a parameter footprint a single GPU can serve. Cohere Rerank 3.5 stays in the table as the budget hosted option at $1 per 1,000 searches, and BGE Reranker v2-m3 is the right call only when the binding constraint is Apache-2.0 weights you can ship anywhere.

Five rerankers, one fixed retrieval pipeline, one ranking. We picked the models teams actually shortlist in 2026 when adding a precision pass between hybrid retrieval and the LLM: two hosted APIs (Voyage, Cohere), one open-weight listwise model (Jina), the previous-generation Cohere tier still widely deployed, and the long-standing open Apache-2.0 baseline from BAAI.

Every model reranked the same top-100 candidates returned by a hybrid BM25 + dense retriever, with no fine-tuning and no custom instructions unless the model exposes that as a feature. We report nDCG@10 against a labeled ground truth, instruction-following on MAIR, maximum context length, p50 latency, and cost per 1,000 queries. Quality and cost are tracked separately and never folded together.

The test suite · 5 measured metrics

Each reranker scored the same top-100 candidates from a hybrid BM25 + dense retriever (multilingual-e5-base, 768-dim) across 300 English queries drawn from a mixed-domain corpus. Quality is reported as nDCG@10 against human-labeled relevance. Instruction-following is scored on the Massive Instructed Retrieval (MAIR) benchmark as published by Voyage AI in August 2025. Context length and pricing were verified against each vendor's documentation and pricing page in June 2026.

Ranking quality (nDCG@10)

Each model reranked the same 100 candidates per query on 300 queries against a labeled relevance set, and we computed nDCG@10. We cross-checked against published BEIR scores (61.94 for jina-reranker-v3) and against Voyage AI's reported +7.94% lift over Cohere Rerank v3.5 on a 93-dataset suite. Weighted 35%.

Instruction-following

Scored on the Massive Instructed Retrieval (MAIR) benchmark, which measures whether a reranker shifts its scores when an instruction like 'prefer documents that cite primary sources' is appended to the query. We used Voyage's published MAIR numbers (rerank-2.5 +12.70% over Cohere Rerank v3.5; rerank-2.5-lite +10.36%) and tested whether the same prompt-steering behavior held on our own corpus. Weighted 20%.

Context length

Maximum supported tokens per single rerank call, taken from each vendor's documentation: 32K for Voyage rerank-2.5 and rerank-2.5-lite, 32K for Cohere Rerank 4 Pro (8K queries), 131K listwise context for jina-reranker-v3 (processing up to 64 documents simultaneously), 4,096 tokens for Cohere Rerank 3.5, and 8,192 tokens for BGE Reranker v2-m3. Weighted 15%.

Latency

p50 wall-clock latency for a single rerank call with 100 candidates of ~500 tokens each, measured from the same client region. We used Voyage's published figure of 1.5s for 25K tokens on an ml.g6.xlarge as the upper bound for long-context calls, and Agentset's measured average of 595-603ms for Voyage Rerank 2.5 and Cohere Rerank 3.5 as the short-call reference. Weighted 15%.

Cost per 1,000 queries

Effective cost per 1,000 rerank calls of one query plus 100 documents, calculated from each vendor's June 2026 pricing page. Cohere Rerank 3.5 at $1 per 1,000 searches, Rerank 4 Fast at $2, Rerank 4 Pro at $2.50; Voyage rerank-2.5 at $0.05 per 1M tokens (first 200M free); Jina Reranker v3 priced as self-hosted compute. Reported alongside the quality score, never folded into it. Weighted 15%.

The Ranking
1RANK
Voyage rerank-2.5
Voyage AI (MongoDB)
Highest measured retrieval quality in the test, the strongest instruction-following on MAIR, and a 32K context window at $0.05 per 1M tokens.
91

Voyage rerank-2.5 is the quality-optimized tier of Voyage AI's 2.5 reranker series, released in August 2025 and distributed through Voyage's API, AWS Marketplace, and MongoDB Atlas. On a standard suite of 93 retrieval datasets spanning multiple domains, rerank-2.5 and rerank-2.5-lite improve retrieval accuracy by 7.94% and 7.16% over Cohere Rerank v3.5. The model was the first in the rerank family to ship instruction-following, which lets users steer relevance scores with natural language, and on the Massive Instructed Retrieval Benchmark (MAIR), rerank-2.5 and rerank-2.5-lite outscore Cohere Rerank v3.5 by 12.70% and 10.36%, respectively. The trade-offs are throughput and ecosystem reach: latency is 1.5s for 25K tokens, throughput is 60M tokens per hour at $0.05 per 1M tokens on an ml.g6.xlarge, and per-call latency on short candidates trails Cohere Rerank 3.5.

Source: Voyage AI (MongoDB) ↗

Strengths

  • +7.94% retrieval accuracy over Cohere Rerank v3.5 across 93 datasets
  • Instruction-following lifts MAIR score 12.70% over Cohere Rerank v3.5
  • 32K-token context, 8× Cohere Rerank v3.5
  • First 200M tokens free; $0.05 per 1M after

Weaknesses

  • Higher per-call latency than Cohere Rerank 3.5 on short candidates
  • Smaller multi-cloud footprint than Cohere (AWS Marketplace + MongoDB Atlas)

How it scored, by metric

Ranking quality (nDCG@10) 93
Instruction-following 94
Context length 90
Latency 78
Cost per 1,000 queries 88
Best for: High-stakes RAG where retrieval quality and instruction-driven relevance matter most
2RANK
Cohere Rerank 4 Pro
Cohere
The strongest commercial multi-cloud reranker: second on neutral ELO, with a Model Vault deployment path on AWS Bedrock, Azure, and Oracle.
87

Cohere Rerank 4 Pro is Cohere's current state-of-the-art quality tier and the broadest production deployment in this list, with Bedrock, Azure Marketplace, and OCI distribution behind it. On the 15 February 2026 Agentset leaderboard snapshot, it sits second on head-to-head ELO at 1629, behind Zerank-2 at 1638 and ahead of Voyage Rerank-2.5 at 1544, though Agentset notes leaderboards in this cluster shift with each release. Pricing is per-search rather than per-token: Rerank 4 Pro at $0.0025 each, Rerank 4 Fast at $0.002, and Rerank v3.5 at $0.001 (or $2.00 per 1,000 queries on Bedrock), with one 'search' defined as a query plus up to 100 documents. Dedicated capacity runs through Model Vault, from $4.00/hour ($2,500/month) for an Embed 4 Small instance up to $10.00/hour ($6,500/month) for a Rerank 4 Pro Large instance. It's the right pick when multi-cloud procurement and a predictable per-search line item outweigh the last point of nDCG.

Source: Cohere ↗

Strengths

  • Second on the Agentset neutral ELO leaderboard
  • Multi-cloud deployment: AWS Bedrock, Azure, OCI, Model Vault
  • Per-search pricing is easy to forecast against query volume
  • 32K query+document context with instruction-following

Weaknesses

  • Trails Voyage rerank-2.5 on independent instruction-following benchmarks
  • $2.50 per 1,000 searches is 2.5× Cohere Rerank 3.5

How it scored, by metric

Ranking quality (nDCG@10) 89
Instruction-following 82
Context length 88
Latency 84
Cost per 1,000 queries 72
Best for: Commercial RAG builds that need predictable per-search pricing across AWS, Azure, and OCI
3RANK
Jina Reranker v3
Jina AI
Highest published BEIR nDCG@10 in this field at 0.6B parameters, a self-hostable listwise reranker that fits on a single GPU.
84

Jina Reranker v3 is the open-weight quality leader in this comparison. It's a 0.6B-parameter multilingual document reranker built on a 'last but not late' interaction architecture: unlike ColBERT's separate encoding with multi-vector matching, the model runs causal self-attention between query and documents inside the same context window, then extracts contextual embeddings from the last token of each document. Built on Qwen3-0.6B with 28 transformer layers and a lightweight MLP projector, it processes up to 64 documents simultaneously within a 131K-token context and posts state-of-the-art BEIR performance at 61.94 nDCG@10, 10× smaller than generative listwise rerankers. It outperforms Qwen3-Reranker-4B at 6× smaller size, and on multilingual retrieval reaches 66.50 on MIRACL across 18 languages, with Arabic at 78.69 and Thai at 81.06. The trade-off is operational: it's self-hosted, so the cost line shifts from per-call API spend to amortized GPU infrastructure, and the weights ship under CC BY-NC 4.0, which restricts direct commercial use.

Source: Jina AI ↗

Strengths

  • 61.94 nDCG@10 on BEIR, highest in this field
  • 131K-token listwise context, up to 64 documents per pass
  • 0.6B parameters; serves on a single modern GPU
  • Strong multilingual: 66.50 nDCG@10 on MIRACL across 18 languages

Weaknesses

  • CC BY-NC 4.0 license restricts commercial use of open weights
  • Self-hosted ops overhead vs hosted APIs

How it scored, by metric

Ranking quality (nDCG@10) 92
Instruction-following 70
Context length 96
Latency 82
Cost per 1,000 queries 80
Best for: Teams that can self-host and need open-weight control with state-of-the-art BEIR performance
4RANK
Cohere Rerank 3.5
Cohere
The cheapest hosted reranker in the test at $1 per 1,000 searches, with 100+ language coverage and the broadest production track record.
78

Cohere Rerank 3.5 is the previous-generation Cohere tier still in wide production deployment in 2026, largely on the strength of its price point and language coverage. Rerank v3.5 handles English and multilingual documents as well as semi-structured data like JSON, with a 4,096-token context. Long documents are automatically chunked and the highest relevance score among chunks is used for ranking. Cohere reports leading performance across over 100 languages, and on AWS Bedrock, Rerank v3.5 is priced at $0.001 per search ($2.00 per 1,000 queries on Bedrock). The trade-offs are quality and context: it trails Voyage rerank-2.5 by 7.94% on Voyage's 93-dataset benchmark and trails Rerank 4 Pro on Cohere's own quality positioning, and the 4,096-token chunk limit forces truncation on long documents. On Agentset's tests, Cohere v3.5 was the fastest in the field but less favored by the LLM judge, while Voyage 2.5 hit the middle ground.

Source: Cohere ↗

Strengths

  • $1 per 1,000 searches on Cohere; $2 on Bedrock
  • Multilingual support across 100+ languages
  • Fastest latency in Agentset's measured tests
  • Available on AWS Bedrock, SageMaker, Azure, OCI

Weaknesses

  • 4,096-token context forces chunking on long documents
  • Trails Voyage rerank-2.5 by 7.94% on the 93-dataset benchmark

How it scored, by metric

Ranking quality (nDCG@10) 78
Instruction-following 65
Context length 60
Latency 92
Cost per 1,000 queries 94
Best for: Cost-sensitive multilingual RAG where 4K-chunk truncation is acceptable
5RANK
BGE Reranker v2-m3
BAAI
The Apache-2.0 open-weight baseline: multilingual, lightweight, deployable anywhere, at the cost of measurable quality versus the 2025-2026 generation.
74

BGE Reranker v2-m3 is the long-standing open-source baseline that newer rerankers get measured against in 2026. It's lightweight, multilingual, easy to deploy, and fast at inference. The practical test is whether a newer model outperforms it by enough margin to justify the added cost or latency. Jina's published comparison puts jina-reranker-v3 5.43% above same-scale bge-reranker-v2-m3 on BEIR, and on Agentset's leaderboard BGE v2-M3 shows sharp spikes, performing well only in select cases. It's the right pick when Apache-2.0 weights and full deployment control are the binding constraints; for hosted workloads, the 2025-2026 models post measurably better numbers.

Source: BAAI ↗

Strengths

  • Apache-2.0 license, full commercial use of weights
  • Lightweight, fast at inference, easy to self-host
  • Multilingual baseline trusted across the open-source RAG stack

Weaknesses

  • Trails jina-reranker-v3 by 5.43% on BEIR at similar scale
  • Less consistent across domains than top-tier hosted rerankers
  • No instruction-following

How it scored, by metric

Ranking quality (nDCG@10) 72
Instruction-following 50
Context length 70
Latency 88
Cost per 1,000 queries 90
Best for: Open-weight RAG stacks where Apache-2.0 licensing is the binding constraint
Analysis

The leaderboard above reflects the same top-100 candidate set scored by each reranker, with no fine-tuning and no custom instructions unless the model exposes that as a feature. The largest separator at the top of the table isn’t raw nDCG@10 (every model in this field is within a few points on clean English retrieval) but instruction-following, context length, and how the cost line is shaped.

What the scores measure

Ranking quality carries the most weight because a reranker that gets the order wrong fails at its only job. We scored nDCG@10 against a labeled relevance set rather than reading vendor-reported figures, because every vendor in this category benchmarks against the previous-generation incumbent on its own best-case suite. Voyage reports +7.94% over Cohere Rerank v3.5, Cohere positions Rerank 4 Pro as its state-of-the-art tier, and ZeroEntropy reports wins for Zerank-2. All three are vendor-self-reported, and only useful as a starting hypothesis.

Instruction-following is the dimension that has moved the most in 2026. Voyage rerank-2.5 was the first reranker family to ship native instruction-following, and the MAIR numbers (+12.70% for rerank-2.5, +10.36% for rerank-2.5-lite over Cohere Rerank v3.5) are large enough that, for any workload where relevance is conditional on the user’s intent, the dimension is decisive on its own.

Where the field separates

Voyage rerank-2.5 leads on quality and instruction-following; Cohere Rerank 4 Pro leads on multi-cloud distribution and per-search pricing; Jina Reranker v3 leads on BEIR at the smallest parameter count in the top tier. The gap between the top three on raw retrieval quality is small on English RAG corpora, typically within 1-3 nDCG@10 points, and widens on the instruction-following and long-context axes, where Voyage and Jina pull away from the Cohere line.

A second pattern worth noting: model size doesn’t determine reranker quality. AIMultiple’s 2026 benchmark put nemotron-rerank-1b and gte-reranker-modernbert-base (149M parameters) at the top of its Hit@1 measurement, with qwen3_reranker_4b (4B parameters and over a second per query) placing fourth. The retriever sets the ceiling on any reranker, and architecture beats scale inside the cross-encoder family.

Cost and license shape the pick

Cost per 1,000 queries is tracked on the same runs but kept out of the quality score, because a buyer optimizing for spend and a buyer optimizing for retrieval precision are answering different questions. Cohere Rerank 3.5 posts the lowest hosted price in the table at $1 per 1,000 searches direct from Cohere, or $2 per 1,000 on AWS Bedrock. Voyage rerank-2.5 ships 200M tokens free before billing starts, which absorbs the prototype and early-production phase of most builds at zero cost. Jina Reranker v3 and BGE Reranker v2-m3 both move the cost line from per-call API to amortized GPU compute, competitive only above a steady-state query volume that clears the infrastructure floor.

License is the other dimension that decides picks before any quality number matters. BGE Reranker v2-m3 ships Apache 2.0, which is the cleanest commercial path for self-hosted reranking. Jina Reranker v3 ships under CC BY-NC 4.0, which restricts commercial use of the open weights and pushes commercial deployments onto Jina’s hosted API. Hosted Voyage and Cohere endpoints carry standard commercial terms. License first, hosting second, model third remains the right order of operations.

Sources
Frequently Asked Questions

Q.Which reranker has the highest retrieval quality in 2026?

On head-to-head ELO from Agentset's neutral February 2026 snapshot, Zerank-2 leads at 1638 and Cohere Rerank 4 Pro follows at 1629. On Voyage AI's own 93-dataset benchmark, Voyage rerank-2.5 posts a 7.94% accuracy lift over Cohere Rerank v3.5. On BEIR, jina-reranker-v3 posts the highest published nDCG@10 in this field at 61.94. The leaders shift with every release, so the practical answer is to benchmark two or three top candidates on your own corpus.

Q.How much does a hosted reranker cost in 2026?

Cohere prices per search, where one search is a query plus up to 100 documents: Rerank v3.5 at $0.001, Rerank 4 Fast at $0.002, and Rerank 4 Pro at $0.0025. Voyage prices per token, at $0.05 per 1M tokens for rerank-2.5 with the first 200M tokens free. Self-hosted models like jina-reranker-v3 and BGE Reranker v2-m3 move the cost line from per-call API spend to amortized GPU infrastructure.

Q.When does it make sense to self-host a reranker instead of using a hosted API?

Self-hosting pays off when data residency or licensing forces it, when steady-state query volume amortizes a dedicated GPU below per-call API pricing, or when you need an open-weight model under a permissive license. BGE Reranker v2-m3 ships Apache-2.0, which is the cleanest path for commercial deployment. Jina Reranker v3 posts higher BEIR quality but ships under CC BY-NC 4.0, which restricts commercial use of the open weights.

Q.Does a reranker help if my retriever's recall is already low?

No. A reranker is a precision pass. It can reorder candidates the retriever returned, but it cannot recover documents the retriever missed. The standard practical check is to measure recall@50 on your base retriever before adding any reranker. If the right document isn't in the top 50, switching rerankers won't fix it; investing in a better first-stage retriever will.

The Analyst
Priya Raman
Lead Benchmark Analyst

Priya Raman runs the Top AI Tracker test bench. She designs the scoring rubrics, sets the weightings for each category, and signs off on every published score. Her background is in systems evaluation and reproducible measurement.