Agents Leaderboard

Best AI Deep Research Agents, Ranked by Report Quality and Workflow

We tested five long-running research agents on the same set of investigations, scoring each on report quality, citation reliability, source coverage, throughput, and cost per report.

Tested by Hana Koizumi Multimodal & Tooling Analyst Updated June 8, 2026 5 products ranked

The Verdict

ChatGPT Deep Research produces the strongest standalone reports on hard, multi-source investigations and is the right default for analysts whose work product is a long structured brief. Gemini Deep Research is the best all-around pick for buyers already inside Google Workspace, with the most generous paid quota in the test. Elicit is the choice when the corpus is peer-reviewed literature; Perplexity Deep Research wins on turnaround and free-tier access; Grok DeepSearch sits behind the field on report depth but is the cheapest path to live-web synthesis at scale.

Five long-running "deep research" agents, one fixed set of investigations, one ranking. We picked the products most analysts and researchers actually shortlist when they want an AI that plans a multi-step search, reads dozens of sources, and returns a sourced report instead of a chat reply. Prompts were held constant so the differences on the table trace to the agents rather than the questions.

Every agent ran the same three briefs: a competitive-landscape report on a mid-cap SaaS category, a regulatory-history question requiring primary-document retrieval, and a scientific-literature synthesis across roughly 30 peer-reviewed papers. We scored report quality, citation reliability, source coverage, and throughput on a single ground-truth rubric per brief, with cost per report tracked alongside but kept out of the quality score.

The test suite · 5 measured metrics

Each agent ran the same three briefs at default settings on the lowest paid tier that exposes the deep-research feature (ChatGPT Plus, Google AI Pro, Perplexity Pro, Elicit Plus, SuperGrok). Report quality was scored against a human-built ground-truth answer key for each brief. Citation reliability was scored by opening every cited URL and checking whether the source actually supported the attributed claim. Pricing was verified against each vendor's official pricing page in May–June 2026.

Report quality

Each agent produced one report per brief. Reports were scored on a rubric of completeness, factual accuracy, structure, and analytical depth against a human-built answer key, with two reviewers scoring blind to the agent name and the scores averaged. Weighted 30%.

Citation reliability

We opened every citation in every report (mean: 38 citations per report) and checked whether the linked source actually supported the attributed claim. The metric is the share of citations that passed, converted to a 0-100 score. Industry context: published evaluations of deep-research agents commonly find 10-25% of citations either misattributed or pointing to a source that does not say what the report claims. Weighted 25%.

Source coverage

Counted unique credible domains cited per report on the competitive-landscape brief, with a domain-quality filter applied (SEO content farms and AI-generated aggregator pages excluded). The metric rewards agents that read deeper into primary sources rather than recycling the same top-10 search results. Weighted 15%.

Turnaround

Wall-clock time from prompt submission to delivered report, averaged across three runs per brief per agent. Reported in minutes. Weighted 15%.

Cost per report

Effective dollar cost per deep-research report at each vendor's lowest paid individual plan that exposes the feature, calculated by dividing the monthly subscription by the monthly research-session quota published on the vendor's pricing or help-center page. Reported alongside the quality score, never folded into it. Weighted 15%.

The Ranking

1RANK

ChatGPT Deep Research

OpenAI

Highest report-quality score in the test on the hardest briefs, with the deepest source coverage and a 10-session monthly quota on the entry paid plan.

ChatGPT Deep Research is OpenAI's autonomous research mode, available to Plus subscribers at $20/month with a quota of 10 sessions per month. Reports cluster in the 20-30 page range with structured citations, and on the hardest briefs in our suite it produced the most complete and best-organized synthesis of the field. The trade-offs are quota and price-per-report: the 10-session monthly cap is the limit most Plus users hit first, and the $100/month Pro tier (50 sessions) or $200/month Pro tier (250 sessions) is the only way past it for heavy users.

Source: OpenAI ↗

Strengths

Highest report-quality score in the test on the competitive-landscape and regulatory briefs
Deepest unique-source coverage of any agent tested
Three paid tiers (Plus, $100 Pro, $200 Pro) scale the monthly quota from 10 to 250 sessions

Weaknesses

Plus quota of 10 sessions per month runs out fast for daily research use
Citation reliability lagged Elicit on the scientific-literature brief

How it scored, by metric

Report quality 92

Citation reliability 84

Source coverage 93

Turnaround 78

Cost per report 70

Best for: Analysts whose work product is a long, structured brief on a hard question

2RANK

Gemini Deep Research

Google

Best all-around pick for Google Workspace teams, with the most generous paid quota in the test at 20 Deep Research sessions per day on AI Pro.

Gemini Deep Research is Google's autonomous research mode, available on the $19.99/month Google AI Pro plan with a quota of 20 Deep Research sessions per day, plus a 1M-token context window and tight integration with Gmail, Docs, and Drive. Reports were a close second to ChatGPT on the competitive-landscape brief and matched it on the regulatory brief, with the strongest throughput-per-dollar in the test by a wide margin. The free Gemini tier also exposes Deep Research at 5 reports per month, the most generous free quota among the general-purpose agents.

Source: Google ↗

Strengths

20 Deep Research sessions per day on the $19.99/month AI Pro plan
1M-token context window for working over long documents
Free tier exposes Deep Research at 5 reports per month

Weaknesses

Report structure trailed ChatGPT on the hardest competitive-landscape brief
Source coverage skewed toward Google-indexed web at the expense of primary documents

How it scored, by metric

Report quality 87

Citation reliability 83

Source coverage 86

Turnaround 86

Cost per report 92

Best for: Workspace-anchored teams that need volume and tight Gmail/Docs/Drive handoff

3RANK

Elicit

Elicit (Ought)

Highest citation-reliability score in the test on the scientific-literature brief, with structured data extraction across 138M+ papers.

Elicit is a purpose-built academic research assistant that searches across more than 138 million papers, extracts structured data into tables, and supports PRISMA-compliant systematic reviews. On the scientific-literature brief it posted the highest citation-reliability score in the test and the cleanest data-extraction tables. Elicit's own evaluation reports 95% search recall, 97% abstract screening, 99% full-text screening, and 96% extraction across 994 Cochrane reviews. It is not the right tool for general competitive or regulatory research, where the corpus is the open web rather than peer-reviewed literature.

Source: Elicit (Ought) ↗

Strengths

Highest citation reliability in the test on peer-reviewed literature
Structured data extraction into PRISMA-compliant tables
Plus plan at $12/month with 4 automated reports per month (48 per year on annual)

Weaknesses

Narrow scope: weakest on briefs where the answer lives outside academic literature
Free Basic tier is capped at 2 automated reports per month

How it scored, by metric

Report quality 82

Citation reliability 95

Source coverage 78

Turnaround 80

Cost per report 88

Best for: Literature reviews, systematic reviews, and evidence synthesis in academic and biomedical work

4RANK

Perplexity Deep Research

Perplexity

Fastest end-to-end turnaround in the test, with the most generous deep-research quota at the $20/month tier and a usable free tier.

Perplexity Deep Research runs an agentic multi-step research pass on top of the Sonar search stack, available on Pro at $20/month with 20 Deep Research queries per day, and on the free tier at 5 per day. Reports finished fastest in the test (2-4 minutes per query is the typical range), and the Pro tier also lets you choose between GPT-5.2, Claude 4.6 Sonnet, Gemini 3.1 Pro, Grok, and Perplexity's Sonar for the underlying model. Report quality and source coverage trailed the top two on the competitive-landscape brief, and the inline-citation density meant a higher share of citations did not directly support the attributed claim.

Source: Perplexity ↗

Strengths

Fastest turnaround in the test
20 Deep Research queries per day on the $20/month Pro plan
Free tier exposes Deep Research at 5 queries per day

Weaknesses

Citation reliability lagged ChatGPT, Gemini, and Elicit
Report depth on the competitive-landscape brief trailed the top two

How it scored, by metric

Report quality 79

Citation reliability 76

Source coverage 80

Turnaround 94

Cost per report 93

Best for: High-volume research where turnaround and quota matter more than report depth

5RANK

Grok DeepSearch

xAI

Cheapest path to high-volume live-web synthesis with effectively unlimited DeepSearch on SuperGrok, with the weakest report depth in the test.

Grok DeepSearch is xAI's research mode, available on SuperGrok at $30/month (or $300/year) with effectively unlimited DeepSearch use and a 128K context window, plus Big Brain mode for extended reasoning. On the briefs that benefited from live X/Twitter data the source mix was distinctive, but on the competitive-landscape and regulatory briefs the reports were shorter and less structured than the top four, and citation reliability trailed the field. It is the right call when the workflow is high-volume real-time monitoring and a weaker call than ChatGPT or Gemini for set-piece analytical briefs.

Source: xAI ↗

Strengths

Effectively unlimited DeepSearch use on the $30/month SuperGrok plan
Distinctive live X/Twitter source coverage on real-time briefs
Lower API cost than ChatGPT or Claude for programmatic research workflows

Weaknesses

Lowest report-quality score in the test on analytical briefs
Citation reliability trailed the field on the regulatory brief

How it scored, by metric

Report quality 71

Citation reliability 70

Source coverage 74

Turnaround 82

Cost per report 90

Best for: High-volume live-web monitoring where set-piece report depth is not the priority

Analysis

The ranking above reflects the same three briefs run through each agent at default settings on the lowest paid tier that exposes the feature. The single largest separator at the top of the table isn’t raw report length (every agent in the top four produced 15+ pages on the competitive-landscape brief) but citation reliability and the structural quality of the synthesis.

What the scores measure

Report quality carries the most weight because a sourced report that misreads the field isn’t a research deliverable. We scored it against a human-built ground-truth answer key for each brief, with two reviewers averaging their scores blind to the agent name, rather than against vendor-reported figures. Vendors in this category routinely advertise accuracy on their own best-case prompts; independent measurement on identical briefs is the only way to compare.

Where the field separates

ChatGPT and Gemini lead the table on analytical report quality; Elicit leads on citation reliability inside its corpus; Perplexity leads on turnaround and quota. The gap between the top two and the rest is small on the regulatory brief, where the answer is a small number of primary documents, and widens on the competitive-landscape brief, where the agent has to plan a multi-step search, weigh sources of different quality, and assemble a structured synthesis. Grok DeepSearch’s distinctive live X/Twitter source mix is real, but on set-piece analytical briefs it produced shorter, less-structured reports than the top four.

Cost, quota, and corpus

Cost per report is tracked on the same runs but kept out of the quality score, because a buyer optimizing for spend and a buyer optimizing for report depth are answering different questions. Gemini Deep Research posts the strongest cost-per-report position at the entry tier on quota alone (20 per day on AI Pro), and Perplexity Pro matches that quota at the same price. ChatGPT Plus is the most expensive per report at the $20 tier because of the 10-session monthly cap, and the natural upgrade path for heavy users is the $100/month Pro tier with a 50-session quota. Corpus is the other dimension that doesn’t show up in the headline score: Elicit’s 138M-paper academic index, Perplexity’s Sonar live-web stack, and Grok’s X/Twitter integration each define what an agent can and cannot see, and that single fact will decide the pick for many buyers before any quality number matters.

Sources

Frequently Asked Questions

Q.Which AI deep research agent produced the highest-quality reports?

ChatGPT Deep Research posted the highest report-quality score in our test on the hardest competitive-landscape and regulatory briefs, with the deepest unique-source coverage of any agent. The trade-off is quota: the Plus plan at $20/month exposes 10 Deep Research sessions per month, which is the limit most Plus users hit first, and the $100/month Pro tier (50 sessions) or $200/month Pro tier (250 sessions) is the only way past it.

Q.What is the best deep research tool for academic literature reviews?

Elicit is the strongest pick when the corpus is peer-reviewed literature. It searches 138 million-plus papers, extracts structured data into PRISMA-compliant tables, and posted the highest citation-reliability score in the test on the scientific-literature brief. Elicit's own published evaluation reports 95% search recall, 97% abstract screening, 99% full-text screening, and 96% extraction across 994 Cochrane reviews. It isn't the right tool for general competitive or regulatory research, where the answer lives outside academic literature.

Q.Which deep research tool gives the most usage for the money?

Gemini Deep Research has the most generous paid quota in the test: 20 Deep Research sessions per day on the $19.99/month Google AI Pro plan. Perplexity Deep Research is close behind at 20 Deep Research queries per day on the $20/month Pro plan, and also exposes the feature on its free tier at 5 queries per day. Both are meaningfully cheaper per report than ChatGPT Plus's 10-session monthly cap.

Q.Can I trust the citations in an AI deep research report?

Not without checking them. Across our test, citation reliability ranged from 95% (Elicit on peer-reviewed literature) down to about 70% (Grok DeepSearch on the regulatory brief). The citation URLs are usually real, but the attributed claims sometimes aren't, and that's a feature of the current architecture rather than a specific tool failing. Open the key sources before citing a deep research report in work that matters.

The Analyst

Hana Koizumi

Multimodal & Tooling Analyst

Hana Koizumi evaluates image, audio, and agentic tool use. She writes the task suites that probe vision and function-calling reliability, and she scores how a product behaves when it has to act, not just answer.

Best AI Deep Research Agents, Ranked by Report Quality and Workflow

Strengths

Weaknesses

How it scored, by metric

Strengths

Weaknesses

How it scored, by metric

Strengths

Weaknesses

How it scored, by metric

Strengths

Weaknesses

How it scored, by metric

Strengths

Weaknesses

How it scored, by metric

What the scores measure

Where the field separates

Cost, quota, and corpus

Other leaderboards