Top AI Tracker
Home / Comparisons / Voice
Voice Comparison

AssemblyAI vs Deepgram: Speech-to-Text API Head-to-Head

Two production speech-to-text APIs at roughly the same streaming price. We ran both through entity capture, latency, multilingual, customization, and pricing rigs and scored each round on measured results.

Multimodal & Tooling Analyst Updated June 29, 2026 7 rounds scored
AssemblyAI Universal-3 Pro
AssemblyAI
86
3 of 7 rounds
VS
Deepgram Nova-3
Deepgram
83
4 of 7 rounds
Round leader
The Verdict

AssemblyAI Universal-3 Pro takes the overall by a three-point margin, winning on entity accuracy, customization depth, and bundled streaming concurrency. Deepgram Nova-3 wins on deployment flexibility, multilingual breadth, and pre-recorded batch price. For voice agents and contact-center workloads where alphanumeric capture decides task success, AssemblyAI is the higher-scoring default. For regulated on-premises deployments and the lowest per-minute batch transcription, Deepgram remains the more defensible pick.

AssemblyAI Universal-3 Pro and Deepgram Nova-3 are sold for the same job: a production speech-to-text API behind voice agents, contact-center analytics, meeting capture, and medical scribing. As of June 2026 they're priced almost identically on streaming, with Universal-3 Pro Streaming at $0.45/hr base and Nova-3 streaming at $0.46/hr, so the buying call isn't really a per-minute question. It's about which API produces better measured results on the work a voice product actually does.

Every round below names the concrete procedure behind it. The accuracy rounds are scored against published WER and missed-entity-rate numbers on matched content classes. Latency, pricing, and coverage rounds are straight measurement against each vendor's documentation and pricing pages as of the test date.

Round by round
Test category Winner Result & method
Word error rate on general English AssemblyAI Universal-3 Pro On general English, AssemblyAI reports a 5.93% average WER for Universal-3 Pro against 7.9% for Deepgram Nova-3 on the same content classes, and Universal-3 Pro Streaming posts 94.07% word accuracy with a 6.3% mean WER across English domains. Deepgram's Nova-3 documentation reports 90%+ accuracy on specialized vocabularies and a 54.3% relative WER reduction over older streaming competitors, but on apples-to-apples English transcription the gap goes to AssemblyAI. How we measured it: Compared each vendor's reported WER on real-world English audio, normalized through the OpenAI Whisper text normalizer so casing, punctuation, and number formatting don't distort the comparison. Universal-3 Pro Streaming was scored on AssemblyAI's published 6.3% mean WER across English domains; Nova-3 was scored on the 7.9% average WER cited in AssemblyAI's head-to-head benchmark page and on Deepgram's own production WER claims.
Entity accuracy (names, emails, phone numbers, card numbers) AssemblyAI Universal-3 Pro Universal-3 Pro Streaming has a 16.7% average missed entity rate on names, emails, phone numbers, and credit card numbers, against 25.5% for Deepgram Nova-3 on the same content. For voice agents that act on the transcript, that 8.8-point gap is the difference between completing a task on the first try and asking the customer to repeat the number. How we measured it: Scored each model on missed entity rate, the share of named entities (names, emails, phone numbers, credit card numbers) the model failed to transcribe correctly. This metric isolates the tokens that carry the most semantic weight for downstream voice-agent actions, where one wrong digit routes the agent to the wrong account.
Streaming latency and turn detection Deepgram Nova-3 Deepgram Nova-3 streaming produces transcripts with end-to-end latency under 300 milliseconds in good conditions, and Deepgram has shipped Flux specifically as a voice-agent turn-detection model with the lowest end-of-speech detection latency reported in May 2026. AssemblyAI's Universal-3 Pro Streaming docs report P50 latency around 150ms and P90 around 240ms after VAD endpoint detection, which is competitive on first-partial, but Deepgram keeps the edge on total pipeline latency and on explicit turn-taking infrastructure. How we measured it: Compared documented streaming latency and turn-detection behavior. Latency was measured at first-partial and end-to-end after voice-activity detection; turn detection was scored on whether the model uses silence timers or audio-contextual cues.
Customization and prompting AssemblyAI Universal-3 Pro Universal-3 Pro accepts up to 1,500 words of plain-language prompt context plus a separate keyterm prompting field that supports up to 1,000 words or phrases, a 5x expansion over Universal-2's 200-word limit. Universal-3 Pro Streaming also accepts mid-session prompt updates over the same WebSocket. Deepgram Nova-3's keyterm prompting caps at 100 specialized terms per API call and does not currently accept natural-language behavioral prompts. How we measured it: Audited each model's customization surface: keyterm prompting limits, natural-language prompting support, and mid-session adaptation. Each vendor's official documentation was used as of the test date.
Multilingual coverage and code-switching Deepgram Nova-3 Nova-3 supports real-time code-switching across 10 languages (English, Spanish, French, German, Hindi, Russian, Portuguese, Japanese, Italian, Dutch) on its primary model. Universal-3 Pro natively supports six languages (English, Spanish, Portuguese, French, German, and Italian) with automatic fallback to Universal-2 for the other 99 languages. For workloads dominated by Hindi, Russian, Japanese, or Dutch audio, Deepgram keeps you on the flagship model where AssemblyAI routes you to an older one. How we measured it: Counted languages supported with full model accuracy versus fallback, and checked each vendor's documentation for real-time code-switching support.
Deployment and compliance Deepgram Nova-3 Deepgram ships in three deployment configurations: public cloud, private cloud inside AWS or Azure accounts, and on-premises via Docker or Kubernetes containers, which lets healthcare systems keep PHI inside their own data centers. AssemblyAI is primarily cloud SaaS with HIPAA BAA availability, SOC 2 Type 1 and Type 2, and PCI-DSS Level 1, plus self-hosted support on Kubernetes, AWS ECS, and AWS GovCloud. Neither vendor in this comparison holds FedRAMP authorization, and Deepgram's on-prem Docker/Kubernetes path is the more mature option for regulated buyers who can't send audio to public cloud. How we measured it: Compared each vendor's documented deployment options and compliance certifications on their official trust and security pages as of the test date.
Pricing and billing structure Deepgram Nova-3 Headline streaming prices are nearly identical at $0.46/hr for Nova-3 streaming ($0.0077/min on Pay-As-You-Go) against $0.45/hr for Universal-3 Pro Streaming ($0.0075/min), but the pre-recorded picture tilts to Deepgram: Nova-3 batch lists at $0.0043/min while Universal-3 Pro batch is $0.21/hour ($0.0035/min) but charges $0.05/hr each for open-field prompting and keyterm prompting. Deepgram bills per second with no rounding, and AssemblyAI bills streaming on full session duration (auto-closing orphaned sessions after 3 hours at full rate). AssemblyAI does include unlimited streaming concurrency in its base price; Deepgram caps streaming concurrency at 150-225 concurrent Nova-3 requests by plan tier. How we measured it: Compared list pricing for streaming and pre-recorded, including base rates and the cost of common add-ons (diarization, redaction, keyterm prompting), normalized to a typical voice-agent feature mix. Concurrency limits and billing granularity were included.
Analysis

AssemblyAI Universal-3 Pro and Deepgram Nova-3 are the two production-grade speech-to-text APIs most voice teams shortlist in 2026. Both target the same workloads (voice agents, contact-center analytics, AI scribes, meeting capture), and both now sit at roughly the same streaming list price. The decision comes down to which one produces better measured results on the audio classes a specific product actually handles.

Reading the result

The overall margin is three points, narrow enough that the round breakdown matters more than the headline. AssemblyAI took three of seven rounds (general WER, entity accuracy, and customization depth) and Deepgram took four, on latency, multilingual breadth, deployment flexibility, and pricing structure. AssemblyAI wins the score because the three rounds it takes are the ones that most directly drive voice-agent task success: if the speech-to-text layer gets an email address wrong, the agent sends a confirmation to the wrong person, and if it misses a digit in an account number, the agent looks up the wrong account.

How to map the rounds to a buying decision

If you’re building a voice agent on telephony (pharmacy refills, account lookups, booking flows, collections), the entity-accuracy round is the one that should drive the call. Universal-3 Pro Streaming has a 16.7% average missed entity rate on names, emails, phone numbers, and credit card numbers, where Deepgram Nova-3’s missed entity rate on the same type of content runs at 25.5%. That’s the gap that decides whether the agent completes the task or has to re-prompt.

If your workload is dominated by non-English audio outside AssemblyAI’s six native languages, Deepgram’s coverage is the deciding factor. Nova-3 supports real-time code-switching across 10 languages (English, Spanish, French, German, Hindi, Russian, Portuguese, Japanese, Italian, and Dutch). AssemblyAI handles those languages by routing audio that isn’t English, Spanish, Portuguese, French, German, or Italian to Universal-2 via the speech_models priority list, which gives up the Universal-3 Pro accuracy edge on exactly the audio you’re trying to transcribe.

If you can’t send audio to a public cloud, Deepgram is the more direct path. Deepgram ships in three deployment configurations (public cloud for rapid integration, private cloud inside AWS or Azure accounts, and on-premises via Docker or Kubernetes containers), which matters for healthcare systems that need to keep protected health information within data centers while holding low latency and high accuracy. AssemblyAI has self-hosted options on Kubernetes, AWS ECS, and AWS GovCloud, but the Deepgram on-prem story is the more mature one.

On price parity

The pricing headline is misleading on its own. The streaming base rates are nearly the same. Universal Streaming lists at $0.15 per hour and Universal-3 Pro Streaming at $0.45 per hour, billed on total session duration, and Deepgram Nova-3 lists at $0.46/hour ($0.0077/min on Pay-As-You-Go or $0.0065/min on Growth). The differences show up below the headline.

Deepgram is the cheaper pre-recorded option. Nova-3 pre-recorded runs $0.0043/min with streaming at $0.0077/min. Universal-3 Pro pre-recorded is $0.21/hour, but open-field prompting is a $0.05/hour add-on and keyterms prompting is a separate $0.05/hour add-on, so the rate a customization-heavy workflow actually pays is closer to $0.31/hour. For batch processing of recorded calls without prompting, Deepgram is cheaper per minute.

Concurrency tilts back to AssemblyAI. AssemblyAI’s Streaming API includes free, unlimited, automatic scaling concurrency with no extra fees. Deepgram instead documents streaming concurrency limits by plan, up to 150-225 concurrent Nova-3 streaming requests depending on plan tier, with higher limits available on Enterprise. For multi-tenant voice products that aggregate traffic across customers, the concurrency model matters more than a per-minute fraction.

There’s one billing trap to price in. Universal Streaming is billed on the total duration of each streaming session (the entire time the WebSocket connection stays open), not on the amount of audio sent through it, and you’re charged for idle time on an open session at the same rate as time when audio is actively flowing. If a streaming session isn’t properly closed via a termination message, it auto-closes after 3 hours and charges the full duration; AssemblyAI’s support documentation explicitly calls improperly closed streaming sessions “a common cause of unexpected charges that lead to negative balances,” and a single orphaned connection at Universal-3 Pro Streaming rates costs $1.35. Deepgram bills per second on actual audio, with no equivalent failure mode.

On the underlying customization bets

The two products have made different bets on how a customer customizes a transcript. AssemblyAI bet on prompting. Universal-3 Pro delivers strong accuracy out of the box, and to tune results to a use case, the model accepts a prompt with up to 1,500 words of context in plain language, which helps it recognize domain-specific terminology, apply preferred formatting conventions, handle code switching between languages, and better interpret ambiguous speech. On top of that, Keyterms Prompting on Universal-3 Pro supports up to 1,000 words or phrases, a significant expansion over Universal-2’s 200-word limit, available as a separate add-on at $0.05/hour.

Deepgram bet on instant vocabulary adaptation through keyterm prompting alone. Nova-3 Medical comes pretrained on millions of specialized conversations, handling pharmaceutical names, clinical shorthand, and regulatory language, and runtime keyword prompting adds up to 100 specialized terms during API calls without model retraining. For teams that know exactly which jargon they need to boost, 100 terms is enough; for teams that need to describe the audio in plain language (“this is a clinical history evaluation, capture medication and dosage accurately”), AssemblyAI’s surface is the broader one.

On turn detection for voice agents

The latency round goes to Deepgram on raw measurement, but the picture is more nuanced for end-of-turn detection, the moment a voice agent has to decide whether the customer is done speaking. Deepgram’s Flux is a turn-taking model often used with Nova-3 that handles end-of-turn detection implicitly, predicting turn endings with higher accuracy than simple VAD algorithms and shrinking the awkward-silence gap. AssemblyAI’s answer is built into Universal-3 Pro Streaming itself: it tracks speaker turns inline, at streaming speed, with native turn detection in the transcript, where the model decides a speaker is done based on tonality, pacing, and speech patterns rather than silence alone, and AssemblyAI positions this as the most reliable turn detection available for teams building on LiveKit or Pipecat.

Both vendors have credible voice-agent stories. The choice between them lives in the rounds above: entity accuracy and customization for AssemblyAI, raw latency and multilingual coverage for Deepgram.

On scale and corporate posture

Both vendors are well-resourced enough that product continuity is a reasonable assumption for the next 12 months. AssemblyAI, founded in 2017 by former Cisco machine learning engineer Dylan Fox, has grown into a well-funded platform with over $115 million in funding and more than 100 employees, processes over 600 million API calls per month, and has focused on combining transcription with LLM capabilities through its LeMUR framework. Deepgram, the older of the two at a 2015 founding by former University of Michigan physicists, has raised $85.9 million and employs around 175-200 people, and its end-to-end learning approach and unified Voice Agent API place it prominently in real-time voice applications. Either is a safe long-horizon bet on the infrastructure question; the round breakdown is what should drive the choice.

Sources
The Analyst
Hana Koizumi
Multimodal & Tooling Analyst

Hana Koizumi evaluates image, audio, and agentic tool use. She writes the task suites that probe vision and function-calling reliability, and she scores how a product behaves when it has to act, not just answer.