Cost & Latency Leaderboard

Best Fast LLM Inference APIs for Open-Weight Models, Ranked by Speed, Latency, and Cost

We benchmarked five inference providers serving the same open-weight models, scoring each on throughput, time-to-first-token, model catalog, cost per million tokens, and developer experience.

Tested by Devon Mizrahi Cost & Latency Analyst Updated June 28, 2026 5 products ranked

The Verdict

Cerebras posts the highest sustained tokens per second on the open-weight models it hosts, with independently benchmarked throughput well above any GPU-based provider. Groq is the right default for most teams: its LPU silicon records the lowest time-to-first-token in the field, holds 250-394 tokens/second on Llama 3.3 70B, and undercuts Together and Fireworks on per-token price for the same model. SambaNova leads on frontier-size open-weight throughput (gpt-oss-120B, MiniMax M2.7) but ships a thinner catalog and a weaker developer surface. Fireworks and Together are the right call when you need the broadest open-model catalog, LoRA fine-tuning, or dedicated GPU endpoints, and can accept GPU-class latency.

Five inference providers, the same open-weight models, one ranking. We picked the platforms production teams actually shortlist when they want fast, OpenAI-compatible APIs for Llama, DeepSeek, Qwen, Kimi, gpt-oss, and the rest of the open-weight field. The model is held constant so every difference on the table traces to the provider, not the weights.

Each provider was scored against the same five-metric suite: sustained output throughput, time-to-first-token, model catalog breadth, cost per million tokens at the popular 70B-class tier, and developer experience. Throughput and TTFT figures come from publicly available endpoint benchmarks; pricing was verified against each vendor's pricing page in June 2026. Quality isn't scored here, because every provider serves the same released weights. Where a provider degrades a model with aggressive quantization, that shows up in the catalog notes rather than as a separate score.

The test suite · 5 measured metrics

Each provider was scored against the same five-metric suite on the same model class wherever possible (Llama 3.3 70B as the cross-provider reference, gpt-oss-120B as the frontier-size cross-check). Throughput and TTFT figures come from publicly reported endpoint benchmarks (Artificial Analysis, vendor case studies, and third-party benchmark harnesses) rather than vendor marketing positioning. Pricing was verified against each vendor's pricing page in June 2026 and reported in USD per million tokens. Weights: Throughput 30%, TTFT 25%, Model catalog 15%, Cost 20%, Developer experience 10%.

Output throughput

Median sustained output tokens per second on the reference model (Llama 3.3 70B for cross-provider parity, with gpt-oss-120B as a frontier-size cross-check). Numbers come from publicly reported endpoint benchmarks (Artificial Analysis live data and third-party benchmark harnesses), not vendor headline positioning. A 0-100 score is mapped against a fixed scale where 100 corresponds to the highest sustained TPS observed in this category in 2026 (Cerebras at ~3,000 TPS on gpt-oss-120B) and 50 corresponds to typical GPU-cloud throughput on the same model (~60-100 TPS). Weighted 30%.

Time to first token

Median TTFT in milliseconds from request sent to first streamed token on the same Llama 3.3 70B reference endpoint, taken from publicly reported endpoint benchmarks. Scored on a fixed scale where 100 corresponds to sub-100ms TTFT (LPU/WSE territory) and 50 corresponds to 400-600ms (typical GPU inference). TTFT is scored separately from throughput because voice agents, autocomplete, and interactive chat live or die on TTFT, not on sustained tokens-per-second. Weighted 25%.

Model catalog

Counted the number of production-ready open-weight chat models served by each provider, then graded breadth and currency: presence of the current Llama family, gpt-oss, DeepSeek (V3/V4 and R1), Kimi K2, Qwen3, and MiniMax. Custom-silicon providers were penalized for catalog gaps where the model hasn't been ported to their stack; GPU-based providers were credited for serving frontier-size open weights without porting delay. Weighted 15%.

Cost per million tokens

Blended input/output price for the reference Llama 3.3 70B endpoint, in USD per million tokens, verified against each vendor's pricing page on 2026-06-25. Where a provider doesn't host Llama 3.3 70B, we used its closest equivalent 70B-class open-weight model and noted the substitution. Batch and prompt-caching discounts were noted but not folded into the headline score, because they apply to a subset of workloads. Normalized so a lower blended price scores higher. Weighted 20%.

Developer experience

Scored on OpenAI-compatibility of the chat-completions endpoint, free-tier generosity for prototyping, presence of structured output and function calling, fine-tuning support, observability/dashboards, and documented production rate limits. Each capability was scored present-and-good, present-but-weak, or absent. Weighted 10%.

The Ranking

1RANK

Groq

Groq, Inc.

Lowest TTFT in the field on Llama and gpt-oss, and the cheapest per-million-token price among first-party hosts on the 70B reference model.

Groq runs open-weight models on its custom Language Processing Unit (LPU), an SRAM-resident architecture that pre-schedules every operation at compile time and removes the dynamic scheduling overhead of GPU inference. The measured result is sub-100ms TTFT on supported models and sustained throughput of 250-394 tokens per second on Llama 3.3 70B at $0.59 input / $0.79 output per million tokens, the lowest first-party price among the providers in this ranking on the reference model. The trade-offs are catalog and ownership: Groq serves only open-weight models that have been explicitly ported to the LPU (Llama variants, gpt-oss, Kimi K2, Qwen3, DeepSeek R1 Distill), and in December 2025 NVIDIA licensed Groq's LPU technology for roughly $20 billion, with GroqCloud continuing to operate independently.

Source: Groq, Inc. ↗

Strengths

Sub-100ms TTFT and 250-394 TPS on Llama 3.3 70B via custom LPU silicon
$0.59/$0.79 per million tokens on Llama 3.3 70B undercuts Together and Fireworks on the same model
Generous no-credit-card free tier with OpenAI-compatible endpoints

Weaknesses

Open-weight catalog only, no GPT, Claude, or Gemini
Catalog depth depends on Groq's porting roadmap; some new models arrive late
Fixed organization-level rate limits at the free tier

How it scored, by metric

Output throughput 84

Time to first token 96

Model catalog 78

Cost per million tokens 88

Developer experience 92

Best for: Interactive chat, voice agents, autocomplete, and any workload where TTFT is the critical path

2RANK

Cerebras Inference

Cerebras Systems

Highest sustained tokens per second in the field on supported open-weight models, on wafer-scale silicon.

Cerebras runs inference on its Wafer-Scale Engine, a single chip the size of a dinner plate with roughly 900,000 cores and enough on-chip memory to keep large open-weight models resident without external DRAM transfers. Independently benchmarked throughput on gpt-oss-120B reaches roughly 3,000 tokens per second, and Llama 3.3 70B runs in the 1,600-2,000 TPS range, roughly 10-20x the GPU-based providers on the same models. The trade-offs are catalog size and ecosystem maturity: only models explicitly ported to the WSE are available (currently a handful of Llama variants, Qwen3, gpt-oss, and a small set of others), and on-demand capacity can be constrained when a new model goes viral. Cerebras completed a U.S. IPO in May 2026 (Nasdaq: CBRS) at a roughly $66B day-one market cap, with an OpenAI supply relationship behind it.

Source: Cerebras Systems ↗

Strengths

~3,000 TPS on gpt-oss-120B is the highest sustained throughput in the category
TTFT in the 80-150ms range, second only to Groq on supported models
OpenAI-compatible API and on-prem deployment options

Weaknesses

Small catalog of ported models compared with GPU-based hosts
Pricing on some newer models (e.g. GLM-class) runs higher than slower providers
Capacity has been constrained on demand surges in 2026

How it scored, by metric

Output throughput 99

Time to first token 92

Model catalog 62

Cost per million tokens 74

Developer experience 82

Best for: Long-output generation, code generation, and high-QPS endpoints where sustained TPS is the binding constraint

3RANK

Fireworks AI

Fireworks AI, Inc.

Production-grade GPU inference with adaptive speculative decoding, the broadest function-calling support, and the lowest per-token price on frontier-size open weights.

Fireworks AI runs open-weight models on optimized GPU infrastructure with a proprietary FireAttention inference engine and adaptive speculative decoding, a runtime technique where a smaller draft model proposes tokens that the target model verifies in parallel. The measured result is the highest throughput for a GPU-based stack on frontier-size open weights: DeepSeek V4 Pro at $1.74 input / $3.48 output per million tokens (about 17% cheaper on input than Together on the same model), Kimi K2.6 at $0.95 / $4.00, and Llama 3.3 70B at $0.90 flat. The trade-off is latency: GPU-class TTFT runs 4-6x higher than the custom-silicon providers, and Fireworks gates monthly spend by tier with a fixed 6,000 RPM ceiling even at the top tier. ARR was reported at roughly $800M in early 2026.

Source: Fireworks AI, Inc. ↗

Strengths

Adaptive speculative decoding lifts effective throughput on agreement-heavy workloads
Strong structured output and function-calling support for agent workflows
Cheaper than Together on DeepSeek V4 Pro and Kimi K2.6 in the June 2026 pricing snapshot

Weaknesses

GPU-class TTFT, well behind Groq and Cerebras on interactive workloads
Fixed 6,000 RPM ceiling even at the top spend tier
Request-level tracing is limited compared with full observability platforms

How it scored, by metric

Output throughput 76

Time to first token 70

Model catalog 90

Cost per million tokens 84

Developer experience 90

Best for: Production agent systems running frontier-size open weights with structured output or function calling

4RANK

Together AI

Together Computer, Inc.

Broadest open-model catalog in the field, with first-class LoRA fine-tuning and downloadable trained weights.

Together AI hosts over 200 open-weight models behind a single OpenAI-compatible API and is the most full-stack of the GPU-based providers in this ranking: serverless per-token inference, dedicated H100 endpoints at $6.49/hr, raw HGX H100 clusters from $3.99/hr reserved, and LoRA fine-tuning across most Llama, Mistral, and Qwen sizes including the 405B flagship. Llama 3.3 70B is $1.04 flat per million tokens (mid-pack against Groq at $0.59/$0.79 and Fireworks at $0.90). CEO Vipul Ved Prakash frames the platform as the 'AI Acceleration Cloud' for teams that need both serverless inference and reserved capacity. ARR was reported at roughly $1B in early 2026.

Source: Together Computer, Inc. ↗

Strengths

200+ open-weight models, including long-tail specialty models others don't carry
LoRA fine-tuning with downloadable weights, the cleanest path to a custom model
Dedicated H100 endpoints and raw HGX clusters for predictable-load workloads

Weaknesses

Mid-pack on Llama 3.3 70B per-token price ($1.04 vs Groq's $0.59/$0.79)
GPU-class TTFT, 5-10x slower than Groq on the same Llama models
Asymmetric serverless rate limits scale with sustained traffic rather than published ceilings

How it scored, by metric

Output throughput 72

Time to first token 68

Model catalog 96

Cost per million tokens 76

Developer experience 92

Best for: Teams that need a broad open-weight catalog, LoRA fine-tuning, or a mix of serverless and dedicated capacity

5RANK

SambaNova Cloud

SambaNova Systems, Inc.

Class-leading throughput on frontier-size open-weight models on Reconfigurable Dataflow Unit silicon, with a thinner catalog and weaker developer surface.

SambaNova runs inference on its fourth-generation Reconfigurable Dataflow Unit (the SN40L), a tightly coupled three-tier memory architecture designed for large sparse models. Published throughput is 435 tokens per second on MiniMax M2.7 and over 600 tokens per second on gpt-oss-120B, with single-rack deployments at roughly 10 kW versus tens of racks for equivalent GPU throughput. The trade-offs are catalog and ecosystem: SambaNova has fewer ported models than Groq, the developer surface trails Groq and Fireworks on observability and tooling, and the company has had a turbulent year (April 2025 layoffs and a Series E down round at roughly $2.2B from a prior $5.1B mark) before refocusing on inference and sovereign AI.

Source: SambaNova Systems, Inc. ↗

Strengths

Class-leading throughput on frontier-size open weights (gpt-oss-120B, MiniMax M2.7)
Single-rack power footprint at roughly 10 kW for an equivalent model deployment
OpenAI-compatible endpoint via the OpenAI SDK

Weaknesses

Thinner ported-model catalog than Groq or the GPU-based hosts
Developer surface (dashboards, observability, free-tier generosity) trails Groq and Fireworks
Corporate trajectory in 2025 (layoffs, down round) raises continuity questions for risk-averse buyers

How it scored, by metric

Output throughput 90

Time to first token 78

Model catalog 58

Cost per million tokens 72

Developer experience 70

Best for: Teams running frontier-size open-weight reasoning models where sustained TPS on the largest open weights is the binding constraint

Analysis

The ranking above reflects publicly reported throughput and TTFT for each provider on the same open-weight reference models, with per-token pricing verified against each vendor’s pricing page on 2026-06-25. The single largest separator at the top of the table isn’t raw throughput (every provider in this field sits within an order of magnitude on Llama 3.3 70B), it’s the architectural split between custom-silicon hosts (Groq, Cerebras, SambaNova) and GPU-based hosts (Together, Fireworks), which decides whether you’re buying speed or catalog breadth.

What the scores measure

Throughput and TTFT are scored separately and weighted differently because they answer different production questions. Throughput decides how fast a long response completes, which is what matters for code generation, document summarization, and any batch-shaped workload. TTFT decides whether streaming output feels instant, which is what matters for voice agents, autocomplete, and any conversational UX where the user is waiting on the first token. A provider can win one and lose the other: GPU-based hosts have closed part of the throughput gap through speculative decoding, paged attention, and FlashAttention-class kernels, but the architectural advantage of custom silicon on TTFT for single-request low-concurrency inference is still measurable in this snapshot.

Where the field separates

Cerebras and Groq lead the table on the two speed metrics, with Cerebras taking sustained TPS and Groq taking TTFT. SambaNova lands in third on speed because its published throughput on frontier-size open weights (gpt-oss-120B, MiniMax M2.7) is class-leading, but its catalog and developer surface trail. Fireworks and Together close the speed gap on the metrics where software optimization compounds (high-concurrency workloads, function-calling agents, structured output) and pull ahead on catalog and on fine-tuning, which is where most teams’ second-order requirements live. The headline tokens-per-second number is largest at low concurrency and smallest at high concurrency, and that single fact decides the pick for many teams before any other metric matters.

Cost and catalog

Cost per million tokens is tracked on the same reference model but kept out of the speed scores, because a buyer optimizing for spend and a buyer optimizing for latency are answering different questions. Groq posts the lowest first-party per-token price on Llama 3.3 70B at $0.59/$0.79, with prompt caching and Batch API discounts that stack to roughly 25% of the on-demand rate on supported workloads. Fireworks and Together cluster around $0.90-$1.04 flat on the same model. Catalog coverage is the dimension that doesn’t show up in the headline score: Groq, Cerebras, and SambaNova serve only models that have been ported to their custom silicon (catalogs of roughly 4-15 models each, all open-source), while Together’s catalog spans 200+ open-weight models including the long-tail specialty checkpoints that the custom-silicon providers don’t carry. For teams whose model selection is a moving target, that one fact decides the pick before any speed or price number matters.

Sources

Frequently Asked Questions

Q.Which inference API is the fastest in 2026?

There's no single fastest provider for every workload. Cerebras leads on sustained tokens per second for the small and mid-sized open-weight models it has ported to its Wafer-Scale Engine, with independently reported throughput around 3,000 TPS on gpt-oss-120B. Groq leads on time-to-first-token via its custom LPU silicon and holds 250-394 TPS on Llama 3.3 70B. For interactive chat and voice, Groq is the right default; for long-output generation on supported open weights, Cerebras is the right default.

Q.Which provider is cheapest on Llama 3.3 70B?

Among the first-party providers in this ranking, Groq is the cheapest at $0.59 per million input tokens and $0.79 per million output tokens, verified against the Groq pricing page in June 2026. Fireworks lists Llama 3.3 70B at $0.90 flat per million tokens, and Together at $1.04 flat. DeepInfra (not ranked here) lists $0.23/$0.40 per million tokens on the same model and is the cheapest hosted option overall, with the trade-off that latency and uptime trail the providers in this ranking.

Q.Why isn't Llama 3.3 70B available on every provider at the same price?

The model weights are identical; the infrastructure and business model underneath are not. Custom-silicon providers (Groq, Cerebras, SambaNova) bypass GPU memory bandwidth limits with purpose-built dataflow or wafer-scale architectures and price for the speed advantage. GPU-based providers (Together, Fireworks) run the same weights on H100-class hardware with software optimizations like speculative decoding and FlashAttention, and compete on catalog breadth and fine-tuning rather than raw throughput. The same Llama 3.3 70B inference can cost $0.59 on Groq and $1.04 on Together because the providers are selling different things on top of the same weights.

Q.Can I run GPT-5, Claude, or Gemini through these providers?

No. Every provider in this ranking serves open-weight models only. Groq, Cerebras, and SambaNova host models that have been explicitly ported to their custom silicon; Together and Fireworks host the broad open-weight field on GPU infrastructure. For proprietary frontier models you need OpenAI, Anthropic, or Google directly, or a gateway like OpenRouter that routes across both open and proprietary providers.

Q.What's the right fallback architecture for production?

Multi-provider routing is the production answer. A common pattern is primary on Groq for interactive workloads with a fallback to Together or Fireworks for capacity or for models that Groq hasn't yet ported, or primary on Cerebras for long-output generation with a GPU fallback. Every provider in this ranking exposes an OpenAI-compatible chat-completions endpoint, so routing logic and observability can be shared. Treat latency and cost as separable axes: the right provider for a voice agent and the right provider for a batch summarization job are rarely the same.

The Analyst

Devon Mizrahi

Cost & Latency Analyst

Devon Mizrahi measures what a model costs to run and how fast it answers. He maintains the price-per-token tables and the latency rigs, and he is the reason the Tracker reports tokens-per-second next to every quality score.

Best Fast LLM Inference APIs for Open-Weight Models, Ranked by Speed, Latency, and Cost

Strengths

Weaknesses

How it scored, by metric

Strengths

Weaknesses

How it scored, by metric

Strengths

Weaknesses

How it scored, by metric

Strengths

Weaknesses

How it scored, by metric

Strengths

Weaknesses

How it scored, by metric

What the scores measure

Where the field separates

Cost and catalog

Other leaderboards