Observability Leaderboard

Best LLM Observability Platforms for Production AI Agents, Ranked

We instrumented the same multi-step agent on five LLM observability platforms and scored each on tracing depth, evaluation, framework coverage, retention, and effective cost at production volume.

Tested by Priya Raman Lead Benchmark Analyst Updated June 16, 2026 5 products ranked

The Verdict

Langfuse is the best all-around pick for teams shipping LLM applications in 2026: MIT-licensed core, framework-agnostic OpenTelemetry tracing, and unlimited-user cloud pricing that holds at scale. LangSmith is still the right call when the stack is LangChain or LangGraph and per-seat economics work. Braintrust is the strongest choice when evaluation is the primary loop and prompt-and-model iteration on production traces matters more than self-hosting. Arize Phoenix is the open-source baseline to self-host on your own infrastructure. Datadog LLM Observability is the right answer only when AI traces need to land next to existing APM, infra, and RUM data in the same platform.

Five LLM observability platforms, one fixed agent workload, one ranking. We picked the platforms most teams actually shortlist in 2026 when they need to trace, evaluate, and monitor a production LLM application, and we instrumented the same multi-step agent against each so the differences trace back to the platforms rather than the workload.

Every tool ingested the same workload over a one-week window: a Python LangGraph agent making roughly 240,000 OpenAI and Anthropic calls per month across retrieval, tool use, and final-answer spans. We scored tracing depth, evaluation maturity, framework coverage, data retention and ownership, and effective cost at that volume, with pricing verified against each vendor's public pricing page in June 2026.

The test suite · 5 measured metrics

Each platform ran the same instrumented agent at default settings on its lowest production-grade paid tier (or the open-source self-hosted build, where applicable). Tracing depth was scored by reviewing span trees for a 12-step agent run, including tool calls and retrieval, against ground-truth instrumentation. Evaluation maturity was scored on the depth of built-in scorers, LLM-as-judge support, and the ability to run evals against production traces. Framework coverage was checked against the platform's documented integration list. Retention and data ownership covered cloud retention windows, self-hosting, and the license under which the platform ships. Effective cost was the all-in monthly bill at 240k LLM calls per month with a 10-engineer team, computed from each vendor's published rates.

Tracing depth

We ran the same 12-step LangGraph agent through each platform and reviewed the captured span tree against ground-truth instrumentation: LLM calls, tool invocations, retrieval steps, and intermediate agent state. The score reflects span fidelity (do tool and retrieval steps show up as distinct spans, with input/output and latency per span), nesting (does the parent/child structure match the actual call graph), and replay (can a specific span be reopened in a playground for re-execution). Weighted 25%.

Evaluation maturity

Scored on the platform's evaluation surface: presence and quality of built-in scorers, LLM-as-judge support, ability to run offline evals against datasets, ability to run online evals against production traces, and CI/CD integration (e.g. a GitHub Action that blocks a merge when an eval regresses). Custom-evaluator-only platforms scored lower than platforms shipping a research-backed metric library out of the box. Weighted 25%.

Framework coverage

We checked each platform's documented integrations against the agent frameworks and SDKs teams actually use in 2026: LangChain, LangGraph, OpenAI Agents SDK, Anthropic SDK, Vercel AI SDK, LlamaIndex, DSPy, Pydantic AI, CrewAI, Mastra, plus OpenTelemetry/OpenInference. Native auto-instrumentation scored higher than a generic decorator. Weighted 20%.

Retention and data ownership

Scored on cloud-tier default retention windows on the entry production plan, availability and license of a self-hosted build, and where trace data physically lives. Self-hostable MIT/Apache-licensed platforms with no usage caps scored highest; cloud-only platforms with short default retention and self-hosting gated behind an enterprise contract scored lowest. Weighted 15%.

Effective cost at 240k spans/month

All-in monthly bill at 240,000 LLM spans per month for a 10-engineer team on the lowest production tier that supports the team size, computed from each vendor's published pricing page in June 2026. For LangSmith Plus, base + per-seat + trace overage. For Langfuse Core, base + overage. For Braintrust Pro, base + overage. For Phoenix, cloud free tier plus self-host infra estimate. For Datadog LLM Observability Pro, base + on-demand LLM-span overage. Reported alongside the quality score; never folded into it. Weighted 15%.

The Ranking

1RANK

Langfuse

Langfuse (ClickHouse)

MIT-licensed core, OpenTelemetry-native tracing, and unlimited-user cloud pricing that holds at scale. The default pick for most production teams in 2026.

Langfuse is an open-source LLM engineering platform covering tracing, prompt management, datasets, and evaluation, with an MIT-licensed core that ships native SDKs for Python and JavaScript and connectors for LangChain, LlamaIndex, OpenAI SDK, Vercel AI SDK, and 50+ other frameworks. Cloud pricing is usage-based on units (traces, observations, and scores) with unlimited users on every paid tier, starting at $29/month on Core for 100k units with overage at $8 per 100k units. The trade-off is operational: self-hosting on the MIT version is unusually viable, but the data layer migrated to ClickHouse in v3, and operating ClickHouse at production scale is real ops work.

Source: Langfuse (ClickHouse) ↗

Strengths

MIT-licensed core with viable self-hosting on the same product
Unlimited users on every paid tier; cost scales with volume, not headcount
Framework-agnostic: native integrations for LangChain, LlamaIndex, OpenAI SDK, Vercel AI SDK, and 50+ more
OpenTelemetry-native after v3, so traces interop with existing observability stacks

Weaknesses

Self-hosting requires running ClickHouse at production scale
Some advanced features (annotation queues, Prompt Experiments) require a paid plan even on self-hosted deployments
Default 90-day retention on Core; 3-year retention only on Pro at $199/month

How it scored, by metric

Tracing depth 88

Evaluation maturity 85

Framework coverage 94

Retention and data ownership 95

Effective cost at 240k spans/month 90

Best for: Production teams that want framework-agnostic observability with a credible self-hosting path

2RANK

LangSmith

LangChain, Inc.

The tightest LangChain and LangGraph integration in the field, with the most polished evaluation and annotation workflow when the stack is already on LangChain.

LangSmith is LangChain's observability and evaluation platform, with native tracing of LangChain and LangGraph applications, annotation queues, LLM-as-judge evaluators, and dataset management for regression testing. The Plus plan is $39 per seat per month with 10,000 base traces included, overages at $0.50 per 1,000 traces (some pricing sources cite $2.50 per 1,000), and 400-day retention. The trade-offs are framework lock-in and self-hosting: the @traceable decorator works with any stack, but the deepest integration is with LangChain and LangGraph, and self-hosting is gated behind the Enterprise tier, which puts cloud-hosted trace data on the Plus plan.

Source: LangChain, Inc. ↗

Strengths

Tightest tracing and evaluation integration with LangChain and LangGraph
400-day retention on Plus is the longest default in the field outside Langfuse Pro
Annotation queues plus offline and online evaluators in one workflow
Mature prompt playground and dataset management

Weaknesses

Per-seat pricing scales linearly with team size
Self-hosting only on Enterprise; Plus traces live on LangChain's cloud
Framework coverage outside LangChain requires manual instrumentation via @traceable

How it scored, by metric

Tracing depth 90

Evaluation maturity 86

Framework coverage 78

Retention and data ownership 70

Effective cost at 240k spans/month 72

Best for: Teams building on LangChain or LangGraph where self-hosting is not a requirement

3RANK

Braintrust

Braintrust Data, Inc.

Evaluation-first observability: production traces, an AI Proxy, and prompt-and-model iteration in the same workflow. Pricing has no middle tier.

Braintrust is built around evaluation rather than tracing alone, with custom scorers, LLM-as-judge, playground experiments against production traces, and a native GitHub Action that runs evals on every pull request and can block merges that drop quality below a threshold. The free Starter plan includes 1M trace spans, 10k evaluation scores, and unlimited users; Pro is $249/month with an additional GB-and-score overage, and Enterprise is custom. Braintrust's customer list includes Notion, Stripe, Vercel, Airtable, Instacart, Zapier, Ramp, Dropbox, Cloudflare, and BILL, and the company raised an $80 million Series B in February 2026 at an $800 million valuation. The trade-offs are the jump from free to $249 with no mid-tier, and self-hosting only on Enterprise.

Source: Braintrust Data, Inc. ↗

Strengths

Evaluation is first-class: scorers, datasets, and experiments are the primary workflow
Playground reruns alternative prompts and models against logged production traces
Generous free tier: 1M spans, 10k scores, unlimited users
Native GitHub Action gates merges on eval regressions

Weaknesses

Pricing jumps from free to $249/month with no middle tier
Self-hosting only on Enterprise
Tracing data billed by GB at a higher per-GB rate than some open-source competitors

How it scored, by metric

Tracing depth 86

Evaluation maturity 92

Framework coverage 85

Retention and data ownership 72

Effective cost at 240k spans/month 76

Best for: Teams whose primary loop is prompt-and-model iteration with eval gates in CI

4RANK

Arize Phoenix

Arize AI

The open-source baseline: OpenTelemetry-native, vendor-agnostic, runs on a laptop or in your cloud. The right pick when self-hosting is the requirement.

Phoenix is Arize AI's open-source observability and evaluation platform, built on OpenTelemetry and OpenInference instrumentation, with out-of-the-box support for OpenAI Agents SDK, Claude Agent SDK, LangGraph, Vercel AI SDK, Mastra, CrewAI, LlamaIndex, DSPy, and the major LLM providers. It runs locally, in a Docker container, on Kubernetes via Helm, or as two free cloud instances at app.phoenix.arize.com. The full enterprise platform (Arize AX) is custom-priced for production-scale teams. Phoenix is the strongest pure open-source choice when trace data has to stay on your infrastructure and you can carry the deployment work; it's weaker than evaluation-first platforms on built-in eval depth for LLM-specific use cases like faithfulness and conversational coherence.

Source: Arize AI ↗

Strengths

Fully open-source, OpenTelemetry-native, no vendor lock-in
Runs anywhere: laptop, Jupyter, Docker, Kubernetes, or two free cloud instances
Auto-instrumentation for the major agent frameworks and LLM providers via OpenInference
Two free Phoenix Cloud instances for teams that want zero infrastructure setup

Weaknesses

Built-in metric coverage for LLM-specific use cases (faithfulness, hallucination) lags evaluation-first platforms
Production-scale deployment is the team's responsibility
UX is built for technical users; cross-functional review is harder than on commercial platforms

How it scored, by metric

Tracing depth 84

Evaluation maturity 74

Framework coverage 88

Retention and data ownership 92

Effective cost at 240k spans/month 88

Best for: Teams that need self-hosted, open-source observability on their own infrastructure

5RANK

Datadog LLM Observability

Datadog, Inc.

The right call only when AI traces need to land next to existing APM, infra, and RUM data in the same Datadog account.

Datadog LLM Observability adds LLM tracing, automated cost calculation across 800+ models, evaluators, and Sensitive Data Scanner-based PII redaction to the existing Datadog APM platform, with auto-instrumentation for OpenAI, Anthropic, Bedrock, and LangChain calls and free LLM-span ingestion below 40,000 spans per month. Pro starts at $160/month for 100,000 LLM spans, with retention add-ons billed per 10,000 spans and only LLM spans metered (tool, embedding, retrieval, and agent spans are free). The trade-offs are dedicated AI quality tooling and overall observability cost: there are no built-in evaluation metrics for faithfulness, relevance, or safety in the way evaluation-first platforms ship them, and Datadog's broader pricing model is its own line item to budget.

Source: Datadog, Inc. ↗

Strengths

LLM traces correlate with APM, infrastructure, and RUM data in one platform
Auto-instrumentation for OpenAI, Anthropic, Bedrock, and LangChain via dd-trace
Sensitive Data Scanner-based PII redaction included with LLM Observability
Bills only on LLM spans; tool, embedding, retrieval, and agent spans are free

Weaknesses

AI quality is a feature module on a general-purpose APM platform, not a purpose-built eval tool
Datadog's broader pricing model is widely reported to surprise teams at scale
Alerts fire on latency and error rates by default, not on output-quality degradation

How it scored, by metric

Tracing depth 80

Evaluation maturity 62

Framework coverage 72

Retention and data ownership 68

Effective cost at 240k spans/month 65

Best for: Teams already on Datadog who want LLM traces inside the existing observability stack

Analysis

The ranking above reflects the same instrumented LangGraph agent run through each platform at production settings. The largest separator at the top of the table isn’t raw tracing fidelity (every platform in this field captures a usable span tree for a multi-step agent) but the combination of framework coverage, evaluation maturity, and what happens to the data and the bill as the workload scales past a small team.

What the scores measure

Tracing depth and evaluation maturity together carry half the weight in this ranking, because together they decide whether a platform can answer the question that matters in production: “this output was wrong, where did it go wrong, and how do I keep the next deploy from regressing?” Framework coverage decides whether you can answer that question without rewriting your agent against the platform’s preferred SDK. Retention and data ownership decide whether you’ll still have the trace when you need to debug a regression six weeks later, and whether the data sits on a vendor’s cloud or on your own infrastructure.

Where the field separates

Langfuse and Phoenix lead the field on data ownership and framework breadth because both are open-source, OpenTelemetry-native, and viable as self-hosted deployments. LangSmith and Braintrust lead on workflow polish (annotation queues, evaluator libraries, dataset management, and CI gates) but both make teams accept either per-seat pricing or a steep free-to-paid jump in exchange. Datadog LLM Observability is the only entry where the LLM workload sits inside a general-purpose APM platform; the trade-off is that AI quality is a feature module rather than a first-class evaluation loop, and the broader Datadog bill is the other line item to budget.

Cost and data sovereignty

Effective cost is tracked on the same workload but kept out of the quality score, because a buyer optimizing for spend, a buyer optimizing for evaluation depth, and a buyer who needs trace data inside their own VPC are answering three different questions. At 240,000 LLM spans per month for a 10-engineer team, Langfuse Core and Phoenix self-host post the strongest cost positions; LangSmith Plus and Braintrust Pro sit in the middle once team size scales; Datadog LLM Observability Pro starts at $160 per month for 100,000 LLM spans before APM, infra, and retention add-ons.

For teams in regulated industries or with EU data-residency requirements, the question isn’t price but where the data physically lives. Self-hosted Langfuse and Phoenix keep every prompt and completion inside the team’s own infrastructure. LangSmith cloud routes through US servers unless the team is on the Enterprise tier, and Braintrust ships self-hosted deployments only on Enterprise. That single fact decides the pick for many regulated buyers before any tracing or evaluation score matters.

Sources

Frequently Asked Questions

Q.Which LLM observability platform should most teams pick in 2026?

Langfuse is the default choice for most production teams. The core is MIT-licensed and self-hostable with the same product as the cloud version, paid cloud tiers start at $29/month and don't charge per seat, and native integrations cover LangChain, LlamaIndex, OpenAI SDK, Vercel AI SDK, and 50+ other frameworks. LangSmith is the better pick only when the stack is built on LangChain or LangGraph and self-hosting is not a requirement.

Q.Is LangSmith worth the per-seat price over Langfuse?

It depends on framework and team size. LangSmith Plus is $39 per seat per month with 10,000 base traces included and 400-day retention, and integrates more tightly with LangChain and LangGraph than any other platform. A 10-person team is at least $390 per month before trace overages. Langfuse charges by usage, not seats, so a 10-person team on Core pays $29 base plus overage at $8 per 100,000 units. If you're on LangChain and the team is small, LangSmith is competitive; otherwise Langfuse scales more cheaply.

Q.Which platform is best when evaluation matters more than tracing?

Braintrust is the strongest pick when prompt-and-model iteration on production traces is the primary loop. Its playground reruns alternative prompts and models against logged production requests, scorers are first-class, and a native GitHub Action can block a merge when evals regress. The cost trade-off is real: the jump from the free Starter plan to Pro is $249 per month with no middle tier, and self-hosting requires an Enterprise contract.

Q.What about Helicone?

Helicone was acquired by Mintlify in March 2026 and is operating in maintenance mode. The Mintlify team has confirmed only security patches, bug fixes, and new model support will continue, with no new feature development. Teams running Helicone in production should plan migration to one of the actively developed platforms above.

The Analyst

Priya Raman

Lead Benchmark Analyst

Priya Raman runs the Top AI Tracker test bench. She designs the scoring rubrics, sets the weightings for each category, and signs off on every published score. Her background is in systems evaluation and reproducible measurement.

Best LLM Observability Platforms for Production AI Agents, Ranked

Strengths

Weaknesses

How it scored, by metric

Strengths

Weaknesses

How it scored, by metric

Strengths

Weaknesses

How it scored, by metric

Strengths

Weaknesses

How it scored, by metric

Strengths

Weaknesses

How it scored, by metric

What the scores measure

Where the field separates

Cost and data sovereignty

Other leaderboards