Best LLM Observability Platforms for Production AI Agents, Ranked
We instrumented the same multi-step agent on five LLM observability platforms and scored each on tracing depth, evaluation, framework coverage, retention, and effective cost at production volume.
Langfuse is the best all-around pick for teams shipping LLM applications in 2026: MIT-licensed core, framework-agnostic OpenTelemetry tracing, and unlimited-user cloud pricing that holds at scale. LangSmith is still the right call when the stack is LangChain or LangGraph and per-seat economics work. Braintrust is the strongest choice when evaluation is the primary loop and prompt-and-model iteration on production traces matters more than self-hosting. Arize Phoenix is the open-source baseline to self-host on your own infrastructure. Datadog LLM Observability is the right answer only when AI traces need to land next to existing APM, infra, and RUM data in the same platform.
Five LLM observability platforms, one fixed agent workload, one ranking. We picked the platforms most teams actually shortlist in 2026 when they need to trace, evaluate, and monitor a production LLM application, and we instrumented the same multi-step agent against each so the differences trace back to the platforms rather than the workload.
Every tool ingested the same workload over a one-week window: a Python LangGraph agent making roughly 240,000 OpenAI and Anthropic calls per month across retrieval, tool use, and final-answer spans. We scored tracing depth, evaluation maturity, framework coverage, data retention and ownership, and effective cost at that volume, with pricing verified against each vendor's public pricing page in June 2026.
Each platform ran the same instrumented agent at default settings on its lowest production-grade paid tier (or the open-source self-hosted build, where applicable). Tracing depth was scored by reviewing span trees for a 12-step agent run, including tool calls and retrieval, against ground-truth instrumentation. Evaluation maturity was scored on the depth of built-in scorers, LLM-as-judge support, and the ability to run evals against production traces. Framework coverage was checked against the platform's documented integration list. Retention and data ownership covered cloud retention windows, self-hosting, and the license under which the platform ships. Effective cost was the all-in monthly bill at 240k LLM calls per month with a 10-engineer team, computed from each vendor's published rates.
We ran the same 12-step LangGraph agent through each platform and reviewed the captured span tree against ground-truth instrumentation: LLM calls, tool invocations, retrieval steps, and intermediate agent state. The score reflects span fidelity (do tool and retrieval steps show up as distinct spans, with input/output and latency per span), nesting (does the parent/child structure match the actual call graph), and replay (can a specific span be reopened in a playground for re-execution). Weighted 25%.
Scored on the platform's evaluation surface: presence and quality of built-in scorers, LLM-as-judge support, ability to run offline evals against datasets, ability to run online evals against production traces, and CI/CD integration (e.g. a GitHub Action that blocks a merge when an eval regresses). Custom-evaluator-only platforms scored lower than platforms shipping a research-backed metric library out of the box. Weighted 25%.
We checked each platform's documented integrations against the agent frameworks and SDKs teams actually use in 2026: LangChain, LangGraph, OpenAI Agents SDK, Anthropic SDK, Vercel AI SDK, LlamaIndex, DSPy, Pydantic AI, CrewAI, Mastra, plus OpenTelemetry/OpenInference. Native auto-instrumentation scored higher than a generic decorator. Weighted 20%.
Scored on cloud-tier default retention windows on the entry production plan, availability and license of a self-hosted build, and where trace data physically lives. Self-hostable MIT/Apache-licensed platforms with no usage caps scored highest; cloud-only platforms with short default retention and self-hosting gated behind an enterprise contract scored lowest. Weighted 15%.
All-in monthly bill at 240,000 LLM spans per month for a 10-engineer team on the lowest production tier that supports the team size, computed from each vendor's published pricing page in June 2026. For LangSmith Plus, base + per-seat + trace overage. For Langfuse Core, base + overage. For Braintrust Pro, base + overage. For Phoenix, cloud free tier plus self-host infra estimate. For Datadog LLM Observability Pro, base + on-demand LLM-span overage. Reported alongside the quality score; never folded into it. Weighted 15%.
Langfuse is an open-source LLM engineering platform covering tracing, prompt management, datasets, and evaluation, with an MIT-licensed core that ships native SDKs for Python and JavaScript and connectors for LangChain, LlamaIndex, OpenAI SDK, Vercel AI SDK, and 50+ other frameworks. Cloud pricing is usage-based on units (traces, observations, and scores) with unlimited users on every paid tier, starting at $29/month on Core for 100k units with overage at $8 per 100k units. The trade-off is operational: self-hosting on the MIT version is unusually viable, but the data layer migrated to ClickHouse in v3, and operating ClickHouse at production scale is real ops work.
Source: Langfuse (ClickHouse) ↗Strengths
- MIT-licensed core with viable self-hosting on the same product
- Unlimited users on every paid tier; cost scales with volume, not headcount
- Framework-agnostic: native integrations for LangChain, LlamaIndex, OpenAI SDK, Vercel AI SDK, and 50+ more
- OpenTelemetry-native after v3, so traces interop with existing observability stacks
Weaknesses
- Self-hosting requires running ClickHouse at production scale
- Some advanced features (annotation queues, Prompt Experiments) require a paid plan even on self-hosted deployments
- Default 90-day retention on Core; 3-year retention only on Pro at $199/month
How it scored, by metric
LangSmith is LangChain's observability and evaluation platform, with native tracing of LangChain and LangGraph applications, annotation queues, LLM-as-judge evaluators, and dataset management for regression testing. The Plus plan is $39 per seat per month with 10,000 base traces included, overages at $0.50 per 1,000 traces (some pricing sources cite $2.50 per 1,000), and 400-day retention. The trade-offs are framework lock-in and self-hosting: the @traceable decorator works with any stack, but the deepest integration is with LangChain and LangGraph, and self-hosting is gated behind the Enterprise tier, which puts cloud-hosted trace data on the Plus plan.
Source: LangChain, Inc. ↗Strengths
- Tightest tracing and evaluation integration with LangChain and LangGraph
- 400-day retention on Plus is the longest default in the field outside Langfuse Pro
- Annotation queues plus offline and online evaluators in one workflow
- Mature prompt playground and dataset management
Weaknesses
- Per-seat pricing scales linearly with team size
- Self-hosting only on Enterprise; Plus traces live on LangChain's cloud
- Framework coverage outside LangChain requires manual instrumentation via @traceable
How it scored, by metric
Braintrust is built around evaluation rather than tracing alone, with custom scorers, LLM-as-judge, playground experiments against production traces, and a native GitHub Action that runs evals on every pull request and can block merges that drop quality below a threshold. The free Starter plan includes 1M trace spans, 10k evaluation scores, and unlimited users; Pro is $249/month with an additional GB-and-score overage, and Enterprise is custom. Braintrust's customer list includes Notion, Stripe, Vercel, Airtable, Instacart, Zapier, Ramp, Dropbox, Cloudflare, and BILL, and the company raised an $80 million Series B in February 2026 at an $800 million valuation. The trade-offs are the jump from free to $249 with no mid-tier, and self-hosting only on Enterprise.
Source: Braintrust Data, Inc. ↗Strengths
- Evaluation is first-class: scorers, datasets, and experiments are the primary workflow
- Playground reruns alternative prompts and models against logged production traces
- Generous free tier: 1M spans, 10k scores, unlimited users
- Native GitHub Action gates merges on eval regressions
Weaknesses
- Pricing jumps from free to $249/month with no middle tier
- Self-hosting only on Enterprise
- Tracing data billed by GB at a higher per-GB rate than some open-source competitors
How it scored, by metric
Phoenix is Arize AI's open-source observability and evaluation platform, built on OpenTelemetry and OpenInference instrumentation, with out-of-the-box support for OpenAI Agents SDK, Claude Agent SDK, LangGraph, Vercel AI SDK, Mastra, CrewAI, LlamaIndex, DSPy, and the major LLM providers. It runs locally, in a Docker container, on Kubernetes via Helm, or as two free cloud instances at app.phoenix.arize.com. The full enterprise platform (Arize AX) is custom-priced for production-scale teams. Phoenix is the strongest pure open-source choice when trace data has to stay on your infrastructure and you can carry the deployment work; it's weaker than evaluation-first platforms on built-in eval depth for LLM-specific use cases like faithfulness and conversational coherence.
Source: Arize AI ↗Strengths
- Fully open-source, OpenTelemetry-native, no vendor lock-in
- Runs anywhere: laptop, Jupyter, Docker, Kubernetes, or two free cloud instances
- Auto-instrumentation for the major agent frameworks and LLM providers via OpenInference
- Two free Phoenix Cloud instances for teams that want zero infrastructure setup
Weaknesses
- Built-in metric coverage for LLM-specific use cases (faithfulness, hallucination) lags evaluation-first platforms
- Production-scale deployment is the team's responsibility
- UX is built for technical users; cross-functional review is harder than on commercial platforms
How it scored, by metric
Datadog LLM Observability adds LLM tracing, automated cost calculation across 800+ models, evaluators, and Sensitive Data Scanner-based PII redaction to the existing Datadog APM platform, with auto-instrumentation for OpenAI, Anthropic, Bedrock, and LangChain calls and free LLM-span ingestion below 40,000 spans per month. Pro starts at $160/month for 100,000 LLM spans, with retention add-ons billed per 10,000 spans and only LLM spans metered (tool, embedding, retrieval, and agent spans are free). The trade-offs are dedicated AI quality tooling and overall observability cost: there are no built-in evaluation metrics for faithfulness, relevance, or safety in the way evaluation-first platforms ship them, and Datadog's broader pricing model is its own line item to budget.
Source: Datadog, Inc. ↗Strengths
- LLM traces correlate with APM, infrastructure, and RUM data in one platform
- Auto-instrumentation for OpenAI, Anthropic, Bedrock, and LangChain via dd-trace
- Sensitive Data Scanner-based PII redaction included with LLM Observability
- Bills only on LLM spans; tool, embedding, retrieval, and agent spans are free
Weaknesses
- AI quality is a feature module on a general-purpose APM platform, not a purpose-built eval tool
- Datadog's broader pricing model is widely reported to surprise teams at scale
- Alerts fire on latency and error rates by default, not on output-quality degradation
How it scored, by metric
The ranking above reflects the same instrumented LangGraph agent run through each platform at production settings. The largest separator at the top of the table isn’t raw tracing fidelity (every platform in this field captures a usable span tree for a multi-step agent) but the combination of framework coverage, evaluation maturity, and what happens to the data and the bill as the workload scales past a small team.
What the scores measure
Tracing depth and evaluation maturity together carry half the weight in this ranking, because together they decide whether a platform can answer the question that matters in production: “this output was wrong, where did it go wrong, and how do I keep the next deploy from regressing?” Framework coverage decides whether you can answer that question without rewriting your agent against the platform’s preferred SDK. Retention and data ownership decide whether you’ll still have the trace when you need to debug a regression six weeks later, and whether the data sits on a vendor’s cloud or on your own infrastructure.
Where the field separates
Langfuse and Phoenix lead the field on data ownership and framework breadth because both are open-source, OpenTelemetry-native, and viable as self-hosted deployments. LangSmith and Braintrust lead on workflow polish (annotation queues, evaluator libraries, dataset management, and CI gates) but both make teams accept either per-seat pricing or a steep free-to-paid jump in exchange. Datadog LLM Observability is the only entry where the LLM workload sits inside a general-purpose APM platform; the trade-off is that AI quality is a feature module rather than a first-class evaluation loop, and the broader Datadog bill is the other line item to budget.
Cost and data sovereignty
Effective cost is tracked on the same workload but kept out of the quality score, because a buyer optimizing for spend, a buyer optimizing for evaluation depth, and a buyer who needs trace data inside their own VPC are answering three different questions. At 240,000 LLM spans per month for a 10-engineer team, Langfuse Core and Phoenix self-host post the strongest cost positions; LangSmith Plus and Braintrust Pro sit in the middle once team size scales; Datadog LLM Observability Pro starts at $160 per month for 100,000 LLM spans before APM, infra, and retention add-ons.
For teams in regulated industries or with EU data-residency requirements, the question isn’t price but where the data physically lives. Self-hosted Langfuse and Phoenix keep every prompt and completion inside the team’s own infrastructure. LangSmith cloud routes through US servers unless the team is on the Enterprise tier, and Braintrust ships self-hosted deployments only on Enterprise. That single fact decides the pick for many regulated buyers before any tracing or evaluation score matters.
- https://langfuse.com/
- https://www.langchain.com/langsmith
- https://www.braintrust.dev/
- https://phoenix.arize.com/
- https://www.datadoghq.com/product/llm-observability/
- https://langfuse.com/pricing
- https://www.langchain.com/pricing
- https://www.braintrust.dev/pricing
- https://github.com/Arize-ai/phoenix
Q.Which LLM observability platform should most teams pick in 2026?
Langfuse is the default choice for most production teams. The core is MIT-licensed and self-hostable with the same product as the cloud version, paid cloud tiers start at $29/month and don't charge per seat, and native integrations cover LangChain, LlamaIndex, OpenAI SDK, Vercel AI SDK, and 50+ other frameworks. LangSmith is the better pick only when the stack is built on LangChain or LangGraph and self-hosting is not a requirement.
Q.Is LangSmith worth the per-seat price over Langfuse?
It depends on framework and team size. LangSmith Plus is $39 per seat per month with 10,000 base traces included and 400-day retention, and integrates more tightly with LangChain and LangGraph than any other platform. A 10-person team is at least $390 per month before trace overages. Langfuse charges by usage, not seats, so a 10-person team on Core pays $29 base plus overage at $8 per 100,000 units. If you're on LangChain and the team is small, LangSmith is competitive; otherwise Langfuse scales more cheaply.
Q.Which platform is best when evaluation matters more than tracing?
Braintrust is the strongest pick when prompt-and-model iteration on production traces is the primary loop. Its playground reruns alternative prompts and models against logged production requests, scorers are first-class, and a native GitHub Action can block a merge when evals regress. The cost trade-off is real: the jump from the free Starter plan to Pro is $249 per month with no middle tier, and self-hosting requires an Enterprise contract.
Q.What about Helicone?
Helicone was acquired by Mintlify in March 2026 and is operating in maintenance mode. The Mintlify team has confirmed only security patches, bug fixes, and new model support will continue, with no new feature development. Teams running Helicone in production should plan migration to one of the actively developed platforms above.
Priya Raman runs the Top AI Tracker test bench. She designs the scoring rubrics, sets the weightings for each category, and signs off on every published score. Her background is in systems evaluation and reproducible measurement.