I tested 5 Agent Observability Tools and they all Failed the Causality Test.
LangSmith, Arize Phoenix, Langfuse, LangGraph Studio, and GraphEvals + Neo4j with same scenario, same question: "Which context influenced this decision?" Only GraphEvals + Neo4j passed the test.
Two weeks ago, I published an article about how an AI agent system cost a company $240,000. One question kept coming up in the comments: “Which observability tool should we use to prevent this?”
So I tested them. All of them.
The Test Scenario
I built the same multi-agent system in five different environments
Agent A (Router) loads historical Context X from cache
Agent B (Researcher) fetches fresh Context Z
Agent C (Decision Maker) makes final call using both contexts
The bug: Context X is 14 days old (stale). Agent C uses outdated information to make a bad decision.
The debug task: Start with Agent C’s bad decision. Find which context caused it and where it originated.
This is a causality question. Not “what happened when” (timeline). But “what caused what” (relationships).
I tested five tools, GraphEvals + Neo4j, LangGraph Studio, LangSmith, Arize Phoenix and Langfuse.
The results weren’t what the vendor demos showed.
The Benchmark Result
Here's what I found.
LangGraph Studio
LangGraph Studio is a local debugging UI that visualizes LangGraph execution as a Directed Acyclic Graph (DAG).
For a single trace, this tool is perfect.
I opened the trace in the browser, clicked through the nodes, saw Context X in the Decision Maker’s state, and traced it back to the Router.
Root cause time: 15 minutes.
But here’s the trap.
LangGraph Studio is a debugger, not a database. You cannot query across traces.
If you process 10,000 agent runs per day, you cannot ask: “How many decisions used context older than 7 days?”
You have to manually open each trace and click through the UI.
Perfect for development. Useless for production observability at scale.
LangSmith
LangSmith uses an OpenTelemetry style span architecture with a waterfall timeline view.
To find the root cause, I had to:
Open the trace and find the Decision Maker span
Expand the nested JSON payload
Find the context field and read the timestamp manually
Scroll up the timeline to find the Router span
Confirm where the context originated
Root cause time: 60-90 minutes of manual JSON archaeology.
LangSmith shows you when things happened. It doesn’t show you why.
To find causality, you must manually parse data and calculate time deltas in your head. It has no topological query engine.
You can filter by keyword. You cannot query relationships between spans.
Arize Phoenix
Arize Phoenix markets “graph tracing” prominently. But it’s built on flat OTLP traces.
The debugging process was nearly identical to LangSmith
Manually expand JSON payloads
Calculate timestamps to find stale context
Root cause time: ~60 minutes.
The “graph tracing” claim is marketing, not architecture.
You can see parent-child relationships between spans. You cannot query the data flow topology.
Phoenix has better attribute filtering than LangSmith. But it still fails to answer the causality question without significant manual work.
Langfuse
Langfuse is functionally similar to LangSmith and Arize Phoenix. Timeline-based waterfall view.
The debugging process required the same manual JSON digging.
Root cause time: 60+ minutes.
This tool proves the divide is not between open-source and paid software.
The divide is between flat logs and graph databases.
Langfuse is good for tracking token counts and latency. It fails the causality test at scale.
Graph Evals + Neo4j
I built a custom instrumentation layer that writes every agent decision and context propagation directly into a Neo4j graph database.
When Agent C made the bad decision, I ran the below Cypher query
MATCH (decision:Decision {id: 'decision_123'})
-[:INFLUENCED_BY]->(context:Context)
WHERE context.age_days > 7
RETURN contextThe result, Context X from Agent A, 14 days old. Returned in 2 seconds.
Root cause time: 40 minutes (including time to write the query and understand the graph topology).
This is the only tool that allows you to query causality at scale.
If you have 10,000 agent executions, you can find every decision influenced by stale context with one command.
The trade-off: Requires building your own instrumentation and maintaining the database. Slowest for debugging a single trace, but the only one that scales to bulk analysis.
Quick Check:
If you use any of these tools, pull up your last agent debugging session. How long did it take to find which context caused the bad output?
If the answer is > 30 minutes, you don’t have causality tracing.
You have manual log archaeology.
Why they failed?
Four out of five tools failed the causality test. It comes down to the fundamental data model.
Timeline Based Architecture (LangSmith, Phoenix, Langfuse)
These tools model execution as spans - records of things that happened at specific times.
They show you a waterfall of boxes stacked vertically. Causality is buried in JSON metadata. Data flow is not a first-class relationship.
To find a pattern across 1,000 traces, you have to export the data and write a script to parse JSON.
They record the “when,” not the “why.”
Graph Based Architecture (Neo4j, LangGraph Studio)
Agents, decisions, and contexts are nodes. Relationships like PROVIDED_CONTEXT or INFLUENCED_BY are explicit edges.
This allows topological queries. You’re not parsing JSON. You’re querying the shape of execution.
This is the only way to find multi-hop context propagation at scale.
The LangGraph Studio Trap
LangGraph Studio looks like it solves the problem because it draws a graph.
But a graph UI is not a graph database.
Studio is a snapshot of one execution. Neo4j is a queryable history of all executions.
Beautiful graph UI ≠ architectural capability.
You cannot query patterns across thousands of executions in Studio. You can only click through one trace at a time.
Which one you should use?
Your choice depends on your scale and debugging style.
< 100 traces/day → LangGraph Studio
If you’re generating fewer than 100 traces per day, LangGraph Studio is the best choice.
Fast visual debugging. Zero custom instrumentation. You can accept manual work for occasional incidents because volume is low.
100-1,000 traces/day → LangSmith or Arize Phoenix
If you’re generating 100-1,000 traces per day, LangSmith or Phoenix are acceptable.
You get vendor support and standard LLM observability (token tracking, latency).
Accept that causality debugging will take 60 minutes of manual JSON digging per incident.
> 1,000 traces/day → GraphEvals + Neo4j
If you’re generating more than 1,000 traces per day and need to find patterns in context propagation, build a custom Neo4j solution.
It’s the only option that lets you query causality paths across massive datasets in seconds.
Trade-off: Higher maintenance burden for the ability to understand why your agents are failing.
The lesson learned
I didn't set out to prove Neo4j was the best tool. I set out to find which tool could answer the causality question.
The testing revealed a brutal truth
Pretty UIs ≠ architectural capability.
LangSmith has a beautiful timeline. Still requires manual archaeology.
Arize Phoenix has impressive graphics. Still flat spans underneath.
LangGraph Studio is perfect for one trace. Cannot query at scale.
The divide: Timeline architecture (records "when") vs Graph architecture (records "what caused what").
If you’re building complex multi-agent systems, you probably need graph observability.
But today, the only way to get true graph observability at scale is to build it yourself using Neo4j.
Conclusion
Every vendor demo showed me beautiful waterfalls and pretty arrows.
None showed me how long it takes to answer: “Which context influenced this decision?”
The Causality Test Results
LangSmith
Single trace: 60-90 minutes (manual JSON digging)
Bulk queries: Export + script required
Verdict: Timeline tool ❌
Arize Phoenix
Single trace: 60 minutes (flat span inspection)
Bulk queries: Export + script required
Verdict: "Graph tracing" is marketing ❌
Langfuse
Single trace: 60-90 minutes (timeline analysis)
Bulk queries: Export + script required
Verdict: Open-source timeline tool ❌
LangGraph Studio
Single trace: 15 minutes (visual debugging) ✅
Bulk queries: No query language
Verdict: Debugger, not database ⚠️
GraphEvals + Neo4j
Single trace: 40 minutes (slower)
Bulk queries: <5 seconds ✅
Verdict: Only option that scales ✅
Beautiful timelines ≠ causality tracing.
"Graph" in marketing ≠ graph database underneath.
Open Source ≠ better architecture.
Before you buy or build, ask the causality question:
"Show me every decision influenced by context older than 7 days that bypassed the researcher node."
If the answer is “export the data and write a script,” you don’t have observability.
You have expensive logs with a nice dashboard.
If your agents process <100 traces/day, use LangGraph Studio.
If you need to query thousands of traces, build a custom graph solution.
Everything else is a timeline tool with varying degrees of polish.
I tested the tools so you don't have to.
The one that passes the causality test is the one you have to build yourself.
For now.



