BinaryBox

I tested 5 Agent Observability Tools and they all Failed the Causality Test.

Ashok Vishwakarma — Mon, 18 May 2026 06:02:36 GMT

Two weeks ago, I published an article about how an AI agent system cost a company $240,000. One question kept coming up in the comments: “Which observability tool should we use to prevent this?”

So I tested them. All of them.

The Test Scenario

I built the same multi-agent system in five different environments

Agent A (Router) loads historical Context X from cache
Agent B (Researcher) fetches fresh Context Z
Agent C (Decision Maker) makes final call using both contexts

The bug: Context X is 14 days old (stale). Agent C uses outdated information to make a bad decision.

The debug task: Start with Agent C’s bad decision. Find which context caused it and where it originated.

This is a causality question. Not “what happened when” (timeline). But “what caused what” (relationships).

I tested five tools, GraphEvals + Neo4j, LangGraph Studio, LangSmith, Arize Phoenix and Langfuse.

The results weren’t what the vendor demos showed.

The Benchmark Result

Here's what I found.

LangGraph Studio

LangGraph Studio is a local debugging UI that visualizes LangGraph execution as a Directed Acyclic Graph (DAG).

For a single trace, this tool is perfect.

I opened the trace in the browser, clicked through the nodes, saw Context X in the Decision Maker’s state, and traced it back to the Router.

Root cause time: 15 minutes.

But here’s the trap.

LangGraph Studio is a debugger, not a database. You cannot query across traces.

If you process 10,000 agent runs per day, you cannot ask: “How many decisions used context older than 7 days?”

You have to manually open each trace and click through the UI.

Perfect for development. Useless for production observability at scale.

LangSmith

LangSmith uses an OpenTelemetry style span architecture with a waterfall timeline view.

To find the root cause, I had to:

Open the trace and find the Decision Maker span
Expand the nested JSON payload
Find the context field and read the timestamp manually
Scroll up the timeline to find the Router span
Confirm where the context originated

Root cause time: 60-90 minutes of manual JSON archaeology.

LangSmith shows you when things happened. It doesn’t show you why.

To find causality, you must manually parse data and calculate time deltas in your head. It has no topological query engine.

You can filter by keyword. You cannot query relationships between spans.

Arize Phoenix

Arize Phoenix markets “graph tracing” prominently. But it’s built on flat OTLP traces.

The debugging process was nearly identical to LangSmith

Manually expand JSON payloads
Calculate timestamps to find stale context

Root cause time: ~60 minutes.

The “graph tracing” claim is marketing, not architecture.

You can see parent-child relationships between spans. You cannot query the data flow topology.

Phoenix has better attribute filtering than LangSmith. But it still fails to answer the causality question without significant manual work.

Langfuse

Langfuse is functionally similar to LangSmith and Arize Phoenix. Timeline-based waterfall view.

The debugging process required the same manual JSON digging.

Root cause time: 60+ minutes.

This tool proves the divide is not between open-source and paid software.

The divide is between flat logs and graph databases.

Langfuse is good for tracking token counts and latency. It fails the causality test at scale.

Graph Evals + Neo4j

I built a custom instrumentation layer that writes every agent decision and context propagation directly into a Neo4j graph database.

When Agent C made the bad decision, I ran the below Cypher query

MATCH (decision:Decision {id: 'decision_123'})
      -[:INFLUENCED_BY]->(context:Context)
WHERE context.age_days > 7
RETURN context

The result, Context X from Agent A, 14 days old. Returned in 2 seconds.

Root cause time: 40 minutes (including time to write the query and understand the graph topology).

This is the only tool that allows you to query causality at scale.

If you have 10,000 agent executions, you can find every decision influenced by stale context with one command.

The trade-off: Requires building your own instrumentation and maintaining the database. Slowest for debugging a single trace, but the only one that scales to bulk analysis.

Quick Check:

If you use any of these tools, pull up your last agent debugging session. How long did it take to find which context caused the bad output?

If the answer is > 30 minutes, you don’t have causality tracing.

You have manual log archaeology.

Why they failed?

Four out of five tools failed the causality test. It comes down to the fundamental data model.

Timeline Based Architecture (LangSmith, Phoenix, Langfuse)

These tools model execution as spans - records of things that happened at specific times.

They show you a waterfall of boxes stacked vertically. Causality is buried in JSON metadata. Data flow is not a first-class relationship.

To find a pattern across 1,000 traces, you have to export the data and write a script to parse JSON.

They record the “when,” not the “why.”

Graph Based Architecture (Neo4j, LangGraph Studio)

Agents, decisions, and contexts are nodes. Relationships like PROVIDED_CONTEXT or INFLUENCED_BY are explicit edges.

This allows topological queries. You’re not parsing JSON. You’re querying the shape of execution.

This is the only way to find multi-hop context propagation at scale.

The LangGraph Studio Trap

LangGraph Studio looks like it solves the problem because it draws a graph.

But a graph UI is not a graph database.

Studio is a snapshot of one execution. Neo4j is a queryable history of all executions.

Beautiful graph UI ≠ architectural capability.

You cannot query patterns across thousands of executions in Studio. You can only click through one trace at a time.

Which one you should use?

Your choice depends on your scale and debugging style.

< 100 traces/day → LangGraph Studio

If you’re generating fewer than 100 traces per day, LangGraph Studio is the best choice.

Fast visual debugging. Zero custom instrumentation. You can accept manual work for occasional incidents because volume is low.

100-1,000 traces/day → LangSmith or Arize Phoenix

If you’re generating 100-1,000 traces per day, LangSmith or Phoenix are acceptable.

You get vendor support and standard LLM observability (token tracking, latency).

Accept that causality debugging will take 60 minutes of manual JSON digging per incident.

> 1,000 traces/day → GraphEvals + Neo4j

If you’re generating more than 1,000 traces per day and need to find patterns in context propagation, build a custom Neo4j solution.

It’s the only option that lets you query causality paths across massive datasets in seconds.

Trade-off: Higher maintenance burden for the ability to understand why your agents are failing.

The lesson learned

I didn't set out to prove Neo4j was the best tool. I set out to find which tool could answer the causality question.

The testing revealed a brutal truth

Pretty UIs ≠ architectural capability.

LangSmith has a beautiful timeline. Still requires manual archaeology.
Arize Phoenix has impressive graphics. Still flat spans underneath.
LangGraph Studio is perfect for one trace. Cannot query at scale.

The divide: Timeline architecture (records "when") vs Graph architecture (records "what caused what").

If you’re building complex multi-agent systems, you probably need graph observability.

But today, the only way to get true graph observability at scale is to build it yourself using Neo4j.

Conclusion

Every vendor demo showed me beautiful waterfalls and pretty arrows.

None showed me how long it takes to answer: “Which context influenced this decision?”

The Causality Test Results

LangSmith

Single trace: 60-90 minutes (manual JSON digging)
Bulk queries: Export + script required
Verdict: Timeline tool ❌

Arize Phoenix

Single trace: 60 minutes (flat span inspection)
Bulk queries: Export + script required
Verdict: "Graph tracing" is marketing ❌

Langfuse

Single trace: 60-90 minutes (timeline analysis)
Bulk queries: Export + script required
Verdict: Open-source timeline tool ❌

LangGraph Studio

Single trace: 15 minutes (visual debugging) ✅
Bulk queries: No query language
Verdict: Debugger, not database ⚠️

GraphEvals + Neo4j

Single trace: 40 minutes (slower)
Bulk queries: <5 seconds ✅
Verdict: Only option that scales ✅

Beautiful timelines ≠ causality tracing.

"Graph" in marketing ≠ graph database underneath.

Open Source ≠ better architecture.

Before you buy or build, ask the causality question:

"Show me every decision influenced by context older than 7 days that bypassed the researcher node."

If the answer is “export the data and write a script,” you don’t have observability.

You have expensive logs with a nice dashboard.

If your agents process <100 traces/day, use LangGraph Studio.

If you need to query thousands of traces, build a custom graph solution.

Everything else is a timeline tool with varying degrees of polish.

I tested the tools so you don't have to.

The one that passes the causality test is the one you have to build yourself.

For now.

Why LinkedIn is leaving Kafka and Why you should not be worried.

Ashok Vishwakarma — Sat, 16 May 2026 06:01:19 GMT

For the last decade engineering teams made a pilgrimage to Kafka. You deployed it because LinkedIn built it. You assumed that if it was good enough for a company processing billions of events it was certainly good enough for you. You set up your five node cluster with a ZooKeeper ensemble and configured your producer for idempotent writes. You spent three months getting it production ready.

Your team processes roughly 10 million events per day. LinkedIn operates at 32 trillion events per day in 2026.

The gap is staggering. You are processing 3.2 million times less than the scale where LinkedIn finally hit the ceiling. And here is the kicker. LinkedIn is moving off Kafka. They built Northguard and Xinfra to replace it.

The question you should be asking is simple. If LinkedIn outgrew Kafka at 32 trillion events and you are at 10 million do you actually need Kafka. Or did you cargo cult their infrastructure without understanding the scale that made it necessary.

The story of LinkedIn replacing Kafka is not a signal that Kafka is dead. It is a lesson in understanding scale. Kafka scaled 23,000x before LinkedIn needed something else. You are running 140x smaller than where they started in 2011. Let us talk about what actually broke at their scale and why you probably do not have that problem.

What broke Kafka at LinkedIn scale?

We need to look at the architectural limits of centralized coordination to understand the failure.

The problem of Centralized Metadata

Kafka relies on a single controller node to manage all partition metadata. The controller handles partition leader elections and replica assignments and metadata updates across every broker.

At your scale of 100 topics this works perfectly. At the LinkedIn scale of 400,000 topics and millions of partitions the controller becomes a catastrophic bottleneck.

When the controller fails a new leader election must occur via the KRaft quorum. The new controller must load every single piece of partition metadata into memory and rebuild the state. At the scale of 32 trillion events this reconstruction takes minutes.

And during that time the infrastructure is essentially frozen. No new topics can be created and no partitions can rebalance.

Centralized metadata management eventually hits a physical ceiling.

The problem of Infra wide Rebalancing

When a consumer group rebalances Kafka pauses consumption for the affected partitions.

At small scale this takes seconds. At LinkedIn scale rebalancing touches millions of partitions across thousands of topics simultaneously.

This causes infrastructure wide pauses that last for minutes. Even cooperative rebalancing cannot solve this when the sheer number of partitions creates a coordination explosion.

Consumer lag spikes and downstream systems experience massive latency increases during these coordinated events.

The problem of Fixed Partition

In Kafka you choose your partition count upfront. If you choose too few you hit a throughput wall. If you choose too many you waste resources.

As your data grows you cannot dynamically split a partition without downtime.

For a company with 400,000 topics repartitioning is operationally impossible. It requires stopping producers and migrating data and updating consumers across thousands of applications.

The problem of Coordination

The metadata storage for millions of partitions reaches gigabytes in size. Every producer and consumer reads this partition metadata on startup. Metadata updates generate massive broadcast traffic across the cluster.

Coordination mechanisms like ZooKeeper or KRaft eventually hit a physical limit of how much state they can broadcast to every broker in a timely manner.

Calculate this right now.

Divide 32 trillion by your daily event volume. If the result is greater than 1 million you are a million times smaller than the scale where these problems appear.

You do not have a Kafka problem. You have a scale perception problem.

How Northguard solves the scale problem?

Northguard replaces Kafka with a fundamentally different model designed for the frontier of distributed systems. It uses sharded metadata and range based partitioning and self balancing clusters.

Sharded Metadata

Instead of a single controller Northguard distributes metadata across vnodes using consistent hashing. Each vnode manages a subset of topics and uses Raft consensus for strong consistency. This removes the single point of coordination.

If your metadata fits comfortably in a single KRaft quorum you do not need this. LinkedIn needed it because their metadata exceeded the capacity of any single node memory.

Range Based Partitioning

Instead of fixed partitions Northguard uses ranges. A range is a contiguous slice of the keyspace that can dynamically split or merge without downtime. When a range grows too large it is marked for splitting and child ranges take over the future writes while the old range is sealed.

If you can estimate your partition count upfront you do not need this complexity. Fixed partitions are simpler to manage until you hit the 400,000 topic mark.

Self Balancing Clusters

In Northguard new segments are automatically assigned to the least loaded brokers. There is no explicit rebalancing operation required. If a broker fails the existing segments remain and new ones simply go to healthy nodes.

If your Kafka cluster only rebalances once per quarter then rebalancing is not your bottleneck. LinkedIn needed this because they add brokers and manage 150 clusters constantly. For them manual rebalancing was a full time operational tax.

Do you actually need Kafka?

We need to use a strict framework based on actual event volume to decide if Kafka really belongs in your stack.

If you process less than 10 million events per day you probably do not need Kafka. Redis Streams or SQS or even Postgres NOTIFY will work with significantly less operational overhead.

If you process between 10 million and 100 million events per day managed Kafka like AWS MSK makes sense. The volume justifies the tool but not the team required to self host it.

You only hit the scale where self hosted Kafka is justified when you cross 100 million events per day. You only hit the LinkedIn 2011 scale at 1.4 billion events.

Ask yourself these three questions.

Do you actually need per partition ordering guarantees. If not use SQS.
Do you need event replay for backfilling new services. If not use a standard message queue.
Do you need exactly once semantics for financial transactions. If yes then Kafka is the right tool.

If you are choosing Kafka because everyone uses it you are cargo culting. Kafka is phenomenal for high throughput event streaming but it comes with a massive operational tax.

How LinkedIn migrated?

There is a deeper lesson to learn from how LinkedIn migrated away from Kafka.

They built Xinfra which is a virtualized Pub/Sub layer that abstracts the physical clusters.

This allowed them to migrate topics from Kafka to Northguard without rewriting their application code. They used a dual write mechanism to ensure zero downtime. This is what mature platform engineering looks like.

The lesson is not that you should deploy Northguard. The lesson is that you should abstract your infrastructure. Do not let your applications call Kafka APIs directly. Wrap them in an internal library. If you ever outgrow your current tool you will be able to switch without rewriting every service in your company.

Conclusion

LinkedIn built Kafka in 2011 because they had a problem that no existing tool could solve at 1.4 billion events.

They outgrew Kafka at 32 trillion events and built Northguard.

You are at 10 million events per day. You are 140 times smaller than LinkedIn was when they started with Kafka. You are 3.2 million times smaller than the scale where Kafka breaks.

Do not choose infrastructure based on who uses it.

Choose based on what problem you are solving and what scale you are operating at.

Kafka scaled 23,000 times before LinkedIn needed something else. You will never outgrow Kafka. But you might be wasting forty percent of your platform team’s time managing a supercomputer when you only need a simple queue.

LinkedIn built Kafka. LinkedIn outgrew Kafka. You never will. Choose your stack accordingly.

I benchmarked 5 Languages in Kubernetes and here is what your stack actually costs.

Ashok Vishwakarma — Thu, 07 May 2026 06:01:14 GMT

Last month I was called into a war room for a high growth fintech client. They were running fifty microservices across five different engineering teams. Like many modern organizations they had no language mandate. Each team chose the tool they knew best.

The checkout team used Node.js. The inventory team used C#. The payments team used Java. The recommendations team used Python. The platform team used Go.

Everything worked perfectly until their first major promotional event.

When the traffic hit 180,000 requests per second the Go and Node.js services scaled in seconds. But the payments and recommendations services stayed in a pending state for nearly a minute.

The checkout requests were succeeding but they were hitting a wall of timeouts from the downstream inventory and payment services. From the perspective of the user the app was dead.

The CFO asked me why they were paying millions in AWS bills for a system that still crashed under load. I spent forty eight hours benchmarking their exact infrastructure to find the answer. We found that language choice is not a developer preference. It is a physical infrastructure variable with a massive price tag.

The Environment

I have executed this benchmark in a standard Kubernetes cluster on AWS. Every service was tested using the same traffic pattern reaching 180,000 requests per second. We used c7g.xlarge instances powered by Graviton processors.

We measured image pull times through a 100 Mbps network bottleneck to simulate real world regional data center congestion.

We must look at the raw data across all five environments to understand the financial impact.

Go with Fiber

Image Size: 24MB
Image Pull Time: 1.2 seconds
Memory per pod (Idle): 8MB
Pod count on peak load: 12
Cost per month: $631 USD

NodeJs with Fastify

Image Size: 80MB
Image Pull Time: 2.5 seconds
Memory per pod (Idle): 45MB
Pod count on peak load: 15
Cost per month: $788 USD

C# with ASP.NET Core

Image Size: 250MB
Image Pull Time: 8.7 seconds
Memory per pod (Idle): 95MB
Pod count on peak load): 18
Cost per month: $946 USD

Java with Spring Boot

Image Size: 180MB
Image Pull Time: 5.8 seconds
Memory per pod (Idle): 120MB
Pod count on peak load): 22
Cost per month: $1,157 USD

Python with FastAPI

Image Size: 300MB
Image Pull Time: 10.2 seconds
Memory per pod (Idle): 85MB
Pod count on peak load): 35
Cost per month: $1,840 USD

The HPA Bottleneck

Why container image size is a first order variable for reliability? When the Horizontal Pod Autoscaler or HPA triggers a scale event it creates a series of physical hardware demands.

The first demand is network I/O. Your container registry has a bandwidth limit. Your Kubernetes node has a network interface limit. A 300MB Python image is not just a little larger than a 24MB Go image. It is twelve times larger.

The second demand is disk I/O. Once the image is pulled the container runtime must extract the layers. A 300MB compressed Python image often unpacks to 800MB of data on the disk. This extraction is a serial process that is heavily CPU and I/O bound.

Look at the math of a traffic spike.

If your traffic doubles in 30 seconds and your Go pods take 2 seconds to become ready you can catch the load.
If your Python pods take 10.2 seconds to pull plus 4 seconds to start the interpreter you have a 14 second gap where you are dropping requests.
By the time the Python pods are ready the existing pods have already crashed from CPU saturation. This is the Cold Start Death Spiral.

The Garbage Collection cost

We must look deeper than raw compute to see the actual cost of running these languages at scale. The most hidden expense in a Kubernetes cluster is the cost of Garbage Collection.

In a managed runtime like Java or Node.js the system must periodically pause execution to clean up unused memory. This is the Stop The World event. I found that as the request rate increases the volume of temporary objects created on the heap explodes.

In the Java Spring Boot service the Garbage Collector consumed twenty percent of the total CPU cycles during peak load. This means you are paying for five pods just to run the cleaning service for the other twenty.

Furthermore these pauses create massive spikes in tail latency. When the collector triggers every active request in that pod is paused. Your ninety ninth percentile latency is dictated by the memory cleaner and not your code

Go avoids this through a different physical strategy. It uses a concurrent mark and sweep collector designed for low latency. The pauses are measured in microseconds and not milliseconds. This allow Go to maintain a flat latency profile even as you approach the physical limits of the CPU.

The Database connections

I discovered a hidden cost that most architects completely miss, the ceiling of connection pool.

Our benchmark showed that C# opened 450 database connections at peak load. Go opened only 180 connections. This is not just a performance detail. It is a hard infrastructure limit.

A standard AWS RDS db.t3.medium instance caps at 420 connections. If you run the C# stack and scale to 50 pods you will hit the max_connections limit of your database.

New pods will fail to connect and they will fail their health checks. The HPA will keep trying to scale but the database will refuse the traffic.

To fix this with C# you are forced to upgrade to a db.r5.large instance. That upgrade costs $438 USD more per month. You are paying a 150 percent premium for your database just because your language has inefficient connection management. Go handles the exact same load on the cheaper instance with 43 percent of the connection limit remaining.

The full economic model

Let’s evaluate the total annual cost of ownership for a sustained workload of 180,000 requests per second.

Annual cost for each language

Go: $7,572 USD
NodeJs: $9,456 USD
Java: $13,884 USD
Python: $22,080 USD

Python costs $14,508 USD more per year than Go to run the exact same workload. When you factor in the forced database upgrades and the registry bandwidth charges the delta grows to over $20,000 USD per microservice.

If you have 50 microservices you are paying a 1 million dollar annual cost for your language choices.

Now, let’s evaluate these findings against the reality of developer productivity.

On one hand we see that Go is the undisputed king of infrastructure economics. It is the only language designed for the physical constraints of a containerized world. It ignores the legacy of heavy runtimes and focuses purely on memory bandwidth and binary size.

On the other hand we must acknowledge the hiring market. Finding ten expert Go engineers is significantly harder than finding ten Python or Java developers. We must analyze if the $14,000 USD savings per year is worth the $50,000 USD difference in senior engineering salaries.

However for high scale startups and enterprise gateways the math is clear. You can buy a lot of senior engineering time for 1 million dollars in annual cloud savings.

Conclusion

After analyzing the image pull physics the connection pool ceilings and the annual compute deltas we reach a definitive conclusion.

Stop choosing languages based on what is popular on social media. You must start choosing based on the cost per million requests.

If your workload has spiky traffic and you run in Kubernetes you should use Go or Node.js. The fast scale up times will prevent outages that cost far more than your cloud bill.

If you are a banking or enterprise shop that requires Java or C# then you must budget for a 50 percent infrastructure premium. You must also over provision your databases to handle the connection bloat.

If for some reason using Python is the only choice you have, be prepare for that 300% infrastructure cost to handle the spiked traffic.

Your language choice has a monthly price tag. It is a line item on your AWS bill.

Choose accordingly.

Why Flat Logs cannot Debug AI Agents?

Ashok Vishwakarma — Mon, 04 May 2026 06:01:42 GMT

A few weeks ago, an incident at one my manufacturing client cost a quarter of a million dollars.

They have deployed 7 AI Agents, they have tested thoroughly before pushing it to production. Their explicit job was to manage the bill of materials and keep inventory aligned and place orders automatically. There was no human in the loop. That was the entire point of the architecture.

One Tuesday morning the automated procurement agent placed a purchase order for 500 gallons of industrial solvent. It ordered the completely wrong chemical type. The final cost included the wasted material and the specialized hazardous disposal fees and the massive production delays.

The total financial damage was $240,000.

The engineering team immediately pulled the production logs to find the bug. They exported 38,000 lines of flat JSON. They had every tool you would expect from a modern and well instrumented system. They had beautiful LangSmith traces. They had OpenTelemetry spans. They had deeply structured logging.

It took the team 90 hours to find the root cause. Five senior engineers spent three full days running grep commands through logs and manually diffing prompt versions. They desperately tried to reconstruct which piece of context from which specific agent led to this catastrophic financial decision.

The revelation hit them on day three. The problem was not an LLM hallucination. It was not a bad prompt design. It was not even a traditional software bug.

The problem was that Agent 2 passed stale context to Agent 5. Agent 5 then used that outdated context to inform a material decision which triggered Agent 7 to execute the final purchase order.

The observability tools could easily show what happened. They showed the exact purchase order. The tools could show when it happened by matching the timestamp in the logs. But the tools completely failed to show why it happened. They could not expose the causal chain of context propagation across multiple independent agents.

This is the story of why flat logs cannot debug graph problems. Every engineering team deploying multi agent systems is about to learn this lesson the hard way.

Flat logs and Graph Decisions

Let’s analyze why traditional observability breaks down the moment you introduce autonomous agents into a production environment.

If you build distributed systems you already know the standard playbook. You use structured logging for individual events. You use distributed tracing like OpenTelemetry for request flows. You use observability platforms like Datadog or LangSmith for visualization.

This playbook works beautifully for microservices. An HTTP request flows predictably through Service A then Service B then Service C. The trace is entirely linear. The causality is strictly sequential.

Agents do not think linearly. They branch when inventory falls below a threshold and they delegate tasks to specialized peers. They retry failed tool calls with modified context. They loop back to reevaluate decisions based on updated state.

The Raw Telemetry Comparison

Look at the standard flat log output you see in a terminal.

[09.18.32] agent=bom_reader action=fetch_spec file=spec_v2.json // Bad spec file
[09.22.47] agent=procurement action=query_supplier target=ABC_Corp
[09.23.41] agent=procurement action=place_order qty=500

You can see exactly what happened but you cannot see why.

Now look at the graph reality represented as a linked structure.

(Context file="spec_v2.json" state="STALE")
  -[PROVIDED_CONTEXT]-> (Decision action="Select Supplier")
    -[TRIGGERED]-> (ToolCall action="place_order" qty=500 state="FLAGGED")

Causality is completely visible.

Look closely at what each data structure physically records to see the exact difference.

When you look at the flat logs you see isolated events attached to timestamps. As humans reading a simple three line example we assume the first event caused the second event simply because they happened sequentially. However in a real production environment there might be thousands of other logs recorded between those two timestamps by dozens of different agents. The flat log provides absolutely no structural proof that the specific file fetched at 09.18 was the exact data used to place the order at 09.23. It only tells you when things happened. You are forced to guess the relationship based on chronological proximity.

Now look at the graph model. The graph does not rely on guessing based on time. It uses explicit structural edges to link data together. The edge labeled PROVIDED_CONTEXT is a literal database relationship tying the specific stale JSON file directly to the decision node. The edge labeled TRIGGERED physically links that exact decision to the final tool call.

Flat logs force you to guess causality based on time. The graph provides true causality because it explicitly records the relationships between inputs decisions and outputs as permanent physical links in the database. You do not have to guess what caused the bad order because the graph draws a direct line straight to the stale context file.

LangSmith gives you beautiful traces. But those traces are trees not graphs. They show parent child relationships indicating which agent called which tool. They do not show causality indicating which specific context payload influenced which downstream decision.

Quick check. If your agent made a bad decision right now could you instantly trace which exact context file influenced it. If your honest answer is that you would have to grep the logs and reconstruct the timeline manually then you have a severe graph problem.

Without causality tracing your investigation reality is brutal. You pull 38,000 lines of logs. You search for the order SKU. You find the decision that triggered it. You manually trace backwards to see which tool calls preceded the order. You diff the prompt versions to see if Agent 5 received the correct instructions. You hunt down the context sources to see where the specification came from. Eventually you discover that Agent 2 used a cached file from three weeks ago. You reconstruct this entire propagation path manually over 90 hours.

Agents make decisions in a complex graph structure. Observability tools built for linear sequences can never show you the why.

The solution (Thinking in Graph)

We evaluate this monitoring failure and reach a clear architectural mandate.

You must stop treating agent traces as chronological logs and must start treating them as graphs.

This requires a fundamental mental shift. Every agent decision is a node in a graph. Every handoff between agents is an edge connecting those nodes. Every piece of context is a property living on those nodes.

We need to define a strict data model to capture this reality.

Nodes

AgentRun containing id and agentName and model and startedAt
Decision containing id and reasoning and confidence and timestamp
ToolCall containing toolName and input and output and duration
Context containing source and contentHash and retrievedAt and stale status

Edges

MADE a decision
TRIGGERED a tool call
DELEGATED_TO other Agent (Node)
PROVIDED_CONTEXT to a Decision or Agent

Flow

AgentRun → MADE → Decision
Decision → TRIGGERED → ToolCall
Decision → DELEGATED_TO → AgentRun
Context PROVIDED_CONTEXT Decision

The Decision Graph

AgentRun(#BOM-7)
├─ Decision (Retrieve BOM)
│   └─ Context (spec_v2.json) ← STALE
│
├─ Decision (Select Supplier)
│   ├─ PROVIDED_CONTEXT ← Context (bad_spec_v2.json)
│   └─ TRIGGERED → ToolCall (query_supplier)
│
└─ Decision (Place Order) ← FLAGGED
    └─ TRIGGERED → ToolCall (place_order qty 500)

You need a database that stores decisions as they happen through streaming ingestion. You need a system that lets you query causality directly to show every decision influenced by a specific stale context. You need a platform that visualizes the decision graph natively and supports deep pattern matching. Neo4j handles all four of these requirements perfectly.

You would not debug a distributed microservice system with a basic tail command. Why are you debugging multi agent systems with grep.

Here is the ingestion pattern you need to implement. We use Cypher to write the trace data directly into Neo4j.

MERGE (run:AgentRun {id: $run_id})
CREATE (d:Decision)
SET d += $props
MERGE (run)-[:MADE]->(d)
WITH d
FOREACH (ctx IN $contexts |
    MERGE (c:Context {source: ctx.source})
    SET c += ctx
    MERGE (c)-[:PROVIDED_CONTEXT]->(d)
)

We make specific design decisions here. We use MERGE instead of CREATE for the runs and contexts so reingesting a trace does not duplicate data. We ingest at decision time rather than batching at the end of the run. We keep the tool output as raw JSON because Cypher can query nested JSON directly using dot notation.

When the incident occurred we did not grep the logs. We ran the following Cypher query to find the root cause.

MATCH 
  (run:AgentRun)-[:MADE]->(d:Decision)-[:TRIGGERED]->(tc:ToolCall)
WHERE tc.tool_name = 'place_order'
  AND tc.output.quantity > 100
  AND tc.output.verified = false
MATCH
  (d)<-[:PROVIDED_CONTEXT]-(ctx:Context)
WHERE ctx.stale = true
RETURN
  run.id, d.reasoning, ctx.source, tc.output
ORDER BY tc.timestamp

This query immediately returned the exact run ID and the flawed reasoning and the exact stale context file that caused the purchase. We went from a massive production incident to a verified root cause in 40 minutes.

The problem was never that we could not log the decision. The problem was that we could not query the causality. Logs are great for recording what happened. Graphs are absolutely necessary for understanding why it happened.

Graph Evals Assertions as Cypher Queries

We evaluate how this changes our testing strategy. Traditional evaluations check basic outputs. They ask if the LLM called the right tool or if the output was valid JSON. This is essentially a basic unit test on a ToolCall node.

Graph evaluations completely change this paradigm. They check systemic causality. They ask if any stale context influenced a high value order. They ask if a flagged decision propagated to downstream agents.

Graph Eval Query Pattern

MATCH
  (Context)-[:PROVIDED_CONTEXT]->(Decision)-[:TRIGGERED]->(ToolCall)
WHERE
  Context.stale = true
  AND ToolCall.tool_name =~ '.*order.*'

Pattern matching makes assertions declarative rather than procedural.

Take this first example of a Stale Context Check. We want to catch any order where the context source was flagged as stale. We write an assertion as a graph traversal.

MATCH 
  (c:Context)-[:PROVIDED_CONTEXT]->(d:Decision)-[:TRIGGERED]->(tc:ToolCall)
WHERE c.stale = true
  AND tc.tool_name =~ '.*order.*'
RETURN c.source, d.reasoning, tc.output.quantity
ORDER BY tc.output.quantity DESC

This single query catches any order placed based on outdated specifications or cached inventory data or old supplier information.

Take a second example of checking for Multi Agent Context Drift. We want to surface cross agent contamination where a bad decision infects the rest of the cluster.

MATCH path = (r1:AgentRun)-[:MADE*1..4]->(d:Decision)-[:DELEGATED_TO]->(r2:AgentRun)
WHERE ANY(n IN nodes(path) WHERE n.flagged = true)
RETURN r1.agent_name, r2.agent_name, length(path) AS chain_depth

This query catches the exact moment Agent A makes a flagged decision and delegates the flawed state to Agent B which then poisons Agent C. You see the entire contamination chain instantly.

How many of your evaluations check basic outputs versus checking deep causality. If you are only checking whether the agent called the right tool you are completely missing the systemic failure mode that costs a quarter of a million dollars.

Evaluations at the individual call level miss the system level failure mode. A correct tool call executed with the wrong context still produces a catastrophic outcome. Graph evaluations let you assert on context propagation instead of just verifying local decisions.

Before and After Timeline

We quantify exactly what changed when we shifted our observability architecture.

Before

Day 1 Pull massive logs and grep for the SKU
Day 2 Manually trace the context across seven agents
Day 3 Finally locate the root cause file

Before we implemented graph evaluations our root cause time was 72 hours. We required five senior engineers on the incident. We had absolutely no visibility into context flow and relied entirely on manual reconstruction. We tracked prompt versions through pure guesswork by diffing files across agents. Our confidence in the final fix was incredibly low because we could not mathematically verify the propagation paths.

After

Hour 1 Run a single Cypher query and locate the root cause instantly

After we implemented graph evaluations our root cause time dropped to 40 minutes. We required exactly one engineer on the incident. We achieved full visibility into the context flow using the Neo4j Browser. Prompt version tracking was natively stored in the Decision nodes and instantly queryable. Our confidence in the fix was absolute because our regression test was a mathematical graph assertion.

Calculate this right now. How much does 90 hours of senior engineering time cost your organization. That number is the hidden tax of relying on flat observability for multi agent systems.

The economic impact is staggering. Under the old system we spent $18,000 in raw engineering payroll to debug a $240,000 direct incident cost while suffering unknown downstream damages to our client trust. Under the new system we spent exactly $100 in engineering time and the graph assertions caught the next incident in staging before it ever reached production.

The architectural shift is clear. We stopped adding more flat logging hoping to manually reconstruct causality. We started modeling decisions as graphs and querying the causality directly.

Observability for agents is not about collecting more flat data. It is about modeling the correct physical structure of the workflow. Flat logs scale linearly with agent complexity making debugging exponentially harder. Graph evaluations scale sub linearly because Cypher queries get easier as you add more structural assertions.

What you should do?

Engineering leaders must take concrete steps to fix this observability gap immediately.

Step 1

You must start modeling your agent traces as graphs today. Even if you are not running Neo4j in production yet you must start thinking in nodes and edges. Stop logging that an agent placed an order. Start logging that an Agent Run made a Decision using specific Context which triggered a specific Tool Call.

Step 2

You must instrument your context propagation. Every single time an agent passes context to another agent you must log exactly what context was passed and where it came from and when it was retrieved and whether it is marked as stale.

Step 3

You need to set up a graph database. If you are not ready for production infrastructure you can download Neo4j Desktop locally. Ingest a few traces manually. Write your first Cypher query and visualize the decision graph. Once you understand the immense value you can deploy Neo4j Aura on their cloud managed tier and stream decisions in real time from your agent framework.

Step 4

You must write your first graph evaluation. Pick one critical failure mode you actually care about. Maybe it is stale context leading to bad decisions. Maybe it is a high value action triggered without human verification. Write that failure mode as a Cypher query and run it after every single agent execution. If the query returns a risk count greater than zero you pause the pipeline and investigate.

Step 5

You must build a decision graph dashboard. Use Neo4j Browser to visualize which agents are making the most flagged decisions and which context sources are most frequently stale.

Make your causality visible and make it queryable. Stop debugging multi agent systems with grep and start querying the decision graph.

What’s next?

We evaluate this architecture and see exactly where the industry is heading.

Now

We are doing post hoc graph evaluations. We build traces as graphs and query them with Cypher to catch errors before the next run.

Near future

We will use real time evaluation hooks. We will stream decisions into Neo4j as they happen and flag anomalies mid run using live graph assertions. Imagine Agent 5 is about to place a massive order. Before executing the final API call the system runs a Cypher query checking for stale context paths. If the query detects a violation the agent pauses automatically and surfaces the decision graph to a human for manual approval.

Ultimate goal

The ultimate goal is autonomous self correction. Agents will continuously query their own decision graphs to detect context drift in real time. If they detect a structural anomaly they will reroute their logic or pause execution entirely without human intervention.

Graph evaluations are not a nice to have feature. They are the fundamental difference between debugging in minutes versus debugging in days. That operational gap only grows as you add more agents to your cluster.

Today we debug agents after they fail. Tomorrow agents will debug themselves before they fail. But you cannot self correct without understanding causality. You cannot query causality without building graphs. Observability is not the endgame here. Autonomous self correction is the endgame.

Conclusion

The $240,000 order was not a failure of the language model. It was not a failure of the prompt engineering. It was not even a failure of the multi agent architecture itself.

It was a catastrophic failure of enterprise observability.

Flat logs can tell you exactly what happened. They can never tell you why it happened. Artificial intelligence agents do not think in linear sequences. They branch and retry and delegate and loop. Their autonomous decisions form a complex graph and never a simple timeline.

If you are deploying agents in a production environment you desperately need observability that matches their actual internal structure. You must model your traces as graphs. You must query causality using Cypher. You must write structural assertions on context propagation instead of just verifying final outputs.

The next time your agent makes a disastrous decision you will not spend 90 hours grepping through flat JSON files. You will run one single query. You will instantly see the exact decision path. You will fix the poisoned context and add a new regression test. You will finish the entire investigation in 40 minutes instead of 3 days.

Agents think in graphs. So should your evaluations.

Why AI is Forcing us back to Basic Computer Science?

Ashok Vishwakarma — Thu, 23 Apr 2026 06:01:28 GMT

We spent the last decade optimizing for speed. We taught an entire generation of engineers how to glue endpoints together and build web applications as quickly as possible.

Now artificial intelligence can do that exact same job in seconds.

I recently sat down for a deep discussion with Pratik Kale to talk about the brutal reality of the modern hiring market.

The industry no longer needs thousands of junior developers who only know how to spin up a boilerplate.

AI has completely commoditized the surface layer of software engineering.

The market is aggressively shifting back to deep computer science fundamentals.

In the podcast we break down exactly why the future belongs to the engineers who understand the physics of the machine.

We discuss why knowing the exact 14 kilobyte limit of a TCP (Initial Congestion Window initcwnd) is the only way to genuinely optimize first paint frontend load times.

We dive into why the V8 engine treats JavaScript objects like C++ structs and how ignoring that destroys your memory allocation.

If your only skill is writing basic application code you are competing against an algorithm that never sleeps.

If you understand how a compiler evaluates code step by step and how operating systems manage dynamic cache you become irreplaceable.

An AI can generate a thousand lines of code but it takes a human architect to understand the memory footprint and the networking throughput of that code in a production environment.

We also discussed why most startup MVPs fail and how finding a cofounder with an opposite personality is the only way to survive the business side of engineering.

I would love to hear your thoughts on this shift. Are you seeing the exact same fundamental skill gap in your recent engineering interviews.

I would highly recommend to visit Pratik’s YouTube channel for more such content, and do subscribe if you like what you see there 🙏

Fine Tuning an AI Model on your Mac

Ashok Vishwakarma — Mon, 20 Apr 2026 06:00:37 GMT

Most of us (Engineers) have a Mac machine.

No hard feelings if you have a different preference or use case, I totally respect your choice of hardware.

But, this one is for those who owns a Mac and know a little Python 🙂

I have a Mac Studio with M1 Ultra SOC with 64 GB Unified Memory.

Don’t be jealous but its perfectly fine if you own any Mac Pro machine with a M series Chip and at least 16+ GB of Unified Memory.

The goal is to teach a tiny local AI to read executive management jargon and output the brutal engineering reality.

We are building a Brutally Honest Corporate Translator 😊.

So let’s get started.

Environment Setup

Open your terminal and execute the following package installation.

# Make sure your have Python 3.x installed
# Test using python --version
# You may also try python3 --version

# If your python --version is 3.x
pip install mlx-lm 

# if your python3 --version is 3.x
pip3 install mlx-lm 

# If you don't have python 3.x
# please install it before installing mxl-lm

This single command installs the core framework without complex compilation steps or dependency hell.

The Model (SLM)

Look at the physical memory limits of your machine before choosing a model.

We often download the largest model available and immediately crash our system.

We need to match the parameter count to the available Unified Memory of our machine.

A 70 billion parameter model requires nearly 40 gigabytes of memory just to load the weights.

You cannot train that on a standard laptop. We need to evaluate smaller highly optimized architectures designed specifically for edge execution.

Here are the most viable models for local fine tuning today.

Meta Llama 3 8B Instruct

This is the current gold standard for local reasoning. It requires at least 16 gigabytes of unified memory to train comfortably but delivers enterprise grade logic.

Alibaba Qwen 2.5 7B

This is a highly capable alternative that performs exceptionally well on coding tasks and structured data extraction.

HuggingFaceTB SmolLM2 1.7B Instruct

We will use this specific tiny model for our tutorial today. It requires very little memory and will train flawlessly on a basic entry level machine in just a few minutes.

Training Data

AI models learn strictly through pattern recognition and examples.

Create a new directory called data.

mkdir data

Inside this folder you must create three specific files named

touch train.json valid.json test.json

We will format our examples using standard prompt and completion keys.

Copy the following JSON blocks and paste them into all three of your data files for this engineering experiment.

{"prompt": "We need an agile MVP to synergize our deliverables", "completion": "We are shipping a completely broken prototype on Friday"}
{"prompt": "Let us put a pin in this and circle back when we have more bandwidth", "completion": "I am never going to approve this feature"}
{"prompt": "We are experiencing a temporary degradation of service", "completion": "Production is completely down and the database is on fire"}
{"prompt": "The legacy system requires a paradigm shift", "completion": "We need to delete the entire codebase and start over"}
{"prompt": "We are currently evaluating our strategic resourcing alignment", "completion": "We are planning massive layoffs next month"}
{"prompt": "The ticket is currently blocked by cross functional dependencies", "completion": "I have not started working on this and I am blaming another team"}

I suggest to add more data into this json which will make responses more accurate and fun. You can use this as a sample and generate more using ChatGPT, Gemini etc.

Or you can use the one I have used 😊

Training

Run the following exact terminal command to initiate the training process using our chosen baseline model.

python -m mlx_lm.lora \
  --model HuggingFaceTB/SmolLM2-1.7B-Instruct \
  --train \
  --data ./data \
  --iters 500

Look at the mechanical physics of this command. The framework freezes the massive original weights of the base model. It only trains a tiny new adapter layer on top of the frozen parameters.

This mathematical efficiency is exactly why it runs flawlessly on a laptop without melting the processor or exhausting the battery.

The Adapters

Once the training step completes successfully, it will create an adapters folder.

This specific folder holds the specialized sarcastic knowledge we just mathematically injected into the system.

We now need to merge this new knowledge permanently into the base model.

Run the following command to fuse the new weights.

python -m mlx_lm.fuse \
  --model HuggingFaceTB/SmolLM2-1.7B-Instruct \
  --adapter-path ./adapters \
  --save-path ./fused_model

This command executes a permanent mathematical merge. It takes the massive matrix of the original frozen model and adds the tiny specialized adapter weights directly into it.

The result is a single consolidated model directory.

You no longer need to manage separate adapter files because the new corporate knowledge is permanently baked into the core neural network.

The Fun

We are ready to test the compiled engine.

Execute the following generation command to see your new translator in production.

python -m mlx_lm.generate \
  --model ./fused_model \
  --prompt "The legacy system requires a paradigm shift"

The model will instantly process the prompt and output the honest translation natively right on your machine.

Here are some fun responses I got while playing with it

Prompt: We are adopting a flat organizational structure
Management wants everyone to do the work of three people without a title promotion

Prompt: We value your feedback and have added it to the product backlog
We are ignoring your idea completely

Prompt: The new architecture is highly scalable and future proof
We added Kubernetes to a basic web application and nobody knows how it works anymore

Prompt: We are embracing a fast paced startup culture
You will work weekends and we will not pay you overtime

Conclusion

Weigh the unit economics of this exercise against the operational output before reaching a conclusion.

You now possess a fully customized and functional model without paying a single cent to an external cloud provider.

The return on investment for utilizing existing local hardware is massive.

While Apple Silicon is absolutely not built for planetary scale distributed training it is the undisputed king of local edge engineering.

You own the hardware and you own the intelligence.

Do share your prompts and responses in the comments 😊

If you are curious and want to learn more watch this video from wwdc2024

Thinking about Training you own AI Model?

Ashok Vishwakarma — Fri, 17 Apr 2026 06:01:19 GMT

One of my client approached me last month and they wanted to train a proprietary foundation model entirely from scratch.

They wanted to own their intelligence and stop paying external API fees.

Though they had a massive budget but they had absolutely no clue how to actually build the system.

They assumed they could just buy a thousand graphics cards and string them together with standard network cables.

I had to stop them from wasting millions of dollars. While researching the exact architecture required to build their cluster I found the reality absolutely fascinating.

I wondered how so few people in our industry truly understand the brutal math and physics behind distributed training.

Based on my research and understanding here is what I know so far.

To understand this physical reality we need to divide the training landscape into two distinct parts.

The SLM (Local)

If you want to train a Small Language Model (SLM) or fine tune an existing model for a highly scoped internal routing task you can rely on a single local system.

You do not need a massive data center. You just need to understand the physics of a single motherboard.

The Software

To make raw silicon do math you need a translator. In the local training environment you have two distinct software paths with completely different hardware requirements.

The first path is CUDA.

Nvidia built CUDA to be their proprietary programming interface. It converts high level Python code into low level parallel math instructions that the hardware can physically execute.

If you use CUDA you must buy Nvidia graphics cards.

The second path is MLX.

Apple built the MLX framework to compete directly with CUDA for local execution.

MLX is designed exclusively for Apple Silicon. It allows developers to run complex machine learning math on standard Mac Studio desktops.

MLX is still very new and not matured as compared to CUDA, it also has a very limited library support, which make CUDA a better choice.

The Shared Memory Advantage

When training locally the physical layout of the memory dictates your performance.

In a standard Nvidia PC build the system relies on separated memory.

You must copy your training data from the system RAM across the PCIe bus into the dedicated VRAM of the graphics card.

Apple avoids this entirely with Unified Memory.

The central processor and the graphics processor share the exact same physical memory pool.

When you load your training data into an Apple system there is zero data copying required.

The processors simply pass a pointer to the shared memory block. This shared memory architecture makes Apple Silicon incredibly efficient for training small models on a single local machine.

Apple’s Unified Memory gives you a huge advantage here.

The Frontier (Cluster)

When you leave the local machine to train a billion or trillion parameter model you enter a realm governed by strict physical capacity.

You are no longer building a computer. You are building a cluster of computer which acts as a supercomputer.

The Software

Apple and MLX completely disappear at this scale.

Nvidia dominates the frontier because CUDA scales perfectly across thousands of machines.

Nvidia spent a decade ensuring frameworks like PyTorch default exclusively to CUDA for distributed workloads.

The software moat forces you to buy into their enterprise hardware ecosystem.

The Hardware

We hit a hardware capacity wall very quickly when building foundation models.

A 70 billion parameter model consumes terabytes of memory. You have to store the massive weight matrices the optimizer states and the continuous training batches.

No single GPU on earth holds that much data.

The only mathematical solution is Data Parallelism. You must build a distributed cluster.

You copy the exact same model across thousands of distinct GPUs. You slice the massive training dataset into smaller manageable chunks and feed them to the separate GPUs simultaneously.

Gradient Synchronization

Having thousands of processors working at once sounds incredibly efficient until you realize they have to talk to each other constantly.

As GPUs process their data chunks they calculate gradients.

Gradients are massive mathematical vectors that tell the model exactly how to adjust its weights to decrease errors.

This is the literal process of machine learning.

This requirement introduces a massive architectural bottleneck. Before moving to the next training step every single one of those thousands of GPUs must share its gradients. They must average them together and update their local copies identically so the cluster learns as one single brain.

This Gradient Synchronization happens millions of times per run.

The NCCL

Moving this much data simultaneously breaks normal computer networks.

If ten thousand GPUs broadcast massive files simultaneously over a standard network the data center collapses instantly.

The GPUs finish their math in milliseconds and then sit completely idle waiting for network switches to clear the traffic jam.

Nvidia solved this network choke with a software tool called the Nvidia Collective Communications Library or NCCL.

NCCL uses a brilliant mathematical layout called Ring All Reduce.

It arranges GPUs in a logical ring. Data is broken into small chunks and passed strictly to immediate neighbors instead of broadcasting to everyone.

For a cluster of N GPUs and a data size D the data sent and received is bound by this exact formula.

Because the fraction approaches 1 the total data transferred by any single node never exceeds 2D.

This mathematical proof guarantees the network will not choke regardless of how many thousands of GPUs you add to the cluster.

The NVLink

Software optimization can only take you so far before physics gets in the way.

Even with NCCL mathematically optimizing the traffic we still hit a physical motherboard limit. Moving hundreds of gigabytes over a standard PCIe connection is cripplingly slow.

A standard PCIe bus maxes out around 64 gigabytes per second. That is far too slow for gradient synchronization.

Nvidia built NVLink to bypass the motherboard entirely.

NVLink is a proprietary physical bridge connecting GPUs directly to each other with thick copper cables.

It delivers a staggering 1.8 terabytes per second of bidirectional bandwidth.

It transfers calculations so fast that the entire server rack functions as a single unified processor.

The Cost

We need to connect this complex architecture back to the financial reality facing my client.

We must evaluate the strict economic difference between the local path and the frontier path before we reach a final decision.

The SLM (Local)

Training a Small Language Model (SLM) locally is a highly predictable financial commitment.

You buy the hardware once. You plug it into a standard wall outlet.

The unified memory architecture allows your engineers to experiment and fail rapidly without incurring hourly cloud compute penalties.

Your total financial risk is strictly capped at the initial purchase price of the desktop machine.

This makes local SLM development an incredible bargain for enterprise teams building scoped internal tools.

The Frontier (Cluster)

Training a frontier model is a completely different financial universe.

It requires ten thousand to one hundred thousand GPUs running continuously for months.

Without NCCL and NVLink a cluster spends forty percent of its time waiting for data transfers.

When your cloud bill is hundreds of thousands of dollars a day idle compute is pure financial hemorrhage.

You are burning cash to power hardware that is doing absolutely nothing.

The physics directly dictates the invoice.

Conclusion

We need to weigh the ambition of training proprietary models against the harsh reality of data center physics before jumping in.

If you want to train a small model locally the unified memory of an Apple system running MLX is an incredible and cost effective engineering marvel.

You avoid the vendor lock in and bypass the PCIe bottleneck entirely on a single machine but you have deal with the drawbacks of MLX in general.

If you want to compete at the frontier and train massive models you have absolutely no choice.

You are buying into a closed high speed distributed network ecosystem. You cannot build a custom cluster with cheap networking and expect it to survive the synchronization penalty.

Nvidia dominates because they control the entire vertical stack.

Their true moat is not just the silicon chip processing the math.

Their moat is the complex distributed software and the thick physical cables tying the entire data center together.

Why CUDA (Nvidia) won the AI game even when Apple built the best hardware?

Ashok Vishwakarma — Thu, 16 Apr 2026 08:45:47 GMT

Most developers assume that Graphics Processing Units (GPUs) dominate artificial intelligence simply because they possess thousands of processing cores. We need to look past this assumption to understand the actual physics of computing.

Artificial Intelligence (AI) inference is not fundamentally a compute problem. It is a data movement problem. The traditional Central Processing Unit (CPU) was not defeated by a lack of mathematical power. It was defeated by physical distance.

There is a strange twist to this story. Apple solved this hardware problem years ago. Yet Nvidia remains a multi trillion dollar monopoly today. Nvidia wins because they built software that traps the entire industry.

The Silicon Real Estate

Look at the physical die of a modern server processor.

You will see massive silicon real estate dedicated to branch prediction and deep caching layers. We must understand the physics of this layout.

When a central processing unit executes a web application it encounters millions of conditional branch instructions at the machine code level.

The silicon cannot wait to evaluate every single condition before fetching the next operation. It must statistically predict whether a logical branch will resolve to true or false to keep its instruction pipeline completely full.

If the hardware predicts incorrectly it must flush the entire pipeline and start the cycle over.

This speculative execution is absolutely necessary to run highly unpredictable software like operating systems and relational databases.

A traditional processor survives by calculating the statistical probability of the immediate future.

Graphics processors strip all of that predictive logic away.

Artificial Intelligence (AI) does not branch. It multiplies.

The core of a neural network is a massive sequence of Multiply Accumulate operations. You take a matrix of weights and multiply it by a matrix of inputs.

This math is entirely deterministic and highly parallel.

Graphics processors dedicate their silicon entirely to Arithmetic Logic Units (ALUs) to process thousands of these matrix operations simultaneously.

They do not guess they just compute.

The Von Neumann Bottleneck

The physics of inference exposes a massive flaw in traditional computer design. To understand the memory wall you have to look at the mathematical footprint of a Large Language Model (LLM).

The absolute minimum physical memory required to load a model for inference is dictated by a strict formula.

Where P is the total number of parameters and B is the byte size of the precision format.

If you want to run Llama 3 with 70 billion parameters in standard 16 bit precision where each parameter is 2 bytes you are looking at a minimum of 140 GB of VRAM just to hold the weights.

This completely excludes the KV cache and context window. You are not loading a program. You are loading a 140 GB matrix of floating point numbers into memory.

We must evaluate the architecture John von Neumann designed in 1945. He separated the processing unit from the memory unit.

To perform math the processor must fetch data across a physical wire. In modern servers this wire is the PCIe bus.

Moving hundreds of gigabytes of data across a physical motherboard trace requires electricity and introduces massive latency. The electrons literally have too far to travel.

The CPU was defeated by a mathematical ratio known as Arithmetic Intensity.

This formula measures how many floating point operations or FLOPs the processor can execute for every byte of data it fetches from memory.

Generative AI inference has an incredibly low arithmetic intensity. Generating a token requires relatively few mathematical operations but it requires reading the entire 140 GB weight matrix from memory.

Because the math is simple but the data is massive the CPU processing cores finish their calculations in nanoseconds and then sit entirely idle waiting for the standard DDR motherboard bus to fetch the next batch of data.

This is the Von Neumann Bottleneck. Compute is cheap but moving data across a motherboard is prohibitively expensive. The CPU is starved by the motherboard.

Nvidia did not win by making the numerator slightly faster. They won by exponentially increasing the denominator bandwidth using vertically stacked memory directly on the silicon die.

The Unified Memory Architecture

If we treat this strictly as a data movement problem Apple should theoretically dominate the entire enterprise AI industry.

Apple engineers evaluated this physical bottleneck and radically redesigned the motherboard. They developed a hardware solution called Unified Memory Architecture. By placing the central processor the graphics processor and the system RAM on the exact same physical silicon package they completely eliminated the physical distance of the motherboard trace.

They did not just shorten the wire. They eliminated the PCIe bus entirely.

In a traditional PC an Nvidia GPU must copy data from system RAM over the PCIe bus into its own dedicated VRAM before it can execute matrix multiplication.

In an Apple system the CPU and GPU simply pass a pointer to the exact same block of memory. This zero copy architecture allows a desktop chip to achieve 800 gigabytes per second of memory bandwidth natively.

To achieve that specific memory capacity in standard PC ecosystems you would require massive server racks and you would suffer from crippling network cable latency.

Structurally Apple built the absolute perfect AI machine.

The CUDA Monopoly

Look at the reality on the ground.

Walk into any elite artificial intelligence lab today and you will see engineers completely ignoring Apple hardware.

They are hoarding Nvidia hardware instead.

This reveals a brutal engineering truth. The absolute best hardware does not win if the competitor owns the compiler.

Nvidia is not just a silicon company. They are a ruthless software monopoly.

To understand their moat you must understand Compute Unified Device Architecture or CUDA.

CUDA is an inescapable compiler layer that translates high level Python code into low level hardware instructions.

Nvidia spent fifteen years optimizing their proprietary math libraries and ensuring that every foundational AI framework including PyTorch was built natively on top of their compiler.

We must understand why the industry cannot simply switch to competing silicon.

If you buy an AMD chip or an Apple desktop you must rely on translation layers to convert CUDA calls into alternative instructions.

Translation introduces bugs and destroys performance. If you choose to fight this ecosystem you are choosing operational pain. You will face compiler lock in immediately. You will encounter missing tensor libraries. You will watch your mathematical compilations fail. Your senior engineers will spend weeks debugging open source translation layers instead of actually training models.

Nvidia gave the compiler away for free to ensure you could never leave their hardware ecosystem.

Conclusion

Weighing hardware physics against software ecosystems brings us to a definitive conclusion.

The fundamental architectural maxim is simple. Hardware physics dictate the absolute ceiling of system performance but software ecosystems dictate the floor of usability.

Here is the defensive playbook for technical leadership.

If your team is strictly running local inference on a pre trained model Apple Silicon is a cost effective and it will save you massive cloud compute bills and bypass the memory wall beautifully.

However if your engineering team is actively training foundational models or building complex agentic frameworks you have absolutely no choice.

You must pay the Nvidia tax for CUDA.

Do not choose your enterprise infrastructure based strictly on a hardware specification sheet as fighting the dominant software ecosystem will burn your entire financial runway on operational overhead.

The best silicon in the world is completely useless if you cannot compile the math.

Why AI Giants are Abandoning the Public Cloud?

Ashok Vishwakarma — Fri, 10 Apr 2026 08:01:56 GMT

For a decade we were told owning physical servers was a relic of the past. The public cloud was the final destination.

We must critically analyze this assumption today in April 2026.

The generative AI boom is aggressively reversing that trend. The companies building frontier AI are quietly abandoning generic public cloud infrastructure.

They are pouring hundreds of billions into custom physical data centers.

Let us examine the evidence to prove this shift.

Look at Microsoft and OpenAI. They are planning Project Stargate. This is a proposed 500 billion dollar supercomputer data center. We must critically analyze that number. It is five hundred times more expensive than current massive data centers.

Look at Meta. They are bypassing standard cloud providers entirely. They are hoarding hundreds of thousands of H100 GPUs in custom built facilities. They are designing the cooling and power routing from scratch.

Google has always relied on its own custom TPU pods rather than generic cloud hardware for its core AI research.

The takeaway is clear. The smartest money in tech is no longer renting compute by the hour.

They are buying the bare metal.

But why?

Why is this happening.

We must evaluate the physics.

We must analyze the architectural mismatch between cloud virtualization and generative AI.

The public cloud relies on virtualization to host stateless web servers. Generative AI completely breaks this model.

Training a trillion parameter model requires tens of thousands of GPUs communicating via Synchronous Real Time Communication.

A single delayed packet on a standard cloud network switch stalls the entire run.

AI requires bare metal access. It requires custom InfiniBand networking and extreme thermal cooling.

Then we evaluate the math.

Renting a GPU means paying the hyper-scaler profit margin. When your compute bill hits 10 billion dollars a year building a custom nuclear powered data center is mathematically cheaper.

What’s the Impact?

Let’s analyze what this means for a Chief Technology Officer who is not building a custom data center.

There is good news for speed.

Custom networking pipelines designed exclusively for inference will drastically drop latency. Your Time to First Token (TTFT) will plummet. The math will execute perfectly on their bare metal.

However there are some bad news as well. We call this the Monopoly Trap.

You must destroy the assumption that cheaper internal costs for the giants mean cheaper API costs for developers. The AI giants will subsidize costs now to capture the market.

This creates an impenetrable physical moat. Startups cannot build their own Stargate to compete.

Once the market consolidates to two or three mega providers who own the physical hardware the era of cheap AI ends.

They will have total leverage to exponentially raise API token prices when investors demand profitability.

Conclusion

The cloud abstraction is failing under the weight of artificial intelligence.

You can enjoy the speed and subsidized prices for now but you must architect defensively.

If your business model relies on a single provider keeping APIs cheap forever it is built on borrowed time.

The AI Giants are building physical castles.

Make sure you are not locked inside when they raise the drawbridge.

Why Agentic AI requires Graph based Observability?

Ashok Vishwakarma — Wed, 08 Apr 2026 06:01:32 GMT

Recently, I have received a call from an old client about their AI Agent making decisions which they cannot track even when they have invested in Observability tooling.

They had deployed a sophisticated procurement agent to manage their raw materials inventory. The agent was designed to read the Bill of Materials (BOM) for incoming orders and automatically interact with the ERP to ensure supply.

It had been running smoothly for weeks. Then, one night, the agent confidently issued a purchase order for 5,000 gallons of highly reactive industrial solvent that the company absolutely did not need. It was a $200,000 mistake executed in milliseconds.

The engineering team did what they were trained to do, they opened their AI Observability dashboards to debug the failure.

They were using one of the industry-standard LLMOps platforms (think LangSmith or Arize AI). They opened the specific trace, expecting to see a giant red error box.

Instead, the dashboard showed a 100% success rate.

The token usage was optimal. The latency was fine. The visual Directed Acyclic Graph (DAG) in the UI showed a perfectly clean execution path: the agent checked the ERP, saw a shortage, queried the vector DB for an alternative, found the solvent, and successfully executed the Create_Purchase_Order tool.

According to modern AI observability tools, the agent performed flawlessly. The JSON parsed correctly. The APIs returned 200 OK.

The tools were completely blind to the fact that the agent had just committed a catastrophic logical error. This is the “Silent Success” problem of Agentic AI, and it exposes a massive architectural flaw in how the industry approaches observability.

The AI Observability

The industry is currently flooded with vendors claiming to have “solved” AI observability. When you view them through the lens of enterprise architecture, their solutions fall apart because they suffer from a severe semantic blindspot.

The Giants (Datadog, Dynatrace, New Relic etc)

To be fair, these platforms have evolved. They boast “Interactive Dependency Graphs” and “Entity Maps.” But look at what these graphs actually map, they map infrastructure and service flow, not reasoning.

They are built for deterministic microservices. They treat an LLM call exactly like a SQL query.

A 200 OK HTTP status code between your agent container and the OpenAI endpoint means nothing if the agent just authorized a catastrophic purchase order.

They map the servers, but they are blind to the thought process.

The AI Focused (Arize AI, LangSmith, Kore.ai etc)

These platforms specifically target AI, and their marketing heavily touts “agentic tracing” and “graph visualization.” They do a great job of mapping a single agent’s trajectory.

But under the hood, they are still fundamentally OpenTelemetry (OTel) log stores. They suffer from Trace Isolation. They capture the agent’s actions perfectly, but that trace exists in a vacuum.

The tool does not know your business logic.

If an agent skips a mandatory safety check, LangSmith simply shows you a UI graph where the “Safety Check” node isn’t there.

Because the tool doesn’t know the rules of your enterprise, it assumes the agent’s path was correct. They treat the graph as a UI visualization of an ephemeral log, not as a computable mathematical structure tied to reality.

The Solution

If an agent’s execution is a branching decision tree, the observability layer must be a native graph database (like Neo4j) that already holds your company’s business logic.

I have implemented a separate agent’s execution state directly into Neo4j, where the company’s manufacturing ontology (their Enterprise Knowledge Graph) already lived.

We ran the agent in a shadow environment, and a week later, it attempted the exact same hallucination.

This time, we didn’t look at a UI visualization of a single trace. We mathematically queried the agent’s behavior against the laws of the business.

Because both the Agent Trace and the Enterprise Ontology lived in the same database, we could write a Cypher query to perform an “Ontological Join”

// Find any execution trace where the agent ordered a hazardous material
// WITHOUT executing a corresponding safety check tool.
MATCH (trace:AgentSession)-[:TOOL_CALL]->(po:PurchaseOrder)-[:TARGETS_ITEM]->(item:Material)
MATCH (item)-[:HAS_PROPERTY]->(prop:HazardLevel {value: 'High'})
WHERE NOT (trace)-[:TOOL_CALL]->(:SafetyMatrixCheck)
RETURN trace.id, item.name

The Graph revealed the semantic missing edge. The agent had queried the vector database for an alternative chemical. It found “Solvent Y-200”. In the company’s ontology, Y-200 is linked to a [HazardLevel: High] node, which strictly requires a [:SAFETY_CLEARANCE] edge.

The agent had bypassed the safety check tool entirely.

LangSmith couldn’t catch this because LangSmith doesn’t know what “Solvent Y-200” is. It only knows what the LLM typed.

Neo4j caught it instantly because it cross-referenced the agent’s ephemeral trace against the physical reality of the business.

What else is needed?

Once you have Native Graph Evals in place, you stop guessing and start engineering. Seeing the failure is step one. Preventing it requires expanding your architecture beyond trusting the LLM.

Here is what else you must implement when deploying agents to production

Deterministic Tool Gateways

An LLM should never speak directly to an ERP or a production database.

You must build a middleware gateway between the agent’s output and the actual API execution. Using our Neo4j setup, we built a gateway that intercepts the Create_Purchase_Order tool call, queries the graph to ensure the [Safety_Check] node exists in the current session trace, and blocks the API if the graph topology is invalid.

State-Bound Prompts

Do not give an agent all 15 of its tools in the initial system prompt. That is begging for hallucinations.

Use the graph state to dynamically inject only the tools that are valid for that specific moment.

If the agent has not successfully completed the Inventory_Check node, the Create_Purchase_Order tool should literally not exist in its context window.

Macro-Topological Analysis

Stop looking at crashes one by one.

With a native graph database, you can run PageRank or Cycle Detection algorithms across millions of historical traces simultaneously.

You can mathematically prove that 90% of your token-burning infinite loops only happen when the agent interacts with Tool A immediately after failing Tool B.

Conclusion

If you deploy autonomous agents to production using isolated trace stores, you are not deploying software. You are deploying a dangerous, expensive black box.

Native graph observability shifts AI from “magic” back to determinism. By interrogating the topology of the agent’s reasoning against the ontology of your business, you stop relying on “Silent Successes.”

You patch the system prompt, build a gateway, or constrain the toolset precisely at the node where the graph broke.

Stop treating agents like magic functions. Treat them like autonomous state machines traversing a graph, and build the infrastructure to observe them accordingly.

If you want to see exactly how to build this architecture in production, I am doing a deep dive on this exact topic at Neo4j’s NODES AI.

The $20/$200 AI Subscription is going to be Dead?

Ashok Vishwakarma — Thu, 02 Apr 2026 06:02:25 GMT

For two years, the industry sold us the idea that frontier AI was just another consumer utility. They priced access to massive supercomputers like a Spotify subscription. You paid $20 or $200 a month, and in exchange, you got an all-you-can-eat buffet of compute power.

But by early 2026, the physical cost of running these models caught up with the business reality.

Look at the events of the last quarter. OpenAI shut down Sora and backed out of their Disney partnership. Microsoft is tightening its cloud budget. And a leaked internal dashboard shared by The Signal from Anthropic showed just how unsustainable the generative AI business model is right now.

We aren’t entering an era of unlimited personal AI. We’re actually going back to the 1970s model, the era of the IBM Mainframe. The hallucination seems officially over 🙂

AI isn’t a SaaS

Software-as-a-Service (SaaS) is a great business model because the marginal cost of serving one more user is effectively zero.

AI inference isn’t SaaS. It scales linearly. Every prompt requires physical hardware to do work.

From The Signal

To understand why the flat-rate AI subscription is failing, look at the leaked Anthropic dashboard. An enterprise power user on a $200/month “Pro” tier ran an autonomous coding loop. In 23 days, that single user consumed 1.1 billion tokens and triggered 9,221 sub-agent tasks.

The actual compute cost of running those inferences on Anthropic’s GPU clusters was $27,000. Anthropic took a 135x loss on a single customer in less than a month.

Analyzing a 100-page PDF or running an autonomous agent isn’t a simple database query. It requires firing up clusters of GPUs and executing billions of operations. You can’t sell that kind of compute for a flat fee and hope to make it up in volume. Volume is exactly what causes the losses.

The Death of Sora

If you want proof that the subsidy is over, look at Sora.

OpenAI killed their flagship video generation model less than six months after its public launch. The tech press talked about “safety concerns” and “copyright,” but the reality was the Cost of Goods Sold (COGS).

Generating 60 frames per second of photorealistic video requires massive compute. Keeping the Sora clusters running for 500,000 active users burned an estimated $1 million a day in electricity and GPU depreciation.

They tried to pivot to the enterprise by signing a $1 billion partnership with Disney. But Disney realized that offloading their rendering pipeline to OpenAI’s servers was actually more expensive than doing it in-house. The unit economics didn’t make sense, so the servers were shut down.

OpenAI’s “Risk Factor”

The cloud providers have woken up. For years, Microsoft subsidized OpenAI’s compute to gain market share. That era of cheap infrastructure is over.

In a recent financial disclosure, OpenAI explicitly listed their reliance on Microsoft Azure’s compute pricing as a “Risk Factor.”

Wall Street is demanding a return on the billions poured into data centers. Investors are forcing AI labs to drop unprofitable consumer tools and focus entirely on enterprise contracts that actually pay the bills.

Edge vs. Mainframe

To fix the broken math, the AI market is splitting into two distinct tiers.

The middle ground which is $200/month frontier web app is disappearing.

Tier 1: The Consumer Edge

Consumers will get smaller, 8-billion parameter models running locally on their phones and laptops (Apple Silicon, Snapdragon NPUs etc).

These models are good enough for basic tasks like grammar correction and summarizing emails, but they aren’t capable of deep reasoning.

Why the shift?

Because by pushing the model to the edge, companies offload the cost of the compute and electricity directly onto the user’s battery. It is the only way the consumer unit economics work.

Tier 2: The AI Mainframe (Enterprise)

True frontier AI, massive models capable of deep, autonomous workflows will become bespoke enterprise tools.

These won’t be accessible via a casual web interface.

They will be sold via multi-million-dollar B2B contracts to pharmaceutical companies running protein simulations, and quant hedge funds executing trading logic.

They are the only businesses with gross margins high enough to afford the true, unsubsidized cost of compute.

Conclusion

The idea that a solo developer in a garage will have the exact same compute power as the CTO of JPMorgan isn’t realistic.

The physics of data centers dictate otherwise.

As a Software Architect, you need to plan defensively. Relying on cloud providers to subsidize your application’s heavy reasoning is a risk.

Build your systems around cheap, local, open-source models for the basic plumbing such as routing, classification, and simple tasks.

Treat frontier API calls as an expensive, highly constrained physical resource.

Use them only when absolutely necessary and plan for AI like heavy industrial equipment.

The era of cheap, subsidized compute is over.

A ₹500 ($5) course won't make you an AI Engineer

Ashok Vishwakarma — Tue, 31 Mar 2026 12:03:53 GMT

Take a breath.

I know what your LinkedIn and Twitter feeds look like right now.

Every time you open your phone, an influencer in a rented sports car or a sleek home office is screaming at you.

They claim that AI is replacing software engineers tomorrow, and if you don’t buy their “AI Mastery Guide” today, your career is over before it begins.

I see the panic in computer science cohorts and junior developers. It is entirely valid.

But you need to understand that this anxiety is not a natural reaction to the tech market. It is a manufactured emotion, meticulously engineered by marketers to make you feel inadequate so you will open your wallet.

In any gold rush, the people who actually get rich are the ones selling the shovels. Today’s “AI Influencers” are selling cheap, plastic shovels to desperate, anxious students.

Let’s tear down their selling tactics like an architecture.

The Scam

If you want to be an engineer, you need to learn how to reverse-engineer systems.

Let’s look at the system design of the influencer business model.

When you see an ad for a “Master ChatGPT in 30 Days” course priced at ₹500 or $5, you think you are buying an educational product.

You are not. You are participating in a Customer Acquisition Cost (CAC) mechanism.

That ₹500 course is just the entry point to a sales funnel. The content inside is nothing but regurgitated Twitter threads, repackaged into a PDF to create artificial complexity.

It is designed to give you a dopamine hit of “productivity” while making you feel like you are still missing the real secret.

Then, the trap springs. Once you finish the cheap course, the email sequence begins.

They tell you that to actually secure a six-figure job, you need to join their exclusive “AI Mastermind” or buy the advanced bootcamp for ₹50,000.

You are being farmed for your attention and your anxiety.

The Delusion

The core lie holding this entire grift together is the concept of the “Prompt Engineer.”

Let me give you the hard truth

Prompt Engineering is not a sustainable technical career.

A prompt is simply a natural language API call. You are no more an “engineer” for typing instructions into ChatGPT than you are a “Search Engineer” for typing queries into Google.

More importantly, it is a rapidly depreciating asset.

Two years ago, getting an LLM to output a specific JSON format required paragraphs of complex “jailbreak” constraints and magical phrasing.

Today, frontier models like Claude 3.5 Sonnet, GPT-4o, and Gemini 1.5 Pro natively understand intent, reason through logic, and output strict JSON out of the box.

The models are getting smarter at interpreting human ambiguity. As the compilers (the LLMs) get better at understanding natural language, the need for complex, esoteric “prompts” disappears.

Buying a course on prompting is investing your limited time and money into a dying skill.

The Trap

Part of the illusion relies heavily on visual spectacle.

The influencer will open a video by shouting,

“I just generated a 50-page PowerPoint, a PDF marketing brochure, and 12 photorealistic images in under three minutes using these SECRET tools!”

They will show you a curated list of glossy AI web apps or tools that automatically build slide decks, generate stock photos, or format PDFs.

They will try to convince you that knowing which button to click on these websites is a highly monetizable, technical skill.

Let me be brutally clear.

Using a SaaS wrapper to generate a PDF makes you an end-user, not an engineer.

You are not building software. You are consuming it.

Knowing how to type a prompt into an image generator or a presentation-maker is equivalent to claiming you are a Database Architect because you know how to sort a column in Microsoft Excel.

It is a neat productivity trick for a marketing manager, but it has absolutely zero engineering value.

When you buy a course to learn these tools, you are paying for a glorified software tutorial, not an engineering curriculum.

The Real AI Engineering

Contrast the influencer fantasy with the brutal reality of what the industry actually pays for.

Companies do not pay AI Engineers $150,000 a year to type clever sentences into a web interface or generate a pretty slide deck.

They pay engineers to build resilient, scalable systems around non-deterministic intelligence.

Real AI engineering is systems engineering. It is building deterministic RAG (Retrieval-Augmented Generation) pipelines. It is managing the severe latency of vector databases. It is calculating the unit economics of token costs versus API throughput.

An actual AI Engineer spends their day setting up strict guardrails to catch LLM hallucinations before they reach a customer.

They write middleware to mask Personally Identifiable Information (PII) before it hits a third-party API.

They write exponential backoff algorithms to handle 502 Bad Gateway errors and 429 Too Many Requests rate limits from OpenAI.

They build intelligent chunking strategies to fit a massive PDF into a 100k context window without destroying the semantic meaning.

If a course does not teach you how to handle network failures, rate limits, or context window chunking, it is not an AI engineering course.

It is a typing class.

Where you should learn?

So, where do you actually go to learn?

You ignore the influencers and you go to the raw, unpolished sources.

Here is your actual roadmap, and it is almost entirely free.

The Math & Mechanics

Go to YouTube and search for Andrej Karpathy’s “Neural Networks: Zero to Hero” series. Karpathy was a founding member of OpenAI and the Director of AI at Tesla.

He will teach you how to build a neural network from scratch using Python.

Understanding the underlying calculus and matrix multiplication is infinitely more valuable than memorizing a prompt template.

The Applied Systems

Stop buying courses and start reading official documentation.

Take Grow with Google AI Lessons.

Read the Anthropic documentation.

Read the OpenAI Cookbooks.

If you need structured courses, take Andrew Ng’s DeepLearning AI classes.

They are built by actual AI researchers, not Twitter marketers.

The Framework

Want to build AI apps?

Go read the raw GitHub documentation for LangChain, LlamaIndex, or Hugging Face.

Build a simple Python script that takes a local text file, converts it into embeddings, stores it in a local ChromaDB, and queries it.

You will learn more in one weekend of fighting Python dependency errors than in a year of watching influencer videos.

Conclusion

True technical leverage does not live behind a paywall on an Instagram ad.

It lives in your ability to sit in a quiet room, read difficult technical documentation, and build things that break until you figure out how to fix them.

A career in software engineering is a 40-year marathon.

Technologies will rise and fall. Frameworks will die.

Do not let marketers rush you into buying plastic shovels.

Ignore the noise. Protect your wallet. Focus on the fundamentals.

Thought - Why Token Costs will Bankrupt your LLM Wrapper

Ashok Vishwakarma — Thu, 26 Mar 2026 06:00:40 GMT

Last month, I was brought in to consult for a generative AI startup that had just closed a massive Seed round. They had built an “AI Customer Success Agent” that looked incredible in staging.

Then I opened their Anthropic billing dashboard.

Their Cost of Goods Sold (COGS) wasn’t just high. It was terminal. Every time a user asked their bot a question, the company lost money.

When I tore down their architecture to find the leak, I realized they hadn’t actually built a software company. They had built an unoptimized API proxy.

And this is not an isolated incident.

Look at the startup graveyard from the last two years. The AI Copywriter. The AI PDF Chatbot. The AI Email Assistant. They launched to deafening hype, dominated Product Hunt, and convinced investors they had cracked a new market.

Twelve months later, they were quietly shuttered or sold for parts.

Their entire product architecture consisted of prepending a system prompt to a user input and piping it directly to an LLM provider. They had no defensible moat, and their gross margins were systematically eaten alive by the underlying compute provider.

The Maths of Margin Erosion

Software-as-a-Service (SaaS) valuations are predicated on 80 percent gross margins and a near-zero marginal cost of replication.

If your COGS scales perfectly linearly with every single user click, you do not have a software business. You have a subsidized consulting firm.

The RAG Trap

Let’s examine the exact architecture that was killing my client. A standard enterprise RAG (Retrieval-Augmented Generation) implementation. You deploy a customer support bot handling a modest 10,000 conversations a day.

You write a dense, 2,000-token system prompt detailing the company’s tone and rules. Every time a user asks a question, your vector database retrieves 5,000 tokens of documentation context. Before the LLM generates a single word, your baseline payload is 7,000 input tokens.

At premium model pricing (roughly $10 to $15 per million input tokens for models like Claude 3 Opus or GPT-4), that single interaction costs about $0.10 just to read the prompt. Multiply that by 10,000 conversations, factoring in average multi-turn context window inflation, and you are burning $1,500+ a day.

You are bleeding $500,000 a year purely on input tokens before factoring in output costs, vector database hosting, or salaries.

The Output Multiplier

The input tax is just the entry fee. The output tax is where margins actually die.

Generating tokens requires significantly more compute than reading them. Providers charge 3x to 5x more for output tokens. If your bot generates a helpful 500-word response, that single output costs another $0.015.

Users do not ask one question; they iterate. A four-message conversation easily compounds to 30,000 total tokens processed. You are paying a premium cloud tax on every single syllable your system outputs, 24 hours a day.

The Latency Tax

Financial bleed is only the first failure mode. The second is Time to First Token (TTFT).

TTFT

TTFT is the ultimate metric in AI architecture. When you wire your frontend directly to an external LLM provider, you surrender total control of your application’s physics.

You inherit their network round-trips. You inherit their TLS handshakes. You inherit their peak-hour queue delays. You are at the mercy of raw 2-to-5-second inference times.

Human beings abandon interfaces that take more than 400 milliseconds to react. If your application takes five seconds to stream the first word because us-east-1 is saturated, your users will leave.

The Streaming Band-Aid

Many developers try to hide this latency by streaming tokens to the UI. This is a visual band-aid, not an architectural fix.

Streaming gives the illusion of speed, but it does not change the physical time it takes to complete the task.

You cannot out-prompt the speed of light.

Prompt engineering cannot fix an overloaded cloud endpoint. If a backend task requires parsing JSON from an LLM response before executing a database query, streaming is useless.

You are simply blocked for five seconds.

The Self-Hosting Fallacy

Faced with massive API bills and latency spikes, engineering teams typically experience a knee-jerk reaction

“Our API bill is too high. Let’s buy an H100 and host Llama 3 ourselves.”

This is an ego trip disguised as a financial strategy.

The Silicon Math

Generating a million tokens on a hosted, heavily optimized API endpoint costs pennies for smaller models. Renting a single A100 or H100 GPU node costs upwards of $80 to $150 a day.

Because user traffic is spiky and unpredictable, that expensive silicon will sit completely idle 80 percent of the time.

You are paying for maximum capacity, but only utilizing a fraction of it.

The DevOps Nightmare

Running LLMs in production is not like running a Node.js server. It is a grueling battle with KV cache memory fragmentation, continuous batching algorithms, and CUDA out-of-memory crashes.

To keep a self-hosted model running efficiently, you need specialized AI infrastructure engineers. Those engineers cost $250,000 a year.

Unless you process tens of millions of tokens daily at a perfectly consistent utilization rate, or you have strict air-gapped compliance requirements, self-hosting will bankrupt you faster than using LLM APIs.

You trade variable API costs for massive fixed CapEx and severe operational friction.

Architecting the Defensive Moat

To survive the LLM API tax, architects must build a ruthless abstraction layer between their application and the model provider.

You need an AI Gateway.

Tactic 1: Semantic Caching

Users ask the exact same questions constantly. You should not pay Anthropic 1,000 times a day to answer “What is your refund policy?”

Instead, pass the user’s query through a cheap, sub-millisecond embedding model. Store that vector in a Redis cache alongside the LLM’s final generated response.

When the next user asks “How do I get my money back?”, perform a cosine similarity search.

If the mathematical match exceeds a 0.95 threshold, serve the cached string instantly. Your compute cost drops to $0.

Your TTFT drops to 50 milliseconds.

Tactic 2: Intelligent Routing (The Cascade)

Stop using premium frontier models for everything. A massive percentage of application logic involves basic classification, sentiment analysis, or summarization.

Your gateway should route trivial tasks to ultra-cheap, high-speed models like Gemini Flash, Llama 3 8B, or Claude Haiku.

Reserve the expensive, heavy reasoning models strictly for complex escalation paths. Implement a cascade, try the cheap model first, and only trigger the expensive model if the output fails a deterministic validation check.

Tactic 3: Context Pruning

Stop treating the LLM context window as a dumping ground.

Throwing 50 pages of PDF text into an API call because “the model supports 1M tokens” is financial suicide.

Implement strict context pruning.

Use a fast Cross-Encoder to re-rank your vector search results. Strip out boilerplate text, HTML tags, and redundant paragraphs before assembling the final prompt.

Sending exactly 800 highly relevant tokens is infinitely cheaper and yields better model accuracy than blindly dumping 8,000 tokens.

Tactic 4: Circuit Breakers

External APIs fail. Rate limits get exhausted. If your application crashes when LLM API throws a 502 Bad Gateway, your architecture is brittle.

Your gateway must implement strict circuit breakers.

If the primary provider times out or degrades, the gateway must instantly and transparently reroute the exact payload to a backup provider (e.g., failing over from OpenAI to Anthropic).

Graceful degradation is a mandatory requirement, not a feature.

Conclusion

Large Language Models are not magic brains. They are commodity compute.

The value of your engineering team is not in writing a clever system prompt. The value is in building the caching, routing, pruning, and abstraction infrastructure that makes running that prompt economically viable.

If you build a thin wrapper, the API provider will eventually consume your margins.

Build the moat, or prepare to join the graveyard.

Do you really need a Vector Database for your AI Product?

Ashok Vishwakarma — Tue, 24 Mar 2026 06:00:50 GMT

Last week, I sat in an architecture review with a client who had just secured Series A funding. They were building a standard Retrieval-Augmented Generation (RAG) pipeline for their internal documents.

Before anyone had written a single line of backend logic, the engineering lead proudly displayed a slide proposing a six-figure enterprise contract with a dedicated Vector Database.

When I asked why we weren’t just using Postgres, the answer was immediate

“Because this is an AI app. You need a Vector DB for AI.”

This is the exact kind of hype-driven development that destroys startup runways.

At the physical level, Vector Databases are not magic AI boxes. They do not understand “meaning,” “context,” or “semantics.”

They are simply C++ or Rust memory allocators navigating multi-dimensional mathematical graphs.

To do this quickly, they trade exact accuracy for speed, and they force you to pay an absolute fortune in RAM to do it.

Before you sign a massive SaaS contract for a dedicated vector engine, you need to understand the physics of the hardware you are renting.

The Brute Force Math

To understand why traditional databases fail at vector search, you have to look at the math of a “Vector.”

An OpenAI text-embedding-3-small embedding is a 1,536-dimensional array of floating-point numbers. In physical memory, a single 32-bit float consumes 4 bytes.

1,536 dimensions × 4 bytes = 6,144 bytes (6 KB) per vector

If you want to find the closest vectors to a user’s query using exact math, a process called Exact K-Nearest Neighbors (KNN) which your CPU must calculate the cosine similarity between the query vector and every single row in the database.

Let’s do the memory bandwidth math on a small 1-million-row table

1,000,000 rows × 6 KB = 6.1 Gigabytes of data

To answer one user query,

the CPU must pull 6.1 GB of data from RAM through the memory bus,
load it into the L1 cache,
and execute millions of AVX-512 SIMD (Single Instruction, Multiple Data) dot-product operations.

Even on modern DDR5 RAM peaking at 50 GB/s of bandwidth, a single concurrent user doing an Exact KNN search will consume 12% of your entire server’s memory bandwidth.

If you get 10 concurrent searches, your CPU is completely starved for data. The system grinds to an absolute halt.

Even the traditional B-Trees cannot save you.

A B-Tree relies on 1-dimensional inequalities (Is X > 10? Go right). You cannot sort or bisect 1,536 dimensions simultaneously.

HNSW (Hierarchical Navigable Small World)

If Exact KNN locks up the CPU, how do Vector DBs return results in 20 milliseconds?

They don’t do exact searches. They cheat.

Vector databases rely on Approximate Nearest Neighbor (ANN) algorithms.

They accept that finding the perfect match is computationally impossible at scale, so they settle for finding a very good match almost instantly.

The undisputed king of these algorithms is HNSW.

Do not let the academic name intimidate you. HNSW is just a multi-layered skip-list mapped over a proximity graph.

Imagine you are driving from New York to a specific house in a suburb of Los Angeles

The Top Layer (Interstates)

You don’t take local roads across the country. You get on an interstate.

In HNSW, the top layer has very few nodes, but they have long-distance links.

The search algorithm drops in here and makes massive, cross-graph jumps toward the general cluster of the target.

The Middle Layers (City Roads)

Once you are near Los Angeles, you drop down a layer.

There are more nodes here, connected by shorter links.

You navigate to the correct neighborhood.

The Bottom Layer (Local Streets)

You drop to the base layer, which contains every single vector in the database, and you traverse the local streets until you hit the closest possible house (a local minimum).

HNSW is a masterpiece of algorithmic engineering. It reduces a catastrophic O(N) full table scan into a blisteringly fast O(log N) graph traversal.

But it comes with a brutal physical cost called Pointer Chasing.

Why NVMe SSDs Hate HNSW (Pointer Chasing)

Traditional relational databases are famous for being disk-friendly because they exploit Locality of Reference.

B-Tree nodes are packed cleanly into contiguous 8KB blocks (pages) on your SSD. When Postgres needs an index node, it pulls a single 8KB block into memory, and all the sequential keys are right there next to each other.

HNSW graphs are the exact opposite.

An HNSW index is a giant, chaotic web of pointers (memory addresses) pointing to other pointers across multiple layers.

The nodes are scattered randomly across the heap during insertion.

Traversing this graph means jumping wildly from memory address to memory address.

If your HNSW index does not fit in RAM and is forced onto a Solid-State Drive, following those pointers requires thousands of Random Disk Reads per query.

A standard NVMe SSD takes roughly 100 microseconds (µs) to complete a random 4KB read. If an HNSW search requires 200 graph hops to find the nearest neighbor

200 hops × 100 µs = 20 milliseconds of pure disk latency

That sounds fast, until you realize this is for one query.

If your app is doing 1,000 queries per second, your SSD must sustain 200,000 random IOPS.

Even the most expensive AWS io2 Block Express volumes will buckle under that kind of random I/O queue depth.

Your 20ms latency will instantly spike to 2 seconds as the disk controllers choke.

Furthermore, pointer chasing completely defeats the OS Page Cache and the CPU’s hardware prefetchers.

The CPU cannot guess which memory address the graph will jump to next, resulting in continuous L3 cache misses.

The RAM Tax

Here is the physical reality of vector search

To get the sub-50ms latency that Vector DBs advertise, the entire HNSW index must physically live in RAM.

This brings us to the invoice (cost of the RAM).

RAM is exponentially more expensive than NVMe storage. Let’s do the math on a production-scale deployment of 100 million vectors.

Vector Payload - 100,000,000 × 6 KB (1536-dim floats) = 600 GB
HNSW Graph Overhead - Each node in HNSW maintains bidirectional pointers to its neighbors across multiple layers. This pointer overhead usually adds 30% to 50% to the base vector size. Let’s add 200 GB.

To serve 100 million vectors, you need 800 GB of RAM.

Storing 800 GB on a standard Postgres SSD costs about $65 a month. Holding 800 GB in an AWS r6a.24xlarge memory-optimized instance costs $5,300 a month.

When you buy a dedicated Vector Database, you are not buying better “AI.”

You are buying an expensive fleet of high-RAM cloud instances to hold a massive, chaotic pointer graph in volatile memory because the algorithm physically cannot survive on a disk.

What you should do?

Do not subsidize a SaaS company’s valuation because you think standard infrastructure can’t handle vector math.

Here is the architectural framework you should use before signing a vendor contract

Use Postgres (`pgvector`)

If you have fewer than 5 million vectors, you do not have a scale problem. You have a standard CRUD problem.

Install the pgvector extension on your existing Postgres instance.

Postgres is perfectly capable of building an HNSW index and holding it in its shared_buffers (RAM).

You save thousands of dollars, you eliminate a fragile network hop in your infrastructure, and you keep your relational data and your embeddings in the exact same ACID-compliant transaction block.

You can JOIN your semantic search results directly against your user permission tables in a single query.

Use a Dedicated Vector DB (Pinecone, Milvus, Qdrant)

You only graduate to a dedicated vector engine when your index physically outgrows the RAM limits of a single, massive database instance.

When you hit 50 million, 100 million, or a billion vectors, a single Postgres node will OOM (Out of Memory) trying to fit the HNSW graph into shared_buffers.

That is the exact moment you pay a premium for a distributed vector database.

You are paying them to shard the massive HNSW graph across multiple clustered nodes and handle the distributed scatter-gather networking required to query it.

Conclusion

Architecture requires alignment between the physical realities of the hardware and the economic realities of the business.

HNSW is a brilliant algorithm, but it is an unapologetic memory glutton.

It defeats disk I/O, thrashes CPU caches, and demands expensive DDR5 RAM to function at scale.

Start with Postgres, monitor your memory saturation, and scale out to a distributed vector engine only when the physics of the graph demand it.

Don’t buy a distributed system until you have a distributed problem.

Thought - Why Postgres is a dangerous default

Ashok Vishwakarma — Thu, 19 Mar 2026 06:00:52 GMT

Nobody gets fired for choosing Postgres.

It is the safest, most defensible technical decision you can make. Until it isn’t. Especially when you are building a highly concurrent, event-driven logger.

You pick it because it handles everything perfectly on day one, completely ignoring that you are buying a general-purpose engine for what might be a highly specialized problem.

We do this constantly in software. We slap a UUIDv4 on a primary key because it is convenient, blindly accepting the B-Tree fragmentation it guarantees at scale.

We spin up Postgres because it is the industry standard, ignoring what its storage engine actually does to our specific data shape.

Convenience dictates the architecture. Physics dictates the bill.

The cost of MVCC Write

Look at how Postgres manages concurrency. Multi-Version Concurrency Control (MVCC) is a brilliant mechanism. It allows readers and writers to operate simultaneously by creating new tuple versions instead of modifying data in place.

Let’s be absolutely clear.

If you are building a standard B2B SaaS application, with users, organizations, billing tables, and dynamic records, Postgres is flawless. If you are building a banking ledger where balances update constantly, MVCC is your best friend.

But if you are building an event log, a sensor stream, or an audit trail, MVCC is a parasite.

Consider a standard telemetry or event logging table. The schema looks entirely innocent

CREATE TABLE user_events (
    event_id UUID PRIMARY KEY,
    user_id BIGINT,
    event_type VARCHAR(50),
    payload JSONB,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Executed 50,000 times a second
INSERT INTO user_events (event_id, user_id, event_type, payload)
VALUES ('...', 123, 'click', '{"button": "signup"}');

To the developer, this is a simple, lightweight append operation. To the Postgres storage engine, this is a massive overhead event.

Your rows are immutable the second they hit the disk. Yet, every single insert carries a hidden 23-byte tuple header (xmin, xmax, cmin, cmax) tracking version visibility for concurrency that will never happen.

At 50,000 inserts a second, you are paying a massive, constant write tax on data that will never change. You are generating gigabytes of Write-Ahead Log (WAL) just to track ghosts.

8KB at a Time

When you do actually update or delete records in a highly active table, the architecture fights you in a completely different way.

-- You think you are modifying data in place. You aren't.
UPDATE user_sessions 
SET last_seen = NOW() 
WHERE session_id = 'abc-123';

Postgres does not delete old rows. It leaves the dead tuple sitting exactly where it was on the physical 8KB page. It writes the new, updated row to a new location. A background worker called autovacuum is supposed to sweep through eventually and mark that old space as reusable.

Under a heavy write load, autovacuum loses the race. Dead tuples accumulate faster than the database can process them.

Because live and dead records share the exact same physical pages, the query planner cannot skip the graveyard. If your table has 10 million live rows and 40 million dead ones, your hardware is pulling massive amounts of garbage from the disk into your buffer pool just to find the active data. The engine literally chokes on its own history.

As Andy Pavlo at CMU has pointed out, if someone were building a new MVCC database today, they would not implement it the way Postgres does. Its append-only storage design is a relic of the 1980s that predates log-structured storage patterns entirely.

The Snapshot Collision

One day someone asks for a dashboard. You run a long analytical query against this live operational data.

-- The Data Analyst runs this at 9:00 AM
SELECT date_trunc('hour', created_at), count(*)
FROM user_events
GROUP BY 1
ORDER BY 1;

Running a daily sales report on a 5-gigabyte SaaS database is fine. Running a 2-hour aggregation on a table absorbing 50,000 sensor readings a second is a death sentence.

To guarantee consistent reads, Postgres holds the transaction snapshot open. While that two-hour report runs, autovacuum is paralyzed.

It cannot clean up any dead tuples that fall behind that active transaction horizon. Your dashboard is actively blocking garbage collection for the entire operational database.

Every UPDATE happening on your site during those two hours leaves a permanent scar on the disk until the report finishes.

Friction generates heat. Heat generates wear. Wear destroys your query latency. Postgres was simply not designed to serve heavy OLTP and OLAP from the same instance under load.

The Optimization Treadmill

This is when you step onto the Optimization Treadmill. You do not fix the architecture. You treat the symptoms.

Step 1 - The Index Trap

Write throughput dips, so you add a composite index to speed up the read queries causing the locks. You just amplified your write penalty. Every insert now has to update multiple B-Trees.

Step 2 - The Vacuum Illusion

The table bloats, so you aggressively tune the background workers.

ALTER TABLE user_sessions 
SET (autovacuum_vacuum_scale_factor = 0.01);

You force the database to vacuum constantly. You are now burning precious CPU cycles fighting your own insert rate.

Step 3 - Hardware Scaling

CPU spikes, so you upgrade the instance class. You double the RAM. You provision higher IOPS.

Every single one of these interventions is the technically correct response to a real symptom. You are doing exactly what an experienced DBA would do. But none of it changes your trajectory. You are fighting an architectural ceiling, not a misconfiguration.

The treadmill only speeds up. Each fix buys you a few months of runway before the database collects its tax again.

How to Spot the Treadmill Early

The most critical decision you can make is recognizing the treadmill before you get on it. Stop treating Postgres as a zero-cost default. Look at the physical reality of your data before you write the schema.

Are your writes predominantly inserts with no updates?
Are the rows permanently immutable once written?
Is your data volume growing by 50 percent year over year?
Do you need to run heavy analytics against the same live table?

If you nod your head to any of those, the architectural mismatch is already baked in. The only variable is how much engineering time and AWS budget you burn before you stop asking how to tune the database, and start asking the question you should have asked on day one.

Are we actually building a general-purpose OLTP system?

The HTAP Escape

If you recognize the treadmill early, you can escape it. If you actually need to serve heavy OLTP and OLAP concurrently from the same system without melting your disks, you need an architecture designed for Hybrid Transactional/Analytical Processing (HTAP).

When Postgres is no longer the default, you have three primary paths

1. SingleStore

SingleStore (formerly MemSQL) solves the OLTP/OLAP collision by physically splitting the storage engine under the hood. It uses an in-memory rowstore for blazing-fast transactional inserts and a disk-based columnstore for your analytical queries.

It is true HTAP. You get millisecond ingest and sub-second aggregations on the exact same cluster. It speaks the MySQL wire protocol, making integration trivial.

But, it is proprietary and expensive. The in-memory rowstore demands that your operational working set fits entirely in RAM. You are trading software friction for a massive hardware bill.

2. TiDB

TiDB separates compute from storage entirely. It uses a Raft-based distributed key-value store (TiKV) for your transactional inserts, and asynchronously replicates that data to a columnar engine (TiFlash) for analytics. The smart query planner automatically routes your dashboard queries to the column nodes and your point-writes to the row nodes.

It scales horizontally forever. You can run massive analytical reports against TiFlash without ever locking an operational row in TiKV.

But, it has severe operational complexity. You do not just “spin up” TiDB. Running the control plane, compute nodes, row storage nodes, and column storage nodes requires a dedicated infrastructure team. It is a Ferrari; you have to hire the mechanics.

3. Kafka + ClickHouse (Split Stack)

If you cannot justify HTAP licensing or operational overhead, stop trying to force both workloads into one database. Write your immutable events to an append-only log (Kafka). Consume that log into a pure columnar database (ClickHouse) for your dashboards.

The physics align perfectly. Kafka’s sequential writes effortlessly absorb the OLTP load. ClickHouse’s columnar compression destroys the OLAP load. Both systems run at absolute maximum efficiency.

But, you now maintain two distinct distributed systems and the pipeline connecting them. Eventual consistency is guaranteed; immediate consistency is impossible.

Conclusion

There is a recurring theme in engineering failures, we mistake popularity for universal fit.

Postgres is arguably the greatest relational database ever built. But it is a machine with specific operating parameters.

When you force an append-only, high-velocity stream into an MVCC engine, you are not testing the limits of Postgres. You are testing the limits of your infrastructure budget.

Stop letting convenience dictate your architecture. The cost of learning a specialized database like ClickHouse or Kafka today is infinitely lower than the cost of fighting the Optimization Treadmill tomorrow.

Physics always wins. Choose your database accordingly.

Research - A Formal Threshold Model for B-Tree Performance Degradation Under Random Primary Keys in OLTP Systems

Ashok Vishwakarma — Tue, 17 Mar 2026 06:01:02 GMT

Image credits Toptal

The adoption of UUIDv4 as a primary key in OLTP databases has become a widespread default, driven by ORM convenience and distributed system requirements.

This paper demonstrates that UUIDv4 primary keys guarantee severe, predictable performance degradation as table size grows. We introduce the Buffer Saturation Ratio (BSR) and derive a closed-form threshold (N^*) that predicts exactly when write latency will spike.

We prove that this degradation is a physical limitation of B-Tree mechanics under random insertion, compounded by Write Amplification Factor (WAF) penalties, rather than a hardware or traffic-scaling failure.

The Anatomy of a Deferred Crisis

Modern frameworks make it trivial to assign a random, 128-bit string as a primary key. Distributed ID generation solves immediate engineering headaches. It prevents sequence contention. It hides entity counts from users.

But it introduces a delayed tax on your storage layer.

Engineers rarely observe the cost of UUIDv4 during the first year of a project. Databases aggressively cache active indexes in RAM. As long as the primary key index fits entirely within the database buffer pool, random inserts perform acceptably.

The crisis arrives silently. The index quietly outgrows the allocated memory. The database begins reading cold pages from the physical disk to complete routine inserts. Write latency jumps by two orders of magnitude. The typical engineering response, upgrading the instance class which fails to resolve the underlying physics.

B-Tree Mechanics and the Fill Factor Penalty

Relational databases use B-Tree variants to store indexes. Sequential keys, such as BigInt sequences or UUIDv7, append new records to the rightmost edge of the tree. The database keeps that single, active leaf node hot in memory.

UUIDv4 values possess zero temporal locality. Every insert targets a random location in the B-Tree.

When you insert a record into a random leaf node that is already full, the database executes a page split. It allocates a new page (16KB in MySQL InnoDB, 8KB in PostgreSQL), transfers half the records, and updates the parent nodes.

Sequential inserts achieve a high index fill factor (f ≈ 0.90 to 0.94). Random UUIDv4 inserts guarantee continuous page splits, driving the equilibrium fill factor down to approximately 0.50. You are permanently storing half-empty pages. Your index requires twice the RAM just to exist.

The Buffer Saturation Ratio (BSR)

To predict performance failure, we must track the relationship between index size and available memory. We define the Buffer Saturation Ratio (BSR)

Where B_pool is the configured memory for caching (InnoDB Buffer Pool or PostgreSQL shared_buffers), and I_size is the total size of the primary key index.

When BSR ≥ 1.0, the entire index resides in memory. Writes are fast.
When BSR < 1.0 , the index exceeds available RAM.

At BSR < 1.0, every random insert carries a cache miss probability (P_miss)

When a cache miss occurs, the database evicts a page to load the target leaf node from disk. Because UUIDv4 targets pages uniformly, the newly loaded page is highly unlikely to be reused before it is evicted again. This creates LRU eviction thrashing.

Your expected insert latency (L_insert) becomes completely dominated by disk I/O

Deriving the Threshold Row Count (N^*)

We can calculate the exact row count where a database will fall off the performance cliff. First, we model the projected index size

Where

N = Number of rows
K_size= Key size in bytes (16 for binary UUID, 36 for char UUID, 8 for BigInt)
P_overhead = Pointer overhead (roughly 8 bytes)
f = Fill factor (0.50 for UUIDv4, 0.94 for BigInt)

To find the threshold row count (N^*), we set I_size equal to B_pool(BSR = 1.0) and solve for N

This equation allows architects to project the exact lifespan of a database schema configuration.

Write Amplification Factor (WAF)

Random inserts punish the storage layer beyond just cache misses. They generate massive Write Amplification Factor (WAF) in the Write-Ahead Log (WAL).

The write amplification for a UUIDv4 index follows a logarithmic curve

Where m is the B-Tree order (roughly 80 for InnoDB at 16KB pages with 200-byte rows), N is total rows, and M is the number of rows that fit in the buffer pool.

Your total write amplification multiplies by your secondary indexes (k)

MySQL InnoDB uses clustered indexes. The massive, random primary key is duplicated into the leaf nodes of every secondary index. PostgreSQL uses heap tables, bypassing clustered index bloat, but secondary indexes still suffer the brutal page split penalty and identical WAF multipliers.

Case study

One of my client ran a package scan event log using a char(36) UUIDv4 primary key on a 16 GB RDS instance equipped with a 12 GB buffer pool.

During the first year, at 50 million rows, performance remained nominal. Their BSR sat at 8.0. The buffer pool easily absorbed the random writes.

The table eventually reached 500 million rows. The char(36) key, combined with the 0.50 fill factor, swelled the index size to roughly 15 GB.

Their BSR dropped to 0.80. Write latency instantly climbed from 2 ms to 200 ms. CPU utilization on the RDS instance pinned at 90 percent, entirely consumed by managing LRU evictions.

They upgraded the RDS instance class twice. Neither upgrade resolved the latency. Each vertical scaling event raised CPU, RAM, and the buffer pool proportionally, but the BSR remained strictly below 1.0. The index remained larger than the pool.

Hardware scaling cannot outrun mathematical thresholds. The only viable solution was migrating the primary key to a sequential format.

The N^* Decision Matrix

Applying the N^* formula to standard infrastructure configurations reveals the severe capacity penalty of UUIDv4 strings

A 32 GB server running a char(36) UUIDv4 index chokes at 273 million rows. The exact same hardware running a BigInt handles 2.25 billion rows. You are paying a 400 percent RAM tax for a framework default.

Conclusion

UUIDv4 primary keys constitute a systemic architectural flaw in high-throughput OLTP databases. They actively defeat database caching mechanisms and maximize physical disk I/O.

Engineers must stop treating database performance degradation as a generic symptom of traffic scaling. Use the N^* threshold formula to project your schema’s structural limits before executing a CREATE TABLE statement.

If distributed ID generation is a hard requirement, adopt temporally sequential identifiers like UUIDv7 or ULID. Align your data structures with the physical constraints of B-Tree mechanics.

Read the paper here

https://zenodo.org/records/19034540

Thought - Why Banks Can't "Just Rewrite" COBOL in Java/Python/Go?

Ashok Vishwakarma — Thu, 12 Mar 2026 06:01:49 GMT

It happens every five years. A new CTO arrives at a Global 500 bank.

They look at the “Green Screen” terminals, the 40-year-old IBM Z/OS mainframes, and the millions of lines of COBOL.

They cringe.

“This is legacy,” they declare.

“This is technical debt. We are going to rewrite the Core Ledger in Java microservices. We will be cloud-native.”

Three years and $50 million later, the project is quietly cancelled.

The CTO moves on to a different company.

The COBOL remains.

Why?

It’s not because the bank is lazy. It’s not because we lost the source code.

It is because the fundamental architecture of modern programming languages is hostile to the requirements of global finance.

You are not fighting code, you are fighting how computers do math.

The Math Problem (IEEE 754)

When you write float or double in Java, Python, C++, or Go, you are using IEEE 754 Binary Floating Point arithmetic.

This standard is designed for scientific calculation, measuring the distance to a star or the velocity of a particle.

It prioritizes range and speed over absolute precision.

The Rounding Error

Open your browser console or a Node.js shell and type

0.1 + 0.2
// The result: 0.30000000000000004

In a physics simulation, that 0.00000000000000004 is noise.

In a banking ledger processing trillions of dollars a day, that error is a regulatory violation, a failed audit, and potentially a lawsuit.

Fixed-Point Decimal (COBOL Way)

COBOL was not designed for science.

It was designed for Business (Common Business Oriented Language).

It does not use Binary Floating Point for money. It uses Fixed-Point arithmetic stored as Binary-Coded Decimal (BCD).

In COBOL, you define a variable like this

01 ACCOUNT-BALANCE PIC S9(13)V99 COMP-3.

01: Level Number. In COBOL, 01 represents a top-level record or variable. Think of it like const or let at the root scope.
ACCOUNT-BALANCE: The Variable Name. (COBOL uses kebab-case because it was invented before snake_case or camelCase won the war).
PIC: Picture Clause. This tells the compiler exactly what the data looks like.
S: Signed. This bit tracks if the number is positive or negative.
9(13): Numeric Integers. The number 9 represents a digit. (13) means “allocate space for 13 of them.”
V: Virtual Decimal. This is the magic. It tells the CPU “assume a decimal point here,” but it does not store a dot character in memory. It saves space.
99: Precision. Allocate exactly 2 digits for cents.
COMP-3: Packed Decimal (BCD). This is the instruction to store 2 digits per byte (using 4 bits each), rather than the standard 1 byte per character. This is what enables the hardware math precision.

When COBOL calculates 0.1 + 0.2, it does not convert them to binary approximations.

It calculates them in base-10, digit by digit, often utilizing specific hardware instructions on the mainframe (Decimal Floating Point units) that x86 architectures have historically lacked or emulated poorly.

The result is exactly 0.3. Always.

The Cost of Emulation

You must be wondering can’t we do this in Java?

Yes, using java.math.BigDecimal.

But BigDecimal is a software object. It adds memory overhead and CPU cycles for every single calculation.

COBOL operates on money at the instruction-set level. When you are processing 50,000 transactions per second (TPS), that overhead isn’t just a performance hit.

It’s an infrastructure bill.

True Transactionality (The CICS)

The second reason rewrites fail is the misunderstanding of “Transactionality.”

Modern web development is obsessed with “Statelessness.”

You send a REST request, the server forgets you, and you send another.

State is hard.

The Mainframe world runs on CICS (Customer Information Control System). CICS is an application server that manages transactions with ACID properties (Atomicity, Consistency, Isolation, Durability) as a law of physics.

Lets take an example of ATM withdrawal

Check Balance.
Debit Ledger.
Dispense Cash.
Log Audit Trail.

In CICS, if step 3 fails (the cash jams), the entire transaction rolls back instantly. The ledger is never touched.

The state remains consistent.

The Microservices Nightmare

In a distributed microservices architecture, you break these steps into different services.

Service A (Ledger) debits the account.
Service B (ATM) fails to dispense.
Now Service A must “compensate” (undo) the transaction.

You have moved from immediate consistency to Eventual Consistency.

You are now managing “Distributed Transactions” and “Sagas.”

So question to you is, do you want your checking account balance to be “Eventually Consistent”?

Or do you want it to be Correct?

Scale Up vs. Scale Out

The philosophy of the modern cloud is Scale Out.

“If the server is overloaded, spin up 100 more cheap nodes. If a node fails, kill it and retry.”

The philosophy of the Mainframe is Scale Up.

“This single machine will process everything. It will not fail. If a CPU dies, a backup CPU takes over without the operating system even noticing.”

Mainframes utilize specialized I/O Processors (Channel Subsystems).

The main CPU doesn’t waste time reading from a disk or talking to the network card. It delegates that to a sub-processor and keeps crunching numbers. This allows mainframes to run at 100% CPU utilization for years without throttling.

Try running a Linux server at 100% CPU for an hour, it will become unresponsive.

Finally, the verdict is

For Twitter, Scale Out. If a tweet fails to load, nobody loses money.
For the Global Economy, Scale Up. Reliability is not optional.

Why this matters to You?

Even if you never touch a line of COBOL, the principles of the Mainframe apply to your modern stack.

Never Use Floats for Money

If you are building a FinTech app in JavaScript or Python, do not use standard number types for currency.

const price = 19.99; // Is Wrong

Use libraries like decimal.js, Python’s decimal module, or store values as Integers (cents) and format them only for display

(const priceInCents = 1999;)

Respect the Monolith

We have been trained to think “Microservices = Good” and “Monolith = Bad.”

But if your application requires strict ACID compliance (inventory management, voting systems, banking), a well-structured Monolith with a single database transaction is infinitely less buggy than a distributed system trying to coordinate state across ten services.

“Legacy” is a Success Metric

Code becomes “legacy” only if it survives.

If you are looking at a system that has been running for 30 years, do not mock it.

Learn from it. It has survived market crashes, leap years, and user stupidity for decades.

It is doing something right.

Conclusion

Rewriting a Core Banking System is rarely a “refactoring” effort. It is an archaeological excavation.

That “ugly” COBOL code contains 40 years of edge cases, the tax law change of 1992, the negative interest rate handling of 2015, the specific rounding rule required by the Bank of Japan.

If you rewrite it, you will miss these rules.

You will introduce bugs that were solved in 1985.

Don’t replace the Mainframe. Modernize it.

Wrap the COBOL logic in REST APIs (using Z/OS Connect).

Let the Java frontend look pretty, but let the Iron down in the basement handle the math.

Respect the Old Gods.

They are still holding up the sky.

Thought - Why Abstract Syntax Tree (AST) makes sense for AI Code Migration

Ashok Vishwakarma — Tue, 10 Mar 2026 06:01:04 GMT

Right now, in boardrooms and daily stand-ups across the world, executives are signing off on millions for “AI Modernization,” while engineers are blindly feeding legacy scripts into ChatGPT to refactor core business logic.

The pitch is intoxicatingly simple, take a 10,000-line C++ monolith, paste it into an LLM window, and ask it to output pristine, idiomatic Python, Go or Rust microservices.

This is not engineering. This is reckless gambling.

The fundamental flaw in this strategy, whether you are a VP buying an enterprise AI tool or a Senior Engineer writing a migration script, is a misunderstanding of what an LLM actually is.

Large Language Models are probabilistic text generators. They are not compilers. They do not possess an underlying, deterministic model of the code they are ingesting. They do not mathematically understand state mutation, variable shadowing, or lexical scope.

They simply predict the next most likely token based on latent space vectors.

An LLM might translate a complex legacy while loop correctly 99 times. But on the 100th time, distracted by a weirdly named variable or an obscure GOTO statement, it will silently hallucinate. It will drop a crucial state mutation. It will misinterpret the lexical scope of a nested variable.

When you are migrating a critical system such as, a core banking ledger or a flight control system, a 99% success rate is another term for a catastrophic production outage.

Let’s look at an example of how a pure LLM migration breaks.

Smart LLMs rarely make basic syntax errors anymore, instead, they make semantic errors by translating text perfectly while fundamentally changing how the computer manages memory which we will see in the example below

Imagine a pricing calculation in legacy C++ using a struct. By default, C++ passes structs by value (meaning it creates a safe copy).

struct Trade { 
    double price; 
};

// Passed by VALUE. Modifies a local copy for a "what-if" simulation.
void simulateDiscount(Trade t) { 
    t.price *= 0.9; 
    logSimulation("Discounted price would be: ", t.price);
}

// ... elsewhere in the system ...
Trade myTrade = { 100.0 };
simulateDiscount(myTrade);

// myTrade.price is STILL 100.0 here. The original is safely untouched.
processPayment(myTrade); // Processes the actual, full price.

An engineer pastes this into an LLM and asks for an “idiomatic Python” translation. The LLM confidently spits out beautifully formatted Python

class Trade:
    def __init__(self, price: float):
        self.price = price

# Passed by REFERENCE. Modifies the original object!
def simulate_discount(t: Trade):
    t.price *= 0.9
    log_simulation("Discounted price would be: ", t.price)

# ... elsewhere in the system ...
my_trade = Trade(100.0)
simulate_discount(my_trade)

# my_trade.price is now 90.0! 
process_payment(my_trade) 
# You just undercharged the client by 10% because of a simulation!

The LLM got a 100% on the syntax. The code is highly readable.

But it completely corrupted your ledger.

The LLM didn’t know why this was wrong because it doesn’t build a memory execution graph. It mapped a C++ function signature to a Python function signature. It completely failed to realize it just changed a safe, immutable pass-by-value operation into a highly destructive pass-by-reference mutation.

Why AST matters

To translate code safely, we have to return to Computer Science 101.

Code is not text. Treating code as a string of characters is the original issue of pure-LLM migration.

Code is a tree. Specifically, it is an Abstract Syntax Tree (AST).

When a traditional compiler reads your code, the very first thing it does is strip away the formatting, the whitespace, and the text, converting the logic into a strict, hierarchical graph of nodes and edges.

If we parse our legacy C++ simulateDiscount(Trade t) function into an AST, the parser generates a structural map that looks something like this

{
  "node_type": "FunctionDeclaration",
  "identifier": "simulateDiscount",
  "parameters": [
    {
      "node_type": "Parameter",
      "identifier": "t",
      "data_type": "Trade",
      "memory_model": "PASS_BY_VALUE" // <-- The Lifesaver
    }
  ],
  "body": [ ... ]
}

Notice the memory_model metadata.

The AST isn’t guessing based on text patterns, it mathematically knows that this specific C++ language construct creates a local copy.

When an AST-driven migration maps this to Python, it compares the source memory model (PASS_BY_VALUE) to the target language’s default memory model (in Python, objects are passed by reference).

Because it detects this mismatch, the structural translation engine intervenes before the LLM can make a mistake.

It physically forces a constraint into the generated Python code to preserve the semantic contract of the original architecture.

The resulting code structurally generated by the tool will look like this

import copy

def simulate_discount(t: Trade):
    # AST-enforced safety boundary to preserve C++ pass-by-value
    t_local = copy.copy(t) 
    t_local.price *= 0.9
    log_simulation("Discounted price would be: ", t_local.price)

A pure LLM drops the constraint because it just sees words.

An AST preserves the constraint because it enforces mathematical logic.

Probabilistic models guess. ASTs prove.

The Hybrid approach

If pure LLMs are dangerous, and pure AST translation (source-to-source compilers/transpilers) produces ugly, unmaintainable “machine code,” what is the solution?

If you are building an enterprise-grade migration tool, like the local AI agent architectures we build at BinaryBox, you must use a hybrid pipeline. You do not ask the AI to write code from scratch. You force it to operate within a deterministic pipeline

Parse

Use a traditional, deterministic parser (like Tree-sitter) to generate the AST of the legacy C++, or Java codebase.

This locks the exact structure, memory model, and control flow into a mathematical graph.

Analyze

Pass specific, isolated subtrees of the AST to the LLM.

Instead of asking the AI to “rewrite this file,” you ask it to identify intent.

“Analyze this AST node block. Is this an implementation of a bubble sort? Is this a deprecated synchronous network call?”

Generate

Use the AST to deterministically generate the new syntax, utilizing the LLM only to provide modern idiomatic mapping.

As shown above, if the AST dictates a pass-by-value parameter, the LLM is constrained to generating code that enforces a clone or copy in the target language.

The AST guarantees the structure. The AI translates the semantics.

Why you should care?

Why does this architectural distinction matter to a CTO?

Because of QA costs.

If you use a pure LLM to translate a 500,000-line monolith, you cannot trust the output.

Because the system is probabilistic, human Senior Engineers must manually read, verify, and test every single line of the generated code to ensure no silent bugs were introduced.

The time and salary cost of having a Principal Engineer QA 500,000 lines of AI-generated spaghetti is often higher than the cost of having them rewrite it by hand.

The ROI of the migration drops to zero.

An AST+AI hybrid approach mathematically guarantees structural equivalence.

If the AST parser proves that all control flows and state mutations have been preserved in the new language, your engineers do not need to read every line.

They only need to review the architectural patterns and idiomatic choices.

You eliminate the operational risk of silent logic drops, and you cut the migration timeline from years to months.

Conclusion

There is a recurring theme in resilient system design, Architecture requires constraints.

We saw this in the Editor Wars. Atom gave developers total freedom to touch the DOM, resulting in chaotic performance. VS Code built a strict Extension Host, a structural prison that constrained plugins and guaranteed a flawless user experience.

AI agents are no different. If you give an LLM the freedom to write whatever text it wants, it will eventually write a bug that bankrupts your company.

To do a reliable AI code migration, you must build a structural prison.

The Abstract Syntax Tree (AST) is that prison. You lock the LLM inside the boundaries of the AST, forcing it to translate node-by-node, bounded by the laws of lexical scope, memory models, and control flow.

Freedom is chaos. Constraints are reliability.

Stop trying to use a magic wand, and start building a compiler.

Deep Dive - How Kafka hit 1 Million write per second on a $40 HDD

Ashok Vishwakarma — Thu, 05 Mar 2026 06:01:47 GMT

In my previous Deep Dive, I tried to write 1,000,000 records per second to PostgreSQL running on an AWS c8g.48xlarge instance backed by Provisioned IOPS SSDs (io2 Block Express).

The database locked up. The queue depth exploded. The disk, a $30,000/month NVMe SSD simply couldn’t physically accept the write signals fast enough.

We had to abandon persistent storage entirely and switch to a Redis cluster.

We traded durability for speed, accepting that a power failure would vaporize millions of transactions in an instant.

But here’s the part that breaks most engineers’ mental models

Apache Kafka handles 1 million writes per second on cheap, spinning hard drives.

Not NVMe. Not even SATA SSDs. Actual magnetic platters with mechanical arms. The kind of physical, spinning rust drives you can buy for $40 at Amazon.

How is this possible?

The answer isn’t “Kafka is written in a faster language” (it runs on the JVM, which is notoriously heavy).

The answer isn’t “Kafka uses better compression.”

The answer is physics.

Kafka doesn’t fight the hard drive. It exploits it. This is the story of how Kafka “cheats” by respecting the fundamental constraints of hardware, while traditional databases try to bend reality and lose.

RAM is Fast, Disk is Slow?

Every engineer “knows” RAM is faster than disk. But at scale, throughput beats latency. Sequential disk can outrun random RAM.

Let’s challenge the conventional wisdom directly. We are all taught these standard latency numbers

RAM ~100 nanoseconds access time
SSD ~100 microseconds access time (1,000x slower)
HDD ~10 milliseconds access time (100,000x slower)

These numbers are factually correct. But for high-throughput workloads, they are completely irrelevant. The missing context in that table is Random vs. Sequential I/O.

Those latency numbers assume random access, your application is jumping to arbitrary memory addresses or disk sectors.

But when you switch to sequential access, the story completely flips.

Let’s look at Sequential Read/Write Throughput

RAM (DDR4) ~20-50 GB/sec
NVMe SSD ~3-7 GB/sec
SATA SSD ~500 MB/sec
7200 RPM HDD ~200 MB/sec

Here is the key insight.

A standard, cheap hard drive doing pure sequential I/O can easily saturate a 1 Gbps network link. If your system bottleneck is network throughput (which it absolutely is at 1,000,000 requests per second), the magnetic disk is actually fast enough.

Let’s look at the real comparison between our failed experiment and Kafka’s architecture.

Postgres doing 1M random inserts

Each insert updates multiple B-Tree indexes. Each index update requires seeking to a random page on the disk. Even on an enterprise SSD, a random seek takes ~100 microseconds.

1,000,000 random seeks × 100 microseconds = 100 seconds of pure seek time.

It is mathematically impossible to process that in one second.

Kafka doing 1M sequential appends

Kafka writes to the end of a log file. There is no seeking. A modern hard drive sequentially writes at ~200 MB/sec.

1,000,000 writes × 1KB each = 1 GB.

At sequential speeds, that takes roughly 5 seconds on a single cheap disk, and is trivially parallelized across 5-10 disks in a JBOD (Just a Bunch of Disks) configuration to handle it in 1 second.

The lesson here. Sequential disk beats random RAM when throughput matters more than latency.

This is why Kafka doesn’t need NVMe. It just needs sequential access patterns.

What’s the throughput of your production database doing sequential scans vs. indexed lookups? If you don’t know, you’re optimizing blind.

How Kafka Enforces Sequential I/O

Unlike relational databases, which rely heavily on B-Trees to enable fast random lookups, Kafka is built around a single, aggressively simple data structure. The Commit Log.

When a message arrives at a Kafka broker, the system does exactly one thing, it appends the message to the end of the current log segment (a raw file on disk).

It never updates existing entries.
It never seeks backward to modify a state.
It writes in large batches (saving multiple messages in a single system call).

Why does this work so perfectly?

Hard drives have a mechanical read/write head. To read or write data, that head must physically move across the platter to find the correct sector.

This is why random I/O is so devastatingly slow, the mechanical arm is constantly repositioning. It is physically vibrating.

But when you strictly append to the end of a file, the head drops into position and stays there. The disk controller can stop worrying about seek times and optimize entirely for pure throughput.

You are essentially turning a hard drive into a firehose.

If you add batching, writing 100 messages per write() syscall instead of 1, you reduce the CPU context-switch overhead by 100x while keeping the disk arm perfectly still.

The Trade-off Kafka Makes

Databases optimize for flexibility, random access (give me record ID 47293), updates in place (change this user’s email), and complex queries (JOIN across three tables).

Kafka completely abandons this. It optimizes strictly for append-only writes (add this event), sequential reads (replay messages in order), and time-based queries (give me all events from 2:00 PM).

This is a conscious architectural choice, not magic. By refusing to support random updates, Kafka gets to use the fastest possible I/O pattern the hardware offers.

Look at Netflix. They log every single user interaction (play, pause, seek, stop) to Kafka. At peak, that is hundreds of thousands of events per second from millions of concurrent users.

Netflix doesn’t need to query “what did user X do exactly 4 seconds ago?” in real-time. They need to capture the firehose of data and process it asynchronously.

A B-Tree database would collapse under that write load. Kafka’s append-only log absorbs it effortlessly.

Look at your highest-volume write workload.
Is it actually appending events, or are you using INSERT/UPDATE simply because “that’s what databases do”?

Kafka Doesn’t Manage Memory

If you write a standard application that interacts with a disk and a network, your data flow generally looks like this

Read data from disk into a kernel buffer.
Copy data from the kernel buffer to application memory (like the JVM heap).
Process the data.
Copy data from application memory back to a kernel socket buffer.
Write to the network socket.

At extreme throughput, this standard pattern creates two massive system bottlenecks.

1. Garbage Collection Death

Every message object allocated in the JVM heap must eventually be garbage collected.

If you are pushing 1,000,000 messages per second through application memory, the Garbage Collector cannot keep up.

You will experience massive “stop-the-world” pauses that instantly kill your throughput and trigger network timeouts.

2. Double Buffering

Your data exists in two places at once, kernel memory (the OS page cache) and application memory (the JVM heap).

You are wasting RAM, and more importantly, you are wasting CPU cycles copying the exact same bytes back and forth between user space and kernel space.

Kafka’s Solution

Bypass Application Memory Entirely.

Kafka does not attempt to manage a complex internal buffer pool. It relies entirely on the Linux OS Page Cache.

When Kafka writes a message, it calls write() to append to the log file. The OS buffers this write in the page cache in RAM. Kafka immediately returns a success acknowledgment to the producer.

The Linux kernel flushes that page cache to the physical disk asynchronously in the background.

When Kafka reads a message, the OS loads the file into the page cache. Kafka references the page cache directly. The message data is never copied into the JVM heap. Therefore, there is no garbage collection penalty.

Modern operating systems are ruthlessly efficient at managing file caches. Linux will gladly use 100% of your free RAM as a page cache. Kafka doesn’t try to outsmart the OS, it defers to the kernel.

The practical result?

A Kafka broker with 64 GB of RAM effectively has ~4 GB dedicated to the JVM heap (which is tiny), and ~60 GB dedicated to the OS page cache.

Consumers reading recent data get RAM-speed access because the OS serves it directly from the cache. Older messages fall out of cache and are read from disk, but because it’s sequential, it remains incredibly fast.

Postgres must manage its own complex buffer pool because it supports random updates, ACID transactions, and row-level locking.

Kafka can rely entirely on the OS because it only does sequential access.

Zero-Copy

This brings us to the centerpiece of Kafka’s architecture.

The fastest code is the code that never runs. Zero-Copy means the CPU doesn’t touch your data.

That’s why it’s fast.

The Traditional Data Path (4 Copies)

When a consumer requests a batch of messages from Kafka, a naive implementation would execute the following path

Disk → Kernel Buffer - Via DMA - Direct Memory Access. Hardware does the work.
Kernel Buffer → Application Buffer - CPU copies data from kernel space into the Kafka JVM heap.
Application Buffer → Socket Buffer - CPU copies data from the JVM back to the kernel network stack.
Socket Buffer → NIC - Via DMA to the Network Interface Card.

At each copy boundary, you suffer. CPU cycles are wasted. You force expensive context switches between user space and kernel space. You pollute the CPU L1/L2 caches, evicting hot application state just to make room for transient message bytes that are passing through.

At scale, serving 1,000,000 messages/sec means copying 1 GB/sec four times. That is 4 GB/sec of memory bandwidth consumed just moving the exact same bytes around the motherboard.

On our massive c8g.48xlarge server, the CPU would be saturated just copying data, doing absolutely zero actual processing.

The Zero-Copy Solution

sendfile() Linux provides the sendfile() syscall (and its cousin splice()) to solve this exact bottleneck.

// Traditional approach (CPU intensive)
read(file_fd, buffer, size);           // Copy from kernel to app
write(socket_fd, buffer, size);        // Copy from app to kernel

// Zero-copy approach (Hardware accelerated)
sendfile(socket_fd, file_fd, offset, size);  // Copy directly kernel-to-kernel

What actually happens under the hood when Kafka uses zero-copy?

Disk → Kernel Page Cache - DMA read.
Page Cache → Socket Buffer - A kernel-to-kernel copy. User-space is completely bypassed.
Socket Buffer → NIC - DMA write.

The message data never enters Kafka’s application memory. Kafka simply issues a command to the Linux kernel,

“Take 500 bytes from file descriptor Z at offset X, and stream them directly into network socket W.”

The kernel and the DMA controllers handle everything.

Instead of 4 copies, you get 2 copies (and both are heavily hardware-accelerated). Instead of 4 expensive context switches, you get 1. Instead of thrashing the CPU cache, you keep it pristine for actual orchestration logic.

LinkedIn engineering published benchmarks demonstrating that zero-copy improves Kafka’s throughput by 2-3x for consumer reads. At 1M messages/sec, that is the literal difference between needing a cluster of 3 massive servers versus 1 server.

Why traditional message queues can’t do this

RabbitMQ, ActiveMQ, and traditional enterprise queues usually transform messages (adding headers, parsing routing keys), encrypt payloads in the application layer, or apply middleware.

All of these actions require the message to be pulled into application memory so the CPU can inspect and alter the bytes.

Kafka’s messages are opaque byte arrays. Kafka does not parse them, it does not transform them, and it does not care about their contents.

This architectural constraint allows Kafka to use zero-copy. The broker is just a dumb, incredibly fast pipe moving bytes from a disk to a network card.

How many times is your data copied between receiving it and sending it? Every single copy is CPU waste you are paying for on your AWS bill.

When to Use This Pattern

Understanding how Kafka works is only half the battle. You need to know when to apply these principles, append-only logs and zero-copy transfers to your own systems.

This pattern WORKS when

Write-heavy, read-sequential workloads

Event logging, audit trails, analytics ingestion pipelines, and background job queues.

Messages are opaque blobs

You don’t need the broker to parse, transform, or route based on content. Consumers handle the deserialization.

Recent data is hot, old data is cold

99% of your reads are for data written in the last few minutes (guaranteeing a page cache hit). Occasional historical reads (requiring a disk seek) are acceptable.

Durability matters, but immediate consistency does not

Relying on the OS page cache flush (write-behind caching) is “good enough,” and you don’t need to force an fsync() to the physical platter on every single write.

This pattern DOES NOT work when

Random access queries

“Give me user 47293’s profile.” (Use a traditional database).

Low-latency single-message processing

If you need sub-millisecond latency per message, Kafka isn’t the tool. Zero-copy optimizes for massive batch throughput, not single-message latency.

Message transformation in the broker

If your broker must decrypt, dynamically route, or mutate messages, you cannot use zero-copy because you must pull the data into application memory.

Before reaching for Postgres, Redis, or MongoDB for a high-volume endpoint, ask yourself

“Am I appending events, or am I updating records?”

If your workload is append-mostly and sequential-read, you are leaving 10x performance on the table by using a general-purpose B-Tree database.

Consider Kafka for event streams, ClickHouse for analytics, or InfluxDB for time-series metrics.

All of them use append-only logs. All of them respect sequential I/O.

Conclusion

In my 1M RPS test, Postgres failed not because it was poorly designed, but because it was designed for a entirely different problem space.

Postgres optimizes for maximum flexibility, random updates, complex queries, and strict ACID guarantees. To deliver this flexibility, it must use B-Trees, endure random I/O, and manage its own application buffers.

Kafka optimizes for maximum throughput, append-only writes, sequential reads, and eventual consistency. To deliver this throughput, it uses commit logs, demands sequential I/O, and relies entirely on kernel-managed caching.

Neither system is “better.” They solve different physical problems.

The lesson here isn’t “use Kafka instead of Postgres.”

The lesson is, understand the physics of your hardware, and then choose the data structure that ruthlessly exploits it.

Sequential disk is faster than random RAM. Zero-copy is faster than application processing. The Linux OS page cache is smarter than your hand-rolled buffer pool.

Stop fighting the metal. Start respecting it. When you align your architecture with the strict constraints of the underlying hardware, you don’t need to scale out to hundreds of servers. You can handle 1 million writes per second on a $40 hard drive.

That’s not magic. That’s engineering.

Report - How Andres Freund saves the Internet

Ashok Vishwakarma — Tue, 03 Mar 2026 06:00:15 GMT

March 2024. Andres Freund, a Microsoft engineer and PostgreSQL developer, is working from home when he notices something odd.

His SSH logins are taking 500 milliseconds longer than normal.

Most engineers would ignore this. Network latency. A busy server. Maybe restart the daemon and forget about it.

Freund didn’t. He broke out the profiler and debugged it.

He traced the micro-delay to liblzma. A ubiquitous compression library used by OpenSSH (via systemd).

He found that recent versions contained a backdoor so sophisticated that it had bypassed

Automated security scanners
Code reviews from major Linux distributions
Penetration testing
Static analysis tools
The “many eyes” of the open-source community

The backdoor had been planted over 2.5 years by a nation-state actor who earned the trust of the maintainer, contributed legitimate code, and then injected malware into the build process itself.

If Freund had ignored that 500ms delay, every major Linux distribution would have shipped SSH servers with a pre-installed, pre-authentication remote access backdoor.

Every CISO in the world relies on million-dollar security budgets, automated scanners, and penetration tests. A backdoor that took 2.5 years to plant was caught by one engineer who thought, “Huh, that CPU spike is weird.”

This is the story of why your security budget missed what one curious engineer caught by accident. And why “free” open-source software might be the most expensive dependency in your infrastructure.

The Myth

Let’s challenge the core religion of modern software engineering.

Linus’s Law states

“Given enough eyeballs, all bugs are shallow.”

This has been gospel in the tech industry for 30 years. It’s why enterprise companies confidently build trillion-dollar infrastructures on open-source software.

The assumption is that open source is inherently more secure than proprietary code because anyone can audit it. Bad actors can’t hide in plain sight when millions of developers are watching.

Here is the reality of XZ Utils

Downloads - 1.4 million per day.
Usage - Core dependency for every major Linux distribution (Debian, Ubuntu, Fedora, Red Hat).
Criticality - Required for SSH authentication on virtually every server on the internet.
Maintainers - ONE unpaid volunteer (Lasse Collin).
Annual Budget - $0.
Last Security Audit - Never.

According to the Open Source Security and Risk Analysis (OSSRA) report, 87% of commercial applications contain open-source components.

Yet, less than 3% of organizations perform regular security audits of their dependencies, and the average application relies on 500+ transitive dependencies.

Linus’s Law assumes the eyes are looking. But nobody audits compression libraries. Nobody reviews arcane build scripts. Nobody sponsors the maintainer working nights and weekends for zero pay.

Your production infrastructure runs on code that hasn’t been reviewed since it was written. The “many eyes” aren’t watching. They’re assuming someone else is. This collective delusion is exactly what made the XZ attack possible.

How many dependencies does your main application have? How many of those have you personally reviewed in the last year? If the answer is ‘zero,’ you’re trusting strangers with your production environment.

How Jia Tan Bypassed the Eyeballs

To understand why your scanners failed, you have to understand the mechanics of the attack. It was a masterpiece of both social and software engineering.

Phase 1 - Social Engineering the Maintainer

The attack didn’t start with code. It started with economics.

Lasse Collin had maintained XZ Utils alone, unpaid, for over 15 years. He had a full-time job. XZ was his nights-and-weekends project. In 2021, a coordinated pressure campaign began. Multiple sock-puppet accounts created hostile issues on the XZ mailing list

“Lasse is unresponsive to patches.”
“This project is effectively unmaintained.”
“We need new leadership or XZ will die.”

The manufactured urgency worked. Lasse, burnt out and dealing with personal health issues, accepted help.

Enter “Jia Tan”

A seemingly legitimate open-source contributor who had been making small, helpful commits for months. Jia earned trust slowly. He fixed real bugs. He responded to issues professionally. He became indispensable. After two years of free labor, Lasse granted Jia Tan co-maintainer status and release access.

The initial vulnerability wasn’t a buffer overflow. It was the economics. One unpaid volunteer maintaining critical infrastructure for billions of users. The attackers didn’t need a zero-day. They just needed patience.

Phase 2 - Hiding the Payload

Traditional malware injection is obvious: You add malicious code to .c source files. But Jia Tan knew code reviewers read source files. Even casual contributors glance at GitHub diffs.

Jia Tan’s solution was brilliant. Don’t hide the backdoor in the source code. Hide it where nobody looks.

The payload was concealed inside binary test files, specifically, .xz compressed blobs used for the test suite to verify the compressor was working. Developers assume test data is benign. Nobody reviews binary blobs. The malicious test files sat in the repository for months, completely invisible to the human eye.

Phase 3 - Build-Time Injection

The source code remained clean. Your Veracode or Snyk scanners found nothing suspicious in the .c files. But Jia Tan heavily obfuscated and modified the build scripts (configure.ac and Makefile.am).

During the make compilation process on a Debian or RPM build serve

The build script extracted the hidden payload from the binary test files.
It decrypted the payload and injected the backdoor directly into the compiled binary (liblzma.so).

The source code never contained the malware, it only existed in the build artifact.

Your build process is a black box.
You review code, but do you verify the compiled binary matches the source? If not, how would you detect a build-time injection?

Phase 4 - The IFUNC/GOT Hijacking

Once the malicious liblzma.so was loaded into memory by the SSH daemon, it executed a sophisticated Linux exploitation technique

IFUNC (Indirect Function) Resolvers and GOT (Global Offset Table) Hijacking.

Linux supports IFUNC, a mechanism where a function’s implementation is chosen dynamically at runtime based on CPU capabilities (e.g., choosing an AVX-512 optimized version if the hardware supports it).

The backdoor registered a malicious IFUNC resolver that executed during the dynamic linking phase, before main() even runs.

This resolver modified the Global Offset Table, replacing the pointer to OpenSSH’s RSA_public_decrypt() function with a pointer to the attacker’s own code.

The backdoor allowed remote code execution using a specific Ed448 cryptographic key controlled by the attacker. No valid SSH credentials were required.

It was entirely stealthy, it checked if the process was /usr/sbin/sshd so it wouldn’t crash normal tarball extractions, and it self-destructed if it detected profiling or debugging tools. (It only failed to hide from Freund’s profiler because of a tiny CPU cycle overhead bug).

Modern security assumes you can trust the build process. Code review, static analysis, and fuzzing all operate on source code.

But if the build itself is compromised, none of that matters. This was an attack on the software supply chain, not the software itself.

The Root Cause

Let’s shift from the technical to the economic. The brutal math of open-source maintenance is the actual root cause of CVE-2024-3094.

Let’s quantify what Lasse Collin was managing

Impact - Billions of devices worldwide.
Critical dependencies - OpenSSH, systemd, dpkg, rpm.
Maintainer compensation - $0/year.
Support burden - Thousands of emails, bug reports, and feature requests from corporations demanding free support.

The Fortune 500 runs mission-critical infrastructure on a foundation of unpaid labor.

OpenSSL - Maintained by ~10 people with an Annual Budget of ~$1M (via grants) and used by Billions of users.
curl - Maintained by Daniel Stenberg (mostly solo) with an annual budget of ~100k (sponsors) and used by Billions of users.
SQLite - Maintained by D. Richard Hipp + 2 contractors with a Self-funded budget and used by Billions of users.
XZ Utils (pre-attack) - Maintained by 1 unpaid volunteer with $0 budget and used by Billions of users.

The sock-puppet pressure campaign worked because Lasse was genuinely overwhelmed. The complaints about slow response times were real. “Helpful” contributors offered relief, and no one else (not Amazon, not Google, not Microsoft) was stepping up to help him. Handing off maintenance seemed like the responsible thing to do.

If XZ Utils disappeared tomorrow and you had to build a replacement in-house, how much would it cost?

A conservative estimate

3 senior engineers × 2 years × $200K = $1.2 million.

That is the real enterprise value Lasse Collin was providing. For free. While working a separate full-time job.

Companies will happily pay $50,000 annually for a Datadog license to monitor their infrastructure.

But they won’t pay $50,000 to sponsor the unpaid maintainer of the compression library that every server in their data center depends on. Until the maintainer burns out. Hands over access. And a nation-state backdoor ends up in production.

If your most critical open-source dependency disappeared tomorrow, how much would it cost to replace it? That’s what the maintainer is saving you. Are you paying them anything?

What you should do now?

The era of blind trust in npm install and apt-get upgrade is over.

Here is the framework for securing your supply chain Monday morning

Audit Your Dependency Tree

Don’t just look for known CVEs. Identify maintenance risk.

Run dependency analysis (npm audit, pip-audit, cargo audit).
Find dependencies with fewer than 3 active maintainers.
Check the last commit date (>1 year without activity is a massive red flag).
Red Flags - A single maintainer, hundreds of unresolved issues, or a maintainer posting “looking for new ownership.”

Pin and Vendor Critical Dependencies

Stop trusting the internet to compile your software.

Stop using latest or ^1.0.0 version ranges in production. Stop auto-updating dependencies without review.

Pin exact versions with SHA checksums. Use lock files (package-lock.json, Cargo.lock). For ultra-critical infrastructure libraries, vendor them (copy the source into your own repository) and build them from source.

Zero Trust for the Build Process

If the build is compromised, the code review doesn’t matter.

Implement Reproducible Builds (the exact same source code must produce the exact same byte-for-byte binary).
Isolate your build environments in ephemeral containers or VMs.
Generate a Software Bill of Materials (SBOM) for every release.
Verify that downloaded packages match published checksums and ensure no unexpected network requests happen during your CI/CD pipeline.

Fund the Maintainers

Funding maintainers isn’t charity. It’s cheaper than breach response, cheaper than in-house development, and cheaper than explaining to your board why you ignored supply chain risk.

If a library is critical to your business, sponsor it via GitHub Sponsors, Tidelift, or Open Collective.
The Goal - $50K–$100K/year for infrastructure-level libraries.
This reduces maintainer burnout, creates accountability, and gives you input on roadmap and priority bug fixes.

Treat Open Source Like Vendors

Apply the exact same due diligence to your open-source libraries that you apply to a SaaS vendor. Monitor dependencies like production services. Alert on new releases, review changelogs before upgrading, and test updates in staging.

You cannot audit every line of every dependency. You have 500+ transitive dependencies, and they change constantly. But you CAN control your build process, fund the maintainers of critical libraries, and treat “free” software as unpaid labor with systemic risk.

Conclusion

The XZ backdoor wasn’t a failure of code. It was a failure of economics.

Jia Tan didn’t find a zero-day or exploit a race condition. He exploited the fact that trillion-dollar companies depend on unpaid volunteers, and then act surprised when those volunteers burn out and hand over the keys.

Linus’s Law ”with enough eyeballs, all bugs are shallow” assumes the eyeballs are actually looking. But nobody reviews compression libraries. Nobody audits build scripts. Nobody funds the maintainer. Until something breaks.

The next time your CFO questions a $500,000 budget for open-source sponsorships, show them the XZ timeline.

Show them the 2.5 years of patient infiltration. Show them that one burnt-out maintainer was all that stood between production systems and a global, nation-state backdoor.

“Free” software isn’t free. You’re just deferring the payment. The only question is

Will you pay before the breach, or after?

Every dependency in your stack is either funded, or it’s a countdown timer.

Credit to Veritasium for their exceptional video breakdown of the Jia Tan timeline, which served as the foundation for this architectural teardown. Watch it here