Why LinkedIn is leaving Kafka and Why you should not be worried.

The question you should be asking is simple. If LinkedIn outgrew Kafka at 32 trillion events and you are at 10 million do you actually need Kafka.

May 16, 2026

Getting Started with Landoop's Kafka on Docker Locally | Jamie Bowman

For the last decade engineering teams made a pilgrimage to Kafka. You deployed it because LinkedIn built it. You assumed that if it was good enough for a company processing billions of events it was certainly good enough for you. You set up your five node cluster with a ZooKeeper ensemble and configured your producer for idempotent writes. You spent three months getting it production ready.

Your team processes roughly 10 million events per day. LinkedIn operates at 32 trillion events per day in 2026.

The gap is staggering. You are processing 3.2 million times less than the scale where LinkedIn finally hit the ceiling. And here is the kicker. LinkedIn is moving off Kafka. They built Northguard and Xinfra to replace it.

The question you should be asking is simple. If LinkedIn outgrew Kafka at 32 trillion events and you are at 10 million do you actually need Kafka. Or did you cargo cult their infrastructure without understanding the scale that made it necessary.

The story of LinkedIn replacing Kafka is not a signal that Kafka is dead. It is a lesson in understanding scale. Kafka scaled 23,000x before LinkedIn needed something else. You are running 140x smaller than where they started in 2011. Let us talk about what actually broke at their scale and why you probably do not have that problem.

What broke Kafka at LinkedIn scale?

We need to look at the architectural limits of centralized coordination to understand the failure.

The problem of Centralized Metadata

Kafka relies on a single controller node to manage all partition metadata. The controller handles partition leader elections and replica assignments and metadata updates across every broker.

At your scale of 100 topics this works perfectly. At the LinkedIn scale of 400,000 topics and millions of partitions the controller becomes a catastrophic bottleneck.

When the controller fails a new leader election must occur via the KRaft quorum. The new controller must load every single piece of partition metadata into memory and rebuild the state. At the scale of 32 trillion events this reconstruction takes minutes.

And during that time the infrastructure is essentially frozen. No new topics can be created and no partitions can rebalance.

Centralized metadata management eventually hits a physical ceiling.

The problem of Infra wide Rebalancing

When a consumer group rebalances Kafka pauses consumption for the affected partitions.

At small scale this takes seconds. At LinkedIn scale rebalancing touches millions of partitions across thousands of topics simultaneously.

This causes infrastructure wide pauses that last for minutes. Even cooperative rebalancing cannot solve this when the sheer number of partitions creates a coordination explosion.

Consumer lag spikes and downstream systems experience massive latency increases during these coordinated events.

The problem of Fixed Partition

In Kafka you choose your partition count upfront. If you choose too few you hit a throughput wall. If you choose too many you waste resources.

As your data grows you cannot dynamically split a partition without downtime.

For a company with 400,000 topics repartitioning is operationally impossible. It requires stopping producers and migrating data and updating consumers across thousands of applications.

The problem of Coordination

The metadata storage for millions of partitions reaches gigabytes in size. Every producer and consumer reads this partition metadata on startup. Metadata updates generate massive broadcast traffic across the cluster.

Coordination mechanisms like ZooKeeper or KRaft eventually hit a physical limit of how much state they can broadcast to every broker in a timely manner.

Calculate this right now.

Divide 32 trillion by your daily event volume. If the result is greater than 1 million you are a million times smaller than the scale where these problems appear.

You do not have a Kafka problem. You have a scale perception problem.

How Northguard solves the scale problem?

Northguard replaces Kafka with a fundamentally different model designed for the frontier of distributed systems. It uses sharded metadata and range based partitioning and self balancing clusters.

Sharded Metadata

Instead of a single controller Northguard distributes metadata across vnodes using consistent hashing. Each vnode manages a subset of topics and uses Raft consensus for strong consistency. This removes the single point of coordination.

If your metadata fits comfortably in a single KRaft quorum you do not need this. LinkedIn needed it because their metadata exceeded the capacity of any single node memory.

Range Based Partitioning

Instead of fixed partitions Northguard uses ranges. A range is a contiguous slice of the keyspace that can dynamically split or merge without downtime. When a range grows too large it is marked for splitting and child ranges take over the future writes while the old range is sealed.

If you can estimate your partition count upfront you do not need this complexity. Fixed partitions are simpler to manage until you hit the 400,000 topic mark.

Self Balancing Clusters

In Northguard new segments are automatically assigned to the least loaded brokers. There is no explicit rebalancing operation required. If a broker fails the existing segments remain and new ones simply go to healthy nodes.

If your Kafka cluster only rebalances once per quarter then rebalancing is not your bottleneck. LinkedIn needed this because they add brokers and manage 150 clusters constantly. For them manual rebalancing was a full time operational tax.

Do you actually need Kafka?

We need to use a strict framework based on actual event volume to decide if Kafka really belongs in your stack.

If you process less than 10 million events per day you probably do not need Kafka. Redis Streams or SQS or even Postgres NOTIFY will work with significantly less operational overhead.

If you process between 10 million and 100 million events per day managed Kafka like AWS MSK makes sense. The volume justifies the tool but not the team required to self host it.

You only hit the scale where self hosted Kafka is justified when you cross 100 million events per day. You only hit the LinkedIn 2011 scale at 1.4 billion events.

Ask yourself these three questions.

Do you actually need per partition ordering guarantees. If not use SQS.
Do you need event replay for backfilling new services. If not use a standard message queue.
Do you need exactly once semantics for financial transactions. If yes then Kafka is the right tool.

If you are choosing Kafka because everyone uses it you are cargo culting. Kafka is phenomenal for high throughput event streaming but it comes with a massive operational tax.

How LinkedIn migrated?

There is a deeper lesson to learn from how LinkedIn migrated away from Kafka.

They built Xinfra which is a virtualized Pub/Sub layer that abstracts the physical clusters.

This allowed them to migrate topics from Kafka to Northguard without rewriting their application code. They used a dual write mechanism to ensure zero downtime. This is what mature platform engineering looks like.

The lesson is not that you should deploy Northguard. The lesson is that you should abstract your infrastructure. Do not let your applications call Kafka APIs directly. Wrap them in an internal library. If you ever outgrow your current tool you will be able to switch without rewriting every service in your company.

Conclusion

LinkedIn built Kafka in 2011 because they had a problem that no existing tool could solve at 1.4 billion events.

They outgrew Kafka at 32 trillion events and built Northguard.

You are at 10 million events per day. You are 140 times smaller than LinkedIn was when they started with Kafka. You are 3.2 million times smaller than the scale where Kafka breaks.

Do not choose infrastructure based on who uses it.

Choose based on what problem you are solving and what scale you are operating at.

Kafka scaled 23,000 times before LinkedIn needed something else. You will never outgrow Kafka. But you might be wasting forty percent of your platform team’s time managing a supercomputer when you only need a simple queue.

LinkedIn built Kafka. LinkedIn outgrew Kafka. You never will. Choose your stack accordingly.

Discussion about this post

Ready for more?