Thought - Why Postgres is a dangerous default
Nobody gets fired for choosing Postgres. It is the safest, most defensible technical decision you can make. Until it isn't. Especially when you are building a highly concurrent, event-driven logger.
Nobody gets fired for choosing Postgres.
It is the safest, most defensible technical decision you can make. Until it isn’t. Especially when you are building a highly concurrent, event-driven logger.
You pick it because it handles everything perfectly on day one, completely ignoring that you are buying a general-purpose engine for what might be a highly specialized problem.
We do this constantly in software. We slap a UUIDv4 on a primary key because it is convenient, blindly accepting the B-Tree fragmentation it guarantees at scale.
We spin up Postgres because it is the industry standard, ignoring what its storage engine actually does to our specific data shape.
Convenience dictates the architecture. Physics dictates the bill.
The cost of MVCC Write
Look at how Postgres manages concurrency. Multi-Version Concurrency Control (MVCC) is a brilliant mechanism. It allows readers and writers to operate simultaneously by creating new tuple versions instead of modifying data in place.
Let’s be absolutely clear.
If you are building a standard B2B SaaS application, with users, organizations, billing tables, and dynamic records, Postgres is flawless. If you are building a banking ledger where balances update constantly, MVCC is your best friend.
But if you are building an event log, a sensor stream, or an audit trail, MVCC is a parasite.
Consider a standard telemetry or event logging table. The schema looks entirely innocent
CREATE TABLE user_events (
event_id UUID PRIMARY KEY,
user_id BIGINT,
event_type VARCHAR(50),
payload JSONB,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Executed 50,000 times a second
INSERT INTO user_events (event_id, user_id, event_type, payload)
VALUES ('...', 123, 'click', '{"button": "signup"}');To the developer, this is a simple, lightweight append operation. To the Postgres storage engine, this is a massive overhead event.
Your rows are immutable the second they hit the disk. Yet, every single insert carries a hidden 23-byte tuple header (xmin, xmax, cmin, cmax) tracking version visibility for concurrency that will never happen.
At 50,000 inserts a second, you are paying a massive, constant write tax on data that will never change. You are generating gigabytes of Write-Ahead Log (WAL) just to track ghosts.
8KB at a Time
When you do actually update or delete records in a highly active table, the architecture fights you in a completely different way.
-- You think you are modifying data in place. You aren't.
UPDATE user_sessions
SET last_seen = NOW()
WHERE session_id = 'abc-123';Postgres does not delete old rows. It leaves the dead tuple sitting exactly where it was on the physical 8KB page. It writes the new, updated row to a new location. A background worker called autovacuum is supposed to sweep through eventually and mark that old space as reusable.
Under a heavy write load, autovacuum loses the race. Dead tuples accumulate faster than the database can process them.
Because live and dead records share the exact same physical pages, the query planner cannot skip the graveyard. If your table has 10 million live rows and 40 million dead ones, your hardware is pulling massive amounts of garbage from the disk into your buffer pool just to find the active data. The engine literally chokes on its own history.
As Andy Pavlo at CMU has pointed out, if someone were building a new MVCC database today, they would not implement it the way Postgres does. Its append-only storage design is a relic of the 1980s that predates log-structured storage patterns entirely.
The Snapshot Collision
One day someone asks for a dashboard. You run a long analytical query against this live operational data.
-- The Data Analyst runs this at 9:00 AM
SELECT date_trunc('hour', created_at), count(*)
FROM user_events
GROUP BY 1
ORDER BY 1;Running a daily sales report on a 5-gigabyte SaaS database is fine. Running a 2-hour aggregation on a table absorbing 50,000 sensor readings a second is a death sentence.
To guarantee consistent reads, Postgres holds the transaction snapshot open. While that two-hour report runs, autovacuum is paralyzed.
It cannot clean up any dead tuples that fall behind that active transaction horizon. Your dashboard is actively blocking garbage collection for the entire operational database.
Every UPDATE happening on your site during those two hours leaves a permanent scar on the disk until the report finishes.
Friction generates heat. Heat generates wear. Wear destroys your query latency. Postgres was simply not designed to serve heavy OLTP and OLAP from the same instance under load.
The Optimization Treadmill
This is when you step onto the Optimization Treadmill. You do not fix the architecture. You treat the symptoms.
Step 1 - The Index Trap
Write throughput dips, so you add a composite index to speed up the read queries causing the locks. You just amplified your write penalty. Every insert now has to update multiple B-Trees.
Step 2 - The Vacuum Illusion
The table bloats, so you aggressively tune the background workers.
ALTER TABLE user_sessions
SET (autovacuum_vacuum_scale_factor = 0.01);You force the database to vacuum constantly. You are now burning precious CPU cycles fighting your own insert rate.
Step 3 - Hardware Scaling
CPU spikes, so you upgrade the instance class. You double the RAM. You provision higher IOPS.
Every single one of these interventions is the technically correct response to a real symptom. You are doing exactly what an experienced DBA would do. But none of it changes your trajectory. You are fighting an architectural ceiling, not a misconfiguration.
The treadmill only speeds up. Each fix buys you a few months of runway before the database collects its tax again.
How to Spot the Treadmill Early
The most critical decision you can make is recognizing the treadmill before you get on it. Stop treating Postgres as a zero-cost default. Look at the physical reality of your data before you write the schema.
Are your writes predominantly inserts with no updates?
Are the rows permanently immutable once written?
Is your data volume growing by 50 percent year over year?
Do you need to run heavy analytics against the same live table?
If you nod your head to any of those, the architectural mismatch is already baked in. The only variable is how much engineering time and AWS budget you burn before you stop asking how to tune the database, and start asking the question you should have asked on day one.
Are we actually building a general-purpose OLTP system?
The HTAP Escape
If you recognize the treadmill early, you can escape it. If you actually need to serve heavy OLTP and OLAP concurrently from the same system without melting your disks, you need an architecture designed for Hybrid Transactional/Analytical Processing (HTAP).
When Postgres is no longer the default, you have three primary paths
1. SingleStore
SingleStore (formerly MemSQL) solves the OLTP/OLAP collision by physically splitting the storage engine under the hood. It uses an in-memory rowstore for blazing-fast transactional inserts and a disk-based columnstore for your analytical queries.
It is true HTAP. You get millisecond ingest and sub-second aggregations on the exact same cluster. It speaks the MySQL wire protocol, making integration trivial.
But, it is proprietary and expensive. The in-memory rowstore demands that your operational working set fits entirely in RAM. You are trading software friction for a massive hardware bill.
2. TiDB
TiDB separates compute from storage entirely. It uses a Raft-based distributed key-value store (TiKV) for your transactional inserts, and asynchronously replicates that data to a columnar engine (TiFlash) for analytics. The smart query planner automatically routes your dashboard queries to the column nodes and your point-writes to the row nodes.
It scales horizontally forever. You can run massive analytical reports against TiFlash without ever locking an operational row in TiKV.
But, it has severe operational complexity. You do not just “spin up” TiDB. Running the control plane, compute nodes, row storage nodes, and column storage nodes requires a dedicated infrastructure team. It is a Ferrari; you have to hire the mechanics.
3. Kafka + ClickHouse (Split Stack)
If you cannot justify HTAP licensing or operational overhead, stop trying to force both workloads into one database. Write your immutable events to an append-only log (Kafka). Consume that log into a pure columnar database (ClickHouse) for your dashboards.
The physics align perfectly. Kafka’s sequential writes effortlessly absorb the OLTP load. ClickHouse’s columnar compression destroys the OLAP load. Both systems run at absolute maximum efficiency.
But, you now maintain two distinct distributed systems and the pipeline connecting them. Eventual consistency is guaranteed; immediate consistency is impossible.
Conclusion
There is a recurring theme in engineering failures, we mistake popularity for universal fit.
Postgres is arguably the greatest relational database ever built. But it is a machine with specific operating parameters.
When you force an append-only, high-velocity stream into an MVCC engine, you are not testing the limits of Postgres. You are testing the limits of your infrastructure budget.
Stop letting convenience dictate your architecture. The cost of learning a specialized database like ClickHouse or Kafka today is infinitely lower than the cost of fighting the Optimization Treadmill tomorrow.
Physics always wins. Choose your database accordingly.


