System Design - Performance and Capacity Planning

A designer who knows how to calculate capacity is saving the company millions of dollars and avoiding embarrassing outages.

Nov 12, 2025

What is AutoScaling? Explained in Detail (Updated) | Middleware

We have built our system we have chosen our databases and we have added redundancy to handle failures. Now we need to answer the most critical financial and engineering question How many servers do we actually need?

This is the job of Performance and Capacity Planning. It is the process of estimating load predicting growth and ensuring that you buy or rent just enough computing power to keep your system fast without wasting vast amounts of money on idle servers.

A designer who knows how to calculate capacity is saving the company millions of dollars and avoiding embarrassing outages during peak traffic events like a major product launch or the Super Bowl. This guide will give you the tools to stop guessing and start calculating.

The Metrics That Matter

In Blog 2 we briefly introduced Latency and Throughput. Now we need to look at how engineers actually measure and optimize these factors. When we talk about performance we focus on three related metrics.

Latency

Latency is the time delay between the request leaving the user and the response returning. It is the time taken for a single transaction.

Measured in milliseconds (ms).

Most systems are fast 90% of the time. However a good designer does not care about the average speed. They care about the worst case speed. This is known as Tail Latency.

P95 Latency - 95% of all requests must complete in this time. If your P95 is 200 ms it means 95 out of 100 users see a response within 200 ms.

P99 Latency - 99% of all requests must complete in this time. This is for the users with bad luck or poor network connections. Keeping the P99 low is extremely difficult but vital for a smooth user experience. If a user’s experience is consistently slow they will leave.

Throughput

Throughput is the total number of operations processed in a given time period.

Measured in Requests Per Second (RPS) or Transactions Per Minute (TPM).

This metric tells you the maximum capacity of your system. If your server cluster can handle 10,000 RPS and you get 10,001 RPS your system is overloaded and performance falls off a cliff.

Concurrency

Concurrency is the number of tasks or requests that a system is handling at the same exact moment.

In a single lane road (one CPU core) you can only handle one car (one request) at a time (low concurrency). A server with 16 cores can handle 16 requests concurrently often much more due to techniques like multi threading.

Concurrency is the upper limit of how many users can be actively waiting for a response from your service without causing a massive slowdown.

The goal of capacity planning is to ensure your servers have enough Concurrency to meet the expected Throughput while maintaining the required Latency targets (P95 and P99).

Estimating Load

Before you provision a single server you must perform simple back of the envelope calculations. These rough estimates guide your entire design.

Converting Users to RPS

We must translate the vague requirement of “handle a lot of users” into a concrete measurable load Requests Per Second.

Let use some assumptions to see how we can convert users to request per seconds. DAU means Daily Active Users.

Let us assume your service has 1 million DAU and each user makes at least 10 requests on average (logging in loading a feed submitting a post)

Total daily requests 1,000,000 users * 10 requests/user = 10 million requests/day.

The traffic is never spread evenly. Most of your traffic will occur during a 4 hour window usually a peak time like the morning or evening commute. Assume 50% of your daily traffic happens in 20% of the day.

Peak requests 10 million total requests * 50% peak concentration = 5 million peak requests.
Peak seconds 24 hours * 0.2 peak concentration * 3600 seconds/hour = 17,280 peak seconds.

Peak RPS 5,000,000 requests / 17,280 seconds ~290 RPS

Therefore your system must be able to handle at least 290 requests per second. If you estimate your current single server can handle 100 RPS you immediately know you need at least three servers 290 / 100 rounded up.

The Read Write Ratio (R/W)

As discussed in Blog 6 the R/W ratio is essential for sizing your databases.

If your system has a 101 R/W ratio for every one data write (e.g. posting a message) there are 10 reads (e.g. people loading the message).

This tells you to spend 90% of your database scaling efforts on read replicas and caching because that is where the vast majority of your load is coming from. If the ratio was 11 you would focus more on sharding the primary database.

Benchmarking and Profiling

Estimation is just the start. Before deployment you must verify your assumptions using tools.

Benchmarking

Benchmarking is the process of testing a component under controlled high load to see what its actual maximum capacity is.

You use a load testing tool to simulate 1000 concurrent users hitting your API or database. You measure the system’s performance and find its breaking point.

To discover the true P99 latency and maximum RPS of your current setup. This replaces the guess work with actual data. If your estimate said you need 290 RPS and your benchmark shows your single server can handle 350 RPS you know your initial estimate was safe.

Profiling

While benchmarking tells you what the max throughput is Profiling tells you why it is that way.

Profiling involves running special tools on your application code that track exactly how much time your server spends on different tasks.

Profiling helps you find the bottleneck the single slowest step that is limiting your overall speed. The bottleneck is often one of four things.

CPU - Your server is spending too much time calculating code.
Memory - Your server is constantly moving data in and out of memory.
Network I/O - Your service is waiting too long for external services or databases over the network.
Disk I/O - Your database is spending too much time physically reading data from the hard drive.

You cannot fix a performance problem until you profile the code and identify the specific line or operation that is slowing down the entire chain.

Capacity Planning and Autoscaling

Once you know your required capacity (290 RPS in our example) you need a plan for managing those resources in the real world. This is Capacity Planning.

Safety Margin

You never provision for exactly 290 RPS. If you do your system will crash when traffic hits 291 RPS.

Always add a Safety Margin usually 50% to 100% above your peak calculation. If you need 290 RPS you provision for 450 to 580 RPS. This buffer gives you breathing room for unexpected traffic spikes or inefficient code.

Autoscaling

Manually provisioning servers wastes money. You do not need 10 servers running at 2 AM when only 100 people are online. You need 10 servers running at 5 PM peak time.

Autoscaling is the process of automatically adding or removing servers based on real time load metrics.

Reactive Autoscaling

This is the most common method. The system monitors a metric like CPU utilization.

Scale Out - If the average CPU load across the server cluster goes above 70% for more than five minutes the autoscaler automatically launches three new servers.
Scale In - If the average CPU load drops below 20% for a period of time the autoscaler automatically shuts down two servers to save money.

Predictive Autoscaling

Used for predictable events. For example an e commerce site knows its traffic will be 300% higher every Black Friday. It uses predictive scaling to launch 15 servers at 11 PM on Thanksgiving night to handle the predictable traffic spike before the servers ever get overloaded.

This combination of reactive and predictive scaling ensures you pay only for the compute power you are actually using maximizing your budget (a key constraint from Blog 2).

The Launch Spike

Imagine a gaming company launches a new console.

The team predicts a peak of 50,000 concurrent users resulting in 5,000 RPS for the first hour of sales.

The system designer provisions for 5,500 RPS with a fixed number of servers. They forgot about the bottleneck.

The launch traffic hits 5,000 RPS and the system slows to a crawl even though the CPUs are only at 50\% usage. A quick profile reveals the database is the bottleneck it is spending too much time reading from the disk Disk I O which prevents the servers from handling more traffic.

The designer realizes they needed to spend less money on web servers and more money on upgrading the database’s I/O performance or adding more read replicas. Capacity planning failed not because the math was wrong but because the designer did not correctly identify the slowest component the bottleneck.

The Next Step

Capacity planning is the financial guardrail of system design. It takes you from theory to reality by translating business goals into measurable RPS targets. You must always use back of the envelope math to set a baseline then use benchmarking and profiling to find the true bottlenecks and finally implement autoscaling to manage costs and maintain performance automatically.

We have now covered the essentials of building and scaling systems. The next challenge is the final and perhaps most mind bending theoretical problem in distributed systems The CAP Theorem.

In the next post “Consistency versus Availability and Partition Tolerance” we will dive deep into this fundamental trade off learning what each term truly means and how the CAP theorem forces you to make painful choices between real time accuracy and reliability.

BinaryBox

Discussion about this post

Ready for more?