System Design - Data Design and Storage Strategies
Data is the heart of your application. If you lose your data you lose your business.
We have covered the fundamentals how to gather requirements how to organize the system into microservices and how to make the system reliable when things inevitably fail.
Now we turn to the most important part of any system the data.
Data is the heart of your application. Everything else the servers the APIs the load balancers can crash be replaced or be rewired. But if you lose your data you lose your business. Therefore choosing the right place to store that data is the single biggest architectural decision a designer makes.
This post will explain the core storage options. We will explore the trade offs between the two major database types SQL and NoSQL and then look at the essential techniques like sharding and partitioning that allow massive companies like Google and Facebook to manage data that simply would not fit on one computer.
SQL vs NoSQL vs GraphDB
When picking a database you are choosing a structure for how you organize and retrieve information. This choice is primarily driven by your data’s structure and your system’s read write ratio (from Blog 2).
SQL Databases (Relational)
These are the classic original databases like MySQL PostgreSQL and Oracle. They are built on the relational model where data is stored in strict tables with predefined rows and columns.
Highly structured. Think of a spreadsheet where every row has the exact same columns defined by a Schema.
Key Advantage ACID Guarantees
Atomicity - All or nothing. If a transaction fails halfway through the entire operation is cancelled. Crucial for financial transfers.
Consistency - Transactions only bring the database from one valid state to another. Data integrity is guaranteed.
Isolation - Multiple transactions happening at the same time do not interfere with each other.
Durability - Once data is committed it will not be lost.
Anywhere you need strong consistency and complex relationships like banking transactions user authentication or inventory management where the numbers must always add up.
Scaling reads is easy using replicas. Scaling writes is hard because you have only one primary database managing all those ACID guarantees.
NoSQL Databases (Non Relational)
These are newer databases built to solve the scaling challenges of the modern internet. They sacrifice some of the strict ACID guarantees for massive speed and flexibility.
There are three main types of non relational NoSQL databases.
Key Value Stores (Example Redis)
The simplest type like a giant Python dictionary or JavaScript object. You store a key (like user123) and a value (the user data). Extremely fast for reads. Great for caching sessions and leaderboards.
Document Databases (Example MongoDB)
Stores data in flexible JSON like documents. No strict schema is required. You can add new fields to one document without affecting others. Great for content management systems and user profiles where the data structure changes often.
Wide Column Stores (Example Cassandra)
Designed for extremely large scale high write throughput. Used by companies like Facebook for handling millions of writes per second across many servers.
NoSQL is much easier to scale horizontally but it often provides only Eventual Consistency. This means if you write data to one server it might take a moment or two for all the other servers to have the updated copy. This is usually fine for social media posts but not for bank accounts.
Graph Databases (Relationship Focused)
Graph databases like Neo4j are fundamentally different from both SQL and other NoSQL stores because they prioritize relationships above all else.
Core Structure
These databases use nodes (entities like a Person or a Product) and edges (the connections or relationships between them like “works at” or “is friends with”). The edge itself can hold data such as the duration of a friendship or the rating of a purchase.
Primary Use Case
They are highly optimized for quickly traversing complex networks and finding connections. This makes them essential for social networks recommendation engines and fraud detection where the path between two data points is more important than the data points themselves.
Scaling Strategies for Database
Once your data volume or traffic exceeds what a single database server can handle you must implement techniques to split the data.
Sharding Horizontal Partitioning
Sharding is the process of splitting a single logical database into multiple smaller physical database servers called shards.
Imagine one giant library (the database) that is too big. You split it into three smaller identical branch libraries (the shards).
When a user signs up you might assign them to a shard based on their User ID number.
Users with IDs 1 to 1 million go to Shard A.
Users with IDs 1 million to 2 million go to Shard B.
And so on.
The load is distributed. The entire system can handle triple the traffic because each shard only deals with a third of the data and a third of the traffic.
Challenges of Sharding
Complexity - You need a Sharding Key (the User ID in this case) and a Router or Lookup Service to know which shard to send the request to.
Joins - Running a query that needs data from both Shard A and Shard B becomes very slow and complicated.
Hot Shards - If a celebrity is on Shard B and they suddenly get 10 million followers Shard B becomes much busier than Shard A. This uneven load is called a hot shard and requires manual rebalancing.
Vertical Partitioning
Instead of splitting the data by row (sharding), you split the data by column.
Instead of having one giant table for User Details with 50 columns you split it into two tables User Identity (Name Password Location) and User Activity (Last Login Time Post Count).
Since 90% of requests only need the User Identity information you only load that smaller table from the disk making the queries much faster. The rarely accessed data is kept separate.
Data Warehousing and Analytics
What happens when you need to run complex reports like “What was our average sale price in every state last month” that require reading and analyzing billions of records? You cannot run this report on your live Online Transaction Processing (OLTP) database because it will slow down your entire application for all live users.
The solution is a Data Warehouse.
A separate database optimized only for fast, complex, large-scale analytical queries.
Once every night or every hour data from the live OLTP database is copied and transformed into the Data Warehouse.
Analysts can run their multi hour complex reports without ever impacting the speed of the live user application. This creates a clean separation between the live system and the analysis system.
Indexing and Replication
Two final crucial techniques determine the efficiency and reliability of your data layer.
Indexing for Speed
If a database is a library an index is the card catalog.
If you want to find a book by its title you cannot search every page of every book.
The index is a separate sorted data structure that tells the database exactly where on the hard drive to find the data you are looking for.
Indexes make reads lightning fast but they slightly slow down writes because every time you update a record the index also has to be updated. A good designer knows exactly which columns to index based on the system’s most common read queries.
Read Replicas for Scalability
We touched on this in Blog 5. Read replicas are the workhorse of horizontal scaling for the database layer.
Your Primary database handles all C U D operations. It sends a copy of every change to its Replica databases.
Your Web Servers only send their Read requests (R operations) to the Replicas. Since most applications have a heavy read ratio (e.g. 10 reads for every 1 write) you can add 5 10 or 20 read replicas to handle the massive volume of user traffic without ever overloading the single Primary database.
The Next Step
The database is the most critical piece of the puzzle. The choice between SQL and NoSQL is determined by whether you prioritize ACID consistency (SQL) or Eventual consistency and flexibility (NoSQL).
When you must scale your data layer you use Sharding to split the load and Read Replicas to offload read traffic. The whole goal is to ensure your users get the data they need quickly consistently and reliably.
We have now covered the complete lifecycle of a system from requirements to architecture to reliability to data.
In the final post of this series “The System Design Interview A Comprehensive Example” we will bring all six concepts together. We will take a classic design problem like Designing Twitter or Designing a URL Shortener and use every concept we have learned from functional requirements to sharding to build a complete and scalable architecture. This will be your final test run for a real world design interview.


