LARPing as a genius until it works

Diving in to Systems Design

The most intimidating thing about web development has never been digging through the massive code bases or learning a new language. That's the fun part. For me, it's been the systems that support them. And the systems that support those systems. I've never had to think to deeply about these issues since these decisions were made long before I started working at my current company. But of course, curiosity got the best of me and I decided to do a bit of a deeper dive into some basics regarding distributed computing.

A good friend of mine recommended that I should read Designing Data-Intensive Applications by Martin Kleppmann in order to get my feet wet on the topic. I found nearly all of it to be invaluable but I thought I'd write mainly on the second part of the book: Distributed Data. In it, he speaks on replication, partitioning/sharding, transactions, and how everything can and will go wrong. This won't be an exhaustive write-up but I will add some additional resources at the end that I've found helpful over the years.

Replication

Keeping a single copy of your data is not a good idea. It is entirely possible that your DB can become unavailable at any time for any reason. Our system needs to be reliable in that it needs to be available at all times. To avoid this, we replicate our database and create multiple copies that can be housed in different locations. This allows us to route traffic to another DB in the case that one fails. We can also choose to house these DB's in locations that are closer to a certain subset of users for quicker load times. So, replication can be used in a defensive matter (making sure our data is available) and offensively (making sure our data is available quickly).

While the explanation might be simple enough, the implementation is tricky. It turns out it isn't as simple as copying and pasting your DB everywhere! We need a way to have all of our replicas up to date so that when our leader goes down, the replicated data isn't too far behind. Remember, no down time is key to any reliable distributed system. One of the most common ways to do this is by using leader-based replication. Let's dive deeper into that.

Leader-based replication

Leader-based replication designates a single leader (or node) to handle all the writes and then propagate those changes to several followers or replicas. Those replicas can then handle all read-only queries. Additionally, it's common to have a synchronous follower and several asynchronous followers. A synchronous follower means that the leader waits until it's received confirmation that the follower has received a write request before reporting the write as successful to the user. With asynchronous followers, the leader sends the message but doesn't wait to hear back for a response.

Let's demonstrate this with a simple example in the figure below where we have one leader, one synchronous follower, and one async follower. First, the user attempts to update a value and the leader sends out two messages to both of it's followers. Follower 1 acts as our synchronous follower and once it's confirmed it's received the write, it sends a success message back to the leader. Now that the leader has received a success message from all (in this case just one) sync followers, it can make the write visible to all other clients. Eventually, follower 2 sends it's success message back to the leader but this does not affect the leaders ability to make the write visible.

	   user performs write
User - - - -------------------------------- - - - - - - - - -> time
	    \  waiting for follower...   / ok              
Leader - - - ---------------------------- - - - - - - - - - ->
	data change	\          / ok            /
Follower 1 - - - - - - - ---------- - - - - - - - - - - - - ->
			  \	                 / ok
Follower 2 - - - - - - - - ---------------------- - - - - - ->

Leader-based replication with one synch and one async follower

A quick note on failover

Let's say that our leader node crashes for some reason. A new node needs to be picked to accept writes and the remaining nodes need to be reconfigured so that they send and receive messages to the new leader. This process is called failover and there's a few steps we take to automatically perform this.

  1. Determining that the leader failed. This is usually determined with a timeout.
  2. Choosing the new leader. One example would be picking the replica with the most up to date data.
  3. Reconfiguring the system to respect the new leader that was chosen. We can see the reason why replication is so powerful. Failures happen. All the time. Having multiple copies of our data provides us with a reasonable way to account for these failures while still providing high availability for our users.

Things I don't want to talk about

We've covered just the basics of replication. We didn't talk about the other ways to replicate a database such as using multiple leaders or even leaderless (popular in Dynamo-style DB's) replications. We also haven't even talked about how replication is occurring under the hood! While those topics are interesting, I just don't want to write about it since someone else can probably do a better job of explaining it to you. Just let me live, man.

Partitioning

Ok, so now you have lots of copies of your data. Fantastic! Well, assuming you're building something great, that data is only going to continue to grow and grow. Not only is your data going to grow, but your user base is going to grow which means more throughput. What happens when you combine a large dataset and a lot of throughput? A system that performs very slowly. There's no amount of indexing that is going to be able to overcome those issues. What if there was a way to take your large dataset but get the performance benefits of working on a much smaller dataset? This is what partitioning does for us. Rather than having all of your data live on one disk, we can split it up among many disks. When we need to retrieve data, we ask a coordinator to politely tell us where the data we are looking for lives. Then we can perform our query on a smaller dataset. Much more efficient!

Methods for partitioning

All we need to do is chop up our database into smaller sizes, right? Well yes, but how you divide it up is very important. Let's say you have a database where you decide to partition your users based on the letter of their first names. The users starting with A-C will be on one partition, the ones with D-F will be on another, and so on sequentially. While you have divided up your data, you've now created data that's skewed (sometimes called a hot spot). It's much more likely that your users will have a first name that starts with A-C or D-F rather than X-Z. We want a method that will distribute our data much more evenly and prevent this issue from happening.

Key Range Partitioning

One way would be partitioning by key range. Let's assume that we know the most popular first names in our database. We could decide to manually create ranges that would lead to a more even distribution. Say we have a lot of users that happen to start with the letters A-B and C-D but not so many that start with Q-Z. We could create our own boundaries such that they reflect the distribution of our data. This can be manually done by an admin or automatically by the database. One benefit of this method is that range scans are easy. In this example, we could easily ask the database for an ABC ordered list of users starting at L and ending at P. Of course, you can choose whatever key you want to divide up your data. A more concrete example might be dividing up a table that contains audits of users actions using a timestamp. If you needed to get the range of a certain user's actions based on a given timeframe, that would be easy with this approach.

Hash Key Partitioning

Of course, we still run the risk of having skewed data since our boundaries won't be perfect. At minimum, we are going to have the database re-balance itself more frequently. A better way of avoiding skewed data would be to use a hash function (this is the method that Citus uses with Postgres). Whenever a key is passed in, a number is returned with the partition that our key should belong to. Even if the keys are very similar like "Adam" and "Ada", they will end up on different partitions. We have a much better chance at distributing the keys. What we trade for uniformity is the ability to do range scans efficiently. Since all of our data is split up, we would need to visit every partition to perform the query making this massively inefficient.

Transactions

Let's talk about everyone's favorite acronym: ACID. If you're not familiar, ACID stands for Atomic, Consistent, Isolated, and Durable. Let's break that down a little bit.

Isolation Levels

If our system is going to be reliable, we need it to be able to handle concurrency well. If user A makes an update to a table, we don't want user B to be able to see that update until it's been committed (dirty read). What about when user A decides to update a row at the same time as user B? We want to make sure that the data is consistent across the entire system and not just partially updated in certain spots (dirty write). One way to avoid this is to completely forego concurrency and have the database make reads and writes one at a time. This is called serializable isolation. Hooray! We solved concurrency. Well, not really. There are massive performance costs in order to pull this off. Instead, it's much more common to accept that things will never be perfect and to use a weak isolation level. I'm not going to talk about every way databases attempt to handle concurrency but just a few that I feel are worth your time.

Read Committed

Let's start with a basic level of isolation, the read committed isolation. With this type of isolation, we only read and write over data that has been committed. This helps us avoid dirty reads and writes since we only read and modify data that has been committed. Under the hood, we use something called a row level lock to prevent the dirty writes. Essentially, if a transaction is attempting to write to a row it gets to hold on to a lock that says "I'm the only thing that gets to update this row right now". All other transactions must wait to write to that row until that lock has been released. However, we can still allow reads to take place on that row. We just have to make sure that the transaction is reading the old non-committed value which the DB is able to provide to us.

Snapshot Isolation

Read committed, while effective, can still lead to issues such as write skew where a user can read from a database that's in an inconsistent state (if you're not familiar with write skew, you should look it up). How can we avoid these timing anomalies? This is what the snapshot isolation attempts to solve and I'll use the approach that Postgres takes. Whenever a transaction is created, it's given a transaction ID (txid) that is increments every time a new transaction is created. Using this transaction ID, we can only view or modify rows (tuples) that have a txid that is less than or equal to we're using. Since we can only access rows that have been created in the "past", we can avoid the timing anomaly of write skew.

Actual Serialization

There are real reasons to use a serializable isolation such as with banking software. People really hate it when you mess up money so removing concurrency is really enticing even at the cost of performance. One way to do this is with... literal serial execution. Yep, just have everything run on a single thread. This is actually somewhat feasible nowadays since RAM is so cheap. Just load your dataset into memory and get to work. Another way to get more performance out of this method is to use stored procedures. Rather than bounce between our application code and the database constantly, wrap everything in one short procedure to avoid the extra context switching. (Random tidbit: you can write stored procedures in Redis with Lua. This only strikes me as interesting because I use nvim btw)

Two-Phase Locking

Let's talk about another way to serialize your transactions, perhaps the most famous way: Two Phase-Locking. At this isolation level, a user must have an exclusive lock to write to the database. Let's say user A is reading from a row that user B wants to make an update to. User B must wait until user A relinquishes that lock it has on the row. Once it's released, user B may update. And vice versa, if user A wants to read from that same row while user B is updating, then it must wait until user B has released the lock. While this does avoid all race conditions that we would find with weaker level isolations, it's not uncommon for concurrent transactions to be waiting for each other to release their lock. We call this a deadlock. Most DB's have ways to detect deadlocks and have one of the transactions release it's lock in order for the other to make progress.

Bringing it all together... to mess it all up

Phew, that was a lot of text. We started at the top by showing how to keep your data available via replication. We then looked at how to access the database efficiently with partitioning. Then we got deeper with how data is read and modified with transactions. While this is a good foundation to help you understand how some of our biggest systems operate, we've barely scratched the surface on what it takes to bring those systems down. Hint: it doesn't take much! Network failures, entire racks going down, clocks (don't get me started on clocks)... it's much too commonplace. Which is why I plead with you to always consider what happens when everything blows up. It might be the most important thing you get from this post. We've seen how just two users attempting to update or access a database can lead to unintentional bugs that have massive consequences. Luckily my examples mean nothing but the code you write does! Always be on the look out for how something can go wrong. Never take anything for granted.

Some final thoughts that don't have much to do with the above

I started this post talking about how reading and writing the code is the fun part. I love being able to make a computer talk. I also love that I can go line by line in someone else's codebase and say "oh wow, I never thought of that". And for the rest of your career, reader, you could get far by just understanding the code and and the syntax and the hottest framework and how this package does that and blah blah blah. But if you're in a field that deals with distributed computing, do yourself a favor and dive head first into these tough concepts. Really understand how data is flowing between all these services. How the data is being processed. Don't just write the code, understand what it's truly doing.

Further resources