What if you had very important information that you needed to safeguard against corruption? What if you couldn’t afford to lose access to it? How should you go about storing it so that your data maintains its integrity whilst always being available when you need it? To answer these questions, we here at KeyDB decided to use the Raft algorithm, which we plan to implement in KeyDB in a future release.
To demonstrate how Raft is effective, let’s start off with an example. Suppose you work at a financial institution as a software architect, and you need a database system to store information about account balances. This information is incredibly sensitive, having balances change unexpectedly could lead to some very disastrous outcomes (imagine losing your life savings because a database wasn’t designed well). Users will also want their accounts available to them 24/7.
The system will work like this:
- A user will request to deposit or withdraw money from their account.
- The system will then increment or decrement their account balance accordingly.
- The system will relay a message back to the user indicating whether their transaction went through.
You want this system to have a few properties:
- Minimal downtime: You don't want to inconvenience your users too often.
- Strong consistency: Once an account has its balance changed, any new users looking at said account will see the updated balance.
- Performant: You want it to be as fast as possible without compromising the criteria above.
How do we go about meeting these criteria? Let’s walk through some design decisions.
The first choice is between a centralized and decentralized approach. Do we want to have a single server storing the entire database, or do we wish to distribute the database over a network of servers? With a single server, if it goes down, the entire database is rendered inaccessible. Since we wish to minimize the downtime of our system, a central point of failure is a bad idea. We can mitigate this by opting for a decentralized approach, if part of the network is up, we should still have some level of access to account information. For this reason, we will go with a decentralized approach.
Now that we’ve decided that we’re going to have multiple servers bearing the load of our database, how do we split it up across multiple servers? There are many ways to go about this, such as sharding for example, but in our case, replication would serve our needs best.
Replication is a method of database decentralization where each server stores an exact copy of the database. In our case, this means each server on our network has balance information about every account. By replicating our data in this manner, we provide a straight-forward means to tolerate partial network failures; if one server goes down, we can simply go to another server for our data. This allows us to minimize user observed downtime, so we'll go with this approach.
We have yet another choice to make, do we want our replication to occur synchronously or asynchronously?
Synchronous replication is where commands are first propagated to the servers in the network, and once a majority of servers have received the command, only then is it processed by the servers on the network. In our example, when a user updates their balance, the following happens:
- The server receives the update request.
- The server propagates said request to the rest of the servers in the network in the hopes of reaching a majority.
- The server waits until it receives acknowledgement that a majority of the network has seen the request.
- The server and all of the servers on the network apply the update request to their copy of the database, in this case they update the user’s account balance to the correct amount.
- The user receives confirmation of their update.
Asynchronous replication is when the command is processed in parallel with its propagation to other servers. For us, this means updating the account balance and propagating the request to the rest of the network at the same time.
Asynchronous replication is faster in that you don’t have to wait for the command to be replicated before responding to the user. However, you can run into consistency issues with this approach. Suppose a user makes a transaction on a server. The server responds to the user but goes down before it can propagate the request to the rest of the servers in the network. Now, when a new server services the user’s next transaction, it will operate as if the transaction had not occurred, but the user will think it had, thus leading to an inconsistent state. With synchronous replication, you wait for replication to occur before responding to the user. As long as the new server is a part of the majority that has seen the request, this situation will not occur. Although the critical path takes more time with a synchronous approach, we have guarantees that replication has occurred, which makes it a better fit for our objectives going forward.
Now that we have decided that we wish to synchronously replicate, there is one final decision to be made, how many masters do we want in our network? A master is a server that can both read and write from the database (and propagate said changes to the rest of the network) whilst a replica is a server that can function as a potential backup if the master goes down. If you have less stringent consistency requirements, you can opt to read from replicas as well for a performance improvement, however that's not applicable in our case. There are two options here, single-master and multi-master.
Single-master replication, as the name implies, allows for only a single master on the network at a time. Similarly, multi-master replication allows for multiple masters on the network. Multi-master replication allows for higher overall uptime because if one master fails, there are still other masters available on the network to process incoming requests, which leads to zero failover time. In contrast, a single-master setup will encounter some failover time when the master goes down as one of the replicas will have to be promoted to a master in order to continue processing requests.
However, with multi-master replication, given that multiple masters can change the database (modify account balances), it is harder to enforce our strong consistency requirement. In contrast, having only a single source of database changes allows for strong consistency guarantees to be made much more easily, since we don’t have to handle multiple masters updating the same account balance concurrently. For this reason, we will go with a single-master setup.
Now that we’ve decided that our solution is a single-master, synchronously replicated database, how do we go about coming up with an algorithm for it? The answer is we don’t need to, that’s what Raft is for! You can find a more in-depth description here, a nice guided visualization here, and access to further resources here.
In short, Raft guarantees that the system remains consistent by having a single master - known as the leader - that only applies a command to the database once a majority of servers have seen said command and all commands prior. In the context of Raft, a majority refers to strictly over half of all servers in the network, including those that are not operational. A log is kept on each server in addition to its copy of the database to keep track of all commands that that server has witnessed. This way, a majority of replicas - known as followers - will be completely up to date with the leader at any given time. Any followers not completely up to date will be periodically fed updates in order from the leader until they become up to date.
The leader is elected is through a majority vote triggered by a follower when it has not received a response from the leader in a while. This majority vote is known as an election. That follower will vote for itself and then send a vote request to everyone else on the network, at which point it is known as a candidate. Each server votes for the first candidate that comes along that is at least as up to date as itself, and votes at most once during an election. Once a server receives the vote of the majority of servers, it is declared the leader and broadcasts a message to the other servers, who then demote back to followers (if they were candidates). In the case of cluster startup, each server will start as a follower. Eventually, one or more servers will trigger a vote, which will lead to a leader being selected.
Since a majority of servers on the network are up to date at all times, if a not up to date candidate tries to become the leader, the up-to-date followers – which form a majority – will not vote for them, and that candidate will never have a majority, and will thus never become the leader. Therefore, the only candidates that can become the leader, are those that are up to date. The corollary to this is that if a majority of servers aren’t up to date with the leader and that leader were to go down, it would be possible to elect an out-of-date leader (since only a minority would not vote for them). This would cause us to lose data already applied to the database and violate the strong consistency requirement as a result.
In our case, this means that all requests go through the leader. The leader then processes these requests and returns the result (either how much is in an account, or a confirmation that a transaction went through), only after making sure that the majority of followers have seen the request. If the leader goes down at some point, a new leader is selected amongst the up-to-date followers, who will process transactions from now, until the next leader need be selected.
By using Raft, since commands are only processed once a majority has witnessed them, and those servers are the only ones that can possibly be leaders (and subsequently be read from), there is no risk of inconsistent state. In addition, by the distributed nature of Raft, you minimize downtime by having other servers that can function as the leader in the network in the event the leader goes down without resulting in loss of processed data. That solves two of the three issues, but what about the third?
Given that every request, be it a read or a write, must go through a single leader, Raft's performance is strongly correlated with the performance of its leader. If that database is slow, the algorithm's performance will suffer as a result. How do we go about maximizing our performance while still meeting our other requirements? This is where KeyDB comes in. KeyDB is really fast, and as such is uniquely positioned to provide a performant Raft cluster solution, while maintaining its strong consistency guarantees. We are actively working on incorporating Raft into KeyDB, so stay tuned!
We have some really cool features in the works. To keep up to date with what we are doing, subscribe to the KeyDB mailing list
or follow us on one of the following channels: