I implemented the Raft consensus algorithm (the poster child of distributed algorithms) in Python. It's a pretty bad implementation! But also (somewhat) correct.
Here are some notes I'd share with anyone else who's interested in taking on a similar challenge.
In hindsight, these were the most useful resources for learning about Raft and implementing it correctly.
- The Raft paper (read up to section 5 and reference figure 2 heavily)
- Students' Guide to Raft
- one of the most widely used Raft implementations
- clone the repo, skim
raft.goand go back and forth with an LLM to understand the code base and design decisions
- clone the repo, skim
- Eli Bendersky's blog series
I'd suggest spending an hour or so reading the paper first, then stubbing out some code for a UDP or TCP server that reads incoming bytes and adds them to an array. I then followed along with Eli's implementation, adding features to my Raft implementation in the same order.
After getting something that looks like elections working, I started looking for bugs and errors in my understanding of the algorithm. I'd go back and forth between the students' guide, Figure 2 in the Raft paper, and my implementation, thinking carefully about where my implementation was the same (or differed). I also heavily used an LLM to review this code, adding material from the above resources into the context.
Repeat the above process for log replication, persistence, etc.
re: implementation
I made some simplifying design choices in my implementation. In no particular order:
- each node runs and processes messages on a single thread
- use a "logical clock" to keep track of local "time" on the system (e.g.
tick()and increment a counter local to each node, vs using system time) - "muddy" the implementation by having everything in one file. e.g network parsing, storage/persistence, the core raft algorithm, and utilities/commands for controlling the node itself
2 seems like a sound and correct design choice (logical clocks are what's used in etcd's implementation). 3 is arguably better for learning/pedagogy. It's nice to have everything in one file so you can see it all at once, and gives you a nice implementation you can rip up and see which abstractions fit the algorithm the best.
1 is a bit of an egregious choice to me. It does make the implementation far simpler (you worry less about getting into deadlocks and atomic updates to the node's internal state), but you also end up with something that isn't quite Raft. For a first implementation, this seems fine. The algorithm is complex enough and I think you'd rather spend your time debugging logical errors in the core Raft algorithm vs fussing with mutexes.
I'd consider implementing Raft this way as a ~30-hour project. The initial reading of the Raft paper and reviewing related materials should take a few hours. I did the bulk of the coding in ~3 days during the holidays, hacking for about 6-8 hours/day. I still have some things to polish and improve (e.g. fix some subtle bugs) in the existing implementation, which might be another half a day of work.
All in all, not too bad for understanding one of the core algorithms that powers so much infrastructure.