In Twitter’s early days it used Memcached as a critical component for virtually every part of the web site. This wasn’t really an issue initially, but as Twitter scaled up larger and larger these Memcached instances became more and more critical. Memcached quickly became the bane of Operations existence real fast as the scale increased. Some of the heroic engineering required to keep these beasts running was amazing.
The biggest issue is that we had gotten to a point where rebuilding the data in our Memcached instances was starting to take longer and longer to rebuild. During rebuilds users would experience extremely long timeline loads, with a high rate of errors. Normally a users Twitter feed was stored as hundreds of tweets in a list in a single key in one of the dozens of Memcached’s that stored timelines. If we lost even a single Memcached instance it could take hours to fully recover. Every timeline key in Memcached had to be rebuilt one by one by pulling the individual users list of followed users, then pulling each of those followed users complete list of tweets, and then merge those all together into a new timeline key. It was quite load intensive.
What’s worse is that our fail-over algorithm had some issues. We used something called ‘ketama’ to balance load to the destination servers. Each node would hash its name a few dozen times and each of those hashes would be placed on a line from 0 to the max hash value. With dozens of nodes this would distribute ownership of ranges fairly easily. It also meant that adding nodes required no re-balancing, the new ranges would be taken over and nothing else would change. In theory removing nodes was easy as well. Once removed key lookups would automatically redirect lookups to the next node on the list.
When adding and removing nodes though there was a serious risk of inconsistency. If a node was added then we would immediately have to start rebuilding all the timelines that now land on the new node. Worse yet, if the new node is removed then the service would fall back to the old node and start using the old data. This could be problematic if the data had been mutated during the time it was on the new node.
The first global restart I was involved with
The very first time we were pushed into a position where we were required to restart the Memcached instances was for a mandatory OS restart that was forced on us by our managed hosting provider. We needed to restart each instance one by one. However, doing that would trigger an absolutely huge amount of rebuilding timelines. Worse though, it would create a serious inconsistency issue due to the re-balancing issues mentioned above.
Rather than allowing re-balancing and data inconsistency we planned to modify the code base to try the first option and then never ever fail over. Instead it would just serve an archaic error known as the “Moan Cone” (It was a slightly different version of the famous Fail Whale.)
This actually worked fantastically. We watched the error rates and they never really got out of hand. Moan Cones served a different HTTP error code so we could see what was a normal error vs this special error we had reintroduced for this maintenance. We also watched Twitter to see any mentions and at the end of the whole maintenance we had only seen one Tweet about the Moan Cone.
The problem with connections
Each Ruby instance connected to every Memcached instance. No need for pooling as the instances were not multi threaded, but that still caused a lot of connections. Our machines had 16 ruby instances each, and we had thousands of machines. This quickly ran us out of the default 32k file descriptors that RHEL Linux gave us. We had upped the limit to 128k at some point, but even that eventually ran out. When that happened we were in serious trouble, again. In theory we needed to restart every instance again to get a higher file descriptor limit.
We had recently hired a kernel developer that came up with a quick fix: write a kernel module that allowed changing the file descriptor limits on live processes. (This was on CentOS 5, long before the prlimit syscall was added to accomplish the same task.) This module changed everything, we were able to update the descriptor limits pretty much any time without having to restart!
Living with the beast
Eventually we were able to adapt our operation of the cache clusters to reduce the faults and issues. One of the lessons learned was to never remove a failed instance, instead always replacing it with a new node. This was necessary to prevent the possibility of cache key reverting. Over time the use of TTLs on keys were improved so they actually worked more effectively without risk of reverting keys.
twemproxy
Eventually the connection count issue got so bad that something bigger needed to be done. This was the genesis of Twemproxy! This service started life as a simple proxy that allowed our Ruby processes to connect to the proxy alone, then the proxy would connect to all of the Memcached instances in the cluster. This allowed connection pooling which vastly reduced the total number of connections to each individual Memcached. By letting Twemproxy handle most of the cache management many of the issues that created these restart struggles went away completely.
As a side note: Twemproxy has gone on and had a significant life outside of Twitter. Its in use at Uber, Twitch, Yahoo!, Wikimedia, etc (a more complete list can be found on the Twemproxy GitHub repo.