In 2010 Twitter took on one of its highest sustained traffic events in its history and pretty much fell flat on its face. However from this moment of abject failure Twitter managed to build an incident management process that actually cut through a lot of the nonsense and started making things better.
Going into the World Cup it was pretty well established that we were going to have issues. We were already pretty well known for the “Fail Whale” at this point, but we had bent over backwards trying to prepare. We nearly doubled the web servers, added MySQL replicas, and tried to clear out as much unnecessary cruft as possible.
Everybody’s got a plan until they get punched in the mouth. - Mike Tyson
The very first game managed to cause our site to start throwing errors. I can’t remember the exact rate but it was significant. The problem is that there didn’t seem to be an obvious source. Apache was serving 502’s, but according to our metrics Rails wasn’t. In fact Rails was serving more traffic than it ever had by a large amount. The databases were not overloaded, they appeared to have capacity to spare.
In this first contact with elevated error rates the reaction was to, well, panic. Owners of each individual service started throwing switches, turning off features, breaking glass and generally trying everything in their book to see what would reduce the errors. The error rate ping ponged all over, but so many changes were being made so quickly that it wasn’t clear what impact each change was having! By the end of the set of games hosted that day we had not learned anything and we still had weeks of games left that would only get bigger and higher traffic.
Meanwhile, I had slept through the whole event. I tended to stay awake until the wee hours of the morning and wake up much later than a normal human should. So by the time I had heard about any of this it was far too late to get involved. I showed up at the office in time for the meeting with all of the tech leads, directors, vice presidents and basically anybody else who had an opinion. This meeting got contentious quickly. Theories were thrown around, fingers were pointed, and while everybody had the best of intentions to figure out the problem, the reality was that the meeting was quickly turning into the same pointless exercise that the incidents operational work had become.
At this point I got a chance to interject before things went further off the Rails and I was able to do something that was unique in my career: I got to tell a room full of leadership that they were not needed to solve the problem and that they needed to leave the investigation and debugging to the engineers. I specifically called out that they had wasted an entire day by not being methodical and actually learning from the outage.
I then introduced the new process we would use for the next day. We would propose a possible problem, and a solution. Each proposal would be given a fixed amount of time to work and we would try them one at a time, giving them enough time to mature so we would know for sure what did work and what didn’t.
Science
This model was branded simply “Science.” We would use this each day to learn what was going on, we would gather data during the outage, and process it in a meeting afterwards to gain understanding and come up with the experiments we needed to run the next day. We specifically called out that major changes to the system needed to be coordinated by the Incident Manager only. Nobody was to change something without getting approval first. This would ensure that we would know what changed and when. Communication would also go through the incident process. Executives didn’t need to individually ping their senior staff who would individually ping people working on the incident who would effectively create a human DDoS of interruptions.
This model worked. We learned a lot quickly. We isolated the issues to where it could be, and where it absolutely wasn’t. And best of all it eliminated a lot of the confusing chatter that was happening on the side.
The actual issue
In the end we discovered the underlying issue and it was far more complicated than most would initially assume. We used Apache as a proxy between the Rails Unicorn process and the load-balancer. This helped fan requests out to the Rails processes which handled requests serially. The Apache process would receive a request, write it to the Rails socket. The Rails workers would accept the request, read the HTTP call, process it, then write the response.
Except when the system started to get busy it all started to backlog a bit. Apache would open the request to the Rails instance and start a five second timer after which it would serve a 502 to the caller and close the connection to Rails. Except, inside of Rails it would still see the connection, and could read the headers for the request because all of that was buffered in the Linux kernel. Only when it got to the very end and could serve a success to the caller would it receive a Pipe Error while writing, (funny enough it would then try to write an error which would also error).
Rails was serving successfully, just the queue was longer than the timeout. The heavier the load, the longer the queue. This was the death spiral we were encountering. The fix for this was actually fairly trivial. When Apache wrote the request to the Rails instance it would include a header that contained the time stamp of when the request was initiated. Rails, when reading the request could then calculate how long the request had before Apache would give up completely. If we were too close to that line we could just serve a very quick and cheap 502 rather than wasting time and resources responding to a request that had already been timed out.
This feature was called the ‘Whalinator’ (never let nerds name things!)
Testing the solution.
This function was put in place before the final World Cup games. Interestingly for the next game we didn’t see a massive sustained level of errors. At a few points we saw error spikes but they would subside and the site would return to greater than 99% success rates. We set a rule that if the error rate spiked over 10% we would turn off the Whalinator to see if that would help. We would then wait ten minutes to observe the effects.
At one point in the game we hit 10% errors, so we turned of the Whalinator and the effects where immediate and terrible. The error rate spiked up to beyond 60%, the whole site become non responsive, and basically everything went completely sideways. Rather than wait ten minutes we turned the Whalinator back on immediately and the error rate dropped back down to sub 5%. We had found our fix.
Lessons learned
This ended up being an absolutely amazing demonstration of proper incident management in action. We removed the noise that can be created by people not responding, we created a system that allowed us to learn from the ongoing issues without immediately reacting and losing debugging data, and we created a well defined process for owning the resolution. After this we had a clearly established Incident Manager on-call role with a select group of people that knew the system well enough to answer those pages.
For me at least this moment represented a mentality shift in the way that Twitter dealt with reliability as a function. It was no longer an afterthought and it started being an actual practiced form of engineering. It also opened doors to be able to solve problems closer to design time rather than just when they presented themselves publicly.
Shout outs
This event was significant at Twitter to me and several other people and it would be remiss of me not to call them out as well: Jonathan Reichhold, Jeff Hodges, as well as Mike Abbott. There were also a pile of others that helped at various points along the way that I am unfairly forgetting in my old age I am sure.