Over the months leading up to Christmas in 2007 Gmail had been growing like wildfire. Our clusters were getting full quickly, and we were adding new capacity as quickly as possible, yet we were still running out of capacity.
Gmail had a crystal ball, and by that I mean an amazing wizard of planning named Jason that could work numbers and seemingly predict the future using the largest Excel spreadsheet I have ever seen in my entire life. Our crystal ball was predicting that given the growth rate we would run out of space sometime around Christmas. The closer we got, the more desperate we got to add capacity, to the point of stealing machine deliveries from other teams.
Our back end engineering was debugging trying to figure out ways to help. They had figured out that something was wrong with the storage system we were using (Google File System). The cluster had a very large amount of data labeled as “unknown” which was something that never showed up in any significant quantities in normal operations.
The fuller the storage cluster got, the higher error rate Gmail started serving. All operations in Gmail were appends to a log, including deletes. As the cluster had less and less free space, it made it take longer and longer to find a server with enough space to store appended data. This added latency caused timeouts for operations, and left resources in use longer which in turn increased overall resource consumption. We were running out of space quickly. This was going to be a nightmare and it was all going to happen within a week of Christmas Day!
I wasn’t planning on going home for Christmas that year so I had actually volunteered to take on-call over Christmas months before it became clear just how bad this was shaping up to be. As it got a bit closer my amazing boss offered by get anybody that took a holiday on-call shift a new Xbox 360.
Eventually while tracking down the unknown data in GFS one of our amazing back end engineers traced down a lead. That unknown data seemed to be linked to deleted data that had not been freed yet. When a file was deleted in GFS each of the chunks (portions of the file) were marked as deleted as well. The chunk server would send a list of all updated chunks to the elected leader and only once it confirmed that the chunk was not needed would the chunk server actually delete the files on disk. There was a block of code in the GFS source that setup a timer that would trigger 8 times a second, and each time the timer triggered it would sent 100 of the highest priority chunk modifications to the leader for updates. This list would prioritize mutations over deletions. The idea in the code was simple, smooth deletes over time so they don’t overwhelm the leaders.
Over time our storage cells had gotten bigger and bigger with more and more chunks. As the cells got bigger the number of mutated chunks increased to the point that there would be more than 800 mutated chunks per second which would leave no capacity to report deleted chunks, and thus the leaders would never approve the chunk servers to delete the data. Over time the storage cluster would gain more and more of this useless data until the cluster got full enough to have the user load-balancer start moving users out of the cluster. Eventually enough users would move that the number of updates would fall below the magic 800 number and suddenly data would be purged, freeing up space and allowing users to move back into the cluster.
Now, the eight times a second was hard coded. Changing that would require a complete rebuild, but the 100 chunks per time was a dynamic flag. Within Google those dynamic flags could be changed via a HTTP call. This was good news for us! We could dynamically set it on each chunk server in a single cluster to see if it resolved the problem. Luckily Gmail was multi homed which means that if we broke a single cluster we could recover. The down side was that it was around 7pm just a few days before Christmas. Most of the engineers had left to go home for the holidays. We had basically no support if anything went wrong.
With not many options left to us we decided to just go ahead on a single cell and see what happened. It took less than a minute to run curl in parallel on a few hundred machines. We watched the graph to see the impact and the total bytes in the cluster started dropping like a rock. Within a minute all of the unknown data had been cleared. We watched the error rates improve. We watched for consistency errors and there were none. After watching for a half an hour or so we got the confidence to try a wider net. Next up was the whole region. Again, exactly the same as before, graphs drop, error rate improves, no signs of issues.
With that success under our belts, it was time for the last step. Changing the setting on every chunk server, in every region. This went the same as before, but on a much, much larger scale. Watching the storage graph was one of the single most alarming experiences in my career. Rob, the super amazing back end engineer and I pretty much sat there motionless, watching the graphs and half expecting the error rate to launch into the sky. Needless to say, the error rate suddenly dropped like crazy, the system was suddenly.. healthy.
After eventually gaining composure, I turned to Rob and told him we had just “deleted more useless data in an hour than most people would create in their entire lifetimes.” Needless to say I ended up having a very quiet on-call period playing games on my new Xbox!