Twitter Tales: Project Dirt Goose

In the early days of Twitter all of the Tweets were stored in a MySQL database and replicated out to ten or so more instances that were used for reads. Every tweet required a write to a single table in the database. This worked fine in the early days, but as Twitter grew the number of tweets written per second grew with it. The runway on this solution was running out quickly.

Project Snow Goose

The solution came in the form of a new database called Cassandra which had been released by Facebook that same year. It was a distributed key value store that supported indexing, and other required features, but best of all it could scale up as needed. In theory this was exactly what we needed as scaling up would require adding servers to a cluster. That would be far easier than migrating the existing database, which required building a new machine with more hardware and faster storage, then copying all the data and scheduling an swap. This process could take as much as a week.

It didn’t take long after getting the project bootstrapped to start to see some exciting results. The first pass at load testing showed lots of promise. Monitoring was setup and ready to go. Soon enough the project had all the code needed to work and could be enabled as an experiment for a certain percent of traffic.

This is where everything started to go off the rails. Every time we added a bit of load the Cassandra cluster would start to fail, latency increased, eventually the cluster would get so backlogged it got into a garbage collection spiral. When a server became overloaded it would start pausing for longer and longer, the rest of the cluster would slow down dealing with the slow server, eventually sending another server into a spiral, so on and so forth. Soon enough, the whole cluster would go completely offline. Seemingly any amount of load was triggering this issue.

The deadline

The whole project was starting to come undone, and even worse we were facing a deadline. The primary database was pretty much out of runway at this point. Just keeping up with the write volume, the storage requirements, and everything else was taxing the server beyond what we were comfortable with. We needed a solution and we needed it fast.

After trying repeatedly it started to become clear that Snow Goose wasn’t going to work. Cassandra had a feature called a ‘hinted hand-off’. This was a function that let servers inform clients about other servers failing. It seemed that Cassandra’s hinted hand-off system was causing massive amounts of garbage which would kill the server. Every time a server went offline it would cause more hinted hand-offs, which in turn would cause the whole cluster to go offline.

Enter Dirt Goose

So, with very little runway left and the leading solution being untenable, we needed an alternative as quickly as possible. The proposal was to shard the database, rather than distributing data across servers the way Cassandra did. We decided to just shard based on time. Older tweets would remain in the existing databases, and then at an exact time we would stop writing tweets to the old database and start writing them to a new database. This was not a great overall solution, but was the only solution we could possibly get done before the primary database keeled over from load.

This solution was awful and everybody knew it. But there was just no other alternative. Thus, Dirt Goose was born. It was never intended to be a long term solution, but it was needed in the moment. We had to rethink the whole project and eventually the solution was to build a distributed system based on MySQL.

Snowflake and Flock

Eventually a pair of servers were launched that replaced Dirt Goose. The first was a service called Snowflake which would generate unique IDs in a distributed way. The second was a database service called Flock. These two services combined to finally put the nail in the scalability issues that lead to Snow Goose/Dirt Goose in the first place.