Twitter Tales: The Ill Fated Data Center!

Published: March 14, 2024
This article is part of ‘Twitter Tales’, a series that describes some of the amusing, impressive or just plain crazy things that happened during the early days at Twitter, as written by those that worked there at the time.
Author picture of Brady Catherman
Brady Catherman
Sultan of Scale

One of the most unexplainable to outsider events in my time at Twitter was easily the addition of our first first data center. We were moving out of managed hosting and into a colocation facility because we had simply outgrown what was possible with our initial provider. Somehow the decision was made to go with a completely new facility rather using an existing building with existing tenants. This decision proved to be the start of so many funny and interesting stories that its impossible to even remember them all.

The problem is that we were crunched for time. Badly. We needed out of our managed hosting or the price would go through the roof. This schedule left us with virtually no room for slips or errors, which became a problem when the physical construction of the data center slipped. Soon enough the only way for us to make the timeline that we needed to hit was to start installing the hardware alongside the finish work required for the data center. We would rack up networking kit, core switches, etc as they setup power and got cooling systems online. The goal was to be able to power on hardware the moment that power was available in the facility.

This turned out to be a bit optimistic. When we started racking up hardware there was still construction debris strewn about. The colocation facility was full of a thick layer of concrete dust from where they had drilled anchors. Even worse, about half the floor was missing. We soldiered on though, getting our core switches running with the help of the vendor (who happily didn’t void our warranties as the switches blew the dust all over).

The security audit.

At one point about half way through building process we had a security audit by Twitter’s head of security: Bob Lord. He showed up, looked around the office and pointed out all of the issues with the layout that would cause security problems. One of the first problem he pointed out is that the only way to access the security controls was through the middle of the man trap. A man trap is a tool that is used to ensure that anybody trying to access the facility has to go through two unique security doors and if they fail the security check at the second then they will be trapped in a space and held there until they could be dealt with. In the new facility the security room was only accessible from space between the two doors. This was problematic because it meant that once an intruder gets “trapped” it becomes impossible to get to the security room.

Another major issue that Bob found was that the roof was accessible via a tree out front of the office. I bet the owners that he could get on the roof of the building without security access, and once the bet was taken up he simply climbed up the pine tree and jumped to the roof. The next day the tree was cut down.

Its raining fire!

It turns out that some of the other things that were not done as we started moving hardware in included the AC system’s chillers on the roof, and any level of power to the facility beyond what it got as a typical commercial building. During the installation of the AC units on the roof some of the vendors had not put down “slag mats” which catch the hot sparks that come of the things being welded. The lack of slag mats meant that the hot bits of metal fell onto the foam sealant on the roof which immediately melted into molten plastic. Since the foam layer was supposed to seal the roof from rain the newly created holes allowed the now molten plastic to drip through the roof and into the data center. Flaming bits of molten plastic started dripping down into the data center, right on the rack full of our highly expensive core switches.

I was luckily on the other side of the room when the flaming rain started to fall. I remember having a brief moment where I started thinking that somebody was welding directly behind our switches because there was so much fire and illumination. Within a couple of seconds the fire alarm triggered and the building was evacuated.

Its raining.. well.. rain.

The secondary effect of the fire rain was the roof was no longer sealed, so when a large rain storm came through the area we suddenly experienced rain in the data center, directly on top of our now powered on racks. We had to rush to wrap every rack up with painters drop cloth to try and keep the water dripping randomly through the ceiling off the racks. It ended up taking several major rain storms before the roof was reliable enough to leave servers on during a storm.

In this picture we were trying to get the water that was dripping directly on the rack into the trashcans rather than on the servers.

The trashcans, they do nothing

Zip-ties everywhere!

Once we got to the point where we were able to power on racks of hardware the real fun began. I had written a custom burn in tool (that I won’t go into detail with here) that automatically moved machines through several types of test: wiring, memory, CPU, and disk. Each component would be thoroughly tested and failures would be tracked in a machine database.

Each rack was setup by our vendor and burn in was run on them where they were installed; so in theory, beyond issues that came up during shipping we should have been good to go. Because of this we didn’t really order much in the way of spare parts, basically enough for regular operation but not for a hard start of a new facility.

To mark the machines that needed work we used colored zip ties. Zipped to the rack face with the tail sticking out it made a clear visual indicator where repairs were needed, and due to color coding it also made it possible to see what type of repair was expected.

Everything started going off the rails very fast. Whole racks needed to be rewired to get identification working properly, memory tests started failing, CPU tests would overpower the machine causing it to overheat and shut down, and disk failed. Lots of disks. Roughly one out of two servers needed manual intervention and we didn’t have the parts to handle that, so the zip ties stayed on the racks for what seemed like ages, marking out just how bad the failure rate had been.

This image was early in the process, and if I recall right, red indicated memory failures, and green indicated CPU failures. This was before the full extent of the zip tie madness had started. There was a point where every rack had twenty or thirty zip ties each!

Zip ties everywhere!

The outcome

Eventually this whole data center project came to an abrupt end. With all the delays it became clear that we would never be able to finish the move on time which put the whole project at risk. Instead, having multiple experienced vets of the colocation industry on staff finally, we were able to get space in an established space that was ready to use and already had everything we needed. This promoted an emergency plan change which ended up being successful and to this day is one of my proudest engineering moments.

Copyright 2016 - 2024