Do not unplug the load bearing mac mini!
One afternoon, sometime in late 2010 or perhaps early 2011 the whole Twitter office suddenly got very loud. Everybody was asking “Is Jira down?” and “Is git working for you?” Almost every single internal service had gone dark in an instant. The internet was working, just not the internal services.
As people sat confused trying to figure out what had happened, I knew immediately. I sprinted off to the IT closet on the 6th floor and found our IT guys standing over my precious Mac Mini with the Ethernet cable unplugged. They knew right away that they had done something bad. Plugging it back in restored service to the office, and after that the little machine was named the “Load Bearing Mac Mini.” (Inspired by a Simpsons gag that had been used in the office as a joke on multiple occasions)
Background
In the early days of Twitter we ran all of our infrastructure in a 100% managed hosting environment. Our data center provider managed the machines, operating systems, and the network. We only controlled what ran on the machines. By 2009 we had already reached the edge of what was possible with this model. We had told them that they couldn’t upgrade packages without our approval, and that they couldn’t replace/rebuild drive arrays without notice from us, etc. We had even implemented a code word for calls as we had already had people unaffiliated with us calling and asking them to do hands-on work on our behalf. One of the downsides to this situation is that the provider didn’t give us a way to VPN, or have any form of private networking, and specifically disallowed the setting up of a VPN service on any of their boxes.
Then came the hack in 2009 where our administrative dashboard was broken into using a dictionary attack against popular accounts. As an emergency fix all of the administrative functionality was hidden away against another web page that was protected by basic auth. We didn’t have a VPN so this page was also on the internet and quickly was identified, dictionary attacked, and broken into as well.
In need of a quick, dirty fix I came up with the idea of just using a SSH
tunnel with DNS in the office pointed at it. This didn’t get flagged by the
hosting provider and allowed communication to go over something more secure. I
asked for hardware to run this on but this was an era where “IT” was still
outsourced and we didn’t have anything approaching a network closet. Thus, the
fastest solution was to take a mac mini, set it on my desk and setup the ssh
tunnel as a launchd
service. And with that, the admin page was off the
internet.
Over time we continued adding services. See, Search had an admin page that needed protecting, and Trust and Safety built some tooling as well, and eventually we moved our git server behind https so we could prevent things like force pushes and such, etc. That little mac-mini ended up becoming quite important for our day to day operations.
We moved offices after that and I still couldn’t convince anybody to get me some better hardware and a location in a secured closet. There just wasn’t time or enough IT resources to deal with that. That is, until the cleaning people managed to unplug the machine by accident and half the company was unable to do anything until I sprinted to the office in the middle of the night and plugged it back in. After that I got approval to put it in the IT closet so we could avoid unexpected interruptions like that.
With that things just kind of hummed along. Everything was working, and adding services was trivial enough. We had a huge initiative to move out of the managed hosting environment so there just wasn’t time to do anything with the mac mini, and either way it would be obsolete when we controlled our own network.
IT cleans out the closet
IT at Twitter had always previously been a mess. It started as a contracted service, and when we hired our first IT staff they mostly were focused on keeping peoples hardware working and had zero time to do anything beyond that. Slowly they took on more and more until they finally got a handle on all the services, hardware, etc.
At that point they started a project to identify and clean out all of the devices in the server closet. There were old G5 Mac Pro’s sitting on the floor with zero documentation, network switches that appeared to be completely disconnected except from the upstream, etc. It was a disorganized mess. Square in the middle of that mess was a random Mac Mini with what used to be permanent marker written on the top that had long since been wiped off. They began identifying each item and removing what wasn’t necessary. When it came to the Mini they had no login details, no ownership details, and no way to track who it belonged too. Thus the plan was born, unplug it and see who complains.
And boy howdy did we complain!
Aftermath
At this point it became clear that we first of all, needed a label which was helpfully provided by IT: “Load Bearing mac mini, do not unplug” and a contact name/email. We also decided to beef up the poor little Mini in the most “startup engineering” way possible. We added a second Mac Mini and setup CARP so that if the main mini failed the second would assume its IPs and continue serving. Remember, this whole thing was going away soon anyway so why invest much more than that?
The legend is born
I have heard this tale from other people more times than I can count. It usually comes with a tone highlighting that it shows how silly or stupid Twitter was to have gotten in that situation, but honestly this type of problem solving is common in small very fast paced startups. There just isn’t time or necessarily the money to invest in a perfect solution, and often the technical debts taken on to get the system to this point are so high that a perfect solution wouldn’t even be possible.
As a side note, the reason it was a Mac Mini was pretty simple, Twitter IT only bought Apple hardware at the time and Apple had discontinued the XServe and the Mac Pro was too big to fit in any of the racks available to us.