Brady Catherman

I started life as a bio-informatics cluster administrator, but quickly shifted into Site Reliability Engineering when I started working on GMail at Google in 2006. I was able to work with the early SRE teams at Google, learning how to think about problems with exceptionally large scale. This is where I met Phil Pennock.

After Google I was able to join Twitter very early as an Operations Engineer. That experience was completely out of this world, giving me one of those startup stories that is sometimes outright unbelievable. From the Load Bearing Mac-Mini, to the 48 days spent living on a data center floor trying to get the new cluster working before the clock ran out on our managed hosting environment. At Twitter I implemented the Incident Management process, built out data center operations program, and helped to eliminate the “Fail Whale.”

While at Twitter I also initiated a project called “app-app” which was supposed to be the “application app” which managed deployments to machines. It used cgroups and namespaces and unionfs to create private spaces for applications to run that were free of the base operating system. This conceptually was docker two years before docker and I still kick myself for not working harder to get this open sourced earlier. This lead to my next experience writing a Kubernetes like container management engine at Apcera as founding employee #3, again before Kubernetes existed. It was at Apcera that I met Jonathan Klobuchar for the first time.

At this point I switched over to working as a full time remote SRE for the rest of my career. I started in this new model working for a DBaaS company called Orchestrate (acquired by CenturyLink), and a mobile app streaming service called Mobcrush (acquired by Super League). In both roles I was the only infrastructure engineer when the companies had less than twenty people.

My next gig was an opportunity to work with Jonathan Klobucar again, whom I had initially met at Apcera. This job involved building a SRE team from the ground up including: hiring great people, building infrastructure as code, achieving SOC2 compliance, implementing incident managment, etc. I also worked on the Architectural Support Group, helping engineers think through designs to make sure that they see a problem from every angle, including those not typically thought of by a developer.

I then ventured off to help solve a problem that had plagued many companies I had worked with to that point at a startup called Cookie.ai (later renamed Veza). I have recently moved on to a new adventure will should prove to be just as challenging and rewarding!

Recent Articles by Brady Catherman

Twitter Tales: Project Dirt Goose

17 April, 2024

In 2010 Twitter embarked on am ambitious project to replace its MySQL tweet store with Cassandra, a newfangled distributed Key Value store. Called project `Snow Goose` this ended up leading to a crazy hack of a project called Project Dirt Goose.

Alphabet Soup: Gmails Out of Space!

15 April, 2024

A story about the time that Gmail nearly ran out of storage space days before Christmas of 2007 and the heroic efforts that were taken to keep Gmail running.

Twitter Tales: Memcached Can Not Be Restarted

01 April, 2024

Restarting Memcached at Twitter was always an extremely problematic experience. We often took great lengths to avoid restarting them at all, and when that wasn't possible we often had to jump through hoops to keep things working. This article explores some of those events and the eventual fixes that eliminated the problems.

Twitter Tales: Brownie Manifestation

25 March, 2024

Twitter used to have a system to help teach people to lock their work laptops. This is a story of how that system was used to get real brownies in the office.

Single page apps are harder than you think.

15 March, 2024

Single page apps deliver a nice experience to users, but come with some unexpected side effects that are often not planned for at design time. This article attempts to expose some of those issues to help prevent them before they become a support issue later on down the line.

Twitter Tales: The Ill Fated Data Center!

14 March, 2024

In 2010 Twitter embarked on a project to move from managed hosting into a new, bespoke data center which all went completely wrong in no time flat. This is a quick write up of some of the catastrophic issues we encountered along the way.

Twitter Tales: The 2010 World Cup

13 March, 2024

The 2010 World Cup was a pivotal moment in Twitter's history. It both established the model for incident management, as well as a process by which debugging could be done on exceptionally large distributed systems. It started with a total failure, but ended with a team coming together and finding the convoluted cause just in time.

Twitter Tales: Sorry @Flash

12 March, 2024

An errant feature launch managed to erase one poor unsuspecting user from Twitter. This user got stuck in a broken state, able to tweet and see a timeline, but nobody could see their profile. This article explores the reason why it happened and provides tips for preventing this type of failure on your own site.

Twitter Tales: Load Bearing Mac Mini

11 March, 2024

During Twitter's early days, when the company was still less than 100 engineers, a small computer became a crucial piece of infrastructure. This is the true story of Twitter's infamous "Load Bearing Mac-Mini."

Hiring Your First SRE

14 October, 2016

A quick guide on hiring your first SRE, as written by a SRE that was often the first to come into a company. This is targeted at a hiring manager or interviewer that is trying to figure out what SRE does, as well as how you interview somebody in order to find a good candidate.

What is SREally?

10 October, 2016

A brief synopsis of SREally explaining what the initial inspiration was for starting and contributing to it.

A Tale of Unexpected ELB Behavior

10 June, 2016

Amazon's Classic Elastic Load Balancers (ELBs) had an issue where they would pre-open connections which would cause strange timeout properties with the services that they were hosting. This article documents what that looks like and how to resolve it.

Tips for Resilient API Design

31 May, 2016

A list of mistakes to avoid as well as suggestions to make an API more resilient and future proof by a industry veteran with decades of experience working with large and small APIs.

Who are you? How do I know you?

30 May, 2016

It's evening on a holiday weekend and your entire site has been hacked, how do you ensure that the people that you are talking to are who they say they are if you have never communicated outside of work channels.

Status Page Conundrum

30 May, 2016

There is nothing worse than encountering a major issue right in the middle of trying to deal with another major issue. This article tries to describe some failure modes worth considering when setting up your new status page.

Tower of Hanoi Timeouts

09 October, 2015

A summary of a type of engineering failure common when writing services in which a child request has a timeout that is larger than parents. This can lead to a total breakdown in request and error handling that causes confusing traffic patterns, as experienced by Twitter in 2010.