SREally?

Some tales from SREs of how things really happened, and other hard-won lessons and design advice.

Meet the authors:

Brady Catherman

A Site Reliability Engineer with over twenty years of experience running large web sites. Past experience working as an earlyish SRE at Google working on GMail, as well as a very early Twitter Operations Engineer. Further experience at multiple startups in many spaces generally as the first SRE or as somebody hired to help build out a SRE team.

Recent Articles on SREally?

Twitter Tales: Project Dirt Goose

17 April, 2024

In 2010 Twitter embarked on am ambitious project to replace its MySQL tweet store with Cassandra, a newfangled distributed Key Value store. Called project `Snow Goose` this ended up leading to a crazy hack of a project called Project Dirt Goose.

Alphabet Soup: Gmails Out of Space!

15 April, 2024

A story about the time that Gmail nearly ran out of storage space days before Christmas of 2007 and the heroic efforts that were taken to keep Gmail running.

Twitter Tales: Memcached Can Not Be Restarted

01 April, 2024

Restarting Memcached at Twitter was always an extremely problematic experience. We often took great lengths to avoid restarting them at all, and when that wasn't possible we often had to jump through hoops to keep things working. This article explores some of those events and the eventual fixes that eliminated the problems.

Twitter Tales: Brownie Manifestation

25 March, 2024

Twitter used to have a system to help teach people to lock their work laptops. This is a story of how that system was used to get real brownies in the office.

Single page apps are harder than you think.

15 March, 2024

Single page apps deliver a nice experience to users, but come with some unexpected side effects that are often not planned for at design time. This article attempts to expose some of those issues to help prevent them before they become a support issue later on down the line.

Twitter Tales: The Ill Fated Data Center!

14 March, 2024

In 2010 Twitter embarked on a project to move from managed hosting into a new, bespoke data center which all went completely wrong in no time flat. This is a quick write up of some of the catastrophic issues we encountered along the way.

Twitter Tales: The 2010 World Cup

13 March, 2024

The 2010 World Cup was a pivotal moment in Twitter's history. It both established the model for incident management, as well as a process by which debugging could be done on exceptionally large distributed systems. It started with a total failure, but ended with a team coming together and finding the convoluted cause just in time.

Twitter Tales: Sorry @Flash

12 March, 2024

An errant feature launch managed to erase one poor unsuspecting user from Twitter. This user got stuck in a broken state, able to tweet and see a timeline, but nobody could see their profile. This article explores the reason why it happened and provides tips for preventing this type of failure on your own site.

Twitter Tales: Load Bearing Mac Mini

11 March, 2024

During Twitter's early days, when the company was still less than 100 engineers, a small computer became a crucial piece of infrastructure. This is the true story of Twitter's infamous "Load Bearing Mac-Mini."

Hiring Your First SRE

14 October, 2016

A quick guide on hiring your first SRE, as written by a SRE that was often the first to come into a company. This is targeted at a hiring manager or interviewer that is trying to figure out what SRE does, as well as how you interview somebody in order to find a good candidate.

What is SREally?

10 October, 2016

A brief synopsis of SREally explaining what the initial inspiration was for starting and contributing to it.

A Tale of Unexpected ELB Behavior

10 June, 2016

Amazon's Classic Elastic Load Balancers (ELBs) had an issue where they would pre-open connections which would cause strange timeout properties with the services that they were hosting. This article documents what that looks like and how to resolve it.

Tips for Resilient API Design

31 May, 2016

A list of mistakes to avoid as well as suggestions to make an API more resilient and future proof by a industry veteran with decades of experience working with large and small APIs.

Who are you? How do I know you?

30 May, 2016

It's evening on a holiday weekend and your entire site has been hacked, how do you ensure that the people that you are talking to are who they say they are if you have never communicated outside of work channels.

Status Page Conundrum

30 May, 2016

There is nothing worse than encountering a major issue right in the middle of trying to deal with another major issue. This article tries to describe some failure modes worth considering when setting up your new status page.

Tower of Hanoi Timeouts

09 October, 2015

A summary of a type of engineering failure common when writing services in which a child request has a timeout that is larger than parents. This can lead to a total breakdown in request and error handling that causes confusing traffic patterns, as experienced by Twitter in 2010.