SREally?
Some tales from SREs of how things really happened, and other hard-won lessons and design advice.
Meet the authors:
Recent Articles on SREally?
Twitter Tales: Project Dirt Goose
In 2010 Twitter embarked on am ambitious project to replace its MySQL tweet
store with Cassandra, a newfangled distributed Key Value store. Called project
`Snow Goose` this ended up leading to a crazy hack of a project called
Project Dirt Goose.
Alphabet Soup: Gmails Out of Space!
A story about the time that Gmail nearly ran out of storage space days before
Christmas of 2007 and the heroic efforts that were taken to keep Gmail running.
Twitter Tales: Memcached Can Not Be Restarted
Restarting Memcached at Twitter was always an extremely problematic experience.
We often took great lengths to avoid restarting them at all, and when that
wasn't possible we often had to jump through hoops to keep things working. This
article explores some of those events and the eventual fixes that eliminated
the problems.
Twitter Tales: Brownie Manifestation
Twitter used to have a system to help teach people to lock their work laptops.
This is a story of how that system was used to get real brownies in the
office.
Single page apps are harder than you think.
Single page apps deliver a nice experience to users, but come with some
unexpected side effects that are often not planned for at design time. This
article attempts to expose some of those issues to help prevent them before
they become a support issue later on down the line.
Twitter Tales: The Ill Fated Data Center!
In 2010 Twitter embarked on a project to move from managed hosting into a new,
bespoke data center which all went completely wrong in no time flat. This is a
quick write up of some of the catastrophic issues we encountered along the way.
Twitter Tales: The 2010 World Cup
The 2010 World Cup was a pivotal moment in Twitter's history. It both
established the model for incident management, as well as a process by which
debugging could be done on exceptionally large distributed systems. It started
with a total failure, but ended with a team coming together and finding the
convoluted cause just in time.
Twitter Tales: Sorry @Flash
An errant feature launch managed to erase one poor unsuspecting user from
Twitter. This user got stuck in a broken state, able to tweet and see a
timeline, but nobody could see their profile. This article explores the
reason why it happened and provides tips for preventing this type of failure
on your own site.
Twitter Tales: Load Bearing Mac Mini
During Twitter's early days, when the company was still less than 100
engineers, a small computer became a crucial piece of infrastructure. This
is the true story of Twitter's infamous "Load Bearing Mac-Mini."
Hiring Your First SRE
A quick guide on hiring your first SRE, as written by a SRE that was often
the first to come into a company. This is targeted at a hiring manager
or interviewer that is trying to figure out what SRE does, as well as how
you interview somebody in order to find a good candidate.
What is SREally?
A brief synopsis of SREally explaining what the initial inspiration was
for starting and contributing to it.
A Tale of Unexpected ELB Behavior
Amazon's Classic Elastic Load Balancers (ELBs) had an issue where they would
pre-open connections which would cause strange timeout properties with the
services that they were hosting. This article documents what that looks like
and how to resolve it.
Tips for Resilient API Design
A list of mistakes to avoid as well as suggestions to make an API more
resilient and future proof by a industry veteran with decades of experience
working with large and small APIs.
Who are you? How do I know you?
It's evening on a holiday weekend and your entire site has been hacked, how
do you ensure that the people that you are talking to are who they say they
are if you have never communicated outside of work channels.
Status Page Conundrum
There is nothing worse than encountering a major issue right in the middle of
trying to deal with another major issue. This article tries to describe
some failure modes worth considering when setting up your new status page.
Tower of Hanoi Timeouts
A summary of a type of engineering failure common when writing services in
which a child request has a timeout that is larger than parents. This can lead
to a total breakdown in request and error handling that causes confusing
traffic patterns, as experienced by Twitter in 2010.
Copyright 2016 - 2024