I started life as a bio-informatics cluster administrator, but quickly shifted
into Site Reliability Engineering when I started working on GMail at Google in
2006. I was able to work with the early SRE teams at Google, learning how to
think about problems with exceptionally large scale. This is where I met Phil
Pennock.
After Google I was able to join Twitter very early as an Operations Engineer.
That experience was completely out of this world, giving me one of those
startup stories that is sometimes outright unbelievable. From the Load
Bearing Mac-Mini, to the 48 days spent living on a data center floor trying to
get the new cluster working before the clock ran out on our managed hosting
environment. At Twitter I implemented the Incident Management process, built
out data center operations program, and helped to eliminate the “Fail Whale.”
While at Twitter I also initiated a project called “app-app” which was supposed
to be the “application app” which managed deployments to machines. It used
cgroups and namespaces and unionfs to create private spaces for applications to
run that were free of the base operating system. This conceptually was docker
two years before docker and I still kick myself for not working harder to get
this open sourced earlier. This lead to my next experience writing a Kubernetes
like container management engine at Apcera as founding employee #3, again
before Kubernetes existed. It was at Apcera that I met Jonathan Klobuchar for
the first time.
At this point I switched over to working as a full time remote SRE for the rest
of my career. I started in this new model working for a DBaaS company called
Orchestrate (acquired by CenturyLink), and a mobile app streaming service
called Mobcrush (acquired by Super League). In both roles I was the only
infrastructure engineer when the companies had less than twenty people.
My next gig was an opportunity to work with Jonathan Klobucar again, whom I had
initially met at Apcera. This job involved building a SRE team from the ground
up including: hiring great people, building infrastructure as code, achieving
SOC2 compliance, implementing incident managment, etc. I also worked on the
Architectural Support Group, helping engineers think through designs to make
sure that they see a problem from every angle, including those not typically
thought of by a developer.
I then ventured off to help solve a problem that had plagued many companies I
had worked with to that point at a startup called Cookie.ai (later renamed
Veza). I have recently moved on to a new adventure will should prove to be just
as challenging and rewarding!
In 2010 Twitter embarked on am ambitious project to replace its MySQL tweet
store with Cassandra, a newfangled distributed Key Value store. Called project
`Snow Goose` this ended up leading to a crazy hack of a project called
Project Dirt Goose.
A story about the time that Gmail nearly ran out of storage space days before
Christmas of 2007 and the heroic efforts that were taken to keep Gmail running.
Restarting Memcached at Twitter was always an extremely problematic experience.
We often took great lengths to avoid restarting them at all, and when that
wasn't possible we often had to jump through hoops to keep things working. This
article explores some of those events and the eventual fixes that eliminated
the problems.
Twitter used to have a system to help teach people to lock their work laptops.
This is a story of how that system was used to get real brownies in the
office.
Single page apps deliver a nice experience to users, but come with some
unexpected side effects that are often not planned for at design time. This
article attempts to expose some of those issues to help prevent them before
they become a support issue later on down the line.
In 2010 Twitter embarked on a project to move from managed hosting into a new,
bespoke data center which all went completely wrong in no time flat. This is a
quick write up of some of the catastrophic issues we encountered along the way.
The 2010 World Cup was a pivotal moment in Twitter's history. It both
established the model for incident management, as well as a process by which
debugging could be done on exceptionally large distributed systems. It started
with a total failure, but ended with a team coming together and finding the
convoluted cause just in time.
An errant feature launch managed to erase one poor unsuspecting user from
Twitter. This user got stuck in a broken state, able to tweet and see a
timeline, but nobody could see their profile. This article explores the
reason why it happened and provides tips for preventing this type of failure
on your own site.
During Twitter's early days, when the company was still less than 100
engineers, a small computer became a crucial piece of infrastructure. This
is the true story of Twitter's infamous "Load Bearing Mac-Mini."
A quick guide on hiring your first SRE, as written by a SRE that was often
the first to come into a company. This is targeted at a hiring manager
or interviewer that is trying to figure out what SRE does, as well as how
you interview somebody in order to find a good candidate.
Amazon's Classic Elastic Load Balancers (ELBs) had an issue where they would
pre-open connections which would cause strange timeout properties with the
services that they were hosting. This article documents what that looks like
and how to resolve it.
A list of mistakes to avoid as well as suggestions to make an API more
resilient and future proof by a industry veteran with decades of experience
working with large and small APIs.
It's evening on a holiday weekend and your entire site has been hacked, how
do you ensure that the people that you are talking to are who they say they
are if you have never communicated outside of work channels.
There is nothing worse than encountering a major issue right in the middle of
trying to deal with another major issue. This article tries to describe
some failure modes worth considering when setting up your new status page.
A summary of a type of engineering failure common when writing services in
which a child request has a timeout that is larger than parents. This can lead
to a total breakdown in request and error handling that causes confusing
traffic patterns, as experienced by Twitter in 2010.