Hiring Your First SRE [2016]

Published: October 14, 2016
Author picture of Brady Catherman
Brady Catherman
Sultan of Scale

Your startup is growing and has reached the point where you can no longer operate with just a pool of generalist engineers (and one or two DevOps engineers). It’s time to start specializing! One of the roles you should look for is a Site Reliability Engineer, or SRE.

If the term SRE is new to you, here’s the TL;DR: an SRE is a systems developer that focuses on infrastructure problems. For more resources, you can check out an interview with Ben Treynor, the head of Google’s SRE team, or the book Google’s SRE team wrote on the subject.

Now, you might be thinking: “Why not just hire more DevOps engineers and be done with it?” If your experience with DevOps has been engineers with deep systems knowledge who like to focus on all infrastructure components then congrats! You may have hired an SRE and not even realized it. The term “DevOps” is grossly over used, covering everything from a true SRE to an engineer with experience setting up a deploy pipeline and maintaining a CI server. A SRE should be helping you not only solve problems you are having today, but problems that are coming that you may not even know about yet. They will be able to set up monitoring for your service to help see problems before they start, reduce overall cost by tracking resources, reduce deployment time with automation, reduce development time by finding preexisting solutions that might not be seen otherwise, and far more that we will likely write about in later articles.

So the question we are left with is “How do I hire an SRE if I am not really sure what one does?” That is quite a challenging question to answer, but one of the most important tasks of an SRE is spreading knowledge, so start that process off in the interview. Unless they are a bundle of nerves in the interview, you should be able to talk with them as if they are a colleague, asking them to explain answers when you don’t understand, and let them teach you what you don’t know.

Reliability

As implied by the ‘reliability’ in ‘site reliability engineer’, your new SRE’s role is to help your organization build reliable systems.

Ask questions like:

Monitoring

SREs should be installing, increasing, and fighting for metrics. The best SREs are data driven and make data driven decisions. If the discussion turns to something like “We should replace this Java process because GC is becoming an issue,” a SRE should respond with, “Let me graph our GC timing to see what it looks like.”

Ask questions like:

Alerting

Like the above, an SRE should be able to set up alerting to catch issues before you get a wave of customer support tickets. They should know how to do this, and have a list of things they like to monitor right out of the gate.

Ask questions like:

Infrastructure

SREs spend a lot of time dealing with infrastructure components like Amazon Web Services (AWS), Google Compute Engine (GCE), Nginx or Apache, MySQL, Redis, etc. On top of that they should have a good understanding of the scalability of each solution.

Ask questions like:

Product Engineering

One of the major things that sets SRE apart from traditional Operations roles is that they will often get involved in the product development side of the company. They are, after all, software engineers too. This means that they are going to have to interact very closely with your existing engineering staff. Always be on the look out for somebody who will cause clashes. Having a toxic relationship between your SRE-focused engineers and product-focused engineers will more or less doom the SREs.

Ask questions like:

Debugging

Another common trait of an SRE is the ability to debug. They are often one of the few people to have cross-team knowledge of the whole system which makes them prime for debugging complex issues. Finding an SRE that can keep this state in their head while tracking down a specific issue is a prime goal. When talking about debugging, make sure that they use rigor when problem-solving: do they seem to evaluate, test, then conclude, or are they just throwing solutions at the wall to see what sticks?

Ask questions like:

Security

Though not strictly a requirement of a SRE, they often have a security background as well.

Ask questions like:

Other thoughts

Also keep in mind that a SRE is a developer, so it should be safe to ask them general development questions too. The old favorites like “What are the problems of your favorite programming language?”, or “What is the difference between a system and library call?” If they are going to debug within your infrastructure they must be able to grasp your code, even if they are not software architects, so don’t hesitate to ask questions along those lines.

Some more generic questions include:


Copyright 2016 - 2024