Your startup is growing and has reached the point where you can no longer operate with just a pool of generalist engineers (and one or two DevOps engineers). It’s time to start specializing! One of the roles you should look for is a Site Reliability Engineer, or SRE.
If the term SRE is new to you, here’s the TL;DR: an SRE is a systems developer that focuses on infrastructure problems. For more resources, you can check out an interview with Ben Treynor, the head of Google’s SRE team, or the book Google’s SRE team wrote on the subject.
Now, you might be thinking: “Why not just hire more DevOps engineers and be done with it?” If your experience with DevOps has been engineers with deep systems knowledge who like to focus on all infrastructure components then congrats! You may have hired an SRE and not even realized it. The term “DevOps” is grossly over used, covering everything from a true SRE to an engineer with experience setting up a deploy pipeline and maintaining a CI server. A SRE should be helping you not only solve problems you are having today, but problems that are coming that you may not even know about yet. They will be able to set up monitoring for your service to help see problems before they start, reduce overall cost by tracking resources, reduce deployment time with automation, reduce development time by finding preexisting solutions that might not be seen otherwise, and far more that we will likely write about in later articles.
So the question we are left with is “How do I hire an SRE if I am not really sure what one does?” That is quite a challenging question to answer, but one of the most important tasks of an SRE is spreading knowledge, so start that process off in the interview. Unless they are a bundle of nerves in the interview, you should be able to talk with them as if they are a colleague, asking them to explain answers when you don’t understand, and let them teach you what you don’t know.
Reliability
As implied by the ‘reliability’ in ‘site reliability engineer’, your new SRE’s role is to help your organization build reliable systems.
Ask questions like:
- How do you balance reliability and moving fast and breaking things? When would you choose one over the other?
- How do you approach system design that might have a single point of failure?
- What are some ways that you have used to be sure that the code you are running in production matches what you think is running?
Monitoring
SREs should be installing, increasing, and fighting for metrics. The best SREs are data driven and make data driven decisions. If the discussion turns to something like “We should replace this Java process because GC is becoming an issue,” a SRE should respond with, “Let me graph our GC timing to see what it looks like.”
Ask questions like:
- How would you monitor an infrastructure like ours?
- What would you monitor in an application like ours?
- What do you think makes for good application logging?
Alerting
Like the above, an SRE should be able to set up alerting to catch issues before you get a wave of customer support tickets. They should know how to do this, and have a list of things they like to monitor right out of the gate.
Ask questions like:
- What is your approach to alerting?
- What would you alert on in an infrastructure like ours and why?
- How do you put together an on-call rotation including both SREs and non-SREs?
Infrastructure
SREs spend a lot of time dealing with infrastructure components like Amazon Web Services (AWS), Google Compute Engine (GCE), Nginx or Apache, MySQL, Redis, etc. On top of that they should have a good understanding of the scalability of each solution.
Ask questions like:
- What technology would you suggest for a stack like ours? Why?
- What are some things we should be avoiding?
Product Engineering
One of the major things that sets SRE apart from traditional Operations roles is that they will often get involved in the product development side of the company. They are, after all, software engineers too. This means that they are going to have to interact very closely with your existing engineering staff. Always be on the look out for somebody who will cause clashes. Having a toxic relationship between your SRE-focused engineers and product-focused engineers will more or less doom the SREs.
Ask questions like:
- What are some improvements that you feel we can make in our product to make it more reliable? More scalable?
- If you found a major issue with the design of our site, how would you go about getting it fixed?
Debugging
Another common trait of an SRE is the ability to debug. They are often one of the few people to have cross-team knowledge of the whole system which makes them prime for debugging complex issues. Finding an SRE that can keep this state in their head while tracking down a specific issue is a prime goal. When talking about debugging, make sure that they use rigor when problem-solving: do they seem to evaluate, test, then conclude, or are they just throwing solutions at the wall to see what sticks?
Ask questions like:
- How would you debug this hypothetical issue?
- Now that you found the solution, how would you prevent this from happening in the first place?
- Is the cost of preventing it from happening worthwhile for us?
Security
Though not strictly a requirement of a SRE, they often have a security background as well.
Ask questions like:
- What is a common security issue with products like ours?
- What are common vulnerabilities that most people do not think about?
Other thoughts
Also keep in mind that a SRE is a developer, so it should be safe to ask them general development questions too. The old favorites like “What are the problems of your favorite programming language?”, or “What is the difference between a system and library call?” If they are going to debug within your infrastructure they must be able to grasp your code, even if they are not software architects, so don’t hesitate to ask questions along those lines.
Some more generic questions include:
- What was your biggest “Wow!” moment where you suddenly got something about running a reliable service?
- What are some indicators of a successful SRE team?
- What is the difference between exporting a metric and aggregating a log, and when would you do one vs the other?
- What are the advantages and disadvantages of a micro services architecture?