The Site Reliability Engineer Interview, Decoded
The Mythic Intel Team · May 14, 2026 · 9 min read
Site Reliability Engineering interviews are not software interviews with a reliability flavor on top. The loop is built to find people who stay calm when a service is on fire, who reason about systems in concrete numbers instead of hand-waving, and who can write working code under the same pressure. If you are preparing SRE interview questions and you treat it like a standard engineering loop, you will be surprised by how much production thinking the panel expects.
Below is the real shape of a site reliability engineer interview at the companies that defined the discipline, what each round actually probes, and the technical ground you need to hold. SRE interview preparation works best when you map your practice to these rounds rather than grinding generic puzzles.
The recruiter screen and how the loop is structured
The first conversation is a recruiter screen. It confirms logistics, your experience level, and on-call appetite, and it sets expectations for the rounds ahead. A Google-style SRE loop typically runs several distinct interviews: coding, troubleshooting, systems and production design (often called Non-Abstract Large Systems Design, or NALSD), Unix and systems internals, and behavioral. Some loops fold internals into the troubleshooting round. Expect four or more interviews on the day.
Be ready to state, in plain terms, what you owned: the services, their SLOs, your on-call rotation, the worst incident you ran, and what changed afterward. Vague ownership claims get probed hard later.
Coding, but non-abstract
SRE coding rounds skew practical. You will still get data-structures-and-algorithms style problems, but the framing is operational: parse a log stream and compute the p99 latency, deduplicate events, find the request rate per source IP over a sliding window, walk a process tree. The bar is working code, sensible complexity, and clean handling of edge cases like empty input and malformed lines.
# A typical SRE coding shape: per-minute error rate from a log
# parse timestamp + status, bucket by minute, emit rate where errors/total > threshold
Write real code, run through it line by line, and reason about time and memory out loud. Interviewers care that you would trust this code at 3am.
Troubleshooting: the round that fails senior people
The troubleshooting round is the heart of the SRE interview. The interviewer describes a service that is misbehaving (latency spiked, error rate climbed, a region went dark) and you drive the diagnosis. There is no clean answer key. They are watching your method.
Strong candidates work top down and narrow systematically:
- Establish the symptom precisely and check whether it is user-visible. What does the SLI say, and is the SLO actually breached?
- Ask what changed: recent deploy, config push, traffic shift, dependency, certificate, quota.
- Bisect the request path: load balancer, application, cache, database, downstream services. Use the four golden signals (latency, traffic, errors, saturation) to localize the fault.
- Form one hypothesis at a time, predict what you would see if it were true, then check. Do not shotgun.
- Separate mitigation from root cause. Stop the bleeding first (roll back, drain a region, shed load), then investigate.
The failure mode that sinks experienced engineers is jumping straight to a clever root cause without grounding it in evidence. Verbalize the evidence chain.
Production and systems design (NALSD)
NALSD asks you to design a real system and then make it concrete. The "non-abstract" part is the whole point: you are expected to do back-of-the-envelope math. If you build a service handling 50,000 queries per second, you estimate request size, fan-out, storage growth, and how many machines that implies, then check whether your design survives the load and a machine or zone failure.
Cover the reliability dimensions explicitly:
- Capacity and headroom, with numbers. How many replicas, how much memory, what happens at 2x traffic.
- Failure domains: what breaks when a node, a zone, or a dependency dies, and how the system degrades.
- Data: replication, consistency model, and what you sacrifice under partition.
- Load management: backpressure, retries with jitter and caps, circuit breaking, graceful degradation.
A senior answer ties the design back to an SLO. State the target, then show the design meets it.
SLI, SLO, and error budget reasoning
This vocabulary needs to be exact, because panels test it directly.
- An SLI is a quantitative measure of the service that users actually experience: the proportion of successful requests, request latency under a threshold, data freshness.
- An SLO is the target for that SLI over a window, for example 99.9% of requests succeed over 28 days.
- The error budget is 100% minus the SLO. A 99.9% SLO grants a 0.1% budget of allowed failures. It turns reliability into a number both SRE and product teams can negotiate against.
- Burn rate is how fast you consume the budget relative to how fast it accrues. A burn rate of 1.0 spends the budget exactly over the SLO window; higher rates spend it faster. Modern alerting uses multiple windows and multiple burn-rate thresholds so a fast, severe burn pages immediately while a slow burn raises a ticket.
If asked what to do when the error budget is exhausted, the answer is to freeze risky launches and shift effort to reliability until the budget recovers. That tradeoff is the policy SRE exists to enforce.
On-call, incident response, and behavioral
The behavioral round probes operational maturity. Expect to walk a real incident end to end: detection, triage, who you paged, how you communicated, mitigation, and the follow-up. Use the language of structured incident response, with a clear incident commander and ongoing comms.
When you discuss the aftermath, frame it as a blameless postmortem. The goal is to fix the systemic and process gaps that let the failure happen, not to assign blame to a person. Panels also probe toil: what repetitive operational work you automated away, and how you measured the win. SRE rewards engineers who delete their own busywork.
Rehearse out loud
These rounds are spoken, not written. Practice the troubleshooting walkthrough and the SLO explanation by saying them aloud, end to end, until the evidence chain and the numbers come out cleanly under time pressure. A voice-driven trainer like Mythic Intel can build a verified SRE room and grade your spoken answers on accuracy, completeness, and structure, which is exactly the gap most candidates miss when they only practice in their head.