Overview
Site Reliability Engineering (SRE) is a discipline that applies software engineering principles to infrastructure and operations. The concept was pioneered by Google in 2003, and the role became formalized through their famous SRE book published in 2016. Think about it this way: Traditional operations teams spend their time manually maintaining systems.
SREs say "if a human is doing it, we should automate it." They write code to manage infrastructure, monitor systems, and respond to incidents automatically. The core philosophy is simple:.
Expected Salaries (2025)
Key Terms You Should Know
SLI (Service Level Indicator)
A measurement of service behavior. For example: "the percentage of requests that complete in under 200ms" or "the percentage of requests that succeed." SLIs are the raw numbers you measure.
SLO (Service Level Objective)
A target value for an SLI. For example: "99.9% of requests will complete in under 200ms." SLOs define what "reliable enough" means for your service. They're internal goals, not contracts.
SLA (Service Level Agreement)
A legal contract with customers about reliability, usually with financial penalties. SLAs should always be looser than SLOs—if your SLO is 99.9%, your SLA might be 99.5% to give you buffer.
Error Budget
The allowed amount of unreliability. If your SLO is 99.9% uptime, you're allowed 0.1% downtime per month (~43 minutes). This is your "budget" to spend on risky deployments, experiments, or maintenance. When you run out, you freeze changes and focus on reliability.
Toil
Manual, repetitive work that doesn't provide lasting value. Examples: manually restarting servers, copying data between systems, responding to the same alerts repeatedly. SREs aim to eliminate toil through automation.
Observability
The ability to understand what's happening inside your systems from external outputs. The three pillars are: metrics (numbers over time), logs (event records), and traces (request paths through distributed systems).
Postmortem
A document written after an incident to understand what happened, why it happened, and how to prevent it in the future. Postmortems should be blameless—focused on systems and processes, not individuals.
On-Call
Being the designated person who responds to production incidents outside normal working hours. SREs typically rotate on-call duty. Good on-call culture means clear escalation paths, runbooks, and reasonable alert volume.
SRE vs DevOps: What's the Difference?
DevOps is a culture and set of practices that emphasizes collaboration between development and operations teams. It's about breaking down silos, automating deployments, and shipping faster. SRE is a specific job role that implements DevOps principles with a particular focus on reliability. Google famously said: "SRE is what happens when you ask a software engineer to design an operations team." Key differences: Choose SRE if: Choose DevOps if:
The Complete Learning Path
Follow these steps in order. Each builds on the previous. All resources are 100% free.
Master Programming (Python or Go)
Duration: 6-8 weeks — Foundation levelWhy this matters: SREs write code constantly—automation scripts, monitoring tools, internal services. You need to be comfortable programming, not just scripting.
Recommendation: Start with Python, then learn Go later. Python's ecosystem for infrastructure automation is unmatched.
Key concepts to master:
- Python: More common, easier to learn, excellent for automation and scripting. Most SRE tooling has Python libraries.
- Go: Used heavily at Google, Kubernetes, Docker. Faster and better for building tools, but harder to learn.
- Variables, functions, data structures (lists, dicts, sets)
- File I/O and text processing (crucial for log analysis)
- HTTP requests and API interactions
- Error handling and exceptions
- Working with JSON and YAML
- Unit testing basics
Master Linux & Networking
Duration: 6-8 weeks — Core infrastructureWhy this matters: SREs operate in Linux environments. You need deep knowledge of how systems work—not just how to use commands, but understanding processes, memory, disk I/O, networking, and troubleshooting.
Linux essentials:
Networking fundamentals:
- Command line fluency (bash, shell scripting)
- Process management (ps, top, htop, systemd)
- File systems and disk management (df, du, lsblk, mount)
- User permissions and security
- Package management (apt, yum, dpkg)
- Log analysis (journalctl, grep, awk, sed)
- TCP/IP, UDP, HTTP/HTTPS protocols
- DNS (how domain names resolve to IPs)
- Load balancing and reverse proxies
- Firewalls and security groups
Learn Observability (Metrics, Logs, Traces)
Duration: 5-6 weeks — Core SRE skillWhy this matters: You can't fix what you can't observe. Observability is how SREs understand what's happening in production systems. The three pillars are metrics, logs, and traces.
Metrics: Numbers that change over time. Examples: CPU usage, request latency, error rates. Tools: Prometheus (collection) + Grafana (visualization).
Logs: Event records from applications. Tell you what happened and when. Tools: Loki, Elasticsearch/ELK Stack, Splunk (enterprise).
- Setting up Prometheus to scrape metrics
- Creating useful Grafana dashboards
- Writing PromQL queries to analyze metrics
- Configuring alerting rules
- Understanding the RED method (Rate, Errors, Duration) and USE method (Utilization, Saturation, Errors)
Master Containers & Kubernetes
Duration: 6-8 weeks — Industry standardWhy this matters: Most production workloads run on Kubernetes. As an SRE, you'll be responsible for the reliability of K8s clusters and the applications running on them.
Start with Docker:
Then Kubernetes:
- What containers are (isolated environments with their own filesystem)
- Building images with Dockerfiles
- Running containers, port mapping, volumes
- Multi-stage builds and image optimization
- Core concepts: Pods, Deployments, Services, ConfigMaps, Secrets
- Scheduling and resource management
- Networking (Services, Ingress, Network Policies)
- Storage (PersistentVolumes, StorageClasses)
- Debugging with kubectl (logs, exec, describe)
- Helm for package management
Understand SRE Practices
Duration: 4-6 weeks — Core theoryWhy this matters: This is what makes SRE different from generic DevOps. These practices are how Google and other top companies manage reliability at massive scale.
- Choosing the right SLIs for your service (availability, latency, throughput)
- Setting realistic SLOs (hint: 100% is never the goal)
- Writing SLO-based alerts instead of symptom-based alerts
- Calculating error budgets from SLOs
- Using error budgets to balance reliability vs. velocity
- Error budget policies (what happens when you run out)
- Identifying toil in your work
- Automation strategies
- Google's 50% rule (SREs should spend <50% time on operational work)
Learn Incident Management
Duration: 3-4 weeks — Real-world skillsWhy this matters: SREs are responsible for keeping systems running. When things break (and they will), you need to respond quickly and effectively. Good incident management separates great SREs from average ones.
On-call fundamentals:
Incident response:
- Setting up on-call rotations (PagerDuty, Opsgenie)
- Creating runbooks for common incidents
- Escalation procedures
- Healthy on-call practices (sustainable workload, compensation)
- Incident command system (roles: Incident Commander, Communications Lead, etc.)
- Triage and prioritization
- Communication during incidents
- Post-incident cleanup
- Writing effective postmortem documents
- Focusing on systems, not people
Tips for Success
- Read the Google SRE books. They're free online and are the bible of SRE. Start with "Site Reliability Engineering" then read "The Site Reliability Workbook" for practical exercises.
- Get comfortable with chaos. SRE is about embracing that things will break. Practice chaos engineering—intentionally break things in controlled ways to find weaknesses.
- Automate your own workflow first. Before you automate production systems, automate your daily tasks. This builds the habit and skills.
- Practice writing postmortems. Even for personal projects, write postmortems when things go wrong. The skill of analyzing failures is invaluable.
- Join the SRE community. The SRE subreddit, SREcon talks (free on YouTube), and various Slack communities are great for learning from experienced practitioners.
Save This Roadmap
Download a PDF version to track your progress offline.
