Overview

Site Reliability Engineering (SRE) is a discipline that applies software engineering principles to infrastructure and operations. The concept was pioneered by Google in 2003, and the role became formalized through their famous SRE book published in 2016. Think about it this way: Traditional operations teams spend their time manually maintaining systems.

SREs say "if a human is doing it, we should automate it." They write code to manage infrastructure, monitor systems, and respond to incidents automatically. The core philosophy is simple:.

Expected Salaries (2025)

USA$100K-$165K

Europe€80K-€150K

India₹12L-₹28L

UK€80K-€150K

Key Terms You Should Know

SLI (Service Level Indicator)

A measurement of service behavior. For example: "the percentage of requests that complete in under 200ms" or "the percentage of requests that succeed." SLIs are the raw numbers you measure.

SLO (Service Level Objective)

A target value for an SLI. For example: "99.9% of requests will complete in under 200ms." SLOs define what "reliable enough" means for your service. They're internal goals, not contracts.

SLA (Service Level Agreement)

A legal contract with customers about reliability, usually with financial penalties. SLAs should always be looser than SLOs—if your SLO is 99.9%, your SLA might be 99.5% to give you buffer.

Error Budget

The allowed amount of unreliability. If your SLO is 99.9% uptime, you're allowed 0.1% downtime per month (~43 minutes). This is your "budget" to spend on risky deployments, experiments, or maintenance. When you run out, you freeze changes and focus on reliability.

Toil

Manual, repetitive work that doesn't provide lasting value. Examples: manually restarting servers, copying data between systems, responding to the same alerts repeatedly. SREs aim to eliminate toil through automation.

Observability

The ability to understand what's happening inside your systems from external outputs. The three pillars are: metrics (numbers over time), logs (event records), and traces (request paths through distributed systems).

Postmortem

A document written after an incident to understand what happened, why it happened, and how to prevent it in the future. Postmortems should be blameless—focused on systems and processes, not individuals.

On-Call

Being the designated person who responds to production incidents outside normal working hours. SREs typically rotate on-call duty. Good on-call culture means clear escalation paths, runbooks, and reasonable alert volume.

SRE vs DevOps: What's the Difference?

DevOps is a culture and set of practices that emphasizes collaboration between development and operations teams. It's about breaking down silos, automating deployments, and shipping faster. SRE is a specific job role that implements DevOps principles with a particular focus on reliability. Google famously said: "SRE is what happens when you ask a software engineer to design an operations team." Key differences: Choose SRE if: Choose DevOps if:

The Complete Learning Path

Follow these steps in order. Each builds on the previous. All resources are 100% free.

Master Programming (Python or Go)

Duration: 6-8 weeks — Foundation level

Why this matters: SREs write code constantly—automation scripts, monitoring tools, internal services. You need to be comfortable programming, not just scripting.

Recommendation: Start with Python, then learn Go later. Python's ecosystem for infrastructure automation is unmatched.

Key concepts to master:

Python: More common, easier to learn, excellent for automation and scripting. Most SRE tooling has Python libraries.
Go: Used heavily at Google, Kubernetes, Docker. Faster and better for building tools, but harder to learn.
Variables, functions, data structures (lists, dicts, sets)
File I/O and text processing (crucial for log analysis)
HTTP requests and API interactions
Error handling and exceptions
Working with JSON and YAML
Unit testing basics

PythonGoScriptingAPI interactionUnit testing

Free Resources

CS50's Introduction to PythonHarvard — Rigorous introduction — Free certificate available Scientific Computing with PythonfreeCodeCamp — 300 hours — Certificate included A Tour of GoOfficial Go tutorial — Interactive — Learn Go basics

Master Linux & Networking

Duration: 6-8 weeks — Core infrastructure

Why this matters: SREs operate in Linux environments. You need deep knowledge of how systems work—not just how to use commands, but understanding processes, memory, disk I/O, networking, and troubleshooting.

Linux essentials:

Networking fundamentals:

Command line fluency (bash, shell scripting)
Process management (ps, top, htop, systemd)
File systems and disk management (df, du, lsblk, mount)
User permissions and security
Package management (apt, yum, dpkg)
Log analysis (journalctl, grep, awk, sed)
TCP/IP, UDP, HTTP/HTTPS protocols
DNS (how domain names resolve to IPs)
Load balancing and reverse proxies
Firewalls and security groups

Linux administrationBash scriptingTCP/IPDNSTroubleshooting

Free Resources

Linux JourneyInteractive — Comprehensive — Perfect for beginners Cisco Networking EssentialsCisco NetAcad — Industry-standard networking fundamentals

Learn Observability (Metrics, Logs, Traces)

Duration: 5-6 weeks — Core SRE skill

Why this matters: You can't fix what you can't observe. Observability is how SREs understand what's happening in production systems. The three pillars are metrics, logs, and traces.

Metrics: Numbers that change over time. Examples: CPU usage, request latency, error rates. Tools: Prometheus (collection) + Grafana (visualization).

Logs: Event records from applications. Tell you what happened and when. Tools: Loki, Elasticsearch/ELK Stack, Splunk (enterprise).

Setting up Prometheus to scrape metrics
Creating useful Grafana dashboards
Writing PromQL queries to analyze metrics
Configuring alerting rules
Understanding the RED method (Rate, Errors, Duration) and USE method (Utilization, Saturation, Errors)

PrometheusGrafanaPromQLAlertingDistributed tracing

Free Resources

Prometheus Getting StartedOfficial documentation — Hands-on tutorial Grafana TutorialsOfficial — Dashboard creation and visualization

Master Containers & Kubernetes

Duration: 6-8 weeks — Industry standard

Why this matters: Most production workloads run on Kubernetes. As an SRE, you'll be responsible for the reliability of K8s clusters and the applications running on them.

Start with Docker:

Then Kubernetes:

What containers are (isolated environments with their own filesystem)
Building images with Dockerfiles
Running containers, port mapping, volumes
Multi-stage builds and image optimization
Core concepts: Pods, Deployments, Services, ConfigMaps, Secrets
Scheduling and resource management
Networking (Services, Ingress, Network Policies)
Storage (PersistentVolumes, StorageClasses)
Debugging with kubectl (logs, exec, describe)
Helm for package management

DockerKuberneteskubectlHelmContainer networking

Free Resources

Kubernetes BasicsOfficial K8s docs — Interactive tutorials CKA CertificationCNCF — Industry-standard credential for K8s admins Killercoda Interactive LabsFree K8s playground — Practice kubectl commands

Understand SRE Practices

Duration: 4-6 weeks — Core theory

Why this matters: This is what makes SRE different from generic DevOps. These practices are how Google and other top companies manage reliability at massive scale.

Choosing the right SLIs for your service (availability, latency, throughput)
Setting realistic SLOs (hint: 100% is never the goal)
Writing SLO-based alerts instead of symptom-based alerts
Calculating error budgets from SLOs
Using error budgets to balance reliability vs. velocity
Error budget policies (what happens when you run out)
Identifying toil in your work
Automation strategies
Google's 50% rule (SREs should spend <50% time on operational work)

SLIs/SLOsError budgetsToil reductionCapacity planning

Free Resources

Google SRE BooksFree online — The definitive SRE resource — Must read Site Reliability Engineering: Measuring and Managing ReliabilityGoogle Cloud on Coursera — Free to audit

Learn Incident Management

Duration: 3-4 weeks — Real-world skills

Why this matters: SREs are responsible for keeping systems running. When things break (and they will), you need to respond quickly and effectively. Good incident management separates great SREs from average ones.

On-call fundamentals:

Incident response:

Setting up on-call rotations (PagerDuty, Opsgenie)
Creating runbooks for common incidents
Escalation procedures
Healthy on-call practices (sustainable workload, compensation)
Incident command system (roles: Incident Commander, Communications Lead, etc.)
Triage and prioritization
Communication during incidents
Post-incident cleanup
Writing effective postmortem documents
Focusing on systems, not people

Incident responseOn-callPostmortemsRunbooks

Free Resources

PagerDuty Incident Response GuideFree — Comprehensive incident management documentation Atlassian Incident Management HandbookFree guide — Practical incident response

Tips for Success

Read the Google SRE books. They're free online and are the bible of SRE. Start with "Site Reliability Engineering" then read "The Site Reliability Workbook" for practical exercises.
Get comfortable with chaos. SRE is about embracing that things will break. Practice chaos engineering—intentionally break things in controlled ways to find weaknesses.
Automate your own workflow first. Before you automate production systems, automate your daily tasks. This builds the habit and skills.
Practice writing postmortems. Even for personal projects, write postmortems when things go wrong. The skill of analyzing failures is invaluable.
Join the SRE community. The SRE subreddit, SREcon talks (free on YouTube), and various Slack communities are great for learning from experienced practitioners.

Save This Roadmap

Download a PDF version to track your progress offline.

Site Reliability Engineer (SRE) Roadmap 2025

Overview

Expected Salaries (2025)

Key Terms You Should Know

SLI (Service Level Indicator)

SLO (Service Level Objective)

SLA (Service Level Agreement)

Error Budget

Toil

Observability

Postmortem

On-Call

SRE vs DevOps: What's the Difference?

The Complete Learning Path

Master Programming (Python or Go)

Free Resources

Master Linux & Networking

Free Resources

Learn Observability (Metrics, Logs, Traces)

Free Resources

Master Containers & Kubernetes

Free Resources

Understand SRE Practices

Free Resources

Learn Incident Management

Free Resources

Tips for Success

Save This Roadmap

The Gateway is Open.