10. Site Reliability Engineering (SRE).

PostedDecember 26, 2021

UpdatedDecember 11, 2025

Author -Rajkumar Aute

SRE Principles.

Site Reliability Engineering (SRE) is a discipline created by Google that combines software engineering with operations to build highly reliable, scalable, and efficient systems.

SRE brings engineering mindset, automation, and data-driven decisions to operations.

SRE makes sure systems are fast, reliable, and always available without slowing down development.

1. Reduce toil

Toil is the repetitive, manual, boring operations work that does not create long-term value.

Manual deployments.
Manual server setups.
Repeated ticket handling.
Manually restarting services.

How SRE reduces toil: –

Manual work leads to mistakes.
Developers waste time doing operational tasks.
Toil limits scaling.

Automate or eliminate as much toil as possible.

—

2. Reliability as a feature

Reliability is not optional it is treated like a core feature of the product.

Why it matters: –

Customers expect services to be always available.
Unreliable systems harm business reputation.
Reliability builds trust.

Ex: – A product may delay new features if reliability drops below acceptable level.

—

3. Error budgets

Error budget = the allowed amount of failure within a given period based on SLOs.

Ex: – If SLO = 99.9% uptime, Error Budget = 0.1% allowed downtime.

Why error budgets matter: –

Balance between speed and reliability.
If error budget is consumed, teams slow down releases.
If error budget remains, teams can release faster.

Error budgets link SRE and developers together.

—

4. SLOs / SLIs

SLO (Service Level Objective): –

A target performance level for an SLI.
Ex: – “Availability should be 99.9%.”

SLI (Service Level Indicator): –

A metric that measures service performance.
Ex: – latency, error rate, availability.

Why they matter: –

Help define reliability goals.
Provide clear expectations.
Remove guesswork and conflict.
Guide engineering decisions.

SLOs help teams know when they can move fast and when they need to stabilize.

—

5. Automate everything

SRE teams automate as many processes as possible deployments, scaling, monitoring, remediation.

Why it matters: –

Reduces manual errors.
Saves engineer time.
Makes systems predictable.
Helps with scaling.
Enables reliable, repeatable processes.

Automation is a core SRE principle.

—

6. Blameless Postmortems

After an incident or outage, the team analyses what happened without blaming any individual.

Why it matters: –

Encourages honesty.
Enables learning from failures.
Builds trust.
Reduces fear.
Prevents repeating mistakes.

Blameless culture is essential for reliability improvement.

—

7. Observability

Observability is the ability to understand what is happening inside a system using: – Logs, Metrics, Traces, Dashboards.

Why it matters: –

Helps diagnose issues quickly.
Improves MTTR.
Helps identify performance bottlenecks.
Supports proactive improvements.

Observability is the backbone of modern SRE.

Tags:

Tech should learn

AWS(Draft)

AWS-Cloud-Tech

AWS-Compute

DevOps Essentials

DevSecOps Essentials(Draft)

Programming

Python

CI/CD

GitHub Actions

Kubernetes (Draft)

The Foundation

Kubernetes Architecture

Kubernetes Setting Up the Lab

Kubernetes Core Workloads

Docker

10. Site Reliability Engineering (SRE).

SRE Principles.

1. Reduce toil

2. Reliability as a feature

3. Error budgets

4. SLOs / SLIs

SLO (Service Level Objective): –

SLI (Service Level Indicator): –

5. Automate everything

6. Blameless Postmortems

7. Observability