SRE & Observability
Hope is not a strategy. Learn how to define SLIs/SLOs, implement the 'Three Pillars of Observability', and manage incident response like a seasoned Site Reliability Engineer.
Mastered Technologies
You Will Build
Implement the complete 'Three Pillars of Observability' (Logs, Metrics, Traces) for a provided black-box microservice architecture and successfully debug a live injected failure.
The 5-Week Syllabus
An intense, week-by-week breakdown designed to push your limits.
SRE Foundations: SLIs, SLOs, SLAs
Translating business requirements into engineering metrics.
Core Topics
- Error Budgets
- Defining golden signals
- SRE vs DevOps
Hands-on Lab
Draft a comprehensive SLO document for a critical payment microservice.
Metrics & Time Series Data
Scraping and visualizing system health.
Core Topics
- Prometheus Architecture
- PromQL Deep Dive
- Grafana Dashboards
Hands-on Lab
Deploy Prometheus and create a complex Grafana dashboard utilizing PromQL.
Centralized Logging
Finding the needle in the haystack of text logs.
Core Topics
- Structured vs Unstructured Logs
- Fluentd / Logstash
- Elasticsearch / Loki queries
Hands-on Lab
Configure a centralized logging pipeline aggregating logs from 10 different containers.
Distributed Tracing & APM
Following a request across 20 microservices.
Core Topics
- OpenTelemetry Standards
- Jaeger
- Context Propagation
Hands-on Lab
Instrument a legacy Python application with OpenTelemetry to map database bottlenecks.
Alerting & Incident Management
Waking up the right person, at the right time, with the right info.
Core Topics
- Alertmanager / PagerDuty
- Reducing Alert Fatigue
- Blameless Post-Mortems
Hands-on Lab
Configure intelligent alerts that route to Slack, and write a blameless post-mortem for a simulated outage.
Expert Facilitator
Trained at Google, Marcus wrote the playbook for incident response at his current startup. He teaches ruthless prioritization of reliability.
Student Perks
- PagerDuty simulation trial
- Post-mortem templates
- 1-on-1 Dashboard Review