Kairon-AI

Online
Chat with Kairon-AI. What would you like to know about our courses, professional certifications, or corporate training?
devops engineering / COURSE

SRE & Observability

Hope is not a strategy. Learn how to define SLIs/SLOs, implement the 'Three Pillars of Observability', and manage incident response like a seasoned Site Reliability Engineer.

5 Weeks (Intensive)
Advanced
Live Online + Async Labs
Marcus Thorne
Lead Instructor
Marcus Thorne
Ex-SRE Lead, Google

Mastered Technologies

PrometheusGrafanaOpenTelemetryElasticsearchPagerDuty

You Will Build

Capstone Project

Implement the complete 'Three Pillars of Observability' (Logs, Metrics, Traces) for a provided black-box microservice architecture and successfully debug a live injected failure.

The 5-Week Syllabus

An intense, week-by-week breakdown designed to push your limits.

Week 1

SRE Foundations: SLIs, SLOs, SLAs

Translating business requirements into engineering metrics.

Core Topics

  • Error Budgets
  • Defining golden signals
  • SRE vs DevOps

Hands-on Lab

Draft a comprehensive SLO document for a critical payment microservice.

Week 2

Metrics & Time Series Data

Scraping and visualizing system health.

Core Topics

  • Prometheus Architecture
  • PromQL Deep Dive
  • Grafana Dashboards

Hands-on Lab

Deploy Prometheus and create a complex Grafana dashboard utilizing PromQL.

Week 3

Centralized Logging

Finding the needle in the haystack of text logs.

Core Topics

  • Structured vs Unstructured Logs
  • Fluentd / Logstash
  • Elasticsearch / Loki queries

Hands-on Lab

Configure a centralized logging pipeline aggregating logs from 10 different containers.

Week 4

Distributed Tracing & APM

Following a request across 20 microservices.

Core Topics

  • OpenTelemetry Standards
  • Jaeger
  • Context Propagation

Hands-on Lab

Instrument a legacy Python application with OpenTelemetry to map database bottlenecks.

Week 5

Alerting & Incident Management

Waking up the right person, at the right time, with the right info.

Core Topics

  • Alertmanager / PagerDuty
  • Reducing Alert Fatigue
  • Blameless Post-Mortems

Hands-on Lab

Configure intelligent alerts that route to Slack, and write a blameless post-mortem for a simulated outage.

Expert Facilitator

Marcus Thorne
Marcus Thorne
Ex-SRE Lead, Google

Trained at Google, Marcus wrote the playbook for incident response at his current startup. He teaches ruthless prioritization of reliability.

Student Perks

  • PagerDuty simulation trial
  • Post-mortem templates
  • 1-on-1 Dashboard Review