ENGINEERING

Bootcamps AI Labs Curriculum Hiring Partners

devops engineering / COURSE

SRE & Observability

Hope is not a strategy. Learn how to define SLIs/SLOs, implement the 'Three Pillars of Observability', and manage incident response like a seasoned Site Reliability Engineer.

6 Weeks (Intensive)

Advanced

Live Online

Lead Instructor

Marcus Thorne

Ex-SRE Lead, Google

Tech Stack Mastery

PrometheusGrafanaOpenTelemetryElasticsearchPagerDuty

Capstone Milestone

Implement the complete 'Three Pillars of Observability' (Logs, Metrics, Traces) for a provided black-box microservice architecture and successfully debug a live injected failure.

The 6-Week Syllabus

An intense, week-by-week breakdown designed for deep technical mastery.

Module 01

SRE Foundations: SLIs, SLOs, SLAs

Translating business requirements into engineering metrics.

Core Knowledge Units

Error Budgets
Defining golden signals
SRE vs DevOps

Practical Component

Draft a comprehensive SLO document for a critical payment microservice.

Module 02

Metrics & Time Series Data

Scraping and visualizing system health.

Core Knowledge Units

Prometheus Architecture
PromQL Deep Dive
Grafana Dashboards

Practical Component

Deploy Prometheus and create a complex Grafana dashboard utilizing PromQL.

Module 03

Centralized Logging

Finding the needle in the haystack of text logs.

Core Knowledge Units

Structured vs Unstructured Logs
Fluentd / Logstash
Elasticsearch / Loki queries

Practical Component

Configure a centralized logging pipeline aggregating logs from 10 different containers.

Module 04

Distributed Tracing & APM

Following a request across 20 microservices.

Core Knowledge Units

OpenTelemetry Standards
Jaeger
Context Propagation

Practical Component

Instrument a legacy Python application with OpenTelemetry to map database bottlenecks.

Module 05

Alerting & Incident Management

Waking up the right person, at the right time, with the right info.

Core Knowledge Units

Alertmanager / PagerDuty
Reducing Alert Fatigue
Blameless Post-Mortems

Practical Component

Configure intelligent alerts that route to Slack, and write a blameless post-mortem for a simulated outage.

KML Consulting

devops engineering / COURSE

SRE & Observability

Hope is not a strategy. Learn how to define SLIs/SLOs, implement the 'Three Pillars of Observability', and manage incident response like a seasoned Site Reliability Engineer.

6 Weeks (Intensive)

Advanced

Live Online

Lead Instructor

Marcus Thorne

Ex-SRE Lead, Google

Tech Stack Mastery

PrometheusGrafanaOpenTelemetryElasticsearchPagerDuty

Capstone Milestone

Implement the complete 'Three Pillars of Observability' (Logs, Metrics, Traces) for a provided black-box microservice architecture and successfully debug a live injected failure.

The 6-Week Syllabus

An intense, week-by-week breakdown designed for deep technical mastery.

Module 01

SRE Foundations: SLIs, SLOs, SLAs

Translating business requirements into engineering metrics.

Core Knowledge Units

Error Budgets
Defining golden signals
SRE vs DevOps

Practical Component

Draft a comprehensive SLO document for a critical payment microservice.

Module 02

Metrics & Time Series Data

Scraping and visualizing system health.

Core Knowledge Units

Prometheus Architecture
PromQL Deep Dive
Grafana Dashboards

Practical Component

Deploy Prometheus and create a complex Grafana dashboard utilizing PromQL.

Module 03

Centralized Logging

Finding the needle in the haystack of text logs.

Core Knowledge Units

Structured vs Unstructured Logs
Fluentd / Logstash
Elasticsearch / Loki queries

Practical Component

Configure a centralized logging pipeline aggregating logs from 10 different containers.

Module 04

Distributed Tracing & APM

Following a request across 20 microservices.

Core Knowledge Units

OpenTelemetry Standards
Jaeger
Context Propagation

Practical Component

Instrument a legacy Python application with OpenTelemetry to map database bottlenecks.

Module 05

Alerting & Incident Management

Waking up the right person, at the right time, with the right info.

Core Knowledge Units

Alertmanager / PagerDuty
Reducing Alert Fatigue
Blameless Post-Mortems

Practical Component

Configure intelligent alerts that route to Slack, and write a blameless post-mortem for a simulated outage.

Kairon-AI

SRE & Observability

Tech Stack Mastery

Capstone Milestone

The 6-Week Syllabus

SRE Foundations: SLIs, SLOs, SLAs

Core Knowledge Units

Practical Component

Metrics & Time Series Data

Core Knowledge Units

Practical Component

Centralized Logging

Core Knowledge Units

Practical Component

Distributed Tracing & APM

Core Knowledge Units

Practical Component

Alerting & Incident Management

Core Knowledge Units

Practical Component

Expert Facilitator

Elite Benefits

Ready to Enroll?

SRE & Observability

Tech Stack Mastery

Capstone Milestone

The 6-Week Syllabus

SRE Foundations: SLIs, SLOs, SLAs

Core Knowledge Units

Practical Component

Metrics & Time Series Data

Core Knowledge Units

Practical Component

Centralized Logging

Core Knowledge Units

Practical Component

Distributed Tracing & APM

Core Knowledge Units

Practical Component

Alerting & Incident Management

Core Knowledge Units

Practical Component

Expert Facilitator

Elite Benefits

Ready to Enroll?