Three Pillars of Observability
Beginner- Metrics — Numeric measurements over time (CPU, memory, request rate, error rate)
- Logs — Timestamped text records of discrete events
- Traces — End-to-end request flow across microservices
Prometheus
BeginnerPrometheus is the leading open-source monitoring system. It uses a pull model to scrape metrics from targets.
# prometheus.yml — Configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
scrape_configs:
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'webapp'
metrics_path: '/metrics'
static_configs:
- targets: ['webapp:3000']
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
PromQL — Query Language
# Instant vector — current value
up{job="webapp"}
# Rate — per-second rate over 5 minutes
rate(http_requests_total[5m])
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100
# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# CPU usage by container
sum(rate(container_cpu_usage_seconds_total[5m])) by (container)
# Memory usage percentage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes * 100
Grafana
Beginner# Deploy Grafana with Docker Compose
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
ports:
- "9090:9090"
grafana:
image: grafana/grafana:latest
environment:
GF_SECURITY_ADMIN_PASSWORD: admin
volumes:
- grafana_data:/var/lib/grafana
ports:
- "3000:3000"
depends_on:
- prometheus
node-exporter:
image: prom/node-exporter:latest
ports:
- "9100:9100"
volumes:
prometheus_data:
grafana_data:
Alerting
Intermediate# alert_rules.yml
groups:
- name: application
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 5%)"
- alert: HighLatency
expr: |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High P95 latency"
description: "P95 latency is {{ $value }}s"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash-looping"
ELK Stack (Elasticsearch, Logstash, Kibana)
Intermediate# docker-compose.yml for ELK
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
environment:
- discovery.type=single-node
- xpack.security.enabled=false
- "ES_JAVA_OPTS=-Xms1g -Xmx1g"
volumes:
- es_data:/usr/share/elasticsearch/data
ports:
- "9200:9200"
logstash:
image: docker.elastic.co/logstash/logstash:8.12.0
volumes:
- ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
depends_on:
- elasticsearch
kibana:
image: docker.elastic.co/kibana/kibana:8.12.0
environment:
ELASTICSEARCH_HOSTS: http://elasticsearch:9200
ports:
- "5601:5601"
depends_on:
- elasticsearch
volumes:
es_data:
The Four Golden Signals
IntermediateFrom Google's SRE book — the four signals every service should monitor:
- Latency — Time to service a request (distinguish success vs error latency)
- Traffic — Demand on your system (requests/sec, sessions, reads/writes)
- Errors — Rate of failed requests (explicit errors + implicit like slow responses)
- Saturation — How "full" your service is (CPU, memory, disk, queue depth)
Distributed Tracing
AdvancedTracing follows a request across multiple microservices, showing latency at each step.
- Jaeger — Open-source distributed tracing (CNCF)
- Zipkin — Distributed tracing system
- OpenTelemetry — Vendor-neutral observability framework (recommended)
SRE Practices
Advanced- SLIs (Service Level Indicators) — Measurable metrics (e.g., latency, availability)
- SLOs (Service Level Objectives) — Targets for SLIs (e.g., 99.9% availability)
- SLAs (Service Level Agreements) — Contractual obligations with consequences
- Error budgets — Allowed downtime based on SLO (e.g., 99.9% = 8.7 hours/year)
- Incident management — On-call rotations, runbooks, postmortems