Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install kevinzai-commander-skills-ccc-devops-promql-alertinggit clone https://github.com/KevinZai/commander.gitcp commander/SKILL.MD ~/.claude/skills/kevinzai-commander-skills-ccc-devops-promql-alerting/SKILL.md---
name: promql-alerting
description: "PromQL & alerting — query patterns, aggregations, recording rules, alert design, SLO-based alerting, and Grafana dashboard queries."
risk: low
source: custom
date_added: '2026-03-20'
---
# PromQL & Alerting
Expert guide to PromQL queries, alert rules, and Grafana dashboard design.
## Use this skill when
- Writing PromQL queries for dashboards and alerts
- Designing alert rules with proper thresholds and severity
- Implementing SLO-based monitoring and error budgets
- Building Grafana dashboards with effective visualizations
## Do not use this skill when
- Setting up Prometheus infrastructure (see prometheus-configuration)
- Working with non-Prometheus monitoring systems
---
## PromQL Fundamentals
### Selectors and Matchers
```promql
# Exact match
http_requests_total{method="GET", status="200"}
# Regex match
http_requests_total{method=~"GET|POST"}
# Negative match
http_requests_total{status!="200"}
# Negative regex
http_requests_total{status!~"2.."}
```
### Rate and Increase
```promql
# Per-second rate over 5 minutes (for counters)
rate(http_requests_total[5m])
# Increase over time period (total count increase)
increase(http_requests_total[1h])
# Use irate for high-resolution, volatile metrics
irate(http_requests_total[5m])
```
### Aggregations
```promql
# Sum across all instances
sum(rate(http_requests_total[5m]))
# Sum by specific labels
sum by (method, status) (rate(http_requests_total[5m]))
# Average, min, max
avg by (instance) (node_cpu_seconds_total{mode="idle"})
max by (job) (up)
# Top 5 by request rate
topk(5, sum by (handler) (rate(http_requests_total[5m])))
# Count of series
count by (status) (http_requests_total)
```
## Common Dashboard Queries
### Request Rate
```promql
# Total request rate
sum(rate(http_requests_total[5m]))
# Request rate by endpoint
sum by (handler) (rate(http_requests_total[5m]))
# Request rate by status code class
sum by (status_class) (
label_replace(
rate(http_requests_total[5m]),
"status_class", "${1}xx", "status", "(.).*"
)
)
```
### Error Rate
```promql
# Error percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100
# Error rate by service
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (service) (rate(http_requests_total[5m]))
* 100
```
### Latency (Histograms)
```promql
# P50 latency
histogram_quantile(0.50,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
# P95 latency
histogram_quantile(0.95,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
# P99 latency by endpoint
histogram_quantile(0.99,
sum by (le, handler) (rate(http_request_duration_seconds_bucket[5m]))
)
# Average latency
sum(rate(http_request_duration_seconds_sum[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
```
### Resource Utilization
```promql
# CPU utilization percentage
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory utilization percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk utilization percentage
(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100
# Network throughput
rate(node_network_receive_bytes_total{device="eth0"}[5m]) * 8 # bits/sec
```
### Container Metrics
```promql
# Container CPU usage (percentage of requested)
sum by (pod) (rate(container_cpu_usage_seconds_total[5m]))
/
sum by (pod) (kube_pod_container_resource_requests{resource="cpu"})
* 100
# Container memory usage
sum by (pod) (container_memory_working_set_bytes)
/
sum by (pod) (kube_pod_container_resource_limits{resource="memory"})
* 100
# Container restart count
sum by (pod) (increase(kube_pod_container_status_restarts_total[1h]))
```
## Alert Rules
### Availability Alerts
```yaml
groups:
- name: availability
rules:
- alert: ServiceDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "{{ $labels.job }} on {{ $labels.instance }} is down"
- alert: HighErrorRate
expr: |
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (service) (rate(http_requests_total[5m]))
> 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $labels.service }} error rate is {{ $value | humanizePercentage }}"
- alert: HighLatencyP95
expr: |
histogram_quantile(0.95,
sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $labels.service }} P95 latency is {{ $value | humanizeDuration }}"
```
### Resource Alerts
```yaml
- name: resources
rules:
- alert: HighCPU
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[10m])) * 100) > 85
for: 10m
labels:
severity: warning
- alert: HighMemory
expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
for: 5m
labels:
severity: critical
- alert: DiskWillFillIn24h
expr: |
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0
for: 30m
labels:
severity: warning
annotations:
summary: "Disk on {{ $labels.instance }} predicted to fill within 24 hours"
- alert: HighDiskUsage
expr: (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 90
for: 5m
labels:
severity: critical
```
## SLO-Based Alerting
```yaml
# Multi-window multi-burn-rate alerting
# Based on Google SRE practices
- name: slo
rules:
# SLO: 99.9% availability (error budget: 0.1%)
# Fast burn (2% of budget in 1 hour)
- alert: SLOErrorBudgetFastBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001)
and
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > (14.4 * 0.001)
for: 2m
labels:
severity: critical
annotations:
summary: "Error budget burning fast — 2% consumed in 1h at current rate"
# Slow burn (5% of budget in 6 hours)
- alert: SLOErrorBudgetSlowBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[6h]))
/
sum(rate(http_requests_total[6h]))
) > (6 * 0.001)
and
(
sum(rate(http_requests_total{status=~"5.."}[30m]))
/
sum(rate(http_requests_total[30m]))
) > (6 * 0.001)
for: 15m
labels:
severity: warning
```
## Recording Rules (Pre-Compute Expensive Queries)
```yaml
groups:
- name: request_metrics
interval: 15s
rules:
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
- record: job:http_errors:rate5m
expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
- record: job:http_error_ratio:rate5m
expr: job:http_errors:rate5m / job:http_requests:rate5m
- record: job:http_latency:p95
expr: |
histogram_quantile(0.95,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
)
- record: job:http_latency:p99
expr: |
histogram_quantile(0.99,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
)
```
## Alert Design Best Practices
1. **Page on symptoms, not causes** — Alert on "error rate > 5%" not "CPU > 80%".
2. **Use `for` duration** — Avoid flapping. `for: 5m` means condition must hold for 5 minutes.
3. **Set meaningful severity** — `critical` = pages someone, `warning` = investigate next business day.
4. **Include runbook links** — Add `runbook_url` annotation with troubleshooting steps.
5. **Use recording rules** — Pre-compute expensive queries to reduce Prometheus load.
6. **Test alert rules** — Use `promtool test rules` with unit tests for alerts.
## Common Pitfalls
1. **Using `irate` for alerting** — `irate` is too volatile. Use `rate` for stable alerting.
2. **Missing label in `by` clause** — Forgetting `le` label in histogram aggregation breaks quantile calculation.
3. **Alerting on individual instances** — Alert on service-level aggregates, not individual containers.
4. **No `for` duration** — Without it, brief spikes trigger alerts.
5. **Too many alerts** — Alert fatigue leads to ignored alerts. Each alert should be actionable.
6. **Fixed thresholds** — Use `predict_linear()` for capacity planning alerts instead of fixed percentages.