Skip to content

Commit c27d953

Browse files
authored
Merge pull request #167 from projectsyn/bestpractice/runbook
Add best practices for writing Prometheus Alerts
2 parents 463248f + b3abb32 commit c27d953

File tree

2 files changed

+133
-0
lines changed

2 files changed

+133
-0
lines changed

docs/modules/ROOT/nav.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,7 @@ include::steward:ROOT:partial$nav-reference.adoc[]
9696
** xref:explanations/commodore-components/helm-charts.adoc[Using Helm charts]
9797
** xref:explanations/commodore-components/parameters-logic.adoc[Conditionals in the parameters hierarchy]
9898
** xref:explanations/commodore-components/crds.adoc[Custom Resource Defintions]
99+
** xref:explanations/commodore-components/alerts.adoc[Writing Prometheus Alert Rules]
99100
* xref:explanations/commodore-packages.adoc[Commodore Packages Best Practices]
100101
* xref:explanations/jsonnet.adoc[Jsonnet Best Practices]
101102
* xref:explanations/component_template_sync.adoc[Keep components in sync]
Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
= Providing Prometheus Alert Rules
2+
3+
When writing a component that manages a critical piece of infrastructure, you should provide alerts that notify the operator if it fails.
4+
Writing good alerts and runbooks is difficult.
5+
This document should give you some best practices that worked for us so far.
6+
7+
== Writing Alert Rules
8+
9+
In nearly all cases you can provide Prometheus alert rules through the https://prometheus-operator.dev/docs/operator/api/#monitoring.coreos.com/v1.PrometheusRule[PrometheusRule CRD].
10+
This definition is then picked up by the responsible monitoring component.
11+
12+
For OpenShift cluster this generally means labeling the namespace with `openshift.io/cluster-monitoring: 'true'` and for clusters with rancher monitoring this would mean labeling it with `SYNMonitoring: 'main'`
13+
14+
15+
16+
* *Alerts need to be actionable*
17+
+
18+
Try to imagine what you would do if you received this alert.
19+
If the answer is "I don't know" or wait and see if it resolves itself, you probably shouldn't emit this alert.
20+
21+
* *Label your alert*
22+
+
23+
Label your alerts, so that they can be routed effectively.
24+
At the very least add labels `syn: 'true'` and `syn_component: 'COMPONENT_NAME'` to indicate that this alert is managed by the syn component, and a label `severity`.
25+
26+
* *Assess severity*
27+
+
28+
How critical is this alert?
29+
We generally differentiate three severity levels.
30+
+
31+
`info` for alerts that don't need urgent intervention.
32+
These are things that someone should look into, but it can usually wait up to a few days.
33+
Info alerts could also often just be part of a dashboard.
34+
+
35+
`warning` for alerts that should be looked at as soon as possible, but it can usually wait until regular office hours.
36+
+
37+
`critical` for alerts that need immediate attention, even outside office hours.
38+
+
39+
Carefully decide in which category your alert should be and add the appropriate `severity` label.
40+
But keep in mind that if all alerts are critical none of them are.
41+
42+
* *Make alerts tunable*
43+
+
44+
You most likely won't be able to write a perfect alert out of the box.
45+
It will either be too noisy, not sensitive enough, or in some other way not relevant for the user.
46+
With that in mind, give the user a way to tune your alert.
47+
+
48+
At the very least provide ways to selectively enable or disable individual alerts.
49+
It's considered best practice to let the user overwrite all of the alert specification if they wish.
50+
However, it's a good idea to also provide some more convenient parameters to tune configuration that often need to be adapted such as alert labels or alert specific parameters like a list of relevant namespaces.
51+
+
52+
Try to imagine what a user might need to change and make tuning it as easy as possible.
53+
54+
* *Provide a runbook*
55+
+
56+
You should always provide a link to a runbook in an annotation `runbook_url`.
57+
See the section below on writing good runbooks.
58+
59+
60+
Following these guidelines, you should get a usable alert.
61+
There are still some pitfalls when writing Prometheus alerts, but there are also many guides to help you write them.
62+
You can look at https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/[the official documentation] or check out how https://blog.cloudflare.com/monitoring-our-monitoring/[Cloudscale writes alert rules].
63+
64+
[WARNING]
65+
====
66+
When installing third party software there are often upstream alerts.
67+
It's a good idea to reuse these alerts, but the best practices still apply.
68+
69+
Don't blindly include all upstream alerts.
70+
Check if they're actionable, add labels, make them tunable, and provide a runbook, even if you didn't write the alert yourself.
71+
====
72+
73+
== Writing Runbooks
74+
75+
Every alert rule should have a runbook.
76+
The runbook is the first place a user looks to get information on the alert and how to debug it.
77+
78+
It should tell the reader:
79+
80+
* *What does this alert mean?*
81+
+
82+
Tell the reader why they got the alert.
83+
What exactly doesn't work as it should?
84+
Maybe also tell the user how the alert was measured and if there might be false positive.
85+
* *What's the impact?*
86+
+
87+
Who and what's effected?
88+
How fast should the reader react?
89+
The alert labels should already give an impression how critical the alert is, but try to be more explicit in the runbook.
90+
* *How do I diagnose this?*
91+
+
92+
Provide some input on how to debug this.
93+
Where might the reader get the relevant events or logs?
94+
How to narrow down the possible root causes?
95+
* *How may I mitigate the issue?*
96+
+
97+
List some possible mitigation strategies or ways to resolve this alert for good.
98+
+
99+
NOTE: Ideally, you shouldn't alert on issues that could be fixed automatically.
100+
If you have one clear way to resolve this alert, check if you could resolve this automatically.
101+
* *How do I tune the alert?*
102+
+
103+
Maybe this alert wasn't actionable, or maybe the alert was raised far too late.
104+
Give the reader options to tune the alert to make it less noisy or more sensitive.
105+
106+
Whenever possible try to provide code snippets and precise instructions.
107+
If the reader got a critical alert, they don't have the time or nerves to build the `jq` query they need right now or to find out exactly which controller is responsible for this CRD.
108+
109+
It's considered best practice to put all your runbooks at `docs/modules/ROOT/pages/runbooks/ALERTNAME.adoc`, but there might be good reasons to deviate from this.
110+
Just make sure to adjust the runbook links as necessary.
111+
112+
Finally, a runbook doesn't have to be perfect.
113+
Maybe you don't really know how this might fail or how to debug this, or maybe you simply don't have the resources right now to write a comprehensive runbook.
114+
Add one anyway.
115+
Any input can be valuable when debugging an alert and at the very least there is now a basis on which to improve on when we learn more.
116+
117+
[IMPORTANT]
118+
====
119+
.Removing or Renaming Alert Rules
120+
121+
Sometimes alerts become obsolete.
122+
Maybe the system can now resolve the issue automatically, or the responsible part simply doesn't exist anymore.
123+
124+
However, you need to make sure that you *never* break a runbook link.
125+
There might be people using older releases of your component and their runbook links should still lead to valid runbooks.
126+
127+
* Don't remove runbook remarks if they get obsolete, but make a note that they're only relevant for older versions.
128+
* Don't remove runbooks, but simply remove them from the navigation.
129+
* If you rename an alert or move the runbook, use https://docs.antora.org/antora/latest/page/page-aliases/[page aliases] to keep old links valid.
130+
131+
If you follow these three rules, runbook links should always stay relevant.
132+
====

0 commit comments

Comments
 (0)