|
| 1 | += Providing Prometheus Alert Rules |
| 2 | + |
| 3 | +When writing a component that manages a critical piece of infrastructure, you should provide alerts that notify the operator if it fails. |
| 4 | +Writing good alerts and runbooks is difficult. |
| 5 | +This document should give you some best practices that worked for us so far. |
| 6 | + |
| 7 | +== Writing Alert Rules |
| 8 | + |
| 9 | +In nearly all cases you can provide Prometheus alert rules through the https://prometheus-operator.dev/docs/operator/api/#monitoring.coreos.com/v1.PrometheusRule[PrometheusRule CRD]. |
| 10 | +This definition is then picked up by the responsible monitoring component. |
| 11 | + |
| 12 | +For OpenShift cluster this generally means labeling the namespace with `openshift.io/cluster-monitoring: 'true'` and for clusters with rancher monitoring this would mean labeling it with `SYNMonitoring: 'main'` |
| 13 | + |
| 14 | + |
| 15 | + |
| 16 | +* *Alerts need to be actionable* |
| 17 | ++ |
| 18 | +Try to imagine what you would do if you received this alert. |
| 19 | +If the answer is "I don't know" or wait and see if it resolves itself, you probably shouldn't emit this alert. |
| 20 | + |
| 21 | +* *Label your alert* |
| 22 | ++ |
| 23 | +Label your alerts, so that they can be routed effectively. |
| 24 | +At the very least add labels `syn: 'true'` and `syn_component: 'COMPONENT_NAME'` to indicate that this alert is managed by the syn component, and a label `severity`. |
| 25 | + |
| 26 | +* *Assess severity* |
| 27 | ++ |
| 28 | +How critical is this alert? |
| 29 | +We generally differentiate three severity levels. |
| 30 | ++ |
| 31 | +`info` for alerts that don't need urgent intervention. |
| 32 | +These are things that someone should look into, but it can usually wait up to a few days. |
| 33 | +Info alerts could also often just be part of a dashboard. |
| 34 | ++ |
| 35 | +`warning` for alerts that should be looked at as soon as possible, but it can usually wait until regular office hours. |
| 36 | ++ |
| 37 | +`critical` for alerts that need immediate attention, even outside office hours. |
| 38 | ++ |
| 39 | +Carefully decide in which category your alert should be and add the appropriate `severity` label. |
| 40 | +But keep in mind that if all alerts are critical none of them are. |
| 41 | + |
| 42 | +* *Make alerts tunable* |
| 43 | ++ |
| 44 | +You most likely won't be able to write a perfect alert out of the box. |
| 45 | +It will either be too noisy, not sensitive enough, or in some other way not relevant for the user. |
| 46 | +With that in mind, give the user a way to tune your alert. |
| 47 | ++ |
| 48 | +At the very least provide ways to selectively enable or disable individual alerts. |
| 49 | +It's considered best practice to let the user overwrite all of the alert specification if they wish. |
| 50 | +However, it's a good idea to also provide some more convenient parameters to tune configuration that often need to be adapted such as alert labels or alert specific parameters like a list of relevant namespaces. |
| 51 | ++ |
| 52 | +Try to imagine what a user might need to change and make tuning it as easy as possible. |
| 53 | + |
| 54 | +* *Provide a runbook* |
| 55 | ++ |
| 56 | +You should always provide a link to a runbook in an annotation `runbook_url`. |
| 57 | +See the section below on writing good runbooks. |
| 58 | + |
| 59 | + |
| 60 | +Following these guidelines, you should get a usable alert. |
| 61 | +There are still some pitfalls when writing Prometheus alerts, but there are also many guides to help you write them. |
| 62 | +You can look at https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/[the official documentation] or check out how https://blog.cloudflare.com/monitoring-our-monitoring/[Cloudscale writes alert rules]. |
| 63 | + |
| 64 | +[WARNING] |
| 65 | +==== |
| 66 | +When installing third party software there are often upstream alerts. |
| 67 | +It's a good idea to reuse these alerts, but the best practices still apply. |
| 68 | +
|
| 69 | +Don't blindly include all upstream alerts. |
| 70 | +Check if they're actionable, add labels, make them tunable, and provide a runbook, even if you didn't write the alert yourself. |
| 71 | +==== |
| 72 | + |
| 73 | +== Writing Runbooks |
| 74 | + |
| 75 | +Every alert rule should have a runbook. |
| 76 | +The runbook is the first place a user looks to get information on the alert and how to debug it. |
| 77 | + |
| 78 | +It should tell the reader: |
| 79 | + |
| 80 | +* *What does this alert mean?* |
| 81 | ++ |
| 82 | +Tell the reader why they got the alert. |
| 83 | +What exactly doesn't work as it should? |
| 84 | +Maybe also tell the user how the alert was measured and if there might be false positive. |
| 85 | +* *What's the impact?* |
| 86 | ++ |
| 87 | +Who and what's effected? |
| 88 | +How fast should the reader react? |
| 89 | +The alert labels should already give an impression how critical the alert is, but try to be more explicit in the runbook. |
| 90 | +* *How do I diagnose this?* |
| 91 | ++ |
| 92 | +Provide some input on how to debug this. |
| 93 | +Where might the reader get the relevant events or logs? |
| 94 | +How to narrow down the possible root causes? |
| 95 | +* *How may I mitigate the issue?* |
| 96 | ++ |
| 97 | +List some possible mitigation strategies or ways to resolve this alert for good. |
| 98 | ++ |
| 99 | +NOTE: Ideally, you shouldn't alert on issues that could be fixed automatically. |
| 100 | +If you have one clear way to resolve this alert, check if you could resolve this automatically. |
| 101 | +* *How do I tune the alert?* |
| 102 | ++ |
| 103 | +Maybe this alert wasn't actionable, or maybe the alert was raised far too late. |
| 104 | +Give the reader options to tune the alert to make it less noisy or more sensitive. |
| 105 | + |
| 106 | +Whenever possible try to provide code snippets and precise instructions. |
| 107 | +If the reader got a critical alert, they don't have the time or nerves to build the `jq` query they need right now or to find out exactly which controller is responsible for this CRD. |
| 108 | + |
| 109 | +It's considered best practice to put all your runbooks at `docs/modules/ROOT/pages/runbooks/ALERTNAME.adoc`, but there might be good reasons to deviate from this. |
| 110 | +Just make sure to adjust the runbook links as necessary. |
| 111 | + |
| 112 | +Finally, a runbook doesn't have to be perfect. |
| 113 | +Maybe you don't really know how this might fail or how to debug this, or maybe you simply don't have the resources right now to write a comprehensive runbook. |
| 114 | +Add one anyway. |
| 115 | +Any input can be valuable when debugging an alert and at the very least there is now a basis on which to improve on when we learn more. |
| 116 | + |
| 117 | +[IMPORTANT] |
| 118 | +==== |
| 119 | +.Removing or Renaming Alert Rules |
| 120 | +
|
| 121 | +Sometimes alerts become obsolete. |
| 122 | +Maybe the system can now resolve the issue automatically, or the responsible part simply doesn't exist anymore. |
| 123 | +
|
| 124 | +However, you need to make sure that you *never* break a runbook link. |
| 125 | +There might be people using older releases of your component and their runbook links should still lead to valid runbooks. |
| 126 | +
|
| 127 | +* Don't remove runbook remarks if they get obsolete, but make a note that they're only relevant for older versions. |
| 128 | +* Don't remove runbooks, but simply remove them from the navigation. |
| 129 | +* If you rename an alert or move the runbook, use https://docs.antora.org/antora/latest/page/page-aliases/[page aliases] to keep old links valid. |
| 130 | +
|
| 131 | +If you follow these three rules, runbook links should always stay relevant. |
| 132 | +==== |
0 commit comments