Skip to content

Commit acc679e

Browse files
amotlhlcianfagnamatriv
authored
Monitoring: Guide about Prometheus and Grafana (#302)
* Admin: Pull tutorial about CrateDB monitoring with Prometheus and Grafana * Refactor JMX and SQL Exporter details to separate pages * Refactor Ubuntu setup example/walkthrough * Refer to Grafana's documentation for installation * Improve landing page * Adjust cross linking * Copy editing * Wrap lines at 80 characters * Implement suggestions by CodeRabbit --------- Co-authored-by: Hernan Cianfagna <110453267+hlcianfagna@users.noreply.github.com> Co-authored-by: Marios Trivyzas <5058131+matriv@users.noreply.github.com>
1 parent 098de2e commit acc679e

File tree

7 files changed

+465
-8
lines changed

7 files changed

+465
-8
lines changed

docs/admin/monitoring/index.md

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,9 @@
44
# Monitoring and diagnostics
55

66
It is important to continuously monitor your CrateDB database cluster
7-
to detect anomalies and follow usage trends, so you can react to
8-
them properly and timely.
7+
to detect anomalies, so you can react to them promptly.
8+
Collecting statistics and following usage trends is also important
9+
for proper capacity planning.
910

1011
CrateDB provides system information about the cluster as a whole,
1112
individual cluster nodes, and about the entities and resources it manages.
@@ -72,12 +73,12 @@ and for ad hoc use. Below are a few popular and recommended options.
7273

7374
:Prometheus:
7475

75-
The [Crate JMX HTTP Exporter] is a Prometheus exporter that consumes
76+
The {ref}`Crate JMX HTTP Exporter <prometheus-jmx-exporter>` is a Prometheus exporter that consumes
7677
metrics information from CrateDB's JMX collectors and exposes them
7778
via HTTP so they can be scraped by Prometheus, and, for example,
7879
subsequently displayed in Grafana, or processed into Alertmanager.
7980

80-
[Monitoring a CrateDB cluster with Prometheus and Grafana] illustrates
81+
{ref}`monitoring-prometheus-grafana` illustrates
8182
a full setup for making CrateDB-specific metrics available to Prometheus.
8283
The tutorial uses the _Crate JMX HTTP Exporter_ for exposing telemetry
8384
information, the _Prometheus SQL Exporter_ for conducting system table
@@ -104,5 +105,9 @@ and for ad hoc use. Below are a few popular and recommended options.
104105
real-time information about the cluster, its nodes, and their shards.
105106

106107

107-
[Crate JMX HTTP Exporter]: https://github.com/crate/jmx_exporter
108-
[Monitoring a CrateDB cluster with Prometheus and Grafana]: https://community.cratedb.com/t/monitoring-a-self-managed-cratedb-cluster-with-prometheus-and-grafana/1236
108+
:::{toctree}
109+
:hidden:
110+
Prometheus and Grafana <prometheus-grafana>
111+
prometheus-jmx-exporter
112+
prometheus-sql-exporter
113+
:::
Lines changed: 211 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,211 @@
1+
(monitoring-prometheus-grafana)=
2+
# Monitoring a CrateDB cluster with Prometheus and Grafana
3+
4+
:::{div} sd-text-muted
5+
:::
6+
7+
:::{rubric} Introduction
8+
:::
9+
10+
We recommend [^standalone] pairing two standard observability tools:
11+
Use [Prometheus] to collect and store metrics,
12+
and [Grafana] to build dashboards.
13+
14+
This guide describes how to set up a Grafana dashboard that allows you
15+
to check live and historical data around performance and capacity
16+
metrics in your CrateDB cluster. It uses instructions suitable for
17+
Debian or Ubuntu Linux, but can be adapted for other Linux distributions.
18+
19+
[^standalone]: {ref}`Containerized <install-container>` and [CrateDB Cloud] setups differ.
20+
This tutorial targets standalone and on‑premises installations.
21+
22+
:::{rubric} Overview
23+
:::
24+
25+
For a CrateDB environment, you are interested in CrateDB-specific metrics,
26+
such as the number of shards or number of failed queries, and OS metrics,
27+
such as available disk space, memory usage, or CPU usage.
28+
Based on Prometheus, the monitoring stack uses the following exporters
29+
to fulfill those requirements.
30+
31+
:Node Exporter:
32+
33+
Exposes a wide variety of hardware and kernel-related metrics.
34+
35+
:JMX Exporter:
36+
37+
Consumes metrics information from CrateDB's
38+
JMX collectors and exposes them via HTTP so they can be scraped by Prometheus.
39+
40+
:SQL Exporter:
41+
42+
Allows running arbitrary SQL
43+
statements against a CrateDB cluster to retrieve additional
44+
information from CrateDB's system tables.
45+
46+
## Set up CrateDB cluster
47+
48+
First things first, you will need a CrateDB cluster.
49+
{ref}`Multi-node setup instructions <multi-node-setup-example>` provides
50+
a quick walkthrough for Ubuntu Linux.
51+
52+
## Set up Prometheus Exporters
53+
54+
The Node Exporter and the JMX Exporter need to be installed on all
55+
machines that are running CrateDB nodes.
56+
57+
1. Install the Prometheus Node Exporter.
58+
```shell
59+
apt install prometheus-node-exporter
60+
```
61+
62+
2. Install the {ref}`prometheus-jmx-exporter`.
63+
64+
## Set up Prometheus
65+
66+
You would typically run this on a machine that is not part of the
67+
CrateDB cluster.
68+
The {ref}`prometheus-sql-exporter` also does not need to be installed
69+
on each machine.
70+
71+
```shell
72+
apt install prometheus prometheus-sql-exporter --no-install-recommends
73+
```
74+
75+
For advanced configuration options, see {ref}`prometheus-auth` and
76+
{ref}`prometheus-storage`.
77+
78+
Now, configure Prometheus to scrape metrics from Node Exporters and
79+
JMX Exporters on all CrateDB nodes, and also metrics from the SQL
80+
Exporter.
81+
```shell
82+
nano /etc/prometheus/prometheus.yml
83+
```
84+
85+
:Node Exporter: Port 9100
86+
:JMX Exporter: Port 8080
87+
:SQL Exporter: Port 9237
88+
89+
```yaml
90+
- job_name: 'node'
91+
static_configs:
92+
- targets: ['ubuntuvm1:9100', 'ubuntuvm2:9100']
93+
94+
- job_name: 'cratedb_jmx'
95+
static_configs:
96+
- targets: ['ubuntuvm1:8080', 'ubuntuvm2:8080']
97+
98+
- job_name: 'sql_exporter'
99+
static_configs:
100+
- targets: ['localhost:9237']
101+
```
102+
103+
Restart the Prometheus daemon if it was already started.
104+
```shell
105+
systemctl restart prometheus
106+
```
107+
108+
## Set up Grafana
109+
110+
Install Grafana on the same machine where you installed Prometheus.
111+
On a Debian or Ubuntu machine, run the following:
112+
```shell
113+
apt install --yes wget gpg
114+
wget -q -O - https://packages.grafana.com/gpg.key | gpg --dearmor | tee /usr/share/keyrings/grafana.gpg >/dev/null
115+
echo "deb [signed-by=/usr/share/keyrings/grafana.gpg] https://packages.grafana.com/oss/deb stable main" | tee /etc/apt/sources.list.d/grafana.list
116+
apt update
117+
apt install --yes grafana
118+
```
119+
Then, start Grafana.
120+
```shell
121+
systemctl start grafana-server
122+
```
123+
For other systems, see the [Grafana installation documentation][grafana-debian].
124+
125+
:::{rubric} Data source
126+
:::
127+
128+
Navigate to `http://<grafana-host>:3000/` to access the Grafana login screen.
129+
The default credentials are `admin`/`admin`; change the password immediately.
130+
Navigate to "Add your first data source", then select "Prometheus" and set the
131+
URL to `http://<prometheus-host>:9090/`.
132+
If you configured basic authentication for Prometheus, this is where you
133+
would need to enter the credentials.
134+
Confirm using "Save & test".
135+
136+
:::{rubric} Dashboard
137+
:::
138+
139+
An example dashboard based on the discussed setup is available for easy importing
140+
from [Grafana » CrateDB Monitoring Dashboard].
141+
In your Grafana installation, on the left-hand side, hover over the “Dashboards”
142+
icon and select “Import”. Specify the dashboard ID **17174** and load the dashboard.
143+
On the next screen, finalize the setup by selecting the previously created
144+
Prometheus data source.
145+
146+
![CrateDB monitoring dashboard in Grafana|690x396](https://us1.discourse-cdn.com/flex020/uploads/crate/original/1X/0e01a3f0b8fc61ae97250fdeb2fe741f34ac7422.png){width=690px}
147+
148+
## Alternative implementations
149+
150+
Build your own dashboard or use an entirely different monitoring approach while
151+
still covering similar metrics discussed in this article.
152+
The list below is a good starting point for troubleshooting most operational issues.
153+
154+
* CrateDB metrics (with example Prometheus queries based on the Crate JMX HTTP Exporter)
155+
* Thread pools rejected: `sum(rate(crate_threadpools{property="rejected"}[5m])) by (name)`
156+
* Thread pool queue size: `sum(crate_threadpools{property="queueSize"}) by (name)`
157+
* Thread pools active: `sum(crate_threadpools{property="active"}) by (name)`
158+
* Queries per second: `sum(rate(crate_query_total_count[5m])) by (query)`
159+
* Query error rate: `sum(rate(crate_query_failed_count[5m])) by (query)`
160+
* Average Query Duration over the last 5 minutes: `sum(rate(crate_query_sum_of_durations_millis[5m])) by (query) / sum(rate(crate_query_total_count[5m])) by (query)`
161+
* Circuit breaker memory in use: `sum(crate_circuitbreakers{property="used"}) by (name)`
162+
* Number of shards: `crate_node{name="shard_stats",property="total"}`
163+
* Garbage Collector rates: `sum(rate(jvm_gc_collection_seconds_count[5m])) by (gc)`
164+
* Thread pool rejected operations: `crate_threadpools{property="rejected"}`
165+
* Operating system metrics
166+
* CPU utilization
167+
* Memory usage
168+
* Open file descriptors
169+
* Disk usage
170+
* Disk read/write operations and throughput
171+
* Received and transmitted network traffic
172+
173+
## Appendix
174+
175+
(prometheus-auth)=
176+
:::{rubric} Prometheus authentication
177+
:::
178+
179+
By default, Prometheus binds to port 9090 without authentication. Prevent
180+
auto-start during install (e.g., with `policy-rcd-declarative`), then
181+
configure web auth using a YAML file.
182+
183+
Create `/etc/prometheus/web.yml`:
184+
```yaml
185+
basic_auth_users:
186+
admin: <bcrypt hash>
187+
```
188+
189+
Point Prometheus at it (e.g., `/etc/default/prometheus`):
190+
191+
```shell
192+
ARGS="--web.config.file=/etc/prometheus/web.yml --web.enable-lifecycle"
193+
```
194+
195+
Restart Prometheus after setting ownership and 0640 permissions on `web.yml`.
196+
197+
(prometheus-storage)=
198+
:::{rubric} CrateDB as Prometheus storage
199+
:::
200+
201+
For a large deployment where you also use Prometheus to monitor other systems,
202+
you may also want to use a CrateDB cluster as the storage for all Prometheus
203+
metrics. The {ref}`CrateDB Prometheus Adapter <prometheus>` achieves that.
204+
205+
206+
[CrateDB Cloud]: https://cratedb.com/products/cratedb-cloud
207+
[Grafana]: https://grafana.com/
208+
[grafana-debian]: https://grafana.com/docs/grafana/latest/setup-grafana/installation/debian/
209+
[Grafana » CrateDB Monitoring Dashboard]: https://grafana.com/grafana/dashboards/17174-cratedb-monitoring/
210+
[Prometheus]: https://prometheus.io/
211+
[Prometheus Node Exporter]: https://prometheus.io/docs/guides/node-exporter/
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
(prometheus-jmx-exporter)=
2+
3+
# Prometheus JMX Exporter
4+
5+
The [Crate JMX HTTP Exporter] is a Prometheus exporter that consumes metrics
6+
information from CrateDB's JMX collectors and exposes them via HTTP so they can
7+
be scraped by Prometheus, and, for example, subsequently displayed in Grafana,
8+
or processed into Alertmanager.
9+
10+
:::{rubric} Setup
11+
:::
12+
13+
This is very simple, on each node run the following:
14+
15+
```shell
16+
cd /usr/share/crate/lib
17+
wget https://repo1.maven.org/maven2/io/crate/crate-jmx-exporter/1.2.0/crate-jmx-exporter-1.2.0.jar
18+
nano /etc/default/crate
19+
```
20+
21+
then uncomment the `CRATE_JAVA_OPTS` line and change its value to:
22+
23+
```shell
24+
# Append to existing options (preserve other flags).
25+
CRATE_JAVA_OPTS="${CRATE_JAVA_OPTS:-} -javaagent:/usr/share/crate/lib/crate-jmx-exporter-1.2.0.jar=8080"
26+
```
27+
28+
and restart the crate daemon:
29+
30+
```bash
31+
systemctl restart crate
32+
```
33+
34+
35+
[Crate JMX HTTP Exporter]: https://github.com/crate/jmx_exporter
Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
(prometheus-sql-exporter)=
2+
3+
# Prometheus SQL Exporter
4+
5+
The SQL Exporter allows running arbitrary SQL statements against a CrateDB
6+
cluster to retrieve additional information. As the cluster contains information
7+
from each node, we do not need to install the SQL Exporter on every node.
8+
Instead, we install it centrally on the same machine that also hosts Prometheus.
9+
10+
Please note that it is not the same to set up a data source in Grafana pointing
11+
to CrateDB to display the output from queries in real-time as to use Prometheus
12+
to collect these values over time.
13+
14+
Installing the package is straight-forward:
15+
16+
```shell
17+
apt install prometheus-sql-exporter
18+
```
19+
20+
For the SQL exporter to connect to the cluster, we need to create a new user
21+
`sql_exporter`. We grant the user reading access to the `sys` schema. Run the
22+
below commands on any CrateDB node:
23+
24+
```shell
25+
curl -H 'Content-Type: application/json' -X POST 'http://localhost:4200/_sql' -d '{"stmt":"CREATE USER sql_exporter WITH (password = '\''insert_password'\'');"}'
26+
curl -H 'Content-Type: application/json' -X POST 'http://localhost:4200/_sql' -d '{"stmt":"GRANT DQL ON SCHEMA sys TO sql_exporter;"}'
27+
```
28+
29+
We then create a configuration file in `/etc/prometheus-sql-exporter.yml` with a
30+
sample query that retrieves the number of shards per node:
31+
32+
```yaml
33+
jobs:
34+
- name: "global"
35+
interval: '5m'
36+
connections: ['postgres://sql_exporter:insert_password@ubuntuvm1:5433?sslmode=disable']
37+
queries:
38+
- name: "shard_distribution"
39+
help: "Number of shards per node"
40+
labels: ["node_name"]
41+
values: ["shards"]
42+
query: |
43+
SELECT node['name'] AS node_name, COUNT(*) AS shards
44+
FROM sys.shards
45+
GROUP BY 1;
46+
allow_zero_rows: true
47+
48+
- name: "heap_usage"
49+
help: "Used heap space per node"
50+
labels: ["node_name"]
51+
values: ["heap_used"]
52+
query: |
53+
SELECT name AS node_name, heap['used'] / heap['max']::DOUBLE AS heap_used
54+
FROM sys.nodes;
55+
56+
- name: "global_translog"
57+
help: "Global translog statistics"
58+
values: ["translog_uncommitted_size"]
59+
query: |
60+
SELECT COALESCE(SUM(translog_stats['uncommitted_size']), 0) AS translog_uncommitted_size
61+
FROM sys.shards;
62+
63+
- name: "checkpoints"
64+
help: "Maximum global/local checkpoint delta"
65+
values: ["max_checkpoint_delta"]
66+
query: |
67+
SELECT COALESCE(MAX(seq_no_stats['local_checkpoint'] - seq_no_stats['global_checkpoint']), 0) AS max_checkpoint_delta
68+
FROM sys.shards;
69+
70+
- name: "shard_allocation_issues"
71+
help: "Shard allocation issues"
72+
labels: ["shard_type"]
73+
values: ["shards"]
74+
query: |
75+
SELECT IF(s.primary = TRUE, 'primary', 'replica') AS shard_type, COALESCE(shards, 0) AS shards
76+
FROM UNNEST([true, false]) s(primary)
77+
LEFT JOIN (
78+
SELECT primary, COUNT(*) AS shards
79+
FROM sys.allocations
80+
WHERE current_state <> 'STARTED'
81+
GROUP BY 1
82+
) a ON s.primary = a.primary;
83+
```
84+
85+
*Please note: There exist two implementations of the SQL Exporter:
86+
[burningalchemist/sql_exporter](https://github.com/burningalchemist/sql_exporter)
87+
and [justwatchcom/sql_exporter](https://github.com/justwatchcom/sql_exporter).
88+
They don't share the same configuration options. Our example is based on the
89+
implementation that is shipped with the Ubuntu package, which is
90+
`justwatchcom/sql_exporter.*`.
91+
92+
To apply the new configuration, we restart the service:
93+
94+
```shell
95+
systemctl restart prometheus-sql-exporter
96+
```
97+
98+
The SQL Exporter can also be used to monitor any business metrics as well, but
99+
be careful with regularly running expensive queries. Below are two more advanced
100+
monitoring queries of CrateDB that may be useful:
101+
102+
```sql
103+
/* Time since the last successful snapshot (backup) */
104+
SELECT (NOW() - MAX(started)) / 60000 AS MinutesSinceLastSuccessfulSnapshot
105+
FROM sys.snapshots
106+
WHERE "state" = 'SUCCESS';
107+
```

0 commit comments

Comments
 (0)