Skip to content

Commit ba78343

Browse files
hlcianfagnaamotl
authored andcommitted
Admin: Tutorial about CrateDB monitoring with Prometheus and Grafana
1 parent 8ed9a9a commit ba78343

File tree

4 files changed

+340
-0
lines changed

4 files changed

+340
-0
lines changed

docs/admin/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ Production and troubleshooting guidelines and system resource considerations.
3232
3333
clustering/index
3434
sharding-partitioning
35+
Monitoring <monitoring/prometheus-grafana>
3536
../performance/index
3637
```
3738
+++
Lines changed: 313 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,313 @@
1+
(monitoring-prometheus-grafana)=
2+
# Monitoring a self-managed CrateDB cluster with Prometheus and Grafana
3+
4+
## Introduction
5+
6+
If you are running CrateDB in a production environment, you have probably wondered what would be the best way to monitor the servers to identify issues before they become problematic and to collect statistics that you can use for capacity planning.
7+
8+
We recommend pairing two well-known OSS solutions, [Prometheus](https://prometheus.io/) which is a system that collects and stores performance metrics, and [Grafana](https://grafana.com/) which is a system to create dashboards.
9+
10+
For a CrateDB environment, we are interested in:
11+
* CrateDB-specific metrics, such as the number of shards or number of failed queries
12+
* and OS metrics, such as available disk space, memory usage, or CPU usage
13+
14+
For what concerns CrateDB-specific metrics we recommend making these available to Prometheus by using the [Crate JMX HTTP Exporter](https://cratedb.com/docs/crate/reference/en/5.1/admin/monitoring.html#exposing-jmx-via-http) and [Prometheus SQL Exporter](https://github.com/justwatchcom/sql_exporter). For what concerns OS metrics, in Linux environments, we recommend using the [Prometheus Node Exporter](https://prometheus.io/docs/guides/node-exporter/).
15+
16+
Things are a bit different of course if you are using containers, or if you are using the fully-managed cloud-hosted [CrateDB Cloud](https://cratedb.com/products/cratedb-cloud), but let’s see how all this works on an on-premises installation by setting all this up together.
17+
18+
## First we need a CrateDB cluster
19+
20+
First things first, we will need a CrateDB cluster, you may have one already and that is great, but if you do not we can get one up quickly.
21+
22+
You can review the installation documentation at {ref}`install` and {ref}`multi_node_setup`.
23+
24+
In my case, I am using Ubuntu and I did it like this, first I ssh to the first machine and run:
25+
26+
```
27+
nano /etc/default/crate
28+
```
29+
30+
This is a configuration file that will be used by CrateDB, we only need one line to configure memory settings here (this is a required step otherwise we will fail bootstrap checks):
31+
32+
```
33+
CRATE_HEAP_SIZE=4G
34+
```
35+
36+
We also need to create another configuration file:
37+
38+
```shell
39+
mkdir /etc/crate
40+
nano /etc/crate/crate.yml
41+
```
42+
43+
In my case I used the following values:
44+
45+
```yaml
46+
network.host: _local_,_site_
47+
```
48+
49+
This tells CrateDB to respond to requests both from localhost and the local network.
50+
51+
```yaml
52+
discovery.seed_hosts:
53+
- ubuntuvm1:4300
54+
- ubuntuvm2:4300
55+
```
56+
57+
This lists all the machines that make up our cluster, here I only have 2, but for production use, we recommend having at least 3 nodes so that a quorum can be established in case of network partition to avoid split-brain scenarios.
58+
59+
```yaml
60+
cluster.initial_master_nodes:
61+
- ubuntuvm1
62+
- ubuntuvm2
63+
```
64+
65+
This lists the nodes that are eligible to act as master nodes during bootstrap.
66+
67+
```yaml
68+
auth.host_based.enabled: true
69+
auth:
70+
host_based:
71+
config:
72+
0:
73+
user: crate
74+
address: _local_
75+
method: trust
76+
99:
77+
method: password
78+
```
79+
80+
This indicates that the `crate` super user will work for local connections but connections from other machines will require a username and password.
81+
82+
```yaml
83+
gateway.recover_after_data_nodes: 2
84+
gateway.expected_data_nodes: 2
85+
```
86+
87+
And this requires both nodes to be available for the cluster to operate in this case, but with more nodes, we could have set `recover_after_data_nodes` to a value smaller than the total number of nodes.
88+
89+
Now let’s install CrateDB:
90+
91+
```bash
92+
wget https://cdn.crate.io/downloads/deb/DEB-GPG-KEY-crate
93+
apt-key add DEB-GPG-KEY-crate
94+
add-apt-repository "deb https://cdn.crate.io/downloads/deb/stable/ $(lsb_release -cs) main"
95+
apt update
96+
apt install crate -o Dpkg::Options::="--force-confold"
97+
```
98+
(`force-confold` is used to keep the configuration files we created earlier)
99+
100+
Repeat the above steps on the other node.
101+
102+
## Setup of the Crate JMX HTTP Exporter
103+
104+
This is very simple, on each node run the following:
105+
106+
```shell
107+
cd /usr/share/crate/lib
108+
wget https://repo1.maven.org/maven2/io/crate/crate-jmx-exporter/1.0.0/crate-jmx-exporter-1.0.0.jar
109+
nano /etc/default/crate
110+
```
111+
112+
then uncomment the `CRATE_JAVA_OPTS` line and change its value to:
113+
114+
```
115+
CRATE_JAVA_OPTS="-javaagent:/usr/share/crate/lib/crate-jmx-exporter-1.0.0.jar=8080"
116+
```
117+
118+
and restart the crate daemon:
119+
120+
```bash
121+
systemctl restart crate
122+
```
123+
124+
## Prometheus Node Exporter
125+
126+
This can be set up with a one-liner:
127+
128+
```
129+
apt install prometheus-node-exporter
130+
```
131+
132+
## Prometheus SQL Exporter
133+
134+
The SQL Exporter allows running arbitrary SQL statements against a CrateDB cluster to retrieve additional information. As the cluster contains information from each node, we do not need to install the SQL Exporter on every node. Instead, we install it centrally on the same machine that also hosts Prometheus.
135+
136+
Please note that it is not the same to set up a data source in Grafana pointing to CrateDB to display the output from queries in real-time as to use Prometheus to collect these values over time.
137+
138+
Installing the package is straight-forward:
139+
```shell
140+
apt install prometheus-sql-exporter
141+
```
142+
143+
For the SQL exporter to connect to the cluster, we need to create a new user `sql_exporter`. We grant the user reading access to the `sys` schema. Run the below commands on any CrateDB node:
144+
```shell
145+
curl -H 'Content-Type: application/json' -X POST 'http://localhost:4200/_sql' -d '{"stmt":"CREATE USER sql_exporter WITH (password = '\''insert_password'\'');"}'
146+
curl -H 'Content-Type: application/json' -X POST 'http://localhost:4200/_sql' -d '{"stmt":"GRANT DQL ON SCHEMA sys TO sql_exporter;"}'
147+
```
148+
149+
We then create a configuration file in `/etc/prometheus-sql-exporter.yml` with a sample query that retrieves the number of shards per node:
150+
151+
```yaml
152+
jobs:
153+
- name: "global"
154+
interval: '5m'
155+
connections: ['postgres://sql_exporter:insert_password@ubuntuvm1:5433?sslmode=disable']
156+
queries:
157+
- name: "shard_distribution"
158+
help: "Number of shards per node"
159+
labels: ["node_name"]
160+
values: ["shards"]
161+
query: |
162+
SELECT node['name'] AS node_name, COUNT(*) AS shards
163+
FROM sys.shards
164+
GROUP BY 1;
165+
allow_zero_rows: true
166+
167+
- name: "heap_usage"
168+
help: "Used heap space per node"
169+
labels: ["node_name"]
170+
values: ["heap_used"]
171+
query: |
172+
SELECT name AS node_name, heap['used'] / heap['max']::DOUBLE AS heap_used
173+
FROM sys.nodes;
174+
175+
- name: "global_translog"
176+
help: "Global translog statistics"
177+
values: ["translog_uncommitted_size"]
178+
query: |
179+
SELECT COALESCE(SUM(translog_stats['uncommitted_size']), 0) AS translog_uncommitted_size
180+
FROM sys.shards;
181+
182+
- name: "checkpoints"
183+
help: "Maximum global/local checkpoint delta"
184+
values: ["max_checkpoint_delta"]
185+
query: |
186+
SELECT COALESCE(MAX(seq_no_stats['local_checkpoint'] - seq_no_stats['global_checkpoint']), 0) AS max_checkpoint_delta
187+
FROM sys.shards;
188+
189+
- name: "shard_allocation_issues"
190+
help: "Shard allocation issues"
191+
labels: ["shard_type"]
192+
values: ["shards"]
193+
query: |
194+
SELECT IF(s.primary = TRUE, 'primary', 'replica') AS shard_type, COALESCE(shards, 0) AS shards
195+
FROM UNNEST([true, false]) s(primary)
196+
LEFT JOIN (
197+
SELECT primary, COUNT(*) AS shards
198+
FROM sys.allocations
199+
WHERE current_state <> 'STARTED'
200+
GROUP BY 1
201+
) a ON s.primary = a.primary;
202+
```
203+
204+
*Please note: There exist two implementations of the SQL Exporter: [burningalchemist/sql_exporter](https://github.com/burningalchemist/sql_exporter) and [justwatchcom/sql_exporter](https://github.com/justwatchcom/sql_exporter). They don't share the same configuration options.
205+
Our example is based on the implementation that is shipped with the Ubuntu package, which is justwatchcom/sql_exporter.*
206+
207+
To apply the new configuration, we restart the service:
208+
209+
```shell
210+
systemctl restart prometheus-sql-exporter
211+
```
212+
213+
The SQL Exporter can also be used to monitor any business metrics as well, but be careful with regularly running expensive queries. Below are two more advanced monitoring queries of CrateDB that may be useful:
214+
215+
```sql
216+
/* Time since the last successful snapshot (backup) */
217+
SELECT (NOW() - MAX(started)) / 60000 AS MinutesSinceLastSuccessfulSnapshot
218+
FROM sys.snapshots
219+
WHERE "state" = 'SUCCESS';
220+
```
221+
222+
## Prometheus setup
223+
224+
You would run this on a machine that is not part of the CrateDB cluster and it can be installed with:
225+
226+
```
227+
apt install prometheus --no-install-recommends
228+
```
229+
230+
Please note that by default this will right away become available on port 9090 without authentication requirements, you can use `policy-rcd-declarative` to prevent the service from starting immediately after installation and you can define a YAML web config file with `basic_auth_users` and then refer to that file in `/etc/default/prometheus`.
231+
232+
For a large deployment where you also use Prometheus to monitor other systems, you may also want to use a CrateDB cluster as the storage for all Prometheus metrics, you can read more about this at [CrateDB Prometheus Adapter](https://github.com/crate/cratedb-prometheus-adapter).
233+
234+
Now we will configure Prometheus to scrape metrics from the node explorer from the CrateDB machines and also metrics from our Crate JMX HTTP Exporter:
235+
236+
```
237+
nano /etc/prometheus/prometheus.yml
238+
```
239+
240+
Where it says:
241+
242+
```yaml
243+
- job_name: 'node'
244+
static_configs:
245+
- targets: ['localhost:9100']
246+
```
247+
248+
We replace this with the below configuration, which reflects port 8080 (Crate JMX Exporter), port 9100 (Prometheus Node Exporter), port 9237 (Prometheus SQL Exporter), as well as port 9100 (Prometheus Node Exporter).
249+
```yaml
250+
- job_name: 'node'
251+
static_configs:
252+
- targets: ['ubuntuvm1:9100', 'ubuntuvm2:9100']
253+
- job_name: 'cratedb_jmx'
254+
static_configs:
255+
- targets: ['ubuntuvm1:8080', 'ubuntuvm2:8080']
256+
- job_name: 'sql_exporter'
257+
static_configs:
258+
- targets: ['localhost:9237']
259+
```
260+
261+
Restart the `prometheus` daemon if it was already started (`systemctl restart prometheus`).
262+
263+
## Grafana setup
264+
265+
This can be installed on the same machine where you have Prometheus and can be installed with:
266+
267+
```shell
268+
echo "deb https://packages.grafana.com/oss/deb stable main" | tee -a /etc/apt/sources.list.d/grafana.list
269+
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
270+
apt update
271+
apt install grafana
272+
systemctl start grafana-server
273+
```
274+
275+
If you now point your browser to *http://\<Grafana host>:3000* you will be welcomed by the Grafana login screen, the first time you can log in with admin as both the username and password, make sure to change this password right away.
276+
277+
Click on "Add your first data source", then click on "Prometheus", and enter the URL *http://\<Prometheus host>:9090*.
278+
279+
If you had configured basic authentication for Prometheus this is where you would need to enter the credentials.
280+
281+
Click "Save & test".
282+
283+
An example dashboard based on the discussed setup is available for easy importing on [grafana.com](https://grafana.com/grafana/dashboards/17174-cratedb-monitoring/). In your Grafana installation, on the left-hand side, hover over the “Dashboards” icon and select “Import”. Specify the ID 17174 and load the dashboard. On the next screen, finalize the setup by selecting your previously created Prometheus data sources.
284+
285+
![CrateDB monitoring dashboard in Grafana|690x396](https://us1.discourse-cdn.com/flex020/uploads/crate/original/1X/0e01a3f0b8fc61ae97250fdeb2fe741f34ac7422.png)
286+
287+
## Alternative implementations
288+
289+
If you decide to build your own dashboard or use an entirely different monitoring approach, we recommend still covering similar metrics as discussed in this article. The list below is a good starting point for troubleshooting most operational issues:
290+
291+
* CrateDB metrics (with example Prometheus queries based on the Crate JMX HTTP Exporter)
292+
* Thread pools rejected: `sum(rate(crate_threadpools{property="rejected"}[5m])) by (name)`
293+
* Thread pool queue size: `sum(crate_threadpools{property="queueSize"}) by (name)`
294+
* Thread pools active: `sum(crate_threadpools{property="active"}) by (name)`
295+
* Queries per second: `sum(rate(crate_query_total_count[5m])) by (query)`
296+
* Query error rate: `sum(rate(crate_query_failed_count[5m])) by (query)`
297+
* Average Query Duration over the last 5 minutes: `sum(rate(crate_query_sum_of_durations_millis[5m])) by (query) / sum(rate(crate_query_total_count[5m])) by (query)`
298+
* Circuit breaker memory in use: `sum(crate_circuitbreakers{property="used"}) by (name)`
299+
* Number of shards: `crate_node{name="shard_stats",property="total"}`
300+
* Garbage Collector rates: `sum(rate(jvm_gc_collection_seconds_count[5m])) by (gc)`
301+
* Thread pool queue size: `crate_threadpools{property="queueSize"}`
302+
* Thread pool rejected operations: `crate_threadpools{property="rejected"}`
303+
* Operating system metrics
304+
* CPU utilization
305+
* Memory usage
306+
* Open file descriptors
307+
* Disk usage
308+
* Disk read/write operations and throughput
309+
* Received and transmitted network traffic
310+
311+
## Wrapping up
312+
313+
We got a Grafana dashboard that allows us to check live and historical data around performance and capacity metrics in our CrateDB cluster, this illustrates one possible setup. You could use different tools depending on your environment and preferences. Still, we recommend you use the interface of the Crate JMX HTTP Exporter to collect CrateDB-specific metrics and that you always also monitor the health of the environment at the OS level as we have done here with the Prometheus Node Exporter.

docs/integrate/grafana/index.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,18 @@ Connecting to a CrateDB cluster uses the Grafana PostgreSQL data source adapter.
5151

5252
::::
5353

54+
:::{rubric} See also
55+
:::
56+
57+
::::{grid} 2
58+
59+
:::{grid-item-card} Tutorial: Monitoring CrateDB with Prometheus and Grafana
60+
:link: monitoring-prometheus-grafana
61+
:link-type: ref
62+
Production-grade monitoring and graphing of CrateDB metrics.
63+
:::
64+
65+
::::
5466

5567
:::{toctree}
5668
:maxdepth: 1

docs/integrate/prometheus/index.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,20 @@ tutorial.
107107
[CrateDB Prometheus Adapter]
108108

109109

110+
:::{rubric} See also
111+
:::
112+
113+
::::{grid} 2
114+
115+
:::{grid-item-card} Tutorial: Monitoring CrateDB with Prometheus and Grafana
116+
:link: monitoring-prometheus-grafana
117+
:link-type: ref
118+
Production-grade monitoring and graphing of CrateDB metrics.
119+
:::
120+
121+
::::
122+
123+
110124
```{seealso}
111125
[CrateDB and Prometheus]
112126
```

0 commit comments

Comments
 (0)