From d0d710fe4d8a7c924491ac6172f55c7bd2a2ddd3 Mon Sep 17 00:00:00 2001 From: Rich Loveland Date: Tue, 14 Oct 2025 14:37:10 -0400 Subject: [PATCH] Add `storage.wal.failover.write_and_sync.latency` Fixes DOC-13184 Summary of changes: - Add a mention of the `storage.wal.failover.write_and_sync.latency` metric to the `wal-failover-metrics.md` include file, which will pull it into the 'WAL failover' and 'cockroach start' pages. - We're also doing a cockroachdb/cockroach PR to mark this metric as 'essential', so it shows up in the list of Storage essential metrics at e.g. https://www.cockroachlabs.com/docs/v25.3/essential-metrics-self-hosted.html#storage --- src/current/_includes/v24.1/wal-failover-metrics.md | 4 ++++ src/current/_includes/v24.3/wal-failover-metrics.md | 4 ++++ src/current/_includes/v25.2/wal-failover-metrics.md | 4 ++++ src/current/_includes/v25.3/wal-failover-metrics.md | 4 ++++ src/current/_includes/v25.4/wal-failover-metrics.md | 4 ++++ 5 files changed, 20 insertions(+) diff --git a/src/current/_includes/v24.1/wal-failover-metrics.md b/src/current/_includes/v24.1/wal-failover-metrics.md index 96d17a83d48..449d1b2332d 100644 --- a/src/current/_includes/v24.1/wal-failover-metrics.md +++ b/src/current/_includes/v24.1/wal-failover-metrics.md @@ -3,6 +3,8 @@ You can monitor WAL failover occurrences using the following metrics: - `storage.wal.failover.secondary.duration`: Cumulative time spent (in nanoseconds) writing to the secondary WAL directory. Only populated when WAL failover is configured. - `storage.wal.failover.primary.duration`: Cumulative time spent (in nanoseconds) writing to the primary WAL directory. Only populated when WAL failover is configured. - `storage.wal.failover.switch.count`: Count of the number of times WAL writing has switched from primary to secondary store, and vice versa. +- `storage.wal.fsync.latency` monitors the latencies of WAL files. If you have WAL failover enabled and are failing over, `storage.wal.fsync.latency` will include the latency of the stalled primary. +- `storage.wal.failover.write_and_sync.latency`: When WAL failover is configured in a cluster, the operator should monitor this metric which shows the effective latency observed by the higher layer writing to the WAL. This metric is expected to stay low in a healthy system, regardless of whether WAL files are being written to the primary or secondary. The `storage.wal.failover.secondary.duration` is the primary metric to monitor. You should expect this metric to be `0` unless a WAL failover occurs. If a WAL failover occurs, the rate at which it increases provides an indication of the health of the primary store. @@ -10,3 +12,5 @@ You can access these metrics via the following methods: - The [**Custom Chart** debug page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}) in [DB Console]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}). - By [monitoring CockroachDB with Prometheus]({% link {{ page.version.version }}/monitor-cockroachdb-with-prometheus.md %}). + +For more information, refer to [Essential storage metrics]({% link {{ page.version.version }}/essential-metrics-self-hosted.md %}#storage) diff --git a/src/current/_includes/v24.3/wal-failover-metrics.md b/src/current/_includes/v24.3/wal-failover-metrics.md index 96d17a83d48..449d1b2332d 100644 --- a/src/current/_includes/v24.3/wal-failover-metrics.md +++ b/src/current/_includes/v24.3/wal-failover-metrics.md @@ -3,6 +3,8 @@ You can monitor WAL failover occurrences using the following metrics: - `storage.wal.failover.secondary.duration`: Cumulative time spent (in nanoseconds) writing to the secondary WAL directory. Only populated when WAL failover is configured. - `storage.wal.failover.primary.duration`: Cumulative time spent (in nanoseconds) writing to the primary WAL directory. Only populated when WAL failover is configured. - `storage.wal.failover.switch.count`: Count of the number of times WAL writing has switched from primary to secondary store, and vice versa. +- `storage.wal.fsync.latency` monitors the latencies of WAL files. If you have WAL failover enabled and are failing over, `storage.wal.fsync.latency` will include the latency of the stalled primary. +- `storage.wal.failover.write_and_sync.latency`: When WAL failover is configured in a cluster, the operator should monitor this metric which shows the effective latency observed by the higher layer writing to the WAL. This metric is expected to stay low in a healthy system, regardless of whether WAL files are being written to the primary or secondary. The `storage.wal.failover.secondary.duration` is the primary metric to monitor. You should expect this metric to be `0` unless a WAL failover occurs. If a WAL failover occurs, the rate at which it increases provides an indication of the health of the primary store. @@ -10,3 +12,5 @@ You can access these metrics via the following methods: - The [**Custom Chart** debug page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}) in [DB Console]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}). - By [monitoring CockroachDB with Prometheus]({% link {{ page.version.version }}/monitor-cockroachdb-with-prometheus.md %}). + +For more information, refer to [Essential storage metrics]({% link {{ page.version.version }}/essential-metrics-self-hosted.md %}#storage) diff --git a/src/current/_includes/v25.2/wal-failover-metrics.md b/src/current/_includes/v25.2/wal-failover-metrics.md index 96d17a83d48..449d1b2332d 100644 --- a/src/current/_includes/v25.2/wal-failover-metrics.md +++ b/src/current/_includes/v25.2/wal-failover-metrics.md @@ -3,6 +3,8 @@ You can monitor WAL failover occurrences using the following metrics: - `storage.wal.failover.secondary.duration`: Cumulative time spent (in nanoseconds) writing to the secondary WAL directory. Only populated when WAL failover is configured. - `storage.wal.failover.primary.duration`: Cumulative time spent (in nanoseconds) writing to the primary WAL directory. Only populated when WAL failover is configured. - `storage.wal.failover.switch.count`: Count of the number of times WAL writing has switched from primary to secondary store, and vice versa. +- `storage.wal.fsync.latency` monitors the latencies of WAL files. If you have WAL failover enabled and are failing over, `storage.wal.fsync.latency` will include the latency of the stalled primary. +- `storage.wal.failover.write_and_sync.latency`: When WAL failover is configured in a cluster, the operator should monitor this metric which shows the effective latency observed by the higher layer writing to the WAL. This metric is expected to stay low in a healthy system, regardless of whether WAL files are being written to the primary or secondary. The `storage.wal.failover.secondary.duration` is the primary metric to monitor. You should expect this metric to be `0` unless a WAL failover occurs. If a WAL failover occurs, the rate at which it increases provides an indication of the health of the primary store. @@ -10,3 +12,5 @@ You can access these metrics via the following methods: - The [**Custom Chart** debug page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}) in [DB Console]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}). - By [monitoring CockroachDB with Prometheus]({% link {{ page.version.version }}/monitor-cockroachdb-with-prometheus.md %}). + +For more information, refer to [Essential storage metrics]({% link {{ page.version.version }}/essential-metrics-self-hosted.md %}#storage) diff --git a/src/current/_includes/v25.3/wal-failover-metrics.md b/src/current/_includes/v25.3/wal-failover-metrics.md index 96d17a83d48..449d1b2332d 100644 --- a/src/current/_includes/v25.3/wal-failover-metrics.md +++ b/src/current/_includes/v25.3/wal-failover-metrics.md @@ -3,6 +3,8 @@ You can monitor WAL failover occurrences using the following metrics: - `storage.wal.failover.secondary.duration`: Cumulative time spent (in nanoseconds) writing to the secondary WAL directory. Only populated when WAL failover is configured. - `storage.wal.failover.primary.duration`: Cumulative time spent (in nanoseconds) writing to the primary WAL directory. Only populated when WAL failover is configured. - `storage.wal.failover.switch.count`: Count of the number of times WAL writing has switched from primary to secondary store, and vice versa. +- `storage.wal.fsync.latency` monitors the latencies of WAL files. If you have WAL failover enabled and are failing over, `storage.wal.fsync.latency` will include the latency of the stalled primary. +- `storage.wal.failover.write_and_sync.latency`: When WAL failover is configured in a cluster, the operator should monitor this metric which shows the effective latency observed by the higher layer writing to the WAL. This metric is expected to stay low in a healthy system, regardless of whether WAL files are being written to the primary or secondary. The `storage.wal.failover.secondary.duration` is the primary metric to monitor. You should expect this metric to be `0` unless a WAL failover occurs. If a WAL failover occurs, the rate at which it increases provides an indication of the health of the primary store. @@ -10,3 +12,5 @@ You can access these metrics via the following methods: - The [**Custom Chart** debug page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}) in [DB Console]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}). - By [monitoring CockroachDB with Prometheus]({% link {{ page.version.version }}/monitor-cockroachdb-with-prometheus.md %}). + +For more information, refer to [Essential storage metrics]({% link {{ page.version.version }}/essential-metrics-self-hosted.md %}#storage) diff --git a/src/current/_includes/v25.4/wal-failover-metrics.md b/src/current/_includes/v25.4/wal-failover-metrics.md index 96d17a83d48..449d1b2332d 100644 --- a/src/current/_includes/v25.4/wal-failover-metrics.md +++ b/src/current/_includes/v25.4/wal-failover-metrics.md @@ -3,6 +3,8 @@ You can monitor WAL failover occurrences using the following metrics: - `storage.wal.failover.secondary.duration`: Cumulative time spent (in nanoseconds) writing to the secondary WAL directory. Only populated when WAL failover is configured. - `storage.wal.failover.primary.duration`: Cumulative time spent (in nanoseconds) writing to the primary WAL directory. Only populated when WAL failover is configured. - `storage.wal.failover.switch.count`: Count of the number of times WAL writing has switched from primary to secondary store, and vice versa. +- `storage.wal.fsync.latency` monitors the latencies of WAL files. If you have WAL failover enabled and are failing over, `storage.wal.fsync.latency` will include the latency of the stalled primary. +- `storage.wal.failover.write_and_sync.latency`: When WAL failover is configured in a cluster, the operator should monitor this metric which shows the effective latency observed by the higher layer writing to the WAL. This metric is expected to stay low in a healthy system, regardless of whether WAL files are being written to the primary or secondary. The `storage.wal.failover.secondary.duration` is the primary metric to monitor. You should expect this metric to be `0` unless a WAL failover occurs. If a WAL failover occurs, the rate at which it increases provides an indication of the health of the primary store. @@ -10,3 +12,5 @@ You can access these metrics via the following methods: - The [**Custom Chart** debug page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}) in [DB Console]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}). - By [monitoring CockroachDB with Prometheus]({% link {{ page.version.version }}/monitor-cockroachdb-with-prometheus.md %}). + +For more information, refer to [Essential storage metrics]({% link {{ page.version.version }}/essential-metrics-self-hosted.md %}#storage)