From d0d710fe4d8a7c924491ac6172f55c7bd2a2ddd3 Mon Sep 17 00:00:00 2001
From: Rich Loveland <rich@cockroachlabs.com>
Date: Tue, 14 Oct 2025 14:37:10 -0400
Subject: [PATCH] Add `storage.wal.failover.write_and_sync.latency`

Fixes DOC-13184

Summary of changes:

- Add a mention of the `storage.wal.failover.write_and_sync.latency`
  metric to the `wal-failover-metrics.md` include file, which will pull
  it into the 'WAL failover' and 'cockroach start' pages.

- We're also doing a cockroachdb/cockroach PR to mark this metric as
  'essential', so it shows up in the list of Storage essential metrics
  at e.g.
  https://www.cockroachlabs.com/docs/v25.3/essential-metrics-self-hosted.html#storage
---
 src/current/_includes/v24.1/wal-failover-metrics.md | 4 ++++
 src/current/_includes/v24.3/wal-failover-metrics.md | 4 ++++
 src/current/_includes/v25.2/wal-failover-metrics.md | 4 ++++
 src/current/_includes/v25.3/wal-failover-metrics.md | 4 ++++
 src/current/_includes/v25.4/wal-failover-metrics.md | 4 ++++
 5 files changed, 20 insertions(+)

diff --git a/src/current/_includes/v24.1/wal-failover-metrics.md b/src/current/_includes/v24.1/wal-failover-metrics.md
index 96d17a83d48..449d1b2332d 100644
--- a/src/current/_includes/v24.1/wal-failover-metrics.md
+++ b/src/current/_includes/v24.1/wal-failover-metrics.md
@@ -3,6 +3,8 @@ You can monitor WAL failover occurrences using the following metrics:
 - `storage.wal.failover.secondary.duration`: Cumulative time spent (in nanoseconds) writing to the secondary WAL directory. Only populated when WAL failover is configured.
 - `storage.wal.failover.primary.duration`: Cumulative time spent (in nanoseconds) writing to the primary WAL directory. Only populated when WAL failover is configured.
 - `storage.wal.failover.switch.count`: Count of the number of times WAL writing has switched from primary to secondary store, and vice versa.
+- `storage.wal.fsync.latency` monitors the latencies of WAL files. If you have WAL failover enabled and are failing over, `storage.wal.fsync.latency` will include the latency of the stalled primary. 
+- `storage.wal.failover.write_and_sync.latency`: When WAL failover is configured in a cluster, the operator should monitor this metric which shows the effective latency observed by the higher layer writing to the WAL. This metric is expected to stay low in a healthy system, regardless of whether WAL files are being written to the primary or secondary.
 
 The `storage.wal.failover.secondary.duration` is the primary metric to monitor. You should expect this metric to be `0` unless a WAL failover occurs. If a WAL failover occurs, the rate at which it increases provides an indication of the health of the primary store.
 
@@ -10,3 +12,5 @@ You can access these metrics via the following methods:
 
 - The [**Custom Chart** debug page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}) in [DB Console]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}).
 - By [monitoring CockroachDB with Prometheus]({% link {{ page.version.version }}/monitor-cockroachdb-with-prometheus.md %}).
+
+For more information, refer to [Essential storage metrics]({% link {{ page.version.version }}/essential-metrics-self-hosted.md %}#storage)
diff --git a/src/current/_includes/v24.3/wal-failover-metrics.md b/src/current/_includes/v24.3/wal-failover-metrics.md
index 96d17a83d48..449d1b2332d 100644
--- a/src/current/_includes/v24.3/wal-failover-metrics.md
+++ b/src/current/_includes/v24.3/wal-failover-metrics.md
@@ -3,6 +3,8 @@ You can monitor WAL failover occurrences using the following metrics:
 - `storage.wal.failover.secondary.duration`: Cumulative time spent (in nanoseconds) writing to the secondary WAL directory. Only populated when WAL failover is configured.
 - `storage.wal.failover.primary.duration`: Cumulative time spent (in nanoseconds) writing to the primary WAL directory. Only populated when WAL failover is configured.
 - `storage.wal.failover.switch.count`: Count of the number of times WAL writing has switched from primary to secondary store, and vice versa.
+- `storage.wal.fsync.latency` monitors the latencies of WAL files. If you have WAL failover enabled and are failing over, `storage.wal.fsync.latency` will include the latency of the stalled primary. 
+- `storage.wal.failover.write_and_sync.latency`: When WAL failover is configured in a cluster, the operator should monitor this metric which shows the effective latency observed by the higher layer writing to the WAL. This metric is expected to stay low in a healthy system, regardless of whether WAL files are being written to the primary or secondary.
 
 The `storage.wal.failover.secondary.duration` is the primary metric to monitor. You should expect this metric to be `0` unless a WAL failover occurs. If a WAL failover occurs, the rate at which it increases provides an indication of the health of the primary store.
 
@@ -10,3 +12,5 @@ You can access these metrics via the following methods:
 
 - The [**Custom Chart** debug page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}) in [DB Console]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}).
 - By [monitoring CockroachDB with Prometheus]({% link {{ page.version.version }}/monitor-cockroachdb-with-prometheus.md %}).
+
+For more information, refer to [Essential storage metrics]({% link {{ page.version.version }}/essential-metrics-self-hosted.md %}#storage)
diff --git a/src/current/_includes/v25.2/wal-failover-metrics.md b/src/current/_includes/v25.2/wal-failover-metrics.md
index 96d17a83d48..449d1b2332d 100644
--- a/src/current/_includes/v25.2/wal-failover-metrics.md
+++ b/src/current/_includes/v25.2/wal-failover-metrics.md
@@ -3,6 +3,8 @@ You can monitor WAL failover occurrences using the following metrics:
 - `storage.wal.failover.secondary.duration`: Cumulative time spent (in nanoseconds) writing to the secondary WAL directory. Only populated when WAL failover is configured.
 - `storage.wal.failover.primary.duration`: Cumulative time spent (in nanoseconds) writing to the primary WAL directory. Only populated when WAL failover is configured.
 - `storage.wal.failover.switch.count`: Count of the number of times WAL writing has switched from primary to secondary store, and vice versa.
+- `storage.wal.fsync.latency` monitors the latencies of WAL files. If you have WAL failover enabled and are failing over, `storage.wal.fsync.latency` will include the latency of the stalled primary. 
+- `storage.wal.failover.write_and_sync.latency`: When WAL failover is configured in a cluster, the operator should monitor this metric which shows the effective latency observed by the higher layer writing to the WAL. This metric is expected to stay low in a healthy system, regardless of whether WAL files are being written to the primary or secondary.
 
 The `storage.wal.failover.secondary.duration` is the primary metric to monitor. You should expect this metric to be `0` unless a WAL failover occurs. If a WAL failover occurs, the rate at which it increases provides an indication of the health of the primary store.
 
@@ -10,3 +12,5 @@ You can access these metrics via the following methods:
 
 - The [**Custom Chart** debug page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}) in [DB Console]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}).
 - By [monitoring CockroachDB with Prometheus]({% link {{ page.version.version }}/monitor-cockroachdb-with-prometheus.md %}).
+
+For more information, refer to [Essential storage metrics]({% link {{ page.version.version }}/essential-metrics-self-hosted.md %}#storage)
diff --git a/src/current/_includes/v25.3/wal-failover-metrics.md b/src/current/_includes/v25.3/wal-failover-metrics.md
index 96d17a83d48..449d1b2332d 100644
--- a/src/current/_includes/v25.3/wal-failover-metrics.md
+++ b/src/current/_includes/v25.3/wal-failover-metrics.md
@@ -3,6 +3,8 @@ You can monitor WAL failover occurrences using the following metrics:
 - `storage.wal.failover.secondary.duration`: Cumulative time spent (in nanoseconds) writing to the secondary WAL directory. Only populated when WAL failover is configured.
 - `storage.wal.failover.primary.duration`: Cumulative time spent (in nanoseconds) writing to the primary WAL directory. Only populated when WAL failover is configured.
 - `storage.wal.failover.switch.count`: Count of the number of times WAL writing has switched from primary to secondary store, and vice versa.
+- `storage.wal.fsync.latency` monitors the latencies of WAL files. If you have WAL failover enabled and are failing over, `storage.wal.fsync.latency` will include the latency of the stalled primary. 
+- `storage.wal.failover.write_and_sync.latency`: When WAL failover is configured in a cluster, the operator should monitor this metric which shows the effective latency observed by the higher layer writing to the WAL. This metric is expected to stay low in a healthy system, regardless of whether WAL files are being written to the primary or secondary.
 
 The `storage.wal.failover.secondary.duration` is the primary metric to monitor. You should expect this metric to be `0` unless a WAL failover occurs. If a WAL failover occurs, the rate at which it increases provides an indication of the health of the primary store.
 
@@ -10,3 +12,5 @@ You can access these metrics via the following methods:
 
 - The [**Custom Chart** debug page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}) in [DB Console]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}).
 - By [monitoring CockroachDB with Prometheus]({% link {{ page.version.version }}/monitor-cockroachdb-with-prometheus.md %}).
+
+For more information, refer to [Essential storage metrics]({% link {{ page.version.version }}/essential-metrics-self-hosted.md %}#storage)
diff --git a/src/current/_includes/v25.4/wal-failover-metrics.md b/src/current/_includes/v25.4/wal-failover-metrics.md
index 96d17a83d48..449d1b2332d 100644
--- a/src/current/_includes/v25.4/wal-failover-metrics.md
+++ b/src/current/_includes/v25.4/wal-failover-metrics.md
@@ -3,6 +3,8 @@ You can monitor WAL failover occurrences using the following metrics:
 - `storage.wal.failover.secondary.duration`: Cumulative time spent (in nanoseconds) writing to the secondary WAL directory. Only populated when WAL failover is configured.
 - `storage.wal.failover.primary.duration`: Cumulative time spent (in nanoseconds) writing to the primary WAL directory. Only populated when WAL failover is configured.
 - `storage.wal.failover.switch.count`: Count of the number of times WAL writing has switched from primary to secondary store, and vice versa.
+- `storage.wal.fsync.latency` monitors the latencies of WAL files. If you have WAL failover enabled and are failing over, `storage.wal.fsync.latency` will include the latency of the stalled primary. 
+- `storage.wal.failover.write_and_sync.latency`: When WAL failover is configured in a cluster, the operator should monitor this metric which shows the effective latency observed by the higher layer writing to the WAL. This metric is expected to stay low in a healthy system, regardless of whether WAL files are being written to the primary or secondary.
 
 The `storage.wal.failover.secondary.duration` is the primary metric to monitor. You should expect this metric to be `0` unless a WAL failover occurs. If a WAL failover occurs, the rate at which it increases provides an indication of the health of the primary store.
 
@@ -10,3 +12,5 @@ You can access these metrics via the following methods:
 
 - The [**Custom Chart** debug page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}) in [DB Console]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}).
 - By [monitoring CockroachDB with Prometheus]({% link {{ page.version.version }}/monitor-cockroachdb-with-prometheus.md %}).
+
+For more information, refer to [Essential storage metrics]({% link {{ page.version.version }}/essential-metrics-self-hosted.md %}#storage)