Add docs for storage engine value separation (#19941)

rmloveland · web-flow · commit a978713b31de · 2025-07-31T20:34:15.000Z
Fixes: - DOC-13931 - DOC-14051 - DOC-14109 - DOC-14110 - DOC-14293 Summary of changes: - Update 'Storage Layer' docs with new section on value separation, as well as breaking out read amp and write amp into their own new sections - Update 'Essential Metrics for Self-Hosted' with docs for the new timeseries metrics for value separation - Update previously published release notes for the feature to link to the 'Storage Layer' docs
diff --git a/src/current/_includes/releases/v25.3/v25.3.0-alpha.2.md b/src/current/_includes/releases/v25.3/v25.3.0-alpha.2.md
@@ -30,7 +30,7 @@ Release Date: June 16, 2025
 
 - Added an `alter_changefeed` structured log event to provide more visibility into when an `ALTER CHANGEFEED` event occurred and what changed.
  [#147679][#147679]
-- Added new timeseries metrics to the `storage.value_separation.*` namespace for observing the behavior of storage engine value separation.
+- Added new timeseries metrics to the [`storage.value_separation.*` namespace]({% link v25.3/essential-metrics-self-hosted.md %}#storage-value-separation) for observing the behavior of [storage engine value separation]({% link v25.3/architecture/storage-layer.md %}#value-separation).
  [#147728][#147728]
 
 <h3 id="v25-3-0-alpha-2-db-console-changes">DB Console changes</h3>
diff --git a/src/current/_includes/releases/v25.3/v25.3.0-beta.1.md b/src/current/_includes/releases/v25.3/v25.3.0-beta.1.md
@@ -15,7 +15,7 @@ Release Date: July 2, 2025
 
 <h3 id="v25-3-0-beta-1-operational-changes">Operational changes</h3>
 
-- Introduced the following cluster settings for enabling and configuring value separation in the storage engine: `storage.value_separation.enabled`, `storage.value_separation.minimum_size`, and `storage.value_separation.max_reference_depth`.
+- Introduced the following cluster settings for enabling and configuring [value separation in the storage engine]({% link v25.3/architecture/storage-layer.md %}#value-separation): `storage.value_separation.enabled`, `storage.value_separation.minimum_size`, and `storage.value_separation.max_reference_depth`.
  [#148535][#148535]
 - Non-admin users no longer have access to changefeed jobs they do not own and which are not owned by a role of which they are a member, regardless of whether they have the `CHANGEFEED` privilege on the table or tables those jobs may be watching. Admin users, or those with global `SHOWJOB` / `CONTROLJOB` privileges, can still interact with all jobs, regardless of ownership.
  [#148537][#148537]
diff --git a/src/current/_includes/v25.3/essential-metrics.md b/src/current/_includes/v25.3/essential-metrics.md
@@ -40,6 +40,10 @@ The **Usage** column explains why each metric is important to visualize in a cus
 | rocksdb.compactions                                 | rocksdb.compactions.total                                    | Number of SST compactions                                    | This metric reports the number of a node's [LSM compactions]({% link {{ page.version.version }}/common-issues-to-monitor.md %}#lsm-health). If the number of compactions remains elevated while the LSM health does not improve, compactions are not keeping up with the workload. If the condition persists for an extended period, the cluster will initially exhibit performance issues that will eventually escalate into stability issues. |
 | rocksdb.block.cache.hits                            | rocksdb.block.cache.hits                                     | Count of block cache hits                                    | This metric gives hits to block cache which is reserved memory. It is allocated upon the start of a node process by the [`--cache` flag]({% link {{ page.version.version }}/cockroach-start.md %}#general) and never shrinks. By observing block cache hits and misses, you can fine-tune memory allocations in the node process for the demands of the workload. |
 | rocksdb.block.cache.misses                          | rocksdb.block.cache.misses                                   | Count of block cache misses                                  | This metric gives misses to block cache which is reserved memory. It is allocated upon the start of a node process by the [`--cache` flag]({% link {{ page.version.version }}/cockroach-start.md %}#general) and never shrinks. By observing block cache hits and misses, you can fine-tune memory allocations in the node process for the demands of the workload. |
+| <a name="storage-value-separation"></a> storage.value_separation.blob_files.count | storage.value_separation.blob_files.count | The number of blob files that are used to store [separated values]({% link {{ page.version.version }}/architecture/storage-layer.md %}#value-separation) within the storage engine. | Use this metric to track how many values (of key-value pairs) are being stored outside of the [LSM]({% link {{ page.version.version }}/architecture/storage-layer.md %}#log-structured-merge-trees) by the storage engine due to their large size. |
+| storage.value_separation.blob_files.size | storage.value_separation.blob_files.size | The size of the physical blob files that are used to store [separated values]({% link {{ page.version.version }}/architecture/storage-layer.md %}#value-separation) within the storage engine. This value is the physical post-compression sum of the `storage.value_separation.value_bytes.referenced` and `storage.value_separation.value_bytes.unreferenced` metrics. | Use this metric to see how much of your physical storage capacity is being used by separated values in blob files. |
+| storage.value_separation.value_bytes.referenced | storage.value_separation.value_bytes.referenced | The size of storage engine value bytes (pre-compression) that are [stored separately in blob files]({% link {{ page.version.version }}/architecture/storage-layer.md %}#value-separation) and referenced by a live [SSTable]({% link {{ page.version.version }}/architecture/storage-layer.md %}#ssts). | Use this metric to see how much live (i.e., not yet eligible for compaction) blob storage is in use by separated values. |
+| storage.value_separation.value_bytes.unreferenced | storage.value_separation.value_bytes.unreferenced | The size of storage engine value bytes (pre-compression) that are [stored separately in blob files]({% link {{ page.version.version }}/architecture/storage-layer.md %}#value-separation) and not referenced by any live [SSTable]({% link {{ page.version.version }}/architecture/storage-layer.md %}#ssts). These bytes are garbage that could be reclaimed by a [compaction]({% link {{ page.version.version }}/architecture/storage-layer.md %}#compaction). | Use this metric to see how much blob storage is no longer in use and waiting to be compacted. |
 
 ## Health
 
diff --git a/src/current/v25.3/architecture/storage-layer.md b/src/current/v25.3/architecture/storage-layer.md
@@ -80,6 +80,39 @@ The SSTs within each level are guaranteed to be non-overlapping: for example, if
 
 <img src="{{ 'images/v21.2/lsm-with-ssts.png' | relative_url }}" alt="LSM tree with SST files" style="max-width:100%" />
 
+##### Write amplification
+
+_Write amplification_ measures the volume of data written to disk relative to the volume of data logically committed to the storage engine. When values are committed, CockroachDB writes them to the [write-ahead log (WAL)](#memtable-and-write-ahead-log) and then to [SSTables](#ssts) during flushes. [Compactions](#compaction) rewrite those SSTables multiple times over the value's lifetime. Most write amplification, and write bandwidth more broadly, originates from compactions.
+
+This tradeoff between compactions and write amplification is necessary, because if the storage engine performs too few compactions, the size of [L0](#lsm-levels) will get too large and an inverted LSM will result, which also has ill effects. In contrast, writes to the WAL are a small fraction of a [store]({% link {{ page.version.version }}/cockroach-start.md %}#store)'s overall write bandwidth and IOPS.
+
+##### Read amplification
+
+_Read amplification_ measures the number of SSTable files consulted to satisfy a logical read. High read amplification occurs when value lookups must search multiple LSM levels or SST files, such as in an inverted LSM state. Keeping read and [write amplification](#write-amplification) in balance is critical for optimal storage engine performance.
+
+Read amplification is high [when the LSM is inverted](#inverted-lsms). In the inverted LSM state, reads need to start in higher levels and "look down" through a lot of SSTs to read a key's correct (freshest) value.
+
+Read amplification can be especially bad if a large [`IMPORT INTO`]({% link {{ page.version.version }}/import-into.md %}) is overloading the cluster (due to insufficient CPU and/or IOPS) and the storage engine has to consult many small SSTs in L0 to determine the most up-to-date value of the keys being read (e.g., using a [`SELECT`]({% link {{ page.version.version }}/select-clause.md %})).
+
+A certain amount of read amplification is expected in a normally functioning CockroachDB cluster. For example, a read amplification factor less than 10 as shown in the [**Read Amplification** graph on the **Storage** dashboard]({% link {{ page.version.version }}/ui-storage-dashboard.md %}#other-graphs) is considered healthy.
+
+##### Value separation
+
+{% include feature-phases/preview.md %}
+
+{% include_cached new-in.html version="v25.3" %} The storage engine can optimize performance using _value separation_. When the engine encounters a key-value pair with a sufficiently large value component, it stores the key in the [LSM](#log-structured-merge-trees) alongside a pointer to the value's location in a _blob file_ that is located outside the LSM. This indirection allows [compactions](#compaction) of the LSM to skip rewriting large values over and over; instead, compactions can copy a pointer to the large value's location.
+
+Value separation is especially beneficial for workloads with large values relative to key size (for example, [Raft log]({% link {{ page.version.version }}/architecture/replication-layer.md %}#raft) entries). It reduces [write amplification](#write-amplification) by about 50%, at the cost of about 20% in [space amplification]({% link {{ page.version.version }}/operational-faqs.md %}#space-amplification). In practice, value separation causes the storage engine to use far fewer [IOPS]({% link {{ page.version.version }}/common-issues-to-monitor.md %}#disk-iops) and storage bandwidth overall, which are expensive, at the cost of an increase in storage capacity, which is much cheaper.
+
+To enable value separation, set the following [cluster setting]({% link {{ page.version.version }}/set-cluster-setting.md %}):
+
+{% include_cached copy-clipboard.html %}
+~~~ sql
+SET CLUSTER SETTING storage.value_separation.enabled = true;
+~~~
+
+To monitor this feature, refer to [the documentation for the metrics in the `storage.value_separation.*` namespace]({% link {{ page.version.version }}/essential-metrics-self-hosted.md %}#storage-value-separation).
+
 ##### Compaction
 
 The process of merging SSTs and moving them from L0 down to L6 in the LSM is called _compaction_. The storage engine works to compact data as quickly as possible. As a result of this process, lower levels of the LSM should contain larger SSTs that contain less recently updated keys, while higher levels of the LSM should contain smaller SSTs that contain more recently updated keys.
@@ -102,19 +135,6 @@ During normal operation, the LSM should look like this: ◣. An inverted LSM loo
 
 An inverted LSM will have degraded read performance.
 
-<a name="read-amplification"></a>
-
-Read amplification is high when the LSM is inverted. In the inverted LSM state, reads need to start in higher levels and "look down" through a lot of SSTs to read a key's correct (freshest) value. When the storage engine needs to read from multiple SST files in order to service a single logical read, this state is known as _read amplification_.
-
-Read amplification can be especially bad if a large [`IMPORT INTO`]({% link {{ page.version.version }}/import-into.md %}) is overloading the cluster (due to insufficient CPU and/or IOPS) and the storage engine has to consult many small SSTs in L0 to determine the most up-to-date value of the keys being read (e.g., using a [`SELECT`]({% link {{ page.version.version }}/select-clause.md %})).
-
-A certain amount of read amplification is expected in a normally functioning CockroachDB cluster. For example, a read amplification factor less than 10 as shown in the [**Read Amplification** graph on the **Storage** dashboard]({% link {{ page.version.version }}/ui-storage-dashboard.md %}#other-graphs) is considered healthy.
-
-<a name="write-amplification"></a>
-
-_Write amplification_ measures the volume of data written to disk, relative to the volume of data logically committed to the storage engine. When a value is committed to the storage engine, CockroachDB writes it once to the [write-ahead log (WAL)](#memtable-and-write-ahead-log). CockroachDB writes the value again when flushing it to an [SSTable](#ssts). CockroachDB subsequently writes the value multiple times as part of [compactions](#compaction) over the lifetime of the value. Most write amplification, and write bandwidth more broadly, originates from compactions. This is a necessary tradeoff, because if the storage engine performs too few compactions, the size of [L0](#lsm-levels) will get too large and an inverted LSM will result, which also has ill effects. In contrast, writes to the WAL are a small fraction of a [store]({% link {{ page.version.version }}/cockroach-start.md %}#store)'s overall write bandwidth and IOPs.
-
-Read amplification and write amplification are key metrics for LSM performance. Neither is inherently "good" or "bad", but they must not occur in excess, and for optimum performance they must be kept in balance. That balance involves tradeoffs.
 
 Inverted LSMs also have excessive compaction debt. In this state, the storage engine has a large backlog of [compactions](#compaction) to do to return the inverted LSM to a normal, non-inverted state.
 
diff --git a/src/current/v25.3/cockroachdb-feature-availability.md b/src/current/v25.3/cockroachdb-feature-availability.md
@@ -53,6 +53,10 @@ Any feature made available in a phase prior to GA is provided without any warran
 
 The `metrics` Prometheus endpoint is commonly used and is the default in Prometheus configurations.
 
+### Value separation
+
+[Value separation]({% link {{ page.version.version }}/architecture/storage-layer.md %}#value-separation) reduces write amplification by storing large values separately from the LSM in blob files. Value separation can reduce write amplification by up to 50% for large-value workloads, while introducing minor read overhead and a slight increase in disk space usage. This feature is available in Preview.
+
 ### `database` and `application_name` labels for certain metrics
 
 The following cluster settings enable the [`database` and `application_name` labels for certain metrics]({% link {{ page.version.version }}/multi-dimensional-metrics.md %}#enable-database-and-application_name-labels), along with their internal counterparts if they exist:
diff --git a/src/current/v25.3/operational-faqs.md b/src/current/v25.3/operational-faqs.md
@@ -77,7 +77,9 @@ For more information about how MVCC works, see [MVCC]({% link {{ page.version.ve
 
 ### The data could be in the process of being compacted
 
-When MVCC garbage is deleted by garbage collection, the data is still not yet physically removed from the filesystem by the [Storage Layer]({% link {{ page.version.version }}/architecture/storage-layer.md %}). Removing data from the filesystem requires rewriting the files containing the data using a process also known as [compaction]({% link {{ page.version.version }}/architecture/storage-layer.md %}#compaction), which can be expensive. The storage engine has heuristics to compact data and remove deleted rows when enough garbage has accumulated to warrant a compaction. It strives to always restrict the overhead of obsolete data (called the space amplification) to at most 10%. If a lot of data was just deleted, it may take the storage engine some time to compact the files and restore this property.
+<a name="space-amplification"></a>
+
+When MVCC garbage is deleted by garbage collection, the data is still not yet physically removed from the filesystem by the [Storage Layer]({% link {{ page.version.version }}/architecture/storage-layer.md %}). Removing data from the filesystem requires rewriting the files containing the data using a process called [compaction]({% link {{ page.version.version }}/architecture/storage-layer.md %}#compaction), which can be expensive. The storage engine has heuristics to compact data and remove deleted rows when enough garbage has accumulated to warrant a compaction. It strives to limit the overhead of this obsolete data (called the _space amplification_) to a small fixed percentage. If a lot of data was just deleted, it may take the storage engine some time to compact the files and restore this property.
 
 {% include {{page.version.version}}/storage/free-up-disk-space.md %}