openzfs · gmelikov · Dec 2, 2025 · Dec 2, 2025
diff --git a/docs/Basic Concepts/RAIDZ.rst b/docs/Basic Concepts/RAIDZ.rst
@@ -1,8 +1,6 @@
 RAIDZ
 =====
 
-tl;dr: RAIDZ is effective for large block sizes and sequential workloads.
-
 Introduction
 ~~~~~~~~~~~~
 

diff --git a/docs/Basic Concepts/VDEVs.rst b/docs/Basic Concepts/VDEVs.rst
@@ -0,0 +1,87 @@
+What is a VDEV?
+===============
+
+A vdev (virtual device) is a fundamental building block of ZFS storage pools. It represents a logical grouping of physical storage devices, such as hard drives, SSDs, or partitions.
+
+What is a leaf vdev?
+====================
+
+A leaf vdev is the most basic type of vdev, which directly corresponds to a physical storage device. It is the endpoint of the storage hierarchy in ZFS.
+
+What is a top-level vdev?
+=========================
+
+Top-level vdevs are the direct children of the root vdev. They can be single devices or logical groups that aggregate multiple leaf vdevs (like mirrors or RAIDZ groups). ZFS dynamically stripes data across all top-level vdevs in a pool.
+
+What is a root vdev?
+====================
+
+The root vdev is the top of the pool hierarchy. It aggregates all top-level vdevs into a single logical storage unit (the pool).
+
+What are the different types of vdevs?
+======================================
+
+OpenZFS supports several types of vdevs. Top-level vdevs carry data and provide redundancy:
+
+* **Striped Disk(s)**: A vdev consisting of one or more physical devices striped together (like RAID 0). It provides no redundancy and will lead to data loss if a drive fails.
+* **Mirror**: A vdev that stores the same data on two or more drives for redundancy.
+* `RAIDZ <./RAIDZ.html>`__: A vdev that uses parity to provide fault tolerance, similar to traditional RAID 5/6. There are three levels of RAIDZ:
+
+   * **RAIDZ1**: Single parity, similar to RAID 5. Requires at least 2 disks (3+ recommended), can tolerate one drive failure.
+   * **RAIDZ2**: Double parity, similar to RAID 6. Requires at least 3 disks (5+ recommended), can tolerate two drive failures.
+   * **RAIDZ3**: Triple parity. Requires at least 4 disks (7+ recommended), can tolerate three drive failures.
+
+* `dRAID <./dRAID%20Howto.html>`__: Distributed RAID. A vdev that provides distributed parity and hot spares across multiple drives, allowing for much faster rebuild performance after a failure.
+
+Auxiliary vdevs provide specific functionality:
+
+* **Spare**: A drive that acts as a hot spare, automatically replacing a failed drive in another vdev.
+* **Cache (L2ARC)**: A Level 2 ARC vdev used for caching frequently accessed data to improve random read performance.
+* **Log (SLOG)**: A separate log vdev (SLOG) used to store the ZFS Intent Log (ZIL) for improved synchronous write performance.
+* **Special**: A vdev dedicated to storing metadata, and optionally small file blocks and the Dedup Table (DDT).
+* **Dedup**: A vdev dedicated strictly to storing the Deduplication Table (DDT).
+
+How do vdevs relate to storage pools?
+=====================================
+
+Vdevs are the building blocks of ZFS storage pools. A storage pool (zpool) is created by combining one or more top-level vdevs. The overall performance, capacity, and redundancy of the storage pool depend on the configuration and types of vdevs used.
+
+Here is an example layout as seen in `zpool-status(8) <https://openzfs.github.io/openzfs-docs/man/master/8/zpool-status.8.html>`_ output
+for a pool with two RAIDZ1 top-level vdevs and 10 leaf vdevs:
+
+::
+
+   datapoolname (root vdev)
+     raidz1-0 (top-level vdev)
+       /dev/dsk/disk0 (leaf vdev)
+       /dev/dsk/disk1 (leaf vdev)
+       /dev/dsk/disk2 (leaf vdev)
+       /dev/dsk/disk3 (leaf vdev)
+       /dev/dsk/disk4 (leaf vdev)
+     raidz1-1 (top-level vdev)
+       /dev/dsk/disk5 (leaf vdev)
+       /dev/dsk/disk6 (leaf vdev)
+       /dev/dsk/disk7 (leaf vdev)
+       /dev/dsk/disk8 (leaf vdev)
+       /dev/dsk/disk9 (leaf vdev)
+
+How does ZFS handle vdev failures?
+==================================
+
+ZFS is designed to handle vdev failures gracefully. If a vdev fails, ZFS can continue to operate using the remaining vdevs in the pool,
+provided that the redundancy level of the pool allows for it (e.g., in a mirror, RAIDZ, or dRAID configuration).
+When there is still enough redundancy in the pool, ZFS will mark the failed vdev as "faulted" and will recover data from the remaining vdevs.
+Administrators can `zpool-replace(8) <https://openzfs.github.io/openzfs-docs/man/master/8/zpool-replace.8.html>`_ failed vdev with a new one and ZFS will automatically resilver (rebuild)
+the data onto the new vdev to return the pool to a healthy state.
+
+How do I manage vdevs in ZFS?
+=============================
+
+Vdevs are managed using the `zpool(8) <https://openzfs.github.io/openzfs-docs/man/master/8/zpool.8.html>`_ command-line utility. Common operations include:
+
+* **Creating a pool**: `zpool create` allows you to specify the vdev layout. See `zpool-create(8) <https://openzfs.github.io/openzfs-docs/man/master/8/zpool-create.8.html>`_.
+* **Adding vdevs**: `zpool add` attaches new top-level vdevs to an existing pool, expanding its capacity and performance (by increasing stripe width). See `zpool-add(8) <https://openzfs.github.io/openzfs-docs/man/master/8/zpool-add.8.html>`_.
+* **Removing vdevs**: `zpool remove` can remove certain types of top-level vdevs evacuating their data to other vdevs. See `zpool-remove(8) <https://openzfs.github.io/openzfs-docs/man/master/8/zpool-remove.8.html>`_.
+* **Replacing drives**: `zpool replace` swaps a failed or small drive with a new one. See `zpool-replace(8) <https://openzfs.github.io/openzfs-docs/man/master/8/zpool-replace.8.html>`_.
+* **Monitoring status**: `zpool status` shows the health and layout of all vdevs. See `zpool-status(8) <https://openzfs.github.io/openzfs-docs/man/master/8/zpool-status.8.html>`_.
+* **Monitoring performance**: `zpool iostat` displays I/O statistics for the pool and individual vdevs. See `zpool-iostat(8) <https://openzfs.github.io/openzfs-docs/man/master/8/zpool-iostat.8.html>`_.
diff --git a/docs/Performance and Tuning/Async Write.rst b/docs/Performance and Tuning/Async Write.rst
diff --git a/docs/Performance and Tuning/Module Parameters.rst b/docs/Performance and Tuning/Module Parameters.rst
@@ -488,6 +488,8 @@ resilver
 -  `zfs_top_maxinflight <#zfs-top-maxinflight>`__
 -  `zfs_vdev_scrub_max_active <#zfs-vdev-scrub-max-active>`__
 -  `zfs_vdev_scrub_min_active <#zfs-vdev-scrub-min-active>`__
+-  `zfs_vdev_nia_credit <#zfs-vdev-nia-credit>`__
+-  `zfs_vdev_nia_delay <#zfs-vdev-nia-delay>`__
 
 scrub
 ~~~~~
@@ -511,6 +513,8 @@ scrub
 -  `zfs_top_maxinflight <#zfs-top-maxinflight>`__
 -  `zfs_vdev_scrub_max_active <#zfs-vdev-scrub-max-active>`__
 -  `zfs_vdev_scrub_min_active <#zfs-vdev-scrub-min-active>`__
+-  `zfs_vdev_nia_credit <#zfs-vdev-nia-credit>`__
+-  `zfs_vdev_nia_delay <#zfs-vdev-nia-delay>`__
 
 send
 ~~~~
@@ -632,6 +636,8 @@ vdev
 -  `zfs_vdev_write_gap_limit <#zfs-vdev-write-gap-limit>`__
 -  `zio_dva_throttle_enabled <#zio-dva-throttle-enabled>`__
 -  `zio_slow_io_ms <#zio-slow-io-ms>`__
+-  `zfs_vdev_nia_credit <#zfs-vdev-nia-credit>`__
+-  `zfs_vdev_nia_delay <#zfs-vdev-nia-delay>`__
 
 vdev_cache
 ~~~~~~~~~~
@@ -644,6 +650,8 @@ vdev_initialize
 ~~~~~~~~~~~~~~~
 
 -  `zfs_initialize_value <#zfs-initialize-value>`__
+-  `zfs_vdev_nia_credit <#zfs-vdev-nia-credit>`__
+-  `zfs_vdev_nia_delay <#zfs-vdev-nia-delay>`__
 
 vdev_removal
 ~~~~~~~~~~~~
@@ -656,6 +664,8 @@ vdev_removal
 -  `zfs_removal_ignore_errors <#zfs-removal-ignore-errors>`__
 -  `zfs_removal_suspend_progress <#zfs-removal-suspend-progress>`__
 -  `vdev_removal_max_span <#vdev-removal-max-span>`__
+-  `zfs_vdev_nia_credit <#zfs-vdev-nia-credit>`__
+-  `zfs_vdev_nia_delay <#zfs-vdev-nia-delay>`__
 
 volume
 ~~~~~~
@@ -736,6 +746,8 @@ ZIO_scheduler
 -  `zfs_vdev_trim_max_active <#zfs-vdev-trim-max-active>`__
 -  `zfs_vdev_trim_min_active <#zfs-vdev-trim-min-active>`__
 -  `zfs_vdev_write_gap_limit <#zfs-vdev-write-gap-limit>`__
+-  `zfs_vdev_nia_credit <#zfs-vdev-nia-credit>`__
+-  `zfs_vdev_nia_delay <#zfs-vdev-nia-delay>`__
 -  `zio_dva_throttle_enabled <#zio-dva-throttle-enabled>`__
 -  `zio_requeue_io_start_cut_in_line <#zio-requeue-io-start-cut-in-line>`__
 -  `zio_taskq_batch_pct <#zio-taskq-batch-pct>`__
@@ -979,6 +991,8 @@ Index
 -  `zfs_vdev_mirror_rotating_seek_inc <#zfs-vdev-mirror-rotating-seek-inc>`__
 -  `zfs_vdev_mirror_rotating_seek_offset <#zfs-vdev-mirror-rotating-seek-offset>`__
 -  `zfs_vdev_ms_count_limit <#zfs-vdev-ms-count-limit>`__
+-  `zfs_vdev_nia_credit <#zfs-vdev-nia-credit>`__
+-  `zfs_vdev_nia_delay <#zfs-vdev-nia-delay>`__
 -  `zfs_vdev_queue_depth_pct <#zfs-vdev-queue-depth-pct>`__
 -  `zfs_vdev_raidz_impl <#zfs-vdev-raidz-impl>`__
 -  `zfs_vdev_read_gap_limit <#zfs-vdev-read-gap-limit>`__
@@ -4009,6 +4023,53 @@ See also `zio_dva_throttle_enabled <#zio-dva-throttle-enabled>`__
 | Versions Affected        | v0.7.0 and later                         |
 +--------------------------+------------------------------------------+
 
+zfs_vdev_nia_delay
+~~~~~~~~~~~~~~~~~~
+
+For non-interactive I/O (scrub, resilver, removal, initialize and rebuild),
+the number of concurrently-active I/O operations is limited to zfs_*_min_active,
+unless the vdev is "idle". When there are no interactive I/O operations active
+(synchronous or otherwise), and zfs_vdev_nia_delay operations have completed
+since the last interactive operation, then the vdev is considered to be "idle",
+and the number of concurrently-active non-interactive operations is increased to
+zfs_*_max_active. See `ZIO SCHEDULER <https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/ZIO%20Scheduler.html>`__.
+
+====================== ====================
+zfs_vdev_nia_delay     Notes
+====================== ====================
+Tags                   `vdev <#vdev>`__, `scrub <#scrub>`__, `resilver <#resilver>`__
+When to change         See `ZIO SCHEDULER <https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/ZIO%20Scheduler.html>`__
+Data Type              uint
+Range                  1 to UINT_MAX
+Default                5
+Change                 Dynamic
+Versions Affected      v2.0.0 and later
+====================== ====================
+
+zfs_vdev_nia_credit
+~~~~~~~~~~~~~~~~~~~
+
+Some HDDs tend to prioritize sequential I/O so strongly, that concurrent
+random I/O latency reaches several seconds. On some HDDs this happens even
+if sequential I/O operations are submitted one at a time, and so setting
+zfs_*_max_active= 1 does not help. To prevent non-interactive I/O, like scrub,
+from monopolizing the device, no more than zfs_vdev_nia_credit operations
+can be sent while there are outstanding incomplete interactive operations.
+This enforced wait ensures the HDD services the interactive I/O within a
+reasonable amount of time. See `ZIO SCHEDULER <https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/ZIO%20Scheduler.html>`__.
+
+====================== ====================
+zfs_vdev_nia_credit    Notes
+====================== ====================
+Tags                   `vdev <#vdev>`__, `scrub <#scrub>`__, `resilver <#resilver>`__
+When to change         See `ZIO SCHEDULER <https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/ZIO%20Scheduler.html>`__
+Data Type              uint
+Range                  1 to UINT_MAX
+Default                5
+Change                 Dynamic
+Versions Affected      v2.0.0 and later
+====================== ====================
+
 zfs_disable_dup_eviction
 ~~~~~~~~~~~~~~~~~~~~~~~~
 

diff --git a/docs/Performance and Tuning/ZFS Transaction Delay.rst b/docs/Performance and Tuning/ZFS Transaction Delay.rst
@@ -1,9 +1,19 @@
-ZFS Transaction Delay
-=====================
+OpenZFS Transaction Delay
+=========================
 
-ZFS write operations are delayed when the backend storage isn't able to
+OpenZFS write operations are delayed when the backend storage isn't able to
 accommodate the rate of incoming writes. This delay process is known as
-the ZFS write throttle.
+the OpenZFS write throttle. As different hardware and workloads have different
+performance characteristics, tuning the write throttle is hardware and workload
+specific.
+
+Writes are grouped into transactions. Transactions are grouped into transaction groups.
+When a transaction group is synced to disk, all transactions in that group are
+considered complete. When a delay is applied to a transaction it delays the transaction's
+assignment into a transaction group.
+
+To check if write throttle is activated on a pool, monitor the dmu_tx_delay
+and/or dmu_tx_dirty_delay kstat counters.
 
 If there is already a write transaction waiting, the delay is relative
 to when that transaction will finish waiting. Thus the calculated delay
@@ -103,3 +113,7 @@ system should be to keep the amount of dirty data out of that range by
 first ensuring that the appropriate limits are set for the I/O scheduler
 to reach optimal throughput on the backend storage, and then by changing
 the value of zfs_delay_scale to increase the steepness of the curve.
+
+
+
+Code reference: `dmu_tx.c <https://github.com/openzfs/zfs/blob/master/module/zfs/dmu_tx.c#L866>`_