Skip to content

Commit 7526625

Browse files
committed
updated TX delay, ZIO scheduler, module params and add VDEV pages
1 parent 015070a commit 7526625

File tree

5 files changed

+277
-47
lines changed

5 files changed

+277
-47
lines changed

docs/Basic Concepts/RAIDZ.rst

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,6 @@
11
RAIDZ
22
=====
33

4-
tl;dr: RAIDZ is effective for large block sizes and sequential workloads.
5-
64
Introduction
75
~~~~~~~~~~~~
86

docs/Basic Concepts/VDEVs.rst

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
What is a VDEV?
2+
===============
3+
4+
A vdev (virtual device) is a fundamental building block of ZFS storage pools. It represents a logical grouping of physical storage devices, such as hard drives, SSDs, or partitions.
5+
6+
What is a leaf vdev?
7+
====================
8+
9+
A leaf vdev is the most basic type of vdev, which directly corresponds to a physical storage device. It is the endpoint of the storage hierarchy in ZFS.
10+
11+
What is a top-level vdev?
12+
=========================
13+
14+
Top-level vdevs are the direct children of the root vdev. They can be single devices or logical groups that aggregate multiple leaf vdevs (like mirrors or RAIDZ groups). ZFS dynamically stripes data across all top-level vdevs in a pool.
15+
16+
What is a root vdev?
17+
====================
18+
19+
The root vdev is the top of the pool hierarchy. It aggregates all top-level vdevs into a single logical storage unit (the pool).
20+
21+
What are the different types of vdevs?
22+
======================================
23+
24+
OpenZFS supports several types of vdevs. Top-level vdevs carry data and provide redundancy:
25+
26+
* **Striped Disk(s)**: A vdev consisting of one or more physical devices striped together (like RAID 0). It provides no redundancy and will lead to data loss if a drive fails.
27+
* **Mirror**: A vdev that stores the same data on two or more drives for redundancy.
28+
* `RAIDZ <https://openzfs.github.io/openzfs-docs/Basic%20Concepts/RAIDZ.html>`__: A vdev that uses parity to provide fault tolerance, similar to traditional RAID 5/6. There are three levels of RAIDZ:
29+
30+
* **RAIDZ1**: Single parity, similar to RAID 5. Requires at least 2 disks (3+ recommended), can tolerate one drive failure.
31+
* **RAIDZ2**: Double parity, similar to RAID 6. Requires at least 3 disks (5+ recommended), can tolerate two drive failures.
32+
* **RAIDZ3**: Triple parity. Requires at least 4 disks (7+ recommended), can tolerate three drive failures.
33+
34+
* `dRAID <https://openzfs.github.io/openzfs-docs/Basic%20Concepts/dRAID%20Howto.html>`__: Distributed RAID. A vdev that provides distributed parity and hot spares across multiple drives, allowing for much faster rebuild performance after a failure.
35+
36+
Auxiliary vdevs provide specific functionality:
37+
38+
* **Spare**: A drive that acts as a hot spare, automatically replacing a failed drive in another vdev.
39+
* **Cache (L2ARC)**: A Level 2 ARC vdev used for caching frequently accessed data to improve random read performance.
40+
* **Log (SLOG)**: A separate log vdev (SLOG) used to store the ZFS Intent Log (ZIL) for improved synchronous write performance.
41+
* **Special**: A vdev dedicated to storing metadata, and optionally small file blocks and the Dedup Table (DDT).
42+
* **Dedup**: A vdev dedicated strictly to storing the Deduplication Table (DDT).
43+
44+
How do vdevs relate to storage pools?
45+
=====================================
46+
47+
Vdevs are the building blocks of ZFS storage pools. A storage pool (zpool) is created by combining one or more top-level vdevs. The overall performance, capacity, and redundancy of the storage pool depend on the configuration and types of vdevs used.
48+
49+
Here is an example layout as seen in `zpool-status(8) <https://openzfs.github.io/openzfs-docs/man/master/8/zpool-status.8.html>`_ output
50+
for a pool with two RAIDZ1 top-level vdevs and 10 leaf vdevs:
51+
52+
::
53+
54+
datapoolname (root vdev)
55+
raidz1-0 (top-level vdev)
56+
/dev/dsk/disk0 (leaf vdev)
57+
/dev/dsk/disk1 (leaf vdev)
58+
/dev/dsk/disk2 (leaf vdev)
59+
/dev/dsk/disk3 (leaf vdev)
60+
/dev/dsk/disk4 (leaf vdev)
61+
raidz1-1 (top-level vdev)
62+
/dev/dsk/disk5 (leaf vdev)
63+
/dev/dsk/disk6 (leaf vdev)
64+
/dev/dsk/disk7 (leaf vdev)
65+
/dev/dsk/disk8 (leaf vdev)
66+
/dev/dsk/disk9 (leaf vdev)
67+
68+
How does ZFS handle vdev failures?
69+
==================================
70+
71+
ZFS is designed to handle vdev failures gracefully. If a vdev fails, ZFS can continue to operate using the remaining vdevs in the pool,
72+
provided that the redundancy level of the pool allows for it (e.g., in a mirror, RAIDZ, or dRAID configuration).
73+
When there is still enough redundancy in the pool, ZFS will mark the failed vdev as "faulted" and will recover data from the remaining vdevs.
74+
Administrators can `zpool-replace(8) <https://openzfs.github.io/openzfs-docs/man/master/8/zpool-replace.8.html>`_ failed vdev with a new one and ZFS will automatically resilver (rebuild)
75+
the data onto the new vdev to return the pool to a healthy state.
76+
77+
How do I manage vdevs in ZFS?
78+
=============================
79+
80+
Vdevs are managed using the `zpool(8) <https://openzfs.github.io/openzfs-docs/man/master/8/zpool.8.html>`_ command-line utility. Common operations include:
81+
82+
* **Creating a pool**: `zpool create` allows you to specify the vdev layout. See `zpool-create(8) <https://openzfs.github.io/openzfs-docs/man/master/8/zpool-create.8.html>`_.
83+
* **Adding vdevs**: `zpool add` attaches new top-level vdevs to an existing pool, expanding its capacity and performance (by increasing stripe width). See `zpool-add(8) <https://openzfs.github.io/openzfs-docs/man/master/8/zpool-add.8.html>`_.
84+
* **Removing vdevs**: `zpool remove` can remove certain types of top-level vdevs evacuating their data to other vdevs. See `zpool-remove(8) <https://openzfs.github.io/openzfs-docs/man/master/8/zpool-remove.8.html>`_.
85+
* **Replacing drives**: `zpool replace` swaps a failed or small drive with a new one. See `zpool-replace(8) <https://openzfs.github.io/openzfs-docs/man/master/8/zpool-replace.8.html>`_.
86+
* **Monitoring status**: `zpool status` shows the health and layout of all vdevs. See `zpool-status(8) <https://openzfs.github.io/openzfs-docs/man/master/8/zpool-status.8.html>`_.
87+
* **Monitoring performance**: `zpool iostat` displays I/O statistics for the pool and individual vdevs. See `zpool-iostat(8) <https://openzfs.github.io/openzfs-docs/man/master/8/zpool-iostat.8.html>`_.
88+
89+
----
90+
91+
Last updated: November 24, 2025

docs/Performance and Tuning/Module Parameters.rst

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -488,6 +488,8 @@ resilver
488488
- `zfs_top_maxinflight <#zfs-top-maxinflight>`__
489489
- `zfs_vdev_scrub_max_active <#zfs-vdev-scrub-max-active>`__
490490
- `zfs_vdev_scrub_min_active <#zfs-vdev-scrub-min-active>`__
491+
- `zfs_vdev_nia_credit <#zfs-vdev-nia-credit>`__
492+
- `zfs_vdev_nia_delay <#zfs-vdev-nia-delay>`__
491493

492494
scrub
493495
~~~~~
@@ -511,6 +513,8 @@ scrub
511513
- `zfs_top_maxinflight <#zfs-top-maxinflight>`__
512514
- `zfs_vdev_scrub_max_active <#zfs-vdev-scrub-max-active>`__
513515
- `zfs_vdev_scrub_min_active <#zfs-vdev-scrub-min-active>`__
516+
- `zfs_vdev_nia_credit <#zfs-vdev-nia-credit>`__
517+
- `zfs_vdev_nia_delay <#zfs-vdev-nia-delay>`__
514518

515519
send
516520
~~~~
@@ -632,6 +636,8 @@ vdev
632636
- `zfs_vdev_write_gap_limit <#zfs-vdev-write-gap-limit>`__
633637
- `zio_dva_throttle_enabled <#zio-dva-throttle-enabled>`__
634638
- `zio_slow_io_ms <#zio-slow-io-ms>`__
639+
- `zfs_vdev_nia_credit <#zfs-vdev-nia-credit>`__
640+
- `zfs_vdev_nia_delay <#zfs-vdev-nia-delay>`__
635641

636642
vdev_cache
637643
~~~~~~~~~~
@@ -644,6 +650,8 @@ vdev_initialize
644650
~~~~~~~~~~~~~~~
645651

646652
- `zfs_initialize_value <#zfs-initialize-value>`__
653+
- `zfs_vdev_nia_credit <#zfs-vdev-nia-credit>`__
654+
- `zfs_vdev_nia_delay <#zfs-vdev-nia-delay>`__
647655

648656
vdev_removal
649657
~~~~~~~~~~~~
@@ -656,6 +664,8 @@ vdev_removal
656664
- `zfs_removal_ignore_errors <#zfs-removal-ignore-errors>`__
657665
- `zfs_removal_suspend_progress <#zfs-removal-suspend-progress>`__
658666
- `vdev_removal_max_span <#vdev-removal-max-span>`__
667+
- `zfs_vdev_nia_credit <#zfs-vdev-nia-credit>`__
668+
- `zfs_vdev_nia_delay <#zfs-vdev-nia-delay>`__
659669

660670
volume
661671
~~~~~~
@@ -736,6 +746,8 @@ ZIO_scheduler
736746
- `zfs_vdev_trim_max_active <#zfs-vdev-trim-max-active>`__
737747
- `zfs_vdev_trim_min_active <#zfs-vdev-trim-min-active>`__
738748
- `zfs_vdev_write_gap_limit <#zfs-vdev-write-gap-limit>`__
749+
- `zfs_vdev_nia_credit <#zfs-vdev-nia-credit>`__
750+
- `zfs_vdev_nia_delay <#zfs-vdev-nia-delay>`__
739751
- `zio_dva_throttle_enabled <#zio-dva-throttle-enabled>`__
740752
- `zio_requeue_io_start_cut_in_line <#zio-requeue-io-start-cut-in-line>`__
741753
- `zio_taskq_batch_pct <#zio-taskq-batch-pct>`__
@@ -979,6 +991,8 @@ Index
979991
- `zfs_vdev_mirror_rotating_seek_inc <#zfs-vdev-mirror-rotating-seek-inc>`__
980992
- `zfs_vdev_mirror_rotating_seek_offset <#zfs-vdev-mirror-rotating-seek-offset>`__
981993
- `zfs_vdev_ms_count_limit <#zfs-vdev-ms-count-limit>`__
994+
- `zfs_vdev_nia_credit <#zfs-vdev-nia-credit>`__
995+
- `zfs_vdev_nia_delay <#zfs-vdev-nia-delay>`__
982996
- `zfs_vdev_queue_depth_pct <#zfs-vdev-queue-depth-pct>`__
983997
- `zfs_vdev_raidz_impl <#zfs-vdev-raidz-impl>`__
984998
- `zfs_vdev_read_gap_limit <#zfs-vdev-read-gap-limit>`__
@@ -4009,6 +4023,53 @@ See also `zio_dva_throttle_enabled <#zio-dva-throttle-enabled>`__
40094023
| Versions Affected | v0.7.0 and later |
40104024
+--------------------------+------------------------------------------+
40114025

4026+
zfs_vdev_nia_delay
4027+
~~~~~~~~~~~~~~~~~~
4028+
4029+
For non-interactive I/O (scrub, resilver, removal, initialize and rebuild),
4030+
the number of concurrently-active I/O operations is limited to zfs_*_min_active,
4031+
unless the vdev is "idle". When there are no interactive I/O operations active
4032+
(synchronous or otherwise), and zfs_vdev_nia_delay operations have completed
4033+
since the last interactive operation, then the vdev is considered to be "idle",
4034+
and the number of concurrently-active non-interactive operations is increased to
4035+
zfs_*_max_active. See `ZIO SCHEDULER <https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/ZIO%20Scheduler.html>`__.
4036+
4037+
====================== ====================
4038+
zfs_vdev_nia_delay Notes
4039+
====================== ====================
4040+
Tags `vdev <#vdev>`__, `scrub <#scrub>`__, `resilver <#resilver>`__
4041+
When to change See `ZIO SCHEDULER <https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/ZIO%20Scheduler.html>`__
4042+
Data Type uint
4043+
Range 1 to UINT_MAX
4044+
Default 5
4045+
Change Dynamic
4046+
Versions Affected v2.0.0 and later
4047+
====================== ====================
4048+
4049+
zfs_vdev_nia_credit
4050+
~~~~~~~~~~~~~~~~~~~
4051+
4052+
Some HDDs tend to prioritize sequential I/O so strongly, that concurrent
4053+
random I/O latency reaches several seconds. On some HDDs this happens even
4054+
if sequential I/O operations are submitted one at a time, and so setting
4055+
zfs_*_max_active= 1 does not help. To prevent non-interactive I/O, like scrub,
4056+
from monopolizing the device, no more than zfs_vdev_nia_credit operations
4057+
can be sent while there are outstanding incomplete interactive operations.
4058+
This enforced wait ensures the HDD services the interactive I/O within a
4059+
reasonable amount of time. See `ZIO SCHEDULER <https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/ZIO%20Scheduler.html>`__.
4060+
4061+
====================== ====================
4062+
zfs_vdev_nia_credit Notes
4063+
====================== ====================
4064+
Tags `vdev <#vdev>`__, `scrub <#scrub>`__, `resilver <#resilver>`__
4065+
When to change See `ZIO SCHEDULER <https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/ZIO%20Scheduler.html>`__
4066+
Data Type uint
4067+
Range 1 to UINT_MAX
4068+
Default 5
4069+
Change Dynamic
4070+
Versions Affected v2.0.0 and later
4071+
====================== ====================
4072+
40124073
zfs_disable_dup_eviction
40134074
~~~~~~~~~~~~~~~~~~~~~~~~
40144075

docs/Performance and Tuning/ZFS Transaction Delay.rst

Lines changed: 20 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,19 @@
1-
ZFS Transaction Delay
2-
=====================
1+
OpenZFS Transaction Delay
2+
=========================
33

4-
ZFS write operations are delayed when the backend storage isn't able to
4+
OpenZFS write operations are delayed when the backend storage isn't able to
55
accommodate the rate of incoming writes. This delay process is known as
6-
the ZFS write throttle.
6+
the OpenZFS write throttle. As different hardware and workloads have different
7+
performance characteristics, tuning the write throttle is hardware and workload
8+
specific.
9+
10+
Writes are grouped into transactions. Transactions are grouped into transaction groups.
11+
When a transaction group is synced to disk, all transactions in that group are
12+
considered complete. When a delay is applied to a transaction it delays the transaction's
13+
assignment into a transaction group.
14+
15+
To check if write throttle is activated on a pool, monitor the dmu_tx_delay
16+
and/or dmu_tx_dirty_delay kstat counters.
717

818
If there is already a write transaction waiting, the delay is relative
919
to when that transaction will finish waiting. Thus the calculated delay
@@ -103,3 +113,9 @@ system should be to keep the amount of dirty data out of that range by
103113
first ensuring that the appropriate limits are set for the I/O scheduler
104114
to reach optimal throughput on the backend storage, and then by changing
105115
the value of zfs_delay_scale to increase the steepness of the curve.
116+
117+
---
118+
119+
Code reference: `dmu_tx.c <https://github.com/openzfs/zfs/blob/master/module/zfs/dmu_tx.c#L866>`_
120+
121+
Last updated: October 28 2025

0 commit comments

Comments
 (0)