Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions docs/Basic Concepts/RAIDZ.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
RAIDZ
=====

tl;dr: RAIDZ is effective for large block sizes and sequential workloads.

Introduction
~~~~~~~~~~~~

Expand Down
87 changes: 87 additions & 0 deletions docs/Basic Concepts/VDEVs.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
What is a VDEV?
===============

A vdev (virtual device) is a fundamental building block of ZFS storage pools. It represents a logical grouping of physical storage devices, such as hard drives, SSDs, or partitions.

What is a leaf vdev?
====================

A leaf vdev is the most basic type of vdev, which directly corresponds to a physical storage device. It is the endpoint of the storage hierarchy in ZFS.

What is a top-level vdev?
=========================

Top-level vdevs are the direct children of the root vdev. They can be single devices or logical groups that aggregate multiple leaf vdevs (like mirrors or RAIDZ groups). ZFS dynamically stripes data across all top-level vdevs in a pool.

What is a root vdev?
====================

The root vdev is the top of the pool hierarchy. It aggregates all top-level vdevs into a single logical storage unit (the pool).

What are the different types of vdevs?
======================================

OpenZFS supports several types of vdevs. Top-level vdevs carry data and provide redundancy:

* **Striped Disk(s)**: A vdev consisting of one or more physical devices striped together (like RAID 0). It provides no redundancy and will lead to data loss if a drive fails.
* **Mirror**: A vdev that stores the same data on two or more drives for redundancy.
* `RAIDZ <./RAIDZ.html>`__: A vdev that uses parity to provide fault tolerance, similar to traditional RAID 5/6. There are three levels of RAIDZ:

* **RAIDZ1**: Single parity, similar to RAID 5. Requires at least 2 disks (3+ recommended), can tolerate one drive failure.
* **RAIDZ2**: Double parity, similar to RAID 6. Requires at least 3 disks (5+ recommended), can tolerate two drive failures.
* **RAIDZ3**: Triple parity. Requires at least 4 disks (7+ recommended), can tolerate three drive failures.

* `dRAID <./dRAID%20Howto.html>`__: Distributed RAID. A vdev that provides distributed parity and hot spares across multiple drives, allowing for much faster rebuild performance after a failure.

Auxiliary vdevs provide specific functionality:

* **Spare**: A drive that acts as a hot spare, automatically replacing a failed drive in another vdev.
* **Cache (L2ARC)**: A Level 2 ARC vdev used for caching frequently accessed data to improve random read performance.
* **Log (SLOG)**: A separate log vdev (SLOG) used to store the ZFS Intent Log (ZIL) for improved synchronous write performance.
* **Special**: A vdev dedicated to storing metadata, and optionally small file blocks and the Dedup Table (DDT).
* **Dedup**: A vdev dedicated strictly to storing the Deduplication Table (DDT).

How do vdevs relate to storage pools?
=====================================

Vdevs are the building blocks of ZFS storage pools. A storage pool (zpool) is created by combining one or more top-level vdevs. The overall performance, capacity, and redundancy of the storage pool depend on the configuration and types of vdevs used.

Here is an example layout as seen in `zpool-status(8) <https://openzfs.github.io/openzfs-docs/man/master/8/zpool-status.8.html>`_ output
for a pool with two RAIDZ1 top-level vdevs and 10 leaf vdevs:

::

datapoolname (root vdev)
raidz1-0 (top-level vdev)
/dev/dsk/disk0 (leaf vdev)
/dev/dsk/disk1 (leaf vdev)
/dev/dsk/disk2 (leaf vdev)
/dev/dsk/disk3 (leaf vdev)
/dev/dsk/disk4 (leaf vdev)
raidz1-1 (top-level vdev)
/dev/dsk/disk5 (leaf vdev)
/dev/dsk/disk6 (leaf vdev)
/dev/dsk/disk7 (leaf vdev)
/dev/dsk/disk8 (leaf vdev)
/dev/dsk/disk9 (leaf vdev)

How does ZFS handle vdev failures?
==================================

ZFS is designed to handle vdev failures gracefully. If a vdev fails, ZFS can continue to operate using the remaining vdevs in the pool,
provided that the redundancy level of the pool allows for it (e.g., in a mirror, RAIDZ, or dRAID configuration).
When there is still enough redundancy in the pool, ZFS will mark the failed vdev as "faulted" and will recover data from the remaining vdevs.
Administrators can `zpool-replace(8) <https://openzfs.github.io/openzfs-docs/man/master/8/zpool-replace.8.html>`_ failed vdev with a new one and ZFS will automatically resilver (rebuild)
the data onto the new vdev to return the pool to a healthy state.

How do I manage vdevs in ZFS?
=============================

Vdevs are managed using the `zpool(8) <https://openzfs.github.io/openzfs-docs/man/master/8/zpool.8.html>`_ command-line utility. Common operations include:

* **Creating a pool**: `zpool create` allows you to specify the vdev layout. See `zpool-create(8) <https://openzfs.github.io/openzfs-docs/man/master/8/zpool-create.8.html>`_.
* **Adding vdevs**: `zpool add` attaches new top-level vdevs to an existing pool, expanding its capacity and performance (by increasing stripe width). See `zpool-add(8) <https://openzfs.github.io/openzfs-docs/man/master/8/zpool-add.8.html>`_.
* **Removing vdevs**: `zpool remove` can remove certain types of top-level vdevs evacuating their data to other vdevs. See `zpool-remove(8) <https://openzfs.github.io/openzfs-docs/man/master/8/zpool-remove.8.html>`_.
* **Replacing drives**: `zpool replace` swaps a failed or small drive with a new one. See `zpool-replace(8) <https://openzfs.github.io/openzfs-docs/man/master/8/zpool-replace.8.html>`_.
* **Monitoring status**: `zpool status` shows the health and layout of all vdevs. See `zpool-status(8) <https://openzfs.github.io/openzfs-docs/man/master/8/zpool-status.8.html>`_.
* **Monitoring performance**: `zpool iostat` displays I/O statistics for the pool and individual vdevs. See `zpool-iostat(8) <https://openzfs.github.io/openzfs-docs/man/master/8/zpool-iostat.8.html>`_.
36 changes: 0 additions & 36 deletions docs/Performance and Tuning/Async Write.rst

This file was deleted.

61 changes: 61 additions & 0 deletions docs/Performance and Tuning/Module Parameters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -488,6 +488,8 @@ resilver
- `zfs_top_maxinflight <#zfs-top-maxinflight>`__
- `zfs_vdev_scrub_max_active <#zfs-vdev-scrub-max-active>`__
- `zfs_vdev_scrub_min_active <#zfs-vdev-scrub-min-active>`__
- `zfs_vdev_nia_credit <#zfs-vdev-nia-credit>`__
- `zfs_vdev_nia_delay <#zfs-vdev-nia-delay>`__

scrub
~~~~~
Expand All @@ -511,6 +513,8 @@ scrub
- `zfs_top_maxinflight <#zfs-top-maxinflight>`__
- `zfs_vdev_scrub_max_active <#zfs-vdev-scrub-max-active>`__
- `zfs_vdev_scrub_min_active <#zfs-vdev-scrub-min-active>`__
- `zfs_vdev_nia_credit <#zfs-vdev-nia-credit>`__
- `zfs_vdev_nia_delay <#zfs-vdev-nia-delay>`__

send
~~~~
Expand Down Expand Up @@ -632,6 +636,8 @@ vdev
- `zfs_vdev_write_gap_limit <#zfs-vdev-write-gap-limit>`__
- `zio_dva_throttle_enabled <#zio-dva-throttle-enabled>`__
- `zio_slow_io_ms <#zio-slow-io-ms>`__
- `zfs_vdev_nia_credit <#zfs-vdev-nia-credit>`__
- `zfs_vdev_nia_delay <#zfs-vdev-nia-delay>`__

vdev_cache
~~~~~~~~~~
Expand All @@ -644,6 +650,8 @@ vdev_initialize
~~~~~~~~~~~~~~~

- `zfs_initialize_value <#zfs-initialize-value>`__
- `zfs_vdev_nia_credit <#zfs-vdev-nia-credit>`__
- `zfs_vdev_nia_delay <#zfs-vdev-nia-delay>`__

vdev_removal
~~~~~~~~~~~~
Expand All @@ -656,6 +664,8 @@ vdev_removal
- `zfs_removal_ignore_errors <#zfs-removal-ignore-errors>`__
- `zfs_removal_suspend_progress <#zfs-removal-suspend-progress>`__
- `vdev_removal_max_span <#vdev-removal-max-span>`__
- `zfs_vdev_nia_credit <#zfs-vdev-nia-credit>`__
- `zfs_vdev_nia_delay <#zfs-vdev-nia-delay>`__

volume
~~~~~~
Expand Down Expand Up @@ -736,6 +746,8 @@ ZIO_scheduler
- `zfs_vdev_trim_max_active <#zfs-vdev-trim-max-active>`__
- `zfs_vdev_trim_min_active <#zfs-vdev-trim-min-active>`__
- `zfs_vdev_write_gap_limit <#zfs-vdev-write-gap-limit>`__
- `zfs_vdev_nia_credit <#zfs-vdev-nia-credit>`__
- `zfs_vdev_nia_delay <#zfs-vdev-nia-delay>`__
- `zio_dva_throttle_enabled <#zio-dva-throttle-enabled>`__
- `zio_requeue_io_start_cut_in_line <#zio-requeue-io-start-cut-in-line>`__
- `zio_taskq_batch_pct <#zio-taskq-batch-pct>`__
Expand Down Expand Up @@ -979,6 +991,8 @@ Index
- `zfs_vdev_mirror_rotating_seek_inc <#zfs-vdev-mirror-rotating-seek-inc>`__
- `zfs_vdev_mirror_rotating_seek_offset <#zfs-vdev-mirror-rotating-seek-offset>`__
- `zfs_vdev_ms_count_limit <#zfs-vdev-ms-count-limit>`__
- `zfs_vdev_nia_credit <#zfs-vdev-nia-credit>`__
- `zfs_vdev_nia_delay <#zfs-vdev-nia-delay>`__
- `zfs_vdev_queue_depth_pct <#zfs-vdev-queue-depth-pct>`__
- `zfs_vdev_raidz_impl <#zfs-vdev-raidz-impl>`__
- `zfs_vdev_read_gap_limit <#zfs-vdev-read-gap-limit>`__
Expand Down Expand Up @@ -4009,6 +4023,53 @@ See also `zio_dva_throttle_enabled <#zio-dva-throttle-enabled>`__
| Versions Affected | v0.7.0 and later |
+--------------------------+------------------------------------------+

zfs_vdev_nia_delay
~~~~~~~~~~~~~~~~~~

For non-interactive I/O (scrub, resilver, removal, initialize and rebuild),
the number of concurrently-active I/O operations is limited to zfs_*_min_active,
unless the vdev is "idle". When there are no interactive I/O operations active
(synchronous or otherwise), and zfs_vdev_nia_delay operations have completed
since the last interactive operation, then the vdev is considered to be "idle",
and the number of concurrently-active non-interactive operations is increased to
zfs_*_max_active. See `ZIO SCHEDULER <https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/ZIO%20Scheduler.html>`__.

====================== ====================
zfs_vdev_nia_delay Notes
====================== ====================
Tags `vdev <#vdev>`__, `scrub <#scrub>`__, `resilver <#resilver>`__
When to change See `ZIO SCHEDULER <https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/ZIO%20Scheduler.html>`__
Data Type uint
Range 1 to UINT_MAX
Default 5
Change Dynamic
Versions Affected v2.0.0 and later
====================== ====================

zfs_vdev_nia_credit
~~~~~~~~~~~~~~~~~~~

Some HDDs tend to prioritize sequential I/O so strongly, that concurrent
random I/O latency reaches several seconds. On some HDDs this happens even
if sequential I/O operations are submitted one at a time, and so setting
zfs_*_max_active= 1 does not help. To prevent non-interactive I/O, like scrub,
from monopolizing the device, no more than zfs_vdev_nia_credit operations
can be sent while there are outstanding incomplete interactive operations.
This enforced wait ensures the HDD services the interactive I/O within a
reasonable amount of time. See `ZIO SCHEDULER <https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/ZIO%20Scheduler.html>`__.

====================== ====================
zfs_vdev_nia_credit Notes
====================== ====================
Tags `vdev <#vdev>`__, `scrub <#scrub>`__, `resilver <#resilver>`__
When to change See `ZIO SCHEDULER <https://openzfs.github.io/openzfs-docs/Performance%20and%20Tuning/ZIO%20Scheduler.html>`__
Data Type uint
Range 1 to UINT_MAX
Default 5
Change Dynamic
Versions Affected v2.0.0 and later
====================== ====================

zfs_disable_dup_eviction
~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
22 changes: 18 additions & 4 deletions docs/Performance and Tuning/ZFS Transaction Delay.rst
Original file line number Diff line number Diff line change
@@ -1,9 +1,19 @@
ZFS Transaction Delay
=====================
OpenZFS Transaction Delay
=========================

ZFS write operations are delayed when the backend storage isn't able to
OpenZFS write operations are delayed when the backend storage isn't able to
accommodate the rate of incoming writes. This delay process is known as
the ZFS write throttle.
the OpenZFS write throttle. As different hardware and workloads have different
performance characteristics, tuning the write throttle is hardware and workload
specific.

Writes are grouped into transactions. Transactions are grouped into transaction groups.
When a transaction group is synced to disk, all transactions in that group are
considered complete. When a delay is applied to a transaction it delays the transaction's
assignment into a transaction group.

To check if write throttle is activated on a pool, monitor the dmu_tx_delay
and/or dmu_tx_dirty_delay kstat counters.

If there is already a write transaction waiting, the delay is relative
to when that transaction will finish waiting. Thus the calculated delay
Expand Down Expand Up @@ -103,3 +113,7 @@ system should be to keep the amount of dirty data out of that range by
first ensuring that the appropriate limits are set for the I/O scheduler
to reach optimal throughput on the backend storage, and then by changing
the value of zfs_delay_scale to increase the steepness of the curve.



Code reference: `dmu_tx.c <https://github.com/openzfs/zfs/blob/master/module/zfs/dmu_tx.c#L866>`_
Loading