diff --git a/src/core_ondisk.md b/src/core_ondisk.md deleted file mode 100644 index d99cd61..0000000 --- a/src/core_ondisk.md +++ /dev/null @@ -1,228 +0,0 @@ -# Core on-disk format - -## Overview - -The EROFS core on-disk format is designed to be **as simple as possible**, since -one of the basic use cases of EROFS is as a drop-in replacement for -[tar](https://pubs.opengroup.org/onlinepubs/007908799/xcu/tar.html) or -[cpio](https://pubs.opengroup.org/onlinepubs/007908799/xcu/cpio.html): - -![EROFS core on-disk format](_static/erofs_core_format.svg) - -The format design principles are as follows: - - - Data (except for _inline data_) is always block-based; metadata is not strictly block-based. - - - There are **no centralized inode or directory tables**. These are not - suitable for image incremental updates, metadata flexibility, and - extensibility. It is up to users to determine whether inodes or directories - are arranged one by one or not. - - - I/O amplification from **extra metadata access** should be as small as - possible. - -There are _only **three** on-disk components to form a full filesystem tree_: -`erofs_super_block`, `erofs_inode_{compact,extended}`, and `erofs_dirent`. -If [extended attribute](https://man7.org/linux/man-pages/man7/xattr.7.html) -support also needs to be considered, the additional components will still be -limited. - -Note that only `erofs_super_block` needs to be kept at a fixed offset, as -mentioned below. - -(on_disk_superblock)= -## Superblock - -EROFS superblock is currently 128 bytes in size, which records various -information about the enclosing filesystem. The superblock will start at an -absolute offset of 1024 bytes, where the first 1024 bytes are currently -unused. This will allow for support of other advanced formats based on -EROFS filesystem, as well as the installation of x86 boot sectors and other oddities. - -The EROFS superblock is laid out as follows in [`struct erofs_super_block`](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/erofs/erofs_fs.h?h=v6.6#n55): - -| Offset | Size | Name | Description | -| ------ | ------ | --------------------- | -------------------------------------------------------------------------------------------------- | -| 0x0 | __le32 | magic | Magic signature, 0xE0F5E1E2 | -| 0x4 | __le32 | checksum | Superblock `crc32c` checksum | -| 0x8 | __le32 | feature_compat | Compatible feature flags. The kernel can still read the fs even if it doesn't understand a flag | -| 0xC | __u8 | blkszbits | Block size is 2{sup}`blkszbits`. It should be no less than 9 (512-byte block size) | -| 0xD | __u8 | sb_extslots | The total superblock size is 128 + `sb_extslots` * 16. It should be 0 for future expansion | -| 0xE | __le16 | root_nid | NID (node number) of the root directory | -| 0x10 | __le64 | inos | Total valid inode count | -| 0x18 | __le64 | build_time | Filesystem creation date, in seconds since the epoch | -| 0x20 | __le32 | build_time_ns | Nanoseconds component of the above (`build_time`) timestamp | -| 0x24 | __le32 | blocks | Total block count | -| 0x28 | __le32 | meta_blkaddr | Start block address of metadata area | -| 0x2C | __le32 | xattr_blkaddr | Start block address of shared xattr area | -| 0x30 | __u8 | uuid[16] | 128-bit UUID for volume | -| 0x40 | __u8 | volume_name[16] | Filesystem label | -| 0x50 | __le32 | feature_incompat | Incompatible feature flags. The kernel will refuse to mount if it doesn't understand a flag | -| 0x54 | __le16 | available_compr_algs | Bitmap for compression algorithms used in this image (FEATURE_INCOMPAT_COMPR_CFGS is set) | -| 0x54 | __le16 | lz4_max_distance | Customized LZ4 window size. 0 means the default value (FEATURE_INCOMPAT_COMPR_CFGS isn't set) | -| 0x56 | __le16 | extra_devices | Number of external devices. 0 means no extra device | -| 0x58 | __le16 | devt_slotoff | (Indicate the start address of the external device table) | -| 0x5A | __u8 | dirblkbits | Directory block size is 2{sup}`blkszbits + dirblkbits`. Always 0 for now | -| 0x5B | __u8 | xattr_prefix_count | Total number of long xattr name prefixes | -| 0x5C | __le32 | xattr_prefix_start | (Indicate the start address of long xattr prefixes) | -| 0x60 | __le64 | packed_nid | NID of the special packed inode, which is mainly used to keep fragments for now | -| 0x68 | __u8 | xattr_filter_reserved | Always 0 for reserved use | -| 0x69 | __u8 | reserved[23] | Reserved | - -### Verify superblock checksum - -The CRC32-C checksum is calculated from the first byte of the superblock -(offset `1024`) to the end of the filesystem block, the `checksum` field should -be filled with zero. - -The filesystem block size is defined in `blkszbits`. For block size is larger -than 1024 bytes, the first 1 KiB will be skipped (since the superblock offset -is `1024`). This approach allows some use-cases which may contain user-defined -contents, such as MBR, boot sector or others. - -> For example, when `blkszbits` is 12 (block size is 4KiB): -> -> | Offset | Size | Description | Checksum covered | -> |--------|------|------------------------------------------------|------------------| -> | 0 | 1024 | Padding | No | -> | 1024 | 4 | Magic number | Yes | -> | 1028 | 4 | Checksum field in superblock, filled with zero | Yes | -> | 1032 | 3064 | Remain bytes in the filesystem block | Yes | - -> Tips: Some implementations (e.g., java.util.zip.CRC32C) apply a final -> bit-wise inversion. If the superblock checksum doesn't match, -> try inverting it. - -## Inodes - -Each valid on-disk inode should be aligned to a fixed inode slot (32-byte) -boundary, which is set to be kept in line with the compact inode size. - -Each inode can be directly located using the following formula: -``` - inode absolute offset = meta_blkaddr * block_size + 32 * NID -``` - -Valid inode sizes are either 32 or 64 bytes, which can be distinguished from -a common field that all inode versions have -- `i_format`: - -32-byte compact inodes are defined as -[`struct erofs_inode_compact`](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/erofs/erofs_fs.h?h=v6.6#n156) -as below: - -| Offset | Size | Name | Description | -| ------ | ------ | -------------- | ---------------------------------------------------------------------- | -| 0x0 | __le16 | i_format | Inode format hints (e.g. on-disk inode version, datalayout, etc.) | -| 0x2 | __le16 | i_xattr_icount | Indicate the extended attribute metadata size of this inode | -| 0x4 | __le16 | i_mode | File mode | -| 0x6 | __le16 | i_nlink | Hard link count | -| 0x8 | __le32 | i_size | Inode size in bytes | -| 0xC | __u8 | i_reserved[4] | Reserved | -| 0x10 | __u8 | i_u[4] | (Up to the specific inode datalayout) | -| 0x14 | __le32 | i_ino | Inode incremental number, mainly used for 32-bit stat(2) compatibility | -| 0x18 | __le16 | i_uid | Owner UID | -| 0x1A | __le16 | i_gid | Owner GID | -| 0x1C | __u8 | i_reserved2[4] | Reserved | - -64-byte extended inodes are defined as -[`struct erofs_inode_extended`](https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs.git/tree/fs/erofs/erofs_fs.h?h=v6.6#n174) -as below: - -| Offset | Size | Name | Description | -| ------ | ------ | --------------- | ----------------------------------------------------------------------------------- | -| 0x0 | __le16 | i_format | Inode format hints (e.g. on-disk inode version, datalayout, etc.) | -| 0x2 | __le16 | i_xattr_icount | Indicate the extended attribute metadata size of this inode | -| 0x4 | __le16 | i_mode | File mode | -| 0x6 | __u8 | i_reserved[4] | Reserved | -| 0x8 | __le64 | i_size | Inode size in bytes | -| 0x10 | __u8 | i_u[4] | (Up to the specific inode datalayout) | -| 0x14 | __le32 | i_ino | Inode incremental number, mainly used for 32-bit stat(2) compatibility | -| 0x18 | __le32 | i_uid | Owner UID | -| 0x1C | __le32 | i_gid | Owner GID | -| 0x20 | __le64 | i_mtime | Inode timestamp derived from the original `mtime`, in seconds since the UNIX epoch | -| 0x28 | __le32 | i_mtime_nsec | This provides nanosecond precision | -| 0x2C | __le32 | i_nlink | Hard link count | -| 0x30 | __u8 | i_reserved2[16] | Reserved | - -`inode.i_format` contains format hints for each inode as below: - -| | Bits | Description | -| - | ---- | ------------------------------------------------------------ | -| 0 | 1 | Inode version (0 - compact; 1 - extended) | -| 1 | 3 | Inode data layout (0-4 are valid; 5-7 are reserved for now) | - -## Inode data layouts - -There are **five** valid data layouts in total for each inode to indicate how -inode data is recorded on disk. Only **three** values are taken into account in -the EROFS core on-disk format: - - - `EROFS_INODE_FLAT_PLAIN (0)`: - - The consecutive physical blocks contain the entirety of the inode's content -with the starting block address stored in `inode.i_u`. - - - `EROFS_INODE_FLAT_INLINE (2)`: - - Except for the tail data block, all consecutive physical blocks hold the -entire content of the inode with the starting block address stored in -`inode.i_u`. The tail block is kept within the block immediately following the -on-disk inode metadata. If there are no blocks other than the tail inlined -block, the value in `inode.i_u` (now treated as a "don't care" field) -will be ignored at runtime. - - :::{note} - This layout is not allowed if the tail inode data block cannot be inlined. - ::: - - - `EROFS_INODE_CHUNK_BASED (4)`: - - The entire inode is split into several fixed-size chunks. Each chunk has -consecutive physical blocks. - -## Directories - -All on-disk directories are now organized in the form of `directory blocks`. - -Each directory block is split into two variable-size parts (`directory entries` -and `filenames`) in order to make random lookups work. All directory entries -(including `.` and `..` ) are _strictly_ recorded in alphabetical order to -enable the improved prefix binary search algorithm. - -Each directory entry is defined as 12-byte -[`struct erofs_dirent`](https://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs.git/tree/fs/erofs/erofs_fs.h?h=v6.6#n276): - -| Offset | Size | Name | Description | -| ------ | ------ | -------------- | -------------------------------------------------------------------------------------------------------------- | -| 0x0 | __le64 | nid | Node number of the inode that this directory entry points to | -| 0x8 | __le16 | nameoff | Start offset of the file name in this directory block | -| 0xA | __u8 | file_type | [File type code](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/fs_types.c?h=v6.6) | -| 0xB | __u8 | reserved | Reserved | - -Note that _nameoff{sub}`0`_ (`nameoff` of the 1st directory entry) also -indicates the total number of directory entries in this directory block. - -File names are not null-terminated (`\0`): For each directory block, if the last -file name doesn't reach up to the end of the block, the remaining bytes must be -filled with `0x00`. - -:::{note} - -Other alternative forms (e.g., `Eytzinger order`) were also considered (that is -why we once had [.*_classic](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/erofs/namei.c?h=v5.4#n90) -naming). Here are some reasons that those forms weren't supported: - - - Filenames are variable-sized strings, which makes `Eytzinger order` harder -to be utilized unless `namehash` is also introduced, but it also complicates -the overall implementation and expands the directory sizes; - - - Also, it makes it harder to keep filenames and directory entries in the same -directory block (especially _large directories_) to minimize I/O amplification; - - - readdir(3) will be impacted too if we'd like to keep alphabetical order -strictly. - -If there are some better ideas to resolve these, the on-disk definition could be -updated in the future. - -::: diff --git a/src/design.md b/src/design.md index 5d336b2..2f59696 100644 --- a/src/design.md +++ b/src/design.md @@ -119,6 +119,6 @@ or user space tools like `ureadahead`, so there is no need to bother the kernel. ```{toctree} :hidden: -core_ondisk.md +ondisk/index merging.md ``` diff --git a/src/merging.md b/src/merging.md index e8c1acc..e86f026 100644 --- a/src/merging.md +++ b/src/merging.md @@ -8,7 +8,7 @@ in one go. ## Single device Unlike other kernel filesystems which have inflexible layouts, EROFS has only -one fixed-offset on_disk part: [the superblock](#on_disk_superblock). It's +one fixed-offset on_disk part: {ref}`the superblock `. It's quite easy to compose external payloads such as binary formats like [tarballs](https://pubs.opengroup.org/onlinepubs/007908799/xcu/tar.html) or other filesystems (e.g., EROFS, EXT4, XFS, etc.) just with **linear physical diff --git a/src/ondisk/core_ondisk.md b/src/ondisk/core_ondisk.md new file mode 100644 index 0000000..a3883a6 --- /dev/null +++ b/src/ondisk/core_ondisk.md @@ -0,0 +1,318 @@ +# Core On-disk Format + +## Overview + +The EROFS core on-disk format is designed to be **as simple as possible**, since +one of the basic use cases of EROFS is as a drop-in replacement for +[tar](https://pubs.opengroup.org/onlinepubs/007908799/xcu/tar.html) or +[cpio](https://pubs.opengroup.org/onlinepubs/007908799/xcu/cpio.html): + +![EROFS core on-disk format](../_static/erofs_core_format.svg) + +The format design principles are as follows: + + - Data (except for _inline data_) is always block-based; metadata is not strictly block-based. + + - There are **no centralized inode or directory tables**. These are not + suitable for image incremental updates, metadata flexibility, and + extensibility. It is up to users to determine whether inodes or directories + are arranged one by one or not. + + - I/O amplification from **extra metadata access** should be as small as + possible. + +There are _only **three** on-disk components to form a full filesystem tree_: +superblock, inodes, and directory entries. + +Note that only the superblock needs to be kept at a fixed offset, as mentioned below. + +### Conformance to Core Format + +An EROFS image conforms to the core on-disk format if and only if **all** of the +following conditions are met: + +1. The `is_compressed` field (offset 0x54, 2 bytes) in the superblock is **0**. +2. All bits in `feature_compat` and `feature_incompat`, except those listed in + the [Feature Flags](#feature-flags) section below, are **0**. + +An image that does not meet these conditions uses one or more optional features +described in separate feature-specific documents. + +(on_disk_superblock)= +## Superblock + +The EROFS superblock is located at a fixed absolute offset of **1024 bytes**. +Its base size is 128 bytes. When `sb_extslots` is non-zero, the total superblock +size is `128 + sb_extslots * 16` bytes. The first 1024 bytes are currently unused, +which allows for support of other advanced formats based on EROFS, as well as +the installation of x86 boot sectors and other oddities. + +### Field Definitions + +| Offset | Size | Type | Name | Description | +|--------|------|--------|--------------------------|-------------| +| 0x00 | 4 | `u32` | `magic` | Magic signature: `0xE0F5E1E2` | +| 0x04 | 4 | `u32` | `checksum` | CRC32-C checksum of the superblock block; see {ref}`superblock-checksum` | +| 0x08 | 4 | `u32` | `feature_compat` | Compatible feature flags; see {ref}`feature-flags` | +| 0x0C | 1 | `u8` | `blkszbits` | Block size = `2^blkszbits`; minimum 9 | +| 0x0D | 1 | `u8` | `sb_extslots` | Number of 16-byte superblock extension slots | +| 0x0E | 2 | `u16` | `rootnid` | Root directory NID | +| 0x10 | 8 | `u64` | `inos` | Total valid inode count | +| 0x18 | 8 | `u64` | `build_time` | Filesystem creation time, seconds since UNIX epoch | +| 0x20 | 4 | `u32` | `build_time_nsec` | Nanoseconds component of `build_time` | +| 0x24 | 4 | `u32` | `blocks` | Total filesystem block count | +| 0x28 | 4 | `u32` | `meta_blkaddr` | Start block address of the metadata area | +| 0x2C | 4 | `u32` | `reserved` | Feature-specific; not described in core format | +| 0x30 | 16 | `u8[]` | `uuid` | 128-bit UUID for the volume | +| 0x40 | 16 | `u8[]` | `volume_name` | Filesystem label (not null-terminated if 16 bytes) | +| 0x50 | 4 | `u32` | `feature_incompat` | Incompatible feature flags; see {ref}`feature-flags` | +| 0x54 | 2 | `u16` | `is_compressed` | 0 for non-compressed images, any non-zero value for compressed images | +| 0x56 | 4 | `u32` | `reserved` | Feature-specific; not described in core format | +| 0x5A | 1 | `u8` | `dirblkbits` | Directory block size = `2^(blkszbits + dirblkbits)`; currently always 0 | +| 0x5B | 37 | `u8[]` | `reserved` | Feature-specific; not described in core format | + +### Magic Number + +The magic number at offset 0x00 must be `0xE0F5E1E2` (little-endian). A reader must +reject any image whose first four bytes at offset 1024 do not match this value. + +(superblock-checksum)= +### Superblock Checksum + +When `EROFS_FEATURE_COMPAT_SB_CHKSUM` is set, the `checksum` field contains a +CRC32-C digest. The digest is computed over the byte range `[1024, 1024 + block_size)`, +with the four bytes of the `checksum` field itself treated as zero during computation. + +> For example, when `blkszbits` is 12 (block size is 4 KiB): +> +> | Offset | Size | Description | Checksum covered | +> |--------|------|------------------------------------------------|------------------| +> | 0 | 1024 | Padding | No | +> | 1024 | 4 | Magic number | Yes | +> | 1028 | 4 | Checksum field in superblock, filled with zero | Yes | +> | 1032 | 3064 | Remaining bytes in the filesystem block | Yes | + +> **Tip:** Some implementations (e.g., `java.util.zip.CRC32C`) apply a final +> bit-wise inversion. If the superblock checksum does not match, try inverting it. + +(feature-flags)= +### Feature Flags + +#### `feature_compat` — Compatible Feature Flags + +A mount implementation that does not recognise a bit in `feature_compat` may still +mount the filesystem without loss of correctness. + +| Bit mask | Name | Description | +|--------------|---------------------------------------------|-------------| +| `0x00000001` | `EROFS_FEATURE_COMPAT_SB_CHKSUM` | Superblock CRC32-C checksum is present; see {ref}`superblock-checksum` | +| `0x00000002` | `EROFS_FEATURE_COMPAT_MTIME` | Per-inode mtime is stored in extended inodes | + +#### `feature_incompat` — Incompatible Feature Flags + +A mount implementation that does not recognise any bit in `feature_incompat` must +refuse to mount the filesystem. + +The core on-disk format defines no incompatible feature flags. A non-zero +`feature_incompat` value indicates one or more optional extensions. + +(on_disk_inodes)= +## Inodes + +Each on-disk inode must be aligned to a **32-byte inode slot** boundary, which is +set to be kept in line with the compact inode size. Given a NID `nid`, its inode can +be located in O(1) time by computing the absolute byte offset as follows: + +``` +inode_offset = meta_blkaddr * block_size + 32 * nid +``` + +The NIDs for the root directory and special-purpose inodes are stored in the +superblock. Valid inode sizes are either **32 bytes** (compact) or **64 bytes** +(extended), distinguished by bit 0 of the `i_format` field. + +### Compact Inode (32 bytes) + +Defined as [`struct erofs_inode_compact`](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/erofs/erofs_fs.h): + +| Offset | Size | Type | Name | Description | +|--------|------|-------|------------------|-------------| +| 0x00 | 2 | `u16` | `i_format` | Inode format hints; see {ref}`i_format-field` | +| 0x02 | 2 | `u16` | `reserved` | Feature-specific; not described in core format | +| 0x04 | 2 | `u16` | `i_mode` | File type and permission bits | +| 0x06 | 2 | `u16` | `i_nb` | Union; see {ref}`i_nb-union` | +| 0x08 | 4 | `u32` | `i_size` | File size in bytes (32-bit) | +| 0x0C | 4 | `u32` | `reserved` | Feature-specific; not described in core format | +| 0x10 | 4 | `u32` | `i_u` | Union; see {ref}`i_u-union` | +| 0x14 | 4 | `u32` | `i_ino` | Inode serial number for 32-bit `stat(2)` compatibility | +| 0x18 | 2 | `u16` | `i_uid` | Owner UID (16-bit) | +| 0x1A | 2 | `u16` | `i_gid` | Owner GID (16-bit) | +| 0x1C | 4 | `u32` | `i_reserved` | Reserved; must be 0 | + +### Extended Inode (64 bytes) + +Defined as [`struct erofs_inode_extended`](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/erofs/erofs_fs.h): + +| Offset | Size | Type | Name | Description | +|--------|------|--------|-------------------|-------------| +| 0x00 | 2 | `u16` | `i_format` | Inode format hints; see {ref}`i_format-field` | +| 0x02 | 2 | `u16` | `reserved` | Feature-specific; not described in core format | +| 0x04 | 2 | `u16` | `i_mode` | File type and permission bits | +| 0x06 | 2 | `u16` | `i_nb` | Union; see {ref}`i_nb-union` | +| 0x08 | 8 | `u64` | `i_size` | File size in bytes (64-bit) | +| 0x10 | 4 | `u32` | `i_u` | Union; see {ref}`i_u-union` | +| 0x14 | 4 | `u32` | `i_ino` | Inode serial number for 32-bit `stat(2)` compatibility | +| 0x18 | 4 | `u32` | `i_uid` | Owner UID (32-bit) | +| 0x1C | 4 | `u32` | `i_gid` | Owner GID (32-bit) | +| 0x20 | 8 | `u64` | `i_mtime` | Modification time, seconds since UNIX epoch | +| 0x28 | 4 | `u32` | `i_mtime_nsec` | Nanoseconds component of `i_mtime` | +| 0x2C | 4 | `u32` | `i_nlink` | Hard link count (32-bit) | +| 0x30 | 16 | `u8[]` | `i_reserved2` | Reserved; must be 0 | + +(i_format-field)= +### `i_format` Field + +The `i_format` field is present at offset 0x00 in both inode variants and encodes +layout metadata: + +| Bits | Width | Description | +|-------|-------|-------------| +| 0 | 1 | Inode version: 0 = compact (32-byte), 1 = extended (64-byte) | +| 1–3 | 3 | Data layout: values 0–4 are defined; 5–7 are reserved. See {ref}`inode_data_layouts` | +| 4 | 1 | `EROFS_I_NLINK_1_BIT` (non-directory compact inodes) / `EROFS_I_DOT_OMITTED_BIT` (directory inodes) | +| 5–15 | 11 | Reserved; must be 0 | + +Bit 4 has two mutually exclusive interpretations: + +- **`EROFS_I_NLINK_1_BIT`** (non-directory compact inodes only): when set, the hard + link count is implicitly 1 and `i_nb.nlink` need not be read, freeing `i_nb` for + other feature-specific uses. +- **`EROFS_I_DOT_OMITTED_BIT`** (directory inodes only): when set, the `.` entry is + omitted from the directory's dirent list to save space. + +(i_nb-union)= +### `i_nb` Union + +The `i_nb` field (2 bytes at offset 0x06) is interpreted based on the data layout +and whether `EROFS_I_NLINK_1_BIT` is set: + +| Name | Applicable when | Description | +|--------------------|-----------------|-------------| +| `i_nb.nlink` | `EROFS_I_NLINK_1_BIT` unset (non-directory compact inodes) | Hard link count | + +Other interpretations of `i_nb` are defined by optional extensions. + +(i_u-union)= +### `i_u` Union + +The `i_u` field (4 bytes at offset 0x10) is interpreted based on the data layout: + +| Name | Applicable when | Description | +|-------------------|-----------------|-------------| +| `i_u.startblk` | Flat inodes | Starting block number | +| `i_u.rdev` | Character/block device inodes | Device ID | + +(inode_data_layouts)= +## Inode Data Layouts + +The data layout of an inode is encoded in bits 1–3 of `i_format`. The core format +defines two flat layouts. + +### `EROFS_INODE_FLAT_PLAIN` (0) + +`i_u` is interpreted as `startblk` (the 32-bit starting block address). + +The inode's data lies in consecutive blocks starting from that address, +occupying `ceil(i_size / block_size)` consecutive blocks. + +### `EROFS_INODE_FLAT_INLINE` (2) + +`i_u` is interpreted as `startblk` (the 32-bit starting block address). + +The inode's data lies in consecutive blocks starting from that address, except for the tail part that is inlined in the block immediately following the inode metadata. +If `i_size` is small enough that the entire content fits in the inline tail, there +are no preceding blocks and `i_u` is a don't-care field. + +:::{note} +This layout is not allowed if the tail inode data block cannot be inlined. +::: + +(on_disk_directories)= +## Directories + +All on-disk directories are organized in the form of **directory blocks** of size +`2^(blkszbits + dirblkbits)` (currently `dirblkbits` is always 0). + +### Directory Block Structure + +Each directory block is divided into two contiguous regions: + +1. A fixed-size array of directory entry records at the start of the block. +2. Variable-length filename strings packed at the end of the block, growing towards + the entry array. + +The `nameoff` field of the **first** entry in a block encodes the total number of +directory entries in that block: + +``` +entry_count = nameoff[0] / sizeof(erofs_dirent) +``` + +All entries within a directory block, including `.` and `..`, are stored in strict +**lexicographic (byte-value ascending) order** to enable an improved prefix binary +search algorithm. + +### Directory Entry Record (12 bytes) + +Defined as [`struct erofs_dirent`](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/erofs/erofs_fs.h): + +| Offset | Size | Type | Name | Description | +|--------|------|-------|-------------|-------------| +| 0x00 | 8 | `u64` | `nid` | Node number of the target inode | +| 0x08 | 2 | `u16` | `nameoff` | Byte offset of the filename within this directory block | +| 0x0A | 1 | `u8` | `file_type` | File type code (see below) | +| 0x0B | 1 | `u8` | `reserved` | Reserved; must be 0 | + +#### `file_type` Values + +| Value | Constant | POSIX type | +|-------|---------------------|------------| +| 0 | `EROFS_FT_UNKNOWN` | Unknown | +| 1 | `EROFS_FT_REG_FILE` | Regular file | +| 2 | `EROFS_FT_DIR` | Directory | +| 3 | `EROFS_FT_CHRDEV` | Character device | +| 4 | `EROFS_FT_BLKDEV` | Block device | +| 5 | `EROFS_FT_FIFO` | FIFO | +| 6 | `EROFS_FT_SOCK` | Socket | +| 7 | `EROFS_FT_SYMLINK` | Symbolic link | + +### Filename Encoding + +Filenames are stored as raw byte sequences and are **not** null-terminated. The +length of entry `i` is derived as: + +- For all entries except the last: `nameoff[i+1] − nameoff[i]`. +- For the last entry in the block: `block_end − nameoff[last]`, where `block_end` + is the first byte past the block. + +No character encoding is mandated; UTF-8 is recommended. + +:::{note} + +Other alternative forms (e.g., `Eytzinger order`) were also considered (that is +why there was once [.*_classic](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/erofs/namei.c?h=v5.4#n90) +naming). Here are some reasons those forms were not supported: + + - Filenames are variable-sized strings, which makes `Eytzinger order` harder + to utilize unless `namehash` is also introduced, but that complicates the + overall implementation and expands directory sizes. + + - It is harder to keep filenames and directory entries in the same directory + block (especially _large directories_) to minimize I/O amplification. + + - `readdir(3)` would be impacted too if strict alphabetical order were required. + +If there are better ideas to resolve these, the on-disk definition could be updated +in the future. + +::: diff --git a/src/ondisk/index.md b/src/ondisk/index.md new file mode 100644 index 0000000..0e2735c --- /dev/null +++ b/src/ondisk/index.md @@ -0,0 +1,29 @@ +(erofs_ondisk_format)= +# EROFS On-disk Format + +EROFS uses a flexible, hierarchical, block-aligned on-disk layout that is built +with the following goals: + +- DMA- and mmap-friendly, block-aligned data to maximize runtime performance on + all kinds of storage devices; +- A simple core on-disk format that is easy to parse and has zero unnecessary + metadata redundancy for archive use unlike other generic filesystems, ideal + for data auditing and accessing remote untrusted data; +- Advanced on-disk features like compression (compressed inodes and metadata + compression) are completely optional and aren’t mixed with the core design: + you can use them only when needed. + +The entire filesystem tree is built from just three core on-disk structures: + +- **Superblock** — located at a fixed offset of 1024 bytes; the only + structure at a fixed position in the filesystem. +- **Compact/Extended inodes** — per regular file, device, symlink, or directory; + addressed in O(1) time via a simple NID-to-offset formula. +- **Directory entries** — 12-byte records, sorted lexicographically by filename + at the beginning of each directory block (each data block of a directory inode). + +```{toctree} +:hidden: +core_ondisk +48bit +```