Skip to content

fix: prevent NFS handle eviction during reads and parallelize per-inode reads#80

Draft
XciD wants to merge 3 commits intomainfrom
fix/nfs-handle-pool-and-read-serialization
Draft

fix: prevent NFS handle eviction during reads and parallelize per-inode reads#80
XciD wants to merge 3 commits intomainfrom
fix/nfs-handle-pool-and-read-serialization

Conversation

@XciD
Copy link
Copy Markdown
Member

@XciD XciD commented Mar 26, 2026

Summary

Fixes the last two bugs from the REVIEW.md audit (PR #69):

Bug 6 (NFS handle pool TOCTOU): Adds pin_count to HandlePool entries. get() pins the handle so it cannot be LRU-evicted while an async read is in flight. The caller unpins after the operation completes. Eviction skips pinned entries and the pool grows temporarily if all entries are pinned. Eliminates the TOCTOU window between get_or_open_handle and read that caused EBADF under heavy load (>64 concurrent files).

Bug 7 (NFS read serialization): Uses try_lock() on the per-handle prefetch mutex. When contended (another reader holds the lock on the same NFS inode), falls back to a direct range download that bypasses prefetch entirely. This prevents two NFS clients reading different parts of the same file from serializing behind each other's network I/O, and avoids one reader's seek pattern disrupting the other's prefetch window.

Includes 5 new unit tests for pin/unpin behavior and updated existing tests.

XciD added 2 commits March 26, 2026 19:04
…de reads

Bug #6: Add pin_count to HandlePool entries. get() pins the handle,
preventing LRU eviction while an async read is in flight. The caller
unpins after the operation completes. Eviction skips pinned entries,
and the pool grows temporarily if all entries are pinned. This
eliminates the TOCTOU window between get_or_open_handle and read
that could cause EBADF under heavy load (>64 concurrent files).

Bug #7: Use try_lock() on the per-handle prefetch mutex. When
contended (another reader holds the lock on the same NFS inode),
fall back to a direct range download that bypasses prefetch entirely.
This prevents two NFS clients reading different parts of the same
file from serializing behind each other's network I/O, and avoids
one reader's seek pattern disrupting the other's prefetch window.
- try_read_direct returns None for non-xet files (plain git/LFS) so the
  caller falls back to waiting for the prefetch lock instead of silently
  returning empty data
- Use get_file_entry() instead of manual inode table lookup
- Add HandlePool::pin() method instead of reaching into handles directly
- Remove duplicate comment
@github-actions
Copy link
Copy Markdown

POSIX Compliance (pjdfstest)

============================================================
  pjdfstest POSIX Compliance Results
------------------------------------------------------------
  Files: 130/130 passed    Tests: 832 total (0 subtests failed)
  Result: PASS
------------------------------------------------------------
  Category               Passed    Total   Status
  -------------------- -------- -------- --------
  chflags                     5        5       OK
  chmod                       8        8       OK
  chown                       6        6       OK
  ftruncate                  13       13       OK
  granular                    5        5       OK
  mkdir                       9        9       OK
  open                       19       19       OK
  posix_fallocate             1        1       OK
  rename                     10       10       OK
  rmdir                      11       11       OK
  symlink                    10       10       OK
  truncate                   13       13       OK
  unlink                     11       11       OK
  utimensat                   9        9       OK
============================================================

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 26, 2026

Benchmark Results

============================================================
  Benchmark — 50MB
------------------------------------------------------------
  Metric                                 FUSE          NFS
  ------------------------------ ------------ ------------
  Sequential read                    120.7 MB/s     186.8 MB/s
  Sequential re-read                 463.5 MB/s    2355.9 MB/s
  Range read (1MB@25MB)               36.8 ms         0.2 ms
  Random reads (100x4KB avg)          34.1 ms         0.0 ms
  Sequential write (FUSE)           1485.0 MB/s
  Close latency (CAS+Hub)            0.153 s
  Write end-to-end                   267.8 MB/s
  Dedup write                       1796.1 MB/s
  Dedup close latency                0.092 s
  Dedup end-to-end                   416.2 MB/s
============================================================
============================================================
  Benchmark — 200MB
------------------------------------------------------------
  Metric                                 FUSE          NFS
  ------------------------------ ------------ ------------
  Sequential read                    507.8 MB/s     936.8 MB/s
  Sequential re-read                 911.6 MB/s    2365.7 MB/s
  Range read (1MB@25MB)               29.1 ms         0.2 ms
  Random reads (100x4KB avg)          35.8 ms         0.0 ms
  Sequential write (FUSE)           1629.7 MB/s
  Close latency (CAS+Hub)            0.114 s
  Write end-to-end                   843.5 MB/s
  Dedup write                       1619.1 MB/s
  Dedup close latency                0.091 s
  Dedup end-to-end                   931.7 MB/s
============================================================
============================================================
  Benchmark — 500MB
------------------------------------------------------------
  Metric                                 FUSE          NFS
  ------------------------------ ------------ ------------
  Sequential read                    692.6 MB/s    1376.9 MB/s
  Sequential re-read                1037.5 MB/s    2333.4 MB/s
  Range read (1MB@25MB)               34.0 ms         0.2 ms
  Random reads (100x4KB avg)          35.0 ms         0.0 ms
  Sequential write (FUSE)           1442.5 MB/s
  Close latency (CAS+Hub)            0.117 s
  Write end-to-end                  1078.6 MB/s
  Dedup write                       1470.3 MB/s
  Dedup close latency                1.584 s
  Dedup end-to-end                   259.8 MB/s
============================================================
============================================================
  fio Benchmark Results
------------------------------------------------------------
  Job                        FUSE MB/s   NFS MB/s  FUSE IOPS   NFS IOPS
  ------------------------- ---------- ---------- ---------- ----------
  seq-read-100M                  303.0      318.5                      
  seq-reread-100M                444.4      529.1                      
  rand-read-4k-100M                0.1        0.1         20         18
  seq-read-5x10M                 819.7      806.5                      
  rand-read-10x1M                  0.1        0.1         35         35
  Random Read Latency           FUSE avg      NFS avg
  ------------------------- ------------ ------------
  rand-read-4k-100M           50865.7 us   54488.5 us
  rand-read-10x1M             28332.7 us   28398.0 us
============================================================

1. Reinstate EBADF retry in NFS read: remove()/rename can delete a
   handle while pinned (pins only prevent LRU eviction). On EBADF,
   purge the stale entry and retry once with a fresh handle.

2. Use open-time xet_hash/file_size snapshot in try_read_direct instead
   of re-reading from the inode table. Prevents the contended read path
   from seeing a different file revision than the prefetch path.

3. Return EIO on premature stream EOF in try_read_direct instead of a
   short read, matching the invariant enforced by read_with_prefetch
   (FUSE fuse_read_update_size shrinks i_size on short reads).

4. Evict multiple entries in HandlePool::insert() to bring the pool
   back to capacity after pinned overflow bursts. Previously the pool
   grew permanently after a burst of pinned reads.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant