Skip to content

feat: sparse writes with range_upload, zero-download write path#41

Open
XciD wants to merge 2 commits intomainfrom
feat/append-write
Open

feat: sparse writes with range_upload, zero-download write path#41
XciD wants to merge 2 commits intomainfrom
feat/append-write

Conversation

@XciD
Copy link
Copy Markdown
Member

@XciD XciD commented Mar 16, 2026

Summary

Replaces the full-download write path with sparse staging files and dirty range tracking. When opening an existing file for write, we create a sparse staging file (set_len()) instead of downloading CAS content. Only the dirty byte ranges are tracked and uploaded.

How it works

  • Open for write: create sparse staging file at original size, no CAS download
  • Write: pwrite() to staging file, SparseWriteState::track_write() records dirty ranges (O(log n) merge with binary search)
  • Read: pread() from staging, fill_sparse_holes() downloads CAS data on demand for non-dirty regions
  • Truncate: clip_to_size() trims dirty ranges on shrink, track_write() marks extension as dirty on grow
  • Flush: range_upload composes a new CAS file from stable prefix/suffix + re-chunked dirty regions via upload_ranges() in xet-core

For a 200MB file with a 1KB mid-file edit, this downloads 0 bytes on open and uploads only the dirty chunks (not 200MB).

Key components

  • SparseWriteState (inode.rs): tracks original_hash, original_size, sorted dirty_ranges vec
  • fill_sparse_holes (mod.rs): downloads CAS data for sparse holes in read buffer
  • range_upload (xet.rs): builds DirtyInput vec from dirty ranges, each with its own async reader
  • FlushEntry/FlushSuccess structs (flush.rs): replace raw tuples for flush pipeline
  • flush_generation counter: prevents stale flush from clearing dirty state after concurrent writes/renames
  • NFS handle upgrade (nfs.rs): upgrades read-only pool handles to writable on first WRITE RPC

Tests

  • 47 new unit tests covering sparse writes, flush races, POSIX edge cases
  • Sparse write smoke tests in integration suite (mid-file write, append past EOF, CAS round-trip, multi-write accumulation, large file range_upload)
  • fsx: 50k random ops (staging) + 100 paranoid ops (CAS round-trip per mutation)
  • xfstests generic/quick suite (167+ pass)

Dependencies

  • xet-core branch adrien/combined-hf-mount (PR #717) which adds upload_ranges() API for composing files from CAS prefix/suffix + dirty regions

@github-actions
Copy link
Copy Markdown

POSIX Compliance (pjdfstest)

============================================================
  pjdfstest POSIX Compliance Results
------------------------------------------------------------
  Files: 130/130 passed    Tests: 832 total (0 subtests failed)
  Result: PASS
------------------------------------------------------------
  Category               Passed    Total   Status
  -------------------- -------- -------- --------
  chflags                     5        5       OK
  chmod                       8        8       OK
  chown                       6        6       OK
  ftruncate                  13       13       OK
  granular                    5        5       OK
  mkdir                       9        9       OK
  open                       19       19       OK
  posix_fallocate             1        1       OK
  rename                     10       10       OK
  rmdir                      11       11       OK
  symlink                    10       10       OK
  truncate                   13       13       OK
  unlink                     11       11       OK
  utimensat                   9        9       OK
============================================================

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 16, 2026

Benchmark Results

============================================================
  Benchmark — 50MB
------------------------------------------------------------
  Metric                                 FUSE          NFS
  ------------------------------ ------------ ------------
  Sequential read                    239.8 MB/s     239.1 MB/s
  Sequential re-read                1571.7 MB/s    2149.4 MB/s
  Range read (1MB@25MB)               33.0 ms         0.2 ms
  Random reads (100x4KB avg)          34.9 ms         0.0 ms
  Sequential write (FUSE)           1013.4 MB/s
  Close latency (CAS+Hub)            0.106 s
  Write end-to-end                   322.8 MB/s
  Dedup write                       1063.3 MB/s
  Dedup close latency                0.360 s
  Dedup end-to-end                   122.7 MB/s
============================================================
============================================================
  Benchmark — 200MB
------------------------------------------------------------
  Metric                                 FUSE          NFS
  ------------------------------ ------------ ------------
  Sequential read                   1005.6 MB/s     967.3 MB/s
  Sequential re-read                1747.6 MB/s    2232.7 MB/s
  Range read (1MB@25MB)               31.4 ms         0.3 ms
  Random reads (100x4KB avg)          32.4 ms         0.0 ms
  Sequential write (FUSE)            963.7 MB/s
  Close latency (CAS+Hub)            0.077 s
  Write end-to-end                   701.7 MB/s
  Dedup write                       1331.0 MB/s
  Dedup close latency                1.119 s
  Dedup end-to-end                   157.5 MB/s
============================================================
============================================================
  Benchmark — 500MB
------------------------------------------------------------
  Metric                                 FUSE          NFS
  ------------------------------ ------------ ------------
  Sequential read                   1556.9 MB/s    1348.3 MB/s
  Sequential re-read                1751.6 MB/s    2224.7 MB/s
  Range read (1MB@25MB)               36.8 ms         0.2 ms
  Random reads (100x4KB avg)          31.9 ms         0.0 ms
  Sequential write (FUSE)           1015.8 MB/s
  Close latency (CAS+Hub)            0.095 s
  Write end-to-end                   851.0 MB/s
  Dedup write                       1323.3 MB/s
  Dedup close latency                0.617 s
  Dedup end-to-end                   502.4 MB/s
============================================================
============================================================
  fio Benchmark Results
------------------------------------------------------------
  Job                        FUSE MB/s   NFS MB/s  FUSE IOPS   NFS IOPS
  ------------------------- ---------- ---------- ---------- ----------
  seq-read-100M                  529.1      458.7                      
  seq-reread-100M               2381.0       19.7                      
  rand-read-4k-100M                0.1        0.1         18         19
  seq-read-5x10M                 862.1      625.0                      
  rand-read-10x1M                  0.1        0.1         37         36
  Random Read Latency           FUSE avg      NFS avg
  ------------------------- ------------ ------------
  rand-read-4k-100M           54586.4 us   52205.2 us
  rand-read-10x1M             26960.4 us   27890.4 us
============================================================

Sparse staging: open for write creates a sparse file (set_len) instead of
downloading the original CAS content. Dirty byte ranges are tracked in
SparseWriteState and only modified regions are uploaded via range_upload.

Key changes:
- Sparse staging file on open (no CAS download)
- SparseWriteState tracks dirty ranges with O(log n) merge
- fill_sparse_holes reads CAS data on demand for read-after-write
- flush_generation counter prevents stale flush from clearing dirty state
- Rename re-enqueues dirty files for flush at new path
- setattr truncate/grow handled via clip_to_size + gap tracking
- write past original_size automatically tracks gap as dirty
- file.metadata() guard against concurrent truncate vs write race

New xet-core API (DirtyInput with AsyncRead per range):
- range_upload builds DirtyInput per dirty range from staging file
- upload_ranges handles truncation boundary from CAS directly
- No download needed for any write/truncate path

Testing:
- 245 unit tests (47 new for sparse writes, flush races, edge cases)
- fsx: 50k random ops (staging) + 100 paranoid CAS round-trip ops
- xfstests: generic/quick suite with FUSE patches (167 pass)
- pjdfstest: 8789 POSIX syscall tests
- Integration smoke tests: mid-file edit, append, truncate, multi-write,
  large file (512KB) CAS round-trip
@XciD XciD force-pushed the feat/append-write branch from 67b8d38 to 6a35987 Compare March 20, 2026 15:25
NFS handle pool opens handles read-only for reads. When a WRITE RPC
arrives for an existing CAS file, the handle lacks a staging file and
VFS rejects with EBADF. Fix: try write with existing handle, on EBADF
evict it and reopen writable (creating sparse staging).
@XciD XciD changed the title feat: append-aware writes with CDC-correct file composition feat: sparse writes with range_upload, zero-download write path Mar 20, 2026
@XciD XciD marked this pull request as ready for review March 20, 2026 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant