Skip to content

fix: upload reliability hardening -- retry, fault tolerance, spill cleanup#26

Open
grumbach wants to merge 12 commits intomainfrom
fix/payment-upload-hardening
Open

fix: upload reliability hardening -- retry, fault tolerance, spill cleanup#26
grumbach wants to merge 12 commits intomainfrom
fix/payment-upload-hardening

Conversation

@grumbach
Copy link
Copy Markdown
Contributor

@grumbach grumbach commented Apr 2, 2026

Summary

Production hardening of the upload and payment pipeline:

Upload reliability:

  • Retry failed chunk stores up to 3x with exponential backoff (500ms, 1s, 2s)
  • Track partial progress via PartialUpload error variant instead of silently losing stored chunks
  • Multi-batch merkle returns explicit error with partial proofs instead of misleading Ok

Quote collection fault tolerance:

  • Query 2x peers (10 instead of 5), keep closest 5 by XOR distance
  • AlreadyStored only counted from close-group peers, not distant ones
  • Tolerates up to 5 peer failures without aborting

Merkle candidate validation:

  • Validate data_size on merkle candidate responses (prevents pricing manipulation)
  • Fix direct indexing (addresses[i]) to safe zip iterator

External signer improvements:

  • Concurrent quote collection (was sequential, now uses buffer_unordered)
  • Correct payment_mode tracking in PreparedUpload

Chunk spill hardening:

  • Moved from system temp to data_dir/spill/ (predictable location)
  • Lockfile-based protection: cleanup skips dirs with active uploads
  • Symlink attack prevention: only deletes actual directories, not symlinks
  • Prefix-based filtering: only cleans spill_ prefixed dirs
  • Stale cleanup after 24h for orphaned dirs from crashed processes

Measured results

Gas savings (merkle vs single, local Anvil):

  • 50MB (16 chunks): 1.5x cheaper
  • 200MB (54 chunks): 4x cheaper
  • 500MB (129 chunks): 8x cheaper
  • 2GB (516 chunks): 22x cheaper

Memory: flat at ~1GB regardless of file size (1GB and 4GB uploads same peak RSS)

No external dependencies

This PR is self-contained. Uses published evmlib 0.5.0 and ant-node 0.9.0 from crates.io.

Related PRs (independent improvements to other layers)

Test plan

  • cfd passes
  • Clippy + Format pass
  • Unit tests pass
  • Merkle E2E pass
  • e2e_upload_costs test (10MB-4GB, single+merkle) passes in release mode
  • 3-agent adversarial review (2 Claude Opus + 1 Codex gpt-5.4) -- all issues fixed
  • E2E macOS: test_payment_flow_with_node_failure flaky (also fails on main)

grumbach added 12 commits April 2, 2026 15:35
…indexing

- Add expected_data_size parameter to collect_validated_candidates()
  to reject nodes that return tampered data_size in quoting metrics
- Fix direct indexing addresses[i] -> safe zip iterator (pre-existing
  project rule violation)
Query CLOSE_GROUP_SIZE * 2 (10) peers from DHT, send quote requests
to all concurrently, and keep the CLOSE_GROUP_SIZE (5) closest
successful responders sorted by XOR distance. This tolerates up to
5 peer failures (timeout, bad quote, etc.) without aborting the
entire quote collection.
…racking

- store_paid_chunks() now retries failed chunks up to 3 times with
  exponential backoff (500ms, 1s, 2s) before giving up
- Returns WaveResult { stored, failed } instead of discarding partial
  successes on first error
- Add PartialUpload error variant that carries both stored addresses
  and failed chunk details so callers can track progress
- PaidChunk is now Clone to support retry
- batch_upload_chunks() accumulates stored addresses across waves and
  reports them even on partial failure
When pay_for_merkle_multi_batch fails on sub-batch N, return proofs
from sub-batches 1..N-1 instead of discarding them. This prevents
losing already-paid tokens when a later sub-batch fails -- callers
can still store the chunks that were paid for.
…mode

- file_prepare_upload: replace sequential for-loop with concurrent
  buffer_unordered pattern (same as prepare_wave) for quote collection
- data_prepare_upload: same concurrent fix
- Add payment_mode field to PreparedUpload so finalize_upload reports
  the actual mode used instead of hardcoding PaymentMode::Single
…kle error, log accuracy

- AlreadyStored now only counts votes from the closest CLOSE_GROUP_SIZE
  peers by XOR distance, preventing false positives from distant peers
- Multi-batch merkle partial failure now returns PartialUpload error
  instead of misleading Ok with fewer proofs
- Fix log message to capture total response count before truncation
- Document Bytes clone overhead as O(1) ref-counted
- New e2e_upload_costs test: uploads files at 200MB, 1GB, 4GB, 8GB
  in both Single and Merkle modes, reports ANT cost, gas cost,
  chunk count, and EVM transaction count in a formatted table
- New 8GB test in e2e_huge_file for memory bounding verification
- Increase testnet to 20 nodes (merkle needs CANDIDATES_PER_POOL=16)
- Create separate files with different seeds for Single vs Merkle
  to prevent AlreadyStored when uploading the same content twice
Measured results on local Anvil testnet (20 nodes):
- 10MB (3 chunks):   Single 108K gwei, Merkle 172K gwei (overhead)
- 50MB (16 chunks):  Single 278K gwei, Merkle 177K gwei (36% savings)
- 200MB (54 chunks): Single 596K gwei, Merkle 160K gwei (73% savings)
- 500MB (129 chunks): Single 1041K gwei (merkle: disk full on testnet)

Gas savings increase with file size. Merkle breaks even around ~10 chunks
and delivers 3-4x gas reduction at 50+ chunks.
- Spill dirs now live under <data_dir>/spill/ instead of system temp
  (e.g. ~/Library/Application Support/ant/spill/ on macOS)
- Dir names include Unix timestamp: <timestamp>_<random>
- On each new upload, stale spill dirs older than 24h are cleaned up
  (catches orphans from crashed/killed processes)
- Disk space check now queries the spill root instead of temp dir
…prefix

Review-driven fixes for ChunkSpill:

- Add lockfile (.lock) inside each spill dir, held for the upload's
  lifetime via fs2 exclusive lock. cleanup_stale skips dirs with
  active locks, preventing deletion of in-progress uploads.
- Only delete entries starting with 'spill_' prefix, preventing
  accidental deletion of unrelated files in the spill root.
- Check entry.file_type().is_dir() before remove_dir_all to prevent
  symlink attacks (following symlinks to delete arbitrary dirs).
- Skip cleanup entirely when system clock is broken (timestamp 0).
- Add run_cleanup() public method for client startup use.
- Import crate::config at function scope instead of inline path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant