fix: upload reliability hardening -- retry, fault tolerance, spill cleanup#26
Open
fix: upload reliability hardening -- retry, fault tolerance, spill cleanup#26
Conversation
…indexing - Add expected_data_size parameter to collect_validated_candidates() to reject nodes that return tampered data_size in quoting metrics - Fix direct indexing addresses[i] -> safe zip iterator (pre-existing project rule violation)
Query CLOSE_GROUP_SIZE * 2 (10) peers from DHT, send quote requests to all concurrently, and keep the CLOSE_GROUP_SIZE (5) closest successful responders sorted by XOR distance. This tolerates up to 5 peer failures (timeout, bad quote, etc.) without aborting the entire quote collection.
…racking
- store_paid_chunks() now retries failed chunks up to 3 times with
exponential backoff (500ms, 1s, 2s) before giving up
- Returns WaveResult { stored, failed } instead of discarding partial
successes on first error
- Add PartialUpload error variant that carries both stored addresses
and failed chunk details so callers can track progress
- PaidChunk is now Clone to support retry
- batch_upload_chunks() accumulates stored addresses across waves and
reports them even on partial failure
When pay_for_merkle_multi_batch fails on sub-batch N, return proofs from sub-batches 1..N-1 instead of discarding them. This prevents losing already-paid tokens when a later sub-batch fails -- callers can still store the chunks that were paid for.
…mode - file_prepare_upload: replace sequential for-loop with concurrent buffer_unordered pattern (same as prepare_wave) for quote collection - data_prepare_upload: same concurrent fix - Add payment_mode field to PreparedUpload so finalize_upload reports the actual mode used instead of hardcoding PaymentMode::Single
…kle error, log accuracy - AlreadyStored now only counts votes from the closest CLOSE_GROUP_SIZE peers by XOR distance, preventing false positives from distant peers - Multi-batch merkle partial failure now returns PartialUpload error instead of misleading Ok with fewer proofs - Fix log message to capture total response count before truncation - Document Bytes clone overhead as O(1) ref-counted
- New e2e_upload_costs test: uploads files at 200MB, 1GB, 4GB, 8GB in both Single and Merkle modes, reports ANT cost, gas cost, chunk count, and EVM transaction count in a formatted table - New 8GB test in e2e_huge_file for memory bounding verification
- Increase testnet to 20 nodes (merkle needs CANDIDATES_PER_POOL=16) - Create separate files with different seeds for Single vs Merkle to prevent AlreadyStored when uploading the same content twice
Measured results on local Anvil testnet (20 nodes): - 10MB (3 chunks): Single 108K gwei, Merkle 172K gwei (overhead) - 50MB (16 chunks): Single 278K gwei, Merkle 177K gwei (36% savings) - 200MB (54 chunks): Single 596K gwei, Merkle 160K gwei (73% savings) - 500MB (129 chunks): Single 1041K gwei (merkle: disk full on testnet) Gas savings increase with file size. Merkle breaks even around ~10 chunks and delivers 3-4x gas reduction at 50+ chunks.
- Spill dirs now live under <data_dir>/spill/ instead of system temp (e.g. ~/Library/Application Support/ant/spill/ on macOS) - Dir names include Unix timestamp: <timestamp>_<random> - On each new upload, stale spill dirs older than 24h are cleaned up (catches orphans from crashed/killed processes) - Disk space check now queries the spill root instead of temp dir
…prefix Review-driven fixes for ChunkSpill: - Add lockfile (.lock) inside each spill dir, held for the upload's lifetime via fs2 exclusive lock. cleanup_stale skips dirs with active locks, preventing deletion of in-progress uploads. - Only delete entries starting with 'spill_' prefix, preventing accidental deletion of unrelated files in the spill root. - Check entry.file_type().is_dir() before remove_dir_all to prevent symlink attacks (following symlinks to delete arbitrary dirs). - Skip cleanup entirely when system clock is broken (timestamp 0). - Add run_cleanup() public method for client startup use. - Import crate::config at function scope instead of inline path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Production hardening of the upload and payment pipeline:
Upload reliability:
Quote collection fault tolerance:
Merkle candidate validation:
External signer improvements:
Chunk spill hardening:
Measured results
Gas savings (merkle vs single, local Anvil):
Memory: flat at ~1GB regardless of file size (1GB and 4GB uploads same peak RSS)
No external dependencies
This PR is self-contained. Uses published evmlib 0.5.0 and ant-node 0.9.0 from crates.io.
Related PRs (independent improvements to other layers)
Test plan