Skip to content

feat(portal): stream file uploads to reduce memory from O(file_size) to O(5 MiB)#409

Open
lilongen wants to merge 1 commit intoboxlite-ai:mainfrom
lilongen:feat/streaming-file-upload
Open

feat(portal): stream file uploads to reduce memory from O(file_size) to O(5 MiB)#409
lilongen wants to merge 1 commit intoboxlite-ai:mainfrom
lilongen:feat/streaming-file-upload

Conversation

@lilongen
Copy link
Copy Markdown

Summary

  • Replace full-file Vec<UploadChunk> buffering in upload_tar with a bounded mpsc::channel(4) + spawned reader task, capping peak memory at ~5 MiB regardless of file size
  • Extract stream_file_chunks helper (accepts impl AsyncRead) for testability and single-responsibility
  • Use std::mem::take for dest_path/container_id (first chunk only, zero-copy on subsequent chunks)
  • Always await reader JoinHandle before checking gRPC result to surface root-cause errors first
  • Return error on receiver drop instead of silent Ok(()) to detect incomplete uploads

Motivation

upload_tar previously read the entire tar file into a Vec<UploadChunk> before streaming to the guest. For a 500 MB file, this consumed ~500 MB of heap. The download_tar path in the same file already used streaming correctly — this PR closes the asymmetry.

Scenario Before After
500 MB upload ~500 MB heap ~5 MiB
1 GB upload ~1 GB heap ~5 MiB

Test plan

  • cargo fmt -- --check clean
  • cargo clippy -p boxlite --no-default-features --lib -- -D warnings clean (0 warnings)
  • cargo test -p boxlite --no-default-features --lib — 608 passed, 0 failed
  • 8 new unit tests covering all branches of stream_file_chunks:
    • Multi-chunk (3 MiB), single-chunk (500 B), partial last chunk, empty file
    • Receiver dropped (cancellation) returns error
    • I/O read error propagation via FailingReader mock
    • Byte-level data integrity verification
    • upload_tar file-not-found error path

🤖 Generated with Claude Code

…to O(5 MiB)

Replace full-file Vec buffering in upload_tar with a bounded mpsc channel
(capacity=4) and a spawned reader task. This caps peak memory at ~5 MiB
regardless of file size, matching the streaming pattern already used in
download_tar and the guest-side upload handler.

Key changes:
- Extract stream_file_chunks helper accepting impl AsyncRead for testability
- Use std::mem::take for dest_path/container_id (first chunk only, zero-copy)
- Always await reader JoinHandle before checking gRPC result (root-cause priority)
- Return error on receiver drop instead of silent Ok (incomplete upload detection)
- Add 8 unit tests: multi-chunk, single-chunk, partial, empty, receiver-drop,
  read-error (FailingReader mock), data-integrity, and file-not-found

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 27, 2026 05:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the portal-side tar upload path to stream file contents instead of buffering the entire archive in memory, bringing upload behavior in line with the already-streaming download path and reducing peak heap usage for large uploads.

Changes:

  • Reworked FilesInterface::upload_tar to stream UploadChunks via a bounded mpsc channel + ReceiverStream rather than prebuilding a Vec<UploadChunk>.
  • Added stream_file_chunks helper to read any AsyncRead in CHUNK_SIZE increments and emit UploadChunks.
  • Added unit tests covering chunking behavior, integrity, cancellation/receiver-drop behavior, and error propagation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +57 to +73
let upload_result = self.client.upload(stream).await;

// Always join the reader task first — its error is the root cause when
// both the reader and the gRPC call fail (e.g. read error causes the
// stream to end, which causes the server to report failure).
match reader_handle.await {
Ok(Err(read_err)) => return Err(read_err),
Err(join_err) => {
return Err(BoxliteError::Internal(format!(
"Upload reader task failed: {}",
join_err
)));
}
Ok(Ok(())) => {}
}

// Use futures::stream::iter for the upload stream
let stream = futures::stream::iter(chunks);

let response = self
.client
.upload(stream)
.await
.map_err(map_tonic_err)?
.into_inner();
let response = upload_result.map_err(map_tonic_err)?.into_inner();
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upload_tar returns the reader task error before inspecting the gRPC result. If the server rejects the upload early (e.g. permission denied) and drops the receiver, stream_file_chunks will likely return Internal("Upload stream closed..."), which then masks the real tonic Status from upload_result. Consider combining both results: if upload_result is Err, prefer that error unless the reader error is a real I/O failure (e.g. Storage("Failed to read...")), or include both errors in the returned message.

Copilot uses AI. Check for mistakes.
Comment on lines +57 to 71
let upload_result = self.client.upload(stream).await;

// Always join the reader task first — its error is the root cause when
// both the reader and the gRPC call fail (e.g. read error causes the
// stream to end, which causes the server to report failure).
match reader_handle.await {
Ok(Err(read_err)) => return Err(read_err),
Err(join_err) => {
return Err(BoxliteError::Internal(format!(
"Upload reader task failed: {}",
join_err
)));
}
Ok(Ok(())) => {}
}
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the gRPC call completes, upload_tar always awaits reader_handle. If the upload fails quickly (connection error) but the file read blocks indefinitely (e.g. stalled filesystem), this can hang the whole operation and hide the gRPC failure. Consider aborting the reader task (or using a timeout) when upload_result is Err, while still joining normally on the success path to surface read errors.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants