Skip to content

Automatic full file download fallback #92

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

samansmink
Copy link
Collaborator

This PR adds a new (default enabled) feature that will attempt to automatically fall back to full file downloads when HTTP range requests are not supported.

@pdet

@pdet
Copy link
Contributor

pdet commented Jul 29, 2025

Thanks, Sam!

I think we briefly chatted about this yesterday, but I don't recall the exact answer. Just for my understanding, is the full download always a blocking operation?

For example, the CSV reader always reads buffers of 32MB. If a file is 320MB, does it need to be fully downloaded upfront before execution, or can execution start as soon as one buffer is filled? If so, would this work in parallel?

My guess is that this might be totally outside the scope of this PR and more related to async I/O, but just checking :-)

@samansmink
Copy link
Collaborator Author

@pdet AFAICT the full file download this is based on was not changed much since you implemented it a while back duckdb/duckdb#6448

The original mechanism uses a shared CachedFileHandle to ensure cached data is downloaded only once and then shared between threads. Currently no mechanism exists to start reading from the CachedFileHandle before they are fully downloaded.

I think for CSV files we should be able to implement this even without async I/O, but it will be complex: The csv reading code will need orchestrate a special thread to start reading the file through the filehandle and we would need to modify the CachedFileHandle to allow reading it uninitialized. This is likely way too complex for what we gain with it. Once we have proper Async I/O, we could consider revisiting because we might actually be able to implement this cleanly.

Also, note that for parquet this is probably fundamentally impossible because the footer lives at the bottom of the file anyways, so we can only start scanning a file once it has finished downloading.

@pdet
Copy link
Contributor

pdet commented Jul 29, 2025

@pdet AFAICT the full file download this is based on was not changed much since you implemented it a while back duckdb/duckdb#6448

The original mechanism uses a shared CachedFileHandle to ensure cached data is downloaded only once and then shared between threads. Currently no mechanism exists to start reading from the CachedFileHandle before they are fully downloaded.

I think for CSV files we should be able to implement this even without async I/O, but it will be complex: The csv reading code will need orchestrate a special thread to start reading the file through the filehandle and we would need to modify the CachedFileHandle to allow reading it uninitialized. This is likely way too complex for what we gain with it. Once we have proper Async I/O, we could consider revisiting because we might actually be able to implement this cleanly.

Also, note that for parquet this is probably fundamentally impossible because the footer lives at the bottom of the file anyways, so we can only start scanning a file once it has finished downloading.

Understood! Indeed, different file formats can benefit differently from that, I think CSVs, JSONs, Arrow IPC could benefit, but it probably makes most sense to solve the async io on the FS level than doing specific implementations!

Thanks again Sam!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants