Automatic full file download fallback #92

samansmink · 2025-07-29T14:03:41Z

This PR adds a new (default enabled) feature that will attempt to automatically fall back to full file downloads when HTTP range requests are not supported.

@pdet

pdet · 2025-07-29T14:14:12Z

Thanks, Sam!

I think we briefly chatted about this yesterday, but I don't recall the exact answer. Just for my understanding, is the full download always a blocking operation?

For example, the CSV reader always reads buffers of 32MB. If a file is 320MB, does it need to be fully downloaded upfront before execution, or can execution start as soon as one buffer is filled? If so, would this work in parallel?

My guess is that this might be totally outside the scope of this PR and more related to async I/O, but just checking :-)

samansmink · 2025-07-29T15:48:50Z

@pdet AFAICT the full file download this is based on was not changed much since you implemented it a while back duckdb/duckdb#6448

The original mechanism uses a shared CachedFileHandle to ensure cached data is downloaded only once and then shared between threads. Currently no mechanism exists to start reading from the CachedFileHandle before they are fully downloaded.

I think for CSV files we should be able to implement this even without async I/O, but it will be complex: The csv reading code will need orchestrate a special thread to start reading the file through the filehandle and we would need to modify the CachedFileHandle to allow reading it uninitialized. This is likely way too complex for what we gain with it. Once we have proper Async I/O, we could consider revisiting because we might actually be able to implement this cleanly.

Also, note that for parquet this is probably fundamentally impossible because the footer lives at the bottom of the file anyways, so we can only start scanning a file once it has finished downloading.

pdet · 2025-07-29T15:56:39Z

@pdet AFAICT the full file download this is based on was not changed much since you implemented it a while back duckdb/duckdb#6448

The original mechanism uses a shared CachedFileHandle to ensure cached data is downloaded only once and then shared between threads. Currently no mechanism exists to start reading from the CachedFileHandle before they are fully downloaded.

I think for CSV files we should be able to implement this even without async I/O, but it will be complex: The csv reading code will need orchestrate a special thread to start reading the file through the filehandle and we would need to modify the CachedFileHandle to allow reading it uninitialized. This is likely way too complex for what we gain with it. Once we have proper Async I/O, we could consider revisiting because we might actually be able to implement this cleanly.

Also, note that for parquet this is probably fundamentally impossible because the footer lives at the bottom of the file anyways, so we can only start scanning a file once it has finished downloading.

Understood! Indeed, different file formats can benefit differently from that, I think CSVs, JSONs, Arrow IPC could benefit, but it probably makes most sense to solve the async io on the FS level than doing specific implementations!

Thanks again Sam!

samansmink added 2 commits July 29, 2025 15:35

feat: automatic full file download fallback

33dc540

fix: http 206 not considered success

a8da7d1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Automatic full file download fallback #92

Automatic full file download fallback #92

Uh oh!

samansmink commented Jul 29, 2025

Uh oh!

pdet commented Jul 29, 2025

Uh oh!

samansmink commented Jul 29, 2025

Uh oh!

pdet commented Jul 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

Automatic full file download fallback #92

Are you sure you want to change the base?

Automatic full file download fallback #92

Uh oh!

Conversation

samansmink commented Jul 29, 2025

Uh oh!

pdet commented Jul 29, 2025

Uh oh!

samansmink commented Jul 29, 2025

Uh oh!

pdet commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pdet commented Jul 29, 2025 •

edited

Loading