-
Notifications
You must be signed in to change notification settings - Fork 3
Stage dataset files locally #47
Description
When SDK users want to access data from a specific dataset, the current functionality for reading directly from S3 via boto is only going to be helpful for tools which can process data from a buffer. In the case of Python users, this is almost always going to be sufficient.
However, when showing the SDK functionality to the Bioinformatics Core, they brought up multiple use-cases where R analysis software needs to read from a local file.
To support this use-case, I was thinking that it might make sense to provide a local cache of files which are downloaded from the portal. Instead of just download those files to the PWD, I thought it would be more efficient to maintain a cache directory, perhaps with the location which could be configured by the user. The downloaded files could then be saved in <cache_dir>/<dataset_id>/<relative_path_to_file>.
The user interaction could look like having a .local_path() function on each File object which returns the local path to that file. When the function is called, it could first check to see if the file has already been staged. If the file size matches the size in S3, then it wouldn't have to download it again.
This is intended as a discussion issue. I'm not exactly sure what the best implementation is to support this particular use case.