Stage dataset files locally

When SDK users want to access data from a specific dataset, the current functionality for reading directly from S3 via `boto` is only going to be helpful for tools which can process data from a buffer. In the case of Python users, this is almost always going to be sufficient.

However, when showing the SDK functionality to the Bioinformatics Core, they brought up multiple use-cases where R analysis software needs to read from a local file. 

To support this use-case, I was thinking that it might make sense to provide a local cache of files which are downloaded from the portal. Instead of just download those files to the PWD, I thought it would be more efficient to maintain a cache directory, perhaps with the location which could be configured by the user. The downloaded files could then be saved in `<cache_dir>/<dataset_id>/<relative_path_to_file>`.

The user interaction could look like having a `.local_path()` function on each `File` object which returns the local path to that file. When the function is called, it could first check to see if the file has already been staged. If the file size matches the size in S3, then it wouldn't have to download it again.

This is intended as a discussion issue. I'm not exactly sure what the best implementation is to support this particular use case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stage dataset files locally #47

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stage dataset files locally #47

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions