Skip to content

get_catchment_characteristics should batch s3 requests #449

@apsoras

Description

@apsoras

The implementation of get_catchment_characteristics opens and collects the s3 data source once for each column. This means the number of s3 data source open_dataset/collect calls increases linearly with the number of variables/characteristics, even if multiple can be pulled from the same s3 bucket at the same time.

Below is a performance comparison for 1, 5, and 9 variable pulls, all for the same 20 catchment IDs. My implementation batches and pulls the variables based on s3 source (retrieves multiple variables where possible rather than one at a time).

Image

The top pane (single url) shows the performance difference when all variables can be pulled from the same s3 url - the batched execution time stays relatively constant with respect to number of variables, while the current function increases from ~2 to ~20 s.

The bottom pane (multi url) shows that the functions perform the same when each variable comes from a unique s3 buckets (no batching possible), meaning the performance difference in the top pane is only really due to the batching.

Batching variable pulls decreases the number of open and collect calls performed on the s3 data source, which is good for s3 storage request count and for the user.

I already have my own implementation that I can share and am happy to submit a PR for this, although I would need more info on the application of the percent missing column though, as in the >300 variables I have pulled, I have yet to see that column come out of the s3 storage (and it looks like it's overwritten in the original function, anyways?).

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions