-
Notifications
You must be signed in to change notification settings - Fork 35
Description
The implementation of get_catchment_characteristics opens and collects the s3 data source once for each column. This means the number of s3 data source open_dataset/collect calls increases linearly with the number of variables/characteristics, even if multiple can be pulled from the same s3 bucket at the same time.
Below is a performance comparison for 1, 5, and 9 variable pulls, all for the same 20 catchment IDs. My implementation batches and pulls the variables based on s3 source (retrieves multiple variables where possible rather than one at a time).
The top pane (single url) shows the performance difference when all variables can be pulled from the same s3 url - the batched execution time stays relatively constant with respect to number of variables, while the current function increases from ~2 to ~20 s.
The bottom pane (multi url) shows that the functions perform the same when each variable comes from a unique s3 buckets (no batching possible), meaning the performance difference in the top pane is only really due to the batching.
Batching variable pulls decreases the number of open and collect calls performed on the s3 data source, which is good for s3 storage request count and for the user.
I already have my own implementation that I can share and am happy to submit a PR for this, although I would need more info on the application of the percent missing column though, as in the >300 variables I have pulled, I have yet to see that column come out of the s3 storage (and it looks like it's overwritten in the original function, anyways?).
Thanks!