Problem
__getitem__ does three full-DataFrame scans per call:
stay_id = self.outcome_df[self.vars["GROUP"]].unique()[idx] # scan 1: outcome_df
window = self.features_df.filter(pl.col(GROUP) == stay_id)... # scan 2: features_df (2.8M rows)
labels = self.outcome_df.filter(pl.col(GROUP) == stay_id)... # scan 3: outcome_df
Every stay lookup re-examined the entire feature frame, which makes caching take a couple of minutes. For example, for the AKI dataset (~66k stays, ~2.8M feature rows), building the RAM cache calles __getitem__ ~66k times, each scanning ~2.8M rows.
Proposed solution
Partition both DataFrames by stay once at __init__, keep the per-stay slices as numpy arrays in dicts, and make __getitem__ do a dict lookup.
Problem
__getitem__does three full-DataFrame scans per call:Every stay lookup re-examined the entire feature frame, which makes caching take a couple of minutes. For example, for the AKI dataset (~66k stays, ~2.8M feature rows), building the RAM cache calles
__getitem__~66k times, each scanning ~2.8M rows.Proposed solution
Partition both DataFrames by stay once at
__init__, keep the per-stay slices as numpy arrays in dicts, and make__getitem__do a dict lookup.