Skip to content

Inefficient caching with polars #184

@prockenschaub

Description

@prockenschaub

Problem

__getitem__ does three full-DataFrame scans per call:

stay_id = self.outcome_df[self.vars["GROUP"]].unique()[idx]      # scan 1: outcome_df
window = self.features_df.filter(pl.col(GROUP) == stay_id)...    # scan 2: features_df (2.8M rows)
labels = self.outcome_df.filter(pl.col(GROUP) == stay_id)...     # scan 3: outcome_df

Every stay lookup re-examined the entire feature frame, which makes caching take a couple of minutes. For example, for the AKI dataset (~66k stays, ~2.8M feature rows), building the RAM cache calles __getitem__ ~66k times, each scanning ~2.8M rows.

Proposed solution

Partition both DataFrames by stay once at __init__, keep the per-stay slices as numpy arrays in dicts, and make __getitem__ do a dict lookup.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions