Fixes + Convenience updates [Vectorized ram_cache, multi label padding, debug sampling]#183
Fixes + Convenience updates [Vectorized ram_cache, multi label padding, debug sampling]#183unartig wants to merge 12 commits intorvandewater:developmentfrom
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 7 minutes and 19 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
WalkthroughThe PR modifies data loading, preprocessing, and splitting logic across three files: reorders dataset columns and enhances cache handling, simplifies preprocessing transformers, and improves train/val splitting logic with configurable sizing and refined debug sampling. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
icu_benchmarks/data/split_process_data.py (1)
552-572:⚠️ Potential issue | 🟠 MajorSort
staysto matchlabelsordering in stratified k-fold split.Line 552 creates
staysfrom.unique()which does not guarantee order, while line 556 createslabelssorted by_id. When these are passed toouter_cv.split(stays, labels)andinner_cv.split(dev_stays, labels[dev]), misaligned indices silently corrupt the stratified split's class assignments. For string stay_ids this causes index 0 to reference different stays than index 0 of labels.🔧 Proposed fix
- stays = pl.Series(name=_id, values=data[DataSegment.outcome][_id].unique()) + stays = pl.Series(name=_id, values=data[DataSegment.outcome][_id].unique().sort()) # If there are labels, and the task is classification, use stratified k-fold if VarType.label in vars and runmode is RunMode.classification: # Get labels from outcome data (takes the highest value (or True) in case seq2seq classification) labels: pl.Series = data[DataSegment.outcome].group_by(_id).max().sort(_id)[vars[VarType.label]]Note: the same misalignment concern applies to
make_train_val_polarsat lines 378/381.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@icu_benchmarks/data/split_process_data.py` around lines 552 - 572, stays is built from data[DataSegment.outcome][_id].unique() which can be unordered while labels is explicitly sorted by _id; to fix, ensure stays uses the exact same ordering as labels before calling outer_cv.split and inner_cv.split (e.g., derive stays from the sorted labels index or sort stays with the same key used for labels) so indices align; apply the same change in make_train_val_polars where stays/ids are created (ensure dev_stays is sliced from an ordered stays that matches labels ordering before inner_cv.split).
🧹 Nitpick comments (1)
icu_benchmarks/data/split_process_data.py (1)
174-176: Minor: inconsistent source for the complete-train splitter.The non-complete-train branch (line 162) passes
sanitized_data, while the complete-train branch passes the rawdata. Sincecheck_sanitize_dataandmodality_selectionmutate the underlying dict in place, the two are currently the same object — but relying on that identity is brittle. For consistency and to guard against future refactors where one of those functions starts returning a new dict, passsanitized_datahere as well.- sanitized_data = make_train_val_polars(data, vars, train_size=train_size, seed=seed, debug=debug, runmode=runmode) + sanitized_data = make_train_val_polars( + sanitized_data, vars, train_size=train_size, seed=seed, debug=debug, runmode=runmode + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@icu_benchmarks/data/split_process_data.py` around lines 174 - 176, The complete-train branch calls make_train_val_polars with the raw data object, which is inconsistent and brittle because check_sanitize_data and modality_selection may mutate or later return a new dict; change the call in the full-train branch to pass sanitized_data (the same variable used in the other branch) to make_train_val_polars and keep the same train_size, seed, debug, and runmode parameters to ensure consistent behavior regardless of in-place mutations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@icu_benchmarks/data/loader.py`:
- Around line 50-57: The code hardcodes m_index = ["stay_id"] causing the group
column to be omitted/misordered when the dataset uses a different group name;
replace the literal with the configured group variable by setting m_index =
[self.vars["GROUP"]], then rebuild front/back using that m_index so the group
column is preserved first in self.features_df; also ensure the rest of the class
(e.g., functions referencing self.features_df and ram_cache logic that calls
pl.exclude(self.vars["GROUP"]) and computes n_features) will now align with the
corrected first-column ordering.
- Around line 149-188: ram_cache currently assumes outcome_df has one label per
timestep; detect the single-label-per-stay case (when outcome_df.shape[0] ==
n_stays) and expand labels_np into per-timestep rows before slicing by offsets:
for each stay create an array of shape (length, n_labels) that is all NaNs
except the final timestep contains the original stay label (mirroring the logic
in __getitem__ that pads single-labels), then use that expanded labels array for
padded_labels population and nan masking; update references to labels_np,
padded_labels, pad_mask, and ram_cache accordingly and add a unit test that
constructs a single-label classification dataset with ram_cache=True to lock
this behavior.
---
Outside diff comments:
In `@icu_benchmarks/data/split_process_data.py`:
- Around line 552-572: stays is built from
data[DataSegment.outcome][_id].unique() which can be unordered while labels is
explicitly sorted by _id; to fix, ensure stays uses the exact same ordering as
labels before calling outer_cv.split and inner_cv.split (e.g., derive stays from
the sorted labels index or sort stays with the same key used for labels) so
indices align; apply the same change in make_train_val_polars where stays/ids
are created (ensure dev_stays is sliced from an ordered stays that matches
labels ordering before inner_cv.split).
---
Nitpick comments:
In `@icu_benchmarks/data/split_process_data.py`:
- Around line 174-176: The complete-train branch calls make_train_val_polars
with the raw data object, which is inconsistent and brittle because
check_sanitize_data and modality_selection may mutate or later return a new
dict; change the call in the full-train branch to pass sanitized_data (the same
variable used in the other branch) to make_train_val_polars and keep the same
train_size, seed, debug, and runmode parameters to ensure consistent behavior
regardless of in-place mutations.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 9d330f83-863a-490c-ae7f-61246332cbac
📒 Files selected for processing (3)
icu_benchmarks/data/loader.pyicu_benchmarks/data/preprocessor.pyicu_benchmarks/data/split_process_data.py
|
@coderabbitai any things still missing for this PR? |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 1192
🐛 Bugs / Potential Bugs
🔧 Feature Requests / Proposals
Worth noting: #158 (optimize |
🧩 Analysis chain🏁 Script executed: Length of output: 36203
✅ What's done
|
| Check | Status | Detail |
|---|---|---|
| Docstring Coverage | 55.56% — threshold is 80%. New methods like ram_cache are likely missing docstrings. |
|
| PR Title | ❓ Inconclusive | "Fixes + Convenience updates" is too generic. Consider something like: "Support train_size with complete_train, fix data overwrite, and add multi-label padding" |
3. 🟠 Outside-diff comment — may still be open
An outside-diff comment (couldn't be posted inline) flagged a potential silent bug:
staysordering in stratified k-fold split (split_process_data.py, lines 552–572):staysis built from.unique()(unordered) whilelabelsis sorted by_id. Passing misaligned arrays toouter_cv.split(stays, labels)silently corrupts stratified fold class assignments.
The suggested fix was:
- stays = pl.Series(name=_id, values=data[DataSegment.outcome][_id].unique())
+ stays = pl.Series(name=_id, values=data[DataSegment.outcome][_id].unique().sort())The same concern applies to make_train_val_polars at lines 378/381. Please verify if this was addressed in the latest commits.
4. 🟡 Nitpick — open
split_process_data.pylines 174–176: The complete-train branch passes rawdatatomake_train_val_polarsinstead ofsanitized_data. Currently harmless (same object in practice), but brittle for future refactors. Consider passingsanitized_datafor consistency.
5. 🔵 No human reviewer approval yet
The reviewDecision is currently empty — no maintainer has formally approved this PR.
💡 Recommended next steps
- Add a docstring to
ram_cache(and any other new/modified methods lacking them). - Update the PR title to be more descriptive.
- Verify the
staysordering fix was applied (or apply it if not). - Trigger a new review — I'll re-inspect all the latest commits cleanly.
Would you like me to trigger a fresh review now?
✅ Actions performed
Review triggered.
Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.
|
@coderabbitai go ahead |
|
✅ Actions performedFull review triggered. |
|
@unartig it seems like you ran into the same cache issue that I noticed this week after revisting YAIB. I already opened #186 before I saw your PR. I think our implementations address fundamentally the same thing. A couple of differences I could spot while skimming over it:
@rvandewater I am happy with either as long as the speed-up remains :) |
|
Hey @prockenschaub , I very much prefer your version over mine, it does make sense to do it in |
Summary:
train_sizewhencomplete_train=True(only touched the polars paths, deprecated pandas is unchanged)
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Improvements