Fixes + Convenience updates [Vectorized ram_cache, multi label padding, debug sampling] by unartig · Pull Request #183 · rvandewater/YAIB

unartig · 2026-04-23T12:59:29Z

Summary:

allow to specify train_size when complete_train=True
fix overwrite instead of update error and debug sampling
use inner joins instead of left, should avoid unwanted null-rows
column names are now ordered alphabetically
updated padding logic in PolarDatasets, can now take more than one label (should be unchanged if only one label used)
buildup in PolarsPredictionDataset now uses a vectorized scheme offering massive speedup

(only touched the polars paths, deprecated pandas is unchanged)

Summary by CodeRabbit

Release Notes

New Features
- Added RAM cache functionality for faster dataset loading.
Bug Fixes
- Fixed train/validation splitting behavior when complete training is enabled.
- Improved handling of missing values in label data.
- Enhanced debug mode sampling for better data representation.
Improvements
- Standardized dataset feature column ordering for consistency.
- Simplified preprocessing logic.

coderabbitai · 2026-04-23T12:59:51Z

Warning

Rate limit exceeded

@unartig has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 7 minutes and 19 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 7 minutes and 19 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 09b08f1e-675e-4d13-94a8-037b3bce0a87

📥 Commits

Reviewing files that changed from the base of the PR and between ef999f7 and 0b5ae48.

📒 Files selected for processing (3)

icu_benchmarks/data/loader.py
icu_benchmarks/data/preprocessor.py
icu_benchmarks/data/split_process_data.py

Walkthrough

The PR modifies data loading, preprocessing, and splitting logic across three files: reorders dataset columns and enhances cache handling, simplifies preprocessing transformers, and improves train/val splitting logic with configurable sizing and refined debug sampling.

Changes

Cohort / File(s)	Summary
Data Loader Enhancements `icu_benchmarks/data/loader.py`	Reorders dataset feature columns with `stay_id` first, followed by non-indicator columns alphabetically, then indicator columns; updates cache access to index per tensor element; adjusts label padding to enforce 2D format; adds `ram_cache` method for RAM-based full tensor caching with NaN handling.
Preprocessing Simplification `icu_benchmarks/data/preprocessor.py`	Removes redundant outer parentheses from `FunctionTransformer` lambdas in regression outcome scaling for both Polars and Pandas preprocessors.
Splitting and Preprocessing Logic `icu_benchmarks/data/split_process_data.py`	Updates train/val splitter to respect user-provided `train_size` when `complete_train` enabled; refines debug mode to sample by unique `stay_id` fractions; applies null/NaN handling to current dataframe in loop; changes join operations from `left` to `inner` with updated key references.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Hop, hop through data streams we go,
Columns sorted, tensors flow,
Cache reordered, labels padded true,
RAM compiled, our split renewed.
Simplified lambdas, joins precise—
Data pipelines now so nice! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title mentions three key improvements (vectorized ram_cache, multi-label padding, debug sampling) but does not clearly identify the main changes or clearly communicate the primary objective of the PR.	Consider a more descriptive title that leads with the primary fix or feature, such as 'Support vectorized RAM caching and multi-label padding in PredictionPolarsDataset' or breaking into subtitles to clarify priorities.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

icu_benchmarks/data/split_process_data.py (1)
552-572: ⚠️ Potential issue | 🟠 Major

Sort stays to match labels ordering in stratified k-fold split.

Line 552 creates stays from .unique() which does not guarantee order, while line 556 creates labels sorted by _id. When these are passed to outer_cv.split(stays, labels) and inner_cv.split(dev_stays, labels[dev]), misaligned indices silently corrupt the stratified split's class assignments. For string stay_ids this causes index 0 to reference different stays than index 0 of labels.
🔧 Proposed fix
-    stays = pl.Series(name=_id, values=data[DataSegment.outcome][_id].unique())
+    stays = pl.Series(name=_id, values=data[DataSegment.outcome][_id].unique().sort())
     # If there are labels, and the task is classification, use stratified k-fold
     if VarType.label in vars and runmode is RunMode.classification:
         # Get labels from outcome data (takes the highest value (or True) in case seq2seq classification)
         labels: pl.Series = data[DataSegment.outcome].group_by(_id).max().sort(_id)[vars[VarType.label]]
Note: the same misalignment concern applies to make_train_val_polars at lines 378/381.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@icu_benchmarks/data/split_process_data.py` around lines 552 - 572, stays is
built from data[DataSegment.outcome][_id].unique() which can be unordered while
labels is explicitly sorted by _id; to fix, ensure stays uses the exact same
ordering as labels before calling outer_cv.split and inner_cv.split (e.g.,
derive stays from the sorted labels index or sort stays with the same key used
for labels) so indices align; apply the same change in make_train_val_polars
where stays/ids are created (ensure dev_stays is sliced from an ordered stays
that matches labels ordering before inner_cv.split).

🧹 Nitpick comments (1)

icu_benchmarks/data/split_process_data.py (1)
174-176: Minor: inconsistent source for the complete-train splitter.

The non-complete-train branch (line 162) passes sanitized_data, while the complete-train branch passes the raw data. Since check_sanitize_data and modality_selection mutate the underlying dict in place, the two are currently the same object — but relying on that identity is brittle. For consistency and to guard against future refactors where one of those functions starts returning a new dict, pass sanitized_data here as well.
-        sanitized_data = make_train_val_polars(data, vars, train_size=train_size, seed=seed, debug=debug, runmode=runmode)
+        sanitized_data = make_train_val_polars(
+            sanitized_data, vars, train_size=train_size, seed=seed, debug=debug, runmode=runmode
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@icu_benchmarks/data/split_process_data.py` around lines 174 - 176, The
complete-train branch calls make_train_val_polars with the raw data object,
which is inconsistent and brittle because check_sanitize_data and
modality_selection may mutate or later return a new dict; change the call in the
full-train branch to pass sanitized_data (the same variable used in the other
branch) to make_train_val_polars and keep the same train_size, seed, debug, and
runmode parameters to ensure consistent behavior regardless of in-place
mutations.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@icu_benchmarks/data/loader.py`:
- Around line 50-57: The code hardcodes m_index = ["stay_id"] causing the group
column to be omitted/misordered when the dataset uses a different group name;
replace the literal with the configured group variable by setting m_index =
[self.vars["GROUP"]], then rebuild front/back using that m_index so the group
column is preserved first in self.features_df; also ensure the rest of the class
(e.g., functions referencing self.features_df and ram_cache logic that calls
pl.exclude(self.vars["GROUP"]) and computes n_features) will now align with the
corrected first-column ordering.
- Around line 149-188: ram_cache currently assumes outcome_df has one label per
timestep; detect the single-label-per-stay case (when outcome_df.shape[0] ==
n_stays) and expand labels_np into per-timestep rows before slicing by offsets:
for each stay create an array of shape (length, n_labels) that is all NaNs
except the final timestep contains the original stay label (mirroring the logic
in __getitem__ that pads single-labels), then use that expanded labels array for
padded_labels population and nan masking; update references to labels_np,
padded_labels, pad_mask, and ram_cache accordingly and add a unit test that
constructs a single-label classification dataset with ram_cache=True to lock
this behavior.

---

Outside diff comments:
In `@icu_benchmarks/data/split_process_data.py`:
- Around line 552-572: stays is built from
data[DataSegment.outcome][_id].unique() which can be unordered while labels is
explicitly sorted by _id; to fix, ensure stays uses the exact same ordering as
labels before calling outer_cv.split and inner_cv.split (e.g., derive stays from
the sorted labels index or sort stays with the same key used for labels) so
indices align; apply the same change in make_train_val_polars where stays/ids
are created (ensure dev_stays is sliced from an ordered stays that matches
labels ordering before inner_cv.split).

---

Nitpick comments:
In `@icu_benchmarks/data/split_process_data.py`:
- Around line 174-176: The complete-train branch calls make_train_val_polars
with the raw data object, which is inconsistent and brittle because
check_sanitize_data and modality_selection may mutate or later return a new
dict; change the call in the full-train branch to pass sanitized_data (the same
variable used in the other branch) to make_train_val_polars and keep the same
train_size, seed, debug, and runmode parameters to ensure consistent behavior
regardless of in-place mutations.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 9d330f83-863a-490c-ae7f-61246332cbac

📥 Commits

Reviewing files that changed from the base of the PR and between ef999f7 and 178bfd0.

📒 Files selected for processing (3)

icu_benchmarks/data/loader.py
icu_benchmarks/data/preprocessor.py
icu_benchmarks/data/split_process_data.py

unartig · 2026-04-23T13:25:43Z

@coderabbitai any things still missing for this PR?

coderabbitai · 2026-04-23T13:26:12Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
gh issue list --repo rvandewater/YAIB --state open --limit 50

Length of output: 1192

@unartig: Yes, there are 14 open issues on the repository. Here's a summary:

🐛 Bugs / Potential Bugs

#	Title	Opened
`#182`	Logistic Regression Model Not converging	2026-04-13
`#181`	Bug: W&B API Key length validation error (40-character limit)	2026-03-19
`#178`	Possible Bug: Scaled MAE in Regression Tasks	2025-12-12

🔧 Feature Requests / Proposals

#	Title	Opened
`#179`	Proposing a new prediction task: Causative Pathogen Identification	2026-03-04
`#162`	PyPOTS applies additional amputation (help wanted)	2025-02-14
`#158`	Optimize `__getitem__` method using more Polars operations	2024-10-17
`#157`	Creating a Polars-based version of `ImputationPandasDataset`	2024-10-17
`#154`	Automatically infer variables in a dataset (enhancement)	2024-09-18
`#146`	Memory optimisation for Temporal Fusion Transformer	2024-06-04
`#143`	Add cross-correlation as imputation evaluation metric (enhancement)	2023-10-30
`#97`	Model Calibration (enhancement)	2022-12-16
`#92`	Evaluate replacing argparse with abseil (enhancement)	2022-12-13
`#90`	Write tests for train and evaluate (enhancement)	2024-04-15
`#67`	Add Sampling Options (enhancement)	2024-04-15

Worth noting: #158 (optimize __getitem__ with Polars) is directly related to this PR — the new ram_cache vectorized approach in PredictionPolarsDataset partially addresses that optimization goal. You may want to consider closing or updating that issue once this PR is merged.

coderabbitai · 2026-04-23T13:29:01Z

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
echo "=== PR Reviews ==="
gh pr view 183 --repo rvandewater/YAIB --json reviews,reviewDecision,statusCheckRollup,comments | jq .

echo "=== PR Review Comments (inline) ==="
gh api repos/rvandewater/YAIB/pulls/183/comments | jq '[.[] | {id, path, line, body, user: .user.login, resolved: .position}]'

echo "=== PR Comments (issue-level) ==="
gh api repos/rvandewater/YAIB/issues/183/comments | jq '[.[] | {id, user: .user.login, body: .body[:300]}]'

echo "=== PR Checks ==="
gh pr checks 183 --repo rvandewater/YAIB

Length of output: 36203

@unartig: Good timing to check! Here's a thorough status breakdown:

✅ What's done

Lint (Ruff) CI: Passing ✅
Both critical inline comments: Marked as addressed in commits 786c86a–7dc68bf:
- Hardcoded "stay_id" → self.vars["GROUP"] fix in loader.py
- ram_cache single-label-per-stay backwards-compatibility fix in loader.py

⚠️ What still needs attention

1. 🔴 Latest commits were never reviewed (rate-limited)

The review of commits 178bfd03 → 7dc68bf was cut short by a rate limit before CodeRabbit could inspect your fixes. You should trigger a fresh review to confirm the addressed comments are correct.

2. ❌ Pre-merge checks failing

Check	Status	Detail
Docstring Coverage	⚠️ Warning	55.56% — threshold is 80%. New methods like `ram_cache` are likely missing docstrings.
PR Title	❓ Inconclusive	"Fixes + Convenience updates" is too generic. Consider something like: "Support train_size with complete_train, fix data overwrite, and add multi-label padding"

3. 🟠 Outside-diff comment — may still be open

An outside-diff comment (couldn't be posted inline) flagged a potential silent bug:

stays ordering in stratified k-fold split (split_process_data.py, lines 552–572): stays is built from .unique() (unordered) while labels is sorted by _id. Passing misaligned arrays to outer_cv.split(stays, labels) silently corrupts stratified fold class assignments.

The suggested fix was:

- stays = pl.Series(name=_id, values=data[DataSegment.outcome][_id].unique())
+ stays = pl.Series(name=_id, values=data[DataSegment.outcome][_id].unique().sort())

The same concern applies to make_train_val_polars at lines 378/381. Please verify if this was addressed in the latest commits.

4. 🟡 Nitpick — open

split_process_data.py lines 174–176: The complete-train branch passes raw data to make_train_val_polars instead of sanitized_data. Currently harmless (same object in practice), but brittle for future refactors. Consider passing sanitized_data for consistency.

5. 🔵 No human reviewer approval yet

The reviewDecision is currently empty — no maintainer has formally approved this PR.

💡 Recommended next steps

Add a docstring to ram_cache (and any other new/modified methods lacking them).
Update the PR title to be more descriptive.
Verify the stays ordering fix was applied (or apply it if not).
Trigger a new review — I'll re-inspect all the latest commits cleanly.

Would you like me to trigger a fresh review now?

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

unartig · 2026-04-23T13:51:52Z

@coderabbitai go ahead

coderabbitai · 2026-04-23T13:52:16Z

@unartig: Triggering a full review now — I'll inspect all the latest commits cleanly, including the fixes from 786c86a–7dc68bf. Hang tight!

✅ Actions performed

Full review triggered.

prockenschaub · 2026-04-24T07:34:29Z

@unartig it seems like you ran into the same cache issue that I noticed this week after revisting YAIB. I already opened #186 before I saw your PR.

I think our implementations address fundamentally the same thing. A couple of differences I could spot while skimming over it:

I use polars' partition_by and keep everything in a stay_id-indexed dict, you seem to do the partitioning yourself and store everything in a stacked tensor. Using partition allows my changes to be fewer lines of changed code.
I currently store the partitions in addition to features_df and outcome_df, which has some memory overhead. This could be changed to overwrite the base DataFrames.
I had my partition step already in __init__, yours only does this during caching. I think your version makes sense for the ML models, which don't this step regardless. At the same time, doing the partitioning is still useful for DL models if ram_cache=False.

@rvandewater I am happy with either as long as the speed-up remains :)

unartig · 2026-04-24T09:04:42Z

Hey @prockenschaub , I very much prefer your version over mine, it does make sense to do it in __init__, I basically chose the ram_cache because it felt the least invasive at the time.
That said I some, I would like see my other changes make it through (especially the >1 labels and train_size), since I heavily rely on it in my project.

unartig added 8 commits April 23, 2026 11:43

Allow train_size when complete_train=True

8af833a

fix: update not overwrite

739f861

"inner" join instead of "left" to prevent silent null rows

06e8dac

Convinience - order features/indicators

b3a888f

Make Data loader work w/ more than 1 label

87cfc75

fix: debug sampling (stay_id wise not row wise) + no id shadowing

06c380c

Use vectorized ram buildup instead of sequential - *massive* speedup

a090760

linting

178bfd0

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread icu_benchmarks/data/loader.py

Comment thread icu_benchmarks/data/loader.py

unartig added 3 commits April 23, 2026 15:17

Handle classification in ram_cache

786c86a

Use vars["GROUP"] instead of hardcoded stay_id

a6ec885

Sort stays

7dc68bf

unartig changed the title ~~Fixes + Convenience updates~~ Fixes + Convenience updates [Vectorized ram_cache, multi label padding, debug sampling] Apr 23, 2026

Add docstring to ram_cache

0b5ae48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes + Convenience updates [Vectorized ram_cache, multi label padding, debug sampling]#183

Fixes + Convenience updates [Vectorized ram_cache, multi label padding, debug sampling]#183
unartig wants to merge 12 commits intorvandewater:developmentfrom
unartig:development

unartig commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading

Rate limit exceeded

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

unartig commented Apr 23, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 23, 2026

Uh oh!

coderabbitai Bot commented Apr 23, 2026

Uh oh!

unartig commented Apr 23, 2026

Uh oh!

coderabbitai Bot commented Apr 23, 2026

Uh oh!

prockenschaub commented Apr 24, 2026

Uh oh!

unartig commented Apr 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

unartig commented Apr 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

unartig commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Apr 23, 2026

🐛 Bugs / Potential Bugs

🔧 Feature Requests / Proposals

Uh oh!

coderabbitai Bot commented Apr 23, 2026

✅ What's done

⚠️ What still needs attention

1. 🔴 Latest commits were never reviewed (rate-limited)

2. ❌ Pre-merge checks failing

3. 🟠 Outside-diff comment — may still be open

4. 🟡 Nitpick — open

5. 🔵 No human reviewer approval yet

💡 Recommended next steps

Uh oh!

unartig commented Apr 23, 2026

Uh oh!

coderabbitai Bot commented Apr 23, 2026

Uh oh!

prockenschaub commented Apr 24, 2026

Uh oh!

unartig commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

unartig commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading

unartig commented Apr 23, 2026 •

edited

Loading

unartig commented Apr 24, 2026 •

edited

Loading