Skip to content

Fixes + Convenience updates [Vectorized ram_cache, multi label padding, debug sampling]#183

Open
unartig wants to merge 12 commits intorvandewater:developmentfrom
unartig:development
Open

Fixes + Convenience updates [Vectorized ram_cache, multi label padding, debug sampling]#183
unartig wants to merge 12 commits intorvandewater:developmentfrom
unartig:development

Conversation

@unartig
Copy link
Copy Markdown
Contributor

@unartig unartig commented Apr 23, 2026

Summary:

  1. allow to specify train_size when complete_train=True
  2. fix overwrite instead of update error and debug sampling
  3. use inner joins instead of left, should avoid unwanted null-rows
  4. column names are now ordered alphabetically
  5. updated padding logic in PolarDatasets, can now take more than one label (should be unchanged if only one label used)
  6. buildup in PolarsPredictionDataset now uses a vectorized scheme offering massive speedup

(only touched the polars paths, deprecated pandas is unchanged)

Summary by CodeRabbit

Release Notes

  • New Features

    • Added RAM cache functionality for faster dataset loading.
  • Bug Fixes

    • Fixed train/validation splitting behavior when complete training is enabled.
    • Improved handling of missing values in label data.
    • Enhanced debug mode sampling for better data representation.
  • Improvements

    • Standardized dataset feature column ordering for consistency.
    • Simplified preprocessing logic.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

Warning

Rate limit exceeded

@unartig has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 7 minutes and 19 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 7 minutes and 19 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 09b08f1e-675e-4d13-94a8-037b3bce0a87

📥 Commits

Reviewing files that changed from the base of the PR and between ef999f7 and 0b5ae48.

📒 Files selected for processing (3)
  • icu_benchmarks/data/loader.py
  • icu_benchmarks/data/preprocessor.py
  • icu_benchmarks/data/split_process_data.py

Walkthrough

The PR modifies data loading, preprocessing, and splitting logic across three files: reorders dataset columns and enhances cache handling, simplifies preprocessing transformers, and improves train/val splitting logic with configurable sizing and refined debug sampling.

Changes

Cohort / File(s) Summary
Data Loader Enhancements
icu_benchmarks/data/loader.py
Reorders dataset feature columns with stay_id first, followed by non-indicator columns alphabetically, then indicator columns; updates cache access to index per tensor element; adjusts label padding to enforce 2D format; adds ram_cache method for RAM-based full tensor caching with NaN handling.
Preprocessing Simplification
icu_benchmarks/data/preprocessor.py
Removes redundant outer parentheses from FunctionTransformer lambdas in regression outcome scaling for both Polars and Pandas preprocessors.
Splitting and Preprocessing Logic
icu_benchmarks/data/split_process_data.py
Updates train/val splitter to respect user-provided train_size when complete_train enabled; refines debug mode to sample by unique stay_id fractions; applies null/NaN handling to current dataframe in loop; changes join operations from left to inner with updated key references.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 Hop, hop through data streams we go,
Columns sorted, tensors flow,
Cache reordered, labels padded true,
RAM compiled, our split renewed.
Simplified lambdas, joins precise—
Data pipelines now so nice!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title mentions three key improvements (vectorized ram_cache, multi-label padding, debug sampling) but does not clearly identify the main changes or clearly communicate the primary objective of the PR. Consider a more descriptive title that leads with the primary fix or feature, such as 'Support vectorized RAM caching and multi-label padding in PredictionPolarsDataset' or breaking into subtitles to clarify priorities.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
icu_benchmarks/data/split_process_data.py (1)

552-572: ⚠️ Potential issue | 🟠 Major

Sort stays to match labels ordering in stratified k-fold split.

Line 552 creates stays from .unique() which does not guarantee order, while line 556 creates labels sorted by _id. When these are passed to outer_cv.split(stays, labels) and inner_cv.split(dev_stays, labels[dev]), misaligned indices silently corrupt the stratified split's class assignments. For string stay_ids this causes index 0 to reference different stays than index 0 of labels.

🔧 Proposed fix
-    stays = pl.Series(name=_id, values=data[DataSegment.outcome][_id].unique())
+    stays = pl.Series(name=_id, values=data[DataSegment.outcome][_id].unique().sort())
     # If there are labels, and the task is classification, use stratified k-fold
     if VarType.label in vars and runmode is RunMode.classification:
         # Get labels from outcome data (takes the highest value (or True) in case seq2seq classification)
         labels: pl.Series = data[DataSegment.outcome].group_by(_id).max().sort(_id)[vars[VarType.label]]

Note: the same misalignment concern applies to make_train_val_polars at lines 378/381.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@icu_benchmarks/data/split_process_data.py` around lines 552 - 572, stays is
built from data[DataSegment.outcome][_id].unique() which can be unordered while
labels is explicitly sorted by _id; to fix, ensure stays uses the exact same
ordering as labels before calling outer_cv.split and inner_cv.split (e.g.,
derive stays from the sorted labels index or sort stays with the same key used
for labels) so indices align; apply the same change in make_train_val_polars
where stays/ids are created (ensure dev_stays is sliced from an ordered stays
that matches labels ordering before inner_cv.split).
🧹 Nitpick comments (1)
icu_benchmarks/data/split_process_data.py (1)

174-176: Minor: inconsistent source for the complete-train splitter.

The non-complete-train branch (line 162) passes sanitized_data, while the complete-train branch passes the raw data. Since check_sanitize_data and modality_selection mutate the underlying dict in place, the two are currently the same object — but relying on that identity is brittle. For consistency and to guard against future refactors where one of those functions starts returning a new dict, pass sanitized_data here as well.

-        sanitized_data = make_train_val_polars(data, vars, train_size=train_size, seed=seed, debug=debug, runmode=runmode)
+        sanitized_data = make_train_val_polars(
+            sanitized_data, vars, train_size=train_size, seed=seed, debug=debug, runmode=runmode
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@icu_benchmarks/data/split_process_data.py` around lines 174 - 176, The
complete-train branch calls make_train_val_polars with the raw data object,
which is inconsistent and brittle because check_sanitize_data and
modality_selection may mutate or later return a new dict; change the call in the
full-train branch to pass sanitized_data (the same variable used in the other
branch) to make_train_val_polars and keep the same train_size, seed, debug, and
runmode parameters to ensure consistent behavior regardless of in-place
mutations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@icu_benchmarks/data/loader.py`:
- Around line 50-57: The code hardcodes m_index = ["stay_id"] causing the group
column to be omitted/misordered when the dataset uses a different group name;
replace the literal with the configured group variable by setting m_index =
[self.vars["GROUP"]], then rebuild front/back using that m_index so the group
column is preserved first in self.features_df; also ensure the rest of the class
(e.g., functions referencing self.features_df and ram_cache logic that calls
pl.exclude(self.vars["GROUP"]) and computes n_features) will now align with the
corrected first-column ordering.
- Around line 149-188: ram_cache currently assumes outcome_df has one label per
timestep; detect the single-label-per-stay case (when outcome_df.shape[0] ==
n_stays) and expand labels_np into per-timestep rows before slicing by offsets:
for each stay create an array of shape (length, n_labels) that is all NaNs
except the final timestep contains the original stay label (mirroring the logic
in __getitem__ that pads single-labels), then use that expanded labels array for
padded_labels population and nan masking; update references to labels_np,
padded_labels, pad_mask, and ram_cache accordingly and add a unit test that
constructs a single-label classification dataset with ram_cache=True to lock
this behavior.

---

Outside diff comments:
In `@icu_benchmarks/data/split_process_data.py`:
- Around line 552-572: stays is built from
data[DataSegment.outcome][_id].unique() which can be unordered while labels is
explicitly sorted by _id; to fix, ensure stays uses the exact same ordering as
labels before calling outer_cv.split and inner_cv.split (e.g., derive stays from
the sorted labels index or sort stays with the same key used for labels) so
indices align; apply the same change in make_train_val_polars where stays/ids
are created (ensure dev_stays is sliced from an ordered stays that matches
labels ordering before inner_cv.split).

---

Nitpick comments:
In `@icu_benchmarks/data/split_process_data.py`:
- Around line 174-176: The complete-train branch calls make_train_val_polars
with the raw data object, which is inconsistent and brittle because
check_sanitize_data and modality_selection may mutate or later return a new
dict; change the call in the full-train branch to pass sanitized_data (the same
variable used in the other branch) to make_train_val_polars and keep the same
train_size, seed, debug, and runmode parameters to ensure consistent behavior
regardless of in-place mutations.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 9d330f83-863a-490c-ae7f-61246332cbac

📥 Commits

Reviewing files that changed from the base of the PR and between ef999f7 and 178bfd0.

📒 Files selected for processing (3)
  • icu_benchmarks/data/loader.py
  • icu_benchmarks/data/preprocessor.py
  • icu_benchmarks/data/split_process_data.py

Comment thread icu_benchmarks/data/loader.py
Comment thread icu_benchmarks/data/loader.py
@unartig
Copy link
Copy Markdown
Contributor Author

unartig commented Apr 23, 2026

@coderabbitai any things still missing for this PR?

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
gh issue list --repo rvandewater/YAIB --state open --limit 50

Length of output: 1192


@unartig: Yes, there are 14 open issues on the repository. Here's a summary:

🐛 Bugs / Potential Bugs

# Title Opened
#182 Logistic Regression Model Not converging 2026-04-13
#181 Bug: W&B API Key length validation error (40-character limit) 2026-03-19
#178 Possible Bug: Scaled MAE in Regression Tasks 2025-12-12

🔧 Feature Requests / Proposals

# Title Opened
#179 Proposing a new prediction task: Causative Pathogen Identification 2026-03-04
#162 PyPOTS applies additional amputation (help wanted) 2025-02-14
#158 Optimize __getitem__ method using more Polars operations 2024-10-17
#157 Creating a Polars-based version of ImputationPandasDataset 2024-10-17
#154 Automatically infer variables in a dataset (enhancement) 2024-09-18
#146 Memory optimisation for Temporal Fusion Transformer 2024-06-04
#143 Add cross-correlation as imputation evaluation metric (enhancement) 2023-10-30
#97 Model Calibration (enhancement) 2022-12-16
#92 Evaluate replacing argparse with abseil (enhancement) 2022-12-13
#90 Write tests for train and evaluate (enhancement) 2024-04-15
#67 Add Sampling Options (enhancement) 2024-04-15

Worth noting: #158 (optimize __getitem__ with Polars) is directly related to this PR — the new ram_cache vectorized approach in PredictionPolarsDataset partially addresses that optimization goal. You may want to consider closing or updating that issue once this PR is merged.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
echo "=== PR Reviews ==="
gh pr view 183 --repo rvandewater/YAIB --json reviews,reviewDecision,statusCheckRollup,comments | jq .

echo "=== PR Review Comments (inline) ==="
gh api repos/rvandewater/YAIB/pulls/183/comments | jq '[.[] | {id, path, line, body, user: .user.login, resolved: .position}]'

echo "=== PR Comments (issue-level) ==="
gh api repos/rvandewater/YAIB/issues/183/comments | jq '[.[] | {id, user: .user.login, body: .body[:300]}]'

echo "=== PR Checks ==="
gh pr checks 183 --repo rvandewater/YAIB

Length of output: 36203


@unartig: Good timing to check! Here's a thorough status breakdown:


✅ What's done

  • Lint (Ruff) CI: Passing ✅
  • Both critical inline comments: Marked as addressed in commits 786c86a7dc68bf:
    • Hardcoded "stay_id"self.vars["GROUP"] fix in loader.py
    • ram_cache single-label-per-stay backwards-compatibility fix in loader.py

⚠️ What still needs attention

1. 🔴 Latest commits were never reviewed (rate-limited)

The review of commits 178bfd037dc68bf was cut short by a rate limit before CodeRabbit could inspect your fixes. You should trigger a fresh review to confirm the addressed comments are correct.

2. ❌ Pre-merge checks failing

Check Status Detail
Docstring Coverage ⚠️ Warning 55.56% — threshold is 80%. New methods like ram_cache are likely missing docstrings.
PR Title ❓ Inconclusive "Fixes + Convenience updates" is too generic. Consider something like: "Support train_size with complete_train, fix data overwrite, and add multi-label padding"

3. 🟠 Outside-diff comment — may still be open

An outside-diff comment (couldn't be posted inline) flagged a potential silent bug:

stays ordering in stratified k-fold split (split_process_data.py, lines 552–572): stays is built from .unique() (unordered) while labels is sorted by _id. Passing misaligned arrays to outer_cv.split(stays, labels) silently corrupts stratified fold class assignments.

The suggested fix was:

- stays = pl.Series(name=_id, values=data[DataSegment.outcome][_id].unique())
+ stays = pl.Series(name=_id, values=data[DataSegment.outcome][_id].unique().sort())

The same concern applies to make_train_val_polars at lines 378/381. Please verify if this was addressed in the latest commits.

4. 🟡 Nitpick — open

split_process_data.py lines 174–176: The complete-train branch passes raw data to make_train_val_polars instead of sanitized_data. Currently harmless (same object in practice), but brittle for future refactors. Consider passing sanitized_data for consistency.

5. 🔵 No human reviewer approval yet

The reviewDecision is currently empty — no maintainer has formally approved this PR.


💡 Recommended next steps

  1. Add a docstring to ram_cache (and any other new/modified methods lacking them).
  2. Update the PR title to be more descriptive.
  3. Verify the stays ordering fix was applied (or apply it if not).
  4. Trigger a new review — I'll re-inspect all the latest commits cleanly.

Would you like me to trigger a fresh review now?

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@unartig unartig changed the title Fixes + Convenience updates Fixes + Convenience updates [Vectorized ram_cache, multi label padding, debug sampling] Apr 23, 2026
@unartig
Copy link
Copy Markdown
Contributor Author

unartig commented Apr 23, 2026

@coderabbitai go ahead

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

@unartig: Triggering a full review now — I'll inspect all the latest commits cleanly, including the fixes from 786c86a7dc68bf. Hang tight!

✅ Actions performed

Full review triggered.

@prockenschaub
Copy link
Copy Markdown
Collaborator

@unartig it seems like you ran into the same cache issue that I noticed this week after revisting YAIB. I already opened #186 before I saw your PR.

I think our implementations address fundamentally the same thing. A couple of differences I could spot while skimming over it:

  • I use polars' partition_by and keep everything in a stay_id-indexed dict, you seem to do the partitioning yourself and store everything in a stacked tensor. Using partition allows my changes to be fewer lines of changed code.
  • I currently store the partitions in addition to features_df and outcome_df, which has some memory overhead. This could be changed to overwrite the base DataFrames.
  • I had my partition step already in __init__, yours only does this during caching. I think your version makes sense for the ML models, which don't this step regardless. At the same time, doing the partitioning is still useful for DL models if ram_cache=False.

@rvandewater I am happy with either as long as the speed-up remains :)

@unartig
Copy link
Copy Markdown
Contributor Author

unartig commented Apr 24, 2026

Hey @prockenschaub , I very much prefer your version over mine, it does make sense to do it in __init__, I basically chose the ram_cache because it felt the least invasive at the time.
That said I some, I would like see my other changes make it through (especially the >1 labels and train_size), since I heavily rely on it in my project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants