Fix/inference new tok new dim #623

jhnwu3 · 2025-11-21T17:59:15Z

This pull request introduces significant improvements to the StageNet mortality prediction workflow, focusing on processor robustness, reproducibility, and efficiency. The main changes include the addition of synthetic hold-out set testing to verify processor handling of unseen codes and varying sequence lengths, enhancements to processor vocabulary handling for unknown tokens, and improved multi-threaded sample generation in the dataset loader. Several usability and reproducibility improvements were also made, such as processor saving/loading and updated documentation.

Processor Robustness and Synthetic Hold-out Testing

Added a new generate_holdout_set function in examples/mortality_mimic4_stagenet_v2.py to create synthetic samples with unseen codes and varying sequence lengths, and integrated comprehensive testing and reporting of processor behavior on these samples. [1] [2]
Updated the notebook tutorial to include hold-out inference and interpretability examples, improving documentation for StageNet usage.

Processor Vocabulary Handling

Changed the initialization and fitting logic for <unk> tokens in SequenceProcessor, NestedSequenceProcessor, and StageNetProcessor so that <unk> is set to the next available index after vocabulary construction, ensuring compatibility with PyTorch embeddings and correct handling of unknown tokens. [1] [2] [3] [4] [5] [6]

Reproducibility and Usability

Added processor saving/loading to examples/mortality_mimic4_stagenet_v2.py, allowing reuse of fitted processors for consistent encoding and reduced computation in future runs.
Improved documentation for the MortalityPredictionStageNetMIMIC4 task to clarify the purpose of the padding argument.

Performance and Efficiency

Refactored multi-threaded sample generation in pyhealth/datasets/base_dataset.py to use a bounded queue and lazy iteration, reducing memory usage and improving progress tracking for large datasets.
Minor code formatting and logging improvements in pyhealth/datasets/base_dataset.py for readability and maintainability. [1] [2] [3] [4]

Training and Testing Adjustments

Updated training configuration in the example script to use CPU and run for a single epoch, facilitating quick testing and reproducibility.

These changes collectively enhance the reliability, scalability, and user experience of the StageNet pipeline for mortality prediction on MIMIC-IV.

…difference in processing speed

jhnwu3 added 7 commits November 19, 2025 14:53

init commit for 1 solution

622c2a2

also change ex cache

36e0b34

init commit

d581587

commit new test case and fixes

58ca0fd

new update

8da24ba

minor update to the number of workers used, turns out it does make a …

b1e95f5

…difference in processing speed

update again

0327e6b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix/inference new tok new dim #623

Fix/inference new tok new dim #623

Uh oh!

jhnwu3 commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix/inference new tok new dim #623

Are you sure you want to change the base?

Fix/inference new tok new dim #623

Uh oh!

Conversation

jhnwu3 commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants