Skip to content

Conversation

@jhnwu3
Copy link
Collaborator

@jhnwu3 jhnwu3 commented Nov 21, 2025

This pull request introduces significant improvements to the StageNet mortality prediction workflow, focusing on processor robustness, reproducibility, and efficiency. The main changes include the addition of synthetic hold-out set testing to verify processor handling of unseen codes and varying sequence lengths, enhancements to processor vocabulary handling for unknown tokens, and improved multi-threaded sample generation in the dataset loader. Several usability and reproducibility improvements were also made, such as processor saving/loading and updated documentation.

Processor Robustness and Synthetic Hold-out Testing

  • Added a new generate_holdout_set function in examples/mortality_mimic4_stagenet_v2.py to create synthetic samples with unseen codes and varying sequence lengths, and integrated comprehensive testing and reporting of processor behavior on these samples. [1] [2]
  • Updated the notebook tutorial to include hold-out inference and interpretability examples, improving documentation for StageNet usage.

Processor Vocabulary Handling

  • Changed the initialization and fitting logic for <unk> tokens in SequenceProcessor, NestedSequenceProcessor, and StageNetProcessor so that <unk> is set to the next available index after vocabulary construction, ensuring compatibility with PyTorch embeddings and correct handling of unknown tokens. [1] [2] [3] [4] [5] [6]

Reproducibility and Usability

  • Added processor saving/loading to examples/mortality_mimic4_stagenet_v2.py, allowing reuse of fitted processors for consistent encoding and reduced computation in future runs.
  • Improved documentation for the MortalityPredictionStageNetMIMIC4 task to clarify the purpose of the padding argument.

Performance and Efficiency

  • Refactored multi-threaded sample generation in pyhealth/datasets/base_dataset.py to use a bounded queue and lazy iteration, reducing memory usage and improving progress tracking for large datasets.
  • Minor code formatting and logging improvements in pyhealth/datasets/base_dataset.py for readability and maintainability. [1] [2] [3] [4]

Training and Testing Adjustments

  • Updated training configuration in the example script to use CPU and run for a single epoch, facilitating quick testing and reproducibility.

These changes collectively enhance the reliability, scalability, and user experience of the StageNet pipeline for mortality prediction on MIMIC-IV.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants