Fix/inference new tok new dim #623
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request introduces significant improvements to the StageNet mortality prediction workflow, focusing on processor robustness, reproducibility, and efficiency. The main changes include the addition of synthetic hold-out set testing to verify processor handling of unseen codes and varying sequence lengths, enhancements to processor vocabulary handling for unknown tokens, and improved multi-threaded sample generation in the dataset loader. Several usability and reproducibility improvements were also made, such as processor saving/loading and updated documentation.
Processor Robustness and Synthetic Hold-out Testing
generate_holdout_setfunction inexamples/mortality_mimic4_stagenet_v2.pyto create synthetic samples with unseen codes and varying sequence lengths, and integrated comprehensive testing and reporting of processor behavior on these samples. [1] [2]Processor Vocabulary Handling
<unk>tokens inSequenceProcessor,NestedSequenceProcessor, andStageNetProcessorso that<unk>is set to the next available index after vocabulary construction, ensuring compatibility with PyTorch embeddings and correct handling of unknown tokens. [1] [2] [3] [4] [5] [6]Reproducibility and Usability
examples/mortality_mimic4_stagenet_v2.py, allowing reuse of fitted processors for consistent encoding and reduced computation in future runs.MortalityPredictionStageNetMIMIC4task to clarify the purpose of thepaddingargument.Performance and Efficiency
pyhealth/datasets/base_dataset.pyto use a bounded queue and lazy iteration, reducing memory usage and improving progress tracking for large datasets.pyhealth/datasets/base_dataset.pyfor readability and maintainability. [1] [2] [3] [4]Training and Testing Adjustments
These changes collectively enhance the reliability, scalability, and user experience of the StageNet pipeline for mortality prediction on MIMIC-IV.