NIDX - Code for: "A Machine Learning Approach for Identifying People with Neuroinfectious Diseases in Electronic Health Records"

Overview

This repository contains code for training, evaluating, and testing a machine learning model that identifies neuroinfectious diseases (NIDX) from unstructured clinical notes. The model addresses limitations of traditional ICD code-based identification methods by leveraging natural language processing (NLP) techniques to analyze clinical text and accurately classify patients with NIDX.

Dependencies

Python
NumPy
Pandas
scikit-learn
XGBoost (implied by the code structure)
Matplotlib
tqdm

Project Structure

Background and Motivation

Neuroinfectious diseases (NIDX) pose serious threats to neurological health and may have long-term consequences, including increased risk for neurodegenerative diseases like Alzheimer's. Accurately identifying NIDX cases in electronic health records (EHR) is critical for clinical research but challenging due to:

Imprecise identification using ICD billing codes (high sensitivity but low specificity)
Labor-intensive manual chart reviews that are impractical for large datasets
Diverse array of pathogens and often clinically indistinguishable symptoms

This project aims to overcome these limitations by using machine learning to analyze unstructured clinical notes, providing more accurate and efficient NIDX identification than traditional methods.

Data Files

bidmc_notes_deid.csv: Deidentified clinical notes from Beth Israel Deaconess Medical Center (BIDMC) used for external validation
annot_20240223.csv: Annotation file with expert-labeled ground truth for classification
Various pickle files containing trained models and preprocessed data:
- features_bow_2231227_manualbow_l1_c10.pkl: Bag-of-Words features
- non_zero_features_2231227_manualbow_l1_c10.pkl: Selected non-zero features after feature selection
- vectorizer_2231227_manualbow_l1_c10.pkl: Text vectorizer model
- model4_2231227_manualbow_l1_c10.pkl: Trained XGBoost model
- X_train_l1_selected_2231227_manualbow_l1_c10.pkl: Training features
- X_test_l1_selected_2231227_manualbow_l1_c10.pkl: Testing features
- y_train_2231227_manualbow_l1_c10.pkl: Training labels
- y_test_2231227_manualbow_l1_c10.pkl: Testing labels

Code Modules

load_data_preprocess.py: Contains functions for loading and preprocessing text data
plots.py: Contains visualization functions for model evaluation

Functionality

Data Collection and Preprocessing

Source Data: The model was developed using clinical notes from patients who underwent lumbar punctures at Mass General Brigham (MGB) hospitals, with external validation on notes from Beth Israel Deaconess Medical Center (BIDMC)
Ground Truth: Six physician experts in neuroimmunology or neuroinfectious diseases manually labeled notes following a standardized operating procedure
Text Processing:
- Text cleaning: Removing carriage returns and newlines
- Text matching with regular expressions
- Custom text preprocessing via the preprocess_text function
- Conversion to bag-of-words representation using n-grams (n=1, 2, 3)

Feature Selection and Model Training

From an initial set of 1,284 n-gram features, 342 were selected as significant predictors
Feature selection was performed using Logistic Regression with L1 regularization
Several models were evaluated through 5-fold cross-validation, with XGBoost selected for its superior performance, particularly on the Area Under the Precision-Recall Curve (AUPRC)
The model was optimized to address class imbalance, as only 16% of notes in the training data represented positive NIDX cases

Model Evaluation

The codebase includes robust evaluation functions:

bootstrap_and_plot: Performs bootstrap resampling to generate confidence intervals for model performance metrics and plots ROC and precision-recall curves.
plot_top_feature_importances: Visualizes the most important features in the trained model based on gain. The top features included clinically relevant terms such as "meningitis," "ventriculitis," and "meningoencephalitis."
plot_roc_curve, plot_precision_recall_curve, plot_roc_pr_curves: Functions for visualizing model performance.
bootstrap_and_plot_results: An enhanced evaluation function that calculates and plots:
- ROC curves with AUROC and confidence intervals
- Precision-recall curves with AUPRC and confidence intervals
- Additional metrics including F1 scores, recall, precision, and specificity for both classes
- Confidence intervals for all metrics using 1,000 bootstrap iterations

Model Testing and Performance

The code includes evaluation on two separate datasets:

MGB dataset (Massachusetts General Brigham): A limited subset of the main test data (445 notes)
BIDMC dataset (Beth Israel Deaconess Medical Center): External validation data (600 notes)

Performance metrics:

On the MGB test set:
- AUROC: 0.977 (95% CI: 0.964-0.988)
- AUPRC: 0.894 (95% CI: 0.831-0.943)
- F1 score: 0.822 (95% CI: 0.752-0.879)
- Recall: 0.846 (95% CI: 0.753-0.923)
- Precision: 0.802 (95% CI: 0.709-0.889)
- Specificity: 0.960 (95% CI: 0.939-0.978)
On the BIDMC external validation set:
- AUROC: 0.976 (95% CI: 0.961-0.989)
- AUPRC: 0.779 (95% CI: 0.655-0.885)
- F1 score: 0.658 (95% CI: 0.528-0.778)
- Recall: 0.687 (95% CI: 0.538-0.839)
- Precision: 0.637 (95% CI: 0.487-0.795)
- Specificity: 0.976 (95% CI: 0.963-0.988)

The model significantly outperformed ICD-code based identification (which had a sensitivity of 97.1% but specificity of only 59.1%) and also surpassed a zero-shot LLaMA 3.2 model (which achieved an AUROC of 0.80 and specificity of 0.94, but lower recall of 0.64).

Workflow

Load preprocessed data and trained model from pickle files
Preprocess clinical notes from the BIDMC dataset
Select specific indices from the MGB dataset for evaluation
Apply label corrections to both datasets (as of June 6, 2024)
Evaluate model performance on both datasets using bootstrap resampling

Usage

The code is structured in blocks that can be executed sequentially:

Block 1: Load all the necessary pickle files containing the trained model and preprocessed data
Block 2: Define evaluation functions for model performance
Block 3: Load and preprocess the BIDMC dataset
Block 4: Define a limited subset of the MGB dataset for evaluation
Block 5: Apply label corrections to both datasets
Block 6: Evaluate model performance on the MGB dataset
Block 7: Transform the BIDMC dataset using the same vectorizer
Block 8: Evaluate model performance on the BIDMC dataset

Significance and Applications

This NLP-based model enables accurate identification of neuroinfectious disease cases from clinical notes, addressing the limitations of ICD code-based methods
The approach offers significant advantages for large-scale epidemiological research, particularly for studying associations between neuroinvasive pathogens and long-term outcomes like neurodegenerative diseases
The model demonstrated strong performance across two independent hospital datasets, suggesting potential generalizability to other institutions
The selected features align with clinical expertise, focusing on markers of CNS inflammation, diagnostic tests, and specific pathogens

Notes

The code suppresses warnings at the beginning of execution
The model is an XGBoost classifier with feature selection performed using L1 regularization
Original data included 3,000 notes from MGB, with 16% (479 notes) labeled as NIDX by expert review
Bootstrap resampling with 1000 iterations is used to establish confidence intervals for performance metrics
Recent label corrections (dated June 6, 2024) have been applied to both datasets to improve label accuracy
Model is part of a research effort described in "A Machine Learning Approach for Identifying People with Neuroinfectious Diseases in Electronic Health Records" by Singh, Sartipi, et al.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

NIDX - Code for: "A Machine Learning Approach for Identifying People with Neuroinfectious Diseases in Electronic Health Records"

Overview

Dependencies

Project Structure

Background and Motivation

Data Files

Code Modules

Functionality

Data Collection and Preprocessing

Feature Selection and Model Training

Model Evaluation

Model Testing and Performance

Workflow

Usage

Significance and Applications

Notes

About

Licenses found

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
LICENSE.txt		LICENSE.txt
NIDX_v2.ipynb		NIDX_v2.ipynb
README.md		README.md
X_test_l1_selected_2231227_manualbow_l1_c10.pkl		X_test_l1_selected_2231227_manualbow_l1_c10.pkl
X_train_l1_selected_2231227_manualbow_l1_c10.pkl		X_train_l1_selected_2231227_manualbow_l1_c10.pkl
annot_20240223.csv		annot_20240223.csv
bidmc_notes_deid.csv		bidmc_notes_deid.csv
features_bow_2231227_manualbow_l1_c10.pkl		features_bow_2231227_manualbow_l1_c10.pkl
load_data_preprocess.py		load_data_preprocess.py
model4_2231227_manualbow_l1_c10.pkl		model4_2231227_manualbow_l1_c10.pkl
non_zero_features_2231227_manualbow_l1_c10.pkl		non_zero_features_2231227_manualbow_l1_c10.pkl
plots.py		plots.py
readme-file.md		readme-file.md
regexes_py.py		regexes_py.py
vectorizer_2231227_manualbow_l1_c10.pkl		vectorizer_2231227_manualbow_l1_c10.pkl
y_test_2231227_manualbow_l1_c10.pkl		y_test_2231227_manualbow_l1_c10.pkl
y_train_2231227_manualbow_l1_c10.pkl		y_train_2231227_manualbow_l1_c10.pkl

License

Licenses found

bdsp-core/NIDX

Folders and files

Latest commit

History

Repository files navigation

NIDX - Code for: "A Machine Learning Approach for Identifying People with Neuroinfectious Diseases in Electronic Health Records"

Overview

Dependencies

Project Structure

Background and Motivation

Data Files

Code Modules

Functionality

Data Collection and Preprocessing

Feature Selection and Model Training

Model Evaluation

Model Testing and Performance

Workflow

Usage

Significance and Applications

Notes

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages