This project is a logistic regression model built from scratch to predict the mortality of COVID-19 patients based on their clinical data. The model is implemented in Python using only NumPy and Pandas, without the use of high-level machine learning libraries like Scikit-learn for the core modeling part.
The dataset used for training this model is Mortality_incidence_sociodemographic_and_clinical_data_in_COVID19_patients.xlsx. It contains sociodemographic and clinical data for patients with COVID-19.
The following features were used for training the model:
- Age.1
- MI (Myocardial Infarction)
- CHF (Congestive Heart Failure)
- CVD (Cerebrovascular Disease)
- DM Simple (Diabetes Mellitus without complications)
- DM Complicated (Diabetes Mellitus with complications)
- COPD (Chronic Obstructive Pulmonary Disease)
- Renal Disease
- DEMENT (Dementia)
- Stroke
- Seizure
- OldOtherNeuro (Other Neurological Disorders)
The target variable is Death, which is a binary outcome (0 for survival, 1 for death).
- Rows with missing values were dropped.
- The features were scaled using
StandardScalerfromscikit-learnto have a mean of 0 and a standard deviation of 1. This helps the gradient descent algorithm to converge faster.
In the dataset, the 'Age' feature has a much larger range of values (e.g., 20 to 90) compared to the other clinical features, which are binary (0 or 1). This difference in scale can cause problems for the gradient descent algorithm.
When features are on vastly different scales, the cost function surface can become elongated and skewed. This means that the algorithm will take a long time to converge, or it might even fail to converge to the optimal solution. The learning algorithm will be dominated by the feature with the larger range, in this case, 'Age'.
By normalising the 'Age' feature (scaling it to have a mean of 0 and a standard deviation of 1), we ensure that all features have a similar scale. This results in a more symmetrical cost function, allowing gradient descent to converge much faster and more reliably.
Without normalisation, the model would likely be biased towards the 'Age' feature. The weights for the other features would be small in comparison, and the model might not learn their true importance. This would lead to a less accurate and less reliable model. In some cases, the gradient descent algorithm might oscillate and never find the minimum of the cost function.
This project requires Python and the following libraries:
NumPyPandasscikit-learn(only forStandardScaler)
- Clone the repository:
- Install the required packages:
- Run the
covid_death_predictor.pyscript to train the model and see the results:
The logistic regression model is implemented from scratch. Here's a breakdown of the core components:
The sigmoid function is used to map the output of the linear regression model to a probability between 0 and 1.
where
The cost function used is the binary cross-entropy loss, which measures the performance of a classification model whose output is a probability value between 0 and 1.
Gradient descent is used to optimize the parameters (weights
where
The model is trained using the gradient descent algorithm to find the optimal
The final parameters after training are:
w_final=np.array([ 0.38096335, 0.22271814, 0.2052443 , 0.2037135 , 0.1706696 , 0.20320603, 0.21854035, 0.21190117, 0.20912185, 0.22683057, 0.03534338])b_final=-1.0959193809146979
The model's performance was evaluated using the following metrics:
-
Confusion Matrix:
- True Positives (TP): 27 - The model correctly predicted 27 deaths.
- False Positives (FP): 18 - The model incorrectly predicted 18 deaths (they survived).
- False Negatives (FN): 1121 - The model incorrectly predicted 1121 survivals (they died). This is a high number and a major concern.
- True Negatives (TN): 3545 - The model correctly predicted 3545 survivals.
-
Accuracy: 75.82% - This is the percentage of total correct predictions. While it seems decent, accuracy can be misleading in the case of imbalanced datasets.
-
Precision: 60.0% - Of all the patients the model predicted would die, 60% actually did.
-
Recall: 2.35% - This is the most alarming metric. It means the model only identified 2.35% of all the patients who actually died. The model is failing to identify the positive class (death).
-
F1 Score: 4.53% - The F1 score is the harmonic mean of precision and recall. A low F1 score indicates that the model has poor performance, especially when there is an imbalance between precision and recall, as seen here.
The results show that while the model has a reasonable accuracy, it is a poor predictor of mortality. The extremely low recall and F1 Score indicate that the model is heavily biased towards predicting survival (the majority class). It fails to identify the vast majority of patients who died. This is likely due to the class imbalance in the dataset (more survivors than deaths). I was unable to find a free balance dataset with enough clinical data to train a model. Please let me know if you find one The model is not suitable for clinical use in its current state.