Multiclass Classification Analysis and Quality Testing of MADDY (Microtubule Assembly and Disassembly Dynamics model) Output
Published by: Maria S. Kelly - Dima Group @ University of Cincinnati
Details of applied analysis can be found in original publication: https://doi.org/10.1016/j.csbj.2022.01.028
Overview: This script will give a multiclass classification of a given dataset by applying cross validation to several classification and ensemble algorithms in order to compare the predicition accuracy and standard deviation of each method. The SMOTE oversampling technique will be applied in case of uneven class populations. Further quality analysis (confusion matrices, PR, and ROC curves) will be made for selected alorithms of interest.
Procedure:
- User Input: Enter the name of the file containing all features and class labels in the order as they appear in dataset (the header of class labels will be needed in order for the script to identify the correct column). Enter class names and a selection of colors equal to the length of classes for plot parameterization. Use "clf_dict" to select algorithms found to have the best initial prediction.
- Loading Data: Script will attempt to identify number of features automatically. A prompt will allow the users to continue with the classification if the feature count is correct or if a number of features should be removed. Enter feature names to be removed in prompted section to have a new dataframe be generated.
- Feature Distribution: The distribution of each feature will be generated on one saved figure.
- Classification: A repeated stratified kfold was utlized due to the skew found in the datasets of original publication. A list of classifiers used and explanations for parameters used is described in detail in the original publication. SMOTE was also tested using another selection of methods. Two boxplots representing predicition accuracy measurments will be generated.
- Confusion Matrices: Multiclass confusion matrices will be generated and label based off of the 'class_names' variable listed above. Center value for each unit describe the sum of testing set predictions for all kfolds. Additional values included for the total number of predictions per true class and percentage of true class predictions that unit represents. The colorbar represents strength of prediction success
- PR/ROC Curves: Using the One vs Rest approach, the testing set of the last kfold will be plotted to compare the selection of algorithms listed in 'clf_dict'.