diff --git a/README.md b/README.md index f746e56..4848459 100644 --- a/README.md +++ b/README.md @@ -1,29 +1,190 @@ -# Project 2 +# Machine Learning +## Project 2 -Select one of the following two options: +# Group Members - Contribution +* Venkata Naga Lakshmi Sai Snigdha Sri Jata - A20560684 - 33.33% +* Sharan Rama Prakash Shenoy - A20560684 - 33.33% +* Adarsh Chidirala - A20561069 - 33.33% -## Boosting Trees +# ################################################### +## Usage Instructions -Implement the gradient-boosting tree algorithm (with the usual fit-predict interface) as described in Sections 10.9-10.10 of Elements of Statistical Learning (2nd Edition). Answer the questions below as you did for Project 1. +### Installation -Put your README below. Answer the following questions. +To get started with this project, first you need **Python 3.x**. Then follow these installation steps: -* What does the model you have implemented do and when should it be used? -* How did you test your model to determine if it is working reasonably correctly? -* What parameters have you exposed to users of your implementation in order to tune performance? (Also perhaps provide some basic usage examples.) -* Are there specific inputs that your implementation has trouble with? Given more time, could you work around these or is it fundamental? +#### 1. Clone the Repository to your local machine: -## Model Selection +```bash +git clone https://github.com/adarsh-chidirala/Project2_Adarsh_Ch_Group.git +``` +#### 2. Steps to Run the Code on Mac -Implement generic k-fold cross-validation and bootstrapping model selection methods. +Follow these steps to set up and run the project: -In your README, answer the following questions: +1. **Create a Virtual Environment**: + - Navigate to your project directory and create a virtual environment using: + ```bash + python3 -m venv myenv + ``` -* Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)? -* In what cases might the methods you've written fail or give incorrect or undesirable results? -* What could you implement given more time to mitigate these cases or help users of your methods? -* What parameters have you exposed to your users in order to use your model selectors. +2. **Activate the Virtual Environment**: + - Activate the created virtual environment by running: + ```bash + source myenv/bin/activate + ``` + +3. **Install Required Libraries**: + - Install the necessary Python libraries with the following command: + ```bash + pip install numpy pandas matplotlib scikit-learn + ``` + +4. **Run the Script**: + - Navigate to the directory containing your script and run it: + ```bash + python project2.py + ``` + +Make sure that the script `project2.py` and any required dataset files are correctly placed in your project directory. + +#### 3. Steps to Run the Code on Windows + +Follow these instructions to set up and execute the project on a Windows system: + +1. **Create a Virtual Environment**: + - Open Command Prompt and navigate to your project directory: + ```cmd + cd path\to\your\project\directory + ``` + - Create a virtual environment in your project directory by running: + ```cmd + python -m venv myenv + ``` + +2. **Activate the Virtual Environment**: + - Activate the virtual environment with the following command: + ```cmd + myenv\Scripts\activate + ``` + +3. **Install Required Libraries**: + - Install the necessary libraries by executing: + ```cmd + pip install numpy pandas matplotlib scikit-learn + ``` + +4. **Run the Script**: + - Make sure the script `project2.py` and any necessary dataset files are placed in your project directory. Run the script with: + ```cmd + python project2.py + ``` +Ensure that all paths are correct and relevant files are located in the specified directories. + +#### 4. Running Datasets: + - We are using two datasets: customer_dataset.csv and patient_data.csv. For each of this we need to specify the features and the target column. + - This is done in the main function in the bottom of the code as follows. + - For customer_dataset.csv +``` +data = pd.read_csv('customer_dataset.csv') + + feature_columns = ['Age','Annual_Income','Spending_Score'] + target_column = 'Store_Visits' + +``` + - For patient_data.csv +``` +data = pd.read_csv('patient_data.csv') + + feature_columns = ['RR_Interval','QRS_Duration','QT_Interval'] + target_column = 'Heart_Rate' + +``` + +## K-fold cross-validation and Bootstrapping model selection models +## Introduction +- This project implements the developments of generic k-fold cross-validation and bootstrapping model selection methods, primarily focusing on AIC scores to evaluate the model performance. +- These techniques are designed to evaluate and compare the performance of machine learning models on different datasets, providing detailed insights into how effective a model is on the given situations and predict the corresponding outcomes accurately based on the scores calculated. +- The implementation can be adapted to various models, and allows customization to suit individual requirements and resources by modifying its features. + +## Key Features implemented +- The model supports 3 different models i.e; Linear, Ridge and Lasso regression models and calculates Mean AIC scores for each model using following validation models. +- The model successfully implements K-Fold Cross-Validation and Bootstrapping Validation. +- **K-Fold Cross-Validation:** This validation model provides evaluation by splitting the dataset into k subsets, using k-1 subsets for training the dataset and remaining 1 subset for testing the dataset. +- **Bootstrapping:** This validation model provides evaluation of the performance by repeatedly resampling the dataset with replacements of the same data. +- The project also generates the plots of the K-fold cross validation and Bootstrapping AIC scores and Distribution. + + +### 1. Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)? +In simple cases like linear regression,the relationship between the dependent and independent variables is considered to be linear, it is possible that cross-validation, bootstrapping, and AIC scores may accept on the simpler model selector. +However, this may not be constant always because it depends on the dataset provided and the complexity of the models. + +Note that cross-validation and bootstrapping provide estimates of model performance, while AIC focuses on model selection. These techniques can be different in model evaluation and selection based on the different purposes. + +To determine whether cross-validation and bootstrapping model selectors agree with AIC in a specific case, it is recommended to perform experiments and compare the results. + +### 2. In what cases might the methods you've written fail or give incorrect or undesirable results? +Failures or undesirable results may occur under these conditions: +**Small Datasets:** k-fold Cross-validation and bootstrapping can give unreliable AIC scores when the dataset is too small. +**Non-balance Data:** Models training on the non-balance datasets, might make biased predictions which affects the AIC calculations. +**Connected Predictions:** Strong connections between predictions can lead to wrong estimates, affecting AIC evaluations. +**Wrong Model Assumptions:** AIC relies on specific model assumptions (e.g., linear regression assumes Gaussian errors). If these assumptions are wrong, AIC might not show the true model fit. +**Overfitting with Bootstrapping:** Repeated sampling with replacement might make bootstrapped datasets favor complex models too much, leading to high AIC values. + +### 3. What could you implement given more time to mitigate these cases or help users of your methods? + +1. Use Ridge and Lasso penalties during cross-validation: Regularization adds a penalty to the model's complexity to reduce overfitting. Ridge regression adds an L2 penalty , while Lasso regression adds an L1 penalty. +Use RidgeCV and LassoCV from scikit-learn: These classes perform cross-validation to find the better regularization parameter (alpha) for Ridge and Lasso regression. + +2. Use categorized sampling for k-fold cross-validation: Categorized k-fold ensures that each fold has the equal proportion of class labels as the original dataset, which shows imbalanced datasets. +Implement balanced bootstrapping: Bootstrapping involves sampling with a stategy of replacement. Balanced bootstrapping ensures that each class is represented equally in each sample. + +3. Add BIC or deviance as options besides AIC: AIC score is a common metric for model selection, but BIC (Bayesian Information Criterion) and deviance can provide additional insights. + +4. Plot performance for different hyperparameters: Validation curves help visualize how the model's performance changes with different values of hyperparameters (e.g., alpha in Ridge/Lasso). This helps in selecting the correct hyperparameter value. + +5. Provide summaries for bootstrap and k-fold iterations: Summarizing the results of resampling methods (like bootstrap and k-fold) helps understand the variability and stability of the model's performance. + +6. Let users set evaluation metrics and sampling methods: Providing flexibility in setting these parameters allows users to customize the model according to their specific needs and preferences. + +### 4. What parameters have you exposed to your users in order to use your model selectors. +1. **Cross-Validation:** +k: Number of parts to split the data into for testing. +model_type: Type of model (e.g., 'linear', 'ridge', 'lasso', 'logistic'). +2. **Bootstrapping:** +n_bootstraps: Number of times to randomly sample the data. +model_type: Type of model (e.g., 'linear', 'ridge', 'lasso', 'logistic'). +3. **General Settings:** +Features (X_columns) and target (y_column): Columns to use from any dataset. + + +### Code Visualization: +- The following screenshots display the results of each test case implemented in this project: + +### 1. customer_dataset.csv: +- Tests the model on a small dataset, and verifies if the predictions are reasonable. +- i. K-fold cross validation AIC score: + ![Customer Test Image](customer_aic_k_fold.png) +- ii. Bootstrap AIC score: + ![Customer Test Image](customer_aic_bootstrap.png) +- iii. K-fold cross validation AIC distribution: + ![Customer Test Image](customer_aic_distr_k_fold.png) +- iv. Bootstrap AIC distribution: + ![Customer Test Image](customer_aic_dis_bootstrap.png) +- v. Bootstrap Output: + ![Customer Test Image](customer_output.png) + +### 2. patient_data.csv: +- Tests the model on a large dataset, and verifies if the predictions are reasonable. +- i. K-fold cross validation AIC score: + ![Patient Test Image](patient_aic_k_fold.png) +- ii. Bootstrap AIC score: + ![Patient Test Image](patient_aic_bootstrap.png) +- iii. K-fold cross validation AIC distribution: + ![Patient Test Image](patient_aic_dis_k_fold.png) +- iv. Bootstrap AIC distribution: + ![Patient Test Image](patient_aic_dis_bootstrap.png) +- v. Bootstrap Output: + ![Patient Test Image](patient_output.png) -See sections 7.10-7.11 of Elements of Statistical Learning and the lecture notes. Pay particular attention to Section 7.10.2. -As usual, above-and-beyond efforts will be considered for bonus points. diff --git a/customer_aic_bootstrap.png b/customer_aic_bootstrap.png new file mode 100644 index 0000000..8a15b7b Binary files /dev/null and b/customer_aic_bootstrap.png differ diff --git a/customer_aic_dis_bootstrap.png b/customer_aic_dis_bootstrap.png new file mode 100644 index 0000000..113e58a Binary files /dev/null and b/customer_aic_dis_bootstrap.png differ diff --git a/customer_aic_distr_k_fold.png b/customer_aic_distr_k_fold.png new file mode 100644 index 0000000..e30fa9e Binary files /dev/null and b/customer_aic_distr_k_fold.png differ diff --git a/customer_aic_k_fold.png b/customer_aic_k_fold.png new file mode 100644 index 0000000..2b10f7e Binary files /dev/null and b/customer_aic_k_fold.png differ diff --git a/customer_dataset.csv b/customer_dataset.csv new file mode 100644 index 0000000..5728d1a --- /dev/null +++ b/customer_dataset.csv @@ -0,0 +1,101 @@ +Customer_ID,Age,Annual_Income,Spending_Score,Store_Visits +Customer_1,56,21864,42,7 +Customer_2,69,29498,99,10 +Customer_3,46,59544,7,9 +Customer_4,32,36399,16,16 +Customer_5,60,57140,90,7 +Customer_6,25,69554,60,9 +Customer_7,38,53173,2,8 +Customer_8,56,58955,1,14 +Customer_9,36,36554,48,12 +Customer_10,40,48320,12,8 +Customer_11,28,72034,69,7 +Customer_12,28,33141,37,5 +Customer_13,41,64250,32,11 +Customer_14,53,75897,9,4 +Customer_15,57,56868,99,9 +Customer_16,41,24735,19,8 +Customer_17,20,54902,48,14 +Customer_18,39,48783,80,10 +Customer_19,19,57016,3,13 +Customer_20,41,61041,20,18 +Customer_21,61,38304,24,14 +Customer_22,47,37341,54,9 +Customer_23,55,47741,33,6 +Customer_24,19,35516,24,9 +Customer_25,38,52257,75,7 +Customer_26,50,48298,72,4 +Customer_27,29,89502,36,12 +Customer_28,39,34623,38,15 +Customer_29,61,38269,84,7 +Customer_30,42,56359,99,7 +Customer_31,66,63090,89,11 +Customer_32,44,84308,99,11 +Customer_33,59,74343,25,12 +Customer_34,45,62355,93,11 +Customer_35,33,54395,18,6 +Customer_36,32,63449,82,5 +Customer_37,64,40845,66,6 +Customer_38,68,45257,54,8 +Customer_39,61,27763,35,11 +Customer_40,69,46567,80,7 +Customer_41,20,64439,61,16 +Customer_42,54,46854,41,10 +Customer_43,68,38389,33,16 +Customer_44,24,44603,68,11 +Customer_45,38,60861,33,12 +Customer_46,26,46163,14,6 +Customer_47,56,62748,21,13 +Customer_48,35,30330,48,13 +Customer_49,21,36945,20,7 +Customer_50,42,42400,8,7 +Customer_51,31,30350,7,8 +Customer_52,67,94154,67,9 +Customer_53,26,33556,17,8 +Customer_54,43,63723,33,10 +Customer_55,19,40009,48,9 +Customer_56,37,42293,76,11 +Customer_57,45,54519,59,6 +Customer_58,64,28122,86,11 +Customer_59,24,40058,22,12 +Customer_60,61,47802,30,7 +Customer_61,25,37309,38,10 +Customer_62,64,37662,51,9 +Customer_63,52,66300,54,5 +Customer_64,31,65074,8,9 +Customer_65,34,43373,27,12 +Customer_66,53,48737,27,10 +Customer_67,67,68555,98,5 +Customer_68,57,28602,21,10 +Customer_69,21,55070,30,10 +Customer_70,19,79618,97,14 +Customer_71,23,79475,28,7 +Customer_72,59,20901,64,15 +Customer_73,21,38560,97,10 +Customer_74,46,52529,69,14 +Customer_75,35,30171,61,15 +Customer_76,43,39976,48,8 +Customer_77,61,47940,19,10 +Customer_78,51,71019,4,12 +Customer_79,27,49318,35,8 +Customer_80,53,53254,64,7 +Customer_81,31,57686,49,11 +Customer_82,48,58152,17,12 +Customer_83,65,50421,44,7 +Customer_84,32,32043,92,10 +Customer_85,25,61845,30,10 +Customer_86,31,56472,93,10 +Customer_87,40,33548,46,15 +Customer_88,57,39765,6,12 +Customer_89,38,63312,99,6 +Customer_90,33,59082,37,12 +Customer_91,62,39195,24,18 +Customer_92,35,45950,93,9 +Customer_93,64,47848,46,8 +Customer_94,41,59255,53,11 +Customer_95,43,68734,95,7 +Customer_96,42,50984,99,9 +Customer_97,62,67538,60,9 +Customer_98,58,63099,97,8 +Customer_99,46,53084,63,11 +Customer_100,32,28347,85,6 diff --git a/customer_output.png b/customer_output.png new file mode 100644 index 0000000..1e1aa6d Binary files /dev/null and b/customer_output.png differ diff --git a/patient_aic_bootstrap.png b/patient_aic_bootstrap.png new file mode 100644 index 0000000..31a95e9 Binary files /dev/null and b/patient_aic_bootstrap.png differ diff --git a/patient_aic_dis_bootstrap.png b/patient_aic_dis_bootstrap.png new file mode 100644 index 0000000..133437a Binary files /dev/null and b/patient_aic_dis_bootstrap.png differ diff --git a/patient_aic_dis_k_fold.png b/patient_aic_dis_k_fold.png new file mode 100644 index 0000000..9872414 Binary files /dev/null and b/patient_aic_dis_k_fold.png differ diff --git a/patient_aic_k_fold.png b/patient_aic_k_fold.png new file mode 100644 index 0000000..9ab0943 Binary files /dev/null and b/patient_aic_k_fold.png differ diff --git a/patient_data.csv b/patient_data.csv new file mode 100644 index 0000000..4a781fa --- /dev/null +++ b/patient_data.csv @@ -0,0 +1,101 @@ +Name,RR_Interval,QRS_Duration,QT_Interval,Heart_Rate +Patient_1,0.8496714153011233,0.07169258515899171,0.3743114944139313,66.470558102682 +Patient_2,0.7861735698828816,0.09158709354469283,0.38243138105472935,73.5181193738789 +Patient_3,0.8647688538100693,0.09314570966946462,0.4033220497270111,73.11917039243617 +Patient_4,0.9523029856408026,0.08395445461556762,0.4021520820813961,66.05700924928192 +Patient_5,0.7765846625276664,0.09677428576667982,0.30489322528171636,77.15686853574789 +Patient_6,0.776586304305082,0.10808101713629077,0.32248699840339506,77.84785008420363 +Patient_7,0.9579212815507392,0.1377237180242106,0.38060141068834635,69.0239513885651 +Patient_8,0.876743472915291,0.10349155625663678,0.38055143803648833,65.47721198252965 +Patient_9,0.7530525614065048,0.1051510078144553,0.3806019074522419,82.41120022914828 +Patient_10,0.8542560043585965,0.09851108168467666,0.5141092596261888,69.22559310251184 +Patient_11,0.7536582307187538,0.061624575694019176,0.3828356204277267,78.5232767790996 +Patient_12,0.7534270246429744,0.09946972249101567,0.405422625607224,85.12999769564942 +Patient_13,0.8241962271566035,0.10120460419882053,0.3981600705397281,76.9252795806596 +Patient_14,0.6086719755342203,0.14926484224970574,0.3860556500522319,102.64281106661645 +Patient_15,0.6275082167486967,0.09615278070437755,0.34738923021438617,102.1436718765265 +Patient_16,0.7437712470759028,0.10603094684667226,0.3903587688197307,80.77498357676075 +Patient_17,0.6987168879665576,0.09930576460589514,0.3290869914184971,89.28145485471593 +Patient_18,0.8314247332595275,0.07662643924760937,0.3505272557303996,70.61394781116486 +Patient_19,0.7091975924478789,0.12285645629030043,0.3405854580868358,86.22348783309093 +Patient_20,0.6587696298664709,0.11503866065373548,0.3632749655754529,90.42816462759119 +Patient_21,0.9465648768921555,0.11582063894086095,0.4525863426669403,63.8720771753293 +Patient_22,0.7774223699513465,0.08181225090410522,0.28530939229633007,80.15390903910003 +Patient_23,0.8067528204687924,0.128055886218722,0.3874504076149805,70.28111875361058 +Patient_24,0.6575251813786543,0.07196297874415439,0.2954913651524139,101.7131944278849 +Patient_25,0.7455617275474817,0.11173714187600542,0.34112272536842264,75.44614702871685 +Patient_26,0.8110922589709867,0.14380911251619957,0.40355802387869466,67.90337901394749 +Patient_27,0.6849006422577697,0.08018927349738623,0.3625712007638185,93.39450200319018 +Patient_28,0.8375698018345672,0.08867404540794457,0.31689020888282776,75.59413399316603 +Patient_29,0.7399361310081195,0.10199302730175283,0.33138785162960127,84.20867883378384 +Patient_30,0.7708306250206723,0.08993048691767602,0.387183909957387,80.97983886732776 +Patient_31,0.7398293387770604,0.06898673137867735,0.3307853347313145,81.03855071470649 +Patient_32,0.9852278184508938,0.10137125949612055,0.368658343583279,56.41334838912456 +Patient_33,0.7986502775262067,0.0787539257254779,0.36182287359615256,75.50577312009649 +Patient_34,0.69422890710441,0.10947184861270363,0.3339359860957673,83.0410160617954 +Patient_35,0.882254491210319,0.08161151531532394,0.445757763573013,72.88318672446336 +Patient_36,0.6779156350028979,0.13099868810035079,0.3853567608927204,87.77130137871981 +Patient_37,0.8208863595004756,0.08433493415327527,0.2789942965336957,68.96423896138931 +Patient_38,0.6040329876120225,0.0935587696758865,0.36745817259077707,97.72539407626846 +Patient_39,0.667181395110157,0.1162703443473934,0.33352854140926447,91.99522254155255 +Patient_40,0.8196861235869124,0.0753827136713209,0.39409733339184894,70.38012767650545 +Patient_41,0.8738466579995411,0.1045491986920826,0.32829917046269197,64.55083023960674 +Patient_42,0.8171368281189971,0.12614285508564857,0.355410542341324,74.64555124221584 +Patient_43,0.788435171761176,0.06785033530877546,0.38019949115921825,77.32493867144008 +Patient_44,0.7698896304410712,0.10369267717064609,0.39463020776680485,75.39853277026003 +Patient_45,0.6521478009632573,0.10519765588496847,0.31198814371776895,89.64849216351446 +Patient_46,0.7280155791605292,0.11563645743554621,0.3466199505663621,83.57606841031834 +Patient_47,0.7539361229040213,0.07526098578243837,0.34100218755356176,72.34191713693855 +Patient_48,0.9057122226218917,0.07359086773831448,0.33386683069705153,59.20888881018827 +Patient_49,0.8343618289568462,0.11043883131233796,0.4306181696112439,68.31902640971597 +Patient_50,0.6236959844637266,0.10593969346466373,0.3761992684384382,95.13347980575513 +Patient_51,0.8324083969394795,0.10500985700691753,0.30956464182659815,73.63454112997775 +Patient_52,0.7614917719583684,0.10692896418993952,0.39671447788219105,86.16949061816116 +Patient_53,0.7323077999694042,0.0863995055684302,0.44488624788050535,86.22105917974149 +Patient_54,0.8611676288840868,0.10464507394322008,0.40129861042204584,68.87315382926947 +Patient_55,0.9030999522495952,0.10586144946597363,0.29922520136183944,66.34274774183217 +Patient_56,0.89312801191162,0.08571297163947265,0.3406306370853499,62.16697267119687 +Patient_57,0.7160782476777362,0.13731549022289513,0.4106764459674649,83.69716008615627 +Patient_58,0.7690787624148786,0.10947665841823576,0.33169322137524876,76.57212334297503 +Patient_59,0.8331263431403564,0.07617393005594703,0.37775277712584915,73.63148119763069 +Patient_60,0.897554512712236,0.1131310721726766,0.39098536213717344,62.71215271115612 +Patient_61,0.752082576215471,0.08050636659545357,0.32292278113687667,82.37520623552601 +Patient_62,0.7814341023336183,0.11574169207484905,0.357618985757528,84.44560083301468 +Patient_63,0.6893665025993972,0.12317191158014809,0.23034930639723708,86.49263020374767 +Patient_64,0.6803793375919329,0.08358635363296579,0.3190244943466284,90.19465816542541 +Patient_65,0.8812525822394198,0.11926752258488645,0.34989727394427356,71.5356268390884 +Patient_66,0.9356240028570824,0.10825561853872998,0.310088672721406,62.12222249839843 +Patient_67,0.7927989878419667,0.1164412031998898,0.4252964521572654,76.8016892018897 +Patient_68,0.9003532897892025,0.13793585965307895,0.3027943448815747,66.70346932526019 +Patient_69,0.8361636025047634,0.0950922376799426,0.34239822053212066,72.24467294588068 +Patient_70,0.7354880245394876,0.08492527671285022,0.36522962309144363,77.71343774278343 +Patient_71,0.8361395605508415,0.08220971140748953,0.4176509315626446,71.88090657027938 +Patient_72,0.9538036566465969,0.08368379430069124,0.3025655139528224,65.39601994740464 +Patient_73,0.7964173960890049,0.09845796581171792,0.40652655008619837,82.59309802922168 +Patient_74,0.9564643655814007,0.10682303949633289,0.36040932244078344,67.52738955538736 +Patient_75,0.5380254895910256,0.10553381598660039,0.32073965395808196,122.28479226777651 +Patient_76,0.8821902504375224,0.11654366498072048,0.3784841389705308,64.17580253073245 +Patient_77,0.8087047068238171,0.10026003783755814,0.3679623878229388,78.55432086376831 +Patient_78,0.7700992649534133,0.12907068154314635,0.3359913249136482,78.82874388010656 +Patient_79,0.8091760776535503,0.09470686333524088,0.36279208339960073,85.09851272587785 +Patient_80,0.6012431085399108,0.15440338333179238,0.3445874561255296,95.75175218429371 +Patient_81,0.7780328112162489,0.11251334695530013,0.3645406938100499,72.91896107023354 +Patient_82,0.8357112571511747,0.08285684887167435,0.38648522698084187,68.79816874758635 +Patient_83,0.9477894044741517,0.07858215003877776,0.4234406726458141,52.685723691077506 +Patient_84,0.7481729781726353,0.10964944830486371,0.310487380046926,77.56658311817392 +Patient_85,0.7191506397106813,0.09553074429348299,0.44532133498625065,79.63609171943745 +Patient_86,0.7498242956415464,0.11428000988184185,0.2819164880190999,80.77071512233246 +Patient_87,0.8915402117702075,0.1094647524914709,0.35392859619857664,69.00804379079206 +Patient_88,0.8328751109659684,0.09854342174686255,0.383532688259383,81.42046639005171 +Patient_89,0.7470239796232961,0.08306412563863191,0.3712396747094013,85.0708260017295 +Patient_90,0.8513267433113356,0.06970305550628271,0.33509201920717624,67.59370934863053 +Patient_91,0.8097077549348041,0.09106970095865959,0.351675109985709,69.60873602087257 +Patient_92,0.896864499053289,0.11712797588646945,0.3402799626136467,69.35933384842717 +Patient_93,0.7297946906122648,0.10428187488260408,0.3364254097222315,75.61373737788642 +Patient_94,0.7672337853402232,0.07508522442576024,0.393984083880841,87.36031505492532 +Patient_95,0.7607891846867842,0.10346361851702364,0.37428061943860186,84.76267504473155 +Patient_96,0.6536485051867882,0.10770634759457674,0.3322836161895738,89.44657521239348 +Patient_97,0.8296120277064577,0.08232285127597734,0.39598399501733,63.757290373627896 +Patient_98,0.826105527217989,0.10307450211891056,0.3722919808350644,79.39930807401652 +Patient_99,0.8005113456642461,0.10116417436892,0.3925144847535584,74.37939273967072 +Patient_100,0.7765412866624853,0.07714059404338755,0.3851851536769445,83.4547737136939 diff --git a/patient_output.png b/patient_output.png new file mode 100644 index 0000000..d122c2c Binary files /dev/null and b/patient_output.png differ diff --git a/project2.py b/project2.py new file mode 100644 index 0000000..0585afd --- /dev/null +++ b/project2.py @@ -0,0 +1,219 @@ +import numpy as np +import pandas as pd +import matplotlib.pyplot as plt + +class Model_Selector: + def AIC_calculator(self, X, y, beta): + + #Calculating the AIC scores for the models + + length_y = len(y) + preds = X @ beta + residual = y - preds + rss = np.sum(residual**2) + square_of_sigma = rss / length_y + log_likelihood = -0.5 * length_y * (np.log(2 * np.pi * square_of_sigma) + 1) + + length_beta = len(beta) # Number of coefficients + aic_scores = 2 * length_beta - 2 * log_likelihood + return aic_scores + + def linear_regression_model(self, X, y): + + #Implement Linear regression model + + XTX = X.T @ X + XTy = X.T @ y + beta = np.linalg.solve(XTX, XTy) + return beta + + def ridge_regression_model(self, X, y, alpha=1.0): + #Implement Ridge Regression model. + + features_n = X.shape[1] + XTX = X.T @ X + alpha * np.eye(features_n) + XTy = X.T @ y + beta = np.linalg.solve(XTX, XTy) + return beta + + def lasso_regression_model(self, X, y, alpha=1.0, max_iter=1000, tol=1e-4): + #Implement Lasso Regression model + + n, m = X.shape + beta = np.zeros(m) + for _ in range(max_iter): + beta_value = beta.copy() + for j in range(m): + residual = y - X @ beta + beta[j] * X[:, j] + rho = X[:, j].T @ residual + if rho < -alpha: + beta[j] = (rho + alpha) / (X[:, j] @ X[:, j]) + elif rho > alpha: + beta[j] = (rho - alpha) / (X[:, j] @ X[:, j]) + else: + beta[j] = 0 + if np.linalg.norm(beta - beta_value, ord=1) < tol: + break + return beta + + def k_fold_cross_validation(self, X, y, model_fn, alpha=None, k=5): + n = len(y) + indices = np.arange(n) + np.random.shuffle(indices) + fold_size = n // k + aic_scores = [] + + for i in range(k): + test_indices = indices[i * fold_size:(i + 1) * fold_size] + train_indices = np.setdiff1d(indices, test_indices) + + X_train, X_test = X[train_indices], X[test_indices] + y_train, y_test = y[train_indices], y[test_indices] + + if alpha is not None: + beta = model_fn(X_train, y_train, alpha) + else: + beta = model_fn(X_train, y_train) + + aic_score = self.AIC_calculator(X_test, y_test, beta) + aic_scores.append(aic_score) + + return aic_scores + + def bootstrap_validation(self, X, y, model_fn, alpha=None, n_bootstraps=100): + n = len(y) + aic_scores = [] + + for _ in range(n_bootstraps): + bootstrap_indices = np.random.choice(n, size=n, replace=True) + X_bootstrap, y_bootstrap = X[bootstrap_indices], y[bootstrap_indices] + + if alpha is not None: + beta = model_fn(X_bootstrap, y_bootstrap, alpha) + else: + beta = model_fn(X_bootstrap, y_bootstrap) + + aic_score = self.AIC_calculator(X_bootstrap, y_bootstrap, beta) + aic_scores.append(aic_score) + + return aic_scores + +def summarize_results(aic_scores): + n = len(aic_scores) + mean_aic = sum(aic_scores) / n + variance = sum((x - mean_aic) ** 2 for x in aic_scores) / n + std_dev = variance ** 0.5 + return mean_aic, std_dev + +def plot_results(results): + """ + Plot the AIC results for k-Fold and Bootstrapping across models using horizontal bars. + """ + models = list(results.keys()) + mean_aic_kf = [results[model]["mean_aic_kf"] for model in models] + mean_aic_bootstrap = [results[model]["mean_aic_bootstrap"] for model in models] + kf_std_dev = [results[model]["kf_std_dev"] for model in models] + bootstrap_std_dev = [results[model]["bootstrap_std_dev"] for model in models] + + # k-Fold Horizontal Bar Plot with Error Bars + fig, ax = plt.subplots(figsize=(8, 6)) + ax.barh(models, mean_aic_kf, color='green', xerr=kf_std_dev, capsize=5) + ax.set_title('k-Fold Cross-Validation AIC') + ax.set_ylabel('Models') + ax.set_xlabel('AIC') + plt.show() + + # Bootstrap AIC Horizontal Bar Plot with Error Bars + fig, ax = plt.subplots(figsize=(8, 6)) + ax.barh(models, mean_aic_bootstrap, color='purple', xerr=bootstrap_std_dev, capsize=5) + ax.set_title('Bootstrap AIC') + ax.set_ylabel('Models') + ax.set_xlabel('AIC') + plt.show() + + # Boxplots for AIC Distribution + fig, ax = plt.subplots(figsize=(8, 6)) + ax.boxplot([results[model]["kf_scores"] for model in models], vert=True) + ax.set_title('k-Fold AIC Distribution') + ax.set_xticklabels(models) + ax.set_ylabel('AIC') + plt.show() + + fig, ax = plt.subplots(figsize=(8, 6)) + ax.boxplot([results[model]["bootstrap_scores"] for model in models], vert=True) + ax.set_title('Bootstrap AIC Distribution') + ax.set_xticklabels(models) + ax.set_ylabel('AIC') + plt.show() + +def generic_process(data, X_columns, y_column): + """ + Generic process to evaluate AIC for Linear, Ridge, and Lasso models on any dataset. + """ + # Split features and target + X = data[X_columns].values + y = data[y_column].values + + # Initialize Model_Selector + selector = Model_Selector() + + # Define models + models = { + 'linear': selector.linear_regression_model, + 'ridge': lambda X, y: selector.ridge_regression_model(X, y, alpha=1.0), + 'lasso': lambda X, y: selector.lasso_regression_model(X, y, alpha=1.0) + } + + results = {} + + for model_name, model_fn in models.items(): + print(f"Evaluating model: {model_name.capitalize()}") + + # Perform k-Fold Cross-Validation + aic_scores_kf = selector.k_fold_cross_validation(X, y, model_fn, k=5) + + # Perform Bootstrapping + aic_scores_bootstrap = selector.bootstrap_validation(X, y, model_fn, n_bootstraps=100) + + # Calculate mean and standard deviation + mean_aic_kf, std_aic_kf = summarize_results(aic_scores_kf) + mean_aic_bootstrap, std_aic_bootstrap = summarize_results(aic_scores_bootstrap) + + # Save the results + results[model_name] = { + "mean_aic_kf": mean_aic_kf, + "mean_aic_bootstrap": mean_aic_bootstrap, + "kf_std_dev": std_aic_kf, + "bootstrap_std_dev": std_aic_bootstrap, + "kf_scores": aic_scores_kf, + "bootstrap_scores": aic_scores_bootstrap + } + + # Find the best model + best_model = None + best_mean_aic = float('inf') + + for model_name, scores in results.items(): + avg_aic = (scores["mean_aic_kf"] + scores["mean_aic_bootstrap"]) / 2 + print(f"\n{model_name.capitalize()} - Mean AIC: k-Fold: {scores['mean_aic_kf']:.3f}, Bootstrapping: {scores['mean_aic_bootstrap']:.3f}") + + if avg_aic < best_mean_aic: + best_mean_aic = avg_aic + best_model = model_name + + print(f"\nBest Model: {best_model.capitalize()} with an average AIC of {best_mean_aic:.3f}") + + # Plot results + plot_results(results) + + +# Example Usage: +if __name__ == "__main__": + + data = pd.read_csv('patient_data.csv') + + feature_columns = ['RR_Interval','QRS_Duration','QT_Interval'] + target_column = 'Heart_Rate' + + # Run the process + generic_process(data, feature_columns, target_column) \ No newline at end of file