Fall2024CS584 · adarsh-chidirala · Nov 22, 2024 · Nov 22, 2024 · Nov 22, 2024 · Nov 22, 2024
diff --git a/README.md b/README.md
@@ -1,29 +1,190 @@
-# Project 2
+# Machine Learning
+## Project 2 
 
-Select one of the following two options:
+# Group Members - Contribution
+* Venkata Naga Lakshmi Sai Snigdha Sri Jata - A20560684 - 33.33%
+* Sharan Rama Prakash Shenoy - A20560684 - 33.33%
+* Adarsh Chidirala - A20561069 - 33.33%
 
-## Boosting Trees
+# ###################################################
+## Usage Instructions
 
-Implement the gradient-boosting tree algorithm (with the usual fit-predict interface) as described in Sections 10.9-10.10 of Elements of Statistical Learning (2nd Edition). Answer the questions below as you did for Project 1.
+### Installation
 
-Put your README below. Answer the following questions.
+To get started with this project, first you need **Python 3.x**. Then follow these installation steps:
 
-* What does the model you have implemented do and when should it be used?
-* How did you test your model to determine if it is working reasonably correctly?
-* What parameters have you exposed to users of your implementation in order to tune performance? (Also perhaps provide some basic usage examples.)
-* Are there specific inputs that your implementation has trouble with? Given more time, could you work around these or is it fundamental?
+#### 1. Clone the Repository to your local machine:
 
-## Model Selection
+```bash
+git clone https://github.com/adarsh-chidirala/Project2_Adarsh_Ch_Group.git 
+```
+#### 2. Steps to Run the Code on Mac
 
-Implement generic k-fold cross-validation and bootstrapping model selection methods.
+Follow these steps to set up and run the project:
 
-In your README, answer the following questions:
+1. **Create a Virtual Environment**:
+   - Navigate to your project directory and create a virtual environment using:
+     ```bash
+     python3 -m venv myenv
+     ```
 
-* Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)?
-* In what cases might the methods you've written fail or give incorrect or undesirable results?
-* What could you implement given more time to mitigate these cases or help users of your methods?
-* What parameters have you exposed to your users in order to use your model selectors.
+2. **Activate the Virtual Environment**:
+   - Activate the created virtual environment by running:
+     ```bash
+     source myenv/bin/activate
+     ```
+
+3. **Install Required Libraries**:
+   - Install the necessary Python libraries with the following command:
+     ```bash
+     pip install numpy pandas matplotlib scikit-learn
+     ```
+
+4. **Run the Script**:
+   - Navigate to the directory containing your script and run it:
+     ```bash
+     python project2.py
+     ```
+
+Make sure that the script `project2.py` and any required dataset files are correctly placed in your project directory.
+
+#### 3. Steps to Run the Code on Windows
+
+Follow these instructions to set up and execute the project on a Windows system:
+
+1. **Create a Virtual Environment**:
+   - Open Command Prompt and navigate to your project directory:
+     ```cmd
+     cd path\to\your\project\directory
+     ```
+   - Create a virtual environment in your project directory by running:
+     ```cmd
+     python -m venv myenv
+     ```
+
+2. **Activate the Virtual Environment**:
+   - Activate the virtual environment with the following command:
+     ```cmd
+     myenv\Scripts\activate
+     ```
+
+3. **Install Required Libraries**:
+   - Install the necessary libraries by executing:
+     ```cmd
+     pip install numpy pandas matplotlib scikit-learn
+     ```
+
+4. **Run the Script**:
+   - Make sure the script `project2.py` and any necessary dataset files are placed in your project directory. Run the script with:
+     ```cmd
+     python project2.py
+     ```
+Ensure that all paths are correct and relevant files are located in the specified directories.
+
+#### 4. Running Datasets:
+   - We are using two datasets: customer_dataset.csv and patient_data.csv. For each of this we need to specify the features and the target column. 
+   - This is done in the main function in the bottom of the code as follows.
+   - For customer_dataset.csv
+``` 
+data = pd.read_csv('customer_dataset.csv')
+
+    feature_columns = ['Age','Annual_Income','Spending_Score']
+    target_column = 'Store_Visits'
+
+```
+   - For patient_data.csv
+``` 
+data = pd.read_csv('patient_data.csv')
+
+    feature_columns = ['RR_Interval','QRS_Duration','QT_Interval']
+    target_column = 'Heart_Rate'
+
+```
+
+## K-fold cross-validation and Bootstrapping model selection models
+## Introduction
+- This project implements the developments of generic k-fold cross-validation and bootstrapping model selection methods, primarily focusing on AIC scores to evaluate the model performance. 
+- These techniques are designed to evaluate and compare the performance of machine learning models on different datasets, providing detailed insights into how effective a model is on the given situations and predict the corresponding outcomes accurately based on the scores calculated. 
+- The implementation can be adapted to various models, and allows customization to suit individual requirements and resources by modifying its features.
+
+## Key Features implemented
+- The model supports 3 different models i.e; Linear, Ridge and Lasso regression models and calculates Mean AIC scores for each model using following validation models.
+- The model successfully implements K-Fold Cross-Validation and Bootstrapping Validation.
+- **K-Fold Cross-Validation:** This validation model provides evaluation by splitting the dataset into k subsets, using k-1 subsets for training the dataset and remaining 1 subset for testing the dataset.
+- **Bootstrapping:** This validation model provides evaluation of the performance by repeatedly resampling the dataset with replacements of the same data.
+- The project also generates the plots of the K-fold cross validation and Bootstrapping AIC scores and Distribution.
+
+
+### 1. Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)?
+In simple cases like linear regression,the relationship between the dependent and independent variables is considered to be linear, it is possible that cross-validation, bootstrapping, and AIC scores may accept on the simpler model selector. 
+However, this may not be constant always because it depends on the dataset provided and the complexity of the models.
+
+Note that cross-validation and bootstrapping provide estimates of model performance, while AIC focuses on model selection. These techniques can be different in model evaluation and selection based on the different purposes.
+
+To determine whether cross-validation and bootstrapping model selectors agree with AIC in a specific case, it is recommended to perform experiments and compare the results.
+
+### 2. In what cases might the methods you've written fail or give incorrect or undesirable results?
+Failures or undesirable results may occur under these conditions:
+**Small Datasets:** k-fold Cross-validation and bootstrapping can give unreliable AIC scores when the dataset is too small.
+**Non-balance Data:** Models training on the non-balance datasets, might make biased predictions which affects the AIC calculations.
+**Connected Predictions:** Strong connections between predictions can lead to wrong estimates, affecting AIC evaluations.
+**Wrong Model Assumptions:** AIC relies on specific model assumptions (e.g., linear regression assumes Gaussian errors). If these assumptions are wrong, AIC might not show the true model fit.
+**Overfitting with Bootstrapping:** Repeated sampling with replacement might  make bootstrapped datasets favor complex models too much, leading to high AIC values.
+
+### 3. What could you implement given more time to mitigate these cases or help users of your methods?
+
+1. Use Ridge and Lasso penalties during cross-validation: Regularization adds a penalty to the model's complexity to reduce overfitting. Ridge regression adds an L2 penalty , while Lasso regression adds an L1 penalty.
+Use RidgeCV and LassoCV from scikit-learn: These classes perform cross-validation to find the better regularization parameter (alpha) for Ridge and Lasso regression.
+
+2. Use categorized sampling for k-fold cross-validation: Categorized k-fold ensures that each fold has the equal proportion of class labels as the original dataset, which shows imbalanced datasets.
+Implement balanced bootstrapping: Bootstrapping involves sampling with a stategy of replacement. Balanced bootstrapping ensures that each class is represented equally in each sample.
+
+3. Add BIC or deviance as options besides AIC: AIC score is a common metric for model selection, but BIC (Bayesian Information Criterion) and deviance can provide additional insights. 
+
+4. Plot performance for different hyperparameters: Validation curves help visualize how the model's performance changes with different values of hyperparameters (e.g., alpha in Ridge/Lasso). This helps in selecting the correct hyperparameter value.
+
+5. Provide summaries for bootstrap and k-fold iterations: Summarizing the results of resampling methods (like bootstrap and k-fold) helps understand the variability and stability of the model's performance.
+
+6. Let users set evaluation metrics and sampling methods: Providing flexibility in setting these parameters allows users to customize the model according to their specific needs and preferences.
+
+### 4. What parameters have you exposed to your users in order to use your model selectors.
+1. **Cross-Validation:**
+k: Number of parts to split the data into for testing.
+model_type: Type of model (e.g., 'linear', 'ridge', 'lasso', 'logistic').
+2. **Bootstrapping:**
+n_bootstraps: Number of times to randomly sample the data.
+model_type: Type of model (e.g., 'linear', 'ridge', 'lasso', 'logistic').
+3. **General Settings:**
+Features (X_columns) and target (y_column): Columns to use from any dataset.
+
+
+### Code Visualization:
+- The following screenshots display the results of each test case implemented in this project:
+
+### 1. customer_dataset.csv:
+- Tests the model on a small dataset, and verifies if the predictions are reasonable.
+- i. K-fold cross validation AIC score:
+    ![Customer Test Image](customer_aic_k_fold.png)
+- ii. Bootstrap AIC score:
+    ![Customer Test Image](customer_aic_bootstrap.png)
+- iii. K-fold cross validation AIC distribution:
+    ![Customer Test Image](customer_aic_distr_k_fold.png)
+- iv. Bootstrap AIC distribution:
+    ![Customer Test Image](customer_aic_dis_bootstrap.png)
+- v. Bootstrap Output:
+    ![Customer Test Image](customer_output.png)
+
+### 2. patient_data.csv:
+- Tests the model on a large dataset, and verifies if the predictions are reasonable.
+- i. K-fold cross validation AIC score:
+    ![Patient Test Image](patient_aic_k_fold.png)
+- ii. Bootstrap AIC score:
+    ![Patient Test Image](patient_aic_bootstrap.png)
+- iii. K-fold cross validation AIC distribution:
+    ![Patient Test Image](patient_aic_dis_k_fold.png)
+- iv. Bootstrap AIC distribution:
+    ![Patient Test Image](patient_aic_dis_bootstrap.png)
+- v. Bootstrap Output:
+    ![Patient Test Image](patient_output.png)
 
-See sections 7.10-7.11 of Elements of Statistical Learning and the lecture notes. Pay particular attention to Section 7.10.2.
 
-As usual, above-and-beyond efforts will be considered for bonus points.
diff --git a/customer_aic_bootstrap.png b/customer_aic_bootstrap.png
diff --git a/customer_aic_dis_bootstrap.png b/customer_aic_dis_bootstrap.png
diff --git a/customer_aic_distr_k_fold.png b/customer_aic_distr_k_fold.png
diff --git a/customer_aic_k_fold.png b/customer_aic_k_fold.png
diff --git a/customer_dataset.csv b/customer_dataset.csv
@@ -0,0 +1,101 @@
+Customer_ID,Age,Annual_Income,Spending_Score,Store_Visits
+Customer_1,56,21864,42,7
+Customer_2,69,29498,99,10
+Customer_3,46,59544,7,9
+Customer_4,32,36399,16,16
+Customer_5,60,57140,90,7
+Customer_6,25,69554,60,9
+Customer_7,38,53173,2,8
+Customer_8,56,58955,1,14
+Customer_9,36,36554,48,12
+Customer_10,40,48320,12,8
+Customer_11,28,72034,69,7
+Customer_12,28,33141,37,5
+Customer_13,41,64250,32,11
+Customer_14,53,75897,9,4
+Customer_15,57,56868,99,9
+Customer_16,41,24735,19,8
+Customer_17,20,54902,48,14
+Customer_18,39,48783,80,10
+Customer_19,19,57016,3,13
+Customer_20,41,61041,20,18
+Customer_21,61,38304,24,14
+Customer_22,47,37341,54,9
+Customer_23,55,47741,33,6
+Customer_24,19,35516,24,9
+Customer_25,38,52257,75,7
+Customer_26,50,48298,72,4
+Customer_27,29,89502,36,12
+Customer_28,39,34623,38,15
+Customer_29,61,38269,84,7
+Customer_30,42,56359,99,7
+Customer_31,66,63090,89,11
+Customer_32,44,84308,99,11
+Customer_33,59,74343,25,12
+Customer_34,45,62355,93,11
+Customer_35,33,54395,18,6
+Customer_36,32,63449,82,5
+Customer_37,64,40845,66,6
+Customer_38,68,45257,54,8
+Customer_39,61,27763,35,11
+Customer_40,69,46567,80,7
+Customer_41,20,64439,61,16
+Customer_42,54,46854,41,10
+Customer_43,68,38389,33,16
+Customer_44,24,44603,68,11
+Customer_45,38,60861,33,12
+Customer_46,26,46163,14,6
+Customer_47,56,62748,21,13
+Customer_48,35,30330,48,13
+Customer_49,21,36945,20,7
+Customer_50,42,42400,8,7
+Customer_51,31,30350,7,8
+Customer_52,67,94154,67,9
+Customer_53,26,33556,17,8
+Customer_54,43,63723,33,10
+Customer_55,19,40009,48,9
+Customer_56,37,42293,76,11
+Customer_57,45,54519,59,6
+Customer_58,64,28122,86,11
+Customer_59,24,40058,22,12
+Customer_60,61,47802,30,7
+Customer_61,25,37309,38,10
+Customer_62,64,37662,51,9
+Customer_63,52,66300,54,5
+Customer_64,31,65074,8,9
+Customer_65,34,43373,27,12
+Customer_66,53,48737,27,10
+Customer_67,67,68555,98,5
+Customer_68,57,28602,21,10
+Customer_69,21,55070,30,10
+Customer_70,19,79618,97,14
+Customer_71,23,79475,28,7
+Customer_72,59,20901,64,15
+Customer_73,21,38560,97,10
+Customer_74,46,52529,69,14
+Customer_75,35,30171,61,15
+Customer_76,43,39976,48,8
+Customer_77,61,47940,19,10
+Customer_78,51,71019,4,12
+Customer_79,27,49318,35,8
+Customer_80,53,53254,64,7
+Customer_81,31,57686,49,11
+Customer_82,48,58152,17,12
+Customer_83,65,50421,44,7
+Customer_84,32,32043,92,10
+Customer_85,25,61845,30,10
+Customer_86,31,56472,93,10
+Customer_87,40,33548,46,15
+Customer_88,57,39765,6,12
+Customer_89,38,63312,99,6
+Customer_90,33,59082,37,12
+Customer_91,62,39195,24,18
+Customer_92,35,45950,93,9
+Customer_93,64,47848,46,8
+Customer_94,41,59255,53,11
+Customer_95,43,68734,95,7
+Customer_96,42,50984,99,9
+Customer_97,62,67538,60,9
+Customer_98,58,63099,97,8
+Customer_99,46,53084,63,11
+Customer_100,32,28347,85,6
diff --git a/customer_output.png b/customer_output.png
diff --git a/patient_aic_bootstrap.png b/patient_aic_bootstrap.png
diff --git a/patient_aic_dis_bootstrap.png b/patient_aic_dis_bootstrap.png
diff --git a/patient_aic_dis_k_fold.png b/patient_aic_dis_k_fold.png
diff --git a/patient_aic_k_fold.png b/patient_aic_k_fold.png