Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
197 changes: 179 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,190 @@
# Project 2
# Machine Learning
## Project 2

Select one of the following two options:
# Group Members - Contribution
* Venkata Naga Lakshmi Sai Snigdha Sri Jata - A20560684 - 33.33%
* Sharan Rama Prakash Shenoy - A20560684 - 33.33%
* Adarsh Chidirala - A20561069 - 33.33%

## Boosting Trees
# ###################################################
## Usage Instructions

Implement the gradient-boosting tree algorithm (with the usual fit-predict interface) as described in Sections 10.9-10.10 of Elements of Statistical Learning (2nd Edition). Answer the questions below as you did for Project 1.
### Installation

Put your README below. Answer the following questions.
To get started with this project, first you need **Python 3.x**. Then follow these installation steps:

* What does the model you have implemented do and when should it be used?
* How did you test your model to determine if it is working reasonably correctly?
* What parameters have you exposed to users of your implementation in order to tune performance? (Also perhaps provide some basic usage examples.)
* Are there specific inputs that your implementation has trouble with? Given more time, could you work around these or is it fundamental?
#### 1. Clone the Repository to your local machine:

## Model Selection
```bash
git clone https://github.com/adarsh-chidirala/Project2_Adarsh_Ch_Group.git
```
#### 2. Steps to Run the Code on Mac

Implement generic k-fold cross-validation and bootstrapping model selection methods.
Follow these steps to set up and run the project:

In your README, answer the following questions:
1. **Create a Virtual Environment**:
- Navigate to your project directory and create a virtual environment using:
```bash
python3 -m venv myenv
```

* Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)?
* In what cases might the methods you've written fail or give incorrect or undesirable results?
* What could you implement given more time to mitigate these cases or help users of your methods?
* What parameters have you exposed to your users in order to use your model selectors.
2. **Activate the Virtual Environment**:
- Activate the created virtual environment by running:
```bash
source myenv/bin/activate
```

3. **Install Required Libraries**:
- Install the necessary Python libraries with the following command:
```bash
pip install numpy pandas matplotlib scikit-learn
```

4. **Run the Script**:
- Navigate to the directory containing your script and run it:
```bash
python project2.py
```

Make sure that the script `project2.py` and any required dataset files are correctly placed in your project directory.

#### 3. Steps to Run the Code on Windows

Follow these instructions to set up and execute the project on a Windows system:

1. **Create a Virtual Environment**:
- Open Command Prompt and navigate to your project directory:
```cmd
cd path\to\your\project\directory
```
- Create a virtual environment in your project directory by running:
```cmd
python -m venv myenv
```

2. **Activate the Virtual Environment**:
- Activate the virtual environment with the following command:
```cmd
myenv\Scripts\activate
```

3. **Install Required Libraries**:
- Install the necessary libraries by executing:
```cmd
pip install numpy pandas matplotlib scikit-learn
```

4. **Run the Script**:
- Make sure the script `project2.py` and any necessary dataset files are placed in your project directory. Run the script with:
```cmd
python project2.py
```
Ensure that all paths are correct and relevant files are located in the specified directories.

#### 4. Running Datasets:
- We are using two datasets: customer_dataset.csv and patient_data.csv. For each of this we need to specify the features and the target column.
- This is done in the main function in the bottom of the code as follows.
- For customer_dataset.csv
```
data = pd.read_csv('customer_dataset.csv')

feature_columns = ['Age','Annual_Income','Spending_Score']
target_column = 'Store_Visits'

```
- For patient_data.csv
```
data = pd.read_csv('patient_data.csv')

feature_columns = ['RR_Interval','QRS_Duration','QT_Interval']
target_column = 'Heart_Rate'

```

## K-fold cross-validation and Bootstrapping model selection models
## Introduction
- This project implements the developments of generic k-fold cross-validation and bootstrapping model selection methods, primarily focusing on AIC scores to evaluate the model performance.
- These techniques are designed to evaluate and compare the performance of machine learning models on different datasets, providing detailed insights into how effective a model is on the given situations and predict the corresponding outcomes accurately based on the scores calculated.
- The implementation can be adapted to various models, and allows customization to suit individual requirements and resources by modifying its features.

## Key Features implemented
- The model supports 3 different models i.e; Linear, Ridge and Lasso regression models and calculates Mean AIC scores for each model using following validation models.
- The model successfully implements K-Fold Cross-Validation and Bootstrapping Validation.
- **K-Fold Cross-Validation:** This validation model provides evaluation by splitting the dataset into k subsets, using k-1 subsets for training the dataset and remaining 1 subset for testing the dataset.
- **Bootstrapping:** This validation model provides evaluation of the performance by repeatedly resampling the dataset with replacements of the same data.
- The project also generates the plots of the K-fold cross validation and Bootstrapping AIC scores and Distribution.


### 1. Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)?
In simple cases like linear regression,the relationship between the dependent and independent variables is considered to be linear, it is possible that cross-validation, bootstrapping, and AIC scores may accept on the simpler model selector.
However, this may not be constant always because it depends on the dataset provided and the complexity of the models.

Note that cross-validation and bootstrapping provide estimates of model performance, while AIC focuses on model selection. These techniques can be different in model evaluation and selection based on the different purposes.

To determine whether cross-validation and bootstrapping model selectors agree with AIC in a specific case, it is recommended to perform experiments and compare the results.

### 2. In what cases might the methods you've written fail or give incorrect or undesirable results?
Failures or undesirable results may occur under these conditions:
**Small Datasets:** k-fold Cross-validation and bootstrapping can give unreliable AIC scores when the dataset is too small.
**Non-balance Data:** Models training on the non-balance datasets, might make biased predictions which affects the AIC calculations.
**Connected Predictions:** Strong connections between predictions can lead to wrong estimates, affecting AIC evaluations.
**Wrong Model Assumptions:** AIC relies on specific model assumptions (e.g., linear regression assumes Gaussian errors). If these assumptions are wrong, AIC might not show the true model fit.
**Overfitting with Bootstrapping:** Repeated sampling with replacement might make bootstrapped datasets favor complex models too much, leading to high AIC values.

### 3. What could you implement given more time to mitigate these cases or help users of your methods?

1. Use Ridge and Lasso penalties during cross-validation: Regularization adds a penalty to the model's complexity to reduce overfitting. Ridge regression adds an L2 penalty , while Lasso regression adds an L1 penalty.
Use RidgeCV and LassoCV from scikit-learn: These classes perform cross-validation to find the better regularization parameter (alpha) for Ridge and Lasso regression.

2. Use categorized sampling for k-fold cross-validation: Categorized k-fold ensures that each fold has the equal proportion of class labels as the original dataset, which shows imbalanced datasets.
Implement balanced bootstrapping: Bootstrapping involves sampling with a stategy of replacement. Balanced bootstrapping ensures that each class is represented equally in each sample.

3. Add BIC or deviance as options besides AIC: AIC score is a common metric for model selection, but BIC (Bayesian Information Criterion) and deviance can provide additional insights.

4. Plot performance for different hyperparameters: Validation curves help visualize how the model's performance changes with different values of hyperparameters (e.g., alpha in Ridge/Lasso). This helps in selecting the correct hyperparameter value.

5. Provide summaries for bootstrap and k-fold iterations: Summarizing the results of resampling methods (like bootstrap and k-fold) helps understand the variability and stability of the model's performance.

6. Let users set evaluation metrics and sampling methods: Providing flexibility in setting these parameters allows users to customize the model according to their specific needs and preferences.

### 4. What parameters have you exposed to your users in order to use your model selectors.
1. **Cross-Validation:**
k: Number of parts to split the data into for testing.
model_type: Type of model (e.g., 'linear', 'ridge', 'lasso', 'logistic').
2. **Bootstrapping:**
n_bootstraps: Number of times to randomly sample the data.
model_type: Type of model (e.g., 'linear', 'ridge', 'lasso', 'logistic').
3. **General Settings:**
Features (X_columns) and target (y_column): Columns to use from any dataset.


### Code Visualization:
- The following screenshots display the results of each test case implemented in this project:

### 1. customer_dataset.csv:
- Tests the model on a small dataset, and verifies if the predictions are reasonable.
- i. K-fold cross validation AIC score:
![Customer Test Image](customer_aic_k_fold.png)
- ii. Bootstrap AIC score:
![Customer Test Image](customer_aic_bootstrap.png)
- iii. K-fold cross validation AIC distribution:
![Customer Test Image](customer_aic_distr_k_fold.png)
- iv. Bootstrap AIC distribution:
![Customer Test Image](customer_aic_dis_bootstrap.png)
- v. Bootstrap Output:
![Customer Test Image](customer_output.png)

### 2. patient_data.csv:
- Tests the model on a large dataset, and verifies if the predictions are reasonable.
- i. K-fold cross validation AIC score:
![Patient Test Image](patient_aic_k_fold.png)
- ii. Bootstrap AIC score:
![Patient Test Image](patient_aic_bootstrap.png)
- iii. K-fold cross validation AIC distribution:
![Patient Test Image](patient_aic_dis_k_fold.png)
- iv. Bootstrap AIC distribution:
![Patient Test Image](patient_aic_dis_bootstrap.png)
- v. Bootstrap Output:
![Patient Test Image](patient_output.png)

See sections 7.10-7.11 of Elements of Statistical Learning and the lecture notes. Pay particular attention to Section 7.10.2.

As usual, above-and-beyond efforts will be considered for bonus points.
Binary file added customer_aic_bootstrap.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added customer_aic_dis_bootstrap.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added customer_aic_distr_k_fold.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added customer_aic_k_fold.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
101 changes: 101 additions & 0 deletions customer_dataset.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
Customer_ID,Age,Annual_Income,Spending_Score,Store_Visits
Customer_1,56,21864,42,7
Customer_2,69,29498,99,10
Customer_3,46,59544,7,9
Customer_4,32,36399,16,16
Customer_5,60,57140,90,7
Customer_6,25,69554,60,9
Customer_7,38,53173,2,8
Customer_8,56,58955,1,14
Customer_9,36,36554,48,12
Customer_10,40,48320,12,8
Customer_11,28,72034,69,7
Customer_12,28,33141,37,5
Customer_13,41,64250,32,11
Customer_14,53,75897,9,4
Customer_15,57,56868,99,9
Customer_16,41,24735,19,8
Customer_17,20,54902,48,14
Customer_18,39,48783,80,10
Customer_19,19,57016,3,13
Customer_20,41,61041,20,18
Customer_21,61,38304,24,14
Customer_22,47,37341,54,9
Customer_23,55,47741,33,6
Customer_24,19,35516,24,9
Customer_25,38,52257,75,7
Customer_26,50,48298,72,4
Customer_27,29,89502,36,12
Customer_28,39,34623,38,15
Customer_29,61,38269,84,7
Customer_30,42,56359,99,7
Customer_31,66,63090,89,11
Customer_32,44,84308,99,11
Customer_33,59,74343,25,12
Customer_34,45,62355,93,11
Customer_35,33,54395,18,6
Customer_36,32,63449,82,5
Customer_37,64,40845,66,6
Customer_38,68,45257,54,8
Customer_39,61,27763,35,11
Customer_40,69,46567,80,7
Customer_41,20,64439,61,16
Customer_42,54,46854,41,10
Customer_43,68,38389,33,16
Customer_44,24,44603,68,11
Customer_45,38,60861,33,12
Customer_46,26,46163,14,6
Customer_47,56,62748,21,13
Customer_48,35,30330,48,13
Customer_49,21,36945,20,7
Customer_50,42,42400,8,7
Customer_51,31,30350,7,8
Customer_52,67,94154,67,9
Customer_53,26,33556,17,8
Customer_54,43,63723,33,10
Customer_55,19,40009,48,9
Customer_56,37,42293,76,11
Customer_57,45,54519,59,6
Customer_58,64,28122,86,11
Customer_59,24,40058,22,12
Customer_60,61,47802,30,7
Customer_61,25,37309,38,10
Customer_62,64,37662,51,9
Customer_63,52,66300,54,5
Customer_64,31,65074,8,9
Customer_65,34,43373,27,12
Customer_66,53,48737,27,10
Customer_67,67,68555,98,5
Customer_68,57,28602,21,10
Customer_69,21,55070,30,10
Customer_70,19,79618,97,14
Customer_71,23,79475,28,7
Customer_72,59,20901,64,15
Customer_73,21,38560,97,10
Customer_74,46,52529,69,14
Customer_75,35,30171,61,15
Customer_76,43,39976,48,8
Customer_77,61,47940,19,10
Customer_78,51,71019,4,12
Customer_79,27,49318,35,8
Customer_80,53,53254,64,7
Customer_81,31,57686,49,11
Customer_82,48,58152,17,12
Customer_83,65,50421,44,7
Customer_84,32,32043,92,10
Customer_85,25,61845,30,10
Customer_86,31,56472,93,10
Customer_87,40,33548,46,15
Customer_88,57,39765,6,12
Customer_89,38,63312,99,6
Customer_90,33,59082,37,12
Customer_91,62,39195,24,18
Customer_92,35,45950,93,9
Customer_93,64,47848,46,8
Customer_94,41,59255,53,11
Customer_95,43,68734,95,7
Customer_96,42,50984,99,9
Customer_97,62,67538,60,9
Customer_98,58,63099,97,8
Customer_99,46,53084,63,11
Customer_100,32,28347,85,6
Binary file added customer_output.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added patient_aic_bootstrap.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added patient_aic_dis_bootstrap.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added patient_aic_dis_k_fold.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added patient_aic_k_fold.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading