Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -130,11 +130,11 @@ ENV/
env.bak/
venv.bak/

# Spyder project settings
# Spyder project params
.spyderproject
.spyproject

# Rope project settings
# Rope project params
.ropeproject

# mkdocs documentation
Expand All @@ -159,4 +159,4 @@ cython_debug/
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
#.idea/
8 changes: 8 additions & 0 deletions .idea/.gitignore

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

138 changes: 121 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,133 @@
# Project 2

Select one of the following two options:
### Implemented model
* generic k-fold cross-validation and bootstrapping model selection methods.

## Boosting Trees
### Creator Description
- Name: Haeun Suh
- HawkID: A20542585
- Class: CS584-04 Machine Learning(Instructor: Steve Avsec)
- Email: hsuh7@hawk.iit.edu

#### [Question 1] Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)?
- For simple datasets, it was observed that cross-validation and bootstrapping models generally reach the same conclusions as simpler methods like AIC.
- However, on more complex datasets, particularly those with random elements or multi-collinearity, the results were somewhat inconsistent.
- Below is a comparison of the average metric scores from cross-validation and bootstrapping with AIC scores for the same tests.

##### Size test:
- Model: Simple Linear Regression
- Metrics: R^2
- [K-fold] k = 5, shuffling = Yes
- [bootstrapping] sampling size: 100, epochs: 100
- [Configuration file] /params/test_size.json
` `
<img alt="" src="./results/size_test_k_fold.png" width="600"/>
<img alt="" src="./results/size_test_bootstrapping.png" width="600"/>

Implement the gradient-boosting tree algorithm (with the usual fit-predict interface) as described in Sections 10.9-10.10 of Elements of Statistical Learning (2nd Edition). Answer the questions below as you did for Project 1.
- AIC tends to increase proportionally with the dataset size. However, the average metric scores for cross-validation and bootstrapping are relatively irregular, with gentler slopes in their trend lines, regardless of the dataset size.
- The k-value applied to the model may have contributed to an underestimation of the dataset size. As a result, these methods do not reach the same conclusions as AIC.
- Below is another test to evaluate the impact of correlation.

##### Correlation test:
- Model: Simple Linear Regression
- Metrics: R^2
- [K-fold] k = 5, shuffling = Yes
- [bootstrapping] sampling size: 100, epochs: 100
- [Configuration file] params/test_correlation.json
` `
<img alt="" src="./results/correlation_test_k_fold.png" width="600">
<img alt="" src="./results/correlation_test_bootstrapping.png" width="600">

Put your README below. Answer the following questions.
- Under multi-collinearity, the trends for cross-validation and bootstrapping models were strong. However, AIC did not exhibit a similarly strong trend and showed artificially high performance under the assumption of a perfect correlation coefficient of 1.
- The two models developed do not yield the same conclusions as AIC. In fact, the conclusions were often contradictory (e.g., lower scores being better for AIC).
- In datasets with multi-collinearity or heavily biased structures, cross-validation and bootstrapping model selectors may not align with simpler model selectors like AIC.

#### [Question 2] In what cases might the methods you've written fail or give incorrect or undesirable results?
- According to the test results mentioned above, there is a high possibility of incorrect conclusions if the test data is too large, has multi-collinearity, or has a biased structure.
- In particular, according to the test results of testing multiple factors together as shown below, performance fluctuations were most severe when multi-collinearity existed.

##### Multi-factor test:
- Model: Simple Linear Regression
- Metrics: R^2
- [K-fold] k = 5, shuffling = Yes
- [bootstrapping] sampling size: 100, epochs: 100
- [Configuration file] /params/test_multi.json
` `
<img alt="multi-test_k_fold" src="./results/multi-test_k_fold.jpg" width="600">
<img alt="multi-test_bootstrapping" src="./results/multi-test_bootstrapping.jpg" width="600">

* What does the model you have implemented do and when should it be used?
* How did you test your model to determine if it is working reasonably correctly?
* What parameters have you exposed to users of your implementation in order to tune performance? (Also perhaps provide some basic usage examples.)
* Are there specific inputs that your implementation has trouble with? Given more time, could you work around these or is it fundamental?
- When the data was predictable but noisy, the cross-validation and bootstrapping models performed better than AIC. However, in more complex scenarios, such as with multi-collinearity or data bias, both models exhibited unstable metric scores.
- Since both methods aim to equalize performance across folds or samples, they may not be appropriate indicators for model selection if the data distribution is highly unstable.
- In summary, if the data distribution is overly complex and biased, the two developed models may not be suitable for model selection.

#### [Question 3] What could you implement given more time to mitigate these cases or help users of your methods?
- To address these limitations, additional improvements could include:
> - Adding regularization techniques (e.g., Ridge, Lasso, ElasticNet) to handle multi-collinearity.
> - Implementing preprocessing methods, such as Principal Component Analysis (PCA), for datasets with high correlation coefficients.
> - Providing automated warnings or recommendations for datasets with bias, multi-collinearity, or extreme size imbalances.

## Model Selection
#### [Question 4] What parameters have you exposed to your users in order to use your model selectors.
- For user convenience, all parameter settings, including data generation conditions, are included in:
> params/
- When running the mail script test.py, the user can specify the desired settings by selecting the json format settings file that exists in the location.
- Regarding parameter settings, specifications are provided in 'params/param_example.txt' and sample images are as follows.
` `
<img alt="param_sample" src="./params/param_sample.jpg">

Implement generic k-fold cross-validation and bootstrapping model selection methods.
- However, the parameters related to the actual model are limited and the specifications are as follows.
>> - "test":
>> - "general":
>> - "model": "LinearRegression", # [Options] "LinearRegression" (default), "LogisticRegression".
>> - "metric": "MSE" # [Options] "MSE" (default), "Accuracy score", "R2".
>> - "k_fold_cross_validation":
>> - "k": [5], # Number of folds for k-fold cross-validation.
>> - "shuffle": true # Whether to shuffle the data before splitting into folds.
>> - "bootstrapping":
>> - "size": [50], # The size of the training dataset for each bootstrap sample.
>> - "epochs": [100] # The number of bootstrapping iterations to perform.

In your README, answer the following questions:
- Regarding k-fold cross validation, the direct variables are as follows.
>> - model: The statistical model to test.
>> - metric: The metric function to measure the model's performance.
>> - X: The feature matrix for training.
>> - y: The target labels for training.
>> - k: The number of folds to divide the data into.
>> - shuffle: Whether to shuffle the data before splitting into folds.

* Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)?
* In what cases might the methods you've written fail or give incorrect or undesirable results?
* What could you implement given more time to mitigate these cases or help users of your methods?
* What parameters have you exposed to your users in order to use your model selectors.
- Regarding Bootstrapping model, the direct variables are as follows.
>> - model: The statistical model to test.
>> - metric: The metric function to measure the model's performance.
>> - X: The feature matrix for training.
>> - y: The target labels for training.
>> - s: The size of the training dataset for each bootstrap sample.
>> - epochs: The number of bootstrap iterations to perform.

See sections 7.10-7.11 of Elements of Statistical Learning and the lecture notes. Pay particular attention to Section 7.10.2.
#### Additional Notes
Since the description of each function and execution is written in comments, only points to keep in mind when executing are explained in detail:
- You can refer to the guidelines in 'params/param_example.txt' or run one of several pre-written configuration files.
- The simplest and easiest to modify file is 'param_single.json'. Use this file to test and verify execution for the program.
- Visualization is only enabled for some items and is only supported for 'generate' type datasets, including file creation.
- Since the purpose of this task is to implement a model related to model selection, the learning model to be evaluated is a linear model from scikit-learn.
- The data folder contains files that you can simply experiment with running. The name, description, and source of each file are included in dataSource.txt.

As usual, above-and-beyond efforts will be considered for bonus points.
#### Sample execution
* To directly execute only the implemented model, you can call the relevant function in modelSelection.py.
>> from modelSelection import k_fold_cross_validation, bootstrapping
1. Adjust configuration file. Refer to 'params/param_example.txt' to set up
* Alternatively, you can choose one of the settings in the params folder. See param_list.csv for a brief introduction to those settings.
2. Execute 'test.py' at the prompt(at that script location) and run it by entering the path to the file. (The example used param_single.json.)
* Sample code as below:
>> python test.py ./params/param_single.json
* Sample execution image as below:
<br/>
<img alt="simple test" src="./results/execution_sample.jpg" width="600"/>
3. (Optional) Another execution for size testing as batch job
* Sample code as below:
>> python test.py ./params/test_size.json
* Sample execution image as below:
<br/>
<img alt="size test" src="./results/size-test.jpg" width="600"/>
<br/>
* Sample execution image as below:
<br/>
<img alt="file created" src="/results/size-test-file_created.jpg" width="600"/>
151 changes: 151 additions & 0 deletions data/IRIS.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,0
4.9,3,1.4,0.2,0
4.7,3.2,1.3,0.2,0
4.6,3.1,1.5,0.2,0
5,3.6,1.4,0.2,0
5.4,3.9,1.7,0.4,0
4.6,3.4,1.4,0.3,0
5,3.4,1.5,0.2,0
4.4,2.9,1.4,0.2,0
4.9,3.1,1.5,0.1,0
5.4,3.7,1.5,0.2,0
4.8,3.4,1.6,0.2,0
4.8,3,1.4,0.1,0
4.3,3,1.1,0.1,0
5.8,4,1.2,0.2,0
5.7,4.4,1.5,0.4,0
5.4,3.9,1.3,0.4,0
5.1,3.5,1.4,0.3,0
5.7,3.8,1.7,0.3,0
5.1,3.8,1.5,0.3,0
5.4,3.4,1.7,0.2,0
5.1,3.7,1.5,0.4,0
4.6,3.6,1,0.2,0
5.1,3.3,1.7,0.5,0
4.8,3.4,1.9,0.2,0
5,3,1.6,0.2,0
5,3.4,1.6,0.4,0
5.2,3.5,1.5,0.2,0
5.2,3.4,1.4,0.2,0
4.7,3.2,1.6,0.2,0
4.8,3.1,1.6,0.2,0
5.4,3.4,1.5,0.4,0
5.2,4.1,1.5,0.1,0
5.5,4.2,1.4,0.2,0
4.9,3.1,1.5,0.1,0
5,3.2,1.2,0.2,0
5.5,3.5,1.3,0.2,0
4.9,3.1,1.5,0.1,0
4.4,3,1.3,0.2,0
5.1,3.4,1.5,0.2,0
5,3.5,1.3,0.3,0
4.5,2.3,1.3,0.3,0
4.4,3.2,1.3,0.2,0
5,3.5,1.6,0.6,0
5.1,3.8,1.9,0.4,0
4.8,3,1.4,0.3,0
5.1,3.8,1.6,0.2,0
4.6,3.2,1.4,0.2,0
5.3,3.7,1.5,0.2,0
5,3.3,1.4,0.2,0
7,3.2,4.7,1.4,1
6.4,3.2,4.5,1.5,1
6.9,3.1,4.9,1.5,1
5.5,2.3,4,1.3,1
6.5,2.8,4.6,1.5,1
5.7,2.8,4.5,1.3,1
6.3,3.3,4.7,1.6,1
4.9,2.4,3.3,1,1
6.6,2.9,4.6,1.3,1
5.2,2.7,3.9,1.4,1
5,2,3.5,1,1
5.9,3,4.2,1.5,1
6,2.2,4,1,1
6.1,2.9,4.7,1.4,1
5.6,2.9,3.6,1.3,1
6.7,3.1,4.4,1.4,1
5.6,3,4.5,1.5,1
5.8,2.7,4.1,1,1
6.2,2.2,4.5,1.5,1
5.6,2.5,3.9,1.1,1
5.9,3.2,4.8,1.8,1
6.1,2.8,4,1.3,1
6.3,2.5,4.9,1.5,1
6.1,2.8,4.7,1.2,1
6.4,2.9,4.3,1.3,1
6.6,3,4.4,1.4,1
6.8,2.8,4.8,1.4,1
6.7,3,5,1.7,1
6,2.9,4.5,1.5,1
5.7,2.6,3.5,1,1
5.5,2.4,3.8,1.1,1
5.5,2.4,3.7,1,1
5.8,2.7,3.9,1.2,1
6,2.7,5.1,1.6,1
5.4,3,4.5,1.5,1
6,3.4,4.5,1.6,1
6.7,3.1,4.7,1.5,1
6.3,2.3,4.4,1.3,1
5.6,3,4.1,1.3,1
5.5,2.5,4,1.3,1
5.5,2.6,4.4,1.2,1
6.1,3,4.6,1.4,1
5.8,2.6,4,1.2,1
5,2.3,3.3,1,1
5.6,2.7,4.2,1.3,1
5.7,3,4.2,1.2,1
5.7,2.9,4.2,1.3,1
6.2,2.9,4.3,1.3,1
5.1,2.5,3,1.1,1
5.7,2.8,4.1,1.3,1
6.3,3.3,6,2.5,2
5.8,2.7,5.1,1.9,2
7.1,3,5.9,2.1,2
6.3,2.9,5.6,1.8,2
6.5,3,5.8,2.2,2
7.6,3,6.6,2.1,2
4.9,2.5,4.5,1.7,2
7.3,2.9,6.3,1.8,2
6.7,2.5,5.8,1.8,2
7.2,3.6,6.1,2.5,2
6.5,3.2,5.1,2,2
6.4,2.7,5.3,1.9,2
6.8,3,5.5,2.1,2
5.7,2.5,5,2,2
5.8,2.8,5.1,2.4,2
6.4,3.2,5.3,2.3,2
6.5,3,5.5,1.8,2
7.7,3.8,6.7,2.2,2
7.7,2.6,6.9,2.3,2
6,2.2,5,1.5,2
6.9,3.2,5.7,2.3,2
5.6,2.8,4.9,2,2
7.7,2.8,6.7,2,2
6.3,2.7,4.9,1.8,2
6.7,3.3,5.7,2.1,2
7.2,3.2,6,1.8,2
6.2,2.8,4.8,1.8,2
6.1,3,4.9,1.8,2
6.4,2.8,5.6,2.1,2
7.2,3,5.8,1.6,2
7.4,2.8,6.1,1.9,2
7.9,3.8,6.4,2,2
6.4,2.8,5.6,2.2,2
6.3,2.8,5.1,1.5,2
6.1,2.6,5.6,1.4,2
7.7,3,6.1,2.3,2
6.3,3.4,5.6,2.4,2
6.4,3.1,5.5,1.8,2
6,3,4.8,1.8,2
6.9,3.1,5.4,2.1,2
6.7,3.1,5.6,2.4,2
6.9,3.1,5.1,2.3,2
5.8,2.7,5.1,1.9,2
6.8,3.2,5.9,2.3,2
6.7,3.3,5.7,2.5,2
6.7,3,5.2,2.3,2
6.3,2.5,5,1.9,2
6.5,3,5.2,2,2
6.2,3.4,5.4,2.3,2
5.9,3,5.1,1.8,2
Empty file added data/__init__.py
Empty file.
15 changes: 15 additions & 0 deletions data/dataSource.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@

[File #1] Boston Housing
[description] Concerns housing values in suburbs of Boston
[LINK] https://www.kaggle.com/datasets/schirmerchad/bostonhoustingmlnd

[File #2] Iris Flower Dataset
[description] Iris flower data set used for multi-class classification.

* Modified species as numerical values:
0: Iris-setosa 1: Iris-versicolor 2: Iris-virginica
[LINK] https://www.kaggle.com/datasets/arshid/iris-flower-dataset

[File #3] Red Wine Quality
[description] Simple and clean practice dataset for regression or classification modelling
[LINK] https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009
Loading