Fall2024CS584 · hsuh9416 · Nov 15, 2024 · Nov 17, 2024 · Nov 18, 2024 · Nov 20, 2024
diff --git a/.gitignore b/.gitignore
@@ -130,11 +130,11 @@ ENV/
 env.bak/
 venv.bak/
 
-# Spyder project settings
+# Spyder project params
 .spyderproject
 .spyproject
 
-# Rope project settings
+# Rope project params
 .ropeproject
 
 # mkdocs documentation
@@ -159,4 +159,4 @@ cython_debug/
 #  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
 #  and can be added to the global gitignore or merged into this file.  For a more nuclear
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
-#.idea/
+#.idea/
diff --git a/.idea/.gitignore b/.idea/.gitignore
diff --git a/README.md b/README.md
@@ -1,29 +1,133 @@
 # Project 2
 
-Select one of the following two options:
+### Implemented model
+* generic k-fold cross-validation and bootstrapping model selection methods.
 
-## Boosting Trees
+### Creator Description
+- Name: Haeun Suh
+- HawkID: A20542585
+- Class: CS584-04 Machine Learning(Instructor: Steve Avsec)
+- Email: hsuh7@hawk.iit.edu
+
+#### [Question 1] Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)?
+- For simple datasets, it was observed that cross-validation and bootstrapping models generally reach the same conclusions as simpler methods like AIC.
+- However, on more complex datasets, particularly those with random elements or multi-collinearity, the results were somewhat inconsistent.
+- Below is a comparison of the average metric scores from cross-validation and bootstrapping with AIC scores for the same tests.
+
+##### Size test:
+- Model: Simple Linear Regression
+- Metrics: R^2
+- [K-fold] k = 5, shuffling = Yes
+- [bootstrapping]  sampling size: 100, epochs: 100
+- [Configuration file] /params/test_size.json
+` `
+<img alt="" src="./results/size_test_k_fold.png" width="600"/>
+<img alt="" src="./results/size_test_bootstrapping.png" width="600"/>
 
-Implement the gradient-boosting tree algorithm (with the usual fit-predict interface) as described in Sections 10.9-10.10 of Elements of Statistical Learning (2nd Edition). Answer the questions below as you did for Project 1.
+- AIC tends to increase proportionally with the dataset size. However, the average metric scores for cross-validation and bootstrapping are relatively irregular, with gentler slopes in their trend lines, regardless of the dataset size.
+- The k-value applied to the model may have contributed to an underestimation of the dataset size. As a result, these methods do not reach the same conclusions as AIC.
+- Below is another test to evaluate the impact of correlation.
+
+##### Correlation test:
+- Model: Simple Linear Regression
+- Metrics: R^2
+- [K-fold] k = 5, shuffling = Yes
+- [bootstrapping]  sampling size: 100, epochs: 100
+- [Configuration file] params/test_correlation.json
+` `
+<img alt="" src="./results/correlation_test_k_fold.png" width="600">
+<img alt="" src="./results/correlation_test_bootstrapping.png" width="600">
 
-Put your README below. Answer the following questions.
+- Under multi-collinearity, the trends for cross-validation and bootstrapping models were strong. However, AIC did not exhibit a similarly strong trend and showed artificially high performance under the assumption of a perfect correlation coefficient of 1.
+- The two models developed do not yield the same conclusions as AIC. In fact, the conclusions were often contradictory (e.g., lower scores being better for AIC).
+- In datasets with multi-collinearity or heavily biased structures, cross-validation and bootstrapping model selectors may not align with simpler model selectors like AIC.
+
+#### [Question 2] In what cases might the methods you've written fail or give incorrect or undesirable results?
+- According to the test results mentioned above, there is a high possibility of incorrect conclusions if the test data is too large, has multi-collinearity, or has a biased structure.
+- In particular, according to the test results of testing multiple factors together as shown below, performance fluctuations were most severe when multi-collinearity existed.
+
+##### Multi-factor test:
+- Model: Simple Linear Regression
+- Metrics: R^2
+- [K-fold] k = 5, shuffling = Yes
+- [bootstrapping]  sampling size: 100, epochs: 100
+- [Configuration file] /params/test_multi.json
+` `
+<img alt="multi-test_k_fold" src="./results/multi-test_k_fold.jpg" width="600">
+<img alt="multi-test_bootstrapping" src="./results/multi-test_bootstrapping.jpg" width="600">
 
-* What does the model you have implemented do and when should it be used?
-* How did you test your model to determine if it is working reasonably correctly?
-* What parameters have you exposed to users of your implementation in order to tune performance? (Also perhaps provide some basic usage examples.)
-* Are there specific inputs that your implementation has trouble with? Given more time, could you work around these or is it fundamental?
+-  When the data was predictable but noisy, the cross-validation and bootstrapping models performed better than AIC. However, in more complex scenarios, such as with multi-collinearity or data bias, both models exhibited unstable metric scores.
+- Since both methods aim to equalize performance across folds or samples, they may not be appropriate indicators for model selection if the data distribution is highly unstable.
+- In summary, if the data distribution is overly complex and biased, the two developed models may not be suitable for model selection.
+
+#### [Question 3] What could you implement given more time to mitigate these cases or help users of your methods?
+- To address these limitations, additional improvements could include:
+> - Adding regularization techniques (e.g., Ridge, Lasso, ElasticNet) to handle multi-collinearity.
+> - Implementing preprocessing methods, such as Principal Component Analysis (PCA), for datasets with high correlation coefficients.
+> - Providing automated warnings or recommendations for datasets with bias, multi-collinearity, or extreme size imbalances.
 
-## Model Selection
+#### [Question 4] What parameters have you exposed to your users in order to use your model selectors.
+- For user convenience, all parameter settings, including data generation conditions, are included in:
+> params/
+- When running the mail script test.py, the user can specify the desired settings by selecting the json format settings file that exists in the location.
+- Regarding parameter settings, specifications are provided in 'params/param_example.txt' and sample images are as follows.
+` `
+<img alt="param_sample" src="./params/param_sample.jpg">
 
-Implement generic k-fold cross-validation and bootstrapping model selection methods.
+- However, the parameters related to the actual model are limited and the specifications are as follows.
+>>  - "test": 
+>>    - "general": 
+>>      - "model": "LinearRegression",  # [Options] "LinearRegression" (default), "LogisticRegression".
+>>      - "metric": "MSE"  # [Options] "MSE" (default), "Accuracy score", "R2".
+>>    - "k_fold_cross_validation":
+>>      - "k": [5],  # Number of folds for k-fold cross-validation.
+>>      - "shuffle": true  # Whether to shuffle the data before splitting into folds.
+>>    - "bootstrapping": 
+>>      - "size": [50],  # The size of the training dataset for each bootstrap sample.
+>>      -  "epochs": [100]  # The number of bootstrapping iterations to perform.
 
-In your README, answer the following questions:
+- Regarding k-fold cross validation, the direct variables are as follows.
+>>  - model: The statistical model to test.
+>>  - metric: The metric function to measure the model's performance.
+>>  - X: The feature matrix for training.
+>>  - y: The target labels for training.
+>>  - k: The number of folds to divide the data into.
+>>  - shuffle: Whether to shuffle the data before splitting into folds.
 
-* Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)?
-* In what cases might the methods you've written fail or give incorrect or undesirable results?
-* What could you implement given more time to mitigate these cases or help users of your methods?
-* What parameters have you exposed to your users in order to use your model selectors.
+- Regarding Bootstrapping model, the direct variables are as follows.
+>> - model: The statistical model to test.
+>>  - metric: The metric function to measure the model's performance.
+>>  - X: The feature matrix for training.
+>>  - y: The target labels for training.
+>>  - s: The size of the training dataset for each bootstrap sample.
+>>  - epochs: The number of bootstrap iterations to perform.
 
-See sections 7.10-7.11 of Elements of Statistical Learning and the lecture notes. Pay particular attention to Section 7.10.2.
+#### Additional Notes
+Since the description of each function and execution is written in comments, only points to keep in mind when executing are explained in detail:
+- You can refer to the guidelines in 'params/param_example.txt' or run one of several pre-written configuration files. 
+  - The simplest and easiest to modify file is 'param_single.json'. Use this file to test and verify execution for the program.
+- Visualization is only enabled for some items and is only supported for 'generate' type datasets, including file creation.
+- Since the purpose of this task is to implement a model related to model selection, the learning model to be evaluated is a linear model from scikit-learn.
+- The data folder contains files that you can simply experiment with running. The name, description, and source of each file are included in dataSource.txt.
 
-As usual, above-and-beyond efforts will be considered for bonus points.
+#### Sample execution
+* To directly execute only the implemented model, you can call the relevant function in modelSelection.py.
+>> from modelSelection import k_fold_cross_validation, bootstrapping
+1. Adjust configuration file. Refer to 'params/param_example.txt' to set up
+   * Alternatively, you can choose one of the settings in the params folder. See param_list.csv for a brief introduction to those settings.
+2. Execute 'test.py' at the prompt(at that script location) and run it by entering the path to the file. (The example used param_single.json.)
+   * Sample code as below:
+    >> python test.py ./params/param_single.json
+   * Sample execution image as below:
+   <br/>
+     <img alt="simple test" src="./results/execution_sample.jpg" width="600"/>
+3. (Optional) Another execution for size testing as batch job
+   * Sample code as below:
+    >>       python test.py ./params/test_size.json
+   * Sample execution image as below:
+   <br/>
+   <img alt="size test" src="./results/size-test.jpg" width="600"/>
+   <br/>
+   * Sample execution image as below:
+   <br/>
+   <img alt="file created" src="/results/size-test-file_created.jpg" width="600"/>
diff --git a/data/IRIS.csv b/data/IRIS.csv
@@ -0,0 +1,151 @@
+sepal_length,sepal_width,petal_length,petal_width,species
+5.1,3.5,1.4,0.2,0
+4.9,3,1.4,0.2,0
+4.7,3.2,1.3,0.2,0
+4.6,3.1,1.5,0.2,0
+5,3.6,1.4,0.2,0
+5.4,3.9,1.7,0.4,0
+4.6,3.4,1.4,0.3,0
+5,3.4,1.5,0.2,0
+4.4,2.9,1.4,0.2,0
+4.9,3.1,1.5,0.1,0
+5.4,3.7,1.5,0.2,0
+4.8,3.4,1.6,0.2,0
+4.8,3,1.4,0.1,0
+4.3,3,1.1,0.1,0
+5.8,4,1.2,0.2,0
+5.7,4.4,1.5,0.4,0
+5.4,3.9,1.3,0.4,0
+5.1,3.5,1.4,0.3,0
+5.7,3.8,1.7,0.3,0
+5.1,3.8,1.5,0.3,0
+5.4,3.4,1.7,0.2,0
+5.1,3.7,1.5,0.4,0
+4.6,3.6,1,0.2,0
+5.1,3.3,1.7,0.5,0
+4.8,3.4,1.9,0.2,0
+5,3,1.6,0.2,0
+5,3.4,1.6,0.4,0
+5.2,3.5,1.5,0.2,0
+5.2,3.4,1.4,0.2,0
+4.7,3.2,1.6,0.2,0
+4.8,3.1,1.6,0.2,0
+5.4,3.4,1.5,0.4,0
+5.2,4.1,1.5,0.1,0
+5.5,4.2,1.4,0.2,0
+4.9,3.1,1.5,0.1,0
+5,3.2,1.2,0.2,0
+5.5,3.5,1.3,0.2,0
+4.9,3.1,1.5,0.1,0
+4.4,3,1.3,0.2,0
+5.1,3.4,1.5,0.2,0
+5,3.5,1.3,0.3,0
+4.5,2.3,1.3,0.3,0
+4.4,3.2,1.3,0.2,0
+5,3.5,1.6,0.6,0
+5.1,3.8,1.9,0.4,0
+4.8,3,1.4,0.3,0
+5.1,3.8,1.6,0.2,0
+4.6,3.2,1.4,0.2,0
+5.3,3.7,1.5,0.2,0
+5,3.3,1.4,0.2,0
+7,3.2,4.7,1.4,1
+6.4,3.2,4.5,1.5,1
+6.9,3.1,4.9,1.5,1
+5.5,2.3,4,1.3,1
+6.5,2.8,4.6,1.5,1
+5.7,2.8,4.5,1.3,1
+6.3,3.3,4.7,1.6,1
+4.9,2.4,3.3,1,1
+6.6,2.9,4.6,1.3,1
+5.2,2.7,3.9,1.4,1
+5,2,3.5,1,1
+5.9,3,4.2,1.5,1
+6,2.2,4,1,1
+6.1,2.9,4.7,1.4,1
+5.6,2.9,3.6,1.3,1
+6.7,3.1,4.4,1.4,1
+5.6,3,4.5,1.5,1
+5.8,2.7,4.1,1,1
+6.2,2.2,4.5,1.5,1
+5.6,2.5,3.9,1.1,1
+5.9,3.2,4.8,1.8,1
+6.1,2.8,4,1.3,1
+6.3,2.5,4.9,1.5,1
+6.1,2.8,4.7,1.2,1
+6.4,2.9,4.3,1.3,1
+6.6,3,4.4,1.4,1
+6.8,2.8,4.8,1.4,1
+6.7,3,5,1.7,1
+6,2.9,4.5,1.5,1
+5.7,2.6,3.5,1,1
+5.5,2.4,3.8,1.1,1
+5.5,2.4,3.7,1,1
+5.8,2.7,3.9,1.2,1
+6,2.7,5.1,1.6,1
+5.4,3,4.5,1.5,1
+6,3.4,4.5,1.6,1
+6.7,3.1,4.7,1.5,1
+6.3,2.3,4.4,1.3,1
+5.6,3,4.1,1.3,1
+5.5,2.5,4,1.3,1
+5.5,2.6,4.4,1.2,1
+6.1,3,4.6,1.4,1
+5.8,2.6,4,1.2,1
+5,2.3,3.3,1,1
+5.6,2.7,4.2,1.3,1
+5.7,3,4.2,1.2,1
+5.7,2.9,4.2,1.3,1
+6.2,2.9,4.3,1.3,1
+5.1,2.5,3,1.1,1
+5.7,2.8,4.1,1.3,1
+6.3,3.3,6,2.5,2
+5.8,2.7,5.1,1.9,2
+7.1,3,5.9,2.1,2
+6.3,2.9,5.6,1.8,2
+6.5,3,5.8,2.2,2
+7.6,3,6.6,2.1,2
+4.9,2.5,4.5,1.7,2
+7.3,2.9,6.3,1.8,2
+6.7,2.5,5.8,1.8,2
+7.2,3.6,6.1,2.5,2
+6.5,3.2,5.1,2,2
+6.4,2.7,5.3,1.9,2
+6.8,3,5.5,2.1,2
+5.7,2.5,5,2,2
+5.8,2.8,5.1,2.4,2
+6.4,3.2,5.3,2.3,2
+6.5,3,5.5,1.8,2
+7.7,3.8,6.7,2.2,2
+7.7,2.6,6.9,2.3,2
+6,2.2,5,1.5,2
+6.9,3.2,5.7,2.3,2
+5.6,2.8,4.9,2,2
+7.7,2.8,6.7,2,2
+6.3,2.7,4.9,1.8,2
+6.7,3.3,5.7,2.1,2
+7.2,3.2,6,1.8,2
+6.2,2.8,4.8,1.8,2
+6.1,3,4.9,1.8,2
+6.4,2.8,5.6,2.1,2
+7.2,3,5.8,1.6,2
+7.4,2.8,6.1,1.9,2
+7.9,3.8,6.4,2,2
+6.4,2.8,5.6,2.2,2
+6.3,2.8,5.1,1.5,2
+6.1,2.6,5.6,1.4,2
+7.7,3,6.1,2.3,2
+6.3,3.4,5.6,2.4,2
+6.4,3.1,5.5,1.8,2
+6,3,4.8,1.8,2
+6.9,3.1,5.4,2.1,2
+6.7,3.1,5.6,2.4,2
+6.9,3.1,5.1,2.3,2
+5.8,2.7,5.1,1.9,2
+6.8,3.2,5.9,2.3,2
+6.7,3.3,5.7,2.5,2
+6.7,3,5.2,2.3,2
+6.3,2.5,5,1.9,2
+6.5,3,5.2,2,2
+6.2,3.4,5.4,2.3,2
+5.9,3,5.1,1.8,2
diff --git a/data/__init__.py b/data/__init__.py
diff --git a/data/dataSource.txt b/data/dataSource.txt
@@ -0,0 +1,15 @@
+
+[File #1] Boston Housing
+[description] Concerns housing values in suburbs of Boston
+[LINK] https://www.kaggle.com/datasets/schirmerchad/bostonhoustingmlnd
+
+[File #2] Iris Flower Dataset
+[description] Iris flower data set used for multi-class classification.
+
+* Modified species as numerical values:
+    0: Iris-setosa 1: Iris-versicolor 2: Iris-virginica
+[LINK] https://www.kaggle.com/datasets/arshid/iris-flower-dataset
+
+[File #3] Red Wine Quality
+[description] Simple and clean practice dataset for regression or classification modelling
+[LINK] https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009