diff --git a/README.md b/README.md index f746e56..7830efb 100644 --- a/README.md +++ b/README.md @@ -1,29 +1,120 @@ -# Project 2 +> ZIRUI OU A20516756 -Select one of the following two options: +## Model Selection -## Boosting Trees +### Program Design -Implement the gradient-boosting tree algorithm (with the usual fit-predict interface) as described in Sections 10.9-10.10 of Elements of Statistical Learning (2nd Edition). Answer the questions below as you did for Project 1. +![alt text](images/image.png) -Put your README below. Answer the following questions. +### Results Overview +| Metric | **Linear Regression** | **Ridge Regression** | +|-----------------------|-----------------------------|-----------------------------| +| Coefficients | [0.2151, 0.5402] | [0.2411, 0.4849] | +| K-Fold MSE | 0.8418 | 0.8471 | +| Bootstrapping MSE | 0.8219 | 0.8214 | +| AIC | -17.495 | -17.461 | +| SSR | 80.6585 | 80.6853 | +| Verification (Coefficients) | True | True | +| Verification (AIC) | True | True | -* What does the model you have implemented do and when should it be used? -* How did you test your model to determine if it is working reasonably correctly? -* What parameters have you exposed to users of your implementation in order to tune performance? (Also perhaps provide some basic usage examples.) -* Are there specific inputs that your implementation has trouble with? Given more time, could you work around these or is it fundamental? -## Model Selection +![alt text](images/Figure_1.png) + +### How to Run the Code + +1. **Clone the Repository:** + ```bash + git clone https://github.com/Jerry-zirui/Project2.git + cd Project2 + ``` + +2. **Install Required Packages:** + ```bash + pip install numpy + pip install matplotlib #optional + ``` + +3. **Generate Synthetic Data** + ```python + # Set random seed for reproducibility + np.random.seed(42) + + # Generate synthetic data + n = 100 # Number of samples + p = 1 # Number of features + X = np.random.rand(n, p) # Feature matrix + beta_true = np.array([1]) # True coefficients + y = X @ beta_true + np.random.randn(n) # Target variable with noise + ``` + +4. **Define Models** + ```python + models = { + "Linear Regression": {"fn": linear_regression}, + "Ridge Regression": {"fn": ridge_regression, "params": {"alpha": 1.0}}, + } + ``` + +5. **Evaluate Models** + ```python + # Evaluate models + results = evaluate_models(models, X, y, k=5, B=1000) + ``` + +6. **Visualize Results** + ```python + # Visualize results + results = visualize(evaluate_models, X, y, models, results) + ``` + +7. **Run the Model Selection Script:** + ```bash + python model_selection.py + ``` +### Q&A -Implement generic k-fold cross-validation and bootstrapping model selection methods. +#### Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)? + - In my case with linear regression and Ridge Regression models, the cross-validation and bootstrapping model selectors generally agree with the AIC model selector. + - The MSE values from cross-validation and bootstrapping are similar, and the AIC values are consistent with the model selection. -In your README, answer the following questions: +#### In what cases might the methods you've written fail or give incorrect or undesirable results? +- **Multicollinearity:** + - If there are highly correlated features, the methods may give incorrect results due to multicollinearity. For example, in this dataset, if `Feature1` and `Feature2` were highly correlated, the coefficients might not accurately reflect their individual contributions. + +- **Small Sample Sizes:** + - With small sample sizes, the methods may not have enough data to provide reliable estimates. This could lead to high variance in the model coefficients and MSE values. + +- **High Variance Models:** + - With high variance models, the methods may give higher MSE values and less desirable results. + +- **Poorly Specified Models:** + - If the models are not properly specified (e.g., incorrect regularization parameters), the methods may fail. For instance, if the Ridge Regression model's alpha parameter was set too high, it might over-shrink the coefficients, leading to underfitting. -* Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)? -* In what cases might the methods you've written fail or give incorrect or undesirable results? -* What could you implement given more time to mitigate these cases or help users of your methods? -* What parameters have you exposed to your users in order to use your model selectors. +#### What could you implement given more time to mitigate these cases or help users of your methods? +- **Feature Selection:** + - Implement feature selection methods to handle multicollinearity. Techniques like Variance Inflation Factor (VIF) can be used to identify and remove highly correlated features. + +- **Regularization Techniques:** + - Implement more advanced regularization techniques such as Lasso, Ridge, and Elastic Net to handle different types of data and model requirements. + +- **Cross-Validation with Multiple Splits:** + - Use cross-validation with multiple splits to improve reliability. This can help in obtaining a more stable estimate of the model performance. + +- **Bootstrap Aggregation:** + - Implement bootstrap aggregation to improve the stability of the results. This technique can help in reducing the variance of the model estimates. + +- **Model Diagnostics:** + - Provide detailed diagnostics to help users understand the model selection process. This can include residual plots, coefficient plots, and other visualizations that can aid in interpreting the results. -See sections 7.10-7.11 of Elements of Statistical Learning and the lecture notes. Pay particular attention to Section 7.10.2. +### What parameters have you exposed to your users in order to use your model selectors. +- **ridge_regression** + - `alpha`: which controls the strength of the L2 regularization +- **K-Fold Cross-Validation:** + - `k`: Number of folds. (Default: 5) + - `model_fn`: Model function. (e.g., `LinearRegression()`, `Ridge()`) + - `model_params`: Model parameters. (e.g., `{'alpha': 1.0}` for Ridge Regression) -As usual, above-and-beyond efforts will be considered for bonus points. +- **Bootstrapping:** + - `B`: Number of bootstrap samples. (Default: 1000) + - `model_fn`: Model function. (e.g., `LinearRegression()`, `Ridge()`) + - `model_params`: Model parameters. (e.g., `{'alpha': 1.0}` for Ridge Regression) \ No newline at end of file diff --git a/images/Figure_1.png b/images/Figure_1.png new file mode 100644 index 0000000..8c10ccb Binary files /dev/null and b/images/Figure_1.png differ diff --git a/images/image.png b/images/image.png new file mode 100644 index 0000000..ab07fcf Binary files /dev/null and b/images/image.png differ diff --git a/model_selection.py b/model_selection.py new file mode 100644 index 0000000..dee28b9 --- /dev/null +++ b/model_selection.py @@ -0,0 +1,175 @@ +import numpy as np +from typing import Callable, Dict, Any, Tuple + +try: + import matplotlib.pyplot as plt +except Exception as e: + ... + +def linear_regression(X:np.ndarray, y:np.ndarray) -> np.ndarray: + X_b = np.hstack((np.ones((X.shape[0], 1)), X)) + beta = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y + return beta + +def ridge_regression(X:np.ndarray, y:np.ndarray, alpha:float=1.0): + X_b = np.hstack((np.ones((X.shape[0], 1)), X)) + I = np.eye(X_b.shape[1]) + I[0, 0] = 0 + beta = np.linalg.inv(X_b.T @ X_b + alpha * I) @ X_b.T @ y + return beta + +def k_fold_cross_validation(X:np.ndarray, y:np.ndarray, k:int=5, model_fn:Callable=linear_regression, **model_params) -> float: + n = X.shape[0] + fold_size = n // k + indices = np.arange(n) + np.random.shuffle(indices) + mse_scores = np.zeros(k) + + for i in range(k): + test_indices = indices[i * fold_size : (i + 1) * fold_size] + train_indices = np.setdiff1d(indices, test_indices) + X_train, y_train = X[train_indices], y[train_indices] + X_test, y_test = X[test_indices], y[test_indices] + + beta = model_fn(X_train, y_train, **model_params) + X_test_b = np.hstack((np.ones((X_test.shape[0], 1)), X_test)) + predictions = X_test_b @ beta + mse_scores[i] = np.mean((y_test - predictions) ** 2) + + return np.mean(mse_scores) + +def bootstrapping(X:np.ndarray, y:np.ndarray, B:int=1000, model_fn=linear_regression, **model_params): + n = X.shape[0] + mse_scores = np.zeros(B) + + for b in range(B): + indices = np.random.choice(n, n, replace=True) + X_b = X[indices] + y_b = y[indices] + + beta = model_fn(X_b, y_b, **model_params) + X_test_b = np.hstack((np.ones((n, 1)), X)) + predictions = X_test_b @ beta + mse_scores[b] = np.mean((y - predictions) ** 2) + + return np.mean(mse_scores) + +def aic(X:np.ndarray, y:np.ndarray, beta) -> Tuple[float, float]: + n = X.shape[0] + X_b = np.hstack((np.ones((X.shape[0], 1)), X)) + y_pred = X_b @ beta + ssr = np.sum((y - y_pred) ** 2) + p = len(beta) - 1 + aic_value = n * np.log(ssr / n) + 2 * (p + 1) + return aic_value, ssr + +def evaluate_models(models:Dict[str,any], X:np.ndarray, y:np.ndarray, k:int=5, B:int=1000): + results = {} + for name, model_info in models.items(): + model_fn = model_info['fn'] + model_params = model_info.get('params', {}) + beta = model_fn(X, y, **model_params) + cv_mse = k_fold_cross_validation(X, y, k, model_fn, **model_params) + bootstrap_mse = bootstrapping(X, y, B, model_fn, **model_params) + aic_val, ssr = aic(X, y, beta) + results[name] = { + "Coefficients": beta, + "K-Fold MSE": cv_mse, + "Bootstrapping MSE": bootstrap_mse, + "AIC": aic_val, + "SSR": ssr, + } + return results + +def verify_coefficients(X:np.ndarray, y:np.ndarray, beta:np.ndarray, model_fn:callable, **model_params) -> bool: + beta_sanity = model_fn(X, y, **model_params) + return np.allclose(beta, beta_sanity) + +def verify_aic(X:np.ndarray, y:np.ndarray, beta:np.ndarray, aic_value:float) -> bool: + n = X.shape[0] + X_b = np.hstack((np.ones((X.shape[0], 1)), X)) + y_pred = X_b @ beta + ssr = np.sum((y - y_pred) ** 2) + p = len(beta) - 1 + aic_manual = n * np.log(ssr / n) + 2 * (p + 1) + return np.isclose(aic_value, aic_manual) + + +np.random.seed(42) +n = 100 +p = 1 +X = np.random.rand(n, p) +beta_true = np.array([1]) +y = X @ beta_true + np.random.randn(n) + +models = { + "Linear Regression": {"fn": linear_regression}, + "Ridge Regression": {"fn": ridge_regression, "params": {"alpha": 1.0}}, +} + +results = evaluate_models(models, X, y, k=5, B=1000) + +if plt: + def visualize(evaluate_models, X, y, models, results): + kfold_mse = [result['K-Fold MSE'] for result in results.values()] + bootstrap_mse = [result['Bootstrapping MSE'] for result in results.values()] + aic_values = [result['AIC'] for result in results.values()] + + results = evaluate_models(models, X, y, k=5, B=1000) + kfold_mse = [result['K-Fold MSE'] for result in results.values()] + bootstrap_mse = [result['Bootstrapping MSE'] for result in results.values()] + aic_values = [result['AIC'] for result in results.values()] + models_names = list(results.keys()) + + fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6)) + bar_width = 0.35 + index = np.arange(len(models_names)) + bars1 = ax1.bar(index, kfold_mse, bar_width, label='K-Fold MSE', color='blue') + bars2 = ax1.bar(index + bar_width, bootstrap_mse, bar_width, label='Bootstrapping MSE', color='green') + + ax1.set_xlabel('Models', fontsize=12) + ax1.set_xticks(index + bar_width / 2) + ax1.set_xticklabels(models_names, rotation=45) + + ax1.set_title('Model Evaluation: MSE Comparison', fontsize=14) + def autolabel(bars, ax): + for bar in bars: + height = bar.get_height() + ax.annotate(f'{height:.4f}', + xy=(bar.get_x() + bar.get_width() / 2, height), + xytext=(0, 3), + textcoords="offset points", + ha='center', va='bottom') + + autolabel(bars1, ax1) + autolabel(bars2, ax1) + + ax1.legend(loc='upper center', bbox_to_anchor=(0.5, 1.15), ncol=2, fontsize=10) + ax1.set_yticks([]) + bars3 = ax2.bar(index, aic_values, bar_width * 2, label='AIC', color='red') + ax2.set_xlabel('Models', fontsize=12) + ax2.set_xticks(index) + ax2.set_xticklabels(models_names, rotation=45) + ax2.set_title('Model Evaluation: AIC Comparison', fontsize=14) + + autolabel(bars3, ax2) + ax2.legend(loc='upper center', bbox_to_anchor=(0.5, 1.15), ncol=1, fontsize=10) + ax2.set_yticks([]) + plt.tight_layout() + plt.show() + return results + + results = visualize(evaluate_models, X, y, models, results) + + +for name, result in results.items(): + print(f"Model: {name}") + print(f"Coefficients: {result['Coefficients']}") + print(f"K-Fold MSE: {result['K-Fold MSE']}") + print(f"Bootstrapping MSE: {result['Bootstrapping MSE']}") + print(f"AIC: {result['AIC']}") + print(f"SSR: {result['SSR']}") + print(f"Verification (Coefficients): {verify_coefficients(X, y, result['Coefficients'], models[name]['fn'], **models[name].get('params', {}))}") + print(f"Verification (AIC): {verify_aic(X, y, result['Coefficients'], result['AIC'])}") + print("\n") + print("-"*30) \ No newline at end of file