Fall2024CS584 · ARAVINTH007ak · Nov 22, 2024 · Nov 22, 2024 · Nov 22, 2024 · Nov 22, 2024
diff --git a/.DS_Store b/.DS_Store
diff --git a/.gitignore b/.gitignore
@@ -2,6 +2,7 @@
 __pycache__/
 *.py[cod]
 *$py.class
+.DS_Store
 
 # C extensions
 *.so

diff --git a/README.md b/README.md
@@ -1,29 +1,210 @@
-# Project 2
+---
 
-Select one of the following two options:
+# Gradient Boosting Tree Model
 
-## Boosting Trees
+---
 
-Implement the gradient-boosting tree algorithm (with the usual fit-predict interface) as described in Sections 10.9-10.10 of Elements of Statistical Learning (2nd Edition). Answer the questions below as you did for Project 1.
+## Authors
 
-Put your README below. Answer the following questions.
+| Name                       | Student ID  | Contribution Files                                   | Contribution % |
+|----------------------------|-------------|----------------------------------------------------|----------------|
+| Sankar Ganesh Paramasivam  | A20553053   | `test_gradientboost.py`, `GradientBoost.py`        | 25%            |
+| Neelarapu Tejaswini        | A20592053   | `checker.py`, `GradientBoost.py`                  | 25%            |
+| Vijaya Sai Dasari          | A20540356   | `test_gradientboost.py`, `GradientBoost.py`        | 25%            |
+| Aravinth Ananth            | A20537468   | `gridsearch.py`, `GradientBoost.py`               | 25%            |
 
-* What does the model you have implemented do and when should it be used?
-* How did you test your model to determine if it is working reasonably correctly?
-* What parameters have you exposed to users of your implementation in order to tune performance? (Also perhaps provide some basic usage examples.)
-* Are there specific inputs that your implementation has trouble with? Given more time, could you work around these or is it fundamental?
+---
+---
+## Team Name: THALA 7
+---
+## Hosted Streamlit App
 
-## Model Selection
+You can interact with the project live using the hosted Streamlit app:
 
-Implement generic k-fold cross-validation and bootstrapping model selection methods.
+👉 **[Gradient Boosting Model App](https://mlprojectaravinthgreat.streamlit.app/)**
 
-In your README, answer the following questions:
+---
 
-* Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)?
-* In what cases might the methods you've written fail or give incorrect or undesirable results?
-* What could you implement given more time to mitigate these cases or help users of your methods?
-* What parameters have you exposed to your users in order to use your model selectors.
+### **Note for Rapid Output**
 
-See sections 7.10-7.11 of Elements of Statistical Learning and the lecture notes. Pay particular attention to Section 7.10.2.
+To reduce execution time in the Streamlit app, select fewer hyperparameter options during grid search. This will allow you to see rapid outputs while exploring the model's functionality. For example:
+- Limit `n_estimators` to [10, 50].
+- Use a smaller range for `learning_rate` such as [0.1].
+- Limit `max_depth` to [2, 3].
 
-As usual, above-and-beyond efforts will be considered for bonus points.
+This ensures faster performance without sacrificing functionality.
+
+---
+
+## Project Overview
+
+This project implements a **Gradient Boosting Tree model**. Gradient Boosting is an advanced ensemble learning technique that builds multiple decision trees sequentially, with each tree correcting the errors of its predecessor. This iterative improvement ensures high accuracy and excellent performance on complex datasets.
+
+### Key Features:
+- **Custom Implementation**:
+  - Implements Gradient Boosting from scratch.
+  - Does not rely on prebuilt machine learning libraries for the core model.
+- **Hyperparameter Optimization**:
+  - Includes grid search to tune hyperparameters (`n_estimators`, `learning_rate`, and `max_depth`).
+- **Performance Evaluation**:
+  - Supports R² Score, Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE).
+- **Visualization**:
+  - Provides detailed visual comparisons between actual and predicted values.
+
+---
+
+## Questions and Answers
+
+### **Q1. What does the model do, and when should it be used?**
+
+The **Gradient Boosting Tree model**:
+- Combines the predictive power of multiple decision trees to reduce error iteratively.
+- Handles non-linear relationships and feature interactions effectively.
+
+**Use cases**:
+- Predicting housing prices based on features like location, size, and amenities.
+- Estimating demand in time-series data.
+- Financial modeling, such as credit scoring or risk assessment.
+
+---
+
+### **Q2. How did you test the model?**
+
+The model was rigorously tested using:
+1. **Synthetic Data**:
+   - Ensured correctness by generating synthetic datasets and comparing model predictions to expected results.
+2. **Hyperparameter Tuning**:
+   - Performed grid search across combinations of `n_estimators`, `learning_rate`, and `max_depth` to identify the best parameters.
+3. **Metrics**:
+   - Evaluated the model using R² Score, MAE, and RMSE to assess accuracy and robustness.
+4. **Visualization**:
+   - Plotted density and scatter plots to validate how closely predicted values align with actual values.
+
+---
+
+### **Q3. What parameters can users tune to improve performance?**
+
+Users can optimize the following hyperparameters:
+- **n_estimators**: Controls the number of boosting iterations (trees). Larger values improve accuracy but increase training time.
+- **learning_rate**: Determines the contribution of each tree. Lower values lead to better generalization but require more iterations.
+- **max_depth**: Limits the depth of individual trees, balancing model complexity and overfitting.
+
+**Example Grid Search**:
+```python
+n_estimators_values = [10, 50, 100]
+learning_rate_values = [0.1, 0.01]
+max_depth_values = [2, 3, 5]
+
+best_params, best_score = grid_search_gradient_boosting(
+    X_train, y_train, X_test, y_test, n_estimators_values, learning_rate_values, max_depth_values
+)
+print("Best Parameters:", best_params)
+```
+
+---
+
+### **Q4. Are there specific inputs the model has trouble with?**
+
+The model has the following limitations:
+1. **Categorical Data**: 
+   - The current implementation does not handle categorical data effectively. Categorical features must be preprocessed (e.g., using one-hot or label encoding) before using the model. Without preprocessing, the model will fail to interpret non-numerical inputs.
+2. **Sparse Features**: 
+   - If the dataset contains many zero or irrelevant features, performance may degrade.
+3. **High Dimensionality**: 
+   - Large feature sets can significantly increase computational time.
+4. **Imbalanced Data**: 
+   - The model may overfit minority class samples without proper handling.
+
+**Potential Solutions**:
+- Extend the implementation to natively support categorical data by including automatic encoding techniques.
+- Apply feature selection to reduce high-dimensional datasets.
+- Incorporate early stopping or regularization to handle overfitting on imbalanced data.
+
+---
+
+## File Structure
+
+| File Name              | Description                                                                 |
+|------------------------|-----------------------------------------------------------------------------|
+| `test_gradientboost.py`| Main script to test the model. Handles data preprocessing, training, and evaluation. |
+| `GradientBoost.py`     | Core implementation of the Gradient Boosting Tree model.                   |
+| `gridsearch.py`        | Implements grid search for hyperparameter optimization.                    |
+| `checker.py`           | Utility functions for null value handling, feature scaling, and more.      |
+
+---
+
+## Code Flow Diagram
+
+The diagram below illustrates the main code flow:
+
+![Code Flow Diagram](diagram.png)
+
+---
+
+## How to Use
+
+### Installation
+
+Ensure Python is installed and dependencies are available. Install requirements using:
+
+```bash
+pip install -r requirements.txt
+```
+
+### Running the Model
+
+Execute the following command to test the Gradient Boosting model:
+
+```bash
+python -m tests.test_gradientboost
+```
+
+### Model Training and Prediction
+
+```python
+from GradientBoost import GradientBoostingTree
+
+# Initialize the model
+model = GradientBoostingTree(n_estimators=50, learning_rate=0.1, max_depth=3)
+
+# Train the model
+model.fit(X_train, y_train)
+
+# Make predictions
+predictions = model.predict(X_test)
+```
+
+### Evaluate Performance
+
+```python
+r2 = model.r2_score_manual(y_test, predictions)
+print(f"R² Score: {r2}")
+```
+
+---
+
+## Example Outputs
+
+### Visualizations
+1. **Density Plot**:
+   - Shows the distribution of actual vs. predicted values.
+
+2. **Prediction Error Plot**:
+   - Compares predicted values to actual values for better interpretability.
+
+### Sample Evaluation Metrics
+```plaintext
+R² Score: 0.85
+Mean Absolute Error (MAE): 3.12
+Root Mean Squared Error (RMSE): 4.25
+```
+
+---
+
+## Reference
+
+This implementation is inspired by **The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Second Edition)** by Trevor Hastie, Robert Tibshirani, and Jerome Friedman (2009). 
+
+The Gradient Boosting methodology specifically follows the concepts described in **Sections 10.9-10.10** of the book.
+
+---
diff --git a/diagram.png b/diagram.png