Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 106 additions & 0 deletions Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Project 2
BY

A20539949-Usha Devaraju

A20548244-Roopashri Kommana

A20550565-Sai Sandeep Neerukonda

# Gradient Boosting for Regression

This repository contains a custom implementation of a Gradient Boosting model for regression tasks, using decision trees as base learners. The model is designed to be versatile and easily adjustable to fit various regression problems.

# 1.What does the model you have implemented do and when should it be used?
## Model Description

The Gradient Boosting model implemented here constructs an ensemble of decision trees in a sequential manner, where each tree is built to correct the errors made by the previous ones. The model is particularly useful for datasets where relationships between features and the target variable are complex and non-linear.

### When to Use This Model

This model should be used when:
- Dealing with regression tasks requiring robust predictive power.
- Handling datasets with complex and non-linear relationships.
- Situations where other simpler models (like linear regression) are insufficient.

# 2.How did you test your model to determine if it is working reasonably correctly?
## Testing the Model

The model has been rigorously tested using the California Housing dataset, which is a standard dataset for evaluating regression models. The testing involves:
- Splitting the data into training and testing sets.
- Scaling the feature matrix to standardize the input data.
- Training the Gradient Boosting model on the training data.
- Evaluating its performance using Mean Squared Error (MSE) on the test set.

# 3.What parameters have you exposed to users of your implementation in order to tune performance? (Also perhaps provide some basic usage examples.
## Exposed Parameters

Users can tune the following parameters to optimize the model's performance:
- `n_estimators`: The number of trees to build (default is 100).
- `learning_rate`: The step size at each iteration to control overfitting (default is 0.1).
- `max_depth`: The maximum depth of each decision tree (default is 3).

### Prerequisites

Ensure you have Python installed along with the following libraries:
- `numpy`
- `scikit-learn`

To install missing dependencies, use:
```bash
pip install numpy scikit-learn
```

### Basic Usage Example

```python
from gradient_boosting import GradientBoosting
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler

# Load data
data = fetch_california_housing()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize and train the gradient boosting model
model = GradientBoosting(n_estimators=50, learning_rate=0.1, max_depth=3)
model.fit(X_train_scaled, y_train)

# Predict and evaluate
predictions = model.predict(X_test_scaled)
print(predictions)
```

## Running Tests
To test the model on the California Housing dataset, run:
```python
python testing.py
```
The script will:
- **Load the dataset.**
- **Train and test the Gradient Boosting model.**
- **Output the Mean Squared Error (MSE) of the predictions.**

# 4.Are there specific inputs that your implementation has trouble with? Given more time, could you work around these or is it fundamental?
## Potential Issues and Workarounds

The model may encounter difficulties with specific types of inputs such as:

- **Extremely Noisy Data**: High levels of noise can lead to overfitting, where the model learns the noise as patterns, degrading prediction accuracy on new data.
- **Outliers**: Outliers can disproportionately influence the decision boundaries established by the decision trees, leading to suboptimal models.

### Workarounds

To enhance model robustness and performance:
- **Preprocessing Steps**: Implement robust preprocessing steps to handle outliers and noise, such as outlier detection algorithms or robust scaling methods.
- **Advanced Techniques**: Explore integrating outlier detection algorithms and advanced noise filtering techniques before fitting the model to improve its generalization capabilities.

182 changes: 182 additions & 0 deletions gradient_boosting.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,182 @@
import numpy as np

# Define the function to calculate the gradient of the squared loss
def squared_loss_gradient(y, f):
"""
Compute the gradient for the squared loss function.

Parameters:
- y (np.array): The target values.
- f (np.array): The predicted values.

Returns:
- np.array: The gradient of the squared loss.
"""
return y - f

# Define the Node class to represent each node in the decision tree
class Node:
"""
A node in the decision tree.

Attributes:
- value (float): The value at the node, used for leaf nodes.
- left (Node): Left child node.
- right (Node): Right child node.
- threshold (float): The threshold for splitting.
- feature (int): The index of the feature used for splitting.
"""
def __init__(self, value=None, left=None, right=None, threshold=None, feature=None):
self.value = value
self.left = left
self.right = right
self.threshold = threshold
self.feature = feature

# Define the DecisionTree class for building the regression tree
class DecisionTree:
"""
A simple decision tree for regression.

Attributes:
- max_depth (int): The maximum depth of the tree.
- root (Node): The root node of the tree.
"""
def __init__(self, max_depth=3):
self.max_depth = max_depth
self.root = None

def fit(self, X, residuals):
"""
Fit the decision tree to the residuals.

Parameters:
- X (np.array): Feature matrix.
- residuals (np.array): Residuals to fit.
"""
self.root = self._build_tree(X, residuals, depth=0)

def _build_tree(self, X, residuals, depth):
"""
Recursively build the decision tree.

Parameters:
- X (np.array): Feature matrix.
- residuals (np.array): Residuals to fit.
- depth (int): Current depth of the tree.

Returns:
- Node: The constructed tree node.
"""
num_samples = X.shape[0]
if depth >= self.max_depth or num_samples <= 1:
leaf_value = np.mean(residuals)
return Node(value=leaf_value)

best_feature, best_threshold, best_var = None, None, np.inf
for feature in range(X.shape[1]):
thresholds = np.unique(X[:, feature])
for threshold in thresholds:
left_mask = X[:, feature] <= threshold
right_mask = X[:, feature] > threshold
if np.sum(left_mask) == 0 or np.sum(right_mask) == 0:
continue
left_var = np.var(residuals[left_mask])
right_var = np.var(residuals[right_mask])
total_var = left_var + right_var
if total_var < best_var:
best_feature, best_threshold, best_var = feature, threshold, total_var

left_mask = X[:, best_feature] <= best_threshold
right_mask = X[:, best_feature] > best_threshold
left_node = self._build_tree(X[left_mask], residuals[left_mask], depth + 1)
right_node = self._build_tree(X[right_mask], residuals[right_mask], depth + 1)
return Node(feature=best_feature, threshold=best_threshold, left=left_node, right=right_node)

def predict(self, X):
"""
Make predictions using the decision tree.

Parameters:
- X (np.array): Feature matrix.

Returns:
- np.array: Predicted values.
"""
return np.array([self._predict(x, self.root) for x in X])

def _predict(self, x, node):
"""
Recursively predict by traversing the decision tree.

Parameters:
- x (np.array): Single feature vector.
- node (Node): Current node of the tree.

Returns:
- float: Predicted value.
"""
if node.value is not None:
return node.value
if x[node.feature] <= node.threshold:
return self._predict(x, node.left)
else:
return self._predict(x, node.right)

# Define the GradientBoosting class for boosting decision trees
class GradientBoosting:
"""
Gradient Boosting for regression.

Attributes:
- n_estimators (int): Number of boosting stages to perform.
- learning_rate (float): Learning rate shrinks the contribution of each tree.
- max_depth (int): Maximum depth of each decision tree.
- models (list): List of successive decision tree models.
- initial_prediction (float): Initial prediction to start the boosting.
"""
def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3):
self.n_estimators = n_estimators
self.learning_rate = learning_rate
self.max_depth = max_depth
self.trees = []
self.initial_prediction = None

def fit(self, X, y):
"""
Fit the gradient boosting model.

Parameters:
- X (np.array): Feature matrix.
- y (np.array): Target values.
"""
# Initialize the first model to the mean of y
self.initial_prediction = np.mean(y)
f_m = np.full(y.shape, self.initial_prediction)

for _ in range(self.n_estimators):
residuals = y - f_m
tree = DecisionTree(max_depth=self.max_depth)
tree.fit(X, residuals)
predictions = tree.predict(X)
f_m += self.learning_rate * predictions
self.trees.append(tree) # Store the tree instead of predictions

def predict(self, X):
"""
Make predictions using the boosted model.

Parameters:
- X (np.array): Feature matrix.

Returns:
- np.array: Predicted values.
"""
# Start with the initial mean prediction
f_m = np.full(X.shape[0], self.initial_prediction)

# Accumulate predictions from each tree
for tree in self.trees:
f_m += self.learning_rate * tree.predict(X)

return f_m
File renamed without changes.
42 changes: 42 additions & 0 deletions testing.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Import necessary libraries
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

# GradientBoosting class is in a file named gradient_boosting.py
from gradient_boosting import GradientBoosting

def main():
# Load the California housing dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler on the training data and transform it
X_train_scaled = scaler.fit_transform(X_train)

# Transform the testing data with the same scaler
X_test_scaled = scaler.transform(X_test)

# Initialize the GradientBoosting model
model = GradientBoosting(n_estimators=50, learning_rate=0.1, max_depth=3)

# Train the model on the scaled training data
model.fit(X_train_scaled, y_train)

# Predict the scaled test set
predictions = model.predict(X_test_scaled)

# Evaluate the model using mean squared error
mse = mean_squared_error(y_test, predictions)
print("Mean Squared Error on Test Set:", mse)

if __name__ == "__main__":
main()