Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
185 changes: 179 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,181 @@
# Project 1

Put your README here. Answer the following questions.

* What does the model you have implemented do and when should it be used?
* How did you test your model to determine if it is working reasonably correctly?
* What parameters have you exposed to users of your implementation in order to tune performance? (Also perhaps provide some basic usage examples.)
* Are there specific inputs that your implementation has trouble with? Given more time, could you work around these or is it fundamental?
# Project 1: ElasticNet

## Overview

This project implements an **ElasticNet** regression model from scratch, using **NumPy** for all the numerical calculations. The ElasticNet model is a type of linear regression that uses both **L1 (Lasso)** and **L2 (Ridge)** penalties. It’s especially useful when dealing with datasets that have lots of features, some of which may be irrelevant or highly correlated with others.

Unlike prebuilt libraries like Scikit-Learn, this implementation relies on manually coded gradient descent to optimize the model’s weights.

The `ElasticNetModel` class includes two main methods:
- `fit(X, y)`: This trains the model using the dataset.
- `predict(X)`: This makes predictions based on the trained model.

## Requirements

This project meets the following requirements:

1. **Algorithm Implementation**: The ElasticNet regression is implemented from scratch, combining both L1 and L2 regularization penalties with gradient descent.
2. **From First Principles**: The model uses **NumPy** for matrix calculations, with no prebuilt machine learning libraries (like Scikit-Learn or Statsmodels).
3. **Testing the Model**: We tested the model using a custom script that runs it on a dataset and evaluates performance using metrics like Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared (R²). We also generated visualizations to assess the model's performance.
4. **Flexible Input**: The model can handle any numerical dataset with proper preprocessing and normalization. The test script works with a provided `test.csv` dataset generated by a separate script.


## What does the model you have implemented do and when should it be used?

The **ElasticNet** model I’ve implemented is a type of linear regression that combines two types of regularization: **L1 (Lasso)** and **L2 (Ridge)**. This makes it useful when you have a dataset with many features, especially when some of those features might not be important or are highly correlated with each other.

ElasticNet helps prevent overfitting by penalizing large coefficients and can also automatically select important features by driving some coefficients to zero. It’s a good choice when you’re dealing with high-dimensional data or when you want a balance between selecting features and generalizing well.


## How did you test your model to determine if it is working reasonably correctly?

we tested the model by writing a script that loads a dataset, trains the ElasticNet model on it, and then evaluates how well it performs using standard metrics like **Mean Squared Error (MSE)**, **Mean Absolute Error (MAE)**, and **R-squared (R²)**. These metrics give a good sense of how accurate the model’s predictions are compared to the actual values.

we also generated visualizations, including a plot of actual vs. predicted values, a residuals plot, and a bar plot showing the importance of each feature (based on the learned coefficients). This helped visually confirm that the model is working as expected.



## What parameters have you exposed to users of your implementation in order to tune performance?

we’ve exposed several parameters that allow users to tweak the model’s behavior:

1. **l1_penalty**: Controls the strength of L1 regularization, which helps with feature selection.
2. **l2_penalty**: Controls the strength of L2 regularization, which helps prevent overfitting.
3. **learning_rate**: Adjusts the step size during the training process. A smaller value means slower but more precise updates.
4. **max_iterations**: Sets how many times the model will update its weights during training.
5. **tolerance**: Decides when the training should stop by checking if the weight updates have become very small.

Here’s an example of how to use these parameters:

```python
from elasticnet.models.ElasticNet import ElasticNetModel

# Example usage of ElasticNet
model = ElasticNetModel(l1_penalty=0.5, l2_penalty=0.3, learning_rate=0.01, max_iterations=5000)
X = [[1, 2], [2, 3], [3, 4], [4, 5]]
y = [3, 4, 5, 6]

model.fit(X, y)
predictions = model.predict(X)
print(predictions)
```


## Are there specific inputs that your implementation has trouble with? Given more time, could you work around these or is it fundamental?

The model works well with numerical data, but there are a few things it doesn’t handle well:

1. **Categorical data**: If you have non-numeric data (like "male" or "female"), you’ll need to convert these into numbers before passing them into the model. Right now, this needs to be done manually.

With more time, I could add a feature to automatically handle categorical data.

2. **Missing data**: The model expects all input values to be valid numbers. If there are missing or `NaN` values, they need to be filled in before training the model.

This could be improved by adding automatic handling for missing data (e.g., filling with mean values).


## Model Description

### When Should You Use ElasticNet?

ElasticNet is particularly useful in these scenarios:
- If you want to encourage **sparsity** in your model (meaning some of the less important features will be ignored), ElasticNet’s L1 regularization helps with that.
- If you’re working with a **large number of features** or when some features are highly correlated, ElasticNet's combination of L1 and L2 regularization helps handle this better than basic linear regression or Lasso alone.
- If you're worried about **overfitting**, the L2 penalty helps keep the model generalized by preventing large coefficients.


## How to Run the Model

### 1. Install Dependencies

First, make sure you have **NumPy** and **Matplotlib** installed. You can install them using `pip`:

pip install numpy matplotlib


### 2. Set Up the Environment

Before running any scripts, make sure your Python environment is set up to find the project's modules. You can do this by setting the `PYTHONPATH` environment variable to your current directory:

export PYTHONPATH=$PWD

if we are using COmmand prompt use

set PYTHONPATH=%cd%

This tells Python where to look for the project's files.

### 3. Generate the Dataset

You’ll need to generate the dataset before running the model. Use the `generate_test_CSV.py` script to create the `test.csv` file:

python generate_test_CSV.py

This will generate a dataset that the model can use for training and testing.

### 4. Run the Test Script

Once the dataset has been generated, you can run the ElasticNet model by using the test script:

python elasticnet/tests/test_ElasticNetModel.py

This will:
- Load the `test.csv` dataset.
- Preprocess the data (if necessary, convert categorical features to numerical form).
- Train the model using `fit()`.
- Predict and evaluate the results using `predict()`.
- Generate visualizations like "Actual vs Predicted" and "Residuals."

![alt text](<Screenshot 2024-10-10 194317.png>)


## Evaluation Metrics

We used the following metrics to evaluate how well the model performed:

- **Mean Squared Error (MSE)**: Measures how far the predicted values are from the actual values.
- **Mean Absolute Error (MAE)**: Similar to MSE but based on the absolute differences.
- **R-squared (R²)**: This tells you how well the model fits the data. The closer to 1, the better.

Here’s an example of the output:

Mean Squared Error (MSE): 1.85
Mean Absolute Error (MAE): 1.2
R-squared (R²): 0.75

![alt text](<Screenshot 2024-10-10 194334.png>)


## Visualizations

The test script will generate several plots to help visualize the model’s performance:

1. **Actual vs Predicted**: A scatter plot that compares the actual target values to the predicted values.
2. **Residuals**: A plot showing the differences between the actual and predicted values.
3. **Distribution of Target Values**: A histogram that shows the spread of the target variable.
4. **Feature Weights**: A bar chart showing the learned importance of each feature in the model.

![alt text](<Screenshot 2024-10-10 194352.png>)


## Known Limitations

The current implementation handles numerical datasets well, but there are a few things to keep in mind:
- **Categorical Data**: If your dataset has non-numeric columns (like 'male' and 'female'), you’ll need to convert them to numbers before using the model.
- **Missing Values**: If your dataset has any missing values (`NaN`), you should handle those first by either filling them in or dropping the rows.


## Future Work

With more time, the following improvements could be made:
- Automating the preprocessing of **categorical data** so that users don’t need to manually encode it.
- Adding better handling for **missing data** by automatically filling or dropping missing values.
- Implementing **cross-validation** to automatically tune the regularization parameters (l1_penalty, l2_penalty) for better performance.


### Group Memebers
Krishna Manideep Malladi (A20550891)
Udaya Sree Vankdavath (A20552992)
Manvitha Byrineni(A20550783)
Binary file added Screenshot 2024-10-10 194317.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Screenshot 2024-10-10 194334.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Screenshot 2024-10-10 194352.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
46 changes: 36 additions & 10 deletions elasticnet/models/ElasticNet.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,43 @@
import numpy as np

class ElasticNetModel:
def __init__(self, l1_penalty=1.0, l2_penalty=1.0, learning_rate=0.001, max_iterations=1000, tolerance=1e-5):
self.l1_penalty = l1_penalty
self.l2_penalty = l2_penalty
self.learning_rate = learning_rate
self.max_iterations = max_iterations
self.tolerance = tolerance
self.weights = None
self.bias = None

class ElasticNetModel():
def __init__(self):
pass
def fit(self, X, y):
num_samples, num_features = X.shape
self.weights = np.zeros(num_features)
self.bias = 0

for i in range(self.max_iterations):
predictions = self.predict(X)
errors = predictions - y

def fit(self, X, y):
return ElasticNetModelResults()
gradient_weights = (X.T @ errors) / num_samples + self.l1_penalty * np.sign(self.weights) + self.l2_penalty * self.weights
gradient_bias = np.mean(errors)

self.weights -= self.learning_rate * gradient_weights
self.bias -= self.learning_rate * gradient_bias

if np.all(np.abs(gradient_weights) < self.tolerance):
break

return ElasticNetModelResults(self.weights, self.bias)

def predict(self, X):
return X @ self.weights + self.bias


class ElasticNetModelResults():
def __init__(self):
pass
class ElasticNetModelResults:
def __init__(self, weights, bias):
self.weights = weights
self.bias = bias

def predict(self, x):
return 0.5
def predict(self, X):
return X @ self.weights + self.bias
Loading