Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2,401 changes: 2,401 additions & 0 deletions ML_Project1.ipynb

Large diffs are not rendered by default.

89 changes: 84 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,87 @@
# Project 1
- Course: CS584 - Machine Learning
- Instructor: Steve Avsec

Put your README here. Answer the following questions.
## Team Members
1. Munish Patel - mpatel176@hawk.iit.edu (A20544034)
2. Jaya Karthik Muppinidi - jmuppinidi@hawk.iit.edu (A20551726)
3. Meghana Mahamkali - mmahamkali@hawk.iit.edu (A20564182)
4. Nirusha Mantralaya Ramesh - nmantralayaramesh@hawk.iit.edu (A20600814)

* What does the model you have implemented do and when should it be used?
* How did you test your model to determine if it is working reasonably correctly?
* What parameters have you exposed to users of your implementation in order to tune performance? (Also perhaps provide some basic usage examples.)
* Are there specific inputs that your implementation has trouble with? Given more time, could you work around these or is it fundamental?
## Linear Regression with ElasticNet Regularization
### Project Overview
Linear regression with ElasticNet regularization (combination of L2 and L1 regularization)
This project implements a ElasticNet Model, which combines L1 and L2 penalties to enhance model generalization and prevent overfitting. The model, developed from scratch in Python using NumPy, optimizes its parameters via gradient descent.

### Usage
```bash
# Fit and predict using the model
model = ElasticNetModel(lambdas=0.1, l1_ratio=0.5, iterations=1000, learning_rate=0.001)
results = model.fit(x_train_scaled, y_train)
predictions = results.predict(x_test_scaled)

predicted_categories = np.clip(np.round(predictions), 0, len(label_encoder.classes_) - 1).astype(int)
# Converting numeric predictions back to job role labels
predicted_job_roles = label_encoder.inverse_transform(predicted_categories)

print("Numerical Predictions:", predictions)
print("Predicted Job Roles:", predicted_job_roles)
```
### Initial Data Used
https://github.com/munishpatel/ML-DATA/blob/c9442334645ca2ac71820578d17125c630c6199f/mldata.csv

### Explanation of the Model
1. What does the model you have implemented do and when should it be used?

- The ElasticNetModel implemented is a type of regularized linear regression that combines both L1 and L2 regularization.
- L1 Regularization (Lasso) helps in feature selection by shrinking some coefficients to zero, which is beneficial in models with high dimensionality.
- L2 Regularization (Ridge) tends to shrink coefficients evenly and helps in dealing with multicollinearity and model stability by keeping the coefficients small.
- The main reason behind using ElasticNet is to build a model with least complexity while excelling in occasions where features seem to relate or when there are more variables than cases. When it is desirable to decrease the model’s complexity with regards to the features contributing to collinearity, then ElasticNet can prove effective.
- ElasticNet is used when we suspect or know there is multicollinearity in your data, have a large number of features, some of which might be irrelevant, need a model that can perform feature selection to improve prediction accuracy.


2. How did you test your model to determine if it is working reasonably correctly?

- We evaluated our model by training it on a dataset that predicts suggested job roles.
- To verify the models ability to generalize, we have divided the data into training sets and testing sets.
- Fit the model on the training data using results = model.fit(x_train_scaled, y_train).
- Make predictions on the testing data using results.predict(x_test_scaled).
- We tested tthe model in test.py using small_data.csv and also tested it in generate_regression_data.py where we generated random data and stored it in data.csv


3. What parameters have you exposed to users of your implementation in order to tune performance?

- ```lambdas```: Controls the strength of the regularization. A higher value means more regularization.
- ```l1_ratio```: Balances between L1 and L2 regularization.
- ```iterations```: Determines the number of iterations in the gradient descent algorithm.
- ```learning_rate```: Controls the step size at each iteration while moving toward a minimum of the loss function.
- Example Usage for random generated data:
```bash
model = ElasticNetModel(lambdas=1.0, l1_ratio=0.5, iterations=1000, learning_rate=0.01)
results = model.fit(X_scaled, y)

predictions = results.predict(X_scaled)

# Plotting the results
plt.figure(figsize=(10, 6))
plt.scatter(y, predictions, alpha=0.5)
plt.title('Comparison of Actual and Predicted Values')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4) # Diagonal line for reference
plt.grid(True)
plt.show()

print("Predictions:", predictions)
print("Actuals:", y)
return predictions, y

predictions, actuals = test_model_with_generated_data('data.csv')
```

4. Are there specific inputs that your implementation has trouble with? Given more time, could you work around these or is it fundamental?

- Non-linear Relationships, The ElasticNetModel, being a linear model, inherently assumes that the relationships between the predictors and the response variable are linear. This assumption limits its ability to model complex, non-linear interactions effectively.
- High-dimensional Data, Although ElasticNet is designed to handle multicollinearity and can perform feature selection via L1 regularization, it might still struggle with very high-dimensional data (p >> n scenario), where the number of features far exceeds the number of observations.
- Categorical Features Handling, we used binary encoding, number encoding, dummy variable encoding in the implementation of the project, as we had more number of categorical features than numerical features in our dataset.
- Further regularization parameter tuning and potentially combining dimensionality reduction techniques like PCA (Principal Component Analysis) before applying ElasticNet could improve model performance.
101 changes: 101 additions & 0 deletions data.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
x_0,x_1,y
5.479120971119267,-1.2224312049589532,19.635827298128937
7.171958398227648,3.9473605811872776,34.01599123419791
-8.11645304224701,9.512447032735118,-0.3295869279908642
5.222794039807059,5.721286105539075,31.947511431974814
-7.4377273464890825,-0.9922812420886569,-18.960169974845627
-2.5840395153483753,8.535299776972035,15.725962869212662
2.8773024016132904,6.455232265416598,26.63295664257481
-1.131716023453377,-5.455225564304462,-8.661660405692198
1.0916957403166965,-8.723654877916493,-11.222394635913918
6.55262343985164,2.6332879824412974,29.875727882507398
5.1617548017074775,-2.9094806374026323,13.823072860024297
9.413960487898066,7.862422426443953,47.747913256158476
5.567669941475238,-6.1072258429606485,8.610405771575666
-0.6655799254593155,-9.123924684255424,-15.578712585589608
-6.914210158649043,3.6609790648490925,-7.5047698038929305
4.895243118156342,9.350194648684202,37.05972593409786
-3.4834928372369607,-2.5908058793026223,-10.601458777721708
-0.6088837744838411,-6.2105728183142865,-9.731966393413675
-7.401569893290567,-0.48590147548132556,-18.504185725196315
-5.461813018982317,3.396279893650206,-3.5901214443419356
-1.2569616225533853,6.653563921156749,15.074358411657268
4.005302040044983,-3.7526671723591782,10.847969882859335
6.645196027904021,6.095287149936038,36.97165670433424
-2.2504324193965104,-4.233437921395118,-10.91411571265047
3.6499100794995094,-7.204950327813804,1.315970765990421
-6.001835950497833,-9.85275460497989,-32.468520270182054
5.738487550042768,3.2970171318406436,28.9860702722633
4.1033075725267025,5.614580620439359,27.45469588622546
-0.8216844892332009,1.3748239190578744,5.37508415204402
-7.204060037446851,-7.709398529280531,-31.80274884076271
3.368059235809433,-0.5780758771373495,16.46549999068752
1.3047221296237765,5.299977148320512,21.390965296794022
2.694366400011816,1.071588013159916,14.37303187579646
1.1841432149082713,-3.920998038747756,0.42305020568012575
-9.383643308641211,-1.2656522153527519,-27.14567635846614
-5.708306543609416,-1.8294271255072765,-16.374480895806286
7.068061465363321,-5.321210282693185,15.877368834293934
-8.83394516621868,-4.372322159560069,-29.040626196887942
-4.128124844666328,3.238330294537901,-1.6367977826667892
1.1406430468255664,5.677964182128271,19.123711064665443
3.2862708065477513,-1.8722627711985886,8.966997847507422
6.280407693320694,-6.660541601845922,10.357473955725329
-9.54575853732279,-8.199042784487165,-41.097775592802265
4.447187011929007,-0.7624553949722532,16.28721081847644
-6.774564419327964,0.020895502067270755,-16.158763032016942
-6.953757945736632,3.9264075015547206,-8.102721388353011
-1.0768744885193868,-2.3795754780703504,-4.747502813055492
-3.96975821704247,2.605651862377769,-3.165016171762862
-2.763747788932191,-8.24700161367798,-17.655999482124233
-7.639881957589694,9.23795329099029,-0.7311618720625308
8.171613814152142,3.9941426762149916,36.40634121643177
-4.682600770809609,9.383527546954479,11.556166309801446
5.5750180793158925,4.337803783179911,33.30572897354791
-1.0127699571242275,-4.55516876309682,-8.320214026391664
-8.072180756930013,8.052047930876832,-3.480695365824407
-0.8844742033277786,-5.952732704095394,-9.217332467233614
-3.8808675169869495,1.5843913788379194,-1.7451221488794175
-6.464544341215365,7.1322856818475096,-1.1159187383793143
5.170390596704202,4.389259119018735,29.044412182207967
-1.3581392044979257,2.546176814048863,6.795273590665693
1.681959378254712,2.9969320310963994,16.474508271423556
-8.311113577202217,-1.683851956587807,-23.677200716012365
-9.16771652276215,-0.12018361510962094,-22.877339763021737
-3.4027757533442937,-7.109516222679062,-20.802255513760986
-7.931940645548967,1.7528914435542404,-15.528212793513086
-6.588140629262278,8.502402367535943,1.973995357276097
1.621222794007899,-3.0626039093032587,3.9706304530434355
1.818309829628335,-9.54392257940605,-9.188242888746112
9.171184264828906,-0.35393126114199447,32.27722879474787
5.654704545005725,-8.345400001551228,6.286029449734547
-0.2668333832367935,-0.18586011290958204,3.9832089521389213
8.756529099499659,1.4345610475215071,34.49046580194119
-0.5302119788609243,-4.660486738162128,-5.858454065329163
-3.3686200531489563,0.4134480494307553,-4.278879667493945
-1.2217707938990667,-9.567758402393391,-18.52238722002683
6.525838483887156,7.923215436795335,40.740440586925445
-7.195018220027785,1.0807228707809884,-14.520895516934868
-7.828485177291129,3.4448018607962343,-9.502683501350324
-4.375324323219834,3.1884526938380358,-0.17571267950818803
4.539892285737652,5.37294983835314,29.75142308647574
-7.845181080882069,8.320236902752157,-2.6581266468366613
-5.395720182102384,-9.251748876476405,-30.803069771243383
1.0970493878296672,-2.5815543227512254,4.319182471075416
6.595794862648262,6.165029441286038,37.380192695664
-3.6572221435456935,9.057987901394899,12.624452775544329
-4.181643237197628,0.30114258463429167,-8.687230529298493
-4.880698188647945,8.720871400979266,8.72708671757288
-6.707843648359637,-9.10178761215342,-32.87268583117139
-1.2980587999392412,9.847511281116741,19.69041547800128
7.833545325098278,4.972160389138985,37.97343194612286
7.8158498175704985,7.8689327939572635,44.44913224498254
0.3771672077289807,-3.6814189633841394,-1.178869505225093
5.4402486422197605,3.233225263355221,27.49502526756617
-2.526845422525799,-8.110666638769695,-18.905357813182746
4.935792226980521,-4.750789681542706,10.053819939649266
8.736263010675586,-5.180588499886305,21.000174544356618
-7.5448413517702795,6.622253442498124,-2.9185251973214323
-6.93431366751012,-6.414633836845218,-31.198867117152098
1.9876558304168697,7.49124081674929,25.708598860239505
-6.071306685708535,-3.793526541998105,-20.62446071974937
55 changes: 45 additions & 10 deletions elasticnet/models/ElasticNet.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,52 @@

class ElasticNetModel:

def __init__(self, lambdas=1.0, l1_ratio=0.5, iterations=10000, learning_rate=0.001):
self.lambdas = lambdas
self.l1_ratio = l1_ratio
self.iterations = iterations
self.learning_rate = learning_rate
self.coef_ = None
self.intercept_ = 0

class ElasticNetModel():
def __init__(self):
pass
def fit(self, X, y):

n_samples, n_features = X.shape
self.coef_ = np.zeros(n_features)
self.intercept_ = 0

# Performing gradient descent
for _ in range(self.iterations):
current_predictions = np.dot(X, self.coef_) + self.intercept_
residuals = current_predictions - y

# Computing gradients for coefficients
# First, we calculate the gradient from the residuals
residual_gradient = np.dot(X.T, residuals) / n_samples

# Computing the L1 regularization term
l1_term = self.l1_ratio * self.lambdas * np.sign(self.coef_)

# Computing the L2 regularization term
l2_term = (1 - self.l1_ratio) * self.lambdas * 2 * self.coef_

# Combining the gradients from residuals, L1, and L2 terms
coef_gradient = residual_gradient + l1_term + l2_term

# Computing the gradient for the intercept
intercept_gradient = np.sum(residuals) / n_samples

def fit(self, X, y):
return ElasticNetModelResults()
# Updating the model parameters
self.coef_ -= self.learning_rate * coef_gradient
self.intercept_ -= self.learning_rate * intercept_gradient

return ElasticNetModelResults(self.coef_, self.intercept_)

class ElasticNetModelResults():
def __init__(self):
pass
class ElasticNetModelResults:
def __init__(self, coef, intercept):
self.coef_ = coef
self.intercept_ = intercept

def predict(self, x):
return 0.5
def predict(self, X):

return np.dot(X, self.coef_) + self.intercept_
49 changes: 38 additions & 11 deletions elasticnet/tests/test_ElasticNetModel.py
Original file line number Diff line number Diff line change
@@ -1,19 +1,46 @@
import numpy as np
import csv
from sklearn.preprocessing import StandardScaler # To standardize the features

import numpy

from elasticnet.models.ElasticNet import ElasticNetModel
# Assuming the ElasticNetModel is defined as provided above

def test_predict():
model = ElasticNetModel()
model = ElasticNetModel(lambdas=1.0, l1_ratio=0.5, iterations=1000, learning_rate=0.01)
data = []
with open("small_test.csv", "r") as file:

# Load data from the CSV file
with open("/content/small_test.csv", "r") as file:
reader = csv.DictReader(file)
for row in reader:
data.append(row)
# Convert all values to float for consistency
data.append({k: float(v) for k, v in row.items()})

# Extract features and targets
X = np.array([[v for k, v in datum.items() if k.startswith('x')] for datum in data])
y = np.array([datum['y'] for datum in data if 'y' in datum])

# Normalize the feature data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit the model
results = model.fit(X_scaled, y)

# Make predictions
preds = results.predict(X_scaled)

# Print predictions to verify outputs
print("Predictions:", preds)

# Plotting the results
plt.figure(figsize=(10, 6))
plt.scatter(y, preds, alpha=0.5, color='blue') # Plot predictions vs actual values
plt.title('Actual vs. Predicted Values')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=2) # Diagonal line
plt.grid(True)
plt.show()

X = numpy.array([[v for k,v in datum.items() if k.startswith('x')] for datum in data])
y = numpy.array([[v for k,v in datum.items() if k=='y'] for datum in data])
results = model.fit(X,y)
preds = results.predict(X)
assert preds == 0.5
# Run the test function
test_predict()
Loading