Fall2024CS584 · munishpatel · Oct 10, 2024 · Oct 10, 2024 · Oct 10, 2024 · Oct 10, 2024
diff --git a/ML_Project1.ipynb b/ML_Project1.ipynb
diff --git a/README.md b/README.md
@@ -1,8 +1,87 @@
 # Project 1 
+- Course: CS584 - Machine Learning
+- Instructor: Steve Avsec
 
-Put your README here. Answer the following questions.
+## Team Members
+1. Munish Patel - mpatel176@hawk.iit.edu (A20544034)
+2. Jaya Karthik Muppinidi - jmuppinidi@hawk.iit.edu (A20551726)
+3. Meghana Mahamkali - mmahamkali@hawk.iit.edu (A20564182)
+4. Nirusha Mantralaya Ramesh - nmantralayaramesh@hawk.iit.edu (A20600814)
 
-* What does the model you have implemented do and when should it be used?
-* How did you test your model to determine if it is working reasonably correctly?
-* What parameters have you exposed to users of your implementation in order to tune performance? (Also perhaps provide some basic usage examples.)
-* Are there specific inputs that your implementation has trouble with? Given more time, could you work around these or is it fundamental?
+## Linear Regression with ElasticNet Regularization
+### Project Overview
+Linear regression with ElasticNet regularization (combination of L2 and L1 regularization)
+This project implements a ElasticNet Model, which combines L1 and L2 penalties to enhance model generalization and prevent overfitting. The model, developed from scratch in Python using NumPy, optimizes its parameters via gradient descent.
+
+### Usage
+  ```bash
+   # Fit and predict using the model
+      model = ElasticNetModel(lambdas=0.1, l1_ratio=0.5, iterations=1000, learning_rate=0.001)
+      results = model.fit(x_train_scaled, y_train)
+      predictions = results.predict(x_test_scaled)
+
+      predicted_categories = np.clip(np.round(predictions), 0, len(label_encoder.classes_) - 1).astype(int)
+      # Converting numeric predictions back to job role labels
+      predicted_job_roles = label_encoder.inverse_transform(predicted_categories)
+
+      print("Numerical Predictions:", predictions)
+      print("Predicted Job Roles:", predicted_job_roles)
+   ```
+### Initial Data Used
+  https://github.com/munishpatel/ML-DATA/blob/c9442334645ca2ac71820578d17125c630c6199f/mldata.csv
+
+### Explanation of the Model
+1. What does the model you have implemented do and when should it be used?
+
+  - The ElasticNetModel implemented is a type of regularized linear regression that combines both L1 and L2 regularization. 
+  - L1 Regularization (Lasso) helps in feature selection by shrinking some coefficients to zero, which is beneficial in models with high dimensionality.
+  - L2 Regularization (Ridge) tends to shrink coefficients evenly and helps in dealing with multicollinearity and model stability by keeping the coefficients small.
+  - The main reason behind using ElasticNet is to build a model with least complexity while excelling in occasions where features seem to relate or when there are more variables than cases. When it is desirable to decrease the model’s complexity with regards to the features contributing to collinearity, then ElasticNet can prove effective.
+  - ElasticNet is used when we suspect or know there is multicollinearity in your data, have a large number of features, some of which might be irrelevant, need a model that can perform feature selection to improve prediction accuracy.
+
+
+2. How did you test your model to determine if it is working reasonably correctly?
+
+  - We evaluated our model by training it on a dataset that predicts suggested job roles.
+  - To verify the models ability to generalize, we have divided the data into training sets and testing sets.
+  - Fit the model on the training data using results = model.fit(x_train_scaled, y_train).
+  - Make predictions on the testing data using results.predict(x_test_scaled).
+  - We tested tthe model in test.py using small_data.csv and also tested it in generate_regression_data.py where we generated random data and stored it in data.csv
+
+
+3. What parameters have you exposed to users of your implementation in order to tune performance?
+
+  - ```lambdas```: Controls the strength of the regularization. A higher value means more regularization.
+  - ```l1_ratio```: Balances between L1 and L2 regularization.
+  - ```iterations```: Determines the number of iterations in the gradient descent algorithm.
+  - ```learning_rate```: Controls the step size at each iteration while moving toward a minimum of the loss function.
+  - Example Usage for random generated data:
+    ```bash
+        model = ElasticNetModel(lambdas=1.0, l1_ratio=0.5, iterations=1000, learning_rate=0.01)
+        results = model.fit(X_scaled, y)
+
+        predictions = results.predict(X_scaled)
+
+        # Plotting the results
+        plt.figure(figsize=(10, 6))
+        plt.scatter(y, predictions, alpha=0.5)
+        plt.title('Comparison of Actual and Predicted Values')
+        plt.xlabel('Actual Values')
+        plt.ylabel('Predicted Values')
+        plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)  # Diagonal line for reference
+        plt.grid(True)
+        plt.show()
+
+        print("Predictions:", predictions)
+        print("Actuals:", y)
+        return predictions, y
+
+        predictions, actuals = test_model_with_generated_data('data.csv')
+    ```
+
+4. Are there specific inputs that your implementation has trouble with? Given more time, could you work around these or is it fundamental?
+
+  - Non-linear Relationships, The ElasticNetModel, being a linear model, inherently assumes that the relationships between the predictors and the response variable are linear. This assumption limits its ability to model complex, non-linear interactions effectively.
+  - High-dimensional Data, Although ElasticNet is designed to handle multicollinearity and can perform feature selection via L1 regularization, it might still struggle with very high-dimensional data (p >> n scenario), where the number of features far exceeds the number of observations.
+  - Categorical Features Handling, we used binary encoding, number encoding, dummy variable encoding in the implementation of the project, as we had more number of categorical features than numerical features in our dataset.
+  - Further regularization parameter tuning and potentially combining dimensionality reduction techniques like PCA (Principal Component Analysis) before applying ElasticNet could improve model performance.
diff --git a/data.csv b/data.csv
@@ -0,0 +1,101 @@
+x_0,x_1,y
+5.479120971119267,-1.2224312049589532,19.635827298128937
+7.171958398227648,3.9473605811872776,34.01599123419791
+-8.11645304224701,9.512447032735118,-0.3295869279908642
+5.222794039807059,5.721286105539075,31.947511431974814
+-7.4377273464890825,-0.9922812420886569,-18.960169974845627
+-2.5840395153483753,8.535299776972035,15.725962869212662
+2.8773024016132904,6.455232265416598,26.63295664257481
+-1.131716023453377,-5.455225564304462,-8.661660405692198
+1.0916957403166965,-8.723654877916493,-11.222394635913918
+6.55262343985164,2.6332879824412974,29.875727882507398
+5.1617548017074775,-2.9094806374026323,13.823072860024297
+9.413960487898066,7.862422426443953,47.747913256158476
+5.567669941475238,-6.1072258429606485,8.610405771575666
+-0.6655799254593155,-9.123924684255424,-15.578712585589608
+-6.914210158649043,3.6609790648490925,-7.5047698038929305
+4.895243118156342,9.350194648684202,37.05972593409786
+-3.4834928372369607,-2.5908058793026223,-10.601458777721708
+-0.6088837744838411,-6.2105728183142865,-9.731966393413675
+-7.401569893290567,-0.48590147548132556,-18.504185725196315
+-5.461813018982317,3.396279893650206,-3.5901214443419356
+-1.2569616225533853,6.653563921156749,15.074358411657268
+4.005302040044983,-3.7526671723591782,10.847969882859335
+6.645196027904021,6.095287149936038,36.97165670433424
+-2.2504324193965104,-4.233437921395118,-10.91411571265047
+3.6499100794995094,-7.204950327813804,1.315970765990421
+-6.001835950497833,-9.85275460497989,-32.468520270182054
+5.738487550042768,3.2970171318406436,28.9860702722633
+4.1033075725267025,5.614580620439359,27.45469588622546
+-0.8216844892332009,1.3748239190578744,5.37508415204402
+-7.204060037446851,-7.709398529280531,-31.80274884076271
+3.368059235809433,-0.5780758771373495,16.46549999068752
+1.3047221296237765,5.299977148320512,21.390965296794022
+2.694366400011816,1.071588013159916,14.37303187579646
+1.1841432149082713,-3.920998038747756,0.42305020568012575
+-9.383643308641211,-1.2656522153527519,-27.14567635846614
+-5.708306543609416,-1.8294271255072765,-16.374480895806286
+7.068061465363321,-5.321210282693185,15.877368834293934
+-8.83394516621868,-4.372322159560069,-29.040626196887942
+-4.128124844666328,3.238330294537901,-1.6367977826667892
+1.1406430468255664,5.677964182128271,19.123711064665443
+3.2862708065477513,-1.8722627711985886,8.966997847507422
+6.280407693320694,-6.660541601845922,10.357473955725329
+-9.54575853732279,-8.199042784487165,-41.097775592802265
+4.447187011929007,-0.7624553949722532,16.28721081847644
+-6.774564419327964,0.020895502067270755,-16.158763032016942
+-6.953757945736632,3.9264075015547206,-8.102721388353011
+-1.0768744885193868,-2.3795754780703504,-4.747502813055492
+-3.96975821704247,2.605651862377769,-3.165016171762862
+-2.763747788932191,-8.24700161367798,-17.655999482124233
+-7.639881957589694,9.23795329099029,-0.7311618720625308
+8.171613814152142,3.9941426762149916,36.40634121643177
+-4.682600770809609,9.383527546954479,11.556166309801446
+5.5750180793158925,4.337803783179911,33.30572897354791
+-1.0127699571242275,-4.55516876309682,-8.320214026391664
+-8.072180756930013,8.052047930876832,-3.480695365824407
+-0.8844742033277786,-5.952732704095394,-9.217332467233614
+-3.8808675169869495,1.5843913788379194,-1.7451221488794175
+-6.464544341215365,7.1322856818475096,-1.1159187383793143
+5.170390596704202,4.389259119018735,29.044412182207967
+-1.3581392044979257,2.546176814048863,6.795273590665693
+1.681959378254712,2.9969320310963994,16.474508271423556
+-8.311113577202217,-1.683851956587807,-23.677200716012365
+-9.16771652276215,-0.12018361510962094,-22.877339763021737
+-3.4027757533442937,-7.109516222679062,-20.802255513760986
+-7.931940645548967,1.7528914435542404,-15.528212793513086
+-6.588140629262278,8.502402367535943,1.973995357276097
+1.621222794007899,-3.0626039093032587,3.9706304530434355
+1.818309829628335,-9.54392257940605,-9.188242888746112
+9.171184264828906,-0.35393126114199447,32.27722879474787
+5.654704545005725,-8.345400001551228,6.286029449734547
+-0.2668333832367935,-0.18586011290958204,3.9832089521389213
+8.756529099499659,1.4345610475215071,34.49046580194119
+-0.5302119788609243,-4.660486738162128,-5.858454065329163
+-3.3686200531489563,0.4134480494307553,-4.278879667493945
+-1.2217707938990667,-9.567758402393391,-18.52238722002683
+6.525838483887156,7.923215436795335,40.740440586925445
+-7.195018220027785,1.0807228707809884,-14.520895516934868
+-7.828485177291129,3.4448018607962343,-9.502683501350324
+-4.375324323219834,3.1884526938380358,-0.17571267950818803
+4.539892285737652,5.37294983835314,29.75142308647574
+-7.845181080882069,8.320236902752157,-2.6581266468366613
+-5.395720182102384,-9.251748876476405,-30.803069771243383
+1.0970493878296672,-2.5815543227512254,4.319182471075416
+6.595794862648262,6.165029441286038,37.380192695664
+-3.6572221435456935,9.057987901394899,12.624452775544329
+-4.181643237197628,0.30114258463429167,-8.687230529298493
+-4.880698188647945,8.720871400979266,8.72708671757288
+-6.707843648359637,-9.10178761215342,-32.87268583117139
+-1.2980587999392412,9.847511281116741,19.69041547800128
+7.833545325098278,4.972160389138985,37.97343194612286
+7.8158498175704985,7.8689327939572635,44.44913224498254
+0.3771672077289807,-3.6814189633841394,-1.178869505225093
+5.4402486422197605,3.233225263355221,27.49502526756617
+-2.526845422525799,-8.110666638769695,-18.905357813182746
+4.935792226980521,-4.750789681542706,10.053819939649266
+8.736263010675586,-5.180588499886305,21.000174544356618
+-7.5448413517702795,6.622253442498124,-2.9185251973214323
+-6.93431366751012,-6.414633836845218,-31.198867117152098
+1.9876558304168697,7.49124081674929,25.708598860239505
+-6.071306685708535,-3.793526541998105,-20.62446071974937
diff --git a/elasticnet/models/ElasticNet.py b/elasticnet/models/ElasticNet.py
@@ -1,17 +1,52 @@
 
+class ElasticNetModel:
+
+    def __init__(self, lambdas=1.0, l1_ratio=0.5, iterations=10000, learning_rate=0.001):
+        self.lambdas = lambdas
+        self.l1_ratio = l1_ratio
+        self.iterations = iterations
+        self.learning_rate = learning_rate
+        self.coef_ = None
+        self.intercept_ = 0
 
-class ElasticNetModel():
-    def __init__(self):
-        pass
+    def fit(self, X, y):
+
+        n_samples, n_features = X.shape
+        self.coef_ = np.zeros(n_features)
+        self.intercept_ = 0
 
+        # Performing gradient descent
+        for _ in range(self.iterations):
+            current_predictions = np.dot(X, self.coef_) + self.intercept_
+            residuals = current_predictions - y
+
+            # Computing gradients for coefficients
+            # First, we calculate the gradient from the residuals
+            residual_gradient = np.dot(X.T, residuals) / n_samples
+
+            # Computing the L1 regularization term
+            l1_term = self.l1_ratio * self.lambdas * np.sign(self.coef_)
+
+            # Computing the L2 regularization term
+            l2_term = (1 - self.l1_ratio) * self.lambdas * 2 * self.coef_
+
+            # Combining the gradients from residuals, L1, and L2 terms
+            coef_gradient = residual_gradient + l1_term + l2_term
+
+            # Computing the gradient for the intercept
+            intercept_gradient = np.sum(residuals) / n_samples
 
-    def fit(self, X, y):
-        return ElasticNetModelResults()
+            # Updating the model parameters
+            self.coef_ -= self.learning_rate * coef_gradient
+            self.intercept_ -= self.learning_rate * intercept_gradient
 
+        return ElasticNetModelResults(self.coef_, self.intercept_)
 
-class ElasticNetModelResults():
-    def __init__(self):
-        pass
+class ElasticNetModelResults:
+    def __init__(self, coef, intercept):
+        self.coef_ = coef
+        self.intercept_ = intercept
 
-    def predict(self, x):
-        return 0.5
+    def predict(self, X):
+
+        return np.dot(X, self.coef_) + self.intercept_
diff --git a/elasticnet/tests/test_ElasticNetModel.py b/elasticnet/tests/test_ElasticNetModel.py
@@ -1,19 +1,46 @@
+import numpy as np
 import csv
+from sklearn.preprocessing import StandardScaler  # To standardize the features
 
-import numpy
-
-from elasticnet.models.ElasticNet import ElasticNetModel
+# Assuming the ElasticNetModel is defined as provided above
 
 def test_predict():
-    model = ElasticNetModel()
+    model = ElasticNetModel(lambdas=1.0, l1_ratio=0.5, iterations=1000, learning_rate=0.01)
     data = []
-    with open("small_test.csv", "r") as file:
+
+    # Load data from the CSV file
+    with open("/content/small_test.csv", "r") as file:
         reader = csv.DictReader(file)
         for row in reader:
-            data.append(row)
+            # Convert all values to float for consistency
+            data.append({k: float(v) for k, v in row.items()})
+
+    # Extract features and targets
+    X = np.array([[v for k, v in datum.items() if k.startswith('x')] for datum in data])
+    y = np.array([datum['y'] for datum in data if 'y' in datum])
+
+    # Normalize the feature data
+    scaler = StandardScaler()
+    X_scaled = scaler.fit_transform(X)
+
+    # Fit the model
+    results = model.fit(X_scaled, y)
+
+    # Make predictions
+    preds = results.predict(X_scaled)
+
+    # Print predictions to verify outputs
+    print("Predictions:", preds)
+
+    # Plotting the results
+    plt.figure(figsize=(10, 6))
+    plt.scatter(y, preds, alpha=0.5, color='blue')  # Plot predictions vs actual values
+    plt.title('Actual vs. Predicted Values')
+    plt.xlabel('Actual Values')
+    plt.ylabel('Predicted Values')
+    plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=2)  # Diagonal line
+    plt.grid(True)
+    plt.show()
 
-    X = numpy.array([[v for k,v in datum.items() if k.startswith('x')] for datum in data])
-    y = numpy.array([[v for k,v in datum.items() if k=='y'] for datum in data])
-    results = model.fit(X,y)
-    preds = results.predict(X)
-    assert preds == 0.5
+# Run the test function
+test_predict()