Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
148 changes: 142 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,144 @@
# Project 1
Ajay Anand A20581927
Anish Vishwanathan VR A20596106
Mohit Panchatcharam A20562455
Sibi Chandra sekar A20577946

Put your README here. Answer the following questions.
OVERVIEW:

* What does the model you have implemented do and when should it be used?
* How did you test your model to determine if it is working reasonably correctly?
* What parameters have you exposed to users of your implementation in order to tune performance? (Also perhaps provide some basic usage examples.)
* Are there specific inputs that your implementation has trouble with? Given more time, could you work around these or is it fundamental?
The main goal of this project is to implement linear regression from the beginning using Python and its libraries such as Pandas, Numpy, Matplotlib, Seaborn and sciPy.
The goal is to predict prices of houses using different features from the Paris housing dataset.
Elastic net regression is a type of linear regression model that uses L1 and L2 (Lasso & Ridge) techniques.

IMPLEMENTATION:

The ElasticNet model is a reversion design that applies two together L1 (Lasso) and L2 (Ridge) punishments while fitting the model. These penalties overfitting by few of the model's coefficients and can even remove irrelevant features, answers to a more genralized solution

WHEN TO USE THIS MODEL?

1. When you need a regression model that demands regularization to prevent overfitting.
2. when the datasets holds multi collinearity that are very equated.
3. when you be going to select only specified feature from your dataset by decreasing the less useful ones.

TESTING THE MODEL:

The datasets used here in Paris Hosuong datasets to train and test the model. The working of models were given below,
1. Exploring the data- The first step was to chek the relationship between featuess and housing prices using a heat map.
2. Handling outliers- We capped the outliers at the 99th and 1st percentage to reduce the performance impact in the model. This tenchique is called as Winsorization.
3. Fitting the model- This model was trained by two parameters “alpha and l1_ratio”.
alpha : Controls the regularization.
l1_ratio : checks the balance between l1 and l2.
4. Evalutaions- This model was working with the R squaredmetric which indicates




PARAMETERS FOR IMPLEMENTATION:

1. alpha : Controls the regularization.
2. l1_ratio : checks the balance between l1(Lasso) and l2(Ridge).
if l1_ratio = 1 , it goes to lasso.
if l1_ratio = 0 , it goes to Ridge.
3. max_iter : specifies the maximum number of iterations allowed during the optimization process.
4. tol : sets the tolerance level for stopping the process. if changes it goes to tol, the optimization stops.

TROUBLES WITH IMPLEMENTATION:

Correlated Features: Although ElasticNet is planned to handle multicollinearity, well correlated features can still cause the same issues. In specific cases, further feature design may be necessary (for example, utilizing Principal Component Analysis or other range decline methods).

Outliers: Even after capping the extreme principles, outliers still unfavorably influence the model’s performance. In cases accompanying many outliers, further aberration situation (like log changing) concede possibility of to make regular the data.

Scalability: For huge datasets, this implementation concede possibility struggle to scale efficiently on account of the iterative optimization process. If scalability enhances an issue, more advanced growth thechniques or parallel processing must be considered.


import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.optimize import minimize

Purpose: This portion imports the essential libraries for data manipulation, visualization, growth of model.

Df = pd.read_csv(r'C:\Users\vrani\Desktop\machine_learning\ParisHousing.csv')

Purpose: Load the dataset into pandas Data frame for post analysis.

df.head()
df.shape
df.columns
df.isnull().sum()

Purpose: By checking the dataset by displaying few columns and rows, the figure of the data frame, the names of the columns and inspecting for some missing values. This step is important for understanding the data structure .

df = df.select_dtypes(include=[np.number])
plt.figure(figsize=(20, 20))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

Purpose: Filter for mathematical columns and create a correlation heatmap to visualize relation between the target and the variable. This helps in understanding which features are most appropriate for relevant prediction.

correlations = df.corr()['price'].sort_values(ascending=False)
print("Correlations with price:\n", correlations)

low_correlation_features = correlations[correlations.abs() < 0.1].index
df.drop(low_correlation_features, axis=1, inplace=True)

Purpose: Label and delete features accompanying low equivalence to the target changeable. This shortens the model and reduces making it easy to run and boost model performance.

for col in df.columns:
if col != 'price':
upper_bound = df[col].quantile(0.99)
lower_bound = df[col].quantile(0.01)
df[col] = np.clip(df[col], lower_bound, upper_bound)

Purpose: we initiate this step to stabilize the model and prevent extreme values from skewing predictions. we limited the value to the 1st and 99th percentilesto manage outliers.

numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns
for col in numeric_columns:
mean = df[col].mean()
std = df[col].std()
df[col] = (df[col] - mean) / std

Purpose: This one is known as critical for regression models because it ensures contribution equally to the model’s calculation. The numeric features are standardize so they have mean of 0 and constant deviation of 1.

X = df.drop('price', axis=1)
y = df['price']
X = sm.add_constant(X)

Purpose: We have divided the dataframe into features(x) and target variable. after that, we add a constant term which is necessary for the intercept in regression models.

def elastic_net_loss(beta, X, y, alpha=0.001, l1_ratio=0.5):
...
Purpose: We have defined a function name custom loss for elastic net regression . This functions helps to calculate loss based on predictions, penalities.

alphas = [0.0001, 0.001, 0.01, 0.1]
l1_ratios = [0.1, 0.5, 0.9]
best_r2 = -np.inf
...
Purpose: We loop through various combinations of regularization strengths (alpha) and L1/L2 ratios (l1_ratio) to find the best performing model based on the R-squared value. This is very important for optimizing the model's performance.

ss_total = np.sum((y - np.mean(y)) ** 2)
ss_residual = np.sum((y - predictions) ** 2)
r_squared = 1 - (ss_residual / ss_total)

Purpose: Here we calculate the R-squared values to evaluate the model's performance. The R-squared value indicates how well the model explains the variance in the target variable.

final_predictions = np.dot(X, best_beta)
print("\nFinal Predictions:", final_predictions)

Purpose: We make predictions using the optimized model coefficients derived from the best hyperparameters.

results_df = pd.DataFrame({
'Actual': y,
'Predicted': final_predictions
})

plt.figure(figsize=(8, 6))
sns.regplot(x='Actual', y='Predicted', data=results_df, scatter_kws={'s':10}, line_kws={'color':'red'})
plt.title('Linear Regression: Actual vs. Predicted Values')
plt.xlabel('Actual Prices (in Lacs)')
plt.ylabel('Predicted Prices (in Lacs)')
plt.show()

Purpose: We create a data frame to compare the actual and predicted values. Then we will visualize the results using a scatter plot graph with a regression line. This helps to visualize the model's performance.
122 changes: 122 additions & 0 deletions elasticnet/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
#Ajay Anand A20581927
#Anish Vishwanathan VR A20596106
#Mohit Panchatcharam A20562455
#Sibi Chandra sekar A20577946

import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.optimize import minimize
df = pd.read_csv(r'C:\Users\vrani\Desktop\machine_learning\ParisHousing.csv')
df.head()
df.shape
df.columns
df.isnull().sum()
df = df.select_dtypes(include=[np.number])
plt.figure(figsize=(20, 20))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

correlations = df.corr()['price'].sort_values(ascending=False)
print("Correlations with price:\n", correlations)

low_correlation_features = correlations[correlations.abs() < 0.1].index
df.drop(low_correlation_features, axis=1, inplace=True)

for col in df.columns:
if col != 'price':
upper_bound = df[col].quantile(0.99)
lower_bound = df[col].quantile(0.01)
df[col] = np.clip(df[col], lower_bound, upper_bound)
numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns
for col in numeric_columns:
mean = df[col].mean()
std = df[col].std()
df[col] = (df[col] - mean) / std
df.head()
X = df.drop('price', axis=1)
y = df['price']

X = sm.add_constant(X)

def elastic_net_loss(beta, X, y, alpha=0.001, l1_ratio=0.5):
"""
Loss function for ElasticNet regression (L1 + L2 regularization)
beta: coefficients (weights)
X: feature matrix (with significant features)
y: target variable
alpha: regularization strength (controls the penalty)
l1_ratio: balance between L1 and L2 (1.0 = L1, 0.0 = L2)
"""
predictions = np.dot(X, beta) # Predicted values
residuals = y - predictions # Error in predictions


l1_penalty = np.sum(np.abs(beta))
l2_penalty = np.sum(beta**2)

# ElasticNet loss function: squared loss + L1 + L2
loss = np.sum(residuals**2) / (2 * len(y)) + alpha * (l1_ratio * l1_penalty + (1 - l1_ratio) * l2_penalty)
return loss

alphas = [0.0001, 0.001, 0.01, 0.1] # Regularization strength values
l1_ratios = [0.1, 0.5, 0.9] # Balance between L1 and L2
best_r2 = -np.inf
best_alpha = None
best_l1_ratio = None
best_beta = None
# Loop over all combinations of alpha and l1_ratio
for alpha in alphas:
for l1_ratio in l1_ratios:
# Initialize beta (coefficients) with small random values
np.random.seed(42)
initial_beta = np.random.normal(size=X.shape[1])

# Minimize the loss function using optimization for the current alpha and l1_ratio
result = minimize(elastic_net_loss, initial_beta, args=(X, y, alpha, l1_ratio), method='BFGS')

# Get the optimized coefficients
optimal_beta = result.x

# Make predictions
predictions = np.dot(X, optimal_beta)

# Evaluate the model (R-squared)
ss_total = np.sum((y - np.mean(y)) ** 2) # Total sum of squares
ss_residual = np.sum((y - predictions) ** 2) # Residual sum of squares
r_squared = 1 - (ss_residual / ss_total)

# Check if this is the best model so far
if r_squared > best_r2:
best_r2 = r_squared
best_alpha = alpha
best_l1_ratio = l1_ratio
best_beta = optimal_beta

print(f"Alpha: {alpha}, L1_ratio: {l1_ratio}, R-squared: {r_squared}")
# Step 7: Print the best hyperparameters and corresponding R-squared
print(f"\nBest Alpha: {best_alpha}")
print(f"Best L1_ratio: {best_l1_ratio}")
print(f"Best R-squared: {best_r2}")
print(f"Best coefficients: {best_beta}")

# Step 8: Final model predictions using the best hyperparameters
final_predictions = np.dot(X, best_beta)
print("\nFinal Predictions:", final_predictions)
# Step 11: Plot the regression results using seaborn
# Create a DataFrame to store actual and predicted values
results_df = pd.DataFrame({
'Actual': y,
'Predicted': final_predictions
})

# Plot the actual vs. predicted values using seaborn
plt.figure(figsize=(8, 6))
sns.regplot(x='Actual', y='Predicted', data=results_df, scatter_kws={'s':10}, line_kws={'color':'red'})
plt.title('Linear Regression: Actual vs. Predicted Values')
plt.xlabel('Actual Prices (in Lacs)')
plt.ylabel('Predicted Prices (in Lacs)')
plt.show()
34 changes: 18 additions & 16 deletions elasticnet/models/ElasticNet.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,19 @@
def elastic_net_loss(beta, X, y, alpha=0.001, l1_ratio=0.5):
"""
Loss function for ElasticNet regression (L1 + L2 regularization)
beta: coefficients (weights)
X: feature matrix (with significant features)
y: target variable
alpha: regularization strength (controls the penalty)
l1_ratio: balance between L1 and L2 (1.0 = L1, 0.0 = L2)
"""
predictions = np.dot(X, beta) # Predicted values
residuals = y - predictions # Error in predictions



class ElasticNetModel():
def __init__(self):
pass


def fit(self, X, y):
return ElasticNetModelResults()


class ElasticNetModelResults():
def __init__(self):
pass

def predict(self, x):
return 0.5
l1_penalty = np.sum(np.abs(beta))
l2_penalty = np.sum(beta**2)

# ElasticNet loss function: squared loss + L1 + L2
loss = np.sum(residuals**2) / (2 * len(y)) + alpha * (l1_ratio * l1_penalty + (1 - l1_ratio) * l2_penalty)
return loss
Loading