Fall2024CS584 · AFA24SCM64P · Nov 22, 2024 · Nov 22, 2024 · Nov 22, 2024 · Nov 22, 2024
diff --git a/README.md b/README.md
@@ -1,29 +1,65 @@
-# Project 2
 
-Select one of the following two options:
+## Team members:
+## Amruta Sanjay Pawar- A20570864
+## Raghav Shah- A20570886
+## Shreedhruthi Boinpally- A20572883
 
-## Boosting Trees
+# Model Selection
 
-Implement the gradient-boosting tree algorithm (with the usual fit-predict interface) as described in Sections 10.9-10.10 of Elements of Statistical Learning (2nd Edition). Answer the questions below as you did for Project 1.
+This project demonstrates the implementation of a custom linear regression model in Python and evaluates its performance using K-Fold Cross-Validation, Bootstrapping, and Akaike Information Criterion (AIC). The project uses a dataset named `heart.csv` for testing and validation.
 
-Put your README below. Answer the following questions.
+## Objective
+To test and evaluate the performance of a custom linear regression model on the any dataset using the following metrics:
+1. K-Fold Cross-Validation Mean Squared Error (MSE)
+2. Bootstrapping Mean Squared Error (MSE)
+3. Akaike Information Criterion (AIC)
 
-* What does the model you have implemented do and when should it be used?
-* How did you test your model to determine if it is working reasonably correctly?
-* What parameters have you exposed to users of your implementation in order to tune performance? (Also perhaps provide some basic usage examples.)
-* Are there specific inputs that your implementation has trouble with? Given more time, could you work around these or is it fundamental?
+### Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)?  
+Yes, in our testing with the `heart.csv` dataset using linear regression, the model selectors produced consistent results:
+- **K-Fold Cross-Validation MSE**: 0.1241  
+- **Bootstrapping MSE**: 0.1254  
+- **AIC**: -2137.2136  
 
-## Model Selection
+These results indicate that cross-validation, bootstrapping, and AIC agree when evaluating the model’s fit to the data for simple cases.
 
-Implement generic k-fold cross-validation and bootstrapping model selection methods.
+---
 
-In your README, answer the following questions:
+### In what cases might the methods you've written fail or give incorrect or undesirable results?  
+1. **Small datasets**: Bootstrapping may produce biased results due to repeated sampling from limited data.  
+2. **Imbalanced data**: K-Fold Cross-Validation may fail if folds are not stratified.  
+3. **High-dimensional data**: AIC may over-penalize complex models, leading to biased selection.  
+4. **Outliers**: All methods may give undesirable results if outliers dominate the dataset.  
 
-* Do your cross-validation and bootstrapping model selectors agree with a simpler model selector like AIC in simple cases (like linear regression)?
-* In what cases might the methods you've written fail or give incorrect or undesirable results?
-* What could you implement given more time to mitigate these cases or help users of your methods?
-* What parameters have you exposed to your users in order to use your model selectors.
+---
 
-See sections 7.10-7.11 of Elements of Statistical Learning and the lecture notes. Pay particular attention to Section 7.10.2.
+### What could you implement given more time to mitigate these cases or help users of your methods?  
+- Implement **Stratified K-Folds** for handling imbalanced datasets.  
+- Incorporate **robust regression techniques** to handle outliers effectively.  
+- Use **Bayesian Information Criterion (BIC)** alongside AIC for high-dimensional datasets.  
+- Add automated **dataset analysis and warnings** for users regarding dataset size, balance, or presence of outliers.  
+
+---
+
+### What parameters have you exposed to your users in order to use your model selectors?  
+- **K-Fold Cross-Validation**:  
+  - `k`: Number of folds.  
+  - `metric`: Metric to evaluate (e.g., MSE or R²).  
+  - `random_seed`: Seed for reproducibility.  
+
+- **Bootstrapping**:  
+  - `num_samples`: Number of bootstrap samples.  
+  - `metric`: Metric to evaluate (e.g., MSE or R²).  
+  - `random_seed`: Seed for reproducibility.  
+
+- **AIC**:  
+  - Requires no additional parameters and uses the full dataset.
+
+These parameters allow users to adapt the methods to their specific datasets and modeling requirements.
+
+
+## How to Run
+1. Clone this repository
+2. Install required Python libraries (`matplotlib` for plotting) from cmd.
+3. If you want test with some other dataset then replace the following values : Replace file_path: Path to the CSV file, Replace target_column: Name of the target column (For eample in heart.csv it is target)
+6. Then run model.py file
 
-As usual, above-and-beyond efforts will be considered for bonus points.
diff --git a/model.py b/model.py
@@ -0,0 +1,201 @@
+import csv
+import random
+import math
+import matplotlib.pyplot as plt
+
+def load_csv(file_path, target_column):
+    """
+    Load a CSV file and split into features (X) and target (y).
+
+    Parameters:
+        file_path: Path to the CSV file.
+        target_column: Name of the target column.
+
+    Returns:
+        X: List of feature rows (2D list).
+        y: List of target values (1D list).
+    """
+    with open(file_path, 'r') as file:
+        reader = csv.DictReader(file)
+        data = list(reader)
+
+    X = []
+    y = []
+    for row in data:
+        y.append(float(row[target_column]))
+        X.append([float(value) for key, value in row.items() if key != target_column])
+
+    return X, y
+
+def mean_squared_error(y_true, y_pred):
+    return sum((yt - yp) ** 2 for yt, yp in zip(y_true, y_pred)) / len(y_true)
+
+def r_squared(y_true, y_pred):
+    mean_y = sum(y_true) / len(y_true)
+    ss_total = sum((yt - mean_y) ** 2 for yt in y_true)
+    ss_residual = sum((yt - yp) ** 2 for yt, yp in zip(y_true, y_pred))
+    return 1 - (ss_residual / ss_total)
+
+def split_data(X, y, indices):
+    X_split = [X[i] for i in indices]
+    y_split = [y[i] for i in indices]
+    return X_split, y_split
+
+def k_fold_cv(model, X, y, k=5, metric='mse', random_seed=None):
+    if random_seed is not None:
+        random.seed(random_seed)
+
+    indices = list(range(len(X)))
+    random.shuffle(indices)
+    fold_size = len(X) // k
+    scores = []
+
+    for i in range(k):
+        test_indices = indices[i * fold_size:(i + 1) * fold_size]
+        train_indices = indices[:i * fold_size] + indices[(i + 1) * fold_size:]
+
+        X_train, y_train = split_data(X, y, train_indices)
+        X_test, y_test = split_data(X, y, test_indices)
+
+        model.fit(X_train, y_train)
+        y_pred = model.predict(X_test)
+
+        if metric == 'mse':
+            scores.append(mean_squared_error(y_test, y_pred))
+        elif metric == 'r2':
+            scores.append(r_squared(y_test, y_pred))
+        else:
+            raise ValueError(f"Unsupported metric: {metric}")
+
+    return sum(scores) / len(scores)
+
+def bootstrap(model, X, y, num_samples=100, metric='mse', random_seed=None):
+    if random_seed is not None:
+        random.seed(random_seed)
+
+    scores = []
+    n = len(X)
+
+    for _ in range(num_samples):
+        bootstrap_indices = [random.randint(0, n - 1) for _ in range(n)]
+        oob_indices = list(set(range(n)) - set(bootstrap_indices))
+
+        X_sample, y_sample = split_data(X, y, bootstrap_indices)
+        X_oob, y_oob = split_data(X, y, oob_indices)
+
+        if not oob_indices:
+            continue
+
+        model.fit(X_sample, y_sample)
+        y_pred = model.predict(X_oob)
+
+        if metric == 'mse':
+            scores.append(mean_squared_error(y_oob, y_pred))
+        elif metric == 'r2':
+            scores.append(r_squared(y_oob, y_pred))
+        else:
+            raise ValueError(f"Unsupported metric: {metric}")
+
+    return sum(scores) / len(scores)
+
+def calculate_aic(model, X, y):
+    model.fit(X, y)
+    y_pred = model.predict(X)
+    resid = [yt - yp for yt, yp in zip(y, y_pred)]
+    n = len(y)
+    p = len(X[0])
+    rss = sum(r ** 2 for r in resid)
+    return n * math.log(rss / n) + 2 * p
+
+def plot_results(y_true, y_pred):
+    """
+    Plot observed vs. predicted values and residuals.
+
+    Parameters:
+        y_true: List of true target values.
+        y_pred: List of predicted target values.
+    """
+    # Observed vs. Predicted
+    plt.figure(figsize=(12, 6))
+    plt.subplot(1, 2, 1)
+    plt.scatter(y_true, y_pred, alpha=0.7, edgecolors='k')
+    plt.plot([min(y_true), max(y_true)], [min(y_true), max(y_true)], 'r--', label="Ideal Fit")
+    plt.title("Observed vs Predicted")
+    plt.xlabel("Observed Values")
+    plt.ylabel("Predicted Values")
+    plt.legend()
+    plt.grid(True)
+
+    # Residuals
+    residuals = [yt - yp for yt, yp in zip(y_true, y_pred)]
+    plt.subplot(1, 2, 2)
+    plt.scatter(y_pred, residuals, alpha=0.7, edgecolors='k')
+    plt.axhline(y=0, color='r', linestyle='--', label="Zero Residual Line")
+    plt.title("Residuals")
+    plt.xlabel("Predicted Values")
+    plt.ylabel("Residuals")
+    plt.legend()
+    plt.grid(True)
+
+    # Show plots
+    plt.tight_layout()
+    plt.show()
+
+class LinearRegression:
+    def __init__(self):
+        self.coef_ = []
+        self.intercept_ = 0
+
+    def fit(self, X, y):
+        n = len(X)
+        p = len(X[0])
+        X_flat = [x + [1] for x in X]
+        XtX = [[sum(X_flat[i][k] * X_flat[i][j] for i in range(n)) for j in range(p + 1)] for k in range(p + 1)]
+        Xty = [sum(X_flat[i][j] * y[i] for i in range(n)) for j in range(p + 1)]
+        self.coef_, self.intercept_ = self.solve_linear_system(XtX, Xty)
+
+    def predict(self, X):
+        return [sum(c * x for c, x in zip(self.coef_, xi)) + self.intercept_ for xi in X]
+
+    def solve_linear_system(self, A, b):
+        n = len(A)
+        for i in range(n):
+            for j in range(i + 1, n):
+                factor = A[j][i] / A[i][i]
+                for k in range(i, n):
+                    A[j][k] -= factor * A[i][k]
+                b[j] -= factor * b[i]
+        x = [0] * n
+        for i in range(n - 1, -1, -1):
+            x[i] = (b[i] - sum(A[i][j] * x[j] for j in range(i + 1, n))) / A[i][i]
+        return x[:-1], x[-1]
+
+if __name__ == "__main__":
+    # Example CSV Usage
+    file_path = "mlp\small_test.csv"  # Replace with your CSV file path
+    target_column = "target"  # Replace with your target column name
+
+    # Load data from CSV
+    X, y = load_csv("file_path", "target_column")
+
+    # Create a Linear Regression model
+    model = LinearRegression()
+
+    # K-Fold Cross-Validation
+    kfold_score = k_fold_cv(model, X, y, k=5, metric='mse', random_seed=42)
+    print(f"K-Fold Cross-Validation MSE: {kfold_score:.4f}")
+
+    # Bootstrapping
+    bootstrap_score = bootstrap(model, X, y, num_samples=100, metric='mse', random_seed=42)
+    print(f"Bootstrapping MSE: {bootstrap_score:.4f}")
+
+    # AIC
+    aic_score = calculate_aic(model, X, y)
+    print(f"AIC: {aic_score:.4f}")
+
+    # Fit the model and get predictions
+    model.fit(X, y)
+    y_pred = model.predict(X)
+
+    # Plot results
+    plot_results(y, y_pred)