Skip to content

juhii31/AutoML-Research-Assistant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

AutoML Research Assistant

A comprehensive AutoML system with data preprocessing, model training, explainability, and reporting capabilities.

Project Structure

AutoML-Research-Assistant/
│
├── app.py
├── requirements.txt
├── README.md
│
├── src/
│   ├── data_preprocessing.py    ✅ Phase 1 - Complete
│   ├── automl_engine.py         ⏳ Phase 2 - In Progress (requires FLAML)
│   ├── explainability.py        ⏳ Phase 3 - Pending
│   ├── report_generator.py      ⏳ Phase 4 - Pending
│
├── data/
│   ├── 1_titanic.csv              # Binary classification (Titanic)
│   ├── 2_house_prices.csv          # Regression (House Prices)
│   ├── 3_wine_quality.csv          # Multi-class classification (Wine)
│   ├── 4_credit_card.csv           # Binary classification (Credit Fraud)
│   ├── sample.csv                  # Simple demo dataset
│   └── DATASETS_INFO.md            # Dataset documentation
├── reports/
│   ├── shap_summary.png
│   ├── model_leaderboard.json
│   └── ai_report.txt

Phase 1: Data Preprocessing Module ✅

The data_preprocessing.py module provides a complete data preprocessing pipeline:

Features

  • Load CSV → pandas DataFrame
  • Identify data types and missing values
  • Handle NaN (imputation by mean/median for numeric, mode for categorical)
  • Encode categorical variables (Label/OneHot encoding with auto-strategy)
  • Scale numeric features (StandardScaler)
  • Split train/test sets
  • Output: cleaned_train.csv, cleaned_test.csv, preprocessing_metadata.json

Usage

from src.data_preprocessing import DataPreprocessor

# Initialize preprocessor
preprocessor = DataPreprocessor(data_dir="data", output_dir="data")

# Run complete pipeline
X_train, X_test, y_train, y_test = preprocessor.process(
    csv_path="1_titanic.csv",      # Or use: 2_house_prices.csv, 3_wine_quality.csv, 4_credit_card.csv
    target_col="Survived",          # Target column name (varies by dataset)
    test_size=0.2,
    encoding_strategy="auto"       # Options: "auto", "label", "onehot"
)

Output Files

  • data/cleaned_train.csv - Preprocessed training data
  • data/cleaned_test.csv - Preprocessed test data
  • data/preprocessing_metadata.json - Complete metadata including:
    • Data type information
    • Missing value handling
    • Encoding strategies
    • Scaling parameters
    • Feature names and counts

Installation

pip install -r requirements.txt

Available Datasets

The data/ directory contains 4 sample datasets from popular Kaggle competitions:

  1. Titanic (1_titanic.csv) - Binary classification, missing values, mixed features
  2. House Prices (2_house_prices.csv) - Regression, many features, categorical variables
  3. Wine Quality (3_wine_quality.csv) - Multi-class classification, all numeric, clean data
  4. Credit Card Fraud (4_credit_card.csv) - Binary classification, imbalanced, all numeric

See data/DATASETS_INFO.md for detailed information about each dataset.

Phase 2: AutoML Engine (FLAML) ⏳

Status: Code complete, requires FLAML installation

The automl_engine.py module is ready but requires FLAML to be installed. Install with: pip install flaml

Features (When FLAML is installed)

  • FLAML Integration - Automated model selection and hyperparameter tuning
  • Task Support - Classification and regression tasks
  • Time Budget Control - Configurable training time limits
  • Model Evaluation - Comprehensive metrics (accuracy, F1, ROC-AUC, R², RMSE, etc.)
  • Leaderboard Generation - Track all models tried during search
  • Model Persistence - Save/load trained models

Next Steps

  • Phase 3: Explainability Module (SHAP)
  • Phase 4: GenAI Report Generator
  • Phase 5: Streamlit Dashboard

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages