A comprehensive AutoML system with data preprocessing, model training, explainability, and reporting capabilities.
AutoML-Research-Assistant/
│
├── app.py
├── requirements.txt
├── README.md
│
├── src/
│ ├── data_preprocessing.py ✅ Phase 1 - Complete
│ ├── automl_engine.py ⏳ Phase 2 - In Progress (requires FLAML)
│ ├── explainability.py ⏳ Phase 3 - Pending
│ ├── report_generator.py ⏳ Phase 4 - Pending
│
├── data/
│ ├── 1_titanic.csv # Binary classification (Titanic)
│ ├── 2_house_prices.csv # Regression (House Prices)
│ ├── 3_wine_quality.csv # Multi-class classification (Wine)
│ ├── 4_credit_card.csv # Binary classification (Credit Fraud)
│ ├── sample.csv # Simple demo dataset
│ └── DATASETS_INFO.md # Dataset documentation
├── reports/
│ ├── shap_summary.png
│ ├── model_leaderboard.json
│ └── ai_report.txt
The data_preprocessing.py module provides a complete data preprocessing pipeline:
- Load CSV → pandas DataFrame
- Identify data types and missing values
- Handle NaN (imputation by mean/median for numeric, mode for categorical)
- Encode categorical variables (Label/OneHot encoding with auto-strategy)
- Scale numeric features (StandardScaler)
- Split train/test sets
- Output:
cleaned_train.csv,cleaned_test.csv,preprocessing_metadata.json
from src.data_preprocessing import DataPreprocessor
# Initialize preprocessor
preprocessor = DataPreprocessor(data_dir="data", output_dir="data")
# Run complete pipeline
X_train, X_test, y_train, y_test = preprocessor.process(
csv_path="1_titanic.csv", # Or use: 2_house_prices.csv, 3_wine_quality.csv, 4_credit_card.csv
target_col="Survived", # Target column name (varies by dataset)
test_size=0.2,
encoding_strategy="auto" # Options: "auto", "label", "onehot"
)data/cleaned_train.csv- Preprocessed training datadata/cleaned_test.csv- Preprocessed test datadata/preprocessing_metadata.json- Complete metadata including:- Data type information
- Missing value handling
- Encoding strategies
- Scaling parameters
- Feature names and counts
pip install -r requirements.txtThe data/ directory contains 4 sample datasets from popular Kaggle competitions:
- Titanic (
1_titanic.csv) - Binary classification, missing values, mixed features - House Prices (
2_house_prices.csv) - Regression, many features, categorical variables - Wine Quality (
3_wine_quality.csv) - Multi-class classification, all numeric, clean data - Credit Card Fraud (
4_credit_card.csv) - Binary classification, imbalanced, all numeric
See data/DATASETS_INFO.md for detailed information about each dataset.
Status: Code complete, requires FLAML installation
The automl_engine.py module is ready but requires FLAML to be installed. Install with: pip install flaml
- FLAML Integration - Automated model selection and hyperparameter tuning
- Task Support - Classification and regression tasks
- Time Budget Control - Configurable training time limits
- Model Evaluation - Comprehensive metrics (accuracy, F1, ROC-AUC, R², RMSE, etc.)
- Leaderboard Generation - Track all models tried during search
- Model Persistence - Save/load trained models
- Phase 3: Explainability Module (SHAP)
- Phase 4: GenAI Report Generator
- Phase 5: Streamlit Dashboard