DataScience Agent is a comprehensive, automated data science platform that orchestrates the entire machine learning pipeline through intelligent agents. It seamlessly integrates data cleaning, feature engineering, visualization, validation, and model training with cutting-edge tools like H2O.ai and MLflow, all presented through an intuitive interactive dashboard.
- Data Cleaning Agent: Automated data preprocessing, outlier detection, and quality assessment
- Feature Engineering Agent: Intelligent feature creation, selection, and transformation
- Data Visualization Agent: Comprehensive exploratory data analysis and interactive plots
- Validation Agent: Cross-validation, data integrity checks, and model validation
- Model Training Agent: Automated ML with H2O.ai integration and hyperparameter optimization
- End-to-End Automation: Complete ML pipeline automation from raw data to deployed models
- Interactive Dashboard: Real-time monitoring and control interface
- MLflow Integration: Experiment tracking, model versioning, and artifact management
- H2O.ai AutoML: Automated machine learning with state-of-the-art algorithms
- Scalable Architecture: Handle datasets of any size with distributed computing support
- Installation
- Quick Start
- Architecture
- Agent Details
- Configuration
- Dashboard
- API Reference
- Examples
- Contributing
- License
- Python 3.8+
- Java 8+ (for H2O.ai)
- Docker (optional, for containerized deployment)
pip install datascience-agent
git clone https://github.com/your-org/datascience-agent.git
cd datascience-agent
pip install -r requirements.txt
pip install -e .
docker pull datascience-agent:latest
docker run -p 8080:8080 datascience-agent:latest
from datascience_agent import DataScienceAgent
from datascience_agent.config import Config
# Initialize the agent
config = Config(
data_path="data/dataset.csv",
target_column="target",
project_name="my_ml_project"
)
agent = DataScienceAgent(config)
# Run the complete pipeline
results = agent.run_pipeline()
# Access results
print(f"Best Model Accuracy: {results.best_model_accuracy}")
print(f"Feature Importance: {results.feature_importance}")
# Start the dashboard server
datascience-agent --dashboard --port 8080
# Or programmatically
from datascience_agent.dashboard import launch_dashboard
launch_dashboard(port=8080, debug=True)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DataScience Agent β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β βββββββββββββββββββ βββββββββββββββββββ ββββββββββββββββ β
β β Data Cleaning β β Feature Engine. β β Visualizationβ β
β β Agent β β Agent β β Agent β β
β βββββββββββββββββββ βββββββββββββββββββ ββββββββββββββββ β
β βββββββββββββββββββ βββββββββββββββββββ ββββββββββββββββ β
β β Validation β β Model Training β β Dashboard β β
β β Agent β β Agent β β Interface β β
β βββββββββββββββββββ βββββββββββββββββββ ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Integration Layer β
β βββββββββββββββ βββββββββββββββ β
β β H2O.ai β β MLflow β β
β β AutoML β β Tracking β β
β βββββββββββββββ βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
from datascience_agent.agents import DataCleaningAgent
cleaner = DataCleaningAgent()
cleaned_data = cleaner.process(
data=raw_data,
strategies=['outlier_removal', 'missing_value_imputation', 'duplicate_removal']
)
Capabilities:
- Automated missing value detection and imputation
- Outlier detection using multiple algorithms (IQR, Z-score, Isolation Forest)
- Data type optimization and standardization
- Duplicate record identification and removal
- Data quality scoring and reporting
from datascience_agent.agents import FeatureEngineeringAgent
fe_agent = FeatureEngineeringAgent()
engineered_features = fe_agent.create_features(
data=cleaned_data,
target=target_column,
feature_types=['polynomial', 'interactions', 'temporal']
)
Capabilities:
- Automated feature creation (polynomial, interaction, temporal features)
- Feature selection using statistical tests and ML-based methods
- Categorical encoding (one-hot, target, ordinal)
- Feature scaling and normalization
- Dimensionality reduction (PCA, t-SNE, UMAP)
from datascience_agent.agents import VisualizationAgent
viz_agent = VisualizationAgent()
charts = viz_agent.create_comprehensive_eda(
data=data,
target=target_column,
chart_types=['correlation', 'distribution', 'trends']
)
Capabilities:
- Automated exploratory data analysis (EDA)
- Interactive plots using Plotly and Bokeh
- Statistical summaries and data profiling
- Correlation analysis and feature relationship mapping
- Custom visualization templates
from datascience_agent.agents import ValidationAgent
validator = ValidationAgent()
validation_results = validator.validate_pipeline(
data=data,
model=trained_model,
validation_strategy='stratified_kfold'
)
Capabilities:
- Cross-validation strategies (K-fold, Stratified K-fold, Time Series CV)
- Data integrity and schema validation
- Model performance validation across different metrics
- A/B testing framework integration
- Statistical significance testing
from datascience_agent.agents import ModelTrainingAgent
trainer = ModelTrainingAgent(
h2o_config={'max_models': 50, 'max_runtime_secs': 3600}
)
best_model = trainer.auto_train(
train_data=train_data,
validation_data=val_data,
target=target_column
)
Capabilities:
- H2O.ai AutoML integration for automated model selection
- Hyperparameter optimization using advanced search strategies
- Model ensemble creation and stacking
- MLflow experiment tracking and model versioning
- Automated model deployment pipeline
# Data Configuration
data:
input_path: "data/raw/dataset.csv"
output_path: "data/processed/"
target_column: "target"
test_size: 0.2
random_state: 42
# Agent Configuration
agents:
data_cleaning:
enabled: true
outlier_method: "isolation_forest"
missing_value_strategy: "iterative"
feature_engineering:
enabled: true
max_features: 1000
feature_selection_method: "mutual_info"
visualization:
enabled: true
chart_style: "plotly"
save_plots: true
validation:
enabled: true
cv_folds: 5
validation_metrics: ["accuracy", "precision", "recall", "f1"]
model_training:
enabled: true
max_models: 20
max_runtime_secs: 1800
# MLflow Configuration
mlflow:
tracking_uri: "sqlite:///mlflow.db"
experiment_name: "datascience_agent_experiments"
# H2O Configuration
h2o:
max_mem_size: "4G"
nthreads: -1
# Dashboard Configuration
dashboard:
host: "0.0.0.0"
port: 8080
debug: false
from datascience_agent.config import Config
config = Config(
data_path="data/dataset.csv",
target_column="target",
agents_config={
'data_cleaning': {'enabled': True, 'outlier_method': 'isolation_forest'},
'feature_engineering': {'enabled': True, 'max_features': 500},
'model_training': {'max_models': 10, 'max_runtime_secs': 900}
},
mlflow_config={'experiment_name': 'my_experiment'},
h2o_config={'max_mem_size': '8G'}
)
The DataScience Agent provides a comprehensive web-based dashboard for monitoring and controlling the ML pipeline.
- Real-time Pipeline Monitoring: Track progress of each agent
- Interactive Data Explorer: Browse and filter datasets
- Model Performance Metrics: Compare models with interactive charts
- Experiment Tracking: View MLflow experiments and runs
- Configuration Manager: Modify settings without code changes
- Log Viewer: Real-time application logs and debugging
# Method 1: Command line
datascience-agent --dashboard
# Method 2: Python script
from datascience_agent.dashboard import app
app.run(host='0.0.0.0', port=8080, debug=True)
# Method 3: Docker
docker run -p 8080:8080 datascience-agent:latest
Navigate to http://localhost:8080
to access the dashboard.
class DataScienceAgent:
def __init__(self, config: Config):
"""Initialize the DataScience Agent"""
def run_pipeline(self) -> PipelineResults:
"""Execute the complete ML pipeline"""
def run_agent(self, agent_name: str) -> AgentResults:
"""Run a specific agent"""
def get_results(self) -> Dict:
"""Get current pipeline results"""
class AgentBase:
def execute(self, data: DataFrame) -> AgentResults:
"""Execute agent logic"""
def validate(self, results: AgentResults) -> bool:
"""Validate agent results"""
def get_metrics(self) -> Dict:
"""Get agent performance metrics"""
# Data utilities
from datascience_agent.utils import load_data, save_data, profile_data
# Model utilities
from datascience_agent.utils import compare_models, export_model, load_model
# Visualization utilities
from datascience_agent.utils import create_plot, save_figure, plot_metrics
from datascience_agent import DataScienceAgent, Config
# Load configuration
config = Config.from_file('config/classification.yaml')
# Initialize agent
agent = DataScienceAgent(config)
# Run pipeline
results = agent.run_pipeline()
# Print results
print(f"Best Model: {results.best_model.model_id}")
print(f"Best Accuracy: {results.best_model.metrics['accuracy']:.4f}")
print(f"Features Used: {len(results.final_features)}")
# Run agents individually for custom control
agent = DataScienceAgent(config)
# Step 1: Data cleaning
cleaning_results = agent.run_agent('data_cleaning')
print(f"Data Quality Score: {cleaning_results.quality_score}")
# Step 2: Feature engineering
fe_results = agent.run_agent('feature_engineering')
print(f"Features Created: {fe_results.num_features_created}")
# Step 3: Model training with custom parameters
agent.agents['model_training'].update_config({
'max_models': 50,
'include_algos': ['GBM', 'RF', 'XGBoost']
})
training_results = agent.run_agent('model_training')
config = Config(
data_path="data/timeseries.csv",
target_column="sales",
problem_type="regression",
time_column="date",
agents_config={
'feature_engineering': {
'temporal_features': True,
'lag_features': [1, 7, 30],
'rolling_features': [7, 30, 90]
},
'validation': {
'cv_strategy': 'time_series_split',
'n_splits': 5
}
}
)
agent = DataScienceAgent(config)
results = agent.run_pipeline()
Run the test suite:
# Run all tests
pytest tests/
# Run specific test categories
pytest tests/test_agents.py
pytest tests/test_integration.py
# Run with coverage
pytest --cov=datascience_agent tests/
Dataset Size | Pipeline Time | Memory Usage | Best Accuracy |
---|---|---|---|
1K rows | 45 seconds | 512 MB | 94.2% |
10K rows | 3.2 minutes | 1.2 GB | 96.1% |
100K rows | 12.5 minutes | 4.8 GB | 97.3% |
1M rows | 45 minutes | 12 GB | 98.1% |
We welcome contributions! Please see our Contributing Guidelines for details.
# Clone repository
git clone https://github.com/your-org/datascience-agent.git
cd datascience-agent
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install development dependencies
pip install -r requirements-dev.txt
pip install -e .
# Run pre-commit hooks
pre-commit install
We use Black, isort, and flake8 for code formatting:
black datascience_agent/
isort datascience_agent/
flake8 datascience_agent/
This project is licensed under the MIT License - see the LICENSE file for details.
- Documentation: https://datascience-agent.readthedocs.io
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: support@datascience-agent.com
- H2O.ai team for the excellent AutoML platform
- MLflow contributors for experiment tracking capabilities
- The open-source data science community
- All contributors and users of this project
Made with β€οΈ by the DataScience Agent Team