AutoData Agent is a modular and autonomous data assistant designed to:
- 🧹 Inspect raw datasets (missing values, data types, anomalies)
- 🧼 Clean and preprocess data (imputation, encoding, scaling)
- 📊 Visualize key statistical properties (distributions, correlations, boxplots)
- 🤖 Train machine learning models automatically (classification or regression)
- ⚙️ Optimize the model using GridSearchCV to achieve best performance
This project demonstrates how a Hugging Face agent can orchestrate an end-to-end Data Science pipeline, making smart decisions and reasoning about the dataset structure.
datacleaner-agent/
├── app.py # Main entry point
├── agent_logic.py # Builds the Hugging Face Agent (model + tools)
├── tools/
│ ├── inspect.py # InspectTool: automatic EDA (plots + summary)
│ ├── cleaning.py # CleaningTool: data cleaning & encoding
│ └── train.py # TrainTool: automatic ML training + evaluation
├── test_tools.py # Local tests for each tool
├── requirements.txt
└── README.mdPerforms an Exploratory Data Analysis (EDA):
- Displays dataset info (shape, dtypes, missing values)
- Generates histograms, boxplots, and correlation heatmaps
- Detects data imbalances and null distributions
Example output:
df.info(),df.describe()- Automatic visualizations for numeric variables
- Summary of missing data
Cleans and prepares the dataset:
- Removes duplicates
- Handles missing values (median or mode)
- Encodes categorical variables (LabelEncoder or OneHotEncoder)
- Scales numerical columns (StandardScaler)
- Detects and drops low-variance features
Goal: produce a dataset ready for model training.
Automatically trains a model based on target variable type:
-
Detects whether it’s classification or regression
-
Chooses the appropriate RandomForest model
-
Runs GridSearchCV to optimize hyperparameters
-
Splits data into train/test (80/20)
-
Displays key metrics:
- Classification:
Accuracy,Precision,Recall,F1 - Regression:
RMSE,R²
- Classification:
- Python ≥ 3.11
- Virtual environment recommended
git clone https://github.com/MeidiLprog/datacleaner-agent.git
cd datacleaner-agent
pip install -r requirements.txtpython app.pyThe agent will:
- Load the Titanic dataset (default)
- Inspect and clean it automatically
- Train a predictive model on
Survived - Output key metrics and model summary
Expected Output:
Dataset successfully loaded ! (891, 12)
Agent ready !
Inspecting dataset...
Cleaning done...
GridSearch training...
Accuracy: 0.84
F1 Score: 0.81
Runs the reasoning model via Hugging Face Inference API.
Set your token:
export HUGGINGFACEHUB_API_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxx"Then in agent_logic.py:
model = LiteLLMModel(model_id="huggingface/mistralai/Mistral-7B-Instruct-v0.2")If you prefer running locally:
-
Install Ollama
-
Pull the model:
ollama pull qwen2:1.5b
-
Replace model in
agent_logic.py:model = LiteLLMModel(model_id="ollama/qwen2:1.5b")
| Stack | Purpose |
|---|---|
| 🤗 Hugging Face SmolAgents | Agent orchestration |
| 🔮 LiteLLM / HfApiModel | LLM reasoning |
| 🧹 Pandas / Numpy | Data wrangling |
| 📊 Matplotlib / Seaborn | Data visualization |
| ⚙️ Scikit-Learn | Model training & GridSearch |
| 💻 Ollama | Local LLM inference (offline mode) |
| Visualization | Description |
|---|---|
![]() |
Automatic data histograms & boxplots |
![]() |
Correlation matrix |
![]() |
Model training output |
Lefki Meidi 🎓 Data Science & Machine Learning Engineer 💬 LinkedIn • GitHub • HuggingFace
- Built entirely from scratch in less than 24h
- Modular architecture (plug & play tools)
- Hugging Face AI agent integrated locally and via cloud API
- Fully autonomous workflow: from raw data → cleaned dataset → trained model
- Ideal for data preprocessing automation or teaching agent reasoning
Special thanks to Hugging Face for the SmolAgents framework, and the open-source community for making AI accessible.
“Why spend hours cleaning data when your agent can do it for you?”


