Skip to content

Ioannis-Stamatakis/datalyst-smolagent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

17 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ”¬ Datalyst Agent

Autonomous CSV Data Analysis β€” Powered by AI

Python smolagents Gemini License

Drop in any CSV. Get a full analysis report, charts, and insights β€” automatically.

The agent writes and executes its own Python code to explore your data.


What Is This?

Datalyst Agent is an agentic data analysis pipeline built on smolagents (HuggingFace) and Google Gemini 2.5 Flash. You point it at a CSV file; it autonomously runs through a complete analysis protocol β€” writing and executing pandas code at each step, generating matplotlib/seaborn charts, and producing a structured written report.

The key differentiator: smolagents' CodeAgent doesn't just describe what to do β€” it writes Python code as its action and executes it live. Every step is a real Thought β†’ Code β†’ Observation loop running in a sandboxed interpreter.


Sample Output

The following charts were generated autonomously by the agent on the bundled sales dataset:

Revenue by Region & Product Category Units Sold vs Revenue (with Regression)
Product Category Mix Correlation Heatmap

How It Works

The agent follows a 16-step analysis protocol, end-to-end, without any human intervention:

Step 1  β†’  Load CSV            shape, dtypes, missing value counts
Step 2  β†’  Schema detection    classify each column: numeric / categorical / datetime / text
Step 2b β†’  Duplicate detection flag duplicate row count and % before any stats are computed
Step 3  β†’  Descriptive stats   mean, median, std, min, max, Q1, Q3, skewness, kurtosis
Step 4  β†’  Outlier detection   IQR method for every numeric column
Step 5  β†’  Value counts        top-N frequency analysis for every categorical column
Step 6  β†’  Correlation matrix  Pearson correlations across all numeric columns
Step 7  β†’  Histograms          distribution + KDE overlay per numeric column β†’ PNG
Step 8  β†’  Heatmap             annotated correlation heatmap β†’ PNG
Step 9  β†’  Bar charts          top-N category frequencies β†’ PNG
Step 10 β†’  Missing values      missing % per column β†’ PNG
Step 11 β†’  Pie / donut charts  share breakdown for low-cardinality categoricals β†’ PNG
Step 12 β†’  Box plots           distribution spread, optionally grouped by a categorical β†’ PNG
Step 13 β†’  Stacked bar chart   [conditional] cross-tabulation of two categoricals β†’ PNG
Step 14 β†’  Time series         [conditional] line chart over a datetime column β†’ PNG
Step 15 β†’  Scatter + regression[conditional] scatter with RΒ² line for correlated pairs β†’ PNG
Step 16 β†’  Summary report      structured analysis_summary.txt with all findings

The terminal streams every step live:

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Step 4 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 ─ Executing parsed code: ──────────────────────────────────────────────────────
  for col in numeric_columns:
      outliers = detect_outliers_iqr(filepath=csv_filepath, column=col)
      print(outliers)
 ───────────────────────────────────────────────────────────────────────────────
Execution logs:
{"column": "units_sold", "outlier_count": 8, "outlier_pct": 2.67,
 "outlier_sample": [506, 523, 501, 879, 522, 760, 616, 625]}
[Step 4: Duration 12.31 seconds | Input tokens: 26,447 | Output tokens: 1,438]

Tech Stack

Layer Technology
Agent framework smolagents CodeAgent
LLM Google Gemini 2.5 Flash via LiteLLM
Data pandas, numpy
Visualizations matplotlib, seaborn
Config python-dotenv

Project Structure

datalyst-agent/
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ sales_data.py         # 300-row sales dataset (regions, revenue, reps)
β”‚   β”œβ”€β”€ weather_data.py       # 365-row weather dataset (5 cities, seasonal temps)
β”‚   └── population_data.py    # 150-row population dataset (6 continents, GDP)
β”‚
β”œβ”€β”€ tools/
β”‚   β”œβ”€β”€ data_tools.py         # load_csv_file, get_column_schema, detect_duplicates
β”‚   β”œβ”€β”€ stats_tools.py        # descriptive stats, IQR outliers, value counts, correlation
β”‚   β”œβ”€β”€ chart_tools.py        # histograms, heatmap, bar charts, pie/donut, box plots, time series, scatter+regression
β”‚   └── summary_tools.py      # write_analysis_summary
β”‚
β”œβ”€β”€ docs/images/              # Sample charts (committed for README)
β”œβ”€β”€ agent.py                  # CodeAgent + GeminiLiteLLMModel config
β”œβ”€β”€ main.py                   # CLI entry point
β”œβ”€β”€ requirements.txt
└── .env                      # Your API key (not committed)

Getting Started

1. Clone and install

git clone https://github.com/yourusername/datalyst-agent.git
cd datalyst-agent

python3 -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate
pip install -r requirements.txt

2. Set your API key

# Create a .env file
echo "GEMINI_API_KEY=your_key_here" > .env

Get a free key at Google AI Studio.

3. Run

# Analyze a bundled demo dataset
python main.py --demo sales
python main.py --demo weather
python main.py --demo population

# Analyze your own CSV
python main.py --csv path/to/your/data.csv

# Custom output directory
python main.py --demo sales --output my_results/

# Generate demo CSVs without running analysis
python main.py --generate-demos

Output

Each run creates a timestamped output directory:

outputs/sales_data_analysis/
β”œβ”€β”€ analysis_summary.txt
β”œβ”€β”€ correlation_heatmap.png
β”œβ”€β”€ hist_units_sold.png          # + one per numeric column
β”œβ”€β”€ bar_region.png               # + one per categorical column
β”œβ”€β”€ missing_values.png
β”œβ”€β”€ pie_region.png               # + one per low-cardinality categorical
β”œβ”€β”€ box_revenue_by_region.png    # + one per numeric column (grouped)
β”œβ”€β”€ stacked_region_by_product_category.png
β”œβ”€β”€ timeseries_date.png          # if a datetime column exists
└── scatter_units_sold_vs_revenue.png  # if |r| β‰₯ 0.3 found

Demo Datasets

All datasets are synthetically generated β€” no external downloads required.

Dataset Rows Notable Features
Sales 300 4 regions Β· 5 product categories Β· 8 sales reps Β· intentional outliers in units_sold (500–900 range) Β· ~5% missing in discount_pct
Weather 365 5 cities Β· sinusoidal seasonal temperatures Β· exponential precipitation distribution Β· sparse NaNs
Population 150 30 countries Β· 6 continents Β· right-skewed population Β· GDP correlated with continent

The intentional outliers and missing values are there to verify the agent actually finds them.


Architecture

                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   main.py ──► agent.py  β”‚         CodeAgent Loop           β”‚
                          β”‚                                 β”‚
                          β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
                          β”‚  β”‚  Gemini  β”‚  β”‚  16 Tools  β”‚  β”‚
                          β”‚  β”‚ 2.5 Flashβ”‚  β”‚ (pandas /  β”‚  β”‚
                          β”‚  β”‚(LiteLLM) β”‚  β”‚  mpl / sns)β”‚  β”‚
                          β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β”‚
                          β”‚       β”‚               β”‚         β”‚
                          β”‚   Thought ──► Code ──► Observe  β”‚
                          β”‚       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚
                          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Each tool is a plain Python function decorated with @tool. The agent decides which tools to call, writes the code to call them, and adapts based on the output β€” no rigid orchestration required.


Requirements

  • Python 3.10+
  • Google Gemini API key
smolagents>=1.24.0
litellm>=1.50.0
pandas>=2.0.0
numpy>=1.26.0
matplotlib>=3.8.0
seaborn>=0.13.0
python-dotenv>=1.0.0

About

An autonomous data analysis agent powered by smolagents framework that takes any CSV file, explores it, writes and executes its own analysis code, generates charts and produces a structured report all without human intervention.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages