Drop in any CSV. Get a full analysis report, charts, and insights β automatically.
The agent writes and executes its own Python code to explore your data.
Datalyst Agent is an agentic data analysis pipeline built on smolagents (HuggingFace) and Google Gemini 2.5 Flash. You point it at a CSV file; it autonomously runs through a complete analysis protocol β writing and executing pandas code at each step, generating matplotlib/seaborn charts, and producing a structured written report.
The key differentiator: smolagents'
CodeAgentdoesn't just describe what to do β it writes Python code as its action and executes it live. Every step is a real Thought β Code β Observation loop running in a sandboxed interpreter.
The following charts were generated autonomously by the agent on the bundled sales dataset:
| Revenue by Region & Product Category | Units Sold vs Revenue (with Regression) |
![]() |
![]() |
| Product Category Mix | Correlation Heatmap |
![]() |
![]() |
The agent follows a 16-step analysis protocol, end-to-end, without any human intervention:
Step 1 β Load CSV shape, dtypes, missing value counts
Step 2 β Schema detection classify each column: numeric / categorical / datetime / text
Step 2b β Duplicate detection flag duplicate row count and % before any stats are computed
Step 3 β Descriptive stats mean, median, std, min, max, Q1, Q3, skewness, kurtosis
Step 4 β Outlier detection IQR method for every numeric column
Step 5 β Value counts top-N frequency analysis for every categorical column
Step 6 β Correlation matrix Pearson correlations across all numeric columns
Step 7 β Histograms distribution + KDE overlay per numeric column β PNG
Step 8 β Heatmap annotated correlation heatmap β PNG
Step 9 β Bar charts top-N category frequencies β PNG
Step 10 β Missing values missing % per column β PNG
Step 11 β Pie / donut charts share breakdown for low-cardinality categoricals β PNG
Step 12 β Box plots distribution spread, optionally grouped by a categorical β PNG
Step 13 β Stacked bar chart [conditional] cross-tabulation of two categoricals β PNG
Step 14 β Time series [conditional] line chart over a datetime column β PNG
Step 15 β Scatter + regression[conditional] scatter with RΒ² line for correlated pairs β PNG
Step 16 β Summary report structured analysis_summary.txt with all findings
The terminal streams every step live:
ββββββββββββββββββββββββββββββββββββ Step 4 ββββββββββββββββββββββββββββββββββββ
β Executing parsed code: ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
for col in numeric_columns:
outliers = detect_outliers_iqr(filepath=csv_filepath, column=col)
print(outliers)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Execution logs:
{"column": "units_sold", "outlier_count": 8, "outlier_pct": 2.67,
"outlier_sample": [506, 523, 501, 879, 522, 760, 616, 625]}
[Step 4: Duration 12.31 seconds | Input tokens: 26,447 | Output tokens: 1,438]
| Layer | Technology |
|---|---|
| Agent framework | smolagents CodeAgent |
| LLM | Google Gemini 2.5 Flash via LiteLLM |
| Data | pandas, numpy |
| Visualizations | matplotlib, seaborn |
| Config | python-dotenv |
datalyst-agent/
β
βββ data/
β βββ sales_data.py # 300-row sales dataset (regions, revenue, reps)
β βββ weather_data.py # 365-row weather dataset (5 cities, seasonal temps)
β βββ population_data.py # 150-row population dataset (6 continents, GDP)
β
βββ tools/
β βββ data_tools.py # load_csv_file, get_column_schema, detect_duplicates
β βββ stats_tools.py # descriptive stats, IQR outliers, value counts, correlation
β βββ chart_tools.py # histograms, heatmap, bar charts, pie/donut, box plots, time series, scatter+regression
β βββ summary_tools.py # write_analysis_summary
β
βββ docs/images/ # Sample charts (committed for README)
βββ agent.py # CodeAgent + GeminiLiteLLMModel config
βββ main.py # CLI entry point
βββ requirements.txt
βββ .env # Your API key (not committed)
git clone https://github.com/yourusername/datalyst-agent.git
cd datalyst-agent
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt# Create a .env file
echo "GEMINI_API_KEY=your_key_here" > .envGet a free key at Google AI Studio.
# Analyze a bundled demo dataset
python main.py --demo sales
python main.py --demo weather
python main.py --demo population
# Analyze your own CSV
python main.py --csv path/to/your/data.csv
# Custom output directory
python main.py --demo sales --output my_results/
# Generate demo CSVs without running analysis
python main.py --generate-demosEach run creates a timestamped output directory:
outputs/sales_data_analysis/
βββ analysis_summary.txt
βββ correlation_heatmap.png
βββ hist_units_sold.png # + one per numeric column
βββ bar_region.png # + one per categorical column
βββ missing_values.png
βββ pie_region.png # + one per low-cardinality categorical
βββ box_revenue_by_region.png # + one per numeric column (grouped)
βββ stacked_region_by_product_category.png
βββ timeseries_date.png # if a datetime column exists
βββ scatter_units_sold_vs_revenue.png # if |r| β₯ 0.3 found
All datasets are synthetically generated β no external downloads required.
| Dataset | Rows | Notable Features |
|---|---|---|
| Sales | 300 | 4 regions Β· 5 product categories Β· 8 sales reps Β· intentional outliers in units_sold (500β900 range) Β· ~5% missing in discount_pct |
| Weather | 365 | 5 cities Β· sinusoidal seasonal temperatures Β· exponential precipitation distribution Β· sparse NaNs |
| Population | 150 | 30 countries Β· 6 continents Β· right-skewed population Β· GDP correlated with continent |
The intentional outliers and missing values are there to verify the agent actually finds them.
βββββββββββββββββββββββββββββββββββ
main.py βββΊ agent.py β CodeAgent Loop β
β β
β ββββββββββββ ββββββββββββββ β
β β Gemini β β 16 Tools β β
β β 2.5 Flashβ β (pandas / β β
β β(LiteLLM) β β mpl / sns)β β
β ββββββ¬ββββββ βββββββ¬βββββββ β
β β β β
β Thought βββΊ Code βββΊ Observe β
β βββββββββββββββββ β
βββββββββββββββββββββββββββββββββββ
Each tool is a plain Python function decorated with @tool. The agent decides which tools to call, writes the code to call them, and adapts based on the output β no rigid orchestration required.
- Python 3.10+
- Google Gemini API key
smolagents>=1.24.0
litellm>=1.50.0
pandas>=2.0.0
numpy>=1.26.0
matplotlib>=3.8.0
seaborn>=0.13.0
python-dotenv>=1.0.0



