CSVCleaner is a small FastAPI service that profiles a CSV and can clean it end-to-end using an LLM-generated cleaning plan.
The main idea is an agentic workflow that stays reproducible:
- The LLM proposes a structured cleaning plan (JSON)
- The plan is validated (schema + semantic checks)
- Execution is deterministic via pandas transforms
- Artifacts are saved (cleaned csv, plan json, report json)
This tool does not directly modify your data or run arbitrary code.
Swagger UI screenshot:
This API is deployed on Render:
- Base URL: https://csvcleaner.onrender.com
- Docs: https://csvcleaner.onrender.com/docs
- Health: https://csvcleaner.onrender.com/health
Benchmarks were run against the deployed service after a warm-up request to avoid cold-start skew.
-
GET
/health(k6, 10 VUs, 30s)- p95 latency: 68.16ms
- throughput: ~101.10 req/s (~6,066 req/min)
- error rate: 0% (0/3043)
-
POST
/clean/basic(50-run file upload benchmark, Windows PowerShell + curl, sample IMDb CSV)- p95 latency: 0.360s
- throughput: ~203 req/min
- errors: 0/50
Explore everything interactively at /docs.
-
GET
/healthBasic health check. -
POST
/profileUpload a CSV and get a dataset profile (shape, missingness, preview rows). -
POST
/planUpload a CSV and get an LLM cleaning plan (validated and saved). -
POST
/clean/basicDeterministic cleaning only (no LLM). -
POST
/clean/llmFull pipeline: profile -> plan -> validate -> execute -> save artifacts. -
GET
/jobs/{job_id}Read the saved report JSON for a prior run. -
GET
/jobs/{job_id}/cleaned.csvDownload the cleaned CSV artifact for a prior run.
Example outputs are committed so you can review the project without running it:
reports/examples/profile_example.jsonreports/examples/plan_example.jsonreports/examples/report_example.json
These were generated using the included sample dataset:
data/raw/messy_IMDB_dataset.csv
Dataset source (Kaggle):
Create and activate a virtual environment, then install dependencies.
Windows (PowerShell):
python -m venv .venv.venv\\Scripts\\Activate.ps1pip install -r requirements.txt
Mac/Linux:
python -m venv .venvsource .venv/bin/activatepip install -r requirements.txt
This project only needs one env var: OPENAI_API_KEY
Create a local .env file in the repo root (not tracked by git):
OPENAI_API_KEY=your_key_here
From the repo root:
uvicorn src.api.app:app --reload
Then open:
Mac/Linux/GitBash:
chmod +x scripts/demo_local.sh./scripts/demo_local.sh
The script calls /health, /profile, /plan, and /clean/llm using the sample CSV and saves results to reports/examples/.
csv-cleaner/
data/
raw/
messy_IMDB_dataset.csv
processed/ (not tracked)
reports/
figures/
docs_example.png
examples/
report_example.json
plan_example.json
profile_example.json
outputs/ (not tracked)
cleaned/
reports/
plans/
src/
api/
app.py (FastAPI app + router wiring)
routes.py (endpoints)
llm/
client.py (OpenAI wrapper)
prompts.py (prompt templates)
schemas.py (cleaning plan schema)
pipeline/
profile.py (dataset profiling)
planner.py (LLM -> plan)
executor.py (apply plan)
validate.py (semantic plan validation)
artifacts.py (write artifacts)
tools/
transforms.py (cleaning transforms)
tests/
conftest.py
test_api_health.py
test_clean_and_jobs.py
test_llm_prompts_and_schema.py
scripts/
demo_local.sh
.github/workflows/ci.yml
.env.example
.gitignore
README.md
requirements.txt
/profile
- Read uploaded csv into pandas
- Compute missingness and preview rows
- Return a profile object
/plan
- Build a prompt using the profile
- Call OpenAI for a structured JSON plan
- Validate plan semantics (422 if invalid)
- Save plan to
outputs/plans/{job_id}.json
/clean/llm
- Generate a plan (same as
/plan) - Validate plan semantics
- Execute deterministic transforms in executor.py
- Generate before/after profiles
- Save artifacts:
outputs/cleaned/{job_id}.csvoutputs/plans/{job_id}.jsonoutputs/reports/{job_id}.json
- Return job_id, plan, report, and artifact paths
Run unit and API smoke tests: pytest -q
CI runs via GitHub Actions in .github/workflows/ci.yml.
