Skip to content

rajansavani/csv-cleaner

Repository files navigation

CSVCleaner (Agentic CSV Cleaning API)

CSVCleaner is a small FastAPI service that profiles a CSV and can clean it end-to-end using an LLM-generated cleaning plan.

The main idea is an agentic workflow that stays reproducible:

  • The LLM proposes a structured cleaning plan (JSON)
  • The plan is validated (schema + semantic checks)
  • Execution is deterministic via pandas transforms
  • Artifacts are saved (cleaned csv, plan json, report json)

This tool does not directly modify your data or run arbitrary code.


API Docs

Swagger UI screenshot:

swagger ui


Deployment + Benchmarks

This API is deployed on Render:

Benchmarks were run against the deployed service after a warm-up request to avoid cold-start skew.

  • GET /health (k6, 10 VUs, 30s)

    • p95 latency: 68.16ms
    • throughput: ~101.10 req/s (~6,066 req/min)
    • error rate: 0% (0/3043)
  • POST /clean/basic (50-run file upload benchmark, Windows PowerShell + curl, sample IMDb CSV)

    • p95 latency: 0.360s
    • throughput: ~203 req/min
    • errors: 0/50

Endpoints

Explore everything interactively at /docs.

  • GET /health Basic health check.

  • POST /profile Upload a CSV and get a dataset profile (shape, missingness, preview rows).

  • POST /plan Upload a CSV and get an LLM cleaning plan (validated and saved).

  • POST /clean/basic Deterministic cleaning only (no LLM).

  • POST /clean/llm Full pipeline: profile -> plan -> validate -> execute -> save artifacts.

  • GET /jobs/{job_id} Read the saved report JSON for a prior run.

  • GET /jobs/{job_id}/cleaned.csv Download the cleaned CSV artifact for a prior run.


Example Outputs

Example outputs are committed so you can review the project without running it:

  • reports/examples/profile_example.json
  • reports/examples/plan_example.json
  • reports/examples/report_example.json

These were generated using the included sample dataset:

  • data/raw/messy_IMDB_dataset.csv

Dataset source (Kaggle):


Quickstart

1) Setup

Create and activate a virtual environment, then install dependencies.

Windows (PowerShell):

  • python -m venv .venv
  • .venv\\Scripts\\Activate.ps1
  • pip install -r requirements.txt

Mac/Linux:

  • python -m venv .venv
  • source .venv/bin/activate
  • pip install -r requirements.txt

2) Configure environment variables

This project only needs one env var: OPENAI_API_KEY

Create a local .env file in the repo root (not tracked by git):

OPENAI_API_KEY=your_key_here

3) Run the API locally

From the repo root:

  • uvicorn src.api.app:app --reload

Then open:

4) Run the demo script (curl)

Mac/Linux/GitBash:

  • chmod +x scripts/demo_local.sh
  • ./scripts/demo_local.sh

The script calls /health, /profile, /plan, and /clean/llm using the sample CSV and saves results to reports/examples/.


Project Structure

csv-cleaner/
  data/
    raw/
      messy_IMDB_dataset.csv
    processed/               (not tracked)
  reports/
    figures/
      docs_example.png
    examples/
      report_example.json
      plan_example.json
      profile_example.json
  outputs/                   (not tracked)
    cleaned/
    reports/
    plans/
  src/
    api/
      app.py                 (FastAPI app + router wiring)
      routes.py              (endpoints)
    llm/
      client.py              (OpenAI wrapper)
      prompts.py             (prompt templates)
      schemas.py             (cleaning plan schema)
    pipeline/
      profile.py             (dataset profiling)
      planner.py             (LLM -> plan)
      executor.py            (apply plan)
      validate.py            (semantic plan validation)
      artifacts.py           (write artifacts)
    tools/
      transforms.py          (cleaning transforms)
  tests/
    conftest.py
    test_api_health.py
    test_clean_and_jobs.py
    test_llm_prompts_and_schema.py
  scripts/
    demo_local.sh
  .github/workflows/ci.yml
  .env.example
  .gitignore
  README.md
  requirements.txt

How the Pipeline Works

/profile

  1. Read uploaded csv into pandas
  2. Compute missingness and preview rows
  3. Return a profile object

/plan

  1. Build a prompt using the profile
  2. Call OpenAI for a structured JSON plan
  3. Validate plan semantics (422 if invalid)
  4. Save plan to outputs/plans/{job_id}.json

/clean/llm

  1. Generate a plan (same as /plan)
  2. Validate plan semantics
  3. Execute deterministic transforms in executor.py
  4. Generate before/after profiles
  5. Save artifacts:
    • outputs/cleaned/{job_id}.csv
    • outputs/plans/{job_id}.json
    • outputs/reports/{job_id}.json
  6. Return job_id, plan, report, and artifact paths

Testing

Run unit and API smoke tests: pytest -q

CI runs via GitHub Actions in .github/workflows/ci.yml.

About

CSVCleaner is a small FastAPI service that takes a messy CSV and returns a cleaned version plus a clear report of what changed. It uses OpenAI to propose a structured cleaning plan, validates it, then applies the steps deterministically with pandas. Includes a REST API, demo scripts, and tests.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages