🔍 dataflow-agent

AI-powered data pipeline debugger

An agentic CLI tool that uses Google Gemini and LangGraph to autonomously diagnose, explain, and fix broken or slow data pipelines.

What it does

dataflow-agent acts as an autonomous data engineering assistant. Point it at a broken pipeline — a dbt error log, a failing Airflow DAG, a Spark OOM crash — and it will:

Read the logs and project files using its tool suite
Reason over the evidence with Gemini running inside a LangGraph ReAct loop
Explain the root cause in plain English
Fix the broken code — showing a rich diff and asking for your confirmation before writing

Features

Capability	Details
🔎 Autonomous diagnosis	Reads logs, traverses project files, calls tools in sequence until it has enough evidence
🛠 Interactive fixes	Rich unified diff preview with `y/n` confirmation before any file is modified
🗄 Schema introspection	Queries live Postgres or Snowflake to validate column and table references
📐 SQL analysis	sqlglot-powered parsing, linting, and optimization hints (indexes, skew, wildcards, etc.)
💬 Chat mode	Multi-turn session with the full pipeline context retained across turns
🧩 Framework-aware	Dedicated parsers for dbt artifacts, Airflow DAGs, Prefect flows, and Spark driver logs
📊 dbt project profiler	Scans all SQL models, scores complexity (CTEs, JOINs, anti-patterns), ranks by risk
🧪 dbt test generator	Infers and writes a `schema.yml` with `not_null`, `unique`, and FK tests from model SQL
🔗 dbt lineage tracer	Traces upstream/downstream model dependencies from `manifest.json` or SQL `ref()` scan; optional AI impact analysis

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        dataflow-agent                           │
│                                                                 │
│   CLI (Typer + Rich)                                            │
│       │                                                         │
│       ▼                                                         │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                  LangGraph Agent Graph                  │   │
│  │                                                         │   │
│  │   START → context_loader → agent_node ⇄ tools_node     │   │
│  │                                 │            │          │   │
│  │                            Gemini LLM   Tool Executor   │   │
│  │                                 │            │          │   │
│  │                                 └────────────┘          │   │
│  │                                      │                  │   │
│  │                                     END                 │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│   Tools available to the agent:                                 │
│   read_log · read_file · list_files · extract_errors             │
│   analyze_sql · explain_query · inspect_schema                   │
│   parse_dbt_manifest · generate_dbt_tests · trace_dbt_lineage   │
│   profile_dbt_project · parse_airflow_dag · parse_prefect_flow  │
│   parse_spark_log · write_fix                                    │
└─────────────────────────────────────────────────────────────────┘

The agent loops — calling tools, processing results, calling more tools — until Gemini decides it has enough information to produce a final diagnosis. No hardcoded logic, no fixed pipelines.

Installation

Requirements: Python 3.11+, a free Gemini API key

# 1. Clone
git clone https://github.com/yourname/dataflow-agent
cd dataflow-agent

# 2. Virtual environment
python -m venv venv
source venv/bin/activate       # Windows: venv\Scripts\activate

# 3. Install
pip install -e .

# 4. Configure
cp .env.example .env
#  → open .env and paste your GEMINI_API_KEY

Configuration

# .env

# Required
GEMINI_API_KEY=your_gemini_api_key_here

# Optional — defaults shown
GEMINI_MODEL=gemini-2.5-flash

# Optional — for schema introspection
POSTGRES_URL=postgresql://user:password@localhost:5432/mydb

SNOWFLAKE_ACCOUNT=myorg-myaccount
SNOWFLAKE_USER=myuser
SNOWFLAKE_PASSWORD=mypassword
SNOWFLAKE_DATABASE=ANALYTICS
SNOWFLAKE_SCHEMA=PUBLIC
SNOWFLAKE_WAREHOUSE=COMPUTE_WH

Usage

`diagnose` — find the root cause

# dbt run log
dataflow-agent diagnose --framework dbt --log ./logs/dbt.log

# dbt log + full project (agent can traverse models, macros, etc.)
dataflow-agent diagnose --framework dbt \
  --log ./logs/dbt.log \
  --project ./my_dbt_project

# Airflow — DAG file + task log
dataflow-agent diagnose --framework airflow \
  --dag ./dags/my_dag.py \
  --log ./logs/task.log

# Spark — OOM, executor loss, shuffle failures
dataflow-agent diagnose --framework spark \
  --log ./logs/spark_driver.log

`diagnose --fix` — diagnose and repair

# Shows a rich diff, asks y/n before writing
dataflow-agent diagnose --framework dbt \
  --log ./logs/dbt.log \
  --project ./my_dbt_project \
  --fix

Example output:

╭─ Proposed change ──────────────────────────────────╮
│ --- a/models/marts/fct_orders.sql                  │
│ +++ b/models/marts/fct_orders.sql                  │
│ @@ -11,7 +11,7 @@                                  │
│ -    SUM(oi.discount_amount) AS total_discount,    │
│ +    SUM(oi.discount_pct)    AS total_discount,    │
╰────────────────────────────────────────────────────╯
Apply this fix? [y/N]:

`optimize` — find performance issues

dataflow-agent optimize --framework spark \
  --log ./logs/spark.log \
  --db postgres

`chat` — interactive session

# Full context retained across turns — ask follow-up questions
dataflow-agent chat --framework dbt --project ./my_dbt_project

dataflow-agent chat --framework spark --log ./logs/spark.log

╭─ Interactive Pipeline Chat ────────────────────────╮
│ Framework: dbt  |  Type exit to end the session.   │
╰────────────────────────────────────────────────────╯

You> Why did fct_orders fail?
Agent> The model failed because it references `discount_amount`, but
       the upstream `int_order_items` model exposes `discount_pct`...

You> What other models depend on fct_orders?
Agent> Based on the manifest, three models reference fct_orders: ...

`profile` — rank dbt models by complexity

Scans every .sql file under models/, scores each by CTEs, JOINs, subqueries, and anti-patterns, and returns a ranked report with LLM-generated refactoring recommendations.

dataflow-agent profile ./my_dbt_project

# Show top 20 models instead of the default 10
dataflow-agent profile ./my_dbt_project --top 20

`generate-tests` — generate a `schema.yml` from model SQL

Infers not_null, unique, and foreign-key tests from a model's SQL and writes a ready-to-use schema.yml.

dataflow-agent generate-tests --model ./models/marts/fct_orders.sql

# Write to a specific output path
dataflow-agent generate-tests \
  --model ./models/marts/fct_orders.sql \
  --output ./models/marts/schema.yml

`lineage` — trace dbt model dependencies

Builds an upstream/downstream dependency graph from manifest.json (preferred) or by scanning ref()/source() calls in SQL files. No API key required unless --analyze is used.

# From manifest
dataflow-agent lineage fct_orders --manifest ./target/manifest.json

# From project directory (SQL scan fallback)
dataflow-agent lineage fct_orders --project ./my_dbt_project

# Upstream only, limited to 2 hops
dataflow-agent lineage fct_orders -m ./target/manifest.json -d upstream --depth 2

# AI-powered impact analysis (requires GEMINI_API_KEY)
dataflow-agent lineage fct_orders -m ./target/manifest.json --analyze

`--model` — override the LLM

dataflow-agent diagnose --framework dbt --log ./dbt.log --model gemini-1.5-pro

Demo — try it now

The repo ships with realistic broken fixtures. No database needed.

# dbt: two failed models with column name mismatches
dataflow-agent diagnose --framework dbt \
  --log tests/fixtures/dbt_error.log

# Spark: executor OOM → stage failure → job abort
dataflow-agent diagnose --framework spark \
  --log tests/fixtures/spark_error.log

# Airflow: KeyError on a renamed column + catchup=True misconfiguration
dataflow-agent diagnose --framework airflow \
  --dag  tests/fixtures/broken_dag.py \
  --log  tests/fixtures/airflow_error.log

# Interactive chat about the broken Airflow DAG
dataflow-agent chat --framework airflow \
  --dag tests/fixtures/broken_dag.py

Project Structure

dataflow-agent/
├── dataflow_agent/
│   ├── cli.py                   # Typer CLI: diagnose · optimize · chat · profile · generate-tests · lineage · explain
│   ├── agent.py                 # LangGraph graph, state, nodes, chat loop, run_* entry points
│   ├── config.py                # Pydantic settings loaded from .env
│   ├── tools/
│   │   ├── log_reader.py        # read_log
│   │   ├── file_reader.py       # read_file · list_files
│   │   ├── file_writer.py       # write_fix  (Rich diff + y/n prompt)
│   │   ├── schema_inspector.py  # inspect_schema  (Postgres + Snowflake)
│   │   ├── sql_analyzer.py      # analyze_sql · explain_query
│   │   ├── dbt_profiler.py      # profile_dbt_project  (complexity ranking)
│   │   └── framework/
│   │       ├── dbt.py           # parse_dbt_manifest · generate_dbt_tests · trace_dbt_lineage
│   │       ├── airflow.py       # parse_airflow_dag
│   │       ├── prefect.py       # parse_prefect_flow
│   │       └── spark.py         # parse_spark_log
│   └── parsers/
│       └── error_extractor.py   # extract_errors  (framework-aware regex)
├── tests/
│   ├── fixtures/
│   │   ├── dbt_error.log        # dbt run with 2 failures + 5 skips
│   │   ├── airflow_error.log    # Airflow KeyError after 3 retry attempts
│   │   ├── spark_error.log      # Spark OOM → executor loss → job abort
│   │   ├── broken_dag.py        # Airflow DAG with column bug + catchup=True
│   │   ├── fct_orders.sql       # dbt model with wrong column reference
│   │   ├── dbt_model.sql        # dbt model fixture for test generation
│   │   └── manifest.json        # minimal dbt manifest with multi-layer lineage graph
│   └── test_tools.py            # 24 smoke tests (all passing)
├── .env.example
├── .gitignore
└── pyproject.toml

Supported Frameworks

Framework	Log parsing	File parsing	Schema validation	SQL analysis
dbt	✅ Run logs	✅ `manifest.json`, `run_results.json`	✅	✅
Apache Airflow	✅ Task logs	✅ DAG `.py` files	✅	✅
Prefect	✅ Flow run logs	✅ Flow `.py` files	✅	✅
Apache Spark	✅ Java exceptions, OOM, shuffle	✅ PySpark scripts	✅	✅

Tech Stack

Layer	Library
LLM	`google-generativeai` / `langchain-google-genai`
Agent framework	`langgraph`
CLI	`typer` + `rich`
SQL parsing	`sqlglot`
DB connectors	`psycopg2-binary`, `snowflake-connector-python`
Config	`pydantic` + `python-dotenv`

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 dataflow-agent

What it does

Features

Architecture

Installation

Configuration

Usage

`diagnose` — find the root cause

`diagnose --fix` — diagnose and repair

`optimize` — find performance issues

`chat` — interactive session

`profile` — rank dbt models by complexity

`generate-tests` — generate a `schema.yml` from model SQL

`lineage` — trace dbt model dependencies

`--model` — override the LLM

Demo — try it now

Project Structure

Supported Frameworks

Tech Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
dataflow_agent		dataflow_agent
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

🔍 dataflow-agent

What it does

Features

Architecture

Installation

Configuration

Usage

diagnose — find the root cause

diagnose --fix — diagnose and repair

optimize — find performance issues

chat — interactive session

profile — rank dbt models by complexity

generate-tests — generate a schema.yml from model SQL

lineage — trace dbt model dependencies

--model — override the LLM

Demo — try it now

Project Structure

Supported Frameworks

Tech Stack

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`diagnose` — find the root cause

`diagnose --fix` — diagnose and repair

`optimize` — find performance issues

`chat` — interactive session

`profile` — rank dbt models by complexity

`generate-tests` — generate a `schema.yml` from model SQL

`lineage` — trace dbt model dependencies

`--model` — override the LLM

Packages