Open-source data infrastructure tools. Built in India. Used everywhere.
We build lightweight, self-hosted tooling that gives small data teams enterprise-grade
pipeline observability, auditing, and automation — without vendor lock-in or SaaS bills.
WillowVibe is a data engineering & AI tooling studio — solo-founded, contributor-driven, OSS-first.
- Pipeline Auditing — point-in-time health checks on Airflow + dbt + warehouses; one command, one report
- Data Observability — continuous monitoring for pipeline health, data freshness, volume anomalies, and schema drift
- FinOps for Data — tracking Snowflake credits and BigQuery bytes billed, turning cloud cost chaos into actionable visibility
- AI-Augmented Pipelines — embedding AI at the right layer of the data stack without replacing what already works
- Open-Source First — every internal tool we build, we ship as OSS so the community benefits
We operate a solo + contributor model — lean by design, moving fast, building things that solve real problems for data teams.
🔬 PipelineProbe — New
Instant Data Pipeline Audit Report for Airflow + dbt + modern warehouses
Run a single command, get a full HTML audit report. PipelineProbe is a read-only CLI audit tool for data engineers who want a fast, objective health check of their pipeline stack — before a migration, after an incident, or as a recurring CI gate.
pip install pipelineprobe
pipelineprobe init # generates pipelineprobe.yml
pipelineprobe audit # produces pipelineprobe-report.html- ✅ Airflow checks — high failure-rate DAGs, missing retries, missing SLAs, stale pipelines
- ✅ dbt checks — models with zero tests, failing test runs, orphaned models
- ✅ Warehouse checks — oversized tables, missing audit timestamps (Postgres, BigQuery, Snowflake)
- ✅ HTML + JSON report — traffic-light severity, health score 0–100, per-issue recommendations
- ✅ CI-ready —
fail_on_criticalexit code gates for GitHub Actions / GitLab CI - ✅ Zero mutations — 100% read-only; safe to run against production
Stack: Python · Typer · Pydantic · httpx · Jinja2 · psycopg2 · dbt artifacts
Self-hosted Data Observability & FinOps Starter Kit for small data teams
ObservaKit gives 1–5 person data teams the 5 core observability pillars — Freshness, Volume, Quality, Schema Drift, and Pipeline Health — in a single docker-compose up. No Monte Carlo. No Metaplane. No SaaS bill.
- ✅ Freshness Monitor — detects stale tables by tracking
max(updated_at) - ✅ Volume Anomaly — Z-score detection against 7-day rolling averages
- ✅ Quality Checks — Soda Core & Great Expectations templates, ready to use
- ✅ Schema Drift Detector — snapshots
information_schema, diffs on every run - ✅ Pipeline Health — Airflow/Prefect REST API + OpenTelemetry + Grafana
- ✅ FinOps Tracker — Snowflake credits & BigQuery bytes billed, natively
- ✅ Native dbt Integration — parses
run_results.jsondirectly, no extra packages
Stack: Python · FastAPI · SQLAlchemy · Alembic · Prometheus · Grafana · Docker Compose · dbt · Airflow / Prefect
| Repo | Description | Language | Status |
|---|---|---|---|
| 🔬 pipelineprobe | Instant pipeline audit CLI — Airflow + dbt + warehouse | Python | active |
| 🔭 ObservaKit | Self-hosted data observability & FinOps starter kit | Python | active |
| 🧰 toolscontainer | Multi-purpose Python utility scripts & automations | Python | maintained |
| 🕷️ scrapy-bot | Scrapy + Flask web scraping bot experiment | Python | archived |
| 💻 online-ide | Lightweight online Python execution environment | Python | experimental |
| Layer | Tools |
|---|---|
| Data Engineering | Python · dbt · Apache Airflow · Prefect · Apache Spark |
| Warehouses | PostgreSQL · Snowflake · BigQuery · DuckDB |
| Observability | Prometheus · Grafana · OpenTelemetry · Soda Core |
| Backend | FastAPI · SQLAlchemy · Alembic · Pydantic |
| Infra & DevOps | Docker · Docker Compose · Terraform · GitHub Actions |
| AI / ML | LangChain · OpenAI APIs · Vector DBs (Qdrant / ChromaDB) |
"Build what the ecosystem needs. Share what you build. Let the community make it better."
Every project we open-source follows three rules:
- Zero vendor lock-in — runs on infra you own and control
- Quickstart in under 10 minutes — if onboarding is painful, it won't get adopted
- Progressive complexity — adopt one layer at a time; no all-or-nothing commitment
We actively maintain what we ship. Issues get responses. PRs get reviewed. Roadmaps get published.
All public repos welcome contributions. Best places to start:
-
🔬 PipelineProbe →
good first issues- Add a new warehouse connector (Redshift, DuckDB)
- Add a new rule (task duration outliers, dbt source freshness)
- Improve the HTML report template
-
🔭 ObservaKit →
good first issues- Add a new warehouse connector (Redshift, Delta Lake)
- Write a Grafana dashboard for a new observability use case
- Improve documentation or add a real-world example
Read CONTRIBUTING.md before opening a PR.
We are open to:
- Collaborations on data tooling, AI pipelines, or observability infra
- Consulting engagements — data platform audits, pipeline migrations, cost optimization
- Freelance / contract data engineering for startups and scaleups
| Channel | Link |
|---|---|
| 🐙 GitHub | @willowvibe |
| 🔬 PipelineProbe Issues | Open an issue |
| 🔭 ObservaKit Issues | Open an issue |
| 🔐 Security Reports | See SECURITY.md |
🌿 WillowVibe — Bengaluru, India · Building in the open since 2024 · Try PipelineProbe 🔬 · Star ObservaKit ⭐