βββββββ ββββββ βββββββββ ββββββ βββββββ βββ ββββββ ββββββββββββββββ
βββββββββββββββββββββββββββββββββ βββββββββββ ββββββ ββββββββββββββββ
βββ βββββββββββ βββ βββββββββββββββββββββββββ ββββββ ββββββββββββββ
βββ βββββββββββ βββ βββββββββββββββββββββ βββ ββββββ ββββββββββββββ
βββββββββββ βββ βββ βββ βββ βββ βββββββββββββββββββββββββββββββββ
βββββββ βββ βββ βββ βββ βββ βββ βββββββ ββββββββββββββββββββββββ
Catch silent data pipeline failures before your stakeholders do.
Quick Start Β· Features Β· Architecture Β· Configuration
Bad data costs companies $12.9 million per year (Gartner). Most data quality issues don't throw errors - they silently corrupt: null values creep into critical columns, duplicate records pile up, stale data gets served to dashboards, and nobody notices until the CEO asks "why does this number look wrong?"
Data-Pulse is a lightweight, open-source data quality engine that:
- Profiles your datasets automatically - row counts, null rates, distributions, data types
- Runs checks defined in simple YAML - no Python required to add new rules
- Detects anomalies using statistical methods (z-scores) against historical baselines
- Alerts your team on Slack the moment something breaks
- Displays health on a real-time dashboard - green, yellow, red at a glance
One command. Zero config databases. Works on CSV, Parquet, or any SQL source.
Pipeline health at a glance - overall score, pass/fail breakdown, check details
Real-time notifications with severity routing - critical vs warning
Detailed check results with PASS/FAIL status and severity levels
git clone https://github.com/MuditNautiyal-21/Data-Pulse.git
cd Data-Pulse
docker-compose up --buildOpen http://localhost:8000/dashboard - done.
git clone https://github.com/MuditNautiyal-21/Data-Pulse.git
cd Data-Pulse
# Setup
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
# Generate sample data
python sample_data/generate_data.py
# Run all checks
python run.py
# Launch dashboard
uvicorn api.main:app --reloadOpen http://localhost:8000/dashboard
checks:
- name: "Order ID is unique"
type: unique_check
column: order_id
severity: critical
- name: "Amount is positive"
type: value_check
column: amount
condition: "> 0"
severity: warningAnyone on your team can add quality rules without writing Python. Just edit a YAML file.
| Check Type | What It Catches | Example |
|---|---|---|
null_check |
Missing values | "order_id should never be empty" |
unique_check |
Duplicate records | "customer_id must be unique" |
value_check |
Out-of-range values | "amount must be > 0" |
accepted_values_check |
Invalid categories | "status must be completed, pending, shipped, or cancelled" |
freshness_check |
Stale or future dates | "order_date should not be in the future" |
row_count_check |
Missing data loads | "table must have at least 100 rows" |
Every run automatically profiles each data source:
- Row and column counts
- Null rates per column
- Unique value counts
- Min / Max / Mean for numeric columns
- Data type detection
Data-Pulse stores profile history in SQLite and uses z-score analysis to detect when today's data deviates significantly from the baseline. If your amount column usually has 2% nulls but today it's 15% - Data-Pulse flags it.
Failed checks trigger Slack notifications automatically:
- π΄ Critical failures - things that should never happen (duplicate primary keys, missing IDs)
- π‘ Warning failures - things to investigate (null emails, negative amounts)
A dark-themed web dashboard showing:
- Pipeline Health Score - single percentage showing overall data quality
- Pass/Fail/Critical stats at a glance
- Check results table - sortable, with severity badges and failure details
- Auto-refreshes every 30 seconds
βββββββββββββββββββ
β config.yaml β
β + checks/*.yamlβ
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββ
β DataPulse Engine β
β β
β ββββββββββββ ββββββββββββ βββββββββββ β
β β Profiler βββ Check ββ β Anomaly β β
β β β β Runner β β Detectorβ β
β ββββββββββββ ββββββββββββ βββββββββββ β
β β
ββββββββββββββββββββ¬βββββββββββββββββββββββββ
β
βββββββββββββΌββββββββββββ
βΌ βΌ βΌ
ββββββββββββ ββββββββββββ ββββββββββββ
β SQLite β β Slack β β FastAPI β
β Storage β β Alerts β β + Dash β
ββββββββββββ ββββββββββββ ββββββββββββ
β β
β Historical Data β http://localhost:8000
β & Trend Analysis β /dashboard
βΌ βΌ
Data flows in one direction:
- Config defines sources and check rules
- Profiler scans every column and records statistics
- Check Runner evaluates each YAML rule against the data
- Anomaly Detector compares current profile to historical baselines
- Results are stored in SQLite, pushed to Slack, and displayed on the dashboard
Data-Pulse/
βββ engine/
β βββ __init__.py
β βββ profiler.py # Auto-profiles data sources
β βββ check_runner.py # Executes YAML-defined checks
β βββ storage.py # SQLite persistence layer
β βββ anomaly.py # Z-score anomaly detection
β βββ alerts.py # Slack webhook integration
βββ api/
β βββ __init__.py
β βββ main.py # FastAPI server + dashboard route
βββ checks/
β βββ orders_checks.yaml # Check rules for orders table
β βββ customers_checks.yaml
βββ sample_data/
β βββ generate_data.py # Creates demo data with intentional issues
β βββ orders.csv # 1005 orders (5% null amounts, duplicates)
β βββ customers.csv # 200 customers (8% null emails)
βββ templates/
β βββ dashboard.html # Real-time health dashboard
βββ tests/
β βββ __init__.py
βββ run.py # Main entry point β runs everything
βββ config.yaml # Source definitions + alert config
βββ requirements.txt
βββ Dockerfile
βββ docker-compose.yaml
βββ README.md
Edit config.yaml:
sources:
my_new_table:
type: csv
path: path/to/your/data.csv
description: "Description of this source"Create a YAML file in checks/:
source: path/to/your/data.csv
checks:
- name: "Revenue is positive"
type: value_check
column: revenue
condition: "> 0"
severity: critical- Create a Slack Incoming Webhook
- Update
config.yaml:
alerts:
slack:
enabled: true
webhook_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"| Technology | Role |
|---|---|
| Python 3.11 | Core engine, profiling, anomaly detection |
| FastAPI | REST API serving dashboard and check results |
| SQLite | Zero-config metadata storage and history tracking |
| Pandas | Data loading, profiling, and analysis |
| SciPy | Statistical anomaly detection (z-scores) |
| PyYAML | Human-readable check definitions |
| Docker | One-command deployment |
| Slack Webhooks | Real-time alerting |
- PostgreSQL / MySQL source connectors
- Schema change detection
- Email alerts
- Custom SQL check support
- Airflow DAG for scheduled runs
- Profile trend charts on dashboard
- GitHub Actions CI/CD pipeline
- CLI tool (
datapulse check --source orders)
Contributions are welcome. To get started:
- Fork the repository
- Create a feature branch (
git checkout -b feature/your-feature) - Commit your changes (
git commit -m 'Add your feature') - Push to the branch (
git push origin feature/your-feature) - Open a Pull Request
This project is licensed under the MIT License - use it however you want.
Built by Mudit Nautiyal
If this helped you, consider giving it a β