Skip to content

MuditNautiyal-21/Data-Pulse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

17 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—       β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ•—   β–ˆβ–ˆβ•—β–ˆβ–ˆβ•—     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—
    β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β•šβ•β•β–ˆβ–ˆβ•”β•β•β•β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—      β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ•”β•β•β•β•β•β–ˆβ–ˆβ•”β•β•β•β•β•
    β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—  
    β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•‘β•šβ•β•β•β•β•β–ˆβ–ˆβ•”β•β•β•β• β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘     β•šβ•β•β•β•β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β•β•β•  
    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘      β–ˆβ–ˆβ•‘     β•šβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—
    β•šβ•β•β•β•β•β• β•šβ•β•  β•šβ•β•   β•šβ•β•   β•šβ•β•  β•šβ•β•      β•šβ•β•      β•šβ•β•β•β•β•β• β•šβ•β•β•β•β•β•β•β•šβ•β•β•β•β•β•β•β•šβ•β•β•β•β•β•β•

Catch silent data pipeline failures before your stakeholders do.

Python FastAPI Docker License Slack

Quick Start Β· Features Β· Architecture Β· Configuration


The Problem

Bad data costs companies $12.9 million per year (Gartner). Most data quality issues don't throw errors - they silently corrupt: null values creep into critical columns, duplicate records pile up, stale data gets served to dashboards, and nobody notices until the CEO asks "why does this number look wrong?"

The Solution

Data-Pulse is a lightweight, open-source data quality engine that:

  1. Profiles your datasets automatically - row counts, null rates, distributions, data types
  2. Runs checks defined in simple YAML - no Python required to add new rules
  3. Detects anomalies using statistical methods (z-scores) against historical baselines
  4. Alerts your team on Slack the moment something breaks
  5. Displays health on a real-time dashboard - green, yellow, red at a glance

One command. Zero config databases. Works on CSV, Parquet, or any SQL source.


Screenshots

Dashboard

Pipeline health at a glance - overall score, pass/fail breakdown, check details

DataPulse Dashboard



Slack Alerts

Real-time notifications with severity routing - critical vs warning

Slack Alert



Terminal Output

Detailed check results with PASS/FAIL status and severity levels

Terminal Output

Quick Start

Option A: Docker (Recommended)

git clone https://github.com/MuditNautiyal-21/Data-Pulse.git
cd Data-Pulse
docker-compose up --build

Open http://localhost:8000/dashboard - done.

Option B: Local Python

git clone https://github.com/MuditNautiyal-21/Data-Pulse.git
cd Data-Pulse

# Setup
python -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate
pip install -r requirements.txt

# Generate sample data
python sample_data/generate_data.py

# Run all checks
python run.py

# Launch dashboard
uvicorn api.main:app --reload

Open http://localhost:8000/dashboard


Features

YAML-Defined Checks - No Code Required

checks:
  - name: "Order ID is unique"
    type: unique_check
    column: order_id
    severity: critical

  - name: "Amount is positive"
    type: value_check
    column: amount
    condition: "> 0"
    severity: warning

Anyone on your team can add quality rules without writing Python. Just edit a YAML file.

6 Built-in Check Types

Check Type What It Catches Example
null_check Missing values "order_id should never be empty"
unique_check Duplicate records "customer_id must be unique"
value_check Out-of-range values "amount must be > 0"
accepted_values_check Invalid categories "status must be completed, pending, shipped, or cancelled"
freshness_check Stale or future dates "order_date should not be in the future"
row_count_check Missing data loads "table must have at least 100 rows"

Auto-Profiling

Every run automatically profiles each data source:

  • Row and column counts
  • Null rates per column
  • Unique value counts
  • Min / Max / Mean for numeric columns
  • Data type detection

Statistical Anomaly Detection

Data-Pulse stores profile history in SQLite and uses z-score analysis to detect when today's data deviates significantly from the baseline. If your amount column usually has 2% nulls but today it's 15% - Data-Pulse flags it.

Slack Alerts with Severity Routing

Failed checks trigger Slack notifications automatically:

  • πŸ”΄ Critical failures - things that should never happen (duplicate primary keys, missing IDs)
  • 🟑 Warning failures - things to investigate (null emails, negative amounts)

Real-Time Dashboard

A dark-themed web dashboard showing:

  • Pipeline Health Score - single percentage showing overall data quality
  • Pass/Fail/Critical stats at a glance
  • Check results table - sortable, with severity badges and failure details
  • Auto-refreshes every 30 seconds

Architecture

                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                        β”‚   config.yaml   β”‚
                        β”‚  + checks/*.yamlβ”‚
                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                                 β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚              DataPulse Engine             β”‚
         β”‚                                           β”‚
         β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
         β”‚  β”‚ Profiler  β”‚β†’β”‚  Check   β”‚β†’ β”‚ Anomaly β”‚  β”‚
         β”‚  β”‚          β”‚  β”‚  Runner  β”‚  β”‚ Detectorβ”‚  β”‚
         β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
         β”‚                                           β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β–Ό           β–Ό           β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  SQLite  β”‚ β”‚  Slack   β”‚ β”‚  FastAPI β”‚
         β”‚ Storage  β”‚ β”‚  Alerts  β”‚ β”‚  + Dash  β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚                         β”‚
              β”‚    Historical Data      β”‚    http://localhost:8000
              β”‚    & Trend Analysis     β”‚    /dashboard
              β–Ό                         β–Ό

Data flows in one direction:

  1. Config defines sources and check rules
  2. Profiler scans every column and records statistics
  3. Check Runner evaluates each YAML rule against the data
  4. Anomaly Detector compares current profile to historical baselines
  5. Results are stored in SQLite, pushed to Slack, and displayed on the dashboard

Project Structure

Data-Pulse/
β”œβ”€β”€ engine/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ profiler.py          # Auto-profiles data sources
β”‚   β”œβ”€β”€ check_runner.py      # Executes YAML-defined checks
β”‚   β”œβ”€β”€ storage.py           # SQLite persistence layer
β”‚   β”œβ”€β”€ anomaly.py           # Z-score anomaly detection
β”‚   └── alerts.py            # Slack webhook integration
β”œβ”€β”€ api/
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── main.py              # FastAPI server + dashboard route
β”œβ”€β”€ checks/
β”‚   β”œβ”€β”€ orders_checks.yaml   # Check rules for orders table
β”‚   └── customers_checks.yaml
β”œβ”€β”€ sample_data/
β”‚   β”œβ”€β”€ generate_data.py     # Creates demo data with intentional issues
β”‚   β”œβ”€β”€ orders.csv           # 1005 orders (5% null amounts, duplicates)
β”‚   └── customers.csv        # 200 customers (8% null emails)
β”œβ”€β”€ templates/
β”‚   └── dashboard.html       # Real-time health dashboard
β”œβ”€β”€ tests/
β”‚   └── __init__.py
β”œβ”€β”€ run.py                   # Main entry point β€” runs everything
β”œβ”€β”€ config.yaml              # Source definitions + alert config
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ docker-compose.yaml
└── README.md

βš™ Configuration

Adding a New Data Source

Edit config.yaml:

sources:
  my_new_table:
    type: csv
    path: path/to/your/data.csv
    description: "Description of this source"

Writing a New Check

Create a YAML file in checks/:

source: path/to/your/data.csv

checks:
  - name: "Revenue is positive"
    type: value_check
    column: revenue
    condition: "> 0"
    severity: critical

Enabling Slack Alerts

  1. Create a Slack Incoming Webhook
  2. Update config.yaml:
alerts:
  slack:
    enabled: true
    webhook_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

Tech Stack

Technology Role
Python 3.11 Core engine, profiling, anomaly detection
FastAPI REST API serving dashboard and check results
SQLite Zero-config metadata storage and history tracking
Pandas Data loading, profiling, and analysis
SciPy Statistical anomaly detection (z-scores)
PyYAML Human-readable check definitions
Docker One-command deployment
Slack Webhooks Real-time alerting

Roadmap

  • PostgreSQL / MySQL source connectors
  • Schema change detection
  • Email alerts
  • Custom SQL check support
  • Airflow DAG for scheduled runs
  • Profile trend charts on dashboard
  • GitHub Actions CI/CD pipeline
  • CLI tool (datapulse check --source orders)

Contributing

Contributions are welcome. To get started:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/your-feature)
  3. Commit your changes (git commit -m 'Add your feature')
  4. Push to the branch (git push origin feature/your-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - use it however you want.


Built by Mudit Nautiyal

If this helped you, consider giving it a ⭐

About

Catch silent data pipeline failures before your stakeholders do. Lightweight data quality engine with YAML-based checks, statistical anomaly detection, Slack alerting, and a health dashboard.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors