Personal Data Platform (PDP)

A production-grade, artifact-driven data engineering pipeline designed to ingest, normalize, and aggregate personal health telemetry (Blood Pressure, Heart Rate, Sleep) from heterogeneous sources.

Core Engineering Signal: This project demonstrates Idempotency, Atomic I/O, Cache-Aware Execution, and Metadata Management without relying on heavy external frameworks like Airflow or Dagster.

🏗 Architecture

The pipeline follows a Push-Model architecture where nodes return explicit artifact paths, coordinated by a central Orchestrator backed by a SQLite Registry.

graph TD
    subgraph "Orchestration Layer"
        Orchestrator[Pipeline Orchestrator]
        Registry[(SQLite Artifact Registry)]
        State[Pipeline State JSON]
    end

    subgraph "Ingestion Layer"
        GS[Google Sheets Source]
        MB[Mi Band Drive Source]
        Raw{Atomic Write}
    end

    subgraph "Processing Layer"
        Norm[Normalizers]
        Val[Validators]
        Merge[Aggregator]
    end

    Orchestrator -->|Topological Sort| Norm
    Orchestrator -->|Check Hash| Registry
    
    GS --> Raw
    MB --> Raw
    Raw -->|Push Paths| Orchestrator
    
    Norm -->|Push Paths| Orchestrator
    Val -->|Push Paths| Orchestrator
    Merge -->|Push Paths| Orchestrator
    
    Orchestrator -->|Register Artifact| Registry

🚀 Key Features

1. Cache-Aware Execution (Memoization)

The Orchestrator calculates a deterministic Input Hash for every stage, combining:

Data Identity: SHA-256 content hash of input files.
Code Identity: AST-based source code hashing of logic classes (inspect.getsource).
Config Identity: State of global settings and pyproject.toml. If the hash matches a valid entry in the Registry, the stage is skipped entirely (0ms execution time).

2. Atomic I/O (Write-Ahead Log Pattern)

To prevent "Zombie Artifacts" (corrupted files from crashed runs), the pipeline uses an Atomic Rename Pattern:

Write: Nodes write to filename.csv..tmp.
Verify: Node returns success.
Commit: Orchestrator performs an OS-level atomic rename (replace()) to .csv and registers the metadata in the same transaction scope.

3. Artifact Registry

Metadata is treated as a first-class citizen. A SQLite backend tracks:

Lineage: Which inputs produced which outputs.
Versioning: Automatic v1, v2 incrementing.
Audit: Execution timestamps and row counts.

4. Dual-Mode Visualization

The Streamlit dashboard implements a Hybrid Auth pattern:

Public View: Automatically loads synthetic data (Mock Mode) for demonstration purposes, ensuring zero-click value for visitors.
Private View: An authenticated sidebar login unlocks the real telemetry data (decrypted from cloud secrets) for the owner.
UX: Features Altair charts with magnetic crosshairs, dynamic date filtering, and physiological threshold markers.

5. Cloud State Sync (Stateless Compute)

To support ephemeral environments (Docker/GitHub Actions), the pipeline implements a Remote State architecture.

Pull: On startup, entrypoint.sh downloads the SQLite Registry and Merged Metrics from a secure Google Drive folder.
Push: After execution, updated state is atomically uploaded back to Drive.

🛠️ Setup & Usage

Prerequisites

Python 3.10+
Google Cloud Service Account (Drive & Sheets API enabled)

Installation

# 1. Clone the repo
git clone https://github.com/YOUR_USERNAME/personal-data-platform.git
cd personal-data-platform

# 2. Install dependencies
pip install -e .

# 3. Configure Secrets
cp .env.example .env
# Edit .env with your Google Credentials path and IDs

Running the Pipeline

The CLI (main.py) supports granular control over execution.

# Standard Run (Ingest -> Normalize -> Validate -> Merge)
python main.py

# Skip Ingestion (Use existing raw files)
python main.py --skip-ingestion

# Resume from a specific stage (e.g., Validation)
python main.py --start-stage validation

# Clean up all generated artifacts (Dry Run)
python main.py --clean --dry-run

# Run Dashboard (Auto-switches based on env vars)
streamlit run dashboard.py

# Run via Docker (Stateless)
docker build -t pdp .
docker run --rm --env-file .env -v $(pwd)/.secrets:/app/.secrets pdp

📂 Project Structure

personal-data-platform/
├── config/             # Pydantic settings & logging config
├── ingestion/          # Source connectors (Strategy Pattern)
├── pipeline/           # Core Engine
│   ├── orchestrator.py # DAG execution & Caching logic
│   ├── registry_sqlite.py # Repository Pattern for Metadata
│   └── nodes.py        # Stage wrappers
├── processing/         # Business Logic (Normalizers/Validators)
├── scripts/            # Utility scripts (Sync, Mock Data)
├── dashboard.py        # Streamlit Visualization Layer
├── Dockerfile          # Multi-stage build definition
├── fly.toml            # Cloud deployment config
├── entrypoint.sh       # Cloud state synchronization logic
└── main.py             # CLI Entrypoint

🛡️ Design Patterns Used

Repository Pattern: Decoupling database access from domain logic (registry_sqlite.py).
Strategy Pattern: Interchangeable ingestion sources (ingestion/interfaces.py).
Inversion of Control: Nodes "push" results to the Orchestrator rather than the Orchestrator "pulling" from hardcoded paths.
Guard Clauses: Enforcing "Identity > Status" in execution logic.

🔜 Roadmap

DAG Topological Sort
SQLite Artifact Registry
Atomic File Operations
Visualization Layer (Streamlit/Altair)
Docker Containerization & Cloud Sync
CI/CD (GitHub Actions -> Fly.io)
Unit Tests for DAG Logic

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Personal Data Platform (PDP)

🏗 Architecture

🚀 Key Features

1. Cache-Aware Execution (Memoization)

2. Atomic I/O (Write-Ahead Log Pattern)

3. Artifact Registry

4. Dual-Mode Visualization

5. Cloud State Sync (Stateless Compute)

🛠️ Setup & Usage

Prerequisites

Installation

Running the Pipeline

📂 Project Structure

🛡️ Design Patterns Used

🔜 Roadmap

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
config		config
data		data
ingestion		ingestion
pipeline		pipeline
processing		processing
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
dashboard.py		dashboard.py
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh
fly.toml		fly.toml
main.py		main.py
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Personal Data Platform (PDP)

🏗 Architecture

🚀 Key Features

1. Cache-Aware Execution (Memoization)

2. Atomic I/O (Write-Ahead Log Pattern)

3. Artifact Registry

4. Dual-Mode Visualization

5. Cloud State Sync (Stateless Compute)

🛠️ Setup & Usage

Prerequisites

Installation

Running the Pipeline

📂 Project Structure

🛡️ Design Patterns Used

🔜 Roadmap

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages