Open data pipeline infrastructure for discovering, installing, and querying public datasets. Each "tap" extracts data from public sources (APIs, satellites, surveys) and publishes versioned snapshots to S3 with DuckLake catalogs.
# Install CLI
pip install walkthru-data
# Search for datasets
walkthru search "real estate"
# Install a tap
walkthru install re01
# Query data
walkthru query re01 "SELECT * FROM data WHERE year = 2025 LIMIT 10"
# Time travel
walkthru query re01@20251101 "SELECT COUNT(*) FROM data"| ID | Name | Category | Update | Records |
|---|---|---|---|---|
| re01 | Nawy Real Estate Data | Real Estate | Monthly | ~1,200 |
| dm01 | World Population Data | Demographics | Monthly | ~250 |
walkthru-data/
├── taps/ # Dataset definitions
│ └── re01/ # Short ID: Real Estate #01
│ ├── tap.yaml # Manifest
│ ├── extract.sql # DuckDB extraction script
│ └── catalog.db # DuckLake version history
│
├── registry/ # Central discovery
│ ├── registry.db # Searchable catalog
│ └── taps.json # Index
│
├── cli/ # CLI tool
│ └── walkthru_data/
│
└── .github/workflows/ # Automated extraction
Each tap has its own S3 directory:
s3://walkthru-earth/
├── dm01/ # World Population (isolated)
│ ├── catalog.ducklake # SQLite catalog
│ └── main/countries/*.parquet # Data files (DuckLake managed)
│
├── re01/ # Nawy Real Estate (isolated)
│ ├── catalog.ducklake # SQLite catalog
│ └── main/compounds/*.parquet # Data files (DuckLake managed)
│
└── _registry/ # Central discovery
└── taps.json # Tap metadata
Simple Design:
- ✅ Isolated: Each tap = one S3 folder
- ✅ No Cross-Access: Taps can't touch other taps' data
- ✅ Homebrew Model: Contribute taps like homebrew formulae
- ✅ Cloud-Native: Hard-to-access data → easy S3 access
Format: {category}{number}
Categories:
re- Real estatecl- Climatedm- Demographicsst- Satellitetr- Transiten- Environmentif- Infrastructure
Examples: re01, cl01, dm01, st01
# 1. Create tap structure
./scripts/create-tap.sh re02 "My Real Estate Data"
# 2. Edit manifest
cd taps/re02
vim tap.yaml
# 3. Write extraction
vim extract.sql
# 4. Test locally
walkthru tap test re02
# 5. Submit PR
git add taps/re02/
git commit -m "Add re02: My Real Estate Data"
git push origin add-re02See CONTRIBUTING.md for details.
✅ Short IDs: re01 vs real-estate-nawy-data
✅ Version history: DuckLake catalogs track all snapshots
✅ Time travel: Query any historical snapshot
✅ Auto-generated workflows: From tap.yaml manifest
✅ CLI tool: Install and query datasets
✅ Searchable registry: Find datasets quickly
✅ Hive partitioning: Efficient queries
✅ Open formats: Parquet, GeoParquet
- Table Format: DuckLake (SQL-based lakehouse)
- Query Engine: DuckDB
- Storage: S3-compatible object storage
- Compression: Parquet with ZSTD
- Catalogs: SQLite per-tap + central PostgreSQL (optional)
- CI/CD: GitHub Actions + Hetzner Cloud runners
- Runners: Self-hosted on Hetzner Cloud (~98% cheaper than GitHub)
- Storage: S3-compatible (Hetzner Object Storage)
- Cost: ~$73/month for 100 datasets, 1TB storage
- Setup: See SECRETS.md for required GitHub secrets configuration
MIT - Data subject to original sources' terms
Built by Walkthru Earth
Powered by:
- DuckDB - Analytics engine
- DuckLake - Lakehouse format
- Hetzner Cloud - Infrastructure