A multi-source scraper engine for extracting public data from Connecticut. Currently supports:
- VGSI — property assessment data from ~80 CT cities/towns
- CT Data — business registry datasets from data.ct.gov
This project uses uv for dependency management.
uv sync
uv sync --extra dev # for testsdocker build -t ctcityscraper .
# Show help
docker run --rm ctcityscraper
# Scrape with persistent data volume
docker run --rm -v ./data:/app/data ctcityscraper load vgsi newhaven --entry-id-max 27000
# Refresh
docker run --rm -v ./data:/app/data ctcityscraper refresh-all --quiet# Populate the cities lookup table (one-time setup)
uv run scrape admin vgsi --fetch-cities
# Scrape a range of properties
uv run scrape load vgsi newhaven --entry-id-max 27000 --workers 10
# Resume after interruption (checkpoints are automatic)
uv run scrape load vgsi newhaven --entry-id-max 27000
# Re-scrape known properties to detect changes
uv run scrape refresh vgsi newhaven
# Download building photos
uv run scrape load vgsi newhaven --entry-id-max 27000 --download-photos# Load all 5 datasets
uv run scrape load llc_ct_data
# Load specific datasets
uv run scrape load llc_ct_data --datasets n7gp-d28j,enwv-52we
# Refresh existing data
uv run scrape refresh llc_ct_data# Refresh all known sources and scopes (no flags needed)
uv run scrape refresh-all --quiet
# Example crontab entry (daily at 2am)
0 2 * * * cd /path/to/ctcityscraper && uv run scrape refresh-all --quiet --log-level WARNINGctcityscraper/
├── scrape.py # CLI entry point
├── Dockerfile # Multi-stage Docker build
├── src/engine/ # Generic scraper engine (source-agnostic)
│ ├── base.py # SourceDefinition, SourceConfig contracts
│ ├── database.py # ParquetWriter (append-only storage)
│ ├── engine.py # Parallel execution, rate limiting, checkpoints
│ └── hash.py # Row hashing for change detection
├── scrapers/ # Source-specific scrapers
│ ├── __init__.py # REGISTRY of all sources
│ ├── vgsi/ # VGSI property data
│ │ └── source.py
│ └── llc_ct_data/ # CT business registry
│ └── source.py
└── tests/
All data is stored as append-only parquet files under data/:
data/
├── newhaven/
│ ├── properties/
│ │ └── 20260226_143000_123456.parquet
│ ├── buildings/
│ ├── ownership/
│ └── ...
├── llc_ct_data/
│ ├── businesses/
│ ├── filings/
│ └── ...
└── _checkpoints/
└── newhaven.json
Each scrape session produces one parquet file per table (batch files are compacted at the end of each run). Every row includes scraped_at and row_hash metadata.
loadwrites every scraped result to parquet. Running load twice for the same ID range produces duplicate rows.refreshre-scrapes all known entries but only writes rows whoserow_hashdiffers from what's already stored. Unchanged entries are skipped, so a refresh over stable data adds zero new rows. The engine preloads existing hashes at the start of each refresh and logs how many rows were written vs skipped.
-- Current state of all properties
SELECT * FROM read_parquet('data/newhaven/properties/*.parquet')
QUALIFY ROW_NUMBER() OVER (PARTITION BY uuid ORDER BY scraped_at DESC) = 1;
-- Properties that changed between scrapes
SELECT * FROM (
SELECT *,
LAG(row_hash) OVER (PARTITION BY uuid ORDER BY scraped_at) AS prev_hash
FROM read_parquet('data/newhaven/properties/*.parquet')
)
WHERE prev_hash IS NOT NULL AND row_hash != prev_hash;scrape <command> [options]
Commands:
load <source> [city] Load new entries
refresh <source> [city] Re-scrape known entries, write only changes
refresh-all Refresh all known sources and scopes
admin <source> Source-specific admin commands
Global options:
--data-dir DIR Parquet output directory (default: data)
--db PATH DuckDB path for city lookup (default: ctcityscraper.duckdb)
--workers N Concurrent threads (default: 10)
--rate N Requests per second (default: 5)
--batch-size N Batch write size (default: 10)
--checkpoint-every N Checkpoint frequency (default: 100)
--no-resume Don't resume from checkpoint
--base-url URL Override base URL for the source
--quiet Suppress progress bars, log to file
--log-level LEVEL DEBUG, INFO, WARNING, ERROR (default: INFO)
VGSI-specific:
--entry-id-min N Starting entry ID (default: 1)
--entry-id-max N Ending entry ID (required for load)
--fetch-cities Fetch VGSI city list (admin command)
--download-photos Download building photos
--photo-dir DIR Photo directory (default: photos)
CT Data-specific:
--datasets IDS Comma-separated dataset IDs (default: all)
See SCRAPER_DEVELOPMENT.md for a step-by-step guide.
uv run pytest tests/ -v