A complete platform to discover, verify, and export every healthcare facility in the United States — hospitals, clinics, dialysis centers, nursing homes, hospices, and more.
One-click pipeline from the browser. No terminal commands needed.
# 1. Install dependencies
pip install requests beautifulsoup4 pandas openpyxl lxml flask
# 2. Set your Serper API key (get free at serper.dev — 2,500 credits)
export SERPER_API_KEY='your_key_here'
# 3. Start the platform
python3 run.py
# 4. Open browser
http://localhost:5000That's it. Everything runs from the browser.
Searches Google Maps + imports government databases to build a complete list of US healthcare facilities with their official websites.
Input: Select a state, click "Start Pipeline" Output: CSV/Excel/JSON with facility name, address, phone, website, type, confidence score
User selects state (e.g., Wyoming) → Clicks "Start Pipeline"
│
┌───────────────┼───────────────────────────────────────┐
│ │
▼ │
Step 1: CMS Hospital Import (FREE) │
• Downloads 5,426 Medicare-certified hospitals │
• Source: data.cms.gov │
• Data: name, address, phone, CMS ID │
• NO website, NO email │
• ~35 seconds │
│ │
▼ │
Step 2: Government Datasets (FREE) │
• Dialysis centers (~7,600) │
• Hospice providers (~5,500) │
• Home health agencies (~11,000) │
• Nursing homes (~15,000) │
• Filtered to selected state │
• ~3-8 minutes │
│ │
▼ │
Step 3: NPPES/NPI Import (FREE) │
• Every billing healthcare provider has an NPI │
• Catches small practices CMS misses │
• Source: npiregistry.cms.hhs.gov │
• ~1-5 minutes per state │
│ │
▼ │
Step 4: Google Discovery (USES API CREDITS) │
• Searches Google Maps + Web for each city × category │
• "hospital in Cheyenne WY", "dental in Casper WY"... │
• 5 cities × 26 categories = 130 searches per state │
• These results COME WITH websites + coordinates │
• ~5-8 minutes per state │
│ │
▼ │
Step 5: Normalize (FREE, ~15 seconds) │
• Cleans names: "ST. JOSEPHS MED CTR LLC" │
→ "Saint Josephs Medical Center" │
• Formats phones: "2145551234" → "(214) 555-1234" │
• Standardizes addresses, URLs, states, ZIPs │
│ │
▼ │
Step 6: Find Websites (USES API CREDITS) │
• For facilities WITHOUT a website (from Steps 1-3) │
• Searches Google 3 times per facility │
• Scores candidates, picks best official website │
• Blacklists Yelp, Facebook, Healthgrades, etc. │
• ~3 credits per facility │
│ │
▼ │
Step 7: Score (FREE, ~5 seconds) │
• Assigns 0.0-1.0 confidence to each facility │
• Based on: has name, address, phone, website, NPI, etc. │
│ │
▼ │
Step 8: Export (FREE, ~2 seconds) │
• CSV (open in Excel) │
• XLSX (formatted Excel) │
• JSON (for developers) │
│ │
▼ │
DONE — Dashboard shows results │
└───────────────────────────────────────────────────────┘
| Source | Type | Records | Has Website? | Cost |
|---|---|---|---|---|
| CMS Hospitals | Federal database | 5,426 | No | FREE |
| CMS Dialysis | Federal database | ~7,600 | No | FREE |
| CMS Hospice | Federal database | ~5,500 | No | FREE |
| CMS Home Health | Federal database | ~11,000 | No | FREE |
| CMS Nursing Homes | Federal database | ~15,000 | No | FREE |
| NPPES/NPI Registry | Federal database | Varies | No | FREE |
| Google Maps (via Serper) | Search API | ~20 per search | Yes | 1 credit/search |
| Google Web (via Serper) | Search API | ~10 per search | Yes | 1 credit/search |
| Website Resolution | Search API | 3 searches/facility | Finds website | 3 credits/facility |
Free tier: 2,500 Serper credits (enough for ~5-6 full state runs)
| Page | URL | What it shows |
|---|---|---|
| Dashboard | / |
Pipeline controls, metrics, charts |
| Facilities | /facilities |
Searchable list with filters (state, type, website) |
| Detail | /facilities/123 |
Single facility — all fields, sources, scores |
| Map | /map |
All facilities on a map (green=website, red=none) |
| Export | /export |
Download CSV/Excel/JSON with state filter |
Runs all 8 steps. Select a state, set website limit, click Start.
Skips import/discovery. Just searches Google for official websites of facilities already in the database. Use this after the first full run to increase coverage.
Checkbox to skip Step 2 (supplemental datasets). Makes the pipeline ~5 minutes faster.
| Operation | Credits | Notes |
|---|---|---|
| CMS import | 0 | Free government data |
| Supplemental | 0 | Free government data |
| NPPES import | 0 | Free government data |
| Serper Maps search | 1 | Per search query |
| Serper Web search | 1 | Per search query |
| Website resolution | 3 | Per facility (3 Google searches) |
Per-state estimates:
| State Size | Discovery | Websites (50) | Total |
|---|---|---|---|
| Small (WY, VT) | ~260 credits | ~150 credits | ~410 credits |
| Medium (AL, OR) | ~260 credits | ~150 credits | ~410 credits |
| Large (CA, TX) | ~728 credits | ~150 credits | ~878 credits |
Free tier: 2,500 credits. Paid: $1 per 1,000 credits.
Each exported record contains:
| Field | Description | Example |
|---|---|---|
| facility_name | Normalized name | Saint Josephs Medical Center |
| facility_type | Classification | Hospital, Clinic, Ambulatory |
| address_1 | Street address | 123 North Main Street Suite 200 |
| city | City | Cheyenne |
| state | 2-letter code | WY |
| zip | 5-digit ZIP | 82001 |
| phone_primary | Formatted phone | (307) 634-2273 |
| website_url | Official website | https://cheyenneregional.org |
| website_domain | Root domain | cheyenneregional.org |
| entity_confidence | Data quality (0-1) | 0.75 |
| website_confidence | Website match (0-1) | 0.92 |
| specialties | Medical specialties | cardiology, orthopedic |
| npi_ids | NPI number | 1234567890 |
| cms_ids | CMS/Medicare ID | 530010 |
| status | Active/closed | active |
For each facility without a website:
Facility: "Crenshaw Community Hospital" in Luverne, AL
Search Google 3 times:
1. "Crenshaw Community Hospital" Luverne AL official website
2. Crenshaw Community Hospital Luverne AL
3. Crenshaw Community Hospital +13343353374
Collect ~20 candidate URLs from results
Score each candidate (0 to 1):
crenshawcommunityhospital.com → 0.95 ✓ WINNER
healthgrades.com/hospital/... → 0.00 ✗ BLACKLISTED
facebook.com/Crenshaw → 0.00 ✗ BLACKLISTED
yelp.com/biz/crenshaw → 0.00 ✗ BLACKLISTED
Scoring signals:
+0.30 domain contains facility name
+0.20 title contains facility name
+0.10 snippet mentions city/state
+0.10 domain is .com or .org
+0.20 NOT a directory site
+0.10 position #1 in results
-0.50 IS a directory (yelp, facebook, etc.)
Best score > 0.4 → save as official website
Blacklisted domains (never selected as official website): yelp.com, facebook.com, healthgrades.com, zocdoc.com, vitals.com, webmd.com, linkedin.com, instagram.com, twitter.com, youtube.com, yellowpages.com, bbb.org, indeed.com, google.com, wikipedia.org, npidb.org, mapquest.com
serper_scraper/
├── run.py ← START HERE
├── database.py ← SQLite database (11 tables)
├── config.py ← API key + settings
│
├── connectors/ ← Data source connectors
│ ├── serper_connector.py ← Google Maps/Web via Serper API
│ ├── cms_connector.py ← CMS hospital data (free)
│ ├── nppes_connector.py ← NPI registry (free)
│ └── supplemental.py ← Dialysis/hospice/nursing/home health
│
├── engines/ ← Processing engines
│ ├── geo_discovery.py ← Orchestrates all imports + discovery
│ ├── website_resolution.py ← Finds official websites via Google
│ ├── normalization.py ← Cleans names/phones/addresses
│ ├── scoring.py ← Confidence scoring (0-1)
│ ├── deduplication.py ← Finds and merges duplicates
│ └── ... ← 15+ more engines
│
├── web_ui/ ← Flask web interface
│ ├── app.py ← Routes + pipeline runner
│ └── templates/
│ ├── index.html ← Dashboard + pipeline controls
│ ├── facilities.html ← Searchable facility list
│ ├── detail.html ← Single facility detail
│ ├── map.html ← Map view (Leaflet.js)
│ └── export.html ← Export page
│
├── export/
│ └── exporter.py ← CSV/XLSX/JSON generation
│
├── exports/ ← Generated export files go here
│
└── healthcare_providers.db ← SQLite database (auto-created)
| Table | Purpose | Key Fields |
|---|---|---|
| facilities | Main entity table | name, address, phone, website, type, confidence |
| organizations | Health systems/chains | name, type, location_count |
| source_records | Raw data lineage | source_type, raw_name, raw_payload |
| crawl_results | Website crawl data | url, status_code, extracted_json |
| website_candidates | Website scoring | candidate_url, score, reasons |
| discovery_jobs | Search job tracking | state, city, category, status |
| geo_cells | Grid-based coverage | lat/lng bounds, results_count |
| crawl_policies | robots.txt compliance | domain, rate_limit, is_excluded |
| change_log | Field change tracking | field_name, old_value, new_value |
| review_queue | Manual QA queue | review_type, priority, status |
| facilities_fts | Full-text search index | name, city, specialties |
Coverage = facilities with website / total facilities × 100
After Step 1-3 (imports): 0% coverage (no websites)
After Step 4 (Google): ~5-10% coverage (Serper results have websites)
After Step 6 (resolve 50): ~10-12% coverage
After Step 6 (resolve 500): ~15-20% coverage
After Step 6 (resolve 5000): ~80-90% coverage (uses 15,000 API credits)
To increase coverage cheaply: Use "Find Websites Only" button repeatedly.
| Problem | Solution |
|---|---|
403 Forbidden from Serper |
API key expired. Get new one at serper.dev |
database disk image is malformed |
Delete healthcare_providers.db and restart |
| Pipeline stuck on Step 2 | Check "Skip govt data" or use smaller state (WY) |
unhashable type: 'dict' |
Known scoring bug — scoring still errors on some records, non-critical |
| Port 5000 in use | Kill other process: `lsof -ti:5000 |
| No facilities on map | Only Serper-discovered facilities have lat/lng |
Run all 80 tests:
python3 -c "exec(open('TESTING_GUIDE.md').read().split('```bash')[-1].split('```')[0])"Or test individual components:
# Database
python3 -c "from database import init_db; init_db(); print('DB OK')"
# CMS API
python3 -c "from connectors.cms_connector import fetch_cms_hospitals; print(len(fetch_cms_hospitals(limit=5)), 'hospitals')"
# Serper API
python3 -c "from connectors.serper_connector import search_maps; print(len(search_maps('hospital Houston TX').get('places',[])), 'places')"- Python 3.10+ — all backend
- Flask — web UI server
- SQLite — database (no external DB needed)
- Serper.dev — Google Maps/Search API
- Leaflet.js — map visualization
- Chart.js — dashboard charts
- Pandas + openpyxl — Excel export
No Docker, no Redis, no PostgreSQL, no Node.js. Just Python + SQLite.
Internal project. Not for public distribution.