This project is a Python-based web scraping and data processing pipeline for collecting and cleaning real estate listing data from Zillow-style property sites. It handles session management, parsing structured listing data, and producing analysis-ready CSV files for downstream modeling or analytics.
It is released here as a standalone, reusable data collection tool.
-
Scrapes housing listings from Zillow-style endpoints
-
Handles session setup and request flow
-
Parses and normalizes listing data
-
Outputs clean CSV files for analysis
-
Includes backup + recovery JSON snapshots
-
Modular architecture for reuse in other projects
.
├── data/
│ ├── raw/ # Raw scraped CSV output
│ ├── processed/ # Cleaned / analysis-ready CSVs
│ └── tmp/ # Backup JSON + intermediate files
├── notebooks/
│ └── sanity_check.ipynb # Quick inspection & validation
├── src/
│ ├── scraper.py # Core scraping logic
│ ├── session.py # Session & request handling
│ ├── parser.py # Parsing + field normalization
│ ├── build.py # Data pipeline orchestration
│ └── main.py # Entry point / CLI-style runner
├── .env # Local environment variables
├── .gitignore
├── README.md
-
Create a virtual environment
python -m venv .venvsource .venv/bin/activate -
Install dependencies
pip install -r requirements.txt
Run the main pipeline:
python -m src.main
This will: • Initialize a session
• Scrape housing data
• Parse data
• Clean data
• Save results to data/raw/ and data/processed/
data/raw/irving_tx_housing.csv → raw scraped data
data/processed/irving_tx_housing_clean.csv → cleaned dataset
data/tmp/*.json → backup snapshots for recovery / debugging
This project is structured to be:
• Modular
• Reusable
• Analysis-friendly
• Easy to extend for other cities or data sources
It separates concerns between:
-
Session handling
-
Scraping
-
Parsing
-
Pipeline orchestration
This project is for educational and research purposes only. Users are responsible for complying with the terms of service of any website they access and for respecting robots.txt, rate limits, and local laws.
CLI arguments for city, state, and filters
Support for additional listing platforms
Database export (Postgres / DuckDB)
Geospatial enrichment (school zones, crime, transit, etc.)
Built by Bilal Haroon
Data Science · ML · Systems · Open Source