An AI-powered tool that generates custom Playwright-based web scrapers. Point it at a page, and it analyzes the DOM and screenshot with GPT-4o to write a working Python scraper — then tests and refines it automatically.
Built for journalists and researchers who need to monitor many sources without writing code from scratch each time.
scraper-factory/
├── cli.py # Entry point — generate, test, register
├── config.json # What to scrape and which fields to extract
├── .env # API keys and DB connection (never commit this)
│
├── scraper_generator/ # Core library
│ ├── generator.py # Page analysis, code generation, refinement loop
│ ├── test.py # Test framework (validates generated scrapers)
│ ├── utils.py # Shared utilities
│ └── prompts/ # Jinja2 templates for LLM prompts
│
├── scrapers/ # Generated scrapers live here
│ └── <org_name>/
│ ├── scraper.py # The generated scraper
│ ├── seed.json # Registration metadata
│ ├── result.json # Output from the last run
│ └── page_analysis.json # Selector candidates from DOM analysis
│
├── scripts/
│ ├── seed.py # Seeds MongoDB from seed.json files
│ ├── scrape_indexes.py # Runs all active scrapers
│ ├── scrape_articles.py # Fetches full article content
│ └── setup.py # DB setup
│
├── example_configs/ # Prebuilt content type configs
│ ├── articles.json
│ ├── faculty_bios.json
│ ├── police_reports.json
│ └── school_board_meetings.json
│
├── streamlit/
│ ├── app.py # Dashboard for viewing scraped data
│ └── requirements.txt
│
├── requirements.txt
└── Dockerfile
git clone https://github.com/yourusername/scraper-factory.git
cd scraper-factory
pip install -r requirements.txt
playwright install chromiumCreate a .env file in the project root. This file is required before you can do anything.
OPENAI_API_KEY=sk-your-key-here
MONGO_URI=mongodb://localhost:27017
DB_NAME=scraper_dataMongoDB options: Use a local MongoDB instance (
mongodb://localhost:27017) or a hosted cluster like MongoDB Atlas or DigitalOcean Managed MongoDB. The URI goes inMONGO_URI.
config.json at the project root controls what kind of content to scrape and which fields to extract. Every scraper you generate will follow this schema.
{
"content_type": "articles",
"description": "News articles from an organization",
"item_label": "article",
"fields": [
{ "name": "title", "description": "Headline or title", "required": true, "type": "text" },
{ "name": "date", "description": "Publication date", "required": false, "type": "date" },
{ "name": "url", "description": "Link to the article", "required": true, "type": "url" }
]
}content_type also becomes the MongoDB collection name (e.g., articles → data stored in articles, scrapers registered in articles_scrapers).
Field types:
"text"— plain string"date"— validated as YYYY-MM-DD"url"— validated as a URL
Prebuilt configs are in example_configs/ — copy one to config.json to use it:
cp example_configs/police_reports.json config.json| Config | Fields |
|---|---|
articles.json |
title, date, url |
police_reports.json |
title, date, url, incident_type |
school_board_meetings.json |
title, date, url, agenda_url |
faculty_bios.json |
name, position, department, url |
You can also write your own. Any fields you define here will be passed to the AI and validated on the scraped output.
python cli.py generate --org "Los Angeles Times" --url "https://www.latimes.com/"GitHub Codespaces: The generator opens a real browser window during refinement. In Codespaces there's no display, so prefix the command with
xvfb-runto provide a virtual one:xvfb-run python cli.py generate --org "Los Angeles Times" --url "https://www.latimes.com/"
Or run without arguments for an interactive prompt:
python cli.py generateTo use a different content config than config.json:
python cli.py generate --org "LAPD" --url "https://lapd.com/reports" --config example_configs/police_reports.jsonThe tool will:
- Load the page in a headless browser and take a screenshot
- Send the DOM + screenshot to GPT-4o to identify CSS selectors
- Generate a complete Playwright scraper based on those selectors
- Test it — if it fails or returns zero results, it refines automatically
- Save the scraper to
scrapers/<org_name>/
Generate many scrapers at once from a JSON or CSV file.
JSON format (one entry per URL):
[
{ "org": "Chicago Tribune", "url": "https://chicagotribune.com/news/" },
{ "org": "LA Times", "url": "https://www.latimes.com/local" },
{ "org": "Boston Globe", "url": "https://www.bostonglobe.com/metro" }
]JSON format (multiple URLs per org):
[
{
"org": "LA Times",
"urls": [
"https://www.latimes.com/california",
"https://www.latimes.com/entertainment-arts",
"https://www.latimes.com/sports"
]
}
]CSV format:
org,url
Chicago Tribune,https://chicagotribune.com/news/
LA Times,https://www.latimes.com/localRun it:
python cli.py generate --batch-file batch/local_papers.jsonEach scraper is generated, tested, and registered independently. Failed ones are logged but won't stop the rest.
Once you have scrapers generated, seed them into MongoDB so the scraping scripts know what to run.
python scripts/seed.pyThis walks through every scrapers/<org>/seed.json file and upserts each org into the {content_type}_scrapers collection in MongoDB. Each scraper entry gets default metadata:
active: truelast_run_status: "error"(updated when it runs)last_runandlast_run_counttimestamps
Run this again any time you add new scrapers.
If you need to set up the DB schema or indexes first:
python scripts/setup.pypython scripts/scrape_indexes.pyThis reads the {content_type}_scrapers collection, runs every scraper marked active: true, and writes results into MongoDB ({content_type} collection). Each scraper also updates its last_run, last_run_status, and last_run_count metadata.
cd scrapers/los_angeles_times
python scraper.pyThis writes results to result.json in the same directory. Useful for testing a scraper in isolation.
python scripts/scrape_articles.pyThis is for going deeper than the index page — fetching body content from individual article URLs stored in MongoDB.
A Streamlit app lets you browse scraped data, filter by organization, and export to CSV.
cd streamlit
pip install -r requirements.txt
streamlit run app.pyIt reads from the same MongoDB collections, using config.json to know which collection to query. The .env file must be present with valid MONGO_URI and DB_NAME values. The site can also be easily deployed to production.
# Test a single scraper
python cli.py test --path scrapers/los_angeles_times/scraper.py
# Test all scrapers for an org
python cli.py test --org "Los Angeles Times"Generating scripts will automatically run the testing suite, but you can run tests again for changes to the scripts.
Tests are dynamic based on your config.json:
- Checks that scraped items have the expected fields
- Validates
datefields as YYYY-MM-DD - Validates
urlfields as valid URLs - Enforces non-blank values for required fields
logs/
generate.log # Generation process
test.log # Test results
<scraper_name>_llm.log # Full LLM prompts and responses (useful for debugging)
- API cost: Each scraper generation typically costs $0.01–0.10 in OpenAI API fees. Running scrapers after generation is free.
- robots.txt: The tool checks
robots.txtbefore generating a scraper. If scraping is disallowed, you'll be warned. - Rate limiting: Don't run scrapers more frequently than necessary. Respect the sites you're scraping.
- Terms of service: Check each site's ToS before deploying scrapers against it.
Skip local setup and run via Docker:
# Single scraper
docker run -it --init --rm \
-v "$PWD/scrapers:/app/scrapers" \
-v "$PWD/logs:/app/logs" \
-e OPENAI_API_KEY=sk-... \
towcenter/scraper-factory:latest \
generate --org "LA Times" --url "https://www.latimes.com/"
# Batch
docker run -it --init --rm \
-v "$PWD/scrapers:/app/scrapers" \
-v "$PWD/logs:/app/logs" \
-v "$PWD/batch:/app/batch" \
-e OPENAI_API_KEY=sk-... \
towcenter/scraper-factory:latest \
generate --batch-file batch/example.jsonMaintained by: Tow Center for Digital Journalism Last Updated: February 2026