Scraper Factory

An AI-powered tool that generates custom Playwright-based web scrapers. Point it at a page, and it analyzes the DOM and screenshot with GPT-4o to write a working Python scraper — then tests and refines it automatically.

Built for journalists and researchers who need to monitor many sources without writing code from scratch each time.

Project Structure

scraper-factory/
├── cli.py                        # Entry point — generate, test, register
├── config.json                   # What to scrape and which fields to extract
├── .env                          # API keys and DB connection (never commit this)
│
├── scraper_generator/            # Core library
│   ├── generator.py              # Page analysis, code generation, refinement loop
│   ├── test.py                   # Test framework (validates generated scrapers)
│   ├── utils.py                  # Shared utilities
│   └── prompts/                  # Jinja2 templates for LLM prompts
│
├── scrapers/                     # Generated scrapers live here
│   └── <org_name>/
│       ├── scraper.py            # The generated scraper
│       ├── seed.json             # Registration metadata
│       ├── result.json           # Output from the last run
│       └── page_analysis.json    # Selector candidates from DOM analysis
│
├── scripts/
│   ├── seed.py                   # Seeds MongoDB from seed.json files
│   ├── scrape_indexes.py         # Runs all active scrapers
│   ├── scrape_articles.py        # Fetches full article content
│   └── setup.py                  # DB setup
│
├── example_configs/              # Prebuilt content type configs
│   ├── articles.json
│   ├── faculty_bios.json
│   ├── police_reports.json
│   └── school_board_meetings.json
│
├── streamlit/
│   ├── app.py                    # Dashboard for viewing scraped data
│   └── requirements.txt
│
├── requirements.txt
└── Dockerfile

Installation

git clone https://github.com/yourusername/scraper-factory.git
cd scraper-factory
pip install -r requirements.txt
playwright install chromium

Step 1: Set Up Your `.env`

Create a .env file in the project root. This file is required before you can do anything.

OPENAI_API_KEY=sk-your-key-here
MONGO_URI=mongodb://localhost:27017
DB_NAME=scraper_data

MongoDB options: Use a local MongoDB instance (mongodb://localhost:27017) or a hosted cluster like MongoDB Atlas or DigitalOcean Managed MongoDB. The URI goes in MONGO_URI.

Step 2: Configure What to Scrape

config.json at the project root controls what kind of content to scrape and which fields to extract. Every scraper you generate will follow this schema.

{
  "content_type": "articles",
  "description": "News articles from an organization",
  "item_label": "article",
  "fields": [
    { "name": "title", "description": "Headline or title", "required": true,  "type": "text" },
    { "name": "date",  "description": "Publication date",  "required": false, "type": "date" },
    { "name": "url",   "description": "Link to the article", "required": true, "type": "url" }
  ]
}

content_type also becomes the MongoDB collection name (e.g., articles → data stored in articles, scrapers registered in articles_scrapers).

Field types:

"text" — plain string
"date" — validated as YYYY-MM-DD
"url" — validated as a URL

Prebuilt configs are in example_configs/ — copy one to config.json to use it:

cp example_configs/police_reports.json config.json

Config	Fields
`articles.json`	title, date, url
`police_reports.json`	title, date, url, incident_type
`school_board_meetings.json`	title, date, url, agenda_url
`faculty_bios.json`	name, position, department, url

You can also write your own. Any fields you define here will be passed to the AI and validated on the scraped output.

Step 3: Generate Scrapers

Single scraper

python cli.py generate --org "Los Angeles Times" --url "https://www.latimes.com/"

GitHub Codespaces: The generator opens a real browser window during refinement. In Codespaces there's no display, so prefix the command with xvfb-run to provide a virtual one:
xvfb-run python cli.py generate --org "Los Angeles Times" --url "https://www.latimes.com/"

Or run without arguments for an interactive prompt:

python cli.py generate

To use a different content config than config.json:

python cli.py generate --org "LAPD" --url "https://lapd.com/reports" --config example_configs/police_reports.json

The tool will:

Load the page in a headless browser and take a screenshot
Send the DOM + screenshot to GPT-4o to identify CSS selectors
Generate a complete Playwright scraper based on those selectors
Test it — if it fails or returns zero results, it refines automatically
Save the scraper to scrapers/<org_name>/

Batch generation

Generate many scrapers at once from a JSON or CSV file.

JSON format (one entry per URL):

[
  { "org": "Chicago Tribune", "url": "https://chicagotribune.com/news/" },
  { "org": "LA Times",        "url": "https://www.latimes.com/local" },
  { "org": "Boston Globe",    "url": "https://www.bostonglobe.com/metro" }
]

JSON format (multiple URLs per org):

[
  {
    "org": "LA Times",
    "urls": [
      "https://www.latimes.com/california",
      "https://www.latimes.com/entertainment-arts",
      "https://www.latimes.com/sports"
    ]
  }
]

CSV format:

org,url
Chicago Tribune,https://chicagotribune.com/news/
LA Times,https://www.latimes.com/local

Run it:

python cli.py generate --batch-file batch/local_papers.json

Each scraper is generated, tested, and registered independently. Failed ones are logged but won't stop the rest.

Step 4: Set Up the Database

Once you have scrapers generated, seed them into MongoDB so the scraping scripts know what to run.

python scripts/seed.py

This walks through every scrapers/<org>/seed.json file and upserts each org into the {content_type}_scrapers collection in MongoDB. Each scraper entry gets default metadata:

active: true
last_run_status: "error" (updated when it runs)
last_run and last_run_count timestamps

Run this again any time you add new scrapers.

If you need to set up the DB schema or indexes first:

python scripts/setup.py

Step 5: Start Scraping

Run all active scrapers (only first page)

python scripts/scrape_indexes.py

This reads the {content_type}_scrapers collection, runs every scraper marked active: true, and writes results into MongoDB ({content_type} collection). Each scraper also updates its last_run, last_run_status, and last_run_count metadata.

Run a single scraper manually

cd scrapers/los_angeles_times
python scraper.py

This writes results to result.json in the same directory. Useful for testing a scraper in isolation.

Fetch full article content (if needed)

python scripts/scrape_articles.py

This is for going deeper than the index page — fetching body content from individual article URLs stored in MongoDB.

View the Dashboard

A Streamlit app lets you browse scraped data, filter by organization, and export to CSV.

cd streamlit
pip install -r requirements.txt
streamlit run app.py

It reads from the same MongoDB collections, using config.json to know which collection to query. The .env file must be present with valid MONGO_URI and DB_NAME values. The site can also be easily deployed to production.

Testing Scrapers

# Test a single scraper
python cli.py test --path scrapers/los_angeles_times/scraper.py

# Test all scrapers for an org
python cli.py test --org "Los Angeles Times"

Generating scripts will automatically run the testing suite, but you can run tests again for changes to the scripts.

Tests are dynamic based on your config.json:

Checks that scraped items have the expected fields
Validates date fields as YYYY-MM-DD
Validates url fields as valid URLs
Enforces non-blank values for required fields

Logs

logs/
  generate.log              # Generation process
  test.log                  # Test results
  <scraper_name>_llm.log    # Full LLM prompts and responses (useful for debugging)

Notes on Cost and Ethics

API cost: Each scraper generation typically costs $0.01–0.10 in OpenAI API fees. Running scrapers after generation is free.
robots.txt: The tool checks robots.txt before generating a scraper. If scraping is disallowed, you'll be warned.
Rate limiting: Don't run scrapers more frequently than necessary. Respect the sites you're scraping.
Terms of service: Check each site's ToS before deploying scrapers against it.

Docker

Skip local setup and run via Docker:

# Single scraper
docker run -it --init --rm \
  -v "$PWD/scrapers:/app/scrapers" \
  -v "$PWD/logs:/app/logs" \
  -e OPENAI_API_KEY=sk-... \
  towcenter/scraper-factory:latest \
  generate --org "LA Times" --url "https://www.latimes.com/"

# Batch
docker run -it --init --rm \
  -v "$PWD/scrapers:/app/scrapers" \
  -v "$PWD/logs:/app/logs" \
  -v "$PWD/batch:/app/batch" \
  -e OPENAI_API_KEY=sk-... \
  towcenter/scraper-factory:latest \
  generate --batch-file batch/example.json

Maintained by: Tow Center for Digital Journalism Last Updated: February 2026

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
example_configs		example_configs
scraper_generator		scraper_generator
scrapers		scrapers
scripts		scripts
streamlit		streamlit
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
README.md		README.md
cli.py		cli.py
config.json		config.json
example.csv		example.csv
mn.csv		mn.csv
operator.json		operator.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scraper Factory

Project Structure

Installation

Step 1: Set Up Your `.env`

Step 2: Configure What to Scrape

Step 3: Generate Scrapers

Single scraper

Batch generation

Step 4: Set Up the Database

Step 5: Start Scraping

Run all active scrapers (only first page)

Run a single scraper manually

Fetch full article content (if needed)

View the Dashboard

Testing Scrapers

Logs

Notes on Cost and Ethics

Docker

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Scraper Factory

Project Structure

Installation

Step 1: Set Up Your .env

Step 2: Configure What to Scrape

Step 3: Generate Scrapers

Single scraper

Batch generation

Step 4: Set Up the Database

Step 5: Start Scraping

Run all active scrapers (only first page)

Run a single scraper manually

Fetch full article content (if needed)

View the Dashboard

Testing Scrapers

Logs

Notes on Cost and Ethics

Docker

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Step 1: Set Up Your `.env`

Packages