Batch scrape website URLs from CSV with the Olostep API. This project provides both a Streamlit UI and a CLI for creating batches, polling progress, retrieving markdown/html/json content, and saving a single JSON output with completed and failed items.
This project helps you:
- Process large URL lists from CSV in a single batch run.
- Keep every result mapped back to your own
custom_id. - Monitor live progress and logs in the Streamlit UI.
- Export one JSON payload with completed results and failed URLs.
- Use an Olostep Parser when you need structured extraction.
- Python 3.9+
- An Olostep account with a valid API key or API token.
Create and activate a virtual environment, then install dependencies:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtCreate .env in the project root:
OLOSTEP_API_KEY=your_olostep_api_key_hereThis repo also accepts OLOSTEP_API_TOKEN. You can create an API key from the Olostep API Keys dashboard.
Run the Streamlit UI:
streamlit run app.pyOr run the CLI:
python main.py --csv data/urls_sample.csv --out output.json --formats markdownFor each batch run, the workflow:
- Reads a CSV with
custom_idorid, plusurl. - Creates a batch through Olostep Batch.
- Polls the batch until processing completes.
- Lists completed and failed items with cursor-based pagination.
- Retrieves content for completed items from
/v1/retrieve. - Writes a single JSON payload with batch metadata, results, and failed items.
Launch:
streamlit run app.pyUI includes:
- CSV upload.
- Retrieve format selection.
- Live batch status and progress.
- Streaming logs during batch execution.
- JSON download after the run completes.
Default run:
python main.py --csv data/urls_sample.csv --out output.jsonExample with additional options:
python main.py \
--csv data/urls_sample.csv \
--out output.json \
--country US \
--parser-id your_parser_id \
--formats markdown,htmlYou can also pass the token directly:
python main.py --csv data/urls_sample.csv --out output.json --token "YOUR_TOKEN"Required columns:
custom_idoridurl
Example:
custom_id,url
heat-003,https://heat.gov/tools-resources/cdc-heatrisk-dashboard/
heat-004,https://heat.gov/tools-resources/extreme-heat-vulnerability-mapping-tool/Sample file:
data/urls_sample.csv
The output file contains:
batchandbatch_idfor the final Olostep batch object.requested_count,results_count, andfailed_countsummary fields.resultswithcustom_id,url,retrieve_id, and retrieved content.failed_itemsfor URLs returned as failed batch items.
If content is too large, Olostep may return *_hosted_url fields instead of inline content. Hosted URLs typically expire after about 7 days, so download what you need promptly.
{
"batch": { "id": "batch_...", "status": "completed" },
"batch_id": "batch_...",
"requested_count": 2,
"results_count": 2,
"results": [
{
"custom_id": "heat-004",
"url": "https://...",
"retrieve_id": "...",
"retrieved": {
"success": true,
"size_exceeded": false,
"markdown_content": "...",
"markdown_hosted_url": null
}
}
],
"failed_count": 0,
"failed_items": []
}--country: Set the ISO 3166-1 alpha-2 country code such asUSorIN.--parser-id: Use an Olostep Parser for structured extraction.--poll-seconds: Control the polling interval between batch status checks.--formats: Requestmarkdown,html, orjsonfrom/v1/retrieve.--items-limit: Control page size for/v1/batches/{batch_id}/itemspagination.--log-every: Log batch status every N polls in CLI mode.
.
├── app.py # Streamlit UI for CSV upload, progress, and JSON download
├── main.py # CLI entrypoint for batch runs
├── data/
│ └── urls_sample.csv # Sample batch input file
├── src/
│ ├── batch_scraper.py # Async client for batch and retrieve endpoints
│ └── batch_workflow.py # CSV parsing, polling, retrieval, and payload helpers
├── requirements.txt # Python dependencies
├── output.json # Example output artifact
└── ui.png # UI preview image used in this README
