osf-scraper

A Python package for scraping preregistration data from the Open Science Framework (OSF).

Installation

pip install -e .

For development (testing, linting):

pip install -e ".[dev]"

Configuration

Authenticated requests enjoy higher rate limits. Set your OSF API token via an environment variable:

export OSF_API_TOKEN=your_token_here

Or create a .env file in the project root:

OSF_API_TOKEN=your_token_here

Get your token at: https://osf.io/settings/tokens

CLI Commands

After installation the following commands are available on your PATH:

`osf-discover` — Discover preregistration IDs

# Discover all preregistration IDs
osf-discover

# Limit results
osf-discover --max-results 1000

# Include all registrations (not just preregistrations)
osf-discover --no-filter

# Use a specific API token
osf-discover --token YOUR_TOKEN

# Specify output file
osf-discover --output data/osf_ids.txt

`osf-scrape` — Scrape registration data

# Scrape from a file of IDs
osf-scrape --file data/osf_ids.txt

# Specify output file
osf-scrape --file data/osf_ids.txt --output data/raw/preregistrations.jsonl

# Resume a previous run
osf-scrape --file data/osf_ids.txt --resume

`osf-remaining` — Compute remaining unprocessed IDs

osf-remaining

# With custom paths
osf-remaining --all-ids data/osf_ids.txt \
              --successful-ids data/raw/successful_ids.txt \
              --output data/osf_ids_remaining.txt

`osf-process` — Flatten raw JSONL into normalised data

osf-process

# With custom paths
osf-process --input data/raw/preregistrations.jsonl \
            --output data/processed/preregistrations.jsonl

`osf-analyse` — Extract column names from processed data

osf-analyse

# With custom paths
osf-analyse --input data/processed/preregistrations.jsonl \
            --output data/analysed/columns.json

Typical Workflow

# 1. Discover IDs
osf-discover --output data/osf_ids.txt

# 2. Scrape registration data
osf-scrape --file data/osf_ids.txt

# 3. If interrupted, compute remaining IDs and resume
osf-remaining
osf-scrape --file data/osf_ids_remaining.txt --resume

# 4. Flatten to normalised DataFrame
osf-process

# 5. Analyse columns
osf-analyse

Python API

You can also use the package programmatically:

from osf_scraper import OSFIDScraper

scraper = OSFIDScraper(api_token="your_token")
ids = scraper.discover_preregistration_ids(max_results=100)
scraper.save_ids(ids, "data/osf_ids.txt")

from osf_scraper import process_registrations

process_registrations("data/raw/preregistrations.jsonl",
                      "data/processed/preregistrations.jsonl")

Running Tests

pytest

Project Structure

osf-scraper/
├── src/
│   └── osf_scraper/
│       ├── __init__.py        # Package exports
│       ├── cli.py             # CLI entry points
│       ├── discovery.py       # OSF ID discovery (OSFIDScraper)
│       ├── scraper.py         # Async batch scraper (TokenBucket, fetch logic)
│       ├── processing.py      # JSONL flattening & analysis
│       └── utils.py           # Remaining-IDs computation
├── tests/
│   ├── test_id_scraper.py
│   ├── test_token_bucket.py
│   └── test_process_registrations.py
├── data/                      # Data directory (not tracked)
├── pyproject.toml             # Package metadata, deps, entry points
└── README.md

Requirements

Python 3.10+
Dependencies managed via pyproject.toml — install with pip install -e .

License

MIT — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
data		data
src/osf_scraper		src/osf_scraper
tests		tests
.env.example		.env.example
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

osf-scraper

Installation

Configuration

CLI Commands

`osf-discover` — Discover preregistration IDs

`osf-scrape` — Scrape registration data

`osf-remaining` — Compute remaining unprocessed IDs

`osf-process` — Flatten raw JSONL into normalised data

`osf-analyse` — Extract column names from processed data

Typical Workflow

Python API

Running Tests

Project Structure

Requirements

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

osf-scraper

Installation

Configuration

CLI Commands

osf-discover — Discover preregistration IDs

osf-scrape — Scrape registration data

osf-remaining — Compute remaining unprocessed IDs

osf-process — Flatten raw JSONL into normalised data

osf-analyse — Extract column names from processed data

Typical Workflow

Python API

Running Tests

Project Structure

Requirements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`osf-discover` — Discover preregistration IDs

`osf-scrape` — Scrape registration data

`osf-remaining` — Compute remaining unprocessed IDs

`osf-process` — Flatten raw JSONL into normalised data

`osf-analyse` — Extract column names from processed data

Packages