OSF Preprints - Modular Pipeline (No Celery)

This repository runs a bounded, stage-based pipeline for OSF preprints using DynamoDB as the single source of truth.

Pipeline stages:

sync: ingest preprints from OSF
pdf: download/convert primary files
grobid: generate TEI from PDFs
extract: parse TEI and write references
enrich: fill missing reference DOIs
flora: FLoRA lookup + screening
author: author/email candidate extraction

All stages run as normal Python commands and exit. Scheduling is external (cron or GitHub Actions). The flora stage checks whether originals have replications cited in the FLoRA database (the FORRT Library of Replication Attempts).

See the trial flowchart for a visual overview of how a preprint flows through the pipeline.

Quick Start (Local)

Create a virtual environment and install Python dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Install LibreOffice (soffice) locally if you need DOCX -> PDF conversion in the pdf stage.
Configure .env:

cp .env.example .env

Review committed runtime rules in config/runtime.toml (for example ingest.anchor_date and FLORA endpoint).
Start local infrastructure services (optional if you use AWS DynamoDB and/or a remote GROBID):

docker compose up -d dynamodb-local grobid

Initialize DynamoDB tables:

python -c "from osf_sync.db import init_db; init_db(); print('Dynamo tables ready')"

Run pipeline stages:

python -m osf_sync.pipeline run --stage sync --limit 1000
python -m osf_sync.pipeline run --stage pdf --limit 100
python -m osf_sync.pipeline run --stage grobid --limit 50
python -m osf_sync.pipeline run --stage extract --limit 200
python -m osf_sync.pipeline run --stage enrich --limit 300
python -m osf_sync.pipeline run --stage flora --limit-lookup 200 --limit-screen 500

Main Commands

Single stage:

python -m osf_sync.pipeline run --stage <sync|pdf|grobid|extract|enrich|flora|author> [options]

Full bounded run:

python -m osf_sync.pipeline run-all \
  --sync-limit 1000 --pdf-limit 100 --grobid-limit 50 --extract-limit 200 --enrich-limit 300

run-all includes the author stage by default; use --skip-author to disable it for a run. By default, run-all keeps local PDF/TEI files during author; use --cleanup-author-files to allow cleanup. By default, author updates DynamoDB only (no local CSV output). Use --write-debug-csv (and optionally --out) for local debug snapshots.

Ad-hoc window sync:

python -m osf_sync.pipeline sync-from-date --start 2025-07-01

One-off preprint:

python -m osf_sync.pipeline fetch-one --id <OSF_ID>
# or
python -m osf_sync.pipeline fetch-one --doi <DOI_OR_URL>

Author-cluster randomisation (standalone, not in run-all):

python -m osf_sync.pipeline author-randomize \
  --network-state-key trial:author_network_state

Optionally add --authors-csv <path> to use an enriched author CSV if available. Status: this workflow is not yet validated end-to-end in production and should be treated as experimental. This command processes only unassigned preprints. If no prior trial network exists, it initializes one from those preprints; otherwise it loads the latest network from DynamoDB and augments it. Allocations, graph state, and run metadata are stored in DynamoDB trial tables plus sync_state. Use --dry-run to preview candidate processing and allocation counts without writing to DynamoDB:

python -m osf_sync.pipeline author-randomize --dry-run

python -m osf_sync.cli ... is now a thin alias to the same pipeline CLI.

Common Options

--limit: max items for the stage.
--max-seconds: stop the stage after N seconds.
--dry-run: compute/select work without executing mutations.
--debug: enable verbose logging.
--owner and --lease-seconds (queue stages): override DynamoDB claim ownership/lease duration.
--skip-author (run-all): skip author extraction when needed.
--cleanup-author-files (run-all): allow author stage file deletion (off by default).
--write-debug-csv (author stage): write a local debug CSV snapshot (--out overrides the default path).

Environment (`.env`)

# local Docker GROBID:
GROBID_URL=http://localhost:8070
# remote GROBID example:
# GROBID_URL=https://grobid.example.org
GROBID_INCLUDE_RAW_CITATIONS=true
PIPELINE_ENV=dev
DDB_BILLING_MODE=PAY_PER_REQUEST
DEV_SYNC_LOOKBACK_DAYS=7
# Optional explicit override for sync start date:
# SYNC_START_DATE_OVERRIDE=2026-01-01
# Optional explicit override for sync end date:
# SYNC_END_DATE_OVERRIDE=2026-03-15
# Safety default: override runs do not rewrite sync cursor.
SYNC_OVERRIDE_WRITES_CURSOR=false
# Optional global cursor-write disable.
SYNC_DISABLE_CURSOR_WRITE=false
DYNAMO_LOCAL_URL=http://localhost:8000
AWS_REGION=eu-north-1
AWS_SECRET_ACCESS_KEY=<AWS_SECRET_ACCESS_KEY>
AWS_ACCESS_KEY_ID=<AWS_ACCESS_KEY_ID>
DDB_TABLE_PREPRINTS=dev_preprints
DDB_TABLE_REFERENCES=dev_preprint_references
DDB_TABLE_TEI=dev_preprint_tei
DDB_TABLE_EXCLUDED_PREPRINTS=dev_excluded_preprints
DDB_TABLE_SYNCSTATE=dev_sync_state
DDB_TABLE_API_CACHE=dev_api_cache
DDB_TABLE_TRIAL_AUTHOR_NODES=dev_trial_author_nodes
DDB_TABLE_TRIAL_AUTHOR_TOKENS=dev_trial_author_tokens
DDB_TABLE_TRIAL_CLUSTERS=dev_trial_clusters
DDB_TABLE_TRIAL_ASSIGNMENTS=dev_trial_preprint_assignments
OPENALEX_EMAIL=<PERSONAL_EMAIL_ID>
PDF_DEST_ROOT=./data/preprints
LOG_LEVEL=INFO
OSF_INGEST_SKIP_EXISTING=false
API_CACHE_TTL_MONTHS=6
PIPELINE_CLAIM_LEASE_SECONDS=1800

sync window behavior:

PIPELINE_ENV=dev: sync uses a rolling DEV_SYNC_LOOKBACK_DAYS window (default 7).
PIPELINE_ENV=prod: sync uses ingest.anchor_date/ingest.window_months from config/runtime.toml.
In prod, changing ingest.anchor_date or ingest.window_months triggers an automatic bounded backfill (controlled by ingest.backfill_on_config_change).
SYNC_START_DATE_OVERRIDE (optional): forces an explicit start date in either mode.
SYNC_END_DATE_OVERRIDE (optional): explicit end date; in prod override mode, omitted end defaults to ingest.anchor_date.
Recommended naming: keep local .env on dev_* tables; GH Actions prod workflows are set to prod_*.
SYNC_OVERRIDE_WRITES_CURSOR=false (default) keeps continuation cursor unchanged during override/backfill runs.

Backfill without breaking continuation:

In GROBID workflow dispatch, set sync_start_date_override (and optionally sync_end_date_override).
Leave sync_override_writes_cursor as false (default).
Run backfill as needed; normal continuation cursor is preserved.
Clear override inputs for subsequent normal runs.

Runtime Rules (`config/runtime.toml`)

These non-secret operational rules are committed in git:

[ingest]
anchor_date = "2026-02-20" # ISO date/timestamp; empty disables date-window filter
window_months = 6

[flora]
original_lookup_url = "https://rep-api.forrt.org/v1/original-lookup"
cache_ttl_hours = 48
csv_url = "https://github.com/forrtproject/FReD-data/raw/refs/heads/main/output/flora_filtered.csv"
csv_path = "data/flora.csv"

Scheduling

Use either:

Cron/systemd timers on a VM, or
GitHub Actions schedule workflows.

Recommended pattern:

Run each stage independently on a cadence with bounded limits.
Allow overlap; claim/lease fields in DynamoDB prevent duplicate processing.

DynamoDB Queue Flow

sync sets queue_pdf=pending when eligible.
pdf marks queue_pdf=done, queue_grobid=pending.
grobid marks queue_grobid=done, queue_extract=pending.
extract marks queue_extract=done.

Queue stages use claim/lease metadata (claim_*_owner, claim_*_until) and error tracking fields (last_error_*, retry_count_*).

DOI Experiment Command

Use the module entrypoint directly for DOI matching experiments:

python -m osf_sync.augmentation.doi_multi_method_lookup --from-db --limit 400 --output doi_multi_method.csv

Name		Name	Last commit message	Last commit date
Latest commit History 225 Commits
.github		.github
config		config
data		data
docs		docs
grobid-home/config		grobid-home/config
osf_sync		osf_sync
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
docker-compose.yml		docker-compose.yml
email_draft.md		email_draft.md
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OSF Preprints - Modular Pipeline (No Celery)

Quick Start (Local)

Main Commands

Common Options

Environment (`.env`)

Runtime Rules (`config/runtime.toml`)

Scheduling

DynamoDB Queue Flow

DOI Experiment Command

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

forrtproject/flora_preprint_notifier

Folders and files

Latest commit

History

Repository files navigation

OSF Preprints - Modular Pipeline (No Celery)

Quick Start (Local)

Main Commands

Common Options

Environment (.env)

Runtime Rules (config/runtime.toml)

Scheduling

DynamoDB Queue Flow

DOI Experiment Command

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Environment (`.env`)

Runtime Rules (`config/runtime.toml`)

Packages