GPCR Annotation Tools

An end-to-end, AI-assisted annotation and human-in-the-loop curation suite for GPCR structural biology.

GPCR Annotation Tools automates the extraction of structured metadata from GPCR crystal and cryo-EM structures deposited in the PDB. It combines automated data enrichment, multi-run AI annotation with structured output, algorithmic cross-validation, and an interactive expert review dashboard to produce database-ready CSVs with full decision provenance.

Pipeline at a Glance

                        PDB IDs (targets.txt)
                                │
                                ▼
┌─────────────────────────────────────────────────────────────────────┐
│  1. gpcr-tools fetch          Download RCSB metadata + enrich       │
│                                (UniProt, PubChem, CrossRef, SMILES) │
├─────────────────────────────────────────────────────────────────────┤
│  2. gpcr-tools fetch-papers   Download open-access PDFs             │
│                                (Unpaywall → PMC OA → abstract       │
│                                 fallback + manual watch mode)       │
├─────────────────────────────────────────────────────────────────────┤
│  3. gpcr-tools annotate       AI annotation via Gemini              │
│                                (10 independent runs per PDB,        │
│                                 structured output via tool calling) │
├─────────────────────────────────────────────────────────────────────┤
│  4. gpcr-tools aggregate      Majority-vote consensus + validation  │
│                                (7-validator chain against PDB/      │
│                                 UniProt/PubChem ground truth)       │
├─────────────────────────────────────────────────────────────────────┤
│  5. gpcr-tools curate         Interactive expert review dashboard   │
│                                (Rich terminal UI + audit trail)     │
└───────────────────────────────┬─────────────────────────────────────┘
                                │
                                ▼
                          output/csv/
                    (database-ready CSVs)

Each step is resumable and idempotent — re-running any command skips already-completed work unless --force is passed.

Key Features

Data Enrichment (pre-annotation)

RCSB GraphQL integration — Downloads comprehensive PDB metadata including polymer/nonpolymer entities, assemblies, citations, and experimental details.
Multi-source enrichment — Automatically resolves UniProt entry names, PubChem CIDs + synonyms, SMILES/InChIKey descriptors, and sibling PDB structures sharing the same publication.
Persistent caching — All external API responses are cached locally with atomic writes, eliminating redundant network calls across pipeline runs.
Tiered paper acquisition — Fetches open-access PDFs via Unpaywall and NCBI PMC OA, with PubMed abstract fallback for paywalled papers and a live filesystem watcher for manual drops.

AI Annotation

Multi-run consensus — Each PDB is annotated 10 times independently (configurable via --runs), producing a statistically robust basis for majority voting.
Structured output via tool calling — Gemini returns annotations in a strict JSON schema enforced by function calling, not free-form text. Every field (receptor identity, ligand roles, signaling partners, state classification) is constrained to defined types and enumerations.
Context-rich prompts — The AI receives not just the paper PDF but also pre-enriched PDB metadata, a chain inventory reminder, and sibling structure warnings — reducing hallucination by grounding the model in API-verified facts.
Flexible model selection — Switch models at runtime via --model flag or GPCR_GEMINI_MODEL environment variable without code changes.
Batch API support — Large-scale annotation via Gemini Batch API with JSONL submission, polling, and automatic result recovery.
Rate-limited client — Sliding-window rate limiting (1000 RPM) with exponential backoff on 429 responses.

Post-Annotation Validation

7-validator chain — Each aggregated annotation passes through a chain of cross-validation steps:
1. Chimera detection — Identifies fusion constructs by comparing G-alpha C-terminal tails against UniProt reference sequences.
2. Receptor identity verification — Validates UniProt entry names against the UniProt API.
3. Ligand existence check — Confirms every annotated ligand exists in PDB Chemical Component Dictionary, filtering common buffers and crystallization artifacts.
4. Oligomer analysis — Classifies complexes (monomer / homomer / heteromer), scans 7TM domain completeness per chain, suggests the primary protomer, and auto-corrects chain-ID assignments when API evidence disagrees with AI output.
5. Structural integrity — Cross-checks internal consistency of the annotation structure.
6. Ground truth injection — Overwrites method, resolution, and release date with PDB-authoritative values.
7. Controversy detection — Flags fields where AI runs disagreed, with per-field vote breakdowns.

Expert Curation

Rich terminal dashboard — An ergonomic review interface built with Rich for rapid, informed decision-making.
Context-aware validation alerts — Real-time display of ghost chains, hallucinated ligands, UniProt identity clashes, and chimera warnings alongside the data being reviewed.
Recursive review engine — Navigate field-by-field through the annotation tree, with controversy highlights guiding attention to disputed values.
Append-only audit trail — Every human decision (accept / edit / reject) is logged to audit_trail.jsonl with timestamps, providing full reproducibility.
Resumable sessions — Curation progress is persisted; interrupted sessions resume exactly where they left off.

Quick Start

Option 1: Docker (Recommended)

# Pull the latest image
docker pull ghcr.io/protwis/gpcr-annotation-tools:latest

# Initialize a workspace
mkdir -p ~/gpcr_workspace
docker run --rm \
  -v ~/gpcr_workspace:/workspace \
  ghcr.io/protwis/gpcr-annotation-tools init-workspace

# Add PDB IDs to the target list
echo -e "8TII\n7W55\n9BLW" >> ~/gpcr_workspace/targets.txt

# Run the full pipeline
docker run --rm \
  -v ~/gpcr_workspace:/workspace \
  -e GPCR_GEMINI_API_KEY="$GPCR_GEMINI_API_KEY" \
  -e GPCR_EMAIL_FOR_APIS="you@example.com" \
  ghcr.io/protwis/gpcr-annotation-tools fetch

docker run --rm \
  -v ~/gpcr_workspace:/workspace \
  -e GPCR_EMAIL_FOR_APIS="you@example.com" \
  ghcr.io/protwis/gpcr-annotation-tools fetch-papers --auto-only

docker run --rm \
  -v ~/gpcr_workspace:/workspace \
  -e GPCR_GEMINI_API_KEY="$GPCR_GEMINI_API_KEY" \
  ghcr.io/protwis/gpcr-annotation-tools annotate

docker run --rm \
  -v ~/gpcr_workspace:/workspace \
  ghcr.io/protwis/gpcr-annotation-tools aggregate

docker run -it --rm \
  -v ~/gpcr_workspace:/workspace \
  ghcr.io/protwis/gpcr-annotation-tools curate

Note: The -it flags are required only for the interactive curate command. Pass --user "$(id -u):$(id -g)" to avoid root-owned files on the host.

Option 2: Local Installation

Requires Python 3.11+.

git clone https://github.com/protwis/GPCR-annotation-tools.git
cd GPCR-annotation-tools

# Install with all optional dependencies
pip install -e ".[dev]"

# Configure
export GPCR_WORKSPACE=~/gpcr_workspace
export GPCR_GEMINI_API_KEY=your-api-key
export GPCR_EMAIL_FOR_APIS=you@example.com

# Initialize and run
gpcr-tools init-workspace

gpcr-tools fetch
gpcr-tools fetch-papers
gpcr-tools annotate
gpcr-tools aggregate
gpcr-tools curate

CLI Reference

`gpcr-tools fetch`

Download PDB metadata from RCSB GraphQL and enrich with UniProt, PubChem, and CrossRef data.

gpcr-tools fetch                        # Process all targets
gpcr-tools fetch 8TII                   # Single PDB
gpcr-tools fetch --targets ids.txt      # Custom target file
gpcr-tools fetch --force                # Re-fetch existing entries

`gpcr-tools fetch-papers`

Download open-access papers with tiered fallback (Unpaywall → PMC OA → abstract).

gpcr-tools fetch-papers                 # All targets, with watch mode for paywalled papers
gpcr-tools fetch-papers --auto-only     # Skip watch mode (for CI/scripting)
gpcr-tools fetch-papers 8TII           # Single PDB

`gpcr-tools annotate`

Run Gemini AI annotation with structured output.

gpcr-tools annotate                                    # Auto-discover pending PDBs
gpcr-tools annotate 8TII --runs 5                      # Single PDB, 5 runs
gpcr-tools annotate --model gemini-2.5-flash            # Use a different model
gpcr-tools annotate --prompt prompts/custom.txt         # Custom prompt template
gpcr-tools annotate --batch                             # Submit via Batch API
gpcr-tools annotate --check-batch                       # Poll batch status
gpcr-tools annotate --recover                           # Re-process raw batch output

`gpcr-tools aggregate`

Aggregate multi-run AI results with majority voting and cross-validation.

gpcr-tools aggregate                    # All pending PDBs
gpcr-tools aggregate 8TII              # Single PDB
gpcr-tools aggregate --skip-api-checks  # Offline mode (no UniProt/PubChem calls)
gpcr-tools aggregate --force            # Re-process already-aggregated entries

`gpcr-tools curate`

Interactive expert review dashboard.

gpcr-tools curate                       # Review all pending PDBs
gpcr-tools curate 8TII                  # Target a single PDB
gpcr-tools curate --auto-accept         # Non-interactive mode (CI/testing)

Configuration

Environment Variables

Variable	Required	Description
`GPCR_WORKSPACE`	No	Workspace root (default: `/workspace`)
`GPCR_GEMINI_API_KEY`	For `annotate`	Google Gemini API key
`GPCR_GEMINI_MODEL`	No	Model override (default: `gemini-2.5-pro`)
`GPCR_EMAIL_FOR_APIS`	For `fetch-papers`	Email for Unpaywall/NCBI polite access

Advanced: per-directory path overrides

For non-standard workspace layouts (e.g., separate storage mounts), each subdirectory can be overridden independently:

Variable	Default
`GPCR_RAW_PATH`	`{workspace}/raw`
`GPCR_ENRICHED_PATH`	`{workspace}/enriched`
`GPCR_PAPERS_PATH`	`{workspace}/papers`
`GPCR_AI_RESULTS_PATH`	`{workspace}/ai_results`
`GPCR_AGGREGATED_PATH`	`{workspace}/aggregated`
`GPCR_OUTPUT_PATH`	`{workspace}/output`
`GPCR_CACHE_PATH`	`{workspace}/cache`
`GPCR_STATE_PATH`	`{workspace}/state`
`GPCR_TMP_PATH`	`{workspace}/tmp`

Workspace Layout

/workspace/
├── contract/storage_contract.json    # Versioned workspace contract
├── targets.txt                       # PDB IDs to process (one per line)
├── prompts/v5.txt                    # Default annotation prompt template
│
├── raw/pdb_json/                     # RCSB GraphQL responses
├── enriched/                         # Enriched PDB metadata (AI input)
├── papers/                           # Downloaded PDFs and abstracts
├── ai_results/{pdb_id}/run_*.json   # 10 independent AI annotation runs
│
├── aggregated/                       # Voted + validated annotations
│   ├── {pdb_id}.json
│   ├── logs/                         # Per-field voting discrepancy logs
│   └── validation_logs/              # Algorithmic validation reports
│
├── output/
│   ├── csv/                          # Database-ready CSV exports
│   └── audit/audit_trail.jsonl       # Append-only decision provenance
│
├── cache/                            # Persistent API caches
└── state/                            # Operational state (resumability)

Output Artifacts

Database CSVs (`output/csv/`)

Tab-separated, normalized files ready for database ingestion:

File	Contents
`structures.csv`	PDB ID, receptor UniProt, method, resolution, state, chain, date
`ligands.csv`	Ligand names, PubChem IDs, roles, types, SMILES, InChIKey, sequences
`g_proteins.csv`	G-protein subunit UniProt IDs and chain assignments
`arrestins.csv`	Arrestin UniProt IDs and chains
`fusion_proteins.csv`	Fusion protein names
`nanobodies.csv`, `antibodies.csv`, `scfv.csv`	Binding partner names
`grk.csv`, `ramp.csv`, `other_aux_proteins.csv`	Auxiliary protein names

Validation Reports (`aggregated/validation_logs/`)

Per-PDB structured reports containing:

Critical warnings — hallucinated ligands, chimeric fusion proteins, identity clashes
Algorithmic conflicts — AI annotation vs. API ground truth disagreements
Oligomer analysis — complex classification, 7TM completeness, chain corrections

Provenance Logs

Log	Purpose
`output/audit/audit_trail.jsonl`	Every human decision, timestamped and append-only
`aggregated/logs/*_voting_log.json`	Per-field majority-vote breakdowns across 10 AI runs
`state/processed_log.json`	Curation completion status (enables resumable sessions)

Architecture

src/gpcr_tools/
├── config.py                  # All constants, URLs, timeouts, thresholds
├── workspace.py               # Workspace initialization & contract validation
├── __main__.py                # CLI entry point
│
├── fetcher/                   # Stage 1: RCSB download + enrichment
│   ├── rcsb_client.py         #   GraphQL query + rate-limited download
│   ├── enricher.py            #   UniProt / PubChem / CrossRef enrichment
│   └── cache.py               #   Atomic JSON cache with version invalidation
│
├── papers/                    # Stage 2: Paper acquisition
│   ├── downloader.py          #   Tiered PDF download (Unpaywall → PMC → abstract)
│   └── watcher.py             #   Filesystem watcher for manual PDF drops
│
├── annotator/                 # Stage 3: Gemini AI annotation
│   ├── gemini_client.py       #   Rate-limited API client
│   ├── prompt_builder.py      #   Context-rich prompt assembly
│   ├── schema.py              #   Structured output schema (tool calling)
│   ├── pdf_compressor.py      #   Ghostscript compression for large PDFs
│   ├── post_processor.py      #   Response normalization
│   └── runner.py              #   Single-call + batch modes with recovery
│
├── aggregator/                # Stage 4: Consensus + validation
│   ├── voting.py              #   Majority-vote engine + controversy detection
│   ├── ground_truth.py        #   PDB/UniProt ground truth injection
│   └── runner.py              #   12-step orchestration with error isolation
│
├── validator/                 # 7-validator cross-check chain
│   ├── chimera.py             #   Fusion protein detection (C-terminal tail matching)
│   ├── receptor_validator.py  #   UniProt identity verification
│   ├── ligand_validator.py    #   PDB-CCD existence check
│   ├── oligomer.py            #   Complex classification + 7TM completeness
│   ├── integrity_checker.py   #   Structural consistency validation
│   └── api_clients.py         #   Shared API wrappers with retry + caching
│
└── csv_generator/             # Stage 5: Expert curation
    ├── app.py                 #   Main curation loop
    ├── review_engine.py       #   Recursive review tree
    ├── ui.py                  #   Rich terminal panels
    ├── csv_writer.py          #   Pure data → CSV export
    └── audit.py               #   JSONL audit trail writer

Design Principles

Principle	Implementation
Atomic writes	`tempfile` + `os.replace` + `try/finally` cleanup — no partial outputs
Mutation isolation	`deepcopy()` boundary before validator invocations
None-safety	`(data.get(key) or {}).get(child)` — never `.get(key, {})` on external data
Centralized configuration	All URLs, timeouts, thresholds, and magic strings in `config.py`
Immutable constants	`frozenset`, `tuple`, `MappingProxyType` for module-level data
Error isolation	Each PDB wrapped in `try/except` — failures logged, pipeline continues
Timeout-guarded I/O	Every HTTP call has an explicit timeout; sessions use `urllib3.Retry`

Development

Prerequisites

pip install -e ".[dev]"

Quality Gates

# Lint + format
ruff check src/ tests/
ruff format src/ tests/

# Type checking
mypy src/

# Tests
pytest tests/ -v

Test Suite

The test suite includes 770+ tests:

Unit tests for every module across all five pipeline stages
Integration tests for the full aggregation pipeline, error isolation, and atomic write safety
Real PDB fixture tests covering 9 canonical GPCR structures (5G53, 8TII, 9AS1, 9BLW, 9EJZ, 9IQS, 9M88, 9NOR, 9O38) with 10 AI runs each
Mock HTTP for external APIs in the default test suite; live network integration tests are gated and skipped unless GPCR_RUN_LIVE_TESTS=1 is set

CI/CD

GitHub Actions workflows run on every push and pull request:

Ruff — Enforced linting and formatting
mypy — Static type checking with ignore_missing_imports = false
pytest — Test matrix across Python 3.11 and 3.12
Docker smoke tests — Build + exercise init-workspace, curate --help, and curate --auto-accept
Automated releases — Docker image published to GHCR on semantic version tags (v*)

License

This project is licensed under the Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.github/workflows		.github/workflows
src/gpcr_tools		src/gpcr_tools
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
developing_styles.md		developing_styles.md
docker-compose.yml		docker-compose.yml
future_plan.md		future_plan.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

GPCR Annotation Tools

Pipeline at a Glance

Key Features

Data Enrichment (pre-annotation)

AI Annotation

Post-Annotation Validation

Expert Curation

Quick Start

Option 1: Docker (Recommended)

Option 2: Local Installation

CLI Reference

gpcr-tools fetch

gpcr-tools fetch-papers

gpcr-tools annotate

gpcr-tools aggregate

gpcr-tools curate

Configuration

Environment Variables

Workspace Layout

Output Artifacts

Database CSVs (output/csv/)

Validation Reports (aggregated/validation_logs/)

Provenance Logs

Architecture

Design Principles

Development

Prerequisites

Quality Gates

Test Suite

CI/CD

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`gpcr-tools fetch`

`gpcr-tools fetch-papers`

`gpcr-tools annotate`

`gpcr-tools aggregate`

`gpcr-tools curate`

Database CSVs (`output/csv/`)

Validation Reports (`aggregated/validation_logs/`)

Packages