Add fetcher, papers, annotator pipeline & CLI#4
Conversation
Introduce end-to-end pipeline components and CLI commands for fetching, paper acquisition, and AI annotation. Adds new packages: fetcher (RCSB GraphQL client, enricher, cache, targets, runner), papers (downloader, watcher, runner), and annotator (Gemini client with sliding-window rate limiting, prompt builder, PDF compressor, post-processor, schema and runner/CLI glue). Extend CLI (__main__.py) with fetch / fetch-papers / annotate commands (single + batch modes, prompt/model/runs flags and batch control). Update packaging (pyproject.toml bumped to 0.1.3, extras for annotate/papers), expand README with pipeline docs and usage, add many unit/integration tests, and small config/workspace/validator adjustments to integrate the new stages.
There was a problem hiding this comment.
Pull request overview
This PR adds an end-to-end pipeline for fetching PDB metadata, downloading papers, and running Gemini-based AI annotation (single + batch), plus supporting workspace/config updates, docs, and tests.
Changes:
- Added new pipeline stages:
fetcher/(RCSB + enrichment),papers/(OA download + paywall watcher),annotator/(Gemini client, prompt building, compression, post-processing, runner). - Extended the CLI with
fetch,fetch-papers, andannotatecommands. - Updated configuration/constants, workspace initialization, dependencies/extras, and expanded documentation/testing.
Reviewed changes
Copilot reviewed 37 out of 37 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
src/gpcr_tools/config.py |
Adds centralized URLs/timeouts/sleeps + new workspace path fields used by new pipeline stages |
src/gpcr_tools/workspace.py |
Adds targets.txt creation + switches to raw_pdb_json_dir |
src/gpcr_tools/fetcher/* |
Implements target parsing, RCSB GraphQL fetch, enrichment logic, and JSON caching |
src/gpcr_tools/papers/* |
Implements tiered paper download, download logging, and manual PDF watcher |
src/gpcr_tools/annotator/* |
Implements Gemini client + schema, prompt builder, PDF compression, post-processing, and runners |
src/gpcr_tools/__main__.py |
Adds new CLI subcommands and annotation orchestration logic |
src/gpcr_tools/validator/* |
Switches hardcoded URLs/timeouts/sleeps to config constants |
README.md |
Documents new pipeline, CLI usage, and architecture |
pyproject.toml |
Version bump + new extras/deps |
tests/unit/*, tests/integration/* |
Adds unit tests and new “live API” integration tests |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def generate_chain_inventory_reminder(pdb_id: str, enriched_data: dict) -> str: | ||
| """Generates a human-readable summary of the polymer chains in the PDB.""" | ||
| polymers = enriched_data.get("polymer_entities") or [] | ||
| if not polymers: | ||
| return f"### CHAIN INVENTORY REMINDER\nThis structure ({pdb_id}) contains 0 polymer chains." |
There was a problem hiding this comment.
prompt_builder is reading polymer/nonpolymer entities from the top-level enriched_data, but the enricher writes them under enriched_data["data"]["entry"]. As a result, chain inventory and simplified metadata will be empty/"UNKNOWN" for real enriched JSON, which will significantly degrade annotation quality. Update the prompt builder to consistently dereference data.entry (and use rcsb_id as the PDB ID) before extracting polymer_entities, nonpolymer_entities, exptl/refine, etc.
| pdb_id = (enriched_data.get("entry") or {}).get("id") or "UNKNOWN" | ||
| entry = enriched_data.get("entry") or {} |
There was a problem hiding this comment.
enhanced_simplify_pdb_json currently uses enriched_data.get("entry"), but the enriched files produced by fetcher/enricher.py are full GraphQL responses with the payload under data.entry. This makes pdb_id, method, resolution, and release_date resolution incorrect for real runs. Consider normalizing by setting entry = (enriched_data.get("data") or {}).get("entry") or {} and reading fields from there.
| pdb_id = (enriched_data.get("entry") or {}).get("id") or "UNKNOWN" | |
| entry = enriched_data.get("entry") or {} | |
| entry = (enriched_data.get("data") or {}).get("entry") or enriched_data.get("entry") or {} | |
| pdb_id = entry.get("id") or "UNKNOWN" |
| # 3. Sibling structures warning | ||
| siblings = enriched_data.get("sibling_pdbs") or [] | ||
| if siblings: |
There was a problem hiding this comment.
Sibling PDB IDs are stored as data.entry.sibling_pdbs by the enricher, but the prompt builder looks for enriched_data.get("sibling_pdbs"). This prevents the sibling warning from ever being included for real enriched JSON. Please read sibling_pdbs from the same normalized entry object used for the rest of the prompt.
| # Auto-discover: enriched PDBs missing ai_results | ||
| enriched_pdbs = {p.stem.upper() for p in cfg.enriched_dir.glob("*.json")} | ||
| done_pdbs = {d.name.upper() for d in cfg.ai_results_dir.iterdir() if d.is_dir()} | ||
| pdb_ids = sorted(enriched_pdbs - done_pdbs) |
There was a problem hiding this comment.
Auto-discovery for annotate treats any existing ai_results/{pdb_id}/ directory as "done". This will incorrectly skip PDBs that have only partial run_*.json outputs (e.g., interrupted runs). Align this logic with annotator/cli_handler.py (or a shared helper) by checking whether the expected run_1..run_N.json files exist before considering a PDB complete.
| def run_annotate( | ||
| pdb_id: str | None, | ||
| targets_file: str | None, | ||
| prompt_file: str | None, | ||
| num_runs: int, | ||
| use_batch: bool, | ||
| ) -> None: |
There was a problem hiding this comment.
annotator/cli_handler.py appears to be unused (no references found) and duplicates substantial CLI logic that already exists in gpcr_tools/__main__.py. This increases maintenance burden and can lead to behavior drift. Either wire __main__.py to call run_annotate(...) or remove the handler module to avoid dead code.
| def _standardize_auxiliary_name(name: str) -> str: | ||
| """Standardise common auxiliary protein names. | ||
|
|
||
| The explicit :data:`AUX_PROTEIN_NAME_MAPPING` is checked first; only if no | ||
| match is found do we fall through to the BRIL and nanobody regex patterns. | ||
| """ | ||
| if not name: | ||
| return name |
There was a problem hiding this comment.
_standardize_auxiliary_name is annotated as accepting/returning str, but it is called in tests with None and will return None at runtime. This is likely to trip mypy and makes the API misleading. Update the signature to name: str | None (and return type accordingly) to match actual usage.
| from gpcr_tools.config import OVERRIDE_VARS, WorkspaceConfig, get_config | ||
|
|
||
| SUPPORTED_CONTRACT_VERSION = 1 | ||
| from gpcr_tools.config import OVERRIDE_VARS, SUPPORTED_CONTRACT_VERSION, WorkspaceConfig, get_config |
There was a problem hiding this comment.
This import statement exceeds the configured line length and bypasses Ruff's auto-formatting style. Please run ruff format or wrap the import across multiple lines to match the project's formatting conventions.
| from gpcr_tools.config import OVERRIDE_VARS, SUPPORTED_CONTRACT_VERSION, WorkspaceConfig, get_config | |
| from gpcr_tools.config import ( | |
| OVERRIDE_VARS, | |
| SUPPORTED_CONTRACT_VERSION, | |
| WorkspaceConfig, | |
| get_config, | |
| ) |
| "mypy>=1.10", | ||
| "types-requests>=2.31", | ||
| "types-tqdm>=4.60", | ||
| "gpcr-annotation-tools[annotate,papers]", |
There was a problem hiding this comment.
Including gpcr-annotation-tools[annotate,papers] inside the package's own dev extra makes the project depend on itself. In editable installs this can cause pip to resolve the published package from PyPI (or fail if it's not available), and it also obscures which deps are actually needed for tests. Instead, add the extra deps directly to dev (e.g., google-genai, PyMuPDF, lxml) or document installing with pip install -e '.[dev,annotate,papers]'.
| "gpcr-annotation-tools[annotate,papers]", | |
| "google-genai", | |
| "PyMuPDF", | |
| "lxml", |
| - **Unit tests** for every module across all five pipeline stages | ||
| - **Integration tests** for the full aggregation pipeline, error isolation, and atomic write safety | ||
| - **Real PDB fixture tests** covering 9 canonical GPCR structures (5G53, 8TII, 9AS1, 9BLW, 9EJZ, 9IQS, 9M88, 9NOR, 9O38) with 10 AI runs each | ||
| - **Mock HTTP** for all external APIs — no live network calls in the test suite |
There was a problem hiding this comment.
The README claims there are "no live network calls in the test suite", but this PR adds live integration tests under tests/integration/ (e.g., test_fetch_live.py). Either adjust the documentation to describe how live tests are gated/skipped, or change the tests to be opt-in so the statement remains true.
| - **Mock HTTP** for all external APIs — no live network calls in the test suite | |
| - **Mock HTTP** for external APIs in the default test suite; any live network integration tests are gated/opt-in and skipped unless explicitly enabled |
| """Integration tests for fetch pipeline — LIVE API calls. | ||
|
|
||
| These tests actually call RCSB, UniProt, PubChem, and RCSB Search APIs. | ||
| They use the canonical 9 PDB IDs from the test fixture set. | ||
|
|
||
| Run with: | ||
| pytest tests/integration/test_fetch_live.py -v | ||
|
|
||
| These tests require network access and may take 2-5 minutes due to | ||
| API rate limiting (1s sleep per RCSB request). | ||
| """ |
There was a problem hiding this comment.
This file performs live network calls (RCSB/UniProt/PubChem) but is not skipped/marked, so it will run under the default pytest configuration (testpaths = ['tests']). That can make CI runs slow and flaky due to external rate limits/outages. Make these tests opt-in (e.g., @pytest.mark.skipif(not os.getenv('GPCR_RUN_LIVE_TESTS'), ...)) or add a network marker and configure pytest to exclude it by default.
Adapt code and tests to the new enriched JSON envelope (data.entry): add _get_entry helper and update prompt_builder, enhanced_simplify_pdb_json, and tests accordingly. Gate live network integration tests behind GPCR_RUN_LIVE_TESTS and update README and tests to skip unless the env var is set. Improve CLI auto-discovery to treat an AI result directory as complete only when the required number of run_N.json files exist, and remove the now-obsolete annotator/cli_handler module. Add PDF validation & cleanup in papers.downloader to detect non-PDF downloads, update post_processor to accept optional auxiliary names, and add runtime dev dependencies (google-genai, PyMuPDF, lxml).
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 37 out of 37 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def enhanced_simplify_pdb_json(enriched_data: dict) -> dict: | ||
| """Simplifies the enriched PDB JSON into a minimal dictionary for Gemini.""" | ||
| entry = _get_entry(enriched_data) | ||
| pdb_id = entry.get("id") or "UNKNOWN" |
There was a problem hiding this comment.
enhanced_simplify_pdb_json() derives pdb_id from entry.get("id"), but the RCSB GraphQL payload puts the identifier at entry["rcsb_id"] (and a nested entry["entry"]["id"]). With real enriched data this will often become UNKNOWN, degrading prompt grounding. Prefer entry.get("rcsb_id") (fallback to ((entry.get("entry") or {}).get("id"))) so prompts consistently include the actual PDB ID.
| pdb_id = entry.get("id") or "UNKNOWN" | |
| pdb_id = entry.get("rcsb_id") or ((entry.get("entry") or {}).get("id")) or "UNKNOWN" |
| # Resolve prompt text | ||
| if args.prompt: | ||
| prompt_text = Path(args.prompt).read_text(encoding="utf-8") | ||
| else: | ||
| prompt_text = cfg.default_prompt_file.read_text(encoding="utf-8") | ||
|
|
There was a problem hiding this comment.
The CLI falls back to reading cfg.default_prompt_file ({workspace}/prompts/v5.txt), but init-workspace/ensure_runtime_dirs do not create prompts/ or the default prompt file, and there is no prompts/v5.txt in-repo. On a fresh workspace, gpcr-tools annotate will crash with FileNotFoundError. Either create/populate the default prompt in init_workspace (and add prompts to the contract dirs), or ship the prompt as a package resource and reference it from the installed package rather than the workspace.
| @pytest.mark.skipif(not _EMAIL, reason=_SKIP_REASON) | ||
| def test_auto_only_downloads_some_papers(self, papers_workspace: Path) -> None: |
There was a problem hiding this comment.
These tests run live network calls whenever GPCR_EMAIL_FOR_APIS is set (via the skipif(not _EMAIL, ...) gate). This is easier to satisfy in CI than the explicit GPCR_RUN_LIVE_TESTS=1 gate used elsewhere (e.g. test_fetch_live.py), which can unintentionally enable slow/flaky live tests. Consider gating on GPCR_RUN_LIVE_TESTS (and optionally also requiring GPCR_EMAIL_FOR_APIS).
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Introduce centralized constants and type-safety across the codebase, plus several robustness and UX fixes: - config: add get_gemini_model_name(), AGG_STATUS_*/DL_STATUS_*/ALERT_PREFIX_* constants, ANNOTATOR_FUNCTION_NAME; make OVERRIDE_VARS, CSV_SCHEMA, AUX_PROTEIN_DISPATCH immutable MappingProxyType and use tuples for schema fields. - Use new config constants in aggregator, csv generator, papers downloader, validator and other modules (replace hard-coded status/alert strings). - Annotator: use ANNOTATOR_FUNCTION_NAME, improve batch recovery parsing with more logging and handling of missing args, remove unused helper. - Annotator CLI: enforce positive integer for --runs and resolve Gemini model lazily via get_gemini_model_name(). - Improve typing/unions using modern | syntax and Sequence in several modules (post_processor, review_engine, exceptions). - CSV handling: fix header comparison to list(expected_fields) and update tests accordingly. - Fetcher/enricher/papers: add error handling around enrichment writes and fetch enrichment calls, normalize PubChem response handling and synonyms parsing, and switch downloader to use DL_STATUS_* constants. - Minor: update .pre-commit-config to add dependencies and remove local pytest-fast hook. These changes centralize configuration, reduce magic strings, improve immutability and typing, and add better error handling and logging for improved reliability.
CLI: check for the configured default prompt file and exit with an error if it is missing (advise using --prompt). Workspace: add a 'prompts' directory to initialized workspace structure. Prompt builder: read PDB identifier from 'rcsb_id' instead of 'id' (aligns with enriched PDB JSON shape) and update unit test accordingly. Tests: tighten live-integration test gating by honoring GPCR_RUN_LIVE_TESTS and clarify skip reasons for missing GPCR_EMAIL_FOR_APIS.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 44 out of 44 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| try: | ||
| os.makedirs(config.pipeline_runs_dir, exist_ok=True) | ||
| raw_out_file = config.pipeline_runs_dir / f"raw_output_{job_name}.jsonl" | ||
| logger.info("Downloading from %s to %s", job.output_uri, raw_out_file) | ||
|
|
||
| response = requests.get(job.output_uri, timeout=TIMEOUT_BATCH_RESULT_DOWNLOAD) | ||
| response.raise_for_status() | ||
| with open(raw_out_file, "wb") as f_out: | ||
| f_out.write(response.content) |
There was a problem hiding this comment.
check_batch_status() builds raw_out_file using the raw job_name, but Gemini batch job names commonly contain slashes (e.g. batchJobs/...). This turns the output path into nested directories (and recover_batch() won’t find it via raw_output_*.jsonl). Sanitize the job name for filenames (e.g. replace / with _ or use the last path segment) and ensure the parent directory exists before writing.
| def run_single_pdb( | ||
| pdb_id: str, | ||
| enriched_data: dict, | ||
| prompt_text: str, | ||
| pdf_path: Path, | ||
| num_runs: int = GEMINI_DEFAULT_RUNS, | ||
| model_name: str = GEMINI_MODEL_NAME, | ||
| ) -> None: |
There was a problem hiding this comment.
run_single_pdb() defaults model_name to the module-level GEMINI_MODEL_NAME, which is a static constant (and the config comment suggests using get_gemini_model_name() for fresh env reads). Consider defaulting model_name to None and resolving it at runtime (or defaulting via get_gemini_model_name()), so programmatic callers get the same env override behavior as the CLI.
| except APIError as e: | ||
| retries += 1 | ||
| if e.code == 429: | ||
| # Rate-limited — longer sleep before retry | ||
| time.sleep(SLEEP_GEMINI_429 * (2 ** (retries - 1))) | ||
| else: | ||
| time.sleep(GEMINI_BASE_BACKOFF * (2 ** (retries - 1))) | ||
| except Exception: | ||
| retries += 1 | ||
| time.sleep(GEMINI_BASE_BACKOFF * (2 ** (retries - 1))) | ||
|
|
||
| logger.error( | ||
| "[%s] Run %d failed after %d retries.", pdb_id, run_num, GEMINI_MAX_RETRIES | ||
| ) |
There was a problem hiding this comment.
In do_run(), generic exceptions are retried with backoff but the underlying exception is never logged (only a final “failed after retries” message). This makes it hard to diagnose schema mismatches, tool-calling failures, or transient SDK issues. Log the exception details at least once per run (e.g., on the first failure or on the final retry) before sleeping.
| def _fetch_unpaywall_pdf_url(doi: str, session: requests.Session) -> str | None: | ||
| """Tier 1: Get OA PDF URL from Unpaywall.""" | ||
| url = f"{UNPAYWALL_API_URL}/{doi}" | ||
| try: | ||
| response = session.get(url, timeout=TIMEOUT_UNPAYWALL) | ||
| if response.status_code == 200: | ||
| data = response.json() | ||
| oa_location = data.get("best_oa_location") or {} | ||
| pdf_url = oa_location.get("url_for_pdf") | ||
| if pdf_url: | ||
| return pdf_url # type: ignore[no-any-return] |
There was a problem hiding this comment.
Unpaywall API calls typically require an email query parameter; currently _fetch_unpaywall_pdf_url() calls GET {UNPAYWALL_API_URL}/{doi} without passing the user email, even though GPCR_EMAIL_FOR_APIS is required and _build_session() sets polite headers. If Unpaywall rejects requests without ?email=..., this tier will always fail. Consider adding the email as a query param (and/or threading it into this function).
| for subdir in _INIT_DIRS: | ||
| (workspace_root / subdir).mkdir(parents=True, exist_ok=True) | ||
|
|
||
| if not contract_file.exists(): | ||
| contract_data = { | ||
| "storage_contract_version": SUPPORTED_CONTRACT_VERSION, | ||
| "created_by": "gpcr-tools", | ||
| "created_at_utc": datetime.now(UTC).isoformat(), | ||
| } | ||
| with open(contract_file, "w", encoding="utf-8") as f: | ||
| json.dump(contract_data, f, indent=2) | ||
| f.write("\n") | ||
|
|
||
| # Create targets.txt (pipeline entry point) if it does not exist | ||
| targets_file = workspace_root / "targets.txt" | ||
| if not targets_file.exists(): | ||
| targets_file.write_text( | ||
| "# Add PDB IDs here, one per line.\n" | ||
| "# Lines starting with # are comments. Blank lines are ignored.\n" | ||
| "# Example:\n" | ||
| "# 7W55\n" | ||
| "# 8ABC\n", | ||
| encoding="utf-8", | ||
| ) | ||
|
|
There was a problem hiding this comment.
init_workspace() creates the prompts/ directory and config defaults default_prompt_file to workspace/prompts/v5.txt, but no default prompt file is created here. As a result, gpcr-tools annotate will error immediately in a freshly-initialized workspace unless the user manually creates v5.txt. Consider writing a starter prompts/v5.txt alongside targets.txt (or updating the default path).
Resolve Gemini model name at runtime and improve robustness, plus bundle a default prompt and Unpaywall email support. - pyproject.toml: allow Greek confusables (α, β, γ) in ruff config for domain terminology. - annotator/runner.py: stop using a hardcoded GEMINI_MODEL_NAME; accept an optional model_name and fall back to get_gemini_model_name(); add clearer warning logging on Gemini call exceptions; sanitize job names when writing batch raw output files. - papers/downloader.py: _fetch_unpaywall_pdf_url now accepts an optional email and includes it as a request param; callers pass the resolved_email when checking Unpaywall. - workspace.py: include importlib.resources/shutil and copy the bundled prompts/v5.txt into a new workspace prompts/ directory when initializing if missing. - Add packaged prompt resource (src/gpcr_tools/data/prompts/v5.txt) and package data init files to make the prompt accessible. These changes let the code pick the current Gemini model dynamically, improve error reporting and filename safety, seed new workspaces with a default annotation prompt, and provide optional Unpaywall email support to improve PDF retrieval.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 45 out of 47 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (2)
src/gpcr_tools/csv_generator/review_engine.py:66
isinstance()does not accept PEP 604 union types (e.g.,list | dict) at runtime; this will raiseTypeError: isinstance() argument 2 cannot be a unionand break edits in the review UI. Use a tuple of types instead (e.g.,isinstance(original, (list, dict))).
if isinstance(original, list | dict):
try:
parsed = json.loads(new_str)
except (json.JSONDecodeError, ValueError):
parsed = None
if isinstance(parsed, type(original)):
return parsed
return new_str
src/gpcr_tools/csv_generator/review_engine.py:250
isinstance(value, dict | list)will raiseTypeErrorat runtime becauseisinstance()cannot take a union type. Useisinstance(value, (dict, list)).
def format_value(value: Any) -> str:
if isinstance(value, dict | list):
try:
return json.dumps(value, sort_keys=True)
except (TypeError, ValueError):
return str(value)
return str(value)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| conf = d_node.get("confidence", 0) | ||
| conf_style = "success" if isinstance(conf, (int, float)) and conf >= 0.8 else "warning" | ||
| conf_style = "success" if isinstance(conf, int | float) and conf >= 0.8 else "warning" | ||
| grid.add_row("Confidence:", f"[{conf_style}]{conf}[/{conf_style}]") |
There was a problem hiding this comment.
isinstance(conf, int | float) will raise TypeError because isinstance() cannot take a union type. Use isinstance(conf, (int, float)) (or handle the new string-based confidence values explicitly).
| if isinstance(data, Mapping) and not isinstance(data, dict): | ||
| return {str(k): _unwrap_composite(v) for k, v in data.items()} | ||
| if isinstance(data, Sequence) and not isinstance(data, str | bytes | list | tuple): | ||
| return [_unwrap_composite(x) for x in data] | ||
| if isinstance(data, dict): |
There was a problem hiding this comment.
isinstance(data, str | bytes | list | tuple) will raise TypeError because isinstance() cannot take a PEP 604 union type. Replace with a tuple of types (e.g., (str, bytes, list, tuple)) to avoid crashing post-processing.
| def is_meaningfully_empty(val: Any) -> bool: | ||
| """Return ``True`` if *val* is effectively empty or explicitly null-like.""" | ||
| if val is None: | ||
| return True | ||
| if isinstance(val, dict | list | str) and not val: | ||
| return True |
There was a problem hiding this comment.
isinstance(val, dict | list | str) will raise TypeError at runtime because isinstance() cannot take a union type. Use isinstance(val, (dict, list, str)).
| cids = (data.get("IdentifierList") or {}).get("CID") | ||
| if cids is not None: | ||
| if isinstance(cids, list) and len(cids) > 0: | ||
| pubchem_id = str(cids[0]) | ||
| elif isinstance(cids, int | float): | ||
| pubchem_id = str(int(cids)) | ||
| except requests.exceptions.RequestException as exc: |
There was a problem hiding this comment.
isinstance(cids, int | float) will raise TypeError at runtime because isinstance() cannot take a PEP 604 union type. Use isinstance(cids, (int, float)) to avoid breaking PubChem CID resolution.
This pull request introduces a new AI annotation module for extracting GPCR structural data from scientific papers, along with significant enhancements to the CLI and annotation pipeline. The main changes include the addition of a Gemini API client with rate limiting, a Ghostscript-based PDF compressor, a robust post-processing pipeline for annotation results, and new CLI commands for batch annotation and paper fetching. The
pyproject.tomlhas also been updated to include new dependencies and extras for annotation and paper processing.Major new annotation module and pipeline:
src/gpcr_tools/annotator/with the following components:gemini_client.py: Implements a Gemini API client with per-key sliding-window rate limiting, supporting both new and legacy API key environment variables.pdf_compressor.py: Provides Ghostscript-based PDF compression for large files before Gemini upload, ensuring compliance with API size limits.post_processor.py: Adds a post-processing pipeline for Gemini annotation responses, including unwrapping protobuf objects, cleaning up empty fields, standardizing auxiliary protein names, and normalizing UniProt entry names.cli_handler.py: Implements a CLI handler for the newannotatecommand, supporting both single and batch annotation modes, with robust input resolution and logging.__init__.py.CLI and workflow enhancements:
src/gpcr_tools/__main__.pyto add new commands:fetch: Download PDB metadata from RCSB and enrich with UniProt/PubChem data.fetch-papers: Download open-access papers for enriched PDB entries.annotate: Run Gemini AI annotation in both single and batch modes, with options for prompt customization, model selection, and batch recovery/status checking. [1] [2]Dependency and configuration updates:
pyproject.toml:[annotate](Gemini) and[papers](PyMuPDF, lxml).gpcr-annotation-tools[annotate,papers]in the development dependencies for easier local testing. [1] [2]These changes collectively enable automated, scalable annotation of GPCR structures using AI, with robust input/output handling and improved developer ergonomics.Introduce end-to-end pipeline components and CLI commands for fetching, paper acquisition, and AI annotation. Adds new packages: fetcher (RCSB GraphQL client, enricher, cache, targets, runner), papers (downloader, watcher, runner), and annotator (Gemini client with sliding-window rate limiting, prompt builder, PDF compressor, post-processor, schema and runner/CLI glue). Extend CLI (main.py) with fetch / fetch-papers / annotate commands (single + batch modes, prompt/model/runs flags and batch control). Update packaging (pyproject.toml bumped to 0.1.3, extras for annotate/papers), expand README with pipeline docs and usage, add many unit/integration tests, and small config/workspace/validator adjustments to integrate the new stages.