Skip to content

Add fetcher, papers, annotator pipeline & CLI#4

Merged
iskoldt-X merged 7 commits intoprotwis:mainfrom
iskoldt-X:main
Apr 13, 2026
Merged

Add fetcher, papers, annotator pipeline & CLI#4
iskoldt-X merged 7 commits intoprotwis:mainfrom
iskoldt-X:main

Conversation

@iskoldt-X
Copy link
Copy Markdown
Collaborator

This pull request introduces a new AI annotation module for extracting GPCR structural data from scientific papers, along with significant enhancements to the CLI and annotation pipeline. The main changes include the addition of a Gemini API client with rate limiting, a Ghostscript-based PDF compressor, a robust post-processing pipeline for annotation results, and new CLI commands for batch annotation and paper fetching. The pyproject.toml has also been updated to include new dependencies and extras for annotation and paper processing.

Major new annotation module and pipeline:

  • Added a new AI annotation module in src/gpcr_tools/annotator/ with the following components:
    • gemini_client.py: Implements a Gemini API client with per-key sliding-window rate limiting, supporting both new and legacy API key environment variables.
    • pdf_compressor.py: Provides Ghostscript-based PDF compression for large files before Gemini upload, ensuring compliance with API size limits.
    • post_processor.py: Adds a post-processing pipeline for Gemini annotation responses, including unwrapping protobuf objects, cleaning up empty fields, standardizing auxiliary protein names, and normalizing UniProt entry names.
    • cli_handler.py: Implements a CLI handler for the new annotate command, supporting both single and batch annotation modes, with robust input resolution and logging.
    • Added module docstring to __init__.py.

CLI and workflow enhancements:

  • Extended the main CLI in src/gpcr_tools/__main__.py to add new commands:
    • fetch: Download PDB metadata from RCSB and enrich with UniProt/PubChem data.
    • fetch-papers: Download open-access papers for enriched PDB entries.
    • annotate: Run Gemini AI annotation in both single and batch modes, with options for prompt customization, model selection, and batch recovery/status checking. [1] [2]

Dependency and configuration updates:

  • Updated pyproject.toml:
    • Bumped version to 0.1.3.
    • Added new extras for [annotate] (Gemini) and [papers] (PyMuPDF, lxml).
    • Included gpcr-annotation-tools[annotate,papers] in the development dependencies for easier local testing. [1] [2]

These changes collectively enable automated, scalable annotation of GPCR structures using AI, with robust input/output handling and improved developer ergonomics.Introduce end-to-end pipeline components and CLI commands for fetching, paper acquisition, and AI annotation. Adds new packages: fetcher (RCSB GraphQL client, enricher, cache, targets, runner), papers (downloader, watcher, runner), and annotator (Gemini client with sliding-window rate limiting, prompt builder, PDF compressor, post-processor, schema and runner/CLI glue). Extend CLI (main.py) with fetch / fetch-papers / annotate commands (single + batch modes, prompt/model/runs flags and batch control). Update packaging (pyproject.toml bumped to 0.1.3, extras for annotate/papers), expand README with pipeline docs and usage, add many unit/integration tests, and small config/workspace/validator adjustments to integrate the new stages.

Introduce end-to-end pipeline components and CLI commands for fetching, paper acquisition, and AI annotation. Adds new packages: fetcher (RCSB GraphQL client, enricher, cache, targets, runner), papers (downloader, watcher, runner), and annotator (Gemini client with sliding-window rate limiting, prompt builder, PDF compressor, post-processor, schema and runner/CLI glue). Extend CLI (__main__.py) with fetch / fetch-papers / annotate commands (single + batch modes, prompt/model/runs flags and batch control). Update packaging (pyproject.toml bumped to 0.1.3, extras for annotate/papers), expand README with pipeline docs and usage, add many unit/integration tests, and small config/workspace/validator adjustments to integrate the new stages.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an end-to-end pipeline for fetching PDB metadata, downloading papers, and running Gemini-based AI annotation (single + batch), plus supporting workspace/config updates, docs, and tests.

Changes:

  • Added new pipeline stages: fetcher/ (RCSB + enrichment), papers/ (OA download + paywall watcher), annotator/ (Gemini client, prompt building, compression, post-processing, runner).
  • Extended the CLI with fetch, fetch-papers, and annotate commands.
  • Updated configuration/constants, workspace initialization, dependencies/extras, and expanded documentation/testing.

Reviewed changes

Copilot reviewed 37 out of 37 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
src/gpcr_tools/config.py Adds centralized URLs/timeouts/sleeps + new workspace path fields used by new pipeline stages
src/gpcr_tools/workspace.py Adds targets.txt creation + switches to raw_pdb_json_dir
src/gpcr_tools/fetcher/* Implements target parsing, RCSB GraphQL fetch, enrichment logic, and JSON caching
src/gpcr_tools/papers/* Implements tiered paper download, download logging, and manual PDF watcher
src/gpcr_tools/annotator/* Implements Gemini client + schema, prompt builder, PDF compression, post-processing, and runners
src/gpcr_tools/__main__.py Adds new CLI subcommands and annotation orchestration logic
src/gpcr_tools/validator/* Switches hardcoded URLs/timeouts/sleeps to config constants
README.md Documents new pipeline, CLI usage, and architecture
pyproject.toml Version bump + new extras/deps
tests/unit/*, tests/integration/* Adds unit tests and new “live API” integration tests

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +10 to +14
def generate_chain_inventory_reminder(pdb_id: str, enriched_data: dict) -> str:
"""Generates a human-readable summary of the polymer chains in the PDB."""
polymers = enriched_data.get("polymer_entities") or []
if not polymers:
return f"### CHAIN INVENTORY REMINDER\nThis structure ({pdb_id}) contains 0 polymer chains."
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prompt_builder is reading polymer/nonpolymer entities from the top-level enriched_data, but the enricher writes them under enriched_data["data"]["entry"]. As a result, chain inventory and simplified metadata will be empty/"UNKNOWN" for real enriched JSON, which will significantly degrade annotation quality. Update the prompt builder to consistently dereference data.entry (and use rcsb_id as the PDB ID) before extracting polymer_entities, nonpolymer_entities, exptl/refine, etc.

Copilot uses AI. Check for mistakes.
Comment on lines +49 to +50
pdb_id = (enriched_data.get("entry") or {}).get("id") or "UNKNOWN"
entry = enriched_data.get("entry") or {}
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enhanced_simplify_pdb_json currently uses enriched_data.get("entry"), but the enriched files produced by fetcher/enricher.py are full GraphQL responses with the payload under data.entry. This makes pdb_id, method, resolution, and release_date resolution incorrect for real runs. Consider normalizing by setting entry = (enriched_data.get("data") or {}).get("entry") or {} and reading fields from there.

Suggested change
pdb_id = (enriched_data.get("entry") or {}).get("id") or "UNKNOWN"
entry = enriched_data.get("entry") or {}
entry = (enriched_data.get("data") or {}).get("entry") or enriched_data.get("entry") or {}
pdb_id = entry.get("id") or "UNKNOWN"

Copilot uses AI. Check for mistakes.
Comment on lines +168 to +170
# 3. Sibling structures warning
siblings = enriched_data.get("sibling_pdbs") or []
if siblings:
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sibling PDB IDs are stored as data.entry.sibling_pdbs by the enricher, but the prompt builder looks for enriched_data.get("sibling_pdbs"). This prevents the sibling warning from ever being included for real enriched JSON. Please read sibling_pdbs from the same normalized entry object used for the rest of the prompt.

Copilot uses AI. Check for mistakes.
Comment thread src/gpcr_tools/__main__.py Outdated
Comment on lines +243 to +246
# Auto-discover: enriched PDBs missing ai_results
enriched_pdbs = {p.stem.upper() for p in cfg.enriched_dir.glob("*.json")}
done_pdbs = {d.name.upper() for d in cfg.ai_results_dir.iterdir() if d.is_dir()}
pdb_ids = sorted(enriched_pdbs - done_pdbs)
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto-discovery for annotate treats any existing ai_results/{pdb_id}/ directory as "done". This will incorrectly skip PDBs that have only partial run_*.json outputs (e.g., interrupted runs). Align this logic with annotator/cli_handler.py (or a shared helper) by checking whether the expected run_1..run_N.json files exist before considering a PDB complete.

Copilot uses AI. Check for mistakes.
Comment thread src/gpcr_tools/annotator/cli_handler.py Outdated
Comment on lines +11 to +17
def run_annotate(
pdb_id: str | None,
targets_file: str | None,
prompt_file: str | None,
num_runs: int,
use_batch: bool,
) -> None:
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

annotator/cli_handler.py appears to be unused (no references found) and duplicates substantial CLI logic that already exists in gpcr_tools/__main__.py. This increases maintenance burden and can lead to behavior drift. Either wire __main__.py to call run_annotate(...) or remove the handler module to avoid dead code.

Copilot uses AI. Check for mistakes.
Comment on lines +70 to +77
def _standardize_auxiliary_name(name: str) -> str:
"""Standardise common auxiliary protein names.

The explicit :data:`AUX_PROTEIN_NAME_MAPPING` is checked first; only if no
match is found do we fall through to the BRIL and nanobody regex patterns.
"""
if not name:
return name
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_standardize_auxiliary_name is annotated as accepting/returning str, but it is called in tests with None and will return None at runtime. This is likely to trip mypy and makes the API misleading. Update the signature to name: str | None (and return type accordingly) to match actual usage.

Copilot uses AI. Check for mistakes.
from gpcr_tools.config import OVERRIDE_VARS, WorkspaceConfig, get_config

SUPPORTED_CONTRACT_VERSION = 1
from gpcr_tools.config import OVERRIDE_VARS, SUPPORTED_CONTRACT_VERSION, WorkspaceConfig, get_config
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import statement exceeds the configured line length and bypasses Ruff's auto-formatting style. Please run ruff format or wrap the import across multiple lines to match the project's formatting conventions.

Suggested change
from gpcr_tools.config import OVERRIDE_VARS, SUPPORTED_CONTRACT_VERSION, WorkspaceConfig, get_config
from gpcr_tools.config import (
OVERRIDE_VARS,
SUPPORTED_CONTRACT_VERSION,
WorkspaceConfig,
get_config,
)

Copilot uses AI. Check for mistakes.
Comment thread pyproject.toml Outdated
"mypy>=1.10",
"types-requests>=2.31",
"types-tqdm>=4.60",
"gpcr-annotation-tools[annotate,papers]",
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including gpcr-annotation-tools[annotate,papers] inside the package's own dev extra makes the project depend on itself. In editable installs this can cause pip to resolve the published package from PyPI (or fail if it's not available), and it also obscures which deps are actually needed for tests. Instead, add the extra deps directly to dev (e.g., google-genai, PyMuPDF, lxml) or document installing with pip install -e '.[dev,annotate,papers]'.

Suggested change
"gpcr-annotation-tools[annotate,papers]",
"google-genai",
"PyMuPDF",
"lxml",

Copilot uses AI. Check for mistakes.
Comment thread README.md Outdated
- **Unit tests** for every module across all five pipeline stages
- **Integration tests** for the full aggregation pipeline, error isolation, and atomic write safety
- **Real PDB fixture tests** covering 9 canonical GPCR structures (5G53, 8TII, 9AS1, 9BLW, 9EJZ, 9IQS, 9M88, 9NOR, 9O38) with 10 AI runs each
- **Mock HTTP** for all external APIs — no live network calls in the test suite
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README claims there are "no live network calls in the test suite", but this PR adds live integration tests under tests/integration/ (e.g., test_fetch_live.py). Either adjust the documentation to describe how live tests are gated/skipped, or change the tests to be opt-in so the statement remains true.

Suggested change
- **Mock HTTP** for all external APIs — no live network calls in the test suite
- **Mock HTTP** for external APIs in the default test suite; any live network integration tests are gated/opt-in and skipped unless explicitly enabled

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +11
"""Integration tests for fetch pipeline — LIVE API calls.

These tests actually call RCSB, UniProt, PubChem, and RCSB Search APIs.
They use the canonical 9 PDB IDs from the test fixture set.

Run with:
pytest tests/integration/test_fetch_live.py -v

These tests require network access and may take 2-5 minutes due to
API rate limiting (1s sleep per RCSB request).
"""
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file performs live network calls (RCSB/UniProt/PubChem) but is not skipped/marked, so it will run under the default pytest configuration (testpaths = ['tests']). That can make CI runs slow and flaky due to external rate limits/outages. Make these tests opt-in (e.g., @pytest.mark.skipif(not os.getenv('GPCR_RUN_LIVE_TESTS'), ...)) or add a network marker and configure pytest to exclude it by default.

Copilot uses AI. Check for mistakes.
Adapt code and tests to the new enriched JSON envelope (data.entry): add _get_entry helper and update prompt_builder, enhanced_simplify_pdb_json, and tests accordingly. Gate live network integration tests behind GPCR_RUN_LIVE_TESTS and update README and tests to skip unless the env var is set. Improve CLI auto-discovery to treat an AI result directory as complete only when the required number of run_N.json files exist, and remove the now-obsolete annotator/cli_handler module. Add PDF validation & cleanup in papers.downloader to detect non-PDF downloads, update post_processor to accept optional auxiliary names, and add runtime dev dependencies (google-genai, PyMuPDF, lxml).
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 37 out of 37 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

def enhanced_simplify_pdb_json(enriched_data: dict) -> dict:
"""Simplifies the enriched PDB JSON into a minimal dictionary for Gemini."""
entry = _get_entry(enriched_data)
pdb_id = entry.get("id") or "UNKNOWN"
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enhanced_simplify_pdb_json() derives pdb_id from entry.get("id"), but the RCSB GraphQL payload puts the identifier at entry["rcsb_id"] (and a nested entry["entry"]["id"]). With real enriched data this will often become UNKNOWN, degrading prompt grounding. Prefer entry.get("rcsb_id") (fallback to ((entry.get("entry") or {}).get("id"))) so prompts consistently include the actual PDB ID.

Suggested change
pdb_id = entry.get("id") or "UNKNOWN"
pdb_id = entry.get("rcsb_id") or ((entry.get("entry") or {}).get("id")) or "UNKNOWN"

Copilot uses AI. Check for mistakes.
Comment on lines +256 to +261
# Resolve prompt text
if args.prompt:
prompt_text = Path(args.prompt).read_text(encoding="utf-8")
else:
prompt_text = cfg.default_prompt_file.read_text(encoding="utf-8")

Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI falls back to reading cfg.default_prompt_file ({workspace}/prompts/v5.txt), but init-workspace/ensure_runtime_dirs do not create prompts/ or the default prompt file, and there is no prompts/v5.txt in-repo. On a fresh workspace, gpcr-tools annotate will crash with FileNotFoundError. Either create/populate the default prompt in init_workspace (and add prompts to the contract dirs), or ship the prompt as a package resource and reference it from the installed package rather than the workspace.

Copilot uses AI. Check for mistakes.
Comment thread .pre-commit-config.yaml Outdated
Comment thread README.md Outdated
Comment on lines +52 to +53
@pytest.mark.skipif(not _EMAIL, reason=_SKIP_REASON)
def test_auto_only_downloads_some_papers(self, papers_workspace: Path) -> None:
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests run live network calls whenever GPCR_EMAIL_FOR_APIS is set (via the skipif(not _EMAIL, ...) gate). This is easier to satisfy in CI than the explicit GPCR_RUN_LIVE_TESTS=1 gate used elsewhere (e.g. test_fetch_live.py), which can unintentionally enable slow/flaky live tests. Consider gating on GPCR_RUN_LIVE_TESTS (and optionally also requiring GPCR_EMAIL_FOR_APIS).

Copilot uses AI. Check for mistakes.
iskoldt-X and others added 4 commits April 13, 2026 01:26
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Introduce centralized constants and type-safety across the codebase, plus several robustness and UX fixes:

- config: add get_gemini_model_name(), AGG_STATUS_*/DL_STATUS_*/ALERT_PREFIX_* constants, ANNOTATOR_FUNCTION_NAME; make OVERRIDE_VARS, CSV_SCHEMA, AUX_PROTEIN_DISPATCH immutable MappingProxyType and use tuples for schema fields.
- Use new config constants in aggregator, csv generator, papers downloader, validator and other modules (replace hard-coded status/alert strings).
- Annotator: use ANNOTATOR_FUNCTION_NAME, improve batch recovery parsing with more logging and handling of missing args, remove unused helper.
- Annotator CLI: enforce positive integer for --runs and resolve Gemini model lazily via get_gemini_model_name().
- Improve typing/unions using modern | syntax and Sequence in several modules (post_processor, review_engine, exceptions).
- CSV handling: fix header comparison to list(expected_fields) and update tests accordingly.
- Fetcher/enricher/papers: add error handling around enrichment writes and fetch enrichment calls, normalize PubChem response handling and synonyms parsing, and switch downloader to use DL_STATUS_* constants.
- Minor: update .pre-commit-config to add dependencies and remove local pytest-fast hook.

These changes centralize configuration, reduce magic strings, improve immutability and typing, and add better error handling and logging for improved reliability.
CLI: check for the configured default prompt file and exit with an error if it is missing (advise using --prompt). Workspace: add a 'prompts' directory to initialized workspace structure. Prompt builder: read PDB identifier from 'rcsb_id' instead of 'id' (aligns with enriched PDB JSON shape) and update unit test accordingly. Tests: tighten live-integration test gating by honoring GPCR_RUN_LIVE_TESTS and clarify skip reasons for missing GPCR_EMAIL_FOR_APIS.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 44 out of 44 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +320 to +328
try:
os.makedirs(config.pipeline_runs_dir, exist_ok=True)
raw_out_file = config.pipeline_runs_dir / f"raw_output_{job_name}.jsonl"
logger.info("Downloading from %s to %s", job.output_uri, raw_out_file)

response = requests.get(job.output_uri, timeout=TIMEOUT_BATCH_RESULT_DOWNLOAD)
response.raise_for_status()
with open(raw_out_file, "wb") as f_out:
f_out.write(response.content)
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_batch_status() builds raw_out_file using the raw job_name, but Gemini batch job names commonly contain slashes (e.g. batchJobs/...). This turns the output path into nested directories (and recover_batch() won’t find it via raw_output_*.jsonl). Sanitize the job name for filenames (e.g. replace / with _ or use the last path segment) and ensure the parent directory exists before writing.

Copilot uses AI. Check for mistakes.
Comment on lines +36 to +43
def run_single_pdb(
pdb_id: str,
enriched_data: dict,
prompt_text: str,
pdf_path: Path,
num_runs: int = GEMINI_DEFAULT_RUNS,
model_name: str = GEMINI_MODEL_NAME,
) -> None:
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_single_pdb() defaults model_name to the module-level GEMINI_MODEL_NAME, which is a static constant (and the config comment suggests using get_gemini_model_name() for fresh env reads). Consider defaulting model_name to None and resolving it at runtime (or defaulting via get_gemini_model_name()), so programmatic callers get the same env override behavior as the CLI.

Copilot uses AI. Check for mistakes.
Comment on lines +125 to +138
except APIError as e:
retries += 1
if e.code == 429:
# Rate-limited — longer sleep before retry
time.sleep(SLEEP_GEMINI_429 * (2 ** (retries - 1)))
else:
time.sleep(GEMINI_BASE_BACKOFF * (2 ** (retries - 1)))
except Exception:
retries += 1
time.sleep(GEMINI_BASE_BACKOFF * (2 ** (retries - 1)))

logger.error(
"[%s] Run %d failed after %d retries.", pdb_id, run_num, GEMINI_MAX_RETRIES
)
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In do_run(), generic exceptions are retried with backoff but the underlying exception is never logged (only a final “failed after retries” message). This makes it hard to diagnose schema mismatches, tool-calling failures, or transient SDK issues. Log the exception details at least once per run (e.g., on the first failure or on the final retry) before sleeping.

Copilot uses AI. Check for mistakes.
Comment thread src/gpcr_tools/papers/downloader.py Outdated
Comment on lines +176 to +186
def _fetch_unpaywall_pdf_url(doi: str, session: requests.Session) -> str | None:
"""Tier 1: Get OA PDF URL from Unpaywall."""
url = f"{UNPAYWALL_API_URL}/{doi}"
try:
response = session.get(url, timeout=TIMEOUT_UNPAYWALL)
if response.status_code == 200:
data = response.json()
oa_location = data.get("best_oa_location") or {}
pdf_url = oa_location.get("url_for_pdf")
if pdf_url:
return pdf_url # type: ignore[no-any-return]
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unpaywall API calls typically require an email query parameter; currently _fetch_unpaywall_pdf_url() calls GET {UNPAYWALL_API_URL}/{doi} without passing the user email, even though GPCR_EMAIL_FOR_APIS is required and _build_session() sets polite headers. If Unpaywall rejects requests without ?email=..., this tier will always fail. Consider adding the email as a query param (and/or threading it into this function).

Copilot uses AI. Check for mistakes.
Comment on lines 65 to +89
for subdir in _INIT_DIRS:
(workspace_root / subdir).mkdir(parents=True, exist_ok=True)

if not contract_file.exists():
contract_data = {
"storage_contract_version": SUPPORTED_CONTRACT_VERSION,
"created_by": "gpcr-tools",
"created_at_utc": datetime.now(UTC).isoformat(),
}
with open(contract_file, "w", encoding="utf-8") as f:
json.dump(contract_data, f, indent=2)
f.write("\n")

# Create targets.txt (pipeline entry point) if it does not exist
targets_file = workspace_root / "targets.txt"
if not targets_file.exists():
targets_file.write_text(
"# Add PDB IDs here, one per line.\n"
"# Lines starting with # are comments. Blank lines are ignored.\n"
"# Example:\n"
"# 7W55\n"
"# 8ABC\n",
encoding="utf-8",
)

Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

init_workspace() creates the prompts/ directory and config defaults default_prompt_file to workspace/prompts/v5.txt, but no default prompt file is created here. As a result, gpcr-tools annotate will error immediately in a freshly-initialized workspace unless the user manually creates v5.txt. Consider writing a starter prompts/v5.txt alongside targets.txt (or updating the default path).

Copilot uses AI. Check for mistakes.
Resolve Gemini model name at runtime and improve robustness, plus bundle a default prompt and Unpaywall email support.

- pyproject.toml: allow Greek confusables (α, β, γ) in ruff config for domain terminology.
- annotator/runner.py: stop using a hardcoded GEMINI_MODEL_NAME; accept an optional model_name and fall back to get_gemini_model_name(); add clearer warning logging on Gemini call exceptions; sanitize job names when writing batch raw output files.
- papers/downloader.py: _fetch_unpaywall_pdf_url now accepts an optional email and includes it as a request param; callers pass the resolved_email when checking Unpaywall.
- workspace.py: include importlib.resources/shutil and copy the bundled prompts/v5.txt into a new workspace prompts/ directory when initializing if missing.
- Add packaged prompt resource (src/gpcr_tools/data/prompts/v5.txt) and package data init files to make the prompt accessible.

These changes let the code pick the current Gemini model dynamically, improve error reporting and filename safety, seed new workspaces with a default annotation prompt, and provide optional Unpaywall email support to improve PDF retrieval.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 45 out of 47 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (2)

src/gpcr_tools/csv_generator/review_engine.py:66

  • isinstance() does not accept PEP 604 union types (e.g., list | dict) at runtime; this will raise TypeError: isinstance() argument 2 cannot be a union and break edits in the review UI. Use a tuple of types instead (e.g., isinstance(original, (list, dict))).
    if isinstance(original, list | dict):
        try:
            parsed = json.loads(new_str)
        except (json.JSONDecodeError, ValueError):
            parsed = None
        if isinstance(parsed, type(original)):
            return parsed
    return new_str

src/gpcr_tools/csv_generator/review_engine.py:250

  • isinstance(value, dict | list) will raise TypeError at runtime because isinstance() cannot take a union type. Use isinstance(value, (dict, list)).
        def format_value(value: Any) -> str:
            if isinstance(value, dict | list):
                try:
                    return json.dumps(value, sort_keys=True)
                except (TypeError, ValueError):
                    return str(value)
            return str(value)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 146 to 148
conf = d_node.get("confidence", 0)
conf_style = "success" if isinstance(conf, (int, float)) and conf >= 0.8 else "warning"
conf_style = "success" if isinstance(conf, int | float) and conf >= 0.8 else "warning"
grid.add_row("Confidence:", f"[{conf_style}]{conf}[/{conf_style}]")
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isinstance(conf, int | float) will raise TypeError because isinstance() cannot take a union type. Use isinstance(conf, (int, float)) (or handle the new string-based confidence values explicitly).

Copilot uses AI. Check for mistakes.
Comment on lines +31 to +35
if isinstance(data, Mapping) and not isinstance(data, dict):
return {str(k): _unwrap_composite(v) for k, v in data.items()}
if isinstance(data, Sequence) and not isinstance(data, str | bytes | list | tuple):
return [_unwrap_composite(x) for x in data]
if isinstance(data, dict):
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isinstance(data, str | bytes | list | tuple) will raise TypeError because isinstance() cannot take a PEP 604 union type. Replace with a tuple of types (e.g., (str, bytes, list, tuple)) to avoid crashing post-processing.

Copilot uses AI. Check for mistakes.
Comment on lines +42 to +47
def is_meaningfully_empty(val: Any) -> bool:
"""Return ``True`` if *val* is effectively empty or explicitly null-like."""
if val is None:
return True
if isinstance(val, dict | list | str) and not val:
return True
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isinstance(val, dict | list | str) will raise TypeError at runtime because isinstance() cannot take a union type. Use isinstance(val, (dict, list, str)).

Copilot uses AI. Check for mistakes.
Comment on lines +312 to +318
cids = (data.get("IdentifierList") or {}).get("CID")
if cids is not None:
if isinstance(cids, list) and len(cids) > 0:
pubchem_id = str(cids[0])
elif isinstance(cids, int | float):
pubchem_id = str(int(cids))
except requests.exceptions.RequestException as exc:
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isinstance(cids, int | float) will raise TypeError at runtime because isinstance() cannot take a PEP 604 union type. Use isinstance(cids, (int, float)) to avoid breaking PubChem CID resolution.

Copilot uses AI. Check for mistakes.
@iskoldt-X iskoldt-X marked this pull request as ready for review April 13, 2026 00:50
@iskoldt-X iskoldt-X merged commit 9ac54ed into protwis:main Apr 13, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants