Add aggregation & validation pipeline with oligomer analysis by iskoldt-X · Pull Request #3 · protwis/GPCR-annotation-tools

iskoldt-X · 2026-04-04T23:08:39Z

Summary

New aggregator/ package — loads AI annotation runs per PDB, selects the best by score, applies majority-vote consensus with per-field controversy detection, and injects ground truth from enriched PDB metadata
New validator/ package — chimera detection, receptor identity validation (UniProt API cross-check), ligand PDB-CCD validation, oligomer analysis (classification, 7TM completeness, chain-ID override, primary protomer suggestion), structural integrity checker, API clients, and persistent JSON cache
CLI aggregate subcommand with --force and --skip-api-checks flags
Hardened csv_generator/ — replaced all .get(key, {}) anti-patterns with None-safe idioms, centralized 20+ magic strings into config.py constants
Updated README — full pipeline documentation with architecture diagram, workspace layout, and design principles

Architecture highlights

Principle	Implementation
Atomic writes	`tempfile` + `os.replace` + `try/finally` cleanup
Mutation isolation	`deepcopy()` boundary in runner before passing data to validators
None-safety	`(data.get(key) or {}).get(child)` — no `.get(key, {})` anywhere
No magic strings	All cross-module strings as constants in `config.py`
Immutability	`frozenset`, `tuple`, `MappingProxyType` for module-level constants
Error isolation	Each PDB wrapped in try/except; failures logged, never crash the batch

Stats

~2,650 lines of new production code across 13 new modules
~3,400 lines of new tests across 11 new test files
685 tests passing (unit + integration + 9 canonical real-PDB fixtures)
0 ruff, mypy, or format issues

Test plan

ruff check src/ tests/ — all checks passed
ruff format --check src/ tests/ — all files formatted
mypy src/ — no issues (with ignore_missing_imports = false)
pytest tests/ -v — 685 passed
Smoke-test gpcr-tools aggregate <workspace> on a real workspace with 10 AI runs per PDB

🤖 Generated with Claude Code

Expand the tests documentation to add a Real Data Testing Strategy and update the test layout. The changes introduce a tests/fixtures/real_pdbs/ structure (with {pdb_id}.json, logs, validation_logs, ai_results, enriched) and add multiple integration tests (test_real_pdb_*.py). The new section defines the "live chain" goal (run pipeline stages live; AI annotation outputs remain the only permanent static breakpoint), explains progressive retirement/migration of fixtures, intermediate fixture rules, fixture selection criteria to ensure coverage, and recommended test organization. Purpose: avoid format drift from stale fixtures and guide staged migration of pipeline stages so downstream tests consume live outputs.

Add future_plan.md describing an improvement to type handling in interactive review. Notes that review_engine.py's coerce_type currently falls back to strings on failed coercion, which can cause type degradation in CSV outputs. Proposes refactoring the UI to validate user input for strict types (booleans/numbers), reject invalid entries with a clear validation error, and prompt users to retry instead of silently accepting string fallbacks.

Update future_plan.md to add a new section describing a validation-ordering issue in pdb-annotation/aggregate_results.py. Documents that validate_receptor_identity() runs before analyze_oligomer()'s chain override, explains the misleading validation reports this can produce, and why the migration preserves the original ordering. Adds an action plan for a follow-up epic: move the chain-override earlier, re-run receptor validation after the override, add regression tests for AI-hallucination scenarios, and check for other validators depending on pre-override state. Also fixes a trailing-newline/formatting issue around the earlier UI validation note.

Clarify the canonical PDB list size (currently 9 entries) and add a "Mandatory Data Density" requirement: each PDB in the canonical set must include a complete set of 10 AI annotation runs (run_01.json through run_10.json) under tests/fixtures/real_pdbs/ai_results/{PDB_ID}/. This ensures robust majority-voting and discrepancy-detection in tests while keeping AI outputs as the sole permanent static fixture.

Introduce a new gpcr_tools.aggregator subpackage (runner, voting, ai_results_loader, enriched_loader, ground_truth) and wire an "aggregate" CLI command into __main__.py. Implements majority-voting, run scoring/selection, discrepancy detection, ground-truth injection, chimera integration, atomic output writes, and aggregate logging; uses caches for sequence/validation lookups. Add validator scaffolding and many tests/fixtures for real-PDB aggregation flows. Update config with voting/validation constants and add requests and typing stubs to pyproject; also tighten mypy by enabling import checks.

Document the new multi-run aggregation and algorithmic validation pipeline: added a Pipeline Overview, CLI examples for the `aggregate` command (and note about `-it`), and a text diagram of the aggregate→curate flow. Updated workspace layout to show `ai_results/` (10 runs per PDB), `aggregated/` logging/validation outputs, persistent caches, and new state files (`aggregate_log.json`). Expanded System Architecture with module-level layouts for aggregator, validator, csv_generator, and supporting modules, plus Design Principles and test-suite details. Small wording/alert changes to clarify provenance, validation reports, and CI type-safety config.

Bump GitHub Actions in .github/workflows/ci.yml: upgrade actions/checkout from v4 to v6 and actions/setup-python from v5 to v6 in lint, type-check and test jobs; upgrade actions/upload-artifact from v4 to v7 for coverage upload. No behavioral changes to job steps or Python versions; this refreshes action versions for CI reliability.

Change workflow tag filter from "v*" to "*" so the Docker publish workflow runs for any pushed tag instead of only tags starting with "v". This allows publishing images for tags that don't use the "v" prefix.

Replace occurrences of /path/to/gpcr_workspace with ~/gpcr_workspace in README usage snippets. Updated docker -v mounts for init-workspace, aggregate, curate (including targeting a PDB entry) and the export GPCR_WORKSPACE line to recommend the home-directory path for convenience.

Introduce API_MAX_RETRIES in config and add simple retry/backoff for UniProt calls: check_uniprot_existence (HEAD) and get_sequence_from_uniprot (FASTA) now retry up to the configured attempts with a 1s sleep and only log on the final failure. Also import time and the new constant where needed. Add a tqdm progress bar to aggregate_all to display progress when processing pending PDB IDs.

Update GitHub Action versions in .github/workflows/docker-publish.yml to newer releases for compatibility and fixes: actions/checkout v4→v6, docker/login-action v3→v4, docker/setup-qemu-action v3→v4, docker/setup-buildx-action v3→v4, docker/metadata-action v5→v6, and docker/build-push-action v5→v7.

Introduce ALERT_SUSPICIOUS_7TM and mark chains with no detected transmembrane helices as suspicious. - config.py: add ALERT_SUSPICIOUS_7TM and extend GPCR_SLUG_NEGATIVE_PREFIXES with glycoprotein hormone prefixes (glha, fshb, lhb, tshb, cgb). - csv_generator/ui.py: import new alert, include it in the set that triggers oligomer highlighting, and add a display style for the alert. - csv_generator/validation_display.py: treat ALERT_SUSPICIOUS_7TM as a warning to be injected into validation output. - validator/oligomer.py: detect chains with TM_STATUS_UNKNOWN and append a SUSPICIOUS_7TM alert advising that the chains may be non-GPCR soluble ligands and suggesting updating GPCR_SLUG_NEGATIVE_PREFIXES if needed. This change surfaces potential false-positive GPCR assignments (no TM helices) to reviewers and provides guidance for updating negative prefixes.

tests: add test_suspicious_7tm_format to tests/unit/test_oligomer.py. The new test imports ALERT_SUSPICIOUS_7TM and mocks scan_all_chains_7tm and is_gpcr_slug to simulate an UNKNOWN TM status for a GPCR chain, runs analyze_oligomer, and asserts that a suspicious 7TM alert is produced and its message matches the expected BL3 regex. This verifies alert injection and formatting for unresolved 7TM cases.

Split comma-separated chain_id values for peptide/protein ligands, lookup each chain in api['poly_by_chain'], collect available sequences and join them with ' / ' into lig['Sequence'], and mark the ligand as VALIDATION_MATCHED_POLYMER if any chain is found. Adds unit test test_protein_multi_chain to verify multi-chain behavior.

Expand single-line list/dict literals into multi-line form in tests to improve readability and maintainability. Changes applied to tests/unit/test_ligand_validator.py and tests/unit/test_oligomer.py; no functional behavior was modified.

Copilot

Pull request overview

Introduces an aggregation + validation pipeline backed by enriched PDB metadata, adds supporting CLI/workflow wiring, and expands real-PDB AI-run fixtures to exercise the end-to-end path.

Changes:

Added offline validators (receptor identity, ligand validation) plus API client + persistent caching utilities.
Added aggregation helpers (enriched loader, ground-truth injector, AI-run loader) and exposed an aggregate CLI subcommand.
Updated CI/Docker workflows and substantially expanded real-PDB fixture coverage (multi-run AI outputs).

Reviewed changes

Copilot reviewed 117 out of 141 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
tests/fixtures/real_pdbs/ai_results/9O38/run_1.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9O38/run_2.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9O38/run_3.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9O38/run_4.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9O38/run_5.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9O38/run_6.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9O38/run_7.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9O38/run_8.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9O38/run_10.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9NOR/run_1.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9NOR/run_2.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9NOR/run_3.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9NOR/run_4.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9NOR/run_5.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9NOR/run_6.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9NOR/run_7.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9NOR/run_8.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9NOR/run_9.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9NOR/run_10.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9IQS/run_1.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9IQS/run_2.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9IQS/run_3.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9IQS/run_4.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9IQS/run_5.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9IQS/run_6.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9IQS/run_7.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9IQS/run_8.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9IQS/run_9.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9IQS/run_10.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9BLW/run_1.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9BLW/run_2.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9BLW/run_3.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9BLW/run_4.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9BLW/run_5.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9BLW/run_6.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9BLW/run_7.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9BLW/run_8.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9BLW/run_9.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9BLW/run_10.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9AS1/run_2.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9AS1/run_3.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9AS1/run_4.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9AS1/run_5.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9AS1/run_8.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9AS1/run_9.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9AS1/run_10.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/8TII/run_1.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/8TII/run_2.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/8TII/run_3.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/8TII/run_4.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/8TII/run_5.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/8TII/run_6.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/8TII/run_7.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/8TII/run_8.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/8TII/run_9.json	Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/8TII/run_10.json	Add real-PDB AI run fixture for aggregation/validation tests
src/gpcr_tools/validator/receptor_validator.py	Add offline receptor identity validation against enriched polymer entities
src/gpcr_tools/validator/ligand_validator.py	Add offline ligand validation + chemical identifier enrichment
src/gpcr_tools/validator/cache.py	Add persistent JSON caches with atomic writes
src/gpcr_tools/validator/api_clients.py	Add UniProt/PubChem/RCSB GraphQL client wrappers
src/gpcr_tools/validator/init.py	Declare validator subpackage
src/gpcr_tools/csv_generator/validation_display.py	Update validation display/oligomer alert injection + None-safety improvements
src/gpcr_tools/csv_generator/review_engine.py	Replace `.get(..., {})` patterns with None-safe idioms
src/gpcr_tools/csv_generator/logic.py	Add constants usage + None-safe access in CSV note/oligomer handling
src/gpcr_tools/csv_generator/csv_writer.py	Harden CSV transformation against `None`/missing dicts
src/gpcr_tools/csv_generator/app.py	Wire oligomer alerts into the interactive curation path
src/gpcr_tools/aggregator/ground_truth.py	Inject authoritative method/resolution/release_date from enriched metadata
src/gpcr_tools/aggregator/enriched_loader.py	Load and validate enriched entry JSON structure
src/gpcr_tools/aggregator/ai_results_loader.py	Load multi-run AI results and compute pending PDB IDs
src/gpcr_tools/aggregator/init.py	Declare aggregator subpackage
src/gpcr_tools/main.py	Add `aggregate` CLI subcommand + flags
pyproject.toml	Bump version and add runtime/dev deps + stricter mypy setting
future_plan.md	Document follow-up work items and migration ordering caveats
developing_styles.md	Update testing layout/docs incl. real-data strategy
.github/workflows/docker-publish.yml	Update Docker publish workflow triggers/action versions
.github/workflows/ci.yml	Update CI action versions and artifact upload step

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-04T23:10:32Z

+    structure_info: dict[str, Any] = best_run_data.setdefault("structure_info", {})
+


setdefault('structure_info', {}) does not guarantee structure_info is a dict (it returns the existing value unchanged). If upstream data contains a non-dict structure_info (e.g., a string/list), the subsequent item assignments will raise at runtime. Recommendation (mandatory): after retrieving structure_info, validate it is a dict and replace it with {} (and/or log) when it is not.

Suggested change

structure_info: dict[str, Any] = best_run_data.setdefault("structure_info", {})

structure_info_raw = best_run_data.get("structure_info")

if not isinstance(structure_info_raw, dict):

if structure_info_raw is not None:

logger.warning(

"[%s] structure_info is %s, replacing with empty dict before ground truth injection",

pdb_id,

type(structure_info_raw).__name__,

)

structure_info_raw = {}

best_run_data["structure_info"] = structure_info_raw

structure_info: dict[str, Any] = structure_info_raw

Copilot · 2026-04-04T23:10:33Z

+    if method:
+        structure_info["method"] = method
+    if resolution is not None:
+        structure_info["resolution"] = resolution
+    if release_date:
+        structure_info["release_date"] = release_date


setdefault('structure_info', {}) does not guarantee structure_info is a dict (it returns the existing value unchanged). If upstream data contains a non-dict structure_info (e.g., a string/list), the subsequent item assignments will raise at runtime. Recommendation (mandatory): after retrieving structure_info, validate it is a dict and replace it with {} (and/or log) when it is not.

Copilot · 2026-04-04T23:10:33Z

+from gpcr_tools.aggregator.voting import _first_list_entry
+
+logger = logging.getLogger(__name__)
+
+


This module imports _first_list_entry, a private helper (leading underscore) from another module. That creates a fragile coupling: refactors in voting.py can break ground_truth.py without a public API contract. Recommendation: either make the helper public (rename to first_list_entry and export it) or move it into a shared utils module used by both.

Suggested change

from gpcr_tools.aggregator.voting import _first_list_entry

logger = logging.getLogger(__name__)

logger = logging.getLogger(__name__)

def _first_list_entry(entry: dict[str, Any], key: str) -> Any:

"""Return the first element from a list-valued field, else ``None``."""

value = entry.get(key)

if isinstance(value, list) and value:

return value[0]

return None

Copilot · 2026-04-04T23:10:33Z

+def validate_receptor_identity(
+    pdb_id: str,
+    best_run_data: dict[str, Any],
+    enriched_entry: dict[str, Any],
+) -> list[str]:


pdb_id is currently unused in this function. Recommendation (optional): either remove it from the signature (if not part of a common validator interface), or use it in log messages / warning strings to improve diagnosability.

Copilot · 2026-04-04T23:10:33Z

+            encoding="utf-8",
+        ) as fd:
+            tmp_path = fd.name
+            json.dump(data, fd, indent=2)


Atomic os.replace prevents partial-file reads, but without an explicit flush() + os.fsync() on the temp file, a sudden power loss can still leave a zero-length/truncated file that gets renamed into place (depending on OS/filesystem buffering). Recommendation: explicitly flush and fsync the temp file before os.replace to harden cache durability.

Suggested change

json.dump(data, fd, indent=2)

json.dump(data, fd, indent=2)

fd.flush()

os.fsync(fd.fileno())

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Introduce a local first_list_entry helper that safely returns the first dict element of container[key] or an empty dict if the container/key/value is missing or invalid. Replace the previous _first_list_entry import and update inject_ground_truth to use the new helper for exptl, em_3d_reconstruction and refine entries, improving robustness and removing the external dependency.

Copilot

Pull request overview

Copilot reviewed 117 out of 141 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-05T13:06:14Z

+        logger.error("[%s] best_run_data is not a dict; cannot inject ground truth", pdb_id)
+        return
+
+    structure_info: dict[str, Any] = best_run_data.setdefault("structure_info", {})


setdefault('structure_info', {}) does not guarantee a dict: if best_run_data['structure_info'] exists but is not a dict (e.g., None or a string), setdefault returns the existing non-dict value and later assignments will fail. Prefer normalizing with a type check (replace non-dicts with {}) before writing method/resolution/release_date.

Suggested change

structure_info: dict[str, Any] = best_run_data.setdefault("structure_info", {})

structure_info_value = best_run_data.get("structure_info")

if not isinstance(structure_info_value, dict):

structure_info_value = {}

best_run_data["structure_info"] = structure_info_value

structure_info: dict[str, Any] = structure_info_value

Copilot · 2026-04-05T13:06:14Z

+    clean_name = entry_name.split(".")[0].lower()
+    key = f"uniprot:{clean_name}"
+
+    cached = cache.get(key)
+    if cached is not None:
+        return cached
+
+    url = f"https://rest.uniprot.org/uniprotkb/{clean_name}.txt"
+    for attempt in range(API_MAX_RETRIES):
+        try:
+            resp = requests.head(url, timeout=5, allow_redirects=True)
+            is_valid = resp.status_code == 200
+            cache.set(key, is_valid)
+            return is_valid
+        except (requests.RequestException, OSError) as exc:
+            if attempt == API_MAX_RETRIES - 1:


The UniProt existence check builds /uniprotkb/{clean_name}.txt from a uniprot_entry_name slug (e.g., GPCRdb-style ts1r2_human). UniProt REST commonly expects an accession or a UniProtKB identifier (e.g., P63092 or P53_HUMAN), so this is likely to return 404 for valid proteins and poison the cache with false negatives. Consider switching to a UniProt search query endpoint (querying by UniProtKB ID vs accession explicitly) and only caching False when the API definitively indicates 'not found' for the intended identifier type.

Suggested change

clean_name = entry_name.split(".")[0].lower()

key = f"uniprot:{clean_name}"

cached = cache.get(key)

if cached is not None:

return cached

url = f"https://rest.uniprot.org/uniprotkb/{clean_name}.txt"

for attempt in range(API_MAX_RETRIES):

try:

resp = requests.head(url, timeout=5, allow_redirects=True)

is_valid = resp.status_code == 200

cache.set(key, is_valid)

return is_valid

except (requests.RequestException, OSError) as exc:

if attempt == API_MAX_RETRIES - 1:

clean_name = entry_name.split(".")[0].upper()

key = f"uniprot:{clean_name}"

cached = cache.get(key)

if cached is not None:

return cached

url = "https://rest.uniprot.org/uniprotkb/search"

params = {

"query": f"id:{clean_name}",

"format": "json",

"size": 1,

}

for attempt in range(API_MAX_RETRIES):

try:

resp = requests.get(url, params=params, timeout=5)

if resp.status_code == 200:

payload: dict[str, Any] = resp.json()

is_valid = bool(payload.get("results"))

cache.set(key, is_valid)

return is_valid

if attempt == API_MAX_RETRIES - 1:

logger.warning(

"UniProt API returned status %s for '%s'",

resp.status_code,

entry_name,

)

return None

time.sleep(1)

except (requests.RequestException, OSError, ValueError) as exc:

if attempt == API_MAX_RETRIES - 1:

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Set clean_name to upper-case (matching UniProt ID conventions) while continuing to use a lower-case cache key. This normalizes the local variable for downstream logic without changing existing cache lookup behavior.

…ools

Copilot

Pull request overview

Copilot reviewed 117 out of 141 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-05T13:26:32Z

+    if not found_chain:
+        return warnings
+
+    match = ai_uniprot in api_slugs


The receptor chain_id is treated as a single chain when locating the polymer entity (if ai_chain in auth_asym_ids:). Several real fixtures use comma-separated chain IDs (e.g., \"B, F\"), which will never match auth_asym_ids and causes receptor validation to silently skip (returns no warnings). Parse ai_chain as a list (split on commas, strip whitespace) and treat the receptor as matched if any chain is found; optionally include which chain matched in api_reality/warning text.

Parse comma-separated chain_id values and validate each reported chain independently. Collect GPCRdb slugs per chain, aggregate unique slugs for api_reality, and mark UNIPROT_CLASH only for chains whose entity slugs don't include the asserted UniProt. Update warning message to include per-chain clash detail. Add unit tests (TestMultiChain) covering same-entity homodimers, multi-entity matches/clashes, partial clashes, missing chains, and no-space chain_id formats.

Copilot

Pull request overview

Copilot reviewed 117 out of 141 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-05T15:11:43Z

+def validate_and_enrich_ligands(
+    pdb_id: str,
+    best_run_data: dict[str, Any],
+    enriched_entry: dict[str, Any],
+) -> list[str]:


The pdb_id parameter is unused here as well. Consider either removing it or incorporating it into the GHOST_LIGAND warnings (and any logger output) so that batch validation logs remain self-identifying when multiple PDBs are processed together.

iskoldt-X added 15 commits April 2, 2026 23:34

Trigger Docker publish on all tags

cb623e9

Change workflow tag filter from "v*" to "*" so the Docker publish workflow runs for any pushed tag instead of only tags starting with "v". This allows publishing images for tags that don't use the "v" prefix.

Reformat test fixtures for readability

3499f41

Expand single-line list/dict literals into multi-line form in tests to improve readability and maintainability. Changes applied to tests/unit/test_ligand_validator.py and tests/unit/test_oligomer.py; no functional behavior was modified.

iskoldt-X requested a review from Copilot April 4, 2026 23:08

iskoldt-X self-assigned this Apr 4, 2026

Copilot AI reviewed Apr 4, 2026

View reviewed changes

iskoldt-X and others added 2 commits April 5, 2026 01:12

Update .github/workflows/docker-publish.yml

81e6aa9

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

iskoldt-X requested a review from Copilot April 5, 2026 13:04

Copilot AI reviewed Apr 5, 2026

View reviewed changes

iskoldt-X and others added 3 commits April 5, 2026 15:09

Update src/gpcr_tools/validator/ligand_validator.py

0e0faf9

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Normalize clean_name to uppercase before caching

cd00f9e

Set clean_name to upper-case (matching UniProt ID conventions) while continuing to use a lower-case cache key. This normalizes the local variable for downstream logic without changing existing cache lookup behavior.

Merge branch 'main' of https://github.com/iskoldt-X/GPCR-annotation-t…

86430ee

…ools

iskoldt-X requested a review from Copilot April 5, 2026 13:25

Copilot AI reviewed Apr 5, 2026

View reviewed changes

iskoldt-X requested a review from Copilot April 5, 2026 15:10

Copilot AI reviewed Apr 5, 2026

View reviewed changes

iskoldt-X merged commit 30f2202 into protwis:main Apr 5, 2026
5 checks passed

		structure_info: dict[str, Any] = best_run_data.setdefault("structure_info", {})

-    structure_info: dict[str, Any] = best_run_data.setdefault("structure_info", {})
+    structure_info_raw = best_run_data.get("structure_info")
+    if not isinstance(structure_info_raw, dict):
+        if structure_info_raw is not None:
+            logger.warning(
+                "[%s] structure_info is %s, replacing with empty dict before ground truth injection",
+                pdb_id,
+                type(structure_info_raw).__name__,
+            )
+        structure_info_raw = {}
+        best_run_data["structure_info"] = structure_info_raw
+    structure_info: dict[str, Any] = structure_info_raw

		from gpcr_tools.aggregator.voting import _first_list_entry

		logger = logging.getLogger(__name__)

-from gpcr_tools.aggregator.voting import _first_list_entry
-logger = logging.getLogger(__name__)
+logger = logging.getLogger(__name__)
+def _first_list_entry(entry: dict[str, Any], key: str) -> Any:
+    """Return the first element from a list-valued field, else ``None``."""
+    value = entry.get(key)
+    if isinstance(value, list) and value:
+        return value[0]
+    return None

-    structure_info: dict[str, Any] = best_run_data.setdefault("structure_info", {})
+    structure_info_value = best_run_data.get("structure_info")
+    if not isinstance(structure_info_value, dict):
+        structure_info_value = {}
+        best_run_data["structure_info"] = structure_info_value
+    structure_info: dict[str, Any] = structure_info_value

Conversation

iskoldt-X commented Apr 4, 2026

Summary

Architecture highlights

Stats

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants