Skip to content

Add aggregation & validation pipeline with oligomer analysis#3

Merged
iskoldt-X merged 21 commits intoprotwis:mainfrom
iskoldt-X:main
Apr 5, 2026
Merged

Add aggregation & validation pipeline with oligomer analysis#3
iskoldt-X merged 21 commits intoprotwis:mainfrom
iskoldt-X:main

Conversation

@iskoldt-X
Copy link
Copy Markdown
Collaborator

Summary

  • New aggregator/ package — loads AI annotation runs per PDB, selects the best by score, applies majority-vote consensus with per-field controversy detection, and injects ground truth from enriched PDB metadata
  • New validator/ package — chimera detection, receptor identity validation (UniProt API cross-check), ligand PDB-CCD validation, oligomer analysis (classification, 7TM completeness, chain-ID override, primary protomer suggestion), structural integrity checker, API clients, and persistent JSON cache
  • CLI aggregate subcommand with --force and --skip-api-checks flags
  • Hardened csv_generator/ — replaced all .get(key, {}) anti-patterns with None-safe idioms, centralized 20+ magic strings into config.py constants
  • Updated README — full pipeline documentation with architecture diagram, workspace layout, and design principles

Architecture highlights

Principle Implementation
Atomic writes tempfile + os.replace + try/finally cleanup
Mutation isolation deepcopy() boundary in runner before passing data to validators
None-safety (data.get(key) or {}).get(child) — no .get(key, {}) anywhere
No magic strings All cross-module strings as constants in config.py
Immutability frozenset, tuple, MappingProxyType for module-level constants
Error isolation Each PDB wrapped in try/except; failures logged, never crash the batch

Stats

  • ~2,650 lines of new production code across 13 new modules
  • ~3,400 lines of new tests across 11 new test files
  • 685 tests passing (unit + integration + 9 canonical real-PDB fixtures)
  • 0 ruff, mypy, or format issues

Test plan

  • ruff check src/ tests/ — all checks passed
  • ruff format --check src/ tests/ — all files formatted
  • mypy src/ — no issues (with ignore_missing_imports = false)
  • pytest tests/ -v — 685 passed
  • Smoke-test gpcr-tools aggregate <workspace> on a real workspace with 10 AI runs per PDB

🤖 Generated with Claude Code

iskoldt-X added 15 commits April 2, 2026 23:34
Expand the tests documentation to add a Real Data Testing Strategy and update the test layout. The changes introduce a tests/fixtures/real_pdbs/ structure (with {pdb_id}.json, logs, validation_logs, ai_results, enriched) and add multiple integration tests (test_real_pdb_*.py). The new section defines the "live chain" goal (run pipeline stages live; AI annotation outputs remain the only permanent static breakpoint), explains progressive retirement/migration of fixtures, intermediate fixture rules, fixture selection criteria to ensure coverage, and recommended test organization. Purpose: avoid format drift from stale fixtures and guide staged migration of pipeline stages so downstream tests consume live outputs.
Add future_plan.md describing an improvement to type handling in interactive review. Notes that review_engine.py's coerce_type currently falls back to strings on failed coercion, which can cause type degradation in CSV outputs. Proposes refactoring the UI to validate user input for strict types (booleans/numbers), reject invalid entries with a clear validation error, and prompt users to retry instead of silently accepting string fallbacks.
Update future_plan.md to add a new section describing a validation-ordering issue in pdb-annotation/aggregate_results.py. Documents that validate_receptor_identity() runs before analyze_oligomer()'s chain override, explains the misleading validation reports this can produce, and why the migration preserves the original ordering. Adds an action plan for a follow-up epic: move the chain-override earlier, re-run receptor validation after the override, add regression tests for AI-hallucination scenarios, and check for other validators depending on pre-override state. Also fixes a trailing-newline/formatting issue around the earlier UI validation note.
Clarify the canonical PDB list size (currently 9 entries) and add a "Mandatory Data Density" requirement: each PDB in the canonical set must include a complete set of 10 AI annotation runs (run_01.json through run_10.json) under tests/fixtures/real_pdbs/ai_results/{PDB_ID}/. This ensures robust majority-voting and discrepancy-detection in tests while keeping AI outputs as the sole permanent static fixture.
Introduce a new gpcr_tools.aggregator subpackage (runner, voting, ai_results_loader, enriched_loader, ground_truth) and wire an "aggregate" CLI command into __main__.py. Implements majority-voting, run scoring/selection, discrepancy detection, ground-truth injection, chimera integration, atomic output writes, and aggregate logging; uses caches for sequence/validation lookups. Add validator scaffolding and many tests/fixtures for real-PDB aggregation flows. Update config with voting/validation constants and add requests and typing stubs to pyproject; also tighten mypy by enabling import checks.
Document the new multi-run aggregation and algorithmic validation pipeline: added a Pipeline Overview, CLI examples for the `aggregate` command (and note about `-it`), and a text diagram of the aggregate→curate flow. Updated workspace layout to show `ai_results/` (10 runs per PDB), `aggregated/` logging/validation outputs, persistent caches, and new state files (`aggregate_log.json`). Expanded System Architecture with module-level layouts for aggregator, validator, csv_generator, and supporting modules, plus Design Principles and test-suite details. Small wording/alert changes to clarify provenance, validation reports, and CI type-safety config.
Bump GitHub Actions in .github/workflows/ci.yml: upgrade actions/checkout from v4 to v6 and actions/setup-python from v5 to v6 in lint, type-check and test jobs; upgrade actions/upload-artifact from v4 to v7 for coverage upload. No behavioral changes to job steps or Python versions; this refreshes action versions for CI reliability.
Change workflow tag filter from "v*" to "*" so the Docker publish workflow runs for any pushed tag instead of only tags starting with "v". This allows publishing images for tags that don't use the "v" prefix.
Replace occurrences of /path/to/gpcr_workspace with ~/gpcr_workspace in README usage snippets. Updated docker -v mounts for init-workspace, aggregate, curate (including targeting a PDB entry) and the export GPCR_WORKSPACE line to recommend the home-directory path for convenience.
Introduce API_MAX_RETRIES in config and add simple retry/backoff for UniProt calls: check_uniprot_existence (HEAD) and get_sequence_from_uniprot (FASTA) now retry up to the configured attempts with a 1s sleep and only log on the final failure. Also import time and the new constant where needed. Add a tqdm progress bar to aggregate_all to display progress when processing pending PDB IDs.
Update GitHub Action versions in .github/workflows/docker-publish.yml to newer releases for compatibility and fixes: actions/checkout v4→v6, docker/login-action v3→v4, docker/setup-qemu-action v3→v4, docker/setup-buildx-action v3→v4, docker/metadata-action v5→v6, and docker/build-push-action v5→v7.
Introduce ALERT_SUSPICIOUS_7TM and mark chains with no detected transmembrane helices as suspicious.

- config.py: add ALERT_SUSPICIOUS_7TM and extend GPCR_SLUG_NEGATIVE_PREFIXES with glycoprotein hormone prefixes (glha, fshb, lhb, tshb, cgb).
- csv_generator/ui.py: import new alert, include it in the set that triggers oligomer highlighting, and add a display style for the alert.
- csv_generator/validation_display.py: treat ALERT_SUSPICIOUS_7TM as a warning to be injected into validation output.
- validator/oligomer.py: detect chains with TM_STATUS_UNKNOWN and append a SUSPICIOUS_7TM alert advising that the chains may be non-GPCR soluble ligands and suggesting updating GPCR_SLUG_NEGATIVE_PREFIXES if needed.

This change surfaces potential false-positive GPCR assignments (no TM helices) to reviewers and provides guidance for updating negative prefixes.
tests: add test_suspicious_7tm_format to tests/unit/test_oligomer.py. The new test imports ALERT_SUSPICIOUS_7TM and mocks scan_all_chains_7tm and is_gpcr_slug to simulate an UNKNOWN TM status for a GPCR chain, runs analyze_oligomer, and asserts that a suspicious 7TM alert is produced and its message matches the expected BL3 regex. This verifies alert injection and formatting for unresolved 7TM cases.
Split comma-separated chain_id values for peptide/protein ligands, lookup each chain in api['poly_by_chain'], collect available sequences and join them with ' / ' into lig['Sequence'], and mark the ligand as VALIDATION_MATCHED_POLYMER if any chain is found. Adds unit test test_protein_multi_chain to verify multi-chain behavior.
Expand single-line list/dict literals into multi-line form in tests to improve readability and maintainability. Changes applied to tests/unit/test_ligand_validator.py and tests/unit/test_oligomer.py; no functional behavior was modified.
@iskoldt-X iskoldt-X requested a review from Copilot April 4, 2026 23:08
@iskoldt-X iskoldt-X self-assigned this Apr 4, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces an aggregation + validation pipeline backed by enriched PDB metadata, adds supporting CLI/workflow wiring, and expands real-PDB AI-run fixtures to exercise the end-to-end path.

Changes:

  • Added offline validators (receptor identity, ligand validation) plus API client + persistent caching utilities.
  • Added aggregation helpers (enriched loader, ground-truth injector, AI-run loader) and exposed an aggregate CLI subcommand.
  • Updated CI/Docker workflows and substantially expanded real-PDB fixture coverage (multi-run AI outputs).

Reviewed changes

Copilot reviewed 117 out of 141 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/fixtures/real_pdbs/ai_results/9O38/run_1.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9O38/run_2.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9O38/run_3.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9O38/run_4.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9O38/run_5.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9O38/run_6.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9O38/run_7.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9O38/run_8.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9O38/run_10.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9NOR/run_1.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9NOR/run_2.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9NOR/run_3.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9NOR/run_4.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9NOR/run_5.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9NOR/run_6.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9NOR/run_7.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9NOR/run_8.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9NOR/run_9.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9NOR/run_10.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9IQS/run_1.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9IQS/run_2.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9IQS/run_3.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9IQS/run_4.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9IQS/run_5.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9IQS/run_6.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9IQS/run_7.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9IQS/run_8.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9IQS/run_9.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9IQS/run_10.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9BLW/run_1.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9BLW/run_2.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9BLW/run_3.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9BLW/run_4.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9BLW/run_5.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9BLW/run_6.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9BLW/run_7.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9BLW/run_8.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9BLW/run_9.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9BLW/run_10.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9AS1/run_2.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9AS1/run_3.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9AS1/run_4.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9AS1/run_5.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9AS1/run_8.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9AS1/run_9.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/9AS1/run_10.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/8TII/run_1.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/8TII/run_2.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/8TII/run_3.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/8TII/run_4.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/8TII/run_5.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/8TII/run_6.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/8TII/run_7.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/8TII/run_8.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/8TII/run_9.json Add real-PDB AI run fixture for aggregation/validation tests
tests/fixtures/real_pdbs/ai_results/8TII/run_10.json Add real-PDB AI run fixture for aggregation/validation tests
src/gpcr_tools/validator/receptor_validator.py Add offline receptor identity validation against enriched polymer entities
src/gpcr_tools/validator/ligand_validator.py Add offline ligand validation + chemical identifier enrichment
src/gpcr_tools/validator/cache.py Add persistent JSON caches with atomic writes
src/gpcr_tools/validator/api_clients.py Add UniProt/PubChem/RCSB GraphQL client wrappers
src/gpcr_tools/validator/init.py Declare validator subpackage
src/gpcr_tools/csv_generator/validation_display.py Update validation display/oligomer alert injection + None-safety improvements
src/gpcr_tools/csv_generator/review_engine.py Replace .get(..., {}) patterns with None-safe idioms
src/gpcr_tools/csv_generator/logic.py Add constants usage + None-safe access in CSV note/oligomer handling
src/gpcr_tools/csv_generator/csv_writer.py Harden CSV transformation against None/missing dicts
src/gpcr_tools/csv_generator/app.py Wire oligomer alerts into the interactive curation path
src/gpcr_tools/aggregator/ground_truth.py Inject authoritative method/resolution/release_date from enriched metadata
src/gpcr_tools/aggregator/enriched_loader.py Load and validate enriched entry JSON structure
src/gpcr_tools/aggregator/ai_results_loader.py Load multi-run AI results and compute pending PDB IDs
src/gpcr_tools/aggregator/init.py Declare aggregator subpackage
src/gpcr_tools/main.py Add aggregate CLI subcommand + flags
pyproject.toml Bump version and add runtime/dev deps + stricter mypy setting
future_plan.md Document follow-up work items and migration ordering caveats
developing_styles.md Update testing layout/docs incl. real-data strategy
.github/workflows/docker-publish.yml Update Docker publish workflow triggers/action versions
.github/workflows/ci.yml Update CI action versions and artifact upload step

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/docker-publish.yml Outdated
Comment on lines +41 to +42
structure_info: dict[str, Any] = best_run_data.setdefault("structure_info", {})

Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setdefault('structure_info', {}) does not guarantee structure_info is a dict (it returns the existing value unchanged). If upstream data contains a non-dict structure_info (e.g., a string/list), the subsequent item assignments will raise at runtime. Recommendation (mandatory): after retrieving structure_info, validate it is a dict and replace it with {} (and/or log) when it is not.

Suggested change
structure_info: dict[str, Any] = best_run_data.setdefault("structure_info", {})
structure_info_raw = best_run_data.get("structure_info")
if not isinstance(structure_info_raw, dict):
if structure_info_raw is not None:
logger.warning(
"[%s] structure_info is %s, replacing with empty dict before ground truth injection",
pdb_id,
type(structure_info_raw).__name__,
)
structure_info_raw = {}
best_run_data["structure_info"] = structure_info_raw
structure_info: dict[str, Any] = structure_info_raw

Copilot uses AI. Check for mistakes.
Comment on lines +72 to +77
if method:
structure_info["method"] = method
if resolution is not None:
structure_info["resolution"] = resolution
if release_date:
structure_info["release_date"] = release_date
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setdefault('structure_info', {}) does not guarantee structure_info is a dict (it returns the existing value unchanged). If upstream data contains a non-dict structure_info (e.g., a string/list), the subsequent item assignments will raise at runtime. Recommendation (mandatory): after retrieving structure_info, validate it is a dict and replace it with {} (and/or log) when it is not.

Copilot uses AI. Check for mistakes.
Comment on lines +17 to +21
from gpcr_tools.aggregator.voting import _first_list_entry

logger = logging.getLogger(__name__)


Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module imports _first_list_entry, a private helper (leading underscore) from another module. That creates a fragile coupling: refactors in voting.py can break ground_truth.py without a public API contract. Recommendation: either make the helper public (rename to first_list_entry and export it) or move it into a shared utils module used by both.

Suggested change
from gpcr_tools.aggregator.voting import _first_list_entry
logger = logging.getLogger(__name__)
logger = logging.getLogger(__name__)
def _first_list_entry(entry: dict[str, Any], key: str) -> Any:
"""Return the first element from a list-valued field, else ``None``."""
value = entry.get(key)
if isinstance(value, list) and value:
return value[0]
return None

Copilot uses AI. Check for mistakes.
Comment on lines +23 to +27
def validate_receptor_identity(
pdb_id: str,
best_run_data: dict[str, Any],
enriched_entry: dict[str, Any],
) -> list[str]:
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pdb_id is currently unused in this function. Recommendation (optional): either remove it from the signature (if not part of a common validator interface), or use it in log messages / warning strings to improve diagnosability.

Copilot uses AI. Check for mistakes.
encoding="utf-8",
) as fd:
tmp_path = fd.name
json.dump(data, fd, indent=2)
Copy link

Copilot AI Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Atomic os.replace prevents partial-file reads, but without an explicit flush() + os.fsync() on the temp file, a sudden power loss can still leave a zero-length/truncated file that gets renamed into place (depending on OS/filesystem buffering). Recommendation: explicitly flush and fsync the temp file before os.replace to harden cache durability.

Suggested change
json.dump(data, fd, indent=2)
json.dump(data, fd, indent=2)
fd.flush()
os.fsync(fd.fileno())

Copilot uses AI. Check for mistakes.
iskoldt-X and others added 2 commits April 5, 2026 01:12
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Introduce a local first_list_entry helper that safely returns the first dict element of container[key] or an empty dict if the container/key/value is missing or invalid. Replace the previous _first_list_entry import and update inject_ground_truth to use the new helper for exptl, em_3d_reconstruction and refine entries, improving robustness and removing the external dependency.
@iskoldt-X iskoldt-X requested a review from Copilot April 5, 2026 13:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 117 out of 141 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

logger.error("[%s] best_run_data is not a dict; cannot inject ground truth", pdb_id)
return

structure_info: dict[str, Any] = best_run_data.setdefault("structure_info", {})
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setdefault('structure_info', {}) does not guarantee a dict: if best_run_data['structure_info'] exists but is not a dict (e.g., None or a string), setdefault returns the existing non-dict value and later assignments will fail. Prefer normalizing with a type check (replace non-dicts with {}) before writing method/resolution/release_date.

Suggested change
structure_info: dict[str, Any] = best_run_data.setdefault("structure_info", {})
structure_info_value = best_run_data.get("structure_info")
if not isinstance(structure_info_value, dict):
structure_info_value = {}
best_run_data["structure_info"] = structure_info_value
structure_info: dict[str, Any] = structure_info_value

Copilot uses AI. Check for mistakes.
Comment thread src/gpcr_tools/validator/ligand_validator.py
Comment thread src/gpcr_tools/validator/api_clients.py Outdated
Comment on lines +30 to +45
clean_name = entry_name.split(".")[0].lower()
key = f"uniprot:{clean_name}"

cached = cache.get(key)
if cached is not None:
return cached

url = f"https://rest.uniprot.org/uniprotkb/{clean_name}.txt"
for attempt in range(API_MAX_RETRIES):
try:
resp = requests.head(url, timeout=5, allow_redirects=True)
is_valid = resp.status_code == 200
cache.set(key, is_valid)
return is_valid
except (requests.RequestException, OSError) as exc:
if attempt == API_MAX_RETRIES - 1:
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UniProt existence check builds /uniprotkb/{clean_name}.txt from a uniprot_entry_name slug (e.g., GPCRdb-style ts1r2_human). UniProt REST commonly expects an accession or a UniProtKB identifier (e.g., P63092 or P53_HUMAN), so this is likely to return 404 for valid proteins and poison the cache with false negatives. Consider switching to a UniProt search query endpoint (querying by UniProtKB ID vs accession explicitly) and only caching False when the API definitively indicates 'not found' for the intended identifier type.

Suggested change
clean_name = entry_name.split(".")[0].lower()
key = f"uniprot:{clean_name}"
cached = cache.get(key)
if cached is not None:
return cached
url = f"https://rest.uniprot.org/uniprotkb/{clean_name}.txt"
for attempt in range(API_MAX_RETRIES):
try:
resp = requests.head(url, timeout=5, allow_redirects=True)
is_valid = resp.status_code == 200
cache.set(key, is_valid)
return is_valid
except (requests.RequestException, OSError) as exc:
if attempt == API_MAX_RETRIES - 1:
clean_name = entry_name.split(".")[0].upper()
key = f"uniprot:{clean_name}"
cached = cache.get(key)
if cached is not None:
return cached
url = "https://rest.uniprot.org/uniprotkb/search"
params = {
"query": f"id:{clean_name}",
"format": "json",
"size": 1,
}
for attempt in range(API_MAX_RETRIES):
try:
resp = requests.get(url, params=params, timeout=5)
if resp.status_code == 200:
payload: dict[str, Any] = resp.json()
is_valid = bool(payload.get("results"))
cache.set(key, is_valid)
return is_valid
if attempt == API_MAX_RETRIES - 1:
logger.warning(
"UniProt API returned status %s for '%s'",
resp.status_code,
entry_name,
)
return None
time.sleep(1)
except (requests.RequestException, OSError, ValueError) as exc:
if attempt == API_MAX_RETRIES - 1:

Copilot uses AI. Check for mistakes.
iskoldt-X and others added 3 commits April 5, 2026 15:09
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Set clean_name to upper-case (matching UniProt ID conventions) while continuing to use a lower-case cache key. This normalizes the local variable for downstream logic without changing existing cache lookup behavior.
@iskoldt-X iskoldt-X requested a review from Copilot April 5, 2026 13:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 117 out of 141 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if not found_chain:
return warnings

match = ai_uniprot in api_slugs
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The receptor chain_id is treated as a single chain when locating the polymer entity (if ai_chain in auth_asym_ids:). Several real fixtures use comma-separated chain IDs (e.g., \"B, F\"), which will never match auth_asym_ids and causes receptor validation to silently skip (returns no warnings). Parse ai_chain as a list (split on commas, strip whitespace) and treat the receptor as matched if any chain is found; optionally include which chain matched in api_reality/warning text.

Copilot uses AI. Check for mistakes.
Parse comma-separated chain_id values and validate each reported chain independently. Collect GPCRdb slugs per chain, aggregate unique slugs for api_reality, and mark UNIPROT_CLASH only for chains whose entity slugs don't include the asserted UniProt. Update warning message to include per-chain clash detail. Add unit tests (TestMultiChain) covering same-entity homodimers, multi-entity matches/clashes, partial clashes, missing chains, and no-space chain_id formats.
@iskoldt-X iskoldt-X requested a review from Copilot April 5, 2026 15:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 117 out of 141 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +77 to +81
def validate_and_enrich_ligands(
pdb_id: str,
best_run_data: dict[str, Any],
enriched_entry: dict[str, Any],
) -> list[str]:
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pdb_id parameter is unused here as well. Consider either removing it or incorporating it into the GHOST_LIGAND warnings (and any logger output) so that batch validation logs remain self-identifying when multiple PDBs are processed together.

Copilot uses AI. Check for mistakes.
@iskoldt-X iskoldt-X merged commit 30f2202 into protwis:main Apr 5, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants