Add aggregation & validation pipeline with oligomer analysis#3
Add aggregation & validation pipeline with oligomer analysis#3iskoldt-X merged 21 commits intoprotwis:mainfrom
Conversation
Expand the tests documentation to add a Real Data Testing Strategy and update the test layout. The changes introduce a tests/fixtures/real_pdbs/ structure (with {pdb_id}.json, logs, validation_logs, ai_results, enriched) and add multiple integration tests (test_real_pdb_*.py). The new section defines the "live chain" goal (run pipeline stages live; AI annotation outputs remain the only permanent static breakpoint), explains progressive retirement/migration of fixtures, intermediate fixture rules, fixture selection criteria to ensure coverage, and recommended test organization. Purpose: avoid format drift from stale fixtures and guide staged migration of pipeline stages so downstream tests consume live outputs.
Add future_plan.md describing an improvement to type handling in interactive review. Notes that review_engine.py's coerce_type currently falls back to strings on failed coercion, which can cause type degradation in CSV outputs. Proposes refactoring the UI to validate user input for strict types (booleans/numbers), reject invalid entries with a clear validation error, and prompt users to retry instead of silently accepting string fallbacks.
Update future_plan.md to add a new section describing a validation-ordering issue in pdb-annotation/aggregate_results.py. Documents that validate_receptor_identity() runs before analyze_oligomer()'s chain override, explains the misleading validation reports this can produce, and why the migration preserves the original ordering. Adds an action plan for a follow-up epic: move the chain-override earlier, re-run receptor validation after the override, add regression tests for AI-hallucination scenarios, and check for other validators depending on pre-override state. Also fixes a trailing-newline/formatting issue around the earlier UI validation note.
Clarify the canonical PDB list size (currently 9 entries) and add a "Mandatory Data Density" requirement: each PDB in the canonical set must include a complete set of 10 AI annotation runs (run_01.json through run_10.json) under tests/fixtures/real_pdbs/ai_results/{PDB_ID}/. This ensures robust majority-voting and discrepancy-detection in tests while keeping AI outputs as the sole permanent static fixture.
Introduce a new gpcr_tools.aggregator subpackage (runner, voting, ai_results_loader, enriched_loader, ground_truth) and wire an "aggregate" CLI command into __main__.py. Implements majority-voting, run scoring/selection, discrepancy detection, ground-truth injection, chimera integration, atomic output writes, and aggregate logging; uses caches for sequence/validation lookups. Add validator scaffolding and many tests/fixtures for real-PDB aggregation flows. Update config with voting/validation constants and add requests and typing stubs to pyproject; also tighten mypy by enabling import checks.
Document the new multi-run aggregation and algorithmic validation pipeline: added a Pipeline Overview, CLI examples for the `aggregate` command (and note about `-it`), and a text diagram of the aggregate→curate flow. Updated workspace layout to show `ai_results/` (10 runs per PDB), `aggregated/` logging/validation outputs, persistent caches, and new state files (`aggregate_log.json`). Expanded System Architecture with module-level layouts for aggregator, validator, csv_generator, and supporting modules, plus Design Principles and test-suite details. Small wording/alert changes to clarify provenance, validation reports, and CI type-safety config.
Bump GitHub Actions in .github/workflows/ci.yml: upgrade actions/checkout from v4 to v6 and actions/setup-python from v5 to v6 in lint, type-check and test jobs; upgrade actions/upload-artifact from v4 to v7 for coverage upload. No behavioral changes to job steps or Python versions; this refreshes action versions for CI reliability.
Change workflow tag filter from "v*" to "*" so the Docker publish workflow runs for any pushed tag instead of only tags starting with "v". This allows publishing images for tags that don't use the "v" prefix.
Replace occurrences of /path/to/gpcr_workspace with ~/gpcr_workspace in README usage snippets. Updated docker -v mounts for init-workspace, aggregate, curate (including targeting a PDB entry) and the export GPCR_WORKSPACE line to recommend the home-directory path for convenience.
Introduce API_MAX_RETRIES in config and add simple retry/backoff for UniProt calls: check_uniprot_existence (HEAD) and get_sequence_from_uniprot (FASTA) now retry up to the configured attempts with a 1s sleep and only log on the final failure. Also import time and the new constant where needed. Add a tqdm progress bar to aggregate_all to display progress when processing pending PDB IDs.
Update GitHub Action versions in .github/workflows/docker-publish.yml to newer releases for compatibility and fixes: actions/checkout v4→v6, docker/login-action v3→v4, docker/setup-qemu-action v3→v4, docker/setup-buildx-action v3→v4, docker/metadata-action v5→v6, and docker/build-push-action v5→v7.
Introduce ALERT_SUSPICIOUS_7TM and mark chains with no detected transmembrane helices as suspicious. - config.py: add ALERT_SUSPICIOUS_7TM and extend GPCR_SLUG_NEGATIVE_PREFIXES with glycoprotein hormone prefixes (glha, fshb, lhb, tshb, cgb). - csv_generator/ui.py: import new alert, include it in the set that triggers oligomer highlighting, and add a display style for the alert. - csv_generator/validation_display.py: treat ALERT_SUSPICIOUS_7TM as a warning to be injected into validation output. - validator/oligomer.py: detect chains with TM_STATUS_UNKNOWN and append a SUSPICIOUS_7TM alert advising that the chains may be non-GPCR soluble ligands and suggesting updating GPCR_SLUG_NEGATIVE_PREFIXES if needed. This change surfaces potential false-positive GPCR assignments (no TM helices) to reviewers and provides guidance for updating negative prefixes.
tests: add test_suspicious_7tm_format to tests/unit/test_oligomer.py. The new test imports ALERT_SUSPICIOUS_7TM and mocks scan_all_chains_7tm and is_gpcr_slug to simulate an UNKNOWN TM status for a GPCR chain, runs analyze_oligomer, and asserts that a suspicious 7TM alert is produced and its message matches the expected BL3 regex. This verifies alert injection and formatting for unresolved 7TM cases.
Split comma-separated chain_id values for peptide/protein ligands, lookup each chain in api['poly_by_chain'], collect available sequences and join them with ' / ' into lig['Sequence'], and mark the ligand as VALIDATION_MATCHED_POLYMER if any chain is found. Adds unit test test_protein_multi_chain to verify multi-chain behavior.
Expand single-line list/dict literals into multi-line form in tests to improve readability and maintainability. Changes applied to tests/unit/test_ligand_validator.py and tests/unit/test_oligomer.py; no functional behavior was modified.
There was a problem hiding this comment.
Pull request overview
Introduces an aggregation + validation pipeline backed by enriched PDB metadata, adds supporting CLI/workflow wiring, and expands real-PDB AI-run fixtures to exercise the end-to-end path.
Changes:
- Added offline validators (receptor identity, ligand validation) plus API client + persistent caching utilities.
- Added aggregation helpers (enriched loader, ground-truth injector, AI-run loader) and exposed an
aggregateCLI subcommand. - Updated CI/Docker workflows and substantially expanded real-PDB fixture coverage (multi-run AI outputs).
Reviewed changes
Copilot reviewed 117 out of 141 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/fixtures/real_pdbs/ai_results/9O38/run_1.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9O38/run_2.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9O38/run_3.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9O38/run_4.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9O38/run_5.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9O38/run_6.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9O38/run_7.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9O38/run_8.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9O38/run_10.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9NOR/run_1.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9NOR/run_2.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9NOR/run_3.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9NOR/run_4.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9NOR/run_5.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9NOR/run_6.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9NOR/run_7.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9NOR/run_8.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9NOR/run_9.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9NOR/run_10.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9IQS/run_1.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9IQS/run_2.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9IQS/run_3.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9IQS/run_4.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9IQS/run_5.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9IQS/run_6.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9IQS/run_7.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9IQS/run_8.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9IQS/run_9.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9IQS/run_10.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9BLW/run_1.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9BLW/run_2.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9BLW/run_3.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9BLW/run_4.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9BLW/run_5.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9BLW/run_6.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9BLW/run_7.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9BLW/run_8.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9BLW/run_9.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9BLW/run_10.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9AS1/run_2.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9AS1/run_3.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9AS1/run_4.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9AS1/run_5.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9AS1/run_8.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9AS1/run_9.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/9AS1/run_10.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/8TII/run_1.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/8TII/run_2.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/8TII/run_3.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/8TII/run_4.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/8TII/run_5.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/8TII/run_6.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/8TII/run_7.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/8TII/run_8.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/8TII/run_9.json | Add real-PDB AI run fixture for aggregation/validation tests |
| tests/fixtures/real_pdbs/ai_results/8TII/run_10.json | Add real-PDB AI run fixture for aggregation/validation tests |
| src/gpcr_tools/validator/receptor_validator.py | Add offline receptor identity validation against enriched polymer entities |
| src/gpcr_tools/validator/ligand_validator.py | Add offline ligand validation + chemical identifier enrichment |
| src/gpcr_tools/validator/cache.py | Add persistent JSON caches with atomic writes |
| src/gpcr_tools/validator/api_clients.py | Add UniProt/PubChem/RCSB GraphQL client wrappers |
| src/gpcr_tools/validator/init.py | Declare validator subpackage |
| src/gpcr_tools/csv_generator/validation_display.py | Update validation display/oligomer alert injection + None-safety improvements |
| src/gpcr_tools/csv_generator/review_engine.py | Replace .get(..., {}) patterns with None-safe idioms |
| src/gpcr_tools/csv_generator/logic.py | Add constants usage + None-safe access in CSV note/oligomer handling |
| src/gpcr_tools/csv_generator/csv_writer.py | Harden CSV transformation against None/missing dicts |
| src/gpcr_tools/csv_generator/app.py | Wire oligomer alerts into the interactive curation path |
| src/gpcr_tools/aggregator/ground_truth.py | Inject authoritative method/resolution/release_date from enriched metadata |
| src/gpcr_tools/aggregator/enriched_loader.py | Load and validate enriched entry JSON structure |
| src/gpcr_tools/aggregator/ai_results_loader.py | Load multi-run AI results and compute pending PDB IDs |
| src/gpcr_tools/aggregator/init.py | Declare aggregator subpackage |
| src/gpcr_tools/main.py | Add aggregate CLI subcommand + flags |
| pyproject.toml | Bump version and add runtime/dev deps + stricter mypy setting |
| future_plan.md | Document follow-up work items and migration ordering caveats |
| developing_styles.md | Update testing layout/docs incl. real-data strategy |
| .github/workflows/docker-publish.yml | Update Docker publish workflow triggers/action versions |
| .github/workflows/ci.yml | Update CI action versions and artifact upload step |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| structure_info: dict[str, Any] = best_run_data.setdefault("structure_info", {}) | ||
|
|
There was a problem hiding this comment.
setdefault('structure_info', {}) does not guarantee structure_info is a dict (it returns the existing value unchanged). If upstream data contains a non-dict structure_info (e.g., a string/list), the subsequent item assignments will raise at runtime. Recommendation (mandatory): after retrieving structure_info, validate it is a dict and replace it with {} (and/or log) when it is not.
| structure_info: dict[str, Any] = best_run_data.setdefault("structure_info", {}) | |
| structure_info_raw = best_run_data.get("structure_info") | |
| if not isinstance(structure_info_raw, dict): | |
| if structure_info_raw is not None: | |
| logger.warning( | |
| "[%s] structure_info is %s, replacing with empty dict before ground truth injection", | |
| pdb_id, | |
| type(structure_info_raw).__name__, | |
| ) | |
| structure_info_raw = {} | |
| best_run_data["structure_info"] = structure_info_raw | |
| structure_info: dict[str, Any] = structure_info_raw |
| if method: | ||
| structure_info["method"] = method | ||
| if resolution is not None: | ||
| structure_info["resolution"] = resolution | ||
| if release_date: | ||
| structure_info["release_date"] = release_date |
There was a problem hiding this comment.
setdefault('structure_info', {}) does not guarantee structure_info is a dict (it returns the existing value unchanged). If upstream data contains a non-dict structure_info (e.g., a string/list), the subsequent item assignments will raise at runtime. Recommendation (mandatory): after retrieving structure_info, validate it is a dict and replace it with {} (and/or log) when it is not.
| from gpcr_tools.aggregator.voting import _first_list_entry | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
There was a problem hiding this comment.
This module imports _first_list_entry, a private helper (leading underscore) from another module. That creates a fragile coupling: refactors in voting.py can break ground_truth.py without a public API contract. Recommendation: either make the helper public (rename to first_list_entry and export it) or move it into a shared utils module used by both.
| from gpcr_tools.aggregator.voting import _first_list_entry | |
| logger = logging.getLogger(__name__) | |
| logger = logging.getLogger(__name__) | |
| def _first_list_entry(entry: dict[str, Any], key: str) -> Any: | |
| """Return the first element from a list-valued field, else ``None``.""" | |
| value = entry.get(key) | |
| if isinstance(value, list) and value: | |
| return value[0] | |
| return None |
| def validate_receptor_identity( | ||
| pdb_id: str, | ||
| best_run_data: dict[str, Any], | ||
| enriched_entry: dict[str, Any], | ||
| ) -> list[str]: |
There was a problem hiding this comment.
pdb_id is currently unused in this function. Recommendation (optional): either remove it from the signature (if not part of a common validator interface), or use it in log messages / warning strings to improve diagnosability.
| encoding="utf-8", | ||
| ) as fd: | ||
| tmp_path = fd.name | ||
| json.dump(data, fd, indent=2) |
There was a problem hiding this comment.
Atomic os.replace prevents partial-file reads, but without an explicit flush() + os.fsync() on the temp file, a sudden power loss can still leave a zero-length/truncated file that gets renamed into place (depending on OS/filesystem buffering). Recommendation: explicitly flush and fsync the temp file before os.replace to harden cache durability.
| json.dump(data, fd, indent=2) | |
| json.dump(data, fd, indent=2) | |
| fd.flush() | |
| os.fsync(fd.fileno()) |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Introduce a local first_list_entry helper that safely returns the first dict element of container[key] or an empty dict if the container/key/value is missing or invalid. Replace the previous _first_list_entry import and update inject_ground_truth to use the new helper for exptl, em_3d_reconstruction and refine entries, improving robustness and removing the external dependency.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 117 out of 141 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| logger.error("[%s] best_run_data is not a dict; cannot inject ground truth", pdb_id) | ||
| return | ||
|
|
||
| structure_info: dict[str, Any] = best_run_data.setdefault("structure_info", {}) |
There was a problem hiding this comment.
setdefault('structure_info', {}) does not guarantee a dict: if best_run_data['structure_info'] exists but is not a dict (e.g., None or a string), setdefault returns the existing non-dict value and later assignments will fail. Prefer normalizing with a type check (replace non-dicts with {}) before writing method/resolution/release_date.
| structure_info: dict[str, Any] = best_run_data.setdefault("structure_info", {}) | |
| structure_info_value = best_run_data.get("structure_info") | |
| if not isinstance(structure_info_value, dict): | |
| structure_info_value = {} | |
| best_run_data["structure_info"] = structure_info_value | |
| structure_info: dict[str, Any] = structure_info_value |
| clean_name = entry_name.split(".")[0].lower() | ||
| key = f"uniprot:{clean_name}" | ||
|
|
||
| cached = cache.get(key) | ||
| if cached is not None: | ||
| return cached | ||
|
|
||
| url = f"https://rest.uniprot.org/uniprotkb/{clean_name}.txt" | ||
| for attempt in range(API_MAX_RETRIES): | ||
| try: | ||
| resp = requests.head(url, timeout=5, allow_redirects=True) | ||
| is_valid = resp.status_code == 200 | ||
| cache.set(key, is_valid) | ||
| return is_valid | ||
| except (requests.RequestException, OSError) as exc: | ||
| if attempt == API_MAX_RETRIES - 1: |
There was a problem hiding this comment.
The UniProt existence check builds /uniprotkb/{clean_name}.txt from a uniprot_entry_name slug (e.g., GPCRdb-style ts1r2_human). UniProt REST commonly expects an accession or a UniProtKB identifier (e.g., P63092 or P53_HUMAN), so this is likely to return 404 for valid proteins and poison the cache with false negatives. Consider switching to a UniProt search query endpoint (querying by UniProtKB ID vs accession explicitly) and only caching False when the API definitively indicates 'not found' for the intended identifier type.
| clean_name = entry_name.split(".")[0].lower() | |
| key = f"uniprot:{clean_name}" | |
| cached = cache.get(key) | |
| if cached is not None: | |
| return cached | |
| url = f"https://rest.uniprot.org/uniprotkb/{clean_name}.txt" | |
| for attempt in range(API_MAX_RETRIES): | |
| try: | |
| resp = requests.head(url, timeout=5, allow_redirects=True) | |
| is_valid = resp.status_code == 200 | |
| cache.set(key, is_valid) | |
| return is_valid | |
| except (requests.RequestException, OSError) as exc: | |
| if attempt == API_MAX_RETRIES - 1: | |
| clean_name = entry_name.split(".")[0].upper() | |
| key = f"uniprot:{clean_name}" | |
| cached = cache.get(key) | |
| if cached is not None: | |
| return cached | |
| url = "https://rest.uniprot.org/uniprotkb/search" | |
| params = { | |
| "query": f"id:{clean_name}", | |
| "format": "json", | |
| "size": 1, | |
| } | |
| for attempt in range(API_MAX_RETRIES): | |
| try: | |
| resp = requests.get(url, params=params, timeout=5) | |
| if resp.status_code == 200: | |
| payload: dict[str, Any] = resp.json() | |
| is_valid = bool(payload.get("results")) | |
| cache.set(key, is_valid) | |
| return is_valid | |
| if attempt == API_MAX_RETRIES - 1: | |
| logger.warning( | |
| "UniProt API returned status %s for '%s'", | |
| resp.status_code, | |
| entry_name, | |
| ) | |
| return None | |
| time.sleep(1) | |
| except (requests.RequestException, OSError, ValueError) as exc: | |
| if attempt == API_MAX_RETRIES - 1: |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Set clean_name to upper-case (matching UniProt ID conventions) while continuing to use a lower-case cache key. This normalizes the local variable for downstream logic without changing existing cache lookup behavior.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 117 out of 141 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if not found_chain: | ||
| return warnings | ||
|
|
||
| match = ai_uniprot in api_slugs |
There was a problem hiding this comment.
The receptor chain_id is treated as a single chain when locating the polymer entity (if ai_chain in auth_asym_ids:). Several real fixtures use comma-separated chain IDs (e.g., \"B, F\"), which will never match auth_asym_ids and causes receptor validation to silently skip (returns no warnings). Parse ai_chain as a list (split on commas, strip whitespace) and treat the receptor as matched if any chain is found; optionally include which chain matched in api_reality/warning text.
Parse comma-separated chain_id values and validate each reported chain independently. Collect GPCRdb slugs per chain, aggregate unique slugs for api_reality, and mark UNIPROT_CLASH only for chains whose entity slugs don't include the asserted UniProt. Update warning message to include per-chain clash detail. Add unit tests (TestMultiChain) covering same-entity homodimers, multi-entity matches/clashes, partial clashes, missing chains, and no-space chain_id formats.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 117 out of 141 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def validate_and_enrich_ligands( | ||
| pdb_id: str, | ||
| best_run_data: dict[str, Any], | ||
| enriched_entry: dict[str, Any], | ||
| ) -> list[str]: |
There was a problem hiding this comment.
The pdb_id parameter is unused here as well. Consider either removing it or incorporating it into the GHOST_LIGAND warnings (and any logger output) so that batch validation logs remain self-identifying when multiple PDBs are processed together.
Summary
aggregator/package — loads AI annotation runs per PDB, selects the best by score, applies majority-vote consensus with per-field controversy detection, and injects ground truth from enriched PDB metadatavalidator/package — chimera detection, receptor identity validation (UniProt API cross-check), ligand PDB-CCD validation, oligomer analysis (classification, 7TM completeness, chain-ID override, primary protomer suggestion), structural integrity checker, API clients, and persistent JSON cacheaggregatesubcommand with--forceand--skip-api-checksflagscsv_generator/— replaced all.get(key, {})anti-patterns with None-safe idioms, centralized 20+ magic strings intoconfig.pyconstantsArchitecture highlights
tempfile+os.replace+try/finallycleanupdeepcopy()boundary in runner before passing data to validators(data.get(key) or {}).get(child)— no.get(key, {})anywhereconfig.pyfrozenset,tuple,MappingProxyTypefor module-level constantsStats
Test plan
ruff check src/ tests/— all checks passedruff format --check src/ tests/— all files formattedmypy src/— no issues (withignore_missing_imports = false)pytest tests/ -v— 685 passedgpcr-tools aggregate <workspace>on a real workspace with 10 AI runs per PDB🤖 Generated with Claude Code