Add development standards and enhance curation architecture by iskoldt-X · Pull Request #2 · protwis/GPCR-annotation-tools

iskoldt-X · 2026-04-01T17:31:18Z

This PR completes the functional migration of the Interactive Curation Suite and extends it with a real-data regression layer. The new repository now reaches parity with the latest curated behavior from the legacy pdb-annotation project while continuing to enforce the architectural rules in developing_styles.md.

Key Features Ported

Safety: Corrected the fix_mode bypass, added non-destructive field-level skip behavior, and preserved edited value types through typed coercion.
Oligomer Radar: Integrated the unified oligomer_analysis display flow and promoted oligomer-derived issues into validation gating.
Science Logic: Implemented multi-chain receptor truncation, orphaned-ligand detection, structure-note enrichment, and label_asym_id mapping.

Real PDB Regression Coverage

This PR also adds a committed real-data regression suite built from audited aggregated PDB fixtures checked into tests/fixtures/real_pdbs/.

Real fixture scope

Added 9 real PDB fixtures with repo-local JSON bundles
Added 5 voting logs and 9 validation logs
No runtime dependence on local workstation paths or external fixture directories

New integration coverage

Fixture integrity tests: verify committed real fixtures, expected sidecars, and workspace population
Function-level pipeline tests: cover load_pdb_data, inject_oligomer_alerts, transform_for_csv, and append_to_csvs
Gating tests: verify app.py accept-all / review / fix-mode availability against real validation and controversy data
Review-engine prompt tests: verify selected real-data prompt flows using deterministic monkeypatching

Real behaviors now regression-tested

Real DB TRUNCATION via 5G53
Real chain_id_override + HALLUCINATION via 9M88
Real HETEROMER, MISSED_PROTOMER, INCOMPLETE_7TM, and algo_conflicts
Real auxiliary dispatch coverage for:
- antibodies.csv
- nanobodies.csv
- scfv.csv
- ramp.csv
- other_aux_proteins.csv
Real warning-routing coverage for:
- UNIPROT_CLASH
- Fake UniProt
- Ghost Chain
- GHOST_LIGAND

Architectural Improvements

Decoupling: Kept csv_writer.py UI-free and preserved separation of concerns across loader, review, validation, and CSV layers.
Robustness: Batch mode (auto-accept) continues to fail loudly on schema mismatches rather than silently skipping writes.
Functional Purity: Preserved return-data-over-mutation patterns in review flows and science transforms.
Repo portability: Real PDB tests now run entirely from committed fixtures in this repository.

Follow-up Hardening Included

A few small robustness fixes were added after review feedback:

Normalized validation_display.py handling of explicit null validation buckets
Hardened extract_validation_entries() against critical_warnings: null / algo_conflicts: null
Added underscore-form ghost_ligand fatal matching so upstream-style GHOST_LIGAND warnings are treated consistently

Verification Results

Pytest: 307/307 passed
New real-data tests: 129 passed
Linting: no linter errors in touched files
Coverage: expanded from synthetic fixtures to audited real aggregated PDB data across pipeline, gating, and selected interactive review paths

Known Real-Data Gaps

The real-data audit also made the remaining uncovered paths explicit:

grk.csv: no audited real fixture currently produces GRK rows
CLEAN_ENTRIES via real fixture: not currently triggered by the audited real validation logs and remains uncovered by real-data tests

Copilot's summary

This pull request introduces several improvements and standards updates for the GPCR-annotation-tools codebase, focusing on documentation, CSV schema consistency, defensive coding, and user interface enhancements. The most significant changes include the addition of a comprehensive development standards document, updates to CSV schemas to support new data fields, stricter use of immutable constants, and a refactored approach to displaying oligomer analysis in the UI.

Documentation and Standards

Added a new developing_styles.md file detailing development philosophy, code quality, testing, Git/PR standards, defensive coding idioms, and domain-specific conventions to guide contributors and maintain codebase consistency.

CSV Schema and Data Model Updates

Updated CSV_SCHEMA in config.py to add label_asym_id columns for structures.csv, ligands.csv, arrestins.csv, and all relevant subunit columns in signaling_partners.csv, ensuring consistent support for asymmetric unit labeling across all data exports. [1] [2]

Defensive Coding and Constants

Replaced mutable module-level constants (set, list) with immutable types (frozenset, tuple) for BLACKLISTED_KEYS, AUTO_RESOLVE_KEYS, and TOPLEVEL_BLOCK_KEYS in config.py to prevent accidental mutation and cross-test contamination. [1] [2] [3]
Expanded VALIDATION_FATAL_KEYWORDS to include "ghost ligand" for improved validation alert matching.

User Interface and Validation Improvements

Replaced legacy heteromer and 7TM completeness panels with a unified display_oligomer_analysis_panel in the main app UI, and introduced inject_oligomer_alerts to centralize validation alert injection for oligomer analysis. This streamlines the user experience and validation logic. [1] [2]
Fixed a logic bug by changing if final_data: to if final_data is not None: to correctly distinguish between "user quit" and "user accepted with empty data," following defensive coding best practices.

Other Changes

Bumped the package version from 0.1.0 to 0.1.1 in pyproject.toml to reflect schema and standards changes.

These changes collectively improve code reliability, maintainability, and contributor onboarding, while reducing common sources of bugs in collaborative scientific software development.

Introduce developing_styles.md documenting the project's development standards and philosophy. The file defines core principles (one epic at a time, PoC before production, minimal changes, TDD), code quality rules (Python 3.11, ruff, mypy, typing, module architecture and function guidelines), testing standards and organization, Git/PR/commit conventions, domain-specific data-flow and CSV/audit rules, and anti-patterns to avoid. This centralized guidance is intended to standardize contributions, enforce CI requirements, and reduce churn across the GPCR-annotation-tools codebase.

…tecture Introduce unified oligomer analysis support and safer CSV writing. Key changes: - Add CSV schema columns (label_asym_id) and update BLACKLISTED_KEYS to use "oligomer_analysis". - New pure-logic module (csv_generator.logic) for label_asym_id mapping, receptor truncation, orphaned-ligand detection, and structure-note assembly. - Transform pipeline updated (csv_writer.transform_for_csv) to map label_asym_id, apply truncation, enrich notes, and map subunit/ligand chains. - CSV writer now performs header migration checks and raises CsvSchemaMismatchError when existing files have outdated headers (new exceptions.py). - UI: unified display_oligomer_analysis_panel with highlighting and rich diagnostics; validation_display.inject_oligomer_alerts promotes oligomer issues into validation gating. - Review engine: add coerce_type to preserve value types on edits, add skip actions and guard core blocks, and ensure fix_mode override when user requests review. - App: show oligomer panel, inject oligomer alerts, and handle CsvSchemaMismatchError during auto-accept flow. - Tests: add fixture and multiple unit tests covering logic, UI/validation integration, CSV writer behavior and schema mismatch handling. Overall goal: surface oligomer findings to reviewers, prevent silent CSV column drift, and improve type-safety and review ergonomics.

Copilot

Pull request overview

This PR completes migration work for the Interactive Curation Suite by porting legacy fixes and introducing/standardizing the new oligomer_analysis + label_asym_id architecture across CSV generation, UI display, validation gating, and review safety.

Changes:

Add a pure-logic transformation layer (logic.py) for label mapping, DB truncation, and note enrichment; wire it into CSV generation.
Introduce unified oligomer analysis UI + validation warning injection to gate fix-mode/accept-all appropriately.
Add schema-mismatch detection for CSV appends (new exception + header check) and expand unit tests/fixtures to cover the new paths.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`src/gpcr_tools/csv_generator/logic.py`	New pure transformation helpers (label mapping, truncation, note assembly).
`src/gpcr_tools/csv_generator/csv_writer.py`	Use new logic layer; emit `label_asym_id` fields; enforce header/schema validation on append.
`src/gpcr_tools/csv_generator/ui.py`	New unified oligomer analysis panel and highlight logic.
`src/gpcr_tools/csv_generator/validation_display.py`	Inject oligomer-derived warnings into `validation_data["critical_warnings"]`.
`src/gpcr_tools/csv_generator/review_engine.py`	Add `coerce_type()` and new skip paths; reduce mutation by returning new decision-unit dicts on edit.
`src/gpcr_tools/csv_generator/app.py`	Replace legacy heteromer/7TM panels with unified oligomer panel; handle schema mismatch in auto-accept; audit “decline write”.
`src/gpcr_tools/csv_generator/exceptions.py`	New `CsvSchemaMismatchError`.
`src/gpcr_tools/config.py`	Add new schema columns + blacklist key update (`oligomer_analysis`).
`tests/*`	New fixtures + unit tests for NE1/NE2/NE3 logic and CSV schema behavior.
`developing_styles.md`	New development standards and architectural guidance.
`pyproject.toml`	Version bump to `0.1.1`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Improve robustness across CSV generator and validation code. Key changes: - Make several config collections immutable (frozenset/tuple) and fix typo in validation keywords. - Pre-flight CSV schema validation to avoid partial writes; append_to_csvs now validates all files before opening for append and updates exception payloads. - Enhance CsvSchemaMismatchError to include detailed differences (missing/extra/order) and store expected/found fields. - Catch CsvSchemaMismatchError in CLI flow and show a clear panel, marking the PDB as failed if write aborts. - Tighten type coercion and JSON parsing in review_engine to return parsed data only when types match and otherwise fall back safely. - Multiple defensive fixes using `.get(...) or {}` or `.get(... ) or ""` to avoid KeyError/TypeError in logic, ui and validation_display; improve oligomer/chain handling and sequence display. - Improve validation warning matching and index extraction to correctly associate warnings with list entries. - Update unit test name to expect an error when headers mismatch. These changes reduce crash/partial-write risk, provide clearer error messaging for schema mismatches, and make display/validation logic more resilient to missing or unexpected data shapes.

Expand developing_styles.md with a new Defensive Coding Idioms section that documents common low-level bugs and safe patterns for processing external/AI-generated JSON: None-safety (.get() vs or), nested access guards, cross-module string constants, bool vs int dispatch, json.loads type checks, truthiness vs identity, immutable module-level constants, and multi-file write atomicity. Clarify csv_writer.py is UI-free but performs data transformation and CSV file appending (was previously described as a pure function module). Add an "exception handling symmetry" rule to require handling new custom exceptions at all call sites. Update the PR checklist to include None-safety, string-constant reuse, exception symmetry, and truthiness checks.

Apply non-functional style and readability changes across modules. - src/gpcr_tools/config.py: Reformat BLACKLISTED_KEYS and AUTO_RESOLVE_KEYS to use multi-line frozenset(...) with consistent braces/spacing. - src/gpcr_tools/csv_generator/exceptions.py: Simplify construction of the CsvSchemaMismatchError detail string into a single f-string. - src/gpcr_tools/csv_generator/validation_display.py: Wrap long conditional expressions with parentheses for clarity and adjust slice spacing to match style guidelines. These edits are purely stylistic and do not change runtime behavior.

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Remove redundant parentheses around the label_map expression in transform_for_csv (src/gpcr_tools/csv_generator/csv_writer.py). This is a minor style cleanup with no functional change.

Add a set of committed real-PDB JSON fixtures (including aggregated, logs, and validation_logs) and new integration tests that exercise them. Update tests/conftest.py to introduce REAL_PDB_* constants, a helper to copy fixtures into a temporary workspace, and a real_pdb_workspace pytest fixture that sets GPCR_WORKSPACE and prepares output/state directories. Update .gitignore to globally ignore *.json while allowing committed fixtures under tests/fixtures/**/*.json. These changes enable deterministic integration tests against real PDB fixtures.

Copilot

Pull request overview

Copilot reviewed 43 out of 44 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

src/gpcr_tools/csv_generator/review_engine.py:660

Top-level block action s currently breaks out of the loop without assigning final_data[key], which effectively removes the block from the returned data. That makes s destructive (and differs from the non-destructive s behavior added at leaf/block review levels). If the intent is “skip reviewing but keep current data”, set final_data[key] = current_block (or the original block) before breaking; if the intent is deletion, consider labeling it explicitly as delete to avoid confusion.

                if action == "q":
                    return None
                if action == "s":
                    core_blocks = {"receptor_info", "ligands", "signaling_partners"}
                    if key in core_blocks and not Confirm.ask(
                        f"[bold red]'{key}' is a core block. "
                        f"Skipping will leave its CSV fields empty. Continue?[/]",
                        default=False,
                    ):
                        continue
                    console.print(f"[yellow]Skipping/Deleting block '{key}'[/yellow]")
                    log_audit_trail(
                        pdb_id,
                        key,
                        "skip_block",
                        create_display_copy(current_block),
                        "SKIPPED",
                    )
                    break

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-01T20:20:01Z

                    "text": warn_str,
                    "path": path_match.group(1) if path_match else None,
                    "bucket": bucket,
-                    "is_hallucination": "HALLUCINATION ALERT" in warn_str.upper(),
+                    "is_hallucination": "HALLUCINATION" in warn_str.upper(),
                }
            )
    return entries


extract_validation_entries() can still raise TypeError if validation_data["critical_warnings"] or validation_data["algo_conflicts"] is explicitly None in the loaded JSON, because the function iterates validation_data.get(bucket, []) (default only applies when key is missing). Consider iterating over (validation_data.get(bucket) or []) to be robust to nulls.

@copilot apply changes based on this feedback

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Tolerate None validation buckets by iterating over validation_data.get(bucket) or [] in extract_validation_entries, preventing crashes when buckets are null. Add 'ghost_ligand' to VALIDATION_FATAL_KEYWORDS so entries with an underscore are treated as fatal. Update/add unit tests to cover null buckets for extract_validation_entries and inject_oligomer_alerts, and to assert GHOST_LIGAND is handled as a fatal validation.

iskoldt-X added 2 commits April 1, 2026 19:14

iskoldt-X requested a review from Copilot April 1, 2026 17:31

Copilot started reviewing on behalf of iskoldt-X April 1, 2026 17:31 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

Comment thread src/gpcr_tools/csv_generator/review_engine.py

Comment thread src/gpcr_tools/csv_generator/review_engine.py Outdated

Comment thread src/gpcr_tools/csv_generator/exceptions.py

Comment thread tests/unit/test_csv_writer.py Outdated

Comment thread developing_styles.md

iskoldt-X added 3 commits April 1, 2026 20:11

iskoldt-X requested a review from Copilot April 1, 2026 18:16

Copilot started reviewing on behalf of iskoldt-X April 1, 2026 18:17 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

Comment thread src/gpcr_tools/csv_generator/csv_writer.py Outdated

Comment thread src/gpcr_tools/csv_generator/logic.py Outdated

iskoldt-X and others added 4 commits April 1, 2026 20:25

Update src/gpcr_tools/csv_generator/csv_writer.py

6d4d5cd

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/gpcr_tools/csv_generator/logic.py

af4cb21

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Simplify label_map assignment in csv_writer

6079ed2

Remove redundant parentheses around the label_map expression in transform_for_csv (src/gpcr_tools/csv_generator/csv_writer.py). This is a minor style cleanup with no functional change.

iskoldt-X requested a review from Copilot April 1, 2026 20:14

Copilot started reviewing on behalf of iskoldt-X April 1, 2026 20:15 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

Update src/gpcr_tools/csv_generator/validation_display.py

9f48311

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

iskoldt-X marked this pull request as ready for review April 1, 2026 20:29

iskoldt-X merged commit 63645f2 into protwis:main Apr 1, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add development standards and enhance curation architecture#2

Add development standards and enhance curation architecture#2
iskoldt-X merged 11 commits intoprotwis:mainfrom
iskoldt-X:main

iskoldt-X commented Apr 1, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

iskoldt-X Apr 1, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

iskoldt-X commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Features Ported

Real PDB Regression Coverage

Real fixture scope

New integration coverage

Real behaviors now regression-tested

Architectural Improvements

Follow-up Hardening Included

Verification Results

Known Real-Data Gaps

Copilot's summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

iskoldt-X Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

iskoldt-X commented Apr 1, 2026 •

edited

Loading