Skip to content

Add development standards and enhance curation architecture#2

Merged
iskoldt-X merged 11 commits intoprotwis:mainfrom
iskoldt-X:main
Apr 1, 2026
Merged

Add development standards and enhance curation architecture#2
iskoldt-X merged 11 commits intoprotwis:mainfrom
iskoldt-X:main

Conversation

@iskoldt-X
Copy link
Copy Markdown
Collaborator

@iskoldt-X iskoldt-X commented Apr 1, 2026

This PR completes the functional migration of the Interactive Curation Suite and extends it with a real-data regression layer. The new repository now reaches parity with the latest curated behavior from the legacy pdb-annotation project while continuing to enforce the architectural rules in developing_styles.md.

Key Features Ported

  • Safety: Corrected the fix_mode bypass, added non-destructive field-level skip behavior, and preserved edited value types through typed coercion.
  • Oligomer Radar: Integrated the unified oligomer_analysis display flow and promoted oligomer-derived issues into validation gating.
  • Science Logic: Implemented multi-chain receptor truncation, orphaned-ligand detection, structure-note enrichment, and label_asym_id mapping.

Real PDB Regression Coverage

This PR also adds a committed real-data regression suite built from audited aggregated PDB fixtures checked into tests/fixtures/real_pdbs/.

Real fixture scope

  • Added 9 real PDB fixtures with repo-local JSON bundles
  • Added 5 voting logs and 9 validation logs
  • No runtime dependence on local workstation paths or external fixture directories

New integration coverage

  • Fixture integrity tests: verify committed real fixtures, expected sidecars, and workspace population
  • Function-level pipeline tests: cover load_pdb_data, inject_oligomer_alerts, transform_for_csv, and append_to_csvs
  • Gating tests: verify app.py accept-all / review / fix-mode availability against real validation and controversy data
  • Review-engine prompt tests: verify selected real-data prompt flows using deterministic monkeypatching

Real behaviors now regression-tested

  • Real DB TRUNCATION via 5G53
  • Real chain_id_override + HALLUCINATION via 9M88
  • Real HETEROMER, MISSED_PROTOMER, INCOMPLETE_7TM, and algo_conflicts
  • Real auxiliary dispatch coverage for:
    • antibodies.csv
    • nanobodies.csv
    • scfv.csv
    • ramp.csv
    • other_aux_proteins.csv
  • Real warning-routing coverage for:
    • UNIPROT_CLASH
    • Fake UniProt
    • Ghost Chain
    • GHOST_LIGAND

Architectural Improvements

  • Decoupling: Kept csv_writer.py UI-free and preserved separation of concerns across loader, review, validation, and CSV layers.
  • Robustness: Batch mode (auto-accept) continues to fail loudly on schema mismatches rather than silently skipping writes.
  • Functional Purity: Preserved return-data-over-mutation patterns in review flows and science transforms.
  • Repo portability: Real PDB tests now run entirely from committed fixtures in this repository.

Follow-up Hardening Included

A few small robustness fixes were added after review feedback:

  • Normalized validation_display.py handling of explicit null validation buckets
  • Hardened extract_validation_entries() against critical_warnings: null / algo_conflicts: null
  • Added underscore-form ghost_ligand fatal matching so upstream-style GHOST_LIGAND warnings are treated consistently

Verification Results

  • Pytest: 307/307 passed
  • New real-data tests: 129 passed
  • Linting: no linter errors in touched files
  • Coverage: expanded from synthetic fixtures to audited real aggregated PDB data across pipeline, gating, and selected interactive review paths

Known Real-Data Gaps

The real-data audit also made the remaining uncovered paths explicit:

  • grk.csv: no audited real fixture currently produces GRK rows
  • CLEAN_ENTRIES via real fixture: not currently triggered by the audited real validation logs and remains uncovered by real-data tests

Copilot's summary

This pull request introduces several improvements and standards updates for the GPCR-annotation-tools codebase, focusing on documentation, CSV schema consistency, defensive coding, and user interface enhancements. The most significant changes include the addition of a comprehensive development standards document, updates to CSV schemas to support new data fields, stricter use of immutable constants, and a refactored approach to displaying oligomer analysis in the UI.

Documentation and Standards

  • Added a new developing_styles.md file detailing development philosophy, code quality, testing, Git/PR standards, defensive coding idioms, and domain-specific conventions to guide contributors and maintain codebase consistency.

CSV Schema and Data Model Updates

  • Updated CSV_SCHEMA in config.py to add label_asym_id columns for structures.csv, ligands.csv, arrestins.csv, and all relevant subunit columns in signaling_partners.csv, ensuring consistent support for asymmetric unit labeling across all data exports. [1] [2]

Defensive Coding and Constants

  • Replaced mutable module-level constants (set, list) with immutable types (frozenset, tuple) for BLACKLISTED_KEYS, AUTO_RESOLVE_KEYS, and TOPLEVEL_BLOCK_KEYS in config.py to prevent accidental mutation and cross-test contamination. [1] [2] [3]
  • Expanded VALIDATION_FATAL_KEYWORDS to include "ghost ligand" for improved validation alert matching.

User Interface and Validation Improvements

  • Replaced legacy heteromer and 7TM completeness panels with a unified display_oligomer_analysis_panel in the main app UI, and introduced inject_oligomer_alerts to centralize validation alert injection for oligomer analysis. This streamlines the user experience and validation logic. [1] [2]
  • Fixed a logic bug by changing if final_data: to if final_data is not None: to correctly distinguish between "user quit" and "user accepted with empty data," following defensive coding best practices.

Other Changes

  • Bumped the package version from 0.1.0 to 0.1.1 in pyproject.toml to reflect schema and standards changes.

These changes collectively improve code reliability, maintainability, and contributor onboarding, while reducing common sources of bugs in collaborative scientific software development.

Introduce developing_styles.md documenting the project's development standards and philosophy. The file defines core principles (one epic at a time, PoC before production, minimal changes, TDD), code quality rules (Python 3.11, ruff, mypy, typing, module architecture and function guidelines), testing standards and organization, Git/PR/commit conventions, domain-specific data-flow and CSV/audit rules, and anti-patterns to avoid. This centralized guidance is intended to standardize contributions, enforce CI requirements, and reduce churn across the GPCR-annotation-tools codebase.
…tecture

Introduce unified oligomer analysis support and safer CSV writing. Key changes:

- Add CSV schema columns (label_asym_id) and update BLACKLISTED_KEYS to use "oligomer_analysis".
- New pure-logic module (csv_generator.logic) for label_asym_id mapping, receptor truncation, orphaned-ligand detection, and structure-note assembly.
- Transform pipeline updated (csv_writer.transform_for_csv) to map label_asym_id, apply truncation, enrich notes, and map subunit/ligand chains.
- CSV writer now performs header migration checks and raises CsvSchemaMismatchError when existing files have outdated headers (new exceptions.py).
- UI: unified display_oligomer_analysis_panel with highlighting and rich diagnostics; validation_display.inject_oligomer_alerts promotes oligomer issues into validation gating.
- Review engine: add coerce_type to preserve value types on edits, add skip actions and guard core blocks, and ensure fix_mode override when user requests review.
- App: show oligomer panel, inject oligomer alerts, and handle CsvSchemaMismatchError during auto-accept flow.
- Tests: add fixture and multiple unit tests covering logic, UI/validation integration, CSV writer behavior and schema mismatch handling.

Overall goal: surface oligomer findings to reviewers, prevent silent CSV column drift, and improve type-safety and review ergonomics.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR completes migration work for the Interactive Curation Suite by porting legacy fixes and introducing/standardizing the new oligomer_analysis + label_asym_id architecture across CSV generation, UI display, validation gating, and review safety.

Changes:

  • Add a pure-logic transformation layer (logic.py) for label mapping, DB truncation, and note enrichment; wire it into CSV generation.
  • Introduce unified oligomer analysis UI + validation warning injection to gate fix-mode/accept-all appropriately.
  • Add schema-mismatch detection for CSV appends (new exception + header check) and expand unit tests/fixtures to cover the new paths.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/gpcr_tools/csv_generator/logic.py New pure transformation helpers (label mapping, truncation, note assembly).
src/gpcr_tools/csv_generator/csv_writer.py Use new logic layer; emit label_asym_id fields; enforce header/schema validation on append.
src/gpcr_tools/csv_generator/ui.py New unified oligomer analysis panel and highlight logic.
src/gpcr_tools/csv_generator/validation_display.py Inject oligomer-derived warnings into validation_data["critical_warnings"].
src/gpcr_tools/csv_generator/review_engine.py Add coerce_type() and new skip paths; reduce mutation by returning new decision-unit dicts on edit.
src/gpcr_tools/csv_generator/app.py Replace legacy heteromer/7TM panels with unified oligomer panel; handle schema mismatch in auto-accept; audit “decline write”.
src/gpcr_tools/csv_generator/exceptions.py New CsvSchemaMismatchError.
src/gpcr_tools/config.py Add new schema columns + blacklist key update (oligomer_analysis).
tests/* New fixtures + unit tests for NE1/NE2/NE3 logic and CSV schema behavior.
developing_styles.md New development standards and architectural guidance.
pyproject.toml Version bump to 0.1.1.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/gpcr_tools/csv_generator/review_engine.py
Comment thread src/gpcr_tools/csv_generator/review_engine.py Outdated
Comment thread src/gpcr_tools/csv_generator/exceptions.py
Comment thread tests/unit/test_csv_writer.py Outdated
Comment thread developing_styles.md
Improve robustness across CSV generator and validation code. Key changes:
- Make several config collections immutable (frozenset/tuple) and fix typo in validation keywords.
- Pre-flight CSV schema validation to avoid partial writes; append_to_csvs now validates all files before opening for append and updates exception payloads.
- Enhance CsvSchemaMismatchError to include detailed differences (missing/extra/order) and store expected/found fields.
- Catch CsvSchemaMismatchError in CLI flow and show a clear panel, marking the PDB as failed if write aborts.
- Tighten type coercion and JSON parsing in review_engine to return parsed data only when types match and otherwise fall back safely.
- Multiple defensive fixes using `.get(...) or {}` or `.get(... ) or ""` to avoid KeyError/TypeError in logic, ui and validation_display; improve oligomer/chain handling and sequence display.
- Improve validation warning matching and index extraction to correctly associate warnings with list entries.
- Update unit test name to expect an error when headers mismatch.
These changes reduce crash/partial-write risk, provide clearer error messaging for schema mismatches, and make display/validation logic more resilient to missing or unexpected data shapes.
Expand developing_styles.md with a new Defensive Coding Idioms section that documents common low-level bugs and safe patterns for processing external/AI-generated JSON: None-safety (.get() vs or), nested access guards, cross-module string constants, bool vs int dispatch, json.loads type checks, truthiness vs identity, immutable module-level constants, and multi-file write atomicity. Clarify csv_writer.py is UI-free but performs data transformation and CSV file appending (was previously described as a pure function module). Add an "exception handling symmetry" rule to require handling new custom exceptions at all call sites. Update the PR checklist to include None-safety, string-constant reuse, exception symmetry, and truthiness checks.
Apply non-functional style and readability changes across modules.

- src/gpcr_tools/config.py: Reformat BLACKLISTED_KEYS and AUTO_RESOLVE_KEYS to use multi-line frozenset(...) with consistent braces/spacing.
- src/gpcr_tools/csv_generator/exceptions.py: Simplify construction of the CsvSchemaMismatchError detail string into a single f-string.
- src/gpcr_tools/csv_generator/validation_display.py: Wrap long conditional expressions with parentheses for clarity and adjust slice spacing to match style guidelines.

These edits are purely stylistic and do not change runtime behavior.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/gpcr_tools/csv_generator/csv_writer.py Outdated
Comment thread src/gpcr_tools/csv_generator/logic.py Outdated
iskoldt-X and others added 4 commits April 1, 2026 20:25
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Remove redundant parentheses around the label_map expression in transform_for_csv (src/gpcr_tools/csv_generator/csv_writer.py). This is a minor style cleanup with no functional change.
Add a set of committed real-PDB JSON fixtures (including aggregated, logs, and validation_logs) and new integration tests that exercise them. Update tests/conftest.py to introduce REAL_PDB_* constants, a helper to copy fixtures into a temporary workspace, and a real_pdb_workspace pytest fixture that sets GPCR_WORKSPACE and prepares output/state directories. Update .gitignore to globally ignore *.json while allowing committed fixtures under tests/fixtures/**/*.json. These changes enable deterministic integration tests against real PDB fixtures.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 43 out of 44 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

src/gpcr_tools/csv_generator/review_engine.py:660

  • Top-level block action s currently breaks out of the loop without assigning final_data[key], which effectively removes the block from the returned data. That makes s destructive (and differs from the non-destructive s behavior added at leaf/block review levels). If the intent is “skip reviewing but keep current data”, set final_data[key] = current_block (or the original block) before breaking; if the intent is deletion, consider labeling it explicitly as delete to avoid confusion.
                if action == "q":
                    return None
                if action == "s":
                    core_blocks = {"receptor_info", "ligands", "signaling_partners"}
                    if key in core_blocks and not Confirm.ask(
                        f"[bold red]'{key}' is a core block. "
                        f"Skipping will leave its CSV fields empty. Continue?[/]",
                        default=False,
                    ):
                        continue
                    console.print(f"[yellow]Skipping/Deleting block '{key}'[/yellow]")
                    log_audit_trail(
                        pdb_id,
                        key,
                        "skip_block",
                        create_display_copy(current_block),
                        "SKIPPED",
                    )
                    break

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/gpcr_tools/csv_generator/validation_display.py Outdated
Comment on lines 117 to 123
"text": warn_str,
"path": path_match.group(1) if path_match else None,
"bucket": bucket,
"is_hallucination": "HALLUCINATION ALERT" in warn_str.upper(),
"is_hallucination": "HALLUCINATION" in warn_str.upper(),
}
)
return entries
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract_validation_entries() can still raise TypeError if validation_data["critical_warnings"] or validation_data["algo_conflicts"] is explicitly None in the loaded JSON, because the function iterates validation_data.get(bucket, []) (default only applies when key is missing). Consider iterating over (validation_data.get(bucket) or []) to be robust to nulls.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot apply changes based on this feedback

Comment thread src/gpcr_tools/config.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@iskoldt-X iskoldt-X marked this pull request as ready for review April 1, 2026 20:29
Tolerate None validation buckets by iterating over validation_data.get(bucket) or [] in extract_validation_entries, preventing crashes when buckets are null. Add 'ghost_ligand' to VALIDATION_FATAL_KEYWORDS so entries with an underscore are treated as fatal. Update/add unit tests to cover null buckets for extract_validation_entries and inject_oligomer_alerts, and to assert GHOST_LIGAND is handled as a fatal validation.
@iskoldt-X iskoldt-X merged commit 63645f2 into protwis:main Apr 1, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants