feat: widen chain IDs from u8 to String for multi-char CIF support#44
Open
adobles96 wants to merge 1 commit intosteineggerlab:masterfrom
Open
feat: widen chain IDs from u8 to String for multi-char CIF support#44adobles96 wants to merge 1 commit intosteineggerlab:masterfrom
adobles96 wants to merge 1 commit intosteineggerlab:masterfrom
Conversation
Large cryo-EM structures (e.g. PDB 9A1O) have multi-character chain IDs
like "10" in auth_asym_id. Previously these were truncated to a single
byte. This commit widens chain storage to String throughout the pipeline:
- AtomVector, Structure, CompactStructure chain fields: u8 -> String
- ResidueMatch type: (u8, u64) -> (String, u64)
- CIF parser: replace get_one_char with get_text_string (preserves full chain name)
- PDB/FCZ parsers: convert at boundary, Atom struct unchanged for FFI compat
- Query format: underscore separator (e.g. "AA_250") for unambiguous multi-char chains,
with backward compat for single-char legacy format ("A250")
- All output formatting uses CHAIN_RESIDUE separator
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
Author
|
Probably need to propagate the new motif format through documentation if approved. I can do that after/if you green light. I introduced the separator b/c there may be multichar chain IDs with numbers (eg |
Member
|
Ah I was thinking of the same query format with underscore separator with backward compatibility. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
u8toStringthroughout the pipeline to support multi-character chain IDs (e.g."10"inauth_asym_idfrom large cryo-EM structures like PDB 9A1O)AtomVector,Structure,CompactStructurechain fields andResidueMatchtype accordinglyget_one_charwithget_text_stringto preserve full chain namesAtomstruct unchanged for FFI compatAA_250) for unambiguous multi-char chains, with backward compat for single-char legacy format (A250)Test plan
🤖 Generated with Claude Code