Skip to content

feat: widen chain IDs from u8 to String for multi-char CIF support#44

Open
adobles96 wants to merge 1 commit intosteineggerlab:masterfrom
genesistherapeutics:upstream/widen-chain-ids
Open

feat: widen chain IDs from u8 to String for multi-char CIF support#44
adobles96 wants to merge 1 commit intosteineggerlab:masterfrom
genesistherapeutics:upstream/widen-chain-ids

Conversation

@adobles96
Copy link
Copy Markdown
Contributor

Summary

  • Widens chain ID storage from u8 to String throughout the pipeline to support multi-character chain IDs (e.g. "10" in auth_asym_id from large cryo-EM structures like PDB 9A1O)
  • Updates AtomVector, Structure, CompactStructure chain fields and ResidueMatch type accordingly
  • CIF parser: replaces get_one_char with get_text_string to preserve full chain names
  • PDB/FCZ parsers: convert at boundary, Atom struct unchanged for FFI compat
  • Query format: underscore separator (e.g. AA_250) for unambiguous multi-char chains, with backward compat for single-char legacy format (A250)

Test plan

  • Tested with large cryo-EM structures (PDB 9A1O) that have multi-char chain IDs
  • Verified backward compatibility with single-char chain ID queries

🤖 Generated with Claude Code

Large cryo-EM structures (e.g. PDB 9A1O) have multi-character chain IDs
like "10" in auth_asym_id. Previously these were truncated to a single
byte. This commit widens chain storage to String throughout the pipeline:

- AtomVector, Structure, CompactStructure chain fields: u8 -> String
- ResidueMatch type: (u8, u64) -> (String, u64)
- CIF parser: replace get_one_char with get_text_string (preserves full chain name)
- PDB/FCZ parsers: convert at boundary, Atom struct unchanged for FFI compat
- Query format: underscore separator (e.g. "AA_250") for unambiguous multi-char chains,
  with backward compat for single-char legacy format ("A250")
- All output formatting uses CHAIN_RESIDUE separator

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@adobles96
Copy link
Copy Markdown
Contributor Author

Probably need to propagate the new motif format through documentation if approved. I can do that after/if you green light. I introduced the separator b/c there may be multichar chain IDs with numbers (eg A1_250), I think eg gemmi does this sometimes.

@khb7840
Copy link
Copy Markdown
Member

khb7840 commented Apr 16, 2026

Ah I was thinking of the same query format with underscore separator with backward compatibility.
Thank you for PR. I'll review this version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants