Add PEP section parsing and peptide-protein group resolution#201
Add PEP section parsing and peptide-protein group resolution#201ypriverol merged 7 commits intobigbio:devfrom
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Not up to standards ⛔🔴 Issues
|
| Category | Results |
|---|---|
| Documentation | 2 minor |
🟢 Metrics 14 complexity · 2 duplication
Metric Results Complexity 14 Duplication 2
TIP This summary will be updated as you push new changes. Give us feedback
There was a problem hiding this comment.
Pull request overview
This PR extends the mzTab loading and QuantMS feature extraction pipeline to incorporate peptide (PEP) sections, and uses peptide-level protein inference to resolve ambiguous protein groups during feature record construction.
Changes:
- Generalizes
load_mztab_sections()to parse multiple mzTab sections (proteins, peptides, PSMs) via a dispatch mechanism, supporting both classic and fast load paths. - Adds a peptide→protein lookup built from the mzTab PEP section and uses it to resolve multi-accession protein groups in QuantMS LFQ feature extraction.
- Refactors fast loader temp-file handling and cleanup into shared helpers.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
qpx/converters/mztab.py |
Adds peptide section support via section dispatch; refactors fast loader to stream sections into temp TSVs for DuckDB ingestion. |
qpx/converters/quantms/feature_adapter.py |
Builds a peptide→protein mapping from the peptides table and uses it to resolve protein groups when forming feature records. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Section definitions: (header_prefix, data_prefix, table_name, dedup_col) | ||
| _MZTAB_SECTIONS = [ | ||
| (_PROTEIN_HEADER_PREFIX, _PROTEIN_LINE_PREFIX, "proteins", "accession"), | ||
| (_PSM_HEADER_PREFIX, _PSM_LINE_PREFIX, "psms", "sequence"), | ||
| (_PEPTIDE_HEADER_PREFIX, _PEPTIDE_LINE_PREFIX, "peptides", "sequence"), | ||
| ] |
| acc_list = protein_name.split(";") if protein_name else [] | ||
| if len(acc_list) > 1 and sequence in _pep_map: | ||
| resolved = _pep_map[sequence] | ||
| acc_list = [resolved] |
| if not self._table_exists("peptides"): | ||
| self.logger.info("No mzTab peptides table — skipping peptide protein map") | ||
| return {} | ||
|
|
| if not line: | ||
| continue | ||
| parts = line.split("\t") | ||
| prefix = parts[0][:3] if parts else "" | ||
| _dispatch_line(prefix, parts, header_map, data_map, on_metadata, on_header, on_data) |
… R1732, D407/D413)
This pull request refactors and extends the mzTab loader and quantms feature adapter to support peptide-level (PEP) sections, and improves how peptide-to-protein mappings are handled for downstream processing. The most significant changes are the generalization of the mzTab parser to handle multiple sections (including peptides), the addition of a peptide-to-protein mapping step, and the use of this mapping to resolve ambiguous protein groups in feature extraction.
mzTab parsing and section handling improvements:
qpx/converters/mztab.pyto support parsing of the peptide (PEP) section, in addition to proteins and PSMs, by introducing a configurable section dispatch mechanism. This includes new constants for peptide section prefixes and a unified approach for handling headers and data lines across all sections. [1] [2]Feature adapter and peptide-protein mapping:
_pep_protein_map) inqpx/converters/quantms/feature_adapter.py, using the newly loaded peptide section to map unambiguously resolved peptides to their corresponding protein accession. This mapping is now built as part of the LFQ conversion process. [1] [2] [3]These changes make the mzTab loader more robust and extensible, and improve the biological accuracy of quantms feature extraction by leveraging peptide-level protein inference.