Skip to content

Support for de novo peptide sequencing workflows (no database search) #203

@BioGeek

Description

@BioGeek

De novo peptide sequencing (predicting peptide sequences directly from MS/MS spectra, without searching a protein database) is growing rapidly with tools like InstaNovo, Casanovo, π-PrimeNovo, etc. QPX could potentially be an interchange format for these workflows, but two required fields currently block adoption.

The PSM schema is a good fit for de novo predictions: sequence, peptidoform, charge, observed_mz, calculated_mz, rt, scan, and additional_scores all map directly, and protein_accessions is already optional. Issue #144 already acknowledged de novo sequencing as a use case when adding fragment method and collision energy fields. However:

  1. is_decoy is required in PSM and feature schemas

De novo sequencing has no target-decoy paradigm. There is no database to generate decoys from, confidence is estimated by the model itself (e.g. sequence log-probability) and can be calibrated into a posterior error probability by downstream tools. Setting is_decoy = False for all rows is a workaround, but it makes a required field semantically meaningless and could confuse downstream tools that use it for FDR calculation.

Suggestion: Make is_decoy optional, or add a dataset-level flag (e.g. in dataset.parquet) indicating the identification method (database_search vs de_novo) so downstream tools know whether to expect decoys.

  1. anchor_protein is required in feature schema (and part of the primary key)

De novo results have no protein mapping by default. Some de novo pipelines offer an optional post-hoc alignment step against a reference proteome, but in the standard workflow there is no protein context. Since anchor_protein is part of the feature table's primary key [sequence, charge, run_file_name, anchor_protein], the feature table cannot be produced at all without protein mapping.

Suggestion: Either make anchor_protein optional (with a fallback primary key of [sequence, charge, run_file_name] when no protein mapping is available), or define the feature table as not required when identification was performed without a database.

What a de novo pipeline can produce today:

QPX structure Default (de novo only) With optional protein mapping With quantification + mapping
.psm.parquet Yes (except is_decoy) Yes Yes
.feature.parquet No (anchor_protein missing) No (no quantification) Yes
.pg.parquet No Yes Yes
.pepmap.parquet No Yes Yes
.run.parquet Yes (from SDRF) Yes Yes
.sample.parquet Yes (from SDRF) Yes Yes
.provenance.parquet Yes Yes Yes
.mz.parquet Yes Yes Yes

Resolving the two blockers above would allow de novo tools to produce valid PSM-level QPX output in the default case and full QPX datasets when protein mapping and quantification are enabled. This would extend QPX's reach beyond database-search pipelines into a universal proteomics interchange format.

Happy to contribute a QPX writer for de novo output and help test schema changes if the above adjustments are feasible.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions