Support for de novo peptide sequencing workflows (no database search)

_De novo_ peptide sequencing (predicting peptide sequences directly from MS/MS spectra, without searching a protein database) is growing rapidly with tools like [InstaNovo](https://github.com/instadeepai/instanovo), [Casanovo](https://github.com/Noble-Lab/casanovo), [π-PrimeNovo](https://github.com/BEAM-Labs/denovo/tree/main/PrimeNovo), etc. QPX could potentially be an interchange format for these workflows, but two required fields currently block adoption.

The [PSM schema](https://qpx.quantms.org/spec/psm/#schema) is a good fit for _de novo_ predictions: `sequence`, `peptidoform`, `charge`, `observed_mz`, `calculated_mz`, `rt`, `scan`, and `additional_scores` all map directly, and `protein_accessions` is already optional. Issue #144 already acknowledged _de novo_ sequencing as a use case when adding fragment method and collision energy fields. However:

  1. **`is_decoy` is required in PSM and feature schemas**

_De novo_ sequencing has no target-decoy paradigm. There is no database to generate decoys from, confidence is estimated by the model itself (e.g. sequence log-probability) and can be calibrated into a posterior error probability by downstream tools. Setting `is_decoy` = `False` for all rows is a workaround, but it makes a required field semantically meaningless and could confuse downstream tools that use it for FDR calculation.

**Suggestion:** Make `is_decoy` optional, or add a dataset-level flag (e.g. in [dataset.parquet](https://qpx.quantms.org/spec/ontology/?h=dataset.parquet#dataset)) indicating the identification method (`database_search` vs `de_novo`) so downstream tools know whether to expect decoys.

  2. **`anchor_protein` is required in feature schema (and part of the primary key)**

_De novo_ results have no protein mapping by default. Some _de novo_ pipelines offer an optional post-hoc alignment step against a reference proteome, but in the standard workflow there is no protein context. Since `anchor_protein` is part of the[ feature table's primary key](https://qpx.quantms.org/spec/schemas/?h=charge%2C+run_file_name%2C+anchor_protein#__tabbed_1_2) `[sequence, charge, run_file_name, anchor_protein]`, the feature table cannot be produced at all without protein mapping.

**Suggestion:** Either make `anchor_protein` optional (with a fallback primary key of `[sequence, charge, run_file_name]` when no protein mapping is available), or define the feature table as not required when identification was performed without a database.

What a _de novo_ pipeline can produce today:
  
| QPX structure         | Default (de novo only)     | With optional protein mapping | With quantification + mapping |
|----------------------|----------------------------|-------------------------------|-------------------------------|
| .psm.parquet         | Yes (except `is_decoy`)      | Yes                           | Yes                           |
| .feature.parquet     | No (`anchor_protein` missing)| No (no quantification)        | Yes                           |
| .pg.parquet          | No                         | Yes                           | Yes                           |
| .pepmap.parquet      | No                         | Yes                           | Yes                           |
| .run.parquet         | Yes (from SDRF)            | Yes                           | Yes                           |
| .sample.parquet      | Yes (from SDRF)            | Yes                           | Yes                           |
| .provenance.parquet  | Yes                        | Yes                           | Yes                           |
| .mz.parquet          | Yes                        | Yes                           | Yes                           |
  

Resolving the two blockers above would allow _de novo_ tools to produce valid PSM-level QPX output in the default case and full QPX datasets when protein mapping and quantification are enabled. This would extend QPX's reach beyond database-search pipelines into a universal proteomics interchange format.

Happy to contribute a QPX writer for _de novo_ output and help test schema changes if the above adjustments are feasible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for de novo peptide sequencing workflows (no database search) #203

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

QPX structure	Default (de novo only)	With optional protein mapping	With quantification + mapping
.psm.parquet	Yes (except `is_decoy`)	Yes	Yes
.feature.parquet	No (`anchor_protein` missing)	No (no quantification)	Yes
.pg.parquet	No	Yes	Yes
.pepmap.parquet	No	Yes	Yes
.run.parquet	Yes (from SDRF)	Yes	Yes
.sample.parquet	Yes (from SDRF)	Yes	Yes
.provenance.parquet	Yes	Yes	Yes
.mz.parquet	Yes	Yes	Yes

Support for de novo peptide sequencing workflows (no database search) #203

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions