De novo peptide sequencing (predicting peptide sequences directly from MS/MS spectra, without searching a protein database) is growing rapidly with tools like InstaNovo, Casanovo, π-PrimeNovo, etc. QPX could potentially be an interchange format for these workflows, but two required fields currently block adoption.
The PSM schema is a good fit for de novo predictions: sequence, peptidoform, charge, observed_mz, calculated_mz, rt, scan, and additional_scores all map directly, and protein_accessions is already optional. Issue #144 already acknowledged de novo sequencing as a use case when adding fragment method and collision energy fields. However:
is_decoy is required in PSM and feature schemas
De novo sequencing has no target-decoy paradigm. There is no database to generate decoys from, confidence is estimated by the model itself (e.g. sequence log-probability) and can be calibrated into a posterior error probability by downstream tools. Setting is_decoy = False for all rows is a workaround, but it makes a required field semantically meaningless and could confuse downstream tools that use it for FDR calculation.
Suggestion: Make is_decoy optional, or add a dataset-level flag (e.g. in dataset.parquet) indicating the identification method (database_search vs de_novo) so downstream tools know whether to expect decoys.
anchor_protein is required in feature schema (and part of the primary key)
De novo results have no protein mapping by default. Some de novo pipelines offer an optional post-hoc alignment step against a reference proteome, but in the standard workflow there is no protein context. Since anchor_protein is part of the feature table's primary key [sequence, charge, run_file_name, anchor_protein], the feature table cannot be produced at all without protein mapping.
Suggestion: Either make anchor_protein optional (with a fallback primary key of [sequence, charge, run_file_name] when no protein mapping is available), or define the feature table as not required when identification was performed without a database.
What a de novo pipeline can produce today:
| QPX structure |
Default (de novo only) |
With optional protein mapping |
With quantification + mapping |
| .psm.parquet |
Yes (except is_decoy) |
Yes |
Yes |
| .feature.parquet |
No (anchor_protein missing) |
No (no quantification) |
Yes |
| .pg.parquet |
No |
Yes |
Yes |
| .pepmap.parquet |
No |
Yes |
Yes |
| .run.parquet |
Yes (from SDRF) |
Yes |
Yes |
| .sample.parquet |
Yes (from SDRF) |
Yes |
Yes |
| .provenance.parquet |
Yes |
Yes |
Yes |
| .mz.parquet |
Yes |
Yes |
Yes |
Resolving the two blockers above would allow de novo tools to produce valid PSM-level QPX output in the default case and full QPX datasets when protein mapping and quantification are enabled. This would extend QPX's reach beyond database-search pipelines into a universal proteomics interchange format.
Happy to contribute a QPX writer for de novo output and help test schema changes if the above adjustments are feasible.
De novo peptide sequencing (predicting peptide sequences directly from MS/MS spectra, without searching a protein database) is growing rapidly with tools like InstaNovo, Casanovo, π-PrimeNovo, etc. QPX could potentially be an interchange format for these workflows, but two required fields currently block adoption.
The PSM schema is a good fit for de novo predictions:
sequence,peptidoform,charge,observed_mz,calculated_mz,rt,scan, andadditional_scoresall map directly, andprotein_accessionsis already optional. Issue #144 already acknowledged de novo sequencing as a use case when adding fragment method and collision energy fields. However:is_decoyis required in PSM and feature schemasDe novo sequencing has no target-decoy paradigm. There is no database to generate decoys from, confidence is estimated by the model itself (e.g. sequence log-probability) and can be calibrated into a posterior error probability by downstream tools. Setting
is_decoy=Falsefor all rows is a workaround, but it makes a required field semantically meaningless and could confuse downstream tools that use it for FDR calculation.Suggestion: Make
is_decoyoptional, or add a dataset-level flag (e.g. in dataset.parquet) indicating the identification method (database_searchvsde_novo) so downstream tools know whether to expect decoys.anchor_proteinis required in feature schema (and part of the primary key)De novo results have no protein mapping by default. Some de novo pipelines offer an optional post-hoc alignment step against a reference proteome, but in the standard workflow there is no protein context. Since
anchor_proteinis part of the feature table's primary key[sequence, charge, run_file_name, anchor_protein], the feature table cannot be produced at all without protein mapping.Suggestion: Either make
anchor_proteinoptional (with a fallback primary key of[sequence, charge, run_file_name]when no protein mapping is available), or define the feature table as not required when identification was performed without a database.What a de novo pipeline can produce today:
is_decoy)anchor_proteinmissing)Resolving the two blockers above would allow de novo tools to produce valid PSM-level QPX output in the default case and full QPX datasets when protein mapping and quantification are enabled. This would extend QPX's reach beyond database-search pipelines into a universal proteomics interchange format.
Happy to contribute a QPX writer for de novo output and help test schema changes if the above adjustments are feasible.