Skip to content

Feature: Export Library#156

Merged
singjc merged 27 commits intoPyProphet:masterfrom
Roestlab:feature/lib_export
Aug 28, 2025
Merged

Feature: Export Library#156
singjc merged 27 commits intoPyProphet:masterfrom
Roestlab:feature/lib_export

Conversation

@jcharkow
Copy link
Contributor

@jcharkow jcharkow commented Aug 8, 2025

This adds functionality to export a .oswpq file or a .oswpqd file to a library. The library can use either experimental or the previous libraries RT/IM or fragment ion intensity.

Currently .osw and .parquet are unsupported but can be added in the future.

Copy link
Contributor

@singjc singjc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the addition! Looks mostly good to go, I just had a few questions and suggestions.

Comment on lines 371 to 372
type=float,
help="Filter results to maximum run-specific peak group-level q-value, should not use values > 0.01.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to limit the filters to be only less than or equal to 0.01, maybe we should change the type to a click.FloatRange? Or add param validation in the export_library if we want to limit the qvalue thresholds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add to the desc, that if there are multiple runs with the same precursor, then the run with the lowest qvalue is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to limit the filters to be only less than or equal to 0.01, maybe we should change the type to a click.FloatRange? Or add param validation in the export_library if we want to limit the qvalue thresholds.

I am not sure if I want to enforce a hard limit because I am still experimenting with values greater than 1% FDR. and greater than 1% is fine if you are filtering to that value anyways.

E.g. If you are doing your entire analysis at 5% FDR it is probably fine to use 5% FDR here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably change the help description then. Either remove the "should not use values > 0.01", or change the wording as a suggestion.

@singjc
Copy link
Contributor

singjc commented Aug 11, 2025

I think the tests need to be updated with the added rt_unit option?

jcharkow and others added 5 commits August 12, 2025 15:03
Co-authored-by: Justin Sing <32938975+singjc@users.noreply.github.com>
after minor changes in data manipulation update snapshot tests
@singjc singjc requested a review from Copilot August 19, 2025 17:12

This comment was marked as outdated.

@jcharkow jcharkow requested review from Copilot and singjc August 19, 2025 21:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds functionality to export library files (.oswpq and .oswpqd) to a TSV library format that can be used with OpenSWATH. The export supports both experimental and previous libraries for RT/IM or fragment ion intensity.

  • Implements library export functionality through a new export library command
  • Adds support for various calibration options (RT, IM, intensity) and filtering parameters
  • Restricts library export to split parquet files only (OSW and non-split parquet files raise NotImplementedError)

Reviewed Changes

Copilot reviewed 11 out of 19 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/test_pyprophet_export.py Adds test cases for library export functionality with different calibration and RT unit configurations
tests/_regtest_outputs/ Reference outputs for the new library export test cases showing expected TSV format
pyprophet/io/export/split_parquet.py Implements library-specific data reading logic with proper validation and SQL queries
pyprophet/io/export/parquet.py Adds NotImplementedError for library export from non-split parquet files
pyprophet/io/export/osw.py Adds NotImplementedError for library export from OSW files
pyprophet/io/_base.py Implements library cleaning, processing, and export functionality with calibration support
pyprophet/cli/export.py Adds new CLI command for library export with comprehensive configuration options
pyprophet/_config.py Extends configuration to support library export parameters and options

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.


if self.config.export_format == "library":
if self._is_unscored_file():
descr= "Files must be scored for library generation."
Copy link

Copilot AI Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent spacing around assignment operator. Should be 'descr = "Files must be scored for library generation."'

Copilot uses AI. Check for mistakes.
logger.exception(descr)
raise ValueError(descr)
if not self._has_peptide_protein_global_scores():
descr= "Files must have peptide and protein level global scores for library generation."
Copy link

Copilot AI Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent spacing around assignment operator. Should be 'descr = "Files must have peptide and protein level global scores for library generation."'

Copilot uses AI. Check for mistakes.
if self.config.keep_decoys:
decoy_query = ""
else:
decoy_query ="p.PRECURSOR_DECOY is false and t.TRANSITION_DECOY is false and"
Copy link

Copilot AI Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent spacing around assignment operator. Should be 'decoy_query = "p.PRECURSOR_DECOY is false and t.TRANSITION_DECOY is false and"'

Copilot uses AI. Check for mistakes.
import duckdb
import pandas as pd
import polars as pl
import sklearn.preprocessing as preprocessing # For MinMaxScaler
Copy link

Copilot AI Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import comment should follow PEP 8 guidelines with proper spacing: '# For MinMaxScaler' should be '# For MinMaxScaler' (two spaces before #)

Suggested change
import sklearn.preprocessing as preprocessing # For MinMaxScaler
import sklearn.preprocessing as preprocessing # For MinMaxScaler

Copilot uses AI. Check for mistakes.
logger.info(f"Library Contains {len(data['Precursor'].drop_duplicates())} Precursors")

logger.info(f"Precursor Fragment Distribution (Before Filtering)")
num_frags_per_prec = data[['Precursor', 'TransitionId']].groupby("Precursor").count().reset_index(names='Precursor').groupby('TransitionId').count()
Copy link

Copilot AI Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is overly complex with multiple chained operations. Consider breaking it into multiple steps for better readability and debugging.

Suggested change
num_frags_per_prec = data[['Precursor', 'TransitionId']].groupby("Precursor").count().reset_index(names='Precursor').groupby('TransitionId').count()
precursor_transition = data[['Precursor', 'TransitionId']]
precursor_counts = precursor_transition.groupby("Precursor").count()
precursor_counts_reset = precursor_counts.reset_index(names='Precursor')
num_frags_per_prec = precursor_counts_reset.groupby('TransitionId').count()

Copilot uses AI. Check for mistakes.

logger.info(f"After filtering, library contains {len(data['Precursor'].drop_duplicates())} Precursors")
if cfg.keep_decoys:
logger.info("Of Which {} are decoys".format(len(data[data['Decoy'] == 1]['Precursor'].drop_duplicates())))
Copy link

Copilot AI Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use f-string formatting instead of .format() for consistency with the rest of the codebase and better performance: f"Of which {len(data[data['Decoy'] == 1]['Precursor'].drop_duplicates())} are decoys"

Suggested change
logger.info("Of Which {} are decoys".format(len(data[data['Decoy'] == 1]['Precursor'].drop_duplicates())))
logger.info(f"Of Which {len(data[data['Decoy'] == 1]['Precursor'].drop_duplicates())} are decoys")

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@singjc singjc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for the changes. Will merge.

@singjc singjc merged commit 62cb5f4 into PyProphet:master Aug 28, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants