Skip to content

Conversation

@jcharkow
Copy link
Contributor

This streamlines sqMass reading and writing to parquet (bypasses the need for pandas dataframes) in order to make the process more memory efficient.

I have not benchmarked against the old method but this method requires 21Gb to convert an 11Gb .sqMass file while I was running into memory errors with the old build.

I kept the old reader functions in case there is a need to read to a pandas dataframe however, I am not sure if these would be used anywhere so it might be worth just removing these old reader functions.

I am not sure if there are any tests which currently test the sqMass conversion functionality so I hope I did not break anything.

Copy link
Contributor

@singjc singjc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the optimization! I just made a few comments, mainly for the large export query builder, can you break this into smaller sub-query builders. Unfortunately, there is no unit tests for the sqMass exporter, should probably add one..

For now, can you test that these changes in two ways if you don't mind:

  1. Can you use the old export method on a smaller sqMass file (that doesn't oom) and test for dataframe equality (minus the added columns) to make sure they're the same.
  2. Can you load the converted xic parquet file and the feature osw (Sqlite) file in the arycal-gui visualization tab to make sure it doesn't break anything there. I use the XIC parquet file for the alignment and the visualization gui, so want to make sure there are no breaking changes there.

Can you update the documentation with the added columns to the xic parquet format schema.

@singjc singjc requested a review from Copilot August 19, 2025 17:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors sqMass to parquet export functionality to improve memory efficiency by bypassing pandas DataFrames and using DuckDB directly for data processing. The implementation removes the need to load data into memory as DataFrames, reportedly reducing memory usage from causing memory errors to requiring only 21GB for an 11GB sqMass file.

Key changes:

  • Replaced pandas-based data reading with direct DuckDB SQL queries
  • Moved shared _execute_copy_query method to base class for reusability
  • Implemented complex SQL query to join sqMass and PQP data without intermediate DataFrames

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
pyprophet/io/export/sqmass.py Replaced pandas-based export with DuckDB SQL query approach, removed reader dependency, added comprehensive SQL query for data extraction
pyprophet/io/export/osw.py Removed duplicate _execute_copy_query method that was moved to base class
pyprophet/io/_base.py Added shared _execute_copy_query method to base class for reuse across exporters

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@jcharkow
Copy link
Contributor Author

Here is a new and old converted xic.

Examining them on my end they are the same
test_chrom_1_.zip

@jcharkow
Copy link
Contributor Author

Tested the above .parquet with arycal and seems to work on my end. If you want to test yourself the .osw is just the test one for massdash (tests/test_data/osw)

@jcharkow jcharkow requested a review from singjc August 19, 2025 21:56
Copy link
Contributor

@singjc singjc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for adding the changes and testing. One last thing, can you please update the format schema in the documentation in docs/file_formats.rst with the added columns for transition annotation.

@singjc singjc merged commit aced07c into PyProphet:master Sep 30, 2025
1 check passed
@singjc singjc deleted the feature/mem_efficient_chrom_export branch September 30, 2025 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants