-
Notifications
You must be signed in to change notification settings - Fork 21
feature: more efficient sqMass to parquet export #158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: more efficient sqMass to parquet export #158
Conversation
singjc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the optimization! I just made a few comments, mainly for the large export query builder, can you break this into smaller sub-query builders. Unfortunately, there is no unit tests for the sqMass exporter, should probably add one..
For now, can you test that these changes in two ways if you don't mind:
- Can you use the old export method on a smaller sqMass file (that doesn't oom) and test for dataframe equality (minus the added columns) to make sure they're the same.
- Can you load the converted xic parquet file and the feature osw (Sqlite) file in the arycal-gui visualization tab to make sure it doesn't break anything there. I use the XIC parquet file for the alignment and the visualization gui, so want to make sure there are no breaking changes there.
Can you update the documentation with the added columns to the xic parquet format schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors sqMass to parquet export functionality to improve memory efficiency by bypassing pandas DataFrames and using DuckDB directly for data processing. The implementation removes the need to load data into memory as DataFrames, reportedly reducing memory usage from causing memory errors to requiring only 21GB for an 11GB sqMass file.
Key changes:
- Replaced pandas-based data reading with direct DuckDB SQL queries
- Moved shared
_execute_copy_querymethod to base class for reusability - Implemented complex SQL query to join sqMass and PQP data without intermediate DataFrames
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| pyprophet/io/export/sqmass.py | Replaced pandas-based export with DuckDB SQL query approach, removed reader dependency, added comprehensive SQL query for data extraction |
| pyprophet/io/export/osw.py | Removed duplicate _execute_copy_query method that was moved to base class |
| pyprophet/io/_base.py | Added shared _execute_copy_query method to base class for reuse across exporters |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Co-authored-by: Justin Sing <32938975+singjc@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…ity" This reverts commit 9cdef04.
|
Here is a new and old converted xic. Examining them on my end they are the same |
|
Tested the above .parquet with arycal and seems to work on my end. If you want to test yourself the .osw is just the test one for massdash (tests/test_data/osw) |
singjc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks for adding the changes and testing. One last thing, can you please update the format schema in the documentation in docs/file_formats.rst with the added columns for transition annotation.
This streamlines sqMass reading and writing to parquet (bypasses the need for pandas dataframes) in order to make the process more memory efficient.
I have not benchmarked against the old method but this method requires 21Gb to convert an 11Gb .sqMass file while I was running into memory errors with the old build.
I kept the old reader functions in case there is a need to read to a pandas dataframe however, I am not sure if these would be used anywhere so it might be worth just removing these old reader functions.
I am not sure if there are any tests which currently test the sqMass conversion functionality so I hope I did not break anything.