feature: more efficient sqMass to parquet export #158

jcharkow · 2025-08-19T00:12:58Z

This streamlines sqMass reading and writing to parquet (bypasses the need for pandas dataframes) in order to make the process more memory efficient.

I have not benchmarked against the old method but this method requires 21Gb to convert an 11Gb .sqMass file while I was running into memory errors with the old build.

I kept the old reader functions in case there is a need to read to a pandas dataframe however, I am not sure if these would be used anywhere so it might be worth just removing these old reader functions.

I am not sure if there are any tests which currently test the sqMass conversion functionality so I hope I did not break anything.

pyprophet/io/_base.py

singjc

Thanks for the optimization! I just made a few comments, mainly for the large export query builder, can you break this into smaller sub-query builders. Unfortunately, there is no unit tests for the sqMass exporter, should probably add one..

For now, can you test that these changes in two ways if you don't mind:

Can you use the old export method on a smaller sqMass file (that doesn't oom) and test for dataframe equality (minus the added columns) to make sure they're the same.
Can you load the converted xic parquet file and the feature osw (Sqlite) file in the arycal-gui visualization tab to make sure it doesn't break anything there. I use the XIC parquet file for the alignment and the visualization gui, so want to make sure there are no breaking changes there.

Can you update the documentation with the added columns to the xic parquet format schema.

pyprophet/io/export/sqmass.py

Copilot

Pull Request Overview

This PR refactors sqMass to parquet export functionality to improve memory efficiency by bypassing pandas DataFrames and using DuckDB directly for data processing. The implementation removes the need to load data into memory as DataFrames, reportedly reducing memory usage from causing memory errors to requiring only 21GB for an 11GB sqMass file.

Key changes:

Replaced pandas-based data reading with direct DuckDB SQL queries
Moved shared _execute_copy_query method to base class for reusability
Implemented complex SQL query to join sqMass and PQP data without intermediate DataFrames

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
pyprophet/io/export/sqmass.py	Replaced pandas-based export with DuckDB SQL query approach, removed reader dependency, added comprehensive SQL query for data extraction
pyprophet/io/export/osw.py	Removed duplicate `_execute_copy_query` method that was moved to base class
pyprophet/io/_base.py	Added shared `_execute_copy_query` method to base class for reuse across exporters

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

pyprophet/io/export/sqmass.py

Co-authored-by: Justin Sing <32938975+singjc@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…ity" This reverts commit 9cdef04.

jcharkow · 2025-08-19T21:45:38Z

Here is a new and old converted xic.

Examining them on my end they are the same
test_chrom_1_.zip

jcharkow · 2025-08-19T21:56:21Z

Tested the above .parquet with arycal and seems to work on my end. If you want to test yourself the .osw is just the test one for massdash (tests/test_data/osw)

singjc

Great, thanks for adding the changes and testing. One last thing, can you please update the format schema in the documentation in docs/file_formats.rst with the added columns for transition annotation.

jcharkow added 4 commits August 18, 2025 20:08

feature: more efficient sqMass to parquet export

04a13eb

feature: add transition ordinal and type to export

bdd2a9c

remove old comments and print statements

c2f6f41

fix bugs with adding transition columns

c528f16

jcharkow mentioned this pull request Aug 19, 2025

OpenSWATH XIC parquet reader Roestlab/massdash#188

Merged

9 tasks

singjc reviewed Aug 19, 2025

View reviewed changes

pyprophet/io/_base.py Show resolved Hide resolved

singjc requested changes Aug 19, 2025

View reviewed changes

pyprophet/io/export/sqmass.py Show resolved Hide resolved

pyprophet/io/export/sqmass.py Outdated Show resolved Hide resolved

singjc requested a review from Copilot August 19, 2025 17:08

Copilot AI reviewed Aug 19, 2025

View reviewed changes

jcharkow and others added 6 commits August 19, 2025 16:30

Apply suggestions from code review

03e6b36

Co-authored-by: Justin Sing <32938975+singjc@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

create reader object so can add indices to sqMass

5dacab5

fix: bug resulting from code suggestions

c94b00e

split sqlite query into smaller functions for better readability

9cdef04

Revert "split sqlite query into smaller functions for better readabil…

3388d95

…ity" This reverts commit 9cdef04.

refactor - split query into smaller chunks

0720b04

jcharkow requested a review from singjc August 19, 2025 21:56

singjc requested changes Aug 20, 2025

View reviewed changes

update: chrom parquet schema documentation

4434704

singjc merged commit aced07c into PyProphet:master Sep 30, 2025
1 check passed

singjc deleted the feature/mem_efficient_chrom_export branch September 30, 2025 17:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: more efficient sqMass to parquet export #158

feature: more efficient sqMass to parquet export #158

Uh oh!

jcharkow commented Aug 19, 2025

Uh oh!

Uh oh!

singjc left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jcharkow commented Aug 19, 2025

Uh oh!

jcharkow commented Aug 19, 2025

Uh oh!

singjc left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feature: more efficient sqMass to parquet export #158

feature: more efficient sqMass to parquet export #158

Uh oh!

Conversation

jcharkow commented Aug 19, 2025

Uh oh!

Uh oh!

singjc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jcharkow commented Aug 19, 2025

Uh oh!

jcharkow commented Aug 19, 2025

Uh oh!

singjc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants