Smaller Memory, Smaller Storage footprint and Faster Custom Dataset Creation by bio-info-guy · Pull Request #98 · snap-stanford/GEARS

bio-info-guy · 2025-11-13T04:13:51Z

Modification to PertData for Smaller Memory, Smaller Storage footprint and Faster custom dataset creation

use new GearsDataset class after data splitting to handle on the fly generation of data objects
low_mem option to first store indices of x, y pairs instead of the actual x and y (no longer need to create pyg pickle objects, each time the dataset_processed is created directly either via loading the h5ad or creating new dataset)
more filter steps of adata during dataset creation:
- filter out singlets condition_names and contexts/cell_lines with no ctrl/ko pairings ( prevent some errors during DEG calculation on custom datasets)
- filter adata during new_data_process to only include cells with perturbation in the gene GO graph.

…fore saving to h5ad

… any 'ctrl' or 'knockouts' - also filter out cells that have only are singlets in terms of condition_name

bio-info-guy added 7 commits November 11, 2025 11:00

- faster and lower mem creation of cell graph

0ca85e2

- when processing new custom datasets, filter by genes in GO graph be…

6ac8b64

…fore saving to h5ad

- fix mistake when appending no-perturbation data-pair indices

886129d

- add a function to help filter out cells in contexts (cells) without…

c520df4

… any 'ctrl' or 'knockouts' - also filter out cells that have only are singlets in terms of condition_name

- add some print statements to show filtering

786ad14

- small fix to print statements

4841ef5

- revert print to print_sys

b772699

bio-info-guy marked this pull request as ready for review November 13, 2025 04:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smaller Memory, Smaller Storage footprint and Faster Custom Dataset Creation#98

Smaller Memory, Smaller Storage footprint and Faster Custom Dataset Creation#98
bio-info-guy wants to merge 7 commits intosnap-stanford:masterfrom
bio-info-guy:preprocess_mem_reduce

bio-info-guy commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bio-info-guy commented Nov 13, 2025

Modification to PertData for Smaller Memory, Smaller Storage footprint and Faster custom dataset creation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant