Skip to content

Smaller Memory, Smaller Storage footprint and Faster Custom Dataset Creation#98

Open
bio-info-guy wants to merge 7 commits intosnap-stanford:masterfrom
bio-info-guy:preprocess_mem_reduce
Open

Smaller Memory, Smaller Storage footprint and Faster Custom Dataset Creation#98
bio-info-guy wants to merge 7 commits intosnap-stanford:masterfrom
bio-info-guy:preprocess_mem_reduce

Conversation

@bio-info-guy
Copy link

Modification to PertData for Smaller Memory, Smaller Storage footprint and Faster custom dataset creation

  • use new GearsDataset class after data splitting to handle on the fly generation of data objects
  • low_mem option to first store indices of x, y pairs instead of the actual x and y (no longer need to create pyg pickle objects, each time the dataset_processed is created directly either via loading the h5ad or creating new dataset)
  • more filter steps of adata during dataset creation:
    • filter out singlets condition_names and contexts/cell_lines with no ctrl/ko pairings ( prevent some errors during DEG calculation on custom datasets)
    • filter adata during new_data_process to only include cells with perturbation in the gene GO graph.

@bio-info-guy bio-info-guy marked this pull request as ready for review November 13, 2025 04:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant