Skip to content

Adding the option of regressing out variables when preprocessing count matrix with harmony#127

Open
aregano wants to merge 13 commits intodylkot:mainfrom
aregano:main
Open

Adding the option of regressing out variables when preprocessing count matrix with harmony#127
aregano wants to merge 13 commits intodylkot:mainfrom
aregano:main

Conversation

@aregano
Copy link
Copy Markdown

@aregano aregano commented Jan 13, 2026

Greetings,

First thanks a mil for developing such a straightforward package for cNMF computing! I am very satisfied with how it is processing some pediatric brain cancer datasets I am working on, specifically since you added the preprocess_for_cnnmf() function, where harmonization is performed on the counts matrix of the various samples prior to cNMF computation.

In said function I saw that the option of regressing out specific genes was not included, which is an option that is standard on harmony pipelines. I believe you did not include it due to the need of needing the counts matrix to go through log1p normalization prior to applying the regressing out linear model provided in scanpy.

I have modified the code for preprocess_for_cnmf() by adding the regressing_vars argument. In brief I regress out the log1p counts matrix and then I revert it back to the normalize_total() normalization step so the pipeline can be carried out normally. I believe with that should suffice to make it effective and it adds an extra tool that can improve the analysis (as it definitely did with mine).

Best,
Alvaro

@dylkot
Copy link
Copy Markdown
Owner

dylkot commented Jan 14, 2026

Hi thanks for this! I think my concern about regressing out signals is not only the log1p normalization but also the fact that regressing out variables could result in negative values. I suppose if the regressing out is done in the log space and then you exponentiate it shouldn't end up with negatives. But I am still somewhat worried about making corrections in the log space and then transforming. I'll look into this!

@aregano
Copy link
Copy Markdown
Author

aregano commented Jan 14, 2026

Exactly, by doing it like this you do not get negative values. I was also skeptical just like you but when I tested it on my dataset the results were fantastic. Before regressing out the stability plot suggested 13 programs with 2 being the cycling and a strong mt related program. After regressing out the highest stability was given to 11 programs with those 2 taken out. Also those 11 programs were pretty similar to the ones that were produced before regressing out (by looking at the top 100 genes per program). I can share the two notebooks I produced if you would be interested in checking this out :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants