Skip to content

Fine-tuning CpGPT on whole genome sequencing methylation data with ~4M CpG sites per sample #5

@tnnandi

Description

@tnnandi

Hi,

The available pretrained models ("small" and "large") appear to use up to 10,000 CpG sites (max_length) per sample. Also, there seems to be a limit on the ROPE embedding calculation, with max_seq_len set to 20,001.

I plan to use CpGPT with whole-genome sequencing methylation data, which contains approximately 4 million CpG sites per sample. Could you please suggest how to fine-tune the pretrained models to obtain sample level embeddings that account for such a large number of CpG sites, without losing information by selecting only a 10,000-site subset?

Would it be reasonable to use a windowing approach, where I generate embeddings using the pretrained model using overlapping or adjacent segments (of length max_length = 5K or 10K based on "small" or "large" pretrained model), and then aggregate the resulting embeddings per sample (e.g., using mean pooling, attention pooling etc)? My goal is to leverage the methylation representations learned by the pretrained models, rather than training a new model from scratch.

Thanks,
Tarak

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions