Fine-tuning CpGPT on whole genome sequencing methylation data with ~4M CpG sites per sample

Hi,

The available pretrained models ("small" and "large") appear to use up to 10,000 CpG sites (max_length) per sample. Also, there seems to be a limit on the ROPE embedding calculation, with max_seq_len set to 20,001.

I plan to use CpGPT with whole-genome sequencing methylation data, which contains approximately 4 million CpG sites per sample. Could you please suggest how to fine-tune the pretrained models to obtain sample level embeddings that account for such a large number of CpG sites, without losing information by selecting only a 10,000-site subset?

Would it be reasonable to use a windowing approach, where I generate embeddings using the pretrained model using overlapping or adjacent segments (of length max_length = 5K or 10K based on "small" or "large" pretrained model), and then aggregate the resulting embeddings per sample (e.g., using mean pooling, attention pooling etc)? My goal is to leverage the methylation representations learned by the pretrained models, rather than training a new model from scratch.

Thanks,
Tarak

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fine-tuning CpGPT on whole genome sequencing methylation data with ~4M CpG sites per sample #5

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Fine-tuning CpGPT on whole genome sequencing methylation data with ~4M CpG sites per sample #5

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions