Hi,
The available pretrained models ("small" and "large") appear to use up to 10,000 CpG sites (max_length) per sample. Also, there seems to be a limit on the ROPE embedding calculation, with max_seq_len set to 20,001.
I plan to use CpGPT with whole-genome sequencing methylation data, which contains approximately 4 million CpG sites per sample. Could you please suggest how to fine-tune the pretrained models to obtain sample level embeddings that account for such a large number of CpG sites, without losing information by selecting only a 10,000-site subset?
Would it be reasonable to use a windowing approach, where I generate embeddings using the pretrained model using overlapping or adjacent segments (of length max_length = 5K or 10K based on "small" or "large" pretrained model), and then aggregate the resulting embeddings per sample (e.g., using mean pooling, attention pooling etc)? My goal is to leverage the methylation representations learned by the pretrained models, rather than training a new model from scratch.
Thanks,
Tarak