Overfitting/poor generalization

Hi,

Thanks so much for such great documentation/tutorials for ExplaiNN! I'm aiming to use ExplaiNN to predict chromatin accessibility by training it on ATAC-seq peaks.

Despite following the tutorials closely, I am struggling to get the model to perform well on a test data set.

When trained on 59,521 positive sequences (ATAC-seq peaks from mouse chromosomes 2+) and 679,671 negative sequences (windows with no overlap to ATAC-seq peak, matched by size and GC content), the model performs very well: aucROC = 0.998 and aucPR = 0.999.

However, when I then test this model on a distinct dataset of 1,799 positives and 397,802 negatives just from chromosome 1 of the mouse, the model performs much worse: aucROC = 0.759, aucPR = 0.011. Based on my (beginner's) understanding of ML, this suggests the model is overfitting/memorising the training set and is not generalisable to other datasets.

I have tried to reduce overfitting by decreasing the number of fully connected layers (2 -> 1), decreasing the number of units (100 -> 50) and increasing the patience (to 5) and did not manage to solve this issue.

I am aware that my dataset is incredibly imbalanced and potentially quite small for this application, but I was wondering if you could advise on what I should try next to try and improve model performance.

I'd really appreciate any suggestions you have! Please let me know if you need any more information from me.

Many thanks,
Jess

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Overfitting/poor generalization #12

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Overfitting/poor generalization #12

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions