Skip to content

Overfitting/poor generalization #12

@jesspeers

Description

@jesspeers

Hi,

Thanks so much for such great documentation/tutorials for ExplaiNN! I'm aiming to use ExplaiNN to predict chromatin accessibility by training it on ATAC-seq peaks.

Despite following the tutorials closely, I am struggling to get the model to perform well on a test data set.

When trained on 59,521 positive sequences (ATAC-seq peaks from mouse chromosomes 2+) and 679,671 negative sequences (windows with no overlap to ATAC-seq peak, matched by size and GC content), the model performs very well: aucROC = 0.998 and aucPR = 0.999.

However, when I then test this model on a distinct dataset of 1,799 positives and 397,802 negatives just from chromosome 1 of the mouse, the model performs much worse: aucROC = 0.759, aucPR = 0.011. Based on my (beginner's) understanding of ML, this suggests the model is overfitting/memorising the training set and is not generalisable to other datasets.

I have tried to reduce overfitting by decreasing the number of fully connected layers (2 -> 1), decreasing the number of units (100 -> 50) and increasing the patience (to 5) and did not manage to solve this issue.

I am aware that my dataset is incredibly imbalanced and potentially quite small for this application, but I was wondering if you could advise on what I should try next to try and improve model performance.

I'd really appreciate any suggestions you have! Please let me know if you need any more information from me.

Many thanks,
Jess

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions