ESM-2 is a pLM trained using unsupervied masked language modelling on 250 Million protein sequences by researchers at Facebook AI Research (FAIR).
It is available in several sizes, ranging from 8 Million to 15 Billion parameters. The smaller models are suitable for various sequence and token classification tasks. The FAIR team also adapted the 3 Billion parameter version into the ESMFold protein structure prediction algorithm.
In this repository, I have used the ESM2 model for the following tasks:
-
Masked Language Modeling
mutation_effect_enzyme calculates the mutation score introduced in the paper Language models enable zero-shot prediction of the effects of mutations on protein function. It ranks the functional impact of mutated sequences compared to the original (wild-type) sequence.
-
Sequence Classification
-
Training a small model (model_name =
facebook/esm2_t6_8M_UR50D)sequence_classification_small notebook is about predicting whether a protein is (1) an enzyme, (2) a receptor protein, or (3) a structural protein. The dataset was taken from Amelie-Schreiber.
You can run this model on Google Colab.
-
Training a larger model (model_name = "facebook/esm2_t33_650M_UR50D")
To run this model we need more compute resources then what is available on Google Colab. So, I ran the sequence_classification_large notebook in AWS SageMaker. The GPU instance used are mentined in the notebook.
The specific problem addressed in this notebook is
subcellular localization: Given a protein sequence, can we build a model that can predict if it lives on the outside (cell membrane) or inside of a cell? This is an important piece of information that can help us understand the function and whether it would make a good drug target.The source of the notebook is aws-healthcare-lifescience-ai-ml-sample-notebooks
-