Codebase for the paper "OneProt: towards multi-modal protein foundation models via latent space alignment of sequence, structure, binding sites and text encoders".
- https://lightning.ai/docs/pytorch/stable/tutorials.html lightning tutorial
- https://hydra.cc/docs/tutorials/intro/ hydra tutorial
This project is dedicated to advancing the understanding and application of various modalities related to proteins, such as sequence, structure, represented as graphs and as foldseek tokens, pockets and sequence similarity tuples, based on mutational information and multiple sequence alignments (MSA).
We are aiming to learn aligned embeddings for different protein modalities. These different modalities can later be used on retrieval, prediction and generation tasks for proteins.
It started as a prototype model of the Bio x ML Hackathon 2023, which won the first prize and the impact prize, and the initial version of the model is here
The weights of the models and example datasets are available on huggingface
- Sequence
- Structure
- Text
- Pockets
- Sequence similarity
We only require paired modalities dataset.
We used OpenProteinSet, which contains structures, sequences, and MSAs for proteins from the PDB and proteins from UniClust30 and UniProtKB/Swiss-Prot. We used MMseqs2, to cluster the sequences with a sequence identity cut-off of 50%, such that each cluster represents a homologous cluster in the protein fold space. We aligned the training, validation, and test splits along these sequence clusters. For each cluster representative and member, using the sequence, we find the structure from the AlphaFold2DB, the MSA from the OpenProteinSet, and the binding pocket with P2Rank. As we could not find an MSA and binding pocket for each protein, fewer data points for these modalities are available. Sequence similarity dataset was constructed using ClinVar variant summary data and MSA data. Each data-point in the sequence similarity dataset consists of three pairs of sequences corresponding to the same protein: original sequence and sequence with a benign mutation, two distinct pathogenic sequences, original sequence and a sequence sampled from the corresponding MSA. Such dataset enforces clustering of the proteins based on their biological relevance, e.g. moves pathogenic mutations away from benign ones.
| Modality 1 | Modality 2 | Dataset Size (Train/Val/Test) |
|---|---|---|
| Sequence | Structure Graph | 647781 / 1000 / 1000 |
| Sequence | Structure Token | 1000000 / 1000 / 1000 |
| Sequence | Text | 540077 / 1000 / 1000 |
| Sequence | Pockets | 335086 / 1000 / 1000 |
| Sequence | Sequence similarity | 1040560 / 1000 / 1000 |
Dataset splits, as well as the .h5 file for pocket modality are available on zenodo, data for sequence and structure modality is also available in two parts: pt 1, pt 2.
DownStream Tasks:
We recommend using PyTorch version 2.1.0 with CUDA-12.1 with the corresponding version of torch-geometric, available for installation via
pip install torch_geometric
pip install torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.1.0+cu121.html
The remaining package requirements are available in the requirements.txt file
A singularity container, containing most of the necessary packages is available from zenodo. However, on top of it one still needs to create a small environment. For that the following is required
pip install torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.1.0+cu121.html
As well as the packages below
wandb
faiss-gpu
transformers
biopandas
Therefore, a workflow of activating required environment within a batch script may look as follows:
srun --cpu-bind=none bash -c "export CUDA_VISIBLE_DEVICES=\"0,1,2,3\"; export PYTHONPATH=\"\"; export HYDRA_FULL_ERROR=1; apptainer run --nv singularity_docker_jupyter.sif bash -c \"source environment_folder/venv/bin/activate && python src/script_of_your_choice.py\""