OneProt

Codebase for the paper "OneProt: towards multi-modal protein foundation models via latent space alignment of sequence, structure, binding sites and text encoders".

OneProt

Getting started with Lightning and Hydra

https://lightning.ai/docs/pytorch/stable/tutorials.html lightning tutorial
https://hydra.cc/docs/tutorials/intro/ hydra tutorial

Description

This project is dedicated to advancing the understanding and application of various modalities related to proteins, such as sequence, structure, represented as graphs and as foldseek tokens, pockets and sequence similarity tuples, based on mutational information and multiple sequence alignments (MSA).

We are aiming to learn aligned embeddings for different protein modalities. These different modalities can later be used on retrieval, prediction and generation tasks for proteins.

It started as a prototype model of the Bio x ML Hackathon 2023, which won the first prize and the impact prize, and the initial version of the model is here

The weights of the models and example datasets are available on huggingface

Modalities

Sequence
Structure
Text
Pockets
Sequence similarity

Dataset

We only require paired modalities dataset.

Dataset curation

We used OpenProteinSet, which contains structures, sequences, and MSAs for proteins from the PDB and proteins from UniClust30 and UniProtKB/Swiss-Prot. We used MMseqs2, to cluster the sequences with a sequence identity cut-off of 50%, such that each cluster represents a homologous cluster in the protein fold space. We aligned the training, validation, and test splits along these sequence clusters. For each cluster representative and member, using the sequence, we find the structure from the AlphaFold2DB, the MSA from the OpenProteinSet, and the binding pocket with P2Rank. As we could not find an MSA and binding pocket for each protein, fewer data points for these modalities are available. Sequence similarity dataset was constructed using ClinVar variant summary data and MSA data. Each data-point in the sequence similarity dataset consists of three pairs of sequences corresponding to the same protein: original sequence and sequence with a benign mutation, two distinct pathogenic sequences, original sequence and a sequence sampled from the corresponding MSA. Such dataset enforces clustering of the proteins based on their biological relevance, e.g. moves pathogenic mutations away from benign ones.

Modality 1	Modality 2	Dataset Size (Train/Val/Test)
Sequence	Structure Graph	647781 / 1000 / 1000
Sequence	Structure Token	1000000 / 1000 / 1000
Sequence	Text	540077 / 1000 / 1000
Sequence	Pockets	335086 / 1000 / 1000
Sequence	Sequence similarity	1040560 / 1000 / 1000

Dataset splits, as well as the .h5 file for pocket modality are available on zenodo, data for sequence and structure modality is also available in two parts: pt 1, pt 2.

Main Ideas

DownStream Tasks:

SaProt

Environment

We recommend using PyTorch version 2.1.0 with CUDA-12.1 with the corresponding version of torch-geometric, available for installation via

pip install torch_geometric

pip install torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.1.0+cu121.html

The remaining package requirements are available in the requirements.txt file

Using singularity container

A singularity container, containing most of the necessary packages is available from zenodo. However, on top of it one still needs to create a small environment. For that the following is required

pip install torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.1.0+cu121.html

As well as the packages below

wandb
faiss-gpu
transformers
biopandas

Therefore, a workflow of activating required environment within a batch script may look as follows:

srun --cpu-bind=none bash -c "export CUDA_VISIBLE_DEVICES=\"0,1,2,3\"; export PYTHONPATH=\"\"; export HYDRA_FULL_ERROR=1; apptainer run --nv singularity_docker_jupyter.sif bash -c \"source environment_folder/venv/bin/activate && python src/script_of_your_choice.py\""

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
.github		.github
configs		configs
data		data
logs		logs
notebooks		notebooks
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.project-root		.project-root
Makefile		Makefile
README.md		README.md
environment.yaml		environment.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
start_training_chain.sh		start_training_chain.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OneProt

Getting started with Lightning and Hydra

Description

Modalities

Dataset

Dataset curation

Main Ideas

Environment

Using singularity container

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 7

Uh oh!

Languages

klemens-floege/oneprot

Folders and files

Latest commit

History

Repository files navigation

OneProt

Getting started with Lightning and Hydra

Description

Modalities

Dataset

Dataset curation

Main Ideas

Environment

Using singularity container

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 7

Uh oh!

Languages

Packages