Glydentify is a deep learning framework for predicting glycosyltransferase donor substrates using structure-aware protein language models and molecular representations. This repository contains the code and resources for the paper submitted to Nature Communications.
glydentify_public/
├── src/ # Core package source code
│ ├── model.py # GTDonorPredictor model definition
│ ├── dataset.py # Dataset loading and processing
│ ├── trainer.py # Training and evaluation logic
│ ├── losses.py # Custom loss functions (e.g., Asymmetric Loss)
│ └── utils.py # Utility functions (including foldseek utils)
├── scripts/ # Executable scripts
│ ├── train.py # Main training script
│ ├── inference.py # Inference on a folder of PDB/CIF files
│ └── annotate.py # Annotate structures with attention weights
├── data/ # Datasets (e.g., gta/train.csv, gtb/test.csv)
├── checkpoints/ # Model checkpoints (e.g., saprot_unimol/gta/)
├── bin/ # Directory for external binaries (Focus is on Foldseek)
├── requirements.txt # Python dependencies
└── README.md # This file
-
Clone the repository:
git clone https://github.com/RuiliF/Glydentify.git cd Glydentify -
Environment Setup:
conda create -n glydentify python=3.10 conda activate glydentify
It is recommended to install PyTorch manually first to ensure the correct CUDA version for your hardware. Run the command below (or check the official PyTorch website for your system):
# Example for Linux with CUDA 12.9 pip install torch==2.8.0+cu129 torchvision==0.23.0+cu129 torchaudio==2.8.0+cu129 --index-url https://download.pytorch.org/whl/cu129Install rest of the dependencies:
pip install -r requirements.txt
-
Install Foldseek: To use SaProt-UniMol version, please download the
foldseekbinary and place it in thebin/directory. You can download it from this Google Drive Link. Note: This structural encoding approach is based on SaProt.
You can run Glydentify inside a Docker container for reproducibility and to easily manage dependencies. Before building the Docker image, ensure you have downloaded the foldseek binary and placed it in the bin/ directory as described above.
To utilize GPUs inside Docker, your host machine must have:
- Docker installed.
- NVIDIA GPU Drivers installed and running.
- NVIDIA Container Toolkit installed to bridge your host GPU to Docker.
- Ubuntu/Debian installation:
sudo apt install nvidia-container-toolkitfollowed bysudo systemctl restart docker. - See the official guide for more OS options.
- Ubuntu/Debian installation:
From the root of the repository, build the Docker image (this will bake the checkpoints/ directly into the image):
docker build -t glydentify .When running the container, use the --gpus all flag to allow the container to access your host GPUs. Also, mount your local data/ folder so the dataloader can read your datasets.
Example for Training:
docker run --rm --gpus all \
-v "$(pwd)/data:/app/data" \
glydentify python scripts/train.py --fold <gta|gtb> --model_type <saprot|esm2|esmc> --batch_size 16Example for Inference:
docker run --rm --gpus all \
-v "$(pwd)/data:/app/data" \
glydentify python scripts/inference.py data/gta/test.csv --checkpoint checkpoints/saprot_unimol/gta/To train the model (supports SaProt, ESM2, ESM-C):
python scripts/train.py --fold <gta|gtb> --model_type <saprot|esm2|esmc> --batch_size 16Arguments:
--fold: Name of the dataset fold (expected indata/<fold>/or../data/<fold>).--model_type: Model architecture (saprot,esm2,esmc). Default:saprot.--train_unimol(optional): Fine-tune the UniMol encoder.--train_seq_encoder(optional): Fine-tune the sequence encoder (SaProt/ESM).
To run inference on a folder of protein structures (.pdb or .cif) using a trained checkpoint:
python scripts/inference.py <input_folder> --checkpoint <path_to_checkpoint> --model_type <saprot|esm2|esmc>or evaluate on the test set:
python scripts/inference.py --checkpoint checkpoints/saprot_unimol/gta/ data/gta/test.csvTo visualize attention weights on the protein structure:
python scripts/annotate.py <input_folder> --checkpoint <path_to_checkpoint> --target_donor <donor_name>This project is licensed under the MIT License - see the LICENSE file for details.
If you use this code or data, please cite our paper:
[Citation Placeholder]