Glydentify

Glydentify is a deep learning framework for predicting glycosyltransferase donor substrates using structure-aware protein language models and molecular representations. This repository contains the code and resources for the paper submitted to Nature Communications.

Repository Structure

glydentify_public/
├── src/                  # Core package source code
│   ├── model.py          # GTDonorPredictor model definition
│   ├── dataset.py        # Dataset loading and processing
│   ├── trainer.py        # Training and evaluation logic
│   ├── losses.py         # Custom loss functions (e.g., Asymmetric Loss)
│   └── utils.py          # Utility functions (including foldseek utils)
├── scripts/              # Executable scripts
│   ├── train.py          # Main training script
│   ├── inference.py      # Inference on a folder of PDB/CIF files
│   └── annotate.py       # Annotate structures with attention weights
├── data/                 # Datasets (e.g., gta/train.csv, gtb/test.csv)
├── checkpoints/          # Model checkpoints (e.g., saprot_unimol/gta/)
├── bin/                  # Directory for external binaries (Focus is on Foldseek)
├── requirements.txt      # Python dependencies
└── README.md             # This file

Installation

Clone the repository:

git clone https://github.com/RuiliF/Glydentify.git
cd Glydentify

Environment Setup:

conda create -n glydentify python=3.10
conda activate glydentify

It is recommended to install PyTorch manually first to ensure the correct CUDA version for your hardware. Run the command below (or check the official PyTorch website for your system):

# Example for Linux with CUDA 12.9
pip install torch==2.8.0+cu129 torchvision==0.23.0+cu129 torchaudio==2.8.0+cu129 --index-url https://download.pytorch.org/whl/cu129

Install rest of the dependencies:

pip install -r requirements.txt

Install Foldseek: To use SaProt-UniMol version, please download the foldseek binary and place it in the bin/ directory. You can download it from this Google Drive Link. Note: This structural encoding approach is based on SaProt.

Docker Setup (Optional but Recommended)

You can run Glydentify inside a Docker container for reproducibility and to easily manage dependencies. Before building the Docker image, ensure you have downloaded the foldseek binary and placed it in the bin/ directory as described above.

1. Prerequisites (Host Machine)

To utilize GPUs inside Docker, your host machine must have:

Docker installed.
NVIDIA GPU Drivers installed and running.
NVIDIA Container Toolkit installed to bridge your host GPU to Docker.
- Ubuntu/Debian installation: sudo apt install nvidia-container-toolkit followed by sudo systemctl restart docker.
- See the official guide for more OS options.

2. Building the Image

From the root of the repository, build the Docker image (this will bake the checkpoints/ directly into the image):

docker build -t glydentify .

3. Running with Docker

When running the container, use the --gpus all flag to allow the container to access your host GPUs. Also, mount your local data/ folder so the dataloader can read your datasets.

Example for Training:

docker run --rm --gpus all \
  -v "$(pwd)/data:/app/data" \
  glydentify python scripts/train.py --fold <gta|gtb> --model_type <saprot|esm2|esmc> --batch_size 16

Example for Inference:

docker run --rm --gpus all \
  -v "$(pwd)/data:/app/data" \
  glydentify python scripts/inference.py data/gta/test.csv --checkpoint checkpoints/saprot_unimol/gta/

Local Usage

Training

To train the model (supports SaProt, ESM2, ESM-C):

python scripts/train.py --fold <gta|gtb> --model_type <saprot|esm2|esmc> --batch_size 16

Arguments:

--fold: Name of the dataset fold (expected in data/<fold>/ or ../data/<fold>).
--model_type: Model architecture (saprot, esm2, esmc). Default: saprot.
--train_unimol (optional): Fine-tune the UniMol encoder.
--train_seq_encoder (optional): Fine-tune the sequence encoder (SaProt/ESM).

Inference

To run inference on a folder of protein structures (.pdb or .cif) using a trained checkpoint:

python scripts/inference.py <input_folder> --checkpoint <path_to_checkpoint> --model_type <saprot|esm2|esmc>

or evaluate on the test set:

python scripts/inference.py --checkpoint checkpoints/saprot_unimol/gta/ data/gta/test.csv

Structure Annotation

To visualize attention weights on the protein structure:

python scripts/annotate.py <input_folder> --checkpoint <path_to_checkpoint> --target_donor <donor_name>

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this code or data, please cite our paper:

[Citation Placeholder]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Glydentify

Repository Structure

Installation

Docker Setup (Optional but Recommended)

1. Prerequisites (Host Machine)

2. Building the Image

3. Running with Docker

Local Usage

Training

Inference

Structure Annotation

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
bin		bin
checkpoints		checkpoints
data		data
scripts		scripts
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Glydentify

Repository Structure

Installation

Docker Setup (Optional but Recommended)

1. Prerequisites (Host Machine)

2. Building the Image

3. Running with Docker

Local Usage

Training

Inference

Structure Annotation

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages