Skip to content

arikat/Glydentify

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Glydentify

Glydentify is a deep learning framework for predicting glycosyltransferase donor substrates using structure-aware protein language models and molecular representations. This repository contains the code and resources for the paper submitted to Nature Communications.

Repository Structure

glydentify_public/
├── src/                  # Core package source code
│   ├── model.py          # GTDonorPredictor model definition
│   ├── dataset.py        # Dataset loading and processing
│   ├── trainer.py        # Training and evaluation logic
│   ├── losses.py         # Custom loss functions (e.g., Asymmetric Loss)
│   └── utils.py          # Utility functions (including foldseek utils)
├── scripts/              # Executable scripts
│   ├── train.py          # Main training script
│   ├── inference.py      # Inference on a folder of PDB/CIF files
│   └── annotate.py       # Annotate structures with attention weights
├── data/                 # Datasets (e.g., gta/train.csv, gtb/test.csv)
├── checkpoints/          # Model checkpoints (e.g., saprot_unimol/gta/)
├── bin/                  # Directory for external binaries (Focus is on Foldseek)
├── requirements.txt      # Python dependencies
└── README.md             # This file

Installation

  1. Clone the repository:

    git clone https://github.com/RuiliF/Glydentify.git
    cd Glydentify
  2. Environment Setup:

    conda create -n glydentify python=3.10
    conda activate glydentify

    It is recommended to install PyTorch manually first to ensure the correct CUDA version for your hardware. Run the command below (or check the official PyTorch website for your system):

    # Example for Linux with CUDA 12.9
    pip install torch==2.8.0+cu129 torchvision==0.23.0+cu129 torchaudio==2.8.0+cu129 --index-url https://download.pytorch.org/whl/cu129

    Install rest of the dependencies:

    pip install -r requirements.txt
  3. Install Foldseek: To use SaProt-UniMol version, please download the foldseek binary and place it in the bin/ directory. You can download it from this Google Drive Link. Note: This structural encoding approach is based on SaProt.

Docker Setup (Optional but Recommended)

You can run Glydentify inside a Docker container for reproducibility and to easily manage dependencies. Before building the Docker image, ensure you have downloaded the foldseek binary and placed it in the bin/ directory as described above.

1. Prerequisites (Host Machine)

To utilize GPUs inside Docker, your host machine must have:

  • Docker installed.
  • NVIDIA GPU Drivers installed and running.
  • NVIDIA Container Toolkit installed to bridge your host GPU to Docker.
    • Ubuntu/Debian installation: sudo apt install nvidia-container-toolkit followed by sudo systemctl restart docker.
    • See the official guide for more OS options.

2. Building the Image

From the root of the repository, build the Docker image (this will bake the checkpoints/ directly into the image):

docker build -t glydentify .

3. Running with Docker

When running the container, use the --gpus all flag to allow the container to access your host GPUs. Also, mount your local data/ folder so the dataloader can read your datasets.

Example for Training:

docker run --rm --gpus all \
  -v "$(pwd)/data:/app/data" \
  glydentify python scripts/train.py --fold <gta|gtb> --model_type <saprot|esm2|esmc> --batch_size 16

Example for Inference:

docker run --rm --gpus all \
  -v "$(pwd)/data:/app/data" \
  glydentify python scripts/inference.py data/gta/test.csv --checkpoint checkpoints/saprot_unimol/gta/

Local Usage

Training

To train the model (supports SaProt, ESM2, ESM-C):

python scripts/train.py --fold <gta|gtb> --model_type <saprot|esm2|esmc> --batch_size 16

Arguments:

  • --fold: Name of the dataset fold (expected in data/<fold>/ or ../data/<fold>).
  • --model_type: Model architecture (saprot, esm2, esmc). Default: saprot.
  • --train_unimol (optional): Fine-tune the UniMol encoder.
  • --train_seq_encoder (optional): Fine-tune the sequence encoder (SaProt/ESM).

Inference

To run inference on a folder of protein structures (.pdb or .cif) using a trained checkpoint:

python scripts/inference.py <input_folder> --checkpoint <path_to_checkpoint> --model_type <saprot|esm2|esmc>

or evaluate on the test set:

python scripts/inference.py --checkpoint checkpoints/saprot_unimol/gta/ data/gta/test.csv

Structure Annotation

To visualize attention weights on the protein structure:

python scripts/annotate.py <input_folder> --checkpoint <path_to_checkpoint> --target_donor <donor_name>

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this code or data, please cite our paper:

[Citation Placeholder]

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 98.3%
  • Dockerfile 1.7%