A comprehensive framework for protein sequence classification using Large Protein Language Models (pLMs) with support for multiple embedding models and classifier architectures.
- Overview
- Features
- Architecture
- Installation
- Usage
- Configuration
- Dataset Format
- Notebooks
- Project Structure
- Citation
- License
- Contact
Metagenome-AI is a flexible framework for protein sequence classification that leverages state-of-the-art protein language models to generate embeddings and trains lightweight classifiers for various downstream tasks. The framework supports multiple embedding models (ESM-2, ESM3, ProtTrans) and classifier architectures (MLP, XGBoost), making it adaptable to diverse protein classification problems including antimicrobial peptide (AMP) prediction, toxicity prediction, and Gram-positive/negative activity classification.
- Multiple Embedding Models: Support for ESM-2 (650M, 3B parameters), ESM3, ProtTrans, and ProteinVec
- Flexible Classifiers: Choose between MLP and XGBoost classifiers
- Fine-tuning Support: Fine-tune protein language models on custom datasets
- Multi-GPU Training: Distributed training support using PyTorch DDP
- Modular Architecture: Easy to extend with new models and classifiers
- Experiment Tracking: Integration with Weights & Biases (WandB)
- Memory Efficient: Separate embedding generation and classifier training phases
The framework follows a two-stage pipeline:
- Embedding Generation: Protein sequences are processed by pre-trained language models to generate fixed-size embeddings
- Classifier Training: Lightweight classifiers are trained on the generated embeddings for specific classification tasks
This approach allows for efficient experimentation with different classifiers without re-computing embeddings.
- Python 3.9+
- CUDA-capable GPU (recommended)
- Conda or pip package manager
- Clone the repository:
git clone https://github.com/yourusername/Metagenome-AI.git
cd Metagenome-AI- Create and activate a conda environment:
conda create -n mai python==3.9
conda activate mai- Install dependencies:
pip install -r requirements.txt- (Optional) For ESM3 model access, configure Hugging Face authentication:
huggingface-cli loginTo run the complete pipeline (embedding generation + classifier training):
python src/train.py -c src/configs/config_sample.jsonTo fine-tune a protein language model on your dataset:
python src/finetune.py -c src/configs/config_ft.jsonOr using Hugging Face's training framework:
python src/finetuning_hf.py -c src/configs/config_ft.jsonThe framework supports three operational modes:
- RUN_ALL (default): Generate embeddings and train classifier
python src/train.py -c src/configs/config_sample.json- ONLY_STORE_EMBEDDINGS: Only generate and store embeddings
{
"program_mode": "ONLY_STORE_EMBEDDINGS",
...
}- TRAIN_PREDICT_FROM_STORED: Train classifier using pre-computed embeddings
{
"program_mode": "TRAIN_PREDICT_FROM_STORED",
...
}The src/configs/ directory contains several example configurations:
config_sample.json: Basic ESM-2 with XGBoost classifierconfig_esm2.json: ESM-2 3B model configurationconfig_esm3.json: ESM3 model configurationconfig_protein_trans.json: ProtTrans model configurationconfig_ft.json: Fine-tuning configuration
| Parameter | Description | Example |
|---|---|---|
model_type |
Embedding model type | "ESM", "ESM3", "PTRANS", "PVEC" |
classifier_type |
Classifier architecture | "MLP", "XGBoost" |
train |
Path to training dataset | "data/sample_train.tsv" |
valid |
Path to validation dataset | "data/sample_validation.tsv" |
test |
Path to test dataset | "data/sample_test.tsv" |
emb_dir |
Directory for storing embeddings | "emb_dir/sample" |
model_folder |
Directory for model checkpoints | "classifier_results/sample" |
model_basename |
Basename for saved models | "esm_sample" |
wandb_key |
Weights & Biases API key | Your WandB API key |
program_mode: Execution mode (default:"RUN_ALL")batch_size: Batch size for training (default:32)max_tokens: Maximum tokens per batch for embedding generation (default:2500)log_dir: Directory for log files (default:"./logs/")pred_dir: Directory for predictions (default:"./predictions/")classifier_path: Path to pre-trained classifier to skip training
num_epochs: Number of training epochs (default:10)lr: Learning rate (default:0.001)hidden_layers: Hidden layer sizes, e.g.,[1024, 512]early_stop_patience: Early stopping patience (default:4)
objective: Objective function (default:"multi:softmax")n_estimators: Number of trees (default:10)eta: Learning rate (default:0.001)early_stop: Early stopping rounds (default:4)max_depth: Maximum tree depth (default:8)eval_metric: Evaluation metric (default:"mlogloss")verbosity: Verbosity level (default:1)
num_epochs_finetune: Number of fine-tuning epochsbatch_size_finetune: Batch size for fine-tuningmax_mask_prob: Maximum masking probability for MLM trainingmodel_name_or_path: Path to base model
Datasets should be in TSV (tab-separated values) format with the following columns:
<protein_id> <length> <sequence> <label>
Example:
protein_001 150 MKTIIALSYIFCLVFA... 1
protein_002 89 ARTKQTARKSTGGKA... 0
For multi-label classification, additional label columns can be added:
<protein_id> <length> <sequence> <label1> <label2> <label3>
Sample datasets are provided in the data/ directory.
The notebooks/ directory contains Jupyter notebooks for data analysis and experimentation:
This notebook performs sequence similarity analysis between predicted antimicrobial peptides (from RiPP core peptides) and known AMP databases. The analysis uses the Needleman-Wunsch global alignment algorithm to calculate pairwise similarity scores between query sequences and a reference database. The notebook processes over 11,379 RiPP core peptide sequences, computes their alignment scores against known AMPs, normalizes the scores by sequence length to obtain percentage identity, and visualizes the distribution of similarity scores through publication-quality histograms. This analysis helps assess the novelty of predicted antimicrobial candidates by quantifying their sequence similarity to previously characterized AMPs. The notebook also supports comparison with DIAMOND BLASTP for computational efficiency, though the primary focus is on the more sensitive Needleman-Wunsch approach.
This comprehensive notebook evaluates and compares multiple protein language models for antimicrobial peptide prediction across four classification tasks: global AMP activity, Gram-positive activity, Gram-negative activity, and toxicity prediction. The analysis includes performance benchmarking of ESM2-650M, ESM2-3B, ESM3, and ProtTrans models against existing AMP prediction tools (Macrel, AMPScanner, iAMPpred, amPEPpy, and ToxinPred3) using metrics including accuracy, F1-score, AUC, and MCC. The notebook processes predictions from approximately 11,379 RiPP core peptides derived from metagenomics and microbial genomes, applies ensemble prediction strategies by intersecting predictions from multiple models, and filters candidates based on normalized probability scores to identify high-confidence antimicrobial peptides with low toxicity. Advanced visualizations include Venn diagrams showing model prediction overlaps and comparative bar plots demonstrating that the ESM2-3B model achieves superior performance (>90% accuracy) compared to traditional AMP prediction tools.
Metagenome-AI/
├── data/ # Sample datasets and predictions
│ ├── sample_train.tsv
│ ├── sample_validation.tsv
│ ├── sample_test.tsv
│ └── predictions/ # Model predictions output
├── src/ # Source code
│ ├── train.py # Main training script
│ ├── finetune.py # Model fine-tuning script
│ ├── config.py # Configuration management
│ ├── dataset.py # Dataset classes
│ ├── configs/ # Configuration files
│ ├── embeddings/ # Embedding model implementations
│ │ ├── embedding_esm.py
│ │ ├── embedding_esm3.py
│ │ ├── embedding_protein_trans.py
│ │ └── embedding_protein_vec.py
│ ├── classifiers/ # Classifier implementations
│ │ ├── classifier_mlp.py
│ │ └── classifier_xgboost.py
│ ├── finetuning/ # Fine-tuning utilities
│ └── utils/ # Utility functions
│ ├── metrics.py
│ ├── wandb.py
│ └── early_stopper.py
├── notebooks/ # Jupyter notebooks for analysis
│ ├── Needleman-Wunsch.ipynb
│ └── amp_tox_ripp-round2.ipynb
├── requirements.txt # Python dependencies
└── README.md # This file
This project is licensed under the MIT License - see the LICENSE file for details.
For questions, suggestions, or issues:
- Create an issue: GitHub Issues
- Email: vladimir.kovacevic@etf.rs
Developed at: BGI Research
Contributors: Nikola Milicevic, Vladimir Kovacevic