MicroGenomer: A Foundation Model for Transferable Microbial Genome Representations Enabling Multi-scale Genomic Understanding and Ecophysiological Trait Prediction
MicroGenomer is a foundation model for transferable microbial genome representations, enabling multi-scale genomic comprehension and ecophysiological trait prediction. It adopts a hierarchical training strategy that integrates large-scale genomic sequence pre-training (234.5 billion base pairs), domain-specific mid-training with the GTDB-curated marker gene set, and task-specific post-training. Surpassing existing gene-scale encoders, MicroGenomer generates robust embeddings at the whole-genome scale.
Figure 1. Overview of MicroGenomer.Since the model leverages FlashAttention for computational acceleration to enhance the efficiency of genomic representation extraction, we recommend the following hardware configurations to ensure optimal model performance:
- NVIDIA GPUs with Ampere architecture or newer are required (e.g., RTX 30/40 series consumer-grade GPUs, A10/A100/A800 professional-grade GPUs).
- A minimum of 16GB VRAM is recommended (32GB or higher for batch processing of genomic data).
Download the GitHub repository and extract the files to a designated folder.
The input data required for MicroGenomer model inference and genomic representation extraction is formatted as a comma-separated values (CSV) file, which exclusively contains gene sequences with annotated Coding Sequences (CDS) — the core genomic regions that encode functional proteins in microbial genomes. The input data format is as follows:
genome_id: Unique identifier for the current genomeaa_seq: DNA sequence corresponding to CDS in the current genomeunique_id: Unique ID for identifying the DNA sequence
Example:
unique_id,aa_seq,genome_id
RS_GCF_013393365.1@TIGR00168,TTGCCGTCCGTAG...GTAGTAAATAA,RS_GCF_013393365.1
RS_GCF_013393365.1@TIGR00382,TTGGCAAAAGATA...TGAAATCGTAA,RS_GCF_013393365.1
...# Create a Conda Python environment
conda create -n MicroGenomer python=3.10
conda activate MicroGenomer
git clone https://github.com/BGIResearch/MicroGenomer.git
cd MicroGenomer
pip install -r requirements.txt- Download the weights for MicroGenomer.
- Place the
MicroGenomer-470Manddownstream_tasksfolders in theweights/directory.
Six downstream tasks available: maximum growth rate, oxygen tolerance, salinity tolerance, optimal pH, optimal temperature and probiotic prediction. It also supports extracting embeddings.
bash run.sh \
--input_path '/path/to/input/' \
--output_dir '/path/to/output/' \
[--task 'downstream_task'] \ # Options: test_growth, test_oxygen, test_salinity, test_pH, test_temperature, test_probiotic, extract_embed or none
[--level 'level_of_probiotic'] \ # For test_probiotic task only. Options: family, genus.--input_path: Path to the input file/folder.--output_dir: Path for saving output files.--task: Downstream task option.--level: Level of probiotic prediction. Default is family.
The input path can be a single CSV file or a folder. If it is a folder, all CSV files in that folder will be processed in batches. If the task parameter is extract_embed or is empty, the model will only extract embedding and will not perform any specific downstream tasks.
- First, pull the Docker image:
docker pull sunhaotong0605/microgenomer:0.1- Run Inference for Different Downstream Tasks:
docker run --rm --gpus all \
-v /path/to/input/:/data/input \
-v /path/to/output/:/data/output \
-e INPUT_PATH="/data/input" \
-e OUTPUT_PATH="/data/output" \
sunhaotong0605/microgenomer:0.1 \
sh -c 'bash run.sh --input_path $INPUT_PATH --output_dir $OUTPUT_PATH --task test_growth'Please replace /path/to/input/, /path/to/output/, and the downstream task option(e.g., test_growth) with your actual paths and desired task.
This project is licensed under the MIT License. See LICENSE for more details.
Kang Q, Guo Y, Hu B, et al. MicroGenomer: A Foundation Model for Transferable Microbial Genome Representations Enabling Multi-scale Genomic Understanding and Ecophysiological Trait Prediction. bioRxiv. doi: https://doi.org/10.64898/2025.12.28.696777.
