DIA-BERT: a pre-trained model for data-independent acquisition mass spectrometry-based proteomics data analysis
If you use DIA-BERT in your work, please cite the following publication:
Liu, Z., Liu, P., Sun, Y. et al. DIA-BERT: pre-trained end-to-end transformer models for enhanced DIA proteomics data analysis. Nat Commun 16, 3530 (2025). https://doi.org/10.1038/s41467-025-58866-4.
The software and manual can be downloaded from https://guomics.com/DIA-BERT/downloads.html. On Linux, download the file from the release. DIA-BERT runs install-free and requires no additional configuration of the environment.
If you want to use DIA-BERT by source code, you can install python and install requirements package.
Please make sure you have a valid installation of conda or miniconda. We recommend setting up miniconda as described on their website.
git clone https://github.com/guomics-lab/DIA-BERT.git
cd DIA-BERTconda create -n DIA-BERT python=3.10
source activate DIA-BERT (If the command doesn't work, please refer to the Conda installation guide for instructions on how to activate Conda.)#On Linux
pip install -r requirements_linux.txtYou need install torch from pytorch (https://pytorch.org/). It is advisable to install the entire pytorch package and follow the official installation method provided by pytorch.
Specifically, first select the CUDA version according to your own operating system, and then, based on the CUDA version, choose the corresponding installation command to execute. For example, run "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124". You can check the installed CUDA version by running the nvcc --version command.
Linux command-line run
```shell
python main_linux.py
• Operating System: Supports both Windows and Linux operating systems.
• Processor: A dual-core processor is recommended, but it can run on a single-core processor.
• Memory: 40GB or more is recommended. If the mass spectrometry files or library files to be identified are large, it is advised to use more memory.
• Storage: At least 100GB of available hard disk space is recommended.
• Graphics Card: A 40GB NVIDIA GPU with CUDA support or a V100 32GB GPU is recommended.
This software is licensed under a custom license that allows academic use but prohibits commercial use. For more details, see the LICENSE file.
For any questions or licensing inquiries, please contact: Dr Guo E-mail: guotiannan@westlake.edu.cn www.guomics.com
Q: Can it analyze semi-specific tryptic samples with DIA?
A: This model was trained exclusively on fully tryptic peptides (canonical tryptic cleavage rules). We have not yet evaluated its performance on semi-specific tryptic samples (e.g., peptides with missed cleavages or non-canonical termini).
If you’re interested in testing the model on such data, we’d be very keen to learn about your findings! Your feedback would be invaluable for further optimizing the model’s generalizability. Feel free to share any results or observations with us.
Q: Would it be possible to make the model code available as open source?
A: As an open-source initiative, our full source code — including the model architecture implementation — is publicly available on this GitHub repository under the license.
Q: Is it possible to create a spectral library using a human proteome FASTA file to use within DIA-BERT?
A: DIA-BERT does not support generating a spectral library directly from a FASTA file. However, you can create the spectral library using external tools. The required elements and format for the library are detailed in the user manual, which is available at: https://guomics.com/DIA-BERT/downloads.html.
Q: What should I do if the file is too large and causes an out-of-memory (OOM) error?
A: You can try reducing the step_size and batch_size parameters to lower memory usage during training. Alternatively, consider running the process on a GPU with larger memory capacity.
Q: What information is required in a DIA-MS library for use with DIA-BERT?
A: The MS DIA spectral library used in DIA-BERT must include the following fields: PeptideSequence, FullUniModPeptideName, PrecursorCharge, PrecursorMz, FragmentMz, iRT, FragmentType, LibraryIntensity, FragmentCharge, ProteinID, and FragmentNumber.
For detailed format specifications, please refer to the user manual, available at: https://guomics.com/DIA-BERT/downloads.html
Q: What format requirements does the DIA-MS library need to meet?
A: DIA-BERT supports spectral library files in the following formats: comma-separated (.csv, .txt) and tab-separated (.tsv, .xls, .xlsx).
Q: Can I train the model using my own data?
A: Yes, the model architecture is fully open and publicly available. You can build and train your own model using custom data, and then replace the pre-trained model file in the software with your version.
However, please note that the current version of the software does not support direct training within the application.
Q: How to install torch required by DIA-BERT?
A: You can install torch from pytorch (https://pytorch.org/). It is advisable to install the entire pytorch package and follow the official installation method provided by pytorch. Specifically, first select the CUDA version according to your own operating system, and then, based on the CUDA version, choose the corresponding installation command to execute. For example, run "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124".
Q: Can DIA-BERT process Astral data?
A: Yes, it can process Astral files. You can select "Other" as the instrument type. However, the identification of Astral files has not been thoroughly evaluated yet, so please use them with caution.
Q: Is there a way to automatically convert FASTA files or other database formats into the format required by the software?
A: You can use DIA-NN (https://github.com/vdemichev/DiaNN) to generate a library file in .tsv format.
Q: If I rent a cloud service with multiple V100 32GB GPU cores (e.g., Nx), do I expect the search speed to be Nx faster? Or, some programming is needed?
A: GPU Usage Instructions:
- If you utilize n GPUs, the processing speed will theoretically be close to n times faster than using a single GPU.
- DIA-BERT automatically detects and uses all GPUs with more than 50% available memory, requiring no additional configuration.
- If you prefer to manually specify which GPUs to use, please use the following parameter: --gpu_devices [List of GPU indices to be used, separated by commas]
Example: --gpu_devices 0,1,2
Q: How do I pass multiple file paths using the --rawdata_file_dir_path parameter?
A: To combine the results of individual analyses for cross-run quantification, you can set the following parameters:
--open_identify=0 --open_cross_quantification=1
For example:
cd /DIA_BERT/00Versions/v1.1; python main_linux.py
--rawdata_file_dir_path=/DIA_BERT/00Benchmark/DZ/Raw_data/Combined/DZ_run_combined.txt
--lib=/ DIA_BERT/lib_proteome/DPHLv2_reviewedfull_library_QC.tsv
--out_path=/ DIA_BERT/00Benchmark/DZ/Raw_data/Combined
--open_identify=0 --open_cross_quantification=1
Q: How to find or create the library file for HYC dataset?
A: You can download the library file using the link below:
https://pan.baidu.com/share/init?surl=Vx-61YbQVPxMTby4bB_l2A&pwd=1y9r
