Megatron-Sailor2: Training Multilingual LLMs for South-East Asia Languages

This repository contains the code for training Sailor2, a powerful and inclusive open language models for South-East Asia.

This codebase is based on Megatron-LLM with adaption for Qwen2.5 and Sailor2.

Quickstart for training Sailor2-20B.

Apply the running pod

Run with the common nvcr docker

sudo docker run --gpus all -it --rm -v /path/to/Megatron-Sailor2/:/mpt/Megatron-Sailor2 nvcr.io/nvidia/pytorch:23.07-py3 --shm-size 512g

Note: “if you use Torch multiprocessing for multi-threaded data loaders, the default shared memory segment size that the container runs with may not be enough. Therefore, you should increase the shared memory size by issuing … " We set the shm-size to be 128gb since model sharding takes large shared memory.

Enter the repository:

cd /mpt/Megatron-Sailor2/

Install the additional dependencies not included in the nvcr image

pip install -r requirements.txt

Configure your huggingface token and wandb token in run_setup.sh then run

bash run_setup.sh

Install the megatron/data/helpers binary:

cd megatron/data/
make
cd ../../

Run data preprocess and model convert

bash run_sailor2.sh

Training model

bash run_train_sailor2_20b.sh

Adjust DISTRIBUTED_ARGS in the running bash for more GPUs (default=8).

SFT tuning

bash run_preprocess_data_sft.sh
bash run_train_qwen2_05b_sft.sh

Acknowledgement

Please refer to Megatron-LLM-Documentation for more details.

Citation

@article{sailor2report,
title  = {Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLM},
author = {Longxu Dou and Qian Liu and Fan Zhou and Changyu Chen and Zili Wang and Ziqi Jin and Zichen Liu and Tongyao Zhu and Cunxiao Du and Penghui Yang and Haonan Wang and Jiaheng Liu and Yongchi Zhao and Xiachong Feng and Xin Mao and Man Tsung Yeung and Kunat Pipatanakul and Fajri Koto and Min Si Thu and Hynek Kydl{\'\i}{\v{c}}ek and Zeyi Liu and Qunshu Lin and Sittipong Sripaisarnmongkol and Kridtaphad Sae-Khow and Nirattisai Thongchim and Taechawat Konkaew and Narong Borijindargoon and Anh Dao and Matichon Maneegard and Phakphum Artkaew and Zheng-Xin Yong and Quan Nguyen and Wannaphong Phatthiyaphaibun and Hoang H. Tran and Mike Zhang and Shiqi Chen and Tianyu Pang and Chao Du and Xinyi Wan and Wei Lu and Min Lin},
journal={arXiv preprint arXiv:2502.12982},
year   = {2025}
}

@software{epfmgtrn,
author       = {Alejandro Hernández Cano  and
                Matteo Pagliardini  and
                Andreas Köpf  and
                Kyle Matoba  and
                Amirkeivan Mohtashami  and
                Xingyao Wang  and
                Olivia Simin Fan  and
                Axel Marmet  and
                Deniz Bayazit  and
                Igor Krawczuk  and
                Zeming Chen  and
                Francesco Salvi  and
                Antoine Bosselut  and
                Martin Jaggi},
title        = {epfLLM Megatron-LLM},
year         = 2023,
url          = {https://github.com/epfLLM/Megatron-LLM}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dataset		dataset
megatron		megatron
model		model
tasks		tasks
tools		tools
weights_conversion		weights_conversion
.gitignore		.gitignore
README.md		README.md
finetune.py		finetune.py
requirements.txt		requirements.txt
run_preprocess_data_sft.sh		run_preprocess_data_sft.sh
run_qwen2.sh		run_qwen2.sh
run_sailor2.sh		run_sailor2.sh
run_setup.sh		run_setup.sh
run_train_qwen2_05b.sh		run_train_qwen2_05b.sh
run_train_qwen2_05b_sft.sh		run_train_qwen2_05b_sft.sh
run_train_sailor2_20b.sh		run_train_sailor2_20b.sh
run_verify.sh		run_verify.sh
setup.py		setup.py
verify_correctness.py		verify_correctness.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Megatron-Sailor2: Training Multilingual LLMs for South-East Asia Languages

Quickstart for training Sailor2-20B.

SFT tuning

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

sail-sg/Megatron-Sailor2

Folders and files

Latest commit

History

Repository files navigation

Megatron-Sailor2: Training Multilingual LLMs for South-East Asia Languages

Quickstart for training Sailor2-20B.

SFT tuning

Acknowledgement

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages