-
Feb. 12 2026: The MiLMMT paper: Scaling Model and Data for Multilingual Machine Translation with Open Large Language Models is available on ArXiv!
-
Jan. 23 2025: The GemmaX2 paper: Multilingual Machine Translation with Open Large Language Models at Practical Scale: An Empirical Study has been accepted at NAACL 2025!
Model checkpoints are released at huggingface:
| Models | Descriptions |
|---|---|
| GemmaX2-28-2B-Pretrain | Developed through continual pretraining of Gemma2-2B. |
| GemmaX2-28-2B-v0.1 | Finetuned on GemmaX2-28-2B-Pretrain with translation instructions (v0.1). |
| GemmaX2-28-2B-v0.2 | Finetuned on GemmaX2-28-2B-Pretrain with translation instructions (v0.2). |
| GemmaX2-28-9B-Pretrain | Developed through continual pretraining of Gemma2-9B. |
| GemmaX2-28-9B-v0.1 | Finetuned on GemmaX2-28-9B-Pretrain with translation instructions (v0.1). |
| GemmaX2-28-9B-v0.2 | Finetuned on GemmaX2-28-9B-Pretrain with translation instructions (v0.2). |
Note that GemmaX2-28-2B-Pretrain and GemmaX2-28-9B-Pretrain are NOT translation models.
| Models | Descriptions |
|---|---|
| MiLMMT-46-1B-Pretrain | Developed through continual pretraining of Gemma3-1B. |
| MiLMMT-46-1B-v0.1 | Finetuned on MiLMMT-46-1B-Pretrain with translation instructions. |
| MiLMMT-46-4B-Pretrain | Developed through continual pretraining of Gemma3-4B. |
| MiLMMT-46-4B-v0.1 | Finetuned on MiLMMT-46-4B-Pretrain with translation instructions. |
| MiLMMT-46-12B-Pretrain | Developed through continual pretraining of Gemma3-12B. |
| MiLMMT-46-12B-v0.1 | Finetuned on MiLMMT-46-12B-Pretrain with translation instructions. |
Note that MiLMMT-46-1B-Pretrain, MiLMMT-46-4B-Pretrain, and MiLMMT-46-12B-Pretrain are NOT translation models.
GemmaX2-28 models support 28 languages: Arabic, Bengali, Czech, German, English, Spanish, Persian, French, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Burmese, Dutch, Polish, Portuguese, Russian, Thai, Tagalog, Turkish, Urdu, Vietnamese, Chinese.
MiLMMT-46 models support 46 languages: Arabic, Azerbaijani, Bulgarian, Bengali, Catalan, Czech, Danish, German, Greek, English, Spanish, Persian, Finnish, French, Hebrew, Hindi, Croatian, Hungarian, Indonesian, Italian, Japanese, Kazakh, Khmer, Korean, Lao, Malay, Burmese, Norwegian, Dutch, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Swedish, Tamil, Thai, Tagalog, Turkish, Urdu, Uzbek, Vietnamese, Cantonese, Chinese (Simplified), Chinese (Traditional).
Translate this from <source language name> to <target language name>:
<source language name>: <source language sentence>
<target language name>:
Please use the language name specified above in the translation prompt.
from vllm import LLM, SamplingParams
model_id = "xiaomi-research/MiLMMT-46-12B-v0.1"
model = LLM(model=model_id)
sampling_params = SamplingParams(top_k=1, temperature=0, max_tokens=2048)
text = "Translate this from Chinese (Simplified) to English:\nChinese (Simplified): 我爱机器翻译\nEnglish:"
outputs = model.generate(text, sampling_params)
print(outputs[0].outputs[0].text)from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "xiaomi-research/MiLMMT-46-12B-v0.1"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
text = "Translate this from Chinese (Simplified) to English:\nChinese (Simplified): 我爱机器翻译\nEnglish:"
inputs = tokenizer(text, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=1024)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))We train our models with the LlamaFactory framework. Please check here for adding pretraining and finetuning datasets in LlamaFactory.
The data samples for multilingual continual pretraining are listed in examples/cpt.json. Check the following command for reference:
bash scripts/cpt.shThe data samples for translation instruction finetuning are listed in examples/sft.json. Check the following command for reference:
bash scripts/sft.shIf you find the resources in this repository helpful, please cite as:
@misc{shang2026scalingmodeldatamultilingual,
title={Scaling Model and Data for Multilingual Machine Translation with Open Large Language Models},
author={Yuzhe Shang and Pengzhi Gao and Wei Liu and Jian Luan and Jinsong Su},
year={2026},
eprint={2602.11961},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.11961},
}
@inproceedings{cui-etal-2025-multilingual,
title = "Multilingual Machine Translation with Open Large Language Models at Practical Scale: An Empirical Study",
author = "Cui, Menglong and
Gao, Pengzhi and
Liu, Wei and
Luan, Jian and
Wang, Bin",
editor = "Chiruzzo, Luis and
Ritter, Alan and
Wang, Lu",
booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
month = apr,
year = "2025",
address = "Albuquerque, New Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.naacl-long.280/",
doi = "10.18653/v1/2025.naacl-long.280",
pages = "5420--5443",
ISBN = "979-8-89176-189-6",
abstract = "Large language models (LLMs) have shown continuously improving multilingual capabilities, and even small-scale open-source models have demonstrated rapid performance enhancement. In this paper, we systematically explore the abilities of open LLMs with less than ten billion parameters to handle multilingual machine translation (MT) tasks. We conduct comprehensive evaluations on six popular LLMs and find that models like Gemma2-9B exhibit impressive multilingual translation capabilities. We then introduce the Parallel-First Monolingual-Second (PFMS) data mixing strategy in the continual pretraining stage to further enhance the MT performance and present GemmaX2-28, a 9B model achieving top-tier multilingual translation performance across 28 languages. Specifically, GemmaX2-28 consistently outperforms the state-of-the-art (SOTA) models such as TowerInstruct and X-ALMA and achieves competitive performance with Google Translate and GPT-4-turbo."
}
