unimax_sampling implements the UniMax sampling method introduced by Chung et al. (2023). This method aims to balance language representation in multilingual large language models by explicitly capping the number of repeats over each language's corpus, thereby mitigating overfitting on tail languages while delivering more uniform coverage of head languages.
# UniMax algorithm only
pip install unimax_sampling
# Including optional dependencies for the count-characters sub-command
pip install unimax_sampling[count]from unimax import unimax, count_characters
# (Optional) count characters in each dataset, only available when optional dependencies are installed via unimax_sampling[count]
character_counts = {}
for subset in ("swe_Latn", "fas_Arab", "ekk_Latn", "isl_Latn", "fao_Latn"):
character_counts[subset.split("_")[0]] = count_characters("HuggingFaceFW/fineweb-2", subset)
# Compute UniMax distribution from available characters per language
character_counts = {
"swe": 179955884499,
"fas": 184595788282,
"ekk": 42541080893,
"isl": 10027573389,
"fao": 549707867,
}
distribution = unimax(
character_counts,
character_budget=250_000_000_000,
max_epochs=4,
)Output:
UniMaxDistribution(
budgets={
"fao": 2198831468,
"isl": 40110293556,
"ekk": 69230291658.66667,
"swe": 69230291658.66666,
"fas": 69230291658.66666,
},
epochs={
"fao": 4.0,
"isl": 4.0,
"ekk": 1.627375003300828,
"swe": 0.3847070177860806,
"fas": 0.37503722215431134,
},
probabilities={
"fao": 0.008795325872,
"isl": 0.160441174224,
"ekk": 0.2769211666346667,
"swe": 0.27692116663466665,
"fas": 0.27692116663466665,
},
)For convenience, the package can be executed as a commandline utility
python -m unimax count-characters <dataset_path> <language_code> <output_json_file> [-c <dataset_configuration>] [-s <split>]Note
count-characters requires unimax_sampling to be installed via pip install unimax_sampling[count]
Example:
python -m unimax count-characters HuggingFaceFW/fineweb-2 fra french.json -c fra_Latnpython -m unimax unimax <character_count_files> -c <character_budget> -m <max_epochs> [-r <language_codes>] [-o <distribution_json_file>]Example:
python -m unimax unimax character_counts.json -c 250000000000 -m 4 -o distribution.jsonHyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, and Noah Constant. 2023. UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda.