Skip to content

kgnlp/unimax

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UniMax

unimax_sampling implements the UniMax sampling method introduced by Chung et al. (2023). This method aims to balance language representation in multilingual large language models by explicitly capping the number of repeats over each language's corpus, thereby mitigating overfitting on tail languages while delivering more uniform coverage of head languages.

Installation

# UniMax algorithm only
pip install unimax_sampling
# Including optional dependencies for the count-characters sub-command
pip install unimax_sampling[count]

Programmatic Usage

from unimax import unimax, count_characters

# (Optional) count characters in each dataset, only available when optional dependencies are installed via unimax_sampling[count]
character_counts = {}
for subset in ("swe_Latn", "fas_Arab", "ekk_Latn", "isl_Latn", "fao_Latn"):
    character_counts[subset.split("_")[0]] = count_characters("HuggingFaceFW/fineweb-2", subset)

# Compute UniMax distribution from available characters per language
character_counts = {
    "swe": 179955884499,
    "fas": 184595788282,
    "ekk": 42541080893,
    "isl": 10027573389,
    "fao": 549707867,
}

distribution = unimax(
    character_counts,
    character_budget=250_000_000_000,
    max_epochs=4,
)

Output:

UniMaxDistribution(
    budgets={
        "fao": 2198831468,
        "isl": 40110293556,
        "ekk": 69230291658.66667,
        "swe": 69230291658.66666,
        "fas": 69230291658.66666,
    },
    epochs={
        "fao": 4.0,
        "isl": 4.0,
        "ekk": 1.627375003300828,
        "swe": 0.3847070177860806,
        "fas": 0.37503722215431134,
    },
    probabilities={
        "fao": 0.008795325872,
        "isl": 0.160441174224,
        "ekk": 0.2769211666346667,
        "swe": 0.27692116663466665,
        "fas": 0.27692116663466665,
    },
)

Commandline Usage

For convenience, the package can be executed as a commandline utility

Counting Characters

python -m unimax count-characters <dataset_path> <language_code> <output_json_file> [-c <dataset_configuration>] [-s <split>]

Note

count-characters requires unimax_sampling to be installed via pip install unimax_sampling[count]

Example:

python -m unimax count-characters HuggingFaceFW/fineweb-2 fra french.json -c fra_Latn

Calculating the UniMax Distribution

python -m unimax unimax <character_count_files> -c <character_budget> -m <max_epochs> [-r <language_codes>] [-o <distribution_json_file>]

Example:

python -m unimax unimax character_counts.json -c 250000000000 -m 4 -o distribution.json

References

Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, and Noah Constant. 2023. UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda.

About

Implementation of the UniMax sampling method for effective language sampling for multilingual pretraining

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages