UniMax

unimax_sampling implements the UniMax sampling method introduced by Chung et al. (2023). This method aims to balance language representation in multilingual large language models by explicitly capping the number of repeats over each language's corpus, thereby mitigating overfitting on tail languages while delivering more uniform coverage of head languages.

Installation

# UniMax algorithm only
pip install unimax_sampling
# Including optional dependencies for the count-characters sub-command
pip install unimax_sampling[count]

Programmatic Usage

from unimax import unimax, count_characters

# (Optional) count characters in each dataset, only available when optional dependencies are installed via unimax_sampling[count]
character_counts = {}
for subset in ("swe_Latn", "fas_Arab", "ekk_Latn", "isl_Latn", "fao_Latn"):
    character_counts[subset.split("_")[0]] = count_characters("HuggingFaceFW/fineweb-2", subset)

# Compute UniMax distribution from available characters per language
character_counts = {
    "swe": 179955884499,
    "fas": 184595788282,
    "ekk": 42541080893,
    "isl": 10027573389,
    "fao": 549707867,
}

distribution = unimax(
    character_counts,
    character_budget=250_000_000_000,
    max_epochs=4,
)

Output:

UniMaxDistribution(
    budgets={
        "fao": 2198831468,
        "isl": 40110293556,
        "ekk": 69230291658.66667,
        "swe": 69230291658.66666,
        "fas": 69230291658.66666,
    },
    epochs={
        "fao": 4.0,
        "isl": 4.0,
        "ekk": 1.627375003300828,
        "swe": 0.3847070177860806,
        "fas": 0.37503722215431134,
    },
    probabilities={
        "fao": 0.008795325872,
        "isl": 0.160441174224,
        "ekk": 0.2769211666346667,
        "swe": 0.27692116663466665,
        "fas": 0.27692116663466665,
    },
)

Commandline Usage

For convenience, the package can be executed as a commandline utility

Counting Characters

python -m unimax count-characters <dataset_path> <language_code> <output_json_file> [-c <dataset_configuration>] [-s <split>]

Note

count-characters requires unimax_sampling to be installed via pip install unimax_sampling[count]

Example:

python -m unimax count-characters HuggingFaceFW/fineweb-2 fra french.json -c fra_Latn

Calculating the UniMax Distribution

python -m unimax unimax <character_count_files> -c <character_budget> -m <max_epochs> [-r <language_codes>] [-o <distribution_json_file>]

Example:

python -m unimax unimax character_counts.json -c 250000000000 -m 4 -o distribution.json

References

Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, and Noah Constant. 2023. UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
pyproject.toml		pyproject.toml
unimax.py		unimax.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UniMax

Installation

Programmatic Usage

Commandline Usage

Counting Characters

Calculating the UniMax Distribution

References

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

UniMax

Installation

Programmatic Usage

Commandline Usage

Counting Characters

Calculating the UniMax Distribution

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages