SLIDE

Data and code (still being added) for the paper Multi-label Scandinavian Language Identification (SLIDE) (presented at RESOURCEFUL-2025).

Models on HuggingFace

SLIDE-Fast available on Huggingface now is an updated version which scores Strict Accuracy 93.6 on our test dataset and 94.9 on Haas and Derczynski, 2021.

reproduce metrics (table 4):

cd src/
./run_all.sh

reproduce evaluation on nordic_langid (table 5)

obtain data from nordic_langid on Huggingface and put *test.csv into src/evaluation

cd src/
python3 nordic_langid2jsonl.py 
python3 evaluate.py --method bert --model ltg/SLIDE-base --dataset nordic_dsl_test50k.jsonl
python3 evaluate.py --method bert --model ltg/SLIDE-base --dataset nordic_dsl_test10k.jsonl

The values that will be shown will be different from those in table 5 in the paper.

The values in table 5 were obtained with the understanding of loose accuracy as it is described in the paper.

The actual evaluate.py accepts a prediction if it is a subset of gold languages, not an intersection. (Values in table 4 were obtained with this understanding). However, while it influences exact values (less than 2%), the models' ranking remains the same.

Cite us

@inproceedings{fedorova-etal-2025-multi,
    title = "Multi-label {S}candinavian Language Identification ({SLIDE})",
    author = "Fedorova, Mariia  and
      Frydenberg, Jonas Sebulon  and
      Handford, Victoria  and
      Lang{\o}, Victoria Ovedie Chruickshank  and
      Willoch, Solveig Helene  and
      Midtgaard, Marthe L{\o}ken  and
      Scherrer, Yves  and
      M{\ae}hlum, Petter  and
      Samuel, David",
    editor = "Holdt, {\v{S}}pela Arhar  and
      Ilinykh, Nikolai  and
      Scalvini, Barbara  and
      Bruton, Micaella  and
      Debess, Iben Nyholm  and
      Tudor, Crina Madalina",
    booktitle = "Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)",
    month = mar,
    year = "2025",
    address = "Tallinn, Estonia",
    publisher = "University of Tartu Library, Estonia",
    url = "https://aclanthology.org/2025.resourceful-1.33/",
    pages = "179--189",
    ISBN = "978-9908-53-121-2",
    abstract = "Identifying closely related languages at sentence level is difficult, in particular because it is often impossible to assign a sentence to a single language. In this paper, we focus on multi-label sentence-level Scandinavian language identification (LID) for Danish, Norwegian Bokm{\r{a}}l, Norwegian Nynorsk, and Swedish. We present the Scandinavian Language Identification and Evaluation, SLIDE, a manually curated multi-label evaluation dataset and a suite of LID models with varying speed{--}accuracy tradeoffs. We demonstrate that the ability to identify multiple languages simultaneously is necessary for any accurate LID method, and present a novel approach to training such multi-label LID models."
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
annotation		annotation
src		src
test_data		test_data
training_data		training_data
validation_data		validation_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SLIDE_slides.pdf		SLIDE_slides.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SLIDE

Models on HuggingFace

reproduce metrics (table 4):

reproduce evaluation on nordic_langid (table 5)

Cite us

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

ltgoslo/slide

Folders and files

Latest commit

History

Repository files navigation

SLIDE

Models on HuggingFace

reproduce metrics (table 4):

reproduce evaluation on nordic_langid (table 5)

Cite us

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages