- Evaluating OpenAI GPT Models for Translation of Endangered Uralic Languages: A Comparison of Reasoning and Non-Reasoning Architectures
- PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
- MultiWikiQA: A Reading Comprehension Benchmark in 300+ Languages
- Combining Automated and Manual Data for Effective Downstream Fine-Tuning of Transformers for Low-Resource Language Applications
- Alligators All Around: Mitigating Lexical Confusion in Low-resource Machine Translation
- Using Large Language Models to Transliterate Endangered Uralic Languages
- MaLA-500: Massive Language Adaptation of Large Language Models
- Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages
- MADLAD-400: A Multilingual And Document-Level Large Audited Dataset
- Machine Translation for Low-resource Finno-Ugric Languages
- Sentiment Analysis Using Aligned Word Embeddings for Uralic Languages
- Low-Resource ASR with an Augmented Language Model, 2021
- The role of the Udmurt spell checker in replenishment of the Udmurt national corpus, 2020
- Corpora of social media in minority Uralic languages, 2019
- Sound-aligned corpus of Udmurt dialectal texts, 2018
- Creating seed lexicons for under-resourced languages, 2016
- FinUgRevita: Developing Language TechnologyTools for Udmurt and Mansi, 2015
