Skip to content

andreaschari/transliterations

Repository files navigation

Lost in Transliteration: Bridging the Script Gap in Neural IR

Code for the SIGIR 2025 Paper

Abstract

Most human languages use scripts other than the Latin alphabet. Search users in these languages often formulate their information needs in a transliterated --usually Latinized-- form for ease of typing. For example, Greek speakers might use Greeklish, and Arabic speakers might use Arabizi. This paper shows that current search systems, including those that use multilingual dense embeddings such as BGE-M3, do not generalise to this setting, and their performance rapidly deteriorates when exposed to transliterated queries. This creates a "script gap" between the performance of the same queries when written in their native or transliterated form. We explore whether adapting the popular "translate-train" paradigm to transliterations can enhance the robustness of multilingual Information Retrieval (IR) methods and bridge the gap between native and transliterated scripts. By exploring various combinations of non-Latin and Latinized query text for training, we investigate whether we can enhance the capacity of existing neural retrieval techniques and enable them to apply to this important setting. We show that by further fine-tuning IR models on an even mixture of native and Latinized text, they can perform this cross-script matching at nearly the same performance as when the query was formulated in the native script. Out-of-domain evaluation and further qualitative analysis show that transliterations can also cause queries to lose some of their nuances, motivating further research.

Released Artifacts

We release the model checkpoints and transliterated queries in the Hugging Face Collection here. The collection contains:

  • The BGE-M3 and mT5 models used in the experiments.
  • The transliterated queries for the datasets used in the experiments.

Reproducing the results

Setup

The instructions here are focused on setting up a conda environment.

This code was developed and tested with Python 3.10.

First, create a virtual environment:

conda create -n lt
conda activate lt

To run the retrieval and re-ranking experiments for BGE-M3 and mT5, you will need to install the following dependencies to run variations_experiments.py:

Transliterating the Queries

The transliterations depend on the uroman library. You can install it using pip:

python3 -m pip install uroman

You can then use the transliterate.py script to transliterate the queries. e.g.

python transliterate.py --lang <language of queries in (ISO 639-3 standard)> --dataset <dataset in IRDS format> --do_docs <if you want to transliterate the documents of the dataset instead>

Note: you need to set the output_path variable in the transliterate.py script to point to where you want to save the transliterated queries. This should be the same path as the TRANSLITERATIONS_DIR variable in the transliterations_experiments.py script.

We provide the transliterated queries in the Hugging Face Collection here. You can download the transliterated queries from there and place them in the TRANSLITERATIONS_DIR path.

Fine-tuning the Models

BGE-M3

You can use the finetune_bgem3.sh script to fine-tune the BGE-M3 model on the mMARCO dataset. The only requirement is setting the output_dir variable to the desired output directory and train_data to the path of the training JSONL file. (We provide our JSONL files in the Hugging Face Collection but if you want to create your own you can follow the steps in our other repo here.)

mT5

You can use the finetune_mt5.py script to fine-tune the mT5 model on the mMARCO dataset.

Retrieval

The list of available BGE-M3 and mT5 models can be found in the models.py file. The models are available in this Hugging Face Collection. You can use the --model argument to specify which model you want to use.

BGE-M3 Retrieval

First you need to use the indexing.py to index the collection. e.g.

python indexing.py --dataset <dataset in IRDS format> --model <BGE-M3 model>

The transliterations_experiments.py script is used to run the BGE-M3 retrieval experiments. e.g.

python transliterations_experiments.py --lang <language of queries> --index <index path> --model <BGE-M3 model>  --dataset  <dataset in IRDS format> --evaluate 

Note: there are two paths you need to manual set in the transliterations_experiments.py script:

TRANSLITERATIONS_DIR which should be the path point to where you have the transliterated queries and RETRIEVAL_DIR which should be the path point to where you want to save the retrieval results.

The --evaluate flag will run the evaluation in addition to retrieval.

mT5 Re-Ranking

The mT5 re-ranking experiments are run using the transliterations_rerank.py script. e.g.

python transliterations_rerank.py --lang <language of queries> --index <index path> --first_stage_model <BGE-M3 model>  --rerank_model >mT5 model> --dataset  <dataset in IRDS format> --evaluate 

Note: Similar to the BGE-M3 retrieval experiments, you need to set the TRANSLITERATIONS_DIR and RETRIEVAL_DIR variables in the transliterations_rerank.py script to point to where the transliterated queries are saved and where to save the retrieval results respectively.

Qualitative Analysis and Significance Testing

The qualitative_analysis.ipynb and the significance_testing.ipynb notebooks are used to run the qualitative analysis and significance testing respectively. You can run them using Jupyter Notebook.

Citation

If you use this code in your research, please cite the following paper:

@inproceedings{chari:sigir2025-translit,
  author = {Chari, Andreas and Ounis, Iadh and MacAvaney, Sean},
  title = {Lost in Transliteration: Bridging the Script Gap in Neural IR},
  booktitle = {Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  year = {2025}
}

About

Code for the SIGIR2025 Short Paper

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors