Terminology mining through semantic search applied to Translation Memories

The goal: Model distillation

Ideally we would have infinite resources (eg. Google)

Low-resource alternative: Optimize the dataset for model refinement

Ansatz: Let the model “see“ examples at least once.

Our task: Mine rare words. Find example sentences and and their translations.

The WikiMatrix Dataset

The Wikimatrix* dataset aims to mine parallel sentences from Wikipedia articles in different language pairs. WikiMatrix.LANG_1-LANG_2.tsv contains sentences in LANG_1 and their counterparts in LANG_2

Sentence pairs are classified by score** measuring quality of correspondance.

** Maximum margin criterion from Haifeng Li, Tao Jiang and Keshu Zhang, "Efficient and robust feature extraction by maximum margin criterion," IEEE Transactions on Neural Networks (2006)

Extraction of suitable pairs

Identify LANG_1 as preferent based on resources availible.
Create dictionary frequency of lemmas* for LANG_1.
Reduce dictionary to instances occurring (n,m) times not too frequent, not extremely rare (possible mispellings, wrong lemmatizations, etc.)
Revisit Wikimatrix and extract 1 sentence for each lemma

- Eg: Furthermore, individual V1 neurons in humans and animals with binocular vision have ocular dominance 
Just one sencence  deals as example for three rare words

Multilingual all-to-all implementation:

An all-to-all extension was built, altering the pipeline accounting for language-specific characteristics.

Language-specific RegEX
Different lemmatizers (Spacy, simplemma)
Lemmatization may not apply (Chinese)
Special tokenizers (konlpy, qalsadi, jieba, sudachipy)

A pipeline for Word-alignment extraction

Preprocess sentences
Get possible alignments (1- or 2-gram to arbitrary)
Encode source n-gram and translations (Multilingual encoder)
Perform cosine similarity and keep best if score over 0.7

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
lemmatization-rules		lemmatization-rules
Extract_top_1_04.ipynb		Extract_top_1_04.ipynb
README.md		README.md
Wikimatrix_pairs.py		Wikimatrix_pairs.py
create_dic.py		create_dic.py
en_words.txt		en_words.txt
functions.py		functions.py
get_alignments_DE-EN.py		get_alignments_DE-EN.py
get_full_terminology.py		get_full_terminology.py
globals.py		globals.py
lid.176.ftz		lid.176.ftz
read_tmx.py		read_tmx.py
sentence_pairs_all2all.py		sentence_pairs_all2all.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Terminology mining through semantic search applied to Translation Memories

The goal: Model distillation

The WikiMatrix Dataset

Extraction of suitable pairs

Multilingual all-to-all implementation:

A pipeline for Word-alignment extraction

About

Uh oh!

Releases

Packages

Languages

moraltt/terminology_mining

Folders and files

Latest commit

History

Repository files navigation

Terminology mining through semantic search applied to Translation Memories

The goal: Model distillation

The WikiMatrix Dataset

Extraction of suitable pairs

Multilingual all-to-all implementation:

A pipeline for Word-alignment extraction

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages