Subword-informed word representation training framework

We provide a general framework for training subword-informed word representations by varying the following components:

subword segmentation methods;
subword embeddings and position embeddings;
composition functions;

For the whole framework architecture and more details, please refer to the reference.

There are 4 segmentation methods, 3 possible ways of embedding subwords, 3 ways of enhancing with position embeddings, and 3 different composition functions.

Here is a full table of different options and their labels:

Component	Option	Label
Segmentation methods	CHIPMUNK Morfessor BPE Character n-gram	sms morf bpe charn
Subword embeddings	w/o word token w/ word token w/ morphotactic tag (only for sms)	- ww wp
Position embeddings	w/o position embedding addition elementwise multiplication	- pp (not applicable to wp) mp (not applicable to wp)
Composition functions	addition single self-attention multi-head self-attention	add att mtxatt

For example, sms.wwppmtxatt means we use CHIPMUNK as segmentation, insert word token into the subword sequence, enhance with additive position embedding, and use multi-head self-attention as composition function.

Subword segmentation methods

Taking the word dishonestly as an example, with different segmentation methods, the word will be segmented into the following subword sequence:

ChipMunk: (<dis, honest, ly>) + (PREFIX, ROOT, SUFFIX)
Morfessor: (<dishonest, ly>)
BPE (10k merge ops): (<dish, on, est, ly>)
Character n-gram (from 3 to 6): (<di, dis, ... , ly>, <dis, ... ,tly>, <dish, ... , stly>, <disho, ... , estly>)

where < and > are word start and end markers.

After the segmentation, we will obtain a subword sequence S for each segmentation method, and another morphortactic tag sequence T for sms.

Subword embeddings and position embeddings

We can embed the subword sequence S directly into subword embedding sequence by looking up in the subword embedding matrix, or insert a word token (ww) into S before embedding, i.e. for sms it will be (<dis, honest, ly>, <dishonestly>).

Then we can enhance the subword embeddings with additive (pp) or elementwise (mp) multiplication.

For sms, we can also embed the concatenation of the subword and its morphortactic tags (wp): (<dis:PREFIX, honest:ROOT, ly>:SUFFIX). And <dishonest>:WORD will be inserted if we choose ww. Note that position embeddings are not applicable to wp as a kind of morphological position information has already been provided.

Prerequisites

Calculate new word embeddings from subword embeddings

Call gen_word_emb.py to generate embeddings of new words for a specific composition function or use batch_gen_word_emb.sh to generate for all composition functions.

Your input, i.e. --in_file in input arg, needs to be a list of word, where each line only consists of a single word.

References

A Systematic Study of Leveraging Subword Information for Learning Word Representations. Yi Zhu, Ivan Vulić, and Anna Korhonen. In Proc. of NAACL 2019.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
code		code
toy_data/en		toy_data/en
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Subword-informed word representation training framework

Subword segmentation methods

Subword embeddings and position embeddings

Prerequisites

Calculate new word embeddings from subword embeddings

References

About

Uh oh!

Releases

Packages

Languages

fxf999/sw_study

Folders and files

Latest commit

History

Repository files navigation

Subword-informed word representation training framework

Subword segmentation methods

Subword embeddings and position embeddings

Prerequisites

Calculate new word embeddings from subword embeddings

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages