Skip to content

fxf999/sw_study

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Subword-informed word representation training framework

We provide a general framework for training subword-informed word representations by varying the following components:

For the whole framework architecture and more details, please refer to the reference.

There are 4 segmentation methods, 3 possible ways of embedding subwords, 3 ways of enhancing with position embeddings, and 3 different composition functions.

Here is a full table of different options and their labels:

Component Option Label
Segmentation methods CHIPMUNK
Morfessor
BPE
Character n-gram
sms
morf
bpe
charn
Subword embeddings w/o word token
w/ word token
w/ morphotactic tag (only for sms)
-
ww
wp
Position embeddings w/o position embedding
addition
elementwise multiplication
-
pp (not applicable to wp)
mp (not applicable to wp)
Composition functions addition
single self-attention
multi-head self-attention
add
att
mtxatt

For example, sms.wwppmtxatt means we use CHIPMUNK as segmentation, insert word token into the subword sequence, enhance with additive position embedding, and use multi-head self-attention as composition function.

Subword segmentation methods

Taking the word dishonestly as an example, with different segmentation methods, the word will be segmented into the following subword sequence:

  • ChipMunk: (<dis, honest, ly>) + (PREFIX, ROOT, SUFFIX)
  • Morfessor: (<dishonest, ly>)
  • BPE (10k merge ops): (<dish, on, est, ly>)
  • Character n-gram (from 3 to 6): (<di, dis, ... , ly>, <dis, ... ,tly>, <dish, ... , stly>, <disho, ... , estly>)

where < and > are word start and end markers.

After the segmentation, we will obtain a subword sequence S for each segmentation method, and another morphortactic tag sequence T for sms.

Subword embeddings and position embeddings

We can embed the subword sequence S directly into subword embedding sequence by looking up in the subword embedding matrix, or insert a word token (ww) into S before embedding, i.e. for sms it will be (<dis, honest, ly>, <dishonestly>).

Then we can enhance the subword embeddings with additive (pp) or elementwise (mp) multiplication.

For sms, we can also embed the concatenation of the subword and its morphortactic tags (wp): (<dis:PREFIX, honest:ROOT, ly>:SUFFIX). And <dishonest>:WORD will be inserted if we choose ww. Note that position embeddings are not applicable to wp as a kind of morphological position information has already been provided.

Prerequisites

Calculate new word embeddings from subword embeddings

Call gen_word_emb.py to generate embeddings of new words for a specific composition function or use batch_gen_word_emb.sh to generate for all composition functions.

Your input, i.e. --in_file in input arg, needs to be a list of word, where each line only consists of a single word.

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Roff 98.7%
  • Python 1.2%
  • Shell 0.1%