Abstract
- propose a framework for training models of text generation in non-monotonic orders
- generate tokens in a binary tree structure
- learning is framed as imitation learning
- achieves competitive performance with conventional left-to-right generation
- tasks : language modeling, sentence completion, word reordering and machine translation
Details
Non-Monotonic Generation as Binary-Tree
- an example generation of proposed approach.
- generation can start from any tokens.
- number in green box is generation order
- number in blue box is reconstruction order
- conventional left-to-right can be framed as a special case of binary tree

Learning for Non-Monotonic Generation
- Imitation Learning framework where
oracle policy provides valid distribution over choices of tokens and model parameter learns it via KL Divergence loss

Oracle policy is defined by

- where we have a choice for
P_a.
- uniform oracle produces uniform distribution over valid tokens. (does not lead to optimal quality)
- coaching oracle : multiply uniform and current policy

- annealed coaching oracle : linear weighted sum of coaching and uniform oracle to provide variety in learning

- in imitation learning, roll-in policy is an stochastic mixture of learned model and oracle policy, but in this task, simply using oracle policy throughout performs better
Experiments
Language Model
- Dataset : Persona-Chat dataset with 133k / 16k / 15k
- Model : 2-layered uni-directional LSTM
- non-monotonic (
annealed) LM produced more diverse(unique and novel) sentences, with average span 1.3~1.4 (span = avg number of child nodes)

- POS tag analysis leads to interesting insights
- non-monotonic (
annealed) produces in order of PUNCT > PNOUN > VERB > NOUN
- left-to-right produces in order of PNOUN > VERB > NOUN > PUNCT

Sentence Completion
- non-monotonic generation opens up a new spectrum in sentence completion where generation can take place anywhere
- left-to-right can only complete sentences to its right

Machine Translation
- Dataset : IWSLT16 DeEn 196k / TED tst2013 / TED tst2014
- Model : 1-layer bi-LSTM
- End-tuning : since
end tag is frequent in training, model over-produces end tag during inference. P_a value for end is tuned down with validation set.
- 7~8 points lower BLEU than Left-to-Right due to drop in 4-gram precision. (1,2-gram is higher, 3-gram is equivalent)
- relatively less discrepancy on other metrics but still lower than left-to-right

Personal Thoughts
- Left-to-Right seems to be a good inductive bias for generation, that's why there is a big gap in quantitative results on machine translation
- Generating tokens in non-monotonic order is far from human's intuitions, but VERY interesting idea
- what is the potential gain of generating machine translation outputs in non-monotonic order?
- this idea is interesting, but seems to make the problem more difficult for the model to learn. model now has to learn all combinatorial cases of sentence generation
Link : https://arxiv.org/pdf/1902.02192.pdf
Authors : Welleck et al. 2019
Abstract
Details
Non-Monotonic Generation as Binary-Tree
Learning for Non-Monotonic Generation
oraclepolicy provides valid distribution over choices of tokens and model parameter learns it via KL Divergence lossOraclepolicy is defined byP_a.Experiments
Language Model
annealed) LM produced more diverse(unique and novel) sentences, with average span 1.3~1.4 (span = avg number of child nodes)annealed) produces in order of PUNCT > PNOUN > VERB > NOUNSentence Completion
Machine Translation
endtag is frequent in training, model over-producesendtag during inference.P_avalue forendis tuned down with validation set.Personal Thoughts
Link : https://arxiv.org/pdf/1902.02192.pdf
Authors : Welleck et al. 2019