Insertion-based Decoding with Automatically Inferred Generation Order

## Abstract
- propose a novel decoding algorithm - `INDIGO`, which generates text in an arbitrary order via insertion operation
- achieve competitive or even better performance in machine translation than conventional left-to-right generation.
  - Dataset : WMT16 RoEn, WMT18 EnTr, KFTT EnJa

## Details
### INDIGO
- `INsertion based Decoding with Inferred Generation Order`
- assumes generation orders as latent variables
- use relative position representation to capture generation order
- use Transformer model with relative position
- maximize evidence lower-bound (ELBO) of the original objective function and study four approximate posterior distribution of generation orders
![screen shot 2019-02-12 at 10 53 41 am](https://user-images.githubusercontent.com/7529838/52606032-7b5a0980-2eb4-11e9-89f4-bea18526a3b7.png)

### Neural Autoregressive Decoding
- neural autoregressive model commonly learns the probability of a Y given X via product of probability of each token given X and previously generated tokens Y_t
- a common way to decode such sequence model was from left-to-right, as it is a `natural` for most human-beings to read sequences (strong inductive bias)
- however, L2R may not be the optimal option for generating all sequences
  - Japanese tend to produce better result in R2L
  - code generation is beneficial when generated on abstract syntax tree etc

### Ordering as latent variable
- add order function `pi` to the conditional probability
- L2R can be recovered if `z_t = t`
![screen shot 2019-02-12 at 10 58 00 am](https://user-images.githubusercontent.com/7529838/52606205-1521b680-2eb5-11e9-9596-5bf92da487da.png)

### Relative Representation of Positions
- it is essential to use relative representation to model position because we do not know how many tokens will be generated at the end
- relational vector is used to model relative positions of all tokens at each timestep. accumulating relational vector across all timestep leads to relational matrix

### Insertion based Decoding
- INDIGO predicts next token and its relative position at each timestep, shown in Alg 1
![screen shot 2019-02-12 at 11 43 23 am](https://user-images.githubusercontent.com/7529838/52607929-7b113c80-2ebb-11e9-9f32-c14b211ae84e.png)
![screen shot 2019-02-12 at 11 44 05 am](https://user-images.githubusercontent.com/7529838/52607936-86646800-2ebb-11e9-9305-71d42e47f15a.png)

### Learning
- maximizing marginalized likelihood is intractable because we need to consider all T! permutations of tokens, given that tokens are now order-free
- instead, we maximize the evidence lower bound of original objective by introducing an approximate posterior distribution of generation orders which we can flexibly control
![screen shot 2019-02-12 at 11 45 55 am](https://user-images.githubusercontent.com/7529838/52607987-c75c7c80-2ebb-11e9-84a1-717a83c08589.png)
![screen shot 2019-02-12 at 11 49 05 am](https://user-images.githubusercontent.com/7529838/52608118-3afe8980-2ebc-11e9-809f-167fda529904.png)

### Experiment - Machine Translation
- Datasets
  - WMT16 RoEn 620k / 2k / 2k
  - WMT18 EnTr 207k / 3k / 3k
  - KFTT EnJa 405k / 1k / 1k
- Result
  - except for Random order, all pre-defined orders perform relatively similar, but L2R / R2L is best
  - Adaptive Order with beam=8 performs better than L2R, R2L in all language pairs
![screen shot 2019-02-12 at 11 50 11 am](https://user-images.githubusercontent.com/7529838/52608180-608b9300-2ebc-11e9-966d-34a5db8cb444.png)

### Experiment - Word Order Recovery / Code Generation
- improvement via INDIGO is more vivid in word order recovery and code generation tasks
![screen shot 2019-02-12 at 11 51 45 am](https://user-images.githubusercontent.com/7529838/52608235-9761a900-2ebc-11e9-968e-a29bf4aa0849.png)
![screen shot 2019-02-12 at 11 51 48 am](https://user-images.githubusercontent.com/7529838/52608236-97fa3f80-2ebc-11e9-81ca-c2e3f06faf59.png)

## Personal Thoughts
- paper was bit difficult to read
- predicting tokens and its positions autoregressively is an interesting idea
- wish more ablation was on what kind of tokens model predicts first in terms of POS, frequency etc
- interesting to see common-first approach is worse than L2R/R2L. surprised to find out how strong L2R inductive bias is

Link : https://arxiv.org/pdf/1902.01370.pdf
Authors : Gu et al. 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Insertion-based Decoding with Automatically Inferred Generation Order #122

Abstract

Details

INDIGO

Neural Autoregressive Decoding

Ordering as latent variable

Relative Representation of Positions