Skip to content

Insertion-based Decoding with Automatically Inferred Generation Order #122

@kweonwooj

Description

@kweonwooj

Abstract

  • propose a novel decoding algorithm - INDIGO, which generates text in an arbitrary order via insertion operation
  • achieve competitive or even better performance in machine translation than conventional left-to-right generation.
    • Dataset : WMT16 RoEn, WMT18 EnTr, KFTT EnJa

Details

INDIGO

  • INsertion based Decoding with Inferred Generation Order
  • assumes generation orders as latent variables
  • use relative position representation to capture generation order
  • use Transformer model with relative position
  • maximize evidence lower-bound (ELBO) of the original objective function and study four approximate posterior distribution of generation orders
    screen shot 2019-02-12 at 10 53 41 am

Neural Autoregressive Decoding

  • neural autoregressive model commonly learns the probability of a Y given X via product of probability of each token given X and previously generated tokens Y_t
  • a common way to decode such sequence model was from left-to-right, as it is a natural for most human-beings to read sequences (strong inductive bias)
  • however, L2R may not be the optimal option for generating all sequences
    • Japanese tend to produce better result in R2L
    • code generation is beneficial when generated on abstract syntax tree etc

Ordering as latent variable

  • add order function pi to the conditional probability
  • L2R can be recovered if z_t = t
    screen shot 2019-02-12 at 10 58 00 am

Relative Representation of Positions

  • it is essential to use relative representation to model position because we do not know how many tokens will be generated at the end
  • relational vector is used to model relative positions of all tokens at each timestep. accumulating relational vector across all timestep leads to relational matrix

Insertion based Decoding

  • INDIGO predicts next token and its relative position at each timestep, shown in Alg 1
    screen shot 2019-02-12 at 11 43 23 am
    screen shot 2019-02-12 at 11 44 05 am

Learning

  • maximizing marginalized likelihood is intractable because we need to consider all T! permutations of tokens, given that tokens are now order-free
  • instead, we maximize the evidence lower bound of original objective by introducing an approximate posterior distribution of generation orders which we can flexibly control
    screen shot 2019-02-12 at 11 45 55 am
    screen shot 2019-02-12 at 11 49 05 am

Experiment - Machine Translation

  • Datasets
    • WMT16 RoEn 620k / 2k / 2k
    • WMT18 EnTr 207k / 3k / 3k
    • KFTT EnJa 405k / 1k / 1k
  • Result
    • except for Random order, all pre-defined orders perform relatively similar, but L2R / R2L is best
    • Adaptive Order with beam=8 performs better than L2R, R2L in all language pairs
      screen shot 2019-02-12 at 11 50 11 am

Experiment - Word Order Recovery / Code Generation

  • improvement via INDIGO is more vivid in word order recovery and code generation tasks
    screen shot 2019-02-12 at 11 51 45 am
    screen shot 2019-02-12 at 11 51 48 am

Personal Thoughts

  • paper was bit difficult to read
  • predicting tokens and its positions autoregressively is an interesting idea
  • wish more ablation was on what kind of tokens model predicts first in terms of POS, frequency etc
  • interesting to see common-first approach is worse than L2R/R2L. surprised to find out how strong L2R inductive bias is

Link : https://arxiv.org/pdf/1902.01370.pdf
Authors : Gu et al. 2019

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions