# Partial Hypothesis Generation Routine
(1) randomly sample k ~ Uniform(0, |y|)
(2) shuffle index list and extract k tokens
(3) for each span of tokens not produced between extracted tokens, obtain distance measure (Eq. 10)
(4) define slot loss as a weighted sum of log-likelihood of tokens with w_i being softmax weight
Abstract
Insertion Transformer, an iterative and partially autoregressive model for sequence generation based on insertion operationsDetails
Introduction
Model Adjustments from original Transformer
ntarget tokens, Insertion Transformer modelsn+1slots between each tokens. Each slot is represented by the concatenation of adjacent pair of tokens with special bos/eos tokensp(c, l)in joint distribution or factorized distributionH, shape(T+1 x h)is last layer of decoder andW, shape (h x C)is softmax projection layer. covers all vocabs over all locationsp(c, l) = p(c | l) * p(l)in conditional distributionh_l, shape(h)is l-th row ofH,q, shape(h)is a learnable query vectorContextual + Mixtureleads to best performance, but the gap disappears when we useeos penaltyTraining
Balanced Binary Tree
Termination Condition
end-of-slottokeneostokeneostoken, once the entire sequence is produced and all locations are empty spansTraining Differenes
Inference
Greedy Decoding
Parallel Decoding
ncan be generated in as few aslog_2_(n) + 1stepsExperiments
transformer_basesetup trained upto 1M steps with 8 x P100 GPUsEOS penalty: selecting EOS token only if the log-probability is at leastbetadifferent (unless model is REALLY confident about eos, do not produce eos). this is because eos token is very frequent in training time.eos penalty+knowledge distillationdata as training target and using Parallel Decoding results in improved performance on dev setParallel Decoding
Test Result
Examples of Decoding
Personal Thoughts
Link : https://arxiv.org/pdf/1902.03249.pdf
Authors : Stern et al. 2019