## Abstract - propose to leverage `hints` from pre-trained AutoRegressive Translation (ART) model to train Non-AutoRegressive Translation (NART) model - `hints` from hidden state - `hints` from word alignment - on WMT14 EnDe, 17.8x faster inference with ~2.00 BLEU loss - NART : 25.20 / 44 ms - ART : 27.30 / 784 ms ## Details ### Introduction - NART models - fully NART models suffer from loss of accuracy - To improve the accuracy of decoder, - [Gu et al 2017]() introduces `fertilities` from SMT model and copies source tokens to initialize decoder states - [Lee et al 2018]() propose iterative refinement process - [Kaiser et al 2018]() embed an ART that outputs discrete latent variables, then use NART model - there is a trade-off between inference speed and computational overhead of improving translation accuracy - Contribution - improve translation accuracy via enriching training signals via two `hints` from pre-trained ART model ### Motivation - Empirical error analysis of NART models lead to two findings - `incoherent phrases and miss meaningful tokens on the source side` - visualized incoherent phrases via cosine similarity of hidden layers - NART models w/o hints have higher cosine similarity across hidden layers which leads to repetitive outputs  - visualized missing tokens via attention weights - NART models w/o hints have low accuracy on attention weights  - Enhancing loss function using two additional information (cosine similarity between layers and attention weights) is the main contribution ### Hint-based NMT  - `Hints from hidden state` - provide penalty when NART hidden states are similar but ART hidden states are not.  - `Hints from word alignment` - KL Divergence loss  - `Initial Decoder State` (`z`) : linear combination of source embedding - exponential weight with source tokens in closer index having more weights  - `Multihead Positional Attention` : additional sub-layer in decoder to re-configure the positions - `Inference Tricks` - `Length Prediction` : instead of predicting target length, use constant bias `C` obtained from train corpus (no computational overhead) - `Length Range Prediction` : instead of predicting a fixed length, predict over a range of target length - `ART re-scoring` : use ART model to re-score multiple target candidates, to select the final one (rescoring can take place in non-autoregressive manner) ### Overall Performance - 17.8x speed-up with 1.90 BLEU loss in WMT14 EnDe  ## Personal Thoughts - I totally agree that all the semantics and syntax are in source sentence, hence NART model can work, if we train them correctly - `Inference Tricks` seem to be a strong contribution that authors do not explicitly point out - ICLR submission rejected due to insufficient related work/story-telling and `bad luck` Link : https://openreview.net/pdf?id=r1gGpjActQ Authors : Li et al 2018
Abstract
hintsfrom pre-trained AutoRegressive Translation (ART) model to train Non-AutoRegressive Translation (NART) modelhintsfrom hidden statehintsfrom word alignmentDetails
Introduction
fertilitiesfrom SMT model and copies source tokens to initialize decoder stateshintsfrom pre-trained ART modelMotivation
incoherent phrases and miss meaningful tokens on the source sideHint-based NMT
Hints from hidden stateHints from word alignmentInitial Decoder State(z) : linear combination of source embeddingMultihead Positional Attention: additional sub-layer in decoder to re-configure the positionsInference TricksLength Prediction: instead of predicting target length, use constant biasCobtained from train corpus (no computational overhead)Length Range Prediction: instead of predicting a fixed length, predict over a range of target lengthART re-scoring: use ART model to re-score multiple target candidates, to select the final one (rescoring can take place in non-autoregressive manner)Overall Performance
Personal Thoughts
Inference Tricksseem to be a strong contribution that authors do not explicitly point outbad luckLink : https://openreview.net/pdf?id=r1gGpjActQ
Authors : Li et al 2018