Hint-based Training for Non-Autoregressive Translation

## Abstract
- propose to leverage `hints` from pre-trained AutoRegressive Translation (ART) model to train Non-AutoRegressive Translation (NART) model
  - `hints` from hidden state
  - `hints` from word alignment
- on WMT14 EnDe, 17.8x faster inference with ~2.00 BLEU loss
  - NART : 25.20 / 44 ms
  - ART : 27.30 / 784 ms

## Details
### Introduction
- NART models
  - fully NART models suffer from loss of accuracy
- To improve the accuracy of decoder, 
  - [Gu et al 2017]() introduces `fertilities` from SMT model and copies source tokens to initialize decoder states
  - [Lee et al 2018]() propose iterative refinement process
  - [Kaiser et al 2018]() embed an ART that outputs discrete latent variables, then use NART model
  - there is a trade-off between inference speed and computational overhead of improving translation accuracy
- Contribution
  - improve translation accuracy via enriching training signals via two `hints` from pre-trained ART model

### Motivation
- Empirical error analysis of NART models lead to two findings
  - `incoherent phrases and miss meaningful tokens on the source side`
- visualized incoherent phrases via cosine similarity of hidden layers
  - NART models w/o hints have higher cosine similarity across hidden layers which leads to repetitive outputs
![screen shot 2019-01-03 at 1 43 19 pm](https://user-images.githubusercontent.com/7529838/50623877-8b013c80-0f5d-11e9-82a0-20d552ad62c6.png)
- visualized missing tokens via attention weights
  - NART models w/o hints have low accuracy on attention weights
![screen shot 2019-01-03 at 1 44 11 pm](https://user-images.githubusercontent.com/7529838/50623889-b126dc80-0f5d-11e9-973a-af82b9852210.png)
- Enhancing loss function using two additional information (cosine similarity between layers and attention weights) is the main contribution

### Hint-based NMT
![screen shot 2019-01-03 at 1 40 12 pm](https://user-images.githubusercontent.com/7529838/50623836-280fa580-0f5d-11e9-954b-3ab161bc0a50.png)
- `Hints from hidden state`
  - provide penalty when NART hidden states are similar but ART hidden states are not.
![screen shot 2019-01-03 at 1 45 43 pm](https://user-images.githubusercontent.com/7529838/50623909-efbc9700-0f5d-11e9-95d2-14a7c224f5c0.png)
- `Hints from word alignment`
  - KL Divergence loss 
![screen shot 2019-01-03 at 1 47 10 pm](https://user-images.githubusercontent.com/7529838/50623922-1a0e5480-0f5e-11e9-8732-754880d3cb3c.png)
- `Initial Decoder State` (`z`) : linear combination of source embedding
  - exponential weight with source tokens in closer index having more weights
![screen shot 2019-01-03 at 1 47 50 pm](https://user-images.githubusercontent.com/7529838/50623931-34483280-0f5e-11e9-9b78-f8bcadf68110.png)
- `Multihead Positional Attention` : additional sub-layer in decoder to re-configure the positions
- `Inference Tricks`
  - `Length Prediction` : instead of predicting target length, use constant bias `C` obtained from train corpus (no computational overhead)
  - `Length Range Prediction` : instead of predicting a fixed length, predict over a range of target length
  - `ART re-scoring` : use ART model to re-score multiple target candidates, to select the final one (rescoring can take place in non-autoregressive manner)

### Overall Performance
- 17.8x speed-up with 1.90 BLEU loss in WMT14 EnDe
![screen shot 2019-01-03 at 1 51 17 pm](https://user-images.githubusercontent.com/7529838/50623984-a91b6c80-0f5e-11e9-9db6-5920fbbf8614.png)

## Personal Thoughts
- I totally agree that all the semantics and syntax are in source sentence, hence NART model can work, if we train them correctly
- `Inference Tricks` seem to be a strong contribution that authors do not explicitly point out
- ICLR submission rejected due to insufficient related work/story-telling and `bad luck`

Link : https://openreview.net/pdf?id=r1gGpjActQ
Authors : Li et al 2018 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hint-based Training for Non-Autoregressive Translation #118

Abstract

Details

Introduction

Motivation

Hint-based NMT

Overall Performance

Personal Thoughts

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Hint-based Training for Non-Autoregressive Translation #118

Description

Abstract

Details

Introduction

Motivation

Hint-based NMT

Overall Performance

Personal Thoughts

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions