Semi-Autoregressive Neural Machine Translation

## Abstract
- propose a novel model for fast sequence generation - the Semi-Autoregressive Transformer (SAT)
- produce multiple successive words in parallel at each time step (K=2,4,6 etc)
- achieves good balance between translation quality and decoding speed on WMT14 EnDe, EnZh
  - 5.58x speed up with 88% translation quality in EnDe (max speed up)
  - when K=2, SAT is almost lossless (only 1% loss in BLEU)

## Details
### Introduction
- Sequence Generation tasks suffer from autoregressive nature (need to decode output one by one in sequence)
  - although CNN, self-attention modules allowed parallel processing in source/encoder side, target/decoder side is still autoregressive
- Recent Works
  - [Gu et al. 2017](https://arxiv.org/pdf/1711.02281.pdf) proposed fully non-autoregressive NMT model with fertility function to predict the target length. Significant gain in speed, but degrade translation quality too much.
  - [Lee et al. 2018](https://arxiv.org/pdf/1802.06901.pdf) proposed non-autoregressive sequence model with iterative refinement, but quality still suffers
  - [Kaiser et al 2018](https://arxiv.org/pdf/1803.03382.pdf) proposed semi-autoregressive model where Transformer model first auto-encodes the sentence into shorter sequence of discrete latent variable in sequence, from which the target sentence is generated in parallel.

### Semi-Autoregressive Transformer
![screen shot 2018-11-01 at 11 19 19 am](https://user-images.githubusercontent.com/7529838/47828865-1593b700-ddc8-11e8-8e9e-6c42322e9e08.png)
- Group-level Chain-Rule
  - chain rule is applied to group of tokens with size K
![screen shot 2018-11-01 at 11 20 13 am](https://user-images.githubusercontent.com/7529838/47828898-30fec200-ddc8-11e8-9dab-f5174dcff607.png)
- Long-Distance Prediction
  - model is prediction K steps ahead
![screen shot 2018-11-01 at 11 21 05 am](https://user-images.githubusercontent.com/7529838/47828908-4116a180-ddc8-11e8-8fbb-f3072af157d5.png)
- Relaxed Causal Mask
  - masking strategy is different in train time
![screen shot 2018-11-01 at 11 21 36 am](https://user-images.githubusercontent.com/7529838/47828932-5c81ac80-ddc8-11e8-9b6f-8f8737f272bd.png)
- Complexity and Acceleration (`a` = time on decoder network, `b` = time on beam search)
![screen shot 2018-11-01 at 11 22 07 am](https://user-images.githubusercontent.com/7529838/47828952-70c5a980-ddc8-11e8-950f-50cbc5433525.png)

### Train
- train with knowledge distillation (teacher-student model) for better performance
![screen shot 2018-11-01 at 11 23 11 am](https://user-images.githubusercontent.com/7529838/47828975-8fc43b80-ddc8-11e8-9c5f-52965c72238f.png)

### Result
- WMT14 EnDe
  - with K=2, BLEU is 26.90 (compared to SoTa 27.11), 1.51x speed up
  - good balance of speed and quality compared to other non-autoregressive methods
![screen shot 2018-11-01 at 11 23 33 am](https://user-images.githubusercontent.com/7529838/47828989-9e125780-ddc8-11e8-900b-c61b108f53d5.png)
- NIST02 EnZh
  - with K=2, BLEU is 39.57 (compared to SoTa 40.59), 1.69x speed up
![screen shot 2018-11-01 at 11 25 07 am](https://user-images.githubusercontent.com/7529838/47829032-cf8b2300-ddc8-11e8-9e0e-79aa2c31268d.png)

### Case Study
- Position-wise Cross-Entropy is high on latter position, indicating that the long-distance prediction is always more difficult
![screen shot 2018-11-01 at 11 26 02 am](https://user-images.githubusercontent.com/7529838/47829062-02351b80-ddc9-11e8-8a3d-5729d4aecfd2.png)
- observe frequent repetition issue

### Future Work
- better design loss function or model for long-distance prediction
- explore more stable method of training along with KD
- adaptively determine size K by network

## Personal Thoughts
- Nice implementation. Idea itself is not super-creative because NAT has been out, and it is natural to think of semi-autoregressive model
- surprised to see that KD helps a lot in training
- very practical paper

Link : https://arxiv.org/pdf/1808.08583v2.pdf
Code : https://github.com/chqiwang/sa-nmt
Authors : Wang et al 2018 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Semi-Autoregressive Neural Machine Translation #115

Abstract

Details

Introduction

Semi-Autoregressive Transformer

Train

Result

Case Study

Future Work

Personal Thoughts

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Semi-Autoregressive Neural Machine Translation #115

Description

Abstract

Details

Introduction

Semi-Autoregressive Transformer

Train

Result

Case Study

Future Work

Personal Thoughts

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions