Abstract
- propose a novel model for fast sequence generation - the Semi-Autoregressive Transformer (SAT)
- produce multiple successive words in parallel at each time step (K=2,4,6 etc)
- achieves good balance between translation quality and decoding speed on WMT14 EnDe, EnZh
- 5.58x speed up with 88% translation quality in EnDe (max speed up)
- when K=2, SAT is almost lossless (only 1% loss in BLEU)
Details
Introduction
- Sequence Generation tasks suffer from autoregressive nature (need to decode output one by one in sequence)
- although CNN, self-attention modules allowed parallel processing in source/encoder side, target/decoder side is still autoregressive
- Recent Works
- Gu et al. 2017 proposed fully non-autoregressive NMT model with fertility function to predict the target length. Significant gain in speed, but degrade translation quality too much.
- Lee et al. 2018 proposed non-autoregressive sequence model with iterative refinement, but quality still suffers
- Kaiser et al 2018 proposed semi-autoregressive model where Transformer model first auto-encodes the sentence into shorter sequence of discrete latent variable in sequence, from which the target sentence is generated in parallel.
Semi-Autoregressive Transformer

- Group-level Chain-Rule
- chain rule is applied to group of tokens with size K

- Long-Distance Prediction
- model is prediction K steps ahead

- Relaxed Causal Mask
- masking strategy is different in train time

- Complexity and Acceleration (
a = time on decoder network, b = time on beam search)

Train
- train with knowledge distillation (teacher-student model) for better performance

Result
- WMT14 EnDe
- with K=2, BLEU is 26.90 (compared to SoTa 27.11), 1.51x speed up
- good balance of speed and quality compared to other non-autoregressive methods

- NIST02 EnZh
- with K=2, BLEU is 39.57 (compared to SoTa 40.59), 1.69x speed up

Case Study
- Position-wise Cross-Entropy is high on latter position, indicating that the long-distance prediction is always more difficult

- observe frequent repetition issue
Future Work
- better design loss function or model for long-distance prediction
- explore more stable method of training along with KD
- adaptively determine size K by network
Personal Thoughts
- Nice implementation. Idea itself is not super-creative because NAT has been out, and it is natural to think of semi-autoregressive model
- surprised to see that KD helps a lot in training
- very practical paper
Link : https://arxiv.org/pdf/1808.08583v2.pdf
Code : https://github.com/chqiwang/sa-nmt
Authors : Wang et al 2018
Abstract
Details
Introduction
Semi-Autoregressive Transformer
a= time on decoder network,b= time on beam search)Train
Result
Case Study
Future Work
Personal Thoughts
Link : https://arxiv.org/pdf/1808.08583v2.pdf
Code : https://github.com/chqiwang/sa-nmt
Authors : Wang et al 2018