Abstract
- propose novel block-wise parallel decoding scheme to make predictions for multiple time steps in parallel and then back off to the longest prefix validated by a scoring model
- apply to exsting SoTA in machine translation and image translation
- achieves 2x speed up w/o loss in quality (MT)
- achieves 3.3x speed up with slight loss in quality (MT)
Details
Introduction
- Non-Autoregressive Decoding
- Problem : although encoding source sentence can be parallelized via self-attention, decoding target sentence is still autoregressive and hence slow and inefficient
- Fully Non-Autoregressive models (by Gu et al 2017) is difficult to train, and leads to a large loss of quality
- Discrete latent variable models (by Kaiser et al 2018) does not show SoTA quality
- Iterative refinements (by Lee et al 2018) shows impressive results, but speed up is not significant
Blockwise Parallel Decoding
- restricted to Greedy Decoding
- Algorithm
- Predict : predict
k block tokens
- Verify : find largest
k~ that is quality-equivalent to greedy decoding
- predict
k blocks in parallel with each predicted token as oracle (this step can be re-used as next predict step)
- verify the validity of
k block tokens, and accept the best k~
- Accept : extend result upto
k~

Approximate Inference
- Top-k selection : relax accept condition by allowing exact match upto top
k items
- Distance-based selection : in case of image, one can use distance metric
d as a criteria
- Minimum Block Size : to ensure minimum speedup, we can constrain at least
l words to be accepted. Ablation study says that this leads to drop in performance. (min_block_size=1 is best)
Training
- pre-train
Transformer base model with WMT14 EnDe data for 100k steps
- modify decoder part, by extending to
k output layers and fine-tune for 100k steps
- due to memory constraint, unable to use mean of
k cross-entropy loss, so select one of k sub-losses uniformly as a unbiased estimate of the full loss
- Knowledge distillation for smoother training

Machine Translation (Experiments)
- Methods
- Regular : fix the pre-trained model and train modified
k output layers only
- Distillation : fix the pre-trained model and train modified
k outputs with distillation
- Fine-Tuning : fine-tune the pre-trained model with modified
k outputs
- Both : fine-tune the pre-trained model with modified
k outputs with distillation
- Result
- Combining Distillation and Fine-Tuning leads to significant improvement in speed while maintaining quality

Wall-Clock SpeedUp
- mean accepted block size is a proxy for speed up, actual wall-clock speed up of 3x is obtained with MT

Example
- Generation Process : see step 1 outputs 10 tokens simultaneously

Overall Performance
- 3x speed up with only 1 BLEU point loss (
k=4,6 seems practical)

Personal Thoughts
- wow, great paper with simple yet an effective idea!
- must implement
Link : https://arxiv.org/pdf/1811.03115.pdf
Authors : Stern et al 2018
Abstract
Details
Introduction
Blockwise Parallel Decoding
kblock tokensk~that is quality-equivalent to greedy decodingkblocks in parallel with each predicted token as oracle (this step can be re-used as nextpredictstep)kblock tokens, and accept the bestk~k~Approximate Inference
kitemsdas a criterialwords to be accepted. Ablation study says that this leads to drop in performance. (min_block_size=1 is best)Training
Transformer basemodel with WMT14 EnDe data for 100k stepskoutput layers and fine-tune for 100k stepskcross-entropy loss, so select one ofksub-losses uniformly as a unbiased estimate of the full lossMachine Translation (Experiments)
koutput layers onlykoutputs with distillationkoutputskoutputs with distillationWall-Clock SpeedUp
Example
Overall Performance
k=4,6seems practical)Personal Thoughts
Link : https://arxiv.org/pdf/1811.03115.pdf
Authors : Stern et al 2018