Blockwise Parallel Decoding for Deep Autoregressive Models

## Abstract
- propose novel block-wise parallel decoding scheme to make predictions for multiple time steps in parallel and then back off to the longest prefix validated by a scoring model
- apply to exsting SoTA in machine translation and image translation
- achieves 2x speed up w/o loss in quality (MT)
- achieves 3.3x speed up with slight loss in quality (MT)

## Details
### Introduction
- Non-Autoregressive Decoding
  - Problem : although encoding source sentence can be parallelized via self-attention, decoding target sentence is still autoregressive and hence slow and inefficient
  - Fully Non-Autoregressive models (by [Gu et al 2017](https://arxiv.org/pdf/1711.02281.pdf)) is difficult to train, and leads to a large loss of quality
  - Discrete latent variable models (by [Kaiser et al 2018](https://arxiv.org/pdf/1803.03382.pdf)) does not show SoTA quality
  - Iterative refinements (by [Lee et al 2018](https://arxiv.org/pdf/1802.06901.pdf)) shows impressive results, but speed up is not significant

### Blockwise Parallel Decoding
- restricted to Greedy Decoding
- Algorithm
  - **Predict** : predict `k` block tokens
  - **Verify** : find largest `k~` that is quality-equivalent to greedy decoding
    - predict `k` blocks in parallel with each predicted token as oracle (this step can be re-used as next `predict` step)
    - verify the validity of `k` block tokens, and accept the best `k~` 
  - **Accept** : extend result upto `k~`
![screen shot 2018-11-15 at 10 31 49 am](https://user-images.githubusercontent.com/7529838/48523934-ad61cc80-e8c1-11e8-945c-44a2208a6d9d.png)

### Approximate Inference
- Top-k selection : relax accept condition by allowing exact match upto top `k` items
- Distance-based selection : in case of image, one can use distance metric `d` as a criteria
- Minimum Block Size : to ensure minimum speedup, we can constrain at least `l` words to be accepted. Ablation study says that this leads to drop in performance. (min_block_size=1 is best)

### Training
- pre-train `Transformer base` model with WMT14 EnDe data for 100k steps
- modify decoder part, by extending to `k` output layers and fine-tune for 100k steps
  - due to memory constraint, unable to use mean of `k` cross-entropy loss, so select one of `k` sub-losses uniformly as a unbiased estimate of the full loss
- Knowledge distillation for smoother training
![screen shot 2018-11-15 at 10 33 26 am](https://user-images.githubusercontent.com/7529838/48524023-f7e34900-e8c1-11e8-9b61-61202047c263.png)

### Machine Translation (Experiments)
- Methods 
  - Regular : fix the pre-trained model and train modified `k` output layers only
  - Distillation : fix the pre-trained model and train modified `k` outputs with distillation
  - Fine-Tuning : fine-tune the pre-trained model with modified `k` outputs
  - Both : fine-tune the pre-trained model with modified `k` outputs with distillation
- Result 
  - Combining Distillation and Fine-Tuning leads to significant improvement in speed while maintaining quality
![screen shot 2018-11-15 at 10 38 21 am](https://user-images.githubusercontent.com/7529838/48524249-fe25f500-e8c2-11e8-8a9c-77352aa48d19.png)

### Wall-Clock SpeedUp
- mean accepted block size is a proxy for speed up, actual wall-clock speed up of 3x is obtained with MT
![screen shot 2018-11-15 at 10 41 33 am](https://user-images.githubusercontent.com/7529838/48524278-1d248700-e8c3-11e8-9f94-4b2ac385429d.png)

### Example
- Generation Process : see step 1 outputs 10 tokens simultaneously
![screen shot 2018-11-15 at 10 42 16 am](https://user-images.githubusercontent.com/7529838/48524291-2877b280-e8c3-11e8-92ee-35dd306152cd.png)

### Overall Performance
- 3x speed up with only 1 BLEU point loss (`k=4,6` seems practical)
![screen shot 2018-11-15 at 10 42 47 am](https://user-images.githubusercontent.com/7529838/48524307-388f9200-e8c3-11e8-8f3b-0fa19cc947b4.png)

## Personal Thoughts
- wow, great paper with simple yet an effective idea!
- must implement

Link : https://arxiv.org/pdf/1811.03115.pdf
Authors : Stern et al 2018 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blockwise Parallel Decoding for Deep Autoregressive Models #116

Abstract

Details

Introduction