Skip to content

Blockwise Parallel Decoding for Deep Autoregressive Models #116

@kweonwooj

Description

@kweonwooj

Abstract

  • propose novel block-wise parallel decoding scheme to make predictions for multiple time steps in parallel and then back off to the longest prefix validated by a scoring model
  • apply to exsting SoTA in machine translation and image translation
  • achieves 2x speed up w/o loss in quality (MT)
  • achieves 3.3x speed up with slight loss in quality (MT)

Details

Introduction

  • Non-Autoregressive Decoding
    • Problem : although encoding source sentence can be parallelized via self-attention, decoding target sentence is still autoregressive and hence slow and inefficient
    • Fully Non-Autoregressive models (by Gu et al 2017) is difficult to train, and leads to a large loss of quality
    • Discrete latent variable models (by Kaiser et al 2018) does not show SoTA quality
    • Iterative refinements (by Lee et al 2018) shows impressive results, but speed up is not significant

Blockwise Parallel Decoding

  • restricted to Greedy Decoding
  • Algorithm
    • Predict : predict k block tokens
    • Verify : find largest k~ that is quality-equivalent to greedy decoding
      • predict k blocks in parallel with each predicted token as oracle (this step can be re-used as next predict step)
      • verify the validity of k block tokens, and accept the best k~
    • Accept : extend result upto k~
      screen shot 2018-11-15 at 10 31 49 am

Approximate Inference

  • Top-k selection : relax accept condition by allowing exact match upto top k items
  • Distance-based selection : in case of image, one can use distance metric d as a criteria
  • Minimum Block Size : to ensure minimum speedup, we can constrain at least l words to be accepted. Ablation study says that this leads to drop in performance. (min_block_size=1 is best)

Training

  • pre-train Transformer base model with WMT14 EnDe data for 100k steps
  • modify decoder part, by extending to k output layers and fine-tune for 100k steps
    • due to memory constraint, unable to use mean of k cross-entropy loss, so select one of k sub-losses uniformly as a unbiased estimate of the full loss
  • Knowledge distillation for smoother training
    screen shot 2018-11-15 at 10 33 26 am

Machine Translation (Experiments)

  • Methods
    • Regular : fix the pre-trained model and train modified k output layers only
    • Distillation : fix the pre-trained model and train modified k outputs with distillation
    • Fine-Tuning : fine-tune the pre-trained model with modified k outputs
    • Both : fine-tune the pre-trained model with modified k outputs with distillation
  • Result
    • Combining Distillation and Fine-Tuning leads to significant improvement in speed while maintaining quality
      screen shot 2018-11-15 at 10 38 21 am

Wall-Clock SpeedUp

  • mean accepted block size is a proxy for speed up, actual wall-clock speed up of 3x is obtained with MT
    screen shot 2018-11-15 at 10 41 33 am

Example

  • Generation Process : see step 1 outputs 10 tokens simultaneously
    screen shot 2018-11-15 at 10 42 16 am

Overall Performance

  • 3x speed up with only 1 BLEU point loss (k=4,6 seems practical)
    screen shot 2018-11-15 at 10 42 47 am

Personal Thoughts

  • wow, great paper with simple yet an effective idea!
  • must implement

Link : https://arxiv.org/pdf/1811.03115.pdf
Authors : Stern et al 2018

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions