Skip to content

Pay Less Attention with Lightweight and Dynamic Convolutions #120

@kweonwooj

Description

@kweonwooj

Abstract

  • self-attention is strong, but its effect on long-range dependency is in question
  • propose lightweight convolution and dynamic convolution, a convolution as a function of timestep which is lightweight and cost is linear in input length + performs better or on-par with self-attention in machine translation, summarization and language modeling
  • in machine translation, WMT14 EnDe SoTA of 29.7 BLEU

Details

Background

  • Depth-wise Convolution : performs convolution independently over every channel
    screen shot 2019-01-15 at 1 30 30 pm

Light-weight Convolution

  • weight sharing (H = 16) + softmax normalized depth-wise convolution
  • DropConnect is used for regularization
    screen shot 2019-01-15 at 1 30 36 pm

Dynamic Convolution

  • timestep dependent kernel function + light-weight convolution
    screen shot 2019-01-15 at 1 30 42 pm
    screen shot 2019-01-15 at 1 30 02 pm

Overall Structure

screen shot 2019-01-15 at 1 30 25 pm

Results

  • DynamicConv achieves 29.7 BLEU on WMT EnDe with same param count as Transformer Big
    screen shot 2019-01-15 at 1 34 21 pm
  • Ablation
    • speed is 20% faster with DynamicConv
      screen shot 2019-01-15 at 1 34 27 pm

Personal Thoughts

  • impressive result, improving both performance and speed against Transformer
  • I wonder what timestep dependent kernel is capturing
  • will the performance with small number of layers be equivalent? because CNN seem to gather contextual information via stacking, whereas self-attention can obtain global context in single operation

Link : https://openreview.net/pdf?id=SkVhlh09tX
Authors : Wu et al. 2018

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions