Pay Less Attention with Lightweight and Dynamic Convolutions

## Abstract
- self-attention is strong, but its effect on long-range dependency is in question
- propose `lightweight convolution` and `dynamic convolution`, a convolution as a function of timestep which is lightweight and cost is linear in input length + performs better or on-par with self-attention in machine translation, summarization and language modeling
- in machine translation, WMT14 EnDe SoTA of 29.7 BLEU

## Details
### Background
- Depth-wise Convolution : performs convolution independently over every channel
![screen shot 2019-01-15 at 1 30 30 pm](https://user-images.githubusercontent.com/7529838/51158895-eb7f6a80-18c9-11e9-9c2a-82467d5f1b5d.png)
### Light-weight Convolution
- weight sharing (`H = 16`) + softmax normalized depth-wise convolution
- DropConnect is used for regularization
![screen shot 2019-01-15 at 1 30 36 pm](https://user-images.githubusercontent.com/7529838/51158931-0b169300-18ca-11e9-84f9-600b055c55b8.png)
### Dynamic Convolution
- timestep dependent kernel function + light-weight convolution
![screen shot 2019-01-15 at 1 30 42 pm](https://user-images.githubusercontent.com/7529838/51158972-2da8ac00-18ca-11e9-815e-ad4181f53e6b.png)
![screen shot 2019-01-15 at 1 30 02 pm](https://user-images.githubusercontent.com/7529838/51158976-313c3300-18ca-11e9-8ab9-9ef466a560fd.png)
### Overall Structure
![screen shot 2019-01-15 at 1 30 25 pm](https://user-images.githubusercontent.com/7529838/51158995-3e592200-18ca-11e9-95dd-d104e9df66b5.png)

### Results
- DynamicConv achieves 29.7 BLEU on WMT EnDe with same param count as Transformer Big
![screen shot 2019-01-15 at 1 34 21 pm](https://user-images.githubusercontent.com/7529838/51159022-5c268700-18ca-11e9-98e8-5650c1475341.png)
- Ablation
  - speed is 20% faster with DynamicConv
![screen shot 2019-01-15 at 1 34 27 pm](https://user-images.githubusercontent.com/7529838/51159043-706a8400-18ca-11e9-8399-09061ee3c553.png)

## Personal Thoughts
- impressive result, improving both performance and speed against Transformer
- I wonder what timestep dependent kernel is capturing
- will the performance with small number of layers be equivalent? because CNN seem to gather contextual information via stacking, whereas self-attention can obtain global context in single operation

Link : https://openreview.net/pdf?id=SkVhlh09tX
Authors : Wu et al. 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pay Less Attention with Lightweight and Dynamic Convolutions #120

Abstract

Details

Background

Light-weight Convolution

Dynamic Convolution

Overall Structure

Results

Personal Thoughts

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Pay Less Attention with Lightweight and Dynamic Convolutions #120

Description

Abstract

Details

Background

Light-weight Convolution

Dynamic Convolution

Overall Structure

Results

Personal Thoughts

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions