You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
self-attention is strong, but its effect on long-range dependency is in question
propose lightweight convolution and dynamic convolution, a convolution as a function of timestep which is lightweight and cost is linear in input length + performs better or on-par with self-attention in machine translation, summarization and language modeling
in machine translation, WMT14 EnDe SoTA of 29.7 BLEU
Details
Background
Depth-wise Convolution : performs convolution independently over every channel
timestep dependent kernel function + light-weight convolution
Overall Structure
Results
DynamicConv achieves 29.7 BLEU on WMT EnDe with same param count as Transformer Big
Ablation
speed is 20% faster with DynamicConv
Personal Thoughts
impressive result, improving both performance and speed against Transformer
I wonder what timestep dependent kernel is capturing
will the performance with small number of layers be equivalent? because CNN seem to gather contextual information via stacking, whereas self-attention can obtain global context in single operation
Abstract
lightweight convolutionanddynamic convolution, a convolution as a function of timestep which is lightweight and cost is linear in input length + performs better or on-par with self-attention in machine translation, summarization and language modelingDetails
Background
Light-weight Convolution
H = 16) + softmax normalized depth-wise convolutionDynamic Convolution
Overall Structure
Results
Personal Thoughts
Link : https://openreview.net/pdf?id=SkVhlh09tX
Authors : Wu et al. 2018