Abstract
- present alternative view to explain the success of LSTM : `the gates themselves are versatile recurrent models that provide more representational power than previously appreciated.
Details
- LSTM/GRU was introduced to resolve the
Vanishing Gradient problem in naive-RNN model
- authors experiment various architectures within LSTM modules to show that gates themselves are powerful recurrent models that provide model representation power beyond simply mitigating vanishing gradient problem
- Test various LSTM sub-module architectures in various sequential tasks : Language Modeling (PTB), Question Answering (SQuAD), Dependency Parsing (Universal Dependencies English Web Treebank v1.3) and Machine Translation (EnDe WMT16)
LSTM
- Sub-components of LSTM can be outlined as below where content layer is Eq 2, memory cell is Eq 3
5 and output layer is Eq 67

- Models
LSTM - S-RNN : replace S-RNN in content layer Eq 2 with a simple linear transformation (c_t = W*x_t)
LSTM - S-RNN - OUT : remove output gate from Eq 7, leaving only the activation function
LSTM - S-RNN - Hidden : each gate is computed only by x_t. It can be seen as a type of QRNN or SRU
LSTM - Gates : ablate gates and isolate S-RNN
Results
- Performance degrades most when
GATES are missing
- when
S-RNN, OUT, HIDDEN is missing, performance drop is not significant


Discussion
LSTM - S-RNN - OUT can be interpreted as a weighted sum of context-independent functions of the inputs, showing a link between self-attention
- Three key differences
- LSTM weights are vectors, self-attention computes scalar weights
- LSTM weighted sum is accumulated with a dynamic programming, where self-attention is computed at once
- LSTM weights can grow up to sequence length, whereas attention has probabilistic interpretation
Personal Thoughts
- Detailed ablation study on wide range of sequential tasks provides a solid evidence of their claims
- Interpretation and linkage between attention was surprising
Link : https://arxiv.org/pdf/1805.03716.pdf
Authors : Levy et al. 2018
Abstract
Details
Vanishing Gradientproblem in naive-RNN modelLSTM
5 and output layer is Eq 67LSTM - S-RNN: replace S-RNN in content layer Eq 2 with a simple linear transformation (c_t = W*x_t)LSTM - S-RNN - OUT: remove output gate from Eq 7, leaving only the activation functionLSTM - S-RNN - Hidden: each gate is computed only byx_t. It can be seen as a type of QRNN or SRULSTM - Gates: ablate gates and isolate S-RNNResults
GATESare missingS-RNN, OUT, HIDDENis missing, performance drop is not significantDiscussion
LSTM - S-RNN - OUTcan be interpreted as a weighted sum of context-independent functions of the inputs, showing a link between self-attentionPersonal Thoughts
Link : https://arxiv.org/pdf/1805.03716.pdf
Authors : Levy et al. 2018