Long Short-Term Memory as a Dynamically Computed Element-wise Weighted Sum

## Abstract
- present alternative view to explain the success of LSTM : `the gates themselves are versatile recurrent models that provide more representational power than previously appreciated.

## Details
- LSTM/GRU was introduced to resolve the `Vanishing Gradient` problem in naive-RNN model
- authors experiment various architectures within LSTM modules to show that gates themselves are powerful recurrent models that provide model representation power beyond simply mitigating vanishing gradient problem
- Test various LSTM sub-module architectures in various sequential tasks : Language Modeling (PTB), Question Answering (SQuAD), Dependency Parsing (Universal Dependencies English Web Treebank v1.3) and Machine Translation (EnDe WMT16)
### LSTM
- Sub-components of LSTM can be outlined as below where content layer is Eq 2, memory cell is Eq 3~5 and output layer is Eq 6~7
![screen shot 2018-06-19 at 11 45 17 am](https://user-images.githubusercontent.com/7529838/41573152-d9601c82-736a-11e8-9d51-5a19446721b5.png)
- Models
  - `LSTM - S-RNN` : replace S-RNN in content layer Eq 2 with a simple linear transformation (`c_t = W*x_t`) 
  - `LSTM - S-RNN - OUT` : remove output gate from Eq 7, leaving only the activation function
  - `LSTM - S-RNN - Hidden` : each gate is computed only by `x_t`. It can be seen as a type of QRNN or SRU
  - `LSTM - Gates` : ablate gates and isolate S-RNN
### Results
- Performance degrades most when `GATES` are missing
- when `S-RNN, OUT, HIDDEN` is missing, performance drop is not significant
![screen shot 2018-06-19 at 11 51 44 am](https://user-images.githubusercontent.com/7529838/41573331-bcdfb74c-736b-11e8-8f1e-b1db4437e428.png)
![screen shot 2018-06-19 at 11 51 47 am](https://user-images.githubusercontent.com/7529838/41573332-bd0739d4-736b-11e8-96d7-672abb85859d.png)

### Discussion
- `LSTM - S-RNN - OUT` can be interpreted as a weighted sum of context-independent functions of the inputs, showing a link between self-attention
- Three key differences
  - LSTM weights are vectors, self-attention computes scalar weights
  - LSTM weighted sum is accumulated with a dynamic programming, where self-attention is computed at once
  - LSTM weights can grow up to sequence length, whereas attention has probabilistic interpretation

## Personal Thoughts
- Detailed ablation study on wide range of sequential tasks provides a solid evidence of their claims
- Interpretation and linkage between attention was surprising

Link : https://arxiv.org/pdf/1805.03716.pdf
Authors : Levy et al. 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long Short-Term Memory as a Dynamically Computed Element-wise Weighted Sum #112

Abstract

Details

LSTM

Results

Discussion

Personal Thoughts

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Long Short-Term Memory as a Dynamically Computed Element-wise Weighted Sum #112

Description

Abstract

Details

LSTM

Results

Discussion

Personal Thoughts

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions