Skip to content

Long Short-Term Memory as a Dynamically Computed Element-wise Weighted Sum #112

@kweonwooj

Description

@kweonwooj

Abstract

  • present alternative view to explain the success of LSTM : `the gates themselves are versatile recurrent models that provide more representational power than previously appreciated.

Details

  • LSTM/GRU was introduced to resolve the Vanishing Gradient problem in naive-RNN model
  • authors experiment various architectures within LSTM modules to show that gates themselves are powerful recurrent models that provide model representation power beyond simply mitigating vanishing gradient problem
  • Test various LSTM sub-module architectures in various sequential tasks : Language Modeling (PTB), Question Answering (SQuAD), Dependency Parsing (Universal Dependencies English Web Treebank v1.3) and Machine Translation (EnDe WMT16)

LSTM

  • Sub-components of LSTM can be outlined as below where content layer is Eq 2, memory cell is Eq 35 and output layer is Eq 67
    screen shot 2018-06-19 at 11 45 17 am
  • Models
    • LSTM - S-RNN : replace S-RNN in content layer Eq 2 with a simple linear transformation (c_t = W*x_t)
    • LSTM - S-RNN - OUT : remove output gate from Eq 7, leaving only the activation function
    • LSTM - S-RNN - Hidden : each gate is computed only by x_t. It can be seen as a type of QRNN or SRU
    • LSTM - Gates : ablate gates and isolate S-RNN

Results

  • Performance degrades most when GATES are missing
  • when S-RNN, OUT, HIDDEN is missing, performance drop is not significant
    screen shot 2018-06-19 at 11 51 44 am
    screen shot 2018-06-19 at 11 51 47 am

Discussion

  • LSTM - S-RNN - OUT can be interpreted as a weighted sum of context-independent functions of the inputs, showing a link between self-attention
  • Three key differences
    • LSTM weights are vectors, self-attention computes scalar weights
    • LSTM weighted sum is accumulated with a dynamic programming, where self-attention is computed at once
    • LSTM weights can grow up to sequence length, whereas attention has probabilistic interpretation

Personal Thoughts

  • Detailed ablation study on wide range of sequential tasks provides a solid evidence of their claims
  • Interpretation and linkage between attention was surprising

Link : https://arxiv.org/pdf/1805.03716.pdf
Authors : Levy et al. 2018

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions