Skip to content

Representation Degeneration Problem in Training Natural Language Generation Models #119

@kweonwooj

Description

@kweonwooj

Abstract

  • representation degeneration == NLG (Natural Language Generation) tasks trained with MLE (Maximum Likelihood Estimation) with weight tying tricks (word embedding & pre-softmax layer) with big training dataset suffer from most of the learnt word embeddings concentrating into a narrow cone, which limits the representation power of word embeddings
  • propose a novel regularization method to address this problem
  • WMT14 EnDe +1.08 BLEU with Transformer Base, +0.54 BLEU with Transformer Big

Details

Representation Degeneration

screen shot 2019-01-11 at 2 19 34 pm

  • word2vec (b) and softmax param learnt from classification task (c) (MNIST) are diversely distributed around the origin using SVD projection
  • whereas, Transformer word embedding is saturated in a narrow cone

Understanding the Problem

  • word embedding is tied to softmax layer, then
    • word representation should be widely distributed to represent different semantic meanings
    • softmax with more diverse distribution is expected to obtain a large margin result
  • however, in reality, the learnt word embeddings are clustered into a narrow cone and the model faces the challenge of limited expressiveness
  • Cause : Intuitively speaking, during the training process of a model with likelihood loss, for any given hidden state, the embedding of the corresponding ground-truth word will be pushed towards the direction of the hidden state in order to get a larger likelihood, while the embeddings of all other words will be pushed towards the negative direction of the hidden state to get a smaller likelihood. As in natural language, word frequency is very low, the embedding of the word will be pushed towards the negative directions of most hidden states which drastically vary. As a result, the embeddings of most words in the vocabulary will be pushed towards similar directions negatively correlated with most hidden states and thus are clustered together in a local region of the embedding space.

MLE with Cosine Regularization

screen shot 2019-01-11 at 3 27 55 pm

  • propose a MLE-CosReg, which adds cosine similarity as a loss objective hence making word embeddings different from each other (pushing away from word embeddings clustering into a narrow cone)

Experimental Result

screen shot 2019-01-11 at 3 29 27 pm

  • WMT14 EnDe & DeEn task with Transformer base lead to increase in BLEU score
    screen shot 2019-01-11 at 3 30 08 pm
  • word embeddings are now distributed more equally around the origin

Personal Thoughts

  • interesting phenomena, and simple solution
  • paper is well-written
  • not sure whether the impact will be visible in NMT outputs

Link : https://openreview.net/pdf?id=SkEYojRqtm
Authors : Gao et al. 2018

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions