Representation Degeneration Problem in Training Natural Language Generation Models

## Abstract
- `representation degeneration` == NLG (Natural Language Generation) tasks trained with MLE (Maximum Likelihood Estimation) with weight tying tricks (word embedding & pre-softmax layer) with big training dataset suffer from most of the learnt word embeddings concentrating into a narrow cone, which limits the representation power of word embeddings
- propose a novel regularization method to address this problem
- WMT14 EnDe +1.08 BLEU with `Transformer Base`, +0.54 BLEU with `Transformer Big`

## Details
### Representation Degeneration
![screen shot 2019-01-11 at 2 19 34 pm](https://user-images.githubusercontent.com/7529838/51016652-77388480-15b4-11e9-8de3-62eb2b3d6200.png)
- word2vec (b) and softmax param learnt from classification task (c) (MNIST) are diversely distributed around the origin using SVD projection
- whereas, Transformer word embedding is saturated in a narrow cone

### Understanding the Problem
- word embedding is tied to softmax layer, then
  - word representation should be widely distributed to represent different semantic meanings
  - softmax with more diverse distribution is expected to obtain a large margin result
- however, in reality, the learnt word embeddings are clustered into a narrow cone and the model faces the challenge of limited expressiveness
- Cause : `Intuitively speaking, during the training process of a model with likelihood loss, for any given hidden state, the embedding of the corresponding ground-truth word will be pushed towards the direction of the hidden state in order to get a larger likelihood, while the embeddings of all other words will be pushed towards the negative direction of the hidden state to get a smaller likelihood. As in natural language, word frequency is very low, the embedding of the word will be pushed towards the negative directions of most hidden states which drastically vary. As a result, the embeddings of most words in the vocabulary will be pushed towards similar directions negatively correlated with most hidden states and thus are clustered together in a local region of the embedding space.`

### MLE with Cosine Regularization
![screen shot 2019-01-11 at 3 27 55 pm](https://user-images.githubusercontent.com/7529838/51016933-8409a800-15b5-11e9-87c6-1d3581fa4c11.png)
- propose a `MLE-CosReg`, which adds cosine similarity as a loss objective hence making word embeddings different from each other (pushing away from word embeddings clustering into a narrow cone)

### Experimental Result
![screen shot 2019-01-11 at 3 29 27 pm](https://user-images.githubusercontent.com/7529838/51016984-b74c3700-15b5-11e9-8f92-b9705ed91d4f.png)
- WMT14 EnDe & DeEn task with `Transformer base` lead to increase in BLEU score
![screen shot 2019-01-11 at 3 30 08 pm](https://user-images.githubusercontent.com/7529838/51017017-d34fd880-15b5-11e9-8216-62f3b1805c88.png)
- word embeddings are now distributed more equally around the origin

## Personal Thoughts
- interesting phenomena, and simple solution
- paper is well-written 
- not sure whether the impact will be visible in NMT outputs

Link : https://openreview.net/pdf?id=SkEYojRqtm
Authors : Gao et al. 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Representation Degeneration Problem in Training Natural Language Generation Models #119

Abstract

Details