Abstract
representation degeneration == NLG (Natural Language Generation) tasks trained with MLE (Maximum Likelihood Estimation) with weight tying tricks (word embedding & pre-softmax layer) with big training dataset suffer from most of the learnt word embeddings concentrating into a narrow cone, which limits the representation power of word embeddings
- propose a novel regularization method to address this problem
- WMT14 EnDe +1.08 BLEU with
Transformer Base, +0.54 BLEU with Transformer Big
Details
Representation Degeneration

- word2vec (b) and softmax param learnt from classification task (c) (MNIST) are diversely distributed around the origin using SVD projection
- whereas, Transformer word embedding is saturated in a narrow cone
Understanding the Problem
- word embedding is tied to softmax layer, then
- word representation should be widely distributed to represent different semantic meanings
- softmax with more diverse distribution is expected to obtain a large margin result
- however, in reality, the learnt word embeddings are clustered into a narrow cone and the model faces the challenge of limited expressiveness
- Cause :
Intuitively speaking, during the training process of a model with likelihood loss, for any given hidden state, the embedding of the corresponding ground-truth word will be pushed towards the direction of the hidden state in order to get a larger likelihood, while the embeddings of all other words will be pushed towards the negative direction of the hidden state to get a smaller likelihood. As in natural language, word frequency is very low, the embedding of the word will be pushed towards the negative directions of most hidden states which drastically vary. As a result, the embeddings of most words in the vocabulary will be pushed towards similar directions negatively correlated with most hidden states and thus are clustered together in a local region of the embedding space.
MLE with Cosine Regularization

- propose a
MLE-CosReg, which adds cosine similarity as a loss objective hence making word embeddings different from each other (pushing away from word embeddings clustering into a narrow cone)
Experimental Result

- WMT14 EnDe & DeEn task with
Transformer base lead to increase in BLEU score

- word embeddings are now distributed more equally around the origin
Personal Thoughts
- interesting phenomena, and simple solution
- paper is well-written
- not sure whether the impact will be visible in NMT outputs
Link : https://openreview.net/pdf?id=SkEYojRqtm
Authors : Gao et al. 2018
Abstract
representation degeneration== NLG (Natural Language Generation) tasks trained with MLE (Maximum Likelihood Estimation) with weight tying tricks (word embedding & pre-softmax layer) with big training dataset suffer from most of the learnt word embeddings concentrating into a narrow cone, which limits the representation power of word embeddingsTransformer Base, +0.54 BLEU withTransformer BigDetails
Representation Degeneration
Understanding the Problem
Intuitively speaking, during the training process of a model with likelihood loss, for any given hidden state, the embedding of the corresponding ground-truth word will be pushed towards the direction of the hidden state in order to get a larger likelihood, while the embeddings of all other words will be pushed towards the negative direction of the hidden state to get a smaller likelihood. As in natural language, word frequency is very low, the embedding of the word will be pushed towards the negative directions of most hidden states which drastically vary. As a result, the embeddings of most words in the vocabulary will be pushed towards similar directions negatively correlated with most hidden states and thus are clustered together in a local region of the embedding space.MLE with Cosine Regularization
MLE-CosReg, which adds cosine similarity as a loss objective hence making word embeddings different from each other (pushing away from word embeddings clustering into a narrow cone)Experimental Result
Transformer baselead to increase in BLEU scorePersonal Thoughts
Link : https://openreview.net/pdf?id=SkEYojRqtm
Authors : Gao et al. 2018