BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

## Abstract
- introduce new language representation model called `BERT` (Bidirectional Encoder Representation from Transformers)
- obtains SoTA on GLEU benchmark, MultiNLI and SQuAD v1.1 (almost all natural language inference tasks)
- successful transfer learning in natural language area

## Details
### Pre-Training (Transfer Learning)
- BERT is the best pre-training model for natural language understanding tasks
- Transfer learning is proven to be useful in many deep learning tasks, especially on image tasks, where ImageNet pre-trained models allow faster training and better performance.
- There are two approaches to utilize transfer learning
  - `feature-based` approach uses extracted feature representation of pre-trained model to downstream task
  - `fine-tuning` approach further trains pre-trained model together with much lower learning rate and much shorter epochs

### ELMo vs OpenAI GPT vs BERT
- Classic pre-training model architectures in natural language tasks
- `ELMo` : use concatenation of bidirectional LSTM representation
- `OpenAI GPT` : use left-to-right Transformer model (Transformer decoder)
- `BERT` : use bidirectional self-attention Transformer model (Transformer encoder)
![screen shot 2018-10-24 at 10 58 04 am](https://user-images.githubusercontent.com/7529838/47401354-f1a6f480-d77b-11e8-8f3d-94ed277de43f.png)

## BERT
### Input Representation
- first token is always `[CLS]` (short for classification token)
- each token embedding is added with segment embedding (which is sentence embedding pre-trained)
  - since BERT will deal with multiple sentence input, segment embedding is introduced by authors
- positional embedding is also added
![screen shot 2018-10-24 at 11 01 00 am](https://user-images.githubusercontent.com/7529838/47401398-287d0a80-d77c-11e8-85cf-7ab32bec71f5.png)

### Pre-training BERT
- not only the model architecture, but the novel pre-training scheme is a big contribution
#### Task 1 : Masked LM
- traditional left-to-right model has been trained with left-to-right LM, where models were trained to predict the next token
- BERT is trained on masked LM where random input tokens are masked, and model is trained to predict the masked token.
  - 15% of the token is masked. Within 15%, 80% is masked as usual, 10% is replaced with random word, 10% is left as-is.
  - random masking allows the model to embed all the sentence information in each token (especially in `CLS` token)
#### Task 2 : Next Sentence Prediction
- in order to align the pre-training with downstream task, add another pre-training task
- input two sentences with separator `[SEP]` and predict whether the sentence is actually adjacent sentence or not.
  - performs 97~98% accurate (which means the next sentence is easy to classify)

### Fine-Tuning in BERT
#### Dataset (tasks)
- GLEU benchmark is free of evaluation bias. can be classified into below tasks
  - sentence-pair classification task : MNLI, QQP, QNLI, STS-B, MRPC, RTE, SWAG
  - single sentence classification task : SST-2, CoLA
  - QnA task : SQuAD v1.1
  - single sentence tagging task : CoNLL-2003 NER 

#### Fine-Tuning
- Sentence-pair classification task and single sentence classification task use final output of `[CLS]` as input to additional FFN layer and softmax (a and b in below figure)
- QnA task has two additional vectors to learn (`S` for start and `E` for end vector). Each `S` and `E` is dot-producted with final representation of paragraph tokens to classify which token is starting token and which is end token. (c in below figure)
- single sentence tagging tasks uses all final output of each tokens as input to additional FFN layer and softmax model independently, because each output token already has context information, there is no need for model to attend to nearby tokens. Can make non-autoregressive inference on NER task. (d in below figure)
![screen shot 2018-10-24 at 11 09 13 am](https://user-images.githubusercontent.com/7529838/47401674-426b1d00-d77d-11e8-9165-8a363a43d2b8.png)

### Performance 
- SoTA on GLEU and SQuAD v1.1 and NER
  - robust to training dataset size and task definitions
![screen shot 2018-10-24 at 11 13 41 am](https://user-images.githubusercontent.com/7529838/47401863-ebb21300-d77d-11e8-9ca8-a38f1ed80696.png)
![screen shot 2018-10-24 at 11 14 10 am](https://user-images.githubusercontent.com/7529838/47401868-f1a7f400-d77d-11e8-889c-8b95b1f621a9.png)
![screen shot 2018-10-24 at 11 14 41 am](https://user-images.githubusercontent.com/7529838/47401890-05ebf100-d77e-11e8-863e-aa0ba18dfc3e.png)
![screen shot 2018-10-24 at 11 14 58 am](https://user-images.githubusercontent.com/7529838/47401898-0f755900-d77e-11e8-9920-79fed48df8de.png)

### Ablation Study
#### What contributed the most?
- People may attribute the success to BERT being bidirectional compared to OpenAPI GPT and ELMo
- To prove that novel pre-training method was also significant, ablation study is performed.
  - `No NSP` leads to drop in performance compared to `BERT base`, which indicates the effect of novel pre-training method
  - `LTR & No NSP` further drops in performance, showing bidirectional nature and novel pre-training method is orthogonal
![screen shot 2018-10-24 at 11 15 16 am](https://user-images.githubusercontent.com/7529838/47401909-1dc37500-d77e-11e8-8210-61c95be8768c.png)

#### Effect on Model Size
- larger the model, the better the scores
![screen shot 2018-10-24 at 11 15 22 am](https://user-images.githubusercontent.com/7529838/47401910-1dc37500-d77e-11e8-9838-20de50d14ce5.png)

#### Effect of Training Steps
- huge data and long pre-training is necessary
- although pre-training does take longer than LTR, gain is much higher

#### Feature-based vs Fine-Tuning
- how well does feature-based transfer learning play in BERT?
  - competitive to fine-tuning, but simply using last layer is not enough. Use `concat last four hidden`.
![screen shot 2018-10-24 at 11 19 49 am](https://user-images.githubusercontent.com/7529838/47402057-bbb73f80-d77e-11e8-9989-38130b7bfed6.png)


## Personal Thoughts
- Well-engineered pre-training scheme. Aligns with downstream tasks and shows better performance. Adding both masked LM and next sentence representation is a creative idea.
- Using `[CLS]` as sentence representation was also impressive. The connection between BERT and downstream tasks (Figure 3) is well-designed.
- This paper definitely has its contribution in designing novel pre-training methods and naturally linking model with downstream tasks.
- wonder how it can improve Machine Translation?
  - For example in EnDe, pre-train English BERT and pre-train German OpenAI GPT and fine-tune using parallel corpus?

Link : https://arxiv.org/pdf/1810.04805.pdf
Authors : Devlin et al 2018 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding #114

Abstract

Details

Pre-Training (Transfer Learning)

ELMo vs OpenAI GPT vs BERT

BERT

Input Representation

Pre-training BERT

Task 1 : Masked LM

Task 2 : Next Sentence Prediction

Fine-Tuning in BERT

Dataset (tasks)

Fine-Tuning

Performance

Ablation Study

What contributed the most?

Effect on Model Size

Effect of Training Steps

Feature-based vs Fine-Tuning

Personal Thoughts

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding #114

Description

Abstract

Details

Pre-Training (Transfer Learning)

ELMo vs OpenAI GPT vs BERT

BERT

Input Representation

Pre-training BERT

Task 1 : Masked LM

Task 2 : Next Sentence Prediction

Fine-Tuning in BERT

Dataset (tasks)

Fine-Tuning

Performance

Ablation Study

What contributed the most?

Effect on Model Size

Effect of Training Steps

Feature-based vs Fine-Tuning

Personal Thoughts

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions