diff --git a/BERT-pytorch b/BERT-pytorch new file mode 160000 index 0000000..919adf1 --- /dev/null +++ b/BERT-pytorch @@ -0,0 +1 @@ +Subproject commit 919adf1ff7d050bb5ab2955caee00b7f994e7e94 diff --git a/README.md b/README.md index ced3be7..5772319 100644 --- a/README.md +++ b/README.md @@ -1,120 +1,20 @@ -# BERT-pytorch +# BERT-pytorch学习心得 +在2023年的2月中旬的凌晨2点,我要结束对BERT-pytorch项目的学习了,这是注册github账号之后第1次相对认真系统的学习一个开源项目,从寒假前夕开始,持续直到现在,坚持下来了离开之前,啰嗦2句,以作纪念! +## 1.经验 +- 根据代码,结合bert论文,基本掌握了bert的真面目:包括词典构建和token随机替换,句子对随机采样的dataset模块、基于transformer编码器的encoder架构的modeling模块、包括loss计算和梯度下降的trainner模块; +- 在代码学习的过程中,掌握了git基本操作,github的使用习惯(自己的注释都合并到了master分支)和常见pytorch API用法; +- 开源项目学习最好结合论文看,这样就将理论和实践结合起来了,当然最好是能灌入数据跑起来 -[![LICENSE](https://img.shields.io/github/license/codertimo/BERT-pytorch.svg)](https://github.com/codertimo/BERT-pytorch/blob/master/LICENSE) -![GitHub issues](https://img.shields.io/github/issues/codertimo/BERT-pytorch.svg) -[![GitHub stars](https://img.shields.io/github/stars/codertimo/BERT-pytorch.svg)](https://github.com/codertimo/BERT-pytorch/stargazers) -[![CircleCI](https://circleci.com/gh/codertimo/BERT-pytorch.svg?style=shield)](https://circleci.com/gh/codertimo/BERT-pytorch) -[![PyPI](https://img.shields.io/pypi/v/bert-pytorch.svg)](https://pypi.org/project/bert_pytorch/) -[![PyPI - Status](https://img.shields.io/pypi/status/bert-pytorch.svg)](https://pypi.org/project/bert_pytorch/) -[![Documentation Status](https://readthedocs.org/projects/bert-pytorch/badge/?version=latest)](https://bert-pytorch.readthedocs.io/en/latest/?badge=latest) +## 2.教训 +- 代码逐行看了,也搭建了bert-pytorch环境,但是没有结合数据去运行查看结果,故调参经验并没有增加 +- 项目学习没有指定里程碑时间表,拖沓 +- 后续的开源项目学习,一定要结合数据,运行起来 +- 本来想好好写一篇readme,但是到头有泄气了。 -Pytorch implementation of Google AI's 2018 BERT, with simple annotation +--- +# bert理解记录 -> BERT 2018 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding -> Paper URL : https://arxiv.org/abs/1810.04805 +## 20230213:今天和大华同事明浩讨论了bert的embedding部分:由token到初始化的embedding向量是怎么实现的?他认为初始化的embedding向量会参与到训练学习中,但是晚上我又看了下该项目,发现本项目的embedding模块只是承担着token的随机初始化过程,之后就会进到attention模块,先线性投影成querey,key,value之后就开始了注意力机制的计算;由此可以认为embedding模块还只是数据预处理的一部分,是不会参与到训练中的; +另外一个问题是为什么可以随机初始化embedding?我认为主要是token的索引就是随机的(现到先得),也就是说不管是token的index,还是初始的embedding向量,只要固定好key-value关系即可,不含任何的语义信息;这样就能圆回来了:如果初始化的embedding是模型参数,参与到学习训练,就破环了key-value关系的确定性; -## Introduction - -Google AI's BERT paper shows the amazing result on various NLP task (new 17 NLP tasks SOTA), -including outperform the human F1 score on SQuAD v1.1 QA task. -This paper proved that Transformer(self-attention) based encoder can be powerfully used as -alternative of previous language model with proper language model training method. -And more importantly, they showed us that this pre-trained language model can be transfer -into any NLP task without making task specific model architecture. - -This amazing result would be record in NLP history, -and I expect many further papers about BERT will be published very soon. - -This repo is implementation of BERT. Code is very simple and easy to understand fastly. -Some of these codes are based on [The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html) - -Currently this project is working on progress. And the code is not verified yet. - -## Installation -``` -pip install bert-pytorch -``` - -## Quickstart - -**NOTICE : Your corpus should be prepared with two sentences in one line with tab(\t) separator** - -### 0. Prepare your corpus -``` -Welcome to the \t the jungle\n -I can stay \t here all night\n -``` - -or tokenized corpus (tokenization is not in package) -``` -Wel_ _come _to _the \t _the _jungle\n -_I _can _stay \t _here _all _night\n -``` - - -### 1. Building vocab based on your corpus -```shell -bert-vocab -c data/corpus.small -o data/vocab.small -``` - -### 2. Train your own BERT model -```shell -bert -c data/corpus.small -v data/vocab.small -o output/bert.model -``` - -## Language Model Pre-training - -In the paper, authors shows the new language model training methods, -which are "masked language model" and "predict next sentence". - - -### Masked Language Model - -> Original Paper : 3.3.1 Task #1: Masked LM - -``` -Input Sequence : The man went to [MASK] store with [MASK] dog -Target Sequence : the his -``` - -#### Rules: -Randomly 15% of input token will be changed into something, based on under sub-rules - -1. Randomly 80% of tokens, gonna be a `[MASK]` token -2. Randomly 10% of tokens, gonna be a `[RANDOM]` token(another word) -3. Randomly 10% of tokens, will be remain as same. But need to be predicted. - -### Predict Next Sentence - -> Original Paper : 3.3.2 Task #2: Next Sentence Prediction - -``` -Input : [CLS] the man went to the store [SEP] he bought a gallon of milk [SEP] -Label : Is Next - -Input = [CLS] the man heading to the store [SEP] penguin [MASK] are flight ##less birds [SEP] -Label = NotNext -``` - -"Is this sentence can be continuously connected?" - - understanding the relationship, between two text sentences, which is -not directly captured by language modeling - -#### Rules: - -1. Randomly 50% of next sentence, gonna be continuous sentence. -2. Randomly 50% of next sentence, gonna be unrelated sentence. - - -## Author -Junseong Kim, Scatter Lab (codertimo@gmail.com / junseong.kim@scatterlab.co.kr) - -## License - -This project following Apache 2.0 License as written in LICENSE file - -Copyright 2018 Junseong Kim, Scatter Lab, respective BERT contributors - -Copyright (c) 2018 Alexander Rush : [The Annotated Trasnformer](https://github.com/harvardnlp/annotated-transformer) diff --git a/README_back.md b/README_back.md new file mode 100644 index 0000000..ced3be7 --- /dev/null +++ b/README_back.md @@ -0,0 +1,120 @@ +# BERT-pytorch + +[![LICENSE](https://img.shields.io/github/license/codertimo/BERT-pytorch.svg)](https://github.com/codertimo/BERT-pytorch/blob/master/LICENSE) +![GitHub issues](https://img.shields.io/github/issues/codertimo/BERT-pytorch.svg) +[![GitHub stars](https://img.shields.io/github/stars/codertimo/BERT-pytorch.svg)](https://github.com/codertimo/BERT-pytorch/stargazers) +[![CircleCI](https://circleci.com/gh/codertimo/BERT-pytorch.svg?style=shield)](https://circleci.com/gh/codertimo/BERT-pytorch) +[![PyPI](https://img.shields.io/pypi/v/bert-pytorch.svg)](https://pypi.org/project/bert_pytorch/) +[![PyPI - Status](https://img.shields.io/pypi/status/bert-pytorch.svg)](https://pypi.org/project/bert_pytorch/) +[![Documentation Status](https://readthedocs.org/projects/bert-pytorch/badge/?version=latest)](https://bert-pytorch.readthedocs.io/en/latest/?badge=latest) + +Pytorch implementation of Google AI's 2018 BERT, with simple annotation + +> BERT 2018 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding +> Paper URL : https://arxiv.org/abs/1810.04805 + + +## Introduction + +Google AI's BERT paper shows the amazing result on various NLP task (new 17 NLP tasks SOTA), +including outperform the human F1 score on SQuAD v1.1 QA task. +This paper proved that Transformer(self-attention) based encoder can be powerfully used as +alternative of previous language model with proper language model training method. +And more importantly, they showed us that this pre-trained language model can be transfer +into any NLP task without making task specific model architecture. + +This amazing result would be record in NLP history, +and I expect many further papers about BERT will be published very soon. + +This repo is implementation of BERT. Code is very simple and easy to understand fastly. +Some of these codes are based on [The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html) + +Currently this project is working on progress. And the code is not verified yet. + +## Installation +``` +pip install bert-pytorch +``` + +## Quickstart + +**NOTICE : Your corpus should be prepared with two sentences in one line with tab(\t) separator** + +### 0. Prepare your corpus +``` +Welcome to the \t the jungle\n +I can stay \t here all night\n +``` + +or tokenized corpus (tokenization is not in package) +``` +Wel_ _come _to _the \t _the _jungle\n +_I _can _stay \t _here _all _night\n +``` + + +### 1. Building vocab based on your corpus +```shell +bert-vocab -c data/corpus.small -o data/vocab.small +``` + +### 2. Train your own BERT model +```shell +bert -c data/corpus.small -v data/vocab.small -o output/bert.model +``` + +## Language Model Pre-training + +In the paper, authors shows the new language model training methods, +which are "masked language model" and "predict next sentence". + + +### Masked Language Model + +> Original Paper : 3.3.1 Task #1: Masked LM + +``` +Input Sequence : The man went to [MASK] store with [MASK] dog +Target Sequence : the his +``` + +#### Rules: +Randomly 15% of input token will be changed into something, based on under sub-rules + +1. Randomly 80% of tokens, gonna be a `[MASK]` token +2. Randomly 10% of tokens, gonna be a `[RANDOM]` token(another word) +3. Randomly 10% of tokens, will be remain as same. But need to be predicted. + +### Predict Next Sentence + +> Original Paper : 3.3.2 Task #2: Next Sentence Prediction + +``` +Input : [CLS] the man went to the store [SEP] he bought a gallon of milk [SEP] +Label : Is Next + +Input = [CLS] the man heading to the store [SEP] penguin [MASK] are flight ##less birds [SEP] +Label = NotNext +``` + +"Is this sentence can be continuously connected?" + + understanding the relationship, between two text sentences, which is +not directly captured by language modeling + +#### Rules: + +1. Randomly 50% of next sentence, gonna be continuous sentence. +2. Randomly 50% of next sentence, gonna be unrelated sentence. + + +## Author +Junseong Kim, Scatter Lab (codertimo@gmail.com / junseong.kim@scatterlab.co.kr) + +## License + +This project following Apache 2.0 License as written in LICENSE file + +Copyright 2018 Junseong Kim, Scatter Lab, respective BERT contributors + +Copyright (c) 2018 Alexander Rush : [The Annotated Trasnformer](https://github.com/harvardnlp/annotated-transformer) diff --git a/bert_pytorch/dataset/dataset.py b/bert_pytorch/dataset/dataset.py index 7d787f3..7a28577 100644 --- a/bert_pytorch/dataset/dataset.py +++ b/bert_pytorch/dataset/dataset.py @@ -15,19 +15,21 @@ def __init__(self, corpus_path, vocab, seq_len, encoding="utf-8", corpus_lines=N self.encoding = encoding with open(corpus_path, "r", encoding=encoding) as f: - if self.corpus_lines is None and not on_memory: + #读取预料库后分下面2种情况处理: + if self.corpus_lines is None and not on_memory: #如果不将语料库直接加载到内存,则需先确定语料库行数 for _ in tqdm.tqdm(f, desc="Loading Dataset", total=corpus_lines): self.corpus_lines += 1 if on_memory: - self.lines = [line[:-1].split("\t") - for line in tqdm.tqdm(f, desc="Loading Dataset", total=corpus_lines)] - self.corpus_lines = len(self.lines) + #数据集全部加载到内存,语料库解析成list类型的self.liines属性 + self.lines = [line[:-1].split('\t') + for line in tqdm.tqdm(f, desc="Loading Dataset", total=corpus_lines)] #对预料库每行根据\t字符分成2个sentence + self.corpus_lines = len(self.lines) #获取语料库行数 - if not on_memory: + if not on_memory: self.file = open(corpus_path, "r", encoding=encoding) self.random_file = open(corpus_path, "r", encoding=encoding) - + #错位抽取负样本,作用是什么? for _ in range(random.randint(self.corpus_lines if self.corpus_lines < 1000 else 1000)): self.random_file.__next__() @@ -35,12 +37,14 @@ def __len__(self): return self.corpus_lines def __getitem__(self, item): - t1, t2, is_next_label = self.random_sent(item) - t1_random, t1_label = self.random_word(t1) - t2_random, t2_label = self.random_word(t2) + #魔术方法__getitem__的定义,功能令类的实例对象向list那样根据索引item取值 + #BERTDataset类实例化返回的bert对象均会进行Next Sentence操作和Masked LM操作 + t1, t2, is_next_label = self.random_sent(item) #Next Sentence操作 + t1_random, t1_label = self.random_word(t1) #Masked LM操作, 其中t1_label表示t1各个位置被masked的类别索引,参看vocab.py中Vocab类的初始化定义 + t2_random, t2_label = self.random_word(t2) # [CLS] tag = SOS tag, [SEP] tag = EOS tag - t1 = [self.vocab.sos_index] + t1_random + [self.vocab.eos_index] + t1 = [self.vocab.sos_index] + t1_random + [self.vocab.eos_index] #论文Figure2 t2 = t2_random + [self.vocab.eos_index] t1_label = [self.vocab.pad_index] + t1_label + [self.vocab.pad_index] @@ -50,7 +54,7 @@ def __getitem__(self, item): bert_input = (t1 + t2)[:self.seq_len] bert_label = (t1_label + t2_label)[:self.seq_len] - padding = [self.vocab.pad_index for _ in range(self.seq_len - len(bert_input))] + padding = [self.vocab.pad_index for _ in range(self.seq_len - len(bert_input))] #最大长度和实际长度之差就是需要padding的位置数量 bert_input.extend(padding), bert_label.extend(padding), segment_label.extend(padding) output = {"bert_input": bert_input, @@ -61,12 +65,15 @@ def __getitem__(self, item): return {key: torch.tensor(value) for key, value in output.items()} def random_word(self, sentence): + #sentence转换成sentence中的token在token-index词典中对应的index tokens = sentence.split() - output_label = [] + output_label = [] #该列表只存0和非0数字,0表示对应位置的token属于85%没被替换的,非0数字是对应位置的token在被mask处理前的vocab中对应的index for i, token in enumerate(tokens): prob = random.random() + #BERT随机选择15%的tokens进行mask if prob < 0.15: + #对于随机选择的15%的tokens,再做一次随机 prob /= 0.15 # 80% randomly change token to mask token @@ -77,26 +84,27 @@ def random_word(self, sentence): elif prob < 0.9: tokens[i] = random.randrange(len(self.vocab)) - # 10% randomly change token to current token + # 10% doesn't change current token else: tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index) output_label.append(self.vocab.stoi.get(token, self.vocab.unk_index)) else: - tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index) + tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index) #未被masked的词,用其在vocab中真正的index填充 + #具体地,self.vocab.unk_index=1,上句相当于从stoi token-index字典 output_label.append(0) return tokens, output_label def random_sent(self, index): - t1, t2 = self.get_corpus_line(index) - - # output_text, label(isNotNext:0, isNext:1) + t1, t2 = self.get_corpus_line(index) + # for sentence A and B, 50% of the time B is the actual next sentence that follows A(labeled as NotNext) + # and for 50% of the time it is a random sentence from the corpus(labeled as NotNext) if random.random() > 0.5: - return t1, t2, 1 + return t1, t2, 1 #1表示isNext else: - return t1, self.get_random_line(), 0 + return t1, self.get_random_line(), 0 #0表示isNotNext def get_corpus_line(self, item): if self.on_memory: @@ -122,4 +130,4 @@ def get_random_line(self): for _ in range(random.randint(self.corpus_lines if self.corpus_lines < 1000 else 1000)): self.random_file.__next__() line = self.random_file.__next__() - return line[:-1].split("\t")[1] + return line[:-1].split("\t")[1] \ No newline at end of file diff --git a/bert_pytorch/dataset/vocab.py b/bert_pytorch/dataset/vocab.py index f7346a7..08dbf60 100644 --- a/bert_pytorch/dataset/vocab.py +++ b/bert_pytorch/dataset/vocab.py @@ -33,6 +33,9 @@ def __init__(self, counter, max_size=None, min_freq=1, specials=['', ' to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size. Default: torch.Tensor.zero_ vectors_cache: directory for cached vectors. Default: '.vector_cache' + Attributes: + self.itos表示所有token组成的词表; + self.stoi表示所有token和其在self.itos中的索引构成字典 """ self.freqs = counter counter = counter.copy() @@ -124,13 +127,18 @@ def __init__(self, texts, max_size=None, min_freq=1): if isinstance(line, list): words = line else: + #原来的replace不能将"\t"、"\n"替换为"",故进行如下更改(please忽略该注释) words = line.replace("\n", "").replace("\t", "").split() + #words = line.replace('\\t', '').replace('\\n', '').split() for word in words: counter[word] += 1 super().__init__(counter, max_size=max_size, min_freq=min_freq) def to_seq(self, sentence, seq_len=None, with_eos=False, with_sos=False, with_len=False): + """将句子转化为由self.stoi中的token对应的index组成的list,如: + sentence = 'Welcome to the the jungle', 则to_seq(sentence)返回 [7, 14, 5, 5, 11] + """ if isinstance(sentence, str): sentence = sentence.split() @@ -153,6 +161,9 @@ def to_seq(self, sentence, seq_len=None, with_eos=False, with_sos=False, with_le return (seq, origin_seq_len) if with_len else seq def from_seq(self, seq, join=False, with_pad=False): + """将to_seq()函数返回的由index组成的list转化为self.stoi中对应的token组成的list,比如 + seq=[7, 14, 5, 5, 11],则from_seq(seq)将返回['Welcome', 'to', 'the', 'the', 'jungle'] + """ words = [self.itos[idx] if idx < len(self.itos) else "<%d>" % idx diff --git a/bert_pytorch/model/attention/single.py b/bert_pytorch/model/attention/single.py index 701d2c2..394d35c 100644 --- a/bert_pytorch/model/attention/single.py +++ b/bert_pytorch/model/attention/single.py @@ -15,7 +15,9 @@ def forward(self, query, key, value, mask=None, dropout=None): / math.sqrt(query.size(-1)) if mask is not None: - scores = scores.masked_fill(mask == 0, -1e9) + #transformer中的mask的作用:encoder中是去除序列的影响;decoder中是去除'不可见逻辑' + #这里显然是前者; + scores = scores.masked_fill(mask == 0, -1e9) #注意mask和score需要是可广播的 p_attn = F.softmax(scores, dim=-1) diff --git a/bert_pytorch/model/embedding/position.py b/bert_pytorch/model/embedding/position.py index d55c224..9f3420e 100644 --- a/bert_pytorch/model/embedding/position.py +++ b/bert_pytorch/model/embedding/position.py @@ -12,11 +12,17 @@ def __init__(self, d_model, max_len=512): pe = torch.zeros(max_len, d_model).float() pe.require_grad = False - position = torch.arange(0, max_len).float().unsqueeze(1) - div_term = (torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)).exp() + position = torch.arange(0, max_len).float().unsqueeze(1) #论文编码公式的分子 + div_term = (torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)).exp() #论文编码公式的分母,先取log,再exp;渐少计算量? pe[:, 0::2] = torch.sin(position * div_term) - pe[:, 1::2] = torch.cos(position * div_term) + + #pe[:, 1::2].size(-1) is less than div_term.size(-1) when d_model is an odd number + if pe[:, 1::2].size(-1) >= div_term.size(-1): + pe[:, 1::2] = torch.cos(position * div_term) + else: + cos_len = pe[:, 1::2].size(-1) + pe[:, 1::2] = torch.cos(position * div_term[:cos_len]) pe = pe.unsqueeze(0) self.register_buffer('pe', pe) diff --git a/bert_pytorch/model/embedding/segment.py b/bert_pytorch/model/embedding/segment.py index cdf84d5..6b89e6c 100644 --- a/bert_pytorch/model/embedding/segment.py +++ b/bert_pytorch/model/embedding/segment.py @@ -3,4 +3,9 @@ class SegmentEmbedding(nn.Embedding): def __init__(self, embed_size=512): + """ + 和TokenEmbedding不同,下面的__init__()中的第一个参数不是vocab_size而是3, + 因为SegmentEmbedding实例化后的input是segment_info,一个由0,1,2三种元素组成的向量 + 故SegmentEmbedding初始化时只需要初始化3个向量即可; + """ super().__init__(3, embed_size, padding_idx=0) diff --git a/bert_pytorch/model/embedding/token.py b/bert_pytorch/model/embedding/token.py index 79b5187..d7a2a6c 100644 --- a/bert_pytorch/model/embedding/token.py +++ b/bert_pytorch/model/embedding/token.py @@ -2,5 +2,9 @@ class TokenEmbedding(nn.Embedding): + """nn.Embedding class is the Parent class of TokenEmbedding + nn.Embedding(vocab_size, embed_size) return vocab_size vector with dimension of embed_size; + nn.Embedding's method forward(self, input: Tensor) -> Tensor以input中的元素为index返回对应的向量: + """ def __init__(self, vocab_size, embed_size=512): super().__init__(vocab_size, embed_size, padding_idx=0) diff --git a/bert_pytorch/model/language_model.py b/bert_pytorch/model/language_model.py index 608f42a..7f66d2f 100644 --- a/bert_pytorch/model/language_model.py +++ b/bert_pytorch/model/language_model.py @@ -39,7 +39,9 @@ def __init__(self, hidden): self.softmax = nn.LogSoftmax(dim=-1) def forward(self, x): - return self.softmax(self.linear(x[:, 0])) + #the final hidden state of [CLs] is used as the sequence representation for classification tasks. + #where x[:, 0] representes [CLs] + return self.softmax(self.linear(x[:, 0])) class MaskedLanguageModel(nn.Module): @@ -54,8 +56,8 @@ def __init__(self, hidden, vocab_size): :param vocab_size: total vocab size """ super().__init__() - self.linear = nn.Linear(hidden, vocab_size) + self.linear = nn.Linear(hidden, vocab_size) #Linear(hidden, vocab_size)将输入的hidden维转换成vocab_size维 self.softmax = nn.LogSoftmax(dim=-1) def forward(self, x): - return self.softmax(self.linear(x)) + return self.softmax(self.linear(x)) #x的最后维数是hidden,经self.linear()作用后最后维数变为vocab_size diff --git a/bert_pytorch/trainer/pretrain.py b/bert_pytorch/trainer/pretrain.py index 0b882dd..5072099 100644 --- a/bert_pytorch/trainer/pretrain.py +++ b/bert_pytorch/trainer/pretrain.py @@ -43,7 +43,8 @@ def __init__(self, bert: BERT, vocab_size: int, # This BERT model will be saved every epoch self.bert = bert # Initialize the BERT Language Model, with BERT model - self.model = BERTLM(bert, vocab_size).to(self.device) + self.model = BERTLM(bert, vocab_size).to(self.device) + #BERTLM类在BERT类对输入编码的基础上返回 Masked LM和Next Sentence Prediction的预测结果 # Distributed GPU training if CUDA can detect more than 1 GPU if with_cuda and torch.cuda.device_count() > 1: @@ -100,12 +101,14 @@ def iteration(self, epoch, data_loader, train=True): # 1. forward the next_sentence_prediction and masked_lm model next_sent_output, mask_lm_output = self.model.forward(data["bert_input"], data["segment_label"]) - + #data["bert_input"]中的token的mask,以及data["segment_label"]对句子对是否是前后句的标记,其标签都是句子自带的,故是无监督学习 + #data["bert_input"], data["segment_label"]是要进行的LM mask和nsp的样本 + #data["bert_label"]和data['is_next']是标签 # 2-1. NLL(negative log likelihood) loss of is_next classification result next_loss = self.criterion(next_sent_output, data["is_next"]) # 2-2. NLLLoss of predicting masked token word - mask_loss = self.criterion(mask_lm_output.transpose(1, 2), data["bert_label"]) + mask_loss = self.criterion(mask_lm_output.transpose(1, 2), data["bert_label"]) #bert的与训练目标本质上个分类问题,故用NLLLnull() # 2-3. Adding next_loss and mask_loss : 3.4 Pre-training Procedure loss = next_loss + mask_loss @@ -118,9 +121,9 @@ def iteration(self, epoch, data_loader, train=True): # next sentence prediction accuracy correct = next_sent_output.argmax(dim=-1).eq(data["is_next"]).sum().item() - avg_loss += loss.item() - total_correct += correct - total_element += data["is_next"].nelement() + avg_loss += loss.item() #累计每一次的损失,用来计算平均损失 + total_correct += correct #累计next sentence预测正确的个数 + total_element += data["is_next"].nelement() #tensor元素个数,等于各个维度之积 post_fix = { "epoch": epoch, diff --git a/data/corpus.txt b/data/corpus.txt new file mode 100644 index 0000000..ed144f5 --- /dev/null +++ b/data/corpus.txt @@ -0,0 +1,2 @@ +Welcome to the the jungle +I can stay here all night diff --git a/data/vocab.small b/data/vocab.small new file mode 100644 index 0000000..a4092d6 Binary files /dev/null and b/data/vocab.small differ diff --git a/img/1.PNG b/img/1.PNG new file mode 100644 index 0000000..5bc31ff Binary files /dev/null and b/img/1.PNG differ diff --git a/img/2.PNG b/img/2.PNG new file mode 100644 index 0000000..cbd475b Binary files /dev/null and b/img/2.PNG differ diff --git a/test_bert.py b/test_bert.py new file mode 100644 index 0000000..3496ae0 --- /dev/null +++ b/test_bert.py @@ -0,0 +1,123 @@ +Skip to content +Search or jump to… +Pull requests +Issues +Codespaces +Marketplace +Explore + +@wanghesong2019 +songyingxin +/ +BERT-pytorch +Public +forked from codertimo/BERT-pytorch +Fork your own copy of songyingxin/BERT-pytorch +Code +Pull requests +Actions +Projects +Security +Insights +BERT-pytorch/test_bert.py / +@songyingxin +songyingxin note +Latest commit e32e2ad on Jul 30, 2019 + History + 1 contributor +94 lines (76 sloc) 4.35 KB + +import argparse + +from torch.utils.data import DataLoader + +from bert_pytorch.model import BERT +from bert_pytorch.trainer import BERTTrainer +from bert_pytorch.dataset import BERTDataset, WordVocab + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + + parser.add_argument("-c", "--train_dataset", required=True, + type=str, help="train dataset for train bert") + parser.add_argument("-t", "--test_dataset", type=str, + default=None, help="test set for evaluate train set") + parser.add_argument("-v", "--vocab_path", required=True, + type=str, help="built vocab model path with bert-vocab") + parser.add_argument("-o", "--output_path", required=True, + type=str, help="ex)output/bert.model") + + parser.add_argument("-hs", "--hidden", type=int, + default=256, help="hidden size of transformer model") + parser.add_argument("-l", "--layers", type=int, + default=8, help="number of layers") + parser.add_argument("-a", "--attn_heads", type=int, + default=8, help="number of attention heads") + parser.add_argument("-s", "--seq_len", type=int, + default=20, help="maximum sequence len") + + parser.add_argument("-b", "--batch_size", type=int, + default=64, help="number of batch_size") + parser.add_argument("-e", "--epochs", type=int, + default=10, help="number of epochs") + parser.add_argument("-w", "--num_workers", type=int, + default=5, help="dataloader worker size") + + parser.add_argument("--with_cuda", type=bool, default=True, + help="training with CUDA: true, or false") + parser.add_argument("--log_freq", type=int, default=10, + help="printing loss every n iter: setting n") + parser.add_argument("--corpus_lines", type=int, + default=None, help="total number of lines in corpus") + parser.add_argument("--cuda_devices", type=int, nargs='+', + default=None, help="CUDA device ids") + parser.add_argument("--on_memory", type=bool, default=True, + help="Loading on memory: true or false") + + parser.add_argument("--lr", type=float, default=1e-3, + help="learning rate of adam") + parser.add_argument("--adam_weight_decay", type=float, + default=0.01, help="weight_decay of adam") + parser.add_argument("--adam_beta1", type=float, + default=0.9, help="adam first beta value") + parser.add_argument("--adam_beta2", type=float, + default=0.999, help="adam first beta value") + + args = parser.parse_args() + + print("Loading Vocab", args.vocab_path) + vocab = WordVocab.load_vocab(args.vocab_path) + print("Vocab Size: ", len(vocab)) + + print("Loading Train Dataset", args.train_dataset) + train_dataset = BERTDataset(args.train_dataset, vocab, seq_len=args.seq_len, + corpus_lines=args.corpus_lines, on_memory=args.on_memory) + + print("Loading Test Dataset", args.test_dataset) + test_dataset = BERTDataset(args.test_dataset, vocab, seq_len=args.seq_len, on_memory=args.on_memory) \ + if args.test_dataset is not None else None + + print("Creating Dataloader") + train_data_loader = DataLoader( + train_dataset, batch_size=args.batch_size, num_workers=args.num_workers) + test_data_loader = DataLoader(test_dataset, batch_size=args.batch_size, num_workers=args.num_workers) \ + if test_dataset is not None else None + + print("Building BERT model") + bert = BERT(len(vocab), hidden=args.hidden, + n_layers=args.layers, attn_heads=args.attn_heads) + + print("Creating BERT Trainer") + trainer = BERTTrainer(bert, len(vocab), train_dataloader=train_data_loader, test_dataloader=test_data_loader, + lr=args.lr, betas=( + args.adam_beta1, args.adam_beta2), weight_decay=args.adam_weight_decay, + with_cuda=args.with_cuda, cuda_devices=args.cuda_devices, log_freq=args.log_freq) + + print("Training Start") + for epoch in range(args.epochs): + trainer.train(epoch) + trainer.save(epoch, args.output_path) + + if test_data_loader is not None: + trainer.test(epoch) diff --git a/test_bert.rar b/test_bert.rar new file mode 100644 index 0000000..e27ddce Binary files /dev/null and b/test_bert.rar differ