diff --git a/BERT-pytorch b/BERT-pytorch
new file mode 160000
index 0000000..919adf1
--- /dev/null
+++ b/BERT-pytorch
@@ -0,0 +1 @@
+Subproject commit 919adf1ff7d050bb5ab2955caee00b7f994e7e94
diff --git a/README.md b/README.md
index ced3be7..5772319 100644
--- a/README.md
+++ b/README.md
@@ -1,120 +1,20 @@
-# BERT-pytorch
+# BERT-pytorch学习心得
+在2023年的2月中旬的凌晨2点，我要结束对BERT-pytorch项目的学习了，这是注册github账号之后第1次相对认真系统的学习一个开源项目，从寒假前夕开始，持续直到现在，坚持下来了离开之前，啰嗦2句，以作纪念！
+## 1.经验
+- 根据代码，结合bert论文，基本掌握了bert的真面目：包括词典构建和token随机替换，句子对随机采样的dataset模块、基于transformer编码器的encoder架构的modeling模块、包括loss计算和梯度下降的trainner模块；
+- 在代码学习的过程中，掌握了git基本操作，github的使用习惯(自己的注释都合并到了master分支)和常见pytorch API用法；
+- 开源项目学习最好结合论文看，这样就将理论和实践结合起来了，当然最好是能灌入数据跑起来
 
-[![LICENSE](https://img.shields.io/github/license/codertimo/BERT-pytorch.svg)](https://github.com/codertimo/BERT-pytorch/blob/master/LICENSE)
-![GitHub issues](https://img.shields.io/github/issues/codertimo/BERT-pytorch.svg)
-[![GitHub stars](https://img.shields.io/github/stars/codertimo/BERT-pytorch.svg)](https://github.com/codertimo/BERT-pytorch/stargazers)
-[![CircleCI](https://circleci.com/gh/codertimo/BERT-pytorch.svg?style=shield)](https://circleci.com/gh/codertimo/BERT-pytorch)
-[![PyPI](https://img.shields.io/pypi/v/bert-pytorch.svg)](https://pypi.org/project/bert_pytorch/)
-[![PyPI - Status](https://img.shields.io/pypi/status/bert-pytorch.svg)](https://pypi.org/project/bert_pytorch/)
-[![Documentation Status](https://readthedocs.org/projects/bert-pytorch/badge/?version=latest)](https://bert-pytorch.readthedocs.io/en/latest/?badge=latest)
+## 2.教训
+- 代码逐行看了，也搭建了bert-pytorch环境，但是没有结合数据去运行查看结果，故调参经验并没有增加
+- 项目学习没有指定里程碑时间表，拖沓
+- 后续的开源项目学习，一定要结合数据，运行起来
+- 本来想好好写一篇readme，但是到头有泄气了。
 
-Pytorch implementation of Google AI's 2018 BERT, with simple annotation
+---
+# bert理解记录
 
-> BERT 2018 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
-> Paper URL : https://arxiv.org/abs/1810.04805
+## 20230213：今天和大华同事明浩讨论了bert的embedding部分：由token到初始化的embedding向量是怎么实现的？他认为初始化的embedding向量会参与到训练学习中，但是晚上我又看了下该项目，发现本项目的embedding模块只是承担着token的随机初始化过程，之后就会进到attention模块，先线性投影成querey,key,value之后就开始了注意力机制的计算；由此可以认为embedding模块还只是数据预处理的一部分，是不会参与到训练中的；
 
+另外一个问题是为什么可以随机初始化embedding？我认为主要是token的索引就是随机的(现到先得)，也就是说不管是token的index，还是初始的embedding向量，只要固定好key-value关系即可，不含任何的语义信息；这样就能圆回来了：如果初始化的embedding是模型参数，参与到学习训练，就破环了key-value关系的确定性；
 
-## Introduction
-
-Google AI's BERT paper shows the amazing result on various NLP task (new 17 NLP tasks SOTA), 
-including outperform the human F1 score on SQuAD v1.1 QA task. 
-This paper proved that Transformer(self-attention) based encoder can be powerfully used as 
-alternative of previous language model with proper language model training method. 
-And more importantly, they showed us that this pre-trained language model can be transfer 
-into any NLP task without making task specific model architecture.
-
-This amazing result would be record in NLP history, 
-and I expect many further papers about BERT will be published very soon.
-
-This repo is implementation of BERT. Code is very simple and easy to understand fastly.
-Some of these codes are based on [The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html)
-
-Currently this project is working on progress. And the code is not verified yet.
-
-## Installation
-```
-pip install bert-pytorch
-```
-
-## Quickstart
-
-**NOTICE : Your corpus should be prepared with two sentences in one line with tab(\t) separator**
-
-### 0. Prepare your corpus
-```
-Welcome to the \t the jungle\n
-I can stay \t here all night\n
-```
-
-or tokenized corpus (tokenization is not in package)
-```
-Wel_ _come _to _the \t _the _jungle\n
-_I _can _stay \t _here _all _night\n
-```
-
-
-### 1. Building vocab based on your corpus
-```shell
-bert-vocab -c data/corpus.small -o data/vocab.small
-```
-
-### 2. Train your own BERT model
-```shell
-bert -c data/corpus.small -v data/vocab.small -o output/bert.model
-```
-
-## Language Model Pre-training
-
-In the paper, authors shows the new language model training methods, 
-which are "masked language model" and "predict next sentence".
-
-
-### Masked Language Model 
-
-> Original Paper : 3.3.1 Task #1: Masked LM 
-
-```
-Input Sequence  : The man went to [MASK] store with [MASK] dog
-Target Sequence :                  the                his
-```
-
-#### Rules:
-Randomly 15% of input token will be changed into something, based on under sub-rules
-
-1. Randomly 80% of tokens, gonna be a `[MASK]` token
-2. Randomly 10% of tokens, gonna be a `[RANDOM]` token(another word)
-3. Randomly 10% of tokens, will be remain as same. But need to be predicted.
-
-### Predict Next Sentence
-
-> Original Paper : 3.3.2 Task #2: Next Sentence Prediction
-
-```
-Input : [CLS] the man went to the store [SEP] he bought a gallon of milk [SEP]
-Label : Is Next
-
-Input = [CLS] the man heading to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
-Label = NotNext
-```
-
-"Is this sentence can be continuously connected?"
-
- understanding the relationship, between two text sentences, which is
-not directly captured by language modeling
-
-#### Rules:
-
-1. Randomly 50% of next sentence, gonna be continuous sentence.
-2. Randomly 50% of next sentence, gonna be unrelated sentence.
-
-
-## Author
-Junseong Kim, Scatter Lab (codertimo@gmail.com / junseong.kim@scatterlab.co.kr)
-
-## License
-
-This project following Apache 2.0 License as written in LICENSE file
-
-Copyright 2018 Junseong Kim, Scatter Lab, respective BERT contributors
-
-Copyright (c) 2018 Alexander Rush : [The Annotated Trasnformer](https://github.com/harvardnlp/annotated-transformer)
diff --git a/README_back.md b/README_back.md
new file mode 100644
index 0000000..ced3be7
--- /dev/null
+++ b/README_back.md
@@ -0,0 +1,120 @@
+# BERT-pytorch
+
+[![LICENSE](https://img.shields.io/github/license/codertimo/BERT-pytorch.svg)](https://github.com/codertimo/BERT-pytorch/blob/master/LICENSE)
+![GitHub issues](https://img.shields.io/github/issues/codertimo/BERT-pytorch.svg)
+[![GitHub stars](https://img.shields.io/github/stars/codertimo/BERT-pytorch.svg)](https://github.com/codertimo/BERT-pytorch/stargazers)
+[![CircleCI](https://circleci.com/gh/codertimo/BERT-pytorch.svg?style=shield)](https://circleci.com/gh/codertimo/BERT-pytorch)
+[![PyPI](https://img.shields.io/pypi/v/bert-pytorch.svg)](https://pypi.org/project/bert_pytorch/)
+[![PyPI - Status](https://img.shields.io/pypi/status/bert-pytorch.svg)](https://pypi.org/project/bert_pytorch/)
+[![Documentation Status](https://readthedocs.org/projects/bert-pytorch/badge/?version=latest)](https://bert-pytorch.readthedocs.io/en/latest/?badge=latest)
+
+Pytorch implementation of Google AI's 2018 BERT, with simple annotation
+
+> BERT 2018 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
+> Paper URL : https://arxiv.org/abs/1810.04805
+
+
+## Introduction
+
+Google AI's BERT paper shows the amazing result on various NLP task (new 17 NLP tasks SOTA), 
+including outperform the human F1 score on SQuAD v1.1 QA task. 
+This paper proved that Transformer(self-attention) based encoder can be powerfully used as 
+alternative of previous language model with proper language model training method. 
+And more importantly, they showed us that this pre-trained language model can be transfer 
+into any NLP task without making task specific model architecture.
+
+This amazing result would be record in NLP history, 
+and I expect many further papers about BERT will be published very soon.
+
+This repo is implementation of BERT. Code is very simple and easy to understand fastly.
+Some of these codes are based on [The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html)
+
+Currently this project is working on progress. And the code is not verified yet.
+
+## Installation
+```
+pip install bert-pytorch
+```
+
+## Quickstart
+
+**NOTICE : Your corpus should be prepared with two sentences in one line with tab(\t) separator**
+
+### 0. Prepare your corpus
+```
+Welcome to the \t the jungle\n
+I can stay \t here all night\n
+```
+
+or tokenized corpus (tokenization is not in package)
+```
+Wel_ _come _to _the \t _the _jungle\n
+_I _can _stay \t _here _all _night\n
+```
+
+
+### 1. Building vocab based on your corpus
+```shell
+bert-vocab -c data/corpus.small -o data/vocab.small
+```
+
+### 2. Train your own BERT model
+```shell
+bert -c data/corpus.small -v data/vocab.small -o output/bert.model
+```
+
+## Language Model Pre-training
+
+In the paper, authors shows the new language model training methods, 
+which are "masked language model" and "predict next sentence".
+
+
+### Masked Language Model 
+
+> Original Paper : 3.3.1 Task #1: Masked LM 
+
+```
+Input Sequence  : The man went to [MASK] store with [MASK] dog
+Target Sequence :                  the                his
+```
+
+#### Rules:
+Randomly 15% of input token will be changed into something, based on under sub-rules
+
+1. Randomly 80% of tokens, gonna be a `[MASK]` token
+2. Randomly 10% of tokens, gonna be a `[RANDOM]` token(another word)
+3. Randomly 10% of tokens, will be remain as same. But need to be predicted.
+
+### Predict Next Sentence
+
+> Original Paper : 3.3.2 Task #2: Next Sentence Prediction
+
+```
+Input : [CLS] the man went to the store [SEP] he bought a gallon of milk [SEP]
+Label : Is Next
+
+Input = [CLS] the man heading to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
+Label = NotNext
+```
+
+"Is this sentence can be continuously connected?"
+
+ understanding the relationship, between two text sentences, which is
+not directly captured by language modeling
+
+#### Rules:
+
+1. Randomly 50% of next sentence, gonna be continuous sentence.
+2. Randomly 50% of next sentence, gonna be unrelated sentence.
+
+
+## Author
+Junseong Kim, Scatter Lab (codertimo@gmail.com / junseong.kim@scatterlab.co.kr)
+
+## License
+
+This project following Apache 2.0 License as written in LICENSE file
+
+Copyright 2018 Junseong Kim, Scatter Lab, respective BERT contributors
+
+Copyright (c) 2018 Alexander Rush : [The Annotated Trasnformer](https://github.com/harvardnlp/annotated-transformer)
diff --git a/bert_pytorch/dataset/dataset.py b/bert_pytorch/dataset/dataset.py
index 7d787f3..7a28577 100644
--- a/bert_pytorch/dataset/dataset.py
+++ b/bert_pytorch/dataset/dataset.py
@@ -15,19 +15,21 @@ def __init__(self, corpus_path, vocab, seq_len, encoding="utf-8", corpus_lines=N
         self.encoding = encoding
 
         with open(corpus_path, "r", encoding=encoding) as f:
-            if self.corpus_lines is None and not on_memory:
+            #读取预料库后分下面2种情况处理：
+            if self.corpus_lines is None and not on_memory: #如果不将语料库直接加载到内存，则需先确定语料库行数
                 for _ in tqdm.tqdm(f, desc="Loading Dataset", total=corpus_lines):
                     self.corpus_lines += 1
 
             if on_memory:
-                self.lines = [line[:-1].split("\t")
-                              for line in tqdm.tqdm(f, desc="Loading Dataset", total=corpus_lines)]
-                self.corpus_lines = len(self.lines)
+                #数据集全部加载到内存，语料库解析成list类型的self.liines属性
+                self.lines = [line[:-1].split('\t')
+                              for line in tqdm.tqdm(f, desc="Loading Dataset", total=corpus_lines)] #对预料库每行根据\t字符分成2个sentence               
+                self.corpus_lines = len(self.lines) #获取语料库行数
 
-        if not on_memory:
+        if not on_memory: 
             self.file = open(corpus_path, "r", encoding=encoding)
             self.random_file = open(corpus_path, "r", encoding=encoding)
-
+            #错位抽取负样本，作用是什么?
             for _ in range(random.randint(self.corpus_lines if self.corpus_lines < 1000 else 1000)):
                 self.random_file.__next__()
 
@@ -35,12 +37,14 @@ def __len__(self):
         return self.corpus_lines
 
     def __getitem__(self, item):
-        t1, t2, is_next_label = self.random_sent(item)
-        t1_random, t1_label = self.random_word(t1)
-        t2_random, t2_label = self.random_word(t2)
+        #魔术方法__getitem__的定义，功能令类的实例对象向list那样根据索引item取值
+        #BERTDataset类实例化返回的bert对象均会进行Next Sentence操作和Masked LM操作
+        t1, t2, is_next_label = self.random_sent(item) #Next Sentence操作
+        t1_random, t1_label = self.random_word(t1) #Masked LM操作, 其中t1_label表示t1各个位置被masked的类别索引，参看vocab.py中Vocab类的初始化定义
+        t2_random, t2_label = self.random_word(t2) 
 
         # [CLS] tag = SOS tag, [SEP] tag = EOS tag
-        t1 = [self.vocab.sos_index] + t1_random + [self.vocab.eos_index]
+        t1 = [self.vocab.sos_index] + t1_random + [self.vocab.eos_index] #论文Figure2
         t2 = t2_random + [self.vocab.eos_index]
 
         t1_label = [self.vocab.pad_index] + t1_label + [self.vocab.pad_index]
@@ -50,7 +54,7 @@ def __getitem__(self, item):
         bert_input = (t1 + t2)[:self.seq_len]
         bert_label = (t1_label + t2_label)[:self.seq_len]
 
-        padding = [self.vocab.pad_index for _ in range(self.seq_len - len(bert_input))]
+        padding = [self.vocab.pad_index for _ in range(self.seq_len - len(bert_input))] #最大长度和实际长度之差就是需要padding的位置数量
         bert_input.extend(padding), bert_label.extend(padding), segment_label.extend(padding)
 
         output = {"bert_input": bert_input,
@@ -61,12 +65,15 @@ def __getitem__(self, item):
         return {key: torch.tensor(value) for key, value in output.items()}
 
     def random_word(self, sentence):
+        #sentence转换成sentence中的token在token-index词典中对应的index
         tokens = sentence.split()
-        output_label = []
+        output_label = [] #该列表只存0和非0数字，0表示对应位置的token属于85%没被替换的，非0数字是对应位置的token在被mask处理前的vocab中对应的index
 
         for i, token in enumerate(tokens):
             prob = random.random()
+            #BERT随机选择15%的tokens进行mask
             if prob < 0.15:
+                #对于随机选择的15%的tokens，再做一次随机
                 prob /= 0.15
 
                 # 80% randomly change token to mask token
@@ -77,26 +84,27 @@ def random_word(self, sentence):
                 elif prob < 0.9:
                     tokens[i] = random.randrange(len(self.vocab))
 
-                # 10% randomly change token to current token
+                # 10% doesn't change current token
                 else:
                     tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)
 
                 output_label.append(self.vocab.stoi.get(token, self.vocab.unk_index))
 
             else:
-                tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)
+                tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index) #未被masked的词，用其在vocab中真正的index填充
+                #具体地，self.vocab.unk_index=1，上句相当于从stoi token-index字典
                 output_label.append(0)
 
         return tokens, output_label
 
     def random_sent(self, index):
-        t1, t2 = self.get_corpus_line(index)
-
-        # output_text, label(isNotNext:0, isNext:1)
+        t1, t2 = self.get_corpus_line(index)        
+        # for sentence A and B, 50% of the time B is the actual next sentence that follows A(labeled as NotNext)
+        # and for 50% of the time it is a random sentence from the corpus(labeled as NotNext)
         if random.random() > 0.5:
-            return t1, t2, 1
+            return t1, t2, 1 #1表示isNext
         else:
-            return t1, self.get_random_line(), 0
+            return t1, self.get_random_line(), 0 #0表示isNotNext
 
     def get_corpus_line(self, item):
         if self.on_memory:
@@ -122,4 +130,4 @@ def get_random_line(self):
             for _ in range(random.randint(self.corpus_lines if self.corpus_lines < 1000 else 1000)):
                 self.random_file.__next__()
             line = self.random_file.__next__()
-        return line[:-1].split("\t")[1]
+        return line[:-1].split("\t")[1]
\ No newline at end of file
diff --git a/bert_pytorch/dataset/vocab.py b/bert_pytorch/dataset/vocab.py
index f7346a7..08dbf60 100644
--- a/bert_pytorch/dataset/vocab.py
+++ b/bert_pytorch/dataset/vocab.py
@@ -33,6 +33,9 @@ def __init__(self, counter, max_size=None, min_freq=1, specials=['<pad>', '<oov>
                 to zero vectors; can be any function that takes in a Tensor and
                 returns a Tensor of the same size. Default: torch.Tensor.zero_
             vectors_cache: directory for cached vectors. Default: '.vector_cache'
+        Attributes:
+            self.itos表示所有token组成的词表；
+            self.stoi表示所有token和其在self.itos中的索引构成字典
         """
         self.freqs = counter
         counter = counter.copy()
@@ -124,13 +127,18 @@ def __init__(self, texts, max_size=None, min_freq=1):
             if isinstance(line, list):
                 words = line
             else:
+                #原来的replace不能将"\t"、"\n"替换为""，故进行如下更改(please忽略该注释)
                 words = line.replace("\n", "").replace("\t", "").split()
+                #words = line.replace('\\t', '').replace('\\n', '').split()
 
             for word in words:
                 counter[word] += 1
         super().__init__(counter, max_size=max_size, min_freq=min_freq)
 
     def to_seq(self, sentence, seq_len=None, with_eos=False, with_sos=False, with_len=False):
+        """将句子转化为由self.stoi中的token对应的index组成的list,如：
+        sentence = 'Welcome to the the jungle'， 则to_seq(sentence)返回 [7, 14, 5, 5, 11]
+        """
         if isinstance(sentence, str):
             sentence = sentence.split()
 
@@ -153,6 +161,9 @@ def to_seq(self, sentence, seq_len=None, with_eos=False, with_sos=False, with_le
         return (seq, origin_seq_len) if with_len else seq
 
     def from_seq(self, seq, join=False, with_pad=False):
+        """将to_seq()函数返回的由index组成的list转化为self.stoi中对应的token组成的list,比如
+        seq=[7, 14, 5, 5, 11]，则from_seq(seq)将返回['Welcome', 'to', 'the', 'the', 'jungle']
+        """
         words = [self.itos[idx]
                  if idx < len(self.itos)
                  else "<%d>" % idx
diff --git a/bert_pytorch/model/attention/single.py b/bert_pytorch/model/attention/single.py
index 701d2c2..394d35c 100644
--- a/bert_pytorch/model/attention/single.py
+++ b/bert_pytorch/model/attention/single.py
@@ -15,7 +15,9 @@ def forward(self, query, key, value, mask=None, dropout=None):
                  / math.sqrt(query.size(-1))
 
         if mask is not None:
-            scores = scores.masked_fill(mask == 0, -1e9)
+            #transformer中的mask的作用：encoder中是去除<pad>序列的影响；decoder中是去除'不可见逻辑'
+            #这里显然是前者；
+            scores = scores.masked_fill(mask == 0, -1e9) #注意mask和score需要是可广播的
 
         p_attn = F.softmax(scores, dim=-1)
 
diff --git a/bert_pytorch/model/embedding/position.py b/bert_pytorch/model/embedding/position.py
index d55c224..9f3420e 100644
--- a/bert_pytorch/model/embedding/position.py
+++ b/bert_pytorch/model/embedding/position.py
@@ -12,11 +12,17 @@ def __init__(self, d_model, max_len=512):
         pe = torch.zeros(max_len, d_model).float()
         pe.require_grad = False
 
-        position = torch.arange(0, max_len).float().unsqueeze(1)
-        div_term = (torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)).exp()
+        position = torch.arange(0, max_len).float().unsqueeze(1) #论文编码公式的分子
+        div_term = (torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model)).exp() #论文编码公式的分母，先取log，再exp；渐少计算量？
 
         pe[:, 0::2] = torch.sin(position * div_term)
-        pe[:, 1::2] = torch.cos(position * div_term)
+
+        #pe[:, 1::2].size(-1) is less than div_term.size(-1) when d_model is an odd number
+        if pe[:, 1::2].size(-1) >= div_term.size(-1):
+            pe[:, 1::2] = torch.cos(position * div_term)
+        else:
+            cos_len = pe[:, 1::2].size(-1)
+            pe[:, 1::2] = torch.cos(position * div_term[:cos_len])
 
         pe = pe.unsqueeze(0)
         self.register_buffer('pe', pe)
diff --git a/bert_pytorch/model/embedding/segment.py b/bert_pytorch/model/embedding/segment.py
index cdf84d5..6b89e6c 100644
--- a/bert_pytorch/model/embedding/segment.py
+++ b/bert_pytorch/model/embedding/segment.py
@@ -3,4 +3,9 @@
 
 class SegmentEmbedding(nn.Embedding):
     def __init__(self, embed_size=512):
+        """
+        和TokenEmbedding不同，下面的__init__()中的第一个参数不是vocab_size而是3，
+        因为SegmentEmbedding实例化后的input是segment_info，一个由0,1,2三种元素组成的向量
+        故SegmentEmbedding初始化时只需要初始化3个向量即可；
+        """
         super().__init__(3, embed_size, padding_idx=0)
diff --git a/bert_pytorch/model/embedding/token.py b/bert_pytorch/model/embedding/token.py
index 79b5187..d7a2a6c 100644
--- a/bert_pytorch/model/embedding/token.py
+++ b/bert_pytorch/model/embedding/token.py
@@ -2,5 +2,9 @@
 
 
 class TokenEmbedding(nn.Embedding):
+    """nn.Embedding class is the Parent class of TokenEmbedding
+       nn.Embedding(vocab_size, embed_size) return vocab_size vector with dimension of embed_size;
+       nn.Embedding's method forward(self, input: Tensor) -> Tensor以input中的元素为index返回对应的向量:
+    """
     def __init__(self, vocab_size, embed_size=512):
         super().__init__(vocab_size, embed_size, padding_idx=0)
diff --git a/bert_pytorch/model/language_model.py b/bert_pytorch/model/language_model.py
index 608f42a..7f66d2f 100644
--- a/bert_pytorch/model/language_model.py
+++ b/bert_pytorch/model/language_model.py
@@ -39,7 +39,9 @@ def __init__(self, hidden):
         self.softmax = nn.LogSoftmax(dim=-1)
 
     def forward(self, x):
-        return self.softmax(self.linear(x[:, 0]))
+        #the final hidden state of [CLs] is used as the sequence representation for classification tasks.
+        #where x[:, 0] representes [CLs]
+        return self.softmax(self.linear(x[:, 0]))  
 
 
 class MaskedLanguageModel(nn.Module):
@@ -54,8 +56,8 @@ def __init__(self, hidden, vocab_size):
         :param vocab_size: total vocab size
         """
         super().__init__()
-        self.linear = nn.Linear(hidden, vocab_size)
+        self.linear = nn.Linear(hidden, vocab_size) #Linear(hidden, vocab_size)将输入的hidden维转换成vocab_size维 
         self.softmax = nn.LogSoftmax(dim=-1)
 
     def forward(self, x):
-        return self.softmax(self.linear(x))
+        return self.softmax(self.linear(x)) #x的最后维数是hidden，经self.linear()作用后最后维数变为vocab_size
diff --git a/bert_pytorch/trainer/pretrain.py b/bert_pytorch/trainer/pretrain.py
index 0b882dd..5072099 100644
--- a/bert_pytorch/trainer/pretrain.py
+++ b/bert_pytorch/trainer/pretrain.py
@@ -43,7 +43,8 @@ def __init__(self, bert: BERT, vocab_size: int,
         # This BERT model will be saved every epoch
         self.bert = bert
         # Initialize the BERT Language Model, with BERT model
-        self.model = BERTLM(bert, vocab_size).to(self.device)
+        self.model = BERTLM(bert, vocab_size).to(self.device) 
+        #BERTLM类在BERT类对输入编码的基础上返回 Masked LM和Next Sentence Prediction的预测结果
 
         # Distributed GPU training if CUDA can detect more than 1 GPU
         if with_cuda and torch.cuda.device_count() > 1:
@@ -100,12 +101,14 @@ def iteration(self, epoch, data_loader, train=True):
 
             # 1. forward the next_sentence_prediction and masked_lm model
             next_sent_output, mask_lm_output = self.model.forward(data["bert_input"], data["segment_label"])
-
+            #data["bert_input"]中的token的mask，以及data["segment_label"]对句子对是否是前后句的标记，其标签都是句子自带的，故是无监督学习
+            #data["bert_input"], data["segment_label"]是要进行的LM mask和nsp的样本
+            #data["bert_label"]和data['is_next']是标签
             # 2-1. NLL(negative log likelihood) loss of is_next classification result
             next_loss = self.criterion(next_sent_output, data["is_next"])
 
             # 2-2. NLLLoss of predicting masked token word
-            mask_loss = self.criterion(mask_lm_output.transpose(1, 2), data["bert_label"])
+            mask_loss = self.criterion(mask_lm_output.transpose(1, 2), data["bert_label"]) #bert的与训练目标本质上个分类问题，故用NLLLnull()
 
             # 2-3. Adding next_loss and mask_loss : 3.4 Pre-training Procedure
             loss = next_loss + mask_loss
@@ -118,9 +121,9 @@ def iteration(self, epoch, data_loader, train=True):
 
             # next sentence prediction accuracy
             correct = next_sent_output.argmax(dim=-1).eq(data["is_next"]).sum().item()
-            avg_loss += loss.item()
-            total_correct += correct
-            total_element += data["is_next"].nelement()
+            avg_loss += loss.item() #累计每一次的损失，用来计算平均损失
+            total_correct += correct #累计next sentence预测正确的个数
+            total_element += data["is_next"].nelement() #tensor元素个数，等于各个维度之积
 
             post_fix = {
                 "epoch": epoch,
diff --git a/data/corpus.txt b/data/corpus.txt
new file mode 100644
index 0000000..ed144f5
--- /dev/null
+++ b/data/corpus.txt
@@ -0,0 +1,2 @@
+Welcome to the 	 the jungle
+I can stay 	 here all night
diff --git a/data/vocab.small b/data/vocab.small
new file mode 100644
index 0000000..a4092d6
Binary files /dev/null and b/data/vocab.small differ
diff --git a/img/1.PNG b/img/1.PNG
new file mode 100644
index 0000000..5bc31ff
Binary files /dev/null and b/img/1.PNG differ
diff --git a/img/2.PNG b/img/2.PNG
new file mode 100644
index 0000000..cbd475b
Binary files /dev/null and b/img/2.PNG differ
diff --git a/test_bert.py b/test_bert.py
new file mode 100644
index 0000000..3496ae0
--- /dev/null
+++ b/test_bert.py
@@ -0,0 +1,123 @@
+Skip to content
+Search or jump to…
+Pull requests
+Issues
+Codespaces
+Marketplace
+Explore
+ 
+@wanghesong2019 
+songyingxin
+/
+BERT-pytorch
+Public
+forked from codertimo/BERT-pytorch
+Fork your own copy of songyingxin/BERT-pytorch
+Code
+Pull requests
+Actions
+Projects
+Security
+Insights
+BERT-pytorch/test_bert.py /
+@songyingxin
+songyingxin note
+Latest commit e32e2ad on Jul 30, 2019
+ History
+ 1 contributor
+94 lines (76 sloc)  4.35 KB
+
+import argparse
+
+from torch.utils.data import DataLoader
+
+from bert_pytorch.model import BERT
+from bert_pytorch.trainer import BERTTrainer
+from bert_pytorch.dataset import BERTDataset, WordVocab
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument("-c", "--train_dataset", required=True,
+                        type=str, help="train dataset for train bert")
+    parser.add_argument("-t", "--test_dataset", type=str,
+                        default=None, help="test set for evaluate train set")
+    parser.add_argument("-v", "--vocab_path", required=True,
+                        type=str, help="built vocab model path with bert-vocab")
+    parser.add_argument("-o", "--output_path", required=True,
+                        type=str, help="ex)output/bert.model")
+
+    parser.add_argument("-hs", "--hidden", type=int,
+                        default=256, help="hidden size of transformer model")
+    parser.add_argument("-l", "--layers", type=int,
+                        default=8, help="number of layers")
+    parser.add_argument("-a", "--attn_heads", type=int,
+                        default=8, help="number of attention heads")
+    parser.add_argument("-s", "--seq_len", type=int,
+                        default=20, help="maximum sequence len")
+
+    parser.add_argument("-b", "--batch_size", type=int,
+                        default=64, help="number of batch_size")
+    parser.add_argument("-e", "--epochs", type=int,
+                        default=10, help="number of epochs")
+    parser.add_argument("-w", "--num_workers", type=int,
+                        default=5, help="dataloader worker size")
+
+    parser.add_argument("--with_cuda", type=bool, default=True,
+                        help="training with CUDA: true, or false")
+    parser.add_argument("--log_freq", type=int, default=10,
+                        help="printing loss every n iter: setting n")
+    parser.add_argument("--corpus_lines", type=int,
+                        default=None, help="total number of lines in corpus")
+    parser.add_argument("--cuda_devices", type=int, nargs='+',
+                        default=None, help="CUDA device ids")
+    parser.add_argument("--on_memory", type=bool, default=True,
+                        help="Loading on memory: true or false")
+
+    parser.add_argument("--lr", type=float, default=1e-3,
+                        help="learning rate of adam")
+    parser.add_argument("--adam_weight_decay", type=float,
+                        default=0.01, help="weight_decay of adam")
+    parser.add_argument("--adam_beta1", type=float,
+                        default=0.9, help="adam first beta value")
+    parser.add_argument("--adam_beta2", type=float,
+                        default=0.999, help="adam first beta value")
+
+    args = parser.parse_args()
+
+    print("Loading Vocab", args.vocab_path)
+    vocab = WordVocab.load_vocab(args.vocab_path)
+    print("Vocab Size: ", len(vocab))
+
+    print("Loading Train Dataset", args.train_dataset)
+    train_dataset = BERTDataset(args.train_dataset, vocab, seq_len=args.seq_len,
+                                corpus_lines=args.corpus_lines, on_memory=args.on_memory)
+
+    print("Loading Test Dataset", args.test_dataset)
+    test_dataset = BERTDataset(args.test_dataset, vocab, seq_len=args.seq_len, on_memory=args.on_memory) \
+        if args.test_dataset is not None else None
+
+    print("Creating Dataloader")
+    train_data_loader = DataLoader(
+        train_dataset, batch_size=args.batch_size, num_workers=args.num_workers)
+    test_data_loader = DataLoader(test_dataset, batch_size=args.batch_size, num_workers=args.num_workers) \
+        if test_dataset is not None else None
+
+    print("Building BERT model")
+    bert = BERT(len(vocab), hidden=args.hidden,
+                n_layers=args.layers, attn_heads=args.attn_heads)
+
+    print("Creating BERT Trainer")
+    trainer = BERTTrainer(bert, len(vocab), train_dataloader=train_data_loader, test_dataloader=test_data_loader,
+                          lr=args.lr, betas=(
+                              args.adam_beta1, args.adam_beta2), weight_decay=args.adam_weight_decay,
+                          with_cuda=args.with_cuda, cuda_devices=args.cuda_devices, log_freq=args.log_freq)
+
+    print("Training Start")
+    for epoch in range(args.epochs):
+        trainer.train(epoch)
+        trainer.save(epoch, args.output_path)
+
+        if test_data_loader is not None:
+            trainer.test(epoch)
diff --git a/test_bert.rar b/test_bert.rar
new file mode 100644
index 0000000..e27ddce
Binary files /dev/null and b/test_bert.rar differ