-
cner: datasets/cner
-
CLUENER: datasets/cluener
note: donot have labels in the test dataset, to evaluate the test results, should subimit the results to the official link. the official link is https://github.com/CLUEbenchmark/CLUENER
- conll_03_english: datasets/conll_03_english
note: it's the BIO version of conll2003 dataset. get from https://github.com/Alibaba-NLP/CLNER
- ontonote: datasets/ontonote
note: the same with dataset used in [NAACL 2021] Better Feature Integration for Named Entity Recognition (In NAACL 2021) SynLSTM-for-NER/data/ontonotes at master · xuuuluuu/SynLSTM-for-NER
5.ontonote4.0
note: the dataset is for chinese ner and get from https://github.com/ShannonAI/glyce
- other datasets:
conll2003: datasets/conll2003
note: it's the BIESO version of conll2003 dataset, use it to test if this model is adaptable to other labeling style.
BTC and GUM: I didn't test them yet.
note: please write another DataProcessor in ner_seq.py
to use the datasets except for 1,2,3
-
GPT2+Softmax
-
GPT2+CRF
1.I use transformers 4.6.0 which is in models.transformers_master
2.The transformers used in the original project is still in models.transformers
but it is the lower version and using it causes bugs.
- for the chinese pretrained gpt2, I use the published model from "uer/gpt2-chinese-cluecorpussmall"
- PyTorch == 1.7.0
- cuda=9.0
- python3.6+
- transformers >= 4.6.0
- use seqeval to compute the metric
Input format (prefer BIOS tag scheme), with each character its label for one line. Sentences are splited with a null line. The cner dataset labels are transferred into BIOS scheme in the DataProcessor.
美 B-LOC
国 I-LOC
的 O
华 B-PER
莱 I-PER
士 I-PER
我 O
跟 O
他 O
there are two prompt style:
-
(m,m,0) : construct the query as : prompt+input+prompt+input and use the output hidden state of the latter input to do classification
-
(m,length_of_max_sequence_length,0) : construct the query as : prompt+input+prompt and use the output hidden state of the latter prompt to do classification
- Modify the configuration information in
run_ner_xxx.py
, please only userun_ner_softmax.py
- Modify the params in
finetuning_argparse.py
- Modify the prompt template by setting
TEMPLATE_CLASSES
inrun_ner_xxx.py
. BART_for_ner.py
cannot run for now.
note: file structure of the model
├── prev_trained_model
| └── bert_base
| | └── pytorch_model.bin
| | └── config.json
| | └── vocab.txt
| | └── ......
best results for gpt2
- cner:
evaluation: acc: 0.94 - recall: 0.93 - f1: 0.93 params: learning_rate=5e-5 weight_decay=0.01 template='1' model_type='chinese_pretrained_gpt2'
- cluener:
evaluation: acc: 0.76 - recall: 0.74 - f1: 0.75 params: learning_rate=5e-5 weight_decay=0.01 template='1' model_type='chinese_pretrained_gpt2'
- conll2003:
evaluation: acc: 0.94 - recall: 0.93 - f1: 0.93 params: learning_rate=5e-5 weight_decay=0.01 template='1' model_type='gpt2'
- ontonote:
evaluation: acc: 0.85 - recall: 0.85 - f1: 0.85 params: learning_rate=1e-4 weight_decay=0.01 template='1' model_type='gpt2'
the other params are default values. # BERT-NER-Pytorch-master