TextClassification_GPT

An implementation of GPT to do TextClassification on Patent Data

For a complete PyTorch implementation of GPT, please refer to: https://github.com/huggingface/pytorch-openai-transformer-lm

The original paper can be found here: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf

You can download the weights of the OpenAI pre-trained version by cloning Alec Radford's repo and placing the model folder containing the pre-trained weights in the present repo.(https://github.com/openai/finetune-transformer-lm)

I added some comments along with the codes. If you find it's hard to digest the huggingface's original codes, I hope this repo would be helpful.

The data can be download from USPTO for free. For sensitivity reasons, I only updated a demo dataset for reference.

I used 20000+ patent abstracts (almost without preprocessing) in my implementation, and the validation accuracy can achieve 68% or so within four epochs (fine tuning). Beyond 5 epochs, GPT would overfit the data (e.g., 100% training accuracy for 9 epochs).

Another finding is that after some preprocessing of the text data (e.g., lemmatization), the validation accuracy moved significantly to 74%.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
Demo_data.csv		Demo_data.csv
README.md		README.md
datasets.py		datasets.py
loss.py		loss.py
model_pytorch.py		model_pytorch.py
opt.py		opt.py
test.md		test.md
text_utils.py		text_utils.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TextClassification_GPT

About

Uh oh!

Releases

Packages

Languages

zhujohn9604/TextClassification_GPT

Folders and files

Latest commit

History

Repository files navigation

TextClassification_GPT

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages