Patent CPC Code Classifier

A multi-label text classification project to predict Cooperative Patent Classification (CPC) codes from patent abstracts. Built using Hugging Face Transformers, fastai, and Blurr, this project covers the full pipeline—from scraping patent data to model training, optimization, and deployment via Hugging Face Spaces and Render.

🧠 Overview

The Patent CPC Code Classifier predicts multiple CPC codes for a given patent abstract, enabling structured classification of inventions. It leverages roberta-base, bert-base-uncased, and distilroberta-base, fine-tuned using fastai + Blurr, and supports model compression and fast inference via ONNX.

Key Highlights:

End-to-end pipeline for multi-label patent classification
Three transformer models trained: roberta-base, bert-base-uncased, distilroberta-base
ONNX quantization for efficient inference
Gradio app deployed on Hugging Face Spaces
Flask app deployed on Render using Hugging Face API

📂 Data Collection & Preprocessing

Workflow:

Download CSV files per search query containing patent URLs.
Scrape each patent to extract: Publication Number, Title, Abstract, and CPC Codes.
Raw data: 39,681 records → Processed dataset: 32,172 records.
After drop duplicates & missing values, 32,172 records remained.
Initially, 28,021 CPC codes were identified. After removing 27,892 rare CPC codes (less than 0.01% of total records), the final dataset contained 129 unique CPC codes, which were used for model prediction. Data used for model training: only abstracts and CPC codes.

Search Queries:

AI & ML Topics
Transformer Language Model	Federated Learning Privacy	Reinforcement Learning Policy
Generative Diffusion Model	Neural Network Compression	Explainable AI System
Computer Vision Segmentation	Anomaly Detection Time Series	Natural Language Understanding
Active Learning Data Selection	Edge AI Accelerator	Transfer Learning Pre-trained
Model Drift Monitoring	Graph Neural Network Embedding	Causal Inference AI
Hyperparameter Optimization Automated	Synthetic Data Generation	Adversarial Machine Learning
Deep Learning Compiler	Model Deployment Pipeline	-

CPC Code Format: [Section][Class][Subclass][MainGroup]/[Subgroup]

Examples:

G06N20/00 — Machine learning
G06N3/045 — Combinations of networks

-Data scraped from Google Patents using Selenium
-Cleaned dataset saved at data/processed/processed_patent_details.csv
-Raw data stored under data/downloads_patent_urls/ and data/scraped/

🏗️ Model Training & Evaluation

Three transformer models were trained for multi-label patent CPC classification:

roberta-base
bert-base-uncased
distilroberta-base

Why Blurr?

Simplifies transformer training within fastai
Seamless Hugging Face integration
Efficient support for multi-label classification

Evaluation Metrics:

Model	Micro F1	Macro F1	Size
distilroberta-base	0.217	0.063	315.8 MB
distilroberta-base (quantized)	0.222	0.065	81.3 MB
roberta-base	0.281	0.136	478.1 MB
roberta-base (quantized)	0.273	0.123	124.4 MB
bert-base-uncased	0.234	0.081	419.1 MB
bert-base-uncased (quantized)	0.124	0.054	109.9 MB

(Update metrics after final evaluation)

Model Selection: The model distilroberta-base (quantized) with the Better F1-score and for faster inference was chosen for deployment, with ONNX used for compression and fast inference.

📦 Deployment

The deployment section is placed immediately after model training. The project includes two deployment options:

1. Gradio App:

Located in the deployment/ folder.
Hosted on Hugging Face Spaces for interactive use.
Users can input patent abstracts and get predicted CPC codes.
Link: Gradio App

2. Flask App:

Located in the docs/ folder.
Uses the selected transformer model via Hugging Face API.
Hosted on Render for web access and integration.
Link: Flask App

⚙️ Installation

git clone https://github.com/faysalalmahmud/patent-cpc-code-classifier.git
cd patent-cpc-code-classifier
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt

🛠️ Technologies

ML & NLP: PyTorch, Hugging Face Transformers, fastai, Blurr, ONNX Runtime Data Processing: Pandas, NumPy, Selenium Deployment & Hosting: Hugging Face, Render Tools: Jupyter Notebook, Git

📦 Project Structure

patent-cpc-code-classifier/
│
├── data/
│   ├── downloads_patent_urls/
│   │   ├── gp-search-20251001-102339.csv
│   │   ├── gp-search-20251001-102354.csv
│   │   ├── gp-search-20251001-102414.csv
│   │   ├── gp-search-20251001-102423.csv
│   │   ├── gp-search-20251001-102444.csv
│   │   ├── gp-search-20251001-102500.csv
│   │   ├── gp-search-20251001-102516.csv
│   │   ├── gp-search-20251001-102533.csv
│   │   ├── gp-search-20251001-102549.csv
│   │   ├── gp-search-20251001-102606.csv
│   │   ├── gp-search-20251001-102619.csv
│   │   ├── gp-search-20251001-102636.csv
│   │   ├── gp-search-20251001-102651.csv
│   │   ├── gp-search-20251001-102706.csv
│   │   ├── gp-search-20251001-102718.csv
│   │   ├── gp-search-20251001-102736.csv
│   │   ├── gp-search-20251001-102758.csv
│   │   ├── gp-search-20251001-102808.csv
│   │   ├── gp-search-20251001-102824.csv
│   │   └── gp-search-20251001-102842.csv
│   ├── processed/
│   │   └── processed_patent_details.csv
│   │
│   └── scraped/
│       ├── patent_details_20251001-102339.csv
│       ├── patent_details_20251001-102354.csv
│       ├── patent_details_20251001-102414.csv
│       ├── patent_details_20251001-102423.csv
│       ├── patent_details_20251001-102444.csv
│       ├── patent_details_20251001-102500.csv
│       ├── patent_details_20251001-102516.csv
│       ├── patent_details_20251001-102533.csv
│       ├── patent_details_20251001-102549.csv
│       ├── patent_details_20251001-102606.csv
│       ├── patent_details_20251001-102619.csv
│       ├── patent_details_20251001-102636.csv
│       ├── patent_details_20251001-102651.csv
│       ├── patent_details_20251001-102706.csv
│       ├── patent_details_20251001-102718.csv
│       ├── patent_details_20251001-102736.csv
│       ├── patent_details_20251001-102758.csv
│       ├── patent_details_20251001-102808.csv
│       ├── patent_details_20251001-102824.csv
│       └── patent_details_20251001-102842.csv
├── dataloaders/
│   └── README.md
├── deployment/
│   ├── app.py
│   ├── distilroberta-base-patent-cpc-classifier-quantized.onnx
│   ├── encode_revised_cpc_codes.json
│   ├── huggingface screenshot.png
│   ├── README.md
│   └── requirements.txt
├── docs/
│   ├── templates/
│   │   └── index.html
│   ├── app.py
│   ├── Procfile
│   └── requirements.txt
├── models/
│   └── README.md
├── src/
│   ├── onnx_inference.ipynb
│   ├── patent_cpc_code_classifer.ipynb
│   ├── patent_details_scraper.py
│   ├── patent_urls_scraper.py
│   └── process_data.py
├── .gitignore
├── LICENSE
└── README.md

🤝 Contributing

Fork the repository
Create a branch (feature/your-feature)
Commit changes and push
Submit a Pull Request

Contributions welcome: accuracy improvements, scraping extensions, better visualizations

📜 License

MIT License — see LICENSE

👤 Author

Faysal Al Mahmud GitHub: @faysalalmahmud Email: faysalalmahmud78@gmail.com

Acknowledgments

Google Patents
CPC system by USPTO & EPO
Hugging Face, fastai, and Blurr teams
Open-source ML community

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Patent CPC Code Classifier

🧠 Overview

📂 Data Collection & Preprocessing

🏗️ Model Training & Evaluation

📦 Deployment

⚙️ Installation

🛠️ Technologies

📦 Project Structure

🤝 Contributing

📜 License

👤 Author

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
dataloaders		dataloaders
deployment		deployment
docs		docs
models		models
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

faysalalmahmud/patent-cpc-code-classifier

Folders and files

Latest commit

History

Repository files navigation

Patent CPC Code Classifier

🧠 Overview

📂 Data Collection & Preprocessing

🏗️ Model Training & Evaluation

📦 Deployment

⚙️ Installation

🛠️ Technologies

📦 Project Structure

🤝 Contributing

📜 License

👤 Author

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages