Skip to content

rakmohan/pii-ner-en

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pii-ner-en

Fine-tuned RoBERTa model for detecting Personally Identifiable Information (PII) and Payment Card Industry (PCI) data in English text.

Model: rm0013/roberta-pii-ner-en  |  Micro F1: 0.95  |  Entities: 54


Supported Entities

PII: PERSON_NAME EMAIL PHONE_NUMBER SSN ADDRESS DATE_OF_BIRTH DATE AGE USERNAME PASSWORD IP_ADDRESS URL API_KEY PASSPORT_NUMBER DRIVER_LICENSE ORGANIZATION DEVICE_ID VEHICLE_ID GPS_COORDINATES USERAGENT and more

PCI: CREDIT_CARD CREDIT_CARD_CVV CREDIT_CARD_EXPIRY PIN BANK_ACCOUNT BANK_ROUTING BITCOINADDRESS ETHEREUMADDRESS AMOUNT CURRENCY and more


Usage

Use hosted model directly:

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="rm0013/roberta-pii-ner-en",
    aggregation_strategy="simple"
)

result = ner("Send the invoice to john.smith@acme.com, card 4111-1111-1111-1111 CVV 123.")
for entity in result:
    print(f"{entity['word']:30s}{entity['entity_group']} ({entity['score']:.2f})")

Or run the inference API locally:

pii-serve --host 0.0.0.0 --port 8000

curl -X POST http://localhost:8000/detect \
  -H "Content-Type: application/json" \
  -d '{"text": "Card 4111-1111-1111-1111 CVV 123 expires 12/25", "mask": true}'

Train from Scratch

1. Install

git clone https://github.com/rakmohan/pii-ner-en.git
cd pii-ner-en
pip install -e .

2. Prepare data

# Full dataset (~130k samples)
pii-prepare --output-dir data/processed \
  --max-hf-samples 100000 \
  --synthetic-samples 20000 \
  --pci-samples 10000

# Quick mode for local testing (~8k samples)
pii-prepare --quick

3. Train

# GPU / Google Colab
pii-train --config config/train_config.yaml

# Mac / CPU
pii-train --config config/train_config_quick.yaml

Training config, entity definitions, and label mappings are in config/.


Training Data


License

MIT — see LICENSE

About

RoBERTa fine-tuned for PII and PCI detection in English text — 54 entity types including SSN, credit cards, crypto addresses, and API keys. Micro F1: 0.95.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages