pii-ner-en

Fine-tuned RoBERTa model for detecting Personally Identifiable Information (PII) and Payment Card Industry (PCI) data in English text.

Model: rm0013/roberta-pii-ner-en | Micro F1: 0.95 | Entities: 54

Supported Entities

PII: PERSON_NAME EMAIL PHONE_NUMBER SSN ADDRESS DATE_OF_BIRTH DATE AGE USERNAME PASSWORD IP_ADDRESS URL API_KEY PASSPORT_NUMBER DRIVER_LICENSE ORGANIZATION DEVICE_ID VEHICLE_ID GPS_COORDINATES USERAGENT and more

PCI: CREDIT_CARD CREDIT_CARD_CVV CREDIT_CARD_EXPIRY PIN BANK_ACCOUNT BANK_ROUTING BITCOINADDRESS ETHEREUMADDRESS AMOUNT CURRENCY and more

Usage

Use hosted model directly:

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="rm0013/roberta-pii-ner-en",
    aggregation_strategy="simple"
)

result = ner("Send the invoice to john.smith@acme.com, card 4111-1111-1111-1111 CVV 123.")
for entity in result:
    print(f"{entity['word']:30s} → {entity['entity_group']} ({entity['score']:.2f})")

Or run the inference API locally:

pii-serve --host 0.0.0.0 --port 8000

curl -X POST http://localhost:8000/detect \
  -H "Content-Type: application/json" \
  -d '{"text": "Card 4111-1111-1111-1111 CVV 123 expires 12/25", "mask": true}'

Train from Scratch

1. Install

git clone https://github.com/rakmohan/pii-ner-en.git
cd pii-ner-en
pip install -e .

2. Prepare data

# Full dataset (~130k samples)
pii-prepare --output-dir data/processed \
  --max-hf-samples 100000 \
  --synthetic-samples 20000 \
  --pci-samples 10000

# Quick mode for local testing (~8k samples)
pii-prepare --quick

3. Train

# GPU / Google Colab
pii-train --config config/train_config.yaml

# Mac / CPU
pii-train --config config/train_config_quick.yaml

Training config, entity definitions, and label mappings are in config/.

Training Data

ai4privacy/pii-masking-200k — 200k annotated examples
Synthetic PII/PCI data generated with Faker

License

MIT — see LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
config		config
examples		examples
src/pii_ner		src/pii_ner
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pii-ner-en

Supported Entities

Usage

Train from Scratch

Training Data

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pii-ner-en

Supported Entities

Usage

Train from Scratch

Training Data

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages