Fine-tuned RoBERTa model for detecting Personally Identifiable Information (PII) and Payment Card Industry (PCI) data in English text.
Model: rm0013/roberta-pii-ner-en | Micro F1: 0.95 | Entities: 54
PII: PERSON_NAME EMAIL PHONE_NUMBER SSN ADDRESS DATE_OF_BIRTH DATE AGE USERNAME PASSWORD IP_ADDRESS URL API_KEY PASSPORT_NUMBER DRIVER_LICENSE ORGANIZATION DEVICE_ID VEHICLE_ID GPS_COORDINATES USERAGENT and more
PCI: CREDIT_CARD CREDIT_CARD_CVV CREDIT_CARD_EXPIRY PIN BANK_ACCOUNT BANK_ROUTING BITCOINADDRESS ETHEREUMADDRESS AMOUNT CURRENCY and more
Use hosted model directly:
from transformers import pipeline
ner = pipeline(
"token-classification",
model="rm0013/roberta-pii-ner-en",
aggregation_strategy="simple"
)
result = ner("Send the invoice to john.smith@acme.com, card 4111-1111-1111-1111 CVV 123.")
for entity in result:
print(f"{entity['word']:30s} → {entity['entity_group']} ({entity['score']:.2f})")Or run the inference API locally:
pii-serve --host 0.0.0.0 --port 8000
curl -X POST http://localhost:8000/detect \
-H "Content-Type: application/json" \
-d '{"text": "Card 4111-1111-1111-1111 CVV 123 expires 12/25", "mask": true}'1. Install
git clone https://github.com/rakmohan/pii-ner-en.git
cd pii-ner-en
pip install -e .2. Prepare data
# Full dataset (~130k samples)
pii-prepare --output-dir data/processed \
--max-hf-samples 100000 \
--synthetic-samples 20000 \
--pci-samples 10000
# Quick mode for local testing (~8k samples)
pii-prepare --quick3. Train
# GPU / Google Colab
pii-train --config config/train_config.yaml
# Mac / CPU
pii-train --config config/train_config_quick.yamlTraining config, entity definitions, and label mappings are in config/.
- ai4privacy/pii-masking-200k — 200k annotated examples
- Synthetic PII/PCI data generated with Faker
MIT — see LICENSE