State-of-the-art PII detection and redaction using the Sentinel model
Sentinel PII SDK is a Python library for identifying and redacting Personally Identifiable Information (PII) in text.
- High-accuracy PII detection (95%+ recall)
- Multiple handling modes: TAG, REDACT, or REPLACE
- Batch processing support
pip install sentinel-pii-sdkWith faker support for REPLACE mode:
pip install 'sentinel-pii-sdk[faker]'git clone https://github.com/cernis-intelligence/sentinel-pii-sdk.git
cd sentinel-pii-sdk
pip install -e .from sentinel_pii import SentinelPIIRedactor
# Initialize (model loads from HuggingFace on first use)
redactor = SentinelPIIRedactor()
# Detect PII in text
text = "My name is John Smith and my email is john@email.com"
result = redactor.redact_text(text)
print(result)
# Output: "My name is [PERSON_NAME] and my email is [EMAIL_ADDRESS]"from sentinel_pii import SentinelPIIRedactor, PIIHandlingMode
redactor = SentinelPIIRedactor()
text = "Contact John Smith at john@email.com or call (555) 123-4567"
# TAG mode - Show PII categories
result = redactor.redact_text(text, mode=PIIHandlingMode.TAG)
print(result)
# "Contact [PERSON_NAME] at [EMAIL_ADDRESS] or call [PHONE_NUMBER]"
# REDACT mode - Same as TAG
result = redactor.redact_text(text, mode=PIIHandlingMode.REDACT)
print(result)
# "Contact [PERSON_NAME] at [EMAIL_ADDRESS] or call [PHONE_NUMBER]"
# REPLACE mode - Replace with fake data (requires faker)
result = redactor.redact_text(text, mode=PIIHandlingMode.REPLACE)
print(result)
# "Contact Jane Doe at jane.doe@example.com or call (555) 987-6543"from sentinel_pii import detect_pii_batch, PIIHandlingMode
documents = [
"My email is john@email.com",
"Patient DOB: 1990-05-15, diagnosed with diabetes"
]
results = detect_pii_batch(documents, mode=PIIHandlingMode.TAG)
for result in results:
print(result)from sentinel_pii import clean_dataset, PIIHandlingMode
# Clean a JSONL dataset file
clean_dataset(
input_filename="input_data.jsonl",
output_filename="output_data.jsonl",
mode=PIIHandlingMode.TAG
)The Sentinel model detects 20+ PII categories:
Identity: PERSON_NAME, USERNAME, AGE, GENDER, DEMOGRAPHIC_GROUP
Contact: EMAIL_ADDRESS, PHONE_NUMBER, STREET_ADDRESS, CITY, STATE, POSTCODE, COUNTRY
Dates: DATE, DATE_OF_BIRTH
ID Numbers: PERSONAL_ID, PASSPORT, DRIVERLICENSE
Financial: CREDIT_CARD_INFO, BANKING_NUMBER
Security: PASSWORD, SECURE_CREDENTIAL
Medical: MEDICAL_CONDITION
Other: ORGANIZATION_NAME, DOMAIN_NAME, NATIONALITY, RELIGIOUS_AFFILIATION
Main class for PII detection.
redactor = SentinelPIIRedactor(pii_categories=None)Parameters:
pii_categories(optional): Custom PII categories string
Methods:
redact_text(text, mode=PIIHandlingMode.TAG, locale="en_US")- Process single textdetect_pii(documents, mode=PIIHandlingMode.TAG, locale="en_US", show_progress=True)- Process list of documents
detect_pii_batch(documents, mode=PIIHandlingMode.TAG, locale="en_US")- Batch processingclean_dataset(input_filename, output_filename, mode=PIIHandlingMode.TAG, locale="en_US")- Clean JSONL files
Enum for handling modes:
PIIHandlingMode.TAG- Show PII categories in bracketsPIIHandlingMode.REDACT- Same as TAGPIIHandlingMode.REPLACE- Replace with fake data (requires faker)
- Model: cernis-intelligence/sentinel on HuggingFace
- Performance: 95%+ recall, ~100 docs/min on GPU
- License: Apache 2.0
- Python >= 3.9
- transformers >= 4.36.0
- torch >= 2.0.0
- accelerate >= 0.20.0
- tqdm >= 4.65.0
- faker >= 20.0.0 (optional, for REPLACE mode)
The examples/ directory contains working sample scripts:
# Basic single-text PII detection
python3.11 examples/basic_usage.py
# Process multiple documents at once
python3.11 examples/batch_processing.py
# Clean JSONL dataset files
python3.11 examples/dataset_cleaning.py
# Validate package structure (no model download)
python3.11 examples/test_all_examples.pyYou can also use the included sample_data.jsonl for testing:
from sentinel_pii import clean_dataset, PIIHandlingMode
clean_dataset(
"examples/sample_data.jsonl",
"output.jsonl",
mode=PIIHandlingMode.TAG
)Contributions welcome! Please submit a Pull Request.
Apache 2.0 License - see LICENSE file for details.
- HuggingFace: cernis-intelligence/sentinel
- Issues: GitHub Issues
- Built on IBM Granite 4.0
- Training data from AI4Privacy