Skip to content

cernis-intelligence/sentinel-pii-sdk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sentinel PII SDK

State-of-the-art PII detection and redaction using the Sentinel model

Sentinel PII SDK is a Python library for identifying and redacting Personally Identifiable Information (PII) in text.

Features

  • High-accuracy PII detection (95%+ recall)
  • Multiple handling modes: TAG, REDACT, or REPLACE
  • Batch processing support

Installation

From PyPI

pip install sentinel-pii-sdk

With faker support for REPLACE mode:

pip install 'sentinel-pii-sdk[faker]'

From Source

git clone https://github.com/cernis-intelligence/sentinel-pii-sdk.git
cd sentinel-pii-sdk
pip install -e .

Quick Start

from sentinel_pii import SentinelPIIRedactor

# Initialize (model loads from HuggingFace on first use)
redactor = SentinelPIIRedactor()

# Detect PII in text
text = "My name is John Smith and my email is john@email.com"
result = redactor.redact_text(text)
print(result)
# Output: "My name is [PERSON_NAME] and my email is [EMAIL_ADDRESS]"

Usage Examples

Basic PII Detection

from sentinel_pii import SentinelPIIRedactor, PIIHandlingMode

redactor = SentinelPIIRedactor()

text = "Contact John Smith at john@email.com or call (555) 123-4567"

# TAG mode - Show PII categories
result = redactor.redact_text(text, mode=PIIHandlingMode.TAG)
print(result)
# "Contact [PERSON_NAME] at [EMAIL_ADDRESS] or call [PHONE_NUMBER]"

# REDACT mode - Same as TAG
result = redactor.redact_text(text, mode=PIIHandlingMode.REDACT)
print(result)
# "Contact [PERSON_NAME] at [EMAIL_ADDRESS] or call [PHONE_NUMBER]"

# REPLACE mode - Replace with fake data (requires faker)
result = redactor.redact_text(text, mode=PIIHandlingMode.REPLACE)
print(result)
# "Contact Jane Doe at jane.doe@example.com or call (555) 987-6543"

Batch Processing

from sentinel_pii import detect_pii_batch, PIIHandlingMode

documents = [
    "My email is john@email.com",
    "Patient DOB: 1990-05-15, diagnosed with diabetes"
]

results = detect_pii_batch(documents, mode=PIIHandlingMode.TAG)
for result in results:
    print(result)

Dataset Cleaning

from sentinel_pii import clean_dataset, PIIHandlingMode

# Clean a JSONL dataset file
clean_dataset(
    input_filename="input_data.jsonl",
    output_filename="output_data.jsonl",
    mode=PIIHandlingMode.TAG
)

Supported PII Categories

The Sentinel model detects 20+ PII categories:

Identity: PERSON_NAME, USERNAME, AGE, GENDER, DEMOGRAPHIC_GROUP

Contact: EMAIL_ADDRESS, PHONE_NUMBER, STREET_ADDRESS, CITY, STATE, POSTCODE, COUNTRY

Dates: DATE, DATE_OF_BIRTH

ID Numbers: PERSONAL_ID, PASSPORT, DRIVERLICENSE

Financial: CREDIT_CARD_INFO, BANKING_NUMBER

Security: PASSWORD, SECURE_CREDENTIAL

Medical: MEDICAL_CONDITION

Other: ORGANIZATION_NAME, DOMAIN_NAME, NATIONALITY, RELIGIOUS_AFFILIATION

API Reference

SentinelPIIRedactor

Main class for PII detection.

redactor = SentinelPIIRedactor(pii_categories=None)

Parameters:

  • pii_categories (optional): Custom PII categories string

Methods:

  • redact_text(text, mode=PIIHandlingMode.TAG, locale="en_US") - Process single text
  • detect_pii(documents, mode=PIIHandlingMode.TAG, locale="en_US", show_progress=True) - Process list of documents

Utility Functions

  • detect_pii_batch(documents, mode=PIIHandlingMode.TAG, locale="en_US") - Batch processing
  • clean_dataset(input_filename, output_filename, mode=PIIHandlingMode.TAG, locale="en_US") - Clean JSONL files

PIIHandlingMode

Enum for handling modes:

  • PIIHandlingMode.TAG - Show PII categories in brackets
  • PIIHandlingMode.REDACT - Same as TAG
  • PIIHandlingMode.REPLACE - Replace with fake data (requires faker)

Model Information

  • Model: cernis-intelligence/sentinel on HuggingFace
  • Performance: 95%+ recall, ~100 docs/min on GPU
  • License: Apache 2.0

Requirements

  • Python >= 3.9
  • transformers >= 4.36.0
  • torch >= 2.0.0
  • accelerate >= 0.20.0
  • tqdm >= 4.65.0
  • faker >= 20.0.0 (optional, for REPLACE mode)

Examples

The examples/ directory contains working sample scripts:

# Basic single-text PII detection
python3.11 examples/basic_usage.py

# Process multiple documents at once
python3.11 examples/batch_processing.py

# Clean JSONL dataset files
python3.11 examples/dataset_cleaning.py

# Validate package structure (no model download)
python3.11 examples/test_all_examples.py

You can also use the included sample_data.jsonl for testing:

from sentinel_pii import clean_dataset, PIIHandlingMode

clean_dataset(
    "examples/sample_data.jsonl",
    "output.jsonl",
    mode=PIIHandlingMode.TAG
)

Contributing

Contributions welcome! Please submit a Pull Request.

License

Apache 2.0 License - see LICENSE file for details.

Support

Acknowledgments

About

State-of-the-art PII detection using Sentinel model

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages