Skip to content

childmindresearch/anonymize-pii

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DOI

Anonymize-PII

This repository ingests reports containing PII and applies an iterative anonymization procedure to flag and anonymize reports. Current version replaces all flagged PII with <MASK> field to ensure consistency.

Features

  • Base feature set uses NLP and Presidio Analyzer to flag sensitive PII content. Base config models include Spacy, Stanza, and GLiNER
  • Input documents are read as .json files (by default looks for 'Reports.json' in data/raw directory) in the format of {'PatientID': 'Full body of report text to be anonymized'}
  • Process iterates through each report key, value pair using each of the default model configs [spacy, stanza, GLiNER] to generate 3 output files:
    • Anonymized_Rports.json Anonymized reports saved in same format as input document {'PatientID': 'Full body of anonymized report'}
    • Iterator.json A map of all entities identified with entity type and confidence score. Format is {'PatientID': {'config_model_name': {'PII Flagged': [type, score]},}
    • PII_Log.json A map of all text replaced with index of start/end based on source input document. Format is {'PatientID': {'entity_type': '', 'start': '', 'end': '', 'score': '', 'analysis_explanation':'','recognition_metadata': {'recognizer_name':'','recognizer_identifier':''}, }

To Run

Install venv dependency requirements

Save Input Report json document as /data/raw/Reports.json

You can copy the "Reports.json" file from the /tests/ directory into /data/raw/ to test run anonymizer

from /src/anonymize_pii directory, run main.py

References

https://microsoft.github.io/presidio/

https://spacy.io/

https://stanfordnlp.github.io/stanza/

https://github.com/urchade/GLiNER

About

Text based PII anonymizer for masking sensitive data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages