This research project focuses on utilizing Large Language Models (LLMs) to remove personally identifiable information (PII) from forum posts. The project explores the effectiveness of both OpenAI's GPT-4o model and Meta's LLama3.1 model (run locally) in de-identifying sensitive data.
- original_files: Contains the original data with PII.
- human_redacted_files: Contains the files that have been de-identified by humans.
- results:
- OpenAI_redacted_files: Files generated by the OpenAI GPT-4o model.
- Llama_redacted_files: Files generated by the LLama3.1 model (local).
- prompts.csv: A CSV file containing a list of prompts used during the de-identification process and for evaluation.
- de_identified_csv_generator.py: This script processes the original files and generates de-identified CSV files using either the OpenAI GPT-4 model or the LLaMA model accessed via Fireworks, based on user selection.
- de_identified_csv_evaluator.py: This script evaluates the reliability of the de-identification process by comparing the model outputs with human-redacted files. It calculates metrics such as accuracy, precision, recall, and Cohen's kappa.
Before getting started, ensure you have the following:
-
Python Environment: Python 3.6 or higher is recommended.
-
OpenAI API Key: Obtain an API key from OpenAI to access the GPT-4o model.
-
Llama 3.1 Model: Obtain LLama 3.1 8B model via Ollama.
-
Python Packages: Install the required packages using pip:
pip install openai pandas
-
Input Data:
- original_files: Place your original CSV files containing PII in this folder.
- human_redacted_files: Place the corresponding human-redacted CSV files in this folder.
Step 1: Download Ollama On Mac or Windows, go to the Ollama download page here and select your platform to download it, then double click the downloaded file to install Ollama.
Step 2: Download and test run Llama 3.1
On a terminal or console, run ollama pull llama3.1 to download the Llama 3.1 8b chat model in the 4-bit quantized format with size about 4.7 GB. For better results, you can download Llama 3.1 70b, ollama pull llama3.1:70b, but you will need around 128 of RAM to run it locally. You can check all the available models on the Ollama web page. If you decide to use a different model from Llama 3.1 8b you need to change the model name in the de_identified_csv_generator.py script
Here's how to initiate the project:
Step 1: Organize Data Place your original files with PII in the original_files folder and human redacted files in human_redacted_files folder. We have artificially created some sample files in the folder for your reference.
Step 2: Run the de_identified_csv_generator.py
Execute the de_identified_csv_generator script, providing the necessary input files and the output folder (which will be automatically created):
This script will process the files, remove PII, and generate an OpenAI de-identified CSV file within an OpenAI_redacted_files folder inside the results folder.
Step 3: Run the de_identified_csv_evaluator.py
To evaluate the accuracy of the de-identification process, run the De-identified CSV Evaluator script:
This script will analyze the de-identified CSV files in the results folder and update the metrics csv with accuracy, precision, recall, and kappa values.