De-identification-using-LLMs

Objective

This research project focuses on utilizing Large Language Models (LLMs) to remove personally identifiable information (PII) from forum posts. The project explores the effectiveness of both OpenAI's GPT-4o model and Meta's LLama3.1 model (run locally) in de-identifying sensitive data.

Project Structure

Data

original_files: Contains the original data with PII.
human_redacted_files: Contains the files that have been de-identified by humans.
results:
- OpenAI_redacted_files: Files generated by the OpenAI GPT-4o model.
- Llama_redacted_files: Files generated by the LLama3.1 model (local).
prompts.csv: A CSV file containing a list of prompts used during the de-identification process and for evaluation.

Code Files

de_identified_csv_generator.py: This script processes the original files and generates de-identified CSV files using either the OpenAI GPT-4 model or the LLaMA model accessed via Fireworks, based on user selection.
de_identified_csv_evaluator.py: This script evaluates the reliability of the de-identification process by comparing the model outputs with human-redacted files. It calculates metrics such as accuracy, precision, recall, and Cohen's kappa.

Prerequisites

Before getting started, ensure you have the following:

Python Environment: Python 3.6 or higher is recommended.
OpenAI API Key: Obtain an API key from OpenAI to access the GPT-4o model.
Llama 3.1 Model: Obtain LLama 3.1 8B model via Ollama.
Python Packages: Install the required packages using pip:
```
pip install openai pandas
```
Input Data:
- original_files: Place your original CSV files containing PII in this folder.
- human_redacted_files: Place the corresponding human-redacted CSV files in this folder.

Downloading and Installing Llama 3.1 8B

Step 1: Download Ollama On Mac or Windows, go to the Ollama download page here and select your platform to download it, then double click the downloaded file to install Ollama.

Step 2: Download and test run Llama 3.1 On a terminal or console, run ollama pull llama3.1 to download the Llama 3.1 8b chat model in the 4-bit quantized format with size about 4.7 GB. For better results, you can download Llama 3.1 70b, ollama pull llama3.1:70b, but you will need around 128 of RAM to run it locally. You can check all the available models on the Ollama web page. If you decide to use a different model from Llama 3.1 8b you need to change the model name in the de_identified_csv_generator.py script

Getting Started

Here's how to initiate the project:

Step 1: Organize Data Place your original files with PII in the original_files folder and human redacted files in human_redacted_files folder. We have artificially created some sample files in the folder for your reference.

Step 2: Run the de_identified_csv_generator.py Execute the de_identified_csv_generator script, providing the necessary input files and the output folder (which will be automatically created): This script will process the files, remove PII, and generate an OpenAI de-identified CSV file within an OpenAI_redacted_files folder inside the results folder.

Step 3: Run the de_identified_csv_evaluator.py To evaluate the accuracy of the de-identification process, run the De-identified CSV Evaluator script: This script will analyze the de-identified CSV files in the results folder and update the metrics csv with accuracy, precision, recall, and kappa values.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
human_redacted_files		human_redacted_files
original_files		original_files
Prompts.csv		Prompts.csv
PromptsLlama.csv		PromptsLlama.csv
README.md		README.md
TestLlama.py		TestLlama.py
de_identified_csv_evaluator.py		de_identified_csv_evaluator.py
de_identified_csv_generator.py		de_identified_csv_generator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

De-identification-using-LLMs

Objective

Project Structure

Data

Code Files

Prerequisites

Downloading and Installing Llama 3.1 8B

Getting Started

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

De-identification-using-LLMs

Objective

Project Structure

Data

Code Files

Prerequisites

Downloading and Installing Llama 3.1 8B

Getting Started

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages