A Python tool for anonymizing sensitive data in CSV and Excel files while preserving data structure and relationships. Perfect for creating test datasets, protecting privacy, and preparing data for sharing.
- 🔒 Smart Data Detection: Automatically identifies email, phone, name, SSN, address, date, ID, and numeric data
- 🎯 Consistent Mapping: Same input always produces same output (with seed)
- 📊 Multiple Formats: Supports CSV and Excel files
- 📋 Clipboard Support: Process data directly from Excel/Google Sheets
- ⚙️ Custom Rules: Override automatic detection with JSON configuration
- 🔄 Reproducible: Use seeds for consistent anonymization results
- Python 3.6+
Will install the pandas, faker, openpyxl, colorama python libraries
pip install -r requirements.txt# Anonymize a CSV file
python main.py data.csv
# Anonymize an Excel file
python main.py data.xlsx
# Specify output file
python main.py data.csv -o anonymized_output.csvImportant: Your input files should have column headers in row 1 (no empty rows above the headers). The program expects the first row to contain the column names.
# Copy data from Excel/Sheets, then run:
python main.py --clipboard# Interactive mode - manually choose anonymization for each column
python main.py data.csv -i
python main.py data.xlsx -i
python main.py --clipboard -i# Use seed for consistent anonymization
python main.py data.csv -s 12345Create a JSON file (rules.json) to specify how each column should be anonymized:
{
"email_address": "email",
"phone_number": "phone",
"customer_name": "name",
"ssn": "ssn",
"address": "address",
"birth_date": "date",
"user_id": "id",
"salary": "float",
"internal_notes": "skip"
}Then use it:
python main.py data.csv -r rules.json| Type | Description | Example Output |
|---|---|---|
email |
Email addresses | john.doe@example.com → sarah.wilson@fake.com |
phone |
Phone numbers | (555) 123-4567 → (555) 987-6543 |
name |
Names (consistent mapping) | John Smith → Sarah Wilson |
ssn |
Social Security Numbers | 123-45-6789 → 987-65-4321 |
address |
Addresses | 123 Main St → 456 Oak Ave |
date |
Dates (±30 day offset) | 2023-01-15 → 2023-02-10 |
id |
IDs (randomized) | user123 → random7digit |
integer |
Integers (digit randomization) | 1234 → 5678 |
decimal |
Floats (±10% noise) | 1000.50 → 1050.25 |
skip |
Don't anonymize | Original value preserved |
generic |
Hash the data | any text → a1b2c3d4e5f6 |
python main.py [input_file] [options]
Options:
-o, --output FILE Output file (default: anonymized_<input>)
-c, --clipboard Process clipboard data
-s, --seed INT Random seed for reproducible results
-r, --rules FILE JSON file with column anonymization rules
-i, --interactive Manually process each column for anonymization
-h, --help Show help message# Input: customer_data.csv
# Output: anonymized_customer_data.csv
python main.py customer_data.csvpython main.py sales_data.xlsx -o anonymized_sales.xlsx- Copy data from Excel/Google Sheets
- Run:
python main.py --clipboard - Anonymized data is copied back to clipboard
- File is also saved as
anonymized_data_YYYYMMDD_HHMMSS.csv
Interactive mode lets you manually choose how to anonymize each column:
# Using clipboard input:
python main.py --clipboard -i
# Using file input:
python main.py customer_data.csv -i
python main.py customer_data.xlsx -iIn interactive mode, the program will:
- Show you each column and its detected data type
- Ask if you want to change the anonymization method
- Let you choose from available anonymization types
- Apply your choices and process the file
Create custom_rules.json:
{
"customer_email": "email",
"phone": "phone",
"full_name": "name",
"salary": "numeric",
"employee_id": "skip"
}Run with custom rules:
python main.py employee_data.csv -r custom_rules.json -o safe_employee_data.csv- Data Type Detection: The script analyzes column names and sample data to automatically detect data types
- Anonymization: Applies appropriate anonymization based on detected or specified type
- Consistency: Uses mapping cache to ensure same input always produces same output
- Preservation: Maintains data structure and statistical properties where possible
- Deterministic Hashing: IDs and generic data use SHA-256 hashing
- Realistic Fake Data: Uses Faker library for believable replacements
- Consistent Mapping: Same real name always maps to same fake name
- Statistical Preservation: Numeric data gets noise instead of complete replacement
- Date Relationships: Preserves relative timing with random offsets
"Unsupported file type"
- Ensure file has
.csv,.xlsx, or.xlsextension - Check file is not corrupted
"Error reading clipboard"
- Make sure you've copied data from Excel/Sheets first
- Try copying a smaller dataset
Missing dependencies
pip install pandas faker openpyxl- For large files (>100k rows), consider processing in chunks
- Use
skiptype for columns that don't need anonymization - Set a seed for reproducible results during testing
python main.py test_data.csv -i