An open-source research framework for automated sensitive data classification and adaptive privacy-preserving data transformations in modern data pipelines.
This project focuses on identifying sensitive data using metadata-driven machine learning techniques and dynamically applying privacy-preserving transformations based on data sensitivity, consumer identity, and usage context.
Organizations increasingly share data across internal teams, external partners, and reporting systems. However:
- Manual identification of sensitive data (PII, PHI, etc.) is error-prone and non-scalable
- Static data masking rules reduce data utility or fail privacy requirements
- Different consumers require different privacy guarantees for the same dataset
This repository addresses these challenges by providing a privacy-aware, automated, and adaptive data transformation framework.
- Automatically classify data fields into sensitivity categories (PII, PHI, Sensitive, Non-Sensitive)
- Leverage column names, column descriptions, and table metadata for classification
- Apply dynamic privacy-preserving transformations based on:
- Data sensitivity
- Consumer identity (internal analyst, external partner, reporting)
- Purpose of data usage
- Enable reproducible research through open-source implementation
- Metadata-driven machine learning models
- NLP-based feature extraction from schema metadata
- Multi-class classification:
- PII (Personally Identifiable Information)
- PHI (Protected Health Information)
- Sensitive
- Non-Sensitive
- Consumer-aware transformation policies
- Dynamic generation of transformations such as:
- Tokenization (internal analytics)
- Masking or hashing (reporting)
- Aggregation (external data sharing)
- Minimizes privacy risk while preserving analytical utility
- Supports configurable privacy policies
Metadata Ingestion (YAML files)
β
Sensitive Data Classification (Rule-based + Optional ML)
β
Privacy Policy & Consumer Context Engine
β
Dynamic Transformation Generator
β
Privacy-Preserving Data Output (CSV)
privacy-aware-data-transformation/
β
βββ src/privacy_aware_transform/
β βββ __init__.py # Package exports
β βββ metadata.py # Metadata ingestion and synthetic generation
β βββ classifier.py # Sensitivity classification (rules + ML)
β βββ policy.py # Consumer policies and transformation rules
β βββ transforms.py # Transformation implementations (mask, hash, tokenize, aggregate)
β βββ utils.py # Utility functions
β βββ cli.py # Command-line interface
β
βββ table_structure/metadata/ # YAML metadata files for tables
β βββ customers.yaml # Customer data table metadata
β βββ patient_records.yaml # Patient data table metadata
β βββ sales_transactions.yaml # Sales transaction table metadata
β
βββ data/
β βββ synthetic/ # Synthetic sample data (CSV files and transformed outputs)
β βββ customers.csv
β βββ patient_records.csv
β βββ sales_transactions.csv
β βββ <consumer_type>/ # Transformed data by consumer type
β
βββ examples/
β βββ example.py # Example script demonstrating the full pipeline
β
βββ README.md
βββ LICENSE
βββ requirements.txt # Python dependencies
βββ .gitignore
- Synthetic schemas with labeled sensitivity classes
- Publicly available metadata schemas (healthcare, finance, retail)
- No real personal or confidential data is used
- Rule-based sensitive data detection
- Static masking and transformation approaches
- Precision, Recall, F1-score for sensitivity classification
- Transformation effectiveness
- Privacy risk reduction
- Data utility preservation
- Fixed random seeds
- Configurable experiments
- Fully local execution (no AWS required)
This repository supports scholarly publications focusing on:
- Metadata-driven sensitive data classification
- Privacy-aware dynamic data transformations
- Privacyβutility tradeoff analysis
- Open-source reproducibility in data privacy research
The implementation aligns closely with the methodologies and experiments described in the associated journal articles.
If you use this framework in your research, please cite:
@software{privacy_aware_data_transformation,
title = {Privacy-Aware Data Transformation},
author = {Thimmareddy, Avinash},
year = {2025},
url = {https://github.com/<your-username>/privacy-aware-data-transformation}
}
This project is licensed under the Apache License 2.0.
- No real personal data is used
- All datasets are synthetic or publicly available
- Intended for research and educational purposes
- Users are responsible for compliance with applicable data protection laws
git clone https://github.com/<your-username>/privacy-aware-data-transformation.git
cd privacy-aware-data-transformation
pip install -r requirements.txtpython examples/example.pyThis script:
- Generates sample metadata YAML files (customers, patient records, sales transactions)
- Creates synthetic test data (CSV files)
- Classifies sensitive columns using metadata-driven rules
- Applies privacy-preserving transformations for different consumer types
- Saves transformed outputs
Output: Check data/synthetic/ for original and transformed data organized by consumer type.
python -m privacy_aware_transform.cli generate-samples --output-dir table_structure/metadatapython -m privacy_aware_transform.cli classify --metadata-dir table_structure/metadata --output classification_report.txtpython -m privacy_aware_transform.cli transform \
--metadata-file table_structure/metadata/customers.yaml \
--data-file data/synthetic/customers.csv \
--consumer-type internal_analyst \
--output data/synthetic/customers_transformed.csvpython -m privacy_aware_transform.cli list-policiesfrom privacy_aware_transform.metadata import MetadataLoader
from privacy_aware_transform.classifier import SensitivityClassifier
from privacy_aware_transform.policy import PolicyEngine
from privacy_aware_transform.transforms import TransformationEngine
from privacy_aware_transform.utils import load_csv_data, apply_transformations_to_dataframe, save_csv_data
# Load metadata from YAML
loader = MetadataLoader('table_structure/metadata')
table_meta = loader.load_table_metadata('customers.yaml')
# Classify sensitive columns
classifier = SensitivityClassifier(use_ml=False)
classifications = classifier.classify_table(table_meta.columns)
# Load data and apply transformations
df = load_csv_data('data/synthetic/customers.csv')
policy_engine = PolicyEngine()
transformation_engine = TransformationEngine()
# Transform for internal analyst
transformed_df = apply_transformations_to_dataframe(
df, table_meta, classifications,
consumer_type='internal_analyst',
transformation_engine=transformation_engine,
policy_engine=policy_engine
)
save_csv_data(transformed_df, 'output.csv')The framework includes an optional ML-based classifier that trains on your metadata to improve sensitivity classification accuracy.
python train_ml_classifier.pyThis automatically:
- Scans all YAML files in
table_structure/metadata/ - Extracts labeled training data from column names and descriptions
- Trains a Logistic Regression + TF-IDF model
- Saves to
models/sensitivity_classifier.pkl - Reports training accuracy
Output:
Training complete! Accuracy: 96.4% (27/28 correct)
Model saved to: models/sensitivity_classifier.pkl
Top learned features:
1. transaction (importance: 0.3476)
2. medication (importance: 0.3078)
3. diagnosis (importance: 0.2902)
from privacy_aware_transform.classifier import SensitivityClassifier
# Automatically loads trained model
classifier = SensitivityClassifier(use_ml=True)
# Classify columns (now uses both rules + ML for better accuracy)
classifications = classifier.classify_table(table_meta.columns)python test_ml_classifier.pyThis shows:
- Comparison between rule-based and ML predictions
- Classification agreement percentage
- Confidence scores for each method
The ML classifier uses a two-stage approach:
-
Rule-Based (Primary - Fast)
- Pattern matching on column names/descriptions
- Confidence: 0.70-0.90
- Fast, interpretable, no training required
-
ML-Based (Secondary - Accurate)
- Trained on your metadata files
- Blends with rules when confidence < 0.8
- Confidence: 0.38-0.90
- More accurate on edge cases
The ML model is incremental-friendly:
# 1. Add new YAML files to table_structure/metadata/
# (e.g., employees.yaml, products.yaml, etc.)
# 2. Retrain
python train_ml_classifier.py
# 3. Model now improves with more metadataRecommended Growth Path:
- Start: 28 samples (3 tables) β 96% accuracy
- Target: 50-75 samples (5-7 tables) β 97-98% accuracy
- Optimal: 100+ samples (10+ tables) β 98%+ accuracy
- Algorithm: Logistic Regression
- Features: TF-IDF on column name + description + data type
- Training Data: Automatically labeled from metadata patterns
- Serialization: Pickle (models/sensitivity_classifier.pkl)
- Update Frequency: Retrain when adding new metadata
See ML_TRAINING_GUIDE.md for:
- Feature engineering details
- Training data requirements
- Performance optimization
- Troubleshooting
- Best practices
Metadata is defined in YAML files located in table_structure/metadata/. Each YAML file represents one table.
Example: customers.yaml
table_name: customers
database: main_db
description: "Customer personal information and contact details"
owner: "data_governance_team"
columns:
- name: customer_id
data_type: int
description: "Unique customer identifier (primary key)"
nullable: false
is_key: true
examples: ["1", "2", "3"]
- name: first_name
data_type: string
description: "Customer first name (PII)"
nullable: false
is_key: false
examples: ["John", "Jane"]
- name: email
data_type: string
description: "Customer email address (PII)"
nullable: true
is_key: false
examples: ["john@example.com", "jane@example.com"]
- name: registration_date
data_type: date
description: "Account registration date (Non-Sensitive)"
nullable: false
is_key: false
examples: ["2020-01-01", "2021-06-15"]The framework automatically classifies columns into sensitivity levels based on metadata (column names and descriptions):
| Class | Definition | Examples |
|---|---|---|
| PII | Personally Identifiable Information | first_name, email, phone, ssn, address, dob |
| PHI | Protected Health Information | diagnosis, medication, patient_name, medical_record_number |
| Sensitive | Financial or location data | salary, amount, zip_code, city, credit_card |
| Non-Sensitive | Public or non-sensitive data | registration_date, status, product_name, visit_count |
Classification Method:
- Rule-based pattern matching on column names and descriptions (high precision, fast)
- Optional ML-based classification (LogisticRegression + TF-IDF) for training on labeled data
The framework supports four consumer types with different privacy-utility tradeoffs:
| Sensitivity | Internal Analyst | External Partner | Reporting | Public |
|---|---|---|---|---|
| PII | Tokenize | Hash | Mask | Hash |
| PHI | Tokenize | Hash | Mask | Hash |
| Sensitive | Mask (keep ends) | Mask (full) | Aggregate | Aggregate |
| Non-Sensitive | Keep | Keep | Keep | Keep |
- Keep: Return data unchanged (pass-through)
- Mask: Replace characters with mask character (e.g.,
john@example.comβj**@****.com) - Hash: Apply SHA256 or other cryptographic hash (irreversible)
- Tokenize: Consistent pseudonymization using keyed HMAC (deterministic but non-reversible without key)
- Aggregate: Count or group data (for reporting purposes)
- MetadataLoader: Load table metadata from YAML files
- SyntheticMetadataGenerator: Generate sample metadata for testing
- SensitivityClassifier: Rule-based + optional ML classification
- ClassificationResult: Result object with class, confidence, and reasoning
- PolicyEngine: Manages consumer policies and transformation rules
- ConsumerPolicy: Maps (sensitivity, consumer) β transformation rule
- TransformationEngine: Orchestrates transformations
- MaskingTransformer, HashingTransformer, TokenizationTransformer: Individual transformations
- Utility functions: CSV I/O, DataFrame transformations, reporting
After running examples/example.py, the framework generates transformed datasets:
data/synthetic/
βββ customers.csv (original)
βββ patient_records.csv (original)
βββ sales_transactions.csv (original)
βββ internal_analyst/
β βββ customers_transformed.csv
β βββ patient_records_transformed.csv
β βββ sales_transactions_transformed.csv
βββ external_partner/
βββ customers_transformed.csv
βββ ...
- Aggregate transformations are framework-ready but not fully implemented
- ML-based classification requires manual training on labeled data
- No reversibility support (transformations are one-way by design)
- Limited to local execution (AWS or cloud integration not included)
- No audit logging of transformations
- No differential privacy support
- Differential privacy mechanisms (Laplace noise, etc.)
- Advanced aggregation strategies (grouping, binning, etc.)
- Pre-trained ML models for classification
- Integration with data lineage tracking
- Audit and compliance logging
- Real-time streaming data support
- Performance benchmarking on large datasets
- Privacy budget management and tracking
Contributions are welcome! Please feel free to submit issues or pull requests.
Q: How do I add my own table metadata?
A: Create a new YAML file in table_structure/metadata/ following the format in customers.yaml.
Q: Can I use transformations reversibly?
A: No, transformations are intentionally one-way for privacy preservation.
Q: How does tokenization work?
A: Tokenization uses a secret key (HMAC-SHA256) to create deterministic pseudonyms.
Licensed under Apache License 2.0. See LICENSE file for details.
Last Updated: January 2025