Skip to content

privacy-enhancing-technologies/SynEval

SynEval: Synthetic Data Evaluation Framework

SynEval is a comprehensive evaluation framework for assessing the quality of synthetic data. The framework provides quantitative scoring across four key dimensions:

  • Fidelity: Measures how well the synthetic data preserves the statistical properties and patterns of the original data
  • Utility: Evaluates the usefulness of synthetic data for downstream tasks
  • Diversity: Assesses the variety and uniqueness of the generated data
  • Privacy: Analyzes the privacy protection level of the synthetic data

Installation

  1. Clone the repository:
git clone https://github.com/privacy-enhancing-technologies/SynEval.git
cd SynEval
  1. Create and activate a conda virtual environment:
conda create -n syneval python=3.10
conda activate syneval
  1. Install dependencies:
pip install -r requirements.txt
  1. Download NLTK data (required for text processing):
python -m nltk.downloader punkt punkt_tab stopwords

Note: You may see dependency conflict warnings during installation. This is normal in environments like Google Colab or when other packages are already installed. For a clean installation without conflicts, consider using a virtual environment.

Quick Start

Command Line Usage

After installation, you can use SynEval from the command line. The main entry point for the framework is run.py. This script allows you to evaluate synthetic data against original data using various metrics.

The general command format is:

python run.py --synthetic <synthetic_data.csv> --original <original_data.csv> --metadata <metadata.json> [evaluation_flags] [--output <results.json>]

Required Arguments

  • --synthetic: Path to the synthetic data CSV file
  • --original: Path to the original data CSV file
  • --metadata: Path to the metadata JSON file

Evaluation Flags

You can select one or more evaluation dimensions to run:

  • --fidelity: Run fidelity evaluation
  • --utility: Run utility evaluation
  • --diversity: Run diversity evaluation
  • --privacy: Run privacy evaluation

Optional Arguments

  • --output: Path to save evaluation results in JSON format. If not specified, results will be printed to stdout. (Default: artifacts/reports/evaluation_results.json)
  • --plot: Generate plots for all evaluation metrics and save them to the artifacts/plots directory. Plots visualize key metrics from fidelity, utility, diversity, and privacy evaluations.
  • --html: Generate an HTML dashboard summarizing each evaluation dimension (saved to artifacts/html/syneval_dashboard.html by default).

Device Selection Arguments

  • --device: Device to use for computation (auto, cpu, cuda). Default: auto (automatically detect best available device)
  • --force-cpu: Force CPU usage even if GPU is available (overrides --device)
  • --gpu-memory-fraction: Fraction of GPU memory to use (0.0-1.0, default: 0.8)

Example:

python run.py \
    --synthetic data/gpt4.1_synthetic.csv \
    --original data/real_10k.csv \
    --metadata data/metadata.json \
    --dimensions fidelity utility diversity privacy \
    --utility-input text \
    --utility-output rating \
    --output artifacts/reports/results.json \
    --plot \
    --html \
    --device auto

For GPU acceleration (if available):

python run.py \
    --synthetic data/gpt4.1_synthetic.csv \
    --original data/real_10k.csv \
    --metadata data/metadata.json \
    --dimensions fidelity utility diversity privacy \
    --utility-input text \
    --utility-output rating \
    --output artifacts/reports/results.json \
    --plot \
    --html \
    --device cuda \
    --gpu-memory-fraction 0.8

For CPU-only processing:

python run.py \
    --synthetic data/gpt4.1_synthetic.csv \
    --original data/real_10k.csv \
    --metadata data/metadata.json \
    --dimensions fidelity utility diversity privacy \
    --utility-input text \
    --utility-output rating \
    --output artifacts/reports/results.json \
    --plot \
    --html \
    --device cpu

All artifacts generated by run.py are stored under the artifacts/ directory:

  • artifacts/cache: cached intermediate computations
  • artifacts/plots: matplotlib/seaborn plots
  • artifacts/html: HTML dashboards (including privacy visualizations)
  • artifacts/reports: JSON summaries and other text outputs

Metadata Format

The metadata file should be a JSON file that describes the structure of your data. It should include column names, types, dataset name, and primary key information. See data/metadata.json for a concrete example.

Evaluation Dimensions

Detailed documentation for each evaluation module is located under evaluation/descriptions/:

Additional Tools

The use_cases/ directory contains scenario-focused extensions that demonstrate how SynEval can be adapted to solve concrete problems beyond the core evaluation CLI (e.g., NER analysis, differential privacy dashboards).

Contributing

As we implement more evaluation metrics, this README will be updated with additional documentation for each component.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Comprehensive evaluation tools for synthetic data.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors