SynEval: Synthetic Data Evaluation Framework

SynEval is a comprehensive evaluation framework for assessing the quality of synthetic data. The framework provides quantitative scoring across four key dimensions:

Fidelity: Measures how well the synthetic data preserves the statistical properties and patterns of the original data
Utility: Evaluates the usefulness of synthetic data for downstream tasks
Diversity: Assesses the variety and uniqueness of the generated data
Privacy: Analyzes the privacy protection level of the synthetic data

Installation

Clone the repository:

git clone https://github.com/privacy-enhancing-technologies/SynEval.git
cd SynEval

Create and activate a conda virtual environment:

conda create -n syneval python=3.10
conda activate syneval

Install dependencies:

pip install -r requirements.txt

Download NLTK data (required for text processing):

python -m nltk.downloader punkt punkt_tab stopwords

Note: You may see dependency conflict warnings during installation. This is normal in environments like Google Colab or when other packages are already installed. For a clean installation without conflicts, consider using a virtual environment.

Quick Start

Command Line Usage

After installation, you can use SynEval from the command line. The main entry point for the framework is run.py. This script allows you to evaluate synthetic data against original data using various metrics.

The general command format is:

python run.py --synthetic <synthetic_data.csv> --original <original_data.csv> --metadata <metadata.json> [evaluation_flags] [--output <results.json>]

Required Arguments

--synthetic: Path to the synthetic data CSV file
--original: Path to the original data CSV file
--metadata: Path to the metadata JSON file

Evaluation Flags

You can select one or more evaluation dimensions to run:

--fidelity: Run fidelity evaluation
--utility: Run utility evaluation
--diversity: Run diversity evaluation
--privacy: Run privacy evaluation

Optional Arguments

--output: Path to save evaluation results in JSON format. If not specified, results will be printed to stdout. (Default: artifacts/reports/evaluation_results.json)
--plot: Generate plots for all evaluation metrics and save them to the artifacts/plots directory. Plots visualize key metrics from fidelity, utility, diversity, and privacy evaluations.
--html: Generate an HTML dashboard summarizing each evaluation dimension (saved to artifacts/html/syneval_dashboard.html by default).

Device Selection Arguments

--device: Device to use for computation (auto, cpu, cuda). Default: auto (automatically detect best available device)
--force-cpu: Force CPU usage even if GPU is available (overrides --device)
--gpu-memory-fraction: Fraction of GPU memory to use (0.0-1.0, default: 0.8)

Example:

python run.py \
    --synthetic data/gpt4.1_synthetic.csv \
    --original data/real_10k.csv \
    --metadata data/metadata.json \
    --dimensions fidelity utility diversity privacy \
    --utility-input text \
    --utility-output rating \
    --output artifacts/reports/results.json \
    --plot \
    --html \
    --device auto

For GPU acceleration (if available):

python run.py \
    --synthetic data/gpt4.1_synthetic.csv \
    --original data/real_10k.csv \
    --metadata data/metadata.json \
    --dimensions fidelity utility diversity privacy \
    --utility-input text \
    --utility-output rating \
    --output artifacts/reports/results.json \
    --plot \
    --html \
    --device cuda \
    --gpu-memory-fraction 0.8

For CPU-only processing:

python run.py \
    --synthetic data/gpt4.1_synthetic.csv \
    --original data/real_10k.csv \
    --metadata data/metadata.json \
    --dimensions fidelity utility diversity privacy \
    --utility-input text \
    --utility-output rating \
    --output artifacts/reports/results.json \
    --plot \
    --html \
    --device cpu

All artifacts generated by run.py are stored under the artifacts/ directory:

artifacts/cache: cached intermediate computations
artifacts/plots: matplotlib/seaborn plots
artifacts/html: HTML dashboards (including privacy visualizations)
artifacts/reports: JSON summaries and other text outputs

Metadata Format

The metadata file should be a JSON file that describes the structure of your data. It should include column names, types, dataset name, and primary key information. See data/metadata.json for a concrete example.

Evaluation Dimensions

Detailed documentation for each evaluation module is located under evaluation/descriptions/:

Additional Tools

The use_cases/ directory contains scenario-focused extensions that demonstrate how SynEval can be adapted to solve concrete problems beyond the core evaluation CLI (e.g., NER analysis, differential privacy dashboards).

Contributing

As we implement more evaluation metrics, this README will be updated with additional documentation for each component.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
artifacts		artifacts
data		data
evaluation		evaluation
tests		tests
.coverage		.coverage
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
SynEval_Demo.ipynb		SynEval_Demo.ipynb
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SynEval: Synthetic Data Evaluation Framework

Installation

Quick Start

Command Line Usage

Required Arguments

Evaluation Flags

Optional Arguments

Device Selection Arguments

Metadata Format

Evaluation Dimensions

Additional Tools

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SynEval: Synthetic Data Evaluation Framework

Installation

Quick Start

Command Line Usage

Required Arguments

Evaluation Flags

Optional Arguments

Device Selection Arguments

Metadata Format

Evaluation Dimensions

Additional Tools

Contributing

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages