RelGen is the abbreviation of Relational Data Generation. This tool is used to generate relational data in databases. RelGen is a powerful tool designed to generate relational data for use in databases. Interestingly, the pronunciation of "Rel" is quite similar to "Real," which subtly highlights the fact that the relational data produced by RelGen is remarkably authentic and reliable.
RelGen is a Python library designed to generate real relational data for users. RelGen uses a variety of advanced deep generative models and algorithms to learn data distribution from real data and generate high-quality simulation data.RelGen can be applied to database system testing, data publishing and cross-domain data flow, as well as machine learning data augmentation.
Figure 1: RelGen Overall Architecture
-
✨ Supports multiple fields and scenarios. RelGen is suitable for a variety of scenarios, including private data release, data augmentation, database testing and so on.
-
✨ Advanced relational data generation models and algorithms. RelGen provides users with a variety of deep generative models to choose from, and uses effective relational data generation algorithms to generate high-quality relational data.
-
✨ Comprehensive quality evaluation for generated relational data. RelGen comprehensively evaluates the quality of generated relational data from multiple dimensions, and visualizes the difference between real relational data and generated relational data.
| Important Link | Notes |
|---|---|
| 📑Tutorial | Contains several examples of generating database using RelGen |
| 📦Package | Contains the code implementation of the RelGen project |
| 📖Docs | The documentation of this project |
RelGen requires Python version 3.7 or later. You can choose one of the following methods to install the relgen.
pip install relgengit clone https://github.com/ruc-datalab/RelGen.git && cd RelGen
pip install -r requirements.txtIn this section, you will learn how to use RelGen package with a simple example. You will load a dataset with RelGen, construct a model for data synthesis, train the model and generate data sample from it.
Load a demo dataset to get started. This dataset is a single table describing the census. You can find this data in census.
Load metadata for the census dataset. Metadata usually contains descriptive information about the dataset, such as field names, types, associations, etc., and is used to help better understand and process the data.
from relgen.data.metadata import Metadata
metadata = Metadata()
metadata.load_from_json("datasets/census/metadata.json")Load data for the census dataset.
import pandas as pd
data = {
"census": pd.read_csv("datasets/census/census.csv")
}Some introduciton for census datasets
This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). The prediction task is to determine whether a person makes over $50K a year.
Load metadata and combine it with actual data created and processed the dataset in preparation for the rest of the process.
from relgen.data.dataset import Dataset
dataset = Dataset(metadata)
dataset.fit(data)Next, we can create a RelGen synthesizer, which is an advanced tool specifically designed to generate relational data. This synthesizer works by analyzing and learning patterns from real datasets, capturing the intricate structures and distributions present in the original data. Once it has understood these patterns, it replicates them to create new, synthetic relational datasets.
from relgen.synthesizer.arsynthesizer import MADESynthesizer
synthesizer = MADESynthesizer(dataset)
synthesizer.fit(data)The synthesizer is now capable of generating relational data.
sampled_data = synthesizer.sample()The RelGen library allows you to evaluate the relational data by comparing it to the real data. Let's start by creating an evaluator.
from relgen.evaluator import Evaluator
evaluator = Evaluator(data["census"], sampled_data["census"])Show comparison histogram of data distribution between real data and generated data. Users can visualise whether the distributions on these key features are consistent, and thus assess the performance of the generated model and the quality of the generated data.
evaluator.eval_histogram(columns=["age", "sex", "relationship"])Show comparison t-SNE plot of data distribution between real data and generated data.The t-SNE plot helps the user to observe the overall structural similarity between the generated data and the real data, and to evaluate the effectiveness of the generated model.
evaluator.eval_tsne()The code of Quick Start can be found in Quick Start.
You can try more examples in tutorial. If you have any question, please contanct us.
If you find RelGen useful for your research or development, please cite the following paper: Tabular data synthesis with generative adversarial networks: design space and optimizations.
@article{liu2024tabular,
title={Tabular data synthesis with generative adversarial networks: design space and optimizations},
author={Liu, Tongyu and Fan, Ju and Li, Guoliang and Tang, Nan and Du, Xiaoyong},
journal={The VLDB Journal},
volume={33},
number={2},
pages={255--280},
year={2024},
publisher={Springer}
}



