GitHub - ruc-datalab/RelGen

RelGen is the abbreviation of Relational Data Generation. This tool is used to generate relational data in databases. RelGen is a powerful tool designed to generate relational data for use in databases. Interestingly, the pronunciation of "Rel" is quite similar to "Real," which subtly highlights the fact that the relational data produced by RelGen is remarkably authentic and reliable.

Overview

RelGen is a Python library designed to generate real relational data for users. RelGen uses a variety of advanced deep generative models and algorithms to learn data distribution from real data and generate high-quality simulation data.RelGen can be applied to database system testing, data publishing and cross-domain data flow, as well as machine learning data augmentation.

Figure 1: RelGen Overall Architecture

Features

✨ Supports multiple fields and scenarios. RelGen is suitable for a variety of scenarios, including private data release, data augmentation, database testing and so on.
✨ Advanced relational data generation models and algorithms. RelGen provides users with a variety of deep generative models to choose from, and uses effective relational data generation algorithms to generate high-quality relational data.
✨ Comprehensive quality evaluation for generated relational data. RelGen comprehensively evaluates the quality of generated relational data from multiple dimensions, and visualizes the difference between real relational data and generated relational data.

Architecture of this project

Important Link	Notes
📑Tutorial	Contains several examples of generating database using RelGen
📦Package	Contains the code implementation of the RelGen project
📖Docs	The documentation of this project

Installation

RelGen requires Python version 3.7 or later. You can choose one of the following methods to install the relgen.

Install from pip

pip install relgen

Install from source

git clone https://github.com/ruc-datalab/RelGen.git && cd RelGen
pip install -r requirements.txt

Quick-Start

In this section, you will learn how to use RelGen package with a simple example. You will load a dataset with RelGen, construct a model for data synthesis, train the model and generate data sample from it.

Loading Dataset

Load a demo dataset to get started. This dataset is a single table describing the census. You can find this data in census.

Load metadata for the census dataset. Metadata usually contains descriptive information about the dataset, such as field names, types, associations, etc., and is used to help better understand and process the data.

from relgen.data.metadata import Metadata

metadata = Metadata()
metadata.load_from_json("datasets/census/metadata.json")

Load data for the census dataset.

import pandas as pd

data = {
    "census": pd.read_csv("datasets/census/census.csv")
}

Some introduciton for census datasets

This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). The prediction task is to determine whether a person makes over $50K a year.

Load metadata and combine it with actual data created and processed the dataset in preparation for the rest of the process.

from relgen.data.dataset import Dataset

dataset = Dataset(metadata)
dataset.fit(data)

Generating Data

Next, we can create a RelGen synthesizer, which is an advanced tool specifically designed to generate relational data. This synthesizer works by analyzing and learning patterns from real datasets, capturing the intricate structures and distributions present in the original data. Once it has understood these patterns, it replicates them to create new, synthetic relational datasets.

from relgen.synthesizer.arsynthesizer import MADESynthesizer

synthesizer = MADESynthesizer(dataset)
synthesizer.fit(data)

The synthesizer is now capable of generating relational data.

sampled_data = synthesizer.sample()

Evaluating Data

The RelGen library allows you to evaluate the relational data by comparing it to the real data. Let's start by creating an evaluator.

from relgen.evaluator import Evaluator

evaluator = Evaluator(data["census"], sampled_data["census"])

Show comparison histogram of data distribution between real data and generated data. Users can visualise whether the distributions on these key features are consistent, and thus assess the performance of the generated model and the quality of the generated data.

evaluator.eval_histogram(columns=["age", "sex", "relationship"])

Show comparison t-SNE plot of data distribution between real data and generated data.The t-SNE plot helps the user to observe the overall structural similarity between the generated data and the real data, and to evaluate the effectiveness of the generated model.

evaluator.eval_tsne()

The code of Quick Start can be found in Quick Start.

Try More Examples

You can try more examples in tutorial. If you have any question, please contanct us.

Cite Us

If you find RelGen useful for your research or development, please cite the following paper: Tabular data synthesis with generative adversarial networks: design space and optimizations.

@article{liu2024tabular,
  title={Tabular data synthesis with generative adversarial networks: design space and optimizations},
  author={Liu, Tongyu and Fan, Ju and Li, Guoliang and Tang, Nan and Du, Xiaoyong},
  journal={The VLDB Journal},
  volume={33},
  number={2},
  pages={255--280},
  year={2024},
  publisher={Springer}
}

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.github/workflows		.github/workflows
asset		asset
datasets		datasets
docs		docs
relgen		relgen
tutorial		tutorial
.gitignore		.gitignore
LICENSE		LICENSE
QuickStart.ipynb		QuickStart.ipynb
README.md		README.md
requirements.txt		requirements.txt
run_relgen.py		run_relgen.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Overview

Features

Architecture of this project

Installation

Install from pip

Install from source

Quick-Start

Loading Dataset

Generating Data

Evaluating Data

Try More Examples

Cite Us

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

Uh oh!

License

Uh oh!

ruc-datalab/RelGen

Folders and files

Latest commit

History

Repository files navigation

Overview

Features

Architecture of this project

Installation

Install from pip

Install from source

Quick-Start

Loading Dataset

Generating Data

Evaluating Data

Try More Examples

Cite Us

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages