Skip to content

ruc-datalab/RelGen

Repository files navigation

RelGen v0.1

Unit Tests E2E Tests Colab PyPi Latest Release License

RelGen is the abbreviation of Relational Data Generation. This tool is used to generate relational data in databases. RelGen is a powerful tool designed to generate relational data for use in databases. Interestingly, the pronunciation of "Rel" is quite similar to "Real," which subtly highlights the fact that the relational data produced by RelGen is remarkably authentic and reliable.

Overview

RelGen is a Python library designed to generate real relational data for users. RelGen uses a variety of advanced deep generative models and algorithms to learn data distribution from real data and generate high-quality simulation data.RelGen can be applied to database system testing, data publishing and cross-domain data flow, as well as machine learning data augmentation.

RelGen v0.1
Figure 1: RelGen Overall Architecture

Features

  • Supports multiple fields and scenarios. RelGen is suitable for a variety of scenarios, including private data release, data augmentation, database testing and so on.

  • Advanced relational data generation models and algorithms. RelGen provides users with a variety of deep generative models to choose from, and uses effective relational data generation algorithms to generate high-quality relational data.

  • Comprehensive quality evaluation for generated relational data. RelGen comprehensively evaluates the quality of generated relational data from multiple dimensions, and visualizes the difference between real relational data and generated relational data.

Architecture of this project

Important Link Notes
📑Tutorial Contains several examples of generating database using RelGen
📦Package Contains the code implementation of the RelGen project
📖Docs The documentation of this project

Installation

RelGen requires Python version 3.7 or later. You can choose one of the following methods to install the relgen.

Install from pip

pip install relgen

Install from source

git clone https://github.com/ruc-datalab/RelGen.git && cd RelGen
pip install -r requirements.txt

Quick-Start

In this section, you will learn how to use RelGen package with a simple example. You will load a dataset with RelGen, construct a model for data synthesis, train the model and generate data sample from it.

Loading Dataset

Load a demo dataset to get started. This dataset is a single table describing the census. You can find this data in census.

Load metadata for the census dataset. Metadata usually contains descriptive information about the dataset, such as field names, types, associations, etc., and is used to help better understand and process the data.

from relgen.data.metadata import Metadata

metadata = Metadata()
metadata.load_from_json("datasets/census/metadata.json")

Load data for the census dataset.

import pandas as pd

data = {
    "census": pd.read_csv("datasets/census/census.csv")
}

RelGen v0.1

Some introduciton for census datasets

This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). The prediction task is to determine whether a person makes over $50K a year.

Load metadata and combine it with actual data created and processed the dataset in preparation for the rest of the process.

from relgen.data.dataset import Dataset

dataset = Dataset(metadata)
dataset.fit(data)

Generating Data

Next, we can create a RelGen synthesizer, which is an advanced tool specifically designed to generate relational data. This synthesizer works by analyzing and learning patterns from real datasets, capturing the intricate structures and distributions present in the original data. Once it has understood these patterns, it replicates them to create new, synthetic relational datasets.

from relgen.synthesizer.arsynthesizer import MADESynthesizer

synthesizer = MADESynthesizer(dataset)
synthesizer.fit(data)

The synthesizer is now capable of generating relational data.

sampled_data = synthesizer.sample()

RelGen v0.1

Evaluating Data

The RelGen library allows you to evaluate the relational data by comparing it to the real data. Let's start by creating an evaluator.

from relgen.evaluator import Evaluator

evaluator = Evaluator(data["census"], sampled_data["census"])

Show comparison histogram of data distribution between real data and generated data. Users can visualise whether the distributions on these key features are consistent, and thus assess the performance of the generated model and the quality of the generated data.

evaluator.eval_histogram(columns=["age", "sex", "relationship"])

RelGen v0.1

Show comparison t-SNE plot of data distribution between real data and generated data.The t-SNE plot helps the user to observe the overall structural similarity between the generated data and the real data, and to evaluate the effectiveness of the generated model.

evaluator.eval_tsne()

RelGen v0.1

The code of Quick Start can be found in Quick Start.

Try More Examples

You can try more examples in tutorial. If you have any question, please contanct us.

Cite Us

If you find RelGen useful for your research or development, please cite the following paper: Tabular data synthesis with generative adversarial networks: design space and optimizations.

@article{liu2024tabular,
  title={Tabular data synthesis with generative adversarial networks: design space and optimizations},
  author={Liu, Tongyu and Fan, Ju and Li, Guoliang and Tang, Nan and Du, Xiaoyong},
  journal={The VLDB Journal},
  volume={33},
  number={2},
  pages={255--280},
  year={2024},
  publisher={Springer}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •