Skip to content

golden-eggs-lab/faster

Repository files navigation

Faster: Functionally Augmented Semantic Types Generation with Foundation Models


Abstract

Semantic types are important for understanding data, but pure semantic labels are often insufficient for practical use. Columns with the same semantic type may differ in format, units, or representation, requiring substantial manual effort to normalize, validate, and transform values before integration across datasets. To address this gap, our prior work introduced Functional Semantic Types (FSTs), which encapsulate both semantic meaning and functional transformations as executable Python classes. While promising, the initial approach generated FSTs independently per table, resulting in redundant, loosely scoped types, and a synthetic ontology. In this work, we present Faster, a new system for generating accurate and robust FSTs using foundation models. Faster aggregates column semantics across tables, applies an iterative multi-stage strategy for consistent FST construction, and organizes types into a semantic forest that captures real-world hierarchies. Extensive experiments on large-scale datasets demonstrate that Faster substantially improves FST quality and structural coherence.


System Overview

System Overview


Setup

1. Install Environment

We use conda for dependency management. Run:

conda env create -f environment.yml
conda activate your-env-name

2. Set Your OpenAI API Key

export OPENAI_API_KEY=your-api-key

Data Preparation

Organize your data under the assets/ directory like so:

assets/{data_universe_name}/tables/{product_name}/{table_name}.csv

Example:

assets/biodivtab/tables/product1/table1.csv

We provide a sample data universe BiodivTab.

Note: For demonstration purposes, we made adaptations to this dataset, including renaming products and selecting a subset of tables.


Building Graphs

To build the graph using either the faster or fstogen pipeline:

# Faster
bash build_graph.sh biodivtab faster 1

# FSTO-Gen
bash build_graph.sh biodivtab fstogen 1

Graphs will be saved under:

assets/biodivtab/graph/

Building Semantic Forests

Once the Faster graph is built, construct the semantic forest:

bash build_forest.sh biodivtab 1

Forests are stored at:

assets/biodivtab/forest/

Project Structure

Faster/
├── assets/                      
│   └── biodivtab/              
│       ├── forest/             # Output semantic forests
│       ├── graph/              # Output graphs
│       ├── tables/             # Input tables (organized by product)
│       ├── forest_stats.csv    # Stats log for forest generation
│       └── graph_stats.csv     # Stats log for graph generation
│
├── prompts/                    # Prompt templates for LLM interactions
│
├── build_graph.py              # Main script to launch graph building
├── build_graph.sh              # Shell wrapper for graph building
├── build_forest.py             # Main script to launch forest building
├── build_forest.sh             # Shell wrapper for forest building
│
├── pipeline_graph.py           # Graph generation pipeline logic
├── pipeline_forest.py          # Forest generation pipeline logic
├── creater.py                  # Graph creator classes to initialize pipeline
├── forest_utils.py             # Utilities for semantic forest
├── code_parsing.py             # Parsing and standardizing FSTs
├── prompt_utils.py             # Helpers for prompt crafting and token counting
├── ray_cmds.py                 # Ray-based parallel LLM calling functions
├── util.py                     # Data loading and preprocessing utils
├── semantic_type_base_classes.py
└── semantic_type_base_classes_gen.py

Logging and Results

Graph build stats:

assets/graph_stats.csv

Forest build stats:

assets/forest_stats.csv

Each record logs token usage, API calls, runtime, and run mode.


Data Availability

The data used in this study are available at the following link: Google Drive.


Citation

Coming soon — citation info will be added upon paper publication.


Acknowledgments

Some components in this repository are adapted from open-source contributions by Two Sigma Open Source, LLC under the Apache 2.0 License. See license headers in relevant files for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published