Semantic types are important for understanding data, but pure semantic labels are often insufficient for practical use. Columns with the same semantic type may differ in format, units, or representation, requiring substantial manual effort to normalize, validate, and transform values before integration across datasets. To address this gap, our prior work introduced Functional Semantic Types (FSTs), which encapsulate both semantic meaning and functional transformations as executable Python classes. While promising, the initial approach generated FSTs independently per table, resulting in redundant, loosely scoped types, and a synthetic ontology. In this work, we present Faster, a new system for generating accurate and robust FSTs using foundation models. Faster aggregates column semantics across tables, applies an iterative multi-stage strategy for consistent FST construction, and organizes types into a semantic forest that captures real-world hierarchies. Extensive experiments on large-scale datasets demonstrate that Faster substantially improves FST quality and structural coherence.
We use conda for dependency management. Run:
conda env create -f environment.yml
conda activate your-env-nameexport OPENAI_API_KEY=your-api-keyOrganize your data under the assets/ directory like so:
assets/{data_universe_name}/tables/{product_name}/{table_name}.csv
Example:
assets/biodivtab/tables/product1/table1.csv
We provide a sample data universe BiodivTab.
Note: For demonstration purposes, we made adaptations to this dataset, including renaming products and selecting a subset of tables.
To build the graph using either the faster or fstogen pipeline:
# Faster
bash build_graph.sh biodivtab faster 1
# FSTO-Gen
bash build_graph.sh biodivtab fstogen 1Graphs will be saved under:
assets/biodivtab/graph/
Once the Faster graph is built, construct the semantic forest:
bash build_forest.sh biodivtab 1Forests are stored at:
assets/biodivtab/forest/
Faster/
├── assets/
│ └── biodivtab/
│ ├── forest/ # Output semantic forests
│ ├── graph/ # Output graphs
│ ├── tables/ # Input tables (organized by product)
│ ├── forest_stats.csv # Stats log for forest generation
│ └── graph_stats.csv # Stats log for graph generation
│
├── prompts/ # Prompt templates for LLM interactions
│
├── build_graph.py # Main script to launch graph building
├── build_graph.sh # Shell wrapper for graph building
├── build_forest.py # Main script to launch forest building
├── build_forest.sh # Shell wrapper for forest building
│
├── pipeline_graph.py # Graph generation pipeline logic
├── pipeline_forest.py # Forest generation pipeline logic
├── creater.py # Graph creator classes to initialize pipeline
├── forest_utils.py # Utilities for semantic forest
├── code_parsing.py # Parsing and standardizing FSTs
├── prompt_utils.py # Helpers for prompt crafting and token counting
├── ray_cmds.py # Ray-based parallel LLM calling functions
├── util.py # Data loading and preprocessing utils
├── semantic_type_base_classes.py
└── semantic_type_base_classes_gen.py
Graph build stats:
assets/graph_stats.csv
Forest build stats:
assets/forest_stats.csv
Each record logs token usage, API calls, runtime, and run mode.
The data used in this study are available at the following link: Google Drive.
Coming soon — citation info will be added upon paper publication.
Some components in this repository are adapted from open-source contributions by Two Sigma Open Source, LLC under the Apache 2.0 License. See license headers in relevant files for details.
