SemForest: Semantic-Aware Ontology Generation with Foundation Models

This repository contains the codebase for our research on automatic semantic forest ontology construction using Large Language Models (LLMs).

🧠 Abstract

Functional Semantic Types (FSTs) enrich column-level semantics by pairing type information with executable logic for data transformation and validation. However, to our best knowledge, the only existing FST generation method relies primarily on name-based merging, resulting in flat, unstructured hierarchies that do not align with real-world semantic structures. We introduce SemForest, a framework that constructs a tree-structured semantic forest of FSTs. SemForest produces the ontology with interpretable semantic meaning by clustering related types in embedding space, and leveraging large language models to organize them into hierarchical trees. The resulting ontology improves interpretability and accelerates semantic retrieval through hierarchical navigation. Experiments on three public data universes demonstrate that SemForest improves retrieval recall while reducing search time compared to the existing baseline.

🖼️ System Overview

📦 Setup

🔧 1. Install Environment

We use conda for dependency management. Run:

conda env create -f environment.yml
conda activate semforest

🔑 2. Set Your OpenAI API Key

export OPENAI_API_KEY=your-api-key

📜 Data Preparation

Organize your data under the assets/ directory like so:

assets/{data_universe_name}/tables/{product_name}/{table_name}.csv

Example:

assets/biodivtab/tables/product1/table1.csv

We provide a sample data universe BiodivTab.

Note: For demonstration purposes, we made adaptations to this dataset, including renaming products and selecting a subset of tables.

🌲 Building Semantic Forests

bash build_forest.sh biodivtab

Forests are stored at:

assets/biodivtab/forest/

📊 Benchmark Construction

To support recall-based evaluation and ensure reproducibility of semantic retrieval tasks, we release our own benchmark datasets for both joinability and concatenation evaluations.

benchmark/data_universe_name/
├── {data_universe_name}_source/   # Source data universe
├── {data_universe_name}_query/    # Query data universe
├── {data_universe_name}_join/     # Ground truth for joinability
├── {data_universe_name}_concat/   # Ground truth for concatenation

To download the benchmark data, run:

gdown 1w3PqXI8JSPjfYHUwiJUocYBYQhmeCb7A

🧱 Project Structure

SemanticForest/
├── assets/                      
├── figures/                     
├── .gitignore                  
├── build_forest.py             # Main script to launch forest building
├── build_forest.sh             # Shell wrapper for forest building
├── code_parsing.py             # Parsing and standardizing FSTs
├── environment.yml             # Conda environment setup
├── forest_utils.py             # Utilities for semantic forest
├── pipeline_forest.py          # Forest generation pipeline logic
├── prompt_utils.py             # Helpers for prompt crafting and token counting
├── README.md                   # Project documentation

📊 Logging and Results

Graph build stats:

assets/graph_stats.csv

Each record logs token usage, API calls, runtime.

📚 Citation

Coming soon — citation info will be added upon paper publication.

🙏 Acknowledgments

Some components in this repository are adapted from open-source contributions by Two Sigma Open Source, LLC under the Apache 2.0 License. See license headers in relevant files for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SemForest: Semantic-Aware Ontology Generation with Foundation Models

🧠 Abstract

🖼️ System Overview

📦 Setup

🔧 1. Install Environment

🔑 2. Set Your OpenAI API Key

📜 Data Preparation

🌲 Building Semantic Forests

📊 Benchmark Construction

🧱 Project Structure

📊 Logging and Results

📚 Citation

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets/biodivtab/tables		assets/biodivtab/tables
figures		figures
prompts		prompts
.gitignore		.gitignore
README.md		README.md
build_forest.py		build_forest.py
build_forest.sh		build_forest.sh
code_parsing.py		code_parsing.py
environment.yml		environment.yml
forest_utils.py		forest_utils.py
pipeline_forest.py		pipeline_forest.py
prompt_utils.py		prompt_utils.py

golden-eggs-lab/semforest

Folders and files

Latest commit

History

Repository files navigation

SemForest: Semantic-Aware Ontology Generation with Foundation Models

🧠 Abstract

🖼️ System Overview

📦 Setup

🔧 1. Install Environment

🔑 2. Set Your OpenAI API Key

📜 Data Preparation

🌲 Building Semantic Forests

📊 Benchmark Construction

🧱 Project Structure

📊 Logging and Results

📚 Citation

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages