ARCADE is a real-time multimodal data system that supports hybrid, continuous and semantic query processing across diverse data types, including text, image, vector, spatial, and relational data.
- Multimodal Support: Text, image, vector, spatial, and relational data.
- Unified Secondary Indexing: Efficient disk-based indexes for vector, spatial, and text data, integrated with LSM-tree storage.
- Hybrid Query Optimizer: Extends MySQLβs cost-based optimizer to support hybrid queries with joint ranking and filters over vector, spatial, text, and relational attributes.
- Continuous Queries: SYNC (time-based) and ASYNC (event-driven) continuous queries using incremental materialized views.
- Semantic Operators: Allows users to embed natural language intent directly into SQL, enabling filtering, transformation, extraction, joining, and ranking with AI-powered semantic analytics.
For details, please refer to the paper.
- [2025-09] π₯ ARCADE now supports a set of semantic operators for AI-powered query processing.
- RocksDB 9.8.0
- MySQL 8.0.32
- Boost (for MySQL build)
- FAISS (for vector features)
cmake . -DCMAKE_BUILD_TYPE=RelWithDebInfo -DWITH_SSL=system -DWITH_ZLIB=bundled -DMYSQL_MAINTAINER_MODE=0 -DENABLED_LOCAL_INFILE=1 -DENABLE_DTRACE=0 -DCMAKE_CXX_FLAGS="-march=native" -DFORCE_INSOURCE_BUILD=1 -DDOWNLOAD_BOOST=1 -DWITH_BOOST=../boost/ -DWITH_FB_VECTORDB=1 -DWITH_NEXT_SPATIALDB=1 -DWITH_SEMANTICDB=1
make -j8
make DESTDIR=[build_dir_path] install
Example my.cnf
[mysqld]
rocksdb
default-storage-engine=rocksdb
skip-innodb
default-tmp-storage-engine=MyISAM
log-bin
binlog-format=ROW
basedir=[build_dir_path]/usr/local/mysql
datadir=[build_dir_path]/usr/local/mysql/data
port=3306Set environment variables:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:[root_dir_path]/faiss/faiss
# API key required for semantic operators
export OPENAI_API_KEY=ββ Initialize and start MySQL:
bin/mysqld --defaults-file=[build_dir_path]/usr/local/mysql/my.cnf --initialize
bin/mysqld --defaults-file=[build_dir_path]/usr/local/mysql/my.cnf &
bin/mysql -u root -p --port=3306
Schema Definition
Tables are stored in separate column families using COMMENT 'cfname=...'.
CREATE TABLE `poi` (
`coordinate` point NOT NULL SRID 4326,
`id` int NOT NULL,
`text` text NOT NULL,
`text_embedding` json NOT NULL FB_VECTOR_DIMENSION 1536,
PRIMARY KEY (id) COMMENT 'cfname=cf1',
Spatial INDEX key1(coordinate) NEXT_SPATIAL_INDEX_TYPE 'global' COMMENT 'cfname=cf1',
INDEX key2(text_embedding) FB_VECTOR_INDEX_TYPE 'lsmidx' COMMENT 'cfname=cf1'
) ENGINE=ROCKSDB;Vector Index Centroids
Currently, Vector index requires loading pre-trained centroids.
Default: centroids_300.csv
To change centroids, edit:
spatial-x-db/rocksdb/table/block_based/block_based_table_factory.h
β function `SetIndexOptions()`
Loading Data
For test purpose load a small portion because semantic query is expensive!
SET GLOBAL local_infile = 1;
source load_poi.sql;
shutdown;(Reopen the DB to flush data to disk.)
ARCADE extends SQL with five semantic operators to enable natural language reasoning over your data. These operators leverage vector embeddings and LLM-backed processing to filter, extract, and rank results beyond exact keyword matching or simple similarity search.
SEMANTIC_FILTERβ Filter rows based on a natural language condition.SEMANTIC_MAPβ Apply transformation to each row.SEMANTIC_EXTRACTβ Extract structured information from unstructured text.SEMANTIC_JOINβ Join two tables based on semantic predicate over text columns.SEMANTIC_RANKβ Rank rows by semantic relevance to a query prompt.
Filter rows using natural language conditions:
-- Identify patient reports that mention diabetes
SELECT report_id, text
FROM patient_reports
WHERE SEMANTIC_FILTER_SINGLE_COL('Does {patient_reports.text} indicate the patient has diabetes?', text) = 1;Transform unstructured text into new values:
SELECT candidate_id,
SEMANTIC_MAP('Summarize candidate's key qualifications for the research scientist role from {resumes.content}', content) AS summary
FROM resumes;Extract structured values from text fields:
-- Extract datasets mentioned in machine learning papers
SELECT paper_id,
SEMANTIC_EXTRACT('List the datasets used in {papers.abstract}', abstract) AS datasets
FROM papers;Rank rows by semantic similarity to a natural language query:
-- Rank climate news articles by relevance to policy impacts
SELECT id, title,
SEMANTIC_RANK(content_embedding, 'Retrive news articles with climate change policy impact') AS score
FROM news_articles
ORDER BY score DESC
LIMIT 10;SEMANTIC_RANK use the openai text-embedding-3-small model by default.
Make sure the embedding column values is also generated with the same model.
-- Match user projects with relevant research papers
SELECT projects.id, papers.id
FROM projects
JOIN papers
ON SEMANTIC_JOIN('Is {projects.description} relevant to {papers.abstract}?', projects.description, papers.abstract) = 1;Hybrid Search Query
SELECT t.content
FROM tweets t
WHERE L2_Distance(t.embedding, LLM(@query_text)) < @threshold
AND t.content LIKE '%keyword%'
AND ST_Contains(t.coordinate, @region);Hybrid NN Query
SELECT t.content
FROM tweets t
WHERE t.time BETWEEN @start_time AND @end_time
ORDER BY ST_Distance(t.coordinate, @location)
+ lambda * VECTOR_L2(t.text_embedding, LLM(@query_text))
LIMIT k;Continuous Query
SELECT c.id, c.name, COUNT(*) as count
FROM tweets t
JOIN City c ON ST_Contains(t.coordinate, c.geom)
WHERE L2_DISTANCE(t.embedding, LLM(@query_text)) < @threshold
GROUP BY c.id
ORDER BY count DESC
SYNC 60 seconds;If you use ARCADE in academic work, please cite:
@misc{yang2025arcaderealtimedatahybrid,
title={ARCADE: A Real-Time Data System for Hybrid and Continuous Query Processing across Diverse Data Modalities},
author={Jingyi Yang and Songsong Mo and Jiachen Shi and Zihao Yu and Kunhao Shi and Xuchen Ding and Gao Cong},
year={2025},
eprint={2509.19757},
archivePrefix={arXiv},
primaryClass={cs.DB},
url={https://arxiv.org/abs/2509.19757},
}
For questions, please open a GitHub issue or contact: π§ [jyang028@ntu.edu.sg]