Answering natural language (NL) questions about tables, known as Tabular Question Answering (TQA), is crucial because it allows users to quickly and efficiently extract meaningful insights from structured data, effectively bridging the gap between human language and machine-readable formats. Many of these tables are derived from web sources or real-world scenarios, which require meticulous data preparation (or data prep) to ensure accurate responses. However, preparing such tables for NL questions introduces new requirements that extend beyond traditional data preparation. This question-aware data preparation involves specific tasks such as column augmentation and filtering tailored to particular questions, as well as question-aware value normalization or conversion, highlighting the need for a more nuanced approach in this context. Because each of the above tasks is unique, a single model (or agent) may not perform effectively across all scenarios. In this paper, we propose AutoPrep, a large language model (LLM)-based multi-agent framework that leverages the strengths of multiple agents, each specialized in a certain type of data prep, ensuring more accurate and contextually relevant responses. Given an NL question over a table, AutoPrep performs data prep through three key components. Planner: Determines a logical plan, outlining a sequence of high-level operations. Programmer: Translates this logical plan into a physical plan by generating the corresponding low-level code. Executor: Executes the generated code to process the table. To support this multi-agent framework, we design a novel Chain-of-Clauses reasoning mechanism for high-level operation suggestion, and a tool-augmented method for low-level code generation. Extensive experiments on real-world TQA datasets demonstrate that AutoPrep can significantly improve the state-of-the-art TQA solutions through question-aware data preparation.
conda create -n ap python=3.9.15
conda activate appip install -r requirements.txt- Download the datasets with token tllm and unzip it to any path. And the constructed TransTQ dataset can be accessed by TransTQ with token tllm.
- Modify the
DATA_PATHin global_values.py to the root path of your downloaded datasets. - create a key file named
keys.txtin the root and put your api keys in it (each key for one line)
Main Takeaway: AutoPrep can improve the accuracy of all TQA baselines, especially for code generation baselines.
Main Takeaway: AutoPrep integrated with NL2SQL achieves SOTA performance on two TQA datasets based on four different LLM backbones.
❗ Please refer to Developer Guides when comitting.
If you find our work helpful, please cite as:
@article{fan2024autoprep,
title={AutoPrep: Natural Language Question-Aware Data Preparation with a Multi-Agent Framework},
author={Fan, Meihao and Fan, Ju and Tang, Nan and Cao, Lei and Li, Guoliang and Du, Xiaoyong},
journal={arXiv preprint arXiv:2412.10422},
year={2024}
}


