Causal AI Scientist (CAIS) is an LLM-powered tool for generating data-driven answers to natural language causal queries. It takes a natural language query (for example, "Does participating in a job training program lead to higher income?"), an accompanying dataset, and the corresponding description as inputs. CAIS then frames a suitable causal estimation problem by selecting appropriate treatment and outcome variables. It finds the suitable method for causal effect estimation, implements it, runs diagnostic tests, and finally interprets the numerical results in the context of the original query.
This repo includes instructions on both using the tool to perform causal analysis on a dataset of interest and reproducing results from our paper.
Note : This repository is a work in progress and will be updated with additional instructions and files.
Prerequisites:
- Python 3.10 (create a new conda environment first)
- Required Python libraries (specified in
requirements.txt)
Step 1: Copy the example configuration
cp .env.example .envStep 2: Create Python 3.10 environment
# Create a new conda environment with Python 3.10
conda create -n cais python=3.10
conda activate cais
pip install -r requirement.txtStep3: Setup cais library
pip install -e .All datasets used to evaluate CAIs and the baseline models are available in the data/ directory. Specifically:
all_data: Folder containing all CSV files from the QRData and real-world study collections.synthetic_data: Folder containing all CSV files corresponding to synthetic datasets.qr_info.csv: Metadata for QRData files. For each file, this includes the filename, description, causal query, reference causal effect, intended inference method, and additional remarks.real_info.csv: Metadata for the real-world datasets.synthetic_info.csv: Metadata for the synthetic datasets.
To execute CAIS, run
python main/run_cais.py \
--metadata_path {path_to_metadata} \
--data_dir {path_to_data_folder} \
--output_dir {output_folder} \
--output_name {output_filename} \
--llm_name {llm_name}
--llm_provider {llm_provider}Args:
- metadata_path (str): Path to the CSV file containing the queries, dataset descriptions, and data file names
- data_dir (str): Path to the folder containing the data in CSV format
- output_dir (str): Path to the folder where the output JSON results will be saved
- output_name (str): Name of the JSON file where the outputs will be saved
- llm_name (str): Name of the LLM to be used (e.g., 'gpt-4', 'claude-3', etc.)
- llm_provider (str): Name of the LLM service provider (e.g., 'openai', 'anthropic', 'together', etc.)
A specific example,
python run_cais.py \
--metadata_path "data/qr_info.csv" \
--data_dir "data/all_data" \
--output_dir "output" \
--output_name "results_qr_4o" \
--llm_name "gpt-4o-mini" \
--llm_provider "openai"Will be updated soon
- Keep your
.envfile secure and never commit it to version control
Distributed under the MIT License. See LICENSE for more information.
