Skip to content

A Python package for generating synthetic datasets and collecting real-world datasets from various Structural Causal Model

License

Notifications You must be signed in to change notification settings

yonghanjung/CausalDGP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CausalDGP: A Python Engine for Structural Causal Models

A powerful and flexible Python library for creating, manipulating, and generating data from Structural Causal Models (SCMs) and analyzing causal graphs.

License: MIT


📖 Overview

CausalDGP is a Python engine for causal inference research and experimentation. It provides a robust framework for defining complex causal systems, generating data from them, and analyzing their underlying graphical properties.

While the engine is a general-purpose tool, it also serves as a comprehensive benchmark, providing a standardized collection of data generating processes (DGPs) from the causal inference literature. It's designed to help researchers evaluate and compare estimators under challenging scenarios, including:

  • Back-door and front-door adjustment
  • Unobserved confounding and complex "napkin" graphs
  • Longitudinal models with time-varying treatments
  • Advanced identification strategies beyond simple adjustments

✨ Key Features

The power of CausalDGP comes from its modular design:

  • Flexible SCM Engine (scm.py): Define any SCM by specifying variables, their causal parents, and their functional relationships. The engine handles the recursive data generation process automatically.
  • Rich Graph Toolkit (graph.py): A suite of utility functions for causal graph analysis, including d-separation checks, finding ancestors, implementing do-calculus rules, and generating interactive visualizations.
  • Benchmark Suite (generator.py): A ready-to-use collection of famous and challenging SCMs from the literature, perfect for evaluating new methods.

For detailed API documentation, please see our docs/ folder (coming soon).


🚀 Quick Start: The Benchmark Application

The easiest way to see CausalDGP in action is through the included benchmark generators (generator.py). These pre-built SCMs from the literature demonstrate the capabilities of the engine.

Setup

  1. Make sure you have the necessary libraries installed:
    pip install networkx scipy matplotlib numpy pandas pyvis
  2. Install CausalDGP as an editable package to ensure all imports work correctly.

Example: Generating Data

You can import any SCM from the generator module and use it to create a dataset. Here’s how to generate 1,000 samples from the Kang & Schafer (2007) simulation:

# 1. Import a pre-built SCM generator from the CausalDGP package
from CausalDGP.generator import Kang_Schafer

# 2. Get the SCM object, treatment, and outcome variable names
scm, treatments, outcomes = Kang_Schafer(seednum=42)

# 3. Use the SCM object to generate a pandas DataFrame
sample_data = scm.generate_samples(num_samples=1000)

# 4. Display the first few rows of your new dataset
print(sample_data.head())

📊 Available Benchmark Models

CausalDGP includes a variety of pre-built SCMs from the causal inference literature. Below is a summary. For detailed mathematical descriptions and graph visualizations for each model, please see our Benchmark Details Documentation.

Model Function Brief Description Key Features & Notes
BD_SCM A model for the standard back-door adjustment User-defined covariate dimensions.
Kang_Schafer A classic model from Kang & Schafer (2007). -
CCDDHNR2018_IRM Interactive-Regression-Model (IRM) from Chernozhukov et al. (2018). High-dimensional confounders
CCDDHNR2018_PLR Partially-Linear-Regression (PLR) model from Chernozhukov et al. (2018). Continuous treatment, high-dimensional confounders.
SBD_SCM Standard sequential back-door adjustment. Time-varying covariates and treatments.
mSBD_SCM Multi-outcome sequential back-door adjustment. Multiple sequential outcomes.
luedtke_2017_sim1_scm Longitudinal model from Luedtke et al., (2017). Sequential data with time-varying treatments.
Canonical_FD_SCM Canonical front-door graph. No observed confounders of T and Y.
Fulcher_FD A model for the Front-Door criterion, used in Fulcher et al., (2020) Front-door with observed confounders.
FD_SCM A model for the Front-door with covariates Front-door model with high-dimensional confounders
Bhattacharya2022_Fig2b_SCM A model used in Bhattacharya et al., (2022) Figure 2b Solvable by advanced identification, not front/back-door.
Bhattacharya2022_Fig3_SCM A model used in Bhattacharya et al., (2022) Figure 3 -
Bhattacharya2022_Fig5_SCM A model used in Bhattacharya et al., (2022) Figure 5 -
Bhattacharya2022_Fig5_SCM A model used in Bhattacharya et al., (2022) Figure 5 -
ConeCloud_15_SCM A model for the 15-node Cone Cloud graph from Figure 3b of Raichev et al., 2024 Large graph with categorical variables.
Napkin_SCM The minimal "Napkin" graph. Irreducible ratio of probabilities estimand.
Napkin_SCM_dim Napkin graph with high-dimensional covariates. Irreducible ratio with high-dimensional covariates.
Napkin_FD_SCM Combines Napkin and Front-Door structures. Irreducible ratio with front-door adjustment.
Nested_Napkin_SCM Nested graph from Tian (2002) Thesis Estimand is a ratio of two sequential back-door formulas.
Double_Napkin_SCM Two stacked Napkin graphs. Estimand is a ratio-of-ratios.
Plan_ID_SCM A model from Figure Figure 1b of Jung et al., 2021 Not solvable by simple back-door or front-door.

📜 Citation

If you use CausalDGP in your research, please consider citing it:

@software{CausalDGP_2025,
  author = {Jung, Yonghan},
  title = {{CausalDGP: A Python Engine for Structural Causal Models}},
  url = {[https://github.com/yonghanjung/CausalDGP](https://github.com/yonghanjung/CausalDGP)},
  version = {0.1.0},
  year = {2025}
}

About

A Python package for generating synthetic datasets and collecting real-world datasets from various Structural Causal Model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published