DML-Auto vs DML-CatBoost: Causal Inference Benchmark

Research question: Does replacing EconML's auto first-stage with CatBoost regressors improve CATE estimation in Double Machine Learning?

All three models use identical CausalForestDML second stages. The only variable is the first-stage nuisance estimator used to partial out confounding:

Model	`model_y`	`model_t`
DML-Auto	EconML auto-select (LassoCV / Ridge / Forest via CV)	same
DML-CatBoost	`CatBoostRegressor` (default params, 500 iterations)	same
DML-CatBoost-Tuned	`CatBoostRegressor` tuned via Optuna (50 trials, R-loss objective)	same

All metrics are computed on a held-out test set (80/20 split) — models never see evaluation data during fitting.

Note on small datasets: DML-CatBoost-Tuned is only evaluated on datasets with n_train ≥ 600 (see CB_TUNE_MIN_N in config.py). Experiments showed that 50-trial Optuna tuning is unreliable below this threshold: R-loss estimates are too noisy for the TPE surrogate to converge, and CausalForestDML itself becomes numerically unstable (catastrophic R² on ACIC 2019, n≈320; no improvement over single-split on IHDP with LOO-like 15-fold CV). On small-n datasets the two-way comparison (Auto vs CatBoost) is more informative. See PLAN.md §"Small-n exclusion" for full analysis.

Key Findings

Based on 1–100 replications × 13 datasets. Significance: paired Wilcoxon signed-rank test, α = 0.05.

DML-Auto wins on PEHE in 7 of 13 datasets — dominates all IHDP variants, Nie & Wager A/B, ACIC 2019, and packt_data11. EconML's CV-selected first stage (typically LassoCV) handles small-n semi-synthetic data better than gradient boosting.
Tuned CatBoost wins on smooth DGPs — significantly better on nie_wager_C and dowhy_nonlinear (PEHE 0.572 vs 0.677 Auto). Tuning recovers most of the untuned CatBoost's deficit.
Tuned CatBoost dominates ATE bias on linear settings — dowhy_linear: 0.081 vs 0.207 (CatBoost) vs 0.437 (Auto). The direct R-loss tuning objective translates well to ATE precision.
ACIC 2019 is Auto's strongest setting — 2× lower PEHE (0.288 vs 0.614 CatBoost) and ATE bias (0.153 vs 0.332). The small-n (~400 train), moderate-p (~22 features) regime favors linear first-stages.
Speed trade-offs — DML-Auto: 1–10 s/rep. DML-CatBoost: 2–6 s/rep. DML-CatBoost-Tuned: 45–390 s/rep. Tuning rarely justifies the 30–150× overhead over auto-selection.

Results

Mean ± std across replications. Bold = best among the three models.

PEHE ↓ (lower is better — precision in heterogeneous effect estimation)

Dataset	Reps	DML-Auto	DML-CatBoost	DML-CatBoost-Tuned	Winner
acic2019	8	0.288 ± 0.251	0.614 ± 0.657	— ¹	Auto ✓
ihdp_B	100	3.748 ± 6.081	3.846 ± 6.325	— ¹	Auto
ihdp_econml_A	100	0.200 ± 0.066	0.320 ± 0.100	— ¹	Auto ✓
ihdp_econml_B	100	2.892 ± 2.490	2.936 ± 2.547	— ¹	Auto
nie_wager_A	100	0.156 ± 0.020	0.168 ± 0.018	0.158 ± 0.017	Auto ✓
nie_wager_B	100	0.294 ± 0.031	0.329 ± 0.034	0.294 ± 0.031	Auto / Tuned
nie_wager_C	100	0.108 ± 0.015	0.101 ± 0.016	0.095 ± 0.013	Tuned ✓
nie_wager_D	100	0.681 ± 0.033	0.714 ± 0.035	0.672 ± 0.034	Tuned ✓
econml_partial_linear	100	33.5 ± 47.7	39.4 ± 48.9	39.5 ± 48.8	~tie
dowhy_nonlinear	100	0.677 ± 0.105	0.675 ± 0.095	0.572 ± 0.095	Tuned ✓
packt_data11	1	0.735	1.380	1.116	Auto
packt_earnings	1	532.6	459.3	464.7	CatBoost
twins	1	0.172	0.172	0.171	—

ATE bias ↓

Dataset	Reps	DML-Auto	DML-CatBoost	DML-CatBoost-Tuned
dowhy_linear	100	0.437 ± 0.287	0.207 ± 0.096	0.081 ± 0.074
acic2019	8	0.153 ± 0.129	0.332 ± 0.391	— ¹

Fit time (mean seconds/replication)

Dataset	DML-Auto	DML-CatBoost	DML-CatBoost-Tuned
acic2019	2.7	4.3	— ¹
nie_wager_A–D	8–10	2–3	112–134
ihdp (all)	0.8–1.0	2.0	— ¹
econml_partial_linear	9.5	4.3	276
twins	145	20	388

¹ DML-CatBoost-Tuned not evaluated: n_train < 600. See note above and PLAN.md §"Small-n exclusion".

Datasets

Dataset	n/rep	Reps	Ground truth	Source
IHDP-B (Hill 2011)	598	100	ITE + ATE	fredjo.com
IHDP-A (EconML DGP)	598	100	ITE + ATE	EconML
IHDP-B (EconML DGP)	598	100	ITE + ATE	EconML
Twins (NBER)	~9100	1	ITE + ATE	CEVAE repo
Nie & Wager A	5000	100	ITE + ATE	causalml
Nie & Wager B	5000	100	ITE + ATE	causalml
Nie & Wager C	5000	100	ITE + ATE	causalml
Nie & Wager D	5000	100	ITE + ATE	causalml
EconML partial linear	2000	100	ITE	EconML
DoWhy linear SCM	5000	100	ATE only	DoWhy
DoWhy nonlinear SCM	5000	100	ITE + ATE	DoWhy
Packt data_11	5000	1	ITE + ATE	Packt GitHub
Packt earnings	~900	1	ITE + ATE	Packt GitHub
ACIC 2019	~400	8	ITE + ATE	ACIC 2019 challenge

Excluded: Jobs (no per-unit ITE, ATE unidentified without strong assumptions).

Quickstart

Option 1 — uv (recommended)

git clone https://github.com/your-username/DML-Auto-vs-Catboost
cd DML-Auto-vs-Catboost
uv sync                         # install exact locked dependencies

# 30-second demo (3 reps, no download needed)
uv run python src/run_comparison.py --dataset nie_wager_B --n_reps 3

# Full benchmark (downloads ~50 MB for IHDP + ACIC 2019)
uv run python src/run_comparison.py --dataset all --n_jobs -1

# Resume interrupted run
uv run python src/run_comparison.py --dataset all --n_jobs -1 --resume

# Analyse results
uv run jupyter lab notebooks/analysis.ipynb

Reproduce exact results

All randomness is controlled through a single seed in src/config.py:

GLOBAL_SEED: int = 42

Dependency versions are locked in uv.lock. To reproduce:

uv sync --frozen                # exact locked versions
uv run python src/run_comparison.py --dataset all

Run tests

uv run pytest                   # 97 tests, ~3.5 min
uv run pytest -m "not slow"     # fast subset (unit tests only)

Coverage report: results/coverage/index.html

Project structure

├── src/
│   ├── config.py               ← all seeds, paths, hyperparameters
│   ├── datasets/               ← 14 loaders with canonical train/test splits
│   │   ├── _http.py            ← shared download utility
│   │   ├── base.py             ← CausalDataset namedtuple
│   │   ├── ihdp.py             ← IHDP (fredjo.com + EconML)
│   │   ├── twins.py            ← Twins (NBER via CEVAE)
│   │   ├── jobs.py             ← Jobs / LaLonde (excluded: no ITE)
│   │   ├── nie_wager.py        ← Nie & Wager A–D
│   │   ├── econml_synthetic.py ← EconML DGPs
│   │   ├── dowhy_synthetic.py  ← DoWhy linear/nonlinear SCMs
│   │   ├── packt.py            ← Packt book datasets
│   │   └── acic2019.py         ← ACIC 2019 test datasets (8 reps)
│   ├── models/
│   │   ├── base.py             ← CausalModel Protocol
│   │   ├── dml_models.py       ← AutoDML, CatBoostDML, TunedCatBoostDML wrappers
│   │   └── __init__.py         ← MODEL_REGISTRY
│   ├── evaluation/
│   │   ├── metrics.py          ← PEHE, CATE R², policy risk, ATE bias
│   │   └── stats.py            ← Wilcoxon + Bonferroni significance tests
│   └── run_comparison.py       ← parallel CLI runner with cross-session resume
├── notebooks/
│   └── analysis.ipynb          ← full analysis with figures and LaTeX table
├── results/
│   ├── raw/                    ← per-dataset CSV results
│   ├── figures/                ← boxplots, heatmaps, bar charts
│   └── tables/                 ← comparison_table.csv + .tex
└── tests/                      ← 97 pytest tests

Design decisions

Why CausalForestDML as the causal forest? Current state-of-the-art DML implementation: honest causal forest, local centering, cross-fitting, and confidence intervals out of the box.

Why CatBoostRegressor for both stages (not Classifier)? CausalForestDML estimates propensity as E[T|X] via regression internally — it does not accept a classifier for model_t. This is the correct and principled choice.

Why Optuna R-loss as the tuning objective? Robinson's R-loss = mean((Ỹ − θ̂·T̃)²) directly optimises the criterion CausalForestDML uses in its second stage. This avoids the disconnect between optimising RMSE on Y or T and the actual causal forest objective.

Why test-set evaluation? In-sample CATE is trivially better for any flexible estimator. All reported numbers use held-out data — fixed 80/20 splits for all datasets.

Why Wilcoxon signed-rank test? PEHE distributions across replications are typically right-skewed and non-normal. Wilcoxon is a non-parametric paired test robust to outlier replications.

Citation

@misc{dml-catboost-benchmark-2026,
  title  = {DML-Auto vs DML-CatBoost: A Causal Inference Benchmark},
  year   = {2026},
  url    = {https://github.com/your-username/DML-Auto-vs-Catboost}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.devcontainer		.devcontainer
notebooks		notebooks
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
PLAN.md		PLAN.md
README.md		README.md
Sources.md		Sources.md
conftest.py		conftest.py
main.py		main.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DML-Auto vs DML-CatBoost: Causal Inference Benchmark

Key Findings

Results

PEHE ↓ (lower is better — precision in heterogeneous effect estimation)

ATE bias ↓

Fit time (mean seconds/replication)

Datasets

Quickstart

Option 1 — uv (recommended)

Reproduce exact results

Run tests

Project structure

Design decisions

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DML-Auto vs DML-CatBoost: Causal Inference Benchmark

Key Findings

Results

PEHE ↓ (lower is better — precision in heterogeneous effect estimation)

ATE bias ↓

Fit time (mean seconds/replication)

Datasets

Quickstart

Option 1 — uv (recommended)

Reproduce exact results

Run tests

Project structure

Design decisions

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages