Skip to content

firefly1248/DML_Auto_vs_CatBoost

Repository files navigation

DML-Auto vs DML-CatBoost: Causal Inference Benchmark

Research question: Does replacing EconML's auto first-stage with CatBoost regressors improve CATE estimation in Double Machine Learning?

All three models use identical CausalForestDML second stages. The only variable is the first-stage nuisance estimator used to partial out confounding:

Model model_y model_t
DML-Auto EconML auto-select (LassoCV / Ridge / Forest via CV) same
DML-CatBoost CatBoostRegressor (default params, 500 iterations) same
DML-CatBoost-Tuned CatBoostRegressor tuned via Optuna (50 trials, R-loss objective) same

All metrics are computed on a held-out test set (80/20 split) — models never see evaluation data during fitting.

Note on small datasets: DML-CatBoost-Tuned is only evaluated on datasets with n_train ≥ 600 (see CB_TUNE_MIN_N in config.py). Experiments showed that 50-trial Optuna tuning is unreliable below this threshold: R-loss estimates are too noisy for the TPE surrogate to converge, and CausalForestDML itself becomes numerically unstable (catastrophic R² on ACIC 2019, n≈320; no improvement over single-split on IHDP with LOO-like 15-fold CV). On small-n datasets the two-way comparison (Auto vs CatBoost) is more informative. See PLAN.md §"Small-n exclusion" for full analysis.


Key Findings

Based on 1–100 replications × 13 datasets. Significance: paired Wilcoxon signed-rank test, α = 0.05.

  • DML-Auto wins on PEHE in 7 of 13 datasets — dominates all IHDP variants, Nie & Wager A/B, ACIC 2019, and packt_data11. EconML's CV-selected first stage (typically LassoCV) handles small-n semi-synthetic data better than gradient boosting.
  • Tuned CatBoost wins on smooth DGPs — significantly better on nie_wager_C and dowhy_nonlinear (PEHE 0.572 vs 0.677 Auto). Tuning recovers most of the untuned CatBoost's deficit.
  • Tuned CatBoost dominates ATE bias on linear settingsdowhy_linear: 0.081 vs 0.207 (CatBoost) vs 0.437 (Auto). The direct R-loss tuning objective translates well to ATE precision.
  • ACIC 2019 is Auto's strongest setting — 2× lower PEHE (0.288 vs 0.614 CatBoost) and ATE bias (0.153 vs 0.332). The small-n (~400 train), moderate-p (~22 features) regime favors linear first-stages.
  • Speed trade-offs — DML-Auto: 1–10 s/rep. DML-CatBoost: 2–6 s/rep. DML-CatBoost-Tuned: 45–390 s/rep. Tuning rarely justifies the 30–150× overhead over auto-selection.

Results

Mean ± std across replications. Bold = best among the three models.

PEHE ↓ (lower is better — precision in heterogeneous effect estimation)

Dataset Reps DML-Auto DML-CatBoost DML-CatBoost-Tuned Winner
acic2019 8 0.288 ± 0.251 0.614 ± 0.657 — ¹ Auto ✓
ihdp_B 100 3.748 ± 6.081 3.846 ± 6.325 — ¹ Auto
ihdp_econml_A 100 0.200 ± 0.066 0.320 ± 0.100 — ¹ Auto ✓
ihdp_econml_B 100 2.892 ± 2.490 2.936 ± 2.547 — ¹ Auto
nie_wager_A 100 0.156 ± 0.020 0.168 ± 0.018 0.158 ± 0.017 Auto ✓
nie_wager_B 100 0.294 ± 0.031 0.329 ± 0.034 0.294 ± 0.031 Auto / Tuned
nie_wager_C 100 0.108 ± 0.015 0.101 ± 0.016 0.095 ± 0.013 Tuned ✓
nie_wager_D 100 0.681 ± 0.033 0.714 ± 0.035 0.672 ± 0.034 Tuned ✓
econml_partial_linear 100 33.5 ± 47.7 39.4 ± 48.9 39.5 ± 48.8 ~tie
dowhy_nonlinear 100 0.677 ± 0.105 0.675 ± 0.095 0.572 ± 0.095 Tuned ✓
packt_data11 1 0.735 1.380 1.116 Auto
packt_earnings 1 532.6 459.3 464.7 CatBoost
twins 1 0.172 0.172 0.171

ATE bias ↓

Dataset Reps DML-Auto DML-CatBoost DML-CatBoost-Tuned
dowhy_linear 100 0.437 ± 0.287 0.207 ± 0.096 0.081 ± 0.074
acic2019 8 0.153 ± 0.129 0.332 ± 0.391 — ¹

Fit time (mean seconds/replication)

Dataset DML-Auto DML-CatBoost DML-CatBoost-Tuned
acic2019 2.7 4.3 — ¹
nie_wager_A–D 8–10 2–3 112–134
ihdp (all) 0.8–1.0 2.0 — ¹
econml_partial_linear 9.5 4.3 276
twins 145 20 388

¹ DML-CatBoost-Tuned not evaluated: n_train < 600. See note above and PLAN.md §"Small-n exclusion".


Datasets

Dataset n/rep Reps Ground truth Source
IHDP-B (Hill 2011) 598 100 ITE + ATE fredjo.com
IHDP-A (EconML DGP) 598 100 ITE + ATE EconML
IHDP-B (EconML DGP) 598 100 ITE + ATE EconML
Twins (NBER) ~9100 1 ITE + ATE CEVAE repo
Nie & Wager A 5000 100 ITE + ATE causalml
Nie & Wager B 5000 100 ITE + ATE causalml
Nie & Wager C 5000 100 ITE + ATE causalml
Nie & Wager D 5000 100 ITE + ATE causalml
EconML partial linear 2000 100 ITE EconML
DoWhy linear SCM 5000 100 ATE only DoWhy
DoWhy nonlinear SCM 5000 100 ITE + ATE DoWhy
Packt data_11 5000 1 ITE + ATE Packt GitHub
Packt earnings ~900 1 ITE + ATE Packt GitHub
ACIC 2019 ~400 8 ITE + ATE ACIC 2019 challenge

Excluded: Jobs (no per-unit ITE, ATE unidentified without strong assumptions).


Quickstart

Option 1 — uv (recommended)

git clone https://github.com/your-username/DML-Auto-vs-Catboost
cd DML-Auto-vs-Catboost
uv sync                         # install exact locked dependencies

# 30-second demo (3 reps, no download needed)
uv run python src/run_comparison.py --dataset nie_wager_B --n_reps 3

# Full benchmark (downloads ~50 MB for IHDP + ACIC 2019)
uv run python src/run_comparison.py --dataset all --n_jobs -1

# Resume interrupted run
uv run python src/run_comparison.py --dataset all --n_jobs -1 --resume

# Analyse results
uv run jupyter lab notebooks/analysis.ipynb

Reproduce exact results

All randomness is controlled through a single seed in src/config.py:

GLOBAL_SEED: int = 42

Dependency versions are locked in uv.lock. To reproduce:

uv sync --frozen                # exact locked versions
uv run python src/run_comparison.py --dataset all

Run tests

uv run pytest                   # 97 tests, ~3.5 min
uv run pytest -m "not slow"     # fast subset (unit tests only)

Coverage report: results/coverage/index.html


Project structure

├── src/
│   ├── config.py               ← all seeds, paths, hyperparameters
│   ├── datasets/               ← 14 loaders with canonical train/test splits
│   │   ├── _http.py            ← shared download utility
│   │   ├── base.py             ← CausalDataset namedtuple
│   │   ├── ihdp.py             ← IHDP (fredjo.com + EconML)
│   │   ├── twins.py            ← Twins (NBER via CEVAE)
│   │   ├── jobs.py             ← Jobs / LaLonde (excluded: no ITE)
│   │   ├── nie_wager.py        ← Nie & Wager A–D
│   │   ├── econml_synthetic.py ← EconML DGPs
│   │   ├── dowhy_synthetic.py  ← DoWhy linear/nonlinear SCMs
│   │   ├── packt.py            ← Packt book datasets
│   │   └── acic2019.py         ← ACIC 2019 test datasets (8 reps)
│   ├── models/
│   │   ├── base.py             ← CausalModel Protocol
│   │   ├── dml_models.py       ← AutoDML, CatBoostDML, TunedCatBoostDML wrappers
│   │   └── __init__.py         ← MODEL_REGISTRY
│   ├── evaluation/
│   │   ├── metrics.py          ← PEHE, CATE R², policy risk, ATE bias
│   │   └── stats.py            ← Wilcoxon + Bonferroni significance tests
│   └── run_comparison.py       ← parallel CLI runner with cross-session resume
├── notebooks/
│   └── analysis.ipynb          ← full analysis with figures and LaTeX table
├── results/
│   ├── raw/                    ← per-dataset CSV results
│   ├── figures/                ← boxplots, heatmaps, bar charts
│   └── tables/                 ← comparison_table.csv + .tex
└── tests/                      ← 97 pytest tests

Design decisions

Why CausalForestDML as the causal forest? Current state-of-the-art DML implementation: honest causal forest, local centering, cross-fitting, and confidence intervals out of the box.

Why CatBoostRegressor for both stages (not Classifier)? CausalForestDML estimates propensity as E[T|X] via regression internally — it does not accept a classifier for model_t. This is the correct and principled choice.

Why Optuna R-loss as the tuning objective? Robinson's R-loss = mean((Ỹ − θ̂·T̃)²) directly optimises the criterion CausalForestDML uses in its second stage. This avoids the disconnect between optimising RMSE on Y or T and the actual causal forest objective.

Why test-set evaluation? In-sample CATE is trivially better for any flexible estimator. All reported numbers use held-out data — fixed 80/20 splits for all datasets.

Why Wilcoxon signed-rank test? PEHE distributions across replications are typically right-skewed and non-normal. Wilcoxon is a non-parametric paired test robust to outlier replications.


Citation

@misc{dml-catboost-benchmark-2026,
  title  = {DML-Auto vs DML-CatBoost: A Causal Inference Benchmark},
  year   = {2026},
  url    = {https://github.com/your-username/DML-Auto-vs-Catboost}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors