Research question: Does replacing EconML's auto first-stage with CatBoost regressors improve CATE estimation in Double Machine Learning?
All three models use identical CausalForestDML second stages. The only variable is the
first-stage nuisance estimator used to partial out confounding:
| Model | model_y |
model_t |
|---|---|---|
| DML-Auto | EconML auto-select (LassoCV / Ridge / Forest via CV) | same |
| DML-CatBoost | CatBoostRegressor (default params, 500 iterations) |
same |
| DML-CatBoost-Tuned | CatBoostRegressor tuned via Optuna (50 trials, R-loss objective) |
same |
All metrics are computed on a held-out test set (80/20 split) — models never see evaluation data during fitting.
Note on small datasets:
DML-CatBoost-Tunedis only evaluated on datasets with n_train ≥ 600 (seeCB_TUNE_MIN_Ninconfig.py). Experiments showed that 50-trial Optuna tuning is unreliable below this threshold: R-loss estimates are too noisy for the TPE surrogate to converge, and CausalForestDML itself becomes numerically unstable (catastrophic R² on ACIC 2019, n≈320; no improvement over single-split on IHDP with LOO-like 15-fold CV). On small-n datasets the two-way comparison (Auto vs CatBoost) is more informative. SeePLAN.md §"Small-n exclusion"for full analysis.
Based on 1–100 replications × 13 datasets. Significance: paired Wilcoxon signed-rank test, α = 0.05.
- DML-Auto wins on PEHE in 7 of 13 datasets — dominates all IHDP variants, Nie & Wager A/B, ACIC 2019, and packt_data11. EconML's CV-selected first stage (typically LassoCV) handles small-n semi-synthetic data better than gradient boosting.
- Tuned CatBoost wins on smooth DGPs — significantly better on
nie_wager_Canddowhy_nonlinear(PEHE 0.572 vs 0.677 Auto). Tuning recovers most of the untuned CatBoost's deficit. - Tuned CatBoost dominates ATE bias on linear settings —
dowhy_linear: 0.081 vs 0.207 (CatBoost) vs 0.437 (Auto). The direct R-loss tuning objective translates well to ATE precision. - ACIC 2019 is Auto's strongest setting — 2× lower PEHE (0.288 vs 0.614 CatBoost) and ATE bias (0.153 vs 0.332). The small-n (~400 train), moderate-p (~22 features) regime favors linear first-stages.
- Speed trade-offs — DML-Auto: 1–10 s/rep. DML-CatBoost: 2–6 s/rep. DML-CatBoost-Tuned: 45–390 s/rep. Tuning rarely justifies the 30–150× overhead over auto-selection.
Mean ± std across replications. Bold = best among the three models.
| Dataset | Reps | DML-Auto | DML-CatBoost | DML-CatBoost-Tuned | Winner |
|---|---|---|---|---|---|
| acic2019 | 8 | 0.288 ± 0.251 | 0.614 ± 0.657 | — ¹ | Auto ✓ |
| ihdp_B | 100 | 3.748 ± 6.081 | 3.846 ± 6.325 | — ¹ | Auto |
| ihdp_econml_A | 100 | 0.200 ± 0.066 | 0.320 ± 0.100 | — ¹ | Auto ✓ |
| ihdp_econml_B | 100 | 2.892 ± 2.490 | 2.936 ± 2.547 | — ¹ | Auto |
| nie_wager_A | 100 | 0.156 ± 0.020 | 0.168 ± 0.018 | 0.158 ± 0.017 | Auto ✓ |
| nie_wager_B | 100 | 0.294 ± 0.031 | 0.329 ± 0.034 | 0.294 ± 0.031 | Auto / Tuned |
| nie_wager_C | 100 | 0.108 ± 0.015 | 0.101 ± 0.016 | 0.095 ± 0.013 | Tuned ✓ |
| nie_wager_D | 100 | 0.681 ± 0.033 | 0.714 ± 0.035 | 0.672 ± 0.034 | Tuned ✓ |
| econml_partial_linear | 100 | 33.5 ± 47.7 | 39.4 ± 48.9 | 39.5 ± 48.8 | ~tie |
| dowhy_nonlinear | 100 | 0.677 ± 0.105 | 0.675 ± 0.095 | 0.572 ± 0.095 | Tuned ✓ |
| packt_data11 | 1 | 0.735 | 1.380 | 1.116 | Auto |
| packt_earnings | 1 | 532.6 | 459.3 | 464.7 | CatBoost |
| twins | 1 | 0.172 | 0.172 | 0.171 | — |
| Dataset | Reps | DML-Auto | DML-CatBoost | DML-CatBoost-Tuned |
|---|---|---|---|---|
| dowhy_linear | 100 | 0.437 ± 0.287 | 0.207 ± 0.096 | 0.081 ± 0.074 |
| acic2019 | 8 | 0.153 ± 0.129 | 0.332 ± 0.391 | — ¹ |
| Dataset | DML-Auto | DML-CatBoost | DML-CatBoost-Tuned |
|---|---|---|---|
| acic2019 | 2.7 | 4.3 | — ¹ |
| nie_wager_A–D | 8–10 | 2–3 | 112–134 |
| ihdp (all) | 0.8–1.0 | 2.0 | — ¹ |
| econml_partial_linear | 9.5 | 4.3 | 276 |
| twins | 145 | 20 | 388 |
¹
DML-CatBoost-Tunednot evaluated: n_train < 600. See note above andPLAN.md §"Small-n exclusion".
| Dataset | n/rep | Reps | Ground truth | Source |
|---|---|---|---|---|
| IHDP-B (Hill 2011) | 598 | 100 | ITE + ATE | fredjo.com |
| IHDP-A (EconML DGP) | 598 | 100 | ITE + ATE | EconML |
| IHDP-B (EconML DGP) | 598 | 100 | ITE + ATE | EconML |
| Twins (NBER) | ~9100 | 1 | ITE + ATE | CEVAE repo |
| Nie & Wager A | 5000 | 100 | ITE + ATE | causalml |
| Nie & Wager B | 5000 | 100 | ITE + ATE | causalml |
| Nie & Wager C | 5000 | 100 | ITE + ATE | causalml |
| Nie & Wager D | 5000 | 100 | ITE + ATE | causalml |
| EconML partial linear | 2000 | 100 | ITE | EconML |
| DoWhy linear SCM | 5000 | 100 | ATE only | DoWhy |
| DoWhy nonlinear SCM | 5000 | 100 | ITE + ATE | DoWhy |
| Packt data_11 | 5000 | 1 | ITE + ATE | Packt GitHub |
| Packt earnings | ~900 | 1 | ITE + ATE | Packt GitHub |
| ACIC 2019 | ~400 | 8 | ITE + ATE | ACIC 2019 challenge |
Excluded: Jobs (no per-unit ITE, ATE unidentified without strong assumptions).
git clone https://github.com/your-username/DML-Auto-vs-Catboost
cd DML-Auto-vs-Catboost
uv sync # install exact locked dependencies
# 30-second demo (3 reps, no download needed)
uv run python src/run_comparison.py --dataset nie_wager_B --n_reps 3
# Full benchmark (downloads ~50 MB for IHDP + ACIC 2019)
uv run python src/run_comparison.py --dataset all --n_jobs -1
# Resume interrupted run
uv run python src/run_comparison.py --dataset all --n_jobs -1 --resume
# Analyse results
uv run jupyter lab notebooks/analysis.ipynbAll randomness is controlled through a single seed in src/config.py:
GLOBAL_SEED: int = 42Dependency versions are locked in uv.lock. To reproduce:
uv sync --frozen # exact locked versions
uv run python src/run_comparison.py --dataset alluv run pytest # 97 tests, ~3.5 min
uv run pytest -m "not slow" # fast subset (unit tests only)Coverage report: results/coverage/index.html
├── src/
│ ├── config.py ← all seeds, paths, hyperparameters
│ ├── datasets/ ← 14 loaders with canonical train/test splits
│ │ ├── _http.py ← shared download utility
│ │ ├── base.py ← CausalDataset namedtuple
│ │ ├── ihdp.py ← IHDP (fredjo.com + EconML)
│ │ ├── twins.py ← Twins (NBER via CEVAE)
│ │ ├── jobs.py ← Jobs / LaLonde (excluded: no ITE)
│ │ ├── nie_wager.py ← Nie & Wager A–D
│ │ ├── econml_synthetic.py ← EconML DGPs
│ │ ├── dowhy_synthetic.py ← DoWhy linear/nonlinear SCMs
│ │ ├── packt.py ← Packt book datasets
│ │ └── acic2019.py ← ACIC 2019 test datasets (8 reps)
│ ├── models/
│ │ ├── base.py ← CausalModel Protocol
│ │ ├── dml_models.py ← AutoDML, CatBoostDML, TunedCatBoostDML wrappers
│ │ └── __init__.py ← MODEL_REGISTRY
│ ├── evaluation/
│ │ ├── metrics.py ← PEHE, CATE R², policy risk, ATE bias
│ │ └── stats.py ← Wilcoxon + Bonferroni significance tests
│ └── run_comparison.py ← parallel CLI runner with cross-session resume
├── notebooks/
│ └── analysis.ipynb ← full analysis with figures and LaTeX table
├── results/
│ ├── raw/ ← per-dataset CSV results
│ ├── figures/ ← boxplots, heatmaps, bar charts
│ └── tables/ ← comparison_table.csv + .tex
└── tests/ ← 97 pytest tests
Why CausalForestDML as the causal forest? Current state-of-the-art DML implementation: honest causal forest, local centering, cross-fitting, and confidence intervals out of the box.
Why CatBoostRegressor for both stages (not Classifier)?
CausalForestDML estimates propensity as E[T|X] via regression internally — it does not accept a classifier for model_t. This is the correct and principled choice.
Why Optuna R-loss as the tuning objective? Robinson's R-loss = mean((Ỹ − θ̂·T̃)²) directly optimises the criterion CausalForestDML uses in its second stage. This avoids the disconnect between optimising RMSE on Y or T and the actual causal forest objective.
Why test-set evaluation? In-sample CATE is trivially better for any flexible estimator. All reported numbers use held-out data — fixed 80/20 splits for all datasets.
Why Wilcoxon signed-rank test? PEHE distributions across replications are typically right-skewed and non-normal. Wilcoxon is a non-parametric paired test robust to outlier replications.
@misc{dml-catboost-benchmark-2026,
title = {DML-Auto vs DML-CatBoost: A Causal Inference Benchmark},
year = {2026},
url = {https://github.com/your-username/DML-Auto-vs-Catboost}
}