Skip to content
5 changes: 5 additions & 0 deletions TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,11 @@ Deferred items from PR reviews that were not addressed before merge.
| `compute_synthetic_weights` backend algorithm mismatch: Rust path uses Frank-Wolfe (`_rust_synthetic_weights` in `utils.py:1184`); Python fallback uses projected gradient descent (`_compute_synthetic_weights_numpy` in `utils.py:1228`). Both solve the same constrained QP but converge to different simplex vertices on near-degenerate / extreme-scale inputs (e.g. `Y~1e9`, or near-singular `Y'Y`). Unified backend (one algorithm) would close the parity gap surfaced by audit finding #22. Two `@pytest.mark.xfail(strict=True)` tests in `tests/test_rust_backend.py::TestSyntheticWeightsBackendParity` baseline the divergence so we notice when/if the algorithms align. | `utils.py`, `rust/` | follow-up | Medium |
| TROP Rust vs Python grid-search divergence on rank-deficient Y: on two near-parallel control units, LOOCV grid-search ATT diverges ~6% between Rust (`trop_global.py:688`) and Python fallback (`trop_global.py:753`). Either grid-winner ties are broken differently or the per-λ solver reaches different stationary points under rank deficiency. Audit finding #23 flagged this surface. `@pytest.mark.xfail(strict=True)` in `tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_grid_search_rank_deficient_Y` baselines the gap. | `trop_global.py`, `rust/` | follow-up | Medium |
| TROP Rust vs Python bootstrap SE divergence under fixed seed: `seed=42` on a tiny panel produces ~28% bootstrap-SE gap. Root cause: Rust bootstrap uses its own RNG (`rand` crate) while Python uses `numpy.random.default_rng`; same seed value maps to different bytestreams across backends. Audit axis-H (RNG/seed) adjacent. `@pytest.mark.xfail(strict=True)` in `tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility` baselines the gap. Unifying RNG (threading a numpy-generated seed-sequence into Rust, or porting Python to ChaCha) would close it. | `trop_global.py`, `rust/` | follow-up | Medium |
| `bias_corrected_local_linear`: extend golden parity to `kernel="triangular"` and `kernel="uniform"` (currently epa-only; all three kernels share `kernel_W` and the `lprobust` math, so parity is expected but not separately asserted). | `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py` | Phase 1c | Low |
| `bias_corrected_local_linear`: expose `vce in {"hc0", "hc1", "hc2", "hc3"}` on the public wrapper once R parity goldens exist (currently raises `NotImplementedError`). The port-level `lprobust` and `lprobust_res` already support all four; expanding the public surface requires a golden generator for each hc mode and a decision on hc2/hc3 q-fit leverage (R reuses p-fit `hii` for q-fit residuals; whether to match that or stage-match deserves a derivation before the wrapper advertises CCT-2014 conformance). | `diff_diff/local_linear.py::bias_corrected_local_linear`, `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py` | Phase 1c | Medium |
| `bias_corrected_local_linear`: support `weights=` once survey-design adaptation lands. nprobust's `lprobust` has no weight argument so there is no parity anchor; derivation needed. | `diff_diff/local_linear.py`, `diff_diff/_nprobust_port.py::lprobust` | Phase 1c | Medium |
| `bias_corrected_local_linear`: support multi-eval grid (`neval > 1`) with cross-covariance (`covgrid=TRUE` branch of `lprobust.R:253-378`). Not needed for HAD but useful for multi-dose diagnostics. | `diff_diff/_nprobust_port.py::lprobust` | Phase 1c | Low |
| Clustered-DGP parity: Phase 1c's DGP 4 uses manual `h=b=0.3` to sidestep an nprobust-internal singleton-cluster bug in `lpbwselect.mse.dpi`'s pilot fits. Once nprobust ships a fix (or we derive one independently), add a clustered-auto-bandwidth parity test. | `benchmarks/R/generate_nprobust_lprobust_golden.R` | Phase 1c | Low |

#### Performance

Expand Down
166 changes: 166 additions & 0 deletions benchmarks/R/generate_nprobust_lprobust_golden.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
# Generate nprobust lprobust golden values for the Phase 1c parity suite.
#
# This script calls nprobust::lprobust() at a single eval point with
# bwselect="mse-dpi" on five deterministic DGPs and records:
# - tau.cl, tau.bc (point estimates)
# - se.cl, se.rb (standard errors)
# - h, b (bandwidths chosen by the mse-dpi selector)
# - N (observations in the selected kernel window)
# - z = qnorm(1 - alpha/2)
# - ci_low, ci_high = tau.bc +/- z * se.rb
#
# DGPs 1-3 reuse the same seed + shape as benchmarks/R/generate_nprobust_golden.R
# so the selected (h, b) are identical; Phase 1c parity is therefore isolated
# to the point-estimate + variance computation. DGP 4 adds cluster IDs for
# cluster-robust SE parity; DGP 5 shifts the support to test a non-zero
# boundary (Design 1 continuous-near-d_lower).
#
# Usage:
# Rscript benchmarks/R/generate_nprobust_lprobust_golden.R
#
# Requirements:
# nprobust (CRAN), jsonlite
#
# Output:
# benchmarks/data/nprobust_lprobust_golden.json
#
# Phase 1c of the HeterogeneousAdoptionDiD implementation (de Chaisemartin,
# Ciccia, D'Haultfoeuille & Knau 2026, arXiv:2405.04465v6). Python tests at
# tests/test_bias_corrected_lprobust.py and tests/test_nprobust_port.py load
# this JSON and check agreement to tiered tolerances (1e-14 on tau_cl/se_cl,
# 1e-12 on tau_bc/se_rb, 1e-13 on CI bounds; see Phase 1c plan).

library(nprobust)
library(jsonlite)

stopifnot(packageVersion("nprobust") == "0.5.0")

extract_lprobust_single_eval <- function(d, y, eval_point = 0.0,
kernel = "epa", vce = "nn",
cluster = NULL, alpha = 0.05,
h = NULL, b = NULL) {
# If h (and optionally b) are passed, bypass the mse-dpi selector and
# call lprobust() with those bandwidths directly. This is used for
# clustered DGPs where nprobust's internal lpbwselect.mse.dpi hits a
# singleton-cluster shape bug in lprobust.vce during the order-q+1/q+2
# pilot fits. For unclustered DGPs, h=NULL triggers bwselect="mse-dpi".
if (is.null(h)) {
fit <- lprobust(y = y, x = d, eval = eval_point,
p = 1L, deriv = 0L, kernel = kernel,
bwselect = "mse-dpi", vce = vce, cluster = cluster,
bwcheck = 21L, bwregul = 1, nnmatch = 3L)
} else {
# When b is unspecified, nprobust defaults to b = h / rho with rho=1.
fit <- lprobust(y = y, x = d, eval = eval_point,
p = 1L, deriv = 0L, kernel = kernel,
h = h, b = if (is.null(b)) h else b,
vce = vce, cluster = cluster,
bwcheck = 21L, nnmatch = 3L)
}
est <- fit$Estimate[1, ]
z <- qnorm(1 - alpha / 2)
ci_low <- as.numeric(est["tau.bc"] - z * est["se.rb"])
ci_high <- as.numeric(est["tau.bc"] + z * est["se.rb"])

list(
eval_point = as.numeric(eval_point),
h = as.numeric(est["h"]),
b = as.numeric(est["b"]),
n_used = as.integer(est["N"]),
tau_cl = as.numeric(est["tau.us"]),
tau_bc = as.numeric(est["tau.bc"]),
se_cl = as.numeric(est["se.us"]),
se_rb = as.numeric(est["se.rb"]),
ci_low = ci_low,
ci_high = ci_high,
alpha = as.numeric(alpha),
z = as.numeric(z)
)
}

set.seed(20260419)

# DGP 1: d ~ Uniform(0, 1), y = d + d^2 + N(0, 0.5)
G <- 2000L
d1 <- runif(G, 0, 1)
y1 <- d1 + d1^2 + rnorm(G, 0, 0.5)

# DGP 2: d ~ Beta(2, 2), y = d + d^2 + N(0, 0.5) (f(0) vanishes at boundary)
d2 <- rbeta(G, 2, 2)
y2 <- d2 + d2^2 + rnorm(G, 0, 0.5)

# DGP 3: Half-normal d, y = 0.5 * d^2 + N(0, 1)
d3 <- abs(rnorm(G, 0, 1))
y3 <- 0.5 * d3^2 + rnorm(G, 0, 1)

# DGP 4: Uniform(0, 1) with 50 clusters of 40 obs (cluster-robust SE parity).
# Fewer, larger clusters avoid an nprobust-internal singleton-cluster shape
# bug in lprobust.vce that fires if a kernel window retains only one obs per
# cluster. 50 clusters x 40 obs => the mse-dpi pilot windows near the
# boundary keep enough obs per cluster to stay well-conditioned.
set.seed(20260420)
G4 <- 2000L
d4 <- runif(G4, 0, 1)
cluster4 <- rep(1:50, each = 40)[1:G4]
# Introduce within-cluster correlation in y via a cluster effect.
cluster_effect <- rnorm(50, 0, 0.3)[cluster4]
y4 <- d4 + d4^2 + cluster_effect + rnorm(G4, 0, 0.3)

# DGP 5: Uniform(0.2, 1.0) — Design 1 continuous-near-d_lower at
# boundary = d.min() > 0. Different seed to avoid aliasing DGP 1.
set.seed(20260421)
G5 <- 2000L
d5 <- runif(G5, 0.2, 1.0)
y5 <- (d5 - 0.2) + (d5 - 0.2)^2 + rnorm(G5, 0, 0.5)
eval5 <- min(d5) # Design 1 continuous: evaluate at the realized minimum.

golden <- list(
metadata = list(
nprobust_version = as.character(packageVersion("nprobust")),
nprobust_sha = "36e4e532d2f7d23d4dc6e162575cca79e0927cda",
seeds = list(dgp1 = 20260419L, dgp2 = 20260419L, dgp3 = 20260419L,
dgp4 = 20260420L, dgp5 = 20260421L),
generator = "benchmarks/R/generate_nprobust_lprobust_golden.R",
algorithm = paste("nprobust::lprobust(..., bwselect='mse-dpi') at a single",
"eval point, p=1, deriv=0, kernel='epa', vce='nn'",
"unless noted. The Python wrapper computes its own",
"z_{1-alpha/2} via scipy.stats.norm.ppf inside",
"safe_inference(); R's z is exported here for audit",
"so a reviewer can verify the two critical values",
"agree to machine precision.")
),
dgp1 = c(list(n = G, d = d1, y = y1, kernel = "epa", vce = "nn",
description = "Uniform(0,1), polynomial m(d) = d + d^2"),
extract_lprobust_single_eval(d1, y1, kernel = "epa", vce = "nn")),
dgp2 = c(list(n = G, d = d2, y = y2, kernel = "epa", vce = "nn",
description = "Beta(2,2) - boundary density vanishes at 0"),
extract_lprobust_single_eval(d2, y2, kernel = "epa", vce = "nn")),
dgp3 = c(list(n = G, d = d3, y = y3, kernel = "epa", vce = "nn",
description = "Half-normal d, quadratic m(d) with unit noise"),
extract_lprobust_single_eval(d3, y3, kernel = "epa", vce = "nn")),
dgp4 = c(list(n = G4, d = d4, y = y4, cluster = cluster4,
kernel = "epa", vce = "nn",
h_manual = 0.3, b_manual = 0.3,
description = paste("Clustered (50 groups of 40) Uniform(0,1);",
"manual h=b=0.3 to sidestep nprobust's",
"singleton-cluster bug in the mse-dpi",
"pilot fits.")),
extract_lprobust_single_eval(d4, y4, kernel = "epa", vce = "nn",
cluster = cluster4,
h = 0.3, b = 0.3)),
dgp5 = c(list(n = G5, d = d5, y = y5, eval_point_override = eval5,
kernel = "epa", vce = "nn",
description = "Uniform(0.2, 1.0), Design 1 boundary = d.min()"),
extract_lprobust_single_eval(d5, y5, eval_point = eval5,
kernel = "epa", vce = "nn"))
)

out_path <- "benchmarks/data/nprobust_lprobust_golden.json"
dir.create("benchmarks/data", recursive = TRUE, showWarnings = FALSE)
write_json(golden, out_path, auto_unbox = TRUE, pretty = TRUE, digits = 14)
cat("Golden values written to", out_path, "\n")
for (name in c("dgp1", "dgp2", "dgp3", "dgp4", "dgp5")) {
cat(sprintf("%s: tau.bc = %.6f, se.rb = %.6f, h = %.6f, b = %.6f\n",
name, golden[[name]]$tau_bc, golden[[name]]$se_rb,
golden[[name]]$h, golden[[name]]$b))
}
119 changes: 119 additions & 0 deletions benchmarks/data/nprobust_lprobust_golden.json

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions diff_diff/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,9 @@
from diff_diff.local_linear import (
KERNELS,
BandwidthResult,
BiasCorrectedFit,
LocalLinearFit,
bias_corrected_local_linear,
epanechnikov_kernel,
kernel_moments,
local_linear_fit,
Expand Down Expand Up @@ -427,6 +429,9 @@
# MSE-optimal bandwidth selector (Phase 1b for HeterogeneousAdoptionDiD)
"BandwidthResult",
"mse_optimal_bandwidth",
# Bias-corrected local-linear (Phase 1c for HeterogeneousAdoptionDiD)
"BiasCorrectedFit",
"bias_corrected_local_linear",
# Datasets
"load_card_krueger",
"load_castle_doctrine",
Expand Down
Loading
Loading