igerber · igerber · Apr 24, 2026 · Apr 24, 2026 · Apr 24, 2026 · Apr 24, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,6 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
 
 ### Added
+- **`HeterogeneousAdoptionDiD.fit(survey=..., weights=...)` on continuous-dose paths (Phase 4.5 survey support).** The `continuous_at_zero` (paper Design 1') and `continuous_near_d_lower` (Design 1 continuous-near-d̲) designs accept survey weights through two interchangeable kwargs: `weights=<array>` (pweight shortcut, weighted-robust SE from the CCT-2014 lprobust port) and `survey=SurveyDesign(weights, strata, psu, fpc)` (design-based inference via Binder-TSL variance using the existing `compute_survey_if_variance` helper at `diff_diff/survey.py:1802`). Point estimates match across both entry paths; SE diverges by design (pweight-only vs PSU-aggregated). `HeterogeneousAdoptionDiDResults.survey_metadata` is a repo-standard `SurveyMetadata` dataclass (weight_type / effective_n / design_effect / sum_weights / weight_range / n_strata / n_psu / df_survey); HAD-specific extras (`variance_formula` label, `effective_dose_mean`) are separate top-level result fields. `to_dict()` surfaces the full `SurveyMetadata` object plus `variance_formula` + `effective_dose_mean`; `summary()` renders `variance_formula`, `effective_n`, `effective_dose_mean`, and (when the survey= path is used) `df_survey`; `__repr__` surfaces `variance_formula` + `effective_dose_mean` when present. The HAD `mass_point` design and `aggregate="event_study"` path raise `NotImplementedError` under survey/weights (deferred to Phase 4.5 B: weighted 2SLS + event-study survey composition); the HAD pretests stay unweighted in this release (Phase 4.5 C). Parity ceiling acknowledged — no public weighted-CCF bias-corrected local-linear reference exists in any language; methodology confidence comes from (1) uniform-weights bit-parity at `atol=1e-14` on the full lprobust output struct, (2) cross-language weighted-OLS parity (manual R reference) at `atol=1e-12`, and (3) Monte Carlo oracle consistency on known-τ DGPs. `_nprobust_port.lprobust` gains `weights=` and `return_influence=` (used internally by the Binder-TSL path); `bias_corrected_local_linear` removes the Phase 1c `NotImplementedError` on `weights=` and forwards. Auto-bandwidth selection remains unweighted in this release — pass `h`/`b` explicitly for weight-aware bandwidths. See `docs/methodology/REGISTRY.md` §HeterogeneousAdoptionDiD "Weighted extension (Phase 4.5 survey support)".
 - **`stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test` + `StuteJointResult`** (HeterogeneousAdoptionDiD Phase 3 follow-up). Joint Cramér-von Mises pretests across K horizons with shared-η Mammen wild bootstrap (preserves vector-valued empirical-process unit-level dependence per Delgado-Manteiga 2001 / Hlávka-Hušková 2020). The core `stute_joint_pretest` is residuals-in; two thin data-in wrappers construct per-horizon residuals for the two nulls the paper spells out: mean-independence (step 2 pre-trends, `OLS(Y_t − Y_base ~ 1)` per pre-period) and linearity (step 3 joint, `OLS(Y_t − Y_base ~ 1 + D)` per post-period). Sum-of-CvMs aggregation (`S_joint = Σ_k S_k`); per-horizon scale-invariant exact-linear short-circuit. Closes the paper Section 4.2 step-2 gap that Phase 3 `did_had_pretest_workflow` previously flagged with an "Assumption 7 pre-trends test NOT run" caveat. See `docs/methodology/REGISTRY.md` §HeterogeneousAdoptionDiD "Joint Stute tests" for algorithm, invariants, and scope exclusion of Eq 18 linear-trend detrending (deferred to Phase 4 Pierce-Schott replication).
 - **`did_had_pretest_workflow(aggregate="event_study")`**: multi-period dispatch on balanced ≥3-period panels. Runs QUG at `F` + joint pre-trends Stute across earlier pre-periods + joint homogeneity-linearity Stute across post-periods. Step 2 closure requires ≥2 pre-periods; with only a single pre-period (the base `F-1`) `pretrends_joint=None` and the verdict flags the skip. Reuses the Phase 2b event-study panel validator (last-cohort auto-filter under staggered timing with `UserWarning`; `ValueError` when `first_treat_col=None` and the panel is staggered). The data-in wrappers `joint_pretrends_test` and `joint_homogeneity_test` also route through that same validator internally, so direct wrapper calls inherit the last-cohort filter and constant-post-dose invariant. `HADPretestReport` extended with `pretrends_joint`, `homogeneity_joint`, and `aggregate` fields; serialization methods (`summary`, `to_dict`, `to_dataframe`, `__repr__`) preserve the Phase 3 output bit-exactly on `aggregate="overall"` — no `aggregate` key, no header row, no schema drift — and only surface the new fields on `aggregate="event_study"`.
 - **`target_parameter` block in BR/DR schemas (experimental; schema version bumped to 2.0)** — `BUSINESS_REPORT_SCHEMA_VERSION` and `DIAGNOSTIC_REPORT_SCHEMA_VERSION` bumped from `"1.0"` to `"2.0"` because the new `"no_scalar_by_design"` value on the `headline.status` / `headline_metric.status` enum (dCDH `trends_linear=True, L_max>=2` configuration) is a breaking change per the REPORTING.md stability policy. BusinessReport and DiagnosticReport now emit a top-level `target_parameter` block naming what the headline scalar actually represents for each of the 16 result classes. Closes BR/DR foundation gap #6 (target-parameter clarity). Fields: `name`, `definition`, `aggregation` (machine-readable dispatch tag), `headline_attribute` (raw result attribute), `reference` (citation pointer). BR's summary emits the short `name` right after the headline; DR's overall-interpretation paragraph does the same; both full reports carry a "## Target Parameter" section with the full definition. Per-estimator dispatch is sourced from REGISTRY.md and lives in the new `diff_diff/_reporting_helpers.py::describe_target_parameter`. A few branches read fit-time config (`EfficientDiDResults.pt_assumption`, `StackedDiDResults.clean_control`, `ChaisemartinDHaultfoeuilleResults.L_max` / `covariate_residuals` / `linear_trends_effects`); others emit a fixed tag (the fit-time `aggregate` kwarg on CS / Imputation / TwoStage / Wooldridge does not change the `overall_att` scalar — disambiguating horizon / group tables is tracked under gap #9). See `docs/methodology/REPORTING.md` "Target parameter" section.

diff --git a/TODO.md b/TODO.md
@@ -92,8 +92,11 @@ Deferred items from PR reviews that were not addressed before merge.
 | Clustered-DGP parity: Phase 1c's DGP 4 uses manual `h=b=0.3` to sidestep an nprobust-internal singleton-cluster bug in `lpbwselect.mse.dpi`'s pilot fits. Once nprobust ships a fix (or we derive one independently), add a clustered-auto-bandwidth parity test. | `benchmarks/R/generate_nprobust_lprobust_golden.R` | Phase 1c | Low |
 | `HeterogeneousAdoptionDiD` joint cross-horizon covariance on event study: per-horizon SEs use INDEPENDENT sandwiches in Phase 2b (paper-faithful pointwise CIs per Pierce-Schott Figure 2). A follow-up could derive an IF-based stacking of per-horizon scores for joint cross-horizon inference (needed for joint hypothesis tests across event-time horizons). Block-bootstrap is a reasonable alternative. | `diff_diff/had.py::_fit_event_study` | Phase 2b | Low |
 | `HeterogeneousAdoptionDiD` event-study staggered-timing beyond last cohort: Phase 2b auto-filters staggered panels to the last cohort per paper Appendix B.2. Earlier-cohort treatment effects are not identified by HAD; redirecting to `ChaisemartinDHaultfoeuille` / `did_multiplegt_dyn` is the paper's prescription. A full staggered HAD would require a different identification path (out of paper scope). | `diff_diff/had.py::_validate_had_panel_event_study` | Phase 2b | Low |
-| `HeterogeneousAdoptionDiD`: survey-design integration (`survey=SurveyDesign(...)`). Currently raises `NotImplementedError`. Requires Taylor-linearization of the β-scale rescaling and replicate-weight-compatible 2SLS variance on the mass-point path. | `diff_diff/had.py` | Phase 2a | Medium |
-| `HeterogeneousAdoptionDiD`: `weights=` support. Deferred jointly with survey integration. nprobust's `lprobust` has no weight argument so the nonparametric continuous path needs a derivation; the 2SLS mass-point path needs weighted-sandwich parity. | `diff_diff/had.py` | Phase 2a | Medium |
+| `HeterogeneousAdoptionDiD` Phase 4.5 B: `survey=` / `weights=` on `design="mass_point"` (weighted 2SLS + weighted-sandwich variance; the Wooldridge 2010 Ch. 12 weighted-IV sandwich has a Stata `ivregress ... [pweight=...]` + R `AER::ivreg(weights=...)` parity anchor). Also ships `aggregate="event_study"` + survey/weights via per-horizon IPW + shared PSU multiplier bootstrap across horizons. This PR (Phase 4.5 A) raises `NotImplementedError` on both paths. | `diff_diff/had.py::_fit_mass_point_2sls`, `diff_diff/had.py::_fit_event_study` | Phase 4.5 B | Medium |
+| `HeterogeneousAdoptionDiD` Phase 4.5 C0: QUG-under-survey decision gate. `qug_test` uses a ratio of extreme order statistics `D_{(1)} / (D_{(2)} - D_{(1)})` — extreme-value theory under inverse-probability weighting is a research area, not a standard toolkit. Lit-review Guillou-Hall (2001), Chen-Chen (2004); likely outcome is `NotImplementedError` on `qug_test(..., weights=...)` with a clear pointer to the Stute/Yatchew/joint pretests as the survey-supported alternatives. | `diff_diff/had_pretests.py::qug_test` | Phase 4.5 C0 | Low |
+| `HeterogeneousAdoptionDiD` Phase 4.5 C: pretests under survey (`stute_test`, `yatchew_hr_test`, `stute_joint_pretest`, `joint_pretrends_test`, `joint_homogeneity_test`, `did_had_pretest_workflow`). Rao-Wu rescaled bootstrap for the Stute-family (weighted η generation + PSU clustering in the bootstrap draw); weighted OLS residuals + weighted variance estimator for Yatchew. | `diff_diff/had_pretests.py` | Phase 4.5 C | Medium |
+| `HeterogeneousAdoptionDiD` Phase 4.5: weight-aware auto-bandwidth MSE-DPI selector. Phase 4.5 A ships weighted `lprobust` with an unweighted DPI selector; users who want a weight-aware bandwidth must pass `h`/`b` explicitly. Extending `lpbwselect_mse_dpi` to propagate weights through density, second-derivative, and variance stages is ~300 LoC of methodology and was out of scope. | `diff_diff/_nprobust_port.py::lpbwselect_mse_dpi` | Phase 4.5 | Low |
+| `HeterogeneousAdoptionDiD` Phase 4.5 C: replicate-weight SurveyDesigns (BRR / Fay / JK1 / JKn / SDR) on the continuous-dose paths. Phase 4.5 A raises `NotImplementedError` on replicate designs in `_aggregate_unit_resolved_survey`. Rao-Wu-style replicate bootstrap for HAD paths requires deriving the per-replicate weight-ratio rescaling for the local-linear intercept IF. | `diff_diff/had.py::_aggregate_unit_resolved_survey` | Phase 4.5 C | Low |
 | `HeterogeneousAdoptionDiD` mass-point: `vcov_type in {"hc2", "hc2_bm"}` raises `NotImplementedError` pending a 2SLS-specific leverage derivation. The OLS leverage `x_i' (X'X)^{-1} x_i` is wrong for 2SLS; the correct finite-sample correction uses `x_i' (Z'X)^{-1} (...) (X'Z)^{-1} x_i`. Needs derivation plus an R / Stata (`ivreg2 small robust`) parity anchor. | `diff_diff/had.py::_fit_mass_point_2sls` | Phase 2a | Medium |
 | `HeterogeneousAdoptionDiD` continuous paths: thread `cluster=` through `bias_corrected_local_linear` (Phase 1c's wrapper already supports cluster; Phase 2a ignores it with a `UserWarning` on the continuous path to keep scope tight). | `diff_diff/had.py`, `diff_diff/local_linear.py` | Phase 2a | Low |
 | `HeterogeneousAdoptionDiD` Eq 18 linear-trend detrending (Pierce-Schott style): the joint-Stute infrastructure shipped in the Phase 3 follow-up supports pre-trends (mean-indep) and post-homogeneity (linearity) nulls. The Pierce-Schott application (paper Section 5.2) uses a LINEAR-TREND detrending of pre-period outcomes before the joint CvM — `Y_{g,t} - Y_{g,t_anchor} - (t - t_anchor)*(Y_{g,t_anchor} - Y_{g,t_anchor-1})` — reaching p=0.51 on US-China tariff data. Extends `joint_pretrends_test` with a detrending mode or a separate Eq 18-specific helper. Deferred to Phase 4 replication harness (where the published p=0.51 serves as the parity anchor). | `diff_diff/had_pretests.py::joint_pretrends_test` | Phase 4 | Medium |

diff --git a/benchmarks/R/generate_np_npreg_weighted_golden.R b/benchmarks/R/generate_np_npreg_weighted_golden.R
@@ -0,0 +1,154 @@
+# Generate cross-language weighted-OLS parity fixture for HAD Phase 4.5.
+#
+# Purpose: no public weighted-CCF reference exists for bias-corrected
+# local-linear (nprobust::lprobust has no weight argument; np::npreg uses
+# its own internal local-linear algorithm that does not reduce to a
+# straightforward weighted OLS at the intercept). To validate the weighted
+# kernel-composition machinery in diff_diff._nprobust_port cross-language,
+# we record the intercept from an R implementation of the SAME formula
+# (weighted-OLS with one-sided Epanechnikov kernel) that the Python port
+# implements. Bit-parity at atol=1e-14 locks in numerical consistency
+# across BLAS reductions.
+#
+# This is NOT third-party validation of the weighted-CCF methodology. It is
+# a regression lock against R↔Python drift on the weighted-OLS formula
+# itself. Methodology confidence under informative weights comes from:
+#   1. Analytic derivation in docs/methodology/REGISTRY.md
+#   2. Uniform-weights bit-parity: weights=np.ones ≡ unweighted at 1e-14
+#   3. Monte Carlo oracle consistency (tests/test_had_mc.py)
+#
+# Usage:
+#   Rscript benchmarks/R/generate_np_npreg_weighted_golden.R
+#
+# Output:
+#   benchmarks/data/np_npreg_weighted_golden.json
+#
+# Phase 4.5 of HeterogeneousAdoptionDiD (de Chaisemartin et al. 2026).
+# Python test loader: tests/test_np_npreg_weighted_parity.py.
+
+library(jsonlite)
+
+# -------------------------------------------------------------------------
+# Weighted local-linear at a boundary: manual weighted OLS with Epa kernel.
+# Matches diff_diff/local_linear.py::local_linear_fit exactly.
+# -------------------------------------------------------------------------
+
+weighted_local_linear <- function(d, y, weights, eval_point = 0.0, h = 0.3) {
+  # One-sided epanechnikov on [0, 1]: k(u) = (3/4)(1 - u^2), zero elsewhere.
+  u <- (d - eval_point) / h
+  kw <- ifelse(u >= 0 & u <= 1, 0.75 * (1 - u^2), 0)
+  # Combined weights: user weights * kernel weights.
+  combined <- kw * weights
+  # Active window (non-zero combined weight).
+  active <- combined > 0
+  if (sum(active) < 2) {
+    stop("Active window has fewer than 2 observations.")
+  }
+  # Weighted OLS of y ~ 1 + (d - eval_point), intercept is mu_hat at
+  # eval_point.
+  fit <- lm(y[active] ~ I(d[active] - eval_point), weights = combined[active])
+  mu_hat <- as.numeric(coef(fit)[1])
+  slope_hat <- as.numeric(coef(fit)[2])
+  list(
+    mu_hat = mu_hat,
+    slope = slope_hat,
+    n_active = as.integer(sum(active)),
+    h = h,
+    eval_point = eval_point
+  )
+}
+
+# -------------------------------------------------------------------------
+# DGPs: deterministic seeds for reproducibility.
+# -------------------------------------------------------------------------
+
+set.seed(20260424)
+
+dgp1 <- local({
+  G <- 500
+  d <- runif(G, 0, 1)
+  y <- 2 * d + 0.3 * d^2 + rnorm(G, sd = 0.25)
+  w <- rep(1.0, G)
+  list(d = d, y = y, w = w, eval_point = 0.0, h = 0.30,
+       description = "Uniform weights, G=500, boundary=0")
+})
+
+dgp2 <- local({
+  G <- 400
+  d <- runif(G, 0, 1)
+  y <- 2 * d + 0.3 * d^2 + rnorm(G, sd = 0.25)
+  w <- exp(-d * 2.0)
+  list(d = d, y = y, w = w, eval_point = 0.0, h = 0.25,
+       description = "Informative weights (exp decay from boundary), G=400")
+})
+
+dgp3 <- local({
+  G <- 200
+  d <- runif(G, 0, 1)
+  y <- 3 * d - d^2 + 0.5 * d^3 + rnorm(G, sd = 0.30)
+  w <- pmax(0.1, runif(G, 0.5, 1.5))
+  list(d = d, y = y, w = w, eval_point = 0.0, h = 0.35,
+       description = "Small G=200, nonlinear m(d), bounded heterogeneous weights")
+})
+
+dgp4 <- local({
+  G <- 400
+  d_lower <- 0.1
+  d <- runif(G, d_lower, 1)
+  y <- 2 * (d - d_lower) + 0.3 * (d - d_lower)^2 + rnorm(G, sd = 0.25)
+  w <- rep(1.0, G)
+  list(d = d - d_lower, y = y, w = w, eval_point = 0.0, h = 0.30,
+       description = "G=400, d_lower=0.1 shifted boundary=0 (Design 1 near-d_lower)",
+       d_lower = d_lower)
+})
+
+# -------------------------------------------------------------------------
+# Run each DGP through the weighted local-linear reference.
+# -------------------------------------------------------------------------
+
+run_one <- function(name, dgp) {
+  cat(sprintf("Running %s: %s\n", name, dgp$description))
+  res <- weighted_local_linear(
+    d = dgp$d, y = dgp$y, weights = dgp$w,
+    eval_point = dgp$eval_point, h = dgp$h
+  )
+  list(
+    description = dgp$description,
+    n = length(dgp$d),
+    d = as.numeric(dgp$d),
+    y = as.numeric(dgp$y),
+    weights = as.numeric(dgp$w),
+    eval_point = as.numeric(res$eval_point),
+    h = as.numeric(res$h),
+    kernel = "epanechnikov",
+    n_active = res$n_active,
+    mu_hat = as.numeric(res$mu_hat),
+    slope = as.numeric(res$slope)
+  )
+}
+
+out <- list(
+  metadata = list(
+    r_version = paste(R.Version()$major, R.Version()$minor, sep = "."),
+    seed = 20260424L,
+    generator = "generate_np_npreg_weighted_golden.R",
+    algorithm = "manual weighted OLS with one-sided Epanechnikov kernel",
+    purpose = "HAD Phase 4.5 cross-language weighted-LL parity",
+    note = paste(
+      "Regression lock on the weighted kernel + weighted OLS formula",
+      "implemented in diff_diff.local_linear.local_linear_fit. Not a",
+      "third-party validation of weighted-CCF methodology; see REGISTRY",
+      "'Weighted extension (Phase 4.5)' for the parity-gap acknowledgement."
+    )
+  ),
+  dgp1 = run_one("dgp1", dgp1),
+  dgp2 = run_one("dgp2", dgp2),
+  dgp3 = run_one("dgp3", dgp3),
+  dgp4 = run_one("dgp4", dgp4)
+)
+
+out_path <- "benchmarks/data/np_npreg_weighted_golden.json"
+dir.create(dirname(out_path), recursive = TRUE, showWarnings = FALSE)
+
+write_json(out, out_path, auto_unbox = TRUE, pretty = TRUE, digits = 14)
+cat(sprintf("Wrote %s\n", out_path))