HAD Phase 4.5: survey support on continuous-dose paths by igerber · Pull Request #359 · igerber/diff-diff

igerber · 2026-04-24T13:35:13Z

Summary

Ships survey=SurveyDesign(...) + weights=<array> on HeterogeneousAdoptionDiD.fit() for the two continuous-dose designs (continuous_at_zero, continuous_near_d_lower); mass-point + event-study + HAD pretests remain NotImplementedError (deferred to Phase 4.5 B / C).
Weights thread through the Phase 1c lprobust port via pointwise kernel × user weight composition; SurveyDesign path composes PSU/strata/FPC variance via the existing compute_survey_if_variance Binder-TSL helper (no duplicate implementation).
Validation stack: uniform-weights bit-parity at atol=1e-14 on the full lprobust output struct, cross-language weighted-OLS parity against a manual R reference at atol=1e-12, and Monte Carlo oracle consistency on known-τ DGPs.

Test plan

pytest tests/test_had.py tests/test_had_mc.py tests/test_nprobust_port.py tests/test_bias_corrected_lprobust.py tests/test_np_npreg_weighted_parity.py -m '' — all pass (469 tests; confirmed locally)
mypy diff_diff/had.py diff_diff/local_linear.py diff_diff/_nprobust_port.py — no new errors
ruff check diff_diff tests — clean on modified files
Rscript benchmarks/R/generate_np_npreg_weighted_golden.R regenerates the fixture without changes

Scope boundaries

Supported this PR:

continuous_at_zero (paper Design 1', Eq 7) + continuous_near_d_lower (Design 1 near-d̲) under weights=<array> and survey=SurveyDesign(weights, strata, psu, fpc).
Uniform-weights bit-parity lock (regression test for the weighted extension).
Binder-TSL variance composition via compute_survey_if_variance / compute_replicate_if_variance (pre-existing helpers).
HeterogeneousAdoptionDiDResults.survey_metadata populated with method / variance-formula / weight-sum / effective-sample-size / n_strata / n_psu; closes the existing to_dict() gap at had.py:397-419.
Cross-language weighted-OLS parity fixture via benchmarks/R/generate_np_npreg_weighted_golden.R + tests/test_np_npreg_weighted_parity.py (13 tests).
MC oracle consistency tests (slow-gated, 4 tests) — the primary methodology anchor under informative weights.

Explicitly NOT in this PR (deferred with targeted NotImplementedError + phase pointer):

design="mass_point" under survey/weights (Phase 4.5 B — weighted 2SLS + sandwich).
aggregate="event_study" under survey/weights (Phase 4.5 B — per-horizon IPW + shared PSU multiplier bootstrap).
HAD pretests (qug_test, stute_test, yatchew_hr_test, joint Stute variants, did_had_pretest_workflow) under weights — Phase 4.5 C (Rao-Wu rescaled bootstrap).
Replicate-weight SurveyDesigns (BRR / Fay / JK1 / JKn / SDR) on the HAD continuous path — Phase 4.5 C.
Weight-aware auto-bandwidth MSE-DPI — pass h/b explicitly for weight-aware bandwidths; the auto path remains unweighted.

Parity ceiling

No public weighted-CCF bias-corrected local-linear reference exists in any language — nprobust::lprobust has no weight argument, np::npreg uses a proprietary local-linear algorithm that does not reduce to weighted OLS at the intercept, Stata rdrobust is RD-shaped (not HAD-shaped), Julia Synthdid.jl is SDID-only. The Phase 1c atol=1e-12 R bit-parity bar is therefore NOT reachable on the weighted bias-corrected CI. This is documented explicitly in docs/methodology/REGISTRY.md §HeterogeneousAdoptionDiD "Weighted extension (Phase 4.5)".

🤖 Generated with Claude Code

…local_linear Extends `_nprobust_port.lprobust` with a `weights=` parameter that multiplies into the kernel weights, propagating through design matrices, Q.q bias-correction, variance matrices, and the HC2/HC3 hat-matrix hii term. When `weights=None` (or `np.ones(N)`) the function is bit-identical to the unweighted path (regression-tested at atol=1e-14). Removes the NotImplementedError at `bias_corrected_local_linear` and forwards weights to the internal lprobust call. Bandwidth selection (auto mode) remains unweighted — the MSE-optimal DPI is not yet weight-aware; users who want weight-aware bandwidths supply `h`/`b` explicitly. Documented as a known gap for the REGISTRY subsection. Parity ceiling acknowledged: no public weighted-CCF reference exists (nprobust has no weight argument; rdrobust is RD-shaped). The uniform-weights bit-parity test is the only bit-parity anchor; validation under informative weights relies on MC oracle + np::npreg partial parity, both deferred to subsequent commits in this PR. Adds TestWeightedLprobust (11 tests) and TestWeightedBiasCorrectedLocalLinear (7 tests): uniform-weights bit-parity at atol=1e-14 (classical + robust variance, clustered + unclustered), weight-validator contracts (negative, NaN, Inf, zero-sum, length-mismatch), zero-weight observations filter from the active window, and an informative-weights mechanism-has-teeth sanity check. All 91 existing Phase 1c parity tests unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Accepts `weights=<array>` and `survey=SurveyDesign(weights=...)` on the `continuous_at_zero` and `continuous_near_d_lower` designs of `HeterogeneousAdoptionDiD.fit()`. Routes the per-row weights through a new `_aggregate_unit_weights` helper that validates constant-within- unit (`feedback_no_silent_failures`) and produces a G-length per-unit weight array, then threads it through the weighted `lprobust` in the final continuous fit. Weighted population moments use `np.average` so unweighted path is bit-identical when `weights=None`. API contract: - `weights=w` and `survey=SurveyDesign(weights=w)` are interchangeable (same per-unit aggregate, same result) — one knob per the merge- redundant-knobs principle, with the shortcut kept for ergonomics. - `survey=` + `weights=` → ValueError (mutex). - weights varying within unit → ValueError (prescribed rollup rule). - negative / NaN / zero-sum weights → ValueError. - `survey=<non-SurveyDesign-like>` → TypeError. - mass_point + survey/weights → NotImplementedError (Phase 4.5 B). - event_study + survey/weights → NotImplementedError (Phase 4.5 B). Result object: - `HeterogeneousAdoptionDiDResults.survey_metadata` now carries `{method, source, n_units_weighted, weight_sum, effective_sample_size}` under weighted fits (was always `None` before — closed the existing `to_dict()` gap at had.py:397-419). - `to_dict()` emits `survey_metadata` key (`None` on unweighted). - `summary()` renders a "Survey method" / "Effective sample size" block when present. PSU/strata/FPC composition via `compute_survey_if_variance` is the subsequent commit; this commit ships the pweight path with the weighted-robust SE from Phase 1c, not full Binder TSL. Tests: 16 new TestHADSurvey tests covering uniform-weights bit-parity (both continuous designs), survey=weights= equivalence, validator contract, deferred-path rejection, survey_metadata population, and to_dict/summary surfacing. 213 existing HAD tests pass unchanged (scaffolding test_weights_raises removed; test_survey_raises updated to test_survey_bad_type_raises expecting TypeError on non-SurveyDesign input). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wires full ``survey=SurveyDesign(weights, strata, psu, fpc)`` support on HAD's continuous-dose paths. When a SurveyDesign is supplied, the standard error routes through Binder (1983) Taylor-series linearization: per-unit influence function of the local-linear intercept aggregated by PSU within strata, with FPC correction (``compute_survey_if_variance`` at ``survey.py:1802``). The estimator consumes the existing helper — no duplicate implementation per ``feedback_psu_telescoping_tsl_vs_replicate``. Core additions: - ``LprobustResult.influence_function`` (Optional[ndarray]): per-obs IF of the classical intercept, populated only when ``return_influence=True``. Aligned to original caller ordering (inverse-permuted when vce='nn' sorts). Observations outside the active kernel window have IF=0. Bit-parity: ``weights=np.ones`` IF ≡ unweighted IF at atol=1e-14. - ``BiasCorrectedFit.influence_function``: same IF forwarded from the internal lprobust call. - ``_aggregate_unit_resolved_survey``: collapses a row-level ``ResolvedSurveyDesign`` to unit-level by requiring constant-within- unit for weights/strata/psu/fpc (same invariant as cluster columns); raises ``ValueError`` on inconsistency (``feedback_no_silent_failures``). Replicate-weight designs raise ``NotImplementedError`` (deferred to Phase 4.5 C, where Stute-family pretests ship the Rao-Wu bootstrap). - ``_fit_continuous`` (HAD) accepts ``resolved_survey_unit``. When set, uses ``bc_fit.influence_function`` + ``compute_survey_if_variance`` for the SE; otherwise falls back to ``bc_fit.se_robust / |den|`` (pweight-only convention for ``weights=<array>``). Paper Eq 8 treats the denominator as fixed at the leading order (sqrt(G) rate vs the G^{2/5} rate of the nonparametric estimator). API contract: - ``weights=<array>``: lightweight pweight shortcut; weighted-robust SE. - ``survey=SurveyDesign(weights='col')``: SurveyDesign path; Binder-TSL SE. Under no strata/PSU, reduces to (n/(n-1))-adjusted sum-of-IF² — asymptotically equivalent but numerically distinct from weights=. - ATT is identical on both paths (same weighted lprobust); only the SE diverges (intended). - ``HeterogeneousAdoptionDiDResults.survey_metadata`` now carries ``method`` ('pweight' vs 'survey_binder_tsl'), ``variance_formula``, ``n_strata``, ``n_psu`` diagnostics so the composition is introspectable without inspecting the estimator. Tests: 5 new TestHADSurvey tests (21 total in that class) covering SurveyDesign-with-strata SE divergence, PSU clustering SE effect, strata-varies-within-unit ValueError, replicate-weights NotImplementedError deferral, and survey_metadata recording the Binder-TSL method. All 325 existing tests (across test_had, test_bias_corrected_lprobust, test_nprobust_port) pass unchanged. Parity ceiling unchanged: Binder TSL composition is validated via the existing dCDH/EfficientDiD usage of ``compute_survey_if_variance``, not against a weighted-CCF public reference (none exists). Methodology confidence for the bias-corrected CI under weights comes from the MC oracle tests and np::npreg partial parity (subsequent commits). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Ships a bit-parity regression lock against R's manually-implemented weighted-OLS with one-sided Epanechnikov kernel. The reference matches ``diff_diff/local_linear.py::local_linear_fit`` exactly in formula, so parity holds at atol=1e-12 across BLAS reduction ordering — 4 DGPs (uniform / informative / bounded-heterogeneous / shifted-boundary), each validating intercept, slope, and active-window count. Scope note: NOT third-party validation. No public weighted-CCF bias-corrected local-linear reference exists in any language — nprobust::lprobust has no weight argument, and np::npreg's local-linear algorithm diverges from a straightforward weighted-OLS at the intercept (preliminary exploration showed ~4-8% discrepancy vs manual weighted lm). The fixture is a cross-language regression lock on the weighted kernel + weighted-OLS formula, matching the R↔Python parity convention used elsewhere in the benchmark suite. Methodology confidence under informative weights comes from the uniform-weights bit-parity (already shipped) + Monte Carlo oracle consistency (next commit). Files: - ``benchmarks/R/generate_np_npreg_weighted_golden.R``: new R generator with 4 deterministic DGPs. Uses ``jsonlite`` (already in ``requirements.R``), no new CRAN dependencies. - ``benchmarks/data/np_npreg_weighted_golden.json``: tracked golden (~600 KB, consistent with other benchmark fixtures in this repo). - ``tests/test_np_npreg_weighted_parity.py``: 13 parametrized tests (intercept / slope / n_active × 4 DGPs + 1 uniform-weights cross-check). Skip-guarded via ``pytest.skip`` per ``feedback_golden_file_pytest_skip``. The filename ``np_npreg_weighted_golden.*`` is retained from the planning phase (plan named the fixture after ``np::npreg``) even though the final R implementation is a manual weighted-OLS, so the generator name stays discoverable in the benchmark directory listing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… HAD Adds slow-gated MC tests that validate the weighted-HAD methodology under known-tau DGPs — the only real methodology anchor available given the parity ceiling (no public weighted-CCF reference exists). Four tests covering: 1. Uniform weights recover β under SRS (regression baseline). 2. Informative weights + inverse-probability pweights recover β under informative sampling (the actual weight-mechanism claim). 3. 95% CI coverage ≥ 60% at n_reps=25 (loose bar that stays above chance while tolerating CCT-2014's known undercoverage at small G + reduced MC reps). 4. Unweighted-under-informative-sampling bias vs weighted-under-same: the bias-reduction is the test teeth — if weighting failed to correct bias, this would flag it. DGP: linear m(d) = β·D (WAS = β exactly, no quadratic confound) with G=500, n_reps scaled via ``ci_params.bootstrap(200, min_n=25)``. Slow- gated per repo convention; all four run in ~2.3s in pure-Python mode so they're cheap enough to execute with -m slow on every CI run. Per ``feedback_run_own_artifacts_before_done``: all four tests executed green locally before commit, not just skip-guarded. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

REGISTRY: adds a "Weighted extension (Phase 4.5 survey support)" subsection under HAD Phase 1c anchor. Covers: - Weighted kernel-weighted least squares derivation + kernel-composition formula W_combined = k((D - d̲)/h) · w_g and β-scale rescaling under weighted population moments. - SurveyDesign composition via Binder (1983) TSL through the shared `compute_survey_if_variance` helper (no duplicate implementation per `feedback_psu_telescoping_tsl_vs_replicate`). - Explicit "parity gap" Note flagging that no public weighted-CCF bias-corrected local-linear reference exists — the 1e-12 R bit-parity bar is unreachable on the weighted bias-corrected CI. - Five `**Note:**` subsections documenting the methodology-confidence stack: uniform-weights bit-parity (1e-14), cross-language weighted-OLS parity (1e-12), MC oracle consistency, unweighted auto-bandwidth gap, and replicate-weight + mass-point + pretest deferrals. - Explicit dispatch-matrix note: survey/weights supported on continuous_at_zero + continuous_near_d_lower only; mass_point, event_study, and HAD pretests deferred per phase sequencing. CHANGELOG: adds an Unreleased "Added" entry covering the public API surface (`survey=` / `weights=` on HAD.fit), the result-object metadata (`survey_metadata`), the parity ceiling, the methodology-confidence stack, and the Phase 4.5 B / C deferrals. TODO: removes the two rows marking survey integration + weights support as deferred (shipped); adds five new rows tracking the remaining Phase 4.5 work (B: mass-point + event-study survey; C0: QUG-under- survey decision; C: pretests under survey; weight-aware auto-bandwidth; replicate-weight continuous-path support). Each row references a specific file and phase so future PRs have a clear anchor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-24T13:44:15Z

Overall Assessment

⛔ Blocker

The new survey= continuous-HAD path reports inference for the wrong estimator scale: the ATT uses the bias-corrected tau_bc, but the survey SE is built from the classical local-linear influence function. That is silent wrong statistical output on the main new feature.

Executive Summary

The survey variance path uses the classical local-linear IF (V_Y_cl-aligned) while the returned ATT is built from tau_bc.
Survey p-values/CIs ignore df_survey and default to normal inference instead of finite-df survey t inference.
survey= does not reject SurveyDesign(weight_type="aweight"|"fweight"), so unsupported weight types are silently treated as sampling weights.
The large scope deferrals in this PR (weighted mass_point, weighted event_study, replicate weights, weight-aware DPI bandwidth) are properly documented in REGISTRY.md/TODO.md and are not blockers.

Methodology

P0 Wrong estimator-scale variance on the new survey path (diff_diff/_nprobust_port.py:L1007-L1016, diff_diff/_nprobust_port.py:L1347-L1382, diff_diff/had.py:L2843-L2856). Impact: att is formed from tau_bc, but compute_survey_if_variance() receives an IF explicitly documented as the classical tau.cl/mu_hat IF with the V_Y_cl self-check. The survey SE/CI therefore target a different estimator than the one returned. Concrete fix: expose and use a bias-corrected IF aligned with Q_q/res_b/V_Y_bc, or reject survey= on this path until that derivation exists.
P1 Survey inference ignores survey degrees of freedom (diff_diff/had.py:L2667-L2699, diff_diff/survey.py:L590-L627, diff_diff/utils.py:L175-L209). Impact: survey fits always use normal-theory inference; small-PSU designs get overstated precision, and any df_survey <= 0 case would miss the repo’s NaN-inference contract. Concrete fix: thread resolved_survey_unit.df_survey into safe_inference(..., df=...) and store it in HAD’s survey metadata.
P1 survey= silently accepts non-pweight designs (diff_diff/had.py:L2353-L2378, diff_diff/survey.py:L914-L947). Impact: SurveyDesign(weight_type="aweight"|"fweight") is reinterpreted as sampling/design weights in both the weighted kernel fit and Binder variance, which changes the estimand/inference silently. Concrete fix: reject weight_type != "pweight" up front on HAD’s survey path.

Code Quality

No additional material findings beyond the methodology issues above.

Performance

No material findings.

Maintainability

No material findings.

Tech Debt

P3 Informational: the deferrals for weighted mass_point, weighted event_study, replicate-weight HAD, and weight-aware DPI bandwidth are explicitly tracked (docs/methodology/REGISTRY.md:L2263-L2265, TODO.md:L95-L99). Impact: none for this review. Concrete fix: none required in this PR.

Security

No findings.

Documentation/Tests

P2 The new HAD survey tests currently encode the wrong SE expectation and miss the key regression gaps (tests/test_had.py:L3303-L3324, tests/test_had.py:L3489-L3530). Impact: the suite would keep passing with the wrong estimator-scale survey SE and with missing df_survey/weight_type handling. Concrete fix: add SRS-equivalence coverage for survey=SurveyDesign(weights="w"), finite-df survey inference tests, and rejection tests for aweight/fweight.
P3 User-facing HAD docs are stale (diff_diff/had.py:L2264-L2268, diff_diff/had.py:L280-L282). Impact: the docstring still says survey/weights are deferred and survey_metadata is always None. Concrete fix: update the fit/result docstrings to describe the supported weighted continuous paths and their constraints.

Path to Approval

Return a bias-corrected survey IF aligned with tau_bc/V_Y_bc, and use that in compute_survey_if_variance() for the HAD survey path.
Pass resolved_survey_unit.df_survey to safe_inference and expose it in HAD result metadata.
Reject SurveyDesign inputs whose weight_type is not pweight.
Replace/add tests so the no-PSU/no-strata survey path matches the estimator-scale weighted variance, df_survey changes inference, and non-pweight survey designs fail fast.

P0 — bias-corrected survey IF: the survey variance path was building the influence function from classical residuals (``res_h`` + ``R_p·W_h``, aligned with ``V_Y_cl``) while the returned ATT uses the bias-corrected ``tau_bc``. Under compute_survey_if_variance this silently under-estimated the survey SE by ignoring the CCT-2014 bias-correction variance inflation. Fixed in ``_nprobust_port.lprobust``: IF now uses ``Q_q`` + ``res_b`` so ``sum(IF^2) == V_Y_bc[0,0]`` (verified in new white-box test), matching the estimator scale of the ATT. Uniform- weights bit-parity (weights=np.ones ≡ unweighted at 1e-14) preserved across the new IF formula. The ``TestWeightedLprobust`` HC1 bit-parity test still passes because under weights=ones the classical vs bias- corrected IF only differ by the Q.q bias-correction term, which is deterministic and cancels in the diff. P1a — df_survey threading: survey fits previously used Normal-theory inference regardless of PSU count. ``resolved_survey_unit.df_survey`` (n_psu − n_strata, or replicate QR rank − 1) now routes through ``safe_inference(..., df=...)`` on the survey path, producing t-based p-values and CIs. Also surfaced in ``survey_metadata["df_survey"]`` for introspection. Under small-PSU designs the t-critical exceeds the Normal z-critical, widening the CI vs the prior (wrong) output. The ``weights=`` shortcut continues to use Normal inference since there's no PSU structure to produce a finite df. P1b — reject non-pweight SurveyDesigns: HAD's weighted kernel composition ``W_combined = k((D-d̲)/h) · w`` implements inverse-probability weighting semantics. ``SurveyDesign(weight_type="aweight"|"fweight")`` is now rejected with ``NotImplementedError`` at fit-time — aweight (analytic) implies a different inferential target (weighted regression, not design-based inference), and fweight (frequency) implies observation replication. Neither has been derived for HAD's continuous-dose path; deferred as a follow-up. P2 — test coverage: six new ``TestHADSurvey`` tests locking in the three fixes above: - ``test_survey_if_uses_bias_corrected_scale``: white-box ``sum(IF^2) == V_Y_bc[0,0]`` under nonlinear DGP where V_Y_bc ≠ V_Y_cl (teeth). - ``test_survey_df_widens_ci_vs_normal`` + ``test_survey_df_threaded_into_inference_via_t_distribution``: assert df_survey surfaces in metadata and produces t-CI half-width matching ``t_crit(df) · se`` exactly. - ``test_survey_aweight_raises_not_implemented`` + ``test_survey_fweight_raises_not_implemented``: front-door rejection. - ``test_survey_no_psu_no_strata_se_matches_weights_hc1``: SRS equivalence — survey SE within 10-15% of weights-shortcut SE (the (n/(n-1))-style HC1 correction), ruling out a silent V_Y_cl-vs-V_Y_bc mismatch. P3 — docstring refresh: the stale fit() docstring calling survey/weights "Reserved for a follow-up PR" is replaced with the actual Phase 4.5 A contract (pweight-only, constant-within-unit, replicate-weight deferred, mass-point/event-study/pretests deferred). ``survey_metadata`` docstring on ``HeterogeneousAdoptionDiDResults`` now enumerates the actual dict keys and their semantics. All 348 tests (across test_had, test_nprobust_port, test_bias_corrected_lprobust, test_np_npreg_weighted_parity, and the slow MC suite) pass after the cascade. Ruff clean on modified files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-24T13:51:37Z

/ai-review

github-actions · 2026-04-24T13:58:14Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 6125fda109df1d055f5e2412d840e0d58df704f2

Overall Assessment

⚠️ Needs changes

Static review only; I did not run the test suite because the sandbox Python environment is missing numpy.

Executive Summary

The previous review’s methodology blockers appear addressed: the survey IF now targets the bias-corrected scale (V_Y_bc), survey d.f. is threaded into safe_inference, and non-pweight survey designs are rejected fast. See diff_diff/_nprobust_port.py:L1359-L1393, diff_diff/had.py:L2390-L2408, diff_diff/had.py:L2710-L2755, tests/test_had.py:L3589-L3743.
P1 The new HAD survey path returns survey_metadata as a plain dict instead of the repo-standard SurveyMetadata object. Shared report/diagnostic consumers read that field via attribute access, so HAD survey results will silently lose survey-specific metadata outside had.py. See diff_diff/had.py:L280-L288, diff_diff/had.py:L2745-L2755, diff_diff/business_report.py:L606-L612, diff_diff/business_report.py:L815-L837, diff_diff/diagnostic_report.py:L1803-L1810, diff_diff/results.py:L14-L45, tests/test_survey.py:L610-L615.
P2 Weighted continuous fits still do not expose a clear result-level denominator contract: the estimator uses weighted population moments, but the public result still stores/displays only raw dose_mean / D_bar, and there is no weighted regression coverage for that metadata. See diff_diff/had.py:L249-L250, diff_diff/had.py:L2606, diff_diff/had.py:L2767, diff_diff/had.py:L2837-L2850, diff_diff/had.py:L348-L348, tests/test_had.py:L346-L350.
The documented scope deferrals for weighted mass_point, weighted event_study, replicate weights, and weight-aware DPI bandwidth are properly recorded in REGISTRY.md/TODO.md and are not blockers. See docs/methodology/REGISTRY.md:L2255-L2265, TODO.md:L95-L99.

Methodology

P3 No unmitigated methodology defects found in the changed estimator path. The earlier P0/P1 issues are fixed by the bias-corrected IF construction, survey-d.f. inference routing, and pweight-only guard. Impact: none. Concrete fix: none. References: diff_diff/_nprobust_port.py:L1359-L1393, diff_diff/had.py:L2390-L2408, diff_diff/had.py:L2710-L2755, tests/test_had.py:L3589-L3743.

Code Quality

No findings.

Performance

No findings.

Maintainability

P1 survey_metadata contract drift on HAD survey fits. The PR changes HeterogeneousAdoptionDiDResults.survey_metadata to a dict (diff_diff/had.py:L280-L288, diff_diff/had.py:L2745-L2755), but shared consumers expect the repo-standard SurveyMetadata object and use attribute access (diff_diff/business_report.py:L606-L612, diff_diff/business_report.py:L815-L837, diff_diff/diagnostic_report.py:L1803-L1810, diff_diff/results.py:L14-L45). Existing survey tests elsewhere in the repo also assert SurveyMetadata instances (tests/test_survey.py:L610-L615). Impact: if a HAD survey result is passed into shared reporting/diagnostic tooling, fields like df_survey, weight_type, effective_n, design_effect, n_strata, and n_psu will be silently dropped or mislabelled. Concrete fix: return the shared SurveyMetadata dataclass for HAD too (ideally via compute_survey_metadata(...)), and move HAD-specific extras such as method, source, and variance_formula to a separate field or an extended metadata object; add a regression test that a HAD survey fit preserves survey inference/report metadata through the generic report helpers.

Tech Debt

P3 Informational: the main deferred items introduced by this PR are properly tracked and documented, so they are not blockers here. Impact: none. Concrete fix: none required in this PR. References: docs/methodology/REGISTRY.md:L2255-L2265, TODO.md:L95-L99.

Security

No findings.

Documentation/Tests

P2 Weighted-fit result metadata and docs are still inconsistent with the shipped surface. The continuous weighted estimator now uses weighted population moments (diff_diff/had.py:L2837-L2850), but the public result still documents/stores only raw dose_mean / D_bar (diff_diff/had.py:L249-L250, diff_diff/had.py:L2606, diff_diff/had.py:L2767, diff_diff/had.py:L348-L348), and the new tests do not pin any weighted metadata contract (tests/test_had.py:L346-L350, tests/test_had.py:L3421-L3430). There are also stale in-code docs that still describe the pre-Phase-4.5 surface or wrong IF semantics (diff_diff/_nprobust_port.py:L34-L39, diff_diff/local_linear.py:L936-L941, diff_diff/had.py:L227-L233). Impact: users and future maintainers cannot reliably tell from the result object/docstrings which denominator and IF scale a weighted HAD fit actually used. Concrete fix: expose the effective weighted denominator explicitly (or rename/clarify dose_mean so it is obviously descriptive rather than estimator-defining), add weighted-fit assertions for it, and update the stale docstrings to match the shipped behavior.

Path to Approval

Restore the repo-wide survey_metadata contract for HAD survey fits by returning a SurveyMetadata-compatible object, or update all shared consumers that read survey_metadata to accept the new mapping shape; add a regression test through the generic report/diagnostic path.
Clarify the weighted denominator metadata on HAD results: either store the actual weighted denominator used by the estimator, or add a separate explicitly named field for it and document that dose_mean is raw-sample-only; add weighted tests for both continuous_at_zero and continuous_near_d_lower.

P1 — survey_metadata contract drift: the previous HAD survey fits returned ``survey_metadata`` as a plain dict, but the repo-standard is the :class:`diff_diff.survey.SurveyMetadata` dataclass read via attribute access by BusinessReport, DiagnosticReport, and the shared ``results.py`` plumbing. Passing a HAD survey result through those consumers would silently drop ``df_survey`` / ``effective_n`` / ``n_strata`` / ``n_psu`` under the dict shape. Fixed by routing both entry paths (``survey=SurveyDesign(...)`` and ``weights=<array>``) through :func:`compute_survey_metadata`; ``weights=`` constructs a minimal ``ResolvedSurveyDesign`` (weights-only, no strata/PSU/FPC) uniformly. HAD-specific extras (``variance_formula`` label, the pweight vs Binder-TSL method tag) are moved to a new top-level field ``HeterogeneousAdoptionDiDResults.variance_formula`` (Optional[str]) rather than polluting the shared SurveyMetadata schema. New regression test ``test_survey_metadata_is_surveymetadata_instance`` exercises attribute access on both entry paths. P2 — weighted denominator contract: the continuous paths use weighted population moments internally but the public result still stored only the raw ``dose_mean`` — a user reconstructing β-scale by hand would have been off by the (weighted − raw) gap. Added ``HeterogeneousAdoptionDiDResults.effective_dose_mean`` (Optional[float]) that exposes the actual weighted denominator used by the beta-scale rescaling: ``sum(w·D)/sum(w)`` for ``continuous_at_zero``, ``sum(w·(D−d_lower))/sum(w)`` for ``continuous_near_d_lower``. ``None`` on unweighted fits (use ``dose_mean`` there; the two coincide under uniform weights and the duplicate field would be noise). Raw ``dose_mean`` preserved for backward compat; docstring clarifies the distinction. Four new regression tests covering both continuous designs + the uniform-weights coincidence + the unweighted ``None``-contract. P3 — stale docstrings (``_nprobust_port.py``, ``local_linear.py``, ``had.py``): three docstring blocks still described the pre-Phase-4.5 surface ("weights= not supported" / "IF of the classical estimate" / "CCT-2014 robust SE divided by |den|"). Updated to describe the shipped behavior: lprobust supports weights + return_influence on the BIAS-CORRECTED scale, bias_corrected_local_linear's ``influence_function`` field aligns with ``V_Y_bc``, and HAD's ``se`` documentation now enumerates the three SE paths (unweighted, pweight via weighted-robust, survey via Binder-TSL). All 353 tests pass after refactor (across test_had, test_nprobust_port, test_bias_corrected_lprobust, test_np_npreg_weighted_parity, and the slow MC suite). Ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-24T14:14:35Z

/ai-review

github-actions · 2026-04-24T14:28:41Z

🔁 AI review rerun (requested by @igerber)

Head SHA: d86084070c401ccbb11e5926b342faf16d4ec965

Overall Assessment

✅ Looks good

Static review only; I could not run the test suite in this sandbox because Python here is missing numpy.

Executive Summary

The previous rerun blockers look addressed: the survey path now uses the bias-corrected IF, threads survey d.f. into safe_inference() for real survey= fits, and returns repo-standard SurveyMetadata instead of a plain dict. References: diff_diff/_nprobust_port.py:L1364-L1398, diff_diff/had.py:L2745-L2804, diff_diff/had.py:L288-L299.
I did not find an unmitigated methodology mismatch against the new registry entry for weighted continuous HAD. The scope exclusions for weighted mass_point, weighted event_study, replicate weights, and weight-aware DPI are explicitly documented/tracked, so they are not blockers. References: docs/methodology/REGISTRY.md:L2249-L2268, TODO.md:L95-L99.
P2 The weights= shortcut still populates survey-style df_survey metadata even though its own inference stays normal-theory. That makes summaries/reports describe some weights= fits as finite-df survey fits when they are not. References: diff_diff/had.py:L2745-L2804, diff_diff/had.py:L407-L413, diff_diff/business_report.py:L606-L612.
P2 The new weighted-result fields are only partially surfaced: effective_dose_mean and variance_formula are stored on the result object, but to_dict() omits them and summary() still only shows raw dose_mean. References: diff_diff/had.py:L341-L359, diff_diff/had.py:L443-L470, diff_diff/had.py:L381-L413, tests/test_had.py:L3438-L3446, tests/test_had.py:L3755-L3816.

Methodology

No findings.

The changed estimator path matches the Methodology Registry as written: kernel weights are composed pointwise with user weights, survey variance is built from the bias-corrected IF, non-pweight survey types are rejected, and the deferred paths are documented rather than silently shipped. References: diff_diff/_nprobust_port.py:L1220-L1297, diff_diff/_nprobust_port.py:L1364-L1398, diff_diff/had.py:L2424-L2455, diff_diff/had.py:L2964-L2979, docs/methodology/REGISTRY.md:L2255-L2268.

Code Quality

Severity P2 weights= fits advertise survey d.f. metadata that their own inference path does not use. On the weights= shortcut, fit() keeps df=None for safe_inference() (diff_diff/had.py:L2745-L2755), but then builds survey_metadata from a synthetic ResolvedSurveyDesign, which gives df_survey = n_obs - 1 and makes summary() / BusinessReport treat the fit as a finite-df survey result (diff_diff/had.py:L2781-L2804, diff_diff/had.py:L407-L413, diff_diff/business_report.py:L606-L612).
Impact: user-facing summaries/reports can misdescribe the inference family for weights= results, even though the stored CI/p-value were computed with the normal reference.
Concrete fix: on the weights= path, either clear survey-only fields such as df_survey (and ideally n_psu/n_strata) after compute_survey_metadata(), or teach report/summary consumers to gate survey-only behavior on variance_formula == "survey_binder_tsl" rather than survey_metadata is not None.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3 Informational: the remaining weighted-scope exclusions are properly documented and tracked, so they are not blockers for this PR. References: TODO.md:L95-L99, docs/methodology/REGISTRY.md:L2266-L2268.
Impact: none.
Concrete fix: none required in this PR.

Security

No findings.

Documentation/Tests

Severity P2 The new weighted-result metadata is only partially propagated. effective_dose_mean and variance_formula are added to HeterogeneousAdoptionDiDResults and populated in fit() (diff_diff/had.py:L341-L359, diff_diff/had.py:L2805-L2815), but to_dict() still omits both (diff_diff/had.py:L443-L470), and summary() still displays only raw dose_mean (diff_diff/had.py:L381-L413). The tests pin the object fields themselves but not the serialized/reporting surface (tests/test_had.py:L3438-L3446, tests/test_had.py:L3755-L3816).
Impact: downstream machine consumers of to_dict() cannot recover the actual weighted denominator or the SE-family label, and the public surface is less complete than the new docs/changelog imply.
Concrete fix: add variance_formula and effective_dose_mean to to_dict(), render the weighted denominator explicitly in summary() when present, and add regression tests for those serialized outputs.

P2a — weights=<array> shortcut clears survey-only SurveyMetadata fields: the shortcut constructs a minimal ResolvedSurveyDesign (n_psu=G, n_strata=1) to reuse compute_survey_metadata's code path, which produces df_survey = G − 1. But the shortcut's own inference stays Normal (safe_inference(..., df=None)), so reporting code that rendered "Survey df: G−1" was misdescribing the fit as a finite-df survey result. Fixed by explicitly clearing ``df_survey`` / ``n_psu`` / ``n_strata`` on the ``SurveyMetadata`` returned for the weights= path; ``weight_type`` / ``effective_n`` / ``design_effect`` / ``sum_weights`` / ``weight_range`` stay populated since they describe the weighted sample regardless of inference family. The ``survey=SurveyDesign(...)`` path is unchanged: all survey-only fields populated and threaded into the t-CI. Two new regression tests lock each side of the contract. P2b — to_dict() + summary() surface the new weighted-result fields: ``effective_dose_mean`` and ``variance_formula`` were stored on the result object but not serialized or rendered, so downstream machine consumers of ``to_dict()`` couldn't recover the weighted denominator or the SE-family label (the reporting-surface mirror of the P2 finding). Added both keys to ``to_dict()`` and rendered the weighted denominator explicitly in ``summary()`` when present. Three new tests: one for to_dict under weights, one for the None-defaults under unweighted, and one that ``summary()`` renders the weighted D̄ label. All 358 tests pass across test_had, test_nprobust_port, test_bias_corrected_lprobust, test_np_npreg_weighted_parity, and the slow MC suite. Ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-24T14:46:48Z

/ai-review

github-actions · 2026-04-24T14:52:37Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 08afb1233631e4ce71e525e01f1f7f8c75e59a90

Overall Assessment

✅ Looks good

Static review only; I could not run the test suite in this sandbox because Python here is missing numpy.

Executive Summary

This re-review looks materially better than the prior pass: the weights= shortcut now clears survey-only finite-df metadata, and variance_formula / effective_dose_mean now propagate through summary() and to_dict() (diff_diff/had.py:L407-L418, diff_diff/had.py:L448-L478, diff_diff/had.py:L2811-L2825).
The affected methods are the HAD continuous-dose estimators (continuous_at_zero, continuous_near_d_lower) plus the supporting Phase 1c local-linear port. I did not find an unmitigated P0/P1 methodology mismatch against the new registry entry (docs/methodology/REGISTRY.md:L2258-L2268).
Deferred scope is handled correctly: weighted mass_point, weighted event_study, replicate-weight survey designs, and weight-aware DPI bandwidth selection are explicitly documented and/or tracked rather than silently shipped (docs/methodology/REGISTRY.md:L2266-L2268, TODO.md:L95-L99).
P2 The survey= branch builds survey_metadata from normalized unit weights, not the original supplied weights. That makes sum_weights and weight_range inconsistent with compute_survey_metadata()’s contract and with the weights= shortcut, even though the estimator and SE themselves look fine (diff_diff/had.py:L2774-L2785, diff_diff/survey.py:L194-L205, diff_diff/survey.py:L719-L767, tests/test_survey.py:L1045-L1061).
P3 CHANGELOG.md overstates the new reporting surface: __repr__ still does not render the weighted-path metadata, and summary() does not show all of the fields named in the changelog (CHANGELOG.md:L11, diff_diff/had.py:L362-L418).

Methodology

No findings. The weighted continuous-HAD implementation matches the Methodology Registry as written: weights enter the local-linear fit via kernel-weight composition, survey inference uses Binder-style Taylor linearization on the bias-corrected IF, and unsupported branches are rejected or documented instead of silently approximated (diff_diff/_nprobust_port.py:L1219-L1297, diff_diff/_nprobust_port.py:L1364-L1412, diff_diff/had.py:L2408-L2462, diff_diff/had.py:L2979-L3002, docs/methodology/REGISTRY.md:L2258-L2268).

Code Quality

Severity P2 Impact: on the survey= path, survey_metadata.sum_weights and survey_metadata.weight_range are computed from normalized unit weights rather than the original supplied weights. SurveyDesign.resolve() normalizes pweight/aweight inputs to mean 1 (diff_diff/survey.py:L194-L205), while compute_survey_metadata() is explicitly defined to receive the raw pre-normalization weights and its tests assert raw-weight semantics for weight_range (diff_diff/survey.py:L719-L767, tests/test_survey.py:L1045-L1061). HAD currently passes resolved_survey_unit.weights into that helper (diff_diff/had.py:L2774-L2785), so the metadata can disagree with the weights= shortcut and with the helper contract. Concrete fix: preserve the pre-normalization per-unit weights when collapsing the survey design, pass those raw unit weights to compute_survey_metadata(), and add a regression test that survey= and weights= report the same sum_weights / weight_range for identical unit weights.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3 Impact: none. The remaining weighted-scope gaps are properly deferred and tracked: weighted mass_point, weighted event_study, replicate-weight survey designs, survey-aware pretests, and weight-aware DPI bandwidths are all called out in REGISTRY.md and TODO.md, so they are not blockers for this PR (docs/methodology/REGISTRY.md:L2266-L2268, TODO.md:L95-L99). Concrete fix: none required in this PR.

Security

No findings.

Documentation/Tests

Severity P3 Impact: CHANGELOG.md says the new survey metadata is rendered consistently by to_dict / summary / __repr__, but __repr__ still only shows att, se, design, and n_obs, and summary() only surfaces a subset of the advertised metadata (CHANGELOG.md:L11, diff_diff/had.py:L362-L418, diff_diff/had.py:L448-L478). This is documentation drift, not a numerical defect. Concrete fix: either narrow the changelog wording to match the actual rendered surface, or extend __repr__ / summary() to include the advertised metadata.

P2 — survey_metadata uses raw pre-normalization weights. On the ``survey=SurveyDesign(weights="col")`` path, the prior code passed ``resolved_survey_unit.weights`` (post-normalization, mean=1) into ``compute_survey_metadata``, producing ``sum_weights ≈ G`` and normalized ``weight_range`` that drifted from both the ``weights=`` shortcut's values and the helper's documented contract (tests in ``tests/test_survey.py`` assert raw-weight semantics for ``weight_range``). Fixed by capturing the raw weight column from ``data[survey.weights]`` BEFORE ``SurveyDesign.resolve()`` rescales pweights/aweights to mean=1, aggregating those raw values to unit- level via the existing ``_aggregate_unit_weights`` helper, and passing that raw per-unit array as the ``raw_weights`` argument. Both entry paths (``weights=`` and ``survey=``) now report bit-identical ``sum_weights`` / ``weight_range`` / ``design_effect`` / ``effective_n`` for the same unit-level weights — verified by a new regression test ``test_survey_metadata_raw_weights_match_shortcut``. P3 — ``__repr__`` surfaces weighted-path metadata. CHANGELOG claimed ``to_dict`` / ``summary`` / ``__repr__`` render the new weighted surface consistently, but ``__repr__`` was unchanged. Extended the compact one-line form to include ``variance_formula`` and ``effective_dose_mean`` when present; unweighted fits keep the original terse repr. Regression test ``test_repr_surfaces_weighted_fields_when_present`` locks both cases. All 360 tests pass (across the six HAD-related test modules). Ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-24T15:02:17Z

/ai-review

github-actions · 2026-04-24T15:10:56Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 188021be767243c5ab08ff7a4e3156a1544f1d1b

Overall Assessment

⛔ Blocker

Static review only; I could not run the targeted tests in this sandbox because Python here is missing numpy.

Executive Summary

The affected methodology is the new weighted HAD continuous-dose path: continuous_at_zero, continuous_near_d_lower, and the supporting Phase 1c weighted bias_corrected_local_linear / _nprobust_port.lprobust route. Aside from one blocker below, the kernel-weight composition and Binder-TSL variance route match the new registry entry.
The prior metadata-scale issue appears fixed: the survey path now captures pre-normalization unit weights before SurveyDesign.resolve() and passes those raw unit weights into compute_survey_metadata().
P0: zero-weight units still drive auto design detection, d_lower, the mass-point threshold, and treated/control counts even though the weighted estimator later drops them. Standard survey subpopulation designs can therefore silently target the wrong HAD design or reject a valid weighted fit.
The documented Phase 4.5 deferrals are handled correctly: weighted mass_point, weighted event_study, replicate-weight designs, pretests under survey, and weight-aware DPI bandwidths are all explicitly documented/tracked rather than silently approximated.
The previous documentation finding is only partially resolved: summary() / __repr__ still do not render all of the survey metadata fields that CHANGELOG.md says they do.

Methodology

Severity P0 Impact: zero-weight observations are allowed and are the standard representation for survey subpopulations, but HAD still resolves the design on the full unit sample instead of the positive-weight support. fit() aggregates all units first, then runs _detect_design, d_lower resolution, the 2% mass-point check, and treated/control counts on that full d_arr; only later does _fit_continuous() switch to weighted averages and weighted kernel weights. That means a zero-weight unit at d.min() can silently flip a weighted sample from continuous_near_d_lower to continuous_at_zero, shift d_lower, misstate the control group, or trigger the wrong NotImplementedError. This is especially problematic for SurveyDesign.subpopulation(), which intentionally zeros excluded units instead of dropping them. diff_diff/survey.py:L422-L538, diff_diff/had.py:L2406-L2493, diff_diff/had.py:L2502-L2679, diff_diff/had.py:L2954-L3018, diff_diff/local_linear.py:L1104-L1165, diff_diff/local_linear.py:L1217-L1232
Concrete fix: resolve HAD design logic on the positive-weight unit subset (weights_unit > 0) for auto design detection, d_lower, the mass-point threshold, and displayed counts. For auto-bandwidth/support checks, feed only that positive-weight subset into the Phase 1c validator/selector. If you still want full-design survey variance, re-expand the returned IF back to the full unit design with zeros before calling compute_survey_if_variance().

Aside from that blocker, the weighted kernel-composition math, Binder-TSL variance route, and the registry-noted deviations/deferrals are aligned with docs/methodology/REGISTRY.md:L2258-L2268.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The remaining weighted-scope gaps are explicitly tracked and therefore mitigated for this PR: docs/methodology/REGISTRY.md:L2266-L2268, TODO.md:L95-L99.

Security

No findings.

Documentation/Tests

Severity P3 Impact: CHANGELOG.md still overstates the reporting surface. It says to_dict / summary / __repr__ render the survey metadata consistently, including weight sum, n_strata, and n_psu, but summary() currently prints only variance_formula, effective_n, the weighted denominator, and optionally df_survey, while __repr__ only adds variance_formula and effective_dose_mean. to_dict() also returns a SurveyMetadata object even though its docstring still says “dict or None.” CHANGELOG.md:L11-L11, diff_diff/had.py:L362-L487
Concrete fix: either narrow the changelog/docstring wording to the fields actually rendered, or add the missing metadata fields to summary() / __repr__ and update the to_dict() docstring to match the implementation.

Path to Approval

Make weighted HAD design resolution depend on the positive-weight unit support, not on all aggregated units.
Ensure the Phase 1c support/bandwidth path on weighted fits also ignores zero-weight units, while preserving full-design IF expansion for survey variance if needed.
Add regression tests using SurveyDesign.subpopulation() or an equivalent zero-weight unit at d.min() showing that weighted fits on the full design match the same fits run after physically dropping the zero-weight units for auto, continuous_near_d_lower, and the reported count fields.

P0 — zero-weight units must not drive HAD design resolution. Previously ``fit()`` aggregated all units first, then ran ``_detect_design``, ``d_lower`` resolution, the 2% mass-point threshold, and treated/control counts on the full ``d_arr`` — so a zero-weight unit at ``d.min() = 0`` (the standard SurveyDesign.subpopulation encoding) could flip a weighted sample from ``continuous_near_d_lower`` to ``continuous_at_zero``, shift ``d_lower``, misstate cohort counts, or trigger the wrong ``NotImplementedError``. The weighted kernel already dropped those units at fit time via lprobust's ``w > 0`` selector, so the fit NUMBERS were correct; the DESIGN DECISION was contaminated. Fixed by filtering ``d_arr`` / ``dy_arr`` / ``weights_unit`` / ``raw_weights_unit`` / ``resolved_survey_unit`` to ``weights_unit > 0`` immediately after survey/weights resolution and before any design-resolution logic. New ``_filter_resolved_survey`` helper rebuilds the ResolvedSurveyDesign on the positive-weight subset (recomputing ``n_strata`` / ``n_psu`` for compute_survey_if_variance). A UserWarning fires when units are dropped so the behavior is introspectable. Three new regression tests lock the fix: - ``test_zero_weight_unit_at_d_min_does_not_flip_design``: full panel with zero-weight unit at ``d=0`` vs physically-dropped panel — design, d_lower, att, and se all match to 1e-10. - ``test_zero_weight_filter_warns_user``: UserWarning emitted. - ``test_zero_weight_counts_reflect_positive_subset``: ``n_obs`` on the result is the positive-weight unit count, not the full panel size. P3 — CHANGELOG + ``to_dict()`` docstring accuracy. CHANGELOG claimed ``to_dict`` / ``summary`` / ``__repr__`` render the survey metadata consistently including ``weight_sum`` / ``n_strata`` / ``n_psu``; in practice each surface renders a different subset. Rewrote the CHANGELOG entry to enumerate exactly what each surface prints, and rewrote ``HeterogeneousAdoptionDiDResults.to_dict()`` docstring to name the weighted-path keys and clarify that ``survey_metadata`` is a SurveyMetadata object (not a dict). All 363 tests pass. Ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-24T15:15:22Z

/ai-review

github-actions · 2026-04-24T15:24:10Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 6bd07e7d11d0a918bcc39dff9efeef7bb1c4fb8e

Overall Assessment

⚠️ Needs changes

Static review only. I could not run the targeted tests in this sandbox because numpy/pytest are unavailable here.

Executive Summary

The previous blocker on zero-weight units driving HAD design resolution appears fixed on the main estimator path: fit() now resolves design, d_lower, and count fields on the positive-weight subset.
P1: the new survey= path drops zero-weight units from the ResolvedSurveyDesign before Binder-TSL variance, which contradicts the library’s own subpopulation/domain-estimation contract and will misstate se, df_survey, n_psu, n_strata, and weight metadata for survey subpopulations.
P1: the new public bias_corrected_local_linear(weights=...) path still runs HAD support/mass-point/boundary validation on the full sample rather than the positive-weight support, so zero-weight boundary units can still flip the supported regime or trigger spurious off-support/mass-point failures.
The documented Phase 4.5 deferrals (mass_point, weighted event study, replicate weights, weight-aware DPI, weighted pretests) are properly tracked in TODO.md / REGISTRY.md and are not blockers.

Methodology

Severity P1
Impact: survey subpopulation variance is currently computed on a filtered design, not the full design. SurveyDesign.subpopulation() and the registry both say the correct domain-analysis method is to keep zero-weight rows/PSUs/strata in the survey design for variance/df, but HAD now filters them out via _filter_resolved_survey() before calling compute_survey_if_variance(). That silently changes PSU counts, stratum centering, FPC application, and df_survey; a domain that leaves one PSU with positive in-domain mass and another PSU with only zero-weight observations can therefore get the wrong survey SE or even the wrong “lonely PSU” behavior. diff_diff/had.py:L2566-L2598, diff_diff/had.py:L3125-L3138, diff_diff/had.py:L1697-L1730, diff_diff/survey.py:L422-L432, docs/methodology/REGISTRY.md:L3083-L3098
Concrete fix: keep zero-weight filtering only for design resolution / local-linear fitting, but preserve the full unit-level ResolvedSurveyDesign for survey variance and metadata. Re-expand the fitted IF back to full unit length with zeros outside the domain, then compute compute_survey_if_variance(), df_survey, and compute_survey_metadata() on that full design.
Severity P1
Impact: the newly enabled bias_corrected_local_linear(weights=...) surface still validates support on unfiltered d. The function accepts nonnegative weights, but immediately calls _validate_had_inputs() on the full sample; that helper’s d.min(), mass-point, and Design-1′ support heuristics therefore still see zero-weight observations. A zero-weight unit at the boundary can still make a valid weighted domain fit look like Design 1′, or trigger a spurious mass-point/off-support rejection; in auto mode it also contaminates bandwidth selection because the selector runs before any positive-weight filtering. diff_diff/local_linear.py:L1063-L1105, diff_diff/local_linear.py:L593-L668
Concrete fix: when weights is passed, form a positive-weight mask first and run _validate_had_inputs(), auto-bandwidth selection, and the final fit on that positive-weight support. If return_influence=True, expand the returned IF back to the original ordering with zeros for masked rows.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The remaining weighted-scope exclusions are explicitly tracked in TODO.md and docs/methodology/REGISTRY.md, so I am not counting them as blockers.

Security

No findings.

Documentation/Tests

No separate findings beyond the missing regression coverage implied by the two P1 issues above.

Path to Approval

Preserve the full survey design on survey= subpopulation fits: zero-pad the HAD continuous-path IF back to the full unit design and compute Binder-TSL variance / df_survey / survey metadata on that full design, not on the filtered positive-weight subset.
Make bias_corrected_local_linear(weights=...) operate on positive-weight support for support validation and auto-bandwidth selection, then zero-pad IF output back to full length when requested.
Add regression tests for both cases:
1. SurveyDesign.subpopulation() with explicit strata/PSU where one PSU is all zero-weight in-domain, asserting full-design survey df/SE behavior.
2. bias_corrected_local_linear(weights=...) with zero-weight units at d.min() / the boundary, comparing weighted results to physically dropping those zero-weight rows in both explicit-h/b and auto-bandwidth modes.

P1a — preserve full survey design for Binder-TSL variance. Round 5 filtered the ResolvedSurveyDesign to the positive-weight subset for BOTH design resolution AND survey variance composition. That's wrong under the standard subpopulation / domain-estimation convention (keep the sampling frame, zero the contributions) that ``SurveyDesign.subpopulation()`` + ``compute_survey_if_variance()`` both implement elsewhere in this repo. Silent consequences: ``n_psu`` / ``n_strata`` / ``df_survey`` / FPC application / lonely-PSU behavior would all have been computed on the in-domain subset instead of the full design. Refactored HAD ``fit()`` to preserve ``weights_unit_full`` / ``resolved_survey_unit_full`` / ``raw_weights_unit_full`` alongside the filtered copies used ONLY for design resolution (``_detect_design`` / ``d_lower`` / mass-point threshold / cohort counts). The fit itself (``_fit_continuous`` → ``bias_corrected_local_linear`` → ``lprobust``) now receives the FULL unfiltered arrays; ``bias_corrected_local_linear`` filters internally (see P1b below) and zero-pads the IF back to full ordering, so ``compute_survey_if_variance(if_full, resolved_full)`` preserves the sampling-frame structure. ``compute_survey_metadata`` and ``effective_dose_mean`` likewise consume the full arrays so ``n_psu`` / ``n_strata`` / ``sum_weights`` / ``weight_range`` reflect the sampling frame. Dropped the now-unused ``_filter_resolved_survey`` helper. P1b — ``bias_corrected_local_linear(weights=...)`` filters internally. The public wrapper previously ran ``_validate_had_inputs`` / mass-point threshold / Design 1' support heuristic + auto-bandwidth MSE-DPI selector on the full ``d`` array, so zero-weight units at the boundary or at ``d.min()`` could flip the detected regime or trigger spurious off-support rejections on valid weighted domain fits. Now computes ``positive_mask = weights > 0`` upfront, validates weights (finite, non-negative, positive sum — consistent with the port's ValueErrors), filters ``d`` / ``y`` / ``weights`` / ``cluster`` to the positive- weight subset BEFORE all downstream validation + selection + fit, and zero-pads the returned IF back to the original ordering when ``return_influence=True``. ``n_total`` on the returned ``BiasCorrectedFit`` is the original length (matches the caller's input shape). Four new regression tests: - ``test_zero_weight_survey_metadata_preserves_full_design``: survey= path with PSU/strata + 25% zero-weight units reports full-frame n_psu (not in-domain subset); ATT bit-parity with physically-dropped fit. - ``test_bias_corrected_local_linear_zero_weight_matches_filtered``: explicit-h/b wrapper with zero-weight unit at d=0 (would have flipped Design 1') matches dropped-sample fit at 1e-12; IF zero-padded to original length with IF[0]=0 at the zero-weight position. - ``test_bias_corrected_local_linear_zero_weight_auto_bandwidth``: auto-bandwidth path (DPI selector also runs on positive-weight support) produces same h + same bias-corrected estimate as the dropped-sample fit. - The existing Round 5 zero-weight tests (``n_obs`` reflects positive- weight subset; UserWarning; design-not-flipped) pass unchanged since the design-resolution filter is preserved. All 366 tests pass (across test_had, test_nprobust_port, test_bias_corrected_lprobust, test_np_npreg_weighted_parity, and the slow MC suite). Ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-24T15:31:55Z

/ai-review

github-actions · 2026-04-24T15:41:01Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 1998d75e2c4dcf55d72743e42be7b28302d65a56

Overall Assessment

Looks good

Static diff review only. I could not run pytest in this sandbox because the module is unavailable, but the modified Python files parsed cleanly via ast.parse().

Executive Summary

The two previously raised P1 issues appear resolved. HAD now filters zero-weight units only for design resolution while preserving the full resolved survey design for survey variance / df / metadata, and the continuous fit is still run against the full ordering so the IF can align to the full design. diff_diff/had.py diff_diff/had.py diff_diff/had.py
bias_corrected_local_linear(weights=...) now operates on positive-weight support before support validation and auto-bandwidth selection, then zero-pads the returned influence function back to the original ordering. That addresses the earlier zero-weight boundary/support failure mode. diff_diff/local_linear.py diff_diff/local_linear.py
The new survey variance path is aligned to the bias-corrected estimator scale. _nprobust_port.lprobust now exports a Q_q / res_b based IF matched to V_Y_bc, which is the right object for Binder-TSL composition on the weighted HAD continuous path. diff_diff/_nprobust_port.py
Remaining scope limits are properly documented rather than silently changing methodology: weighted auto-DPI bandwidths, weighted mass-point, weighted event-study, weighted pretests, and replicate-weight HAD are all called out in the registry and tracked in TODO.md. docs/methodology/REGISTRY.md TODO.md
I did not find any new unmitigated P0 or P1 issues in the changed code. The added regression coverage is well targeted at the prior review concerns. tests/test_had.py tests/test_had.py

Methodology

No findings. The weighted continuous-dose implementation matches the updated methodology registry: weighted beta-scale moments, full-design domain semantics for survey variance, bias-corrected IF scale, and explicit deferral of unsupported paths. docs/methodology/REGISTRY.md

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3
Impact: Weighted mass_point, weighted aggregate="event_study", survey-weighted HAD pretests, replicate-weight HAD, and weight-aware DPI bandwidth selection remain deferred. These are explicitly tracked and documented, so they are not blockers for this PR. TODO.md docs/methodology/REGISTRY.md
Concrete fix: None required in this PR.

Security

No findings.

Documentation/Tests

No findings. The registry update is specific enough for the new methodology surface, and the added tests directly cover the prior blocker classes: IF scale, zero-weight domain handling, raw-weight metadata, weighted parity locks, and MC consistency. docs/methodology/REGISTRY.md tests/test_bias_corrected_lprobust.py tests/test_nprobust_port.py tests/test_had_mc.py

Rebased onto current main (17 commits clean — PR #355, #358, #359 all merged since last rebase). StaggeredTripleDifference corrected as panel-only + balance-enforced. The earlier §4.10 RCS wording paired TripleDifference / StaggeredTripleDifference together in the Explicit RCS support list, but REGISTRY.md §StaggeredTripleDifference requires a balanced panel and staggered_triple_diff.py:93-109 has no panel=False mode — fit() rejects unbalanced/duplicate (unit, time) structure at staggered_triple_diff.py:846-864. - §4.10 Explicit RCS support: TripleDifference (two-period) only; StaggeredTripleDifference removed from the supported set. - §4.10 Explicitly rejected for RCS: StaggeredTripleDifference added with a concrete "no panel=False mode" + "use TripleDifference for cross-sectional DDD" pointer. - §3 Balanced-panel eligibility: StaggeredTripleDifference added to the balance-sensitive gate. Regression tests extended: - Balanced-panel proximity check now covers StaggeredTripleDifference. - §4.10 section test asserts StaggeredTripleDifference appears in the Explicitly rejected block and NOT in the Explicit RCS support block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R7 P0 (sup-t under-scaling on weights= shortcut): the per-unit IF returned by _fit_continuous / _fit_mass_point_2sls is HC1-scaled per the PR #359 convention — compute_survey_if_variance(psi, trivial_resolved) ≈ V_HC1. Routing the weights= shortcut through the unit-level ``resolved_survey=None`` branch of _sup_t_multiplier_bootstrap normalized against raw sum(psi²) = ((n-1)/n) · V_HC1, producing silently too-narrow simultaneous bands. Fix: when the weighted event-study + cband=True path runs, always route the sup-t bootstrap through a ResolvedSurveyDesign. On the weights= shortcut (no user-supplied survey), construct a synthetic trivial resolved (pweight, no strata/psu/fpc, lonely_psu='remove') so the centered + sqrt(n/(n-1))-corrected survey-aware branch fires. The no-strata/no-PSU path inside that branch falls through to the "single implicit stratum — demean across all PSUs, scale by sqrt(n_psu/(n_psu-1))" block, which gives Var_xi(xi @ Psi_psu) ≈ V_HC1 as the IF scale convention requires. Net effect: weights= shortcut and survey=SurveyDesign(weights=...) now target the SAME variance family in the bootstrap (~atol=0.05 between the two quantiles at matching seeds, bounded by the bc_fit.se_robust vs Binder-TSL per-horizon SE convergence tolerance from PR #359). Previously the shortcut was under-scaled by sqrt(n/(n-1)) relative to the analytical HC1 target. Regression tests (+2): - test_weights_shortcut_mass_point_h1_cband_matches_normal: helper- level H=1 lock with mass-point HC1-scaled IF + synthetic trivial resolved. q → Phi^-1(0.975) ≈ 1.96 at atol=0.15 (MC noise at B=5000). The pre-fix under-scaling would have produced q ≈ 1.94 (systematic drift outside MC noise). - test_weights_shortcut_cband_matches_trivial_survey: weights= shortcut and survey=SurveyDesign(weights='w') event-study cband quantiles agree within atol=0.05 on the same DGP / seed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t-study survey composition + sup-t bootstrap Closes the two Phase 4.5 A NotImplementedError gates on design='mass_point' + weights/survey and aggregate='event_study' + weights/survey. Weighted 2SLS in _fit_mass_point_2sls follows the Wooldridge 2010 Ch. 12 pweight convention (w² in HC1 meat, w·u in CR1 cluster score, weighted bread Z'WX). HC1 and CR1 match estimatr::iv_robust bit-exactly at atol=1e-10 (new cross-language golden). Per-unit IF on β̂-scale scales so compute_survey_if_variance(psi, trivial) ≈ V_HC1 at atol=1e-10 (PR igerber#359 convention applied uniformly). Event-study path threads weights + survey through the per-horizon loop, composing Binder-TSL variance per horizon and populating survey_metadata + variance_formula + effective_dose_mean (previously hardcoded None). New _sup_t_multiplier_bootstrap helper reuses generate_survey_multiplier_weights_batch / generate_bootstrap_weights_batch from diff_diff.bootstrap_utils — no custom Rademacher draws, no (1/n) prefactor. At H=1 reduces to Φ⁻¹(1-α/2) ≈ 1.96 (reduction-locked). New __init__ kwargs: n_bootstrap=999, seed=None. New fit() kwarg: cband=True. HeterogeneousAdoptionDiDEventStudyResults gains survey_metadata + variance_formula + effective_dose_mean + cband_* fields, surfaced through to_dict / to_dataframe / summary / __repr__. Unweighted event-study output (att/se) bit-exactly preserved; cband disabled on the unweighted path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber and others added 6 commits April 24, 2026 08:49

igerber added the ready-for-ci Triggers CI test workflows label Apr 24, 2026

igerber merged commit 9b71f67 into main Apr 24, 2026
23 of 24 checks passed

igerber deleted the had-phase-4.5-survey-continuous branch April 24, 2026 17:46

igerber mentioned this pull request Apr 24, 2026

HAD Phase 4.5 B: weighted mass-point 2SLS + event-study survey composition + sup-t bootstrap #363

Merged

6 tasks

igerber mentioned this pull request Apr 25, 2026

Release 3.3.0: HAD estimator, profile_panel, dCDH by_path, SDID survey complete #368

Merged

Conversation

igerber commented Apr 24, 2026

Summary

Test plan

Scope boundaries

Parity ceiling

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

igerber commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Path to Approval

Uh oh!

igerber commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

igerber commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

igerber commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Path to Approval

Uh oh!

igerber commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Overall Assessment

Executive Summary

Path to Approval

Uh oh!

igerber commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone