Skip to content

HAD Phase 4.5: survey support on continuous-dose paths#359

Merged
igerber merged 12 commits intomainfrom
had-phase-4.5-survey-continuous
Apr 24, 2026
Merged

HAD Phase 4.5: survey support on continuous-dose paths#359
igerber merged 12 commits intomainfrom
had-phase-4.5-survey-continuous

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 24, 2026

Summary

  • Ships survey=SurveyDesign(...) + weights=<array> on HeterogeneousAdoptionDiD.fit() for the two continuous-dose designs (continuous_at_zero, continuous_near_d_lower); mass-point + event-study + HAD pretests remain NotImplementedError (deferred to Phase 4.5 B / C).
  • Weights thread through the Phase 1c lprobust port via pointwise kernel × user weight composition; SurveyDesign path composes PSU/strata/FPC variance via the existing compute_survey_if_variance Binder-TSL helper (no duplicate implementation).
  • Validation stack: uniform-weights bit-parity at atol=1e-14 on the full lprobust output struct, cross-language weighted-OLS parity against a manual R reference at atol=1e-12, and Monte Carlo oracle consistency on known-τ DGPs.

Test plan

  • pytest tests/test_had.py tests/test_had_mc.py tests/test_nprobust_port.py tests/test_bias_corrected_lprobust.py tests/test_np_npreg_weighted_parity.py -m '' — all pass (469 tests; confirmed locally)
  • mypy diff_diff/had.py diff_diff/local_linear.py diff_diff/_nprobust_port.py — no new errors
  • ruff check diff_diff tests — clean on modified files
  • Rscript benchmarks/R/generate_np_npreg_weighted_golden.R regenerates the fixture without changes

Scope boundaries

Supported this PR:

  • continuous_at_zero (paper Design 1', Eq 7) + continuous_near_d_lower (Design 1 near-d̲) under weights=<array> and survey=SurveyDesign(weights, strata, psu, fpc).
  • Uniform-weights bit-parity lock (regression test for the weighted extension).
  • Binder-TSL variance composition via compute_survey_if_variance / compute_replicate_if_variance (pre-existing helpers).
  • HeterogeneousAdoptionDiDResults.survey_metadata populated with method / variance-formula / weight-sum / effective-sample-size / n_strata / n_psu; closes the existing to_dict() gap at had.py:397-419.
  • Cross-language weighted-OLS parity fixture via benchmarks/R/generate_np_npreg_weighted_golden.R + tests/test_np_npreg_weighted_parity.py (13 tests).
  • MC oracle consistency tests (slow-gated, 4 tests) — the primary methodology anchor under informative weights.

Explicitly NOT in this PR (deferred with targeted NotImplementedError + phase pointer):

  • design="mass_point" under survey/weights (Phase 4.5 B — weighted 2SLS + sandwich).
  • aggregate="event_study" under survey/weights (Phase 4.5 B — per-horizon IPW + shared PSU multiplier bootstrap).
  • HAD pretests (qug_test, stute_test, yatchew_hr_test, joint Stute variants, did_had_pretest_workflow) under weights — Phase 4.5 C (Rao-Wu rescaled bootstrap).
  • Replicate-weight SurveyDesigns (BRR / Fay / JK1 / JKn / SDR) on the HAD continuous path — Phase 4.5 C.
  • Weight-aware auto-bandwidth MSE-DPI — pass h/b explicitly for weight-aware bandwidths; the auto path remains unweighted.

Parity ceiling

No public weighted-CCF bias-corrected local-linear reference exists in any language — nprobust::lprobust has no weight argument, np::npreg uses a proprietary local-linear algorithm that does not reduce to weighted OLS at the intercept, Stata rdrobust is RD-shaped (not HAD-shaped), Julia Synthdid.jl is SDID-only. The Phase 1c atol=1e-12 R bit-parity bar is therefore NOT reachable on the weighted bias-corrected CI. This is documented explicitly in docs/methodology/REGISTRY.md §HeterogeneousAdoptionDiD "Weighted extension (Phase 4.5)".

🤖 Generated with Claude Code

igerber and others added 6 commits April 24, 2026 08:49
…local_linear

Extends `_nprobust_port.lprobust` with a `weights=` parameter that
multiplies into the kernel weights, propagating through design matrices,
Q.q bias-correction, variance matrices, and the HC2/HC3 hat-matrix
hii term. When `weights=None` (or `np.ones(N)`) the function is
bit-identical to the unweighted path (regression-tested at atol=1e-14).

Removes the NotImplementedError at `bias_corrected_local_linear` and
forwards weights to the internal lprobust call. Bandwidth selection
(auto mode) remains unweighted — the MSE-optimal DPI is not yet
weight-aware; users who want weight-aware bandwidths supply `h`/`b`
explicitly. Documented as a known gap for the REGISTRY subsection.

Parity ceiling acknowledged: no public weighted-CCF reference exists
(nprobust has no weight argument; rdrobust is RD-shaped). The
uniform-weights bit-parity test is the only bit-parity anchor; validation
under informative weights relies on MC oracle + np::npreg partial
parity, both deferred to subsequent commits in this PR.

Adds TestWeightedLprobust (11 tests) and TestWeightedBiasCorrectedLocalLinear
(7 tests): uniform-weights bit-parity at atol=1e-14 (classical + robust
variance, clustered + unclustered), weight-validator contracts (negative,
NaN, Inf, zero-sum, length-mismatch), zero-weight observations filter
from the active window, and an informative-weights mechanism-has-teeth
sanity check. All 91 existing Phase 1c parity tests unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Accepts `weights=<array>` and `survey=SurveyDesign(weights=...)` on
the `continuous_at_zero` and `continuous_near_d_lower` designs of
`HeterogeneousAdoptionDiD.fit()`. Routes the per-row weights through
a new `_aggregate_unit_weights` helper that validates constant-within-
unit (`feedback_no_silent_failures`) and produces a G-length per-unit
weight array, then threads it through the weighted `lprobust` in the
final continuous fit. Weighted population moments use `np.average`
so unweighted path is bit-identical when `weights=None`.

API contract:
- `weights=w` and `survey=SurveyDesign(weights=w)` are interchangeable
  (same per-unit aggregate, same result) — one knob per the merge-
  redundant-knobs principle, with the shortcut kept for ergonomics.
- `survey=` + `weights=` → ValueError (mutex).
- weights varying within unit → ValueError (prescribed rollup rule).
- negative / NaN / zero-sum weights → ValueError.
- `survey=<non-SurveyDesign-like>` → TypeError.
- mass_point + survey/weights → NotImplementedError (Phase 4.5 B).
- event_study + survey/weights → NotImplementedError (Phase 4.5 B).

Result object:
- `HeterogeneousAdoptionDiDResults.survey_metadata` now carries
  `{method, source, n_units_weighted, weight_sum, effective_sample_size}`
  under weighted fits (was always `None` before — closed the existing
  `to_dict()` gap at had.py:397-419).
- `to_dict()` emits `survey_metadata` key (`None` on unweighted).
- `summary()` renders a "Survey method" / "Effective sample size"
  block when present.

PSU/strata/FPC composition via `compute_survey_if_variance` is the
subsequent commit; this commit ships the pweight path with the
weighted-robust SE from Phase 1c, not full Binder TSL.

Tests: 16 new TestHADSurvey tests covering uniform-weights bit-parity
(both continuous designs), survey=weights= equivalence, validator
contract, deferred-path rejection, survey_metadata population, and
to_dict/summary surfacing. 213 existing HAD tests pass unchanged
(scaffolding test_weights_raises removed; test_survey_raises updated
to test_survey_bad_type_raises expecting TypeError on non-SurveyDesign
input).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires full ``survey=SurveyDesign(weights, strata, psu, fpc)`` support
on HAD's continuous-dose paths. When a SurveyDesign is supplied, the
standard error routes through Binder (1983) Taylor-series linearization:
per-unit influence function of the local-linear intercept aggregated
by PSU within strata, with FPC correction (``compute_survey_if_variance``
at ``survey.py:1802``). The estimator consumes the existing helper —
no duplicate implementation per ``feedback_psu_telescoping_tsl_vs_replicate``.

Core additions:
- ``LprobustResult.influence_function`` (Optional[ndarray]): per-obs IF
  of the classical intercept, populated only when ``return_influence=True``.
  Aligned to original caller ordering (inverse-permuted when vce='nn'
  sorts). Observations outside the active kernel window have IF=0.
  Bit-parity: ``weights=np.ones`` IF ≡ unweighted IF at atol=1e-14.
- ``BiasCorrectedFit.influence_function``: same IF forwarded from the
  internal lprobust call.
- ``_aggregate_unit_resolved_survey``: collapses a row-level
  ``ResolvedSurveyDesign`` to unit-level by requiring constant-within-
  unit for weights/strata/psu/fpc (same invariant as cluster columns);
  raises ``ValueError`` on inconsistency (``feedback_no_silent_failures``).
  Replicate-weight designs raise ``NotImplementedError`` (deferred to
  Phase 4.5 C, where Stute-family pretests ship the Rao-Wu bootstrap).
- ``_fit_continuous`` (HAD) accepts ``resolved_survey_unit``. When set,
  uses ``bc_fit.influence_function`` + ``compute_survey_if_variance``
  for the SE; otherwise falls back to ``bc_fit.se_robust / |den|``
  (pweight-only convention for ``weights=<array>``). Paper Eq 8 treats
  the denominator as fixed at the leading order (sqrt(G) rate vs the
  G^{2/5} rate of the nonparametric estimator).

API contract:
- ``weights=<array>``: lightweight pweight shortcut; weighted-robust SE.
- ``survey=SurveyDesign(weights='col')``: SurveyDesign path; Binder-TSL
  SE. Under no strata/PSU, reduces to (n/(n-1))-adjusted sum-of-IF² —
  asymptotically equivalent but numerically distinct from weights=.
- ATT is identical on both paths (same weighted lprobust); only the SE
  diverges (intended).
- ``HeterogeneousAdoptionDiDResults.survey_metadata`` now carries
  ``method`` ('pweight' vs 'survey_binder_tsl'), ``variance_formula``,
  ``n_strata``, ``n_psu`` diagnostics so the composition is
  introspectable without inspecting the estimator.

Tests: 5 new TestHADSurvey tests (21 total in that class) covering
SurveyDesign-with-strata SE divergence, PSU clustering SE effect,
strata-varies-within-unit ValueError, replicate-weights NotImplementedError
deferral, and survey_metadata recording the Binder-TSL method. All 325
existing tests (across test_had, test_bias_corrected_lprobust,
test_nprobust_port) pass unchanged.

Parity ceiling unchanged: Binder TSL composition is validated via the
existing dCDH/EfficientDiD usage of ``compute_survey_if_variance``, not
against a weighted-CCF public reference (none exists). Methodology
confidence for the bias-corrected CI under weights comes from the MC
oracle tests and np::npreg partial parity (subsequent commits).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ships a bit-parity regression lock against R's manually-implemented
weighted-OLS with one-sided Epanechnikov kernel. The reference matches
``diff_diff/local_linear.py::local_linear_fit`` exactly in formula, so
parity holds at atol=1e-12 across BLAS reduction ordering — 4 DGPs
(uniform / informative / bounded-heterogeneous / shifted-boundary),
each validating intercept, slope, and active-window count.

Scope note: NOT third-party validation. No public weighted-CCF
bias-corrected local-linear reference exists in any language —
nprobust::lprobust has no weight argument, and np::npreg's local-linear
algorithm diverges from a straightforward weighted-OLS at the intercept
(preliminary exploration showed ~4-8% discrepancy vs manual weighted
lm). The fixture is a cross-language regression lock on the weighted
kernel + weighted-OLS formula, matching the R↔Python parity convention
used elsewhere in the benchmark suite. Methodology confidence under
informative weights comes from the uniform-weights bit-parity (already
shipped) + Monte Carlo oracle consistency (next commit).

Files:
- ``benchmarks/R/generate_np_npreg_weighted_golden.R``: new R generator
  with 4 deterministic DGPs. Uses ``jsonlite`` (already in
  ``requirements.R``), no new CRAN dependencies.
- ``benchmarks/data/np_npreg_weighted_golden.json``: tracked golden
  (~600 KB, consistent with other benchmark fixtures in this repo).
- ``tests/test_np_npreg_weighted_parity.py``: 13 parametrized tests
  (intercept / slope / n_active × 4 DGPs + 1 uniform-weights
  cross-check). Skip-guarded via ``pytest.skip`` per
  ``feedback_golden_file_pytest_skip``.

The filename ``np_npreg_weighted_golden.*`` is retained from the
planning phase (plan named the fixture after ``np::npreg``) even though
the final R implementation is a manual weighted-OLS, so the generator
name stays discoverable in the benchmark directory listing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… HAD

Adds slow-gated MC tests that validate the weighted-HAD methodology
under known-tau DGPs — the only real methodology anchor available
given the parity ceiling (no public weighted-CCF reference exists).

Four tests covering:
  1. Uniform weights recover β under SRS (regression baseline).
  2. Informative weights + inverse-probability pweights recover β under
     informative sampling (the actual weight-mechanism claim).
  3. 95% CI coverage ≥ 60% at n_reps=25 (loose bar that stays above
     chance while tolerating CCT-2014's known undercoverage at small G
     + reduced MC reps).
  4. Unweighted-under-informative-sampling bias vs weighted-under-same:
     the bias-reduction is the test teeth — if weighting failed to
     correct bias, this would flag it.

DGP: linear m(d) = β·D (WAS = β exactly, no quadratic confound) with
G=500, n_reps scaled via ``ci_params.bootstrap(200, min_n=25)``. Slow-
gated per repo convention; all four run in ~2.3s in pure-Python mode
so they're cheap enough to execute with -m slow on every CI run.

Per ``feedback_run_own_artifacts_before_done``: all four tests executed
green locally before commit, not just skip-guarded.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
REGISTRY: adds a "Weighted extension (Phase 4.5 survey support)"
subsection under HAD Phase 1c anchor. Covers:
- Weighted kernel-weighted least squares derivation + kernel-composition
  formula W_combined = k((D - d̲)/h) · w_g and β-scale rescaling under
  weighted population moments.
- SurveyDesign composition via Binder (1983) TSL through the shared
  `compute_survey_if_variance` helper (no duplicate implementation per
  `feedback_psu_telescoping_tsl_vs_replicate`).
- Explicit "parity gap" Note flagging that no public weighted-CCF
  bias-corrected local-linear reference exists — the 1e-12 R bit-parity
  bar is unreachable on the weighted bias-corrected CI.
- Five `**Note:**` subsections documenting the methodology-confidence
  stack: uniform-weights bit-parity (1e-14), cross-language weighted-OLS
  parity (1e-12), MC oracle consistency, unweighted auto-bandwidth gap,
  and replicate-weight + mass-point + pretest deferrals.
- Explicit dispatch-matrix note: survey/weights supported on
  continuous_at_zero + continuous_near_d_lower only; mass_point,
  event_study, and HAD pretests deferred per phase sequencing.

CHANGELOG: adds an Unreleased "Added" entry covering the public API
surface (`survey=` / `weights=` on HAD.fit), the result-object metadata
(`survey_metadata`), the parity ceiling, the methodology-confidence
stack, and the Phase 4.5 B / C deferrals.

TODO: removes the two rows marking survey integration + weights support
as deferred (shipped); adds five new rows tracking the remaining
Phase 4.5 work (B: mass-point + event-study survey; C0: QUG-under-
survey decision; C: pretests under survey; weight-aware auto-bandwidth;
replicate-weight continuous-path support). Each row references a
specific file and phase so future PRs have a clear anchor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

Blocker

The new survey= continuous-HAD path reports inference for the wrong estimator scale: the ATT uses the bias-corrected tau_bc, but the survey SE is built from the classical local-linear influence function. That is silent wrong statistical output on the main new feature.

Executive Summary

  • The survey variance path uses the classical local-linear IF (V_Y_cl-aligned) while the returned ATT is built from tau_bc.
  • Survey p-values/CIs ignore df_survey and default to normal inference instead of finite-df survey t inference.
  • survey= does not reject SurveyDesign(weight_type="aweight"|"fweight"), so unsupported weight types are silently treated as sampling weights.
  • The large scope deferrals in this PR (weighted mass_point, weighted event_study, replicate weights, weight-aware DPI bandwidth) are properly documented in REGISTRY.md/TODO.md and are not blockers.

Methodology

  • P0 Wrong estimator-scale variance on the new survey path (diff_diff/_nprobust_port.py:L1007-L1016, diff_diff/_nprobust_port.py:L1347-L1382, diff_diff/had.py:L2843-L2856). Impact: att is formed from tau_bc, but compute_survey_if_variance() receives an IF explicitly documented as the classical tau.cl/mu_hat IF with the V_Y_cl self-check. The survey SE/CI therefore target a different estimator than the one returned. Concrete fix: expose and use a bias-corrected IF aligned with Q_q/res_b/V_Y_bc, or reject survey= on this path until that derivation exists.
  • P1 Survey inference ignores survey degrees of freedom (diff_diff/had.py:L2667-L2699, diff_diff/survey.py:L590-L627, diff_diff/utils.py:L175-L209). Impact: survey fits always use normal-theory inference; small-PSU designs get overstated precision, and any df_survey <= 0 case would miss the repo’s NaN-inference contract. Concrete fix: thread resolved_survey_unit.df_survey into safe_inference(..., df=...) and store it in HAD’s survey metadata.
  • P1 survey= silently accepts non-pweight designs (diff_diff/had.py:L2353-L2378, diff_diff/survey.py:L914-L947). Impact: SurveyDesign(weight_type="aweight"|"fweight") is reinterpreted as sampling/design weights in both the weighted kernel fit and Binder variance, which changes the estimand/inference silently. Concrete fix: reject weight_type != "pweight" up front on HAD’s survey path.

Code Quality

  • No additional material findings beyond the methodology issues above.

Performance

  • No material findings.

Maintainability

  • No material findings.

Tech Debt

  • P3 Informational: the deferrals for weighted mass_point, weighted event_study, replicate-weight HAD, and weight-aware DPI bandwidth are explicitly tracked (docs/methodology/REGISTRY.md:L2263-L2265, TODO.md:L95-L99). Impact: none for this review. Concrete fix: none required in this PR.

Security

  • No findings.

Documentation/Tests

  • P2 The new HAD survey tests currently encode the wrong SE expectation and miss the key regression gaps (tests/test_had.py:L3303-L3324, tests/test_had.py:L3489-L3530). Impact: the suite would keep passing with the wrong estimator-scale survey SE and with missing df_survey/weight_type handling. Concrete fix: add SRS-equivalence coverage for survey=SurveyDesign(weights="w"), finite-df survey inference tests, and rejection tests for aweight/fweight.
  • P3 User-facing HAD docs are stale (diff_diff/had.py:L2264-L2268, diff_diff/had.py:L280-L282). Impact: the docstring still says survey/weights are deferred and survey_metadata is always None. Concrete fix: update the fit/result docstrings to describe the supported weighted continuous paths and their constraints.

Path to Approval

  1. Return a bias-corrected survey IF aligned with tau_bc/V_Y_bc, and use that in compute_survey_if_variance() for the HAD survey path.
  2. Pass resolved_survey_unit.df_survey to safe_inference and expose it in HAD result metadata.
  3. Reject SurveyDesign inputs whose weight_type is not pweight.
  4. Replace/add tests so the no-PSU/no-strata survey path matches the estimator-scale weighted variance, df_survey changes inference, and non-pweight survey designs fail fast.

P0 — bias-corrected survey IF: the survey variance path was building
the influence function from classical residuals (``res_h`` + ``R_p·W_h``,
aligned with ``V_Y_cl``) while the returned ATT uses the bias-corrected
``tau_bc``. Under compute_survey_if_variance this silently
under-estimated the survey SE by ignoring the CCT-2014 bias-correction
variance inflation. Fixed in ``_nprobust_port.lprobust``: IF now uses
``Q_q`` + ``res_b`` so ``sum(IF^2) == V_Y_bc[0,0]`` (verified in new
white-box test), matching the estimator scale of the ATT. Uniform-
weights bit-parity (weights=np.ones ≡ unweighted at 1e-14) preserved
across the new IF formula. The ``TestWeightedLprobust`` HC1 bit-parity
test still passes because under weights=ones the classical vs bias-
corrected IF only differ by the Q.q bias-correction term, which is
deterministic and cancels in the diff.

P1a — df_survey threading: survey fits previously used Normal-theory
inference regardless of PSU count. ``resolved_survey_unit.df_survey``
(n_psu − n_strata, or replicate QR rank − 1) now routes through
``safe_inference(..., df=...)`` on the survey path, producing t-based
p-values and CIs. Also surfaced in ``survey_metadata["df_survey"]`` for
introspection. Under small-PSU designs the t-critical exceeds the
Normal z-critical, widening the CI vs the prior (wrong) output. The
``weights=`` shortcut continues to use Normal inference since there's
no PSU structure to produce a finite df.

P1b — reject non-pweight SurveyDesigns: HAD's weighted kernel composition
``W_combined = k((D-d̲)/h) · w`` implements inverse-probability weighting
semantics. ``SurveyDesign(weight_type="aweight"|"fweight")`` is now
rejected with ``NotImplementedError`` at fit-time — aweight (analytic)
implies a different inferential target (weighted regression, not
design-based inference), and fweight (frequency) implies observation
replication. Neither has been derived for HAD's continuous-dose path;
deferred as a follow-up.

P2 — test coverage: six new ``TestHADSurvey`` tests locking in the
three fixes above:
  - ``test_survey_if_uses_bias_corrected_scale``: white-box
    ``sum(IF^2) == V_Y_bc[0,0]`` under nonlinear DGP where
    V_Y_bc ≠ V_Y_cl (teeth).
  - ``test_survey_df_widens_ci_vs_normal`` +
    ``test_survey_df_threaded_into_inference_via_t_distribution``:
    assert df_survey surfaces in metadata and produces t-CI half-width
    matching ``t_crit(df) · se`` exactly.
  - ``test_survey_aweight_raises_not_implemented`` +
    ``test_survey_fweight_raises_not_implemented``: front-door rejection.
  - ``test_survey_no_psu_no_strata_se_matches_weights_hc1``: SRS
    equivalence — survey SE within 10-15% of weights-shortcut SE (the
    (n/(n-1))-style HC1 correction), ruling out a silent
    V_Y_cl-vs-V_Y_bc mismatch.

P3 — docstring refresh: the stale fit() docstring calling survey/weights
"Reserved for a follow-up PR" is replaced with the actual Phase 4.5 A
contract (pweight-only, constant-within-unit, replicate-weight deferred,
mass-point/event-study/pretests deferred). ``survey_metadata`` docstring
on ``HeterogeneousAdoptionDiDResults`` now enumerates the actual dict
keys and their semantics.

All 348 tests (across test_had, test_nprobust_port,
test_bias_corrected_lprobust, test_np_npreg_weighted_parity, and the
slow MC suite) pass after the cascade. Ruff clean on modified files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 6125fda109df1d055f5e2412d840e0d58df704f2


Overall Assessment

⚠️ Needs changes

Static review only; I did not run the test suite because the sandbox Python environment is missing numpy.

Executive Summary

  • The previous review’s methodology blockers appear addressed: the survey IF now targets the bias-corrected scale (V_Y_bc), survey d.f. is threaded into safe_inference, and non-pweight survey designs are rejected fast. See diff_diff/_nprobust_port.py:L1359-L1393, diff_diff/had.py:L2390-L2408, diff_diff/had.py:L2710-L2755, tests/test_had.py:L3589-L3743.
  • P1 The new HAD survey path returns survey_metadata as a plain dict instead of the repo-standard SurveyMetadata object. Shared report/diagnostic consumers read that field via attribute access, so HAD survey results will silently lose survey-specific metadata outside had.py. See diff_diff/had.py:L280-L288, diff_diff/had.py:L2745-L2755, diff_diff/business_report.py:L606-L612, diff_diff/business_report.py:L815-L837, diff_diff/diagnostic_report.py:L1803-L1810, diff_diff/results.py:L14-L45, tests/test_survey.py:L610-L615.
  • P2 Weighted continuous fits still do not expose a clear result-level denominator contract: the estimator uses weighted population moments, but the public result still stores/displays only raw dose_mean / D_bar, and there is no weighted regression coverage for that metadata. See diff_diff/had.py:L249-L250, diff_diff/had.py:L2606, diff_diff/had.py:L2767, diff_diff/had.py:L2837-L2850, diff_diff/had.py:L348-L348, tests/test_had.py:L346-L350.
  • The documented scope deferrals for weighted mass_point, weighted event_study, replicate weights, and weight-aware DPI bandwidth are properly recorded in REGISTRY.md/TODO.md and are not blockers. See docs/methodology/REGISTRY.md:L2255-L2265, TODO.md:L95-L99.

Methodology

  • P3 No unmitigated methodology defects found in the changed estimator path. The earlier P0/P1 issues are fixed by the bias-corrected IF construction, survey-d.f. inference routing, and pweight-only guard. Impact: none. Concrete fix: none. References: diff_diff/_nprobust_port.py:L1359-L1393, diff_diff/had.py:L2390-L2408, diff_diff/had.py:L2710-L2755, tests/test_had.py:L3589-L3743.

Code Quality

No findings.

Performance

No findings.

Maintainability

  • P1 survey_metadata contract drift on HAD survey fits. The PR changes HeterogeneousAdoptionDiDResults.survey_metadata to a dict (diff_diff/had.py:L280-L288, diff_diff/had.py:L2745-L2755), but shared consumers expect the repo-standard SurveyMetadata object and use attribute access (diff_diff/business_report.py:L606-L612, diff_diff/business_report.py:L815-L837, diff_diff/diagnostic_report.py:L1803-L1810, diff_diff/results.py:L14-L45). Existing survey tests elsewhere in the repo also assert SurveyMetadata instances (tests/test_survey.py:L610-L615). Impact: if a HAD survey result is passed into shared reporting/diagnostic tooling, fields like df_survey, weight_type, effective_n, design_effect, n_strata, and n_psu will be silently dropped or mislabelled. Concrete fix: return the shared SurveyMetadata dataclass for HAD too (ideally via compute_survey_metadata(...)), and move HAD-specific extras such as method, source, and variance_formula to a separate field or an extended metadata object; add a regression test that a HAD survey fit preserves survey inference/report metadata through the generic report helpers.

Tech Debt

  • P3 Informational: the main deferred items introduced by this PR are properly tracked and documented, so they are not blockers here. Impact: none. Concrete fix: none required in this PR. References: docs/methodology/REGISTRY.md:L2255-L2265, TODO.md:L95-L99.

Security

No findings.

Documentation/Tests

  • P2 Weighted-fit result metadata and docs are still inconsistent with the shipped surface. The continuous weighted estimator now uses weighted population moments (diff_diff/had.py:L2837-L2850), but the public result still documents/stores only raw dose_mean / D_bar (diff_diff/had.py:L249-L250, diff_diff/had.py:L2606, diff_diff/had.py:L2767, diff_diff/had.py:L348-L348), and the new tests do not pin any weighted metadata contract (tests/test_had.py:L346-L350, tests/test_had.py:L3421-L3430). There are also stale in-code docs that still describe the pre-Phase-4.5 surface or wrong IF semantics (diff_diff/_nprobust_port.py:L34-L39, diff_diff/local_linear.py:L936-L941, diff_diff/had.py:L227-L233). Impact: users and future maintainers cannot reliably tell from the result object/docstrings which denominator and IF scale a weighted HAD fit actually used. Concrete fix: expose the effective weighted denominator explicitly (or rename/clarify dose_mean so it is obviously descriptive rather than estimator-defining), add weighted-fit assertions for it, and update the stale docstrings to match the shipped behavior.

Path to Approval

  1. Restore the repo-wide survey_metadata contract for HAD survey fits by returning a SurveyMetadata-compatible object, or update all shared consumers that read survey_metadata to accept the new mapping shape; add a regression test through the generic report/diagnostic path.
  2. Clarify the weighted denominator metadata on HAD results: either store the actual weighted denominator used by the estimator, or add a separate explicitly named field for it and document that dose_mean is raw-sample-only; add weighted tests for both continuous_at_zero and continuous_near_d_lower.

P1 — survey_metadata contract drift: the previous HAD survey fits
returned ``survey_metadata`` as a plain dict, but the repo-standard is
the :class:`diff_diff.survey.SurveyMetadata` dataclass read via
attribute access by BusinessReport, DiagnosticReport, and the shared
``results.py`` plumbing. Passing a HAD survey result through those
consumers would silently drop ``df_survey`` / ``effective_n`` /
``n_strata`` / ``n_psu`` under the dict shape. Fixed by routing both
entry paths (``survey=SurveyDesign(...)`` and ``weights=<array>``)
through :func:`compute_survey_metadata`; ``weights=`` constructs a
minimal ``ResolvedSurveyDesign`` (weights-only, no strata/PSU/FPC)
uniformly. HAD-specific extras (``variance_formula`` label, the
pweight vs Binder-TSL method tag) are moved to a new top-level field
``HeterogeneousAdoptionDiDResults.variance_formula`` (Optional[str])
rather than polluting the shared SurveyMetadata schema. New regression
test ``test_survey_metadata_is_surveymetadata_instance`` exercises
attribute access on both entry paths.

P2 — weighted denominator contract: the continuous paths use weighted
population moments internally but the public result still stored only
the raw ``dose_mean`` — a user reconstructing β-scale by hand would
have been off by the (weighted − raw) gap. Added
``HeterogeneousAdoptionDiDResults.effective_dose_mean`` (Optional[float])
that exposes the actual weighted denominator used by the beta-scale
rescaling: ``sum(w·D)/sum(w)`` for ``continuous_at_zero``,
``sum(w·(D−d_lower))/sum(w)`` for ``continuous_near_d_lower``. ``None``
on unweighted fits (use ``dose_mean`` there; the two coincide under
uniform weights and the duplicate field would be noise). Raw
``dose_mean`` preserved for backward compat; docstring clarifies the
distinction. Four new regression tests covering both continuous
designs + the uniform-weights coincidence + the unweighted
``None``-contract.

P3 — stale docstrings (``_nprobust_port.py``, ``local_linear.py``,
``had.py``): three docstring blocks still described the pre-Phase-4.5
surface ("weights= not supported" / "IF of the classical estimate"
/ "CCT-2014 robust SE divided by |den|"). Updated to describe the
shipped behavior: lprobust supports weights + return_influence on the
BIAS-CORRECTED scale, bias_corrected_local_linear's
``influence_function`` field aligns with ``V_Y_bc``, and HAD's ``se``
documentation now enumerates the three SE paths (unweighted, pweight
via weighted-robust, survey via Binder-TSL).

All 353 tests pass after refactor (across test_had, test_nprobust_port,
test_bias_corrected_lprobust, test_np_npreg_weighted_parity, and the
slow MC suite). Ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: d86084070c401ccbb11e5926b342faf16d4ec965


Overall Assessment

Looks good

Static review only; I could not run the test suite in this sandbox because Python here is missing numpy.

Executive Summary

  • The previous rerun blockers look addressed: the survey path now uses the bias-corrected IF, threads survey d.f. into safe_inference() for real survey= fits, and returns repo-standard SurveyMetadata instead of a plain dict. References: diff_diff/_nprobust_port.py:L1364-L1398, diff_diff/had.py:L2745-L2804, diff_diff/had.py:L288-L299.
  • I did not find an unmitigated methodology mismatch against the new registry entry for weighted continuous HAD. The scope exclusions for weighted mass_point, weighted event_study, replicate weights, and weight-aware DPI are explicitly documented/tracked, so they are not blockers. References: docs/methodology/REGISTRY.md:L2249-L2268, TODO.md:L95-L99.
  • P2 The weights= shortcut still populates survey-style df_survey metadata even though its own inference stays normal-theory. That makes summaries/reports describe some weights= fits as finite-df survey fits when they are not. References: diff_diff/had.py:L2745-L2804, diff_diff/had.py:L407-L413, diff_diff/business_report.py:L606-L612.
  • P2 The new weighted-result fields are only partially surfaced: effective_dose_mean and variance_formula are stored on the result object, but to_dict() omits them and summary() still only shows raw dose_mean. References: diff_diff/had.py:L341-L359, diff_diff/had.py:L443-L470, diff_diff/had.py:L381-L413, tests/test_had.py:L3438-L3446, tests/test_had.py:L3755-L3816.

Methodology

No findings.

The changed estimator path matches the Methodology Registry as written: kernel weights are composed pointwise with user weights, survey variance is built from the bias-corrected IF, non-pweight survey types are rejected, and the deferred paths are documented rather than silently shipped. References: diff_diff/_nprobust_port.py:L1220-L1297, diff_diff/_nprobust_port.py:L1364-L1398, diff_diff/had.py:L2424-L2455, diff_diff/had.py:L2964-L2979, docs/methodology/REGISTRY.md:L2255-L2268.

Code Quality

  • Severity P2 weights= fits advertise survey d.f. metadata that their own inference path does not use. On the weights= shortcut, fit() keeps df=None for safe_inference() (diff_diff/had.py:L2745-L2755), but then builds survey_metadata from a synthetic ResolvedSurveyDesign, which gives df_survey = n_obs - 1 and makes summary() / BusinessReport treat the fit as a finite-df survey result (diff_diff/had.py:L2781-L2804, diff_diff/had.py:L407-L413, diff_diff/business_report.py:L606-L612).
    Impact: user-facing summaries/reports can misdescribe the inference family for weights= results, even though the stored CI/p-value were computed with the normal reference.
    Concrete fix: on the weights= path, either clear survey-only fields such as df_survey (and ideally n_psu/n_strata) after compute_survey_metadata(), or teach report/summary consumers to gate survey-only behavior on variance_formula == "survey_binder_tsl" rather than survey_metadata is not None.

Performance

No findings.

Maintainability

No findings.

Tech Debt

  • Severity P3 Informational: the remaining weighted-scope exclusions are properly documented and tracked, so they are not blockers for this PR. References: TODO.md:L95-L99, docs/methodology/REGISTRY.md:L2266-L2268.
    Impact: none.
    Concrete fix: none required in this PR.

Security

No findings.

Documentation/Tests

  • Severity P2 The new weighted-result metadata is only partially propagated. effective_dose_mean and variance_formula are added to HeterogeneousAdoptionDiDResults and populated in fit() (diff_diff/had.py:L341-L359, diff_diff/had.py:L2805-L2815), but to_dict() still omits both (diff_diff/had.py:L443-L470), and summary() still displays only raw dose_mean (diff_diff/had.py:L381-L413). The tests pin the object fields themselves but not the serialized/reporting surface (tests/test_had.py:L3438-L3446, tests/test_had.py:L3755-L3816).
    Impact: downstream machine consumers of to_dict() cannot recover the actual weighted denominator or the SE-family label, and the public surface is less complete than the new docs/changelog imply.
    Concrete fix: add variance_formula and effective_dose_mean to to_dict(), render the weighted denominator explicitly in summary() when present, and add regression tests for those serialized outputs.

P2a — weights=<array> shortcut clears survey-only SurveyMetadata fields:
the shortcut constructs a minimal ResolvedSurveyDesign (n_psu=G,
n_strata=1) to reuse compute_survey_metadata's code path, which
produces df_survey = G − 1. But the shortcut's own inference stays
Normal (safe_inference(..., df=None)), so reporting code that rendered
"Survey df: G−1" was misdescribing the fit as a finite-df survey
result. Fixed by explicitly clearing ``df_survey`` / ``n_psu`` /
``n_strata`` on the ``SurveyMetadata`` returned for the weights= path;
``weight_type`` / ``effective_n`` / ``design_effect`` / ``sum_weights``
/ ``weight_range`` stay populated since they describe the weighted
sample regardless of inference family. The ``survey=SurveyDesign(...)``
path is unchanged: all survey-only fields populated and threaded into
the t-CI. Two new regression tests lock each side of the contract.

P2b — to_dict() + summary() surface the new weighted-result fields:
``effective_dose_mean`` and ``variance_formula`` were stored on the
result object but not serialized or rendered, so downstream machine
consumers of ``to_dict()`` couldn't recover the weighted denominator
or the SE-family label (the reporting-surface mirror of the P2
finding). Added both keys to ``to_dict()`` and rendered the weighted
denominator explicitly in ``summary()`` when present. Three new tests:
one for to_dict under weights, one for the None-defaults under
unweighted, and one that ``summary()`` renders the weighted D̄ label.

All 358 tests pass across test_had, test_nprobust_port,
test_bias_corrected_lprobust, test_np_npreg_weighted_parity, and the
slow MC suite. Ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 08afb1233631e4ce71e525e01f1f7f8c75e59a90


Overall Assessment

Looks good

Static review only; I could not run the test suite in this sandbox because Python here is missing numpy.

Executive Summary

  • This re-review looks materially better than the prior pass: the weights= shortcut now clears survey-only finite-df metadata, and variance_formula / effective_dose_mean now propagate through summary() and to_dict() (diff_diff/had.py:L407-L418, diff_diff/had.py:L448-L478, diff_diff/had.py:L2811-L2825).
  • The affected methods are the HAD continuous-dose estimators (continuous_at_zero, continuous_near_d_lower) plus the supporting Phase 1c local-linear port. I did not find an unmitigated P0/P1 methodology mismatch against the new registry entry (docs/methodology/REGISTRY.md:L2258-L2268).
  • Deferred scope is handled correctly: weighted mass_point, weighted event_study, replicate-weight survey designs, and weight-aware DPI bandwidth selection are explicitly documented and/or tracked rather than silently shipped (docs/methodology/REGISTRY.md:L2266-L2268, TODO.md:L95-L99).
  • P2 The survey= branch builds survey_metadata from normalized unit weights, not the original supplied weights. That makes sum_weights and weight_range inconsistent with compute_survey_metadata()’s contract and with the weights= shortcut, even though the estimator and SE themselves look fine (diff_diff/had.py:L2774-L2785, diff_diff/survey.py:L194-L205, diff_diff/survey.py:L719-L767, tests/test_survey.py:L1045-L1061).
  • P3 CHANGELOG.md overstates the new reporting surface: __repr__ still does not render the weighted-path metadata, and summary() does not show all of the fields named in the changelog (CHANGELOG.md:L11, diff_diff/had.py:L362-L418).

Methodology

No findings. The weighted continuous-HAD implementation matches the Methodology Registry as written: weights enter the local-linear fit via kernel-weight composition, survey inference uses Binder-style Taylor linearization on the bias-corrected IF, and unsupported branches are rejected or documented instead of silently approximated (diff_diff/_nprobust_port.py:L1219-L1297, diff_diff/_nprobust_port.py:L1364-L1412, diff_diff/had.py:L2408-L2462, diff_diff/had.py:L2979-L3002, docs/methodology/REGISTRY.md:L2258-L2268).

Code Quality

  • Severity P2 Impact: on the survey= path, survey_metadata.sum_weights and survey_metadata.weight_range are computed from normalized unit weights rather than the original supplied weights. SurveyDesign.resolve() normalizes pweight/aweight inputs to mean 1 (diff_diff/survey.py:L194-L205), while compute_survey_metadata() is explicitly defined to receive the raw pre-normalization weights and its tests assert raw-weight semantics for weight_range (diff_diff/survey.py:L719-L767, tests/test_survey.py:L1045-L1061). HAD currently passes resolved_survey_unit.weights into that helper (diff_diff/had.py:L2774-L2785), so the metadata can disagree with the weights= shortcut and with the helper contract. Concrete fix: preserve the pre-normalization per-unit weights when collapsing the survey design, pass those raw unit weights to compute_survey_metadata(), and add a regression test that survey= and weights= report the same sum_weights / weight_range for identical unit weights.

Performance

No findings.

Maintainability

No findings.

Tech Debt

  • Severity P3 Impact: none. The remaining weighted-scope gaps are properly deferred and tracked: weighted mass_point, weighted event_study, replicate-weight survey designs, survey-aware pretests, and weight-aware DPI bandwidths are all called out in REGISTRY.md and TODO.md, so they are not blockers for this PR (docs/methodology/REGISTRY.md:L2266-L2268, TODO.md:L95-L99). Concrete fix: none required in this PR.

Security

No findings.

Documentation/Tests

  • Severity P3 Impact: CHANGELOG.md says the new survey metadata is rendered consistently by to_dict / summary / __repr__, but __repr__ still only shows att, se, design, and n_obs, and summary() only surfaces a subset of the advertised metadata (CHANGELOG.md:L11, diff_diff/had.py:L362-L418, diff_diff/had.py:L448-L478). This is documentation drift, not a numerical defect. Concrete fix: either narrow the changelog wording to match the actual rendered surface, or extend __repr__ / summary() to include the advertised metadata.

P2 — survey_metadata uses raw pre-normalization weights. On the
``survey=SurveyDesign(weights="col")`` path, the prior code passed
``resolved_survey_unit.weights`` (post-normalization, mean=1) into
``compute_survey_metadata``, producing ``sum_weights ≈ G`` and
normalized ``weight_range`` that drifted from both the ``weights=``
shortcut's values and the helper's documented contract (tests in
``tests/test_survey.py`` assert raw-weight semantics for
``weight_range``). Fixed by capturing the raw weight column from
``data[survey.weights]`` BEFORE ``SurveyDesign.resolve()`` rescales
pweights/aweights to mean=1, aggregating those raw values to unit-
level via the existing ``_aggregate_unit_weights`` helper, and passing
that raw per-unit array as the ``raw_weights`` argument. Both entry
paths (``weights=`` and ``survey=``) now report bit-identical
``sum_weights`` / ``weight_range`` / ``design_effect`` /
``effective_n`` for the same unit-level weights — verified by a new
regression test ``test_survey_metadata_raw_weights_match_shortcut``.

P3 — ``__repr__`` surfaces weighted-path metadata. CHANGELOG claimed
``to_dict`` / ``summary`` / ``__repr__`` render the new weighted
surface consistently, but ``__repr__`` was unchanged. Extended the
compact one-line form to include ``variance_formula`` and
``effective_dose_mean`` when present; unweighted fits keep the
original terse repr. Regression test
``test_repr_surfaces_weighted_fields_when_present`` locks both cases.

All 360 tests pass (across the six HAD-related test modules). Ruff
clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 188021be767243c5ab08ff7a4e3156a1544f1d1b


Overall Assessment

Blocker

Static review only; I could not run the targeted tests in this sandbox because Python here is missing numpy.

Executive Summary

  • The affected methodology is the new weighted HAD continuous-dose path: continuous_at_zero, continuous_near_d_lower, and the supporting Phase 1c weighted bias_corrected_local_linear / _nprobust_port.lprobust route. Aside from one blocker below, the kernel-weight composition and Binder-TSL variance route match the new registry entry.
  • The prior metadata-scale issue appears fixed: the survey path now captures pre-normalization unit weights before SurveyDesign.resolve() and passes those raw unit weights into compute_survey_metadata().
  • P0: zero-weight units still drive auto design detection, d_lower, the mass-point threshold, and treated/control counts even though the weighted estimator later drops them. Standard survey subpopulation designs can therefore silently target the wrong HAD design or reject a valid weighted fit.
  • The documented Phase 4.5 deferrals are handled correctly: weighted mass_point, weighted event_study, replicate-weight designs, pretests under survey, and weight-aware DPI bandwidths are all explicitly documented/tracked rather than silently approximated.
  • The previous documentation finding is only partially resolved: summary() / __repr__ still do not render all of the survey metadata fields that CHANGELOG.md says they do.

Methodology

  • Severity P0 Impact: zero-weight observations are allowed and are the standard representation for survey subpopulations, but HAD still resolves the design on the full unit sample instead of the positive-weight support. fit() aggregates all units first, then runs _detect_design, d_lower resolution, the 2% mass-point check, and treated/control counts on that full d_arr; only later does _fit_continuous() switch to weighted averages and weighted kernel weights. That means a zero-weight unit at d.min() can silently flip a weighted sample from continuous_near_d_lower to continuous_at_zero, shift d_lower, misstate the control group, or trigger the wrong NotImplementedError. This is especially problematic for SurveyDesign.subpopulation(), which intentionally zeros excluded units instead of dropping them. diff_diff/survey.py:L422-L538, diff_diff/had.py:L2406-L2493, diff_diff/had.py:L2502-L2679, diff_diff/had.py:L2954-L3018, diff_diff/local_linear.py:L1104-L1165, diff_diff/local_linear.py:L1217-L1232
    Concrete fix: resolve HAD design logic on the positive-weight unit subset (weights_unit > 0) for auto design detection, d_lower, the mass-point threshold, and displayed counts. For auto-bandwidth/support checks, feed only that positive-weight subset into the Phase 1c validator/selector. If you still want full-design survey variance, re-expand the returned IF back to the full unit design with zeros before calling compute_survey_if_variance().

Aside from that blocker, the weighted kernel-composition math, Binder-TSL variance route, and the registry-noted deviations/deferrals are aligned with docs/methodology/REGISTRY.md:L2258-L2268.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The remaining weighted-scope gaps are explicitly tracked and therefore mitigated for this PR: docs/methodology/REGISTRY.md:L2266-L2268, TODO.md:L95-L99.

Security

No findings.

Documentation/Tests

  • Severity P3 Impact: CHANGELOG.md still overstates the reporting surface. It says to_dict / summary / __repr__ render the survey metadata consistently, including weight sum, n_strata, and n_psu, but summary() currently prints only variance_formula, effective_n, the weighted denominator, and optionally df_survey, while __repr__ only adds variance_formula and effective_dose_mean. to_dict() also returns a SurveyMetadata object even though its docstring still says “dict or None.” CHANGELOG.md:L11-L11, diff_diff/had.py:L362-L487
    Concrete fix: either narrow the changelog/docstring wording to the fields actually rendered, or add the missing metadata fields to summary() / __repr__ and update the to_dict() docstring to match the implementation.

Path to Approval

  1. Make weighted HAD design resolution depend on the positive-weight unit support, not on all aggregated units.
  2. Ensure the Phase 1c support/bandwidth path on weighted fits also ignores zero-weight units, while preserving full-design IF expansion for survey variance if needed.
  3. Add regression tests using SurveyDesign.subpopulation() or an equivalent zero-weight unit at d.min() showing that weighted fits on the full design match the same fits run after physically dropping the zero-weight units for auto, continuous_near_d_lower, and the reported count fields.

P0 — zero-weight units must not drive HAD design resolution. Previously
``fit()`` aggregated all units first, then ran ``_detect_design``,
``d_lower`` resolution, the 2% mass-point threshold, and treated/control
counts on the full ``d_arr`` — so a zero-weight unit at ``d.min() = 0``
(the standard SurveyDesign.subpopulation encoding) could flip a
weighted sample from ``continuous_near_d_lower`` to
``continuous_at_zero``, shift ``d_lower``, misstate cohort counts, or
trigger the wrong ``NotImplementedError``. The weighted kernel already
dropped those units at fit time via lprobust's ``w > 0`` selector, so
the fit NUMBERS were correct; the DESIGN DECISION was contaminated.

Fixed by filtering ``d_arr`` / ``dy_arr`` / ``weights_unit`` /
``raw_weights_unit`` / ``resolved_survey_unit`` to
``weights_unit > 0`` immediately after survey/weights resolution and
before any design-resolution logic. New ``_filter_resolved_survey``
helper rebuilds the ResolvedSurveyDesign on the positive-weight subset
(recomputing ``n_strata`` / ``n_psu`` for compute_survey_if_variance).
A UserWarning fires when units are dropped so the behavior is
introspectable.

Three new regression tests lock the fix:
- ``test_zero_weight_unit_at_d_min_does_not_flip_design``: full panel
  with zero-weight unit at ``d=0`` vs physically-dropped panel —
  design, d_lower, att, and se all match to 1e-10.
- ``test_zero_weight_filter_warns_user``: UserWarning emitted.
- ``test_zero_weight_counts_reflect_positive_subset``: ``n_obs`` on the
  result is the positive-weight unit count, not the full panel size.

P3 — CHANGELOG + ``to_dict()`` docstring accuracy. CHANGELOG claimed
``to_dict`` / ``summary`` / ``__repr__`` render the survey metadata
consistently including ``weight_sum`` / ``n_strata`` / ``n_psu``; in
practice each surface renders a different subset. Rewrote the CHANGELOG
entry to enumerate exactly what each surface prints, and rewrote
``HeterogeneousAdoptionDiDResults.to_dict()`` docstring to name the
weighted-path keys and clarify that ``survey_metadata`` is a
SurveyMetadata object (not a dict).

All 363 tests pass. Ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 6bd07e7d11d0a918bcc39dff9efeef7bb1c4fb8e


Overall Assessment

⚠️ Needs changes

Static review only. I could not run the targeted tests in this sandbox because numpy/pytest are unavailable here.

Executive Summary

  • The previous blocker on zero-weight units driving HAD design resolution appears fixed on the main estimator path: fit() now resolves design, d_lower, and count fields on the positive-weight subset.
  • P1: the new survey= path drops zero-weight units from the ResolvedSurveyDesign before Binder-TSL variance, which contradicts the library’s own subpopulation/domain-estimation contract and will misstate se, df_survey, n_psu, n_strata, and weight metadata for survey subpopulations.
  • P1: the new public bias_corrected_local_linear(weights=...) path still runs HAD support/mass-point/boundary validation on the full sample rather than the positive-weight support, so zero-weight boundary units can still flip the supported regime or trigger spurious off-support/mass-point failures.
  • The documented Phase 4.5 deferrals (mass_point, weighted event study, replicate weights, weight-aware DPI, weighted pretests) are properly tracked in TODO.md / REGISTRY.md and are not blockers.

Methodology

  • Severity P1
    Impact: survey subpopulation variance is currently computed on a filtered design, not the full design. SurveyDesign.subpopulation() and the registry both say the correct domain-analysis method is to keep zero-weight rows/PSUs/strata in the survey design for variance/df, but HAD now filters them out via _filter_resolved_survey() before calling compute_survey_if_variance(). That silently changes PSU counts, stratum centering, FPC application, and df_survey; a domain that leaves one PSU with positive in-domain mass and another PSU with only zero-weight observations can therefore get the wrong survey SE or even the wrong “lonely PSU” behavior. diff_diff/had.py:L2566-L2598, diff_diff/had.py:L3125-L3138, diff_diff/had.py:L1697-L1730, diff_diff/survey.py:L422-L432, docs/methodology/REGISTRY.md:L3083-L3098
    Concrete fix: keep zero-weight filtering only for design resolution / local-linear fitting, but preserve the full unit-level ResolvedSurveyDesign for survey variance and metadata. Re-expand the fitted IF back to full unit length with zeros outside the domain, then compute compute_survey_if_variance(), df_survey, and compute_survey_metadata() on that full design.

  • Severity P1
    Impact: the newly enabled bias_corrected_local_linear(weights=...) surface still validates support on unfiltered d. The function accepts nonnegative weights, but immediately calls _validate_had_inputs() on the full sample; that helper’s d.min(), mass-point, and Design-1′ support heuristics therefore still see zero-weight observations. A zero-weight unit at the boundary can still make a valid weighted domain fit look like Design 1′, or trigger a spurious mass-point/off-support rejection; in auto mode it also contaminates bandwidth selection because the selector runs before any positive-weight filtering. diff_diff/local_linear.py:L1063-L1105, diff_diff/local_linear.py:L593-L668
    Concrete fix: when weights is passed, form a positive-weight mask first and run _validate_had_inputs(), auto-bandwidth selection, and the final fit on that positive-weight support. If return_influence=True, expand the returned IF back to the original ordering with zeros for masked rows.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. The remaining weighted-scope exclusions are explicitly tracked in TODO.md and docs/methodology/REGISTRY.md, so I am not counting them as blockers.

Security

  • No findings.

Documentation/Tests

  • No separate findings beyond the missing regression coverage implied by the two P1 issues above.

Path to Approval

  1. Preserve the full survey design on survey= subpopulation fits: zero-pad the HAD continuous-path IF back to the full unit design and compute Binder-TSL variance / df_survey / survey metadata on that full design, not on the filtered positive-weight subset.
  2. Make bias_corrected_local_linear(weights=...) operate on positive-weight support for support validation and auto-bandwidth selection, then zero-pad IF output back to full length when requested.
  3. Add regression tests for both cases:
    1. SurveyDesign.subpopulation() with explicit strata/PSU where one PSU is all zero-weight in-domain, asserting full-design survey df/SE behavior.
    2. bias_corrected_local_linear(weights=...) with zero-weight units at d.min() / the boundary, comparing weighted results to physically dropping those zero-weight rows in both explicit-h/b and auto-bandwidth modes.

P1a — preserve full survey design for Binder-TSL variance. Round 5 filtered
the ResolvedSurveyDesign to the positive-weight subset for BOTH design
resolution AND survey variance composition. That's wrong under the
standard subpopulation / domain-estimation convention (keep the sampling
frame, zero the contributions) that ``SurveyDesign.subpopulation()`` +
``compute_survey_if_variance()`` both implement elsewhere in this repo.
Silent consequences: ``n_psu`` / ``n_strata`` / ``df_survey`` / FPC
application / lonely-PSU behavior would all have been computed on the
in-domain subset instead of the full design.

Refactored HAD ``fit()`` to preserve ``weights_unit_full`` /
``resolved_survey_unit_full`` / ``raw_weights_unit_full`` alongside the
filtered copies used ONLY for design resolution (``_detect_design`` /
``d_lower`` / mass-point threshold / cohort counts). The fit itself
(``_fit_continuous`` → ``bias_corrected_local_linear`` → ``lprobust``)
now receives the FULL unfiltered arrays; ``bias_corrected_local_linear``
filters internally (see P1b below) and zero-pads the IF back to full
ordering, so ``compute_survey_if_variance(if_full, resolved_full)``
preserves the sampling-frame structure. ``compute_survey_metadata`` and
``effective_dose_mean`` likewise consume the full arrays so
``n_psu`` / ``n_strata`` / ``sum_weights`` / ``weight_range`` reflect
the sampling frame. Dropped the now-unused
``_filter_resolved_survey`` helper.

P1b — ``bias_corrected_local_linear(weights=...)`` filters internally.
The public wrapper previously ran ``_validate_had_inputs`` / mass-point
threshold / Design 1' support heuristic + auto-bandwidth MSE-DPI
selector on the full ``d`` array, so zero-weight units at the boundary
or at ``d.min()`` could flip the detected regime or trigger spurious
off-support rejections on valid weighted domain fits. Now computes
``positive_mask = weights > 0`` upfront, validates weights (finite,
non-negative, positive sum — consistent with the port's ValueErrors),
filters ``d`` / ``y`` / ``weights`` / ``cluster`` to the positive-
weight subset BEFORE all downstream validation + selection + fit,
and zero-pads the returned IF back to the original ordering when
``return_influence=True``. ``n_total`` on the returned ``BiasCorrectedFit``
is the original length (matches the caller's input shape).

Four new regression tests:
- ``test_zero_weight_survey_metadata_preserves_full_design``: survey=
  path with PSU/strata + 25% zero-weight units reports full-frame n_psu
  (not in-domain subset); ATT bit-parity with physically-dropped fit.
- ``test_bias_corrected_local_linear_zero_weight_matches_filtered``:
  explicit-h/b wrapper with zero-weight unit at d=0 (would have flipped
  Design 1') matches dropped-sample fit at 1e-12; IF zero-padded to
  original length with IF[0]=0 at the zero-weight position.
- ``test_bias_corrected_local_linear_zero_weight_auto_bandwidth``:
  auto-bandwidth path (DPI selector also runs on positive-weight
  support) produces same h + same bias-corrected estimate as the
  dropped-sample fit.
- The existing Round 5 zero-weight tests (``n_obs`` reflects positive-
  weight subset; UserWarning; design-not-flipped) pass unchanged since
  the design-resolution filter is preserved.

All 366 tests pass (across test_had, test_nprobust_port,
test_bias_corrected_lprobust, test_np_npreg_weighted_parity, and the
slow MC suite). Ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 1998d75e2c4dcf55d72743e42be7b28302d65a56


Overall Assessment

Looks good

Static diff review only. I could not run pytest in this sandbox because the module is unavailable, but the modified Python files parsed cleanly via ast.parse().

Executive Summary

  • The two previously raised P1 issues appear resolved. HAD now filters zero-weight units only for design resolution while preserving the full resolved survey design for survey variance / df / metadata, and the continuous fit is still run against the full ordering so the IF can align to the full design. diff_diff/had.py diff_diff/had.py diff_diff/had.py
  • bias_corrected_local_linear(weights=...) now operates on positive-weight support before support validation and auto-bandwidth selection, then zero-pads the returned influence function back to the original ordering. That addresses the earlier zero-weight boundary/support failure mode. diff_diff/local_linear.py diff_diff/local_linear.py
  • The new survey variance path is aligned to the bias-corrected estimator scale. _nprobust_port.lprobust now exports a Q_q / res_b based IF matched to V_Y_bc, which is the right object for Binder-TSL composition on the weighted HAD continuous path. diff_diff/_nprobust_port.py
  • Remaining scope limits are properly documented rather than silently changing methodology: weighted auto-DPI bandwidths, weighted mass-point, weighted event-study, weighted pretests, and replicate-weight HAD are all called out in the registry and tracked in TODO.md. docs/methodology/REGISTRY.md TODO.md
  • I did not find any new unmitigated P0 or P1 issues in the changed code. The added regression coverage is well targeted at the prior review concerns. tests/test_had.py tests/test_had.py

Methodology

  • No findings. The weighted continuous-dose implementation matches the updated methodology registry: weighted beta-scale moments, full-design domain semantics for survey variance, bias-corrected IF scale, and explicit deferral of unsupported paths. docs/methodology/REGISTRY.md

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3
    Impact: Weighted mass_point, weighted aggregate="event_study", survey-weighted HAD pretests, replicate-weight HAD, and weight-aware DPI bandwidth selection remain deferred. These are explicitly tracked and documented, so they are not blockers for this PR. TODO.md docs/methodology/REGISTRY.md
    Concrete fix: None required in this PR.

Security

  • No findings.

Documentation/Tests

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 24, 2026
@igerber igerber merged commit 9b71f67 into main Apr 24, 2026
23 of 24 checks passed
@igerber igerber deleted the had-phase-4.5-survey-continuous branch April 24, 2026 17:46
igerber added a commit that referenced this pull request Apr 24, 2026
Rebased onto current main (17 commits clean — PR #355, #358, #359 all
merged since last rebase).

StaggeredTripleDifference corrected as panel-only + balance-enforced.
The earlier §4.10 RCS wording paired TripleDifference /
StaggeredTripleDifference together in the Explicit RCS support list,
but REGISTRY.md §StaggeredTripleDifference requires a balanced panel
and staggered_triple_diff.py:93-109 has no panel=False mode — fit()
rejects unbalanced/duplicate (unit, time) structure at
staggered_triple_diff.py:846-864.

- §4.10 Explicit RCS support: TripleDifference (two-period) only;
  StaggeredTripleDifference removed from the supported set.
- §4.10 Explicitly rejected for RCS: StaggeredTripleDifference added
  with a concrete "no panel=False mode" + "use TripleDifference for
  cross-sectional DDD" pointer.
- §3 Balanced-panel eligibility: StaggeredTripleDifference added to
  the balance-sensitive gate.

Regression tests extended:
- Balanced-panel proximity check now covers StaggeredTripleDifference.
- §4.10 section test asserts StaggeredTripleDifference appears in the
  Explicitly rejected block and NOT in the Explicit RCS support block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
Rebased onto current main (17 commits clean — PR #355, #358, #359 all
merged since last rebase).

StaggeredTripleDifference corrected as panel-only + balance-enforced.
The earlier §4.10 RCS wording paired TripleDifference /
StaggeredTripleDifference together in the Explicit RCS support list,
but REGISTRY.md §StaggeredTripleDifference requires a balanced panel
and staggered_triple_diff.py:93-109 has no panel=False mode — fit()
rejects unbalanced/duplicate (unit, time) structure at
staggered_triple_diff.py:846-864.

- §4.10 Explicit RCS support: TripleDifference (two-period) only;
  StaggeredTripleDifference removed from the supported set.
- §4.10 Explicitly rejected for RCS: StaggeredTripleDifference added
  with a concrete "no panel=False mode" + "use TripleDifference for
  cross-sectional DDD" pointer.
- §3 Balanced-panel eligibility: StaggeredTripleDifference added to
  the balance-sensitive gate.

Regression tests extended:
- Balanced-panel proximity check now covers StaggeredTripleDifference.
- §4.10 section test asserts StaggeredTripleDifference appears in the
  Explicitly rejected block and NOT in the Explicit RCS support block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
R7 P0 (sup-t under-scaling on weights= shortcut): the per-unit IF
returned by _fit_continuous / _fit_mass_point_2sls is HC1-scaled per
the PR #359 convention — compute_survey_if_variance(psi,
trivial_resolved) ≈ V_HC1. Routing the weights= shortcut through the
unit-level ``resolved_survey=None`` branch of
_sup_t_multiplier_bootstrap normalized against raw sum(psi²) =
((n-1)/n) · V_HC1, producing silently too-narrow simultaneous bands.

Fix: when the weighted event-study + cband=True path runs, always
route the sup-t bootstrap through a ResolvedSurveyDesign. On the
weights= shortcut (no user-supplied survey), construct a synthetic
trivial resolved (pweight, no strata/psu/fpc, lonely_psu='remove')
so the centered + sqrt(n/(n-1))-corrected survey-aware branch fires.
The no-strata/no-PSU path inside that branch falls through to the
"single implicit stratum — demean across all PSUs, scale by
sqrt(n_psu/(n_psu-1))" block, which gives Var_xi(xi @ Psi_psu) ≈
V_HC1 as the IF scale convention requires.

Net effect: weights= shortcut and survey=SurveyDesign(weights=...)
now target the SAME variance family in the bootstrap (~atol=0.05
between the two quantiles at matching seeds, bounded by the
bc_fit.se_robust vs Binder-TSL per-horizon SE convergence
tolerance from PR #359). Previously the shortcut was under-scaled
by sqrt(n/(n-1)) relative to the analytical HC1 target.

Regression tests (+2):
- test_weights_shortcut_mass_point_h1_cband_matches_normal: helper-
  level H=1 lock with mass-point HC1-scaled IF + synthetic trivial
  resolved. q → Phi^-1(0.975) ≈ 1.96 at atol=0.15 (MC noise at
  B=5000). The pre-fix under-scaling would have produced q ≈ 1.94
  (systematic drift outside MC noise).
- test_weights_shortcut_cband_matches_trivial_survey: weights=
  shortcut and survey=SurveyDesign(weights='w') event-study cband
  quantiles agree within atol=0.05 on the same DGP / seed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
HanomicsIMF pushed a commit to HanomicsIMF/diff-diff that referenced this pull request Apr 27, 2026
…t-study survey composition + sup-t bootstrap

Closes the two Phase 4.5 A NotImplementedError gates on design='mass_point' +
weights/survey and aggregate='event_study' + weights/survey.

Weighted 2SLS in _fit_mass_point_2sls follows the Wooldridge 2010 Ch. 12
pweight convention (w² in HC1 meat, w·u in CR1 cluster score, weighted
bread Z'WX). HC1 and CR1 match estimatr::iv_robust bit-exactly at
atol=1e-10 (new cross-language golden). Per-unit IF on β̂-scale scales
so compute_survey_if_variance(psi, trivial) ≈ V_HC1 at atol=1e-10 (PR igerber#359
convention applied uniformly).

Event-study path threads weights + survey through the per-horizon loop,
composing Binder-TSL variance per horizon and populating survey_metadata
+ variance_formula + effective_dose_mean (previously hardcoded None).

New _sup_t_multiplier_bootstrap helper reuses
generate_survey_multiplier_weights_batch / generate_bootstrap_weights_batch
from diff_diff.bootstrap_utils — no custom Rademacher draws, no (1/n)
prefactor. At H=1 reduces to Φ⁻¹(1-α/2) ≈ 1.96 (reduction-locked).

New __init__ kwargs: n_bootstrap=999, seed=None. New fit() kwarg:
cband=True. HeterogeneousAdoptionDiDEventStudyResults gains
survey_metadata + variance_formula + effective_dose_mean + cband_*
fields, surfaced through to_dict / to_dataframe / summary / __repr__.

Unweighted event-study output (att/se) bit-exactly preserved; cband
disabled on the unweighted path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant