HAD Phase 4.5: survey support on continuous-dose paths#359
Conversation
…local_linear Extends `_nprobust_port.lprobust` with a `weights=` parameter that multiplies into the kernel weights, propagating through design matrices, Q.q bias-correction, variance matrices, and the HC2/HC3 hat-matrix hii term. When `weights=None` (or `np.ones(N)`) the function is bit-identical to the unweighted path (regression-tested at atol=1e-14). Removes the NotImplementedError at `bias_corrected_local_linear` and forwards weights to the internal lprobust call. Bandwidth selection (auto mode) remains unweighted — the MSE-optimal DPI is not yet weight-aware; users who want weight-aware bandwidths supply `h`/`b` explicitly. Documented as a known gap for the REGISTRY subsection. Parity ceiling acknowledged: no public weighted-CCF reference exists (nprobust has no weight argument; rdrobust is RD-shaped). The uniform-weights bit-parity test is the only bit-parity anchor; validation under informative weights relies on MC oracle + np::npreg partial parity, both deferred to subsequent commits in this PR. Adds TestWeightedLprobust (11 tests) and TestWeightedBiasCorrectedLocalLinear (7 tests): uniform-weights bit-parity at atol=1e-14 (classical + robust variance, clustered + unclustered), weight-validator contracts (negative, NaN, Inf, zero-sum, length-mismatch), zero-weight observations filter from the active window, and an informative-weights mechanism-has-teeth sanity check. All 91 existing Phase 1c parity tests unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Accepts `weights=<array>` and `survey=SurveyDesign(weights=...)` on
the `continuous_at_zero` and `continuous_near_d_lower` designs of
`HeterogeneousAdoptionDiD.fit()`. Routes the per-row weights through
a new `_aggregate_unit_weights` helper that validates constant-within-
unit (`feedback_no_silent_failures`) and produces a G-length per-unit
weight array, then threads it through the weighted `lprobust` in the
final continuous fit. Weighted population moments use `np.average`
so unweighted path is bit-identical when `weights=None`.
API contract:
- `weights=w` and `survey=SurveyDesign(weights=w)` are interchangeable
(same per-unit aggregate, same result) — one knob per the merge-
redundant-knobs principle, with the shortcut kept for ergonomics.
- `survey=` + `weights=` → ValueError (mutex).
- weights varying within unit → ValueError (prescribed rollup rule).
- negative / NaN / zero-sum weights → ValueError.
- `survey=<non-SurveyDesign-like>` → TypeError.
- mass_point + survey/weights → NotImplementedError (Phase 4.5 B).
- event_study + survey/weights → NotImplementedError (Phase 4.5 B).
Result object:
- `HeterogeneousAdoptionDiDResults.survey_metadata` now carries
`{method, source, n_units_weighted, weight_sum, effective_sample_size}`
under weighted fits (was always `None` before — closed the existing
`to_dict()` gap at had.py:397-419).
- `to_dict()` emits `survey_metadata` key (`None` on unweighted).
- `summary()` renders a "Survey method" / "Effective sample size"
block when present.
PSU/strata/FPC composition via `compute_survey_if_variance` is the
subsequent commit; this commit ships the pweight path with the
weighted-robust SE from Phase 1c, not full Binder TSL.
Tests: 16 new TestHADSurvey tests covering uniform-weights bit-parity
(both continuous designs), survey=weights= equivalence, validator
contract, deferred-path rejection, survey_metadata population, and
to_dict/summary surfacing. 213 existing HAD tests pass unchanged
(scaffolding test_weights_raises removed; test_survey_raises updated
to test_survey_bad_type_raises expecting TypeError on non-SurveyDesign
input).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires full ``survey=SurveyDesign(weights, strata, psu, fpc)`` support
on HAD's continuous-dose paths. When a SurveyDesign is supplied, the
standard error routes through Binder (1983) Taylor-series linearization:
per-unit influence function of the local-linear intercept aggregated
by PSU within strata, with FPC correction (``compute_survey_if_variance``
at ``survey.py:1802``). The estimator consumes the existing helper —
no duplicate implementation per ``feedback_psu_telescoping_tsl_vs_replicate``.
Core additions:
- ``LprobustResult.influence_function`` (Optional[ndarray]): per-obs IF
of the classical intercept, populated only when ``return_influence=True``.
Aligned to original caller ordering (inverse-permuted when vce='nn'
sorts). Observations outside the active kernel window have IF=0.
Bit-parity: ``weights=np.ones`` IF ≡ unweighted IF at atol=1e-14.
- ``BiasCorrectedFit.influence_function``: same IF forwarded from the
internal lprobust call.
- ``_aggregate_unit_resolved_survey``: collapses a row-level
``ResolvedSurveyDesign`` to unit-level by requiring constant-within-
unit for weights/strata/psu/fpc (same invariant as cluster columns);
raises ``ValueError`` on inconsistency (``feedback_no_silent_failures``).
Replicate-weight designs raise ``NotImplementedError`` (deferred to
Phase 4.5 C, where Stute-family pretests ship the Rao-Wu bootstrap).
- ``_fit_continuous`` (HAD) accepts ``resolved_survey_unit``. When set,
uses ``bc_fit.influence_function`` + ``compute_survey_if_variance``
for the SE; otherwise falls back to ``bc_fit.se_robust / |den|``
(pweight-only convention for ``weights=<array>``). Paper Eq 8 treats
the denominator as fixed at the leading order (sqrt(G) rate vs the
G^{2/5} rate of the nonparametric estimator).
API contract:
- ``weights=<array>``: lightweight pweight shortcut; weighted-robust SE.
- ``survey=SurveyDesign(weights='col')``: SurveyDesign path; Binder-TSL
SE. Under no strata/PSU, reduces to (n/(n-1))-adjusted sum-of-IF² —
asymptotically equivalent but numerically distinct from weights=.
- ATT is identical on both paths (same weighted lprobust); only the SE
diverges (intended).
- ``HeterogeneousAdoptionDiDResults.survey_metadata`` now carries
``method`` ('pweight' vs 'survey_binder_tsl'), ``variance_formula``,
``n_strata``, ``n_psu`` diagnostics so the composition is
introspectable without inspecting the estimator.
Tests: 5 new TestHADSurvey tests (21 total in that class) covering
SurveyDesign-with-strata SE divergence, PSU clustering SE effect,
strata-varies-within-unit ValueError, replicate-weights NotImplementedError
deferral, and survey_metadata recording the Binder-TSL method. All 325
existing tests (across test_had, test_bias_corrected_lprobust,
test_nprobust_port) pass unchanged.
Parity ceiling unchanged: Binder TSL composition is validated via the
existing dCDH/EfficientDiD usage of ``compute_survey_if_variance``, not
against a weighted-CCF public reference (none exists). Methodology
confidence for the bias-corrected CI under weights comes from the MC
oracle tests and np::npreg partial parity (subsequent commits).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ships a bit-parity regression lock against R's manually-implemented weighted-OLS with one-sided Epanechnikov kernel. The reference matches ``diff_diff/local_linear.py::local_linear_fit`` exactly in formula, so parity holds at atol=1e-12 across BLAS reduction ordering — 4 DGPs (uniform / informative / bounded-heterogeneous / shifted-boundary), each validating intercept, slope, and active-window count. Scope note: NOT third-party validation. No public weighted-CCF bias-corrected local-linear reference exists in any language — nprobust::lprobust has no weight argument, and np::npreg's local-linear algorithm diverges from a straightforward weighted-OLS at the intercept (preliminary exploration showed ~4-8% discrepancy vs manual weighted lm). The fixture is a cross-language regression lock on the weighted kernel + weighted-OLS formula, matching the R↔Python parity convention used elsewhere in the benchmark suite. Methodology confidence under informative weights comes from the uniform-weights bit-parity (already shipped) + Monte Carlo oracle consistency (next commit). Files: - ``benchmarks/R/generate_np_npreg_weighted_golden.R``: new R generator with 4 deterministic DGPs. Uses ``jsonlite`` (already in ``requirements.R``), no new CRAN dependencies. - ``benchmarks/data/np_npreg_weighted_golden.json``: tracked golden (~600 KB, consistent with other benchmark fixtures in this repo). - ``tests/test_np_npreg_weighted_parity.py``: 13 parametrized tests (intercept / slope / n_active × 4 DGPs + 1 uniform-weights cross-check). Skip-guarded via ``pytest.skip`` per ``feedback_golden_file_pytest_skip``. The filename ``np_npreg_weighted_golden.*`` is retained from the planning phase (plan named the fixture after ``np::npreg``) even though the final R implementation is a manual weighted-OLS, so the generator name stays discoverable in the benchmark directory listing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… HAD
Adds slow-gated MC tests that validate the weighted-HAD methodology
under known-tau DGPs — the only real methodology anchor available
given the parity ceiling (no public weighted-CCF reference exists).
Four tests covering:
1. Uniform weights recover β under SRS (regression baseline).
2. Informative weights + inverse-probability pweights recover β under
informative sampling (the actual weight-mechanism claim).
3. 95% CI coverage ≥ 60% at n_reps=25 (loose bar that stays above
chance while tolerating CCT-2014's known undercoverage at small G
+ reduced MC reps).
4. Unweighted-under-informative-sampling bias vs weighted-under-same:
the bias-reduction is the test teeth — if weighting failed to
correct bias, this would flag it.
DGP: linear m(d) = β·D (WAS = β exactly, no quadratic confound) with
G=500, n_reps scaled via ``ci_params.bootstrap(200, min_n=25)``. Slow-
gated per repo convention; all four run in ~2.3s in pure-Python mode
so they're cheap enough to execute with -m slow on every CI run.
Per ``feedback_run_own_artifacts_before_done``: all four tests executed
green locally before commit, not just skip-guarded.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
REGISTRY: adds a "Weighted extension (Phase 4.5 survey support)" subsection under HAD Phase 1c anchor. Covers: - Weighted kernel-weighted least squares derivation + kernel-composition formula W_combined = k((D - d̲)/h) · w_g and β-scale rescaling under weighted population moments. - SurveyDesign composition via Binder (1983) TSL through the shared `compute_survey_if_variance` helper (no duplicate implementation per `feedback_psu_telescoping_tsl_vs_replicate`). - Explicit "parity gap" Note flagging that no public weighted-CCF bias-corrected local-linear reference exists — the 1e-12 R bit-parity bar is unreachable on the weighted bias-corrected CI. - Five `**Note:**` subsections documenting the methodology-confidence stack: uniform-weights bit-parity (1e-14), cross-language weighted-OLS parity (1e-12), MC oracle consistency, unweighted auto-bandwidth gap, and replicate-weight + mass-point + pretest deferrals. - Explicit dispatch-matrix note: survey/weights supported on continuous_at_zero + continuous_near_d_lower only; mass_point, event_study, and HAD pretests deferred per phase sequencing. CHANGELOG: adds an Unreleased "Added" entry covering the public API surface (`survey=` / `weights=` on HAD.fit), the result-object metadata (`survey_metadata`), the parity ceiling, the methodology-confidence stack, and the Phase 4.5 B / C deferrals. TODO: removes the two rows marking survey integration + weights support as deferred (shipped); adds five new rows tracking the remaining Phase 4.5 work (B: mass-point + event-study survey; C0: QUG-under- survey decision; C: pretests under survey; weight-aware auto-bandwidth; replicate-weight continuous-path support). Each row references a specific file and phase so future PRs have a clear anchor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Overall Assessment ⛔ Blocker The new Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
P0 — bias-corrected survey IF: the survey variance path was building
the influence function from classical residuals (``res_h`` + ``R_p·W_h``,
aligned with ``V_Y_cl``) while the returned ATT uses the bias-corrected
``tau_bc``. Under compute_survey_if_variance this silently
under-estimated the survey SE by ignoring the CCT-2014 bias-correction
variance inflation. Fixed in ``_nprobust_port.lprobust``: IF now uses
``Q_q`` + ``res_b`` so ``sum(IF^2) == V_Y_bc[0,0]`` (verified in new
white-box test), matching the estimator scale of the ATT. Uniform-
weights bit-parity (weights=np.ones ≡ unweighted at 1e-14) preserved
across the new IF formula. The ``TestWeightedLprobust`` HC1 bit-parity
test still passes because under weights=ones the classical vs bias-
corrected IF only differ by the Q.q bias-correction term, which is
deterministic and cancels in the diff.
P1a — df_survey threading: survey fits previously used Normal-theory
inference regardless of PSU count. ``resolved_survey_unit.df_survey``
(n_psu − n_strata, or replicate QR rank − 1) now routes through
``safe_inference(..., df=...)`` on the survey path, producing t-based
p-values and CIs. Also surfaced in ``survey_metadata["df_survey"]`` for
introspection. Under small-PSU designs the t-critical exceeds the
Normal z-critical, widening the CI vs the prior (wrong) output. The
``weights=`` shortcut continues to use Normal inference since there's
no PSU structure to produce a finite df.
P1b — reject non-pweight SurveyDesigns: HAD's weighted kernel composition
``W_combined = k((D-d̲)/h) · w`` implements inverse-probability weighting
semantics. ``SurveyDesign(weight_type="aweight"|"fweight")`` is now
rejected with ``NotImplementedError`` at fit-time — aweight (analytic)
implies a different inferential target (weighted regression, not
design-based inference), and fweight (frequency) implies observation
replication. Neither has been derived for HAD's continuous-dose path;
deferred as a follow-up.
P2 — test coverage: six new ``TestHADSurvey`` tests locking in the
three fixes above:
- ``test_survey_if_uses_bias_corrected_scale``: white-box
``sum(IF^2) == V_Y_bc[0,0]`` under nonlinear DGP where
V_Y_bc ≠ V_Y_cl (teeth).
- ``test_survey_df_widens_ci_vs_normal`` +
``test_survey_df_threaded_into_inference_via_t_distribution``:
assert df_survey surfaces in metadata and produces t-CI half-width
matching ``t_crit(df) · se`` exactly.
- ``test_survey_aweight_raises_not_implemented`` +
``test_survey_fweight_raises_not_implemented``: front-door rejection.
- ``test_survey_no_psu_no_strata_se_matches_weights_hc1``: SRS
equivalence — survey SE within 10-15% of weights-shortcut SE (the
(n/(n-1))-style HC1 correction), ruling out a silent
V_Y_cl-vs-V_Y_bc mismatch.
P3 — docstring refresh: the stale fit() docstring calling survey/weights
"Reserved for a follow-up PR" is replaced with the actual Phase 4.5 A
contract (pweight-only, constant-within-unit, replicate-weight deferred,
mass-point/event-study/pretests deferred). ``survey_metadata`` docstring
on ``HeterogeneousAdoptionDiDResults`` now enumerates the actual dict
keys and their semantics.
All 348 tests (across test_had, test_nprobust_port,
test_bias_corrected_lprobust, test_np_npreg_weighted_parity, and the
slow MC suite) pass after the cascade. Ruff clean on modified files.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall AssessmentStatic review only; I did not run the test suite because the sandbox Python environment is missing Executive Summary
Methodology
Code QualityNo findings. PerformanceNo findings. Maintainability
Tech Debt
SecurityNo findings. Documentation/Tests
Path to Approval
|
P1 — survey_metadata contract drift: the previous HAD survey fits
returned ``survey_metadata`` as a plain dict, but the repo-standard is
the :class:`diff_diff.survey.SurveyMetadata` dataclass read via
attribute access by BusinessReport, DiagnosticReport, and the shared
``results.py`` plumbing. Passing a HAD survey result through those
consumers would silently drop ``df_survey`` / ``effective_n`` /
``n_strata`` / ``n_psu`` under the dict shape. Fixed by routing both
entry paths (``survey=SurveyDesign(...)`` and ``weights=<array>``)
through :func:`compute_survey_metadata`; ``weights=`` constructs a
minimal ``ResolvedSurveyDesign`` (weights-only, no strata/PSU/FPC)
uniformly. HAD-specific extras (``variance_formula`` label, the
pweight vs Binder-TSL method tag) are moved to a new top-level field
``HeterogeneousAdoptionDiDResults.variance_formula`` (Optional[str])
rather than polluting the shared SurveyMetadata schema. New regression
test ``test_survey_metadata_is_surveymetadata_instance`` exercises
attribute access on both entry paths.
P2 — weighted denominator contract: the continuous paths use weighted
population moments internally but the public result still stored only
the raw ``dose_mean`` — a user reconstructing β-scale by hand would
have been off by the (weighted − raw) gap. Added
``HeterogeneousAdoptionDiDResults.effective_dose_mean`` (Optional[float])
that exposes the actual weighted denominator used by the beta-scale
rescaling: ``sum(w·D)/sum(w)`` for ``continuous_at_zero``,
``sum(w·(D−d_lower))/sum(w)`` for ``continuous_near_d_lower``. ``None``
on unweighted fits (use ``dose_mean`` there; the two coincide under
uniform weights and the duplicate field would be noise). Raw
``dose_mean`` preserved for backward compat; docstring clarifies the
distinction. Four new regression tests covering both continuous
designs + the uniform-weights coincidence + the unweighted
``None``-contract.
P3 — stale docstrings (``_nprobust_port.py``, ``local_linear.py``,
``had.py``): three docstring blocks still described the pre-Phase-4.5
surface ("weights= not supported" / "IF of the classical estimate"
/ "CCT-2014 robust SE divided by |den|"). Updated to describe the
shipped behavior: lprobust supports weights + return_influence on the
BIAS-CORRECTED scale, bias_corrected_local_linear's
``influence_function`` field aligns with ``V_Y_bc``, and HAD's ``se``
documentation now enumerates the three SE paths (unweighted, pweight
via weighted-robust, survey via Binder-TSL).
All 353 tests pass after refactor (across test_had, test_nprobust_port,
test_bias_corrected_lprobust, test_np_npreg_weighted_parity, and the
slow MC suite). Ruff clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment✅ Looks good Static review only; I could not run the test suite in this sandbox because Python here is missing Executive Summary
MethodologyNo findings. The changed estimator path matches the Methodology Registry as written: kernel weights are composed pointwise with user weights, survey variance is built from the bias-corrected IF, non- Code Quality
PerformanceNo findings. MaintainabilityNo findings. Tech Debt
SecurityNo findings. Documentation/Tests
|
P2a — weights=<array> shortcut clears survey-only SurveyMetadata fields: the shortcut constructs a minimal ResolvedSurveyDesign (n_psu=G, n_strata=1) to reuse compute_survey_metadata's code path, which produces df_survey = G − 1. But the shortcut's own inference stays Normal (safe_inference(..., df=None)), so reporting code that rendered "Survey df: G−1" was misdescribing the fit as a finite-df survey result. Fixed by explicitly clearing ``df_survey`` / ``n_psu`` / ``n_strata`` on the ``SurveyMetadata`` returned for the weights= path; ``weight_type`` / ``effective_n`` / ``design_effect`` / ``sum_weights`` / ``weight_range`` stay populated since they describe the weighted sample regardless of inference family. The ``survey=SurveyDesign(...)`` path is unchanged: all survey-only fields populated and threaded into the t-CI. Two new regression tests lock each side of the contract. P2b — to_dict() + summary() surface the new weighted-result fields: ``effective_dose_mean`` and ``variance_formula`` were stored on the result object but not serialized or rendered, so downstream machine consumers of ``to_dict()`` couldn't recover the weighted denominator or the SE-family label (the reporting-surface mirror of the P2 finding). Added both keys to ``to_dict()`` and rendered the weighted denominator explicitly in ``summary()`` when present. Three new tests: one for to_dict under weights, one for the None-defaults under unweighted, and one that ``summary()`` renders the weighted D̄ label. All 358 tests pass across test_had, test_nprobust_port, test_bias_corrected_lprobust, test_np_npreg_weighted_parity, and the slow MC suite. Ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment✅ Looks good Static review only; I could not run the test suite in this sandbox because Python here is missing Executive Summary
MethodologyNo findings. The weighted continuous-HAD implementation matches the Methodology Registry as written: weights enter the local-linear fit via kernel-weight composition, survey inference uses Binder-style Taylor linearization on the bias-corrected IF, and unsupported branches are rejected or documented instead of silently approximated ( Code Quality
PerformanceNo findings. MaintainabilityNo findings. Tech Debt
SecurityNo findings. Documentation/Tests
|
P2 — survey_metadata uses raw pre-normalization weights. On the ``survey=SurveyDesign(weights="col")`` path, the prior code passed ``resolved_survey_unit.weights`` (post-normalization, mean=1) into ``compute_survey_metadata``, producing ``sum_weights ≈ G`` and normalized ``weight_range`` that drifted from both the ``weights=`` shortcut's values and the helper's documented contract (tests in ``tests/test_survey.py`` assert raw-weight semantics for ``weight_range``). Fixed by capturing the raw weight column from ``data[survey.weights]`` BEFORE ``SurveyDesign.resolve()`` rescales pweights/aweights to mean=1, aggregating those raw values to unit- level via the existing ``_aggregate_unit_weights`` helper, and passing that raw per-unit array as the ``raw_weights`` argument. Both entry paths (``weights=`` and ``survey=``) now report bit-identical ``sum_weights`` / ``weight_range`` / ``design_effect`` / ``effective_n`` for the same unit-level weights — verified by a new regression test ``test_survey_metadata_raw_weights_match_shortcut``. P3 — ``__repr__`` surfaces weighted-path metadata. CHANGELOG claimed ``to_dict`` / ``summary`` / ``__repr__`` render the new weighted surface consistently, but ``__repr__`` was unchanged. Extended the compact one-line form to include ``variance_formula`` and ``effective_dose_mean`` when present; unweighted fits keep the original terse repr. Regression test ``test_repr_surfaces_weighted_fields_when_present`` locks both cases. All 360 tests pass (across the six HAD-related test modules). Ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment⛔ Blocker Static review only; I could not run the targeted tests in this sandbox because Python here is missing Executive Summary
Methodology
Aside from that blocker, the weighted kernel-composition math, Binder-TSL variance route, and the registry-noted deviations/deferrals are aligned with Code QualityNo findings. PerformanceNo findings. MaintainabilityNo findings. Tech DebtNo findings. The remaining weighted-scope gaps are explicitly tracked and therefore mitigated for this PR: SecurityNo findings. Documentation/Tests
Path to Approval
|
P0 — zero-weight units must not drive HAD design resolution. Previously ``fit()`` aggregated all units first, then ran ``_detect_design``, ``d_lower`` resolution, the 2% mass-point threshold, and treated/control counts on the full ``d_arr`` — so a zero-weight unit at ``d.min() = 0`` (the standard SurveyDesign.subpopulation encoding) could flip a weighted sample from ``continuous_near_d_lower`` to ``continuous_at_zero``, shift ``d_lower``, misstate cohort counts, or trigger the wrong ``NotImplementedError``. The weighted kernel already dropped those units at fit time via lprobust's ``w > 0`` selector, so the fit NUMBERS were correct; the DESIGN DECISION was contaminated. Fixed by filtering ``d_arr`` / ``dy_arr`` / ``weights_unit`` / ``raw_weights_unit`` / ``resolved_survey_unit`` to ``weights_unit > 0`` immediately after survey/weights resolution and before any design-resolution logic. New ``_filter_resolved_survey`` helper rebuilds the ResolvedSurveyDesign on the positive-weight subset (recomputing ``n_strata`` / ``n_psu`` for compute_survey_if_variance). A UserWarning fires when units are dropped so the behavior is introspectable. Three new regression tests lock the fix: - ``test_zero_weight_unit_at_d_min_does_not_flip_design``: full panel with zero-weight unit at ``d=0`` vs physically-dropped panel — design, d_lower, att, and se all match to 1e-10. - ``test_zero_weight_filter_warns_user``: UserWarning emitted. - ``test_zero_weight_counts_reflect_positive_subset``: ``n_obs`` on the result is the positive-weight unit count, not the full panel size. P3 — CHANGELOG + ``to_dict()`` docstring accuracy. CHANGELOG claimed ``to_dict`` / ``summary`` / ``__repr__`` render the survey metadata consistently including ``weight_sum`` / ``n_strata`` / ``n_psu``; in practice each surface renders a different subset. Rewrote the CHANGELOG entry to enumerate exactly what each surface prints, and rewrote ``HeterogeneousAdoptionDiDResults.to_dict()`` docstring to name the weighted-path keys and clarify that ``survey_metadata`` is a SurveyMetadata object (not a dict). All 363 tests pass. Ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall AssessmentStatic review only. I could not run the targeted tests in this sandbox because Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
P1a — preserve full survey design for Binder-TSL variance. Round 5 filtered the ResolvedSurveyDesign to the positive-weight subset for BOTH design resolution AND survey variance composition. That's wrong under the standard subpopulation / domain-estimation convention (keep the sampling frame, zero the contributions) that ``SurveyDesign.subpopulation()`` + ``compute_survey_if_variance()`` both implement elsewhere in this repo. Silent consequences: ``n_psu`` / ``n_strata`` / ``df_survey`` / FPC application / lonely-PSU behavior would all have been computed on the in-domain subset instead of the full design. Refactored HAD ``fit()`` to preserve ``weights_unit_full`` / ``resolved_survey_unit_full`` / ``raw_weights_unit_full`` alongside the filtered copies used ONLY for design resolution (``_detect_design`` / ``d_lower`` / mass-point threshold / cohort counts). The fit itself (``_fit_continuous`` → ``bias_corrected_local_linear`` → ``lprobust``) now receives the FULL unfiltered arrays; ``bias_corrected_local_linear`` filters internally (see P1b below) and zero-pads the IF back to full ordering, so ``compute_survey_if_variance(if_full, resolved_full)`` preserves the sampling-frame structure. ``compute_survey_metadata`` and ``effective_dose_mean`` likewise consume the full arrays so ``n_psu`` / ``n_strata`` / ``sum_weights`` / ``weight_range`` reflect the sampling frame. Dropped the now-unused ``_filter_resolved_survey`` helper. P1b — ``bias_corrected_local_linear(weights=...)`` filters internally. The public wrapper previously ran ``_validate_had_inputs`` / mass-point threshold / Design 1' support heuristic + auto-bandwidth MSE-DPI selector on the full ``d`` array, so zero-weight units at the boundary or at ``d.min()`` could flip the detected regime or trigger spurious off-support rejections on valid weighted domain fits. Now computes ``positive_mask = weights > 0`` upfront, validates weights (finite, non-negative, positive sum — consistent with the port's ValueErrors), filters ``d`` / ``y`` / ``weights`` / ``cluster`` to the positive- weight subset BEFORE all downstream validation + selection + fit, and zero-pads the returned IF back to the original ordering when ``return_influence=True``. ``n_total`` on the returned ``BiasCorrectedFit`` is the original length (matches the caller's input shape). Four new regression tests: - ``test_zero_weight_survey_metadata_preserves_full_design``: survey= path with PSU/strata + 25% zero-weight units reports full-frame n_psu (not in-domain subset); ATT bit-parity with physically-dropped fit. - ``test_bias_corrected_local_linear_zero_weight_matches_filtered``: explicit-h/b wrapper with zero-weight unit at d=0 (would have flipped Design 1') matches dropped-sample fit at 1e-12; IF zero-padded to original length with IF[0]=0 at the zero-weight position. - ``test_bias_corrected_local_linear_zero_weight_auto_bandwidth``: auto-bandwidth path (DPI selector also runs on positive-weight support) produces same h + same bias-corrected estimate as the dropped-sample fit. - The existing Round 5 zero-weight tests (``n_obs`` reflects positive- weight subset; UserWarning; design-not-flipped) pass unchanged since the design-resolution filter is preserved. All 366 tests pass (across test_had, test_nprobust_port, test_bias_corrected_lprobust, test_np_npreg_weighted_parity, and the slow MC suite). Ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Looks good Static diff review only. I could not run Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
Rebased onto current main (17 commits clean — PR #355, #358, #359 all merged since last rebase). StaggeredTripleDifference corrected as panel-only + balance-enforced. The earlier §4.10 RCS wording paired TripleDifference / StaggeredTripleDifference together in the Explicit RCS support list, but REGISTRY.md §StaggeredTripleDifference requires a balanced panel and staggered_triple_diff.py:93-109 has no panel=False mode — fit() rejects unbalanced/duplicate (unit, time) structure at staggered_triple_diff.py:846-864. - §4.10 Explicit RCS support: TripleDifference (two-period) only; StaggeredTripleDifference removed from the supported set. - §4.10 Explicitly rejected for RCS: StaggeredTripleDifference added with a concrete "no panel=False mode" + "use TripleDifference for cross-sectional DDD" pointer. - §3 Balanced-panel eligibility: StaggeredTripleDifference added to the balance-sensitive gate. Regression tests extended: - Balanced-panel proximity check now covers StaggeredTripleDifference. - §4.10 section test asserts StaggeredTripleDifference appears in the Explicitly rejected block and NOT in the Explicit RCS support block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rebased onto current main (17 commits clean — PR #355, #358, #359 all merged since last rebase). StaggeredTripleDifference corrected as panel-only + balance-enforced. The earlier §4.10 RCS wording paired TripleDifference / StaggeredTripleDifference together in the Explicit RCS support list, but REGISTRY.md §StaggeredTripleDifference requires a balanced panel and staggered_triple_diff.py:93-109 has no panel=False mode — fit() rejects unbalanced/duplicate (unit, time) structure at staggered_triple_diff.py:846-864. - §4.10 Explicit RCS support: TripleDifference (two-period) only; StaggeredTripleDifference removed from the supported set. - §4.10 Explicitly rejected for RCS: StaggeredTripleDifference added with a concrete "no panel=False mode" + "use TripleDifference for cross-sectional DDD" pointer. - §3 Balanced-panel eligibility: StaggeredTripleDifference added to the balance-sensitive gate. Regression tests extended: - Balanced-panel proximity check now covers StaggeredTripleDifference. - §4.10 section test asserts StaggeredTripleDifference appears in the Explicitly rejected block and NOT in the Explicit RCS support block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R7 P0 (sup-t under-scaling on weights= shortcut): the per-unit IF returned by _fit_continuous / _fit_mass_point_2sls is HC1-scaled per the PR #359 convention — compute_survey_if_variance(psi, trivial_resolved) ≈ V_HC1. Routing the weights= shortcut through the unit-level ``resolved_survey=None`` branch of _sup_t_multiplier_bootstrap normalized against raw sum(psi²) = ((n-1)/n) · V_HC1, producing silently too-narrow simultaneous bands. Fix: when the weighted event-study + cband=True path runs, always route the sup-t bootstrap through a ResolvedSurveyDesign. On the weights= shortcut (no user-supplied survey), construct a synthetic trivial resolved (pweight, no strata/psu/fpc, lonely_psu='remove') so the centered + sqrt(n/(n-1))-corrected survey-aware branch fires. The no-strata/no-PSU path inside that branch falls through to the "single implicit stratum — demean across all PSUs, scale by sqrt(n_psu/(n_psu-1))" block, which gives Var_xi(xi @ Psi_psu) ≈ V_HC1 as the IF scale convention requires. Net effect: weights= shortcut and survey=SurveyDesign(weights=...) now target the SAME variance family in the bootstrap (~atol=0.05 between the two quantiles at matching seeds, bounded by the bc_fit.se_robust vs Binder-TSL per-horizon SE convergence tolerance from PR #359). Previously the shortcut was under-scaled by sqrt(n/(n-1)) relative to the analytical HC1 target. Regression tests (+2): - test_weights_shortcut_mass_point_h1_cband_matches_normal: helper- level H=1 lock with mass-point HC1-scaled IF + synthetic trivial resolved. q → Phi^-1(0.975) ≈ 1.96 at atol=0.15 (MC noise at B=5000). The pre-fix under-scaling would have produced q ≈ 1.94 (systematic drift outside MC noise). - test_weights_shortcut_cband_matches_trivial_survey: weights= shortcut and survey=SurveyDesign(weights='w') event-study cband quantiles agree within atol=0.05 on the same DGP / seed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t-study survey composition + sup-t bootstrap Closes the two Phase 4.5 A NotImplementedError gates on design='mass_point' + weights/survey and aggregate='event_study' + weights/survey. Weighted 2SLS in _fit_mass_point_2sls follows the Wooldridge 2010 Ch. 12 pweight convention (w² in HC1 meat, w·u in CR1 cluster score, weighted bread Z'WX). HC1 and CR1 match estimatr::iv_robust bit-exactly at atol=1e-10 (new cross-language golden). Per-unit IF on β̂-scale scales so compute_survey_if_variance(psi, trivial) ≈ V_HC1 at atol=1e-10 (PR igerber#359 convention applied uniformly). Event-study path threads weights + survey through the per-horizon loop, composing Binder-TSL variance per horizon and populating survey_metadata + variance_formula + effective_dose_mean (previously hardcoded None). New _sup_t_multiplier_bootstrap helper reuses generate_survey_multiplier_weights_batch / generate_bootstrap_weights_batch from diff_diff.bootstrap_utils — no custom Rademacher draws, no (1/n) prefactor. At H=1 reduces to Φ⁻¹(1-α/2) ≈ 1.96 (reduction-locked). New __init__ kwargs: n_bootstrap=999, seed=None. New fit() kwarg: cband=True. HeterogeneousAdoptionDiDEventStudyResults gains survey_metadata + variance_formula + effective_dose_mean + cband_* fields, surfaced through to_dict / to_dataframe / summary / __repr__. Unweighted event-study output (att/se) bit-exactly preserved; cband disabled on the unweighted path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
survey=SurveyDesign(...)+weights=<array>onHeterogeneousAdoptionDiD.fit()for the two continuous-dose designs (continuous_at_zero,continuous_near_d_lower); mass-point + event-study + HAD pretests remainNotImplementedError(deferred to Phase 4.5 B / C).compute_survey_if_varianceBinder-TSL helper (no duplicate implementation).atol=1e-14on the full lprobust output struct, cross-language weighted-OLS parity against a manual R reference atatol=1e-12, and Monte Carlo oracle consistency on known-τ DGPs.Test plan
pytest tests/test_had.py tests/test_had_mc.py tests/test_nprobust_port.py tests/test_bias_corrected_lprobust.py tests/test_np_npreg_weighted_parity.py -m ''— all pass (469 tests; confirmed locally)mypy diff_diff/had.py diff_diff/local_linear.py diff_diff/_nprobust_port.py— no new errorsruff check diff_diff tests— clean on modified filesRscript benchmarks/R/generate_np_npreg_weighted_golden.Rregenerates the fixture without changesScope boundaries
Supported this PR:
continuous_at_zero(paper Design 1', Eq 7) +continuous_near_d_lower(Design 1 near-d̲) underweights=<array>andsurvey=SurveyDesign(weights, strata, psu, fpc).compute_survey_if_variance/compute_replicate_if_variance(pre-existing helpers).HeterogeneousAdoptionDiDResults.survey_metadatapopulated with method / variance-formula / weight-sum / effective-sample-size / n_strata / n_psu; closes the existingto_dict()gap athad.py:397-419.benchmarks/R/generate_np_npreg_weighted_golden.R+tests/test_np_npreg_weighted_parity.py(13 tests).Explicitly NOT in this PR (deferred with targeted
NotImplementedError+ phase pointer):design="mass_point"under survey/weights (Phase 4.5 B — weighted 2SLS + sandwich).aggregate="event_study"under survey/weights (Phase 4.5 B — per-horizon IPW + shared PSU multiplier bootstrap).qug_test,stute_test,yatchew_hr_test, joint Stute variants,did_had_pretest_workflow) under weights — Phase 4.5 C (Rao-Wu rescaled bootstrap).h/bexplicitly for weight-aware bandwidths; the auto path remains unweighted.Parity ceiling
No public weighted-CCF bias-corrected local-linear reference exists in any language —
nprobust::lprobusthas no weight argument,np::npreguses a proprietary local-linear algorithm that does not reduce to weighted OLS at the intercept, Statardrobustis RD-shaped (not HAD-shaped), JuliaSynthdid.jlis SDID-only. The Phase 1catol=1e-12R bit-parity bar is therefore NOT reachable on the weighted bias-corrected CI. This is documented explicitly indocs/methodology/REGISTRY.md§HeterogeneousAdoptionDiD "Weighted extension (Phase 4.5)".🤖 Generated with Claude Code