diff --git a/CHANGELOG.md b/CHANGELOG.md index 3e40e919..dba852fa 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -22,7 +22,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Changed - **`did_had_pretest_workflow(aggregate="event_study")` verdict no longer emits the "paper step 2 deferred to Phase 3 follow-up" caveat** — the joint pre-trends Stute test closes that gap. The two-period `aggregate="overall"` path retains the existing caveat since the joint variant does not apply to single-pre-period panels. Downstream code that greps verdict strings for the Phase 3 caveat will see it suppressed on the event-study path. -- **SyntheticDiD bootstrap no longer supports survey designs** (capability regression). The removed fixed-weight bootstrap path was the only SDID variance method that supported strata/PSU/FPC (via Rao-Wu rescaled bootstrap); the new paper-faithful refit bootstrap rejects all survey designs (including pweight-only) with `NotImplementedError`. Pweight-only users can switch to `variance_method="placebo"` or `"jackknife"`. Strata/PSU/FPC users have no SDID variance option on this release. Composing Rao-Wu rescaled weights with Frank-Wolfe re-estimation requires a separate derivation (weighted FW solver); sketch and reusable scaffolding pointers are in `docs/methodology/REGISTRY.md` §SyntheticDiD and `TODO.md`. +- **SyntheticDiD bootstrap no longer supports survey designs** (capability regression in PR #351, **restored in PR #352** — see Added/Changed entries directly below). The removed fixed-weight bootstrap path was the only SDID variance method that supported strata/PSU/FPC (via Rao-Wu rescaled bootstrap); the PR #351 paper-faithful refit bootstrap initially rejected all survey designs (including pweight-only) with `NotImplementedError`. PR #352 restores the capability via a weighted-FW + Rao-Wu composition; the lock-out window applies only to the v3.2.x line that ships PR #351 alone (without PR #352). Composing Rao-Wu rescaled weights with Frank-Wolfe re-estimation: see `docs/methodology/REGISTRY.md` §SyntheticDiD `Note (survey + bootstrap composition)`. + +### Added (PR #352) +- **SDID `variance_method="bootstrap"` survey support restored** via a hybrid pairs-bootstrap + Rao-Wu rescaling composed with a weighted Frank-Wolfe kernel. Each bootstrap draw first performs the unit-level pairs-bootstrap resampling specified by Arkhangelsky et al. (2021) Algorithm 2 (`boot_idx = rng.choice(n_total)`), and *then* applies Rao-Wu rescaled per-unit weights (Rao & Wu 1988) sliced over the resampled units — NOT a standalone Rao-Wu bootstrap. New Rust kernel `sc_weight_fw_weighted` (and `_with_convergence` sibling) accepts a per-coordinate `reg_weights` argument so the FW objective becomes `min ||A·ω - b||² + ζ²·Σ_j reg_w[j]·ω[j]²`. New Python helpers `compute_sdid_unit_weights_survey` and `compute_time_weights_survey` thread per-control survey weights through the two-pass sparsify-refit dispatcher (column-scaling Y by `rw` for the loss, `reg_weights=rw` for the penalty on the unit-weights side; weighted column-centering + row-scaling Y by `sqrt(rw)` for the loss with uniform reg on the time-weights side). `_bootstrap_se` survey branch composes the per-draw `rw` (Rao-Wu rescaling for full designs, constant `w_control` for pweight-only fits) with the weighted-FW helpers, then composes `ω_eff = rw·ω/Σ(rw·ω)` for the SDID estimator. Coverage MC artifact extended with a `stratified_survey` DGP (BRFSS-style: N=40, strata=2, PSU=2/stratum); the bootstrap row's near-nominal calibration is the validation gate (target rejection ∈ [0.02, 0.10] at α=0.05). New regression tests across `test_methodology_sdid.py::TestBootstrapSE` (single-PSU short-circuit, full-design and pweight-only succeeds-tests, zero-treated-mass retry, deterministic Rao-Wu × boot_idx slice) and `test_survey_phase5.py::TestSyntheticDiDSurvey` (full-design ↔ pweight-only SE differs assertion). See REGISTRY.md §SyntheticDiD ``Note (survey + bootstrap composition)`` for the full objective and the argmin-set caveat. + +### Changed (PR #352) +- **SDID bootstrap SE values under survey fits now differ numerically from the v3.2.x line that shipped PR #351 alone**: the fit no longer raises `NotImplementedError`, and instead returns the weighted-FW + Rao-Wu SE. Non-survey fits are unaffected (the bootstrap dispatcher routes only the survey branch through the new `_survey` helpers; non-survey fits continue to call the existing `compute_sdid_unit_weights` / `compute_time_weights` and stay bit-identical at rel=1e-14 on the `_BASELINE["bootstrap"]` regression). SDID's `placebo` and `jackknife` paths still reject `strata/PSU/FPC` (separate methodology gap; tracked in TODO.md as a follow-up PR). ## [3.2.0] - 2026-04-19 diff --git a/TODO.md b/TODO.md index 0f15a8c6..3ff9d09b 100644 --- a/TODO.md +++ b/TODO.md @@ -104,7 +104,7 @@ Deferred items from PR reviews that were not addressed before merge. | `HeterogeneousAdoptionDiD` Phase 5: `practitioner_next_steps()` integration, tutorial notebook, and `llms.txt` updates (preserving UTF-8 fingerprint). | `diff_diff/practitioner.py`, `tutorials/`, `diff_diff/guides/` | Phase 2a | Low | | `HeterogeneousAdoptionDiD` time-varying dose on event study: Phase 2b REJECTS panels where `D_{g,t}` varies within a unit for `t >= F` (the aggregation uses `D_{g, F}` as the single regressor for all horizons, paper Appendix B.2 constant-dose convention). A follow-up PR could add a time-varying-dose estimator for these panels; current behavior is front-door rejection with a redirect to `ChaisemartinDHaultfoeuille`. | `diff_diff/had.py::_validate_had_panel_event_study` | Phase 2b | Low | | `HeterogeneousAdoptionDiD` repeated-cross-section support: paper Section 2 defines HAD on panel OR repeated cross-section, but Phase 2a is panel-only. RCS inputs (disjoint unit IDs between periods) are rejected by the balanced-panel validator with the generic "unit(s) do not appear in both periods" error. A follow-up PR will add an RCS identification path based on pre/post cell means (rather than unit-level first differences), with its own validator and a distinct `data_mode` / API surface. | `diff_diff/had.py::_validate_had_panel`, `diff_diff/had.py::_aggregate_first_difference` | Phase 2a | Medium | -| **SDID + survey designs** (capability regression in this release; both pweight-only AND strata/PSU/FPC). The previous release's fixed-weight bootstrap accepted strata/PSU/FPC via Rao-Wu rescaled bootstrap; the new paper-faithful refit bootstrap rejects all survey designs because Rao-Wu composed with Frank-Wolfe re-estimation requires its own derivation. The follow-up needs a **weighted Frank-Wolfe** variant of `_sc_weight_fw` accepting per-unit weights in the loss and regularization (`Σ rw_i ω_i Y_i,pre` / `ζ² Σ rw_i ω_i²`), threaded through `compute_sdid_unit_weights` / `compute_time_weights`. Reusable scaffolding (`generate_rao_wu_weights`, split into `rw_control` / `rw_treated`, degenerate-retry, treated-mean weighting) is recoverable from the pre-rewrite `_bootstrap_se` body via `git show 91082e5:diff_diff/synthetic_did.py` (PR #351 "Replace SDID fixed-weight bootstrap with paper-faithful refit"). Compose-after-unweighted-FW does not work — silently reproduces the fixed-weight Rao-Wu behavior we removed. Validation: re-use the coverage MC harness with a stratified DGP, confirm near-nominal rejection rates against placebo-SE tracking. See REGISTRY.md §SyntheticDiD `Note (deferred survey + bootstrap composition)` for the sketch. | `synthetic_did.py::fit`, `synthetic_did.py::_bootstrap_se`, `utils.py::_sc_weight_fw` | follow-up | Medium | +| **SDID + placebo/jackknife + strata/PSU/FPC** (capability gap remaining after PR #352). PR #352 restored survey-bootstrap support via weighted Frank-Wolfe + Rao-Wu composition; the same composition for `placebo` (which permutes control indices) and `jackknife` (which leaves out one unit at a time) requires its own derivations: placebo's allocator needs a weighted permutation distribution that respects PSU clustering; jackknife needs PSU-level LOO + stratum aggregation. Both reuse the weighted-FW kernel from PR #352 (`_sc_weight_fw(reg_weights=)`); the genuinely new work is the per-method allocator. Tracked but no concrete sketch yet — defer until user demand surfaces. | `synthetic_did.py::_placebo_variance_se`, `synthetic_did.py::_jackknife_se` | follow-up | Low | | SyntheticDiD: bootstrap cross-language parity anchor against R's default `synthdid::vcov(method="bootstrap")` (refit; rebinds `opts` per draw) or Julia `Synthdid.jl::src/vcov.jl::bootstrap_se` (refit by construction). Same-library validation (placebo-SE tracking, AER §6.3 MC truth) is in place; a cross-language anchor is desirable to bolster the methodology contract. Julia is the cleanest target — minimal wrapping work and refit-native vcov. Tolerance target: 1e-6 on Monte Carlo samples (different BLAS + RNG paths preclude 1e-10). The R-parity fixture from the previous release was deleted because it pinned the now-removed fixed-weight path. | `benchmarks/R/`, `benchmarks/julia/`, `tests/` | follow-up | Low | #### Performance diff --git a/benchmarks/data/sdid_coverage.json b/benchmarks/data/sdid_coverage.json index 43f86d0a..8f18a6e3 100644 --- a/benchmarks/data/sdid_coverage.json +++ b/benchmarks/data/sdid_coverage.json @@ -4,8 +4,8 @@ "n_bootstrap": 200, "library_version": "3.2.0", "backend": "rust", - "generated_at": "2026-04-22T20:48:18.361220+00:00", - "total_elapsed_sec": 2424.92, + "generated_at": "2026-04-24T13:01:54.876774+00:00", + "total_elapsed_sec": 2420.61, "methods": [ "placebo", "bootstrap", @@ -20,7 +20,8 @@ "dgps": { "balanced": "Balanced / exchangeable: N_co=20, N_tr=3, T_pre=8, T_post=4", "unbalanced": "Unbalanced: N_co=30, N_tr=8, heterogeneous unit-FE variance", - "aer63": "Arkhangelsky et al. (2021) AER \u00a76.3: N=100, N1=20, T=120, T1=5, rank=2, \u03c3=2" + "aer63": "Arkhangelsky et al. (2021) AER \u00a76.3: N=100, N1=20, T=120, T1=5, rank=2, \u03c3=2", + "stratified_survey": "BRFSS-style: N=40, strata=2, PSU=2/stratum, psu_re_sd=1.5 (PR #352)" }, "per_dgp": { "balanced": { @@ -42,9 +43,9 @@ "0.05": 0.078, "0.10": 0.116 }, - "mean_se": 0.21962976414466187, + "mean_se": 0.2195984748876297, "true_sd_tau_hat": 0.2093529148687405, - "se_over_truesd": 1.0490886371578094 + "se_over_truesd": 1.0489391801652868 }, "jackknife": { "n_successful_fits": 500, @@ -57,7 +58,7 @@ "true_sd_tau_hat": 0.2093529148687405, "se_over_truesd": 1.0756639338270981 }, - "_elapsed_sec": 78.62 + "_elapsed_sec": 71.24 }, "unbalanced": { "placebo": { @@ -78,9 +79,9 @@ "0.05": 0.038, "0.10": 0.08 }, - "mean_se": 0.15072674925763238, + "mean_se": 0.15070173940119225, "true_sd_tau_hat": 0.135562270427217, - "se_over_truesd": 1.1118635648593473 + "se_over_truesd": 1.1116790750572711 }, "jackknife": { "n_successful_fits": 500, @@ -93,7 +94,7 @@ "true_sd_tau_hat": 0.135562270427217, "se_over_truesd": 0.990639682456852 }, - "_elapsed_sec": 90.61 + "_elapsed_sec": 78.91 }, "aer63": { "placebo": { @@ -114,9 +115,9 @@ "0.05": 0.04, "0.10": 0.078 }, - "mean_se": 0.28291769703671454, + "mean_se": 0.28265726432861016, "true_sd_tau_hat": 0.2696262336703088, - "se_over_truesd": 1.0492958833622181 + "se_over_truesd": 1.0483299806584672 }, "jackknife": { "n_successful_fits": 500, @@ -129,7 +130,43 @@ "true_sd_tau_hat": 0.2696262336703088, "se_over_truesd": 0.9015870263136688 }, - "_elapsed_sec": 2255.69 + "_elapsed_sec": 2237.29 + }, + "stratified_survey": { + "placebo": { + "n_successful_fits": 0, + "rejection_rate": { + "0.01": null, + "0.05": null, + "0.10": null + }, + "mean_se": null, + "true_sd_tau_hat": null, + "se_over_truesd": null + }, + "bootstrap": { + "n_successful_fits": 500, + "rejection_rate": { + "0.01": 0.024, + "0.05": 0.058, + "0.10": 0.094 + }, + "mean_se": 0.5097482138251239, + "true_sd_tau_hat": 0.4512243070193919, + "se_over_truesd": 1.1297002530566618 + }, + "jackknife": { + "n_successful_fits": 0, + "rejection_rate": { + "0.01": null, + "0.05": null, + "0.10": null + }, + "mean_se": null, + "true_sd_tau_hat": null, + "se_over_truesd": null + }, + "_elapsed_sec": 16.48 } } } \ No newline at end of file diff --git a/benchmarks/python/coverage_sdid.py b/benchmarks/python/coverage_sdid.py index f4db999a..75460f1e 100644 --- a/benchmarks/python/coverage_sdid.py +++ b/benchmarks/python/coverage_sdid.py @@ -2,13 +2,21 @@ """Coverage Monte Carlo study for SDID variance methods. Generates null-panel Monte Carlo samples (no treatment effect) across -three representative DGPs, fits SyntheticDiD under each of the three +four representative DGPs (``balanced``, ``unbalanced``, ``aer63``, +``stratified_survey``), fits SyntheticDiD under each of the three variance methods (placebo, bootstrap, jackknife), and records rejection rates at α ∈ {0.01, 0.05, 0.10} plus the ratio of mean estimated SE to the empirical sampling SD of τ̂. +The ``stratified_survey`` DGP is bootstrap-only — placebo and jackknife +still reject full strata/PSU/FPC survey designs (tracked in ``TODO.md``), +so the harness skips those method × DGP cells via the per-DGP +``survey_design_factory`` in the ``DGPSpec`` registry (PR #352 R5 P3). + The output JSON underwrites the calibration table in -``docs/methodology/REGISTRY.md`` §SyntheticDiD. +``docs/methodology/REGISTRY.md`` §SyntheticDiD, including the +stratified-survey bootstrap calibration gate [0.02, 0.10] that validates +the hybrid pairs-bootstrap + Rao-Wu weighted-FW composition. Usage: # Full run (~15–40 min on M-series Mac with Rust backend; AER §6.3 refit @@ -64,7 +72,7 @@ def _get_backend_from_args() -> str: from diff_diff import HAS_RUST_BACKEND, SyntheticDiD # noqa: E402 ALL_METHODS = ("placebo", "bootstrap", "jackknife") -ALL_DGPS = ("balanced", "unbalanced", "aer63") +ALL_DGPS = ("balanced", "unbalanced", "aer63", "stratified_survey") ALPHAS = (0.01, 0.05, 0.10) @@ -74,6 +82,12 @@ class DGPSpec: description: str # generator returns (DataFrame, post_periods) generator: Callable[[int], Tuple[pd.DataFrame, List[int]]] + # Optional factory: returns (SurveyDesign, methods_supported_set) given a + # DataFrame. None for non-survey DGPs. The methods_supported set lets the + # harness skip the methods that raise NotImplementedError on this design + # (e.g., placebo/jackknife under strata/PSU/FPC). For non-survey DGPs all + # methods are supported. + survey_design_factory: Optional[Callable[[pd.DataFrame], Tuple[Any, Tuple[str, ...]]]] = None def _balanced_dgp(seed: int) -> Tuple[pd.DataFrame, List[int]]: @@ -169,6 +183,60 @@ def _aer63_dgp(seed: int) -> Tuple[pd.DataFrame, List[int]]: return pd.DataFrame(rows), list(range(n_pre, n_pre + n_post)) +def _stratified_survey_dgp(seed: int) -> Tuple[pd.DataFrame, List[int]]: + """BRFSS/ACS-style stratified survey panel, null treatment (PR #352). + + N=40 (10 per PSU × 4 PSUs across 2 strata), T=12 (6 pre, 6 post), + moderate weight variation (Kish DEFF ≈ 1.4), psu_re_sd=1.5 (modest + ICC). Each unit is a respondent with constant per-unit survey + weight, stratum, and PSU columns. Used to validate the SDID + survey-bootstrap calibration: the bootstrap row should land near + nominal at α=0.05 (PR #352 §3c calibration gate, [0.02, 0.10]). + """ + from diff_diff.prep_dgp import generate_survey_did_data + df = generate_survey_did_data( + n_units=40, + n_periods=12, + cohort_periods=[7], + never_treated_frac=0.2, + treatment_effect=0.0, # null for coverage MC + n_strata=2, + psu_per_stratum=2, + fpc_per_stratum=200.0, + weight_variation="moderate", + psu_re_sd=1.5, + psu_period_factor=0.5, + seed=seed, + ) + # generate_survey_did_data emits per-observation 'treated' (post-only + # for treated units); SDID requires a unit-level ever-treated indicator + # (constant across time). Derive from 'first_treat' (cohort, 0 for + # never-treated). Periods are 1-indexed (prep_dgp.py L1211-L1212), so + # cohort 7 with n_periods=12 → post = [7, 8, 9, 10, 11, 12] (6 post + # periods). Derive from df["period"].max() so any change to n_periods + # propagates (PR #355 R13 P1 — the hard-coded range(7, 12) dropped + # period 12 into the pre window, contaminating calibration). + df = df.copy() + df["treated"] = (df["first_treat"] > 0).astype(int) + cohort_onset = 7 + period_max = int(df["period"].max()) + post_periods = list(range(cohort_onset, period_max + 1)) + return df, post_periods + + +def _stratified_survey_design(df: pd.DataFrame) -> Tuple[Any, Tuple[str, ...]]: + """Build the SurveyDesign for the stratified_survey DGP. + + Methods supported: bootstrap only — placebo / jackknife reject + strata/PSU/FPC at fit-time (separate methodology gap). + """ + from diff_diff import SurveyDesign + return ( + SurveyDesign(weights="weight", strata="stratum", psu="psu", fpc="fpc"), + ("bootstrap",), + ) + + DGPS: Dict[str, DGPSpec] = { "balanced": DGPSpec( "balanced", @@ -185,6 +253,12 @@ def _aer63_dgp(seed: int) -> Tuple[pd.DataFrame, List[int]]: "Arkhangelsky et al. (2021) AER §6.3: N=100, N1=20, T=120, T1=5, rank=2, σ=2", _aer63_dgp, ), + "stratified_survey": DGPSpec( + "stratified_survey", + "BRFSS-style: N=40, strata=2, PSU=2/stratum, psu_re_sd=1.5 (PR #352)", + _stratified_survey_dgp, + survey_design_factory=_stratified_survey_design, + ), } @@ -194,23 +268,33 @@ def _fit_one( post_periods: List[int], n_bootstrap: int, seed: int, + survey_design: Optional[Any] = None, ) -> Tuple[Optional[float], Optional[float], Optional[float]]: - """Fit SDID and return (att, se, p_value); (None, None, None) on failure.""" + """Fit SDID and return (att, se, p_value); (None, None, None) on failure. + + For survey DGPs the harness passes a SurveyDesign via ``survey_design``; + fit() routes it through the bootstrap survey path (PR #352) when + method=='bootstrap'. The DGP's ``survey_design_factory`` declares which + methods are supported, so the caller skips unsupported methods entirely + rather than catching the resulting NotImplementedError here. + """ try: with warnings.catch_warnings(): warnings.simplefilter("ignore") - r = SyntheticDiD( - variance_method=method, - n_bootstrap=n_bootstrap, - seed=seed, - ).fit( - df, + fit_kwargs = dict( outcome="outcome", treatment="treated", unit="unit", time="period", post_periods=post_periods, ) + if survey_design is not None: + fit_kwargs["survey_design"] = survey_design + r = SyntheticDiD( + variance_method=method, + n_bootstrap=n_bootstrap, + seed=seed, + ).fit(df, **fit_kwargs) att = float(r.att) if np.isfinite(r.att) else None se = float(r.se) if np.isfinite(r.se) else None p_value = float(r.p_value) if np.isfinite(r.p_value) else None @@ -224,24 +308,31 @@ def _summarize( ses: np.ndarray, p_values: np.ndarray, ) -> Dict[str, Any]: - """Reduce per-seed (att, se, p_value) arrays to summary stats.""" + """Reduce per-seed (att, se, p_value) arrays to summary stats. + + Unsupported / all-failed cells serialize as ``null`` rather than + ``float('nan')`` so the output is strict JSON (PR #355 R11 P3). + """ finite = np.isfinite(atts) & np.isfinite(ses) & np.isfinite(p_values) n_successful = int(finite.sum()) if n_successful == 0: return { "n_successful_fits": 0, - "rejection_rate": {f"{a:.2f}": float("nan") for a in ALPHAS}, - "mean_se": float("nan"), - "true_sd_tau_hat": float("nan"), - "se_over_truesd": float("nan"), + "rejection_rate": {f"{a:.2f}": None for a in ALPHAS}, + "mean_se": None, + "true_sd_tau_hat": None, + "se_over_truesd": None, } atts_f = atts[finite] ses_f = ses[finite] ps_f = p_values[finite] rejection = {f"{a:.2f}": float(np.mean(ps_f < a)) for a in ALPHAS} mean_se = float(ses_f.mean()) - true_sd = float(atts_f.std(ddof=1)) if n_successful > 1 else float("nan") - ratio = float(mean_se / true_sd) if np.isfinite(true_sd) and true_sd > 0 else float("nan") + true_sd = float(atts_f.std(ddof=1)) if n_successful > 1 else None + if true_sd is not None and np.isfinite(true_sd) and true_sd > 0: + ratio: Optional[float] = float(mean_se / true_sd) + else: + ratio = None return { "n_successful_fits": n_successful, "rejection_rate": rejection, @@ -258,7 +349,14 @@ def _run_dgp( n_bootstrap: int, methods: Tuple[str, ...], ) -> Dict[str, Any]: - """Run all methods × n_seeds for one DGP. Returns summary dict.""" + """Run all methods × n_seeds for one DGP. Returns summary dict. + + For survey DGPs (``spec.survey_design_factory is not None``) the harness + constructs the SurveyDesign once per seed (it depends only on the column + names, not the DataFrame contents) and skips methods not in + ``supported_methods`` — those rows in the artifact have + ``n_successful_fits=0``. + """ print(f"\n=== DGP: {name} ({spec.description}) ===", flush=True) # Preallocate per-method arrays @@ -269,8 +367,17 @@ def _run_dgp( start = time.time() for seed in range(n_seeds): df, post = spec.generator(seed) + if spec.survey_design_factory is not None: + survey_design, supported_methods = spec.survey_design_factory(df) + else: + survey_design = None + supported_methods = methods for method in methods: - att, se, p = _fit_one(method, df, post, n_bootstrap, seed) + if method not in supported_methods: + # Method-specific guard fires (e.g., placebo + strata). + # Leave NaN; the summary will report n_successful_fits=0. + continue + att, se, p = _fit_one(method, df, post, n_bootstrap, seed, survey_design) if att is not None: atts[method][seed] = att if se is not None: @@ -380,7 +487,12 @@ def main() -> None: out_path = Path(args.output) out_path.parent.mkdir(parents=True, exist_ok=True) with open(out_path, "w") as f: - json.dump(output, f, indent=2) + # ``allow_nan=False`` so any stray float('nan') / inf fails loudly + # instead of serializing as bare ``NaN`` / ``Infinity`` tokens + # (non-strict JSON that strict parsers reject; PR #355 R11 P3). + # ``_summarize`` returns ``None`` for unsupported / all-failed + # cells precisely so this gate doesn't trip. + json.dump(output, f, indent=2, allow_nan=False) print(f"\nWrote {out_path} ({out_path.stat().st_size / 1024:.1f} KB)", flush=True) diff --git a/benchmarks/python/test_coverage_sdid.py b/benchmarks/python/test_coverage_sdid.py new file mode 100644 index 00000000..63e356df --- /dev/null +++ b/benchmarks/python/test_coverage_sdid.py @@ -0,0 +1,75 @@ +"""Harness-level sanity tests for ``coverage_sdid.py``. + +Lives under ``benchmarks/`` rather than ``tests/`` because the targets +here (the private DGP helpers inside ``coverage_sdid.py``) are part of +the MC harness, not the shipped ``diff_diff`` package. CI's isolated- +install job deliberately copies only ``tests/``; running the pytest +collection here would just fail with ``ModuleNotFoundError`` on the +``benchmarks`` import. + +Invoke manually when about to regenerate the coverage MC artifact:: + + pytest benchmarks/python/test_coverage_sdid.py -v + +Or from the repo root:: + + python -m pytest benchmarks/python/test_coverage_sdid.py -v +""" + +from __future__ import annotations + +from pathlib import Path +import sys + +# Make the top-level ``benchmarks.python.coverage_sdid`` import resolvable +# when pytest is invoked from the repo root. The harness itself does a +# similar sys.path insert at module load; this mirror is only needed for +# test collection from this file. +_REPO_ROOT = Path(__file__).resolve().parent.parent.parent +if str(_REPO_ROOT) not in sys.path: + sys.path.insert(0, str(_REPO_ROOT)) + +from benchmarks.python import coverage_sdid # noqa: E402 + + +def test_stratified_survey_dgp_post_periods_cover_full_post_tail(): + """The ``stratified_survey`` coverage DGP must not drop any post period. + + Regression against PR #355 R13 P1: the harness previously hard- + coded ``range(7, 12)`` as ``post_periods`` even though + ``generate_survey_did_data`` is 1-indexed and emits periods + 1..n_periods, so period 12 was silently included in the pre window. + That contaminated the ``stratified_survey × bootstrap`` calibration + row and every downstream REGISTRY claim that transcribes from the + artifact. The fix derives ``post_periods`` from + ``df["period"].max()``; this test fails fast if a future refactor + reintroduces the off-by-one. + """ + df, post_periods = coverage_sdid._stratified_survey_dgp(seed=0) + + # Contract: post_periods covers the full tail from cohort onset + # through df["period"].max(), with no gaps (unique + sorted + + # contiguous + maxed at df["period"].max()). + assert len(post_periods) == len(set(post_periods)), ( + f"post_periods has duplicates: {post_periods}" + ) + assert post_periods == sorted(post_periods), ( + f"post_periods not sorted: {post_periods}" + ) + gaps = [ + (a, b) for a, b in zip(post_periods, post_periods[1:]) if b - a != 1 + ] + assert not gaps, ( + f"post_periods has gaps: {gaps} in {post_periods}" + ) + assert post_periods[-1] == int(df["period"].max()), ( + f"post_periods max {post_periods[-1]} != df[period].max() " + f"{int(df['period'].max())} — DGP drops the last post period " + "(off-by-one on 1-indexed generate_survey_did_data)." + ) + # Strong form: cohort onset is 7, n_periods=12 → [7,8,9,10,11,12]. + assert post_periods == [7, 8, 9, 10, 11, 12], ( + f"post_periods={post_periods} must equal [7,8,9,10,11,12] for " + "the documented 6-pre/6-post survey DGP; any other slice " + "changes the calibration interpretation in REGISTRY." + ) diff --git a/diff_diff/_backend.py b/diff_diff/_backend.py index d12d9a4c..f763f87c 100644 --- a/diff_diff/_backend.py +++ b/diff_diff/_backend.py @@ -35,6 +35,8 @@ compute_noise_level as _rust_compute_noise_level, sc_weight_fw as _rust_sc_weight_fw, sc_weight_fw_with_convergence as _rust_sc_weight_fw_with_convergence, + sc_weight_fw_weighted as _rust_sc_weight_fw_weighted, + sc_weight_fw_weighted_with_convergence as _rust_sc_weight_fw_weighted_with_convergence, # Diagnostics rust_backend_info as _rust_backend_info, ) @@ -59,6 +61,8 @@ _rust_compute_noise_level = None _rust_sc_weight_fw = None _rust_sc_weight_fw_with_convergence = None + _rust_sc_weight_fw_weighted = None + _rust_sc_weight_fw_weighted_with_convergence = None _rust_backend_info = None # Determine final backend based on environment variable and availability @@ -82,6 +86,8 @@ _rust_compute_noise_level = None _rust_sc_weight_fw = None _rust_sc_weight_fw_with_convergence = None + _rust_sc_weight_fw_weighted = None + _rust_sc_weight_fw_weighted_with_convergence = None _rust_backend_info = None elif _backend_env == "rust": # Force Rust mode - fail if not available @@ -131,4 +137,6 @@ def rust_backend_info(): "_rust_compute_noise_level", "_rust_sc_weight_fw", "_rust_sc_weight_fw_with_convergence", + "_rust_sc_weight_fw_weighted", + "_rust_sc_weight_fw_weighted_with_convergence", ] diff --git a/diff_diff/guides/llms-full.txt b/diff_diff/guides/llms-full.txt index 31d36ecc..80197648 100644 --- a/diff_diff/guides/llms-full.txt +++ b/diff_diff/guides/llms-full.txt @@ -1673,7 +1673,7 @@ sd_female, data_female = sd.subpopulation(data, mask=lambda df: df['sex'] == 'F' **Key features:** - Taylor Series Linearization (TSL) variance with strata + PSU + FPC - Replicate weight variance: BRR, Fay's BRR, JK1, JKn, SDR (13 of 16 estimators, including dCDH) -- Survey-aware bootstrap: multiplier at PSU (Hall-Mammen wild; dCDH, staggered) or Rao-Wu rescaled (SunAbraham, TROP). SyntheticDiD bootstrap is non-survey only: the paper-faithful refit path re-estimates weights via Frank-Wolfe per draw, and Rao-Wu + refit composition is not yet implemented (tracked in TODO.md) +- Survey-aware bootstrap: multiplier at PSU (Hall-Mammen wild; dCDH, staggered) or Rao-Wu rescaled (SunAbraham, SyntheticDiD, TROP). SyntheticDiD bootstrap composes Rao-Wu rescaled per-draw weights with the weighted Frank-Wolfe variant of `_sc_weight_fw` (PR #352): each draw solves `min ||A·diag(rw)·ω - b||² + ζ²·Σ rw_i ω_i²` and composes `ω_eff = rw·ω/Σ(rw·ω)` for the SDID estimator. Pweight-only fits use constant `rw = w_control`; full designs use Rao-Wu. SDID's placebo and jackknife paths still reject strata/PSU/FPC (separate methodology gap, tracked in TODO.md) - DEFF diagnostics, subpopulation analysis, weight trimming (`trim_weights`) - Repeated cross-sections: `CallawaySantAnna(panel=False)` - Compatibility matrix: see `docs/choosing_estimator.rst` Survey Design Support section diff --git a/diff_diff/survey.py b/diff_diff/survey.py index 3d951fb6..d7bf7121 100644 --- a/diff_diff/survey.py +++ b/diff_diff/survey.py @@ -1058,7 +1058,12 @@ def _extract_unit_survey_weights(data, unit_col, survey_design, unit_order): unit_col : str Unit identifier column name. survey_design : SurveyDesign - Survey design (uses ``weights`` column name). + Survey design. When ``survey_design.weights`` is a column name, + the weights are pulled from ``data``. When ``survey_design.weights + is None`` (a valid configuration — ``SurveyDesign.resolve()`` then + synthesizes ones), returns a vector of ones of length + ``len(unit_order)`` so downstream estimators can treat all units + as having unit survey weight 1. unit_order : array-like Ordered sequence of unit identifiers to align weights to. @@ -1067,6 +1072,14 @@ def _extract_unit_survey_weights(data, unit_col, survey_design, unit_order): np.ndarray Float64 array of unit-level weights, one per unit in ``unit_order``. """ + if survey_design.weights is None: + # SurveyDesign(weights=None, strata=..., psu=...) is a valid + # configuration — the design element supplies clustering / + # stratification without explicit per-unit weights. Synthesize + # uniform unit weights of 1 to match SurveyDesign.resolve()'s + # behavior (which emits ones when weights is None). Without this + # branch the groupby below would raise a KeyError on ``None``. + return np.ones(len(unit_order), dtype=np.float64) unit_w = data.groupby(unit_col)[survey_design.weights].first() return np.array([unit_w[u] for u in unit_order], dtype=np.float64) diff --git a/diff_diff/synthetic_did.py b/diff_diff/synthetic_did.py index 55872d9a..7146ba19 100644 --- a/diff_diff/synthetic_did.py +++ b/diff_diff/synthetic_did.py @@ -12,12 +12,15 @@ from diff_diff.estimators import DifferenceInDifferences from diff_diff.linalg import solve_ols from diff_diff.results import SyntheticDiDResults, _SyntheticDiDFitSnapshot +from diff_diff.bootstrap_utils import generate_rao_wu_weights from diff_diff.utils import ( _compute_regularization, _sum_normalize, compute_sdid_estimator, compute_sdid_unit_weights, + compute_sdid_unit_weights_survey, compute_time_weights, + compute_time_weights_survey, safe_inference, validate_binary, ) @@ -60,9 +63,14 @@ class SyntheticDiD(DifferenceInDifferences): synthdid::vcov(method="bootstrap") (which rebinds ``attr(estimate, "opts")`` with ``update.omega=TRUE``, so the renormalized ω is only Frank-Wolfe initialization). Re-estimates ω̂_b and λ̂_b via two-pass sparsified - Frank-Wolfe on each bootstrap draw. Survey designs (including pweight-only) - raise NotImplementedError; Rao-Wu rescaled weights composed with Frank-Wolfe - re-estimation requires a separate derivation (tracked in TODO.md). + Frank-Wolfe on each bootstrap draw. **Survey support (PR #352):** + pweight-only fits use the constant per-control survey weight as ``rw``; + full-design fits (strata/PSU/FPC) use Rao-Wu rescaled weights per draw. + Both compose with the **weighted Frank-Wolfe** kernel + (``min ||A·diag(rw)·ω - b||² + ζ²·Σ rw_i ω_i²``); the FW returns ω on the + standard simplex, then ``ω_eff = rw·ω/Σ(rw·ω)`` is composed for the SDID + estimator. See REGISTRY.md §SyntheticDiD ``Note (survey + bootstrap + composition)`` for the argmin-set caveat. - "jackknife": Jackknife variance matching R's synthdid::vcov(method="jackknife"). Implements Algorithm 3 from Arkhangelsky et al. (2021). Deterministic (N_control + N_treated iterations), uses fixed weights (no re-estimation). @@ -238,15 +246,19 @@ def fit( # type: ignore[override] List of covariate column names. Covariates are residualized out before computing the SDID estimator. survey_design : SurveyDesign, optional - Survey design specification. Only pweight weight_type is supported. - ``variance_method='placebo'`` and ``variance_method='jackknife'`` - accept pweight-only surveys (composed via ``w_control`` / - ``w_treated``). ``variance_method='bootstrap'`` rejects all - survey designs (including pweight-only) and strata/PSU/FPC are - not supported by any variance method on this release — - composing Rao-Wu rescaled weights with paper-faithful - Frank-Wolfe re-estimation requires a separate derivation - (tracked in TODO.md, sketched in REGISTRY.md §SyntheticDiD). + Survey design specification. Only pweight weight_type is + supported. Support matrix (PR #352): + + method pweight-only strata/PSU/FPC + bootstrap ✓ weighted FW ✓ weighted FW + Rao-Wu + placebo ✓ ✗ NotImplementedError + jackknife ✓ ✗ NotImplementedError + + The bootstrap path composes Rao-Wu rescaled weights per draw + with the weighted-Frank-Wolfe kernel; see REGISTRY.md + §SyntheticDiD ``Note (survey + bootstrap composition)``. + ``placebo`` and ``jackknife`` still reject strata/PSU/FPC + (separate methodology gap tracked in TODO.md). Returns ------- @@ -258,11 +270,33 @@ def fit( # type: ignore[override] ------ ValueError If required parameters are missing, data validation fails, - or a non-pweight survey design is provided. + or a non-pweight survey design is provided. Under survey + designs, also raises when: + + - The total survey mass on either arm is zero + (``w_control.sum() == 0`` or ``w_treated.sum() == 0``). + Every unit on that arm would have weight 0, encoding an + unidentified target population (PR #355 R7 P1). + - The composed effective-control mass + ``(unit_weights * w_control).sum()`` is zero. Frank-Wolfe + sparsifies ``unit_weights`` to exact zeros by design, so + even when at least one control has positive survey weight, + the FW solution may concentrate all mass on controls whose + survey weights are 0. Raising up front avoids a silent + ``0/0`` in the ``omega_eff`` normalization (PR #355 R12 P1). + - ``survey_design`` declares ``fpc`` with no explicit + ``psu=``. SDID Rao-Wu then treats each unit as its own + PSU, so ``fpc`` must be ``>=`` the number of units + (unstratified) or ``>=`` the per-stratum unit count + (stratified). Front-door checked after + ``collapse_survey_to_unit_level`` so the user sees a + targeted error instead of a bootstrap-exhaustion + failure (PR #355 R8 P1). NotImplementedError - If ``survey_design`` is provided with strata/PSU/FPC, or if - ``variance_method='bootstrap'`` is provided with any survey - design (including pweight-only). + If ``survey_design`` with strata/PSU/FPC is provided with + ``variance_method='placebo'`` or ``'jackknife'``. Bootstrap + + any survey design (pweight-only or full design) is + supported via PR #352's weighted-FW + Rao-Wu composition. """ # Validate inputs if outcome is None or treatment is None or unit is None or time is None: @@ -279,7 +313,6 @@ def fit( # type: ignore[override] # Resolve survey design from diff_diff.survey import ( - _extract_unit_survey_weights, _resolve_survey_for_fit, _validate_unit_constant_survey, ) @@ -288,17 +321,19 @@ def fit( # type: ignore[override] _resolve_survey_for_fit(survey_design, data, "analytical") ) # Reject replicate-weight designs — SyntheticDiD has no replicate- - # weight variance path. Full survey designs with strata/PSU/FPC are - # also not supported on any variance method in this release (see the - # guards below). Pweight-only works with variance_method='placebo' - # or 'jackknife'. + # weight variance path. Analytical (pweight / strata / PSU / FPC) + # designs are supported per the PR #352 matrix (bootstrap covers + # full design via weighted-FW + Rao-Wu; placebo / jackknife + # accept pweight-only, reject strata/PSU/FPC). if resolved_survey is not None and resolved_survey.uses_replicate_variance: raise NotImplementedError( - "SyntheticDiD does not support replicate-weight survey designs. " - "Only pweight-only survey weights are accepted, and only with " - "variance_method='placebo' or 'jackknife'. See " - "docs/methodology/REGISTRY.md §SyntheticDiD for the survey " - "support matrix." + "SyntheticDiD does not support replicate-weight survey " + "designs. Analytical survey designs are supported: " + "variance_method='bootstrap' accepts both pweight-only " + "and strata/PSU/FPC designs (PR #352), while " + "variance_method='placebo' and 'jackknife' accept " + "pweight-only. See docs/methodology/REGISTRY.md " + "§SyntheticDiD for the full survey support matrix." ) # Validate pweight only if resolved_survey is not None and resolved_survey.weight_type != "pweight": @@ -307,38 +342,31 @@ def fit( # type: ignore[override] f"Got '{resolved_survey.weight_type}'." ) - # Reject strata/PSU/FPC for any variance method. Previously the - # fixed-weight bootstrap accepted these via Rao-Wu rescaling, but that - # path was not paper-faithful and is removed; Rao-Wu composed with - # Frank-Wolfe re-estimation (paper-faithful) requires a separate - # derivation (tracked in TODO.md, sketched in REGISTRY.md). - if resolved_survey is not None and ( - resolved_survey.strata is not None - or resolved_survey.psu is not None - or resolved_survey.fpc is not None - ): - raise NotImplementedError( - "SyntheticDiD does not yet support survey designs with " - "strata/PSU/FPC. Pweight-only pseudo-population weights work " - "with variance_method='placebo' or 'jackknife'. Rao-Wu " - "rescaled weights composed with paper-faithful refit " - "bootstrap is tracked as a follow-up; see " - "docs/methodology/REGISTRY.md §SyntheticDiD for the sketch." + # Strata/PSU/FPC support matrix (PR #352): + # bootstrap → supported via weighted Frank-Wolfe + Rao-Wu rescaling + # (this PR; see _bootstrap_se Rao-Wu branch). + # placebo / jackknife → NotImplemented for full designs (separate + # methodology gap; resampling allocators differ between + # bootstrap pairs and placebo permutations / jackknife + # LOO). Tracked in TODO.md as a follow-up. + if ( + resolved_survey is not None + and ( + resolved_survey.strata is not None + or resolved_survey.psu is not None + or resolved_survey.fpc is not None ) - - # Reject bootstrap + any survey design (including pweight-only). The - # paper-faithful refit bootstrap re-estimates ω̂ and λ̂ via Frank-Wolfe - # on each bootstrap draw; composing that with Rao-Wu rescaled weights - # (or even a pweight composition in the weighted FW loss) requires a - # separate derivation that is not yet implemented. - if self.variance_method == "bootstrap" and resolved_survey is not None: + and self.variance_method in ("placebo", "jackknife") + ): raise NotImplementedError( - "SyntheticDiD with variance_method='bootstrap' does not yet " - "support survey designs (including pweight-only). Rao-Wu " - "rescaled weights composed with Frank-Wolfe re-estimation " - "requires a separate derivation (tracked in TODO.md). Use " - "variance_method='placebo' or 'jackknife' for pweight-only " - "surveys." + f"SyntheticDiD with variance_method='{self.variance_method}' " + "does not yet support survey designs with strata/PSU/FPC. " + "Pweight-only pseudo-population weights work with placebo / " + "jackknife. Strata/PSU/FPC support requires per-method " + "Rao-Wu / wild-bootstrap derivations on the placebo " + "allocator and the jackknife LOO mass; tracked in TODO.md " + "(SDID survey support follow-up). Use " + "variance_method='bootstrap' for full survey designs." ) # Validate treatment is binary @@ -411,16 +439,117 @@ def fit( # type: ignore[override] f"diff_diff.prep.balance_panel() to balance the panel first." ) - # Validate and extract survey weights. Strata/PSU/FPC are rejected - # upstream, so we only reach here with pweight-only surveys, which - # placebo and jackknife consume via ``w_control`` directly. + # Validate and extract survey weights. Pweight-only fits feed + # placebo / jackknife / bootstrap via ``w_control`` directly. + # Strata/PSU/FPC fits feed bootstrap via the unit-collapsed + # ``resolved_survey_unit`` (PR #352) which Rao-Wu rescaling + # consumes per draw. if resolved_survey is not None: _validate_unit_constant_survey(data, unit, survey_design) - w_treated = _extract_unit_survey_weights(data, unit, survey_design, treated_units) - w_control = _extract_unit_survey_weights(data, unit, survey_design, control_units) + # Collapse to unit level for the bootstrap survey path. The + # row order is [control_units..., treated_units...] so + # boot_rw[:n_control] / boot_rw[n_control:] line up with the + # bootstrap loop's column ordering. See + # `collapse_survey_to_unit_level` in diff_diff/survey.py. + # Use `data` (not `working_data`) for the groupby — survey + # design columns are unit-constant (validated above) and + # covariate residualization doesn't shuffle row order, so the + # collapse is invariant to which view we group on. + from diff_diff.survey import collapse_survey_to_unit_level + all_units_for_bootstrap = list(control_units) + list(treated_units) + resolved_survey_unit = collapse_survey_to_unit_level( + resolved_survey, data, unit, all_units_for_bootstrap, + ) + # Front-door FPC validation for implicit-PSU Rao-Wu (PR #355 + # R8 P1). When psu is None but fpc is set, + # ``generate_rao_wu_weights`` (bootstrap_utils.py L654-L655) + # treats each unit as its own PSU and rejects + # ``FPC < n_units`` per stratum mid-draw. ``_bootstrap_se`` + # catches that ``ValueError`` and keeps retrying, so the user + # sees a generic bootstrap-exhaustion message instead of a + # targeted FPC/design error. Validate upstream so the user + # gets a clean error before the bootstrap loop even starts. + if ( + resolved_survey_unit.psu is None + and resolved_survey_unit.fpc is not None + ): + if resolved_survey_unit.strata is None: + n_units_total = len(resolved_survey_unit.weights) + fpc_val = float(resolved_survey_unit.fpc[0]) + if fpc_val < n_units_total: + raise ValueError( + f"FPC ({fpc_val}) is less than the number of " + f"units ({n_units_total}). With no explicit " + "psu= column, SDID Rao-Wu treats each unit as " + "its own PSU; FPC must be >= the number of " + "units. Declare an explicit psu= column or " + "increase FPC." + ) + else: + unique_strata = np.unique(resolved_survey_unit.strata) + for h in unique_strata: + mask_h = resolved_survey_unit.strata == h + n_h_units = int(mask_h.sum()) + fpc_h = float(resolved_survey_unit.fpc[mask_h][0]) + if fpc_h < n_h_units: + raise ValueError( + f"FPC ({fpc_h}) in stratum {h} is less than " + f"the number of units in that stratum " + f"({n_h_units}). With no explicit psu= " + "column, SDID Rao-Wu treats each unit as " + "its own PSU within strata; FPC must be " + ">= the per-stratum unit count. Declare an " + "explicit psu= column or increase FPC." + ) + # Source w_control / w_treated from resolved_survey_unit.weights + # rather than re-extracting raw panel columns. resolved_survey.weights + # is normalized to mean=1 by SurveyDesign.resolve() (survey.py L189- + # L203), so the weighted-FW bootstrap objective — which is NOT + # invariant to a global rescaling of rw — produces identical SE / + # p-value / CI under SurveyDesign(weights="w") vs "c*w" (PR #355 + # R4 P0). Placebo / jackknife paths also consume w_control / + # w_treated but are scale-invariant (np.average divides by sum; + # ω_eff normalization likewise), so switching to resolved weights + # doesn't change their numerics. + n_control_for_split = len(control_units) + w_control = resolved_survey_unit.weights[:n_control_for_split].astype( + np.float64 + ) + w_treated = resolved_survey_unit.weights[n_control_for_split:].astype( + np.float64 + ) + # Front-door positive-mass guard (PR #355 R7 P1). Survey weights + # are non-negative post-resolve() (survey.py L171-L176 rejects + # negatives), but all-zero mass on either arm is reachable — the + # user can assign unit survey weights of 0 to every treated or + # every control unit, which encodes an unidentified target + # population. The fit-time ATT formulas downstream + # (``np.average(..., weights=w_treated)`` around L551-L582 and + # ``omega_eff = unit_weights * w_control`` in the bootstrap / + # placebo / jackknife dispatchers) would otherwise hit 0/0 + # normalization or propagate NaNs silently. The bootstrap loop + # already has per-draw zero-mass retries for degenerate resamples + # (PR #355 R2 P0); this guard is the fit-time analogue. + if w_control.sum() <= 0: + raise ValueError( + "Survey-weighted control arm has zero total mass " + f"(sum of w_control = {w_control.sum():.3g}). " + "Every control unit has survey weight 0, so the target " + "population is unidentified. Drop units with zero weight, " + "or omit survey_design if unweighted estimation is intended." + ) + if w_treated.sum() <= 0: + raise ValueError( + "Survey-weighted treated arm has zero total mass " + f"(sum of w_treated = {w_treated.sum():.3g}). " + "Every treated unit has survey weight 0, so the target " + "population is unidentified. Drop units with zero weight, " + "or omit survey_design if unweighted estimation is intended." + ) else: w_treated = None w_control = None + resolved_survey_unit = None # Residualize covariates if provided working_data = data.copy() @@ -534,7 +663,32 @@ def fit( # type: ignore[override] # population importance post-optimization. if w_control is not None: omega_eff = unit_weights * w_control - omega_eff = omega_eff / omega_eff.sum() + # Front-door effective-control guard (PR #355 R12 P1). The R7 P1 + # check (``w_control.sum() > 0`` at the raw level, synthetic_did.py + # L470-L499) is insufficient here: Frank-Wolfe sparsifies + # ``unit_weights`` to exact zeros by design, so even when at least + # one control has positive survey weight, FW may concentrate all + # mass on a subset whose survey weights are all 0. The composed + # vector ``unit_weights * w_control`` would then sum to 0, the + # downstream ``omega_eff / omega_eff.sum()`` would emit NaN, and + # the fit would return NaN ATT / SE silently. The analogous + # guards already exist for the bootstrap loop + # (``omega_scaled.sum() <= 0`` retry) and jackknife + # (``effective_control > 0`` support gate); this restores the + # contract at fit time. + omega_eff_sum = float(omega_eff.sum()) + if omega_eff_sum <= 0: + raise ValueError( + "SDID point estimate is unidentified: the Frank-Wolfe " + "solution concentrates all synthetic-control mass on " + "units with zero survey weight, so the composed " + "omega_eff = unit_weights * w_control sums to " + f"{omega_eff_sum:.3g}. Every control unit with positive " + "fit-time weight has survey weight 0. Drop zero-weight " + "controls, or omit survey_design if unweighted " + "estimation is intended." + ) + omega_eff = omega_eff / omega_eff_sum else: omega_eff = unit_weights @@ -593,8 +747,29 @@ def fit( # type: ignore[override] # values) so RNG streams stay aligned across scales. if self.variance_method == "bootstrap": # Paper-faithful pairs bootstrap (Algorithm 2 step 2): re-estimate - # ω̂_b and λ̂_b via Frank-Wolfe on each draw. Survey designs are - # rejected upstream in the survey guards. + # ω̂_b and λ̂_b via Frank-Wolfe on each draw. With survey designs + # the FW switches to the weighted-FW variant and Rao-Wu rescaling + # supplies per-draw weights (PR #352). Pweight-only designs use + # constant w_control across draws; full designs use Rao-Wu draws. + # Determine which survey path the bootstrap should use: + # - resolved_survey_unit + strata/PSU/FPC → Rao-Wu rescaling + # - pweight-only (resolved_survey_unit but no strata/PSU) → + # pass w_control/w_treated as constant rw per draw (the + # bootstrap branch sets `_pweight_only` from `w_control` + # when resolved_survey is None). + # - non-survey → pass nothing (legacy path). + full_design = ( + resolved_survey_unit is not None + and ( + resolved_survey_unit.strata is not None + or resolved_survey_unit.psu is not None + or resolved_survey_unit.fpc is not None + ) + ) + _boot_resolved_survey = resolved_survey_unit if full_design else None + _boot_w_control = w_control if not full_design else None + _boot_w_treated = w_treated if not full_design else None + se_n, bootstrap_estimates_n = self._bootstrap_se( Y_pre_control_n, Y_post_control_n, @@ -602,6 +777,9 @@ def fit( # type: ignore[override] Y_post_treated_n, unit_weights, time_weights, + w_control=_boot_w_control, + w_treated=_boot_w_treated, + resolved_survey=_boot_resolved_survey, zeta_omega_n=zeta_omega_n, zeta_lambda_n=zeta_lambda_n, min_decrease=min_decrease, @@ -842,6 +1020,10 @@ def _bootstrap_se( zeta_omega_n: float = 0.0, zeta_lambda_n: float = 0.0, min_decrease: float = 1e-5, + *, + w_control: Optional[np.ndarray] = None, + w_treated: Optional[np.ndarray] = None, + resolved_survey=None, ) -> Tuple[float, np.ndarray]: """Compute pairs-bootstrap standard error for SDID (Algorithm 2 step 2). @@ -854,56 +1036,72 @@ def _bootstrap_se( ``synthdid_estimate`` on each draw, so the renormalized ω serves only as Frank-Wolfe initialization. + Survey-bootstrap (PR #352): if ``w_control``/``w_treated`` (pweight- + only) or ``resolved_survey`` (full design with strata/PSU/FPC) is + passed, switches to weighted-FW per draw. Per-draw rescaled weights + come from either the constant ``w_control`` (pweight-only) or + ``generate_rao_wu_weights(resolved_survey, rng)`` sliced over the + resampled units (full design). The weighted-FW solves + ``min Σ_t (Σ_i rw_i·ω_i·Y_i,pre[t] - b_t)² + ζ²·Σ rw_i·ω_i²``; + downstream ``ω_eff = rw·ω / Σ(rw·ω)`` is composed before the SDID + estimator (see REGISTRY.md §SyntheticDiD ``Note (survey + bootstrap + composition)`` for the argmin-set caveat). + ``zeta_omega_n`` / ``zeta_lambda_n`` are the fit-time normalized-scale - regularization parameters (``ζ_ω / Y_scale``, ``ζ_λ / Y_scale``), used - unchanged on each bootstrap draw because the resampled panel shares - the original noise scale. - - Survey designs are rejected upstream in ``fit()``. The fit-time - ``unit_weights`` (ω̂) and ``time_weights`` (λ̂) are not reused as fixed - estimator weights — every draw re-estimates ω̂_b and λ̂_b from scratch - (using the normalized-scale zetas) — but they ARE used as Frank-Wolfe - warm-start initializations: ``_sum_normalize(unit_weights[boot_ctrl])`` - seeds ω̂_b's first pass, and ``time_weights`` seeds λ̂_b's. This - matches R's ``vcov.R::bootstrap_sample`` shape (see the warm-start - comment below, and the ``Deviation`` / warm-start discussion in - ``docs/methodology/REGISTRY.md`` §SyntheticDiD). The original ω / λ - remain on the result object as the fit-time weights. - - Retry-to-B: matches R's ``synthdid::bootstrap_sample`` while-loop. A - bounded attempt guard of ``20 * n_bootstrap`` prevents pathological- - input hangs; normal fits finish well inside this budget because - degenerate-draw probability scales as ``(N_co/N)^N + (N_tr/N)^N``. - Per-draw Frank-Wolfe non-convergence UserWarnings are suppressed - inside the loop and aggregated into one summary warning after the - loop if the rate of draws with any non-convergence event exceeds 5% - of valid draws (counted once per draw, not once per solver call). + regularization parameters, used unchanged on each draw. + + ``unit_weights`` (fit-time ω) and ``time_weights`` (fit-time λ) are + not reused as fixed estimator weights — every draw re-estimates ω̂_b + and λ̂_b from scratch — but they ARE used as Frank-Wolfe warm-start + initializations: ``_sum_normalize(unit_weights[boot_ctrl])`` seeds + ω̂_b's first pass, ``time_weights`` seeds λ̂_b's. Matches R's + ``vcov.R::bootstrap_sample`` shape. + + Retry-to-B: matches R's ``synthdid::bootstrap_sample`` while-loop; + bounded attempt guard of ``20 * n_bootstrap``. Per-draw Frank-Wolfe + non-convergence is tallied via the ``return_convergence`` flag from + each helper and aggregated into one summary ``UserWarning`` if the + rate exceeds 5% of valid draws. """ - # unit_weights (fit-time ω) and time_weights (fit-time λ) are used - # as warm-start Frank-Wolfe initializations per bootstrap draw, - # matching R's ``vcov.R::bootstrap_sample`` which passes - # ``sum_normalize(weights$omega[sort(ind[ind <= N0])])`` and - # ``weights$lambda`` as the FW init via the rebound ``opts``. - rng = np.random.default_rng(self.seed) n_control = Y_pre_control.shape[1] n_treated = Y_pre_treated.shape[1] n_total = n_control + n_treated + # Survey-bootstrap dispatch (PR #352). _use_rao_wu fires for any + # survey design (the helper itself handles strata=None / psu=None + # by treating each obs as its own PSU); _pweight_only fires when + # we have constant per-control survey weights but no + # resolved_survey object (e.g., a future caller path). + _use_rao_wu = resolved_survey is not None + _pweight_only = (w_control is not None) and (resolved_survey is None) + + # Single-PSU short-circuit: unstratified design with <2 PSUs has no + # identified bootstrap distribution (resampling one PSU yields the + # same subset every draw). Returns NaN SE — same shape as PR #351's + # n_successful=0 raise but caught upstream as NaN. Recovered from + # 91082e5:diff_diff/synthetic_did.py. + if ( + _use_rao_wu + and resolved_survey.psu is not None + and resolved_survey.strata is None + ): + from numpy import unique as _unique + n_psu = len(_unique(resolved_survey.psu)) + if n_psu < 2: + return np.nan, np.array([]) + # Build full panel matrix: (n_pre+n_post, n_control+n_treated) Y_full = np.block([[Y_pre_control, Y_pre_treated], [Y_post_control, Y_post_treated]]) n_pre = Y_pre_control.shape[0] bootstrap_estimates: List[float] = [] - # Retry-until-B-valid semantic: matches R's synthdid::bootstrap_sample - # (`while (count < replications) { ...; if !is.na(est) count = count+1 }`) - # and paper Algorithm 2 (B bootstrap replicates). + # Retry-until-B-valid semantic: matches R's synthdid::bootstrap_sample. max_attempts = 20 * self.n_bootstrap attempts = 0 - # Tally draws with any Frank-Wolfe non-convergence; per-draw - # UserWarnings are suppressed inside the loop and aggregated into - # one summary warning after the loop (avoids 200+ warnings per fit). + # Tally draws with any Frank-Wolfe non-convergence; aggregated into + # one summary warning after the loop. fw_nonconvergence_count = 0 while len(bootstrap_estimates) < self.n_bootstrap and attempts < max_attempts: @@ -919,6 +1117,43 @@ def _bootstrap_se( continue try: + # Per-draw rescaled weights (PR #352 survey path). For + # Rao-Wu, generate_rao_wu_weights returns per-unit rescaled + # weights (resolved_survey is unit-collapsed upstream in + # fit(); see synthetic_did.py callers). For pweight-only, + # rw is the constant per-control survey weight. + if _use_rao_wu: + boot_rw = generate_rao_wu_weights(resolved_survey, rng) + rw_control_full = boot_rw[:n_control] + rw_treated_full = boot_rw[n_control:] + rw_control_draw = rw_control_full[boot_idx[boot_is_control]] + rw_treated_draw = rw_treated_full[boot_idx[~boot_is_control] - n_control] + elif _pweight_only: + rw_control_draw = w_control[boot_idx[boot_is_control]] + rw_treated_draw = ( + w_treated[boot_idx[~boot_is_control] - n_control] + if w_treated is not None + else None + ) + else: + rw_control_draw = None + rw_treated_draw = None + + # Degenerate-retry under ANY survey path: if a draw zeros + # out the control or treated mass, retry. For Rao-Wu this + # is expected behavior (PSUs not drawn get weight 0). For + # pweight-only, zero-mass treated is reachable when at + # least one unit has zero survey weight AND the + # bootstrap resample happens to pick only zero-weight + # treated units — silently falling back to an unweighted + # mean would corrupt the bootstrap distribution because + # fit-time ATT uses the survey-weighted mean (PR #355 + # R2 P0). + if (_use_rao_wu or _pweight_only) and rw_treated_draw is not None and ( + rw_control_draw.sum() == 0 or rw_treated_draw.sum() == 0 + ): + continue + # Extract resampled outcome matrices Y_boot = Y_full[:, boot_idx] Y_boot_pre_c = Y_boot[:n_pre, boot_is_control] @@ -926,55 +1161,84 @@ def _bootstrap_se( Y_boot_pre_t = Y_boot[:n_pre, ~boot_is_control] Y_boot_post_t = Y_boot[n_pre:, ~boot_is_control] - # Unweighted treated-unit mean — survey weights are rejected - # upstream in fit(), so every path here is unweighted. - Y_boot_pre_t_mean = np.mean(Y_boot_pre_t, axis=1) - Y_boot_post_t_mean = np.mean(Y_boot_post_t, axis=1) - - # Warm-start Frank-Wolfe initialization matching R's - # ``vcov.R::bootstrap_sample`` shape: ω_init is the fit-time - # ω renormalized over the resampled controls (same - # ``sum_normalize`` operation R uses), and λ_init is the - # fit-time λ unchanged. R's ``synthdid_estimate`` with - # ``update.omega=TRUE`` / ``update.lambda=TRUE`` then runs - # Frank-Wolfe from those initializations. Without warm-start - # our FW first-pass (max_iter=100) may land in a different - # sparsification pattern than R's on problems where the - # 100-iter budget is tight (e.g., small ζ_λ on - # less-regularized time weights). Strictly-convex objective - # → warm and cold start converge to the same global minimum - # when FW is run to full convergence, but sparsification - # introduces path dependence on the 100-iter first pass. + # Treated-unit mean: survey-weighted if rw_treated_draw is + # set (PR #352), else unweighted. + if rw_treated_draw is not None and rw_treated_draw.sum() > 0: + Y_boot_pre_t_mean = np.average( + Y_boot_pre_t, axis=1, weights=rw_treated_draw, + ) + Y_boot_post_t_mean = np.average( + Y_boot_post_t, axis=1, weights=rw_treated_draw, + ) + else: + Y_boot_pre_t_mean = np.mean(Y_boot_pre_t, axis=1) + Y_boot_post_t_mean = np.mean(Y_boot_post_t, axis=1) + + # Warm-start init matching R's vcov.R::bootstrap_sample + # shape. Under survey-bootstrap, scale the renormalized + # init by rw before sum_normalize (matches the per-draw + # weighted-FW geometry). boot_control_idx = boot_idx[boot_is_control] - boot_omega_init = _sum_normalize(unit_weights[boot_control_idx]) + if rw_control_draw is not None: + boot_omega_init = _sum_normalize( + unit_weights[boot_control_idx] * rw_control_draw + ) + else: + boot_omega_init = _sum_normalize(unit_weights[boot_control_idx]) boot_lambda_init = time_weights # Algorithm 2 step 2: re-estimate ω̂_b and λ̂_b via two-pass - # sparsified Frank-Wolfe on the resampled panel, using the - # fit-time normalized-scale zeta and the warm-start inits. - # Pass ``return_convergence=True`` so the helpers thread a - # bool out of every FW pass (Rust and numpy both) instead of - # relying on ``warnings.catch_warnings`` — the Rust FW entry - # point is silent on ``max_iter`` exhaustion, so the - # warnings-based tally would always read zero under the - # default backend (see ``_sc_weight_fw_with_convergence`` in - # ``rust/src/weights.rs``). - boot_omega, omega_converged = compute_sdid_unit_weights( - Y_boot_pre_c, - Y_boot_pre_t_mean, - zeta_omega=zeta_omega_n, - min_decrease=min_decrease, - init_weights=boot_omega_init, - return_convergence=True, - ) - boot_lambda, lambda_converged = compute_time_weights( - Y_boot_pre_c, - Y_boot_post_c, - zeta_lambda=zeta_lambda_n, - min_decrease=min_decrease, - init_weights=boot_lambda_init, - return_convergence=True, - ) + # sparsified Frank-Wolfe. Survey paths use the weighted-FW + # helpers (PR #352 §1b); non-survey path unchanged. + if rw_control_draw is not None: + boot_omega, omega_converged = compute_sdid_unit_weights_survey( + Y_boot_pre_c, + Y_boot_pre_t_mean, + rw_control_draw, + zeta_omega=zeta_omega_n, + min_decrease=min_decrease, + init_weights=boot_omega_init, + return_convergence=True, + ) + boot_lambda, lambda_converged = compute_time_weights_survey( + Y_boot_pre_c, + Y_boot_post_c, + rw_control_draw, + zeta_lambda=zeta_lambda_n, + min_decrease=min_decrease, + init_weights=boot_lambda_init, + return_convergence=True, + ) + # Compose ω_eff = rw·ω / Σ(rw·ω) for the SDID + # estimator (expects simplex weights). REGISTRY.md + # §SyntheticDiD covers the argmin-set caveat: the FW + # minimizes the weighted objective on ω, NOT the + # standard objective on ω_eff — intentional design + # choice validated by the stratified coverage MC. + omega_scaled = rw_control_draw * boot_omega + total = omega_scaled.sum() + if total <= 0: + # Degenerate: all mass on rw=0 controls + continue + boot_omega = omega_scaled / total + else: + boot_omega, omega_converged = compute_sdid_unit_weights( + Y_boot_pre_c, + Y_boot_pre_t_mean, + zeta_omega=zeta_omega_n, + min_decrease=min_decrease, + init_weights=boot_omega_init, + return_convergence=True, + ) + boot_lambda, lambda_converged = compute_time_weights( + Y_boot_pre_c, + Y_boot_post_c, + zeta_lambda=zeta_lambda_n, + min_decrease=min_decrease, + init_weights=boot_lambda_init, + return_convergence=True, + ) + tau = compute_sdid_estimator( Y_boot_pre_c, Y_boot_post_c, @@ -986,14 +1250,8 @@ def _bootstrap_se( if np.isfinite(tau): bootstrap_estimates.append(float(tau)) # Count draws with ANY non-convergence (boolean per - # draw), not raw solver warnings — a single draw can - # emit up to three non-convergence events (ω - # pre-sparsify, ω main, λ). Increment the counter only - # after the finite-τ gate so the registry's "share of - # valid bootstrap draws" denominator matches the - # numerator (draws that failed the finite-τ gate are - # retried, so they shouldn't inflate the non- - # convergence rate). + # draw). Increment after the finite-τ gate so + # numerator and denominator (n_successful) match. if not (omega_converged and lambda_converged): fw_nonconvergence_count += 1 @@ -1133,14 +1391,13 @@ def _placebo_variance_se( # Ensure we have enough controls for the split n_pseudo_control = n_control - n_treated if n_pseudo_control < 1: - # Bootstrap rejects every survey design in this release, so - # steer survey users to jackknife (pweight-only only) or - # adding controls. Non-survey users can still fall back to - # bootstrap or jackknife. + # Fallback guidance. Placebo and jackknife reject strata/PSU/FPC, + # but bootstrap (PR #352) supports both pweight-only and + # full-design surveys, so it's always a valid fallback. fallback = ( - "variance_method='jackknife' or adding more control units " - "(strata/PSU/FPC are not yet supported by any SDID variance " - "method)" + "variance_method='bootstrap' (supports pweight-only and " + "strata/PSU/FPC survey designs), variance_method='jackknife' " + "(pweight-only only), or adding more control units" if w_control is not None else "variance_method='bootstrap', variance_method='jackknife', " "or adding more control units" @@ -1237,13 +1494,14 @@ def _placebo_variance_se( n_successful = len(placebo_estimates) if n_successful < 2: - # Same survey-awareness branch as the pre-replication guard - # above — bootstrap rejects every survey design in this - # release, so suggest jackknife for pweight-only fits. + # Same fallback guidance as the pre-replication guard above. + # Bootstrap (PR #352) supports pweight-only + strata/PSU/FPC + # survey designs, so it's always a valid fallback for survey + # users even when placebo fails. fallback = ( - "variance_method='jackknife' or increasing the number of " - "control units (strata/PSU/FPC are not yet supported by any " - "SDID variance method)" + "variance_method='bootstrap' (supports pweight-only and " + "strata/PSU/FPC survey designs), variance_method='jackknife' " + "(pweight-only only), or increasing the number of control units" if w_control is not None else "variance_method='bootstrap' or variance_method='jackknife' " "or increasing the number of control units" diff --git a/diff_diff/utils.py b/diff_diff/utils.py index bb547969..e095b181 100644 --- a/diff_diff/utils.py +++ b/diff_diff/utils.py @@ -22,6 +22,8 @@ _rust_compute_noise_level, _rust_sc_weight_fw, _rust_sc_weight_fw_with_convergence, + _rust_sc_weight_fw_weighted, + _rust_sc_weight_fw_weighted_with_convergence, ) # Numerical constants for optimization algorithms @@ -1306,6 +1308,7 @@ def _sc_weight_fw( min_decrease: float = 1e-5, max_iter: int = 10000, return_convergence: bool = False, + reg_weights: Optional[np.ndarray] = None, ): """Compute synthetic control weights via Frank-Wolfe optimization. @@ -1314,6 +1317,12 @@ def _sc_weight_fw( min_{lambda on simplex} zeta^2 * ||lambda||^2 + (1/N) * ||A_centered @ lambda - b_centered||^2 + With ``reg_weights`` set, solves the weighted-regularization variant + used by SDID survey-bootstrap (PR #352):: + + min_{lambda on simplex} zeta^2 * sum_j reg_weights[j] * lambda[j]^2 + + (1/N) * ||A_centered @ lambda - b_centered||^2 + Parameters ---------- Y : np.ndarray @@ -1340,6 +1349,15 @@ def _sc_weight_fw( by SDID bootstrap to surface per-draw FW non-convergence explicitly instead of relying on ``warnings.catch_warnings`` (the default Rust FW entry point is silent on non-convergence). + reg_weights : np.ndarray, optional + Per-coordinate regularization weights of shape ``(T0,)``. When + set, switches to the weighted-regularization Rust kernel + (``sc_weight_fw_weighted`` / ``_with_convergence``) which solves + the SDID survey-bootstrap objective with ``ζ²·Σ rw·ω²`` in place + of the uniform ``ζ²·||ω||²``. The caller is responsible for any + column-scaling of ``Y`` to match the loss form. Default ``None`` + delegates to the unweighted kernel — preserves the legacy ABI for + all existing callers. Returns ------- @@ -1347,38 +1365,59 @@ def _sc_weight_fw( Weights of shape (T0,) on the simplex; with ``return_convergence=True``, additionally the convergence flag. """ + Y_c = np.ascontiguousarray(Y, dtype=np.float64) + init_c = ( + np.ascontiguousarray(init_weights, dtype=np.float64) + if init_weights is not None + else None + ) + rw_c = ( + np.ascontiguousarray(reg_weights, dtype=np.float64) + if reg_weights is not None + else None + ) + + if rw_c is not None: + # Validate reg_weights shape at the dispatcher so Rust and NumPy + # backends share a single failure surface. The Rust + # ``sc_weight_fw_weighted_internal`` silently falls back to the + # unweighted kernel on a length mismatch, while the NumPy + # implementation raises — dispatching without a shared upstream + # check would let callers get the wrong objective on the Rust + # path with no error (PR #355 R5 P2). + expected_t0 = Y_c.shape[1] - 1 + if rw_c.shape != (expected_t0,): + raise ValueError( + f"reg_weights shape {rw_c.shape} does not match expected " + f"({expected_t0},) — must equal Y.shape[1] - 1" + ) + if HAS_RUST_BACKEND: + if reg_weights is not None: + if return_convergence: + weights, converged = _rust_sc_weight_fw_weighted_with_convergence( + Y_c, zeta, intercept, init_c, min_decrease, max_iter, rw_c, + ) + return np.asarray(weights), converged + return np.asarray( + _rust_sc_weight_fw_weighted( + Y_c, zeta, intercept, init_c, min_decrease, max_iter, rw_c, + ) + ) if return_convergence: weights, converged = _rust_sc_weight_fw_with_convergence( - np.ascontiguousarray(Y, dtype=np.float64), - zeta, - intercept, - ( - np.ascontiguousarray(init_weights, dtype=np.float64) - if init_weights is not None - else None - ), - min_decrease, - max_iter, + Y_c, zeta, intercept, init_c, min_decrease, max_iter, ) return np.asarray(weights), converged return np.asarray( _rust_sc_weight_fw( - np.ascontiguousarray(Y, dtype=np.float64), - zeta, - intercept, - ( - np.ascontiguousarray(init_weights, dtype=np.float64) - if init_weights is not None - else None - ), - min_decrease, - max_iter, + Y_c, zeta, intercept, init_c, min_decrease, max_iter, ) ) return _sc_weight_fw_numpy( Y, zeta, intercept, init_weights, min_decrease, max_iter, return_convergence=return_convergence, + reg_weights=reg_weights, ) @@ -1390,12 +1429,19 @@ def _sc_weight_fw_numpy( min_decrease: float = 1e-5, max_iter: int = 10000, return_convergence: bool = False, + reg_weights: Optional[np.ndarray] = None, ): """Pure NumPy implementation of Frank-Wolfe SC weight solver. When ``return_convergence=True``, returns a tuple ``(weights, converged)`` and suppresses the default ``warn_if_not_converged`` side effect — the caller is responsible for deciding how to surface non-convergence. + + With ``reg_weights`` set, solves the weighted-regularization variant + (matches the Rust ``sc_weight_fw_weighted`` kernel; PR #352). The loss + term is unchanged; only the regularization becomes + ``ζ²·Σ_j reg_weights[j]·lam[j]²`` and the FW step uses the diag(rw)- + weighted simplex direction norm. """ T0 = Y.shape[1] - 1 N = Y.shape[0] @@ -1419,12 +1465,53 @@ def _sc_weight_fw_numpy( else: lam = np.ones(T0) / T0 + if reg_weights is not None: + rw = np.asarray(reg_weights, dtype=np.float64) + if rw.shape != (T0,): + raise ValueError( + f"reg_weights shape {rw.shape} does not match expected " + f"({T0},) — must equal A.shape[1]" + ) + else: + rw = None + vals = np.full(max_iter, np.nan) converged = False for t in range(max_iter): - lam = _fw_step(A, lam, b, eta) - err = Y @ np.append(lam, -1.0) - vals[t] = zeta**2 * np.sum(lam**2) + np.sum(err**2) / N + if rw is None: + lam = _fw_step(A, lam, b, eta) + err = Y @ np.append(lam, -1.0) + vals[t] = zeta**2 * np.sum(lam**2) + np.sum(err**2) / N + else: + # Weighted FW step with diag(rw) regularization. Mirrors the + # Rust sc_weight_fw_*_weighted derivation in rust/src/weights.rs. + ax_minus_b = A @ lam - b + half_grad = A.T @ ax_minus_b + eta * rw * lam + i = int(np.argmin(half_grad)) + d = -lam.copy() + d[i] += 1.0 + d_x_w_norm_sq = float(np.sum(rw * d * d)) + if d_x_w_norm_sq < 1e-24: + err = ax_minus_b + vals[t] = zeta**2 * float(np.sum(rw * lam * lam)) + float(np.sum(err**2)) / N + if t >= 1 and vals[t - 1] - vals[t] < min_decrease**2: + converged = True + break + continue + d_err_sq = float(np.sum((A @ d) ** 2)) + denom = d_err_sq + eta * d_x_w_norm_sq + if denom <= 0.0: + err = ax_minus_b + vals[t] = zeta**2 * float(np.sum(rw * lam * lam)) + float(np.sum(err**2)) / N + if t >= 1 and vals[t - 1] - vals[t] < min_decrease**2: + converged = True + break + continue + hg_dot_dx = float(half_grad @ d) + step = float(np.clip(-hg_dot_dx / denom, 0.0, 1.0)) + lam = lam + step * d + err = A @ lam - b + vals[t] = zeta**2 * float(np.sum(rw * lam * lam)) + float(np.sum(err**2)) / N if t >= 1 and vals[t - 1] - vals[t] < min_decrease**2: converged = True break @@ -1750,6 +1837,284 @@ def compute_sdid_unit_weights( return omega +# ============================================================================= +# Survey-weighted SDID FW helpers (PR #352 — internal, called from +# SyntheticDiD._bootstrap_se on per-draw survey-weighted refits) +# ============================================================================= + + +def compute_sdid_unit_weights_survey( + Y_pre_control: np.ndarray, + Y_pre_treated_mean: np.ndarray, + rw_control: np.ndarray, + zeta_omega: float, + intercept: bool = True, + min_decrease: float = 1e-5, + max_iter_pre_sparsify: int = 100, + max_iter: int = 10000, + init_weights: Optional[np.ndarray] = None, + return_convergence: bool = False, +): + """Survey-weighted SDID unit weights via two-pass weighted Frank-Wolfe. + + Solves the weighted-FW objective (PR #352 §2.2):: + + min_{ω on simplex} + Σ_t (Σ_i rw_control[i]·ω[i]·Y_pre_control[t,i] - Y_pre_treated_mean[t])² + + ζ²·Σ_i rw_control[i]·ω[i]² + + Implementation: pre-scales each control column of Y_unit by + ``rw_control`` (so the loss term picks up the per-control linear + combination) and passes ``rw_control`` as ``reg_weights`` to + ``_sc_weight_fw`` (so the regularization picks up the per-ω scaling). + Two-pass sparsify-refit structure mirrors ``compute_sdid_unit_weights``. + + The returned ω is on the standard simplex. The caller (typically + ``SyntheticDiD._bootstrap_se``) is responsible for composing + ``ω_eff = rw_control·ω / Σ(rw_control·ω)`` for the downstream SDID + estimator, which expects a normalized weight vector. + + Parameters + ---------- + Y_pre_control : np.ndarray + Control outcomes in pre-treatment periods, shape (n_pre, n_control). + Y_pre_treated_mean : np.ndarray + Mean treated outcomes in pre-treatment periods, shape (n_pre,). + rw_control : np.ndarray + Per-control survey weights, shape (n_control,). Must be non-negative. + For pweight-only bootstrap this is the constant survey weight per + control unit; for Rao-Wu bootstrap this is the per-draw rescaled + weight (``generate_rao_wu_weights`` output sliced to control units). + zeta_omega : float + Regularization parameter (already normalized by Y_scale). + intercept : bool, default True + Column-center the optimization matrix. + min_decrease : float, default 1e-5 + Convergence criterion. + max_iter_pre_sparsify : int, default 100 + First-pass iteration cap before sparsification. + max_iter : int, default 10000 + Second-pass iteration cap. + init_weights : np.ndarray, optional + Warm-start weights for the first pass; shape (n_control,). + return_convergence : bool, default False + If True, returns ``(ω, converged)`` where converged is the AND of + both passes' convergence flags. + + Returns + ------- + np.ndarray or Tuple[np.ndarray, bool] + ω on the simplex (NOT ω_eff). + """ + n_control = Y_pre_control.shape[1] + + if rw_control.shape != (n_control,): + raise ValueError( + f"rw_control shape {rw_control.shape} does not match expected " + f"({n_control},)" + ) + + if n_control == 0: + empty = np.asarray([]) + return (empty, True) if return_convergence else empty + if n_control == 1: + singleton = np.asarray([1.0]) + return (singleton, True) if return_convergence else singleton + + # Build the column-scaled Y matrix: each control column j is multiplied by + # rw_control[j], so A·ω in the loss equals Σ_j rw_j·ω_j·Y_j,pre. + rw = np.ascontiguousarray(rw_control, dtype=np.float64) + Y_scaled = np.column_stack([ + Y_pre_control * rw[np.newaxis, :], + Y_pre_treated_mean.reshape(-1, 1), + ]) + + if return_convergence: + omega, conv1 = _sc_weight_fw( + Y_scaled, + zeta=zeta_omega, + intercept=intercept, + init_weights=init_weights, + max_iter=max_iter_pre_sparsify, + min_decrease=min_decrease, + return_convergence=True, + reg_weights=rw, + ) + else: + omega = _sc_weight_fw( + Y_scaled, + zeta=zeta_omega, + intercept=intercept, + init_weights=init_weights, + max_iter=max_iter_pre_sparsify, + min_decrease=min_decrease, + reg_weights=rw, + ) + + omega = _sparsify(omega) + + if return_convergence: + omega, conv2 = _sc_weight_fw( + Y_scaled, + zeta=zeta_omega, + intercept=intercept, + init_weights=omega, + max_iter=max_iter, + min_decrease=min_decrease, + return_convergence=True, + reg_weights=rw, + ) + return omega, bool(conv1 and conv2) + + return _sc_weight_fw( + Y_scaled, + zeta=zeta_omega, + intercept=intercept, + init_weights=omega, + max_iter=max_iter, + min_decrease=min_decrease, + reg_weights=rw, + ) + + +def compute_time_weights_survey( + Y_pre_control: np.ndarray, + Y_post_control: np.ndarray, + rw_control: np.ndarray, + zeta_lambda: float, + intercept: bool = True, + min_decrease: float = 1e-5, + max_iter_pre_sparsify: int = 100, + max_iter: int = 10000, + init_weights: Optional[np.ndarray] = None, + return_convergence: bool = False, +): + """Survey-weighted SDID time weights via two-pass row-weighted FW. + + Solves the WLS-style time-weight objective (PR #352 §2.2):: + + min_{λ on simplex} + Σ_u rw_control[u]·(Σ_t λ[t]·Y_u,pre-centered[t] - Y_u,post_mean-centered)² + + ζ²·||λ||² + + Regularization stays uniform on λ (rw is per-control, λ is per-period — + no alignment for per-λ reg weighting). The loss term uses WLS-style + row weights; when ``intercept=True``, the column-centering step is + *also* survey-weighted (weighted mean across controls, weights + ``rw_control``) so the centered loss minimizes + ``Σ_u rw_u·(A_u·λ - b_u)²`` on the rw-centered matrix — equivalent + to the stated weighted objective. The Rust kernel then sees the + weighted-centered + sqrt(rw)-row-scaled matrix with + ``intercept=False`` (no additional unweighted centering). + + The returned λ is on the standard simplex. + + Parameters + ---------- + Y_pre_control : np.ndarray + Shape (n_pre, n_control). + Y_post_control : np.ndarray + Shape (n_post, n_control). + rw_control : np.ndarray + Shape (n_control,), non-negative. + zeta_lambda : float + Regularization parameter (already normalized by Y_scale). + Other parameters mirror ``compute_time_weights``. + + Returns + ------- + np.ndarray or Tuple[np.ndarray, bool] + λ on the simplex. + """ + n_pre = Y_pre_control.shape[0] + n_control = Y_pre_control.shape[1] + + if rw_control.shape != (n_control,): + raise ValueError( + f"rw_control shape {rw_control.shape} does not match expected " + f"({n_control},)" + ) + + if Y_post_control.shape[0] == 0: + raise ValueError( + "Y_post_control has no rows. At least one post-treatment period " + "is required for time weight computation." + ) + + if n_pre <= 1: + lam_trivial = np.ones(n_pre) + return (lam_trivial, True) if return_convergence else lam_trivial + + # Build collapsed form like compute_time_weights: (N_co, T_pre+1) + post_means = np.mean(Y_post_control, axis=0) + Y_time = np.column_stack([Y_pre_control.T, post_means]) # (N_co, T_pre+1) + + # Column-center the (N_co, T_pre+1) matrix using the SURVEY-WEIGHTED + # mean across control units when ``intercept=True``. Plain + # ``intercept=True`` inside the FW kernel would use an unweighted + # column mean which does not correspond to the stated weighted-loss + # objective once ``rw_control`` varies. Perform the weighted + # centering here and pass ``intercept=False`` below so the kernel + # does not re-center on the row-scaled matrix. + rw_sum = float(np.sum(rw_control)) + if intercept and rw_sum > 0: + col_weighted_means = ( + (Y_time * rw_control[:, np.newaxis]).sum(axis=0) / rw_sum + ) + Y_time = Y_time - col_weighted_means[np.newaxis, :] + + # Row-scale by sqrt(rw): after weighted centering (if any), each + # control unit's contribution to the loss is weighted by + # ``rw_control[u]`` via the sqrt(rw) row scaling, which reproduces + # ``||diag(sqrt(rw))·(A·λ - b)||²`` = ``Σ_u rw_u·(A_u·λ - b_u)²``. + # Reg on λ stays uniform (no reg_weights). + sqrt_rw = np.sqrt(np.maximum(rw_control, 0.0)) + Y_weighted = Y_time * sqrt_rw[:, np.newaxis] + + if return_convergence: + lam, conv1 = _sc_weight_fw( + Y_weighted, + zeta=zeta_lambda, + intercept=False, # weighted centering already applied above + init_weights=init_weights, + min_decrease=min_decrease, + max_iter=max_iter_pre_sparsify, + return_convergence=True, + ) + else: + lam = _sc_weight_fw( + Y_weighted, + zeta=zeta_lambda, + intercept=False, # weighted centering already applied above + init_weights=init_weights, + min_decrease=min_decrease, + max_iter=max_iter_pre_sparsify, + ) + + lam = _sparsify(lam) + + if return_convergence: + lam, conv2 = _sc_weight_fw( + Y_weighted, + zeta=zeta_lambda, + intercept=False, # weighted centering already applied above + init_weights=lam, + min_decrease=min_decrease, + max_iter=max_iter, + return_convergence=True, + ) + return lam, bool(conv1 and conv2) + + return _sc_weight_fw( + Y_weighted, + zeta=zeta_lambda, + intercept=False, # weighted centering already applied above + init_weights=lam, + min_decrease=min_decrease, + max_iter=max_iter, + ) + + def compute_sdid_estimator( Y_pre_control: np.ndarray, Y_post_control: np.ndarray, diff --git a/docs/choosing_estimator.rst b/docs/choosing_estimator.rst index fde99cae..155a6bf0 100644 --- a/docs/choosing_estimator.rst +++ b/docs/choosing_estimator.rst @@ -783,10 +783,10 @@ estimation. The depth of support varies by estimator: - Full (analytical) - Multiplier at PSU * - ``SyntheticDiD`` - - pweight only (placebo / jackknife) - - -- - - -- + - pweight only + - Via bootstrap - -- + - Hybrid pairs-bootstrap + Rao-Wu rescaled (bootstrap only) * - ``TROP`` - pweight only - Via bootstrap @@ -807,20 +807,24 @@ estimation. The depth of support varies by estimator: - **Full**: All weight types (pweight/fweight/aweight) + strata/PSU/FPC + Taylor Series Linearization variance - **Full (pweight only)**: Full TSL with strata/PSU/FPC, but only ``pweight`` accepted (``fweight``/``aweight`` rejected because composition changes weight semantics) -- **Via bootstrap**: Strata/PSU/FPC supported only with bootstrap variance. ``TROP`` uses bootstrap by default. ``SyntheticDiD`` does **not** support strata/PSU/FPC on any variance method in this release — see the deferred-composition note in REGISTRY.md §SyntheticDiD. +- **Via bootstrap**: Strata/PSU/FPC supported only with bootstrap variance. ``TROP`` uses bootstrap by default. ``SyntheticDiD`` supports strata/PSU/FPC on ``variance_method='bootstrap'`` via a hybrid pairs-bootstrap + Rao-Wu rescaling composition (see the ``Note (survey + bootstrap composition)`` in REGISTRY.md §SyntheticDiD); ``placebo`` and ``jackknife`` remain pweight-only. - **pweight only** (Weights column): Only ``pweight`` accepted; ``fweight``/``aweight`` raise an error - **Diagnostic**: Weighted descriptive statistics only (no inference) - **--**: Not supported .. note:: - ``SyntheticDiD`` does not support strata/PSU/FPC on any variance method - in this release. Pweight-only survey weights work with - ``variance_method='placebo'`` or ``variance_method='jackknife'``; - ``variance_method='bootstrap'`` rejects all survey designs because - composing Rao-Wu rescaled weights with paper-faithful Frank-Wolfe - re-estimation requires a separate derivation (tracked in - ``TODO.md``). + ``SyntheticDiD`` supports survey designs on ``variance_method='bootstrap'`` + — both pweight-only and full strata/PSU/FPC — via a hybrid pairs-bootstrap + composed with per-draw Rao-Wu rescaled weights fed into a weighted + Frank-Wolfe re-estimation of ω and λ. See the + ``Note (survey + bootstrap composition)`` in REGISTRY.md §SyntheticDiD + for the objective form and argmin-set caveat. + + ``variance_method='placebo'`` and ``variance_method='jackknife'`` remain + pweight-only — composing placebo permutations / leave-one-out with + Rao-Wu rescaling under the weighted objective is a separate derivation + (tracked in ``TODO.md``). For the full walkthrough with code examples, see the `survey tutorial `_. diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index 224f2c8b..cce7a1f5 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -1505,7 +1505,7 @@ Convergence criterion: stop when objective decrease < min_decrease² (default mi R-parity rationale: `synthdid_estimate()` (synthdid.R) stores `update.omega = TRUE` in `attr(estimate, "opts")`, and `vcov.R::bootstrap_sample` rebinds those `opts` inside its `do.call` back into `synthdid_estimate`, so the renormalized ω passed via `weights$omega` is used as Frank-Wolfe initialization (the `sum_normalize` helper in R's source explicitly says so). The Python path threads the same warm-start via `compute_sdid_unit_weights(..., init_weights=...)` and `compute_time_weights(..., init_weights=...)`. The FW objective is strictly convex on the simplex (quadratic loss + ζ² ridge on simplex), so warm- and cold-start converge to the same global minimum given enough iterations; warm-start matters in practice because the 100-iter first pass then sparsification is path-dependent on draws where the pre-sparsify budget is tight. Cross-language SE parity at bit tolerance is not claimed — different BLAS / RNG paths — but the procedure matches R's default bootstrap shape at the algorithm level, and Python-only bit-identity on non-survey data is asserted via `TestScaleEquivariance::test_baseline_parity_small_scale[bootstrap]` at `rel=1e-14`. - Expected wall-clock ~5–30× slower per fit than placebo (panel-size dependent; Frank-Wolfe second-pass can hit its 10K-iter cap on larger panels; warm-start plumbing closes the gap vs cold-start, which would be closer to 10–30× on these DGPs). Per-draw Frank-Wolfe non-convergence UserWarnings are suppressed inside the loop and aggregated into a single summary warning emitted after the loop when the share of valid bootstrap draws with any non-convergence event (counted once per draw — each draw runs Frank-Wolfe once for ω and once for λ, and any of those calls firing a non-convergence warning trips the draw) exceeds 5% of `n_successful`. Composed with any survey design (including pweight-only) this path raises `NotImplementedError` in the current release — see the survey-regression Note below for scope and the deferred-composition sketch. + Expected wall-clock ~5–30× slower per fit than placebo (panel-size dependent; Frank-Wolfe second-pass can hit its 10K-iter cap on larger panels; warm-start plumbing closes the gap vs cold-start, which would be closer to 10–30× on these DGPs). Per-draw Frank-Wolfe non-convergence UserWarnings are suppressed inside the loop and aggregated into a single summary warning emitted after the loop when the share of valid bootstrap draws with any non-convergence event (counted once per draw — each draw runs Frank-Wolfe once for ω and once for λ, and any of those calls firing a non-convergence warning trips the draw) exceeds 5% of `n_successful`. Composed with survey designs (pweight-only OR strata/PSU/FPC) this path uses the weighted Frank-Wolfe kernel and the per-draw dispatch described in the "Note (survey + bootstrap composition)" below. - Alternative: Jackknife variance (matching R's `synthdid::vcov(method="jackknife")`) Implements Algorithm 3 from Arkhangelsky et al. (2021): @@ -1546,11 +1546,34 @@ Convergence criterion: stop when objective decrease < min_decrease² (default mi - **Jackknife with single nonzero-weight control**: Returns NaN SE. Leaving out the only effective control is not meaningful. - **Jackknife with non-finite LOO estimate**: Returns NaN SE. Unlike bootstrap/placebo, jackknife is deterministic and cannot skip failed iterations; NaN propagates through `var()` (matches R behavior). - **Jackknife with survey weights**: Guards on effective positive support (omega * w_control > 0 and w_treated > 0) after composition, not raw FW counts. Returns NaN SE if fewer than 2 effective controls or 2 positive-weight treated units. Per-iteration zero-sum guards return NaN for individual LOO iterations when remaining composed weights sum to zero. -- **Note (survey support):** weights are supported; strata/PSU/FPC are not. The pweight-only path is accepted for `variance_method="placebo"` and `variance_method="jackknife"` — treated-side means are survey-weighted (Frank-Wolfe target and ATT formula); control-side synthetic weights are composed with survey weights post-optimization (ω_eff = ω * w_co, renormalized). Frank-Wolfe optimization itself is unweighted — survey importance enters after trajectory-matching. Covariate residualization uses WLS with survey weights. **`variance_method="bootstrap"` rejects all survey designs (including pweight-only)** in this release; see the deferred-composition Note below. -- **Note (default variance_method deviation from R):** R's `synthdid::vcov()` defaults to `method="bootstrap"`; our `SyntheticDiD.__init__` defaults to `variance_method="placebo"`. This is an intentional library deviation for two reasons: (a) placebo is unconditionally available on pweight-only survey designs whereas refit bootstrap rejects every survey design in this release (see the deferred-composition Note below); (b) placebo avoids the ~5–30× per-draw Frank-Wolfe refit slowdown relative to the deleted fixed-weight path, which becomes prohibitive on large panels at B=200 replications. Users can opt into the R-matching default with `variance_method="bootstrap"`. Placebo (Algorithm 4) and bootstrap (Algorithm 2) both track nominal calibration in the committed coverage MC; see the calibration table below. -- **Note (deferred survey + bootstrap composition):** Rao-Wu rescaled bootstrap composed with paper-faithful refit is not yet implemented. Reusable scaffolding for the follow-up implementer: `generate_rao_wu_weights` (in `bootstrap_utils`, retained for use by other estimators), the split into `rw_control` / `rw_treated`, the degenerate-retry check, and the treated-mean weighting pattern are portable from the fixed-weight Rao-Wu branch this release removed (recoverable via `git show 91082e5:diff_diff/synthetic_did.py` near the pre-rewrite `_bootstrap_se` body (PR #351 "Replace SDID fixed-weight bootstrap with paper-faithful refit")). The genuinely new work is a **weighted Frank-Wolfe** variant of `_sc_weight_fw` that accepts per-unit weights in the loss and regularization (`Σ rw_i ω_i Y_i,pre` / `ζ² Σ rw_i ω_i²`), threaded through `compute_sdid_unit_weights` / `compute_time_weights`. Compose-after-unweighted-FW does NOT work — it silently reproduces the fixed-weight + Rao-Wu behavior we removed (which was never paper-faithful and never matched R's default vcov). Validation: re-use the coverage MC harness with a stratified DGP and confirm near-nominal rejection rates against placebo-SE tracking. Survey + SDID variance is therefore deferred capability — pre-existing strata/PSU/FPC users have no SDID variance method on this release. +- **Note (survey support matrix — PR #352):** + + | variance_method | pweight-only | strata/PSU/FPC | + |-----------------|:------------:|:--------------:| + | `bootstrap` | ✓ weighted FW | ✓ weighted FW + Rao-Wu rescaling | + | `placebo` | ✓ | ✗ NotImplementedError (separate gap) | + | `jackknife` | ✓ | ✗ NotImplementedError (separate gap) | + + **Pweight-only path** (placebo / jackknife / bootstrap): treated-side means are survey-weighted (Frank-Wolfe target and ATT formula); control-side synthetic weights are composed with survey weights post-optimization (ω_eff = ω * w_co, renormalized). Fit-time Frank-Wolfe is unweighted — survey importance enters after trajectory-matching. Covariate residualization uses WLS with survey weights. + + **Bootstrap survey path** (PR #352): for pweight-only the per-draw FW uses constant `rw = w_control`; for full design (strata/PSU/FPC) the per-draw `rw = generate_rao_wu_weights(resolved_survey, rng)` rescaling is composed with the same weighted-FW kernel. See "Note (survey + bootstrap composition)" below for the full objective and the argmin-set caveat. + + **Placebo / jackknife full-design rejection**: separate methodology gap. Placebo permutes control indices (the resampling unit is a control unit, not a PSU); jackknife leaves out one unit at a time. Both allocators need their own weighted derivations to compose with strata/PSU; tracked in TODO.md as a follow-up. +- **Note (default variance_method deviation from R):** R's `synthdid::vcov()` defaults to `method="bootstrap"`; our `SyntheticDiD.__init__` defaults to `variance_method="placebo"`. Library deviation rationale: (a) placebo's default unconditional availability across all survey configurations (full design supported on bootstrap only); (b) placebo avoids the ~5–30× per-draw Frank-Wolfe refit slowdown. Users can opt into R's default with `variance_method="bootstrap"`. Placebo (Algorithm 4) and bootstrap (Algorithm 2 step 2) both track nominal calibration in the committed coverage MC; see the calibration table below. +- **Note (survey + bootstrap composition — PR #352):** Restored capability. The bootstrap survey path solves the **weighted Frank-Wolfe** variant of `_sc_weight_fw` accepting per-unit weights in loss and regularization. For unit weights: + ``` + min_{ω simplex} Σ_t (Σ_i rw_i · ω_i · Y_i,pre[t] - treated_pre[t])² + ζ²·Σ_i rw_i · ω_i² + ``` + Implementation: column-scales A by `rw_control` (so the loss term reads `||A·diag(rw)·ω - b||²`) and passes `reg_weights=rw_control` to the weighted Rust kernel for the diag(rw) penalty. The FW returns ω on the standard simplex; `_bootstrap_se` composes `ω_eff = rw·ω / Σ(rw·ω)` for the downstream `compute_sdid_estimator` call (which expects a normalized weight vector). For time weights, the loss becomes per-row weighted (`||diag(√rw)·(A·λ - b)||²`) and regularization on λ stays uniform — λ is per-period, rw is per-control, no alignment for per-λ reg weighting. + + **Argmin-set caveat (deliberate):** this objective is NOT the same as directly minimizing the standard SDID loss on `ω_eff` (the scaling factor `(Σ rw·ω)²` enters the loss-on-θ reparameterization non-constantly). The chosen form mirrors the spirit of the pre-PR-#351 Rao-Wu fixed-weight composition (rescale + renormalize) but with ω re-estimated per draw under the weighted objective, so weight-estimation uncertainty propagates correctly. No external R/Julia parity anchor exists because neither package defines survey-weighted SDID FW; validation rests on the coverage MC calibration row below (`stratified_survey × bootstrap`, target rejection ∈ [0.02, 0.10] at α=0.05). + + **Per-draw dispatch:** + - pweight-only → `rw = w_control[boot_idx_control]` (constant per draw, no Rao-Wu). + - full design → `rw = generate_rao_wu_weights(resolved_survey_unit, rng)` per draw, sliced over the resampled units. Rao-Wu rescales weights by `(n_h/m_h)·r_hi` within each stratum; degenerate-retry on zero-mass control or treated draws. + - **single-PSU short-circuit**: unstratified single-PSU designs return NaN SE (resampling one PSU yields the same subset every draw — bootstrap distribution is unidentified). - **Note:** P-value computation is variance-method dependent. Placebo (Algorithm 4) uses the empirical null formula `max(mean(|placebo_effects| ≥ |att|), 1/(r+1))` because permuting control indices generates draws from the null distribution (centered on 0). Bootstrap (Algorithm 2) and jackknife (Algorithm 3) use the analytical p-value from `safe_inference(att, se)` (normal-theory): bootstrap draws are centered on `τ̂` (sampling distribution of the estimator) and jackknife pseudo-values are not null draws, so the empirical null formula is invalid for them. This matches R's `synthdid::vcov()` convention, where variance is returned and inference is normal-theory from the SE. -- **Note (coverage Monte Carlo calibration):** `benchmarks/data/sdid_coverage.json` carries empirical rejection rates across all three variance methods on 3 representative null-panel DGPs (500 seeds × B=200, regenerable via `benchmarks/python/coverage_sdid.py`). Under H0 the nominal rejection rate at each α equals α; rates substantially above α indicate anti-conservatism, rates below indicate over-coverage. +- **Note (coverage Monte Carlo calibration):** `benchmarks/data/sdid_coverage.json` carries empirical rejection rates across the three variance methods on 4 representative null-panel DGPs (500 seeds × B=200, regenerable via `benchmarks/python/coverage_sdid.py`). The fourth DGP (`stratified_survey`, added in PR #352) validates the survey-bootstrap calibration; bootstrap is the only method evaluated on it because placebo / jackknife reject strata/PSU/FPC at fit-time. Under H0 the nominal rejection rate at each α equals α; rates substantially above α indicate anti-conservatism, rates below indicate over-coverage. | DGP | method | α=0.01 | α=0.05 | α=0.10 | mean SE / true SD | |-----------------------------------------------------------|------------|--------|--------|--------|-------------------| @@ -1563,8 +1586,11 @@ Convergence criterion: stop when objective decrease < min_decrease² (default mi | AER §6.3 (N=100, N_tr=20, T=120, T_pre=115, rank=2, σ=2) | placebo | 0.018 | 0.058 | 0.086 | 0.99 | | AER §6.3 | bootstrap | 0.010 | 0.040 | 0.078 | 1.05 | | AER §6.3 | jackknife | 0.030 | 0.080 | 0.150 | 0.90 | + | stratified_survey (N=40, strata=2, PSU=2/stratum, ICC≈0.84) | bootstrap | 0.024 | 0.058 | 0.094 | 1.13 | + + Reading: **`bootstrap` (paper-faithful refit)** and **`placebo`** both track nominal calibration across all three non-survey DGPs (rates within Monte Carlo noise at 500 seeds; 2σ MC band ≈ 0.02–0.05 at p ≈ 0.05–0.10). **`jackknife`** is slightly anti-conservative on the smaller panels (balanced, AER §6.3) at α=0.05 (rejection 0.112 and 0.080 vs the 0.05 target). Arkhangelsky et al. (2021) §6.3 reports mixed jackknife evidence (98% coverage — slightly conservative — under iid, and 93% coverage — slightly anti-conservative — under AR(1) ρ=0.7), so the direction of our observation is consistent with the AR(1) branch of the paper's evidence rather than the iid branch. The `mean SE / true SD` column compares mean estimated SE to the empirical sampling SD of τ̂ across seeds. - Reading: **`bootstrap` (paper-faithful refit)** and **`placebo`** both track nominal calibration across all three DGPs (rates within Monte Carlo noise at 500 seeds; 2σ MC band ≈ 0.02–0.05 at p ≈ 0.05–0.10). **`jackknife`** is slightly anti-conservative on the smaller panels (balanced, AER §6.3) at α=0.05 (rejection 0.112 and 0.080 vs the 0.05 target). Arkhangelsky et al. (2021) §6.3 reports mixed jackknife evidence (98% coverage — slightly conservative — under iid, and 93% coverage — slightly anti-conservative — under AR(1) ρ=0.7), so the direction of our observation is consistent with the AR(1) branch of the paper's evidence rather than the iid branch. The `mean SE / true SD` column compares mean estimated SE to the empirical sampling SD of τ̂ across seeds. + **`stratified_survey × bootstrap` (PR #352)**: validates the weighted-FW + Rao-Wu composition added in this PR. Rejection at α=0.05 is 0.058 (inside the calibration gate [0.02, 0.10] widened from a 2σ band to accommodate the high ICC ≈ 0.84 induced by `psu_re_sd=1.5` with only 4 PSUs total). `mean SE / true SD = 1.13` indicates the bootstrap is slightly conservative (overestimates the empirical sampling SD by ~13%) — the safer direction; expected under Rao-Wu rescaling with few PSUs because the per-draw weights inflate variance from the resampling structure on top of the fit-time uncertainty. Placebo and jackknife rows are `null` here because both methods reject strata/PSU/FPC at fit-time (tracked as a separate methodology gap in TODO.md). Bootstrap is the only available variance method for full-design SDID fits in this release. The schema smoke test is `TestCoverageMCArtifact::test_coverage_artifacts_present`; regenerate the JSON via `python benchmarks/python/coverage_sdid.py --n-seeds 500 --n-bootstrap 200 --output benchmarks/data/sdid_coverage.json` (~15–40 min on M-series Mac, Rust backend — warm-start convergence makes newer runs faster than the original cold-start one). @@ -1595,7 +1621,7 @@ Convergence criterion: stop when objective decrease < min_decrease² (default mi - [x] Sparsification: v[v <= max(v)/4] = 0; v = v/sum(v) - [x] Placebo SE formula: sqrt((r-1)/r) * sd(placebo_estimates) - [x] Placebo SE: re-estimates omega and lambda per replication (matching R's update.omega=TRUE, update.lambda=TRUE) -- [x] Bootstrap: paper-faithful Algorithm 2 step 2 — re-estimates ω̂_b and λ̂_b per draw via two-pass sparsified Frank-Wolfe on the resampled panel using the fit-time normalized-scale zeta. Matches R's default `synthdid::vcov(method="bootstrap")` (which rebinds `attr(estimate, "opts")` so the renormalized ω serves only as Frank-Wolfe initialization). Survey designs raise `NotImplementedError`; Rao-Wu + refit composition is tracked in TODO.md and sketched in the deferred-composition Note above. +- [x] Bootstrap: paper-faithful Algorithm 2 step 2 — re-estimates ω̂_b and λ̂_b per draw via two-pass sparsified Frank-Wolfe on the resampled panel using the fit-time normalized-scale zeta. Matches R's default `synthdid::vcov(method="bootstrap")` (which rebinds `attr(estimate, "opts")` so the renormalized ω serves only as Frank-Wolfe initialization). Survey designs (pweight-only AND strata/PSU/FPC) are supported via the weighted-FW + hybrid pairs-bootstrap + Rao-Wu rescaling composition described in the "Note (survey + bootstrap composition)" above (PR #352). - [x] Jackknife SE: fixed weights, LOO all units, formula `sqrt((n-1)/n * sum((u-ubar)^2))` - [x] Jackknife: NaN SE for single treated or single nonzero-weight control - [x] Jackknife: analytical p-value (not empirical) @@ -2923,13 +2949,15 @@ ContinuousDiD, EfficientDiD): Rescaled weight: `w*_i = w_i * (n_h / m_h) * r_hi` where `r_hi` = count of PSU *i* drawn. - **Note:** FPC enters through the resample size `m_h`, not as a post-hoc scaling factor. When `f_h >= 1` (census stratum), observations keep original weights (zero variance). -- **Note:** SyntheticDiD is intentionally excluded from this list. Paper-faithful - refit bootstrap (Arkhangelsky et al. 2021 Algorithm 2 step 2) re-estimates ω̂ - and λ̂ via Frank-Wolfe on each draw; composing that with Rao-Wu rescaled - weights requires a weighted-FW derivation that is not yet implemented (sketch - in §SyntheticDiD survey-regression Note; tracked in TODO.md). The previous - SyntheticDiD Rao-Wu path composed fixed-ω with rescaled weights, which was - not paper-faithful and was removed. +- **Note:** SyntheticDiD joins this list via a **hybrid** pairs-bootstrap + + Rao-Wu rescaling (PR #352). Unlike SunAbraham / TROP, which use standalone + Rao-Wu (resample PSUs within strata and rescale weights), SDID first + performs unit-level pairs-bootstrap (`boot_idx = rng.choice(n_total)`) and + then slices Rao-Wu rescaled weights over the resampled units, passing them + into a **weighted Frank-Wolfe** re-estimation of ω̂ and λ̂ per draw. The + full objective and argmin-set caveat live in §SyntheticDiD "Note (survey + + bootstrap composition)"; the previous fixed-ω-and-rescaled-weights path + was removed in PR #351 and replaced with this weighted-FW derivation. - **Note:** Bootstrap paths support all three `lonely_psu` modes: `"remove"`, `"certainty"`, and `"adjust"`. For `"adjust"`, singleton PSUs from different strata are pooled into a combined pseudo-stratum and weights are generated for the pooled group. This is the diff --git a/docs/methodology/survey-theory.md b/docs/methodology/survey-theory.md index 8535b936..019cc4ca 100644 --- a/docs/methodology/survey-theory.md +++ b/docs/methodology/survey-theory.md @@ -724,15 +724,21 @@ Two bootstrap strategies interact with survey designs: - **Rao-Wu rescaled bootstrap** (SunAbraham, TROP): Draws PSUs with replacement within strata and rescales observation weights. Each draw - re-runs the full estimator on the resampled data. *SyntheticDiD is - intentionally excluded in this release:* the paper-faithful refit - bootstrap rejects every survey design because composing Rao-Wu rescaled - weights with Frank-Wolfe re-estimation requires a weighted-FW derivation - that is not yet implemented. Pweight-only SDID users should use - ``variance_method="placebo"`` or ``"jackknife"``; strata/PSU/FPC users - have no SDID variance option. See TODO.md and - ``docs/methodology/REGISTRY.md`` §SyntheticDiD for the deferred- - composition sketch. + re-runs the full estimator on the resampled data. +- **Hybrid pairs-bootstrap + Rao-Wu rescaling** (SyntheticDiD, PR #352): + SDID's full-design bootstrap is NOT a standalone Rao-Wu bootstrap. Each + draw first performs the unit-level pairs-bootstrap resampling that + Arkhangelsky et al. (2021) Algorithm 2 specifies (``boot_idx = rng.choice(n_total)``), + and *then* applies the Rao-Wu rescaled per-unit weights sliced over the + resampled units (``rw_control = rao_wu_rw[:n_control][boot_idx_control]``). + The weighted-Frank-Wolfe kernel then solves + ``min ||A·diag(rw)·ω - b||² + ζ²·Σ rw_i ω_i²`` on the resampled panel, + and ``ω_eff = rw·ω / Σ(rw·ω)`` is composed for the SDID estimator. + See REGISTRY.md §SyntheticDiD ``Note (survey + bootstrap composition)`` + for the full objective and the argmin-set caveat. SDID's `placebo` and + `jackknife` methods still reject strata/PSU/FPC (the placebo permutation + allocator and jackknife LOO mass need their own weighted derivations; + tracked in TODO.md as a follow-up). --- diff --git a/docs/survey-roadmap.md b/docs/survey-roadmap.md index c92c0df3..db49b416 100644 --- a/docs/survey-roadmap.md +++ b/docs/survey-roadmap.md @@ -44,17 +44,21 @@ Weighted `solve_logit()` in `linalg.py` — survey weights enter IRLS as | Estimator | Survey Support | Notes | |-----------|----------------|-------| -| SyntheticDiD | pweight | Treated means survey-weighted; omega composed with control weights post-optimization | +| SyntheticDiD | pweight (placebo / jackknife / bootstrap); strata/PSU/FPC (bootstrap only via PR #352 weighted FW + Rao-Wu) | Treated means survey-weighted; omega composed with control weights post-optimization. Bootstrap survey path uses weighted-FW + Rao-Wu rescaling per draw | | TROP | pweight | Population-weighted ATT aggregation; model fitting unchanged | ### Phase 6: Advanced Features (v2.7.6) - **Survey-aware bootstrap** for bootstrap-using estimators: multiplier at PSU (CS, Imputation, TwoStage, Continuous, Efficient) - and Rao-Wu rescaled (SA, TROP). SyntheticDiD bootstrap no longer supports - survey designs: the previous fixed-weight + Rao-Wu path was not paper- - faithful and was removed, and composing Rao-Wu with paper-faithful refit - Frank-Wolfe requires a separate derivation (tracked in TODO.md) + and Rao-Wu rescaled (SA, SyntheticDiD, TROP). SyntheticDiD bootstrap + composes Rao-Wu rescaled per-draw weights with the **weighted Frank-Wolfe** + variant (PR #352): each draw solves the weighted objective + ``min ||A·diag(rw)·ω - b||² + ζ²·Σ rw_i ω_i²`` and composes + ``ω_eff = rw·ω/Σ(rw·ω)`` for the SDID estimator. See REGISTRY.md + §SyntheticDiD ``Note (survey + bootstrap composition)`` for the full + derivation. SDID's `placebo` and `jackknife` paths still reject + strata/PSU/FPC (separate methodology gap; tracked in TODO.md). - **Replicate weight variance**: BRR, Fay's BRR, JK1, JKn, SDR. 12 of 16 estimators supported (not SyntheticDiD, TROP, BaconDecomposition, or WooldridgeDiD) - **DEFF diagnostics**: per-coefficient design effects vs SRS baseline @@ -219,8 +223,7 @@ the limitation and suggested alternative. | Estimator | Limitation | Alternative | |-----------|-----------|-------------| -| SyntheticDiD | Strata/PSU/FPC (any variance method) | No SDID variance option in this release. Pweight-only works with `variance_method='placebo'` or `'jackknife'`. Strata/PSU/FPC + SDID requires composing Rao-Wu rescaled weights with paper-faithful Frank-Wolfe re-estimation; sketch in REGISTRY.md §SyntheticDiD. | -| SyntheticDiD | `variance_method='bootstrap'` + any survey design (including pweight-only) | Use `variance_method='placebo'` or `'jackknife'` for pweight-only surveys. Refit bootstrap composed with survey weights requires the same weighted-FW derivation noted above. | +| SyntheticDiD | `variance_method='placebo'` or `'jackknife'` + strata/PSU/FPC | Use `variance_method='bootstrap'` for full-design surveys (PR #352 weighted-FW + Rao-Wu composition). Placebo's control-index permutation and jackknife's LOO allocator need their own weighted derivations on top of the weighted-FW kernel; tracked in TODO.md as a follow-up. | | SyntheticDiD | Replicate weights | Pre-existing limitation: no replicate-weight survey support on SDID. | | TROP | Replicate weights | Use strata/PSU/FPC design with Rao-Wu rescaled bootstrap | | BaconDecomposition | Replicate weights | Diagnostic only, no inference | diff --git a/docs/tutorials/16_survey_did.ipynb b/docs/tutorials/16_survey_did.ipynb index 06bb1b6c..790b6425 100644 --- a/docs/tutorials/16_survey_did.ipynb +++ b/docs/tutorials/16_survey_did.ipynb @@ -1087,7 +1087,7 @@ "cell_type": "markdown", "id": "cell-35-f1ef376c", "metadata": {}, - "source": "## 9. Which Estimators Support Survey Design?\n\n`diff-diff` supports survey design across all estimators, though the level of support varies:\n\n| Estimator | Weights | Strata/PSU/FPC (TSL) | Replicate Weights | Survey-Aware Bootstrap |\n|-----------|---------|---------------------|-------------------|------------------------|\n| **DifferenceInDifferences** | Full | Full | -- | -- |\n| **TwoWayFixedEffects** | Full | Full | -- | -- |\n| **MultiPeriodDiD** | Full | Full | -- | -- |\n| **CallawaySantAnna** | pweight only | Full | Full | Multiplier at PSU |\n| **TripleDifference** | pweight only | Full | Full (analytical) | -- |\n| **StaggeredTripleDifference** | pweight only | Full | Full | Multiplier at PSU |\n| **SunAbraham** | Full | Full | -- | Rao-Wu rescaled |\n| **StackedDiD** | pweight only | Full (pweight only) | -- | -- |\n| **ImputationDiD** | pweight only | Partial (no FPC) | -- | Multiplier at PSU |\n| **TwoStageDiD** | pweight only | Partial (no FPC) | -- | Multiplier at PSU |\n| **ContinuousDiD** | Full | Full | Full (analytical) | Multiplier at PSU |\n| **EfficientDiD** | Full | Full | Full (analytical) | Multiplier at PSU |\n| **SyntheticDiD** | pweight only (placebo / jackknife) | -- | -- | -- |\n| **TROP** | pweight only | -- | -- | Rao-Wu rescaled |\n| **BaconDecomposition** | Diagnostic | Diagnostic | -- | -- |\n\n**Legend:**\n- **Full**: All weight types (pweight/fweight/aweight) + strata/PSU/FPC + Taylor Series Linearization variance\n- **Full (pweight only)**: Full TSL support with strata/PSU/FPC, but only accepts `pweight` weight type (`fweight`/`aweight` rejected because Q-weight composition changes their semantics)\n- **Partial (no FPC)**: Weights + strata (for df) + PSU (for clustering); FPC raises `NotImplementedError`\n- **pweight only** (Weights column): Only `pweight` accepted; `fweight`/`aweight` raise an error\n- **pweight only** (TSL column): Sampling weights for point estimates; no strata/PSU/FPC design elements\n- **Diagnostic**: Weighted descriptive statistics only (no inference)\n- **--**: Not supported\n\n**Note on SyntheticDiD:** `variance_method=\"placebo\"` and `variance_method=\"jackknife\"` support pweight-only survey designs. `variance_method=\"bootstrap\"` rejects every survey design (including pweight-only) because the paper-faithful refit bootstrap composed with Rao-Wu rescaled weights requires a weighted-Frank-Wolfe derivation that is not yet implemented. Strata/PSU/FPC are not supported by any SDID variance method in this release. The weighted-FW + Rao-Wu composition follow-up is tracked in `TODO.md`; see `docs/methodology/REGISTRY.md` §SyntheticDiD for the deferred-composition sketch.\n\n**Note:** `EfficientDiD` supports `covariates` and `survey_design` simultaneously. The doubly-robust (DR) path threads survey weights through WLS outcome regression, weighted sieve propensity ratios, and survey-weighted kernel smoothing.\n\nFor full details, see `docs/survey-roadmap.md`." + "source": "## 9. Which Estimators Support Survey Design?\n\n`diff-diff` supports survey design across all estimators, though the level of support varies:\n\n| Estimator | Weights | Strata/PSU/FPC (TSL) | Replicate Weights | Survey-Aware Bootstrap |\n|-----------|---------|---------------------|-------------------|------------------------|\n| **DifferenceInDifferences** | Full | Full | -- | -- |\n| **TwoWayFixedEffects** | Full | Full | -- | -- |\n| **MultiPeriodDiD** | Full | Full | -- | -- |\n| **CallawaySantAnna** | pweight only | Full | Full | Multiplier at PSU |\n| **TripleDifference** | pweight only | Full | Full (analytical) | -- |\n| **StaggeredTripleDifference** | pweight only | Full | Full | Multiplier at PSU |\n| **SunAbraham** | Full | Full | -- | Rao-Wu rescaled |\n| **StackedDiD** | pweight only | Full (pweight only) | -- | -- |\n| **ImputationDiD** | pweight only | Partial (no FPC) | -- | Multiplier at PSU |\n| **TwoStageDiD** | pweight only | Partial (no FPC) | -- | Multiplier at PSU |\n| **ContinuousDiD** | Full | Full | Full (analytical) | Multiplier at PSU |\n| **EfficientDiD** | Full | Full | Full (analytical) | Multiplier at PSU |\n| **SyntheticDiD** | pweight only | bootstrap only (PR #352) | -- | Rao-Wu rescaled (bootstrap only) |\n| **TROP** | pweight only | -- | -- | Rao-Wu rescaled |\n| **BaconDecomposition** | Diagnostic | Diagnostic | -- | -- |\n\n**Legend:**\n- **Full**: All weight types (pweight/fweight/aweight) + strata/PSU/FPC + Taylor Series Linearization variance\n- **Full (pweight only)**: Full TSL support with strata/PSU/FPC, but only accepts `pweight` weight type (`fweight`/`aweight` rejected because Q-weight composition changes their semantics)\n- **Partial (no FPC)**: Weights + strata (for df) + PSU (for clustering); FPC raises `NotImplementedError`\n- **pweight only** (Weights column): Only `pweight` accepted; `fweight`/`aweight` raise an error\n- **pweight only** (TSL column): Sampling weights for point estimates; no strata/PSU/FPC design elements\n- **bootstrap only** (TSL column): Strata/PSU/FPC supported only on `variance_method=\"bootstrap\"` via the weighted Frank-Wolfe + Rao-Wu composition (PR #352); placebo and jackknife reject full designs\n- **Diagnostic**: Weighted descriptive statistics only (no inference)\n- **--**: Not supported\n\n**Note on SyntheticDiD (PR #352):** the bootstrap survey path composes per-draw Rao-Wu rescaled weights with a **weighted Frank-Wolfe** variant of `_sc_weight_fw`. Each draw solves `min ||A·diag(rw)·ω - b||² + ζ²·Σ rw_i ω_i²` and composes `ω_eff = rw·ω/Σ(rw·ω)` for the SDID estimator. Pweight-only fits use the constant per-control survey weight as `rw`; full designs use Rao-Wu rescaling per draw. SDID's `placebo` and `jackknife` methods still reject `strata/PSU/FPC` (a separate methodology gap — placebo permutes control indices, jackknife leaves out one unit at a time, both need their own weighted derivations; tracked in `TODO.md`). See `docs/methodology/REGISTRY.md` §SyntheticDiD `Note (survey + bootstrap composition)` for the full objective and the argmin-set caveat.\n\n**Note:** `EfficientDiD` supports `covariates` and `survey_design` simultaneously. The doubly-robust (DR) path threads survey weights through WLS outcome regression, weighted sieve propensity ratios, and survey-weighted kernel smoothing.\n\nFor full details, see `docs/survey-roadmap.md`." }, { "cell_type": "markdown", diff --git a/rust/src/lib.rs b/rust/src/lib.rs index 53844650..cc8fd13a 100644 --- a/rust/src/lib.rs +++ b/rust/src/lib.rs @@ -37,6 +37,8 @@ fn _rust_backend(m: &Bound<'_, PyModule>) -> PyResult<()> { m.add_function(wrap_pyfunction!(weights::compute_noise_level, m)?)?; m.add_function(wrap_pyfunction!(weights::sc_weight_fw, m)?)?; m.add_function(wrap_pyfunction!(weights::sc_weight_fw_with_convergence, m)?)?; + m.add_function(wrap_pyfunction!(weights::sc_weight_fw_weighted, m)?)?; + m.add_function(wrap_pyfunction!(weights::sc_weight_fw_weighted_with_convergence, m)?)?; // Linear algebra operations m.add_function(wrap_pyfunction!(linalg::solve_ols, m)?)?; diff --git a/rust/src/weights.rs b/rust/src/weights.rs index 4d39ea50..196999c7 100644 --- a/rust/src/weights.rs +++ b/rust/src/weights.rs @@ -356,6 +356,239 @@ fn sc_weight_fw_standard( converged } +/// Weighted Gram-path Frank-Wolfe loop for unit-weight survey-bootstrap. +/// +/// Identical to `sc_weight_fw_gram` except for the regularization term, which +/// uses a per-coordinate weight `reg_w[j]` so the objective becomes: +/// f(lam) = ||A·lam - b||² / N + ζ²·Σ_j reg_w[j] · lam[j]² +/// +/// Mathematical changes vs the unweighted loop (see PR #352 §2.2): +/// - half_grad[j] = ata_x[j] - atb[j] + eta · reg_w[j] · lam[j] +/// - d_x_norm_sq is replaced by `Σ_j reg_w[j] · d[j]²` (weighted quadratic +/// form), which simplifies to +/// `Σ_j reg_w[j]·lam[j]² + reg_w[i] - 2·reg_w[i]·lam[i]` +/// when `d = e_i - lam` (the FW direction toward vertex i). +/// +/// All control-flow and iteration semantics (incremental ata_x, periodic +/// refresh, dual convergence checks, GRAM_REFRESH_INTERVAL) are preserved. +fn sc_weight_fw_gram_weighted( + a: &ArrayView2, + b: &ArrayView1, + lam: &mut Array1, + reg_w: &ArrayView1, + eta: f64, + zeta: f64, + n: usize, + min_decrease_sq: f64, + max_iter: usize, +) -> bool { + let t0 = lam.len(); + + let ata = a.t().dot(a); + let atb = a.t().dot(b); + let b_norm_sq = b.dot(b); + + let mut ata_diag = Array1::zeros(t0); + for j in 0..t0 { + ata_diag[j] = ata[[j, j]]; + } + + let mut ata_x = ata.dot(lam); + let mut half_grad = Array1::zeros(t0); + + let mut prev_val = f64::INFINITY; + let mut converged = false; + + for t in 0..max_iter { + // Weighted regularization in the gradient + for j in 0..t0 { + half_grad[j] = ata_x[j] - atb[j] + eta * reg_w[j] * lam[j]; + } + + let i = argmin_f64(&half_grad); + + // Weighted simplex direction norm: Σ_j reg_w[j]·d[j]² with d = e_i - lam + let lam_rw_norm_sq: f64 = lam.iter().zip(reg_w.iter()).map(|(&l, &w)| w * l * l).sum(); + let d_x_w_norm_sq = lam_rw_norm_sq + reg_w[i] - 2.0 * reg_w[i] * lam[i]; + + // Weighted regularization in the objective + let lam_rw_norm_sq_for_obj = lam_rw_norm_sq; + + if d_x_w_norm_sq < 1e-24 { + let xt_ata_x: f64 = ata_x.iter().zip(lam.iter()).map(|(&a, &b)| a * b).sum(); + let atb_dot_lam: f64 = atb.iter().zip(lam.iter()).map(|(&a, &b)| a * b).sum(); + let val = zeta * zeta * lam_rw_norm_sq_for_obj + + (xt_ata_x - 2.0 * atb_dot_lam + b_norm_sq) / n as f64; + if t >= 1 && prev_val - val < min_decrease_sq { + converged = true; + break; + } + prev_val = val; + continue; + } + + let xt_ata_x: f64 = ata_x.iter().zip(lam.iter()).map(|(&a, &b)| a * b).sum(); + let d_err_sq = ata_diag[i] - 2.0 * ata_x[i] + xt_ata_x; + let denom = d_err_sq + eta * d_x_w_norm_sq; + if denom <= 0.0 { + let atb_dot_lam: f64 = atb.iter().zip(lam.iter()).map(|(&a, &b)| a * b).sum(); + let val = zeta * zeta * lam_rw_norm_sq_for_obj + + (xt_ata_x - 2.0 * atb_dot_lam + b_norm_sq) / n as f64; + if t >= 1 && prev_val - val < min_decrease_sq { + converged = true; + break; + } + prev_val = val; + continue; + } + + let hg_dot_lam: f64 = half_grad.iter().zip(lam.iter()).map(|(&a, &b)| a * b).sum(); + let hg_dot_dx = half_grad[i] - hg_dot_lam; + let step = (-hg_dot_dx / denom).max(0.0).min(1.0); + + let one_minus_step = 1.0 - step; + for j in 0..t0 { + lam[j] *= one_minus_step; + } + lam[i] += step; + + let ata_col_i = ata.column(i); + for j in 0..t0 { + ata_x[j] = one_minus_step * ata_x[j] + step * ata_col_i[j]; + } + if t > 0 && t % GRAM_REFRESH_INTERVAL == 0 { + ata_x = ata.dot(lam as &Array1); + } + + // Recompute weighted lam-norm after the in-place update + let lam_rw_norm_sq_new: f64 = + lam.iter().zip(reg_w.iter()).map(|(&l, &w)| w * l * l).sum(); + let xt_ata_x: f64 = ata_x.iter().zip(lam.iter()).map(|(&a, &b)| a * b).sum(); + let atb_dot_lam: f64 = atb.iter().zip(lam.iter()).map(|(&a, &b)| a * b).sum(); + let val = zeta * zeta * lam_rw_norm_sq_new + + (xt_ata_x - 2.0 * atb_dot_lam + b_norm_sq) / n as f64; + + if t >= 1 && prev_val - val < min_decrease_sq { + converged = true; + break; + } + prev_val = val; + } + converged +} + +/// Weighted standard-path Frank-Wolfe loop for unit-weight survey-bootstrap. +/// +/// Mirror of `sc_weight_fw_standard` with the same regularization-weight +/// modifications described on `sc_weight_fw_gram_weighted`. +fn sc_weight_fw_standard_weighted( + a: &ArrayView2, + b: &ArrayView1, + lam: &mut Array1, + reg_w: &ArrayView1, + eta: f64, + zeta: f64, + n: usize, + min_decrease_sq: f64, + max_iter: usize, +) -> bool { + let t0 = lam.len(); + + let mut col_norms_sq = Array1::zeros(t0); + for j in 0..t0 { + let col = a.column(j); + col_norms_sq[j] = col.dot(&col); + } + + let mut ax = a.dot(lam as &Array1); + let mut half_grad = Array1::zeros(t0); + let mut diff = Array1::zeros(n); + + let mut prev_val = f64::INFINITY; + let mut converged = false; + + for t in 0..max_iter { + // half_grad = A^T (Ax - b); add eta·reg_w·lam component-wise + diff.assign(&ax); + diff -= &*b; + general_mat_vec_mul(1.0, &a.t(), &diff, 0.0, &mut half_grad); + for j in 0..t0 { + half_grad[j] += eta * reg_w[j] * lam[j]; + } + + let i = argmin_f64(&half_grad); + + let lam_rw_norm_sq: f64 = lam.iter().zip(reg_w.iter()).map(|(&l, &w)| w * l * l).sum(); + let d_x_w_norm_sq = lam_rw_norm_sq + reg_w[i] - 2.0 * reg_w[i] * lam[i]; + + if d_x_w_norm_sq < 1e-24 { + let mut err_sq = 0.0; + for k in 0..n { + let e = ax[k] - b[k]; + err_sq += e * e; + } + let val = zeta * zeta * lam_rw_norm_sq + err_sq / n as f64; + if t >= 1 && prev_val - val < min_decrease_sq { + converged = true; + break; + } + prev_val = val; + continue; + } + + let col_i = a.column(i); + let col_dot_ax: f64 = col_i.iter().zip(ax.iter()).map(|(&a, &b)| a * b).sum(); + let ax_dot_ax: f64 = ax.iter().map(|&v| v * v).sum(); + let d_err_sq = col_norms_sq[i] - 2.0 * col_dot_ax + ax_dot_ax; + + let denom = d_err_sq + eta * d_x_w_norm_sq; + if denom <= 0.0 { + let mut err_sq = 0.0; + for k in 0..n { + let e = ax[k] - b[k]; + err_sq += e * e; + } + let val = zeta * zeta * lam_rw_norm_sq + err_sq / n as f64; + if t >= 1 && prev_val - val < min_decrease_sq { + converged = true; + break; + } + prev_val = val; + continue; + } + + let hg_dot_lam: f64 = half_grad.iter().zip(lam.iter()).map(|(&a, &b)| a * b).sum(); + let hg_dot_dx = half_grad[i] - hg_dot_lam; + let step = (-hg_dot_dx / denom).max(0.0).min(1.0); + + let one_minus_step = 1.0 - step; + for j in 0..t0 { + lam[j] *= one_minus_step; + } + lam[i] += step; + + for k in 0..n { + ax[k] = one_minus_step * ax[k] + step * col_i[k]; + } + + let mut err_sq = 0.0; + for k in 0..n { + let e = ax[k] - b[k]; + err_sq += e * e; + } + let lam_rw_norm_sq_new: f64 = + lam.iter().zip(reg_w.iter()).map(|(&l, &w)| w * l * l).sum(); + let val = zeta * zeta * lam_rw_norm_sq_new + err_sq / n as f64; + + if t >= 1 && prev_val - val < min_decrease_sq { + converged = true; + break; + } + prev_val = val; + } + converged +} + /// Compute synthetic control weights via Frank-Wolfe optimization. /// /// Matches R's sc.weight.fw() from the synthdid package. Solves: @@ -423,6 +656,83 @@ fn sc_weight_fw_internal( (lam, converged) } +/// Weighted-regularization Frank-Wolfe dispatcher for SDID survey-bootstrap. +/// +/// Identical pre-processing to `sc_weight_fw_internal` (column-centering for +/// intercept, eta = N·ζ², init from uniform or warm-start) but dispatches to +/// the `_weighted` loop variants which use `reg_weights` for per-coordinate +/// regularization. Caller is responsible for supplying `reg_weights` of the +/// same length as `lam` (T0). +/// +/// When `reg_weights` is `None`, delegates to `sc_weight_fw_internal` — +/// preserves the unweighted ABI for callers that share a generic dispatch +/// site. +fn sc_weight_fw_weighted_internal( + y: &ArrayView2, + zeta: f64, + intercept: bool, + init_weights: Option<&Array1>, + min_decrease: f64, + max_iter: usize, + reg_weights: Option<&Array1>, +) -> (Array1, bool) { + // Default to the existing unweighted kernel when no reg weights provided. + let rw = match reg_weights { + Some(w) => w, + None => { + return sc_weight_fw_internal(y, zeta, intercept, init_weights, min_decrease, max_iter); + } + }; + + let t0 = y.ncols() - 1; + let n = y.nrows(); + + if t0 == 0 { + return (Array1::ones(1), true); + } + + if rw.len() != t0 { + // Defensive: dimension mismatch — fall back to unweighted to avoid a + // panic from Rust-side callers. Python dispatch paths (both + // ``sc_weight_fw_weighted`` / ``sc_weight_fw_weighted_with_convergence`` + // pyfunctions and ``diff_diff.utils._sc_weight_fw``) validate shapes + // and raise ``ValueError`` before reaching this branch (PR #355 R5 P2), + // so this fallback is unreachable from the Python API. + return sc_weight_fw_internal(y, zeta, intercept, init_weights, min_decrease, max_iter); + } + + let y_owned: Array2 = if intercept { + let col_means = y.mean_axis(Axis(0)).unwrap(); + y - &col_means + } else { + y.to_owned() + }; + + let a = y_owned.slice(s![.., ..t0]); + let b = y_owned.column(t0); + let eta = n as f64 * zeta * zeta; + + let mut lam = match init_weights { + Some(w) => w.clone(), + None => Array1::from_elem(t0, 1.0 / t0 as f64), + }; + + let min_decrease_sq = min_decrease * min_decrease; + let rw_view = rw.view(); + + let converged = if t0 < n { + sc_weight_fw_gram_weighted( + &a, &b, &mut lam, &rw_view, eta, zeta, n, min_decrease_sq, max_iter, + ) + } else { + sc_weight_fw_standard_weighted( + &a, &b, &mut lam, &rw_view, eta, zeta, n, min_decrease_sq, max_iter, + ) + }; + + (lam, converged) +} + /// Compute noise level from first-differences of control outcomes. /// /// Matches R's sd(apply(Y[1:N0, 1:T0], 1, diff)). @@ -554,6 +864,106 @@ pub fn sc_weight_fw_with_convergence<'py>( Ok((result.to_pyarray(py), converged)) } +/// Weighted-regularization variant of `sc_weight_fw` for SDID survey-bootstrap. +/// +/// Solves +/// min_{ω on simplex} ||A·ω - b||² / N + ζ²·Σ_j reg_weights[j]·ω[j]² +/// +/// where A is the first T0 columns of `y` (column-centered if `intercept` is +/// true) and b is the last column. When `reg_weights` is `None`, delegates to +/// the unweighted kernel — same numeric contract as `sc_weight_fw`. +/// +/// Caller is responsible for any column-scaling of A by per-control survey +/// weights to match the loss form +/// Σ_t (Σ_i rw_i·ω_i·Y_i,pre[t] - b_t)² +/// (i.e., pass `Y'` with first T0 columns column-scaled by `rw` and pass +/// `reg_weights=rw`). See `compute_sdid_unit_weights_survey` in +/// `diff_diff/utils.py` for the canonical caller. +#[pyfunction] +#[pyo3(signature = (y, zeta, intercept=true, init_weights=None, min_decrease=1e-5, max_iter=10000, reg_weights=None))] +pub fn sc_weight_fw_weighted<'py>( + py: Python<'py>, + y: PyReadonlyArray2<'py, f64>, + zeta: f64, + intercept: bool, + init_weights: Option>, + min_decrease: f64, + max_iter: usize, + reg_weights: Option>, +) -> PyResult>> { + let y_arr = y.as_array(); + let init = init_weights.map(|w| w.as_array().to_owned()); + let rw = reg_weights.map(|w| w.as_array().to_owned()); + // Validate reg_weights shape at the Python entry point so the Rust and + // NumPy backends share a single failure mode. The internal's defensive + // fallback to the unweighted kernel is kept for Rust-side callers but + // becomes unreachable from Python after this guard (PR #355 R5 P2). + if let Some(ref w) = rw { + let t0 = y_arr.ncols().saturating_sub(1); + if w.len() != t0 { + return Err(pyo3::exceptions::PyValueError::new_err(format!( + "reg_weights length {} does not match expected {} (Y.shape[1] - 1)", + w.len(), + t0 + ))); + } + } + let (result, _converged) = sc_weight_fw_weighted_internal( + &y_arr, + zeta, + intercept, + init.as_ref(), + min_decrease, + max_iter, + rw.as_ref(), + ); + Ok(result.to_pyarray(py)) +} + +/// Weighted-regularization Frank-Wolfe with explicit convergence flag. +/// +/// Identical numeric contract to `sc_weight_fw_weighted`; returns +/// `(weights, converged)` so SDID `_bootstrap_se` can aggregate per-draw +/// non-convergence into the existing summary `UserWarning` (PR #351 c0d089b +/// shape). +#[pyfunction] +#[pyo3(signature = (y, zeta, intercept=true, init_weights=None, min_decrease=1e-5, max_iter=10000, reg_weights=None))] +pub fn sc_weight_fw_weighted_with_convergence<'py>( + py: Python<'py>, + y: PyReadonlyArray2<'py, f64>, + zeta: f64, + intercept: bool, + init_weights: Option>, + min_decrease: f64, + max_iter: usize, + reg_weights: Option>, +) -> PyResult<(Bound<'py, PyArray1>, bool)> { + let y_arr = y.as_array(); + let init = init_weights.map(|w| w.as_array().to_owned()); + let rw = reg_weights.map(|w| w.as_array().to_owned()); + // See ``sc_weight_fw_weighted`` for why we validate here (PR #355 R5 P2). + if let Some(ref w) = rw { + let t0 = y_arr.ncols().saturating_sub(1); + if w.len() != t0 { + return Err(pyo3::exceptions::PyValueError::new_err(format!( + "reg_weights length {} does not match expected {} (Y.shape[1] - 1)", + w.len(), + t0 + ))); + } + } + let (result, converged) = sc_weight_fw_weighted_internal( + &y_arr, + zeta, + intercept, + init.as_ref(), + min_decrease, + max_iter, + rw.as_ref(), + ); + Ok((result.to_pyarray(py), converged)) +} + /// Compute SDID time weights via Frank-Wolfe optimization. /// /// Matches R's synthdid: sc.weight.fw(Yc[1:N0, ], zeta=zeta.lambda, intercept=TRUE) @@ -1009,4 +1419,97 @@ mod tests { assert!((sum_std - 1.0).abs() < 1e-6, "Standard intercept=false: weights should sum to 1, got {}", sum_std); assert!(result_std.iter().all(|&w| w >= -1e-6), "Standard intercept=false: weights should be non-negative"); } + + // ------------------------------------------------------------------------- + // Weighted FW kernel — survey-bootstrap support + // ------------------------------------------------------------------------- + + #[test] + fn test_weighted_fw_reg_weights_none_delegates() { + // reg_weights=None → must produce identical result to the unweighted + // kernel (bit-identity, since the weighted dispatcher delegates). + let vals: Vec = (0..120).map(|i| ((i * 7 + 3) % 97) as f64 / 97.0).collect(); + let y = Array2::from_shape_vec((20, 6), vals).unwrap(); + + let (unweighted, conv_unweighted) = + sc_weight_fw_internal(&y.view(), 0.3, true, None, 1e-5, 10000); + let (weighted, conv_weighted) = sc_weight_fw_weighted_internal( + &y.view(), 0.3, true, None, 1e-5, 10000, None, + ); + + assert_eq!(conv_unweighted, conv_weighted, "convergence flags differ"); + for j in 0..unweighted.len() { + assert!( + (unweighted[j] - weighted[j]).abs() < 1e-14, + "reg_weights=None must delegate to unweighted; mismatch at {}: {} vs {}", + j, unweighted[j], weighted[j] + ); + } + } + + #[test] + fn test_weighted_fw_uniform_reg_weights_matches_unweighted() { + // With reg_weights = ones(t0), the weighted regularization reduces to + // ζ²·Σ ω² which is exactly the unweighted reg. The two kernels should + // agree to within machine precision (the weighted loop recomputes + // lam_norm via the weighted code path each iteration so float + // ordering can introduce tiny ULP-scale drift — stay at rel=1e-12). + let vals: Vec = (0..120).map(|i| ((i * 11 + 5) % 73) as f64 / 73.0).collect(); + let y = Array2::from_shape_vec((20, 6), vals).unwrap(); + let rw = Array1::from_elem(5, 1.0); // t0 = ncols - 1 = 5 + + let (unweighted, _) = sc_weight_fw_internal(&y.view(), 0.3, true, None, 1e-7, 10000); + let (weighted, _) = sc_weight_fw_weighted_internal( + &y.view(), 0.3, true, None, 1e-7, 10000, Some(&rw), + ); + + for j in 0..unweighted.len() { + assert!( + (unweighted[j] - weighted[j]).abs() < 1e-12, + "uniform reg_weights must match unweighted at {}: {} vs {}", + j, unweighted[j], weighted[j] + ); + } + } + + #[test] + fn test_weighted_fw_simplex_invariants() { + // For arbitrary positive rw, the returned ω must lie on the standard + // simplex (sums to 1, non-negative). Exercise both gram (T0 < N) and + // standard (T0 >= N) paths. + let vals_gram: Vec = + (0..120).map(|i| ((i * 13 + 7) % 89) as f64 / 89.0).collect(); + let y_gram = Array2::from_shape_vec((20, 6), vals_gram).unwrap(); + let rw_gram = Array1::from_vec(vec![0.5, 1.0, 1.5, 2.0, 0.8]); + + let (omega_gram, _) = sc_weight_fw_weighted_internal( + &y_gram.view(), 0.4, true, None, 1e-6, 10000, Some(&rw_gram), + ); + let sum_gram: f64 = omega_gram.sum(); + assert!( + (sum_gram - 1.0).abs() < 1e-6, + "gram weighted-FW: ω must sum to 1, got {}", sum_gram + ); + assert!( + omega_gram.iter().all(|&w| w >= -1e-9), + "gram weighted-FW: ω must be non-negative" + ); + + let vals_std: Vec = (0..45).map(|i| ((i * 17 + 3) % 53) as f64 / 53.0).collect(); + let y_std = Array2::from_shape_vec((5, 9), vals_std).unwrap(); + let rw_std = Array1::from_vec(vec![1.0, 0.5, 1.5, 2.0, 0.7, 1.2, 0.9, 1.3]); + + let (omega_std, _) = sc_weight_fw_weighted_internal( + &y_std.view(), 0.5, true, None, 1e-6, 10000, Some(&rw_std), + ); + let sum_std: f64 = omega_std.sum(); + assert!( + (sum_std - 1.0).abs() < 1e-6, + "standard weighted-FW: ω must sum to 1, got {}", sum_std + ); + assert!( + omega_std.iter().all(|&w| w >= -1e-9), + "standard weighted-FW: ω must be non-negative" + ); + } } diff --git a/tests/test_methodology_sdid.py b/tests/test_methodology_sdid.py index 064466a6..e43ebc8e 100644 --- a/tests/test_methodology_sdid.py +++ b/tests/test_methodology_sdid.py @@ -595,45 +595,59 @@ def test_bootstrap_se_tracks_placebo_se_exchangeable(self, ci_params): f"{r_placebo.se:.6f} on exchangeable DGP (rel diff {rel_diff:.4f})" ) - def test_bootstrap_raises_on_pweight_survey(self): - """Survey + bootstrap raises NotImplementedError (pweight-only path). + def test_bootstrap_succeeds_on_pweight_survey(self): + """Survey + bootstrap succeeds (pweight-only) (PR #352). - Rao-Wu rescaled weights composed with paper-faithful Frank-Wolfe - re-estimation is a separate derivation that is not yet implemented; - the guard lives upstream in ``fit()`` before the bootstrap dispatcher. + The weighted-FW path treats per-control survey weights as + constant per draw (no Rao-Wu rescaling, since no PSU). ATT is + finite, SE is positive, variance_method is preserved. """ from diff_diff.survey import SurveyDesign df = _make_panel(n_control=20, n_treated=3, seed=42) df["wt"] = 1.0 - with pytest.raises(NotImplementedError, match="bootstrap"): - SyntheticDiD( - variance_method="bootstrap", n_bootstrap=50, seed=1 - ).fit( - df, outcome="outcome", treatment="treated", - unit="unit", time="period", - post_periods=[5, 6, 7], - survey_design=SurveyDesign(weights="wt"), - ) + result = SyntheticDiD( + variance_method="bootstrap", n_bootstrap=50, seed=1 + ).fit( + df, outcome="outcome", treatment="treated", + unit="unit", time="period", + post_periods=[5, 6, 7], + survey_design=SurveyDesign(weights="wt"), + ) + assert np.isfinite(result.att) + assert np.isfinite(result.se) + assert result.se > 0 + assert result.variance_method == "bootstrap" - def test_bootstrap_raises_on_full_design_survey(self): - """Survey + bootstrap with strata/PSU raises NotImplementedError. + def test_bootstrap_succeeds_on_full_design_survey(self): + """Survey + bootstrap with strata/PSU succeeds via Rao-Wu (PR #352). - No SDID variance method currently supports strata/PSU/FPC; the guard - on line ~300 of ``fit()`` rejects the combination unconditionally. + Composes per-draw Rao-Wu rescaled weights with the weighted-FW + helpers (compute_sdid_unit_weights_survey / + compute_time_weights_survey). Asserts finite SE and survey_metadata + records the strata/PSU layout. """ from diff_diff.survey import SurveyDesign df = _make_panel(n_control=20, n_treated=3, seed=42) df["wt"] = 1.0 df["stratum"] = df["unit"] % 2 - with pytest.raises(NotImplementedError, match="strata"): - SyntheticDiD( - variance_method="bootstrap", n_bootstrap=50, seed=1 - ).fit( - df, outcome="outcome", treatment="treated", - unit="unit", time="period", - post_periods=[5, 6, 7], - survey_design=SurveyDesign(weights="wt", strata="stratum"), - ) + # Globally unique PSU labels (avoids needing nest=True). Each unit + # gets its own PSU id; stratum partitions the PSUs into two groups. + df["psu"] = df["unit"] + result = SyntheticDiD( + variance_method="bootstrap", n_bootstrap=50, seed=1 + ).fit( + df, outcome="outcome", treatment="treated", + unit="unit", time="period", + post_periods=[5, 6, 7], + survey_design=SurveyDesign(weights="wt", strata="stratum", psu="psu"), + ) + assert np.isfinite(result.att) + assert np.isfinite(result.se) + assert result.se > 0 + assert result.variance_method == "bootstrap" + assert result.survey_metadata is not None + assert result.survey_metadata.n_strata is not None + assert result.survey_metadata.n_psu is not None def test_bootstrap_summary_shows_replications(self, ci_params): """result.summary() shows "Bootstrap replications" line for bootstrap. @@ -657,6 +671,475 @@ def test_bootstrap_summary_shows_replications(self, ci_params): assert "Bootstrap replications" in summary assert str(n_boot) in summary + def test_bootstrap_pweight_only_retries_zero_treated_mass_draws(self): + """Pweight-only bootstrap: zero-mass treated draws must be retried, + not silently fall back to an unweighted treated mean (PR #355 R2 P0). + + Regression: prior to R2, when a bootstrap draw's treated units all + had survey weight 0 — reachable when at least one treated unit has + pweight 0 AND the resample picks only those units — the code fell + through to ``np.mean(Y_boot_pre_t, axis=1)``. That silently + dropped survey weighting on that draw while the fit-time ATT uses + the survey-weighted treated mean, corrupting the bootstrap + distribution used for SE/p-value/CI. + + Fix: extend the degenerate-retry to any survey branch where + ``rw_treated_draw.sum() == 0``. This test constructs a panel + where one of three treated units has weight 0; bootstrap must + still produce a valid finite SE, and the draws that hit the + zero-mass condition are retried rather than getting a wrong τ. + """ + from diff_diff.survey import SurveyDesign + df = _make_panel(n_control=25, n_treated=3, seed=42) + # Give one treated unit weight 0 and the other two weight 1. + # Control units get unit weights (any positive). Treated_idx 25, + # 26, 27; set 25 → weight 0, 26 and 27 → weight 1. + df["wt"] = 1.0 + df.loc[df["unit"] == 25, "wt"] = 0.0 + result = SyntheticDiD( + variance_method="bootstrap", n_bootstrap=100, seed=1 + ).fit( + df, outcome="outcome", treatment="treated", + unit="unit", time="period", + post_periods=[5, 6, 7], + survey_design=SurveyDesign(weights="wt"), + ) + assert np.isfinite(result.att), f"att not finite: {result.att}" + assert np.isfinite(result.se), f"se not finite: {result.se}" + assert result.se > 0, f"se={result.se} must be positive" + # Cross-surface: the variance_method label should still be + # bootstrap and the bootstrap replications line should render. + assert result.variance_method == "bootstrap" + + def test_bootstrap_full_design_without_explicit_weights(self): + """SurveyDesign(strata=..., psu=..., weights=None) fits successfully. + + Regression for PR #355 R1 code-quality finding: `SurveyDesign` allows + `weights=None` (resolve() synthesizes unit weights of 1), but the + SDID helper `_extract_unit_survey_weights` used to index + `survey_design.weights` directly and would fail before bootstrap + could run. The helper now returns ones for this configuration. + """ + from diff_diff.survey import SurveyDesign + df = _make_panel(n_control=20, n_treated=3, seed=42) + df["stratum"] = df["unit"] % 2 + df["psu"] = df["unit"] + result = SyntheticDiD( + variance_method="bootstrap", n_bootstrap=50, seed=1 + ).fit( + df, outcome="outcome", treatment="treated", + unit="unit", time="period", + post_periods=[5, 6, 7], + survey_design=SurveyDesign(strata="stratum", psu="psu"), # weights=None + ) + assert np.isfinite(result.att) + assert np.isfinite(result.se) + assert result.se > 0 + assert result.variance_method == "bootstrap" + assert result.survey_metadata is not None + assert result.survey_metadata.n_strata is not None + assert result.survey_metadata.n_psu is not None + + def test_bootstrap_full_design_rao_wu_boot_idx_slice(self, monkeypatch): + """Full-design bootstrap slices Rao-Wu weights by ``boot_idx``. + + Documented in REGISTRY.md §SyntheticDiD ``Note (survey + bootstrap + composition)``: the hybrid path first performs unit-level pairs- + bootstrap (``boot_idx = rng.choice(n_total)``) and THEN slices + the Rao-Wu rescaled weights over the resampled units. Monkeypatch + ``generate_rao_wu_weights`` to return a known vector and capture + the ``rw_control`` fed into ``compute_sdid_unit_weights_survey``; + assert the captured vector matches the expected slice + ``known_rw[:n_control][boot_idx[boot_is_control]]``. + + Regression against a subtle class of bug where either the slice + index arithmetic or the Rao-Wu call site could drift (e.g., + someone refactors ``resolved_survey_unit`` indexing to skip the + boot_idx slicing, or the rw-then-slice order gets swapped to + slice-then-rw). Both would silently produce wrong bootstrap SE. + """ + from diff_diff import utils as dd_utils + from diff_diff import synthetic_did as sdid_mod + from diff_diff.survey import SurveyDesign + + df = _make_panel(n_control=15, n_treated=3, seed=42) + df["wt"] = 1.0 + df["stratum"] = df["unit"] % 2 + df["psu"] = df["unit"] + + # Known Rao-Wu weight vector. Length = n_total = 18; distinct + # values per unit so a slice of the first n_control=15 positions + # by boot_idx[boot_is_control] is identifiable. + n_total = 18 + known_rw = np.arange(1, n_total + 1, dtype=np.float64) + + def fake_rao_wu(resolved_survey, rng): + return known_rw.copy() + + monkeypatch.setattr(sdid_mod, "generate_rao_wu_weights", fake_rao_wu) + + captured: list = [] + + real_helper = dd_utils.compute_sdid_unit_weights_survey + + def capturing_helper(Y_pre_c, Y_pre_t_mean, rw, *args, **kwargs): + captured.append(np.array(rw, copy=True)) + return real_helper(Y_pre_c, Y_pre_t_mean, rw, *args, **kwargs) + + monkeypatch.setattr( + sdid_mod, "compute_sdid_unit_weights_survey", capturing_helper + ) + + bootstrap_seed = 1 + SyntheticDiD( + variance_method="bootstrap", n_bootstrap=10, seed=bootstrap_seed, + ).fit( + df, outcome="outcome", treatment="treated", + unit="unit", time="period", + post_periods=[5, 6, 7], + survey_design=SurveyDesign(weights="wt", strata="stratum", psu="psu"), + ) + + # Exact-equality check against a reproduced RNG stream (PR #355 R4 + # P3). The captured rw vectors must match known_rw[:n_control] + # sliced by boot_idx[boot_is_control] value-for-value. Reproducing + # the bootstrap's rng externally works because: + # - fake_rao_wu does NOT consume rng (just returns known_rw), + # so the only per-draw rng advance is ``rng.choice(n_total, ...)`` + # which yields boot_idx; + # - known_rw is strictly positive, so the zero-mass retry branch + # (synthetic_did.py ``_bootstrap_se``) never fires; + # - a 15/3 split makes the no-control and all-control retries + # vanishingly rare. + # An exact-equality regression catches the sibling bugs the old + # range check missed: permuted indices, deduplicated boot_idx, or + # substituted ``resolved_survey_unit.weights`` lookup in place of + # the known_rw slice — any of which would silently change + # bootstrap SE. + n_control = 15 + rng_sim = np.random.default_rng(bootstrap_seed) + expected_slices = [] + while len(expected_slices) < len(captured): + boot_idx = rng_sim.choice(n_total, size=n_total, replace=True) + boot_is_control = boot_idx < n_control + n_co_b = int(boot_is_control.sum()) + if n_co_b == 0 or n_co_b == n_total: + continue + expected_slices.append(known_rw[:n_control][boot_idx[boot_is_control]]) + + assert len(captured) >= 1, "no FW calls captured — survey dispatch broken" + for i, (rw_captured, rw_expected) in enumerate( + zip(captured, expected_slices) + ): + np.testing.assert_array_equal( + rw_captured, + rw_expected, + err_msg=( + f"draw {i}: captured rw_control differs from expected " + f"known_rw[:n_control][boot_idx[boot_is_control]]. " + "Regression in hybrid pairs-bootstrap + Rao-Wu " + "slice ordering." + ), + ) + + def test_fit_raises_on_zero_total_treated_survey_mass(self): + """Fit-time positive-mass guard: zero treated survey mass raises. + + ``SurveyDesign.resolve()`` accepts non-negative unit weights + (``survey.py`` L171-L176), so a user can legitimately assign unit + survey weights of 0 to every treated unit — encoding an + unidentified target population. Without the front-door guard, the + fit-time survey-weighted ATT (``np.average(Y, weights=w_treated)``) + would hit ``0/0`` and silently propagate NaN into the bootstrap + loop, defeating the per-draw zero-mass retry (PR #355 R2 P0). + Regression against PR #355 R7 P1: the guard must fire before the + bootstrap is even dispatched. + """ + from diff_diff.survey import SurveyDesign + + df = _make_panel(n_control=10, n_treated=3, seed=42) + # Every treated unit gets weight 0; controls keep positive weight. + df["wt"] = np.where(df["treated"] == 1, 0.0, 1.0) + with pytest.raises(ValueError, match=r"treated arm has zero total mass"): + SyntheticDiD(variance_method="bootstrap", n_bootstrap=20, seed=1).fit( + df, outcome="outcome", treatment="treated", + unit="unit", time="period", + post_periods=[5, 6, 7], + survey_design=SurveyDesign(weights="wt"), + ) + + def test_fit_raises_on_zero_total_control_survey_mass(self): + """Fit-time positive-mass guard: zero control survey mass raises. + + Mirror of the treated-arm case (PR #355 R7 P1). Downstream + ``omega_eff = unit_weights * w_control / (unit_weights * w_control).sum()`` + would hit 0/0; the guard front-doors. + """ + from diff_diff.survey import SurveyDesign + + df = _make_panel(n_control=10, n_treated=3, seed=42) + df["wt"] = np.where(df["treated"] == 0, 0.0, 1.0) + with pytest.raises(ValueError, match=r"control arm has zero total mass"): + SyntheticDiD(variance_method="bootstrap", n_bootstrap=20, seed=1).fit( + df, outcome="outcome", treatment="treated", + unit="unit", time="period", + post_periods=[5, 6, 7], + survey_design=SurveyDesign(weights="wt"), + ) + + def test_fit_raises_on_zero_treated_mass_under_full_design(self): + """Fit-time positive-mass guard fires under full strata/PSU/FPC too. + + The guard sources w_control / w_treated from the **resolved + unit-level** design (PR #355 R4 P0), so zero total treated mass + under a strata/PSU/FPC configuration must fire the same front-door + ValueError as the pweight-only case (PR #355 R7 P1). + """ + from diff_diff.survey import SurveyDesign + + df = _make_panel(n_control=10, n_treated=3, seed=42) + df["wt"] = np.where(df["treated"] == 1, 0.0, 1.0) + df["stratum"] = df["unit"] % 2 + df["psu"] = df["unit"] + with pytest.raises(ValueError, match=r"treated arm has zero total mass"): + SyntheticDiD(variance_method="bootstrap", n_bootstrap=20, seed=1).fit( + df, outcome="outcome", treatment="treated", + unit="unit", time="period", + post_periods=[5, 6, 7], + survey_design=SurveyDesign(weights="wt", strata="stratum", psu="psu"), + ) + + def test_fit_raises_on_zero_effective_control_support(self, monkeypatch): + """Fit-time guard: FW mass concentrated on zero-weight controls raises. + + Zero survey weights are valid input (``survey.py`` L171-L176 + rejects only negatives), and ``compute_sdid_unit_weights`` + sparsifies ``unit_weights`` to exact zeros by design. If FW + concentrates on a control whose survey weight is 0, the composed + ``omega_eff = unit_weights * w_control`` has zero total mass and + the subsequent normalization hits ``0/0``, silently propagating + NaN into the ATT / SE / CI. The R7 P1 guard + (``w_control.sum() > 0``) is not sufficient here because at least + one control has positive weight — the degeneracy is in the + composed vector, not the raw survey mass. + + Regression against PR #355 R12 P1: the fit-time guard must raise + a targeted ``ValueError`` before the normalization step. + + We monkeypatch ``compute_sdid_unit_weights`` to return a + canonical sparse unit-weight vector concentrated on the first + control, bypassing FW convergence dynamics. This makes the test + about the guard contract (behavior when the degenerate + configuration is reached), not about how easy it is to force FW + into that configuration on synthetic data. + """ + from diff_diff import synthetic_did as sdid_mod + from diff_diff.survey import SurveyDesign + + df = _make_panel(n_control=5, n_treated=2, seed=42) + # First control gets survey weight 0; others get weight 1. + unique_units = np.sort(df["unit"].unique()) + treated_ids = set(df.loc[df["treated"] == 1, "unit"].unique()) + control_ids = [u for u in unique_units if u not in treated_ids] + zero_weight_unit = control_ids[0] + df["wt"] = np.where(df["unit"] == zero_weight_unit, 0.0, 1.0).astype(float) + + # Force unit_weights to concentrate on control_ids[0]. The call + # sites in fit() use _create_outcome_matrices with control_units + # in sorted order (same order _make_panel produces), so index 0 + # of the unit-weights vector corresponds to control_ids[0]. + def sparse_unit_weights(Y_pre_control, *args, **kwargs): + n_ctrl = Y_pre_control.shape[1] + w = np.zeros(n_ctrl) + w[0] = 1.0 + return w + + monkeypatch.setattr( + sdid_mod, "compute_sdid_unit_weights", sparse_unit_weights + ) + + with pytest.raises(ValueError, match=r"unidentified|zero survey weight"): + SyntheticDiD(variance_method="placebo", n_bootstrap=20, seed=1).fit( + df, outcome="outcome", treatment="treated", + unit="unit", time="period", + post_periods=[5, 6, 7], + survey_design=SurveyDesign(weights="wt"), + ) + + def test_fit_raises_on_zero_effective_control_support_full_design(self, monkeypatch): + """Fit-time effective-control guard fires under strata/PSU/FPC too. + + Mirror of the pweight-only zero-effective-control case (PR #355 + R12 P1) under a full survey design. The guard is part of the + ``w_control is not None`` branch, which fires for both + pweight-only and strata/PSU/FPC fits. + """ + from diff_diff import synthetic_did as sdid_mod + from diff_diff.survey import SurveyDesign + + df = _make_panel(n_control=5, n_treated=2, seed=42) + unique_units = np.sort(df["unit"].unique()) + treated_ids = set(df.loc[df["treated"] == 1, "unit"].unique()) + control_ids = [u for u in unique_units if u not in treated_ids] + zero_weight_unit = control_ids[0] + df["wt"] = np.where(df["unit"] == zero_weight_unit, 0.0, 1.0).astype(float) + df["stratum"] = df["unit"] % 2 + df["psu"] = df["unit"] + + def sparse_unit_weights(Y_pre_control, *args, **kwargs): + n_ctrl = Y_pre_control.shape[1] + w = np.zeros(n_ctrl) + w[0] = 1.0 + return w + + monkeypatch.setattr( + sdid_mod, "compute_sdid_unit_weights", sparse_unit_weights + ) + + with pytest.raises(ValueError, match=r"unidentified|zero survey weight"): + SyntheticDiD(variance_method="bootstrap", n_bootstrap=20, seed=1).fit( + df, outcome="outcome", treatment="treated", + unit="unit", time="period", + post_periods=[5, 6, 7], + survey_design=SurveyDesign(weights="wt", strata="stratum", psu="psu"), + ) + + def test_fit_raises_on_implicit_psu_fpc_below_unit_count_unstratified(self): + """Fit-time FPC validation fires when psu=None and FPC < n_units. + + When ``SurveyDesign(fpc=...)`` is declared without an explicit + ``psu=``, SDID Rao-Wu (via ``bootstrap_utils.generate_rao_wu_weights`` + L654-L655) treats each unit as its own PSU. The helper rejects + ``FPC < n_PSU`` mid-draw (``bootstrap_utils.py`` L684-L688); without + a front-door guard, every bootstrap draw raises, ``_bootstrap_se`` + swallows the ``ValueError`` in its retry loop, and the user sees a + generic bootstrap-exhaustion error. Regression against PR #355 R8 + P1: ``fit()`` must validate FPC vs unit count before dispatching + the bootstrap. + """ + from diff_diff.survey import SurveyDesign + + df = _make_panel(n_control=10, n_treated=3, seed=42) + df["wt"] = 1.0 + # 13 units total; FPC says population size is 5 — infeasible for + # 13 implicit PSUs. + df["fpc_pop"] = 5.0 + with pytest.raises(ValueError, match=r"FPC.*less than the number of units"): + SyntheticDiD(variance_method="bootstrap", n_bootstrap=20, seed=1).fit( + df, outcome="outcome", treatment="treated", + unit="unit", time="period", + post_periods=[5, 6, 7], + survey_design=SurveyDesign(weights="wt", fpc="fpc_pop"), + ) + + def test_fit_raises_on_implicit_psu_fpc_below_stratum_unit_count(self): + """Fit-time FPC validation fires per stratum under implicit PSU. + + Mirror of the unstratified case but with strata present. Each + stratum's FPC must be >= its unit count (PR #355 R8 P1). + """ + from diff_diff.survey import SurveyDesign + + df = _make_panel(n_control=12, n_treated=4, seed=42) + df["wt"] = 1.0 + df["stratum"] = df["unit"] % 2 # 2 strata, ~8 units each + # FPC says 3 per stratum — infeasible for 8 implicit PSUs/stratum. + df["fpc_pop"] = 3.0 + with pytest.raises(ValueError, match=r"FPC.*less than the number of units in that stratum"): + SyntheticDiD(variance_method="bootstrap", n_bootstrap=20, seed=1).fit( + df, outcome="outcome", treatment="treated", + unit="unit", time="period", + post_periods=[5, 6, 7], + survey_design=SurveyDesign( + weights="wt", strata="stratum", fpc="fpc_pop", + ), + ) + + def test_bootstrap_scale_invariance_under_pweight_rescaling(self): + """Survey-bootstrap SE / p / CI are invariant to a global pweight rescaling. + + ``SurveyDesign.resolve()`` normalizes pweights/aweights to mean=1 + (survey.py L189-L203), which is the library's scale-invariance + contract for survey-weighted fits. This test fits the same SDID + panel under two SurveyDesigns — weights column ``"wt"`` vs a + 10x-rescaled copy ``"wt_scaled"`` — and asserts bootstrap SE, + p-value, and CI agree to machine-epsilon tolerance. + + Regression against PR #355 R4 P0: the initial PR #352 pweight-only + bootstrap branch bypassed the resolved (normalized) unit-level + weights and fed raw panel-column weights into the weighted-FW + objective. That objective is NOT invariant to a global rescale + of rw — the loss term scales as rw^2 (``A-tilde = A * diag(rw)``) + while the reg term scales as rw (``zeta^2 * sum rw * omega^2``) — + so any user who rescaled their pweight column (e.g. switched + units) would see silently different SEs. The fix + (synthetic_did.py ``fit()`` around the ``resolved_survey`` block) + sources ``w_control`` and ``w_treated`` from + ``resolved_survey_unit.weights`` (post-normalization) rather + than re-extracting raw weights via ``_extract_unit_survey_weights``. + Tolerance is machine-epsilon tight because floating-point multiply- + reduce ordering inside ``raw * (n / (c*raw_sum))`` vs + ``raw * (n / raw_sum)`` can drift by ~1 ULP; a raw-weight fallback + would produce differences on the order of 1 or larger. + """ + from diff_diff.survey import SurveyDesign + + df = _make_panel(n_control=12, n_treated=3, seed=42) + unique_units = np.sort(df["unit"].unique()) + unit_weights = np.linspace(0.5, 2.5, len(unique_units)) + wt_map = dict(zip(unique_units, unit_weights)) + df["wt"] = df["unit"].map(wt_map) + df["wt_scaled"] = df["wt"] * 10.0 + + kwargs = dict( + outcome="outcome", treatment="treated", + unit="unit", time="period", + post_periods=[5, 6, 7], + ) + result_base = SyntheticDiD( + variance_method="bootstrap", n_bootstrap=50, seed=1, + ).fit(df, survey_design=SurveyDesign(weights="wt"), **kwargs) + result_scaled = SyntheticDiD( + variance_method="bootstrap", n_bootstrap=50, seed=1, + ).fit(df, survey_design=SurveyDesign(weights="wt_scaled"), **kwargs) + + assert np.isfinite(result_base.se) and result_base.se > 0 + np.testing.assert_allclose( + result_scaled.se, result_base.se, rtol=1e-13, atol=0, + err_msg="bootstrap SE is not invariant to pweight global rescaling", + ) + np.testing.assert_allclose( + result_scaled.p_value, result_base.p_value, rtol=1e-12, atol=1e-14, + err_msg="bootstrap p-value is not invariant to pweight global rescaling", + ) + np.testing.assert_allclose( + result_scaled.conf_int, result_base.conf_int, rtol=1e-13, atol=0, + err_msg="bootstrap CI is not invariant to pweight global rescaling", + ) + + def test_bootstrap_single_psu_returns_nan(self): + """Unstratified single-PSU survey design returns NaN SE (PR #352). + + Resampling one PSU yields the same subset every draw, so the + bootstrap distribution is degenerate and the SE is unidentified. + ``_bootstrap_se`` short-circuits to (NaN, []) for this case rather + than emitting a misleading SE estimate. Recovered from the pre-PR + #351 fixed-weight Rao-Wu branch (commit 91082e5). + """ + from diff_diff.survey import SurveyDesign + df = _make_panel(n_control=20, n_treated=3, seed=42) + df["wt"] = 1.0 + df["psu_single"] = 0 # all units in the same PSU; no strata + sdid = SyntheticDiD(variance_method="bootstrap", n_bootstrap=50, seed=1) + result = sdid.fit( + df, outcome="outcome", treatment="treated", + unit="unit", time="period", + post_periods=[5, 6, 7], + survey_design=SurveyDesign(weights="wt", psu="psu_single"), + ) + assert np.isnan(result.se), \ + f"single-PSU unstratified bootstrap should return NaN SE, got {result.se}" + def test_bootstrap_fw_nonconvergence_warning_fires_under_rust(self, monkeypatch): """Aggregate FW non-convergence warning surfaces on the Rust backend. @@ -2949,12 +3432,14 @@ def test_cross_period_ramping_trend(self): class TestCoverageMCArtifact: """Schema smoke-check on ``benchmarks/data/sdid_coverage.json``. - The full Monte Carlo study (500 seeds × B=200 × 3 DGPs × 3 methods) - runs outside CI; its JSON output underwrites the calibration table in - REGISTRY.md §SyntheticDiD. This test verifies the artifact is present - and structured correctly. Per ``feedback_golden_file_pytest_skip.md``, - skip if missing — CI's isolated-install job copies only ``tests/``, - not ``benchmarks/``. + The full Monte Carlo study (500 seeds × B=200 × 4 DGPs × 3 methods, + PR #352) runs outside CI; its JSON output underwrites the calibration + table in REGISTRY.md §SyntheticDiD. The 4th DGP (``stratified_survey``) + is bootstrap-only — placebo / jackknife reject strata/PSU/FPC at + fit-time. This test verifies the artifact is present and structured + correctly. Per ``feedback_golden_file_pytest_skip.md``, skip if + missing — CI's isolated-install job copies only ``tests/``, not + ``benchmarks/``. """ def test_coverage_artifacts_present(self): @@ -2991,7 +3476,7 @@ def test_coverage_artifacts_present(self): "enum value are both gone." ) - for dgp in ("balanced", "unbalanced", "aer63"): + for dgp in ("balanced", "unbalanced", "aer63", "stratified_survey"): assert dgp in payload["per_dgp"], f"missing DGP block: {dgp}" per_method = payload["per_dgp"][dgp] for method in ("placebo", "bootstrap", "jackknife"): @@ -3008,3 +3493,29 @@ def test_coverage_artifacts_present(self): assert alpha_key in block["rejection_rate"], ( f"missing alpha {alpha_key} in {dgp}/{method} rejection_rate" ) + + # PR #352: stratified_survey is bootstrap-only — placebo and + # jackknife reject strata/PSU/FPC at fit-time, so their blocks + # report n_successful_fits=0. Bootstrap must have the full 500 + # successful fits + finite rejection rate at α=0.05 inside the + # calibration gate [0.02, 0.10]. + survey_block = payload["per_dgp"]["stratified_survey"] + assert survey_block["bootstrap"]["n_successful_fits"] >= 100, ( + "stratified_survey bootstrap must have ≥100 successful fits; " + "the survey-bootstrap path is broken if this drops to 0." + ) + rej_05 = survey_block["bootstrap"]["rejection_rate"]["0.05"] + assert 0.02 <= rej_05 <= 0.10, ( + f"stratified_survey bootstrap α=0.05 rejection {rej_05} outside " + "calibration gate [0.02, 0.10]; weighted FW + Rao-Wu is " + "miscalibrated. See PR #352 §3c rollback protocol." + ) + assert survey_block["placebo"]["n_successful_fits"] == 0, ( + "stratified_survey placebo should have 0 successful fits " + "(strata/PSU/FPC raises NotImplementedError at fit-time)" + ) + assert survey_block["jackknife"]["n_successful_fits"] == 0, ( + "stratified_survey jackknife should have 0 successful fits " + "(strata/PSU/FPC raises NotImplementedError at fit-time)" + ) + diff --git a/tests/test_survey_phase5.py b/tests/test_survey_phase5.py index 09e6ed47..46e85373 100644 --- a/tests/test_survey_phase5.py +++ b/tests/test_survey_phase5.py @@ -176,25 +176,37 @@ def test_survey_metadata_fields(self, sdid_survey_data, survey_design_weights): assert sm.effective_n > 0 assert sm.design_effect > 0 - def test_full_design_bootstrap_raises(self, sdid_survey_data, survey_design_full): - """Full survey design (strata/PSU) with bootstrap raises NotImplementedError. - - The previous fixed-weight bootstrap accepted strata/PSU/FPC via Rao-Wu - rescaling, but that path was not paper-faithful and was removed. Rao-Wu - composed with paper-faithful refit Frank-Wolfe requires a separate - derivation (tracked in TODO.md, sketched in REGISTRY.md §SyntheticDiD). + def test_full_design_bootstrap_succeeds(self, sdid_survey_data, survey_design_full): + """Full survey design (strata/PSU) with bootstrap succeeds (PR #352). + + Restored capability: composes Rao-Wu rescaled weights with the + weighted-Frank-Wolfe variant per draw (see REGISTRY.md §SyntheticDiD + survey + bootstrap composition Note). Asserts: + - finite SE > 0 + - survey_metadata populated with n_strata / n_psu + - result.summary() round-trips without error (cross-surface guard) """ est = SyntheticDiD(variance_method="bootstrap", n_bootstrap=50, seed=42) - with pytest.raises(NotImplementedError, match="does not yet support"): - est.fit( - sdid_survey_data, - outcome="outcome", - treatment="treated", - unit="unit", - time="time", - post_periods=[6, 7, 8, 9], - survey_design=survey_design_full, - ) + result = est.fit( + sdid_survey_data, + outcome="outcome", + treatment="treated", + unit="unit", + time="time", + post_periods=[6, 7, 8, 9], + survey_design=survey_design_full, + ) + assert np.isfinite(result.att) + assert np.isfinite(result.se) + assert result.se > 0 + assert result.survey_metadata is not None + assert result.survey_metadata.n_strata is not None + assert result.survey_metadata.n_psu is not None + # summary() must render the bootstrap replications line and the + # survey design block without exception. + summary = result.summary() + assert "Survey Design" in summary + assert "Bootstrap replications" in summary def test_full_design_placebo_raises(self, sdid_survey_data, survey_design_full): """Placebo variance with full design raises NotImplementedError.""" @@ -231,14 +243,14 @@ def test_placebo_with_pweight_only_full_design_stripped_att_match( columns are physically dropped from the input DataFrame. Point estimates depend only on the pseudo-population weights, not on - the strata/PSU structure — full-design bootstrap previously exploited - that structure via Rao-Wu rescaling and is now rejected upstream (see - ``test_full_design_bootstrap_raises`` / - ``test_full_design_placebo_raises``). A silent pickup of ``stratum`` - or ``psu`` by the estimator (e.g., by name-matching a convention - column) would cause the two fits to diverge, so comparing a - DataFrame with those columns present against one with them dropped - is the real contract. + the strata/PSU structure. PR #352 restored bootstrap support for + strata/PSU (which inflates SE via Rao-Wu clustering) but the + placebo / jackknife methods still depend only on per-unit pweight + for the point estimate. A silent pickup of ``stratum`` or ``psu`` + by the estimator (e.g., by name-matching a convention column) + would cause the two fits to diverge, so comparing a DataFrame with + those columns present against one with them dropped is the real + contract. """ sd_pweight_only = SurveyDesign(weights="weight") est = SyntheticDiD(variance_method="placebo", n_bootstrap=100, seed=42) @@ -323,28 +335,74 @@ def test_summary_includes_survey(self, sdid_survey_data, survey_design_weights): assert "Survey Design" in summary assert "pweight" in summary - def test_bootstrap_with_pweight_only_raises( + def test_bootstrap_with_pweight_only_succeeds( self, sdid_survey_data, survey_design_weights ): - """variance_method='bootstrap' with any survey design raises NotImplementedError. + """variance_method='bootstrap' with pweight-only survey succeeds (PR #352). - Paper-faithful refit bootstrap re-estimates ω̂ and λ̂ via Frank-Wolfe on - each draw; composing that with Rao-Wu rescaled weights (or even a - pweight composition in a weighted FW loss) requires a separate - derivation. Pweight-only survey users must use - ``variance_method='placebo'`` or ``'jackknife'``. + Restored capability: the bootstrap loop dispatches to the + weighted-FW variant per draw with constant ``rw = w_control`` + (no Rao-Wu rescaling — pweight-only). Cross-surface guard: the + result still labels itself as bootstrap (variance_method allow-list + in results.py / business_report.py). """ est = SyntheticDiD(variance_method="bootstrap", n_bootstrap=50, seed=42) - with pytest.raises(NotImplementedError, match="does not yet support"): - est.fit( - sdid_survey_data, - outcome="outcome", - treatment="treated", - unit="unit", - time="time", - post_periods=[6, 7, 8, 9], - survey_design=survey_design_weights, - ) + result = est.fit( + sdid_survey_data, + outcome="outcome", + treatment="treated", + unit="unit", + time="time", + post_periods=[6, 7, 8, 9], + survey_design=survey_design_weights, + ) + assert np.isfinite(result.att) + assert np.isfinite(result.se) + assert result.se > 0 + assert result.variance_method == "bootstrap" + + def test_bootstrap_full_design_se_differs_from_pweight_only( + self, sdid_survey_data + ): + """Full-design bootstrap SE differs from pweight-only bootstrap SE. + + Resurrects the test_full_design_se_differs_from_weights_only + contract that PR #351 deleted (R3 cleanup). With Rao-Wu rescaling + (full design), per-draw weights are clustered by PSU within + strata, inflating the bootstrap variance vs the pweight-only path + which uses constant per-control weights. ATT point estimates + match (both compose ω_eff = ω·w_control / Σ post-fit) but SEs + diverge — that's the methodology contract. + """ + sd_pweight = SurveyDesign(weights="weight") + sd_full = SurveyDesign(weights="weight", strata="stratum", psu="psu") + + est_pw = SyntheticDiD(variance_method="bootstrap", n_bootstrap=100, seed=42) + result_pw = est_pw.fit( + sdid_survey_data, + outcome="outcome", + treatment="treated", + unit="unit", + time="time", + post_periods=[6, 7, 8, 9], + survey_design=sd_pweight, + ) + est_full = SyntheticDiD(variance_method="bootstrap", n_bootstrap=100, seed=42) + result_full = est_full.fit( + sdid_survey_data, + outcome="outcome", + treatment="treated", + unit="unit", + time="time", + post_periods=[6, 7, 8, 9], + survey_design=sd_full, + ) + # ATT point estimates match (same w_control, same post-fit + # composition). Tolerance loose because the bootstrap path + # re-estimates ω̂_b under different weighted objectives. + assert result_pw.att == pytest.approx(result_full.att, abs=1e-10) + # SEs differ — Rao-Wu adds PSU clustering variance. + assert result_pw.se != pytest.approx(result_full.se, abs=1e-6) def test_jackknife_with_pweight_only(self, sdid_survey_data, survey_design_weights): """variance_method='jackknife' completes with pweight-only survey weights. diff --git a/tests/test_weighted_fw.py b/tests/test_weighted_fw.py new file mode 100644 index 00000000..9e102f7e --- /dev/null +++ b/tests/test_weighted_fw.py @@ -0,0 +1,407 @@ +"""Tests for the weighted Frank-Wolfe kernel and SDID survey-bootstrap helpers. + +Covers PR #352 (SDID survey-bootstrap via weighted FW + Rao-Wu composition): +- Rust kernel: ``sc_weight_fw_weighted`` and ``_with_convergence`` variants +- Python wrappers: ``_sc_weight_fw(reg_weights=...)`` dispatch and the + ``_sc_weight_fw_numpy`` weighted-reg fallback +- Survey helpers: ``compute_sdid_unit_weights_survey`` and + ``compute_time_weights_survey`` + +The unweighted contract is verified separately in ``test_rust_backend.py`` and +``test_methodology_sdid.py``; this file focuses on the weighted-reg path. +""" + +from __future__ import annotations + +import numpy as np +import pytest + +from diff_diff.utils import ( + _sc_weight_fw, + _sc_weight_fw_numpy, + compute_sdid_unit_weights, + compute_sdid_unit_weights_survey, + compute_time_weights, + compute_time_weights_survey, +) + + +@pytest.fixture +def small_panel(): + """8 pre-periods × 12 control units, deterministic seed.""" + rng = np.random.default_rng(42) + n_pre, n_post, n_control = 8, 4, 12 + return { + "Y_pre_control": rng.normal(size=(n_pre, n_control)), + "Y_post_control": rng.normal(size=(n_post, n_control)), + "Y_pre_treated_mean": rng.normal(size=n_pre), + "n_pre": n_pre, + "n_post": n_post, + "n_control": n_control, + } + + +# ============================================================================= +# Kernel-level: _sc_weight_fw with reg_weights +# ============================================================================= + + +class TestSCWeightFWWeighted: + """Verify the Python dispatch through to the weighted Rust/numpy kernels.""" + + def test_reg_weights_none_matches_unweighted(self, small_panel): + """reg_weights=None must be bit-identical to the unweighted call.""" + Y = np.column_stack([ + small_panel["Y_pre_control"], + small_panel["Y_pre_treated_mean"].reshape(-1, 1), + ]) + unweighted = _sc_weight_fw(Y, zeta=0.3, max_iter=10000, min_decrease=1e-7) + weighted_none = _sc_weight_fw( + Y, zeta=0.3, max_iter=10000, min_decrease=1e-7, reg_weights=None, + ) + np.testing.assert_allclose(weighted_none, unweighted, rtol=1e-14, atol=0) + + def test_uniform_reg_weights_matches_unweighted(self, small_panel): + """reg_weights = ones must collapse to the unweighted regularization. + + Mathematical identity: ζ²·Σ 1·ω² = ζ²·||ω||². The two paths can + differ by ULP-scale due to float ordering inside the loop, so allow + rel=1e-12 (tighter than rel=1e-14 only because the weighted loop + uses Σ rw·ω² while the unweighted loop uses np.sum(ω²) — different + reduction orders). + """ + Y = np.column_stack([ + small_panel["Y_pre_control"], + small_panel["Y_pre_treated_mean"].reshape(-1, 1), + ]) + rw_uniform = np.ones(small_panel["n_control"]) + unweighted = _sc_weight_fw(Y, zeta=0.3, max_iter=10000, min_decrease=1e-7) + weighted_uniform = _sc_weight_fw( + Y, zeta=0.3, max_iter=10000, min_decrease=1e-7, + reg_weights=rw_uniform, + ) + np.testing.assert_allclose(weighted_uniform, unweighted, rtol=1e-12, atol=1e-13) + + def test_python_rust_parity_under_weighted_reg(self, small_panel): + """Pure-Python and Rust weighted FW agree at rel=1e-10. + + Different BLAS / reduction orders prevent bit-identity, but the + weighted objective is strictly convex on the simplex so both + backends converge to the same minimizer. + """ + Y = np.column_stack([ + small_panel["Y_pre_control"], + small_panel["Y_pre_treated_mean"].reshape(-1, 1), + ]) + rw = np.array([1.5, 0.5, 1.0, 2.0, 0.7, 1.3, 0.9, 1.8, 0.6, 1.1, 1.4, 0.8]) + rust_w = _sc_weight_fw( + Y, zeta=0.3, max_iter=20000, min_decrease=1e-9, reg_weights=rw, + ) + numpy_w = _sc_weight_fw_numpy( + Y, zeta=0.3, max_iter=20000, min_decrease=1e-9, reg_weights=rw, + ) + np.testing.assert_allclose(numpy_w, rust_w, rtol=1e-9, atol=1e-10) + + def test_simplex_invariants_under_arbitrary_rw(self, small_panel): + """For any positive rw, ω sums to 1 and is non-negative.""" + Y = np.column_stack([ + small_panel["Y_pre_control"], + small_panel["Y_pre_treated_mean"].reshape(-1, 1), + ]) + rw = np.array([0.3, 0.5, 1.0, 2.0, 1.5, 0.8, 1.2, 0.6, 1.7, 0.9, 1.4, 1.1]) + omega = _sc_weight_fw( + Y, zeta=0.4, max_iter=10000, min_decrease=1e-6, reg_weights=rw, + ) + assert omega.shape == (small_panel["n_control"],) + assert np.isclose(omega.sum(), 1.0, atol=1e-6), \ + f"weights must sum to 1, got {omega.sum()}" + assert np.all(omega >= -1e-9), "weights must be non-negative" + + def test_return_convergence_tuple_shape(self, small_panel): + """return_convergence=True returns (weights, bool) under weighted reg.""" + Y = np.column_stack([ + small_panel["Y_pre_control"], + small_panel["Y_pre_treated_mean"].reshape(-1, 1), + ]) + rw = np.linspace(0.5, 2.0, small_panel["n_control"]) + result = _sc_weight_fw( + Y, zeta=0.3, max_iter=10000, min_decrease=1e-6, + return_convergence=True, reg_weights=rw, + ) + assert isinstance(result, tuple) + assert len(result) == 2 + weights, converged = result + assert weights.shape == (small_panel["n_control"],) + assert isinstance(converged, bool) + + def test_reg_weights_length_mismatch_raises(self, small_panel): + """reg_weights length mismatch raises ValueError on both backends. + + Regression against PR #355 R5 P2: the Rust internal previously + silently fell back to the unweighted kernel when ``reg_weights.len() + != Y.shape[1] - 1``, while the NumPy path raised. The dispatcher + now validates shape upstream so both backends share the same + failure surface, and each Rust pyfunction also guards the entry + point against direct Rust-from-Python calls. Exercising the + wrong-shape call must raise ``ValueError`` with a message naming + the expected dimension. + """ + Y = np.column_stack([ + small_panel["Y_pre_control"], + small_panel["Y_pre_treated_mean"].reshape(-1, 1), + ]) + expected_t0 = Y.shape[1] - 1 + bad_rw = np.ones(expected_t0 + 3) + with pytest.raises(ValueError, match=r"reg_weights"): + _sc_weight_fw( + Y, zeta=0.3, max_iter=100, min_decrease=1e-6, + reg_weights=bad_rw, + ) + + +# ============================================================================= +# Survey helpers: compute_sdid_unit_weights_survey, compute_time_weights_survey +# ============================================================================= + + +class TestComputeSDIDUnitWeightsSurvey: + """Two-pass survey-weighted unit FW with diag(rw) regularization.""" + + def test_uniform_rw_matches_unweighted_helper_within_tolerance(self, small_panel): + """Uniform rw=1 produces ω close to the unweighted helper. + + Note: this is NOT bit-identity — the weighted helper column-scales Y + by rw=ones (no-op) AND passes reg_weights=ones to the kernel, while + the unweighted helper passes reg_weights=None. The two kernels reach + the same simplex minimum but iterate through different float + reduction orders, so the FW iterates can land on adjacent vertices + when the simplex objective is nearly flat. Use rel=1e-6 (loose). + """ + Y_pre_c = small_panel["Y_pre_control"] + Y_pre_t = small_panel["Y_pre_treated_mean"] + unweighted = compute_sdid_unit_weights( + Y_pre_c, Y_pre_t, zeta_omega=0.3, min_decrease=1e-6, + ) + rw_uniform = np.ones(small_panel["n_control"]) + weighted = compute_sdid_unit_weights_survey( + Y_pre_c, Y_pre_t, rw_uniform, zeta_omega=0.3, min_decrease=1e-6, + ) + np.testing.assert_allclose(weighted, unweighted, rtol=1e-6, atol=1e-6) + + def test_simplex_invariants_under_arbitrary_rw(self, small_panel): + Y_pre_c = small_panel["Y_pre_control"] + Y_pre_t = small_panel["Y_pre_treated_mean"] + rng = np.random.default_rng(7) + rw = rng.uniform(0.3, 2.5, size=small_panel["n_control"]) + omega = compute_sdid_unit_weights_survey( + Y_pre_c, Y_pre_t, rw, zeta_omega=0.3, min_decrease=1e-6, + ) + assert omega.shape == (small_panel["n_control"],) + assert np.isclose(omega.sum(), 1.0, atol=1e-6) + assert np.all(omega >= -1e-9) + + def test_rw_shape_mismatch_raises(self, small_panel): + Y_pre_c = small_panel["Y_pre_control"] + Y_pre_t = small_panel["Y_pre_treated_mean"] + wrong_rw = np.ones(small_panel["n_control"] + 1) + with pytest.raises(ValueError, match="rw_control shape"): + compute_sdid_unit_weights_survey( + Y_pre_c, Y_pre_t, wrong_rw, zeta_omega=0.3, + ) + + def test_return_convergence_propagates_AND_of_passes(self, small_panel): + Y_pre_c = small_panel["Y_pre_control"] + Y_pre_t = small_panel["Y_pre_treated_mean"] + rw = np.linspace(0.5, 2.0, small_panel["n_control"]) + # Tight tolerance + few iterations to force non-convergence + omega, converged = compute_sdid_unit_weights_survey( + Y_pre_c, Y_pre_t, rw, zeta_omega=0.3, + min_decrease=1e-15, max_iter_pre_sparsify=5, max_iter=5, + return_convergence=True, + ) + assert isinstance(converged, bool) + # Confirm the result is still a valid simplex vector + assert np.isclose(omega.sum(), 1.0, atol=1e-6) + + +class TestComputeTimeWeightsSurvey: + """Two-pass row-weighted time FW with uniform regularization.""" + + def test_uniform_rw_matches_unweighted_helper_within_tolerance(self, small_panel): + """Uniform rw produces λ matching the unweighted helper. + + sqrt(1)=1 row-scaling is a no-op, so this is genuine equivalence + modulo FW iterate ordering. + """ + Y_pre_c = small_panel["Y_pre_control"] + Y_post_c = small_panel["Y_post_control"] + unweighted = compute_time_weights( + Y_pre_c, Y_post_c, zeta_lambda=0.05, min_decrease=1e-6, + ) + rw_uniform = np.ones(small_panel["n_control"]) + weighted = compute_time_weights_survey( + Y_pre_c, Y_post_c, rw_uniform, zeta_lambda=0.05, min_decrease=1e-6, + ) + np.testing.assert_allclose(weighted, unweighted, rtol=1e-6, atol=1e-6) + + def test_simplex_invariants_under_arbitrary_rw(self, small_panel): + Y_pre_c = small_panel["Y_pre_control"] + Y_post_c = small_panel["Y_post_control"] + rng = np.random.default_rng(11) + rw = rng.uniform(0.3, 2.5, size=small_panel["n_control"]) + lam = compute_time_weights_survey( + Y_pre_c, Y_post_c, rw, zeta_lambda=0.05, min_decrease=1e-6, + ) + assert lam.shape == (small_panel["n_pre"],) + assert np.isclose(lam.sum(), 1.0, atol=1e-6) + assert np.all(lam >= -1e-9) + + def test_rw_shape_mismatch_raises(self, small_panel): + Y_pre_c = small_panel["Y_pre_control"] + Y_post_c = small_panel["Y_post_control"] + wrong_rw = np.ones(small_panel["n_control"] + 1) + with pytest.raises(ValueError, match="rw_control shape"): + compute_time_weights_survey( + Y_pre_c, Y_post_c, wrong_rw, zeta_lambda=0.05, + ) + + def test_non_uniform_rw_beats_unweighted_centering_variant(self, small_panel): + """Non-uniform rw: the weighted-centering solution achieves strictly + lower weighted SSR than the (buggy) unweighted-centering variant. + + Verifies the PR #355 R1 fix — weighted centering + intercept=False + — actually solves the stated weighted loss + ``Σ_u rw_u·(A_u·λ - b_u)²``. Reproduces the unweighted-centering + pre-R1 path by hand (row-scale Y by sqrt(rw), then pass + intercept=True to the kernel so it centers on unweighted column + means) and asserts the correct path's weighted SSR is strictly + better. If R1's fix regresses (someone reverts back to + intercept=True after row-scaling), this test fails because the + two solutions become identical. + """ + Y_pre_c = small_panel["Y_pre_control"] + Y_post_c = small_panel["Y_post_control"] + n_co = small_panel["n_control"] + rng = np.random.default_rng(23) + rw = np.where(rng.uniform(size=n_co) < 0.25, 5.0, 0.5) + + # Correct path: what compute_time_weights_survey actually does. + lam_correct = compute_time_weights_survey( + Y_pre_c, Y_post_c, rw, + zeta_lambda=0.05, + min_decrease=1e-8, + max_iter=10000, + ) + + # Buggy variant: pre-R1 — row-scale by sqrt(rw) but let the kernel + # do UNWEIGHTED centering (intercept=True on the row-scaled matrix). + post_means = np.mean(Y_post_c, axis=0) + Y_time_raw = np.column_stack([Y_pre_c.T, post_means]) + sqrt_rw = np.sqrt(np.maximum(rw, 0.0)) + Y_weighted_unweighted_center = Y_time_raw * sqrt_rw[:, None] + lam_buggy = _sc_weight_fw( + Y_weighted_unweighted_center, zeta=0.05, intercept=True, + min_decrease=1e-8, max_iter=10000, + ) + # Sparsify + refit second pass to match the two-pass shape. + from diff_diff.utils import _sparsify + lam_buggy = _sparsify(lam_buggy) + lam_buggy = _sc_weight_fw( + Y_weighted_unweighted_center, zeta=0.05, intercept=True, + init_weights=lam_buggy, min_decrease=1e-8, max_iter=10000, + ) + + # Compute the canonical (weighted-centered) objective on both. + wc_mean_pre = (Y_pre_c.T * rw[:, None]).sum(axis=0) / rw.sum() + wc_mean_post = (post_means * rw).sum() / rw.sum() + A_wc = Y_pre_c.T - wc_mean_pre + b_wc = post_means - wc_mean_post + + def weighted_ssr(lam_val: np.ndarray) -> float: + resid = A_wc @ lam_val - b_wc + return float(np.sum(rw * resid ** 2)) + + ssr_correct = weighted_ssr(lam_correct) + ssr_buggy = weighted_ssr(lam_buggy) + assert ssr_correct <= ssr_buggy + 1e-6, ( + f"weighted-centering λ (SSR={ssr_correct:.4f}) must achieve at " + f"least as low weighted SSR as the unweighted-centering variant " + f"(SSR={ssr_buggy:.4f}). PR #355 R1 regression: weighted SSR is " + "not being minimized by the survey λ helper." + ) + + def test_zero_rw_subset_handled(self, small_panel): + """rw with some zeros (Rao-Wu draws units to zero weight) still yields + a valid simplex λ — the FW just down-weights those rows in the loss. + """ + Y_pre_c = small_panel["Y_pre_control"] + Y_post_c = small_panel["Y_post_control"] + rw = np.ones(small_panel["n_control"]) + rw[:3] = 0.0 # zero out first 3 controls (e.g., undrawn PSU) + lam = compute_time_weights_survey( + Y_pre_c, Y_post_c, rw, zeta_lambda=0.05, min_decrease=1e-6, + ) + assert lam.shape == (small_panel["n_pre"],) + assert np.isclose(lam.sum(), 1.0, atol=1e-6) + assert np.all(lam >= -1e-9) + + +# ============================================================================= +# Pure-Python vs Rust path parity through the survey helpers +# ============================================================================= + + +class TestSurveyHelperBackendParity: + """The two _survey helpers must produce the same result on Rust vs numpy.""" + + def test_unit_survey_python_rust_parity(self, small_panel, monkeypatch): + """Force pure-Python and compare to Rust output. + + Uses monkeypatch to override HAS_RUST_BACKEND inside utils, then + compares against the default Rust path. + """ + from diff_diff import utils as dd_utils + + Y_pre_c = small_panel["Y_pre_control"] + Y_pre_t = small_panel["Y_pre_treated_mean"] + rng = np.random.default_rng(13) + rw = rng.uniform(0.5, 2.0, size=small_panel["n_control"]) + + # Rust path (default) + rust_omega = compute_sdid_unit_weights_survey( + Y_pre_c, Y_pre_t, rw, zeta_omega=0.3, + max_iter_pre_sparsify=200, max_iter=20000, min_decrease=1e-9, + ) + + # Force pure-Python path + monkeypatch.setattr(dd_utils, "HAS_RUST_BACKEND", False) + py_omega = compute_sdid_unit_weights_survey( + Y_pre_c, Y_pre_t, rw, zeta_omega=0.3, + max_iter_pre_sparsify=200, max_iter=20000, min_decrease=1e-9, + ) + + # Sparsify path is identical, weighted FW is strictly convex; the + # two backends should agree at FW convergence to numerical + # tolerance dominated by float reduction order. + np.testing.assert_allclose(py_omega, rust_omega, rtol=1e-7, atol=1e-7) + + def test_time_survey_python_rust_parity(self, small_panel, monkeypatch): + from diff_diff import utils as dd_utils + + Y_pre_c = small_panel["Y_pre_control"] + Y_post_c = small_panel["Y_post_control"] + rng = np.random.default_rng(17) + rw = rng.uniform(0.5, 2.0, size=small_panel["n_control"]) + + rust_lam = compute_time_weights_survey( + Y_pre_c, Y_post_c, rw, zeta_lambda=0.05, + max_iter_pre_sparsify=200, max_iter=20000, min_decrease=1e-9, + ) + + monkeypatch.setattr(dd_utils, "HAS_RUST_BACKEND", False) + py_lam = compute_time_weights_survey( + Y_pre_c, Y_post_c, rw, zeta_lambda=0.05, + max_iter_pre_sparsify=200, max_iter=20000, min_decrease=1e-9, + ) + + np.testing.assert_allclose(py_lam, rust_lam, rtol=1e-7, atol=1e-7)