diff --git a/CHANGELOG.md b/CHANGELOG.md index f7d75bee..58ab0324 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,12 +9,17 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added - **`target_parameter` block in BR/DR schemas (experimental; schema version bumped to 2.0)** — `BUSINESS_REPORT_SCHEMA_VERSION` and `DIAGNOSTIC_REPORT_SCHEMA_VERSION` bumped from `"1.0"` to `"2.0"` because the new `"no_scalar_by_design"` value on the `headline.status` / `headline_metric.status` enum (dCDH `trends_linear=True, L_max>=2` configuration) is a breaking change per the REPORTING.md stability policy. BusinessReport and DiagnosticReport now emit a top-level `target_parameter` block naming what the headline scalar actually represents for each of the 16 result classes. Closes BR/DR foundation gap #6 (target-parameter clarity). Fields: `name`, `definition`, `aggregation` (machine-readable dispatch tag), `headline_attribute` (raw result attribute), `reference` (citation pointer). BR's summary emits the short `name` right after the headline; DR's overall-interpretation paragraph does the same; both full reports carry a "## Target Parameter" section with the full definition. Per-estimator dispatch is sourced from REGISTRY.md and lives in the new `diff_diff/_reporting_helpers.py::describe_target_parameter`. A few branches read fit-time config (`EfficientDiDResults.pt_assumption`, `StackedDiDResults.clean_control`, `ChaisemartinDHaultfoeuilleResults.L_max` / `covariate_residuals` / `linear_trends_effects`); others emit a fixed tag (the fit-time `aggregate` kwarg on CS / Imputation / TwoStage / Wooldridge does not change the `overall_att` scalar — disambiguating horizon / group tables is tracked under gap #9). See `docs/methodology/REPORTING.md` "Target parameter" section. +- SyntheticDiD coverage Monte Carlo calibration table added to `docs/methodology/REGISTRY.md` §SyntheticDiD — rejection rates at α ∈ {0.01, 0.05, 0.10} across `placebo` / `bootstrap` / `jackknife` on 3 representative DGPs (balanced / exchangeable, unbalanced, and Arkhangelsky et al. (2021) AER §6.3 non-exchangeable). Artifact at `benchmarks/data/sdid_coverage.json` (500 seeds × B=200), regenerable via `benchmarks/python/coverage_sdid.py`. ### Fixed -- SyntheticDiD `variance_method="bootstrap"` now computes p-values from the analytical normal-theory formula using the bootstrap SE (matching R's `synthdid::vcov()` convention), rather than an empirical null-distribution formula that is not valid for bootstrap draws. `is_significant` and `significance_stars` are derived from `p_value` and will also change for bootstrap fits. Placebo and jackknife are unchanged. Point estimates and standard errors are unaffected. +- **SyntheticDiD `variance_method="bootstrap"` now runs the paper-faithful refit bootstrap** with R-default warm-start. Re-estimates ω̂_b and λ̂_b via two-pass sparsified Frank-Wolfe on each pairs-bootstrap draw using the fit-time normalized-scale zeta — Arkhangelsky et al. (2021) Algorithm 2 step 2, matching the behavior of R's default `synthdid::vcov(method="bootstrap")` (which rebinds `attr(estimate, "opts")` so the renormalized ω serves as Frank-Wolfe initialization). The Python path threads that warm-start through `compute_sdid_unit_weights(..., init_weights=_sum_normalize(ω̂[boot_control_idx]))` and `compute_time_weights(..., init_weights=λ̂)` on each bootstrap draw. `compute_sdid_unit_weights` and `compute_time_weights` gain a new `init_weights` kwarg; when provided, the Rust top-level fast-path is skipped in favor of the Python two-pass dispatcher (whose inner FW calls still dispatch to Rust). Without this kwarg both helpers remain backward-compatible and keep the Rust fast-path. The previous fixed-weight bootstrap path is removed entirely — it was not paper-faithful and, despite prior documentation claiming otherwise, also did not match R's default bootstrap (the previous R-parity test fixture invoked `synthdid_estimate(weights=...)` without rebinding `opts`, which silently runs fixed-weight, so the 1e-10 parity was between two paths both wrong in the same direction). Coverage MC at the new artifact above quantifies the correctness fix on 3 representative null DGPs. **Users' existing `variance_method="bootstrap"` fits will return materially different SE / p-value / CI values on the next release** — same enum name, corrected semantics. Bootstrap is now ~5–30× slower per fit than the old fixed-weight shortcut (panel-size dependent; warm-start converges faster than cold-start so the slowdown is less than the 10–100× prior estimate). The PR #349 follow-on bullets below (analytical p-value dispatch, sqrt((r-1)/r) SE formula, retry-to-B contract) all carry over to the refit path unchanged. +- SyntheticDiD `variance_method="bootstrap"` now computes p-values from the analytical normal-theory formula using the bootstrap SE (matching R's `synthdid::vcov()` convention), rather than an empirical null-distribution formula that is not valid for bootstrap draws. `is_significant` and `significance_stars` are derived from `p_value` and will also change for bootstrap fits. Placebo and jackknife are unchanged. Point estimates are unaffected. - SyntheticDiD bootstrap SE formula applies the `sqrt((r-1)/r)` correction matching R's synthdid and the placebo SE formula. - SyntheticDiD bootstrap now retries degenerate resamples (all-control or all-treated, or non-finite `τ_b`) until exactly `n_bootstrap` valid replicates are accumulated, matching R's `synthdid::bootstrap_sample` and Arkhangelsky et al. (2021) Algorithm 2. Previously the Python path counted attempts (with degenerate draws silently dropped), producing fewer valid replicates than requested. A bounded-attempt guard (`20 × n_bootstrap`) prevents pathological-input hangs. +### Changed +- **SyntheticDiD bootstrap no longer supports survey designs** (capability regression). The removed fixed-weight bootstrap path was the only SDID variance method that supported strata/PSU/FPC (via Rao-Wu rescaled bootstrap); the new paper-faithful refit bootstrap rejects all survey designs (including pweight-only) with `NotImplementedError`. Pweight-only users can switch to `variance_method="placebo"` or `"jackknife"`. Strata/PSU/FPC users have no SDID variance option on this release. Composing Rao-Wu rescaled weights with Frank-Wolfe re-estimation requires a separate derivation (weighted FW solver); sketch and reusable scaffolding pointers are in `docs/methodology/REGISTRY.md` §SyntheticDiD and `TODO.md`. + ## [3.2.0] - 2026-04-19 ### Added diff --git a/METHODOLOGY_REVIEW.md b/METHODOLOGY_REVIEW.md index a6e21e6d..0743ec81 100644 --- a/METHODOLOGY_REVIEW.md +++ b/METHODOLOGY_REVIEW.md @@ -501,13 +501,26 @@ variables appear to the left of the `|` separator. `zeta=1.0`). Regularization parameters `zeta_omega` and `zeta_lambda` are now computed automatically from the data noise level (N_tr * sigma^2) as specified in Appendix D of Arkhangelsky et al. (2021), matching R's default behavior. -4. **Bootstrap SE uses fixed weights matching R's `bootstrap_sample`** (was - re-estimating all weights). The bootstrap variance procedure now holds unit and time - weights fixed at their point estimates and only re-estimates the treatment effect, - matching the approach in R's `synthdid::bootstrap_sample()`. -5. **Default `variance_method` changed to `"placebo"`** matching R's default. The R - package uses placebo variance by default (`synthdid_estimate` returns an object whose - `vcov()` uses the placebo method); our default now matches. +4. **Bootstrap SE is paper-faithful refit (Algorithm 2 step 2), matching R's default + `synthdid::vcov(method="bootstrap")` including its warm-start shape.** On each + pairs-bootstrap draw, ω and λ are re-estimated via Frank-Wolfe on the resampled + panel using the fit-time normalized-scale zeta. The Frank-Wolfe first pass is + warm-started from the fit-time ω (renormalized over the resampled controls via + `_sum_normalize`) and the fit-time λ (unchanged), matching R's `bootstrap_sample` + which rebinds `attr(estimate, "opts")` so those weights serve as the FW + initialization per `update.omega=TRUE` / `update.lambda=TRUE`. + *(Historical note: an earlier release shipped a fixed-weight shortcut here + that matched neither the paper nor R's default vcov; that path was removed + in PR #351 along with its R-parity fixture, which had also been mis-anchored. + The same PR added the warm-start plumbing to `compute_sdid_unit_weights` / + `compute_time_weights` via new `init_weights=` kwargs.)* +5. **Default `variance_method` changed to `"placebo"`** — intentional deviation from + R's default (R's `synthdid::vcov()` defaults to `"bootstrap"`). The library default + is placebo for two reasons: (a) placebo is unconditionally available on pweight-only + survey designs, whereas refit bootstrap rejects every survey design in this release; + (b) placebo sidesteps the ~5–30× slowdown of per-draw Frank-Wolfe re-estimation in + refit bootstrap. See REGISTRY.md §SyntheticDiD `Note (default variance_method + deviation from R)` for details. 6. **Deprecated `lambda_reg` and `zeta` params; new params are `zeta_omega` and `zeta_lambda`**. The old parameters had unclear semantics and did not correspond to the paper's notation. The new parameters directly match the paper and R package diff --git a/TODO.md b/TODO.md index ab8ec9bc..87e0930f 100644 --- a/TODO.md +++ b/TODO.md @@ -100,7 +100,8 @@ Deferred items from PR reviews that were not addressed before merge. | `HeterogeneousAdoptionDiD` Phase 5: `practitioner_next_steps()` integration, tutorial notebook, and `llms.txt` updates (preserving UTF-8 fingerprint). | `diff_diff/practitioner.py`, `tutorials/`, `diff_diff/guides/` | Phase 2a | Low | | `HeterogeneousAdoptionDiD` time-varying dose on event study: Phase 2b REJECTS panels where `D_{g,t}` varies within a unit for `t >= F` (the aggregation uses `D_{g, F}` as the single regressor for all horizons, paper Appendix B.2 constant-dose convention). A follow-up PR could add a time-varying-dose estimator for these panels; current behavior is front-door rejection with a redirect to `ChaisemartinDHaultfoeuille`. | `diff_diff/had.py::_validate_had_panel_event_study` | Phase 2b | Low | | `HeterogeneousAdoptionDiD` repeated-cross-section support: paper Section 2 defines HAD on panel OR repeated cross-section, but Phase 2a is panel-only. RCS inputs (disjoint unit IDs between periods) are rejected by the balanced-panel validator with the generic "unit(s) do not appear in both periods" error. A follow-up PR will add an RCS identification path based on pre/post cell means (rather than unit-level first differences), with its own validator and a distinct `data_mode` / API surface. | `diff_diff/had.py::_validate_had_panel`, `diff_diff/had.py::_aggregate_first_difference` | Phase 2a | Medium | -| SyntheticDiD: ship paper-faithful refit bootstrap (Arkhangelsky et al. 2021 Algorithm 2, re-estimating ω and λ per draw) as an opt-in `bootstrap_weights="refit"` kwarg. Current bootstrap matches R's fixed-weight shortcut. | `synthetic_did.py::_bootstrap_se` | follow-up | Low | +| **SDID + survey designs** (capability regression in this release; both pweight-only AND strata/PSU/FPC). The previous release's fixed-weight bootstrap accepted strata/PSU/FPC via Rao-Wu rescaled bootstrap; the new paper-faithful refit bootstrap rejects all survey designs because Rao-Wu composed with Frank-Wolfe re-estimation requires its own derivation. The follow-up needs a **weighted Frank-Wolfe** variant of `_sc_weight_fw` accepting per-unit weights in the loss and regularization (`Σ rw_i ω_i Y_i,pre` / `ζ² Σ rw_i ω_i²`), threaded through `compute_sdid_unit_weights` / `compute_time_weights`. Reusable scaffolding (`generate_rao_wu_weights`, split into `rw_control` / `rw_treated`, degenerate-retry, treated-mean weighting) is recoverable from the pre-rewrite `_bootstrap_se` body via `git show 91082e5:diff_diff/synthetic_did.py` (PR #351 "Replace SDID fixed-weight bootstrap with paper-faithful refit"). Compose-after-unweighted-FW does not work — silently reproduces the fixed-weight Rao-Wu behavior we removed. Validation: re-use the coverage MC harness with a stratified DGP, confirm near-nominal rejection rates against placebo-SE tracking. See REGISTRY.md §SyntheticDiD `Note (deferred survey + bootstrap composition)` for the sketch. | `synthetic_did.py::fit`, `synthetic_did.py::_bootstrap_se`, `utils.py::_sc_weight_fw` | follow-up | Medium | +| SyntheticDiD: bootstrap cross-language parity anchor against R's default `synthdid::vcov(method="bootstrap")` (refit; rebinds `opts` per draw) or Julia `Synthdid.jl::src/vcov.jl::bootstrap_se` (refit by construction). Same-library validation (placebo-SE tracking, AER §6.3 MC truth) is in place; a cross-language anchor is desirable to bolster the methodology contract. Julia is the cleanest target — minimal wrapping work and refit-native vcov. Tolerance target: 1e-6 on Monte Carlo samples (different BLAS + RNG paths preclude 1e-10). The R-parity fixture from the previous release was deleted because it pinned the now-removed fixed-weight path. | `benchmarks/R/`, `benchmarks/julia/`, `tests/` | follow-up | Low | #### Performance @@ -126,8 +127,7 @@ Deferred items from PR reviews that were not addressed before merge. | `EDiDBootstrapResults` cross-reference is ambiguous — class is exported from both `diff_diff` and `diff_diff.efficient_did_bootstrap`, producing 3 "more than one target found" warnings. Add `:noindex:` to one source or use full-path refs | `diff_diff/efficient_did_results.py`, `docs/api/efficient_did.rst` | — | Low | | Tracked Sphinx autosummary stubs in `docs/api/_autosummary/*.rst` are stale — every sphinx build regenerates them with new attributes (e.g., `coef_var`, `survey_metadata`) that have been added to result classes. Either commit a refresh or move the directory to `.gitignore` and treat as build output. Also 6 untracked stubs exist for newer estimators (`WooldridgeDiD`, `SimulationMDEResults`, etc.) that have never been committed. | `docs/api/_autosummary/` | — | Low | | HonestDiD `test_m0_short_circuit` uses wall-clock `elapsed < 0.5s` as a proxy for "short-circuit path taken" instead of calling the full optimizer. Replace with a direct correctness signal (mock/spy the optimizer or check a state flag) so the test doesn't depend on CI timing. Not flaky today at 500ms, but load-bearing correctness on a timing proxy is brittle. | `tests/test_methodology_honest_did.py:246` | — | Low | -| SyntheticDiD: coverage Monte Carlo study — empirical 95% CI coverage for placebo / fixed-bootstrap / jackknife on representative DGPs; document in REGISTRY.md to support the fixed-weight deviation label and calibrate user expectations. | `benchmarks/`, `docs/methodology/REGISTRY.md` | follow-up | Low | -| SyntheticDiD: rename internal `placebo_effects` variable to `null_or_bootstrap_effects` (or `variance_effects`). Misleading name across the bootstrap/placebo/jackknife dispatch paths; low-risk refactor. | `synthetic_did.py`, `synthetic_did_results.py` | follow-up | Low | +| SyntheticDiD: rename internal `placebo_effects` variable to `variance_effects` (or `resampled_effects`). Misleading name across the placebo/bootstrap/jackknife dispatch paths — holds three different contents depending on variance method. Low-risk refactor; user-facing field rename should preserve `placebo_effects` as a deprecated alias for one release. | `synthetic_did.py`, `results.py` | follow-up | Medium | --- diff --git a/benchmarks/R/generate_sdid_bootstrap_parity_fixture.R b/benchmarks/R/generate_sdid_bootstrap_parity_fixture.R deleted file mode 100644 index 4071c01d..00000000 --- a/benchmarks/R/generate_sdid_bootstrap_parity_fixture.R +++ /dev/null @@ -1,153 +0,0 @@ -#!/usr/bin/env Rscript -# Generate an R-parity fixture for SyntheticDiD bootstrap SE. -# -# The fixture pins B non-degenerate bootstrap indices and the resulting -# R-computed bootstrap SE so that the Python R-parity test can feed the -# same indices through `_bootstrap_se` and expect bit-identical SE. -# -# Usage: -# Rscript benchmarks/R/generate_sdid_bootstrap_parity_fixture.R -# -# Output: -# tests/data/sdid_bootstrap_indices_r.json - -library(synthdid) -library(jsonlite) - -# Panel data — must match TestJackknifeSERParity.Y_FLAT in -# tests/test_methodology_sdid.py (23 units × 8 periods = 184 values). -Y_flat <- c( - 12.459567808595292, 13.223481099962006, 13.658348196773856, - 13.844051055863837, 13.888854636247594, 14.997677893012806, - 14.494587375086788, 15.851128751231856, 10.527006629006900, - 11.317894498245712, 9.780141451338988, 10.635177418486473, - 11.007911133698329, 11.692547000930196, 11.532445341187122, - 10.646344091442769, 5.779122815714058, 5.265746845809725, - 4.828411925858962, 5.933107464969151, 6.926403492435262, - 7.566662873481445, 6.703831577045862, 7.090431451464497, - 6.703722507026075, 6.453676391630379, 7.301398891231049, - 7.726092498224848, 8.191225590595401, 7.669210641906834, - 8.526151391259425, 7.715169490073769, 8.005628186152748, - 7.523978158267692, 9.049143286687135, 9.434081283341134, - 9.450553333966674, 10.310163601090766, 9.867729569702721, - 9.846941461031360, 10.459939463684098, 11.887686682638062, - 11.249912950470762, 12.093459993478538, 12.226598684379407, - 11.973716581337246, 13.453499811673423, 13.287085704636093, - 10.317796666844943, 10.819165701226847, 10.824437736488752, - 9.582976251622744, 11.521962769964540, 11.495903971828724, - 12.072136575632017, 12.570433156881965, 12.435827624848123, - 13.750744970607428, 13.567397714461393, 14.218726703934166, - 14.459837938730677, 14.659912736018788, 14.077914185301429, - 14.854380461280002, 10.770274645112915, 11.275621916712160, - 12.137534572839927, 12.531125692916383, 12.678920118269170, - 12.304148175294246, 12.497145874675160, 14.103389828901550, - 10.560062989643855, 10.755394606294518, 10.518678427483797, - 11.721841324084256, 11.607272952190801, 11.924464521898100, - 12.782516039349641, 13.026729430318186, 12.546145790341205, - 13.409407032231695, 14.079787980063543, 13.128838312144593, - 13.553836458429620, 13.718363411441658, 13.854625752117343, - 14.924224028489123, 11.906891367097627, 12.128784222882244, - 11.404804355878456, 13.130649630134753, 12.173021974919472, - 12.859165585526416, 12.895280738363951, 13.345233593320895, - 10.435966548001499, 10.663839793569295, 11.030422432974012, - 11.033668451079661, 11.324277503659044, 11.045836529045589, - 11.985219205566086, 12.220060940064094, 14.722723885094736, - 15.772410109968900, 15.256969467031452, 15.568564129971197, - 16.666133193788099, 16.405462433247578, 17.202870693537243, - 17.289652559976691, 7.760317864391456, 8.460282811921017, - 9.462415007659978, 9.956467084312777, 9.726218110324272, - 10.272688229133685, 11.134101608790994, 11.592584658589104, - 7.747112683063268, 8.706521663648207, 8.170907672905205, - 8.679537720718859, 8.962718814069811, 8.861932954235140, - 9.383430460745986, 9.891050023644237, 9.728955313568255, - 9.231765881057163, 9.555677785583788, 10.420693590160205, - 9.844078095298698, 10.651913064308546, 10.196489890710358, - 11.855847076501993, 9.218785934915712, 9.133582433258733, - 10.048827580363175, 9.952567508276010, 10.385962432276619, - 11.596546220044132, 11.164945662130776, 11.016817405176500, - 10.145044557120791, 10.921420538928436, 11.642624728800259, - 10.730067509380019, 11.753738913724906, 11.868862794274008, - 12.574196556067037, 12.311524695461632, 10.800710206252880, - 12.817967597577915, 12.705627126180516, 12.497850142478354, - 12.148734571851643, 13.494742486942219, 13.714835068828613, - 13.770060323710533, 10.010857300549947, 10.787315152039971, - 11.050238955584605, 11.063282099053561, 10.834793458278272, - 17.153286194944865, 17.380010096861866, 16.984758489324143, - 6.913302966281331, 6.938279687001069, 7.537129527669741, - 7.063822443245238, 7.531238453797332, 13.853711102827464, - 13.812711128345372, 14.204067444347162, 13.694867606609098, - 12.929992273442151, 14.397345491024691, 15.116119455987304, - 15.860226513457558, 19.442026093187646, 19.855029109494353, - 20.377546194927845 -) -stopifnot(length(Y_flat) == 184L) - -N0 <- 20L # controls -N1 <- 3L # treated -T0 <- 5L # pre -T1 <- 3L # post -N <- N0 + N1 -T <- T0 + T1 - -Y <- matrix(Y_flat, nrow = N, ncol = T, byrow = TRUE) - -# Fit once to obtain the omega / lambda that the bootstrap holds fixed -tau_hat <- synthdid_estimate(Y, N0, T0) -weights <- attr(tau_hat, "weights") - -# Bootstrap loop — record indices and compute tau_b with fixed weights, -# mimicking synthdid::vcov(method="bootstrap") per the package source. -# Retry on degenerate draws (no controls or no treated) so the fixture -# contains B non-degenerate rows and Python's `_bootstrap_indices` seam -# consumes them all without skipping. -set.seed(42L) -B <- 200L - -sum_normalize <- function(v) { - s <- sum(v) - if (s > 0) v / s else rep(1 / length(v), length(v)) -} - -indices_matrix <- matrix(0L, nrow = B, ncol = N) -tau_boot <- numeric(B) -b <- 1L -while (b <= B) { - ind <- sample(seq_len(N), replace = TRUE) - n_co_b <- sum(ind <= N0) - if (n_co_b == 0L || n_co_b == N) next # degenerate — retry - weights_boot <- weights - weights_boot$omega <- sum_normalize(weights$omega[sort(ind[ind <= N0])]) - tau_boot_b <- synthdid_estimate( - Y[sort(ind), ], sum(ind <= N0), T0, - weights = weights_boot - ) - indices_matrix[b, ] <- ind - tau_boot[b] <- as.numeric(tau_boot_b) - b <- b + 1L -} - -se <- sqrt((B - 1) / B) * sd(tau_boot) - -output_path <- file.path("tests", "data", "sdid_bootstrap_indices_r.json") -dir.create(dirname(output_path), recursive = TRUE, showWarnings = FALSE) -write_json( - list( - indices = indices_matrix, - seed = 42L, - n_bootstrap = B, - se = se, - att = as.numeric(tau_hat), - metadata = list( - r_version = R.version.string, - synthdid_version = as.character(packageVersion("synthdid")), - panel_N = N, - panel_T = T, - N0 = N0, T0 = T0 - ) - ), - output_path, - pretty = TRUE, - auto_unbox = TRUE, - digits = NA # preserve full float64 precision -) -cat(sprintf("Wrote %s (B=%d, SE=%.15g)\n", output_path, B, se)) diff --git a/benchmarks/data/sdid_coverage.json b/benchmarks/data/sdid_coverage.json new file mode 100644 index 00000000..43f86d0a --- /dev/null +++ b/benchmarks/data/sdid_coverage.json @@ -0,0 +1,135 @@ +{ + "metadata": { + "n_seeds": 500, + "n_bootstrap": 200, + "library_version": "3.2.0", + "backend": "rust", + "generated_at": "2026-04-22T20:48:18.361220+00:00", + "total_elapsed_sec": 2424.92, + "methods": [ + "placebo", + "bootstrap", + "jackknife" + ], + "alphas": [ + 0.01, + 0.05, + 0.1 + ] + }, + "dgps": { + "balanced": "Balanced / exchangeable: N_co=20, N_tr=3, T_pre=8, T_post=4", + "unbalanced": "Unbalanced: N_co=30, N_tr=8, heterogeneous unit-FE variance", + "aer63": "Arkhangelsky et al. (2021) AER \u00a76.3: N=100, N1=20, T=120, T1=5, rank=2, \u03c3=2" + }, + "per_dgp": { + "balanced": { + "placebo": { + "n_successful_fits": 500, + "rejection_rate": { + "0.01": 0.016, + "0.05": 0.06, + "0.10": 0.086 + }, + "mean_se": 0.21883358887261328, + "true_sd_tau_hat": 0.2093529148687405, + "se_over_truesd": 1.0452856078446147 + }, + "bootstrap": { + "n_successful_fits": 500, + "rejection_rate": { + "0.01": 0.028, + "0.05": 0.078, + "0.10": 0.116 + }, + "mean_se": 0.21962976414466187, + "true_sd_tau_hat": 0.2093529148687405, + "se_over_truesd": 1.0490886371578094 + }, + "jackknife": { + "n_successful_fits": 500, + "rejection_rate": { + "0.01": 0.066, + "0.05": 0.112, + "0.10": 0.154 + }, + "mean_se": 0.225193379965879, + "true_sd_tau_hat": 0.2093529148687405, + "se_over_truesd": 1.0756639338270981 + }, + "_elapsed_sec": 78.62 + }, + "unbalanced": { + "placebo": { + "n_successful_fits": 500, + "rejection_rate": { + "0.01": 0.006, + "0.05": 0.032, + "0.10": 0.07 + }, + "mean_se": 0.14578822995037113, + "true_sd_tau_hat": 0.135562270427217, + "se_over_truesd": 1.0754336696407312 + }, + "bootstrap": { + "n_successful_fits": 500, + "rejection_rate": { + "0.01": 0.008, + "0.05": 0.038, + "0.10": 0.08 + }, + "mean_se": 0.15072674925763238, + "true_sd_tau_hat": 0.135562270427217, + "se_over_truesd": 1.1118635648593473 + }, + "jackknife": { + "n_successful_fits": 500, + "rejection_rate": { + "0.01": 0.024, + "0.05": 0.076, + "0.10": 0.12 + }, + "mean_se": 0.13429336452914814, + "true_sd_tau_hat": 0.135562270427217, + "se_over_truesd": 0.990639682456852 + }, + "_elapsed_sec": 90.61 + }, + "aer63": { + "placebo": { + "n_successful_fits": 500, + "rejection_rate": { + "0.01": 0.018, + "0.05": 0.058, + "0.10": 0.086 + }, + "mean_se": 0.2681237752550122, + "true_sd_tau_hat": 0.2696262336703088, + "se_over_truesd": 0.9944276252542481 + }, + "bootstrap": { + "n_successful_fits": 500, + "rejection_rate": { + "0.01": 0.01, + "0.05": 0.04, + "0.10": 0.078 + }, + "mean_se": 0.28291769703671454, + "true_sd_tau_hat": 0.2696262336703088, + "se_over_truesd": 1.0492958833622181 + }, + "jackknife": { + "n_successful_fits": 500, + "rejection_rate": { + "0.01": 0.03, + "0.05": 0.08, + "0.10": 0.15 + }, + "mean_se": 0.24309151423096811, + "true_sd_tau_hat": 0.2696262336703088, + "se_over_truesd": 0.9015870263136688 + }, + "_elapsed_sec": 2255.69 + } + } +} \ No newline at end of file diff --git a/benchmarks/python/coverage_sdid.py b/benchmarks/python/coverage_sdid.py new file mode 100644 index 00000000..f4db999a --- /dev/null +++ b/benchmarks/python/coverage_sdid.py @@ -0,0 +1,388 @@ +#!/usr/bin/env python3 +"""Coverage Monte Carlo study for SDID variance methods. + +Generates null-panel Monte Carlo samples (no treatment effect) across +three representative DGPs, fits SyntheticDiD under each of the three +variance methods (placebo, bootstrap, jackknife), and records rejection +rates at α ∈ {0.01, 0.05, 0.10} plus the ratio of mean estimated SE to +the empirical sampling SD of τ̂. + +The output JSON underwrites the calibration table in +``docs/methodology/REGISTRY.md`` §SyntheticDiD. + +Usage: + # Full run (~15–40 min on M-series Mac with Rust backend; AER §6.3 refit + # is the long tail at ~37 min. Matches the wall-clock wording in + # REGISTRY.md §SyntheticDiD coverage MC note.) + python benchmarks/python/coverage_sdid.py \\ + --n-seeds 500 --n-bootstrap 200 \\ + --output benchmarks/data/sdid_coverage.json + + # Single DGP (chunkable across sessions) + python benchmarks/python/coverage_sdid.py \\ + --dgps balanced --n-seeds 500 \\ + --output benchmarks/data/sdid_coverage_balanced.json + + # Quick smoke-check (~2 minutes at these settings) + python benchmarks/python/coverage_sdid.py \\ + --n-seeds 20 --n-bootstrap 40 \\ + --output /tmp/sdid_coverage_smoke.json +""" + +from __future__ import annotations + +import argparse +import json +import os +import sys +import time +import warnings +from dataclasses import dataclass +from datetime import datetime, timezone +from pathlib import Path +from typing import Any, Callable, Dict, List, Optional, Tuple + + +def _get_backend_from_args() -> str: + parser = argparse.ArgumentParser(add_help=False) + parser.add_argument("--backend", default="auto", choices=["auto", "python", "rust"]) + args, _ = parser.parse_known_args() + return args.backend + + +_requested_backend = _get_backend_from_args() +if _requested_backend in ("python", "rust"): + os.environ["DIFF_DIFF_BACKEND"] = _requested_backend + + +sys.path.insert(0, str(Path(__file__).parent.parent.parent)) + +import numpy as np # noqa: E402 +import pandas as pd # noqa: E402 + +import diff_diff # noqa: E402 — imports after env var gate the backend +from diff_diff import HAS_RUST_BACKEND, SyntheticDiD # noqa: E402 + +ALL_METHODS = ("placebo", "bootstrap", "jackknife") +ALL_DGPS = ("balanced", "unbalanced", "aer63") +ALPHAS = (0.01, 0.05, 0.10) + + +@dataclass +class DGPSpec: + name: str + description: str + # generator returns (DataFrame, post_periods) + generator: Callable[[int], Tuple[pd.DataFrame, List[int]]] + + +def _balanced_dgp(seed: int) -> Tuple[pd.DataFrame, List[int]]: + """Balanced / exchangeable panel with null effect. + + Small enough to run quickly; controls are exchangeable by construction + (iid unit FE draws), supporting tight placebo-tracks-refit assertions. + """ + rng = np.random.default_rng(seed) + n_control, n_treated = 20, 3 + n_pre, n_post = 8, 4 + rows = [] + for unit in range(n_control + n_treated): + unit_fe = rng.normal(0, 1.5) + for t in range(n_pre + n_post): + y = 10.0 + unit_fe + 0.25 * t + rng.normal(0, 0.5) + rows.append( + { + "unit": unit, + "period": t, + "treated": int(unit >= n_control), + "outcome": y, + } + ) + return pd.DataFrame(rows), list(range(n_pre, n_pre + n_post)) + + +def _unbalanced_dgp(seed: int) -> Tuple[pd.DataFrame, List[int]]: + """Unbalanced panel: more treated, uneven unit-FE variance. + + Exercises a regime where placebo's ``n_control > n_treated`` constraint + is tight but still satisfied, and where refit's re-estimation has more + signal to distinguish from the fixed-weight shortcut. + """ + rng = np.random.default_rng(seed) + n_control, n_treated = 30, 8 + n_pre, n_post = 10, 6 + rows = [] + for unit in range(n_control + n_treated): + # Treated units have higher FE variance (non-exchangeable cohorts) + scale = 2.5 if unit >= n_control else 1.0 + unit_fe = rng.normal(0, scale) + for t in range(n_pre + n_post): + y = 10.0 + unit_fe + 0.20 * t + rng.normal(0, 0.6) + rows.append( + { + "unit": unit, + "period": t, + "treated": int(unit >= n_control), + "outcome": y, + } + ) + return pd.DataFrame(rows), list(range(n_pre, n_pre + n_post)) + + +def _aer63_dgp(seed: int) -> Tuple[pd.DataFrame, List[int]]: + """Arkhangelsky et al. (2021) AER §6.3 non-exchangeable DGP. + + N=100, N1=20, T=120, T1=5, rank=2 factor structure, iid N(0, σ²) + errors with σ=2. Set to τ=0 here (H0 calibration). The paper reports + 95% jackknife coverage at 98% on iid errors and 93% on AR(1) ρ=0.7; + we reproduce the iid setting. + """ + rng = np.random.default_rng(seed) + n_control, n_treated = 80, 20 + n_pre, n_post = 115, 5 # T=120, T1=5 (post) + n_factors = 2 + sigma = 2.0 + + # Factor loadings per unit (rank-2). Treated units have shifted loadings, + # which is what makes the DGP non-exchangeable. + loadings_control = rng.normal(0, 1.0, size=(n_control, n_factors)) + loadings_treated = rng.normal(0.5, 1.0, size=(n_treated, n_factors)) + loadings = np.vstack([loadings_control, loadings_treated]) + factors = rng.normal(0, 1.0, size=(n_pre + n_post, n_factors)) + + unit_fe = rng.normal(0, 1.0, size=(n_control + n_treated,)) + time_fe = rng.normal(0, 0.3, size=(n_pre + n_post,)) + + rows = [] + for unit in range(n_control + n_treated): + for t in range(n_pre + n_post): + iif = float(loadings[unit] @ factors[t]) + y = unit_fe[unit] + time_fe[t] + iif + rng.normal(0, sigma) + rows.append( + { + "unit": unit, + "period": t, + "treated": int(unit >= n_control), + "outcome": y, + } + ) + return pd.DataFrame(rows), list(range(n_pre, n_pre + n_post)) + + +DGPS: Dict[str, DGPSpec] = { + "balanced": DGPSpec( + "balanced", + "Balanced / exchangeable: N_co=20, N_tr=3, T_pre=8, T_post=4", + _balanced_dgp, + ), + "unbalanced": DGPSpec( + "unbalanced", + "Unbalanced: N_co=30, N_tr=8, heterogeneous unit-FE variance", + _unbalanced_dgp, + ), + "aer63": DGPSpec( + "aer63", + "Arkhangelsky et al. (2021) AER §6.3: N=100, N1=20, T=120, T1=5, rank=2, σ=2", + _aer63_dgp, + ), +} + + +def _fit_one( + method: str, + df: pd.DataFrame, + post_periods: List[int], + n_bootstrap: int, + seed: int, +) -> Tuple[Optional[float], Optional[float], Optional[float]]: + """Fit SDID and return (att, se, p_value); (None, None, None) on failure.""" + try: + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + r = SyntheticDiD( + variance_method=method, + n_bootstrap=n_bootstrap, + seed=seed, + ).fit( + df, + outcome="outcome", + treatment="treated", + unit="unit", + time="period", + post_periods=post_periods, + ) + att = float(r.att) if np.isfinite(r.att) else None + se = float(r.se) if np.isfinite(r.se) else None + p_value = float(r.p_value) if np.isfinite(r.p_value) else None + return att, se, p_value + except (ValueError, RuntimeError, NotImplementedError): + return None, None, None + + +def _summarize( + atts: np.ndarray, + ses: np.ndarray, + p_values: np.ndarray, +) -> Dict[str, Any]: + """Reduce per-seed (att, se, p_value) arrays to summary stats.""" + finite = np.isfinite(atts) & np.isfinite(ses) & np.isfinite(p_values) + n_successful = int(finite.sum()) + if n_successful == 0: + return { + "n_successful_fits": 0, + "rejection_rate": {f"{a:.2f}": float("nan") for a in ALPHAS}, + "mean_se": float("nan"), + "true_sd_tau_hat": float("nan"), + "se_over_truesd": float("nan"), + } + atts_f = atts[finite] + ses_f = ses[finite] + ps_f = p_values[finite] + rejection = {f"{a:.2f}": float(np.mean(ps_f < a)) for a in ALPHAS} + mean_se = float(ses_f.mean()) + true_sd = float(atts_f.std(ddof=1)) if n_successful > 1 else float("nan") + ratio = float(mean_se / true_sd) if np.isfinite(true_sd) and true_sd > 0 else float("nan") + return { + "n_successful_fits": n_successful, + "rejection_rate": rejection, + "mean_se": mean_se, + "true_sd_tau_hat": true_sd, + "se_over_truesd": ratio, + } + + +def _run_dgp( + name: str, + spec: DGPSpec, + n_seeds: int, + n_bootstrap: int, + methods: Tuple[str, ...], +) -> Dict[str, Any]: + """Run all methods × n_seeds for one DGP. Returns summary dict.""" + print(f"\n=== DGP: {name} ({spec.description}) ===", flush=True) + + # Preallocate per-method arrays + atts = {m: np.full(n_seeds, np.nan) for m in methods} + ses = {m: np.full(n_seeds, np.nan) for m in methods} + pvs = {m: np.full(n_seeds, np.nan) for m in methods} + + start = time.time() + for seed in range(n_seeds): + df, post = spec.generator(seed) + for method in methods: + att, se, p = _fit_one(method, df, post, n_bootstrap, seed) + if att is not None: + atts[method][seed] = att + if se is not None: + ses[method][seed] = se + if p is not None: + pvs[method][seed] = p + if (seed + 1) % 10 == 0 or seed + 1 == n_seeds: + elapsed = time.time() - start + pace = (seed + 1) / max(elapsed, 1e-9) + eta = (n_seeds - seed - 1) / max(pace, 1e-9) + print( + f" seed {seed + 1}/{n_seeds} " f"(elapsed {elapsed:.1f}s, eta {eta:.1f}s)", + flush=True, + ) + + summaries: Dict[str, Any] = {m: _summarize(atts[m], ses[m], pvs[m]) for m in methods} + summaries["_elapsed_sec"] = round(time.time() - start, 2) + return summaries + + +def parse_args() -> argparse.Namespace: + p = argparse.ArgumentParser( + description=( + "SDID coverage Monte Carlo study — rejection rates across " + "placebo / bootstrap / jackknife on representative DGPs." + ) + ) + p.add_argument("--n-seeds", type=int, default=500, help="MC replications per DGP (default 500)") + p.add_argument( + "--n-bootstrap", + type=int, + default=200, + help="Bootstrap replications per fit (default 200)", + ) + p.add_argument( + "--dgps", + type=str, + default=",".join(ALL_DGPS), + help=f"Comma-separated DGPs to run (default all: {','.join(ALL_DGPS)})", + ) + p.add_argument( + "--methods", + type=str, + default=",".join(ALL_METHODS), + help=f"Comma-separated variance methods (default all: {','.join(ALL_METHODS)})", + ) + p.add_argument("--output", type=str, required=True, help="Output JSON path") + p.add_argument( + "--backend", + default="auto", + choices=["auto", "python", "rust"], + help="Backend selection (parsed before diff_diff import)", + ) + return p.parse_args() + + +def main() -> None: + args = parse_args() + + dgps_requested = tuple(d.strip() for d in args.dgps.split(",") if d.strip()) + methods_requested = tuple(m.strip() for m in args.methods.split(",") if m.strip()) + + unknown_dgps = [d for d in dgps_requested if d not in DGPS] + if unknown_dgps: + sys.exit(f"Unknown DGP(s): {unknown_dgps}. Known: {list(DGPS)}") + unknown_methods = [m for m in methods_requested if m not in ALL_METHODS] + if unknown_methods: + sys.exit(f"Unknown method(s): {unknown_methods}. Known: {ALL_METHODS}") + + actual_backend = "rust" if HAS_RUST_BACKEND else "python" + lib_version = getattr(diff_diff, "__version__", "unknown") + + print( + f"SDID coverage MC: n_seeds={args.n_seeds}, n_bootstrap={args.n_bootstrap}, " + f"backend={actual_backend}, version={lib_version}", + flush=True, + ) + print(f"DGPs: {list(dgps_requested)}", flush=True) + print(f"Methods: {list(methods_requested)}", flush=True) + + overall_start = time.time() + per_dgp = {} + for dgp_name in dgps_requested: + per_dgp[dgp_name] = _run_dgp( + dgp_name, + DGPS[dgp_name], + args.n_seeds, + args.n_bootstrap, + methods_requested, + ) + + output = { + "metadata": { + "n_seeds": args.n_seeds, + "n_bootstrap": args.n_bootstrap, + "library_version": lib_version, + "backend": actual_backend, + "generated_at": datetime.now(timezone.utc).isoformat(), + "total_elapsed_sec": round(time.time() - overall_start, 2), + "methods": list(methods_requested), + "alphas": list(ALPHAS), + }, + "dgps": {name: DGPS[name].description for name in dgps_requested}, + "per_dgp": per_dgp, + } + + out_path = Path(args.output) + out_path.parent.mkdir(parents=True, exist_ok=True) + with open(out_path, "w") as f: + json.dump(output, f, indent=2) + print(f"\nWrote {out_path} ({out_path.stat().st_size / 1024:.1f} KB)", flush=True) + + +if __name__ == "__main__": + main() diff --git a/diff_diff/_backend.py b/diff_diff/_backend.py index cfcffe7f..d12d9a4c 100644 --- a/diff_diff/_backend.py +++ b/diff_diff/_backend.py @@ -34,6 +34,7 @@ compute_time_weights as _rust_compute_time_weights, compute_noise_level as _rust_compute_noise_level, sc_weight_fw as _rust_sc_weight_fw, + sc_weight_fw_with_convergence as _rust_sc_weight_fw_with_convergence, # Diagnostics rust_backend_info as _rust_backend_info, ) @@ -57,6 +58,7 @@ _rust_compute_time_weights = None _rust_compute_noise_level = None _rust_sc_weight_fw = None + _rust_sc_weight_fw_with_convergence = None _rust_backend_info = None # Determine final backend based on environment variable and availability @@ -79,6 +81,7 @@ _rust_compute_time_weights = None _rust_compute_noise_level = None _rust_sc_weight_fw = None + _rust_sc_weight_fw_with_convergence = None _rust_backend_info = None elif _backend_env == "rust": # Force Rust mode - fail if not available @@ -127,4 +130,5 @@ def rust_backend_info(): "_rust_compute_time_weights", "_rust_compute_noise_level", "_rust_sc_weight_fw", + "_rust_sc_weight_fw_with_convergence", ] diff --git a/diff_diff/guides/llms-full.txt b/diff_diff/guides/llms-full.txt index 5b648826..31d36ecc 100644 --- a/diff_diff/guides/llms-full.txt +++ b/diff_diff/guides/llms-full.txt @@ -468,7 +468,7 @@ SyntheticDiD( zeta_omega: float | None = None, # Unit weight regularization (auto-computed if None) zeta_lambda: float | None = None, # Time weight regularization (auto-computed if None) alpha: float = 0.05, - variance_method: str = "placebo", # "placebo" or "bootstrap" + variance_method: str = "placebo", # "placebo", "bootstrap", or "jackknife" n_bootstrap: int = 200, # Replications for variance estimation seed: int | None = None, ) @@ -1673,7 +1673,7 @@ sd_female, data_female = sd.subpopulation(data, mask=lambda df: df['sex'] == 'F' **Key features:** - Taylor Series Linearization (TSL) variance with strata + PSU + FPC - Replicate weight variance: BRR, Fay's BRR, JK1, JKn, SDR (13 of 16 estimators, including dCDH) -- Survey-aware bootstrap: multiplier at PSU (Hall-Mammen wild; dCDH, staggered) or Rao-Wu rescaled (SyntheticDiD, SunAbraham) +- Survey-aware bootstrap: multiplier at PSU (Hall-Mammen wild; dCDH, staggered) or Rao-Wu rescaled (SunAbraham, TROP). SyntheticDiD bootstrap is non-survey only: the paper-faithful refit path re-estimates weights via Frank-Wolfe per draw, and Rao-Wu + refit composition is not yet implemented (tracked in TODO.md) - DEFF diagnostics, subpopulation analysis, weight trimming (`trim_weights`) - Repeated cross-sections: `CallawaySantAnna(panel=False)` - Compatibility matrix: see `docs/choosing_estimator.rst` Survey Design Support section diff --git a/diff_diff/power.py b/diff_diff/power.py index e75efb9c..9e631f68 100644 --- a/diff_diff/power.py +++ b/diff_diff/power.py @@ -717,7 +717,8 @@ def _check_sdid_placebo_data( f"treated units, but the generated data has n_control={n_control}, " f"n_treated={n_treated}. Either adjust your data_generator so that " f"n_control > n_treated, or use " - f"SyntheticDiD(variance_method='bootstrap')." + f"SyntheticDiD(variance_method='bootstrap') (paper-faithful refit; " + f"~5-30x slower than placebo) or SyntheticDiD(variance_method='jackknife')." ) @@ -2045,7 +2046,9 @@ def simulate_power( f"treated units (got n_control={n_control}, " f"n_treated={effective_n_treated}). Either lower " f"treatment_fraction so that n_control > n_treated, or use " - f"SyntheticDiD(variance_method='bootstrap')." + f"SyntheticDiD(variance_method='bootstrap') (paper-faithful refit; " + f"~5-30x slower than placebo) or " + f"SyntheticDiD(variance_method='jackknife')." ) # Warn if staggered estimator settings don't match auto DGP diff --git a/diff_diff/results.py b/diff_diff/results.py index dedcc28f..ebdd1549 100644 --- a/diff_diff/results.py +++ b/diff_diff/results.py @@ -811,12 +811,17 @@ class SyntheticDiDResults: post_periods : list List of post-treatment period identifiers. variance_method : str - Method used for variance estimation: "bootstrap", "jackknife", or "placebo". + Method used for variance estimation: ``"bootstrap"`` (paper-faithful + pairs bootstrap re-estimating ω and λ via Frank-Wolfe on each draw; + Arkhangelsky et al. 2021 Algorithm 2 step 2, and R's default + ``synthdid::vcov(method="bootstrap")``), ``"jackknife"``, or + ``"placebo"``. placebo_effects : np.ndarray, optional Method-specific per-iteration estimates: placebo treatment effects - (for "placebo"), bootstrap ATT estimates (for "bootstrap"), or - leave-one-out estimates (for "jackknife"). The ``variance_method`` - field disambiguates the contents. + (for ``"placebo"``), bootstrap ATT estimates with re-estimated + weights per draw (for ``"bootstrap"``), or leave-one-out estimates + (for ``"jackknife"``). The ``variance_method`` field disambiguates + the contents. synthetic_pre_trajectory : np.ndarray, optional Synthetic control trajectory in pre-treatment periods, shape ``(n_pre,)``. Equal to ``Y_pre_control @ omega_eff`` where diff --git a/diff_diff/synthetic_did.py b/diff_diff/synthetic_did.py index 10dc4c7b..55872d9a 100644 --- a/diff_diff/synthetic_did.py +++ b/diff_diff/synthetic_did.py @@ -50,15 +50,25 @@ class SyntheticDiD(DifferenceInDifferences): variance_method : str, default="placebo" Method for variance estimation: - "placebo": Placebo-based variance matching R's synthdid::vcov(method="placebo"). - Implements Algorithm 4 from Arkhangelsky et al. (2021). This is R's default. - - "bootstrap": Bootstrap at unit level with fixed weights matching R's - synthdid::vcov(method="bootstrap"). + Implements Algorithm 4 from Arkhangelsky et al. (2021). Library default + (R's default is ``"bootstrap"``; we default to placebo because it is + unconditionally available on pweight-only survey designs and avoids the + ~5–30× slowdown of the refit bootstrap). See REGISTRY.md §SyntheticDiD + ``Note (default variance_method deviation from R)`` for rationale. + - "bootstrap": Paper-faithful pairs bootstrap — Arkhangelsky et al. (2021) + Algorithm 2 step 2, also the behavior of R's default + synthdid::vcov(method="bootstrap") (which rebinds ``attr(estimate, "opts")`` + with ``update.omega=TRUE``, so the renormalized ω is only Frank-Wolfe + initialization). Re-estimates ω̂_b and λ̂_b via two-pass sparsified + Frank-Wolfe on each bootstrap draw. Survey designs (including pweight-only) + raise NotImplementedError; Rao-Wu rescaled weights composed with Frank-Wolfe + re-estimation requires a separate derivation (tracked in TODO.md). - "jackknife": Jackknife variance matching R's synthdid::vcov(method="jackknife"). Implements Algorithm 3 from Arkhangelsky et al. (2021). Deterministic (N_control + N_treated iterations), uses fixed weights (no re-estimation). The ``n_bootstrap`` parameter is ignored for this method. n_bootstrap : int, default=200 - Number of replications for variance estimation. Used for both: + Number of replications for variance estimation. Used for: - Bootstrap: Number of bootstrap samples - Placebo: Number of random permutations (matches R's `replications` argument) Ignored when ``variance_method="jackknife"``. @@ -168,23 +178,30 @@ def __init__( self.n_bootstrap = n_bootstrap self.seed = seed - # Validate n_bootstrap (irrelevant for jackknife, which is deterministic) - if n_bootstrap < 2 and variance_method != "jackknife": + self._validate_config() + + self._unit_weights = None + self._time_weights = None + + _VALID_VARIANCE_METHODS = ("bootstrap", "jackknife", "placebo") + + def _validate_config(self) -> None: + """Validate ``variance_method`` and ``n_bootstrap`` on the current state. + + Called from both ``__init__`` and ``set_params`` so updates via the + sklearn-style setter path enforce the same contract as construction. + """ + if self.variance_method not in self._VALID_VARIANCE_METHODS: raise ValueError( - f"n_bootstrap must be >= 2 (got {n_bootstrap}). At least 2 " - f"iterations are needed to estimate standard errors." + f"variance_method must be one of {self._VALID_VARIANCE_METHODS}, " + f"got '{self.variance_method}'" ) - - # Validate variance_method - valid_methods = ("bootstrap", "jackknife", "placebo") - if variance_method not in valid_methods: + if self.n_bootstrap < 2 and self.variance_method != "jackknife": raise ValueError( - f"variance_method must be one of {valid_methods}, " f"got '{variance_method}'" + f"n_bootstrap must be >= 2 (got {self.n_bootstrap}). At least 2 " + f"iterations are needed to estimate standard errors." ) - self._unit_weights = None - self._time_weights = None - def fit( # type: ignore[override] self, data: pd.DataFrame, @@ -222,10 +239,14 @@ def fit( # type: ignore[override] out before computing the SDID estimator. survey_design : SurveyDesign, optional Survey design specification. Only pweight weight_type is supported. - Strata/PSU/FPC are supported via Rao-Wu rescaled bootstrap when - variance_method='bootstrap'. Non-bootstrap variance methods - (placebo, jackknife) do not support strata/PSU/FPC; use - variance_method='bootstrap' for full designs. + ``variance_method='placebo'`` and ``variance_method='jackknife'`` + accept pweight-only surveys (composed via ``w_control`` / + ``w_treated``). ``variance_method='bootstrap'`` rejects all + survey designs (including pweight-only) and strata/PSU/FPC are + not supported by any variance method on this release — + composing Rao-Wu rescaled weights with paper-faithful + Frank-Wolfe re-estimation requires a separate derivation + (tracked in TODO.md, sketched in REGISTRY.md §SyntheticDiD). Returns ------- @@ -238,6 +259,10 @@ def fit( # type: ignore[override] ValueError If required parameters are missing, data validation fails, or a non-pweight survey design is provided. + NotImplementedError + If ``survey_design`` is provided with strata/PSU/FPC, or if + ``variance_method='bootstrap'`` is provided with any survey + design (including pweight-only). """ # Validate inputs if outcome is None or treatment is None or unit is None or time is None: @@ -262,33 +287,58 @@ def fit( # type: ignore[override] resolved_survey, survey_weights, survey_weight_type, survey_metadata = ( _resolve_survey_for_fit(survey_design, data, "analytical") ) - # Reject replicate-weight designs — SyntheticDiD uses bootstrap variance + # Reject replicate-weight designs — SyntheticDiD has no replicate- + # weight variance path. Full survey designs with strata/PSU/FPC are + # also not supported on any variance method in this release (see the + # guards below). Pweight-only works with variance_method='placebo' + # or 'jackknife'. if resolved_survey is not None and resolved_survey.uses_replicate_variance: raise NotImplementedError( - "SyntheticDiD does not yet support replicate-weight survey " - "designs. Use a TSL-based survey design (strata/psu/fpc)." + "SyntheticDiD does not support replicate-weight survey designs. " + "Only pweight-only survey weights are accepted, and only with " + "variance_method='placebo' or 'jackknife'. See " + "docs/methodology/REGISTRY.md §SyntheticDiD for the survey " + "support matrix." ) - # Validate pweight only (strata/PSU/FPC are allowed for Rao-Wu bootstrap) + # Validate pweight only if resolved_survey is not None and resolved_survey.weight_type != "pweight": raise ValueError( "SyntheticDiD survey support requires weight_type='pweight'. " f"Got '{resolved_survey.weight_type}'." ) - # Reject non-bootstrap + full survey design (strata/PSU/FPC need Rao-Wu) - if ( - resolved_survey is not None - and ( - resolved_survey.strata is not None - or resolved_survey.psu is not None - or resolved_survey.fpc is not None - ) - and self.variance_method != "bootstrap" + # Reject strata/PSU/FPC for any variance method. Previously the + # fixed-weight bootstrap accepted these via Rao-Wu rescaling, but that + # path was not paper-faithful and is removed; Rao-Wu composed with + # Frank-Wolfe re-estimation (paper-faithful) requires a separate + # derivation (tracked in TODO.md, sketched in REGISTRY.md). + if resolved_survey is not None and ( + resolved_survey.strata is not None + or resolved_survey.psu is not None + or resolved_survey.fpc is not None ): raise NotImplementedError( - f"SyntheticDiD with variance_method='{self.variance_method}' does not " - "support strata/PSU/FPC. " - "Use variance_method='bootstrap' for full survey design support." + "SyntheticDiD does not yet support survey designs with " + "strata/PSU/FPC. Pweight-only pseudo-population weights work " + "with variance_method='placebo' or 'jackknife'. Rao-Wu " + "rescaled weights composed with paper-faithful refit " + "bootstrap is tracked as a follow-up; see " + "docs/methodology/REGISTRY.md §SyntheticDiD for the sketch." + ) + + # Reject bootstrap + any survey design (including pweight-only). The + # paper-faithful refit bootstrap re-estimates ω̂ and λ̂ via Frank-Wolfe + # on each bootstrap draw; composing that with Rao-Wu rescaled weights + # (or even a pweight composition in the weighted FW loss) requires a + # separate derivation that is not yet implemented. + if self.variance_method == "bootstrap" and resolved_survey is not None: + raise NotImplementedError( + "SyntheticDiD with variance_method='bootstrap' does not yet " + "support survey designs (including pweight-only). Rao-Wu " + "rescaled weights composed with Frank-Wolfe re-estimation " + "requires a separate derivation (tracked in TODO.md). Use " + "variance_method='placebo' or 'jackknife' for pweight-only " + "surveys." ) # Validate treatment is binary @@ -361,29 +411,13 @@ def fit( # type: ignore[override] f"diff_diff.prep.balance_panel() to balance the panel first." ) - # Validate and extract survey weights - # Build unit-level ResolvedSurveyDesign for Rao-Wu bootstrap when - # strata/PSU/FPC are present (survey columns are unit-constant). - _unit_resolved_survey = None + # Validate and extract survey weights. Strata/PSU/FPC are rejected + # upstream, so we only reach here with pweight-only surveys, which + # placebo and jackknife consume via ``w_control`` directly. if resolved_survey is not None: _validate_unit_constant_survey(data, unit, survey_design) w_treated = _extract_unit_survey_weights(data, unit, survey_design, treated_units) w_control = _extract_unit_survey_weights(data, unit, survey_design, control_units) - - # Build unit-level resolved survey for Rao-Wu bootstrap - _has_design = ( - resolved_survey.strata is not None - or resolved_survey.psu is not None - or resolved_survey.fpc is not None - ) - if _has_design: - _unit_resolved_survey = self._build_unit_resolved_survey( - data, - unit, - survey_design, - control_units, - treated_units, - ) else: w_treated = None w_control = None @@ -558,6 +592,9 @@ def fit( # type: ignore[override] # Variance procedures resample / permute indices (independent of Y # values) so RNG streams stay aligned across scales. if self.variance_method == "bootstrap": + # Paper-faithful pairs bootstrap (Algorithm 2 step 2): re-estimate + # ω̂_b and λ̂_b via Frank-Wolfe on each draw. Survey designs are + # rejected upstream in the survey guards. se_n, bootstrap_estimates_n = self._bootstrap_se( Y_pre_control_n, Y_post_control_n, @@ -565,9 +602,9 @@ def fit( # type: ignore[override] Y_post_treated_n, unit_weights, time_weights, - w_treated=w_treated, - w_control=w_control, - resolved_survey=_unit_resolved_survey, + zeta_omega_n=zeta_omega_n, + zeta_lambda_n=zeta_lambda_n, + min_decrease=min_decrease, ) se = se_n * Y_scale placebo_effects = np.asarray(bootstrap_estimates_n) * Y_scale @@ -794,63 +831,6 @@ def _residualize_covariates( return data - @staticmethod - def _build_unit_resolved_survey(data, unit_col, survey_design, control_units, treated_units): - """Build a unit-level ResolvedSurveyDesign for Rao-Wu bootstrap. - - Extracts one row per unit (survey columns are unit-constant) in - control-then-treated order matching the panel matrix columns. - """ - from diff_diff.linalg import _factorize_cluster_ids - from diff_diff.survey import ResolvedSurveyDesign - - all_units = list(control_units) + list(treated_units) - # Take first row per unit in the specified order - first_rows = data.groupby(unit_col).first().loc[all_units] - n_units = len(all_units) - - # Weights (normalized pweights, mean=1) - if survey_design.weights is not None: - raw_w = first_rows[survey_design.weights].values.astype(np.float64) - weights = raw_w * (n_units / np.sum(raw_w)) - else: - weights = np.ones(n_units, dtype=np.float64) - - # Strata - strata_arr = None - n_strata = 0 - if survey_design.strata is not None: - strata_arr = _factorize_cluster_ids(first_rows[survey_design.strata].values) - n_strata = len(np.unique(strata_arr)) - - # PSU - psu_arr = None - n_psu = 0 - if survey_design.psu is not None: - psu_raw = first_rows[survey_design.psu].values - if survey_design.nest and strata_arr is not None: - combined = np.array([f"{s}_{p}" for s, p in zip(strata_arr, psu_raw)]) - psu_arr = _factorize_cluster_ids(combined) - else: - psu_arr = _factorize_cluster_ids(psu_raw) - n_psu = len(np.unique(psu_arr)) - - # FPC - fpc_arr = None - if survey_design.fpc is not None: - fpc_arr = first_rows[survey_design.fpc].values.astype(np.float64) - - return ResolvedSurveyDesign( - weights=weights, - weight_type=survey_design.weight_type, - strata=strata_arr, - psu=psu_arr, - fpc=fpc_arr, - n_strata=n_strata, - n_psu=n_psu, - lonely_psu=survey_design.lonely_psu, - ) - def _bootstrap_se( self, Y_pre_control: np.ndarray, @@ -859,31 +839,51 @@ def _bootstrap_se( Y_post_treated: np.ndarray, unit_weights: np.ndarray, time_weights: np.ndarray, - w_treated=None, - w_control=None, - resolved_survey=None, - _bootstrap_indices: Optional[np.ndarray] = None, + zeta_omega_n: float = 0.0, + zeta_lambda_n: float = 0.0, + min_decrease: float = 1e-5, ) -> Tuple[float, np.ndarray]: - """Compute bootstrap standard error matching R's synthdid bootstrap_sample. - - Resamples all units (control + treated) with replacement, renormalizes - original unit weights for the resampled controls, and computes the - SDID estimator with **fixed** weights (no re-estimation). - - When ``resolved_survey`` is provided (unit-level ResolvedSurveyDesign - with strata/PSU/FPC), uses Rao-Wu rescaled bootstrap instead of the - simple pairs bootstrap. The Rao-Wu weights are per-unit rescaled - survey weights; they composite with SDID unit weights the same way - pweights do in the weights-only path. - - ``_bootstrap_indices`` is a test-only seam: when provided (shape - ``(n_bootstrap, n_total)``), the pairs-bootstrap branch uses row ``b`` - of the array instead of ``rng.choice``. Used by the R-parity test to - feed pre-computed R indices; ignored by the Rao-Wu branch. - - This matches R's ``synthdid::vcov(method="bootstrap")``. + """Compute pairs-bootstrap standard error for SDID (Algorithm 2 step 2). + + Paper-faithful refit bootstrap: resamples all units (control + treated) + with replacement, then re-estimates ω̂_b and λ̂_b via two-pass sparsified + Frank-Wolfe on each resampled panel (Arkhangelsky et al. 2021 + Algorithm 2 step 2). Also matches R's default + ``synthdid::vcov(method="bootstrap")`` behavior: R rebinds + ``attr(estimate, "opts")`` (including ``update.omega=TRUE``) back into + ``synthdid_estimate`` on each draw, so the renormalized ω serves only + as Frank-Wolfe initialization. + + ``zeta_omega_n`` / ``zeta_lambda_n`` are the fit-time normalized-scale + regularization parameters (``ζ_ω / Y_scale``, ``ζ_λ / Y_scale``), used + unchanged on each bootstrap draw because the resampled panel shares + the original noise scale. + + Survey designs are rejected upstream in ``fit()``. The fit-time + ``unit_weights`` (ω̂) and ``time_weights`` (λ̂) are not reused as fixed + estimator weights — every draw re-estimates ω̂_b and λ̂_b from scratch + (using the normalized-scale zetas) — but they ARE used as Frank-Wolfe + warm-start initializations: ``_sum_normalize(unit_weights[boot_ctrl])`` + seeds ω̂_b's first pass, and ``time_weights`` seeds λ̂_b's. This + matches R's ``vcov.R::bootstrap_sample`` shape (see the warm-start + comment below, and the ``Deviation`` / warm-start discussion in + ``docs/methodology/REGISTRY.md`` §SyntheticDiD). The original ω / λ + remain on the result object as the fit-time weights. + + Retry-to-B: matches R's ``synthdid::bootstrap_sample`` while-loop. A + bounded attempt guard of ``20 * n_bootstrap`` prevents pathological- + input hangs; normal fits finish well inside this budget because + degenerate-draw probability scales as ``(N_co/N)^N + (N_tr/N)^N``. + Per-draw Frank-Wolfe non-convergence UserWarnings are suppressed + inside the loop and aggregated into one summary warning after the + loop if the rate of draws with any non-convergence event exceeds 5% + of valid draws (counted once per draw, not once per solver call). """ - from diff_diff.bootstrap_utils import generate_rao_wu_weights + # unit_weights (fit-time ω) and time_weights (fit-time λ) are used + # as warm-start Frank-Wolfe initializations per bootstrap draw, + # matching R's ``vcov.R::bootstrap_sample`` which passes + # ``sum_normalize(weights$omega[sort(ind[ind <= N0])])`` and + # ``weights$lambda`` as the FW init via the rebound ``opts``. rng = np.random.default_rng(self.seed) n_control = Y_pre_control.shape[1] @@ -894,156 +894,111 @@ def _bootstrap_se( Y_full = np.block([[Y_pre_control, Y_pre_treated], [Y_post_control, Y_post_treated]]) n_pre = Y_pre_control.shape[0] - # Determine whether to use Rao-Wu (full design) or pairs bootstrap - _use_rao_wu = resolved_survey is not None - - # Check for unidentified variance (single unstratified PSU) - if ( - _use_rao_wu - and resolved_survey.psu is not None - and resolved_survey.n_psu < 2 - and resolved_survey.strata is None - ): - return np.nan, np.array([]) - bootstrap_estimates: List[float] = [] # Retry-until-B-valid semantic: matches R's synthdid::bootstrap_sample # (`while (count < replications) { ...; if !is.na(est) count = count+1 }`) - # and paper Algorithm 2 (B bootstrap replicates). A bounded attempt - # guard (20x) prevents pathological-input hangs; normal fits finish - # well inside this budget because degenerate-draw probability scales - # as (N_co/N)^N + (N_tr/N)^N, which is small for any non-trivial - # N_co/N_tr split. + # and paper Algorithm 2 (B bootstrap replicates). max_attempts = 20 * self.n_bootstrap attempts = 0 - # Fixture-driven path uses deterministic index iteration instead of - # retry: R's generator already encodes retry semantics into the stored - # B x N indices, so every row is pre-validated. - fixture_row = 0 + # Tally draws with any Frank-Wolfe non-convergence; per-draw + # UserWarnings are suppressed inside the loop and aggregated into + # one summary warning after the loop (avoids 200+ warnings per fit). + fw_nonconvergence_count = 0 while len(bootstrap_estimates) < self.n_bootstrap and attempts < max_attempts: attempts += 1 - if _use_rao_wu: - # --- Rao-Wu rescaled bootstrap path --- - # generate_rao_wu_weights returns per-unit rescaled survey - # weights (shape n_total). Units whose PSU was not drawn - # get weight 0, effectively dropping them. - try: - boot_rw = generate_rao_wu_weights(resolved_survey, rng) - - rw_control = boot_rw[:n_control] - rw_treated = boot_rw[n_control:] - - # Retry if all control or all treated weights are zero - if rw_control.sum() == 0 or rw_treated.sum() == 0: - continue - - # Composite SDID unit weights with Rao-Wu rescaled weights - boot_omega_eff = unit_weights * rw_control - if boot_omega_eff.sum() > 0: - boot_omega_eff = boot_omega_eff / boot_omega_eff.sum() - else: - continue - - # Treated mean weighted by Rao-Wu weights - Y_boot_pre_t_mean = np.average( - Y_pre_treated, - axis=1, - weights=rw_treated, - ) - Y_boot_post_t_mean = np.average( - Y_post_treated, - axis=1, - weights=rw_treated, - ) + boot_idx = rng.choice(n_total, size=n_total, replace=True) - tau = compute_sdid_estimator( - Y_pre_control, - Y_post_control, - Y_boot_pre_t_mean, - Y_boot_post_t_mean, - boot_omega_eff, - time_weights, - ) - if np.isfinite(tau): - bootstrap_estimates.append(float(tau)) + # Split resampled units into control vs treated + boot_is_control = boot_idx < n_control + n_co_b = int(boot_is_control.sum()) - except (ValueError, LinAlgError): - continue - else: - # --- Standard pairs bootstrap path (weights-only or no survey) --- - # Resample ALL units with replacement (or use pre-computed - # indices from the test-only _bootstrap_indices seam). - if _bootstrap_indices is not None: - if fixture_row >= len(_bootstrap_indices): - break - boot_idx = np.asarray(_bootstrap_indices[fixture_row], dtype=np.int64) - fixture_row += 1 - else: - boot_idx = rng.choice(n_total, size=n_total, replace=True) + # Retry if no control or no treated units in bootstrap sample + if n_co_b == 0 or n_co_b == n_total: + continue - # Identify which resampled units are control vs treated - boot_is_control = boot_idx < n_control + try: + # Extract resampled outcome matrices + Y_boot = Y_full[:, boot_idx] + Y_boot_pre_c = Y_boot[:n_pre, boot_is_control] + Y_boot_post_c = Y_boot[n_pre:, boot_is_control] + Y_boot_pre_t = Y_boot[:n_pre, ~boot_is_control] + Y_boot_post_t = Y_boot[n_pre:, ~boot_is_control] + + # Unweighted treated-unit mean — survey weights are rejected + # upstream in fit(), so every path here is unweighted. + Y_boot_pre_t_mean = np.mean(Y_boot_pre_t, axis=1) + Y_boot_post_t_mean = np.mean(Y_boot_post_t, axis=1) + + # Warm-start Frank-Wolfe initialization matching R's + # ``vcov.R::bootstrap_sample`` shape: ω_init is the fit-time + # ω renormalized over the resampled controls (same + # ``sum_normalize`` operation R uses), and λ_init is the + # fit-time λ unchanged. R's ``synthdid_estimate`` with + # ``update.omega=TRUE`` / ``update.lambda=TRUE`` then runs + # Frank-Wolfe from those initializations. Without warm-start + # our FW first-pass (max_iter=100) may land in a different + # sparsification pattern than R's on problems where the + # 100-iter budget is tight (e.g., small ζ_λ on + # less-regularized time weights). Strictly-convex objective + # → warm and cold start converge to the same global minimum + # when FW is run to full convergence, but sparsification + # introduces path dependence on the 100-iter first pass. boot_control_idx = boot_idx[boot_is_control] - boot_treated_idx = boot_idx[~boot_is_control] - - # Retry if no control or no treated units in bootstrap sample - if len(boot_control_idx) == 0 or len(boot_treated_idx) == 0: - continue - - try: - # Renormalize original unit weights for the resampled controls - boot_omega = _sum_normalize(unit_weights[boot_control_idx]) - - # Compose with control survey weights if present - if w_control is not None: - boot_w_c = w_control[boot_idx[boot_is_control]] - boot_omega_eff = boot_omega * boot_w_c - boot_omega_eff = boot_omega_eff / boot_omega_eff.sum() - else: - boot_omega_eff = boot_omega - - # Extract resampled outcome matrices - Y_boot = Y_full[:, boot_idx] - Y_boot_pre_c = Y_boot[:n_pre, boot_is_control] - Y_boot_post_c = Y_boot[n_pre:, boot_is_control] - Y_boot_pre_t = Y_boot[:n_pre, ~boot_is_control] - Y_boot_post_t = Y_boot[n_pre:, ~boot_is_control] - - # Compute ATT with FIXED weights (do NOT re-estimate). - # boot_idx[~boot_is_control] maps to original index space; - # subtract n_control to index into w_treated. Duplicate draws - # carry identical weights -> alignment is safe. - if w_treated is not None: - boot_w_t = w_treated[boot_idx[~boot_is_control] - n_control] - Y_boot_pre_t_mean = np.average( - Y_boot_pre_t, - axis=1, - weights=boot_w_t, - ) - Y_boot_post_t_mean = np.average( - Y_boot_post_t, - axis=1, - weights=boot_w_t, - ) - else: - Y_boot_pre_t_mean = np.mean(Y_boot_pre_t, axis=1) - Y_boot_post_t_mean = np.mean(Y_boot_post_t, axis=1) - - tau = compute_sdid_estimator( - Y_boot_pre_c, - Y_boot_post_c, - Y_boot_pre_t_mean, - Y_boot_post_t_mean, - boot_omega_eff, - time_weights, - ) - if np.isfinite(tau): - bootstrap_estimates.append(float(tau)) - - except (ValueError, LinAlgError): - continue + boot_omega_init = _sum_normalize(unit_weights[boot_control_idx]) + boot_lambda_init = time_weights + + # Algorithm 2 step 2: re-estimate ω̂_b and λ̂_b via two-pass + # sparsified Frank-Wolfe on the resampled panel, using the + # fit-time normalized-scale zeta and the warm-start inits. + # Pass ``return_convergence=True`` so the helpers thread a + # bool out of every FW pass (Rust and numpy both) instead of + # relying on ``warnings.catch_warnings`` — the Rust FW entry + # point is silent on ``max_iter`` exhaustion, so the + # warnings-based tally would always read zero under the + # default backend (see ``_sc_weight_fw_with_convergence`` in + # ``rust/src/weights.rs``). + boot_omega, omega_converged = compute_sdid_unit_weights( + Y_boot_pre_c, + Y_boot_pre_t_mean, + zeta_omega=zeta_omega_n, + min_decrease=min_decrease, + init_weights=boot_omega_init, + return_convergence=True, + ) + boot_lambda, lambda_converged = compute_time_weights( + Y_boot_pre_c, + Y_boot_post_c, + zeta_lambda=zeta_lambda_n, + min_decrease=min_decrease, + init_weights=boot_lambda_init, + return_convergence=True, + ) + tau = compute_sdid_estimator( + Y_boot_pre_c, + Y_boot_post_c, + Y_boot_pre_t_mean, + Y_boot_post_t_mean, + boot_omega, + boot_lambda, + ) + if np.isfinite(tau): + bootstrap_estimates.append(float(tau)) + # Count draws with ANY non-convergence (boolean per + # draw), not raw solver warnings — a single draw can + # emit up to three non-convergence events (ω + # pre-sparsify, ω main, λ). Increment the counter only + # after the finite-τ gate so the registry's "share of + # valid bootstrap draws" denominator matches the + # numerator (draws that failed the finite-τ gate are + # retried, so they shouldn't inflate the non- + # convergence rate). + if not (omega_converged and lambda_converged): + fw_nonconvergence_count += 1 + + except (ValueError, LinAlgError): + continue bootstrap_estimates = np.array(bootstrap_estimates) @@ -1087,6 +1042,22 @@ def _bootstrap_se( stacklevel=2, ) + # Aggregate Frank-Wolfe non-convergence across bootstrap draws. Per-draw + # convergence warnings from compute_sdid_unit_weights / compute_time_weights + # are suppressed inside the loop; emit one summary here if the rate + # exceeds the same 5% threshold used for retry exhaustion. + if fw_nonconvergence_count > 0.05 * max(n_successful, 1): + warnings.warn( + f"Frank-Wolfe did not converge on {fw_nonconvergence_count} of " + f"{n_successful} valid bootstrap draws " + f"(variance_method='bootstrap'). SE is still reported from " + f"the final iterate of each draw, but non-convergent draws may be " + f"noisier; consider relaxing min_decrease or increasing pre-period " + f"length if regularization is already moderate.", + UserWarning, + stacklevel=2, + ) + # SE formula matches R's synthdid::vcov(method="bootstrap"): # sqrt((r-1)/r) * sd(ddof=1), equivalent to Arkhangelsky et al. (2021) # Algorithm 2's σ̂² = (1/r) Σ (τ_b - τ̄)². Uses n_successful for @@ -1162,10 +1133,22 @@ def _placebo_variance_se( # Ensure we have enough controls for the split n_pseudo_control = n_control - n_treated if n_pseudo_control < 1: + # Bootstrap rejects every survey design in this release, so + # steer survey users to jackknife (pweight-only only) or + # adding controls. Non-survey users can still fall back to + # bootstrap or jackknife. + fallback = ( + "variance_method='jackknife' or adding more control units " + "(strata/PSU/FPC are not yet supported by any SDID variance " + "method)" + if w_control is not None + else "variance_method='bootstrap', variance_method='jackknife', " + "or adding more control units" + ) warnings.warn( f"Not enough control units ({n_control}) for placebo variance " f"estimation with {n_treated} treated units. " - f"Consider using variance_method='bootstrap'.", + f"Consider using {fallback}.", UserWarning, stacklevel=3, ) @@ -1254,11 +1237,21 @@ def _placebo_variance_se( n_successful = len(placebo_estimates) if n_successful < 2: + # Same survey-awareness branch as the pre-replication guard + # above — bootstrap rejects every survey design in this + # release, so suggest jackknife for pweight-only fits. + fallback = ( + "variance_method='jackknife' or increasing the number of " + "control units (strata/PSU/FPC are not yet supported by any " + "SDID variance method)" + if w_control is not None + else "variance_method='bootstrap' or variance_method='jackknife' " + "or increasing the number of control units" + ) warnings.warn( f"Only {n_successful} placebo replications completed successfully. " f"Standard error cannot be estimated reliably. " - f"Consider using variance_method='bootstrap' or increasing " - f"the number of control units.", + f"Consider using {fallback}.", UserWarning, stacklevel=3, ) @@ -1484,19 +1477,36 @@ def get_params(self) -> Dict[str, Any]: } def set_params(self, **params) -> "SyntheticDiD": - """Set estimator parameters.""" + """Set estimator parameters. + + Applies updates transactionally: if ``_validate_config()`` rejects the + post-update state, the instance is rolled back to the pre-call values + so a raised ``ValueError`` leaves the object consistent with its + pre-call configuration. + """ # Deprecated parameter names — emit warning and ignore _deprecated = {"lambda_reg", "zeta"} - for key, value in params.items(): - if key in _deprecated: - warnings.warn( - f"{key} is deprecated and ignored. Use zeta_omega/zeta_lambda " - f"instead. Will be removed in v4.0.0.", - DeprecationWarning, - stacklevel=2, - ) - elif hasattr(self, key): - setattr(self, key, value) - else: - raise ValueError(f"Unknown parameter: {key}") + # Snapshot original values for transactional rollback on validation failure. + _rollback: Dict[str, Any] = {} + for key in params: + if key not in _deprecated and hasattr(self, key): + _rollback[key] = getattr(self, key) + try: + for key, value in params.items(): + if key in _deprecated: + warnings.warn( + f"{key} is deprecated and ignored. Use zeta_omega/zeta_lambda " + f"instead. Will be removed in v4.0.0.", + DeprecationWarning, + stacklevel=2, + ) + elif hasattr(self, key): + setattr(self, key, value) + else: + raise ValueError(f"Unknown parameter: {key}") + self._validate_config() + except (ValueError, TypeError): + for key, prev in _rollback.items(): + setattr(self, key, prev) + raise return self diff --git a/diff_diff/utils.py b/diff_diff/utils.py index bff19e97..bb547969 100644 --- a/diff_diff/utils.py +++ b/diff_diff/utils.py @@ -21,6 +21,7 @@ _rust_compute_time_weights, _rust_compute_noise_level, _rust_sc_weight_fw, + _rust_sc_weight_fw_with_convergence, ) # Numerical constants for optimization algorithms @@ -1304,7 +1305,8 @@ def _sc_weight_fw( init_weights: Optional[np.ndarray] = None, min_decrease: float = 1e-5, max_iter: int = 10000, -) -> np.ndarray: + return_convergence: bool = False, +): """Compute synthetic control weights via Frank-Wolfe optimization. Matches R's ``sc.weight.fw()`` from the synthdid package. Solves:: @@ -1329,13 +1331,37 @@ def _sc_weight_fw( should pass the data-dependent value for best results. max_iter : int, default 10000 Maximum number of iterations. Matches R's default. + return_convergence : bool, default False + If True, returns a tuple ``(weights, converged)`` where + ``converged`` is ``True`` iff the min-decrease criterion fired + rather than ``max_iter`` being reached. Dispatches to the Rust + ``sc_weight_fw_with_convergence`` entry point when available, and + to ``_sc_weight_fw_numpy(return_convergence=True)`` otherwise. Used + by SDID bootstrap to surface per-draw FW non-convergence + explicitly instead of relying on ``warnings.catch_warnings`` (the + default Rust FW entry point is silent on non-convergence). Returns ------- - np.ndarray - Weights of shape (T0,) on the simplex. + np.ndarray or Tuple[np.ndarray, bool] + Weights of shape (T0,) on the simplex; with + ``return_convergence=True``, additionally the convergence flag. """ if HAS_RUST_BACKEND: + if return_convergence: + weights, converged = _rust_sc_weight_fw_with_convergence( + np.ascontiguousarray(Y, dtype=np.float64), + zeta, + intercept, + ( + np.ascontiguousarray(init_weights, dtype=np.float64) + if init_weights is not None + else None + ), + min_decrease, + max_iter, + ) + return np.asarray(weights), converged return np.asarray( _rust_sc_weight_fw( np.ascontiguousarray(Y, dtype=np.float64), @@ -1350,7 +1376,10 @@ def _sc_weight_fw( max_iter, ) ) - return _sc_weight_fw_numpy(Y, zeta, intercept, init_weights, min_decrease, max_iter) + return _sc_weight_fw_numpy( + Y, zeta, intercept, init_weights, min_decrease, max_iter, + return_convergence=return_convergence, + ) def _sc_weight_fw_numpy( @@ -1360,13 +1389,22 @@ def _sc_weight_fw_numpy( init_weights: Optional[np.ndarray] = None, min_decrease: float = 1e-5, max_iter: int = 10000, -) -> np.ndarray: - """Pure NumPy implementation of Frank-Wolfe SC weight solver.""" + return_convergence: bool = False, +): + """Pure NumPy implementation of Frank-Wolfe SC weight solver. + + When ``return_convergence=True``, returns a tuple ``(weights, converged)`` + and suppresses the default ``warn_if_not_converged`` side effect — the + caller is responsible for deciding how to surface non-convergence. + """ T0 = Y.shape[1] - 1 N = Y.shape[0] if T0 <= 0: - return np.ones(max(T0, 1)) + lam_trivial = np.ones(max(T0, 1)) + if return_convergence: + return lam_trivial, True + return lam_trivial # Column-center if using intercept (matches R's intercept=TRUE default) if intercept: @@ -1390,6 +1428,8 @@ def _sc_weight_fw_numpy( if t >= 1 and vals[t - 1] - vals[t] < min_decrease**2: converged = True break + if return_convergence: + return lam, converged warn_if_not_converged(converged, "Frank-Wolfe SC weight solver", max_iter, min_decrease) return lam @@ -1427,7 +1467,9 @@ def compute_time_weights( min_decrease: float = 1e-5, max_iter_pre_sparsify: int = 100, max_iter: int = 10000, -) -> np.ndarray: + init_weights: Optional[np.ndarray] = None, + return_convergence: bool = False, +): """Compute SDID time weights via Frank-Wolfe optimization. Matches R's ``synthdid::sc.weight.fw(Yc[1:N0, ], zeta=zeta.lambda, @@ -1452,11 +1494,32 @@ def compute_time_weights( max_iter : int, default 10000 Maximum iterations for second pass (after sparsification). Matches R's default. + init_weights : np.ndarray, optional + Warm-start weights for the first Frank-Wolfe pass, shape ``(n_pre,)``. + If None (default), the solver starts from uniform, matching the + top-level ``synthdid_estimate(update.lambda=TRUE)`` path. When + provided, the Rust fast-path is skipped in favor of the Python + two-pass dispatcher so the first-pass init can be threaded + through; this matches R's ``synthdid::bootstrap_sample`` shape + (which passes ``weights$lambda`` as FW init per draw). Used by + ``SyntheticDiD._bootstrap_se`` on the refit loop. + return_convergence : bool, default False + If True, returns a tuple ``(weights, converged)`` where ``converged`` + is the AND of the first-pass and second-pass convergence flags from + the underlying ``_sc_weight_fw`` calls (True iff the min-decrease + criterion fired on BOTH passes; False if either hit ``max_iter``). + Setting this flag also forces the Python two-pass dispatcher even + when ``init_weights`` is None, because the Rust top-level fast-path + is silent on non-convergence. Used by SDID bootstrap to surface + per-draw FW non-convergence explicitly; standalone callers can + leave this at the default to preserve the legacy ABI. Returns ------- - np.ndarray - Time weights of shape (n_pre,) on the simplex. + np.ndarray or Tuple[np.ndarray, bool] + Time weights of shape (n_pre,) on the simplex. With + ``return_convergence=True``, additionally the two-pass convergence + flag (as described above). """ if Y_post_control.shape[0] == 0: raise ValueError( @@ -1464,7 +1527,10 @@ def compute_time_weights( "is required for time weight computation." ) - if HAS_RUST_BACKEND: + # When the caller asks for convergence tracking, skip the Rust top-level + # fast-path even if init_weights is None — that entry point bypasses the + # Python two-pass dispatcher and is silent on FW non-convergence. + if HAS_RUST_BACKEND and init_weights is None and not return_convergence: return np.asarray( _rust_compute_time_weights( np.ascontiguousarray(Y_pre_control, dtype=np.float64), @@ -1480,25 +1546,55 @@ def compute_time_weights( n_pre = Y_pre_control.shape[0] if n_pre <= 1: - return np.ones(n_pre) + lam_trivial = np.ones(n_pre) + if return_convergence: + return lam_trivial, True + return lam_trivial # Build collapsed form: (N_co, T_pre + 1), last col = per-control post mean post_means = np.mean(Y_post_control, axis=0) # (N_co,) Y_time = np.column_stack([Y_pre_control.T, post_means]) # (N_co, T_pre+1) - # First pass: limited iterations (matching R's max.iter.pre.sparsify) - lam = _sc_weight_fw( - Y_time, - zeta=zeta_lambda, - intercept=intercept, - min_decrease=min_decrease, - max_iter=max_iter_pre_sparsify, - ) + # First pass: limited iterations (matching R's max.iter.pre.sparsify). + # init_weights is either None (uniform start) or the caller-supplied + # warm-start; the inner _sc_weight_fw still dispatches to Rust for the + # 100-iter run, so we only pay a Python-level dispatch overhead. + if return_convergence: + lam, conv1 = _sc_weight_fw( + Y_time, + zeta=zeta_lambda, + intercept=intercept, + init_weights=init_weights, + min_decrease=min_decrease, + max_iter=max_iter_pre_sparsify, + return_convergence=True, + ) + else: + lam = _sc_weight_fw( + Y_time, + zeta=zeta_lambda, + intercept=intercept, + init_weights=init_weights, + min_decrease=min_decrease, + max_iter=max_iter_pre_sparsify, + ) # Sparsify: zero out small weights, renormalize (R's sparsify_function) lam = _sparsify(lam) # Second pass: from sparsified initialization (matching R's max.iter) + if return_convergence: + lam, conv2 = _sc_weight_fw( + Y_time, + zeta=zeta_lambda, + intercept=intercept, + init_weights=lam, + min_decrease=min_decrease, + max_iter=max_iter, + return_convergence=True, + ) + return lam, bool(conv1 and conv2) + lam = _sc_weight_fw( Y_time, zeta=zeta_lambda, @@ -1519,7 +1615,9 @@ def compute_sdid_unit_weights( min_decrease: float = 1e-5, max_iter_pre_sparsify: int = 100, max_iter: int = 10000, -) -> np.ndarray: + init_weights: Optional[np.ndarray] = None, + return_convergence: bool = False, +): """Compute SDID unit weights via Frank-Wolfe with two-pass sparsification. Matches R's ``synthdid::sc.weight.fw(t(Yc[, 1:T0]), zeta=zeta.omega, @@ -1541,20 +1639,50 @@ def compute_sdid_unit_weights( Iterations for first pass (before sparsification). max_iter : int, default 10000 Iterations for second pass (after sparsification). Matches R's default. + init_weights : np.ndarray, optional + Warm-start weights for the first Frank-Wolfe pass, shape + ``(n_control,)``. If None (default), the solver starts from + uniform — matching the top-level ``synthdid_estimate(update.omega=TRUE)`` + path. When provided, the Rust fast-path is skipped in favor of the + Python two-pass dispatcher so the first-pass init can be threaded + through; this matches R's ``synthdid::bootstrap_sample`` shape + (which passes ``sum_normalize(weights$omega[...])`` as FW init per + draw). Used by ``SyntheticDiD._bootstrap_se`` on the refit loop. + return_convergence : bool, default False + If True, returns a tuple ``(weights, converged)`` where ``converged`` + is the AND of the first-pass and second-pass convergence flags from + the underlying ``_sc_weight_fw`` calls (True iff the min-decrease + criterion fired on BOTH passes; False if either hit ``max_iter``). + Setting this flag also forces the Python two-pass dispatcher even + when ``init_weights`` is None, because the Rust top-level fast-path + is silent on non-convergence. Used by SDID bootstrap to surface + per-draw FW non-convergence explicitly; standalone callers can + leave this at the default to preserve the legacy ABI. Returns ------- - np.ndarray - Unit weights of shape (n_control,) on the simplex. + np.ndarray or Tuple[np.ndarray, bool] + Unit weights of shape (n_control,) on the simplex. With + ``return_convergence=True``, additionally the two-pass convergence + flag (as described above). """ n_control = Y_pre_control.shape[1] if n_control == 0: - return np.asarray([]) + empty = np.asarray([]) + if return_convergence: + return empty, True + return empty if n_control == 1: - return np.asarray([1.0]) - - if HAS_RUST_BACKEND: + singleton = np.asarray([1.0]) + if return_convergence: + return singleton, True + return singleton + + # When the caller asks for convergence tracking, skip the Rust top-level + # fast-path even if init_weights is None — that entry point bypasses the + # Python two-pass dispatcher and is silent on FW non-convergence. + if HAS_RUST_BACKEND and init_weights is None and not return_convergence: return np.asarray( _rust_sdid_unit_weights( np.ascontiguousarray(Y_pre_control, dtype=np.float64), @@ -1570,19 +1698,46 @@ def compute_sdid_unit_weights( # Build collapsed form: (T_pre, N_co + 1), last col = treated pre means Y_unit = np.column_stack([Y_pre_control, Y_pre_treated_mean.reshape(-1, 1)]) - # First pass: limited iterations - omega = _sc_weight_fw( - Y_unit, - zeta=zeta_omega, - intercept=intercept, - max_iter=max_iter_pre_sparsify, - min_decrease=min_decrease, - ) + # First pass: limited iterations. init_weights is either None (uniform + # start) or the caller-supplied warm-start; the inner _sc_weight_fw + # still dispatches to Rust for the 100-iter run, so we only pay a + # Python-level dispatch overhead. + if return_convergence: + omega, conv1 = _sc_weight_fw( + Y_unit, + zeta=zeta_omega, + intercept=intercept, + init_weights=init_weights, + max_iter=max_iter_pre_sparsify, + min_decrease=min_decrease, + return_convergence=True, + ) + else: + omega = _sc_weight_fw( + Y_unit, + zeta=zeta_omega, + intercept=intercept, + init_weights=init_weights, + max_iter=max_iter_pre_sparsify, + min_decrease=min_decrease, + ) # Sparsify: zero out weights <= max/4, renormalize omega = _sparsify(omega) # Second pass: from sparsified initialization + if return_convergence: + omega, conv2 = _sc_weight_fw( + Y_unit, + zeta=zeta_omega, + intercept=intercept, + init_weights=omega, + max_iter=max_iter, + min_decrease=min_decrease, + return_convergence=True, + ) + return omega, bool(conv1 and conv2) + omega = _sc_weight_fw( Y_unit, zeta=zeta_omega, diff --git a/docs/choosing_estimator.rst b/docs/choosing_estimator.rst index a79e74ee..fde99cae 100644 --- a/docs/choosing_estimator.rst +++ b/docs/choosing_estimator.rst @@ -610,8 +610,8 @@ differences helps interpret results and choose appropriate inference. - Analytical (influence function) - Uses influence-function SEs with WIF adjustment by default. Set ``n_bootstrap=999`` for multiplier bootstrap inference (weight types: ``rademacher``, ``mammen``, ``webb``). * - ``SyntheticDiD`` - - Placebo or bootstrap - - Default uses placebo-based variance (``variance_method="placebo"``). Set ``variance_method="bootstrap"`` for bootstrap inference. Both methods use ``n_bootstrap`` replications (default 200). + - Placebo, paper-faithful refit bootstrap, or jackknife + - Default uses placebo-based variance (``variance_method="placebo"``). Set ``variance_method="bootstrap"`` for paper-faithful Algorithm 2 bootstrap (re-estimates ω and λ via Frank-Wolfe per draw; ~5–30× slower than placebo, panel-size dependent). Both methods use ``n_bootstrap`` replications (default 200). ``variance_method="jackknife"`` is also available. * - ``ContinuousDiD`` - Analytical (influence function) - Uses influence-function-based SEs by default. Use ``n_bootstrap=199`` (or higher) for multiplier bootstrap inference with proper CIs. @@ -783,10 +783,10 @@ estimation. The depth of support varies by estimator: - Full (analytical) - Multiplier at PSU * - ``SyntheticDiD`` - - pweight only - - Via bootstrap + - pweight only (placebo / jackknife) + - -- + - -- - -- - - Rao-Wu rescaled * - ``TROP`` - pweight only - Via bootstrap @@ -807,16 +807,20 @@ estimation. The depth of support varies by estimator: - **Full**: All weight types (pweight/fweight/aweight) + strata/PSU/FPC + Taylor Series Linearization variance - **Full (pweight only)**: Full TSL with strata/PSU/FPC, but only ``pweight`` accepted (``fweight``/``aweight`` rejected because composition changes weight semantics) -- **Via bootstrap**: Strata/PSU/FPC supported only with bootstrap variance. ``SyntheticDiD`` requires ``variance_method='bootstrap'``; ``TROP`` uses bootstrap by default. ``SyntheticDiD`` placebo does not support strata/PSU/FPC. +- **Via bootstrap**: Strata/PSU/FPC supported only with bootstrap variance. ``TROP`` uses bootstrap by default. ``SyntheticDiD`` does **not** support strata/PSU/FPC on any variance method in this release — see the deferred-composition note in REGISTRY.md §SyntheticDiD. - **pweight only** (Weights column): Only ``pweight`` accepted; ``fweight``/``aweight`` raise an error - **Diagnostic**: Weighted descriptive statistics only (no inference) - **--**: Not supported .. note:: - ``SyntheticDiD`` with ``variance_method='placebo'`` does not support - strata/PSU/FPC. Use ``variance_method='bootstrap'`` for full survey - design support. + ``SyntheticDiD`` does not support strata/PSU/FPC on any variance method + in this release. Pweight-only survey weights work with + ``variance_method='placebo'`` or ``variance_method='jackknife'``; + ``variance_method='bootstrap'`` rejects all survey designs because + composing Rao-Wu rescaled weights with paper-faithful Frank-Wolfe + re-estimation requires a separate derivation (tracked in + ``TODO.md``). For the full walkthrough with code examples, see the `survey tutorial `_. diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index ae629889..c82e0ce6 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -1494,13 +1494,18 @@ Convergence criterion: stop when objective decrease < min_decrease² (default mi This matches R's `synthdid::vcov(method="placebo")` which passes `update.omega=TRUE, update.lambda=TRUE` via `opts`. -- Alternative: Bootstrap at unit level (matching R's `synthdid::vcov(method="bootstrap")`) - 1. Resample ALL units (control + treated) with replacement - 2. Identify which resampled units are control vs treated - 3. Renormalize original unit weights for resampled controls: `ω_boot = _sum_normalize(ω[boot_control_idx])` - 4. Time weights λ remain **unchanged** from original estimation (fixed weights) - 5. Compute SDID estimate with renormalized ω and original λ - 6. `SE = sqrt((r-1)/r) × sd(bootstrap_estimates, ddof=1)` where `r = n_successful`, matching R's `vcov.R` and the placebo formula above (equivalent to Arkhangelsky et al. (2021) Algorithm 2's `σ̂² = (1/r) Σ (τ_b − τ̄)²`) +- Alternative: Bootstrap at unit level — paper-faithful refit (`variance_method="bootstrap"`) + Arkhangelsky et al. (2021) Algorithm 2 step 2 verbatim, and also the behavior of R's default `synthdid::vcov(method="bootstrap")`. On each pairs-bootstrap draw: + 1. Resample ALL units (control + treated) with replacement. + 2. Identify which resampled units are control vs treated. + 3. Re-estimate `ω̂_b = compute_sdid_unit_weights(Y_boot_pre_c, Y_boot_pre_t_mean, zeta_omega=ζ_ω / Y_scale, init_weights=_sum_normalize(ω̂[boot_control_idx]))` — two-pass sparsified Frank-Wolfe, **warm-started from the fit-time ω renormalized over the resampled controls** (matching R's `sum_normalize(weights$omega[sort(ind[ind <= N0])])` shape). + 4. Re-estimate `λ̂_b = compute_time_weights(Y_boot_pre_c, Y_boot_post_c, zeta_lambda=ζ_λ / Y_scale, init_weights=λ̂)` — **warm-started from the fit-time λ unchanged** (matching R's `weights.boot$lambda = weights$lambda` shape). + 5. Compute SDID estimate with refit `ω̂_b` and `λ̂_b`. + 6. `SE = sqrt((r-1)/r) × sd(bootstrap_estimates, ddof=1)` where `r = n_successful` (equivalent to the paper's `σ̂² = (1/r) Σ (τ_b − τ̄)²`). + + R-parity rationale: `synthdid_estimate()` (synthdid.R) stores `update.omega = TRUE` in `attr(estimate, "opts")`, and `vcov.R::bootstrap_sample` rebinds those `opts` inside its `do.call` back into `synthdid_estimate`, so the renormalized ω passed via `weights$omega` is used as Frank-Wolfe initialization (the `sum_normalize` helper in R's source explicitly says so). The Python path threads the same warm-start via `compute_sdid_unit_weights(..., init_weights=...)` and `compute_time_weights(..., init_weights=...)`. The FW objective is strictly convex on the simplex (quadratic loss + ζ² ridge on simplex), so warm- and cold-start converge to the same global minimum given enough iterations; warm-start matters in practice because the 100-iter first pass then sparsification is path-dependent on draws where the pre-sparsify budget is tight. Cross-language SE parity at bit tolerance is not claimed — different BLAS / RNG paths — but the procedure matches R's default bootstrap shape at the algorithm level, and Python-only bit-identity on non-survey data is asserted via `TestScaleEquivariance::test_baseline_parity_small_scale[bootstrap]` at `rel=1e-14`. + + Expected wall-clock ~5–30× slower per fit than placebo (panel-size dependent; Frank-Wolfe second-pass can hit its 10K-iter cap on larger panels; warm-start plumbing closes the gap vs cold-start, which would be closer to 10–30× on these DGPs). Per-draw Frank-Wolfe non-convergence UserWarnings are suppressed inside the loop and aggregated into a single summary warning emitted after the loop when the share of valid bootstrap draws with any non-convergence event (counted once per draw — each draw runs Frank-Wolfe once for ω and once for λ, and any of those calls firing a non-convergence warning trips the draw) exceeds 5% of `n_successful`. Composed with any survey design (including pweight-only) this path raises `NotImplementedError` in the current release — see the survey-regression Note below for scope and the deferred-composition sketch. - Alternative: Jackknife variance (matching R's `synthdid::vcov(method="jackknife")`) Implements Algorithm 3 from Arkhangelsky et al. (2021): @@ -1520,7 +1525,7 @@ Convergence criterion: stop when objective decrease < min_decrease² (default mi P-value: analytical (normal distribution), not empirical. *Edge cases:* -- **Frank-Wolfe non-convergence**: Returns current weights after max_iter iterations when the convergence check `vals[t-1] - vals[t] < min_decrease²` never triggers early exit. The numpy-backend path (`_sc_weight_fw_numpy`) emits a `UserWarning` via `diff_diff.utils.warn_if_not_converged` in that case; the Rust-backend path silently returns the final iterate (Rust-side signature change required to thread convergence status — tracked as an axis-G backend-parity follow-up). +- **Frank-Wolfe non-convergence**: Returns current weights after max_iter iterations when the convergence check `vals[t-1] - vals[t] < min_decrease²` never triggers early exit. On `variance_method="bootstrap"` specifically, `_bootstrap_se` uses the `sc_weight_fw_with_convergence` Rust entry point (or `_sc_weight_fw_numpy(return_convergence=True)` on the pure-Python backend) to thread a convergence bool out of every per-draw FW pass; draws with any non-convergence are tallied and, if the rate exceeds 5% of valid draws, aggregated into a single `UserWarning` after the loop (avoids 200+ warnings per fit). Standalone calls to `_sc_weight_fw` / `compute_sdid_unit_weights` / `compute_time_weights` that do not pass `return_convergence=True` follow the legacy behavior: the numpy path emits a per-call `UserWarning` via `diff_diff.utils.warn_if_not_converged` and the Rust path silently returns the final iterate. - **`_sparsify` all-zero input**: If `max(v) <= 0`, returns uniform weights `ones(len(v)) / len(v)`. - **Single control unit**: `compute_sdid_unit_weights` returns `[1.0]` immediately (short-circuit before Frank-Wolfe). - **Zero control units**: `compute_sdid_unit_weights` returns empty array `[]`. @@ -1541,10 +1546,29 @@ Convergence criterion: stop when objective decrease < min_decrease² (default mi - **Jackknife with single nonzero-weight control**: Returns NaN SE. Leaving out the only effective control is not meaningful. - **Jackknife with non-finite LOO estimate**: Returns NaN SE. Unlike bootstrap/placebo, jackknife is deterministic and cannot skip failed iterations; NaN propagates through `var()` (matches R behavior). - **Jackknife with survey weights**: Guards on effective positive support (omega * w_control > 0 and w_treated > 0) after composition, not raw FW counts. Returns NaN SE if fewer than 2 effective controls or 2 positive-weight treated units. Per-iteration zero-sum guards return NaN for individual LOO iterations when remaining composed weights sum to zero. -- **Note:** Survey support: weights, strata, PSU, and FPC are all supported. Full-design surveys use Rao-Wu rescaled bootstrap (Phase 6); non-bootstrap variance methods (`variance_method="placebo"` or `"jackknife"`) require weights-only (strata/PSU/FPC require bootstrap). Both sides weighted per WLS regression interpretation: treated-side means are survey-weighted (Frank-Wolfe target and ATT formula); control-side synthetic weights are composed with survey weights post-optimization (ω_eff = ω * w_co, renormalized). Frank-Wolfe optimization itself is unweighted — survey importance enters after trajectory-matching. Covariate residualization uses WLS with survey weights. Placebo, jackknife, and bootstrap SE preserve survey weights on both sides. -- **Note:** P-value computation is variance-method dependent. Placebo (Algorithm 4) uses the empirical null formula `max(mean(|placebo_effects| ≥ |att|), 1/(r+1))` because permuting control indices generates draws from the null distribution (centered on 0). Bootstrap (Algorithm 2) and jackknife (Algorithm 3) use the analytical p-value from `safe_inference(att, se)` (normal-theory): bootstrap draws are centered on `τ̂` (sampling distribution of the estimator) and jackknife pseudo-values are not null draws, so the empirical null formula is invalid for both. This matches R's `synthdid::vcov()` convention, where variance is returned and inference is normal-theory from the SE. -- **Note (fixed-weight bootstrap calibration):** the `variance_method="bootstrap"` path is anti-conservative under H0 because it holds ω and λ fixed rather than re-estimating per draw (see the fixed-weight Note below). Monte Carlo on the default panel fixture (`_make_panel(att=0.0)`, n=500 seeds, B=200) returns median p-value ≈ 0.38 and empirical rejection rates of {α=0.01: 0.12, α=0.05: 0.18, α=0.10: 0.25, α=0.20: 0.34}. That's roughly 3.7× the nominal 5% rate — substantial over-rejection, consistent with the SE under-estimate from ignoring weight-estimation uncertainty. Users who need nominally-calibrated inference should prefer `variance_method="placebo"` (Algorithm 4 re-estimates both ω and λ per draw) or track the paper-faithful refit-bootstrap follow-up in TODO.md. The characterization regression test is `TestPValueSemantics::test_bootstrap_p_value_null_calibration` (slow). -- **Note (R-parity test scope):** `test_bootstrap_se_matches_r` pins R's bootstrap index matrix via a JSON fixture (`tests/data/sdid_bootstrap_indices_r.json`, regenerated by `benchmarks/R/generate_sdid_bootstrap_parity_fixture.R`) and feeds it through the Python bootstrap loop via a hidden `_bootstrap_indices` kwarg. That eliminates the cross-language RNG gap (Python's PCG64 ≠ R's Mersenne Twister) and isolates the deterministic math — per-draw SDID estimator, weight renormalization, SE aggregation — which matches at 1e-10 tolerance. The test does not verify that independently-seeded R and Python bootstraps produce equal SE at any finite B; they won't. +- **Note (survey support):** weights are supported; strata/PSU/FPC are not. The pweight-only path is accepted for `variance_method="placebo"` and `variance_method="jackknife"` — treated-side means are survey-weighted (Frank-Wolfe target and ATT formula); control-side synthetic weights are composed with survey weights post-optimization (ω_eff = ω * w_co, renormalized). Frank-Wolfe optimization itself is unweighted — survey importance enters after trajectory-matching. Covariate residualization uses WLS with survey weights. **`variance_method="bootstrap"` rejects all survey designs (including pweight-only)** in this release; see the deferred-composition Note below. +- **Note (default variance_method deviation from R):** R's `synthdid::vcov()` defaults to `method="bootstrap"`; our `SyntheticDiD.__init__` defaults to `variance_method="placebo"`. This is an intentional library deviation for two reasons: (a) placebo is unconditionally available on pweight-only survey designs whereas refit bootstrap rejects every survey design in this release (see the deferred-composition Note below); (b) placebo avoids the ~5–30× per-draw Frank-Wolfe refit slowdown relative to the deleted fixed-weight path, which becomes prohibitive on large panels at B=200 replications. Users can opt into the R-matching default with `variance_method="bootstrap"`. Placebo (Algorithm 4) and bootstrap (Algorithm 2) both track nominal calibration in the committed coverage MC; see the calibration table below. +- **Note (deferred survey + bootstrap composition):** Rao-Wu rescaled bootstrap composed with paper-faithful refit is not yet implemented. Reusable scaffolding for the follow-up implementer: `generate_rao_wu_weights` (in `bootstrap_utils`, retained for use by other estimators), the split into `rw_control` / `rw_treated`, the degenerate-retry check, and the treated-mean weighting pattern are portable from the fixed-weight Rao-Wu branch this release removed (recoverable via `git show 91082e5:diff_diff/synthetic_did.py` near the pre-rewrite `_bootstrap_se` body (PR #351 "Replace SDID fixed-weight bootstrap with paper-faithful refit")). The genuinely new work is a **weighted Frank-Wolfe** variant of `_sc_weight_fw` that accepts per-unit weights in the loss and regularization (`Σ rw_i ω_i Y_i,pre` / `ζ² Σ rw_i ω_i²`), threaded through `compute_sdid_unit_weights` / `compute_time_weights`. Compose-after-unweighted-FW does NOT work — it silently reproduces the fixed-weight + Rao-Wu behavior we removed (which was never paper-faithful and never matched R's default vcov). Validation: re-use the coverage MC harness with a stratified DGP and confirm near-nominal rejection rates against placebo-SE tracking. Survey + SDID variance is therefore deferred capability — pre-existing strata/PSU/FPC users have no SDID variance method on this release. +- **Note:** P-value computation is variance-method dependent. Placebo (Algorithm 4) uses the empirical null formula `max(mean(|placebo_effects| ≥ |att|), 1/(r+1))` because permuting control indices generates draws from the null distribution (centered on 0). Bootstrap (Algorithm 2) and jackknife (Algorithm 3) use the analytical p-value from `safe_inference(att, se)` (normal-theory): bootstrap draws are centered on `τ̂` (sampling distribution of the estimator) and jackknife pseudo-values are not null draws, so the empirical null formula is invalid for them. This matches R's `synthdid::vcov()` convention, where variance is returned and inference is normal-theory from the SE. +- **Note (coverage Monte Carlo calibration):** `benchmarks/data/sdid_coverage.json` carries empirical rejection rates across all three variance methods on 3 representative null-panel DGPs (500 seeds × B=200, regenerable via `benchmarks/python/coverage_sdid.py`). Under H0 the nominal rejection rate at each α equals α; rates substantially above α indicate anti-conservatism, rates below indicate over-coverage. + + | DGP | method | α=0.01 | α=0.05 | α=0.10 | mean SE / true SD | + |-----------------------------------------------------------|------------|--------|--------|--------|-------------------| + | balanced (N_co=20, N_tr=3, T_pre=8, T_post=4) | placebo | 0.016 | 0.060 | 0.086 | 1.05 | + | balanced | bootstrap | 0.028 | 0.078 | 0.116 | 1.05 | + | balanced | jackknife | 0.066 | 0.112 | 0.154 | 1.08 | + | unbalanced (N_co=30, N_tr=8, heterogeneous unit-FE) | placebo | 0.006 | 0.032 | 0.070 | 1.08 | + | unbalanced | bootstrap | 0.008 | 0.038 | 0.080 | 1.11 | + | unbalanced | jackknife | 0.024 | 0.076 | 0.120 | 0.99 | + | AER §6.3 (N=100, N_tr=20, T=120, T_pre=115, rank=2, σ=2) | placebo | 0.018 | 0.058 | 0.086 | 0.99 | + | AER §6.3 | bootstrap | 0.010 | 0.040 | 0.078 | 1.05 | + | AER §6.3 | jackknife | 0.030 | 0.080 | 0.150 | 0.90 | + + Reading: **`bootstrap` (paper-faithful refit)** and **`placebo`** both track nominal calibration across all three DGPs (rates within Monte Carlo noise at 500 seeds; 2σ MC band ≈ 0.02–0.05 at p ≈ 0.05–0.10). **`jackknife`** is slightly anti-conservative on the smaller panels (balanced, AER §6.3) at α=0.05 (rejection 0.112 and 0.080 vs the 0.05 target). Arkhangelsky et al. (2021) §6.3 reports mixed jackknife evidence (98% coverage — slightly conservative — under iid, and 93% coverage — slightly anti-conservative — under AR(1) ρ=0.7), so the direction of our observation is consistent with the AR(1) branch of the paper's evidence rather than the iid branch. The `mean SE / true SD` column compares mean estimated SE to the empirical sampling SD of τ̂ across seeds. + + The schema smoke test is `TestCoverageMCArtifact::test_coverage_artifacts_present`; regenerate the JSON via `python benchmarks/python/coverage_sdid.py --n-seeds 500 --n-bootstrap 200 --output benchmarks/data/sdid_coverage.json` (~15–40 min on M-series Mac, Rust backend — warm-start convergence makes newer runs faster than the original cold-start one). + + **Artifact cadence (documentation-grade, not pinned-regression-grade):** this file is a documentation substrate — the table above transcribes from it, and the schema test just checks structure. It is **not** numerically pinned by any regression test, because MC noise at 500 seeds × B=200 (2σ ≈ 0.02–0.05 per cell) makes tight bounds fragile and loose bounds uninformative. The runtime characterization test that guards the one regression class worth catching (dispatch breakage → rejection rate ≈0 or ≈0.5, or SE-collapse) is `TestPValueSemantics::test_bootstrap_p_value_null_dispersion` (slow; calibration-agnostic dispersion + loose rejection-rate band). Regenerate the JSON when a methodology change materially shifts per-draw numerics — SE formula, new variance method, FW solver swap, paper-procedure correction. **Do not** regenerate on refactors that preserve the FW global optimum (warm-start vs cold-start, backend migration, pure renames, docstring fixes) — those stay within MC noise of the committed numbers. Per-seed bit-identity on the captured fixture is the cheaper, stricter check: see `TestScaleEquivariance::test_baseline_parity_small_scale[bootstrap]` at `rel=1e-14`. - **Note:** Internal Y normalization. Before weight optimization, the estimator, and variance procedures, `fit()` centers Y by `mean(Y_pre_control)` and scales by `std(Y_pre_control)`; `Y_scale` falls back to `1.0` when std is non-finite or below `1e-12 * max(|mean|, 1)`. Auto-regularization and `noise_level` are computed on normalized Y; user-supplied `zeta_omega` / `zeta_lambda` are divided by `Y_scale` internally for Frank-Wolfe. τ, SE, CI, the placebo/bootstrap/jackknife effect vectors, `results_.noise_level`, and `results_.zeta_omega` / `results_.zeta_lambda` are all reported on the user's original outcome scale (user-supplied zetas are echoed back exactly to avoid float roundoff). Mathematically a no-op — τ is location-invariant and scale-equivariant, and FW weights are invariant under `(Y, ζ) → (Y/s, ζ/s)` — but prevents catastrophic cancellation in the SDID double-difference when outcomes span millions-to-billions (see synth-inference/synthdid#71 for the R-package version of this issue). Normalization constants are derived from controls' pre-period only so the reference is unaffected by treatment. `in_time_placebo()` and `sensitivity_to_zeta_omega()` reuse the exact same `Y_shift` / `Y_scale` captured on the fit snapshot: they normalize the re-sliced arrays before re-running Frank-Wolfe, pass `zeta / Y_scale` to the weight solvers, and rescale the returned `att` and `pre_fit_rmse` by `Y_scale` before reporting; unit-weight diagnostics (`max_unit_weight`, `effective_n`) are scale-invariant and reported directly. *Validation diagnostics (post-fit methods on `SyntheticDiDResults`):* @@ -1571,7 +1595,7 @@ Convergence criterion: stop when objective decrease < min_decrease² (default mi - [x] Sparsification: v[v <= max(v)/4] = 0; v = v/sum(v) - [x] Placebo SE formula: sqrt((r-1)/r) * sd(placebo_estimates) - [x] Placebo SE: re-estimates omega and lambda per replication (matching R's update.omega=TRUE, update.lambda=TRUE) -- [x] Bootstrap: fixed weights (original lambda unchanged, omega renormalized for resampled controls). This matches R's `synthdid::vcov(method="bootstrap")` but is a shortcut relative to Arkhangelsky et al. (2021) Algorithm 2 step 2, which specifies re-estimating ω̂_b and λ̂_b per draw; fixed-weight bootstrap underestimates SE relative to the paper's procedure by ignoring weight-estimation uncertainty. Use `variance_method="placebo"` (Algorithm 4) for paper-faithful inference. +- [x] Bootstrap: paper-faithful Algorithm 2 step 2 — re-estimates ω̂_b and λ̂_b per draw via two-pass sparsified Frank-Wolfe on the resampled panel using the fit-time normalized-scale zeta. Matches R's default `synthdid::vcov(method="bootstrap")` (which rebinds `attr(estimate, "opts")` so the renormalized ω serves only as Frank-Wolfe initialization). Survey designs raise `NotImplementedError`; Rao-Wu + refit composition is tracked in TODO.md and sketched in the deferred-composition Note above. - [x] Jackknife SE: fixed weights, LOO all units, formula `sqrt((n-1)/n * sum((u-ubar)^2))` - [x] Jackknife: NaN SE for single treated or single nonzero-weight control - [x] Jackknife: analytical p-value (not empirical) @@ -2673,7 +2697,7 @@ should be a deliberate user choice. | SunAbraham | Cluster-robust + delta method | Pairs bootstrap | | ImputationDiD | Conservative clustered (Thm 3) | Multiplier bootstrap (library extension; percentile CIs and empirical p-values, consistent with CS/SA) | | TwoStageDiD | GMM sandwich (Newey & McFadden 1994) | Multiplier bootstrap on GMM influence function | -| SyntheticDiD | Placebo variance (Alg 4) | Unit-level bootstrap (fixed weights) | +| SyntheticDiD | Placebo variance (Alg 4) | Unit-level pairs bootstrap (paper-faithful refit, Alg 2 step 2); jackknife (Alg 3) | | TripleDifference | Influence function (all methods) | SE = std(IF) / sqrt(n) | | StackedDiD | Cluster-robust (unit) | Cluster at unit × sub-experiment | | TROP | Block bootstrap | — | @@ -2863,7 +2887,7 @@ ContinuousDiD, EfficientDiD): `ATT_boot[b] = ATT + w_b^T @ psi_psu` where `psi_psu` are PSU-aggregated IF sums. - **Note:** When no strata/PSU/FPC, degenerates to standard unit-level multiplier bootstrap. -**Rao-Wu Rescaled Bootstrap** (SunAbraham, SyntheticDiD, TROP): +**Rao-Wu Rescaled Bootstrap** (SunAbraham, TROP): - **Reference**: Rao & Wu (1988) "Resampling Inference with Complex Survey Data", JASA 83(401); Rao, Wu & Yue (1992) "Some Recent Work on Resampling Methods for @@ -2873,12 +2897,19 @@ ContinuousDiD, EfficientDiD): Rescaled weight: `w*_i = w_i * (n_h / m_h) * r_hi` where `r_hi` = count of PSU *i* drawn. - **Note:** FPC enters through the resample size `m_h`, not as a post-hoc scaling factor. When `f_h >= 1` (census stratum), observations keep original weights (zero variance). +- **Note:** SyntheticDiD is intentionally excluded from this list. Paper-faithful + refit bootstrap (Arkhangelsky et al. 2021 Algorithm 2 step 2) re-estimates ω̂ + and λ̂ via Frank-Wolfe on each draw; composing that with Rao-Wu rescaled + weights requires a weighted-FW derivation that is not yet implemented (sketch + in §SyntheticDiD survey-regression Note; tracked in TODO.md). The previous + SyntheticDiD Rao-Wu path composed fixed-ω with rescaled weights, which was + not paper-faithful and was removed. - **Note:** Bootstrap paths support all three `lonely_psu` modes: `"remove"`, `"certainty"`, and `"adjust"`. For `"adjust"`, singleton PSUs from different strata are pooled into a combined pseudo-stratum and weights are generated for the pooled group. This is the bootstrap analogue of the TSL "adjust" behavior (centering around the global mean). Applies to both multiplier bootstrap (CallawaySantAnna, ImputationDiD, TwoStageDiD, - ContinuousDiD, EfficientDiD) and Rao-Wu bootstrap (SunAbraham, SyntheticDiD, TROP). + ContinuousDiD, EfficientDiD) and Rao-Wu bootstrap (SunAbraham, TROP). FPC scaling is skipped for pooled singletons (conservative). When only one singleton stratum exists total, pooling is not possible — the singleton contributes zero bootstrap variance (same as `remove`), with a `UserWarning` emitted. This is a library-specific diff --git a/docs/methodology/survey-theory.md b/docs/methodology/survey-theory.md index fefb179e..8535b936 100644 --- a/docs/methodology/survey-theory.md +++ b/docs/methodology/survey-theory.md @@ -722,9 +722,17 @@ Two bootstrap strategies interact with survey designs: Generates multiplier weights at the PSU level within strata, with FPC scaling. Each bootstrap draw reweights the IF values. -- **Rao-Wu rescaled bootstrap** (SunAbraham, SyntheticDiD, TROP): Draws PSUs +- **Rao-Wu rescaled bootstrap** (SunAbraham, TROP): Draws PSUs with replacement within strata and rescales observation weights. Each draw - re-runs the full estimator on the resampled data. + re-runs the full estimator on the resampled data. *SyntheticDiD is + intentionally excluded in this release:* the paper-faithful refit + bootstrap rejects every survey design because composing Rao-Wu rescaled + weights with Frank-Wolfe re-estimation requires a weighted-FW derivation + that is not yet implemented. Pweight-only SDID users should use + ``variance_method="placebo"`` or ``"jackknife"``; strata/PSU/FPC users + have no SDID variance option. See TODO.md and + ``docs/methodology/REGISTRY.md`` §SyntheticDiD for the deferred- + composition sketch. --- diff --git a/docs/performance-scenarios.md b/docs/performance-scenarios.md index 3eeb89d5..38d41a48 100644 --- a/docs/performance-scenarios.md +++ b/docs/performance-scenarios.md @@ -234,14 +234,21 @@ serves a different purpose: R-parity accuracy). They complement it. ```python SyntheticDiD(variance_method="jackknife", n_bootstrap=0).fit(...) # then also variance_method="bootstrap", n_bootstrap=200 for comparison + # NOTE: bootstrap is now paper-faithful refit (re-estimates ω and λ via + # Frank-Wolfe per draw); ~5–30× slower than placebo (panel-size dependent; + # warm-start plumbing reduces the slowdown relative to cold-start). Plan + # accordingly when timing. ``` - **Operation chain.** (1) SDiD fit with `variance_method="jackknife"` - exercises the leave-one-out refit loop; (2) SDiD fit with - `variance_method="bootstrap"`, `n_bootstrap=200` for SE comparison; - (3) `results.in_time_placebo()`; (4) `results.get_loo_effects_df()`; - (5) `results.sensitivity_to_zeta_omega()`; (6) - `results.get_weight_concentration()`. The jackknife loop is the primary - time sink; `sensitivity_to_zeta_omega` also refits. + `variance_method="bootstrap"`, `n_bootstrap=200` for SE comparison + (paper-faithful refit; expect order-of-magnitude longer wall-clock + than jackknife on this scale); (3) `results.in_time_placebo()`; + (4) `results.get_loo_effects_df()`; (5) + `results.sensitivity_to_zeta_omega()`; (6) + `results.get_weight_concentration()`. The bootstrap refit and the + jackknife loop are now both significant time sinks; + `sensitivity_to_zeta_omega` also refits. - **Source anchor.** `docs/tutorials/18_geo_experiments.ipynb`, Arkhangelsky et al. (2021), Mercado Libre geo-experiment writeup (medium.com/mercadolibre-tech), Meta GeoLift methodology docs diff --git a/docs/survey-roadmap.md b/docs/survey-roadmap.md index 49b3ab05..c92c0df3 100644 --- a/docs/survey-roadmap.md +++ b/docs/survey-roadmap.md @@ -49,9 +49,12 @@ Weighted `solve_logit()` in `linalg.py` — survey weights enter IRLS as ### Phase 6: Advanced Features (v2.7.6) -- **Survey-aware bootstrap** for all 8 bootstrap-using estimators: +- **Survey-aware bootstrap** for bootstrap-using estimators: multiplier at PSU (CS, Imputation, TwoStage, Continuous, Efficient) - and Rao-Wu rescaled (SA, SyntheticDiD, TROP) + and Rao-Wu rescaled (SA, TROP). SyntheticDiD bootstrap no longer supports + survey designs: the previous fixed-weight + Rao-Wu path was not paper- + faithful and was removed, and composing Rao-Wu with paper-faithful refit + Frank-Wolfe requires a separate derivation (tracked in TODO.md) - **Replicate weight variance**: BRR, Fay's BRR, JK1, JKn, SDR. 12 of 16 estimators supported (not SyntheticDiD, TROP, BaconDecomposition, or WooldridgeDiD) - **DEFF diagnostics**: per-coefficient design effects vs SRS baseline @@ -216,10 +219,11 @@ the limitation and suggested alternative. | Estimator | Limitation | Alternative | |-----------|-----------|-------------| -| SyntheticDiD | Replicate weights | Use strata/PSU/FPC design with Rao-Wu rescaled bootstrap | +| SyntheticDiD | Strata/PSU/FPC (any variance method) | No SDID variance option in this release. Pweight-only works with `variance_method='placebo'` or `'jackknife'`. Strata/PSU/FPC + SDID requires composing Rao-Wu rescaled weights with paper-faithful Frank-Wolfe re-estimation; sketch in REGISTRY.md §SyntheticDiD. | +| SyntheticDiD | `variance_method='bootstrap'` + any survey design (including pweight-only) | Use `variance_method='placebo'` or `'jackknife'` for pweight-only surveys. Refit bootstrap composed with survey weights requires the same weighted-FW derivation noted above. | +| SyntheticDiD | Replicate weights | Pre-existing limitation: no replicate-weight survey support on SDID. | | TROP | Replicate weights | Use strata/PSU/FPC design with Rao-Wu rescaled bootstrap | | BaconDecomposition | Replicate weights | Diagnostic only, no inference | -| SyntheticDiD | `variance_method='placebo'` + strata/PSU/FPC | Use `variance_method='bootstrap'` | | ImputationDiD | `pretrends=True` + replicate weights | Use analytical survey design instead | | ImputationDiD | `pretrend_test()` + replicate weights | Use analytical survey design instead | | DiD, TWFE | `inference='wild_bootstrap'` + `survey_design` | Use analytical survey inference (default) | diff --git a/docs/tutorials/03_synthetic_did.ipynb b/docs/tutorials/03_synthetic_did.ipynb index 80b9d2ea..fd74871c 100644 --- a/docs/tutorials/03_synthetic_did.ipynb +++ b/docs/tutorials/03_synthetic_did.ipynb @@ -398,14 +398,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "## 7. Inference Methods\n", - "\n", - "SDID supports two inference methods:\n", - "\n", - "1. **Placebo** (`variance_method=\"placebo\"`, default): Placebo-based variance using Algorithm 4 from Arkhangelsky et al. (2021). This matches R's default.\n", - "2. **Bootstrap** (`variance_method=\"bootstrap\"`): Unit-level bootstrap with fixed weights matching R's `bootstrap_sample`" - ] + "source": "## 7. Inference Methods\n\nSDID supports three inference methods:\n\n1. **Placebo** (`variance_method=\"placebo\"`, default): Placebo-based variance using Algorithm 4 from Arkhangelsky et al. (2021). Library default (R's default is bootstrap — we deviate because placebo is unconditionally available on pweight-only survey designs and sidesteps the refit bootstrap slowdown).\n2. **Bootstrap** (`variance_method=\"bootstrap\"`): Paper-faithful pairs bootstrap (Algorithm 2 step 2) — re-estimates ω and λ via Frank-Wolfe on each draw. Matches R's default `synthdid::vcov(method=\"bootstrap\")` behavior. Expect ~5–30× slower per fit than placebo (panel-size dependent).\n3. **Jackknife** (`variance_method=\"jackknife\"`): Algorithm 3 — fixed-weight leave-one-out. Deterministic; no bootstrap replications." }, { "cell_type": "code", @@ -606,23 +599,7 @@ { "cell_type": "markdown", "metadata": {}, - "source": [ - "## Summary\n", - "\n", - "Key takeaways for Synthetic DiD:\n", - "\n", - "1. **Best use cases**: Few treated units, many controls, long pre-period\n", - "2. **Unit weights**: Identify which controls are most similar to treated (Frank-Wolfe with sparsification)\n", - "3. **Time weights**: Determine which pre-periods are most informative (Frank-Wolfe on collapsed form)\n", - "4. **Pre-treatment fit**: Lower RMSE indicates better synthetic match\n", - "5. **Inference options**:\n", - " - Placebo (`variance_method=\"placebo\"`, default): Placebo-based variance from controls (matches R)\n", - " - Bootstrap (`variance_method=\"bootstrap\"`): Unit-level bootstrap with fixed weights\n", - "6. **Regularization**: Auto-computed from data noise level by default. Override with `zeta_omega`/`zeta_lambda`.\n", - "\n", - "Reference:\n", - "- Arkhangelsky, D., Athey, S., Hirshberg, D. A., Imbens, G. W., & Wager, S. (2021). Synthetic difference-in-differences. American Economic Review, 111(12), 4088-4118." - ] + "source": "## Summary\n\nKey takeaways for Synthetic DiD:\n\n1. **Best use cases**: Few treated units, many controls, long pre-period\n2. **Unit weights**: Identify which controls are most similar to treated (Frank-Wolfe with sparsification)\n3. **Time weights**: Determine which pre-periods are most informative (Frank-Wolfe on collapsed form)\n4. **Pre-treatment fit**: Lower RMSE indicates better synthetic match\n5. **Inference options**:\n - Placebo (`variance_method=\"placebo\"`, default): Placebo-based variance from controls. Library default (R's default is bootstrap; we deviate for survey availability + perf).\n - Bootstrap (`variance_method=\"bootstrap\"`): Paper-faithful pairs bootstrap re-estimating ω and λ via Frank-Wolfe per draw (Algorithm 2 step 2; matches R's default `vcov`). ~5–30× slower than placebo.\n - Jackknife (`variance_method=\"jackknife\"`): Algorithm 3 — fixed-weight leave-one-out.\n6. **Regularization**: Auto-computed from data noise level by default. Override with `zeta_omega`/`zeta_lambda`.\n\nReference:\n- Arkhangelsky, D., Athey, S., Hirshberg, D. A., Imbens, G. W., & Wager, S. (2021). Synthetic difference-in-differences. American Economic Review, 111(12), 4088-4118." } ], "metadata": { @@ -632,4 +609,4 @@ }, "nbformat": 4, "nbformat_minor": 4 -} +} \ No newline at end of file diff --git a/docs/tutorials/16_survey_did.ipynb b/docs/tutorials/16_survey_did.ipynb index 511a7384..06bb1b6c 100644 --- a/docs/tutorials/16_survey_did.ipynb +++ b/docs/tutorials/16_survey_did.ipynb @@ -195,7 +195,7 @@ "id": "cell-05-90139e87", "metadata": {}, "source": [ - "**About the normalization warning:** You'll see `pweight weights normalized to mean=1` throughout this tutorial. Survey weights are inverse selection probabilities -- they rarely have mean=1 out of the box. The library rescales them internally so that weighted estimators are numerically stable. This is standard practice (Lumley 2004, \u00a72.2). The warning confirms rescaling occurred; it is not an error." + "**About the normalization warning:** You'll see `pweight weights normalized to mean=1` throughout this tutorial. Survey weights are inverse selection probabilities -- they rarely have mean=1 out of the box. The library rescales them internally so that weighted estimators are numerically stable. This is standard practice (Lumley 2004, §2.2). The warning confirms rescaling occurred; it is not an error." ] }, { @@ -1087,42 +1087,7 @@ "cell_type": "markdown", "id": "cell-35-f1ef376c", "metadata": {}, - "source": [ - "## 9. Which Estimators Support Survey Design?\n", - "\n", - "`diff-diff` supports survey design across all estimators, though the level of support varies:\n", - "\n", - "| Estimator | Weights | Strata/PSU/FPC (TSL) | Replicate Weights | Survey-Aware Bootstrap |\n", - "|-----------|---------|---------------------|-------------------|------------------------|\n", - "| **DifferenceInDifferences** | Full | Full | -- | -- |\n", - "| **TwoWayFixedEffects** | Full | Full | -- | -- |\n", - "| **MultiPeriodDiD** | Full | Full | -- | -- |\n", - "| **CallawaySantAnna** | pweight only | Full | Full | Multiplier at PSU |\n", - "| **TripleDifference** | pweight only | Full | Full (analytical) | -- |\n", - "| **StaggeredTripleDifference** | pweight only | Full | Full | Multiplier at PSU |\n", - "| **SunAbraham** | Full | Full | -- | Rao-Wu rescaled |\n", - "| **StackedDiD** | pweight only | Full (pweight only) | -- | -- |\n", - "| **ImputationDiD** | pweight only | Partial (no FPC) | -- | Multiplier at PSU |\n", - "| **TwoStageDiD** | pweight only | Partial (no FPC) | -- | Multiplier at PSU |\n", - "| **ContinuousDiD** | Full | Full | Full (analytical) | Multiplier at PSU |\n", - "| **EfficientDiD** | Full | Full | Full (analytical) | Multiplier at PSU |\n", - "| **SyntheticDiD** | pweight only | -- | -- | Rao-Wu rescaled |\n", - "| **TROP** | pweight only | -- | -- | Rao-Wu rescaled |\n", - "| **BaconDecomposition** | Diagnostic | Diagnostic | -- | -- |\n", - "\n", - "**Legend:**\n", - "- **Full**: All weight types (pweight/fweight/aweight) + strata/PSU/FPC + Taylor Series Linearization variance\n", - "- **Full (pweight only)**: Full TSL support with strata/PSU/FPC, but only accepts `pweight` weight type (`fweight`/`aweight` rejected because Q-weight composition changes their semantics)\n", - "- **Partial (no FPC)**: Weights + strata (for df) + PSU (for clustering); FPC raises `NotImplementedError`\n", - "- **pweight only** (Weights column): Only `pweight` accepted; `fweight`/`aweight` raise an error\n", - "- **pweight only** (TSL column): Sampling weights for point estimates; no strata/PSU/FPC design elements\n", - "- **Diagnostic**: Weighted descriptive statistics only (no inference)\n", - "- **--**: Not supported\n", - "\n", - "**Note:** `EfficientDiD` supports `covariates` and `survey_design` simultaneously. The doubly-robust (DR) path threads survey weights through WLS outcome regression, weighted sieve propensity ratios, and survey-weighted kernel smoothing.\n", - "\n", - "For full details, see `docs/survey-roadmap.md`." - ] + "source": "## 9. Which Estimators Support Survey Design?\n\n`diff-diff` supports survey design across all estimators, though the level of support varies:\n\n| Estimator | Weights | Strata/PSU/FPC (TSL) | Replicate Weights | Survey-Aware Bootstrap |\n|-----------|---------|---------------------|-------------------|------------------------|\n| **DifferenceInDifferences** | Full | Full | -- | -- |\n| **TwoWayFixedEffects** | Full | Full | -- | -- |\n| **MultiPeriodDiD** | Full | Full | -- | -- |\n| **CallawaySantAnna** | pweight only | Full | Full | Multiplier at PSU |\n| **TripleDifference** | pweight only | Full | Full (analytical) | -- |\n| **StaggeredTripleDifference** | pweight only | Full | Full | Multiplier at PSU |\n| **SunAbraham** | Full | Full | -- | Rao-Wu rescaled |\n| **StackedDiD** | pweight only | Full (pweight only) | -- | -- |\n| **ImputationDiD** | pweight only | Partial (no FPC) | -- | Multiplier at PSU |\n| **TwoStageDiD** | pweight only | Partial (no FPC) | -- | Multiplier at PSU |\n| **ContinuousDiD** | Full | Full | Full (analytical) | Multiplier at PSU |\n| **EfficientDiD** | Full | Full | Full (analytical) | Multiplier at PSU |\n| **SyntheticDiD** | pweight only (placebo / jackknife) | -- | -- | -- |\n| **TROP** | pweight only | -- | -- | Rao-Wu rescaled |\n| **BaconDecomposition** | Diagnostic | Diagnostic | -- | -- |\n\n**Legend:**\n- **Full**: All weight types (pweight/fweight/aweight) + strata/PSU/FPC + Taylor Series Linearization variance\n- **Full (pweight only)**: Full TSL support with strata/PSU/FPC, but only accepts `pweight` weight type (`fweight`/`aweight` rejected because Q-weight composition changes their semantics)\n- **Partial (no FPC)**: Weights + strata (for df) + PSU (for clustering); FPC raises `NotImplementedError`\n- **pweight only** (Weights column): Only `pweight` accepted; `fweight`/`aweight` raise an error\n- **pweight only** (TSL column): Sampling weights for point estimates; no strata/PSU/FPC design elements\n- **Diagnostic**: Weighted descriptive statistics only (no inference)\n- **--**: Not supported\n\n**Note on SyntheticDiD:** `variance_method=\"placebo\"` and `variance_method=\"jackknife\"` support pweight-only survey designs. `variance_method=\"bootstrap\"` rejects every survey design (including pweight-only) because the paper-faithful refit bootstrap composed with Rao-Wu rescaled weights requires a weighted-Frank-Wolfe derivation that is not yet implemented. Strata/PSU/FPC are not supported by any SDID variance method in this release. The weighted-FW + Rao-Wu composition follow-up is tracked in `TODO.md`; see `docs/methodology/REGISTRY.md` §SyntheticDiD for the deferred-composition sketch.\n\n**Note:** `EfficientDiD` supports `covariates` and `survey_design` simultaneously. The doubly-robust (DR) path threads survey weights through WLS outcome regression, weighted sieve propensity ratios, and survey-weighted kernel smoothing.\n\nFor full details, see `docs/survey-roadmap.md`." }, { "cell_type": "markdown", @@ -1137,15 +1102,15 @@ "\n", "**Policy background.** The Affordable Care Act's dependent coverage provision, effective\n", "September 2010, allowed young adults to remain on their parents' health insurance until age 26.\n", - "This created a natural experiment: adults aged 19\u201325 gained coverage access (treatment group),\n", - "while adults aged 27\u201334 \u2014 similar demographics but ineligible \u2014 serve as controls. This is one\n", + "This created a natural experiment: adults aged 19–25 gained coverage access (treatment group),\n", + "while adults aged 27–34 — similar demographics but ineligible — serve as controls. This is one\n", "of the most widely studied DiD natural experiments in health economics.\n", "\n", "> Antwi, Y.A., Moriya, A.S. & Simon, K. (2013). \"Effects of Federal Policy to Insure Young\n", "> Adults: Evidence from the 2010 Affordable Care Act's Dependent-Coverage Mandate.\"\n", - "> *American Economic Journal: Economic Policy* 5(4): 1\u201328.\n", + "> *American Economic Journal: Economic Policy* 5(4): 1–28.\n", "\n", - "We use two NHANES cycles: **2007\u20132008** (pre-ACA) and **2015\u20132016** (post-ACA), with health\n", + "We use two NHANES cycles: **2007–2008** (pre-ACA) and **2015–2016** (post-ACA), with health\n", "insurance coverage as the outcome (binary, modeled as a linear probability model). NHANES uses\n", "a complex multi-stage probability sampling design with **masked pseudo-strata** (`SDMVSTRA`),\n", "**masked pseudo-PSUs** (`SDMVPSU`), and **exam weights** (`WTMEC2YR`). Because PSU IDs are\n", @@ -1351,7 +1316,7 @@ "source": [ "With real survey data, **both the point estimate and standard error change** when we account\n", "for the survey design. The ATT shifts from 0.065 (unweighted) to 0.097 (weighted) because\n", - "NHANES uses unequal selection probabilities \u2014 the weighted estimate is population-representative,\n", + "NHANES uses unequal selection probabilities — the weighted estimate is population-representative,\n", "while the unweighted one over- or under-represents certain demographic groups. The SE also\n", "increases because NHANES clusters individuals within PSUs, and people from the same geographic\n", "area have correlated insurance status.\n", @@ -1359,13 +1324,13 @@ "The survey-corrected estimate of **~9.7 percentage points** suggests that the ACA provisions\n", "meaningfully increased insurance coverage among young adults. This is consistent with the\n", "published literature: studies measuring only the 2010 dependent coverage mandate find\n", - "3\u20136 pp effects (Antwi et al., 2013; Sommers, 2012), while studies spanning the full\n", - "ACA implementation \u2014 including the 2014 marketplace, individual mandate, and Medicaid\n", - "expansion \u2014 find 8\u201313 pp (Kaestner et al., 2017; Courtemanche et al., 2017). Our\n", - "2007\u201308 vs. 2015\u201316 window captures all of these provisions, placing the 9.7 pp\n", + "3–6 pp effects (Antwi et al., 2013; Sommers, 2012), while studies spanning the full\n", + "ACA implementation — including the 2014 marketplace, individual mandate, and Medicaid\n", + "expansion — find 8–13 pp (Kaestner et al., 2017; Courtemanche et al., 2017). Our\n", + "2007–08 vs. 2015–16 window captures all of these provisions, placing the 9.7 pp\n", "estimate squarely in the expected range.\n", "\n", - "The survey degrees of freedom (31 = n_PSU \u2212 n_strata) reflect the actual number\n", + "The survey degrees of freedom (31 = n_PSU − n_strata) reflect the actual number\n", "of independent sampling units, not the number of individuals. This is why the\n", "confidence interval [0.006, 0.187] is wide despite nearly 3,000 observations.\n", "\n", diff --git a/docs/tutorials/18_geo_experiments.ipynb b/docs/tutorials/18_geo_experiments.ipynb index 95e6a73a..3a660470 100644 --- a/docs/tutorials/18_geo_experiments.ipynb +++ b/docs/tutorials/18_geo_experiments.ipynb @@ -54,10 +54,10 @@ "id": "t18-cell-006", "metadata": { "execution": { - "iopub.execute_input": "2026-04-22T01:35:10.031700Z", - "iopub.status.busy": "2026-04-22T01:35:10.031610Z", - "iopub.status.idle": "2026-04-22T01:35:11.884393Z", - "shell.execute_reply": "2026-04-22T01:35:11.884085Z" + "iopub.execute_input": "2026-04-22T21:17:39.827552Z", + "iopub.status.busy": "2026-04-22T21:17:39.827479Z", + "iopub.status.idle": "2026-04-22T21:17:40.857211Z", + "shell.execute_reply": "2026-04-22T21:17:40.856867Z" } }, "outputs": [], @@ -106,10 +106,10 @@ "id": "t18-cell-009", "metadata": { "execution": { - "iopub.execute_input": "2026-04-22T01:35:11.885866Z", - "iopub.status.busy": "2026-04-22T01:35:11.885703Z", - "iopub.status.idle": "2026-04-22T01:35:11.896385Z", - "shell.execute_reply": "2026-04-22T01:35:11.896150Z" + "iopub.execute_input": "2026-04-22T21:17:40.858677Z", + "iopub.status.busy": "2026-04-22T21:17:40.858527Z", + "iopub.status.idle": "2026-04-22T21:17:40.868516Z", + "shell.execute_reply": "2026-04-22T21:17:40.868282Z" } }, "outputs": [ @@ -161,10 +161,10 @@ "id": "t18-cell-010", "metadata": { "execution": { - "iopub.execute_input": "2026-04-22T01:35:11.897470Z", - "iopub.status.busy": "2026-04-22T01:35:11.897377Z", - "iopub.status.idle": "2026-04-22T01:35:11.902271Z", - "shell.execute_reply": "2026-04-22T01:35:11.902055Z" + "iopub.execute_input": "2026-04-22T21:17:40.869577Z", + "iopub.status.busy": "2026-04-22T21:17:40.869495Z", + "iopub.status.idle": "2026-04-22T21:17:40.874320Z", + "shell.execute_reply": "2026-04-22T21:17:40.874100Z" } }, "outputs": [ @@ -259,10 +259,10 @@ "id": "t18-cell-011", "metadata": { "execution": { - "iopub.execute_input": "2026-04-22T01:35:11.903302Z", - "iopub.status.busy": "2026-04-22T01:35:11.903226Z", - "iopub.status.idle": "2026-04-22T01:35:11.911482Z", - "shell.execute_reply": "2026-04-22T01:35:11.911257Z" + "iopub.execute_input": "2026-04-22T21:17:40.875296Z", + "iopub.status.busy": "2026-04-22T21:17:40.875214Z", + "iopub.status.idle": "2026-04-22T21:17:40.882826Z", + "shell.execute_reply": "2026-04-22T21:17:40.882608Z" } }, "outputs": [ @@ -379,10 +379,10 @@ "id": "t18-cell-012", "metadata": { "execution": { - "iopub.execute_input": "2026-04-22T01:35:11.912471Z", - "iopub.status.busy": "2026-04-22T01:35:11.912393Z", - "iopub.status.idle": "2026-04-22T01:35:12.026053Z", - "shell.execute_reply": "2026-04-22T01:35:12.025785Z" + "iopub.execute_input": "2026-04-22T21:17:40.883777Z", + "iopub.status.busy": "2026-04-22T21:17:40.883709Z", + "iopub.status.idle": "2026-04-22T21:17:40.988502Z", + "shell.execute_reply": "2026-04-22T21:17:40.988246Z" } }, "outputs": [ @@ -455,24 +455,13 @@ "id": "t18-cell-016", "metadata": { "execution": { - "iopub.execute_input": "2026-04-22T01:35:12.027139Z", - "iopub.status.busy": "2026-04-22T01:35:12.027054Z", - "iopub.status.idle": "2026-04-22T01:35:51.265252Z", - "shell.execute_reply": "2026-04-22T01:35:51.264927Z" + "iopub.execute_input": "2026-04-22T21:17:40.989623Z", + "iopub.status.busy": "2026-04-22T21:17:40.989544Z", + "iopub.status.idle": "2026-04-22T21:17:41.228821Z", + "shell.execute_reply": "2026-04-22T21:17:41.228493Z" } }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/Users/igerber/diff-diff-sdid-bootstrap/diff_diff/utils.py:1455: UserWarning: Frank-Wolfe SC weight solver did not converge in 100 iterations (tol=1.543597599298811e-05). Results may be inaccurate.\n", - " return _sc_weight_fw_numpy(Y, zeta, intercept, init_weights, min_decrease, max_iter)\n", - "/Users/igerber/diff-diff-sdid-bootstrap/diff_diff/utils.py:1455: UserWarning: Frank-Wolfe SC weight solver did not converge in 10000 iterations (tol=1.543597599298811e-05). Results may be inaccurate.\n", - " return _sc_weight_fw_numpy(Y, zeta, intercept, init_weights, min_decrease, max_iter)\n" - ] - } - ], + "outputs": [], "source": [ "# Note: variance_method, n_bootstrap, and seed are CONSTRUCTOR kwargs.\n", "# They are not arguments to .fit().\n", @@ -493,10 +482,10 @@ "id": "t18-cell-017", "metadata": { "execution": { - "iopub.execute_input": "2026-04-22T01:35:51.266633Z", - "iopub.status.busy": "2026-04-22T01:35:51.266542Z", - "iopub.status.idle": "2026-04-22T01:35:51.268408Z", - "shell.execute_reply": "2026-04-22T01:35:51.268164Z" + "iopub.execute_input": "2026-04-22T21:17:41.230093Z", + "iopub.status.busy": "2026-04-22T21:17:41.230011Z", + "iopub.status.idle": "2026-04-22T21:17:41.231757Z", + "shell.execute_reply": "2026-04-22T21:17:41.231511Z" } }, "outputs": [ @@ -581,10 +570,10 @@ "id": "t18-cell-021", "metadata": { "execution": { - "iopub.execute_input": "2026-04-22T01:35:51.269469Z", - "iopub.status.busy": "2026-04-22T01:35:51.269395Z", - "iopub.status.idle": "2026-04-22T01:35:51.273224Z", - "shell.execute_reply": "2026-04-22T01:35:51.273017Z" + "iopub.execute_input": "2026-04-22T21:17:41.232798Z", + "iopub.status.busy": "2026-04-22T21:17:41.232728Z", + "iopub.status.idle": "2026-04-22T21:17:41.235765Z", + "shell.execute_reply": "2026-04-22T21:17:41.235563Z" } }, "outputs": [ @@ -624,10 +613,10 @@ "id": "t18-cell-022", "metadata": { "execution": { - "iopub.execute_input": "2026-04-22T01:35:51.274212Z", - "iopub.status.busy": "2026-04-22T01:35:51.274139Z", - "iopub.status.idle": "2026-04-22T01:35:51.333261Z", - "shell.execute_reply": "2026-04-22T01:35:51.332993Z" + "iopub.execute_input": "2026-04-22T21:17:41.236762Z", + "iopub.status.busy": "2026-04-22T21:17:41.236687Z", + "iopub.status.idle": "2026-04-22T21:17:41.294144Z", + "shell.execute_reply": "2026-04-22T21:17:41.293879Z" } }, "outputs": [ @@ -663,10 +652,10 @@ "id": "t18-cell-023", "metadata": { "execution": { - "iopub.execute_input": "2026-04-22T01:35:51.334442Z", - "iopub.status.busy": "2026-04-22T01:35:51.334344Z", - "iopub.status.idle": "2026-04-22T01:35:51.379849Z", - "shell.execute_reply": "2026-04-22T01:35:51.379557Z" + "iopub.execute_input": "2026-04-22T21:17:41.295247Z", + "iopub.status.busy": "2026-04-22T21:17:41.295156Z", + "iopub.status.idle": "2026-04-22T21:17:41.340650Z", + "shell.execute_reply": "2026-04-22T21:17:41.340382Z" } }, "outputs": [ @@ -721,10 +710,10 @@ "id": "t18-cell-024", "metadata": { "execution": { - "iopub.execute_input": "2026-04-22T01:35:51.380969Z", - "iopub.status.busy": "2026-04-22T01:35:51.380878Z", - "iopub.status.idle": "2026-04-22T01:35:51.382633Z", - "shell.execute_reply": "2026-04-22T01:35:51.382397Z" + "iopub.execute_input": "2026-04-22T21:17:41.341768Z", + "iopub.status.busy": "2026-04-22T21:17:41.341675Z", + "iopub.status.idle": "2026-04-22T21:17:41.343483Z", + "shell.execute_reply": "2026-04-22T21:17:41.343271Z" } }, "outputs": [ @@ -756,10 +745,10 @@ "id": "t18-cell-025", "metadata": { "execution": { - "iopub.execute_input": "2026-04-22T01:35:51.383625Z", - "iopub.status.busy": "2026-04-22T01:35:51.383550Z", - "iopub.status.idle": "2026-04-22T01:35:51.437856Z", - "shell.execute_reply": "2026-04-22T01:35:51.437576Z" + "iopub.execute_input": "2026-04-22T21:17:41.344422Z", + "iopub.status.busy": "2026-04-22T21:17:41.344352Z", + "iopub.status.idle": "2026-04-22T21:17:41.398064Z", + "shell.execute_reply": "2026-04-22T21:17:41.397777Z" } }, "outputs": [ @@ -822,10 +811,10 @@ "id": "t18-cell-026", "metadata": { "execution": { - "iopub.execute_input": "2026-04-22T01:35:51.439084Z", - "iopub.status.busy": "2026-04-22T01:35:51.438979Z", - "iopub.status.idle": "2026-04-22T01:35:51.441226Z", - "shell.execute_reply": "2026-04-22T01:35:51.441007Z" + "iopub.execute_input": "2026-04-22T21:17:41.399142Z", + "iopub.status.busy": "2026-04-22T21:17:41.399060Z", + "iopub.status.idle": "2026-04-22T21:17:41.401344Z", + "shell.execute_reply": "2026-04-22T21:17:41.401110Z" } }, "outputs": [ @@ -865,16 +854,7 @@ "cell_type": "markdown", "id": "t18-cell-028", "metadata": {}, - "source": [ - "diff-diff's `SyntheticDiD` supports two standard error methods, and the difference between them is *what gets re-estimated per replication*:\n", - "\n", - "- **Placebo SE** (default): permutes which control units are pretended to be \"treated\", then **re-estimates both the unit weights and the time weights** (Frank-Wolfe) on each permutation and recomputes SDiD. The standard deviation of those placebo effects is the SE. This is Algorithm 4 in Arkhangelsky et al. (2021) and matches R's `synthdid::vcov(method=\"placebo\")`.\n", - "- **Bootstrap SE**: resamples all units with replacement, **renormalizes the *original* unit weights** for the resampled controls (no re-estimation), keeps the **time weights fixed** at their original values, and recomputes the SDiD estimate with those fixed weights. Matches R's `synthdid::vcov(method=\"bootstrap\")`.\n", - "\n", - "The contrast is the point: placebo re-estimates everything per replication and so reflects the full uncertainty in the weighting procedure; bootstrap freezes the weights and only varies the resampled-unit composition. They typically agree to within a few percent, which is the cross-check you want.\n", - "\n", - "Both methods are configured on the `SyntheticDiD` *constructor*, not on `.fit()`. Use placebo by default (it's the published method); switch to bootstrap if you want a quick cross-check." - ] + "source": "diff-diff's `SyntheticDiD` supports three standard error methods, and the difference between the two paper-based ones is *what gets resampled* per replication:\n\n- **Placebo SE** (default): permutes which control units are pretended to be \"treated\", then **re-estimates both the unit weights and the time weights** (Frank-Wolfe) on each permutation and recomputes SDiD. The standard deviation of those placebo effects is the SE. This is Algorithm 4 in Arkhangelsky et al. (2021) and matches R's `synthdid::vcov(method=\"placebo\")`.\n- **Bootstrap SE**: pairs-bootstrap resampling of all units with replacement, then **re-estimates both the unit weights and the time weights** via Frank-Wolfe on each resampled panel and recomputes SDiD. This is Algorithm 2 step 2 in Arkhangelsky et al. (2021) and matches R's default `synthdid::vcov(method=\"bootstrap\")` behavior (which rebinds `attr(estimate, \"opts\")` so the renormalized ω is only Frank-Wolfe initialization). Expect ~5–30× slower per fit than placebo (panel-size dependent).\n- **Jackknife SE**: deterministic Algorithm 3 — fixed-weight leave-one-out across all units. Faster than bootstrap; mildly anti-conservative on smaller panels.\n\nBoth bootstrap and placebo re-estimate the weights per replication, so each reflects the full uncertainty in the weighting procedure. They differ in *how* they resample: placebo permutes the control-vs-treated assignment, bootstrap draws with replacement. On exchangeable DGPs the two SEs typically track each other; on small panels with non-exchangeable factor structure (like the marketing geo-experiment here), they can differ in magnitude while still agreeing on significance and CI direction.\n\nAll three methods are configured on the `SyntheticDiD` *constructor*, not on `.fit()`. Use placebo by default (it's the library default; R's default is bootstrap); switch to bootstrap if you want a cross-check from a different resampling protocol; switch to jackknife if you need a deterministic, fast alternative." }, { "cell_type": "code", @@ -882,23 +862,13 @@ "id": "t18-cell-029", "metadata": { "execution": { - "iopub.execute_input": "2026-04-22T01:35:51.442285Z", - "iopub.status.busy": "2026-04-22T01:35:51.442190Z", - "iopub.status.idle": "2026-04-22T01:35:51.925944Z", - "shell.execute_reply": "2026-04-22T01:35:51.925669Z" + "iopub.execute_input": "2026-04-22T21:17:41.402325Z", + "iopub.status.busy": "2026-04-22T21:17:41.402255Z", + "iopub.status.idle": "2026-04-22T21:17:41.662155Z", + "shell.execute_reply": "2026-04-22T21:17:41.661887Z" } }, "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/Users/igerber/diff-diff-sdid-bootstrap/diff_diff/utils.py:1455: UserWarning: Frank-Wolfe SC weight solver did not converge in 100 iterations (tol=1.543597599298811e-05). Results may be inaccurate.\n", - " return _sc_weight_fw_numpy(Y, zeta, intercept, init_weights, min_decrease, max_iter)\n", - "/Users/igerber/diff-diff-sdid-bootstrap/diff_diff/utils.py:1455: UserWarning: Frank-Wolfe SC weight solver did not converge in 10000 iterations (tol=1.543597599298811e-05). Results may be inaccurate.\n", - " return _sc_weight_fw_numpy(Y, zeta, intercept, init_weights, min_decrease, max_iter)\n" - ] - }, { "data": { "text/html": [ @@ -940,9 +910,9 @@ " 1\n", " Bootstrap\n", " 311.85\n", - " 4.26\n", - " 303.50\n", - " 320.21\n", + " 4.44\n", + " 303.15\n", + " 320.56\n", " \n", " \n", "\n", @@ -951,7 +921,7 @@ "text/plain": [ " Method ATT SE CI_low CI_high\n", "0 Placebo (default) 311.85 7.31 297.52 326.19\n", - "1 Bootstrap 311.85 4.26 303.50 320.21" + "1 Bootstrap 311.85 4.44 303.15 320.56" ] }, "execution_count": 14, @@ -987,10 +957,10 @@ "id": "t18-cell-030", "metadata": { "execution": { - "iopub.execute_input": "2026-04-22T01:35:51.927040Z", - "iopub.status.busy": "2026-04-22T01:35:51.926952Z", - "iopub.status.idle": "2026-04-22T01:35:51.928720Z", - "shell.execute_reply": "2026-04-22T01:35:51.928495Z" + "iopub.execute_input": "2026-04-22T21:17:41.663265Z", + "iopub.status.busy": "2026-04-22T21:17:41.663187Z", + "iopub.status.idle": "2026-04-22T21:17:41.664937Z", + "shell.execute_reply": "2026-04-22T21:17:41.664722Z" } }, "outputs": [ @@ -1089,7 +1059,7 @@ ">\n", "> **Sample size and design.** 5 pilot markets, 75 control markets. 12 weeks of weekly data: 6 weeks pre-launch, 6 weeks post-launch. Outcome: weekly conversions per market. Method: Synthetic Difference-in-Differences (Arkhangelsky et al. 2021), the canonical generalization of synthetic control to multi-treated panel settings. The 75 control markets serve as the donor pool that SDiD reweights to construct a counterfactual specific to the 5 pilot markets.\n", ">\n", - "> **Validity evidence.** The synthetic control's pre-treatment fit RMSE is well below the standard deviation of treated pre-period outcomes (the library would warn otherwise), which means the weighted blend of donor markets tracks the treated pre-trend closely. The placebo standard error matches the published Arkhangelsky et al. (2021) method, and we cross-checked with bootstrap inference (Section 5) — the two methods agree. The estimate is statistically significant at the p < 0.01 level, and the 95% CI cleanly covers the true treatment effect on the synthetic data we used to demonstrate the workflow.\n", + "> **Validity evidence.** The synthetic control's pre-treatment fit RMSE is well below the standard deviation of treated pre-period outcomes (the library would warn otherwise), which means the weighted blend of donor markets tracks the treated pre-trend closely. The placebo standard error matches the published Arkhangelsky et al. (2021) method, and we cross-checked with paper-faithful refit bootstrap inference (Section 5) — both methods agree on the point estimate (311.85) and on the result's significance, with bootstrap producing a narrower CI than placebo on this small panel (5 treated × 6 pre-periods, factor-model heterogeneity). The estimate is statistically significant under both inference methods, and the placebo 95% CI cleanly covers the true treatment effect on the synthetic data we used to demonstrate the workflow.\n", ">\n", "> **What \"312 conversions per market per week\" means in business terms.** Across 5 pilot markets and 6 weeks, that's roughly 9,400 incremental conversions attributable to the campaign in this small pilot. Translate to your own revenue-per-conversion to compare against the pilot's campaign spend, then use the per-market lift estimate to project what a broader rollout would deliver.\n", ">\n", @@ -1115,4 +1085,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} +} \ No newline at end of file diff --git a/rust/src/lib.rs b/rust/src/lib.rs index b6586665..53844650 100644 --- a/rust/src/lib.rs +++ b/rust/src/lib.rs @@ -36,6 +36,7 @@ fn _rust_backend(m: &Bound<'_, PyModule>) -> PyResult<()> { m.add_function(wrap_pyfunction!(weights::compute_time_weights, m)?)?; m.add_function(wrap_pyfunction!(weights::compute_noise_level, m)?)?; m.add_function(wrap_pyfunction!(weights::sc_weight_fw, m)?)?; + m.add_function(wrap_pyfunction!(weights::sc_weight_fw_with_convergence, m)?)?; // Linear algebra operations m.add_function(wrap_pyfunction!(linalg::solve_ols, m)?)?; diff --git a/rust/src/weights.rs b/rust/src/weights.rs index 8e5c26c8..4d39ea50 100644 --- a/rust/src/weights.rs +++ b/rust/src/weights.rs @@ -127,7 +127,7 @@ fn sc_weight_fw_gram( n: usize, min_decrease_sq: f64, max_iter: usize, -) { +) -> bool { let t0 = lam.len(); // Precompute Gram matrix and related quantities — O(N×T0²) once @@ -148,6 +148,7 @@ fn sc_weight_fw_gram( let mut half_grad = Array1::zeros(t0); let mut prev_val = f64::INFINITY; + let mut converged = false; for t in 0..max_iter { // Step 1: half_grad[j] = ata_x[j] - atb[j] + eta * lam[j] @@ -168,6 +169,7 @@ fn sc_weight_fw_gram( let atb_dot_lam: f64 = atb.iter().zip(lam.iter()).map(|(&a, &b)| a * b).sum(); let val = zeta * zeta * lam_norm_sq + (xt_ata_x - 2.0 * atb_dot_lam + b_norm_sq) / n as f64; if t >= 1 && prev_val - val < min_decrease_sq { + converged = true; break; } prev_val = val; @@ -183,6 +185,7 @@ fn sc_weight_fw_gram( let atb_dot_lam: f64 = atb.iter().zip(lam.iter()).map(|(&a, &b)| a * b).sum(); let val = zeta * zeta * lam_norm_sq + (xt_ata_x - 2.0 * atb_dot_lam + b_norm_sq) / n as f64; if t >= 1 && prev_val - val < min_decrease_sq { + converged = true; break; } prev_val = val; @@ -222,10 +225,12 @@ fn sc_weight_fw_gram( // Step 10: Convergence check if t >= 1 && prev_val - val < min_decrease_sq { + converged = true; break; } prev_val = val; } + converged } /// Allocation-free standard Frank-Wolfe loop for the T0 >= N case (unit weights). @@ -243,7 +248,7 @@ fn sc_weight_fw_standard( n: usize, min_decrease_sq: f64, max_iter: usize, -) { +) -> bool { let t0 = lam.len(); // Precompute column norms: col_norms_sq[j] = ||A[:,j]||² @@ -259,6 +264,7 @@ fn sc_weight_fw_standard( let mut diff = Array1::zeros(n); // Reusable buffer for ax - b let mut prev_val = f64::INFINITY; + let mut converged = false; for t in 0..max_iter { // Step 1-2: Compute half_grad = A^T @ (ax - b) + eta * lam @@ -284,6 +290,7 @@ fn sc_weight_fw_standard( } let val = zeta * zeta * lam_norm_sq + err_sq / n as f64; if t >= 1 && prev_val - val < min_decrease_sq { + converged = true; break; } prev_val = val; @@ -306,6 +313,7 @@ fn sc_weight_fw_standard( } let val = zeta * zeta * lam_norm_sq + err_sq / n as f64; if t >= 1 && prev_val - val < min_decrease_sq { + converged = true; break; } prev_val = val; @@ -340,10 +348,12 @@ fn sc_weight_fw_standard( let val = zeta * zeta * lam_norm_sq + err_sq / n as f64; if t >= 1 && prev_val - val < min_decrease_sq { + converged = true; break; } prev_val = val; } + converged } /// Compute synthetic control weights via Frank-Wolfe optimization. @@ -373,12 +383,13 @@ fn sc_weight_fw_internal( init_weights: Option<&Array1>, min_decrease: f64, max_iter: usize, -) -> Array1 { +) -> (Array1, bool) { let t0 = y.ncols() - 1; let n = y.nrows(); if t0 == 0 { - return Array1::ones(1); + // Degenerate case: no weights to optimize; treat as trivially converged. + return (Array1::ones(1), true); } // Column-center if using intercept — owned Array2 for the centered case @@ -401,15 +412,15 @@ fn sc_weight_fw_internal( let min_decrease_sq = min_decrease * min_decrease; // Dispatch to optimized loop based on problem dimensions - if t0 < n { + let converged = if t0 < n { // Gram path: precompute A^T@A for O(T0) per iteration - sc_weight_fw_gram(&a, &b, &mut lam, eta, zeta, n, min_decrease_sq, max_iter); + sc_weight_fw_gram(&a, &b, &mut lam, eta, zeta, n, min_decrease_sq, max_iter) } else { // Standard path: allocation-free with 1 GEMV per iteration - sc_weight_fw_standard(&a, &b, &mut lam, eta, zeta, n, min_decrease_sq, max_iter); - } + sc_weight_fw_standard(&a, &b, &mut lam, eta, zeta, n, min_decrease_sq, max_iter) + }; - lam + (lam, converged) } /// Compute noise level from first-differences of control outcomes. @@ -493,7 +504,7 @@ pub fn sc_weight_fw<'py>( let v = w.as_array(); v.to_owned() }); - let result = sc_weight_fw_internal( + let (result, _converged) = sc_weight_fw_internal( &y_arr, zeta, intercept, @@ -504,6 +515,45 @@ pub fn sc_weight_fw<'py>( Ok(result.to_pyarray(py)) } +/// Compute synthetic control weights via Frank-Wolfe optimization, returning +/// a convergence flag alongside the weight vector. +/// +/// Identical numeric contract to `sc_weight_fw`; the returned tuple's second +/// element is `true` iff the solver's min-decrease criterion fired (rather +/// than `max_iter` being reached). Callers that need to surface FW non- +/// convergence explicitly (e.g., SDID bootstrap aggregate warnings) should +/// use this function instead of `sc_weight_fw`, because the top-level Rust +/// FW entry point is otherwise silent on non-convergence. +/// +/// # Returns +/// Tuple of `(weights, converged)`. +#[pyfunction] +#[pyo3(signature = (y, zeta, intercept=true, init_weights=None, min_decrease=1e-5, max_iter=10000))] +pub fn sc_weight_fw_with_convergence<'py>( + py: Python<'py>, + y: PyReadonlyArray2<'py, f64>, + zeta: f64, + intercept: bool, + init_weights: Option>, + min_decrease: f64, + max_iter: usize, +) -> PyResult<(Bound<'py, PyArray1>, bool)> { + let y_arr = y.as_array(); + let init = init_weights.map(|w| { + let v = w.as_array(); + v.to_owned() + }); + let (result, converged) = sc_weight_fw_internal( + &y_arr, + zeta, + intercept, + init.as_ref(), + min_decrease, + max_iter, + ); + Ok((result.to_pyarray(py), converged)) +} + /// Compute SDID time weights via Frank-Wolfe optimization. /// /// Matches R's synthdid: sc.weight.fw(Yc[1:N0, ], zeta=zeta.lambda, intercept=TRUE) @@ -572,14 +622,19 @@ pub(crate) fn compute_time_weights_internal( } // Two-pass sparsification (matching R's default sparsify=sparsify_function) - // First pass: limited iterations - let lam = sc_weight_fw_internal(&y_time.view(), zeta_lambda, intercept, None, min_decrease, max_iter_pre_sparsify); + // First pass: limited iterations. This entry point discards the inner + // convergence flag — Python callers that need convergence tracking use + // `compute_time_weights` (Python wrapper in utils.py) with + // `return_convergence=True`, which runs the two-pass in Python against + // `sc_weight_fw_with_convergence`. + let (lam, _) = sc_weight_fw_internal(&y_time.view(), zeta_lambda, intercept, None, min_decrease, max_iter_pre_sparsify); // Sparsify let lam_sparse = sparsify_internal(&lam); // Second pass: from sparsified initialization - sc_weight_fw_internal(&y_time.view(), zeta_lambda, intercept, Some(&lam_sparse), min_decrease, max_iter) + let (lam2, _) = sc_weight_fw_internal(&y_time.view(), zeta_lambda, intercept, Some(&lam_sparse), min_decrease, max_iter); + lam2 } /// Compute SDID unit weights via Frank-Wolfe with two-pass sparsification. @@ -646,8 +701,9 @@ pub(crate) fn compute_sdid_unit_weights_internal( y_unit[[t, n_control]] = y_pre_treated_mean[t]; } - // First pass: limited iterations - let omega = sc_weight_fw_internal( + // First pass: limited iterations. See note in compute_time_weights_internal + // about convergence-tracking contract. + let (omega, _) = sc_weight_fw_internal( &y_unit.view(), zeta_omega, intercept, None, min_decrease, max_iter_pre_sparsify, ); @@ -655,9 +711,10 @@ pub(crate) fn compute_sdid_unit_weights_internal( let omega = sparsify_internal(&omega); // Second pass: from sparsified initialization - sc_weight_fw_internal( + let (omega2, _) = sc_weight_fw_internal( &y_unit.view(), zeta_omega, intercept, Some(&omega), min_decrease, max_iter, - ) + ); + omega2 } #[cfg(test)] @@ -752,7 +809,7 @@ mod tests { fn test_fw_weights_on_simplex() { // Simple 3x3 problem: 2 pre-periods + 1 target column let y = array![[1.0, 2.0, 1.5], [3.0, 4.0, 3.5], [5.0, 6.0, 5.5]]; - let result = sc_weight_fw_internal(&y.view(), 0.1, true, None, 1e-3, 100); + let (result, _converged) = sc_weight_fw_internal(&y.view(), 0.1, true, None, 1e-3, 100); let sum: f64 = result.sum(); assert!((sum - 1.0).abs() < 1e-6, "FW weights should sum to 1, got {}", sum); assert!(result.iter().all(|&w| w >= -1e-6), "FW weights should be non-negative"); @@ -843,7 +900,7 @@ mod tests { let vals: Vec = (0..45).map(|i| ((i * 13 + 5) % 53) as f64 / 53.0).collect(); let y = Array2::from_shape_vec((5, 9), vals).unwrap(); - let result = sc_weight_fw_internal(&y.view(), 0.5, true, None, 1e-5, 10000); + let (result, _converged) = sc_weight_fw_internal(&y.view(), 0.5, true, None, 1e-5, 10000); // Verify valid simplex weights let sum: f64 = result.sum(); @@ -863,7 +920,7 @@ mod tests { let y = Array2::from_shape_vec((20, 10), vals).unwrap(); // Run with enough iterations to exercise the refresh mechanism - let result = sc_weight_fw_internal(&y.view(), 0.1, true, None, 1e-8, 1000); + let (result, _converged) = sc_weight_fw_internal(&y.view(), 0.1, true, None, 1e-8, 1000); // Verify valid result (convergence with correct weights) let sum: f64 = result.sum(); @@ -881,7 +938,7 @@ mod tests { let vals: Vec = (0..36).map(|i| ((i * 11 + 7) % 41) as f64 / 41.0).collect(); let y = Array2::from_shape_vec((6, 6), vals).unwrap(); - let result = sc_weight_fw_internal(&y.view(), 0.2, true, None, 1e-5, 10000); + let (result, _converged) = sc_weight_fw_internal(&y.view(), 0.2, true, None, 1e-5, 10000); let sum: f64 = result.sum(); assert!((sum - 1.0).abs() < 1e-6, "Weights should sum to 1, got {}", sum); @@ -939,7 +996,7 @@ mod tests { // Gram path: N=15, T0=4 (T0 < N) let vals_gram: Vec = (0..75).map(|i| ((i * 3 + 1) % 37) as f64 / 37.0).collect(); let y_gram = Array2::from_shape_vec((15, 5), vals_gram).unwrap(); - let result_gram = sc_weight_fw_internal(&y_gram.view(), 0.3, false, None, 1e-5, 10000); + let (result_gram, _converged_gram) = sc_weight_fw_internal(&y_gram.view(), 0.3, false, None, 1e-5, 10000); let sum_gram: f64 = result_gram.sum(); assert!((sum_gram - 1.0).abs() < 1e-6, "Gram intercept=false: weights should sum to 1, got {}", sum_gram); assert!(result_gram.iter().all(|&w| w >= -1e-6), "Gram intercept=false: weights should be non-negative"); @@ -947,7 +1004,7 @@ mod tests { // Standard path: N=4, T0=10 (T0 >= N) let vals_std: Vec = (0..44).map(|i| ((i * 5 + 2) % 29) as f64 / 29.0).collect(); let y_std = Array2::from_shape_vec((4, 11), vals_std).unwrap(); - let result_std = sc_weight_fw_internal(&y_std.view(), 0.3, false, None, 1e-5, 10000); + let (result_std, _converged_std) = sc_weight_fw_internal(&y_std.view(), 0.3, false, None, 1e-5, 10000); let sum_std: f64 = result_std.sum(); assert!((sum_std - 1.0).abs() < 1e-6, "Standard intercept=false: weights should sum to 1, got {}", sum_std); assert!(result_std.iter().all(|&w| w >= -1e-6), "Standard intercept=false: weights should be non-negative"); diff --git a/tests/data/sdid_bootstrap_indices_r.json b/tests/data/sdid_bootstrap_indices_r.json deleted file mode 100644 index b8449773..00000000 --- a/tests/data/sdid_bootstrap_indices_r.json +++ /dev/null @@ -1,216 +0,0 @@ -{ - "indices": [ - [8, 3, 1, 10, 11, 15, 22, 8, 4, 4, 22, 18, 13, 5, 4, 2, 18, 3, 23, 17, 21, 18, 6], - [6, 2, 20, 3, 22, 21, 2, 23, 6, 10, 8, 5, 1, 17, 7, 4, 13, 10, 9, 12, 20, 9, 11], - [3, 16, 5, 14, 5, 22, 2, 18, 17, 18, 5, 14, 8, 23, 8, 21, 4, 4, 7, 10, 18, 13, 21], - [22, 19, 16, 22, 17, 12, 9, 21, 11, 16, 8, 23, 6, 1, 13, 14, 5, 9, 23, 16, 11, 10, 10], - [1, 19, 17, 14, 14, 15, 6, 15, 11, 5, 16, 17, 22, 17, 18, 2, 6, 22, 6, 6, 20, 15, 2], - [10, 23, 12, 9, 1, 2, 2, 11, 14, 2, 2, 10, 22, 5, 10, 17, 17, 5, 3, 14, 14, 1, 5], - [3, 22, 13, 11, 20, 17, 20, 4, 20, 11, 17, 9, 17, 13, 17, 20, 1, 17, 1, 2, 8, 16, 3], - [5, 13, 12, 1, 19, 5, 15, 6, 23, 16, 10, 7, 9, 9, 3, 18, 21, 15, 15, 20, 16, 22, 1], - [6, 11, 9, 22, 22, 22, 20, 23, 22, 13, 5, 11, 15, 21, 2, 6, 1, 14, 16, 17, 11, 18, 23], - [2, 10, 4, 20, 15, 7, 17, 5, 22, 20, 18, 9, 11, 6, 10, 9, 23, 19, 9, 11, 19, 21, 4], - [10, 22, 22, 17, 21, 12, 21, 6, 15, 9, 10, 18, 23, 7, 14, 12, 21, 19, 10, 5, 2, 3, 2], - [14, 16, 4, 21, 6, 20, 21, 20, 11, 1, 14, 1, 4, 13, 6, 22, 12, 7, 12, 5, 18, 7, 13], - [22, 2, 3, 1, 15, 20, 11, 9, 22, 10, 7, 19, 18, 12, 10, 9, 19, 15, 14, 23, 3, 22, 20], - [10, 2, 23, 17, 18, 19, 14, 17, 10, 5, 13, 18, 19, 19, 10, 16, 2, 4, 7, 1, 14, 20, 22], - [19, 12, 19, 6, 12, 10, 7, 19, 2, 6, 9, 11, 6, 5, 14, 10, 15, 1, 9, 10, 17, 12, 22], - [1, 11, 4, 3, 21, 6, 5, 10, 20, 2, 23, 4, 20, 9, 14, 14, 3, 18, 10, 1, 8, 20, 21], - [5, 6, 14, 4, 9, 8, 23, 16, 12, 10, 14, 23, 11, 23, 22, 2, 16, 14, 15, 17, 6, 23, 19], - [10, 13, 6, 1, 11, 13, 10, 13, 15, 17, 15, 8, 4, 6, 10, 20, 16, 12, 16, 19, 3, 21, 7], - [3, 10, 21, 10, 7, 19, 13, 10, 23, 14, 11, 13, 3, 10, 19, 2, 7, 20, 21, 21, 11, 11, 1], - [21, 6, 23, 10, 9, 12, 13, 18, 15, 21, 1, 11, 8, 13, 14, 2, 6, 13, 3, 13, 16, 19, 16], - [13, 13, 8, 23, 13, 6, 6, 8, 5, 23, 9, 10, 11, 23, 1, 22, 15, 4, 7, 11, 10, 8, 3], - [13, 20, 23, 16, 8, 3, 9, 16, 11, 13, 7, 18, 7, 21, 5, 20, 18, 7, 17, 3, 16, 8, 19], - [18, 10, 15, 12, 3, 20, 16, 22, 10, 2, 8, 1, 13, 7, 11, 11, 13, 5, 2, 15, 5, 14, 5], - [17, 18, 22, 17, 18, 7, 13, 2, 12, 3, 20, 8, 20, 23, 14, 1, 14, 21, 18, 2, 5, 10, 12], - [3, 18, 18, 18, 15, 1, 8, 19, 8, 20, 19, 2, 21, 20, 19, 11, 15, 18, 8, 10, 2, 22, 4], - [23, 17, 12, 23, 20, 19, 7, 9, 19, 21, 11, 11, 12, 17, 2, 20, 7, 17, 4, 13, 19, 1, 17], - [4, 2, 21, 8, 3, 13, 2, 23, 4, 9, 11, 10, 14, 1, 1, 14, 20, 21, 11, 9, 19, 5, 21], - [2, 9, 3, 12, 2, 16, 18, 22, 17, 20, 8, 8, 19, 15, 6, 3, 1, 15, 4, 9, 1, 11, 6], - [20, 14, 9, 6, 12, 22, 17, 11, 9, 18, 11, 19, 1, 11, 17, 17, 8, 11, 18, 1, 23, 6, 2], - [14, 5, 1, 23, 18, 4, 15, 7, 14, 10, 3, 7, 5, 6, 5, 21, 18, 9, 3, 3, 9, 13, 7], - [10, 17, 2, 20, 17, 12, 6, 22, 19, 10, 16, 1, 10, 7, 2, 4, 20, 8, 15, 14, 17, 5, 17], - [17, 16, 20, 6, 3, 14, 9, 12, 23, 9, 14, 3, 4, 3, 7, 1, 5, 22, 4, 8, 12, 12, 12], - [15, 9, 3, 3, 9, 7, 23, 12, 13, 17, 15, 15, 20, 3, 13, 19, 2, 22, 5, 9, 18, 1, 4], - [18, 5, 3, 10, 7, 14, 10, 23, 10, 21, 9, 7, 6, 20, 8, 5, 19, 20, 11, 17, 7, 10, 12], - [22, 4, 20, 3, 1, 16, 5, 19, 17, 11, 1, 1, 18, 3, 5, 7, 6, 16, 13, 1, 19, 18, 10], - [14, 10, 23, 2, 16, 9, 17, 2, 15, 23, 8, 3, 2, 17, 22, 3, 11, 5, 22, 1, 3, 14, 6], - [16, 10, 8, 23, 22, 9, 3, 2, 6, 2, 16, 9, 5, 17, 12, 20, 4, 6, 2, 3, 18, 17, 18], - [11, 8, 7, 21, 20, 22, 23, 1, 19, 16, 8, 20, 18, 21, 10, 10, 19, 22, 13, 23, 12, 8, 21], - [6, 22, 7, 7, 5, 1, 19, 11, 2, 18, 23, 10, 6, 21, 4, 12, 21, 6, 17, 14, 1, 11, 20], - [19, 5, 3, 13, 19, 21, 18, 20, 14, 2, 1, 22, 22, 14, 7, 15, 23, 6, 12, 20, 11, 10, 17], - [10, 19, 10, 9, 22, 8, 19, 7, 23, 16, 8, 23, 21, 16, 21, 11, 10, 20, 18, 16, 8, 21, 5], - [23, 18, 15, 2, 23, 1, 10, 22, 9, 1, 17, 13, 6, 12, 19, 1, 11, 16, 14, 19, 21, 23, 5], - [8, 21, 19, 16, 12, 15, 7, 1, 15, 8, 6, 11, 2, 1, 20, 5, 5, 11, 17, 17, 17, 23, 19], - [13, 18, 10, 3, 17, 21, 10, 1, 11, 13, 14, 13, 7, 8, 4, 14, 5, 20, 19, 23, 14, 19, 15], - [16, 3, 2, 6, 7, 10, 13, 10, 10, 3, 20, 17, 18, 17, 10, 10, 8, 23, 13, 8, 20, 5, 17], - [9, 8, 19, 13, 1, 12, 1, 6, 19, 13, 19, 5, 23, 22, 13, 14, 14, 22, 20, 12, 20, 8, 11], - [14, 2, 14, 10, 2, 21, 10, 3, 2, 10, 15, 19, 12, 18, 22, 18, 11, 9, 8, 9, 8, 3, 8], - [10, 10, 2, 11, 21, 10, 16, 20, 2, 10, 22, 14, 19, 4, 16, 15, 23, 11, 19, 10, 2, 5, 19], - [23, 11, 22, 18, 20, 11, 10, 5, 13, 21, 10, 17, 23, 3, 14, 10, 2, 22, 5, 12, 15, 6, 1], - [23, 13, 9, 14, 12, 18, 7, 16, 12, 12, 6, 19, 18, 2, 8, 21, 1, 15, 18, 14, 1, 6, 19], - [4, 21, 1, 1, 1, 15, 18, 21, 11, 1, 10, 20, 9, 5, 6, 11, 5, 19, 6, 8, 16, 17, 12], - [10, 8, 8, 21, 7, 14, 15, 15, 11, 10, 19, 18, 15, 15, 19, 13, 20, 23, 23, 3, 10, 16, 21], - [13, 18, 1, 9, 11, 5, 14, 3, 5, 11, 14, 16, 5, 8, 6, 21, 4, 5, 12, 18, 7, 13, 10], - [7, 3, 18, 21, 11, 5, 16, 13, 4, 19, 15, 8, 3, 23, 9, 23, 13, 5, 12, 4, 10, 3, 7], - [6, 17, 20, 10, 20, 12, 20, 3, 18, 3, 15, 12, 15, 3, 20, 23, 6, 10, 6, 11, 1, 10, 3], - [13, 5, 10, 16, 14, 1, 1, 11, 14, 1, 20, 3, 23, 3, 23, 17, 19, 19, 23, 22, 14, 1, 6], - [6, 14, 17, 20, 20, 8, 9, 4, 7, 4, 15, 7, 15, 16, 13, 17, 15, 1, 23, 19, 8, 16, 14], - [11, 10, 15, 18, 2, 8, 16, 23, 23, 5, 14, 7, 5, 12, 6, 21, 17, 21, 16, 23, 5, 6, 5], - [1, 22, 20, 21, 18, 21, 19, 1, 6, 11, 12, 3, 15, 14, 1, 17, 3, 1, 13, 2, 13, 11, 6], - [23, 6, 22, 7, 12, 22, 1, 19, 13, 3, 8, 22, 23, 15, 5, 11, 13, 15, 14, 19, 3, 1, 13], - [9, 17, 16, 17, 7, 12, 11, 12, 11, 5, 13, 21, 21, 21, 5, 19, 10, 15, 2, 13, 12, 4, 7], - [13, 19, 13, 16, 10, 1, 16, 9, 3, 1, 12, 7, 15, 9, 2, 14, 21, 12, 18, 12, 22, 11, 9], - [3, 9, 1, 14, 19, 12, 22, 14, 12, 20, 16, 23, 22, 4, 13, 5, 4, 20, 1, 13, 9, 9, 22], - [13, 13, 14, 9, 2, 9, 15, 12, 18, 12, 15, 16, 23, 12, 12, 8, 18, 11, 18, 13, 15, 7, 4], - [21, 21, 17, 4, 7, 23, 5, 18, 13, 12, 20, 14, 23, 6, 2, 23, 7, 14, 5, 8, 22, 18, 3], - [12, 19, 13, 16, 5, 5, 19, 8, 12, 2, 1, 19, 12, 15, 23, 1, 13, 2, 3, 1, 6, 9, 19], - [3, 1, 5, 12, 6, 6, 14, 20, 13, 19, 23, 2, 17, 15, 3, 3, 17, 4, 1, 8, 18, 17, 22], - [21, 12, 2, 18, 23, 20, 7, 6, 2, 3, 11, 6, 9, 3, 20, 23, 23, 2, 10, 12, 20, 12, 10], - [20, 4, 17, 15, 23, 12, 20, 23, 12, 2, 21, 6, 3, 12, 19, 17, 16, 14, 10, 8, 6, 15, 1], - [7, 13, 5, 17, 17, 19, 6, 6, 19, 10, 15, 17, 12, 2, 22, 2, 2, 4, 8, 5, 16, 19, 2], - [5, 19, 10, 5, 13, 3, 16, 12, 4, 4, 4, 17, 1, 18, 4, 8, 10, 10, 2, 21, 19, 3, 20], - [20, 3, 19, 18, 16, 13, 14, 15, 10, 5, 2, 14, 7, 16, 4, 23, 18, 9, 13, 1, 18, 12, 17], - [4, 12, 7, 2, 12, 7, 14, 17, 13, 8, 17, 2, 5, 4, 8, 7, 8, 15, 11, 22, 20, 11, 21], - [18, 12, 23, 19, 3, 12, 9, 7, 2, 10, 6, 20, 12, 5, 20, 17, 20, 6, 2, 3, 12, 16, 11], - [20, 9, 18, 7, 13, 10, 15, 12, 9, 14, 2, 1, 14, 22, 13, 15, 3, 19, 13, 17, 6, 6, 8], - [19, 12, 9, 23, 14, 6, 14, 2, 19, 23, 9, 8, 13, 11, 14, 6, 1, 19, 21, 8, 13, 11, 7], - [12, 9, 15, 9, 21, 11, 15, 11, 22, 21, 12, 14, 21, 9, 11, 1, 8, 11, 12, 11, 2, 3, 15], - [3, 9, 9, 20, 2, 6, 5, 17, 23, 23, 20, 9, 11, 23, 3, 21, 9, 9, 11, 2, 23, 8, 20], - [3, 17, 15, 16, 1, 3, 13, 12, 15, 21, 6, 17, 16, 3, 7, 21, 2, 7, 14, 13, 8, 18, 7], - [12, 23, 20, 10, 19, 20, 13, 15, 18, 15, 14, 15, 10, 10, 20, 21, 6, 6, 9, 6, 7, 4, 21], - [1, 3, 10, 2, 4, 21, 8, 2, 12, 12, 22, 16, 12, 15, 10, 19, 11, 7, 3, 22, 8, 7, 10], - [19, 19, 3, 23, 1, 20, 18, 3, 2, 16, 7, 1, 3, 4, 20, 5, 4, 6, 14, 14, 7, 2, 16], - [4, 17, 14, 13, 19, 5, 7, 11, 9, 21, 11, 12, 22, 4, 2, 22, 20, 16, 17, 18, 2, 3, 21], - [9, 14, 14, 2, 20, 1, 23, 12, 13, 5, 1, 8, 8, 23, 11, 3, 14, 21, 1, 23, 9, 7, 1], - [23, 23, 3, 1, 14, 1, 6, 12, 13, 5, 4, 12, 4, 23, 13, 11, 14, 22, 18, 10, 2, 21, 8], - [7, 17, 14, 3, 2, 22, 23, 3, 3, 18, 5, 22, 19, 18, 22, 10, 16, 17, 14, 21, 20, 2, 3], - [22, 22, 15, 14, 10, 3, 19, 6, 8, 1, 4, 1, 13, 2, 7, 15, 20, 5, 23, 5, 20, 1, 12], - [19, 21, 15, 11, 6, 14, 20, 10, 23, 20, 18, 9, 5, 18, 22, 3, 1, 22, 1, 9, 20, 7, 11], - [5, 21, 23, 12, 18, 20, 17, 18, 1, 23, 3, 8, 12, 10, 12, 22, 20, 15, 18, 7, 7, 2, 12], - [19, 19, 12, 23, 22, 14, 15, 12, 12, 9, 2, 21, 22, 23, 8, 17, 1, 15, 6, 5, 5, 13, 23], - [5, 14, 15, 5, 9, 19, 18, 20, 7, 11, 19, 12, 1, 9, 21, 17, 19, 18, 4, 23, 12, 19, 9], - [12, 10, 18, 23, 21, 7, 10, 6, 22, 3, 14, 9, 17, 17, 12, 4, 8, 9, 7, 5, 7, 13, 18], - [11, 16, 3, 1, 22, 14, 15, 9, 4, 7, 3, 15, 21, 7, 20, 4, 12, 9, 4, 16, 4, 7, 8], - [12, 5, 21, 14, 2, 10, 13, 21, 7, 20, 13, 9, 13, 21, 15, 6, 2, 14, 17, 20, 18, 15, 11], - [21, 11, 15, 6, 12, 22, 10, 3, 14, 17, 11, 9, 20, 21, 13, 14, 5, 6, 21, 8, 21, 9, 10], - [8, 1, 2, 13, 9, 23, 8, 23, 10, 13, 1, 13, 19, 10, 8, 16, 17, 4, 10, 9, 13, 19, 8], - [8, 18, 14, 16, 16, 16, 13, 19, 4, 21, 4, 10, 5, 12, 4, 11, 13, 21, 22, 2, 17, 14, 6], - [11, 1, 23, 10, 21, 12, 13, 23, 17, 23, 8, 21, 8, 17, 17, 20, 2, 23, 21, 23, 7, 13, 11], - [22, 16, 3, 7, 16, 11, 15, 20, 1, 18, 12, 5, 8, 20, 10, 18, 17, 17, 1, 14, 9, 19, 17], - [2, 12, 3, 16, 20, 12, 20, 19, 9, 2, 6, 18, 6, 13, 16, 19, 11, 22, 22, 8, 23, 20, 19], - [3, 23, 12, 21, 18, 21, 8, 2, 20, 23, 15, 16, 12, 13, 15, 10, 4, 5, 15, 10, 22, 23, 15], - [11, 13, 21, 22, 21, 9, 14, 8, 17, 7, 18, 20, 7, 3, 12, 6, 14, 4, 22, 21, 15, 13, 7], - [20, 7, 2, 18, 11, 20, 11, 10, 16, 7, 15, 23, 4, 12, 5, 13, 5, 20, 20, 14, 1, 10, 5], - [14, 3, 11, 11, 5, 4, 15, 16, 9, 4, 3, 1, 4, 21, 11, 19, 4, 18, 23, 3, 13, 4, 18], - [17, 23, 21, 12, 15, 19, 18, 3, 9, 20, 20, 13, 1, 1, 7, 4, 19, 22, 12, 12, 18, 9, 4], - [13, 14, 13, 14, 17, 7, 7, 7, 18, 12, 22, 2, 5, 11, 13, 4, 12, 22, 17, 10, 23, 16, 9], - [11, 11, 12, 1, 9, 4, 22, 2, 11, 8, 17, 21, 8, 15, 4, 19, 20, 17, 7, 15, 12, 5, 3], - [16, 21, 21, 20, 20, 14, 15, 11, 21, 11, 5, 11, 18, 19, 14, 7, 13, 4, 6, 16, 5, 9, 6], - [12, 17, 10, 3, 6, 2, 22, 3, 16, 15, 15, 12, 17, 2, 7, 8, 16, 10, 3, 12, 11, 5, 20], - [3, 9, 18, 12, 5, 14, 11, 3, 22, 14, 16, 19, 1, 7, 23, 20, 2, 21, 6, 5, 22, 18, 7], - [22, 14, 22, 4, 13, 21, 1, 20, 8, 19, 20, 7, 10, 9, 7, 5, 15, 9, 3, 2, 17, 21, 6], - [19, 13, 19, 16, 11, 13, 9, 5, 3, 14, 23, 11, 7, 17, 3, 18, 10, 14, 14, 14, 9, 5, 17], - [1, 15, 16, 18, 16, 7, 21, 16, 16, 8, 9, 4, 23, 17, 10, 4, 11, 3, 17, 6, 11, 11, 13], - [5, 8, 2, 5, 20, 2, 17, 16, 4, 23, 17, 21, 20, 10, 1, 17, 11, 16, 23, 14, 16, 10, 3], - [15, 7, 6, 12, 21, 20, 10, 19, 3, 10, 23, 23, 9, 1, 14, 13, 13, 19, 19, 16, 21, 22, 3], - [16, 7, 2, 7, 7, 22, 7, 20, 10, 18, 4, 11, 14, 22, 18, 15, 10, 2, 23, 1, 11, 4, 2], - [2, 23, 14, 20, 9, 15, 8, 1, 19, 9, 10, 7, 19, 22, 18, 3, 15, 4, 16, 13, 14, 20, 17], - [18, 11, 12, 2, 4, 19, 17, 4, 4, 10, 22, 14, 9, 23, 10, 11, 5, 22, 9, 1, 8, 4, 21], - [11, 5, 3, 20, 5, 8, 15, 14, 18, 13, 3, 16, 3, 20, 9, 11, 10, 8, 20, 21, 14, 8, 3], - [16, 9, 9, 16, 1, 14, 3, 7, 8, 5, 22, 4, 9, 17, 17, 5, 11, 16, 11, 17, 13, 8, 13], - [10, 14, 5, 22, 21, 21, 16, 14, 12, 9, 16, 2, 13, 20, 13, 19, 20, 22, 5, 12, 4, 1, 21], - [17, 3, 18, 4, 5, 16, 9, 19, 9, 2, 2, 20, 8, 12, 10, 4, 3, 22, 8, 14, 20, 16, 2], - [23, 15, 4, 15, 18, 15, 8, 5, 17, 12, 1, 7, 6, 20, 23, 14, 16, 21, 21, 21, 6, 14, 19], - [1, 16, 14, 17, 1, 10, 15, 20, 11, 4, 20, 10, 8, 2, 6, 1, 3, 13, 22, 1, 15, 21, 13], - [15, 20, 14, 2, 18, 4, 14, 13, 14, 16, 21, 20, 8, 13, 4, 5, 12, 17, 23, 21, 23, 4, 14], - [16, 19, 5, 4, 1, 11, 7, 16, 9, 12, 17, 21, 1, 22, 13, 6, 2, 5, 15, 4, 8, 19, 4], - [9, 7, 1, 23, 2, 13, 15, 11, 22, 18, 10, 10, 4, 5, 23, 18, 13, 20, 18, 19, 19, 20, 3], - [20, 10, 23, 14, 14, 13, 1, 11, 3, 14, 10, 11, 5, 20, 19, 17, 7, 8, 8, 16, 18, 1, 23], - [6, 22, 7, 8, 20, 16, 22, 20, 11, 16, 1, 3, 7, 16, 4, 22, 14, 11, 4, 9, 4, 13, 17], - [5, 19, 14, 20, 3, 2, 23, 16, 9, 23, 7, 20, 22, 5, 23, 22, 12, 2, 12, 18, 7, 20, 6], - [22, 20, 18, 17, 11, 16, 17, 6, 1, 11, 12, 7, 22, 3, 20, 3, 6, 16, 19, 18, 10, 2, 12], - [19, 20, 18, 11, 19, 20, 17, 5, 6, 8, 13, 22, 18, 6, 8, 4, 1, 16, 13, 5, 12, 23, 15], - [18, 18, 12, 18, 1, 19, 8, 8, 22, 10, 20, 5, 22, 14, 11, 21, 6, 21, 22, 14, 19, 22, 14], - [21, 22, 7, 4, 20, 5, 18, 22, 5, 3, 10, 17, 20, 10, 23, 19, 21, 4, 7, 21, 10, 23, 8], - [15, 4, 23, 23, 13, 14, 6, 15, 8, 8, 5, 12, 6, 13, 9, 4, 4, 10, 8, 23, 9, 18, 13], - [23, 15, 21, 19, 13, 6, 20, 19, 15, 16, 19, 8, 18, 22, 13, 20, 21, 12, 18, 18, 6, 7, 1], - [7, 18, 23, 5, 21, 5, 14, 15, 2, 8, 19, 22, 7, 11, 8, 15, 8, 9, 16, 17, 20, 5, 6], - [17, 14, 16, 1, 21, 9, 21, 7, 5, 18, 14, 23, 15, 1, 22, 4, 4, 20, 11, 14, 9, 4, 11], - [1, 22, 14, 16, 7, 7, 9, 5, 12, 21, 3, 2, 12, 10, 14, 18, 12, 14, 6, 19, 4, 15, 15], - [16, 23, 4, 10, 23, 23, 17, 13, 2, 12, 20, 23, 3, 9, 5, 10, 23, 22, 10, 12, 7, 7, 2], - [16, 1, 20, 19, 9, 10, 8, 15, 11, 17, 3, 6, 20, 5, 19, 12, 12, 18, 23, 4, 2, 14, 1], - [9, 18, 11, 17, 1, 11, 15, 2, 12, 14, 15, 1, 3, 22, 18, 5, 5, 8, 10, 10, 4, 2, 4], - [2, 9, 2, 16, 23, 19, 7, 8, 5, 18, 15, 1, 20, 12, 4, 8, 5, 13, 11, 21, 6, 11, 16], - [7, 6, 1, 20, 4, 16, 2, 15, 8, 14, 11, 23, 6, 15, 18, 18, 8, 11, 14, 4, 3, 23, 10], - [12, 21, 9, 1, 2, 3, 15, 7, 5, 21, 16, 8, 12, 12, 13, 16, 12, 6, 10, 13, 10, 19, 3], - [22, 13, 16, 7, 18, 4, 14, 8, 4, 4, 22, 5, 21, 12, 15, 23, 18, 4, 12, 22, 4, 7, 7], - [12, 5, 12, 4, 14, 14, 14, 12, 2, 6, 9, 22, 9, 4, 4, 5, 19, 4, 17, 9, 16, 1, 5], - [21, 10, 23, 23, 5, 19, 5, 18, 18, 19, 13, 16, 21, 8, 19, 15, 7, 16, 12, 9, 12, 18, 15], - [5, 20, 10, 3, 13, 18, 21, 16, 12, 2, 16, 12, 2, 5, 6, 15, 22, 12, 3, 9, 12, 21, 12], - [9, 9, 23, 22, 2, 21, 12, 2, 18, 23, 16, 14, 11, 23, 5, 2, 19, 5, 8, 3, 21, 17, 22], - [21, 2, 11, 18, 12, 23, 1, 16, 3, 19, 11, 22, 14, 6, 1, 5, 20, 14, 19, 8, 7, 12, 12], - [8, 6, 7, 13, 10, 13, 18, 3, 1, 22, 5, 13, 13, 3, 6, 4, 10, 16, 18, 12, 9, 8, 7], - [19, 12, 18, 17, 17, 4, 17, 1, 10, 1, 2, 22, 11, 3, 17, 10, 20, 16, 20, 10, 8, 11, 23], - [1, 21, 20, 16, 17, 15, 12, 21, 8, 6, 18, 5, 20, 12, 7, 14, 6, 20, 16, 21, 11, 3, 9], - [12, 23, 3, 4, 2, 3, 2, 14, 18, 19, 8, 13, 17, 16, 1, 7, 17, 5, 8, 1, 7, 2, 6], - [3, 5, 17, 17, 14, 21, 23, 12, 19, 9, 17, 22, 2, 5, 4, 22, 19, 20, 18, 2, 8, 10, 16], - [17, 21, 18, 11, 17, 19, 6, 16, 21, 4, 23, 4, 16, 7, 11, 14, 7, 11, 15, 6, 4, 15, 1], - [18, 2, 14, 11, 9, 21, 2, 6, 4, 12, 6, 22, 18, 1, 20, 22, 17, 19, 4, 9, 11, 13, 11], - [21, 6, 5, 19, 8, 5, 6, 6, 4, 16, 18, 19, 17, 15, 11, 2, 12, 2, 4, 10, 22, 15, 5], - [13, 15, 16, 8, 14, 22, 23, 15, 23, 6, 2, 1, 12, 5, 1, 19, 7, 6, 3, 6, 20, 14, 5], - [11, 2, 1, 9, 17, 16, 8, 8, 1, 13, 14, 23, 5, 8, 23, 8, 11, 1, 3, 22, 12, 11, 15], - [3, 9, 14, 1, 2, 22, 1, 8, 11, 2, 15, 18, 23, 6, 15, 5, 23, 1, 10, 9, 23, 1, 4], - [2, 3, 18, 17, 9, 14, 15, 14, 4, 15, 14, 11, 14, 18, 2, 15, 14, 17, 4, 18, 8, 4, 23], - [8, 15, 17, 22, 9, 7, 18, 5, 17, 6, 22, 10, 9, 16, 4, 6, 8, 13, 7, 10, 12, 13, 2], - [1, 23, 10, 11, 13, 15, 7, 16, 18, 15, 21, 1, 21, 11, 7, 7, 12, 4, 23, 1, 3, 2, 23], - [7, 15, 5, 1, 11, 22, 20, 16, 8, 15, 23, 7, 14, 18, 19, 19, 16, 17, 7, 5, 7, 20, 20], - [4, 7, 11, 5, 15, 10, 6, 20, 22, 17, 13, 18, 11, 17, 13, 2, 22, 9, 1, 8, 10, 11, 6], - [1, 17, 16, 4, 5, 4, 9, 1, 17, 10, 9, 17, 1, 1, 17, 19, 15, 5, 16, 15, 21, 2, 22], - [12, 2, 7, 11, 23, 5, 19, 21, 13, 17, 14, 20, 4, 12, 6, 15, 15, 10, 10, 15, 8, 1, 10], - [17, 2, 12, 19, 12, 16, 22, 21, 8, 9, 2, 15, 18, 21, 2, 22, 17, 2, 1, 11, 3, 13, 15], - [20, 6, 2, 4, 2, 18, 20, 5, 3, 16, 22, 1, 1, 8, 21, 14, 2, 18, 4, 4, 7, 6, 14], - [15, 6, 16, 5, 23, 6, 7, 20, 19, 21, 17, 7, 11, 16, 18, 6, 8, 16, 16, 23, 20, 18, 19], - [23, 19, 21, 13, 7, 17, 13, 20, 10, 8, 8, 5, 13, 18, 3, 1, 2, 23, 6, 1, 20, 9, 22], - [7, 7, 15, 7, 21, 5, 5, 10, 8, 21, 22, 17, 2, 22, 10, 23, 8, 10, 17, 2, 12, 3, 8], - [22, 8, 7, 21, 12, 22, 23, 6, 22, 7, 3, 7, 8, 16, 12, 5, 7, 16, 19, 16, 6, 21, 7], - [21, 3, 20, 20, 22, 21, 16, 17, 9, 20, 9, 4, 13, 6, 3, 16, 14, 5, 3, 10, 23, 3, 12], - [6, 12, 18, 12, 18, 14, 21, 9, 2, 17, 12, 14, 2, 3, 20, 18, 7, 20, 16, 1, 10, 10, 11], - [7, 11, 7, 9, 1, 9, 3, 13, 8, 2, 3, 13, 22, 2, 11, 16, 7, 17, 6, 3, 1, 6, 9], - [4, 16, 10, 7, 13, 6, 13, 2, 5, 5, 13, 16, 16, 3, 10, 22, 18, 4, 19, 4, 9, 11, 20], - [15, 13, 11, 11, 21, 9, 2, 17, 16, 21, 17, 16, 11, 10, 6, 10, 18, 15, 22, 9, 18, 22, 23], - [7, 19, 6, 15, 13, 15, 23, 11, 13, 9, 8, 2, 23, 11, 11, 18, 1, 8, 17, 18, 21, 3, 9], - [19, 20, 13, 16, 21, 8, 3, 3, 15, 9, 12, 17, 5, 11, 14, 12, 20, 15, 13, 21, 10, 15, 20], - [10, 20, 22, 9, 11, 19, 11, 3, 7, 14, 7, 23, 13, 5, 9, 11, 13, 5, 12, 12, 16, 15, 6], - [14, 17, 23, 20, 5, 4, 16, 4, 21, 11, 3, 13, 16, 11, 14, 17, 11, 20, 15, 9, 14, 14, 4], - [1, 5, 22, 12, 1, 18, 23, 19, 12, 3, 21, 1, 4, 10, 12, 15, 21, 3, 22, 5, 1, 8, 10], - [7, 21, 10, 14, 21, 17, 1, 10, 2, 16, 13, 13, 13, 19, 10, 1, 17, 23, 11, 17, 8, 16, 16], - [1, 3, 20, 1, 9, 16, 1, 2, 23, 4, 22, 10, 17, 18, 22, 14, 16, 15, 12, 11, 18, 10, 11], - [6, 1, 6, 19, 21, 5, 1, 10, 16, 20, 9, 6, 6, 13, 5, 11, 18, 15, 20, 15, 7, 17, 2], - [1, 8, 1, 12, 16, 17, 7, 6, 22, 4, 19, 1, 5, 5, 13, 13, 9, 8, 8, 14, 23, 9, 2], - [16, 17, 21, 15, 1, 22, 8, 2, 23, 11, 5, 10, 8, 5, 8, 18, 21, 6, 19, 12, 17, 4, 7], - [16, 22, 8, 17, 1, 22, 19, 5, 5, 2, 4, 10, 11, 13, 21, 23, 20, 11, 10, 21, 19, 13, 5], - [8, 15, 3, 19, 21, 5, 5, 5, 20, 19, 16, 2, 14, 20, 14, 9, 12, 23, 14, 10, 12, 9, 9], - [1, 3, 17, 4, 20, 11, 16, 16, 18, 22, 11, 1, 20, 4, 22, 5, 14, 14, 22, 2, 10, 2, 22], - [8, 21, 1, 5, 20, 18, 15, 21, 11, 6, 11, 11, 4, 10, 11, 3, 11, 4, 13, 4, 18, 12, 15], - [17, 20, 21, 10, 18, 15, 20, 13, 21, 2, 20, 2, 11, 14, 3, 18, 20, 20, 16, 7, 5, 21, 16], - [21, 11, 20, 12, 16, 6, 14, 19, 3, 11, 22, 21, 4, 22, 7, 5, 15, 5, 8, 6, 17, 9, 15], - [17, 23, 21, 5, 15, 11, 14, 10, 18, 8, 2, 20, 5, 19, 10, 23, 10, 1, 16, 13, 9, 12, 1], - [4, 22, 18, 6, 23, 2, 15, 15, 5, 17, 8, 16, 8, 9, 22, 6, 23, 3, 15, 16, 2, 21, 16], - [11, 2, 20, 7, 15, 7, 5, 5, 13, 1, 11, 7, 6, 8, 13, 16, 8, 8, 22, 5, 10, 11, 12], - [16, 5, 19, 20, 16, 5, 12, 17, 23, 20, 20, 14, 13, 21, 7, 5, 13, 22, 16, 4, 19, 19, 20] - ], - "seed": 42, - "n_bootstrap": 200, - "se": 0.498347838909041, - "att": 4.98084886006093, - "metadata": { - "r_version": "R version 4.5.2 (2025-10-31)", - "synthdid_version": "0.0.9", - "panel_N": 23, - "panel_T": 8, - "N0": 20, - "T0": 5 - } -} diff --git a/tests/test_business_report.py b/tests/test_business_report.py index 66895513..276c75c6 100644 --- a/tests/test_business_report.py +++ b/tests/test_business_report.py @@ -4653,3 +4653,50 @@ class MultiPeriodDiDResults: "Explicit precomputed['sensitivity'] must win over " "honest_did_results shorthand. Got sens block: " + repr(sens) ) + + +class TestSyntheticDiDBootstrapInferenceLabel: + """Cross-surface guard: ``variance_method='bootstrap'`` must be + recognized by the BR inference-label allow-list at + ``business_report.py:602``. + + Without the allow-list, BR's alpha-override fallback would emit + ``"analytical (native degrees of freedom)"`` as the inference label + for SDID bootstrap fits — a user-visible regression that silently + mis-describes the fit's variance method. + """ + + def test_bootstrap_fit_uses_bootstrap_label_on_alpha_override(self): + import warnings as _warnings + + from diff_diff import SyntheticDiD, generate_factor_data + from diff_diff.business_report import BusinessReport + + fdf = generate_factor_data(n_units=25, n_pre=8, n_post=4, n_treated=4, seed=11) + with _warnings.catch_warnings(): + _warnings.simplefilter("ignore", UserWarning) + fit = SyntheticDiD(variance_method="bootstrap", n_bootstrap=50, seed=1).fit( + fdf, + outcome="outcome", + unit="unit", + time="period", + treatment="treat", + ) + assert fit.variance_method == "bootstrap" + + # Alpha mismatch (fit at 0.05, BR requesting 0.10) activates the + # inference-label branch at business_report.py:593-638. + br = BusinessReport(fit, alpha=0.10, auto_diagnostics=False) + caveats = br.caveats() + override_caveats = [c for c in caveats if c.get("topic") == "alpha_override_preserved"] + assert ( + len(override_caveats) == 1 + ), f"expected one alpha_override_preserved caveat, got {caveats}" + message = override_caveats[0].get("message", "") + assert "bootstrap variance" in message, ( + f"expected 'bootstrap variance' in caveat message, " f"got: {message!r}" + ) + assert "analytical (native degrees of freedom)" not in message, ( + f"allow-list regression: bootstrap fits must not fall through to the " + f"analytical label. caveat message: {message!r}" + ) diff --git a/tests/test_methodology_sdid.py b/tests/test_methodology_sdid.py index ee003e44..064466a6 100644 --- a/tests/test_methodology_sdid.py +++ b/tests/test_methodology_sdid.py @@ -498,7 +498,13 @@ def test_placebo_se_formula(self): class TestBootstrapSE: - """Verify bootstrap SE with fixed weights.""" + """Verify the paper-faithful refit pairs bootstrap. + + ``variance_method="bootstrap"`` re-estimates ω̂_b and λ̂_b via two-pass + sparsified Frank-Wolfe on each bootstrap draw (Arkhangelsky et al. 2021 + Algorithm 2 step 2, and R's default ``synthdid::vcov(method="bootstrap")``). + Survey designs are rejected upstream in ``fit()``. + """ def test_bootstrap_se_positive(self, ci_params): """Bootstrap SE should be positive.""" @@ -515,6 +521,7 @@ def test_bootstrap_se_positive(self, ci_params): assert results.se > 0 assert results.variance_method == "bootstrap" + assert results.n_bootstrap == n_boot def test_bootstrap_with_zeta_overrides(self, ci_params): """Bootstrap SE should work with user-specified zeta overrides.""" @@ -535,6 +542,178 @@ def test_bootstrap_with_zeta_overrides(self, ci_params): assert results.variance_method == "bootstrap" assert results.se > 0 + def test_bootstrap_se_tracks_placebo_se_exchangeable(self, ci_params): + """Bootstrap SE tracks placebo SE under control-pool exchangeability. + + Placebo (Algorithm 4) already re-estimates ω and λ per permutation, + so under exchangeability of the control pool it should produce a + similar variance to the paper-faithful refit bootstrap. Divergence + flags either a refit implementation bug or a genuine exchangeability + violation in the DGP. + + Skipped under pure-Python mode: ``ci_params.bootstrap(min_n=...)`` + caps ``min_n`` at 49 to keep pure-Python CI fast (see + ``tests/conftest.py:210``), but the 0.40 tolerance is calibrated + for B∈[100, 200] — at B=49 MC noise on the bootstrap SE can push + rel-diff beyond 0.40 without any correctness issue (B=100/200 + runs converge to rel-diff ≈ 0.27 on the same seed). The 15 + Rust-backed matrix jobs (macOS/Linux x86/Linux ARM/Windows × 3 + Python versions) exercise the regression guard at the designed + B=200, so the contract is still covered for the default user + install path. + """ + from diff_diff import utils as dd_utils + + if not dd_utils.HAS_RUST_BACKEND: + pytest.skip( + "Pure-Python mode caps ci_params.bootstrap min_n at 49, " + "but the 0.40 tolerance requires B≥100. Rust-backend CI " + "jobs exercise this regression guard at B=200." + ) + df = _make_panel(n_control=20, n_treated=3, seed=42) + n_boot = ci_params.bootstrap(200, min_n=100) + with warnings.catch_warnings(): + warnings.simplefilter("ignore", UserWarning) + r_placebo = SyntheticDiD( + variance_method="placebo", n_bootstrap=n_boot, seed=1 + ).fit( + df, outcome="outcome", treatment="treated", + unit="unit", time="period", post_periods=[5, 6, 7], + ) + r_boot = SyntheticDiD( + variance_method="bootstrap", n_bootstrap=n_boot, seed=1 + ).fit( + df, outcome="outcome", treatment="treated", + unit="unit", time="period", post_periods=[5, 6, 7], + ) + rel_diff = abs(r_boot.se - r_placebo.se) / r_placebo.se + # Tolerance chosen for B=100–200 with MC noise floor ~7–10% on the + # SE ratio; 0.40 leaves headroom without hiding order-of-magnitude + # regressions. + assert rel_diff < 0.40, ( + f"bootstrap SE {r_boot.se:.6f} does not track placebo SE " + f"{r_placebo.se:.6f} on exchangeable DGP (rel diff {rel_diff:.4f})" + ) + + def test_bootstrap_raises_on_pweight_survey(self): + """Survey + bootstrap raises NotImplementedError (pweight-only path). + + Rao-Wu rescaled weights composed with paper-faithful Frank-Wolfe + re-estimation is a separate derivation that is not yet implemented; + the guard lives upstream in ``fit()`` before the bootstrap dispatcher. + """ + from diff_diff.survey import SurveyDesign + df = _make_panel(n_control=20, n_treated=3, seed=42) + df["wt"] = 1.0 + with pytest.raises(NotImplementedError, match="bootstrap"): + SyntheticDiD( + variance_method="bootstrap", n_bootstrap=50, seed=1 + ).fit( + df, outcome="outcome", treatment="treated", + unit="unit", time="period", + post_periods=[5, 6, 7], + survey_design=SurveyDesign(weights="wt"), + ) + + def test_bootstrap_raises_on_full_design_survey(self): + """Survey + bootstrap with strata/PSU raises NotImplementedError. + + No SDID variance method currently supports strata/PSU/FPC; the guard + on line ~300 of ``fit()`` rejects the combination unconditionally. + """ + from diff_diff.survey import SurveyDesign + df = _make_panel(n_control=20, n_treated=3, seed=42) + df["wt"] = 1.0 + df["stratum"] = df["unit"] % 2 + with pytest.raises(NotImplementedError, match="strata"): + SyntheticDiD( + variance_method="bootstrap", n_bootstrap=50, seed=1 + ).fit( + df, outcome="outcome", treatment="treated", + unit="unit", time="period", + post_periods=[5, 6, 7], + survey_design=SurveyDesign(weights="wt", strata="stratum"), + ) + + def test_bootstrap_summary_shows_replications(self, ci_params): + """result.summary() shows "Bootstrap replications" line for bootstrap. + + Cross-surface guard: the result-class gating at ``results.py:960`` + must keep ``"bootstrap"`` in its allow-list so the replications row + renders for bootstrap fits. + """ + df = _make_panel(n_control=20, n_treated=3, seed=42) + n_boot = ci_params.bootstrap(50) + with warnings.catch_warnings(): + warnings.simplefilter("ignore", UserWarning) + r = SyntheticDiD( + variance_method="bootstrap", n_bootstrap=n_boot, seed=1 + ).fit( + df, outcome="outcome", treatment="treated", + unit="unit", time="period", + post_periods=[5, 6, 7], + ) + summary = r.summary() + assert "Bootstrap replications" in summary + assert str(n_boot) in summary + + def test_bootstrap_fw_nonconvergence_warning_fires_under_rust(self, monkeypatch): + """Aggregate FW non-convergence warning surfaces on the Rust backend. + + Regression against the silent-failure mode previously masked by the + ``warnings.catch_warnings`` tally: the Rust FW solver is silent on + ``max_iter`` exhaustion, so a purely warnings-based per-draw count + read zero even when the Rust solver did not converge. After wiring + ``return_convergence=True`` through + ``compute_sdid_unit_weights`` / ``compute_time_weights`` via the new + ``sc_weight_fw_with_convergence`` Rust entry point, the helpers + thread an explicit bool per pass and the aggregate warning above 5% + of valid draws fires. + + We force every Rust FW call to report ``converged=False`` by monkey- + patching the helper; under the fixed wiring, the aggregate warning + must fire. Prior to the fix, this same monkeypatch would have had + no effect on the tally. + """ + from diff_diff import utils as dd_utils + + if not dd_utils.HAS_RUST_BACKEND: + pytest.skip("Test targets the Rust backend specifically.") + + real_rust = dd_utils._rust_sc_weight_fw_with_convergence + + def _always_not_converged(Y, zeta, intercept, init, min_decrease, max_iter): + weights, _ = real_rust(Y, zeta, intercept, init, min_decrease, max_iter) + return weights, False + + monkeypatch.setattr( + dd_utils, "_rust_sc_weight_fw_with_convergence", _always_not_converged + ) + + df = _make_panel(n_control=20, n_treated=3, seed=42) + sdid = SyntheticDiD(variance_method="bootstrap", n_bootstrap=50, seed=42) + + with warnings.catch_warnings(record=True) as caught: + warnings.simplefilter("always") + sdid.fit( + df, outcome="outcome", treatment="treated", + unit="unit", time="period", + post_periods=list(range(5, 8)), + ) + + fw_warnings = [ + w for w in caught + if issubclass(w.category, UserWarning) + and "Frank-Wolfe did not converge" in str(w.message) + ] + assert len(fw_warnings) >= 1, ( + "Expected the aggregate Frank-Wolfe non-convergence warning " + "to fire when every FW call reports converged=False under " + "the Rust backend. Got no such warning — the Rust FW path is " + "silent on non-convergence and the bootstrap loop is not " + "surfacing it." + ) + # ============================================================================= # Jackknife SE @@ -872,60 +1051,6 @@ def test_jackknife_se_matches_r(self, r_panel_df): ) assert abs(results.se - self.R_JACKKNIFE_SE) < 1e-10 - def test_bootstrap_se_matches_r(self, r_panel_df): - """Bootstrap SE should match R's vcov(method='bootstrap') given the - same bootstrap indices. - - Scope of parity: RNG streams differ between Python (PCG64) and R - (Mersenne Twister), so a shared integer `seed` value draws - different resamples in each language. The fixture pins R's B × N - index matrix and the test feeds it through the Python bootstrap - loop via the `_bootstrap_indices` seam, so both implementations - traverse the *same* resamples. What the 1e-10 match verifies is - the deterministic math downstream of the indices — per-draw - estimator (weight renormalization + SDID formula) and SE - aggregation (`sqrt((r-1)/r) × sd(ddof=1)`). It does NOT verify - that independently-seeded runs of the two bootstraps agree at any - finite B; that would require a shared RNG stream or a Monte- - Carlo-tolerance comparison at large B, both out of scope here. - """ - import json - import pathlib - - fixture = pathlib.Path(__file__).parent / "data" / "sdid_bootstrap_indices_r.json" - if not fixture.exists(): - pytest.skip( - f"Missing R-parity fixture {fixture}; regenerate via " - "`Rscript benchmarks/R/generate_sdid_bootstrap_parity_fixture.R`." - ) - payload = json.loads(fixture.read_text()) - # R indices are 1-based; convert to 0-based for numpy. - indices = np.asarray(payload["indices"], dtype=np.int64) - 1 - r_bootstrap_se = float(payload["se"]) - - n_bootstrap = indices.shape[0] - sdid = SyntheticDiD( - variance_method="bootstrap", - n_bootstrap=n_bootstrap, - seed=42, - ) - # Route the pinned indices through the hidden _bootstrap_indices seam - # on _bootstrap_se. Patch the bound method at the class level so the - # sdid.fit() call picks it up. - orig = SyntheticDiD._bootstrap_se - - def _patched(self, *args, **kwargs): - kwargs["_bootstrap_indices"] = indices - return orig(self, *args, **kwargs) - - with patch.object(SyntheticDiD, "_bootstrap_se", _patched): - results = sdid.fit( - r_panel_df, outcome="outcome", treatment="treated", - unit="unit", time="time", - post_periods=[5, 6, 7], - ) - assert abs(results.se - r_bootstrap_se) < 1e-10 - # ============================================================================= # Edge Cases @@ -1147,6 +1272,46 @@ def test_set_params_unknown_raises(self): with pytest.raises(ValueError, match="Unknown parameter"): sdid.set_params(nonexistent_param=1.0) + def test_set_params_rejects_invalid_variance_method(self): + """set_params with invalid variance_method raises ValueError. + + Parity with __init__: the setter path must enforce the same enum + contract, not silently accept arbitrary strings. + """ + sdid = SyntheticDiD() + with pytest.raises(ValueError, match="variance_method must be one of"): + sdid.set_params(variance_method="not_a_method") + + def test_set_params_rejects_incoherent_n_bootstrap(self): + """set_params(variance_method='bootstrap', n_bootstrap=1) raises. + + Parity with __init__: n_bootstrap >= 2 required unless jackknife. + """ + sdid = SyntheticDiD() + with pytest.raises(ValueError, match="n_bootstrap must be >= 2"): + sdid.set_params(variance_method="bootstrap", n_bootstrap=1) + + def test_set_params_allows_n_bootstrap_one_for_jackknife(self): + """Jackknife is deterministic; n_bootstrap is ignored and need not be >= 2.""" + sdid = SyntheticDiD() + sdid.set_params(variance_method="jackknife", n_bootstrap=1) + assert sdid.variance_method == "jackknife" + assert sdid.n_bootstrap == 1 + + def test_set_params_rolls_back_on_validation_failure(self): + """On validation failure, ``set_params`` restores the pre-call state. + + Guards against leaving the instance in a partially-mutated invalid + state if one of multiple simultaneous updates fails validation. + """ + sdid = SyntheticDiD(variance_method="placebo", n_bootstrap=200) + with pytest.raises(ValueError): + sdid.set_params(variance_method="bootstrap", n_bootstrap=1) + # Pre-call values preserved despite mid-sequence setattr on the + # (now-partially-applied) instance. + assert sdid.variance_method == "placebo" + assert sdid.n_bootstrap == 200 + class TestDeprecatedParams: """Test deprecated parameter handling in __init__.""" @@ -2239,7 +2404,17 @@ class TestScaleEquivariance: # drift the fix is not a true no-op on normal data and review is warranted. _BASELINE = { "placebo": (4.603349837478791, 0.29385822261006445, 0.004975124378109453, 200), - "bootstrap": (4.603349837478791, 0.16272527384941657, 4.707563471218442e-176, 200), + # bootstrap = paper-faithful refit with R-default warm-start: FW is + # initialized with ``sum_normalize(unit_weights[boot_control_idx])`` + # for ω and with the fit-time ``time_weights`` for λ on each draw, + # matching R's ``vcov.R::bootstrap_sample`` opts-rebind shape. + # Drift from the cold-start capture (0.21424970…) is confined to + # a handful of bootstrap draws where the 100-iter pre-sparsify pass + # converged to a different sparsification pattern under uniform + # init; strict-convexity of the FW objective means the converged + # answer is unique, so the warm-start matches R more faithfully on + # problems where the pre-sparsify budget is tight. + "bootstrap": (4.6033498374787865, 0.21427381053829253, 2.2215821875845446e-102, 200), "jackknife": (4.603349837478791, 0.19908075946622925, 2.716551077849484e-118, 23), } @@ -2268,16 +2443,24 @@ def _fit(data, variance_method, *, seed=1, n_bootstrap=200): @pytest.mark.parametrize("variance_method", ["placebo", "bootstrap", "jackknife"]) def test_baseline_parity_small_scale(self, variance_method): - """Existing-fixture results match pre-fix literals — guards against - drift; a true no-op should hit float epsilon relative to baseline.""" + """Existing-fixture results match captured literals at bit-identity + tolerance. + + The bootstrap refactor that removed the fixed-weight path reuses the + same rng.choice sequence and the same compute_sdid_{unit,time}_weights + + compute_sdid_estimator call chain that the refit branch already ran + under the previous enum name. Any drift above machine epsilon means + the numerical path changed, not just a rename — investigate before + accepting. Placebo / jackknife are untouched by that refactor. + """ att0, se0, p0, n0 = self._BASELINE[variance_method] data = _make_panel(seed=42) with warnings.catch_warnings(): warnings.simplefilter("ignore", UserWarning) r = self._fit(data, variance_method) - assert r.att == pytest.approx(att0, rel=1e-8) - assert r.se == pytest.approx(se0, rel=1e-8) - assert r.p_value == pytest.approx(p0, rel=1e-8) + assert r.att == pytest.approx(att0, rel=1e-14) + assert r.se == pytest.approx(se0, rel=1e-14) + assert r.p_value == pytest.approx(p0, rel=1e-14) assert len(r.placebo_effects) == n0 @pytest.mark.parametrize("variance_method", ["placebo", "bootstrap", "jackknife"]) @@ -2437,26 +2620,28 @@ def test_bootstrap_p_value_detects_large_effect(self): ) @pytest.mark.slow - def test_bootstrap_p_value_null_calibration(self): - """Bootstrap p-values on null data must be spread (not clustered) - and reject within a plausible band for the fixed-weight regime. - - Semantic: this is a characterization test, not a nominal- - calibration assertion. Fixed-weight bootstrap deviates from - Arkhangelsky et al. (2021) Algorithm 2 by ignoring weight- - estimation uncertainty, which biases SE downward and over- - rejects under H0. On this DGP at n=500 seeds the empirical - rejection rate at α=0.05 runs ~0.18 (≈3.7× nominal) — see the - SyntheticDiD calibration note in REGISTRY.md. - - Assertions are wide enough to accommodate Monte Carlo noise at - n=100 seeds (rejection rate SE ≈ 0.04 under fixed-weight) and - remain valid if future calibration improves toward nominal: - - - rejection rate > α = 0.05: catches the pre-fix dispatch bug - where p clustered at ~0.5 on every seed (rejection rate → 0). - - rejection rate < 0.5: upper sanity bound — catches new - catastrophic miscalibration (e.g. SE collapsing to 0). + def test_bootstrap_p_value_null_dispersion(self): + """Bootstrap p-values on null data must be dispersed and fall in a + loose calibration-agnostic band — regression guard against the + pre-fix p-clustering dispatch bug and against SE-collapse. + + Semantic: this is a calibration-agnostic regression test, not a + nominal-calibration assertion. Under paper-faithful refit + bootstrap (Arkhangelsky et al. 2021 Algorithm 2 step 2) the + REGISTRY-cited coverage MC puts α=0.05 rejection near nominal on + this DGP (≈0.08 at 500 seeds × B=200); the previous fixed-weight + path over-rejected at ~0.18. We deliberately do NOT assert + directionally on rejection rate because the reviewer's P2 was + that the prior lower bound (`> 0.05`) biased the test toward + anti-conservative behavior. + + Assertions (calibration-agnostic): + + - ``np.std(p_values) > 0.10``: dispersion floor — catches the + pre-fix dispatch bug where p clustered at ~0.5 on every seed + (std → 0). + - ``0.01 <= rejection_rate <= 0.40``: loose band — catches both + SE-collapse (rate → 1) and SE-explosion (rate → 0). """ p_values = [] for seed in range(100): @@ -2474,14 +2659,20 @@ def test_bootstrap_p_value_null_calibration(self): p_values.append(r.p_value) p_arr = np.asarray(p_values) assert len(p_arr) >= 90, f"only {len(p_arr)}/100 fits produced finite p-values" - rejection_rate = float(np.mean(p_arr < 0.05)) - assert rejection_rate > 0.05, ( - f"rejection rate {rejection_rate:.3f} <= 0.05 — p-values likely " - "clustered (dispatch-bug regression)" + + p_std = float(np.std(p_arr)) + assert p_std > 0.10, ( + f"p-value std {p_std:.3f} <= 0.10 — p-values too tightly " + "clustered (pre-fix dispatch bug regression; fixed-weight→" + "refit porting bug would manifest as p≈0.5 on every seed)" ) - assert rejection_rate < 0.5, ( - f"rejection rate {rejection_rate:.3f} >= 0.5 — catastrophic " - "miscalibration (SE → 0 regression?)" + + rejection_rate = float(np.mean(p_arr < 0.05)) + assert 0.01 <= rejection_rate <= 0.40, ( + f"rejection rate {rejection_rate:.3f} outside loose " + "calibration-agnostic band [0.01, 0.40] — likely SE-collapse " + "or SE-explosion regression (compare against the REGISTRY " + "coverage MC table for the nominal band on this DGP)" ) @@ -2748,3 +2939,72 @@ def test_cross_period_ramping_trend(self): sens = r.sensitivity_to_zeta_omega() assert np.all(np.isfinite(sens["att"])) assert np.all(np.isfinite(sens["pre_fit_rmse"])) + + +# ============================================================================= +# Coverage MC calibration artifact (generated by benchmarks/python/coverage_sdid.py) +# ============================================================================= + + +class TestCoverageMCArtifact: + """Schema smoke-check on ``benchmarks/data/sdid_coverage.json``. + + The full Monte Carlo study (500 seeds × B=200 × 3 DGPs × 3 methods) + runs outside CI; its JSON output underwrites the calibration table in + REGISTRY.md §SyntheticDiD. This test verifies the artifact is present + and structured correctly. Per ``feedback_golden_file_pytest_skip.md``, + skip if missing — CI's isolated-install job copies only ``tests/``, + not ``benchmarks/``. + """ + + def test_coverage_artifacts_present(self): + import json + import pathlib + + artifact = ( + pathlib.Path(__file__).parent.parent + / "benchmarks" / "data" / "sdid_coverage.json" + ) + if not artifact.exists(): + pytest.skip( + f"Missing coverage MC artifact {artifact}; regenerate via " + "`python benchmarks/python/coverage_sdid.py --n-seeds 500 " + "--n-bootstrap 200 --output benchmarks/data/sdid_coverage.json`." + ) + payload = json.loads(artifact.read_text()) + + for key in ("metadata", "dgps", "per_dgp"): + assert key in payload, f"missing top-level key: {key}" + + meta = payload["metadata"] + for key in ("n_seeds", "n_bootstrap", "library_version", "backend", + "generated_at", "methods", "alphas"): + assert key in meta, f"missing metadata key: {key}" + assert meta["n_seeds"] >= 100, ( + f"n_seeds={meta['n_seeds']} too small; the REGISTRY calibration " + "table cites 500-seed rates — regenerate with documented settings." + ) + assert set(meta["methods"]) == {"placebo", "bootstrap", "jackknife"}, ( + "coverage artifact methods list must be exactly the 3 supported " + "SDID variance methods (placebo / bootstrap / jackknife); the old " + "fixed-weight 'bootstrap' and the additive 'bootstrap_refit' " + "enum value are both gone." + ) + + for dgp in ("balanced", "unbalanced", "aer63"): + assert dgp in payload["per_dgp"], f"missing DGP block: {dgp}" + per_method = payload["per_dgp"][dgp] + for method in ("placebo", "bootstrap", "jackknife"): + assert method in per_method, ( + f"missing method block {method!r} under DGP {dgp!r}" + ) + block = per_method[method] + for field in ("rejection_rate", "mean_se", "true_sd_tau_hat", + "se_over_truesd", "n_successful_fits"): + assert field in block, ( + f"missing field {field!r} in {dgp}/{method}" + ) + for alpha_key in ("0.01", "0.05", "0.10"): + assert alpha_key in block["rejection_rate"], ( + f"missing alpha {alpha_key} in {dgp}/{method} rejection_rate" + ) diff --git a/tests/test_survey_phase5.py b/tests/test_survey_phase5.py index 6250261c..09e6ed47 100644 --- a/tests/test_survey_phase5.py +++ b/tests/test_survey_phase5.py @@ -176,29 +176,30 @@ def test_survey_metadata_fields(self, sdid_survey_data, survey_design_weights): assert sm.effective_n > 0 assert sm.design_effect > 0 - def test_full_design_bootstrap_smoke(self, sdid_survey_data, survey_design_full): - """Full survey design (strata/PSU) works with bootstrap variance.""" + def test_full_design_bootstrap_raises(self, sdid_survey_data, survey_design_full): + """Full survey design (strata/PSU) with bootstrap raises NotImplementedError. + + The previous fixed-weight bootstrap accepted strata/PSU/FPC via Rao-Wu + rescaling, but that path was not paper-faithful and was removed. Rao-Wu + composed with paper-faithful refit Frank-Wolfe requires a separate + derivation (tracked in TODO.md, sketched in REGISTRY.md §SyntheticDiD). + """ est = SyntheticDiD(variance_method="bootstrap", n_bootstrap=50, seed=42) - result = est.fit( - sdid_survey_data, - outcome="outcome", - treatment="treated", - unit="unit", - time="time", - post_periods=[6, 7, 8, 9], - survey_design=survey_design_full, - ) - assert np.isfinite(result.att) - assert np.isfinite(result.se) - assert result.se > 0 - assert result.survey_metadata is not None - assert result.survey_metadata.n_strata is not None - assert result.survey_metadata.n_psu is not None + with pytest.raises(NotImplementedError, match="does not yet support"): + est.fit( + sdid_survey_data, + outcome="outcome", + treatment="treated", + unit="unit", + time="time", + post_periods=[6, 7, 8, 9], + survey_design=survey_design_full, + ) def test_full_design_placebo_raises(self, sdid_survey_data, survey_design_full): """Placebo variance with full design raises NotImplementedError.""" est = SyntheticDiD(variance_method="placebo", n_bootstrap=50, seed=42) - with pytest.raises(NotImplementedError, match="does not support strata/PSU/FPC"): + with pytest.raises(NotImplementedError, match="does not yet support survey designs with strata/PSU/FPC"): est.fit( sdid_survey_data, outcome="outcome", @@ -212,7 +213,7 @@ def test_full_design_placebo_raises(self, sdid_survey_data, survey_design_full): def test_full_design_jackknife_raises(self, sdid_survey_data, survey_design_full): """Jackknife variance with full design raises NotImplementedError.""" est = SyntheticDiD(variance_method="jackknife", seed=42) - with pytest.raises(NotImplementedError, match="does not support strata/PSU/FPC"): + with pytest.raises(NotImplementedError, match="does not yet support survey designs with strata/PSU/FPC"): est.fit( sdid_survey_data, outcome="outcome", @@ -223,34 +224,50 @@ def test_full_design_jackknife_raises(self, sdid_survey_data, survey_design_full survey_design=survey_design_full, ) - def test_full_design_se_differs_from_weights_only(self, sdid_survey_data): - """Rao-Wu bootstrap SE differs from pweight-only bootstrap SE.""" - sd_w = SurveyDesign(weights="weight") - sd_full = SurveyDesign(weights="weight", strata="stratum", psu="psu") - est = SyntheticDiD(variance_method="bootstrap", n_bootstrap=100, seed=42) - - result_w = est.fit( + def test_placebo_with_pweight_only_full_design_stripped_att_match( + self, sdid_survey_data + ): + """Placebo ATT with pweight-only is unchanged when stratum/psu + columns are physically dropped from the input DataFrame. + + Point estimates depend only on the pseudo-population weights, not on + the strata/PSU structure — full-design bootstrap previously exploited + that structure via Rao-Wu rescaling and is now rejected upstream (see + ``test_full_design_bootstrap_raises`` / + ``test_full_design_placebo_raises``). A silent pickup of ``stratum`` + or ``psu`` by the estimator (e.g., by name-matching a convention + column) would cause the two fits to diverge, so comparing a + DataFrame with those columns present against one with them dropped + is the real contract. + """ + sd_pweight_only = SurveyDesign(weights="weight") + est = SyntheticDiD(variance_method="placebo", n_bootstrap=100, seed=42) + + result_with_cols = est.fit( sdid_survey_data, outcome="outcome", treatment="treated", unit="unit", time="time", post_periods=[6, 7, 8, 9], - survey_design=sd_w, + survey_design=sd_pweight_only, ) - result_full = est.fit( - sdid_survey_data, + sdid_survey_data_stripped = sdid_survey_data.drop(columns=["stratum", "psu"]) + result_stripped = est.fit( + sdid_survey_data_stripped, outcome="outcome", treatment="treated", unit="unit", time="time", post_periods=[6, 7, 8, 9], - survey_design=sd_full, + survey_design=sd_pweight_only, ) - # ATT point estimates should be the same (same weights) - assert result_full.att == pytest.approx(result_w.att, abs=1e-10) - # SEs should differ (different bootstrap scheme) - assert result_full.se != pytest.approx(result_w.se, abs=1e-6) + assert np.isfinite(result_with_cols.att) + assert np.isfinite(result_with_cols.se) + assert result_with_cols.se > 0 + # ATT depends only on pweight — silent pickup of stratum/psu would + # make the with-columns fit differ from the stripped fit. + assert result_with_cols.att == pytest.approx(result_stripped.att, abs=1e-12) def test_fweight_aweight_raises(self, sdid_survey_data): """Non-pweight raises ValueError.""" @@ -306,9 +323,36 @@ def test_summary_includes_survey(self, sdid_survey_data, survey_design_weights): assert "Survey Design" in summary assert "pweight" in summary - def test_bootstrap_with_survey(self, sdid_survey_data, survey_design_weights): - """variance_method='bootstrap' completes with survey weights.""" + def test_bootstrap_with_pweight_only_raises( + self, sdid_survey_data, survey_design_weights + ): + """variance_method='bootstrap' with any survey design raises NotImplementedError. + + Paper-faithful refit bootstrap re-estimates ω̂ and λ̂ via Frank-Wolfe on + each draw; composing that with Rao-Wu rescaled weights (or even a + pweight composition in a weighted FW loss) requires a separate + derivation. Pweight-only survey users must use + ``variance_method='placebo'`` or ``'jackknife'``. + """ est = SyntheticDiD(variance_method="bootstrap", n_bootstrap=50, seed=42) + with pytest.raises(NotImplementedError, match="does not yet support"): + est.fit( + sdid_survey_data, + outcome="outcome", + treatment="treated", + unit="unit", + time="time", + post_periods=[6, 7, 8, 9], + survey_design=survey_design_weights, + ) + + def test_jackknife_with_pweight_only(self, sdid_survey_data, survey_design_weights): + """variance_method='jackknife' completes with pweight-only survey weights. + + Positive coverage for the pweight-only + jackknife path that replaces + the removed pweight-only + bootstrap case. + """ + est = SyntheticDiD(variance_method="jackknife", seed=42) result = est.fit( sdid_survey_data, outcome="outcome",