igerber · igerber · Apr 23, 2026 · Apr 22, 2026 · Apr 22, 2026 · Apr 22, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,12 +9,17 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Added
 - **`target_parameter` block in BR/DR schemas (experimental; schema version bumped to 2.0)** — `BUSINESS_REPORT_SCHEMA_VERSION` and `DIAGNOSTIC_REPORT_SCHEMA_VERSION` bumped from `"1.0"` to `"2.0"` because the new `"no_scalar_by_design"` value on the `headline.status` / `headline_metric.status` enum (dCDH `trends_linear=True, L_max>=2` configuration) is a breaking change per the REPORTING.md stability policy. BusinessReport and DiagnosticReport now emit a top-level `target_parameter` block naming what the headline scalar actually represents for each of the 16 result classes. Closes BR/DR foundation gap #6 (target-parameter clarity). Fields: `name`, `definition`, `aggregation` (machine-readable dispatch tag), `headline_attribute` (raw result attribute), `reference` (citation pointer). BR's summary emits the short `name` right after the headline; DR's overall-interpretation paragraph does the same; both full reports carry a "## Target Parameter" section with the full definition. Per-estimator dispatch is sourced from REGISTRY.md and lives in the new `diff_diff/_reporting_helpers.py::describe_target_parameter`. A few branches read fit-time config (`EfficientDiDResults.pt_assumption`, `StackedDiDResults.clean_control`, `ChaisemartinDHaultfoeuilleResults.L_max` / `covariate_residuals` / `linear_trends_effects`); others emit a fixed tag (the fit-time `aggregate` kwarg on CS / Imputation / TwoStage / Wooldridge does not change the `overall_att` scalar — disambiguating horizon / group tables is tracked under gap #9). See `docs/methodology/REPORTING.md` "Target parameter" section.
+- SyntheticDiD coverage Monte Carlo calibration table added to `docs/methodology/REGISTRY.md` §SyntheticDiD — rejection rates at α ∈ {0.01, 0.05, 0.10} across `placebo` / `bootstrap` / `jackknife` on 3 representative DGPs (balanced / exchangeable, unbalanced, and Arkhangelsky et al. (2021) AER §6.3 non-exchangeable). Artifact at `benchmarks/data/sdid_coverage.json` (500 seeds × B=200), regenerable via `benchmarks/python/coverage_sdid.py`.
 
 ### Fixed
-- SyntheticDiD `variance_method="bootstrap"` now computes p-values from the analytical normal-theory formula using the bootstrap SE (matching R's `synthdid::vcov()` convention), rather than an empirical null-distribution formula that is not valid for bootstrap draws. `is_significant` and `significance_stars` are derived from `p_value` and will also change for bootstrap fits. Placebo and jackknife are unchanged. Point estimates and standard errors are unaffected.
+- **SyntheticDiD `variance_method="bootstrap"` now runs the paper-faithful refit bootstrap** with R-default warm-start. Re-estimates ω̂_b and λ̂_b via two-pass sparsified Frank-Wolfe on each pairs-bootstrap draw using the fit-time normalized-scale zeta — Arkhangelsky et al. (2021) Algorithm 2 step 2, matching the behavior of R's default `synthdid::vcov(method="bootstrap")` (which rebinds `attr(estimate, "opts")` so the renormalized ω serves as Frank-Wolfe initialization). The Python path threads that warm-start through `compute_sdid_unit_weights(..., init_weights=_sum_normalize(ω̂[boot_control_idx]))` and `compute_time_weights(..., init_weights=λ̂)` on each bootstrap draw. `compute_sdid_unit_weights` and `compute_time_weights` gain a new `init_weights` kwarg; when provided, the Rust top-level fast-path is skipped in favor of the Python two-pass dispatcher (whose inner FW calls still dispatch to Rust). Without this kwarg both helpers remain backward-compatible and keep the Rust fast-path. The previous fixed-weight bootstrap path is removed entirely — it was not paper-faithful and, despite prior documentation claiming otherwise, also did not match R's default bootstrap (the previous R-parity test fixture invoked `synthdid_estimate(weights=...)` without rebinding `opts`, which silently runs fixed-weight, so the 1e-10 parity was between two paths both wrong in the same direction). Coverage MC at the new artifact above quantifies the correctness fix on 3 representative null DGPs. **Users' existing `variance_method="bootstrap"` fits will return materially different SE / p-value / CI values on the next release** — same enum name, corrected semantics. Bootstrap is now ~5–30× slower per fit than the old fixed-weight shortcut (panel-size dependent; warm-start converges faster than cold-start so the slowdown is less than the 10–100× prior estimate). The PR #349 follow-on bullets below (analytical p-value dispatch, sqrt((r-1)/r) SE formula, retry-to-B contract) all carry over to the refit path unchanged.
+- SyntheticDiD `variance_method="bootstrap"` now computes p-values from the analytical normal-theory formula using the bootstrap SE (matching R's `synthdid::vcov()` convention), rather than an empirical null-distribution formula that is not valid for bootstrap draws. `is_significant` and `significance_stars` are derived from `p_value` and will also change for bootstrap fits. Placebo and jackknife are unchanged. Point estimates are unaffected.
 - SyntheticDiD bootstrap SE formula applies the `sqrt((r-1)/r)` correction matching R's synthdid and the placebo SE formula.
 - SyntheticDiD bootstrap now retries degenerate resamples (all-control or all-treated, or non-finite `τ_b`) until exactly `n_bootstrap` valid replicates are accumulated, matching R's `synthdid::bootstrap_sample` and Arkhangelsky et al. (2021) Algorithm 2. Previously the Python path counted attempts (with degenerate draws silently dropped), producing fewer valid replicates than requested. A bounded-attempt guard (`20 × n_bootstrap`) prevents pathological-input hangs.
 
+### Changed
+- **SyntheticDiD bootstrap no longer supports survey designs** (capability regression). The removed fixed-weight bootstrap path was the only SDID variance method that supported strata/PSU/FPC (via Rao-Wu rescaled bootstrap); the new paper-faithful refit bootstrap rejects all survey designs (including pweight-only) with `NotImplementedError`. Pweight-only users can switch to `variance_method="placebo"` or `"jackknife"`. Strata/PSU/FPC users have no SDID variance option on this release. Composing Rao-Wu rescaled weights with Frank-Wolfe re-estimation requires a separate derivation (weighted FW solver); sketch and reusable scaffolding pointers are in `docs/methodology/REGISTRY.md` §SyntheticDiD and `TODO.md`.
+
 ## [3.2.0] - 2026-04-19
 
 ### Added

diff --git a/METHODOLOGY_REVIEW.md b/METHODOLOGY_REVIEW.md
@@ -501,13 +501,26 @@ variables appear to the left of the `|` separator.
    `zeta=1.0`). Regularization parameters `zeta_omega` and `zeta_lambda` are now
    computed automatically from the data noise level (N_tr * sigma^2) as specified in
    Appendix D of Arkhangelsky et al. (2021), matching R's default behavior.
-4. **Bootstrap SE uses fixed weights matching R's `bootstrap_sample`** (was
-   re-estimating all weights). The bootstrap variance procedure now holds unit and time
-   weights fixed at their point estimates and only re-estimates the treatment effect,
-   matching the approach in R's `synthdid::bootstrap_sample()`.
-5. **Default `variance_method` changed to `"placebo"`** matching R's default. The R
-   package uses placebo variance by default (`synthdid_estimate` returns an object whose
-   `vcov()` uses the placebo method); our default now matches.
+4. **Bootstrap SE is paper-faithful refit (Algorithm 2 step 2), matching R's default
+   `synthdid::vcov(method="bootstrap")` including its warm-start shape.** On each
+   pairs-bootstrap draw, ω and λ are re-estimated via Frank-Wolfe on the resampled
+   panel using the fit-time normalized-scale zeta. The Frank-Wolfe first pass is
+   warm-started from the fit-time ω (renormalized over the resampled controls via
+   `_sum_normalize`) and the fit-time λ (unchanged), matching R's `bootstrap_sample`
+   which rebinds `attr(estimate, "opts")` so those weights serve as the FW
+   initialization per `update.omega=TRUE` / `update.lambda=TRUE`.
+   *(Historical note: an earlier release shipped a fixed-weight shortcut here
+   that matched neither the paper nor R's default vcov; that path was removed
+   in PR #351 along with its R-parity fixture, which had also been mis-anchored.
+   The same PR added the warm-start plumbing to `compute_sdid_unit_weights` /
+   `compute_time_weights` via new `init_weights=` kwargs.)*
+5. **Default `variance_method` changed to `"placebo"`** — intentional deviation from
+   R's default (R's `synthdid::vcov()` defaults to `"bootstrap"`). The library default
+   is placebo for two reasons: (a) placebo is unconditionally available on pweight-only
+   survey designs, whereas refit bootstrap rejects every survey design in this release;
+   (b) placebo sidesteps the ~5–30× slowdown of per-draw Frank-Wolfe re-estimation in
+   refit bootstrap. See REGISTRY.md §SyntheticDiD `Note (default variance_method
+   deviation from R)` for details.
 6. **Deprecated `lambda_reg` and `zeta` params; new params are `zeta_omega` and
    `zeta_lambda`**. The old parameters had unclear semantics and did not correspond to
    the paper's notation. The new parameters directly match the paper and R package

diff --git a/TODO.md b/TODO.md
@@ -100,7 +100,8 @@ Deferred items from PR reviews that were not addressed before merge.
 | `HeterogeneousAdoptionDiD` Phase 5: `practitioner_next_steps()` integration, tutorial notebook, and `llms.txt` updates (preserving UTF-8 fingerprint). | `diff_diff/practitioner.py`, `tutorials/`, `diff_diff/guides/` | Phase 2a | Low |
 | `HeterogeneousAdoptionDiD` time-varying dose on event study: Phase 2b REJECTS panels where `D_{g,t}` varies within a unit for `t >= F` (the aggregation uses `D_{g, F}` as the single regressor for all horizons, paper Appendix B.2 constant-dose convention). A follow-up PR could add a time-varying-dose estimator for these panels; current behavior is front-door rejection with a redirect to `ChaisemartinDHaultfoeuille`. | `diff_diff/had.py::_validate_had_panel_event_study` | Phase 2b | Low |
 | `HeterogeneousAdoptionDiD` repeated-cross-section support: paper Section 2 defines HAD on panel OR repeated cross-section, but Phase 2a is panel-only. RCS inputs (disjoint unit IDs between periods) are rejected by the balanced-panel validator with the generic "unit(s) do not appear in both periods" error. A follow-up PR will add an RCS identification path based on pre/post cell means (rather than unit-level first differences), with its own validator and a distinct `data_mode` / API surface. | `diff_diff/had.py::_validate_had_panel`, `diff_diff/had.py::_aggregate_first_difference` | Phase 2a | Medium |
-| SyntheticDiD: ship paper-faithful refit bootstrap (Arkhangelsky et al. 2021 Algorithm 2, re-estimating ω and λ per draw) as an opt-in `bootstrap_weights="refit"` kwarg. Current bootstrap matches R's fixed-weight shortcut. | `synthetic_did.py::_bootstrap_se` | follow-up | Low |
+| **SDID + survey designs** (capability regression in this release; both pweight-only AND strata/PSU/FPC). The previous release's fixed-weight bootstrap accepted strata/PSU/FPC via Rao-Wu rescaled bootstrap; the new paper-faithful refit bootstrap rejects all survey designs because Rao-Wu composed with Frank-Wolfe re-estimation requires its own derivation. The follow-up needs a **weighted Frank-Wolfe** variant of `_sc_weight_fw` accepting per-unit weights in the loss and regularization (`Σ rw_i ω_i Y_i,pre` / `ζ² Σ rw_i ω_i²`), threaded through `compute_sdid_unit_weights` / `compute_time_weights`. Reusable scaffolding (`generate_rao_wu_weights`, split into `rw_control` / `rw_treated`, degenerate-retry, treated-mean weighting) is recoverable from the pre-rewrite `_bootstrap_se` body via `git show 91082e5:diff_diff/synthetic_did.py` (PR #351 "Replace SDID fixed-weight bootstrap with paper-faithful refit"). Compose-after-unweighted-FW does not work — silently reproduces the fixed-weight Rao-Wu behavior we removed. Validation: re-use the coverage MC harness with a stratified DGP, confirm near-nominal rejection rates against placebo-SE tracking. See REGISTRY.md §SyntheticDiD `Note (deferred survey + bootstrap composition)` for the sketch. | `synthetic_did.py::fit`, `synthetic_did.py::_bootstrap_se`, `utils.py::_sc_weight_fw` | follow-up | Medium |
+| SyntheticDiD: bootstrap cross-language parity anchor against R's default `synthdid::vcov(method="bootstrap")` (refit; rebinds `opts` per draw) or Julia `Synthdid.jl::src/vcov.jl::bootstrap_se` (refit by construction). Same-library validation (placebo-SE tracking, AER §6.3 MC truth) is in place; a cross-language anchor is desirable to bolster the methodology contract. Julia is the cleanest target — minimal wrapping work and refit-native vcov. Tolerance target: 1e-6 on Monte Carlo samples (different BLAS + RNG paths preclude 1e-10). The R-parity fixture from the previous release was deleted because it pinned the now-removed fixed-weight path. | `benchmarks/R/`, `benchmarks/julia/`, `tests/` | follow-up | Low |
 
 #### Performance
 
@@ -126,8 +127,7 @@ Deferred items from PR reviews that were not addressed before merge.
 | `EDiDBootstrapResults` cross-reference is ambiguous — class is exported from both `diff_diff` and `diff_diff.efficient_did_bootstrap`, producing 3 "more than one target found" warnings. Add `:noindex:` to one source or use full-path refs | `diff_diff/efficient_did_results.py`, `docs/api/efficient_did.rst` | — | Low |
 | Tracked Sphinx autosummary stubs in `docs/api/_autosummary/*.rst` are stale — every sphinx build regenerates them with new attributes (e.g., `coef_var`, `survey_metadata`) that have been added to result classes. Either commit a refresh or move the directory to `.gitignore` and treat as build output. Also 6 untracked stubs exist for newer estimators (`WooldridgeDiD`, `SimulationMDEResults`, etc.) that have never been committed. | `docs/api/_autosummary/` | — | Low |
 | HonestDiD `test_m0_short_circuit` uses wall-clock `elapsed < 0.5s` as a proxy for "short-circuit path taken" instead of calling the full optimizer. Replace with a direct correctness signal (mock/spy the optimizer or check a state flag) so the test doesn't depend on CI timing. Not flaky today at 500ms, but load-bearing correctness on a timing proxy is brittle. | `tests/test_methodology_honest_did.py:246` | — | Low |
-| SyntheticDiD: coverage Monte Carlo study — empirical 95% CI coverage for placebo / fixed-bootstrap / jackknife on representative DGPs; document in REGISTRY.md to support the fixed-weight deviation label and calibrate user expectations. | `benchmarks/`, `docs/methodology/REGISTRY.md` | follow-up | Low |
-| SyntheticDiD: rename internal `placebo_effects` variable to `null_or_bootstrap_effects` (or `variance_effects`). Misleading name across the bootstrap/placebo/jackknife dispatch paths; low-risk refactor. | `synthetic_did.py`, `synthetic_did_results.py` | follow-up | Low |
+| SyntheticDiD: rename internal `placebo_effects` variable to `variance_effects` (or `resampled_effects`). Misleading name across the placebo/bootstrap/jackknife dispatch paths — holds three different contents depending on variance method. Low-risk refactor; user-facing field rename should preserve `placebo_effects` as a deprecated alias for one release. | `synthetic_did.py`, `results.py` | follow-up | Medium |
 
 ---