igerber · igerber · Apr 24, 2026 · Apr 24, 2026 · Apr 24, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -16,6 +16,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - SyntheticDiD `variance_method="bootstrap"` now computes p-values from the analytical normal-theory formula using the bootstrap SE (matching R's `synthdid::vcov()` convention), rather than an empirical null-distribution formula that is not valid for bootstrap draws. `is_significant` and `significance_stars` are derived from `p_value` and will also change for bootstrap fits. Placebo and jackknife are unchanged. Point estimates are unaffected.
 - SyntheticDiD bootstrap SE formula applies the `sqrt((r-1)/r)` correction matching R's synthdid and the placebo SE formula.
 - SyntheticDiD bootstrap now retries degenerate resamples (all-control or all-treated, or non-finite `τ_b`) until exactly `n_bootstrap` valid replicates are accumulated, matching R's `synthdid::bootstrap_sample` and Arkhangelsky et al. (2021) Algorithm 2. Previously the Python path counted attempts (with degenerate draws silently dropped), producing fewer valid replicates than requested. A bounded-attempt guard (`20 × n_bootstrap`) prevents pathological-input hangs.
+- **TROP global bootstrap SE backend parity under fixed seed** — Rust and Python backends now produce bit-identical bootstrap SE under the same `seed`. Previously Rust's `bootstrap_trop_variance_global` seeded `rand_xoshiro::Xoshiro256PlusPlus` per replicate while Python's fallback consumed `numpy.random.default_rng` (PCG64), producing ~28% SE divergence on tiny panels under `seed=42`. Fixed by extracting a shared `stratified_bootstrap_indices` helper in `diff_diff/bootstrap_utils.py` that pre-generates per-replicate stratified sample indices via numpy on the Python side; both backends consume the same integer arrays through the PyO3 surface. Sampling law (stratified: controls then treated, with replacement) is unchanged. Closes silent-failures audit finding #23 (bootstrap half; grid-search half closed in PR #348). Local-method TROP also adopts the Python-canonical index contract for the RNG layer, but separately-discovered backend divergences (Rust normalizes weight-matrix outer product, Python `_compute_observation_weights` reads stale `_precomputed` cache) prevent local-method bit-identity SE; tracked as a follow-up in `TODO.md`.
 
 ### Changed
 - **SyntheticDiD bootstrap no longer supports survey designs** (capability regression). The removed fixed-weight bootstrap path was the only SDID variance method that supported strata/PSU/FPC (via Rao-Wu rescaled bootstrap); the new paper-faithful refit bootstrap rejects all survey designs (including pweight-only) with `NotImplementedError`. Pweight-only users can switch to `variance_method="placebo"` or `"jackknife"`. Strata/PSU/FPC users have no SDID variance option on this release. Composing Rao-Wu rescaled weights with Frank-Wolfe re-estimation requires a separate derivation (weighted FW solver); sketch and reusable scaffolding pointers are in `docs/methodology/REGISTRY.md` §SyntheticDiD and `TODO.md`.

diff --git a/TODO.md b/TODO.md
@@ -83,7 +83,8 @@ Deferred items from PR reviews that were not addressed before merge.
 | Weighted CR2 Bell-McCaffrey cluster-robust (`vcov_type="hc2_bm"` + `cluster_ids` + `weights`) currently raises `NotImplementedError`. Weighted hat matrix and residual rebalancing need threading per clubSandwich WLS handling. | `linalg.py::_compute_cr2_bm` | Phase 1a | Medium |
 | Regenerate `benchmarks/data/clubsandwich_cr2_golden.json` from R (`Rscript benchmarks/R/generate_clubsandwich_golden.R`). Current JSON has `source: python_self_reference` as a stability anchor until an authoritative R run. | `benchmarks/R/generate_clubsandwich_golden.R` | Phase 1a | Medium |
 | `honest_did.py:1907` `np.linalg.solve(A_sys, b_sys) / except LinAlgError: continue` is a silent basis-rejection in the vertex-enumeration loop that is algorithmically intentional (try the next basis). Consider surfacing a count of rejected bases as a diagnostic when ARP enumeration exhausts, so users see when the vertex search was heavily constrained. Not a silent failure in the sense of the Phase 2 audit (the algorithm is supposed to skip), but the diagnostic would help debug borderline cases. | `honest_did.py` | #334 | Low |
-| TROP Rust vs Python bootstrap SE divergence under fixed seed: `seed=42` on a tiny panel produces ~28% bootstrap-SE gap. Root cause: Rust bootstrap uses its own RNG (`rand` crate) while Python uses `numpy.random.default_rng`; same seed value maps to different bytestreams across backends. Audit axis-H (RNG/seed) adjacent. `@pytest.mark.xfail(strict=True)` in `tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility` baselines the gap. Unifying RNG (threading a numpy-generated seed-sequence into Rust, or porting Python to ChaCha) would close it. | `trop_global.py`, `rust/` | follow-up | Medium |
+| TROP local-method Rust vs Python bootstrap SE divergence beyond RNG: with stratified indices now Python-canonical (closes RNG axis-H), local-method SE still diverges by 10-20% on tiny panels. Two downstream causes: (a) Rust `compute_weight_matrix` (`rust/src/trop.rs:573-606`) normalizes time_weights and unit_weights to sum to 1 before the outer product; Python `_compute_observation_weights` (`diff_diff/trop_local.py:489`) does not normalize. (b) Python `_compute_observation_weights` reads `self._precomputed["Y"]`, `["D"]`, `["time_dist_matrix"]` (lines 423-458) — original-panel cache — rather than the bootstrap-sample data passed via the function arguments, so unit-distance computation in the Python fallback uses stale data. `@pytest.mark.xfail(strict=True)` in `tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility_local` baselines the gap across `seeds=[0, 42, 12345]`. Affects local-method TROP backend parity broadly, not just bootstrap. | `trop_local.py`, `rust/src/trop.rs` | follow-up | Medium |
+| Rust multiplier-bootstrap weight RNG (`generate_bootstrap_weights_batch` in `rust/src/bootstrap.rs:9-10, 57-75`) uses `Xoshiro256PlusPlus::seed_from_u64(seed + i)` per row for Rademacher/Mammen/Webb generation. If any Python caller (SDID / efficient-DiD multiplier bootstrap) has a numpy-canonical equivalent, the two backends likely diverge under the same seed. Audit Python callers (`diff_diff/sdid.py`, `diff_diff/efficient_did_bootstrap.py`, `diff_diff/bootstrap_utils.py::generate_bootstrap_weights_batch_numpy`) for parity-test gaps. Same fix shape as TROP RNG parity (PR #NNN): pre-generate weights in Python via numpy and pass them to Rust through PyO3. | `rust/src/bootstrap.rs`, `diff_diff/bootstrap_utils.py` | follow-up | Medium |
 | `bias_corrected_local_linear`: extend golden parity to `kernel="triangular"` and `kernel="uniform"` (currently epa-only; all three kernels share `kernel_W` and the `lprobust` math, so parity is expected but not separately asserted). | `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py` | Phase 1c | Low |
 | `bias_corrected_local_linear`: expose `vce in {"hc0", "hc1", "hc2", "hc3"}` on the public wrapper once R parity goldens exist (currently raises `NotImplementedError`). The port-level `lprobust` and `lprobust_res` already support all four; expanding the public surface requires a golden generator for each hc mode and a decision on hc2/hc3 q-fit leverage (R reuses p-fit `hii` for q-fit residuals; whether to match that or stage-match deserves a derivation before the wrapper advertises CCT-2014 conformance). | `diff_diff/local_linear.py::bias_corrected_local_linear`, `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py` | Phase 1c | Medium |
 | `bias_corrected_local_linear`: support `weights=` once survey-design adaptation lands. nprobust's `lprobust` has no weight argument so there is no parity anchor; derivation needed. | `diff_diff/local_linear.py`, `diff_diff/_nprobust_port.py::lprobust` | Phase 1c | Medium |

diff --git a/diff_diff/bootstrap_utils.py b/diff_diff/bootstrap_utils.py
@@ -24,9 +24,50 @@
     "compute_effect_bootstrap_stats",
     "compute_effect_bootstrap_stats_batch",
     "warn_bootstrap_failure_rate",
+    "stratified_bootstrap_indices",
 ]
 
 
+def stratified_bootstrap_indices(
+    rng: np.random.Generator,
+    n_control: int,
+    n_treated: int,
+    n_bootstrap: int,
+) -> Tuple[np.ndarray, np.ndarray]:
+    """Generate stratified bootstrap sample indices.
+
+    Draws controls first, then treated, per replicate. A single ``rng``
+    advances through all replicates so Python and Rust consumers share an
+    identical bytestream under the same seed.
+
+    Parameters
+    ----------
+    rng : np.random.Generator
+        Seeded numpy generator. Consumed positionally; caller owns seeding.
+    n_control : int
+        Size of the control pool. Indices drawn in ``[0, n_control)``.
+    n_treated : int
+        Size of the treated pool. Indices drawn in ``[0, n_treated)``.
+    n_bootstrap : int
+        Number of bootstrap replicates.
+
+    Returns
+    -------
+    (control_indices, treated_indices) : tuple of np.ndarray
+        Shapes ``(n_bootstrap, n_control)`` and ``(n_bootstrap, n_treated)``,
+        dtype ``int64``. Each row is one replicate's sample-with-replacement
+        indices. Callers map these back to unit identities via their own pool.
+    """
+    control_idx = np.empty((n_bootstrap, n_control), dtype=np.int64)
+    treated_idx = np.empty((n_bootstrap, n_treated), dtype=np.int64)
+    for b in range(n_bootstrap):
+        if n_control > 0:
+            control_idx[b] = rng.choice(n_control, size=n_control, replace=True)
+        if n_treated > 0:
+            treated_idx[b] = rng.choice(n_treated, size=n_treated, replace=True)
+    return control_idx, treated_idx
+
+
 def warn_bootstrap_failure_rate(
     n_success: int,
     n_attempted: int,

diff --git a/diff_diff/trop_global.py b/diff_diff/trop_global.py
@@ -24,7 +24,10 @@
     _rust_bootstrap_trop_variance_global,
     _rust_loocv_grid_search_global,
 )
-from diff_diff.bootstrap_utils import warn_bootstrap_failure_rate
+from diff_diff.bootstrap_utils import (
+    stratified_bootstrap_indices,
+    warn_bootstrap_failure_rate,
+)
 from diff_diff.trop_local import _soft_threshold_svd, _validate_and_pivot_treatment
 from diff_diff.trop_results import TROPResults
 from diff_diff.utils import safe_inference, warn_if_not_converged
@@ -961,6 +964,21 @@ def _bootstrap_variance_global(
                 survey_design,
             )
 
+        # Stratified bootstrap pools (shared by Rust and Python paths)
+        unit_ever_treated = data.groupby(unit)[treatment].max()
+        treated_units = np.array(unit_ever_treated[unit_ever_treated == 1].index.tolist())
+        control_units = np.array(unit_ever_treated[unit_ever_treated == 0].index.tolist())
+        n_treated_units = len(treated_units)
+        n_control_units = len(control_units)
+
+        # Pre-generate stratified bootstrap indices via numpy (Python-canonical RNG).
+        # Both backends consume these indices so SE is identical under the same seed
+        # (silent-failures finding #23, bootstrap half).
+        rng = np.random.default_rng(self.seed)
+        control_idx, treated_idx = stratified_bootstrap_indices(
+            rng, n_control_units, n_treated_units, self.n_bootstrap
+        )
+
         # Try Rust backend for parallel bootstrap (5-15x speedup)
         # Only used for pweight-only designs (no strata/PSU/FPC)
         if HAS_RUST_BACKEND and _rust_bootstrap_trop_variance_global is not None:
@@ -991,7 +1009,8 @@ def _bootstrap_variance_global(
                     self.n_bootstrap,
                     self.max_iter,
                     self.tol,
-                    self.seed if self.seed is not None else 0,
+                    control_idx,
+                    treated_idx,
                     unit_weight_arr,
                 )
 
@@ -1015,32 +1034,17 @@ def _bootstrap_variance_global(
                     stacklevel=2,
                 )
 
-        # Python fallback implementation
-        rng = np.random.default_rng(self.seed)
-
-        # Stratified bootstrap sampling
-        unit_ever_treated = data.groupby(unit)[treatment].max()
-        treated_units = np.array(unit_ever_treated[unit_ever_treated == 1].index.tolist())
-        control_units = np.array(unit_ever_treated[unit_ever_treated == 0].index.tolist())
-
-        n_treated_units = len(treated_units)
-        n_control_units = len(control_units)
-
+        # Python fallback: consume the same indices the Rust branch would have used.
         bootstrap_estimates_list: List[float] = []
         nonconverg_tracker: List[int] = []
 
-        for _ in range(self.n_bootstrap):
-            # Stratified sampling
-            if n_control_units > 0:
-                sampled_control = rng.choice(control_units, size=n_control_units, replace=True)
-            else:
-                sampled_control = np.array([], dtype=object)
-
-            if n_treated_units > 0:
-                sampled_treated = rng.choice(treated_units, size=n_treated_units, replace=True)
-            else:
-                sampled_treated = np.array([], dtype=object)
-
+        for b in range(self.n_bootstrap):
+            sampled_control = (
+                control_units[control_idx[b]] if n_control_units > 0 else np.array([], dtype=object)
+            )
+            sampled_treated = (
+                treated_units[treated_idx[b]] if n_treated_units > 0 else np.array([], dtype=object)
+            )
             sampled_units = np.concatenate([sampled_control, sampled_treated])
 
             # Create bootstrap sample

diff --git a/diff_diff/trop_local.py b/diff_diff/trop_local.py
@@ -24,7 +24,10 @@
     _rust_bootstrap_trop_variance,
     _rust_unit_distance_matrix,
 )
-from diff_diff.bootstrap_utils import warn_bootstrap_failure_rate
+from diff_diff.bootstrap_utils import (
+    stratified_bootstrap_indices,
+    warn_bootstrap_failure_rate,
+)
 from diff_diff.trop_results import _PrecomputedStructures
 from diff_diff.utils import warn_if_not_converged
 
@@ -957,6 +960,28 @@ def _bootstrap_variance(
                 survey_design,
             )
 
+        # Stratified bootstrap pools (shared by Rust and Python paths).
+        # Paper's Algorithm 3 (page 27) specifies sampling N_0 control rows
+        # and N_1 treated rows separately to preserve treatment ratio.
+        unit_ever_treated = data.groupby(unit)[treatment].max()
+        treated_units = np.array(unit_ever_treated[unit_ever_treated == 1].index)
+        control_units = np.array(unit_ever_treated[unit_ever_treated == 0].index)
+        n_treated_units = len(treated_units)
+        n_control_units = len(control_units)
+
+        # Pre-generate stratified bootstrap indices via numpy (Python-canonical RNG).
+        # This aligns the RNG layer between backends. For the global method the RNG
+        # was the only divergence, so global SE is now bit-identical under the same
+        # seed. For the local method two downstream divergences remain (Rust weight-
+        # matrix normalization + Python `_compute_observation_weights` reading the
+        # stale `_precomputed` cache) — tracked in TODO.md; until those land, local
+        # bootstrap SE still differs across backends. Silent-failures finding #23
+        # (bootstrap half) closed for global; local follow-up queued.
+        rng = np.random.default_rng(self.seed)
+        control_idx, treated_idx = stratified_bootstrap_indices(
+            rng, n_control_units, n_treated_units, self.n_bootstrap
+        )
+
         # Try Rust backend for parallel bootstrap (5-15x speedup)
         # Only used for pweight-only designs (no strata/PSU/FPC)
         if (
@@ -981,7 +1006,8 @@ def _bootstrap_variance(
                     self.n_bootstrap,
                     self.max_iter,
                     self.tol,
-                    self.seed if self.seed is not None else 0,
+                    control_idx,
+                    treated_idx,
                     unit_weight_arr,
                 )
 
@@ -1003,36 +1029,21 @@ def _bootstrap_variance(
                     stacklevel=2,
                 )
 
-        # Python implementation (fallback)
-        rng = np.random.default_rng(self.seed)
-
-        # Issue D fix: Stratified bootstrap sampling
-        # Paper's Algorithm 3 (page 27) specifies sampling N_0 control rows
-        # and N_1 treated rows separately to preserve treatment ratio
-        unit_ever_treated = data.groupby(unit)[treatment].max()
-        treated_units = np.array(unit_ever_treated[unit_ever_treated == 1].index)
-        control_units = np.array(unit_ever_treated[unit_ever_treated == 0].index)
-
-        n_treated_units = len(treated_units)
-        n_control_units = len(control_units)
-
+        # Python fallback: consume the same indices the Rust branch would have used.
         bootstrap_estimates_list = []
         nonconverg_tracker: List[int] = []
 
-        for _ in range(self.n_bootstrap):
-            # Stratified sampling: sample control and treated units separately
-            # This preserves the treatment ratio in each bootstrap sample
-            if n_control_units > 0:
-                sampled_control = rng.choice(control_units, size=n_control_units, replace=True)
-            else:
-                sampled_control = np.array([], dtype=control_units.dtype)
-
-            if n_treated_units > 0:
-                sampled_treated = rng.choice(treated_units, size=n_treated_units, replace=True)
-            else:
-                sampled_treated = np.array([], dtype=treated_units.dtype)
-
-            # Combine stratified samples
+        for b in range(self.n_bootstrap):
+            sampled_control = (
+                control_units[control_idx[b]]
+                if n_control_units > 0
+                else np.array([], dtype=control_units.dtype)
+            )
+            sampled_treated = (
+                treated_units[treated_idx[b]]
+                if n_treated_units > 0
+                else np.array([], dtype=treated_units.dtype)
+            )
             sampled_units = np.concatenate([sampled_control, sampled_treated])
 
             # Create bootstrap sample with unique unit IDs