Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- SyntheticDiD `variance_method="bootstrap"` now computes p-values from the analytical normal-theory formula using the bootstrap SE (matching R's `synthdid::vcov()` convention), rather than an empirical null-distribution formula that is not valid for bootstrap draws. `is_significant` and `significance_stars` are derived from `p_value` and will also change for bootstrap fits. Placebo and jackknife are unchanged. Point estimates are unaffected.
- SyntheticDiD bootstrap SE formula applies the `sqrt((r-1)/r)` correction matching R's synthdid and the placebo SE formula.
- SyntheticDiD bootstrap now retries degenerate resamples (all-control or all-treated, or non-finite `τ_b`) until exactly `n_bootstrap` valid replicates are accumulated, matching R's `synthdid::bootstrap_sample` and Arkhangelsky et al. (2021) Algorithm 2. Previously the Python path counted attempts (with degenerate draws silently dropped), producing fewer valid replicates than requested. A bounded-attempt guard (`20 × n_bootstrap`) prevents pathological-input hangs.
- **TROP global bootstrap SE backend parity under fixed seed** — Rust and Python backends now produce bit-identical bootstrap SE under the same `seed`. Previously Rust's `bootstrap_trop_variance_global` seeded `rand_xoshiro::Xoshiro256PlusPlus` per replicate while Python's fallback consumed `numpy.random.default_rng` (PCG64), producing ~28% SE divergence on tiny panels under `seed=42`. Fixed by extracting a shared `stratified_bootstrap_indices` helper in `diff_diff/bootstrap_utils.py` that pre-generates per-replicate stratified sample indices via numpy on the Python side; both backends consume the same integer arrays through the PyO3 surface. Sampling law (stratified: controls then treated, with replacement) is unchanged. Closes silent-failures audit finding #23 (bootstrap half; grid-search half closed in PR #348). Local-method TROP also adopts the Python-canonical index contract for the RNG layer, but separately-discovered backend divergences (Rust normalizes weight-matrix outer product, Python `_compute_observation_weights` reads stale `_precomputed` cache) prevent local-method bit-identity SE; tracked as a follow-up in `TODO.md`.

### Changed
- **SyntheticDiD bootstrap no longer supports survey designs** (capability regression). The removed fixed-weight bootstrap path was the only SDID variance method that supported strata/PSU/FPC (via Rao-Wu rescaled bootstrap); the new paper-faithful refit bootstrap rejects all survey designs (including pweight-only) with `NotImplementedError`. Pweight-only users can switch to `variance_method="placebo"` or `"jackknife"`. Strata/PSU/FPC users have no SDID variance option on this release. Composing Rao-Wu rescaled weights with Frank-Wolfe re-estimation requires a separate derivation (weighted FW solver); sketch and reusable scaffolding pointers are in `docs/methodology/REGISTRY.md` §SyntheticDiD and `TODO.md`.
Expand Down
3 changes: 2 additions & 1 deletion TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,8 @@ Deferred items from PR reviews that were not addressed before merge.
| Weighted CR2 Bell-McCaffrey cluster-robust (`vcov_type="hc2_bm"` + `cluster_ids` + `weights`) currently raises `NotImplementedError`. Weighted hat matrix and residual rebalancing need threading per clubSandwich WLS handling. | `linalg.py::_compute_cr2_bm` | Phase 1a | Medium |
| Regenerate `benchmarks/data/clubsandwich_cr2_golden.json` from R (`Rscript benchmarks/R/generate_clubsandwich_golden.R`). Current JSON has `source: python_self_reference` as a stability anchor until an authoritative R run. | `benchmarks/R/generate_clubsandwich_golden.R` | Phase 1a | Medium |
| `honest_did.py:1907` `np.linalg.solve(A_sys, b_sys) / except LinAlgError: continue` is a silent basis-rejection in the vertex-enumeration loop that is algorithmically intentional (try the next basis). Consider surfacing a count of rejected bases as a diagnostic when ARP enumeration exhausts, so users see when the vertex search was heavily constrained. Not a silent failure in the sense of the Phase 2 audit (the algorithm is supposed to skip), but the diagnostic would help debug borderline cases. | `honest_did.py` | #334 | Low |
| TROP Rust vs Python bootstrap SE divergence under fixed seed: `seed=42` on a tiny panel produces ~28% bootstrap-SE gap. Root cause: Rust bootstrap uses its own RNG (`rand` crate) while Python uses `numpy.random.default_rng`; same seed value maps to different bytestreams across backends. Audit axis-H (RNG/seed) adjacent. `@pytest.mark.xfail(strict=True)` in `tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility` baselines the gap. Unifying RNG (threading a numpy-generated seed-sequence into Rust, or porting Python to ChaCha) would close it. | `trop_global.py`, `rust/` | follow-up | Medium |
| TROP local-method Rust vs Python bootstrap SE divergence beyond RNG: with stratified indices now Python-canonical (closes RNG axis-H), local-method SE still diverges by 10-20% on tiny panels. Two downstream causes: (a) Rust `compute_weight_matrix` (`rust/src/trop.rs:573-606`) normalizes time_weights and unit_weights to sum to 1 before the outer product; Python `_compute_observation_weights` (`diff_diff/trop_local.py:489`) does not normalize. (b) Python `_compute_observation_weights` reads `self._precomputed["Y"]`, `["D"]`, `["time_dist_matrix"]` (lines 423-458) — original-panel cache — rather than the bootstrap-sample data passed via the function arguments, so unit-distance computation in the Python fallback uses stale data. `@pytest.mark.xfail(strict=True)` in `tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility_local` baselines the gap across `seeds=[0, 42, 12345]`. Affects local-method TROP backend parity broadly, not just bootstrap. | `trop_local.py`, `rust/src/trop.rs` | follow-up | Medium |
| Rust multiplier-bootstrap weight RNG (`generate_bootstrap_weights_batch` in `rust/src/bootstrap.rs:9-10, 57-75`) uses `Xoshiro256PlusPlus::seed_from_u64(seed + i)` per row for Rademacher/Mammen/Webb generation. If any Python caller (SDID / efficient-DiD multiplier bootstrap) has a numpy-canonical equivalent, the two backends likely diverge under the same seed. Audit Python callers (`diff_diff/sdid.py`, `diff_diff/efficient_did_bootstrap.py`, `diff_diff/bootstrap_utils.py::generate_bootstrap_weights_batch_numpy`) for parity-test gaps. Same fix shape as TROP RNG parity (PR #NNN): pre-generate weights in Python via numpy and pass them to Rust through PyO3. | `rust/src/bootstrap.rs`, `diff_diff/bootstrap_utils.py` | follow-up | Medium |
| `bias_corrected_local_linear`: extend golden parity to `kernel="triangular"` and `kernel="uniform"` (currently epa-only; all three kernels share `kernel_W` and the `lprobust` math, so parity is expected but not separately asserted). | `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py` | Phase 1c | Low |
| `bias_corrected_local_linear`: expose `vce in {"hc0", "hc1", "hc2", "hc3"}` on the public wrapper once R parity goldens exist (currently raises `NotImplementedError`). The port-level `lprobust` and `lprobust_res` already support all four; expanding the public surface requires a golden generator for each hc mode and a decision on hc2/hc3 q-fit leverage (R reuses p-fit `hii` for q-fit residuals; whether to match that or stage-match deserves a derivation before the wrapper advertises CCT-2014 conformance). | `diff_diff/local_linear.py::bias_corrected_local_linear`, `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py` | Phase 1c | Medium |
| `bias_corrected_local_linear`: support `weights=` once survey-design adaptation lands. nprobust's `lprobust` has no weight argument so there is no parity anchor; derivation needed. | `diff_diff/local_linear.py`, `diff_diff/_nprobust_port.py::lprobust` | Phase 1c | Medium |
Expand Down
41 changes: 41 additions & 0 deletions diff_diff/bootstrap_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,50 @@
"compute_effect_bootstrap_stats",
"compute_effect_bootstrap_stats_batch",
"warn_bootstrap_failure_rate",
"stratified_bootstrap_indices",
]


def stratified_bootstrap_indices(
rng: np.random.Generator,
n_control: int,
n_treated: int,
n_bootstrap: int,
) -> Tuple[np.ndarray, np.ndarray]:
"""Generate stratified bootstrap sample indices.

Draws controls first, then treated, per replicate. A single ``rng``
advances through all replicates so Python and Rust consumers share an
identical bytestream under the same seed.

Parameters
----------
rng : np.random.Generator
Seeded numpy generator. Consumed positionally; caller owns seeding.
n_control : int
Size of the control pool. Indices drawn in ``[0, n_control)``.
n_treated : int
Size of the treated pool. Indices drawn in ``[0, n_treated)``.
n_bootstrap : int
Number of bootstrap replicates.

Returns
-------
(control_indices, treated_indices) : tuple of np.ndarray
Shapes ``(n_bootstrap, n_control)`` and ``(n_bootstrap, n_treated)``,
dtype ``int64``. Each row is one replicate's sample-with-replacement
indices. Callers map these back to unit identities via their own pool.
"""
control_idx = np.empty((n_bootstrap, n_control), dtype=np.int64)
treated_idx = np.empty((n_bootstrap, n_treated), dtype=np.int64)
for b in range(n_bootstrap):
if n_control > 0:
control_idx[b] = rng.choice(n_control, size=n_control, replace=True)
if n_treated > 0:
treated_idx[b] = rng.choice(n_treated, size=n_treated, replace=True)
return control_idx, treated_idx


def warn_bootstrap_failure_rate(
n_success: int,
n_attempted: int,
Expand Down
54 changes: 29 additions & 25 deletions diff_diff/trop_global.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,10 @@
_rust_bootstrap_trop_variance_global,
_rust_loocv_grid_search_global,
)
from diff_diff.bootstrap_utils import warn_bootstrap_failure_rate
from diff_diff.bootstrap_utils import (
stratified_bootstrap_indices,
warn_bootstrap_failure_rate,
)
from diff_diff.trop_local import _soft_threshold_svd, _validate_and_pivot_treatment
from diff_diff.trop_results import TROPResults
from diff_diff.utils import safe_inference, warn_if_not_converged
Expand Down Expand Up @@ -961,6 +964,21 @@ def _bootstrap_variance_global(
survey_design,
)

# Stratified bootstrap pools (shared by Rust and Python paths)
unit_ever_treated = data.groupby(unit)[treatment].max()
treated_units = np.array(unit_ever_treated[unit_ever_treated == 1].index.tolist())
control_units = np.array(unit_ever_treated[unit_ever_treated == 0].index.tolist())
n_treated_units = len(treated_units)
n_control_units = len(control_units)

# Pre-generate stratified bootstrap indices via numpy (Python-canonical RNG).
# Both backends consume these indices so SE is identical under the same seed
# (silent-failures finding #23, bootstrap half).
rng = np.random.default_rng(self.seed)
control_idx, treated_idx = stratified_bootstrap_indices(
rng, n_control_units, n_treated_units, self.n_bootstrap
)

# Try Rust backend for parallel bootstrap (5-15x speedup)
# Only used for pweight-only designs (no strata/PSU/FPC)
if HAS_RUST_BACKEND and _rust_bootstrap_trop_variance_global is not None:
Expand Down Expand Up @@ -991,7 +1009,8 @@ def _bootstrap_variance_global(
self.n_bootstrap,
self.max_iter,
self.tol,
self.seed if self.seed is not None else 0,
control_idx,
treated_idx,
unit_weight_arr,
)

Expand All @@ -1015,32 +1034,17 @@ def _bootstrap_variance_global(
stacklevel=2,
)

# Python fallback implementation
rng = np.random.default_rng(self.seed)

# Stratified bootstrap sampling
unit_ever_treated = data.groupby(unit)[treatment].max()
treated_units = np.array(unit_ever_treated[unit_ever_treated == 1].index.tolist())
control_units = np.array(unit_ever_treated[unit_ever_treated == 0].index.tolist())

n_treated_units = len(treated_units)
n_control_units = len(control_units)

# Python fallback: consume the same indices the Rust branch would have used.
bootstrap_estimates_list: List[float] = []
nonconverg_tracker: List[int] = []

for _ in range(self.n_bootstrap):
# Stratified sampling
if n_control_units > 0:
sampled_control = rng.choice(control_units, size=n_control_units, replace=True)
else:
sampled_control = np.array([], dtype=object)

if n_treated_units > 0:
sampled_treated = rng.choice(treated_units, size=n_treated_units, replace=True)
else:
sampled_treated = np.array([], dtype=object)

for b in range(self.n_bootstrap):
sampled_control = (
control_units[control_idx[b]] if n_control_units > 0 else np.array([], dtype=object)
)
sampled_treated = (
treated_units[treated_idx[b]] if n_treated_units > 0 else np.array([], dtype=object)
)
sampled_units = np.concatenate([sampled_control, sampled_treated])

# Create bootstrap sample
Expand Down
69 changes: 40 additions & 29 deletions diff_diff/trop_local.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,10 @@
_rust_bootstrap_trop_variance,
_rust_unit_distance_matrix,
)
from diff_diff.bootstrap_utils import warn_bootstrap_failure_rate
from diff_diff.bootstrap_utils import (
stratified_bootstrap_indices,
warn_bootstrap_failure_rate,
)
from diff_diff.trop_results import _PrecomputedStructures
from diff_diff.utils import warn_if_not_converged

Expand Down Expand Up @@ -957,6 +960,28 @@ def _bootstrap_variance(
survey_design,
)

# Stratified bootstrap pools (shared by Rust and Python paths).
# Paper's Algorithm 3 (page 27) specifies sampling N_0 control rows
# and N_1 treated rows separately to preserve treatment ratio.
unit_ever_treated = data.groupby(unit)[treatment].max()
treated_units = np.array(unit_ever_treated[unit_ever_treated == 1].index)
control_units = np.array(unit_ever_treated[unit_ever_treated == 0].index)
n_treated_units = len(treated_units)
n_control_units = len(control_units)

# Pre-generate stratified bootstrap indices via numpy (Python-canonical RNG).
# This aligns the RNG layer between backends. For the global method the RNG
# was the only divergence, so global SE is now bit-identical under the same
# seed. For the local method two downstream divergences remain (Rust weight-
# matrix normalization + Python `_compute_observation_weights` reading the
# stale `_precomputed` cache) — tracked in TODO.md; until those land, local
# bootstrap SE still differs across backends. Silent-failures finding #23
# (bootstrap half) closed for global; local follow-up queued.
rng = np.random.default_rng(self.seed)
control_idx, treated_idx = stratified_bootstrap_indices(
rng, n_control_units, n_treated_units, self.n_bootstrap
)

# Try Rust backend for parallel bootstrap (5-15x speedup)
# Only used for pweight-only designs (no strata/PSU/FPC)
if (
Expand All @@ -981,7 +1006,8 @@ def _bootstrap_variance(
self.n_bootstrap,
self.max_iter,
self.tol,
self.seed if self.seed is not None else 0,
control_idx,
treated_idx,
unit_weight_arr,
)

Expand All @@ -1003,36 +1029,21 @@ def _bootstrap_variance(
stacklevel=2,
)

# Python implementation (fallback)
rng = np.random.default_rng(self.seed)

# Issue D fix: Stratified bootstrap sampling
# Paper's Algorithm 3 (page 27) specifies sampling N_0 control rows
# and N_1 treated rows separately to preserve treatment ratio
unit_ever_treated = data.groupby(unit)[treatment].max()
treated_units = np.array(unit_ever_treated[unit_ever_treated == 1].index)
control_units = np.array(unit_ever_treated[unit_ever_treated == 0].index)

n_treated_units = len(treated_units)
n_control_units = len(control_units)

# Python fallback: consume the same indices the Rust branch would have used.
bootstrap_estimates_list = []
nonconverg_tracker: List[int] = []

for _ in range(self.n_bootstrap):
# Stratified sampling: sample control and treated units separately
# This preserves the treatment ratio in each bootstrap sample
if n_control_units > 0:
sampled_control = rng.choice(control_units, size=n_control_units, replace=True)
else:
sampled_control = np.array([], dtype=control_units.dtype)

if n_treated_units > 0:
sampled_treated = rng.choice(treated_units, size=n_treated_units, replace=True)
else:
sampled_treated = np.array([], dtype=treated_units.dtype)

# Combine stratified samples
for b in range(self.n_bootstrap):
sampled_control = (
control_units[control_idx[b]]
if n_control_units > 0
else np.array([], dtype=control_units.dtype)
)
sampled_treated = (
treated_units[treated_idx[b]]
if n_treated_units > 0
else np.array([], dtype=treated_units.dtype)
)
sampled_units = np.concatenate([sampled_control, sampled_treated])

# Create bootstrap sample with unique unit IDs
Expand Down
Loading
Loading