Fix TROP bootstrap SE backend divergence under fixed seed by igerber · Pull Request #354 · igerber/diff-diff

igerber · 2026-04-24T00:26:47Z

Summary

Rust and Python TROP backends previously produced different bootstrap SE for the same seed. On a tiny correlated panel under seed=42, the gap was ~28% of SE because Rust seeded rand_xoshiro::Xoshiro256PlusPlus per replicate while Python consumed numpy.random.default_rng (PCG64), so identical seeds mapped to different bytestreams.

This PR canonicalizes on numpy. A new stratified_bootstrap_indices helper in diff_diff/bootstrap_utils.py pre-generates per-replicate (control, treated) positional index arrays from a numpy Generator; both TROP bootstrap Rust functions now accept control_indices/treated_indices as i64 arrays in place of seed: u64. Parallelism preserved; sampling law (stratified: controls then treated, with replacement) unchanged.

Global-method bootstrap SE is backend-invariant under the same seed to machine precision. test_bootstrap_seed_reproducibility flipped from xfail(strict=True) to passing assert_allclose(atol=rtol=1e-14) and parametrized over [0, 42, 12345].
Companion local-method test added, currently xfail(strict=True) with a new finding: aligning the RNG surfaced two separate local-method backend divergences (weight-matrix normalization in compute_weight_matrix is Rust-only; Python _compute_observation_weights reads self._precomputed["Y"] instead of the bootstrap-sample Y). Both tracked as follow-up rows in TODO.md and will land in a separate methodology PR.
Closes the bootstrap half of silent-failures audit finding Address code review feedback for CallawaySantAnna covariates #23 (grid-search half closed in PR Unify Rust TROP inner solver to SVD (close finding #23 grid-search divergence) #348).

Methodology references

Method: Triply Robust Panel (TROP) bootstrap SE computation.
Paper: Athey, S., Imbens, G. W., Qu, Z., & Viviano, D. (2025). Triply Robust Panel Estimators. Algorithm 3 (stratified unit-level resample).
Intentional deviations: None. Sampling law is unchanged — this PR only rearranges which side (Python vs Rust) advances the RNG state, not what the bootstrap resamples. The numpy.random.default_rng (PCG64) is treated as canonical per the repo convention that numpy/scipy paths are the reference implementation for cross-backend parity.

Validation

Tests added: TestStratifiedBootstrapIndices (7 tests: shape/dtype, value range, determinism, prefix invariance, hard-coded value pin for default_rng(42), empty-pool edges) in tests/test_bootstrap_utils.py. test_bootstrap_seed_reproducibility_local parametrized over three seeds in tests/test_rust_backend.py.
Tests updated: test_bootstrap_seed_reproducibility flipped from xfail to parametrized passing regression guard (atol=rtol=1e-14, seeds [0, 42, 12345]). Four existing direct-call tests (test_bootstrap_variance_shape, test_bootstrap_reproducibility, and their global counterparts) updated to construct stratified index arrays via the new helper instead of passing seed.
Verification: full tests/test_rust_backend.py passes (83 passed + 3 xfailed — the local-method seed-parity tests documenting the new finding). tests/test_trop.py -k bootstrap passes (13 tests) under both Rust and pure-Python (DIFF_DIFF_BACKEND=python) backends.

Security / privacy

No secrets, credentials, or PII touched. All changes are in source, tests, and documentation files within the repo.

🤖 Generated with Claude Code

Rust and Python TROP backends produced different bootstrap standard errors for the same `seed` value. On a tiny correlated panel under `seed=42` the gap was ~28% of SE: Rust seeded `rand_xoshiro:: Xoshiro256PlusPlus` per replicate while Python's fallback consumed `numpy.random.default_rng` (PCG64), so identical seeds mapped to different bytestreams. Canonicalize on numpy. New `stratified_bootstrap_indices` helper in `diff_diff/bootstrap_utils.py` pre-generates per-replicate (control, treated) positional index arrays from a numpy `Generator` and hands them to both backends through the PyO3 surface — both Rust bootstrap functions (`bootstrap_trop_variance_global`, `bootstrap_trop_variance`) now accept `control_indices` and `treated_indices` as `i64` arrays in place of `seed: u64`. Parallelism is preserved. Sampling law (stratified: controls then treated, with replacement) is unchanged. Global-method SE is now backend-invariant under the same seed to machine precision: the prior `xfail(strict=True)` in `test_bootstrap_seed_reproducibility` is flipped to a passing `assert_allclose(atol=rtol=1e-14)` and parametrized over `[0, 42, 12345]`. A companion `test_bootstrap_seed_reproducibility_local` is added for the local-method bootstrap. It is currently `xfail(strict=True)` because aligning the RNG exposed two separate local-method backend divergences beyond this PR's scope: Rust's `compute_weight_matrix` normalizes time and unit weights to sum to 1, while Python's `_compute_observation_weights` does not; and the Python fallback's `_compute_observation_weights(_precomputed branch)` reads the original-panel cached `Y`/`D` instead of the bootstrap-sample arguments. Both are tracked as follow-up rows in `TODO.md` with file:line pointers and will land in a separate methodology PR. Closes the bootstrap half of silent-failures audit finding #23 (the grid-search half closed in PR #348). Reference: Athey, Imbens, Qu & Viviano (2025), "Triply Robust Panel Estimators", Algorithm 3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-24T00:46:23Z

/ai-review

github-actions · 2026-04-24T00:54:07Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 225674cda82436b710b404d327ac16113890a965

Overall Assessment

⚠️ Needs changes

Highest unmitigated finding: P1 input-validation gap in the new Rust bootstrap API.

Executive Summary

Affected methods: TROP bootstrap SE for method="global" and the RNG plumbing used by method="local". The changed global path still matches Algorithm 3’s stratified block-bootstrap law: resample N0 control units and N1 treated units with replacement, then recompute TROP on each bootstrap draw. citeturn0view0
I did not find a new undocumented methodology, weighting, or variance/SE defect in the global fix.
The new Rust control_indices / treated_indices surface validates shape but not element range, so malformed arrays are handled by Rust bounds behavior instead of a clean ValueError.
The remaining local-method backend divergence is surfaced rather than hidden (tests/test_rust_backend.py:L2335-L2356, TODO.md:L86-L87), but some inline comments still overstate what this PR accomplished.

Methodology

No unmitigated findings. The shared-index implementation preserves the paper/registry bootstrap contract for the changed global path rather than changing the resampling law. diff_diff/bootstrap_utils.py:L31-L68, diff_diff/trop_global.py:L967-L1045, rust/src/trop.rs:L1777-L1841, docs/methodology/REGISTRY.md:L2000-L2027. citeturn0view0

Code Quality

P1 Impact: the new raw-index Rust entry points only check array shape. Negative or out-of-range entries are cast with as usize and then indexed directly via original_control_units[idx] / original_treated_units[idx], so bad inputs fail inside Rust instead of producing a deterministic validation error. This is new in the PR because the old interface accepted only seed. Concrete fix: before entering the parallel loop, validate every element of control_indices and treated_indices in both Rust functions and raise PyValueError when any index is < 0 or >= its pool size. rust/src/trop.rs:L971-L1005, rust/src/trop.rs:L1791-L1840

Performance

No material findings in the changed path.

Maintainability

No additional findings beyond the documentation mismatch below.

Tech Debt

P3 Impact: the PR properly records the newly surfaced local-method Rust/Python divergence as follow-up work instead of burying it. That is useful tracking, but parity for method="local" remains open. Concrete fix: implement the follow-up described in TODO.md:L86-L87 and then convert tests/test_rust_backend.py:L2335-L2356 from strict xfail to a passing regression test.

Security

No findings.

Documentation/Tests

P3 Impact: the local-path comments are inconsistent with the actual state of the code. diff_diff/trop_local.py:L972-L974 says shared indices make SE identical under the same seed, but the new strict xfail and TODO entry explicitly say local SE parity is still unresolved. Concrete fix: change that comment to say the PR aligns only the RNG layer for local bootstrap; full SE parity still needs the follow-up in TODO.md:L86-L87.
P3 Impact: the Rust function header comments still describe a removed seed argument even though the PyO3 signatures now take control_indices / treated_indices. Concrete fix: update the arg docs to match the new API. rust/src/trop.rs:L912-L927, rust/src/trop.rs:L1737-L1752
P3 Impact: the new tests cover helper invariants and happy-path reproducibility, but not malformed raw-index rejection for the new Rust surface. Concrete fix: add direct tests for both bootstrap_trop_variance and bootstrap_trop_variance_global asserting clean errors on one negative index and one too-large index. tests/test_bootstrap_utils.py:L207-L269, tests/test_rust_backend.py:L979-L1028, tests/test_rust_backend.py:L1213-L1253

Path to Approval

Add explicit range validation for control_indices and treated_indices in both Rust bootstrap entry points, returning PyValueError for negative or out-of-pool values.
Add direct regression tests covering malformed index arrays for both local and global Rust bootstrap functions.
Fix the local-path comments and Rust arg docs so they match the actual post-PR state: global seed parity is fixed; local seed parity is not yet fixed; the Rust API now accepts index arrays rather than seed.

P1 — Rust `bootstrap_trop_variance` and `bootstrap_trop_variance_global` now validate every element of `control_indices` and `treated_indices` before entering the parallel loop. Negative values and values `>= pool size` raise `PyValueError` with the offending index, rather than panicking on a negative-to-`usize` cast or indexing `original_*_units` out of bounds inside the rayon closure. P3 — `rust/src/trop.rs` function header docs rewritten for both bootstrap entry points. Removed the stale `seed` argument; added description of `control_indices` and `treated_indices` shape/dtype/ range, and documented the Python-canonical RNG contract via `diff_diff.bootstrap_utils.stratified_bootstrap_indices`. P3 — `diff_diff/trop_local.py` comment on the pre-generate block rewritten to state that local-method SE is NOT yet bit-identical across backends (only the RNG layer is aligned; two downstream divergences remain in `compute_weight_matrix` normalization and the Python `_compute_observation_weights` stale-cache path; tracked in `TODO.md`). P3 — New `tests/test_rust_backend.py` coverage: `TestTROPRustBackend::test_bootstrap_rejects_negative_index`, `test_bootstrap_rejects_out_of_range_index`, and the corresponding `TestTROPGlobalRustBackend` pair each assert `ValueError` with the expected regex for malformed inputs on both local and global bootstrap functions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-24T01:02:21Z

/ai-review

github-actions · 2026-04-24T01:07:21Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 5ce5b8f05cac9b4be4fc8a0d7ad3cb428d599750

Overall Assessment

✅ Looks good

Highest unmitigated severity: P3 (tracked tech debt only).

Executive Summary

Affected methods: TROP bootstrap SE for method="global" and the RNG plumbing used by method="local".
Methodology check passed: the PR preserves Algorithm 3’s stratified unit-level block bootstrap and only changes where the RNG stream is produced, not the resampling law or variance formula.
The prior P1 from the last review is resolved: both Rust bootstrap entry points now reject malformed index arrays before entering the parallel loop, and there are direct negative / out-of-range regression tests for both local and global paths.
Global backend seed parity is now regression-guarded across three seeds; the remaining local-method backend divergence is explicitly strict-xfailed and tracked in TODO.md rather than overstated.
I did not find a new NaN/Inf inference bug, control-pool bug, or undocumented methodology deviation in the changed diff.
I could not execute pytest in this environment because the pytest executable is unavailable.

Methodology

No findings. The changed path still matches the TROP registry requirement of a block bootstrap preserving panel structure, with stratified resampling of control and treated unit pools; canonicalizing on NumPy-generated index arrays is an implementation choice, not a methodology change. References: docs/methodology/REGISTRY.md:L2000-L2027, diff_diff/bootstrap_utils.py:L31-L68, diff_diff/trop_global.py:L967-L1048, diff_diff/trop_local.py:L963-L1047.

Code Quality

No findings. The previous malformed-index API gap is closed by explicit shape/range validation in both Rust entry points, with matching direct tests for negative and out-of-range indices. References: rust/src/trop.rs:L980-L1022, rust/src/trop.rs:L1838-L1880, tests/test_rust_backend.py:L1030-L1076, tests/test_rust_backend.py:L1303-L1341.

Performance

No findings. Pre-generating integer index matrices adds some O(n_bootstrap × (N0 + N1)) work/memory, but it removes duplicated sampling logic and keeps the expensive estimation loop parallelized in Rust; nothing here looks material relative to TROP bootstrap cost.

Maintainability

No findings. Centralizing stratified resampling in one helper improves consistency across Python and Rust consumers and reduces future parity drift. References: diff_diff/bootstrap_utils.py:L31-L68, diff_diff/trop_global.py:L974-L1015, diff_diff/trop_local.py:L972-L1012.

Tech Debt

P3 Impact: local-method Rust/Python bootstrap SE parity is still not fixed after RNG alignment because of separate downstream mismatches in weight normalization and Python cache usage, but this PR now surfaces and tracks that follow-up instead of claiming it is solved. Concrete fix: resolve the two tracked local-method discrepancies, then convert the strict xfail parity test into a passing regression. References: TODO.md:L86-L87, diff_diff/trop_local.py:L423-L458, rust/src/trop.rs:L573-L606, tests/test_rust_backend.py:L2423-L2474.

Security

No findings.

Documentation/Tests

No findings. The earlier doc/comment mismatch is fixed: the local-path comment now says only the global path is bit-identical after RNG alignment, and the Rust API docs now describe control_indices / treated_indices instead of seed. The new helper invariants and malformed-index rejection cases are also covered in tests. References: diff_diff/trop_local.py:L972-L979, rust/src/trop.rs:L900-L936, rust/src/trop.rs:L1770-L1800, tests/test_bootstrap_utils.py:L199-L269, tests/test_rust_backend.py:L2380-L2474.

…+ Python cache-fallthrough Closes the local-method half of silent-failures audit finding #23 (RNG half closed in PR #354; grid-search half in PR #348). Two methodology fixes, both isolated to the local-method path — global is unaffected. 1. Rust weight-matrix normalization removed ------------------------------------------ `rust/src/trop.rs::compute_weight_matrix` no longer divides `time_weights` and `unit_weights` by their respective sums before the outer product. The paper's Equation 2/3 (Athey, Imbens, Qu, Viviano 2025) and REGISTRY.md Requirements checklist (`[x] Unit weights: exp(-λ_unit × distance) (unnormalized, matching Eq. 2)`) both specify raw-exponential weights; Python's `_compute_observation_weights` was already REGISTRY-compliant. Rust's normalization inflated the effective nuclear-norm penalty relative to the data-fit term, changing the regularization trade-off. User-visible effect: Rust local-method ATT values may shift for fits with `lambda_nn < infinity`. For `lambda_nn = infinity` (factor model disabled) outputs are unchanged — uniform weight scaling leaves the minimum-norm WLS argmin invariant. Rust LOOCV-selected lambdas may also shift on that boundary; both backends now converge on the same selection. Affects both local-method Rust call sites (LOOCV at trop.rs:459, bootstrap at trop.rs:1096). 2. Python `_compute_observation_weights` cache-fallthrough removed --------------------------------------------------------------- Removed the `if self._precomputed is not None:` branch that silently substituted `self._precomputed["Y"]` / `["D"]` / `["time_dist_matrix"]` (original-panel cache populated during main fit) for the function-argument `Y, D`. Under bootstrap, `_fit_with_fixed_lambda` computes fresh `Y, D` from the resampled `boot_data` and passes them in; the helper was discarding those and recomputing unit distances from the original panel, so Python's local bootstrap resampled units but reused stale unit-distance weights. Rust's bootstrap was already correct (always consumed `y_boot, d_boot`). Test changes ------------ - `tests/test_rust_backend.py::TestTROPRustEdgeCaseParity:: test_bootstrap_seed_reproducibility_local`: flipped from `xfail(strict=True)` to passing `assert_allclose` at `atol=1e-5` across seeds `[0, 42, 12345]`. Residual ~1e-7 gap is Rust `estimate_model` vs numpy `lstsq` roundoff that accumulates differently across per-replicate bootstrap fits; follow-up TODO row tracks unifying Rust to the `solve_wls_svd` path (same SVD helper the global-method uses since PR #348) for sub-1e-14 parity. - New `test_local_method_main_fit_parity`: parametrized over `(lambda_nn=inf, atol=1e-14)` and `(lambda_nn=0.1, atol=1e-10)`; asserts `atol=1e-14` bit-identity for the main-fit ATT at `lambda_nn=inf` (the regression guard for the normalization fix) and `atol=1e-10` for the finite-`lambda_nn` FISTA path. Verification ------------ Targeted regression sweep — all green: - 9 `TestTROPRustEdgeCaseParity` tests (grid-search + global bootstrap × 3 seeds + local bootstrap × 3 seeds + local main-fit × 2 regimes) - Full `test_rust_backend.py` suite: 92 passed - Full `test_trop.py` suite under Rust backend: 120 passed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

P3 — the PR #354 [Unreleased] Fixed entry (line 21) said local-method bit-identity SE remained blocked by the Rust-normalization and Python cache-fallthrough divergences and was "tracked as a follow-up in TODO.md." With the two TROP-local Fixed entries that this PR adds (lines 22-27) closing exactly those divergences, the PR #354 tail sentence is now internally inconsistent with the surrounding entries. Rewritten to say the RNG half of finding #23 is closed here (bootstrap contract), grid-search half was closed in PR #348, and the local- method methodology half is closed by the two Fixed entries that follow in the same release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber force-pushed the fix/trop-bootstrap-rng-parity branch from b10e46c to 225674c Compare April 24, 2026 00:45

igerber added the ready-for-ci Triggers CI test workflows label Apr 24, 2026

igerber merged commit 08bcbfe into main Apr 24, 2026
23 of 24 checks passed

igerber deleted the fix/trop-bootstrap-rng-parity branch April 24, 2026 10:56

igerber mentioned this pull request Apr 24, 2026

Fix TROP local-method backend parity: drop Rust weight normalization + Python cache-fallthrough #358

Merged

This was referenced Apr 24, 2026

Unify multiplier-bootstrap RNG: drop Rust weight helper, numpy canonical #362

Closed

Release 3.3.0: HAD estimator, profile_panel, dCDH by_path, SDID survey complete #368

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix TROP bootstrap SE backend divergence under fixed seed#354

Fix TROP bootstrap SE backend divergence under fixed seed#354
igerber merged 2 commits intomainfrom
fix/trop-bootstrap-rng-parity

igerber commented Apr 24, 2026

Uh oh!

igerber commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

igerber commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Apr 24, 2026

Summary

Methodology references

Validation

Security / privacy

Uh oh!

igerber commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

igerber commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant