Skip to content

Fix TROP bootstrap SE backend divergence under fixed seed#354

Merged
igerber merged 2 commits intomainfrom
fix/trop-bootstrap-rng-parity
Apr 24, 2026
Merged

Fix TROP bootstrap SE backend divergence under fixed seed#354
igerber merged 2 commits intomainfrom
fix/trop-bootstrap-rng-parity

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 24, 2026

Summary

Rust and Python TROP backends previously produced different bootstrap SE for the same seed. On a tiny correlated panel under seed=42, the gap was ~28% of SE because Rust seeded rand_xoshiro::Xoshiro256PlusPlus per replicate while Python consumed numpy.random.default_rng (PCG64), so identical seeds mapped to different bytestreams.

This PR canonicalizes on numpy. A new stratified_bootstrap_indices helper in diff_diff/bootstrap_utils.py pre-generates per-replicate (control, treated) positional index arrays from a numpy Generator; both TROP bootstrap Rust functions now accept control_indices/treated_indices as i64 arrays in place of seed: u64. Parallelism preserved; sampling law (stratified: controls then treated, with replacement) unchanged.

  • Global-method bootstrap SE is backend-invariant under the same seed to machine precision. test_bootstrap_seed_reproducibility flipped from xfail(strict=True) to passing assert_allclose(atol=rtol=1e-14) and parametrized over [0, 42, 12345].
  • Companion local-method test added, currently xfail(strict=True) with a new finding: aligning the RNG surfaced two separate local-method backend divergences (weight-matrix normalization in compute_weight_matrix is Rust-only; Python _compute_observation_weights reads self._precomputed["Y"] instead of the bootstrap-sample Y). Both tracked as follow-up rows in TODO.md and will land in a separate methodology PR.
  • Closes the bootstrap half of silent-failures audit finding Address code review feedback for CallawaySantAnna covariates #23 (grid-search half closed in PR Unify Rust TROP inner solver to SVD (close finding #23 grid-search divergence) #348).

Methodology references

  • Method: Triply Robust Panel (TROP) bootstrap SE computation.
  • Paper: Athey, S., Imbens, G. W., Qu, Z., & Viviano, D. (2025). Triply Robust Panel Estimators. Algorithm 3 (stratified unit-level resample).
  • Intentional deviations: None. Sampling law is unchanged — this PR only rearranges which side (Python vs Rust) advances the RNG state, not what the bootstrap resamples. The numpy.random.default_rng (PCG64) is treated as canonical per the repo convention that numpy/scipy paths are the reference implementation for cross-backend parity.

Validation

  • Tests added: TestStratifiedBootstrapIndices (7 tests: shape/dtype, value range, determinism, prefix invariance, hard-coded value pin for default_rng(42), empty-pool edges) in tests/test_bootstrap_utils.py. test_bootstrap_seed_reproducibility_local parametrized over three seeds in tests/test_rust_backend.py.
  • Tests updated: test_bootstrap_seed_reproducibility flipped from xfail to parametrized passing regression guard (atol=rtol=1e-14, seeds [0, 42, 12345]). Four existing direct-call tests (test_bootstrap_variance_shape, test_bootstrap_reproducibility, and their global counterparts) updated to construct stratified index arrays via the new helper instead of passing seed.
  • Verification: full tests/test_rust_backend.py passes (83 passed + 3 xfailed — the local-method seed-parity tests documenting the new finding). tests/test_trop.py -k bootstrap passes (13 tests) under both Rust and pure-Python (DIFF_DIFF_BACKEND=python) backends.

Security / privacy

  • No secrets, credentials, or PII touched. All changes are in source, tests, and documentation files within the repo.

🤖 Generated with Claude Code

Rust and Python TROP backends produced different bootstrap standard
errors for the same `seed` value. On a tiny correlated panel under
`seed=42` the gap was ~28% of SE: Rust seeded `rand_xoshiro::
Xoshiro256PlusPlus` per replicate while Python's fallback consumed
`numpy.random.default_rng` (PCG64), so identical seeds mapped to
different bytestreams.

Canonicalize on numpy. New `stratified_bootstrap_indices` helper in
`diff_diff/bootstrap_utils.py` pre-generates per-replicate
(control, treated) positional index arrays from a numpy `Generator`
and hands them to both backends through the PyO3 surface — both
Rust bootstrap functions (`bootstrap_trop_variance_global`,
`bootstrap_trop_variance`) now accept `control_indices` and
`treated_indices` as `i64` arrays in place of `seed: u64`. Parallelism
is preserved. Sampling law (stratified: controls then treated, with
replacement) is unchanged.

Global-method SE is now backend-invariant under the same seed to
machine precision: the prior `xfail(strict=True)` in
`test_bootstrap_seed_reproducibility` is flipped to a passing
`assert_allclose(atol=rtol=1e-14)` and parametrized over
`[0, 42, 12345]`.

A companion `test_bootstrap_seed_reproducibility_local` is added for
the local-method bootstrap. It is currently `xfail(strict=True)`
because aligning the RNG exposed two separate local-method backend
divergences beyond this PR's scope: Rust's `compute_weight_matrix`
normalizes time and unit weights to sum to 1, while Python's
`_compute_observation_weights` does not; and the Python fallback's
`_compute_observation_weights(_precomputed branch)` reads the
original-panel cached `Y`/`D` instead of the bootstrap-sample
arguments. Both are tracked as follow-up rows in `TODO.md` with
file:line pointers and will land in a separate methodology PR.

Closes the bootstrap half of silent-failures audit finding #23 (the
grid-search half closed in PR #348). Reference: Athey, Imbens, Qu &
Viviano (2025), "Triply Robust Panel Estimators", Algorithm 3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber igerber force-pushed the fix/trop-bootstrap-rng-parity branch from b10e46c to 225674c Compare April 24, 2026 00:45
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 225674cda82436b710b404d327ac16113890a965


Overall Assessment

⚠️ Needs changes

Highest unmitigated finding: P1 input-validation gap in the new Rust bootstrap API.

Executive Summary

  • Affected methods: TROP bootstrap SE for method="global" and the RNG plumbing used by method="local". The changed global path still matches Algorithm 3’s stratified block-bootstrap law: resample N0 control units and N1 treated units with replacement, then recompute TROP on each bootstrap draw. citeturn0view0
  • I did not find a new undocumented methodology, weighting, or variance/SE defect in the global fix.
  • The new Rust control_indices / treated_indices surface validates shape but not element range, so malformed arrays are handled by Rust bounds behavior instead of a clean ValueError.
  • The remaining local-method backend divergence is surfaced rather than hidden (tests/test_rust_backend.py:L2335-L2356, TODO.md:L86-L87), but some inline comments still overstate what this PR accomplished.

Methodology

  • No unmitigated findings. The shared-index implementation preserves the paper/registry bootstrap contract for the changed global path rather than changing the resampling law. diff_diff/bootstrap_utils.py:L31-L68, diff_diff/trop_global.py:L967-L1045, rust/src/trop.rs:L1777-L1841, docs/methodology/REGISTRY.md:L2000-L2027. citeturn0view0

Code Quality

  • P1 Impact: the new raw-index Rust entry points only check array shape. Negative or out-of-range entries are cast with as usize and then indexed directly via original_control_units[idx] / original_treated_units[idx], so bad inputs fail inside Rust instead of producing a deterministic validation error. This is new in the PR because the old interface accepted only seed. Concrete fix: before entering the parallel loop, validate every element of control_indices and treated_indices in both Rust functions and raise PyValueError when any index is < 0 or >= its pool size. rust/src/trop.rs:L971-L1005, rust/src/trop.rs:L1791-L1840

Performance

  • No material findings in the changed path.

Maintainability

  • No additional findings beyond the documentation mismatch below.

Tech Debt

  • P3 Impact: the PR properly records the newly surfaced local-method Rust/Python divergence as follow-up work instead of burying it. That is useful tracking, but parity for method="local" remains open. Concrete fix: implement the follow-up described in TODO.md:L86-L87 and then convert tests/test_rust_backend.py:L2335-L2356 from strict xfail to a passing regression test.

Security

  • No findings.

Documentation/Tests

  • P3 Impact: the local-path comments are inconsistent with the actual state of the code. diff_diff/trop_local.py:L972-L974 says shared indices make SE identical under the same seed, but the new strict xfail and TODO entry explicitly say local SE parity is still unresolved. Concrete fix: change that comment to say the PR aligns only the RNG layer for local bootstrap; full SE parity still needs the follow-up in TODO.md:L86-L87.
  • P3 Impact: the Rust function header comments still describe a removed seed argument even though the PyO3 signatures now take control_indices / treated_indices. Concrete fix: update the arg docs to match the new API. rust/src/trop.rs:L912-L927, rust/src/trop.rs:L1737-L1752
  • P3 Impact: the new tests cover helper invariants and happy-path reproducibility, but not malformed raw-index rejection for the new Rust surface. Concrete fix: add direct tests for both bootstrap_trop_variance and bootstrap_trop_variance_global asserting clean errors on one negative index and one too-large index. tests/test_bootstrap_utils.py:L207-L269, tests/test_rust_backend.py:L979-L1028, tests/test_rust_backend.py:L1213-L1253

Path to Approval

  1. Add explicit range validation for control_indices and treated_indices in both Rust bootstrap entry points, returning PyValueError for negative or out-of-pool values.
  2. Add direct regression tests covering malformed index arrays for both local and global Rust bootstrap functions.
  3. Fix the local-path comments and Rust arg docs so they match the actual post-PR state: global seed parity is fixed; local seed parity is not yet fixed; the Rust API now accepts index arrays rather than seed.

P1 — Rust `bootstrap_trop_variance` and `bootstrap_trop_variance_global`
now validate every element of `control_indices` and `treated_indices`
before entering the parallel loop. Negative values and values `>= pool
size` raise `PyValueError` with the offending index, rather than
panicking on a negative-to-`usize` cast or indexing `original_*_units`
out of bounds inside the rayon closure.

P3 — `rust/src/trop.rs` function header docs rewritten for both
bootstrap entry points. Removed the stale `seed` argument; added
description of `control_indices` and `treated_indices` shape/dtype/
range, and documented the Python-canonical RNG contract via
`diff_diff.bootstrap_utils.stratified_bootstrap_indices`.

P3 — `diff_diff/trop_local.py` comment on the pre-generate block
rewritten to state that local-method SE is NOT yet bit-identical
across backends (only the RNG layer is aligned; two downstream
divergences remain in `compute_weight_matrix` normalization and the
Python `_compute_observation_weights` stale-cache path; tracked in
`TODO.md`).

P3 — New `tests/test_rust_backend.py` coverage:
`TestTROPRustBackend::test_bootstrap_rejects_negative_index`,
`test_bootstrap_rejects_out_of_range_index`, and the corresponding
`TestTROPGlobalRustBackend` pair each assert `ValueError` with the
expected regex for malformed inputs on both local and global
bootstrap functions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 5ce5b8f05cac9b4be4fc8a0d7ad3cb428d599750


Overall Assessment

✅ Looks good

Highest unmitigated severity: P3 (tracked tech debt only).

Executive Summary

  • Affected methods: TROP bootstrap SE for method="global" and the RNG plumbing used by method="local".
  • Methodology check passed: the PR preserves Algorithm 3’s stratified unit-level block bootstrap and only changes where the RNG stream is produced, not the resampling law or variance formula.
  • The prior P1 from the last review is resolved: both Rust bootstrap entry points now reject malformed index arrays before entering the parallel loop, and there are direct negative / out-of-range regression tests for both local and global paths.
  • Global backend seed parity is now regression-guarded across three seeds; the remaining local-method backend divergence is explicitly strict-xfailed and tracked in TODO.md rather than overstated.
  • I did not find a new NaN/Inf inference bug, control-pool bug, or undocumented methodology deviation in the changed diff.
  • I could not execute pytest in this environment because the pytest executable is unavailable.

Methodology

  • No findings. The changed path still matches the TROP registry requirement of a block bootstrap preserving panel structure, with stratified resampling of control and treated unit pools; canonicalizing on NumPy-generated index arrays is an implementation choice, not a methodology change. References: docs/methodology/REGISTRY.md:L2000-L2027, diff_diff/bootstrap_utils.py:L31-L68, diff_diff/trop_global.py:L967-L1048, diff_diff/trop_local.py:L963-L1047.

Code Quality

  • No findings. The previous malformed-index API gap is closed by explicit shape/range validation in both Rust entry points, with matching direct tests for negative and out-of-range indices. References: rust/src/trop.rs:L980-L1022, rust/src/trop.rs:L1838-L1880, tests/test_rust_backend.py:L1030-L1076, tests/test_rust_backend.py:L1303-L1341.

Performance

  • No findings. Pre-generating integer index matrices adds some O(n_bootstrap × (N0 + N1)) work/memory, but it removes duplicated sampling logic and keeps the expensive estimation loop parallelized in Rust; nothing here looks material relative to TROP bootstrap cost.

Maintainability

  • No findings. Centralizing stratified resampling in one helper improves consistency across Python and Rust consumers and reduces future parity drift. References: diff_diff/bootstrap_utils.py:L31-L68, diff_diff/trop_global.py:L974-L1015, diff_diff/trop_local.py:L972-L1012.

Tech Debt

  • P3 Impact: local-method Rust/Python bootstrap SE parity is still not fixed after RNG alignment because of separate downstream mismatches in weight normalization and Python cache usage, but this PR now surfaces and tracks that follow-up instead of claiming it is solved. Concrete fix: resolve the two tracked local-method discrepancies, then convert the strict xfail parity test into a passing regression. References: TODO.md:L86-L87, diff_diff/trop_local.py:L423-L458, rust/src/trop.rs:L573-L606, tests/test_rust_backend.py:L2423-L2474.

Security

  • No findings.

Documentation/Tests

  • No findings. The earlier doc/comment mismatch is fixed: the local-path comment now says only the global path is bit-identical after RNG alignment, and the Rust API docs now describe control_indices / treated_indices instead of seed. The new helper invariants and malformed-index rejection cases are also covered in tests. References: diff_diff/trop_local.py:L972-L979, rust/src/trop.rs:L900-L936, rust/src/trop.rs:L1770-L1800, tests/test_bootstrap_utils.py:L199-L269, tests/test_rust_backend.py:L2380-L2474.

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 24, 2026
@igerber igerber merged commit 08bcbfe into main Apr 24, 2026
23 of 24 checks passed
@igerber igerber deleted the fix/trop-bootstrap-rng-parity branch April 24, 2026 10:56
igerber added a commit that referenced this pull request Apr 24, 2026
…+ Python cache-fallthrough

Closes the local-method half of silent-failures audit finding #23
(RNG half closed in PR #354; grid-search half in PR #348). Two
methodology fixes, both isolated to the local-method path — global
is unaffected.

1. Rust weight-matrix normalization removed
   ------------------------------------------
   `rust/src/trop.rs::compute_weight_matrix` no longer divides
   `time_weights` and `unit_weights` by their respective sums before
   the outer product. The paper's Equation 2/3 (Athey, Imbens, Qu,
   Viviano 2025) and REGISTRY.md Requirements checklist
   (`[x] Unit weights: exp(-λ_unit × distance) (unnormalized,
   matching Eq. 2)`) both specify raw-exponential weights; Python's
   `_compute_observation_weights` was already REGISTRY-compliant.
   Rust's normalization inflated the effective nuclear-norm penalty
   relative to the data-fit term, changing the regularization
   trade-off. User-visible effect: Rust local-method ATT values may
   shift for fits with `lambda_nn < infinity`. For
   `lambda_nn = infinity` (factor model disabled) outputs are
   unchanged — uniform weight scaling leaves the minimum-norm WLS
   argmin invariant. Rust LOOCV-selected lambdas may also shift on
   that boundary; both backends now converge on the same selection.
   Affects both local-method Rust call sites (LOOCV at trop.rs:459,
   bootstrap at trop.rs:1096).

2. Python `_compute_observation_weights` cache-fallthrough removed
   ---------------------------------------------------------------
   Removed the `if self._precomputed is not None:` branch that
   silently substituted `self._precomputed["Y"]` / `["D"]` /
   `["time_dist_matrix"]` (original-panel cache populated during
   main fit) for the function-argument `Y, D`. Under bootstrap,
   `_fit_with_fixed_lambda` computes fresh `Y, D` from the resampled
   `boot_data` and passes them in; the helper was discarding those
   and recomputing unit distances from the original panel, so
   Python's local bootstrap resampled units but reused stale
   unit-distance weights. Rust's bootstrap was already correct
   (always consumed `y_boot, d_boot`).

Test changes
------------
- `tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::
  test_bootstrap_seed_reproducibility_local`: flipped from
  `xfail(strict=True)` to passing `assert_allclose` at `atol=1e-5`
  across seeds `[0, 42, 12345]`. Residual ~1e-7 gap is Rust
  `estimate_model` vs numpy `lstsq` roundoff that accumulates
  differently across per-replicate bootstrap fits; follow-up TODO
  row tracks unifying Rust to the `solve_wls_svd` path (same SVD
  helper the global-method uses since PR #348) for sub-1e-14
  parity.
- New `test_local_method_main_fit_parity`: parametrized over
  `(lambda_nn=inf, atol=1e-14)` and `(lambda_nn=0.1, atol=1e-10)`;
  asserts `atol=1e-14` bit-identity for the main-fit ATT at
  `lambda_nn=inf` (the regression guard for the normalization fix)
  and `atol=1e-10` for the finite-`lambda_nn` FISTA path.

Verification
------------
Targeted regression sweep — all green:
- 9 `TestTROPRustEdgeCaseParity` tests (grid-search + global
  bootstrap × 3 seeds + local bootstrap × 3 seeds + local main-fit
  × 2 regimes)
- Full `test_rust_backend.py` suite: 92 passed
- Full `test_trop.py` suite under Rust backend: 120 passed

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
P3 — the PR #354 [Unreleased] Fixed entry (line 21) said local-method
bit-identity SE remained blocked by the Rust-normalization and Python
cache-fallthrough divergences and was "tracked as a follow-up in
TODO.md." With the two TROP-local Fixed entries that this PR adds
(lines 22-27) closing exactly those divergences, the PR #354 tail
sentence is now internally inconsistent with the surrounding entries.
Rewritten to say the RNG half of finding #23 is closed here (bootstrap
contract), grid-search half was closed in PR #348, and the local-
method methodology half is closed by the two Fixed entries that
follow in the same release.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant