diff --git a/CHANGELOG.md b/CHANGELOG.md index 3e40e919..d83b93fe 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -18,7 +18,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - SyntheticDiD `variance_method="bootstrap"` now computes p-values from the analytical normal-theory formula using the bootstrap SE (matching R's `synthdid::vcov()` convention), rather than an empirical null-distribution formula that is not valid for bootstrap draws. `is_significant` and `significance_stars` are derived from `p_value` and will also change for bootstrap fits. Placebo and jackknife are unchanged. Point estimates are unaffected. - SyntheticDiD bootstrap SE formula applies the `sqrt((r-1)/r)` correction matching R's synthdid and the placebo SE formula. - SyntheticDiD bootstrap now retries degenerate resamples (all-control or all-treated, or non-finite `τ_b`) until exactly `n_bootstrap` valid replicates are accumulated, matching R's `synthdid::bootstrap_sample` and Arkhangelsky et al. (2021) Algorithm 2. Previously the Python path counted attempts (with degenerate draws silently dropped), producing fewer valid replicates than requested. A bounded-attempt guard (`20 × n_bootstrap`) prevents pathological-input hangs. -- **TROP global bootstrap SE backend parity under fixed seed** — Rust and Python backends now produce bit-identical bootstrap SE under the same `seed`. Previously Rust's `bootstrap_trop_variance_global` seeded `rand_xoshiro::Xoshiro256PlusPlus` per replicate while Python's fallback consumed `numpy.random.default_rng` (PCG64), producing ~28% SE divergence on tiny panels under `seed=42`. Fixed by extracting a shared `stratified_bootstrap_indices` helper in `diff_diff/bootstrap_utils.py` that pre-generates per-replicate stratified sample indices via numpy on the Python side; both backends consume the same integer arrays through the PyO3 surface. Sampling law (stratified: controls then treated, with replacement) is unchanged. Closes silent-failures audit finding #23 (bootstrap half; grid-search half closed in PR #348). Local-method TROP also adopts the Python-canonical index contract for the RNG layer, but separately-discovered backend divergences (Rust normalizes weight-matrix outer product, Python `_compute_observation_weights` reads stale `_precomputed` cache) prevent local-method bit-identity SE; tracked as a follow-up in `TODO.md`. +- **TROP global bootstrap SE backend parity under fixed seed** — Rust and Python backends now produce bit-identical bootstrap SE under the same `seed`. Previously Rust's `bootstrap_trop_variance_global` seeded `rand_xoshiro::Xoshiro256PlusPlus` per replicate while Python's fallback consumed `numpy.random.default_rng` (PCG64), producing ~28% SE divergence on tiny panels under `seed=42`. Fixed by extracting a shared `stratified_bootstrap_indices` helper in `diff_diff/bootstrap_utils.py` that pre-generates per-replicate stratified sample indices via numpy on the Python side; both backends consume the same integer arrays through the PyO3 surface. Sampling law (stratified: controls then treated, with replacement) is unchanged. Closes the bootstrap-RNG half of silent-failures audit finding #23 (grid-search half closed in PR #348; local-method methodology half closed by the two Fixed entries below). Local-method TROP also adopts the Python-canonical index contract for the RNG layer here. +- **TROP local-method Rust weight-matrix no longer normalized** — `rust/src/trop.rs::compute_weight_matrix` no longer divides time-weights or unit-weights by their respective sums before the outer product. The paper's Equation 2/3 (Athey, Imbens, Qu, Viviano 2025) and REGISTRY.md Requirements checklist (line 2037: `[x] Unit weights: exp(-λ_unit × distance) (unnormalized, matching Eq. 2)`) both specify raw-exponential weights; Python's `_compute_observation_weights` was already REGISTRY-compliant. **User-visible effect**: Rust local-method ATT values may shift for any fit with `lambda_nn < infinity` — normalizing the weight-matrix inflated the effective nuclear-norm penalty relative to the data-fit term, changing the regularization trade-off. For `lambda_nn = infinity` (factor model disabled) outputs are unchanged because uniform weight scaling leaves the minimum-norm WLS argmin invariant. Rust LOOCV-selected lambdas may also shift on this boundary; both backends now converge on the same REGISTRY-compliant selection. +- **TROP local-method Python `_compute_observation_weights` now uses the function-argument `Y, D` and treats all non-target units as donors** — two coupled changes that bring Python structurally in line with Rust and the paper's Eq. 2/3: + 1. Removed the `if self._precomputed is not None:` branch that silently substituted `self._precomputed["Y"]` / `["D"]` / `["time_dist_matrix"]` (original-panel cache populated during main fit) for the function-argument `Y, D`. Under bootstrap, `_fit_with_fixed_lambda` computes fresh `Y, D` from the resampled `boot_data` and passes them in; the helper was discarding those and recomputing unit distances from the original panel, so Python's local bootstrap resampled units but reused stale unit-distance weights. Rust's bootstrap was already correct (always consumed `y_boot, d_boot`). + 2. Removed the `valid_control_at_t = D[t, :] == 0` target-period donor gate that zeroed `ω_j` for any unit `j` treated at the target period (other than the target unit itself). Per REGISTRY Eq. 2/3 and Rust's `compute_weight_matrix`, `ω_j = exp(-λ_unit × dist(j, i))` for all `j ≠ i`; treated-cell exclusion happens via the `(1 − W_{js})` factor applied inside `_estimate_model`. Same-cohort donors now contribute via their pre-treatment rows. Empirically the main-fit ATT is unchanged on tested fixtures because same-cohort pre-treatment observations are exactly absorbed by their own unit fixed effect `alpha_j` without propagating into `mu`, `beta`, or other units' parameters — so this change is structural alignment rather than a numerical shift in output. Users on same-cohort panels with very few controls may still see tiny differences in edge cases; the new `test_local_method_same_cohort_donor_parity` regression guards the aligned behavior. + Together with the normalization fix above, TROP local-method backend parity on the main-fit ATT is regime-dependent: `atol=rtol=1e-14` for `lambda_nn=inf` (no nuclear-norm regularization, uniform weight scaling leaves the WLS argmin invariant) and `atol=1e-10` for finite `lambda_nn` (FISTA inner loop + BLAS reduction ordering introduce sub-1e-10 roundoff across Rust `faer` vs numpy paths). Bootstrap SE parity is asserted at `atol=1e-5` to accommodate ~1e-7 roundoff between Rust's `estimate_model` matrix factorization and numpy's `lstsq` that accumulates across per-replicate fits; sub-1e-14 bootstrap parity is tracked as a follow-up in `TODO.md` under "unify Rust local-method solver path". Closes silent-failures audit finding #23 (local-method half; the RNG half closed in PR #354 and the grid-search half in PR #348). ### Changed - **`did_had_pretest_workflow(aggregate="event_study")` verdict no longer emits the "paper step 2 deferred to Phase 3 follow-up" caveat** — the joint pre-trends Stute test closes that gap. The two-period `aggregate="overall"` path retains the existing caveat since the joint variant does not apply to single-pre-period panels. Downstream code that greps verdict strings for the Phase 3 caveat will see it suppressed on the event-study path. diff --git a/TODO.md b/TODO.md index 0f15a8c6..3c4853d6 100644 --- a/TODO.md +++ b/TODO.md @@ -83,8 +83,8 @@ Deferred items from PR reviews that were not addressed before merge. | Weighted CR2 Bell-McCaffrey cluster-robust (`vcov_type="hc2_bm"` + `cluster_ids` + `weights`) currently raises `NotImplementedError`. Weighted hat matrix and residual rebalancing need threading per clubSandwich WLS handling. | `linalg.py::_compute_cr2_bm` | Phase 1a | Medium | | Regenerate `benchmarks/data/clubsandwich_cr2_golden.json` from R (`Rscript benchmarks/R/generate_clubsandwich_golden.R`). Current JSON has `source: python_self_reference` as a stability anchor until an authoritative R run. | `benchmarks/R/generate_clubsandwich_golden.R` | Phase 1a | Medium | | `honest_did.py:1907` `np.linalg.solve(A_sys, b_sys) / except LinAlgError: continue` is a silent basis-rejection in the vertex-enumeration loop that is algorithmically intentional (try the next basis). Consider surfacing a count of rejected bases as a diagnostic when ARP enumeration exhausts, so users see when the vertex search was heavily constrained. Not a silent failure in the sense of the Phase 2 audit (the algorithm is supposed to skip), but the diagnostic would help debug borderline cases. | `honest_did.py` | #334 | Low | -| TROP local-method Rust vs Python bootstrap SE divergence beyond RNG: with stratified indices now Python-canonical (closes RNG axis-H), local-method SE still diverges by 10-20% on tiny panels. Two downstream causes: (a) Rust `compute_weight_matrix` (`rust/src/trop.rs:573-606`) normalizes time_weights and unit_weights to sum to 1 before the outer product; Python `_compute_observation_weights` (`diff_diff/trop_local.py:489`) does not normalize. (b) Python `_compute_observation_weights` reads `self._precomputed["Y"]`, `["D"]`, `["time_dist_matrix"]` (lines 423-458) — original-panel cache — rather than the bootstrap-sample data passed via the function arguments, so unit-distance computation in the Python fallback uses stale data. `@pytest.mark.xfail(strict=True)` in `tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility_local` baselines the gap across `seeds=[0, 42, 12345]`. Affects local-method TROP backend parity broadly, not just bootstrap. | `trop_local.py`, `rust/src/trop.rs` | follow-up | Medium | -| Rust multiplier-bootstrap weight RNG (`generate_bootstrap_weights_batch` in `rust/src/bootstrap.rs:9-10, 57-75`) uses `Xoshiro256PlusPlus::seed_from_u64(seed + i)` per row for Rademacher/Mammen/Webb generation. If any Python caller (SDID / efficient-DiD multiplier bootstrap) has a numpy-canonical equivalent, the two backends likely diverge under the same seed. Audit Python callers (`diff_diff/sdid.py`, `diff_diff/efficient_did_bootstrap.py`, `diff_diff/bootstrap_utils.py::generate_bootstrap_weights_batch_numpy`) for parity-test gaps. Same fix shape as TROP RNG parity (PR #NNN): pre-generate weights in Python via numpy and pass them to Rust through PyO3. | `rust/src/bootstrap.rs`, `diff_diff/bootstrap_utils.py` | follow-up | Medium | +| Unify Rust local-method `estimate_model` solver path to `solve_wls_svd` (the same SVD helper used by the global-method since PR #348) for sub-1e-14 bootstrap SE parity. Current local-method bootstrap parity test (`tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility_local`) passes at `atol=1e-5` — the residual ~1e-7 gap is roundoff between Rust's `estimate_model` matrix factorization and numpy's `lstsq`, which accumulates differently across per-replicate bootstrap fits. Main-fit ATT parity is regime-dependent (`atol=1e-14` for `lambda_nn=inf`, `atol=1e-10` for finite `lambda_nn` — see `test_local_method_main_fit_parity`); the bootstrap gap is a same-solver-path roundoff concern and not a user-visible correctness bug. | `rust/src/trop.rs::estimate_model`, `rust/src/linalg.rs::solve_wls_svd` | follow-up | Low | +| Rust multiplier-bootstrap weight RNG (`generate_bootstrap_weights_batch` in `rust/src/bootstrap.rs:9-10, 57-75`) uses `Xoshiro256PlusPlus::seed_from_u64(seed + i)` per row for Rademacher/Mammen/Webb generation. If any Python caller (SDID / efficient-DiD multiplier bootstrap) has a numpy-canonical equivalent, the two backends likely diverge under the same seed. Audit Python callers (`diff_diff/sdid.py`, `diff_diff/efficient_did_bootstrap.py`, `diff_diff/bootstrap_utils.py::generate_bootstrap_weights_batch_numpy`) for parity-test gaps. Same fix shape as TROP RNG parity (PR #354): pre-generate weights in Python via numpy and pass them to Rust through PyO3. | `rust/src/bootstrap.rs`, `diff_diff/bootstrap_utils.py` | follow-up | Medium | | `bias_corrected_local_linear`: extend golden parity to `kernel="triangular"` and `kernel="uniform"` (currently epa-only; all three kernels share `kernel_W` and the `lprobust` math, so parity is expected but not separately asserted). | `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py` | Phase 1c | Low | | `bias_corrected_local_linear`: expose `vce in {"hc0", "hc1", "hc2", "hc3"}` on the public wrapper once R parity goldens exist (currently raises `NotImplementedError`). The port-level `lprobust` and `lprobust_res` already support all four; expanding the public surface requires a golden generator for each hc mode and a decision on hc2/hc3 q-fit leverage (R reuses p-fit `hii` for q-fit residuals; whether to match that or stage-match deserves a derivation before the wrapper advertises CCT-2014 conformance). | `diff_diff/local_linear.py::bias_corrected_local_linear`, `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py` | Phase 1c | Medium | | `bias_corrected_local_linear`: support `weights=` once survey-design adaptation lands. nprobust's `lprobust` has no weight argument so there is no parity anchor; derivation needed. | `diff_diff/local_linear.py`, `diff_diff/_nprobust_port.py::lprobust` | Phase 1c | Medium | diff --git a/diff_diff/trop_local.py b/diff_diff/trop_local.py index 39cce9fc..6efa3d7a 100644 --- a/diff_diff/trop_local.py +++ b/diff_diff/trop_local.py @@ -385,13 +385,18 @@ def _compute_observation_weights( - Time weights theta_s^{i,t} = exp(-lambda_time * |t - s|) - Unit weights omega_j^{i,t} = exp(-lambda_unit * dist_unit_{-t}(j, i)) - IMPORTANT (Issue A fix): The paper's objective sums over ALL observations - where (1 - W_js) is non-zero, which includes pre-treatment observations of - eventually-treated units since W_js = 0 for those. This method computes - weights for ALL units where D[t, j] = 0 at the target period, not just - never-treated units. - - Uses pre-computed structures when available for efficiency. + Weights are assigned for every unit ``j != i`` (distance-based, per + Eq. 2/3). Treated-cell exclusion is handled by the `(1 - W_{js})` + factor applied inside ``_estimate_model`` via the control mask, not + by gating ``ω_j`` on ``D[t, j]``. Same-cohort donors therefore + contribute via their pre-treatment rows, and future-cohort donors + contribute via rows where both units are still untreated. + + Always computes from the function-argument ``Y, D``; does not read + ``self._precomputed``. Under bootstrap the caller passes resampled + ``Y, D``, and a prior version of this method silently fell through to + the original-panel cache via a ``_precomputed`` branch, producing + stale unit distances. Parameters ---------- @@ -420,61 +425,24 @@ def _compute_observation_weights( np.ndarray Weight matrix (n_periods x n_units) for observation (i, t). """ - # Use pre-computed structures when available - if self._precomputed is not None: - # Time weights from pre-computed time distance matrix - # time_dist_matrix[t, s] = |t - s| - time_weights = np.exp(-lambda_time * self._precomputed["time_dist_matrix"][t, :]) - - # Unit weights - computed for ALL units where D[t, j] = 0 - # (Issue A fix: includes pre-treatment obs of eventually-treated units) - unit_weights = np.zeros(n_units) - D_stored = self._precomputed["D"] - Y_stored = self._precomputed["Y"] - - # Valid control units at time t: D[t, j] == 0 - valid_control_at_t = D_stored[t, :] == 0 - - if lambda_unit == 0: - # Uniform weights when lambda_unit = 0 - # All units not treated at time t get weight 1 - unit_weights[valid_control_at_t] = 1.0 - else: - # Use observation-specific distances with target period excluded - # (Issue B fix: compute exact per-observation distance) - for j in range(n_units): - if valid_control_at_t[j] and j != i: - # Compute distance excluding target period t - dist = self._compute_unit_distance_for_obs(Y_stored, D_stored, j, i, t) - if np.isinf(dist): - unit_weights[j] = 0.0 - else: - unit_weights[j] = np.exp(-lambda_unit * dist) - - # Treated unit i gets weight 1 - unit_weights[i] = 1.0 - - # Weight matrix: outer product (n_periods x n_units) - return np.outer(time_weights, unit_weights) - - # Fallback: compute from scratch (used in bootstrap) # Time distance: |t - s| following paper's Equation 3 (page 7) dist_time = np.abs(np.arange(n_periods) - t) time_weights = np.exp(-lambda_time * dist_time) - # Unit weights - computed for ALL units where D[t, j] = 0 - # (Issue A fix: includes pre-treatment obs of eventually-treated units) + # Unit weights ω_j = exp(-λ_unit × dist(j, i)) for all j ≠ i per Eq. 2/3. + # No target-period gate: same-cohort donors enter with distance-based + # weight (their pre-treatment rows contribute via theta_s * omega_j; + # their post-treatment cells are zeroed by the control mask (1-D_{js}) + # applied inside ``_estimate_model``). Matches Rust's compute_weight_matrix. unit_weights = np.zeros(n_units) - # Valid control units at time t: D[t, j] == 0 - valid_control_at_t = D[t, :] == 0 - if lambda_unit == 0: - # Uniform weights when lambda_unit = 0 - unit_weights[valid_control_at_t] = 1.0 + # Uniform weights when lambda_unit = 0 — all units get 1. + # Control masking in _estimate_model handles treated-cell exclusion. + unit_weights[:] = 1.0 else: for j in range(n_units): - if valid_control_at_t[j] and j != i: + if j != i: # Compute distance excluding target period t (Issue B fix) dist = self._compute_unit_distance_for_obs(Y, D, j, i, t) if np.isinf(dist): @@ -482,8 +450,8 @@ def _compute_observation_weights( else: unit_weights[j] = np.exp(-lambda_unit * dist) - # Treated unit i gets weight 1 (or could be omitted since we fit on controls) - # We include treated unit's own observation for model fitting + # Target unit gets weight 1 (will be masked out in estimation via + # the control mask, matching the paper's (1-W_{js}) factor). unit_weights[i] = 1.0 # Weight matrix: outer product (n_periods x n_units) @@ -970,13 +938,11 @@ def _bootstrap_variance( n_control_units = len(control_units) # Pre-generate stratified bootstrap indices via numpy (Python-canonical RNG). - # This aligns the RNG layer between backends. For the global method the RNG - # was the only divergence, so global SE is now bit-identical under the same - # seed. For the local method two downstream divergences remain (Rust weight- - # matrix normalization + Python `_compute_observation_weights` reading the - # stale `_precomputed` cache) — tracked in TODO.md; until those land, local - # bootstrap SE still differs across backends. Silent-failures finding #23 - # (bootstrap half) closed for global; local follow-up queued. + # Aligns the RNG layer between backends. Combined with the Rust weight- + # matrix de-normalization and the Python `_compute_observation_weights` + # cache-fallthrough removal (also shipped with this parity work), local- + # method Rust and Python produce matching bootstrap SE up to solver-path + # roundoff (~1e-7); asserted at atol=1e-5 in the parity regression guard. rng = np.random.default_rng(self.seed) control_idx, treated_idx = stratified_bootstrap_indices( rng, n_control_units, n_treated_units, self.n_bootstrap diff --git a/rust/src/trop.rs b/rust/src/trop.rs index ac98b44f..28cc2392 100644 --- a/rust/src/trop.rs +++ b/rust/src/trop.rs @@ -548,9 +548,11 @@ fn compute_unit_distance_for_obs( /// Unit weights: ω_j = exp(-λ_unit × dist(j, i)) /// /// Paper alignment notes: -/// - ALL units get weights (not just those untreated at target period) +/// - ALL units with `j != target_unit` get distance-based weights +/// (same-cohort donors contribute via their pre-treatment rows) /// - The (1 - D_js) masking in the loss naturally excludes treated cells -/// - Weights are normalized to sum to 1 (probability weights) +/// via the control mask applied inside `estimate_model` +/// - Weights are unnormalized raw exponentials per REGISTRY Eq. 2/3 /// - Distance excludes target period t per Equation 3 #[allow(clippy::too_many_arguments)] fn compute_weight_matrix( @@ -565,19 +567,15 @@ fn compute_weight_matrix( time_dist: &ArrayView2, ) -> Array2 { // Time weights for this target period: θ_s = exp(-λ_time × |t - s|) - let mut time_weights: Array1 = Array1::from_shape_fn(n_periods, |s| { + // Unnormalized per REGISTRY Eq. 2/3. + let time_weights: Array1 = Array1::from_shape_fn(n_periods, |s| { let dist = time_dist[[target_period, s]] as f64; (-lambda_time * dist).exp() }); - // Normalize time weights to sum to 1 - let time_sum: f64 = time_weights.sum(); - if time_sum > 0.0 { - time_weights /= time_sum; - } - // Unit weights: ω_j = exp(-λ_unit × dist(j, i)) - // Paper alignment: compute for ALL units, let control masking handle exclusion + // Paper alignment: compute for ALL units, let control masking handle exclusion. + // Unnormalized per REGISTRY Eq. 2/3. let mut unit_weights = Array1::::zeros(n_units); if lambda_unit == 0.0 { @@ -600,14 +598,7 @@ fn compute_weight_matrix( // Target unit gets weight 1 (will be masked out in estimation anyway) unit_weights[target_unit] = 1.0; - // Normalize unit weights to sum to 1 - let unit_sum: f64 = unit_weights.sum(); - if unit_sum > 0.0 { - unit_weights /= unit_sum; - } - - // Outer product: W[t, i] = time_weights[t] * unit_weights[i] - // Result is normalized since both components sum to 1 + // Outer product: W[t, i] = θ_s × ω_j (raw exponentials, unnormalized) let mut weight_matrix = Array2::::zeros((n_periods, n_units)); for t in 0..n_periods { for i in 0..n_units { diff --git a/tests/test_rust_backend.py b/tests/test_rust_backend.py index 1ecc8af8..48d1f0e5 100644 --- a/tests/test_rust_backend.py +++ b/tests/test_rust_backend.py @@ -2420,28 +2420,37 @@ def test_bootstrap_seed_reproducibility(self, seed): f"Rust={res_rust.se:.16f}, Python={res_py.se:.16f}", ) - @pytest.mark.xfail( - strict=True, - reason="Local-method TROP has additional Rust-vs-Python backend " - "divergences beyond RNG (weight-matrix normalization in " - "`compute_weight_matrix` is Rust-only; Python " - "`_compute_observation_weights` reads `self._precomputed['Y']` " - "instead of the bootstrap-sample `Y`). RNG parity infrastructure " - "is in place (both backends now consume the same Python-canonical " - "stratified indices), but post-resampling computations diverge so " - "SE is not yet bit-identical. Tracked as a separate finding in " - "TODO.md. This xfail baselines the gap so we notice when both " - "downstream divergences are closed.", + @pytest.mark.parametrize( + "seed,lambda_nn", + [ + (0, np.inf), + (42, np.inf), + (12345, np.inf), + (42, 0.1), + ], ) - @pytest.mark.parametrize("seed", [0, 42, 12345]) - def test_bootstrap_seed_reproducibility_local(self, seed): - """Bootstrap SE parity under a fixed seed (local method). - - Currently xfail(strict=True) — see decorator. The local-method - path has additional backend divergences beyond RNG that this PR - does not address. Companion to the global-method parity test - above; once the downstream divergences close, this test will - XPASS and should be promoted to a regression guard. + def test_bootstrap_seed_reproducibility_local(self, seed, lambda_nn): + """Backend-invariant bootstrap SE parity for the local method. + + Post-methodology-alignment regression guard covering both the + ``lambda_nn=inf`` regime (no-lowrank path, closed by the Python + ``_precomputed`` cache-fallthrough removal) and the finite + ``lambda_nn`` regime (with-lowrank FISTA path, closed by the Rust + weight-matrix normalization removal). With the RNG fix from + PR #354 plus both methodology fixes landed here, local-method + Rust and Python bootstraps consume bit-identical stratified + indices AND bit-identical raw-exponential weights. Main-fit ATT + is bit-identical (see ``test_local_method_main_fit_parity``), + but per-replicate bootstrap fits route through Rust's + ``estimate_model`` vs numpy's ``lstsq``, which use different + matrix factorization paths and accumulate different BLAS + roundoff. Empirically the residual gap is ~1e-7 relative; + asserted at ``atol=1e-5`` which is ~100x the observed gap and + comfortable across CI runner variance. + + Follow-up to tighten to ``atol=1e-14``: unify Rust + ``estimate_model`` to use ``solve_wls_svd`` (the same SVD path + used by global-method since PR #348). Tracked in ``TODO.md``. """ import sys from unittest.mock import patch @@ -2453,7 +2462,7 @@ def test_bootstrap_seed_reproducibility_local(self, seed): method="local", lambda_time_grid=[1.0], lambda_unit_grid=[1.0], - lambda_nn_grid=[np.inf], + lambda_nn_grid=[lambda_nn], n_bootstrap=10, seed=seed, ) @@ -2468,11 +2477,144 @@ def test_bootstrap_seed_reproducibility_local(self, seed): res_py = trop_py.fit(df.copy(), "outcome", "treated", "unit", "time") np.testing.assert_allclose( - res_rust.se, res_py.se, atol=1e-14, rtol=1e-14, - err_msg=f"Local-method bootstrap SE divergence under seed={seed}: " + res_rust.se, res_py.se, atol=1e-5, rtol=1e-5, + err_msg=f"Local-method bootstrap SE divergence under " + f"seed={seed}, lambda_nn={lambda_nn}: " f"Rust={res_rust.se:.16f}, Python={res_py.se:.16f}", ) + @pytest.mark.parametrize( + "lambda_nn,tol", + [ + (np.inf, 1e-14), + (0.1, 1e-10), + ], + ) + def test_local_method_main_fit_parity(self, lambda_nn, tol): + """Backend-invariant ATT parity for the local-method main fit. + + Companion to the bootstrap seed-parity test above. Exercises both + regimes: ``lambda_nn=inf`` (no-lowrank, bit-identical minimum-norm + WLS argmin under aligned raw-exponential weights) and a finite + ``lambda_nn`` (with-lowrank, FISTA inner loop; tolerance relaxed + to ``1e-10`` because FISTA iteration ordering and BLAS reduction + ordering introduce sub-1e-10 noise across Rust faer and numpy BLAS + paths). + + Regression guard for the normalization fix and cache-fallthrough + fix landed in this PR. Before the fix, Rust ATT diverged from + Python ATT by O(10%) at finite ``lambda_nn`` and O(0) at + ``lambda_nn=inf``; after the fix both regimes match to tolerance. + + Uses multi-candidate lambda grids so LOOCV selection exercises + Rust's `compute_weight_matrix` (the surface the normalization fix + changed). Patches both the LOOCV dispatch in ``diff_diff.trop`` + and the bootstrap dispatch in ``diff_diff.trop_local`` on the + Python side so the comparison is Rust-LOOCV-and-fit vs + Python-LOOCV-and-fit end-to-end. + """ + import sys + from unittest.mock import patch + + from diff_diff import TROP + + df = self._make_correlated_panel(n_units=6, n_periods=6, n_treated=2) + # Multi-candidate grids so LOOCV selection isn't trivial; the Rust + # weight-normalization fix changes per-lambda LOOCV scores and thus + # potentially the selected lambda. + trop_params = dict( + method="local", + lambda_time_grid=[0.1, 1.0, 10.0], + lambda_unit_grid=[0.1, 1.0, 10.0], + lambda_nn_grid=[lambda_nn], + n_bootstrap=2, # minimum allowed; we assert ATT, not SE + seed=42, + ) + + trop_rust = TROP(**trop_params) + res_rust = trop_rust.fit(df.copy(), "outcome", "treated", "unit", "time") + + trop_module = sys.modules["diff_diff.trop"] + trop_local_module = sys.modules["diff_diff.trop_local"] + with patch.object(trop_module, "HAS_RUST_BACKEND", False), \ + patch.object(trop_module, "_rust_loocv_grid_search", None), \ + patch.object(trop_local_module, "HAS_RUST_BACKEND", False), \ + patch.object(trop_local_module, "_rust_bootstrap_trop_variance", None): + trop_py = TROP(**trop_params) + res_py = trop_py.fit(df.copy(), "outcome", "treated", "unit", "time") + + np.testing.assert_allclose( + res_rust.att, res_py.att, atol=tol, rtol=tol, + err_msg=f"Local-method ATT divergence at lambda_nn={lambda_nn}: " + f"Rust={res_rust.att:.16f}, Python={res_py.att:.16f}", + ) + + @pytest.mark.parametrize("lambda_nn,tol", [(np.inf, 1e-14), (0.1, 1e-10)]) + def test_local_method_same_cohort_donor_parity(self, lambda_nn, tol): + """Backend-invariant ATT when multiple units share the same treatment cohort. + + Isolates the ``D[t, j] == 1`` target-period case the prior ``_compute_ + observation_weights`` gate silently dropped: three treated units all + starting at ``t=3``. Under the paper's Eq. 2/3, ``ω_j`` is + distance-based for all ``j ≠ i`` (same-cohort donors included); their + pre-treatment rows contribute via ``θ_s · ω_j`` and post-treatment + cells are zeroed by the control mask ``(1 - W_{js})``. Python now + matches this convention (gate removed). This regression asserts that + the main-fit ATT is bit-identical across backends on a fixture where + the gate would previously have excluded donors' pre-treatment + information. + """ + import sys + from unittest.mock import patch + + import pandas as pd + + from diff_diff import TROP + + # 3 treated units + 3 controls, all treated units share cohort at t=3. + # Pre-treatment trajectories are distinct so same-cohort donors carry + # non-trivial information; without the fix Python and Rust would have + # different effective donor pools at each treated-observation target. + rng = np.random.default_rng(7) + rows = [] + for i in range(6): + is_treated = i < 3 + base = rng.normal(0, 1, 8) + for t in range(8): + y = 3.0 + i * 0.2 + 0.5 * t + base[t] + treated = 1 if (is_treated and t >= 5) else 0 + if treated: + y += 1.5 + rows.append({"unit": i, "time": t, "outcome": y, "treated": treated}) + df = pd.DataFrame(rows) + + trop_params = dict( + method="local", + lambda_time_grid=[0.1, 1.0, 10.0], + lambda_unit_grid=[0.1, 1.0, 10.0], + lambda_nn_grid=[lambda_nn], + n_bootstrap=2, + seed=42, + ) + + trop_rust = TROP(**trop_params) + res_rust = trop_rust.fit(df.copy(), "outcome", "treated", "unit", "time") + + trop_module = sys.modules["diff_diff.trop"] + trop_local_module = sys.modules["diff_diff.trop_local"] + with patch.object(trop_module, "HAS_RUST_BACKEND", False), \ + patch.object(trop_module, "_rust_loocv_grid_search", None), \ + patch.object(trop_local_module, "HAS_RUST_BACKEND", False), \ + patch.object(trop_local_module, "_rust_bootstrap_trop_variance", None): + trop_py = TROP(**trop_params) + res_py = trop_py.fit(df.copy(), "outcome", "treated", "unit", "time") + + np.testing.assert_allclose( + res_rust.att, res_py.att, atol=tol, rtol=tol, + err_msg=f"Same-cohort donor ATT divergence at lambda_nn={lambda_nn}: " + f"Rust={res_rust.att:.16f}, Python={res_py.att:.16f}", + ) + class TestFallbackWhenNoRust: """Test that pure Python fallback works when Rust is unavailable."""