Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- SyntheticDiD `variance_method="bootstrap"` now computes p-values from the analytical normal-theory formula using the bootstrap SE (matching R's `synthdid::vcov()` convention), rather than an empirical null-distribution formula that is not valid for bootstrap draws. `is_significant` and `significance_stars` are derived from `p_value` and will also change for bootstrap fits. Placebo and jackknife are unchanged. Point estimates are unaffected.
- SyntheticDiD bootstrap SE formula applies the `sqrt((r-1)/r)` correction matching R's synthdid and the placebo SE formula.
- SyntheticDiD bootstrap now retries degenerate resamples (all-control or all-treated, or non-finite `τ_b`) until exactly `n_bootstrap` valid replicates are accumulated, matching R's `synthdid::bootstrap_sample` and Arkhangelsky et al. (2021) Algorithm 2. Previously the Python path counted attempts (with degenerate draws silently dropped), producing fewer valid replicates than requested. A bounded-attempt guard (`20 × n_bootstrap`) prevents pathological-input hangs.
- **TROP global bootstrap SE backend parity under fixed seed** — Rust and Python backends now produce bit-identical bootstrap SE under the same `seed`. Previously Rust's `bootstrap_trop_variance_global` seeded `rand_xoshiro::Xoshiro256PlusPlus` per replicate while Python's fallback consumed `numpy.random.default_rng` (PCG64), producing ~28% SE divergence on tiny panels under `seed=42`. Fixed by extracting a shared `stratified_bootstrap_indices` helper in `diff_diff/bootstrap_utils.py` that pre-generates per-replicate stratified sample indices via numpy on the Python side; both backends consume the same integer arrays through the PyO3 surface. Sampling law (stratified: controls then treated, with replacement) is unchanged. Closes silent-failures audit finding #23 (bootstrap half; grid-search half closed in PR #348). Local-method TROP also adopts the Python-canonical index contract for the RNG layer, but separately-discovered backend divergences (Rust normalizes weight-matrix outer product, Python `_compute_observation_weights` reads stale `_precomputed` cache) prevent local-method bit-identity SE; tracked as a follow-up in `TODO.md`.
- **TROP global bootstrap SE backend parity under fixed seed** — Rust and Python backends now produce bit-identical bootstrap SE under the same `seed`. Previously Rust's `bootstrap_trop_variance_global` seeded `rand_xoshiro::Xoshiro256PlusPlus` per replicate while Python's fallback consumed `numpy.random.default_rng` (PCG64), producing ~28% SE divergence on tiny panels under `seed=42`. Fixed by extracting a shared `stratified_bootstrap_indices` helper in `diff_diff/bootstrap_utils.py` that pre-generates per-replicate stratified sample indices via numpy on the Python side; both backends consume the same integer arrays through the PyO3 surface. Sampling law (stratified: controls then treated, with replacement) is unchanged. Closes the bootstrap-RNG half of silent-failures audit finding #23 (grid-search half closed in PR #348; local-method methodology half closed by the two Fixed entries below). Local-method TROP also adopts the Python-canonical index contract for the RNG layer here.
- **TROP local-method Rust weight-matrix no longer normalized** — `rust/src/trop.rs::compute_weight_matrix` no longer divides time-weights or unit-weights by their respective sums before the outer product. The paper's Equation 2/3 (Athey, Imbens, Qu, Viviano 2025) and REGISTRY.md Requirements checklist (line 2037: `[x] Unit weights: exp(-λ_unit × distance) (unnormalized, matching Eq. 2)`) both specify raw-exponential weights; Python's `_compute_observation_weights` was already REGISTRY-compliant. **User-visible effect**: Rust local-method ATT values may shift for any fit with `lambda_nn < infinity` — normalizing the weight-matrix inflated the effective nuclear-norm penalty relative to the data-fit term, changing the regularization trade-off. For `lambda_nn = infinity` (factor model disabled) outputs are unchanged because uniform weight scaling leaves the minimum-norm WLS argmin invariant. Rust LOOCV-selected lambdas may also shift on this boundary; both backends now converge on the same REGISTRY-compliant selection.
- **TROP local-method Python `_compute_observation_weights` now uses the function-argument `Y, D` and treats all non-target units as donors** — two coupled changes that bring Python structurally in line with Rust and the paper's Eq. 2/3:
1. Removed the `if self._precomputed is not None:` branch that silently substituted `self._precomputed["Y"]` / `["D"]` / `["time_dist_matrix"]` (original-panel cache populated during main fit) for the function-argument `Y, D`. Under bootstrap, `_fit_with_fixed_lambda` computes fresh `Y, D` from the resampled `boot_data` and passes them in; the helper was discarding those and recomputing unit distances from the original panel, so Python's local bootstrap resampled units but reused stale unit-distance weights. Rust's bootstrap was already correct (always consumed `y_boot, d_boot`).
2. Removed the `valid_control_at_t = D[t, :] == 0` target-period donor gate that zeroed `ω_j` for any unit `j` treated at the target period (other than the target unit itself). Per REGISTRY Eq. 2/3 and Rust's `compute_weight_matrix`, `ω_j = exp(-λ_unit × dist(j, i))` for all `j ≠ i`; treated-cell exclusion happens via the `(1 − W_{js})` factor applied inside `_estimate_model`. Same-cohort donors now contribute via their pre-treatment rows. Empirically the main-fit ATT is unchanged on tested fixtures because same-cohort pre-treatment observations are exactly absorbed by their own unit fixed effect `alpha_j` without propagating into `mu`, `beta`, or other units' parameters — so this change is structural alignment rather than a numerical shift in output. Users on same-cohort panels with very few controls may still see tiny differences in edge cases; the new `test_local_method_same_cohort_donor_parity` regression guards the aligned behavior.
Together with the normalization fix above, TROP local-method backend parity on the main-fit ATT is regime-dependent: `atol=rtol=1e-14` for `lambda_nn=inf` (no nuclear-norm regularization, uniform weight scaling leaves the WLS argmin invariant) and `atol=1e-10` for finite `lambda_nn` (FISTA inner loop + BLAS reduction ordering introduce sub-1e-10 roundoff across Rust `faer` vs numpy paths). Bootstrap SE parity is asserted at `atol=1e-5` to accommodate ~1e-7 roundoff between Rust's `estimate_model` matrix factorization and numpy's `lstsq` that accumulates across per-replicate fits; sub-1e-14 bootstrap parity is tracked as a follow-up in `TODO.md` under "unify Rust local-method solver path". Closes silent-failures audit finding #23 (local-method half; the RNG half closed in PR #354 and the grid-search half in PR #348).

### Changed
- **`did_had_pretest_workflow(aggregate="event_study")` verdict no longer emits the "paper step 2 deferred to Phase 3 follow-up" caveat** — the joint pre-trends Stute test closes that gap. The two-period `aggregate="overall"` path retains the existing caveat since the joint variant does not apply to single-pre-period panels. Downstream code that greps verdict strings for the Phase 3 caveat will see it suppressed on the event-study path.
Expand Down
4 changes: 2 additions & 2 deletions TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,8 +83,8 @@ Deferred items from PR reviews that were not addressed before merge.
| Weighted CR2 Bell-McCaffrey cluster-robust (`vcov_type="hc2_bm"` + `cluster_ids` + `weights`) currently raises `NotImplementedError`. Weighted hat matrix and residual rebalancing need threading per clubSandwich WLS handling. | `linalg.py::_compute_cr2_bm` | Phase 1a | Medium |
| Regenerate `benchmarks/data/clubsandwich_cr2_golden.json` from R (`Rscript benchmarks/R/generate_clubsandwich_golden.R`). Current JSON has `source: python_self_reference` as a stability anchor until an authoritative R run. | `benchmarks/R/generate_clubsandwich_golden.R` | Phase 1a | Medium |
| `honest_did.py:1907` `np.linalg.solve(A_sys, b_sys) / except LinAlgError: continue` is a silent basis-rejection in the vertex-enumeration loop that is algorithmically intentional (try the next basis). Consider surfacing a count of rejected bases as a diagnostic when ARP enumeration exhausts, so users see when the vertex search was heavily constrained. Not a silent failure in the sense of the Phase 2 audit (the algorithm is supposed to skip), but the diagnostic would help debug borderline cases. | `honest_did.py` | #334 | Low |
| TROP local-method Rust vs Python bootstrap SE divergence beyond RNG: with stratified indices now Python-canonical (closes RNG axis-H), local-method SE still diverges by 10-20% on tiny panels. Two downstream causes: (a) Rust `compute_weight_matrix` (`rust/src/trop.rs:573-606`) normalizes time_weights and unit_weights to sum to 1 before the outer product; Python `_compute_observation_weights` (`diff_diff/trop_local.py:489`) does not normalize. (b) Python `_compute_observation_weights` reads `self._precomputed["Y"]`, `["D"]`, `["time_dist_matrix"]` (lines 423-458) — original-panel cache — rather than the bootstrap-sample data passed via the function arguments, so unit-distance computation in the Python fallback uses stale data. `@pytest.mark.xfail(strict=True)` in `tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility_local` baselines the gap across `seeds=[0, 42, 12345]`. Affects local-method TROP backend parity broadly, not just bootstrap. | `trop_local.py`, `rust/src/trop.rs` | follow-up | Medium |
| Rust multiplier-bootstrap weight RNG (`generate_bootstrap_weights_batch` in `rust/src/bootstrap.rs:9-10, 57-75`) uses `Xoshiro256PlusPlus::seed_from_u64(seed + i)` per row for Rademacher/Mammen/Webb generation. If any Python caller (SDID / efficient-DiD multiplier bootstrap) has a numpy-canonical equivalent, the two backends likely diverge under the same seed. Audit Python callers (`diff_diff/sdid.py`, `diff_diff/efficient_did_bootstrap.py`, `diff_diff/bootstrap_utils.py::generate_bootstrap_weights_batch_numpy`) for parity-test gaps. Same fix shape as TROP RNG parity (PR #NNN): pre-generate weights in Python via numpy and pass them to Rust through PyO3. | `rust/src/bootstrap.rs`, `diff_diff/bootstrap_utils.py` | follow-up | Medium |
| Unify Rust local-method `estimate_model` solver path to `solve_wls_svd` (the same SVD helper used by the global-method since PR #348) for sub-1e-14 bootstrap SE parity. Current local-method bootstrap parity test (`tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility_local`) passes at `atol=1e-5` — the residual ~1e-7 gap is roundoff between Rust's `estimate_model` matrix factorization and numpy's `lstsq`, which accumulates differently across per-replicate bootstrap fits. Main-fit ATT parity is regime-dependent (`atol=1e-14` for `lambda_nn=inf`, `atol=1e-10` for finite `lambda_nn` — see `test_local_method_main_fit_parity`); the bootstrap gap is a same-solver-path roundoff concern and not a user-visible correctness bug. | `rust/src/trop.rs::estimate_model`, `rust/src/linalg.rs::solve_wls_svd` | follow-up | Low |
| Rust multiplier-bootstrap weight RNG (`generate_bootstrap_weights_batch` in `rust/src/bootstrap.rs:9-10, 57-75`) uses `Xoshiro256PlusPlus::seed_from_u64(seed + i)` per row for Rademacher/Mammen/Webb generation. If any Python caller (SDID / efficient-DiD multiplier bootstrap) has a numpy-canonical equivalent, the two backends likely diverge under the same seed. Audit Python callers (`diff_diff/sdid.py`, `diff_diff/efficient_did_bootstrap.py`, `diff_diff/bootstrap_utils.py::generate_bootstrap_weights_batch_numpy`) for parity-test gaps. Same fix shape as TROP RNG parity (PR #354): pre-generate weights in Python via numpy and pass them to Rust through PyO3. | `rust/src/bootstrap.rs`, `diff_diff/bootstrap_utils.py` | follow-up | Medium |
| `bias_corrected_local_linear`: extend golden parity to `kernel="triangular"` and `kernel="uniform"` (currently epa-only; all three kernels share `kernel_W` and the `lprobust` math, so parity is expected but not separately asserted). | `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py` | Phase 1c | Low |
| `bias_corrected_local_linear`: expose `vce in {"hc0", "hc1", "hc2", "hc3"}` on the public wrapper once R parity goldens exist (currently raises `NotImplementedError`). The port-level `lprobust` and `lprobust_res` already support all four; expanding the public surface requires a golden generator for each hc mode and a decision on hc2/hc3 q-fit leverage (R reuses p-fit `hii` for q-fit residuals; whether to match that or stage-match deserves a derivation before the wrapper advertises CCT-2014 conformance). | `diff_diff/local_linear.py::bias_corrected_local_linear`, `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py` | Phase 1c | Medium |
| `bias_corrected_local_linear`: support `weights=` once survey-design adaptation lands. nprobust's `lprobust` has no weight argument so there is no parity anchor; derivation needed. | `diff_diff/local_linear.py`, `diff_diff/_nprobust_port.py::lprobust` | Phase 1c | Medium |
Expand Down
90 changes: 28 additions & 62 deletions diff_diff/trop_local.py
Original file line number Diff line number Diff line change
Expand Up @@ -385,13 +385,18 @@ def _compute_observation_weights(
- Time weights theta_s^{i,t} = exp(-lambda_time * |t - s|)
- Unit weights omega_j^{i,t} = exp(-lambda_unit * dist_unit_{-t}(j, i))

IMPORTANT (Issue A fix): The paper's objective sums over ALL observations
where (1 - W_js) is non-zero, which includes pre-treatment observations of
eventually-treated units since W_js = 0 for those. This method computes
weights for ALL units where D[t, j] = 0 at the target period, not just
never-treated units.

Uses pre-computed structures when available for efficiency.
Weights are assigned for every unit ``j != i`` (distance-based, per
Eq. 2/3). Treated-cell exclusion is handled by the `(1 - W_{js})`
factor applied inside ``_estimate_model`` via the control mask, not
by gating ``ω_j`` on ``D[t, j]``. Same-cohort donors therefore
contribute via their pre-treatment rows, and future-cohort donors
contribute via rows where both units are still untreated.

Always computes from the function-argument ``Y, D``; does not read
``self._precomputed``. Under bootstrap the caller passes resampled
``Y, D``, and a prior version of this method silently fell through to
the original-panel cache via a ``_precomputed`` branch, producing
stale unit distances.

Parameters
----------
Expand Down Expand Up @@ -420,70 +425,33 @@ def _compute_observation_weights(
np.ndarray
Weight matrix (n_periods x n_units) for observation (i, t).
"""
# Use pre-computed structures when available
if self._precomputed is not None:
# Time weights from pre-computed time distance matrix
# time_dist_matrix[t, s] = |t - s|
time_weights = np.exp(-lambda_time * self._precomputed["time_dist_matrix"][t, :])

# Unit weights - computed for ALL units where D[t, j] = 0
# (Issue A fix: includes pre-treatment obs of eventually-treated units)
unit_weights = np.zeros(n_units)
D_stored = self._precomputed["D"]
Y_stored = self._precomputed["Y"]

# Valid control units at time t: D[t, j] == 0
valid_control_at_t = D_stored[t, :] == 0

if lambda_unit == 0:
# Uniform weights when lambda_unit = 0
# All units not treated at time t get weight 1
unit_weights[valid_control_at_t] = 1.0
else:
# Use observation-specific distances with target period excluded
# (Issue B fix: compute exact per-observation distance)
for j in range(n_units):
if valid_control_at_t[j] and j != i:
# Compute distance excluding target period t
dist = self._compute_unit_distance_for_obs(Y_stored, D_stored, j, i, t)
if np.isinf(dist):
unit_weights[j] = 0.0
else:
unit_weights[j] = np.exp(-lambda_unit * dist)

# Treated unit i gets weight 1
unit_weights[i] = 1.0

# Weight matrix: outer product (n_periods x n_units)
return np.outer(time_weights, unit_weights)

# Fallback: compute from scratch (used in bootstrap)
# Time distance: |t - s| following paper's Equation 3 (page 7)
dist_time = np.abs(np.arange(n_periods) - t)
time_weights = np.exp(-lambda_time * dist_time)

# Unit weights - computed for ALL units where D[t, j] = 0
# (Issue A fix: includes pre-treatment obs of eventually-treated units)
# Unit weights ω_j = exp(-λ_unit × dist(j, i)) for all j ≠ i per Eq. 2/3.
# No target-period gate: same-cohort donors enter with distance-based
# weight (their pre-treatment rows contribute via theta_s * omega_j;
# their post-treatment cells are zeroed by the control mask (1-D_{js})
# applied inside ``_estimate_model``). Matches Rust's compute_weight_matrix.
unit_weights = np.zeros(n_units)

# Valid control units at time t: D[t, j] == 0
valid_control_at_t = D[t, :] == 0

if lambda_unit == 0:
# Uniform weights when lambda_unit = 0
unit_weights[valid_control_at_t] = 1.0
# Uniform weights when lambda_unit = 0 — all units get 1.
# Control masking in _estimate_model handles treated-cell exclusion.
unit_weights[:] = 1.0
else:
for j in range(n_units):
if valid_control_at_t[j] and j != i:
if j != i:
# Compute distance excluding target period t (Issue B fix)
dist = self._compute_unit_distance_for_obs(Y, D, j, i, t)
if np.isinf(dist):
unit_weights[j] = 0.0
else:
unit_weights[j] = np.exp(-lambda_unit * dist)

# Treated unit i gets weight 1 (or could be omitted since we fit on controls)
# We include treated unit's own observation for model fitting
# Target unit gets weight 1 (will be masked out in estimation via
# the control mask, matching the paper's (1-W_{js}) factor).
unit_weights[i] = 1.0

# Weight matrix: outer product (n_periods x n_units)
Expand Down Expand Up @@ -970,13 +938,11 @@ def _bootstrap_variance(
n_control_units = len(control_units)

# Pre-generate stratified bootstrap indices via numpy (Python-canonical RNG).
# This aligns the RNG layer between backends. For the global method the RNG
# was the only divergence, so global SE is now bit-identical under the same
# seed. For the local method two downstream divergences remain (Rust weight-
# matrix normalization + Python `_compute_observation_weights` reading the
# stale `_precomputed` cache) — tracked in TODO.md; until those land, local
# bootstrap SE still differs across backends. Silent-failures finding #23
# (bootstrap half) closed for global; local follow-up queued.
# Aligns the RNG layer between backends. Combined with the Rust weight-
# matrix de-normalization and the Python `_compute_observation_weights`
# cache-fallthrough removal (also shipped with this parity work), local-
# method Rust and Python produce matching bootstrap SE up to solver-path
# roundoff (~1e-7); asserted at atol=1e-5 in the parity regression guard.
rng = np.random.default_rng(self.seed)
control_idx, treated_idx = stratified_bootstrap_indices(
rng, n_control_units, n_treated_units, self.n_bootstrap
Expand Down
Loading
Loading