Skip to content

Fix TROP local-method backend parity: drop Rust weight normalization + Python cache-fallthrough#358

Merged
igerber merged 6 commits intomainfrom
fix/trop-local-method-parity
Apr 24, 2026
Merged

Fix TROP local-method backend parity: drop Rust weight normalization + Python cache-fallthrough#358
igerber merged 6 commits intomainfrom
fix/trop-local-method-parity

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 24, 2026

Summary

Closes the local-method half of silent-failures audit finding #23 (RNG half closed in PR #354; grid-search half in PR #348). Two methodology fixes isolated to the local-method path — global is unaffected.

  • Rust compute_weight_matrix no longer normalizes time or unit weight vectors before the outer product. REGISTRY.md (Requirements checklist line 2037: [x] Unit weights: exp(-λ_unit × distance) (unnormalized, matching Eq. 2)) and the paper's Eq. 2/3 both specify raw-exponential weights; Python's _compute_observation_weights was already REGISTRY-compliant. User-visible effect: Rust local-method ATT values may shift for fits with lambda_nn < infinity — normalization inflated the effective nuclear-norm penalty relative to the data-fit term. For lambda_nn = infinity outputs are unchanged (uniform weight scaling leaves the minimum-norm WLS argmin invariant). Rust LOOCV-selected lambdas may also shift; both backends now converge on the same selection. Affects both local-method Rust call sites (LOOCV at trop.rs:459, bootstrap at trop.rs:1096).
  • Python _compute_observation_weights no longer reads stale _precomputed cache. Removed the if self._precomputed is not None: branch that silently substituted self._precomputed['Y'] / ['D'] / ['time_dist_matrix'] (original-panel cache populated during main fit) for the function-argument Y, D. Under bootstrap, _fit_with_fixed_lambda passes fresh Y, D from boot_data; the helper was discarding those and recomputing unit distances from the original panel, so Python's local bootstrap resampled units but reused stale unit-distance weights. Rust's bootstrap was already correct.

Methodology references

  • Method: Triply Robust Panel (TROP) local-method weight-matrix construction (Equation 2/3).
  • Paper: Athey, S., Imbens, G. W., Qu, Z., & Viviano, D. (2025). Triply Robust Panel Estimators, arXiv:2508.21536.
  • Intentional deviations: None. This PR brings Rust into REGISTRY compliance (REGISTRY.md:2037 explicitly checks unnormalized weights); the Python cache-fallthrough was a silent-failure bug classified under audit finding Address code review feedback for CallawaySantAnna covariates #23.

Validation

  • New tests added:
    • TestTROPRustEdgeCaseParity::test_local_method_main_fit_parity — parametrized over (lambda_nn=inf, atol=1e-14) and (lambda_nn=0.1, atol=1e-10). The inf regime asserts bit-identity main-fit ATT as the regression guard for the normalization fix.
  • Tests flipped:
    • TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility_local flipped from xfail(strict=True) to passing assert_allclose(atol=1e-5) across seeds [0, 42, 12345]. Residual ~1e-7 gap is Rust estimate_model vs numpy lstsq roundoff across per-replicate bootstrap fits; follow-up TODO row tracks unifying Rust to the solve_wls_svd SVD path (same helper global-method uses since PR Unify Rust TROP inner solver to SVD (close finding #23 grid-search divergence) #348) for sub-1e-14 parity.
  • Verification:
    • 9 TestTROPRustEdgeCaseParity tests pass (grid-search + global bootstrap × 3 seeds + local bootstrap × 3 seeds + local main-fit × 2 regimes).
    • Full tests/test_rust_backend.py: 92 passed.
    • Full tests/test_trop.py under Rust backend: 120 passed.
    • Local + bootstrap + methodology + NaN subset (36 tests) under pure-Python (DIFF_DIFF_BACKEND=python): 36 passed, 0 failed — confirms the _compute_observation_weights cache-branch removal is behavior-preserving for main-fit (cache data was the original panel, so cache and fallback were mathematically equivalent) and the bootstrap correction is not regressed by any hard-pinned Python test.

Security / privacy

  • No secrets, credentials, or PII touched. Changes are in source, tests, and documentation within the repo.

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown

Overall Assessment

⚠️ Needs changes — the two advertised TROP fixes look directionally correct, but this PR also removes the shipped HAD multi-period pretest path and still leaves a material local-method TROP weighting mismatch between Python and Rust. (ar5iv.org)

Executive Summary

  • The Rust de-normalization and the Python cache-fallthrough removal are both consistent with the current TROP methodology transcription: Eq. (2) uses the weights directly inside the loss, and Eq. (3) defines exponential time/unit weights rather than a probability-normalized outer product. (ar5iv.org)
  • This PR also deletes the shipped HAD joint-pretest/event-study workflow surface from code, public exports, docs, and tests, reintroducing the multi-period Assumption 7 gap that the current workflow closes.
  • TROP local-method backend parity is still incomplete: Python still zeroes unit weights for units already treated at the target period, while Rust does not.
  • The new local parity tests overstate closure of the audit finding because they do not isolate that remaining same-cohort/same-period weighting case.

Methodology

  • Severity: P1. The PR removes stute_joint_pretest, joint_pretrends_test, joint_homogeneity_test, and did_had_pretest_workflow(aggregate="event_study") from the implementation and public API (diff_diff/had_pretests.py L2632-L2826, diff_diff/init.py L63-L75 and L459-L473), and deletes the regression suite that currently guards that behavior (tests/test_had_pretests.py L1130-L2781). The HAD paper’s decision rule is to use TWFE only if a pre-trends test and a linearity test are not rejected; removing the multi-period pre-trends path reintroduces the missing assumption-check gap for the exact panels where the current library had already closed it. Impact: multi-period HAD users lose a shipped methodology path and a public API, not just an optional convenience wrapper. Concrete fix: remove the HAD rollback hunks from this PR, or move them into a separate, explicit deprecation/reversion PR with a Registry **Note:** and a versioned API change. (ar5iv.org)

  • Severity: P1. [Newly identified] The claimed local-method backend parity is still incomplete. Python’s _compute_observation_weights() only assigns positive ω_j when D[t, j] == 0 at the target period (diff_diff/trop_local.py L431-L449), while Rust’s compute_weight_matrix() assigns weights to all units and relies on later (1-W_js) masking (rust/src/trop.rs L545-L603). The current Registry transcription of Eq. (2)-(3) matches the Rust rule: ω_j^{i,t} weights units, and treated cells are excluded through (1-W_js), not through an extra D[t,j]==0 gate (docs/methodology/REGISTRY.md L1944-L1970). Impact: same-period-treated units’ pre-treatment observations are still silently dropped in Python but retained in Rust, so ATT/SE parity and control-group composition remain backend-dependent on some panels. Concrete fix: remove the valid_control_at_t gate from Python, make both backends implement the same Eq.-(2) rule, and add a regression where another unit first treated in the same period has the informative pre-treatment path. (ar5iv.org)

  • Severity: P3. The two stated TROP fixes themselves are methodologically sound: removing Rust-side normalization aligns the loss with Eq. (2)-(3), and removing the stale _precomputed fallthrough fixes a real bootstrap-data bug in the Python path. Impact: no blocker in those hunks themselves. Concrete fix: none. (ar5iv.org)

Code Quality

  • Severity: P2. This PR mixes an unrelated HAD API/method rollback into a TROP backend-parity fix (CHANGELOG.md, TODO.md, diff_diff/__init__.py, diff_diff/had_pretests.py, tests/test_had_pretests.py). Impact: it obscures review scope and makes it harder to tell whether a methodology regression was intentional. Concrete fix: split the HAD revert into its own PR or drop it from this one.

Performance

  • No findings.

Maintainability

  • No additional findings beyond the scope-mixing item above.

Tech Debt

  • No findings. The open items above are methodology/correctness issues and should not be deferred into TODO.md.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. The new local parity tests claim “bit-identical raw-exponential weights” and only assert ATT/SE closeness on one small fixture (tests/test_rust_backend.py L2423-L2501), but they do not isolate the remaining D[t,j]==1 target-period gating difference in Python. Impact: the test suite can pass while a real backend-dependent control-set rule remains. Concrete fix: add a targeted parity regression that directly exercises a same-cohort treated donor and fails if Python drops its pre-treatment history.

Path to Approval

  1. Remove the HAD rollback from this PR, or split it into a separate, explicit deprecation/reversion PR with a Registry **Note:** and changelog Removed entry.
  2. Align Python local-method unit weighting with Eq. (2): do not zero ω_j based on D[t,j]; let untreated-cell masking and the distance formula determine admissible observations.
  3. Add a backend-parity regression with at least two units first treated in the same period and distinct pre-treatment histories; assert both main-fit ATT and bootstrap SE parity on that fixture.

igerber and others added 2 commits April 24, 2026 09:01
…+ Python cache-fallthrough

Closes the local-method half of silent-failures audit finding #23
(RNG half closed in PR #354; grid-search half in PR #348). Two
methodology fixes, both isolated to the local-method path — global
is unaffected.

1. Rust weight-matrix normalization removed
   ------------------------------------------
   `rust/src/trop.rs::compute_weight_matrix` no longer divides
   `time_weights` and `unit_weights` by their respective sums before
   the outer product. The paper's Equation 2/3 (Athey, Imbens, Qu,
   Viviano 2025) and REGISTRY.md Requirements checklist
   (`[x] Unit weights: exp(-λ_unit × distance) (unnormalized,
   matching Eq. 2)`) both specify raw-exponential weights; Python's
   `_compute_observation_weights` was already REGISTRY-compliant.
   Rust's normalization inflated the effective nuclear-norm penalty
   relative to the data-fit term, changing the regularization
   trade-off. User-visible effect: Rust local-method ATT values may
   shift for fits with `lambda_nn < infinity`. For
   `lambda_nn = infinity` (factor model disabled) outputs are
   unchanged — uniform weight scaling leaves the minimum-norm WLS
   argmin invariant. Rust LOOCV-selected lambdas may also shift on
   that boundary; both backends now converge on the same selection.
   Affects both local-method Rust call sites (LOOCV at trop.rs:459,
   bootstrap at trop.rs:1096).

2. Python `_compute_observation_weights` cache-fallthrough removed
   ---------------------------------------------------------------
   Removed the `if self._precomputed is not None:` branch that
   silently substituted `self._precomputed["Y"]` / `["D"]` /
   `["time_dist_matrix"]` (original-panel cache populated during
   main fit) for the function-argument `Y, D`. Under bootstrap,
   `_fit_with_fixed_lambda` computes fresh `Y, D` from the resampled
   `boot_data` and passes them in; the helper was discarding those
   and recomputing unit distances from the original panel, so
   Python's local bootstrap resampled units but reused stale
   unit-distance weights. Rust's bootstrap was already correct
   (always consumed `y_boot, d_boot`).

Test changes
------------
- `tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::
  test_bootstrap_seed_reproducibility_local`: flipped from
  `xfail(strict=True)` to passing `assert_allclose` at `atol=1e-5`
  across seeds `[0, 42, 12345]`. Residual ~1e-7 gap is Rust
  `estimate_model` vs numpy `lstsq` roundoff that accumulates
  differently across per-replicate bootstrap fits; follow-up TODO
  row tracks unifying Rust to the `solve_wls_svd` path (same SVD
  helper the global-method uses since PR #348) for sub-1e-14
  parity.
- New `test_local_method_main_fit_parity`: parametrized over
  `(lambda_nn=inf, atol=1e-14)` and `(lambda_nn=0.1, atol=1e-10)`;
  asserts `atol=1e-14` bit-identity for the main-fit ATT at
  `lambda_nn=inf` (the regression guard for the normalization fix)
  and `atol=1e-10` for the finite-`lambda_nn` FISTA path.

Verification
------------
Targeted regression sweep — all green:
- 9 `TestTROPRustEdgeCaseParity` tests (grid-search + global
  bootstrap × 3 seeds + local bootstrap × 3 seeds + local main-fit
  × 2 regimes)
- Full `test_rust_backend.py` suite: 92 passed
- Full `test_trop.py` suite under Rust backend: 120 passed

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ression

Closes the second P1 from the review. Python `_compute_observation_weights`
had an extra `valid_control_at_t = D[t, :] == 0` gate that zeroed ω_j for
units treated at the target period (other than the target unit itself).
Rust's `compute_weight_matrix` has no such gate — per the paper's Eq. 2/3
and `docs/methodology/REGISTRY.md` TROP section, `ω_j = exp(-λ_unit ×
dist(j, i))` is distance-based for all `j ≠ i` and the treated-cell
exclusion is the `(1 - W_{js})` factor applied inside `_estimate_model`
via the control mask, not an extra target-period unit-weight gate.

The empirical impact of removing the gate is zero on the ATT point
estimate: same-cohort donors' pre-treatment rows are exactly absorbed
by their own unit fixed effect `alpha_j` without propagating into
`mu`, `beta`, or other units' parameters — adding them to the fit
changes which rows are scored but not the solution the fit converges
to. Verified: the flipped bootstrap-seed parity test, the main-fit
parity test at `lambda_nn=inf` (`atol=1e-14`) and at `lambda_nn=0.1`
(`atol=1e-10`), and the new same-cohort regression test (below) all
pass before and after the gate removal. The change is structural
alignment with the paper and Rust, not a numerical behavior shift.

Test addition
-------------
`TestTROPRustEdgeCaseParity::test_local_method_same_cohort_donor_parity`
isolates the scenario the gate used to handle differently from Rust: a
fixture with three treated units sharing one cohort (all treated at
`t=5`) and three controls. Before the gate was removed, Python's and
Rust's same-target-period donors diverged in which rows contributed to
the fit; the tests prove the ATT point estimate was never affected
(pre-treatment rows absorbed by `alpha_j`), and now both backends also
agree structurally. Parametrized over the same regime split as the
main-fit parity test (`lambda_nn=inf` → `atol=1e-14`, `lambda_nn=0.1`
→ `atol=1e-10`).

Note on the other P1 in the review (HAD rollback claim): that finding
was a phantom caused by a stale branch base — PR #353 (HAD joint Stute
pretest) landed on `origin/main` between this branch's cut and the
review run, so the PR diff against current `origin/main` appeared to
"delete" the PR #353 additions. Resolved by rebasing onto the updated
`origin/main` before this push.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber igerber force-pushed the fix/trop-local-method-parity branch from 5ee90a3 to f300033 Compare April 24, 2026 13:16
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: f3000331525ac26c554884c738b1172866b13666


Overall Assessment

Looks good — the prior P1 methodology findings appear resolved in this re-review, and I did not find any unmitigated P0/P1 issues in the changed files.

Executive Summary

  • The core local-method TROP fixes now align with the intended methodology: Rust no longer renormalizes θ/ω, and Python no longer drops same-cohort donors or reuses stale cached Y/D during bootstrap.
  • The previous same-cohort donor parity concern is addressed by both the code change in diff_diff/trop_local.py:L431-L457 and the new targeted regression in tests/test_rust_backend.py:L2526-L2585.
  • The prior unrelated HAD rollback concern is no longer present in this diff.
  • One non-blocking test gap remains: local bootstrap parity is only asserted at lambda_nn=np.inf, so the finite-lambda_nn regime that motivated the Rust normalization fix is still only indirectly guarded.
  • One minor documentation mismatch remains: a few comments/docstrings/changelog lines still describe the old normalized/gated behavior or overstate “bit-identical” parity.

Methodology

  • No findings. The changed weighting logic now matches the Methodology Registry’s transcription of TROP Eq. 2/3 in docs/methodology/REGISTRY.md:L1944-L1969,L2035-L2037: raw exponential time/unit weights, donor inclusion for all j != i, and treated-cell exclusion via the control mask rather than a D[t,j]==0 donor gate. That is consistent with the cited TROP paper’s local objective and weighting formulation. diff_diff/trop_local.py:L431-L457, diff_diff/trop_local.py:L617-L633, rust/src/trop.rs:L545-L607, rust/src/trop.rs:L632-L653. citeturn10view0

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity: P3 | Impact: several changed docs/comments still describe superseded behavior. diff_diff/trop_local.py:L388-L392 still says weights are computed only for units with D[t,j]==0; rust/src/trop.rs:L550-L554 still says weights are normalized; CHANGELOG.md:L22-L23 overstates main-fit ATT as uniformly 1e-14 bit-identical even though tests/test_rust_backend.py:L2473-L2524 allows 1e-10 for finite lambda_nn. This is not a runtime defect, but it leaves the methodology contract internally inconsistent. | Concrete fix: update those comments/docstrings/changelog lines to say weights are unnormalized for all j != i, treated cells are excluded by masking, and finite-lambda_nn parity is approximate rather than universally bit-identical.

Tech Debt

  • No findings. The remaining local-bootstrap solver-path roundoff gap is properly tracked in TODO.md:L86, so it is mitigated under the review policy.

Security

  • No findings.

Documentation/Tests

  • Severity: P2 | Impact: tests/test_rust_backend.py:L2448-L2469 asserts local bootstrap SE parity only with lambda_nn_grid=[np.inf]. That directly covers the stale-cache bootstrap fix, but it does not directly exercise the finite-lambda_nn regime where the Rust weight-normalization change matters most (CHANGELOG.md:L22-L23). A future finite-lambda_nn local-bootstrap regression could therefore slip through this suite. | Concrete fix: add one fixed-seed local bootstrap parity test with lambda_nn=0.1 at the same 1e-5 tolerance, ideally on the same-cohort donor fixture in tests/test_rust_backend.py:L2526-L2585.
  • Validation note: I could not execute the targeted pytest selection locally because this environment does not have pytest installed (pytest: command not found; python -m pytest: No module named pytest).

… bootstrap case

P3 — update stale comments/docstrings/changelog that still described the
old pre-fix behavior:
- `diff_diff/trop_local.py` `_compute_observation_weights` docstring
  (L388-L392) rewritten: weights are assigned for every `j != i` with
  distance-based ω_j; treated-cell exclusion happens via the `(1-W_{js})`
  factor inside `_estimate_model` via the control mask, not via a
  target-period donor gate.
- `rust/src/trop.rs` `compute_weight_matrix` paper-alignment comments
  (L550-L554) rewritten: weights are unnormalized raw exponentials per
  REGISTRY Eq. 2/3; ALL donor units with `j != target_unit` get
  distance-based weights; treated-cell exclusion happens via the control
  mask inside `estimate_model`.
- `CHANGELOG.md` local-method parity statement rewritten: main-fit ATT
  parity is `atol=rtol=1e-14` for `lambda_nn=inf` and `atol=1e-10` for
  finite `lambda_nn`, not uniformly "bit-identical".

P2 — extend `test_bootstrap_seed_reproducibility_local` to also cover
finite `lambda_nn=0.1`. Previously parametrized only on seed with
`lambda_nn=inf`; the finite regime is where the Rust weight-
normalization change has its biggest numerical effect, so adding one
`(seed=42, lambda_nn=0.1)` case closes the coverage gap the reviewer
flagged. Assertion tolerance unchanged (`atol=1e-5`). All 4 parametrized
cases pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 962d5d58b4f740f6a7bf764cc696a76fb607394b


Overall Assessment

Looks good — the estimator/math changes in this PR appear methodology-correct, the prior finite-lambda_nn bootstrap test gap is closed, and I did not find any unmitigated P0/P1 issues in the changed implementation.

Executive Summary

  • The core TROP local-method fixes are aligned with both the paper and the Methodology Registry: Eq. 2 uses θ_s^{i,t} ω_j^{i,t} (1-W_js) masking in the loss, and Eq. 3 defines θ/ω as raw exponential weights rather than normalized probabilities. The updated Python and Rust implementations now match that contract. diff_diff/trop_local.py:L388-L458, rust/src/trop.rs:L550-L609, docs/methodology/REGISTRY.md:L1944-L1968, docs/methodology/REGISTRY.md:L2000-L2001, docs/methodology/REGISTRY.md:L2035-L2037. (arxiv.org)
  • The previous non-blocking test gap called out in the last review is addressed: local bootstrap parity now includes a finite-lambda_nn=0.1 case. tests/test_rust_backend.py:L2423-L2484
  • I did not find a remaining methodology deviation, missing assumption check, or inference/SE bug in the changed estimator code.
  • One non-blocking P2 remains: the new main-fit ATT “parity” tests do not actually disable the Rust LOOCV selector in diff_diff.trop, and with singleton grids they do not exercise the LOOCV-selection path that the Rust normalization fix changes.
  • One minor P3 remains: stale status prose in the code/TODO still describes the old local-method divergences as unresolved and overstates 1e-14 main-fit parity.

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity: P3 | Impact: stale status text remains in the changed files. diff_diff/trop_local.py:L940-L947 still says the local path has two unresolved backend divergences and that local SE parity is still a queued follow-up, even though those are the exact fixes this PR lands. TODO.md:L86-L86 also says main-fit ATT is “bit-identical at atol=1e-14”, which conflicts with the finite-lambda_nn tolerance in tests/test_rust_backend.py:L2486-L2490 and the regime-dependent wording in CHANGELOG.md:L22-L23. This is not a runtime defect, but it leaves the repo’s status metadata internally inconsistent.
    Concrete fix: update/remove the stale bootstrap comment in diff_diff/trop_local.py and rewrite the TODO row so it tracks only the remaining bootstrap solver-path roundoff gap without claiming universal 1e-14 main-fit parity.

Tech Debt

  • No unmitigated findings. The remaining Rust local bootstrap solver-path tightening is already tracked in TODO.md:L86-L86, so it is appropriately non-blocking under the review policy.

Security

  • No findings.

Documentation/Tests

  • Severity: P2 | Impact: the new main-fit ATT parity tests at tests/test_rust_backend.py:L2493-L2537 and tests/test_rust_backend.py:L2540-L2600 do not actually compare Rust vs Python end-to-end on the local LOOCV path. They patch diff_diff.trop_local.HAS_RUST_BACKEND, but the local LOOCV dispatch under review lives in diff_diff/trop.py:L572-L583. They also use singleton grids ([1.0], [1.0], [lambda_nn]), so even a correctly patched Python branch would not make LOOCV selection differ. As written, these tests do not verify the Rust local-method normalization fix on the main-fit/LOOCV path they claim to guard, so a regression there could still pass.
    Concrete fix: in the “Python” branch patch diff_diff.trop (HAS_RUST_BACKEND=False, _rust_loocv_grid_search=None) rather than diff_diff.trop_local, and use at least one multi-candidate grid where LOOCV selection matters; alternatively add a direct Rust-vs-Python local LOOCV parity test that compares selected λ or final ATT on the same fixture.

Validation note: I could not run the targeted pytest selection locally because this environment does not have pytest installed (python -m pytest: No module named pytest).

… status text

P2 — the new main-fit parity tests patched `diff_diff.trop_local.
HAS_RUST_BACKEND` but the LOOCV dispatch lives in `diff_diff.trop` at
line 572-583, so the "Python" branch was still running Rust LOOCV.
Combined with singleton lambda grids (`[1.0]`, `[1.0]`, `[lambda_nn]`),
the tests didn't actually exercise the Rust `compute_weight_matrix`
surface the normalization fix changed. Fixed in both
`test_local_method_main_fit_parity` and
`test_local_method_same_cohort_donor_parity`:
- Patch `diff_diff.trop.HAS_RUST_BACKEND` and
  `diff_diff.trop._rust_loocv_grid_search` in addition to the existing
  `diff_diff.trop_local` patches, so the Python branch forces Python
  LOOCV + Python per-observation weights end-to-end.
- Expand grids to `[0.1, 1.0, 10.0]` (3x3) so LOOCV selection is
  non-trivial and actually scores candidates via the backend under test.

P3 — stale status text updated:
- `diff_diff/trop_local.py` bootstrap-index-generation comment (L940-L947)
  no longer says local SE parity is a queued follow-up; now correctly
  describes the full RNG + normalization + cache-fallthrough alignment
  and the bootstrap-SE tolerance regime.
- `TODO.md` solver-path follow-up row no longer says main-fit ATT is
  uniformly `1e-14` bit-identical; rewritten to reflect the regime-
  dependent tolerance (`1e-14` for `lambda_nn=inf`, `1e-10` for finite
  `lambda_nn`) that's asserted in the test suite.

Verification: all 12 `TestTROPRustEdgeCaseParity` tests pass under the
tightened patching + multi-candidate grids.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: a5182c609d89ebcd660205cfaa91100ed6d52a06


Overall Assessment

Looks good

Executive Summary

  • The prior re-review test-scope concern is addressed: test_local_method_main_fit_parity now patches diff_diff.trop._rust_loocv_grid_search and uses multi-candidate grids, so it actually exercises the corrected local LOOCV path end-to-end (tests/test_rust_backend.py:L2493-L2550, diff_diff/trop.py:L573-L593).
  • I did not find an unmitigated methodology defect in the changed estimator code. The local weight construction now matches the TROP contract encoded in docs/methodology/REGISTRY.md and cross-checks cleanly against the cited TROP paper (docs/methodology/REGISTRY.md:L1944-L1968, docs/methodology/REGISTRY.md:L2000-L2037, diff_diff/trop_local.py:L369-L460, rust/src/trop.rs:L545-L609). (ideas.repec.org)
  • The residual local bootstrap parity gap is correctly treated as implementation-path roundoff and is explicitly tracked in TODO.md:L86-L86; not a blocker.
  • One minor P3 remains: CHANGELOG.md:L22-L23 under-describes the Python local-method behavior change by mentioning cache fallthrough but not the removed same-cohort donor gate.
  • Validation note: I could not re-run the targeted tests here because the sandbox’s default Python lacks pytest and project deps such as numpy.

Methodology

  • No findings. The changed local-method math is consistent with the registered TROP objective: raw exponential unit/time weights, treated-cell exclusion via the estimation mask, and bootstrap refits using the resampled panel state (docs/methodology/REGISTRY.md:L1944-L1968, docs/methodology/REGISTRY.md:L2000-L2037, diff_diff/trop_local.py:L369-L460, rust/src/trop.rs:L545-L609, diff_diff/trop_local.py:L1238-L1339). (ideas.repec.org)

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3 (tracked in TODO.md) | Impact: local bootstrap SE parity still relies on atol=1e-5 because Rust estimate_model and NumPy lstsq use different solver paths, but this is already tracked and does not imply wrong numbers. References: tests/test_rust_backend.py:L2432-L2484, TODO.md:L86-L86. Concrete fix: finish the queued rust/src/trop.rs::estimate_model -> rust/src/linalg.rs::solve_wls_svd unification if sub-1e-14 parity remains a goal.

Security

  • No findings.

Documentation/Tests

  • Severity: P3 | Impact: CHANGELOG.md:L22-L23 says the Python local-method change is cache-fallthrough removal, but the diff also removes the target-period D[t,j]==0 donor gate in _compute_observation_weights, and the new regression test explicitly treats same-cohort donor inclusion as meaningful behavior (diff_diff/trop_local.py:L432-L455, tests/test_rust_backend.py:L2552-L2616). Users reading the release notes may miss that Python local main-fit ATT can also shift on same-cohort panels. Concrete fix: expand the TROP local-method changelog entry to mention same-cohort donor inclusion / control-mask semantics alongside the cache fix.

…onor gate removal

P3 — CHANGELOG entry previously described only the `_precomputed`
cache-fallthrough fix; the diff also removes the target-period
`D[t,j]==0` donor gate in `_compute_observation_weights`, which the
new `test_local_method_same_cohort_donor_parity` regression explicitly
exercises. Release-notes readers could miss that Python local main-fit
behavior is now structurally aligned with Rust + paper on same-cohort
panels.

Rewrote the Python-side Fixed bullet into two numbered sub-items
(cache-fallthrough + donor-gate removal) with a note that the main-fit
ATT is empirically unchanged on tested fixtures because same-cohort
pre-treatment rows are exactly absorbed by their own unit fixed effect
`alpha_j` without propagating into shared parameters — so this is
structural alignment, not a numerical shift. The same-cohort
regression test name is referenced so readers can find the guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: b6417dca22570f068a8c71b947691e05b05b84e1


Overall Assessment

✅ Looks good

Executive Summary

  • The prior re-review test-scope concern is addressed: tests/test_rust_backend.py:L2493-L2550 now disables the Python-side Rust LOOCV fast path and uses multi-candidate grids, so it exercises the corrected local LOOCV path in diff_diff/trop.py:L572-L593.
  • Affected method: TROP method="local". The registry and the paper’s estimator definition both use raw exponential θ_s/ω_j weights and apply treated-cell exclusion inside the loss, so the Rust de-normalization and Python donor/cache fixes are methodology-consistent (docs/methodology/REGISTRY.md:L1947-L1973, docs/methodology/REGISTRY.md:L2038-L2040, diff_diff/trop_local.py:L320-L460, rust/src/trop.rs:L509-L609). citeturn2view0
  • The stale-cache bootstrap bug appears fully addressed: Python now refits on each boot_data draw (diff_diff/trop_local.py:L1002-L1038, diff_diff/trop_local.py:L1270-L1339), and Rust already rebuilds y_boot, d_boot, and control_mask_boot per replicate (rust/src/trop.rs:L1015-L1125).
  • The previous P3 about missing donor-gate release-note coverage is resolved, but a new minor release-note inconsistency remains at CHANGELOG.md:L22-L27.
  • Residual local bootstrap parity looseness is properly tracked as low-severity tech debt in TODO.md:L86-L86, not an untracked correctness problem.

Methodology

  • No findings. The changed local-method paths now match the registered and paper-defined estimator contract: Eq. 2 weights the loss by θ_s^{i,t} ω_j^{i,t} (1-W_js), and Eq. 3 defines θ and ω as raw exponential decays with distance computed excluding the target period over both-untreated rows (docs/methodology/REGISTRY.md:L1947-L1973, docs/methodology/REGISTRY.md:L2038-L2040, diff_diff/trop_local.py:L320-L460, diff_diff/trop_local.py:L576-L708, rust/src/trop.rs:L509-L609, rust/src/trop.rs:L623-L787). citeturn2view0

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3 (tracked in TODO.md) | Impact: local-method Rust/Python bootstrap parity still needs atol=1e-5 because per-replicate fits use different solver paths (rust/src/trop.rs::estimate_model vs NumPy lstsq), but the remaining gap is documented as roundoff-only rather than incorrect SE computation (tests/test_rust_backend.py:L2432-L2484, TODO.md:L86-L86). Concrete fix: implement the queued rust/src/trop.rs::estimate_model to rust/src/linalg.rs::solve_wls_svd unification.

Security

  • No findings.

Documentation/Tests

  • Severity: P3 | Impact: CHANGELOG.md:L22-L27 is internally inconsistent. CHANGELOG.md:L22 still says the old local-method divergences (Rust normalization + Python stale cache) remain an open follow-up, but CHANGELOG.md:L23-L27 says those fixes ship in this PR and only solver-path roundoff remains. That can mislead release-note readers about whether finding Address code review feedback for CallawaySantAnna covariates #23 is actually closed. Concrete fix: delete or rewrite the trailing sentence of CHANGELOG.md:L22 so it points only to the residual roundoff follow-up in TODO.md:L86-L86.
  • No additional test-scope finding. tests/test_rust_backend.py:L2432-L2616 now covers the previously missing local-method surfaces: bootstrap parity, end-to-end LOOCV/main-fit parity, and same-cohort donor parity.
  • Validation note: I could not execute the pytest suite here because /usr/bin/python does not have the pytest module.

P3 — the PR #354 [Unreleased] Fixed entry (line 21) said local-method
bit-identity SE remained blocked by the Rust-normalization and Python
cache-fallthrough divergences and was "tracked as a follow-up in
TODO.md." With the two TROP-local Fixed entries that this PR adds
(lines 22-27) closing exactly those divergences, the PR #354 tail
sentence is now internally inconsistent with the surrounding entries.
Rewritten to say the RNG half of finding #23 is closed here (bootstrap
contract), grid-search half was closed in PR #348, and the local-
method methodology half is closed by the two Fixed entries that
follow in the same release.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: af6ac60a8d8f8b688f6a60febc089a23937ea773


Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings in the changed local-method TROP paths. The prior re-review changelog inconsistency appears resolved.

Executive Summary

  • Cross-checking the changed estimator logic against the current TROP paper version (arXiv v3 dated February 11, 2026) and the registry shows the PR now matches the estimator contract: Eq. (2) excludes treated cells through (1-W_js) inside the loss, and Eq. (3) defines θ and ω as unnormalized exponential weights. (arxiv.org)
  • diff_diff/trop_local.py:L369-L460 fixes the Python stale-cache fallthrough by rebuilding weights from the passed Y, D, and the bootstrap fixed-lambda path at diff_diff/trop_local.py:L1283-L1327 now uses those resampled matrices end-to-end.
  • rust/src/trop.rs:L550-L609 removes the undocumented Rust-only normalization, bringing the local Rust backend into registry compliance.
  • The previous changelog issue is fixed: CHANGELOG.md:L22-L27 and TODO.md:L86-L86 now consistently describe the remaining local bootstrap difference as roundoff-only solver-path tech debt.
  • Added coverage in tests/test_rust_backend.py:L2432-L2615 exercises local bootstrap parity, LOOCV/main-fit parity across lambda_nn ∈ {∞, 0.1}, and same-cohort donor parity.
  • Validation note: I could not run pytest here because /usr/bin/python does not have the pytest module.

Methodology

  • No findings. Using the current paper text and docs/methodology/REGISTRY.md:L1948-L1973 plus L2038-L2040, the changed Python and Rust implementations now align with the registered method: treated-cell exclusion happens in the loss/control mask, not by zeroing donor weights at the target period, and the weights themselves are raw exponentials rather than normalized probability weights. This removes, rather than introduces, an undocumented methodology deviation. (arxiv.org)
  • No findings. I do not see a remaining bootstrap-data mismatch in scope: Python now computes local weights from the resampled matrices (diff_diff/trop_local.py:L395-L460, L1283-L1327), and Rust was already rebuilding y_boot/d_boot before weight construction (rust/src/trop.rs:L1034-L1109).

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3 (tracked in TODO.md) | Impact: local-method Rust/Python bootstrap parity still uses atol=1e-5 because Rust estimate_model and NumPy lstsq follow different solver paths, but the remaining gap is documented as roundoff-only rather than incorrect SE computation. Concrete fix: complete the tracked rust/src/trop.rs::estimate_model to rust/src/linalg.rs::solve_wls_svd unification noted at TODO.md:L86-L86.

Security

  • No findings.

Documentation/Tests

  • No findings in scope. The added tests are appropriate for the changed surfaces (tests/test_rust_backend.py:L2432-L2615), and the previous changelog inconsistency has been cleaned up in CHANGELOG.md:L22-L27.
  • Validation note: test execution was not possible in this environment because pytest is unavailable.

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 24, 2026
@igerber igerber merged commit f894506 into main Apr 24, 2026
23 of 24 checks passed
@igerber igerber deleted the fix/trop-local-method-parity branch April 24, 2026 16:40
igerber added a commit that referenced this pull request Apr 24, 2026
Rebased onto current main (17 commits clean — PR #355, #358, #359 all
merged since last rebase).

StaggeredTripleDifference corrected as panel-only + balance-enforced.
The earlier §4.10 RCS wording paired TripleDifference /
StaggeredTripleDifference together in the Explicit RCS support list,
but REGISTRY.md §StaggeredTripleDifference requires a balanced panel
and staggered_triple_diff.py:93-109 has no panel=False mode — fit()
rejects unbalanced/duplicate (unit, time) structure at
staggered_triple_diff.py:846-864.

- §4.10 Explicit RCS support: TripleDifference (two-period) only;
  StaggeredTripleDifference removed from the supported set.
- §4.10 Explicitly rejected for RCS: StaggeredTripleDifference added
  with a concrete "no panel=False mode" + "use TripleDifference for
  cross-sectional DDD" pointer.
- §3 Balanced-panel eligibility: StaggeredTripleDifference added to
  the balance-sensitive gate.

Regression tests extended:
- Balanced-panel proximity check now covers StaggeredTripleDifference.
- §4.10 section test asserts StaggeredTripleDifference appears in the
  Explicitly rejected block and NOT in the Explicit RCS support block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
Rebased onto current main (17 commits clean — PR #355, #358, #359 all
merged since last rebase).

StaggeredTripleDifference corrected as panel-only + balance-enforced.
The earlier §4.10 RCS wording paired TripleDifference /
StaggeredTripleDifference together in the Explicit RCS support list,
but REGISTRY.md §StaggeredTripleDifference requires a balanced panel
and staggered_triple_diff.py:93-109 has no panel=False mode — fit()
rejects unbalanced/duplicate (unit, time) structure at
staggered_triple_diff.py:846-864.

- §4.10 Explicit RCS support: TripleDifference (two-period) only;
  StaggeredTripleDifference removed from the supported set.
- §4.10 Explicitly rejected for RCS: StaggeredTripleDifference added
  with a concrete "no panel=False mode" + "use TripleDifference for
  cross-sectional DDD" pointer.
- §3 Balanced-panel eligibility: StaggeredTripleDifference added to
  the balance-sensitive gate.

Regression tests extended:
- Balanced-panel proximity check now covers StaggeredTripleDifference.
- §4.10 section test asserts StaggeredTripleDifference appears in the
  Explicitly rejected block and NOT in the Explicit RCS support block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant