Conversation
|
Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
…+ Python cache-fallthrough Closes the local-method half of silent-failures audit finding #23 (RNG half closed in PR #354; grid-search half in PR #348). Two methodology fixes, both isolated to the local-method path — global is unaffected. 1. Rust weight-matrix normalization removed ------------------------------------------ `rust/src/trop.rs::compute_weight_matrix` no longer divides `time_weights` and `unit_weights` by their respective sums before the outer product. The paper's Equation 2/3 (Athey, Imbens, Qu, Viviano 2025) and REGISTRY.md Requirements checklist (`[x] Unit weights: exp(-λ_unit × distance) (unnormalized, matching Eq. 2)`) both specify raw-exponential weights; Python's `_compute_observation_weights` was already REGISTRY-compliant. Rust's normalization inflated the effective nuclear-norm penalty relative to the data-fit term, changing the regularization trade-off. User-visible effect: Rust local-method ATT values may shift for fits with `lambda_nn < infinity`. For `lambda_nn = infinity` (factor model disabled) outputs are unchanged — uniform weight scaling leaves the minimum-norm WLS argmin invariant. Rust LOOCV-selected lambdas may also shift on that boundary; both backends now converge on the same selection. Affects both local-method Rust call sites (LOOCV at trop.rs:459, bootstrap at trop.rs:1096). 2. Python `_compute_observation_weights` cache-fallthrough removed --------------------------------------------------------------- Removed the `if self._precomputed is not None:` branch that silently substituted `self._precomputed["Y"]` / `["D"]` / `["time_dist_matrix"]` (original-panel cache populated during main fit) for the function-argument `Y, D`. Under bootstrap, `_fit_with_fixed_lambda` computes fresh `Y, D` from the resampled `boot_data` and passes them in; the helper was discarding those and recomputing unit distances from the original panel, so Python's local bootstrap resampled units but reused stale unit-distance weights. Rust's bootstrap was already correct (always consumed `y_boot, d_boot`). Test changes ------------ - `tests/test_rust_backend.py::TestTROPRustEdgeCaseParity:: test_bootstrap_seed_reproducibility_local`: flipped from `xfail(strict=True)` to passing `assert_allclose` at `atol=1e-5` across seeds `[0, 42, 12345]`. Residual ~1e-7 gap is Rust `estimate_model` vs numpy `lstsq` roundoff that accumulates differently across per-replicate bootstrap fits; follow-up TODO row tracks unifying Rust to the `solve_wls_svd` path (same SVD helper the global-method uses since PR #348) for sub-1e-14 parity. - New `test_local_method_main_fit_parity`: parametrized over `(lambda_nn=inf, atol=1e-14)` and `(lambda_nn=0.1, atol=1e-10)`; asserts `atol=1e-14` bit-identity for the main-fit ATT at `lambda_nn=inf` (the regression guard for the normalization fix) and `atol=1e-10` for the finite-`lambda_nn` FISTA path. Verification ------------ Targeted regression sweep — all green: - 9 `TestTROPRustEdgeCaseParity` tests (grid-search + global bootstrap × 3 seeds + local bootstrap × 3 seeds + local main-fit × 2 regimes) - Full `test_rust_backend.py` suite: 92 passed - Full `test_trop.py` suite under Rust backend: 120 passed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ression
Closes the second P1 from the review. Python `_compute_observation_weights`
had an extra `valid_control_at_t = D[t, :] == 0` gate that zeroed ω_j for
units treated at the target period (other than the target unit itself).
Rust's `compute_weight_matrix` has no such gate — per the paper's Eq. 2/3
and `docs/methodology/REGISTRY.md` TROP section, `ω_j = exp(-λ_unit ×
dist(j, i))` is distance-based for all `j ≠ i` and the treated-cell
exclusion is the `(1 - W_{js})` factor applied inside `_estimate_model`
via the control mask, not an extra target-period unit-weight gate.
The empirical impact of removing the gate is zero on the ATT point
estimate: same-cohort donors' pre-treatment rows are exactly absorbed
by their own unit fixed effect `alpha_j` without propagating into
`mu`, `beta`, or other units' parameters — adding them to the fit
changes which rows are scored but not the solution the fit converges
to. Verified: the flipped bootstrap-seed parity test, the main-fit
parity test at `lambda_nn=inf` (`atol=1e-14`) and at `lambda_nn=0.1`
(`atol=1e-10`), and the new same-cohort regression test (below) all
pass before and after the gate removal. The change is structural
alignment with the paper and Rust, not a numerical behavior shift.
Test addition
-------------
`TestTROPRustEdgeCaseParity::test_local_method_same_cohort_donor_parity`
isolates the scenario the gate used to handle differently from Rust: a
fixture with three treated units sharing one cohort (all treated at
`t=5`) and three controls. Before the gate was removed, Python's and
Rust's same-target-period donors diverged in which rows contributed to
the fit; the tests prove the ATT point estimate was never affected
(pre-treatment rows absorbed by `alpha_j`), and now both backends also
agree structurally. Parametrized over the same regime split as the
main-fit parity test (`lambda_nn=inf` → `atol=1e-14`, `lambda_nn=0.1`
→ `atol=1e-10`).
Note on the other P1 in the review (HAD rollback claim): that finding
was a phantom caused by a stale branch base — PR #353 (HAD joint Stute
pretest) landed on `origin/main` between this branch's cut and the
review run, so the PR diff against current `origin/main` appeared to
"delete" the PR #353 additions. Resolved by rebasing onto the updated
`origin/main` before this push.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5ee90a3 to
f300033
Compare
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — the prior P1 methodology findings appear resolved in this re-review, and I did not find any unmitigated P0/P1 issues in the changed files. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
… bootstrap case
P3 — update stale comments/docstrings/changelog that still described the
old pre-fix behavior:
- `diff_diff/trop_local.py` `_compute_observation_weights` docstring
(L388-L392) rewritten: weights are assigned for every `j != i` with
distance-based ω_j; treated-cell exclusion happens via the `(1-W_{js})`
factor inside `_estimate_model` via the control mask, not via a
target-period donor gate.
- `rust/src/trop.rs` `compute_weight_matrix` paper-alignment comments
(L550-L554) rewritten: weights are unnormalized raw exponentials per
REGISTRY Eq. 2/3; ALL donor units with `j != target_unit` get
distance-based weights; treated-cell exclusion happens via the control
mask inside `estimate_model`.
- `CHANGELOG.md` local-method parity statement rewritten: main-fit ATT
parity is `atol=rtol=1e-14` for `lambda_nn=inf` and `atol=1e-10` for
finite `lambda_nn`, not uniformly "bit-identical".
P2 — extend `test_bootstrap_seed_reproducibility_local` to also cover
finite `lambda_nn=0.1`. Previously parametrized only on seed with
`lambda_nn=inf`; the finite regime is where the Rust weight-
normalization change has its biggest numerical effect, so adding one
`(seed=42, lambda_nn=0.1)` case closes the coverage gap the reviewer
flagged. Assertion tolerance unchanged (`atol=1e-5`). All 4 parametrized
cases pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — the estimator/math changes in this PR appear methodology-correct, the prior finite- Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Validation note: I could not run the targeted pytest selection locally because this environment does not have |
… status text P2 — the new main-fit parity tests patched `diff_diff.trop_local. HAS_RUST_BACKEND` but the LOOCV dispatch lives in `diff_diff.trop` at line 572-583, so the "Python" branch was still running Rust LOOCV. Combined with singleton lambda grids (`[1.0]`, `[1.0]`, `[lambda_nn]`), the tests didn't actually exercise the Rust `compute_weight_matrix` surface the normalization fix changed. Fixed in both `test_local_method_main_fit_parity` and `test_local_method_same_cohort_donor_parity`: - Patch `diff_diff.trop.HAS_RUST_BACKEND` and `diff_diff.trop._rust_loocv_grid_search` in addition to the existing `diff_diff.trop_local` patches, so the Python branch forces Python LOOCV + Python per-observation weights end-to-end. - Expand grids to `[0.1, 1.0, 10.0]` (3x3) so LOOCV selection is non-trivial and actually scores candidates via the backend under test. P3 — stale status text updated: - `diff_diff/trop_local.py` bootstrap-index-generation comment (L940-L947) no longer says local SE parity is a queued follow-up; now correctly describes the full RNG + normalization + cache-fallthrough alignment and the bootstrap-SE tolerance regime. - `TODO.md` solver-path follow-up row no longer says main-fit ATT is uniformly `1e-14` bit-identical; rewritten to reflect the regime- dependent tolerance (`1e-14` for `lambda_nn=inf`, `1e-10` for finite `lambda_nn`) that's asserted in the test suite. Verification: all 12 `TestTROPRustEdgeCaseParity` tests pass under the tightened patching + multi-candidate grids. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
…onor gate removal P3 — CHANGELOG entry previously described only the `_precomputed` cache-fallthrough fix; the diff also removes the target-period `D[t,j]==0` donor gate in `_compute_observation_weights`, which the new `test_local_method_same_cohort_donor_parity` regression explicitly exercises. Release-notes readers could miss that Python local main-fit behavior is now structurally aligned with Rust + paper on same-cohort panels. Rewrote the Python-side Fixed bullet into two numbered sub-items (cache-fallthrough + donor-gate removal) with a note that the main-fit ATT is empirically unchanged on tested fixtures because same-cohort pre-treatment rows are exactly absorbed by their own unit fixed effect `alpha_j` without propagating into shared parameters — so this is structural alignment, not a numerical shift. The same-cohort regression test name is referenced so readers can find the guard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
P3 — the PR #354 [Unreleased] Fixed entry (line 21) said local-method bit-identity SE remained blocked by the Rust-normalization and Python cache-fallthrough divergences and was "tracked as a follow-up in TODO.md." With the two TROP-local Fixed entries that this PR adds (lines 22-27) closing exactly those divergences, the PR #354 tail sentence is now internally inconsistent with the surrounding entries. Rewritten to say the RNG half of finding #23 is closed here (bootstrap contract), grid-search half was closed in PR #348, and the local- method methodology half is closed by the two Fixed entries that follow in the same release. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good No unmitigated P0/P1 findings in the changed local-method TROP paths. The prior re-review changelog inconsistency appears resolved. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
Rebased onto current main (17 commits clean — PR #355, #358, #359 all merged since last rebase). StaggeredTripleDifference corrected as panel-only + balance-enforced. The earlier §4.10 RCS wording paired TripleDifference / StaggeredTripleDifference together in the Explicit RCS support list, but REGISTRY.md §StaggeredTripleDifference requires a balanced panel and staggered_triple_diff.py:93-109 has no panel=False mode — fit() rejects unbalanced/duplicate (unit, time) structure at staggered_triple_diff.py:846-864. - §4.10 Explicit RCS support: TripleDifference (two-period) only; StaggeredTripleDifference removed from the supported set. - §4.10 Explicitly rejected for RCS: StaggeredTripleDifference added with a concrete "no panel=False mode" + "use TripleDifference for cross-sectional DDD" pointer. - §3 Balanced-panel eligibility: StaggeredTripleDifference added to the balance-sensitive gate. Regression tests extended: - Balanced-panel proximity check now covers StaggeredTripleDifference. - §4.10 section test asserts StaggeredTripleDifference appears in the Explicitly rejected block and NOT in the Explicit RCS support block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rebased onto current main (17 commits clean — PR #355, #358, #359 all merged since last rebase). StaggeredTripleDifference corrected as panel-only + balance-enforced. The earlier §4.10 RCS wording paired TripleDifference / StaggeredTripleDifference together in the Explicit RCS support list, but REGISTRY.md §StaggeredTripleDifference requires a balanced panel and staggered_triple_diff.py:93-109 has no panel=False mode — fit() rejects unbalanced/duplicate (unit, time) structure at staggered_triple_diff.py:846-864. - §4.10 Explicit RCS support: TripleDifference (two-period) only; StaggeredTripleDifference removed from the supported set. - §4.10 Explicitly rejected for RCS: StaggeredTripleDifference added with a concrete "no panel=False mode" + "use TripleDifference for cross-sectional DDD" pointer. - §3 Balanced-panel eligibility: StaggeredTripleDifference added to the balance-sensitive gate. Regression tests extended: - Balanced-panel proximity check now covers StaggeredTripleDifference. - §4.10 section test asserts StaggeredTripleDifference appears in the Explicitly rejected block and NOT in the Explicit RCS support block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Closes the local-method half of silent-failures audit finding #23 (RNG half closed in PR #354; grid-search half in PR #348). Two methodology fixes isolated to the local-method path — global is unaffected.
compute_weight_matrixno longer normalizes time or unit weight vectors before the outer product. REGISTRY.md (Requirements checklist line 2037:[x] Unit weights: exp(-λ_unit × distance) (unnormalized, matching Eq. 2)) and the paper's Eq. 2/3 both specify raw-exponential weights; Python's_compute_observation_weightswas already REGISTRY-compliant. User-visible effect: Rust local-method ATT values may shift for fits withlambda_nn < infinity— normalization inflated the effective nuclear-norm penalty relative to the data-fit term. Forlambda_nn = infinityoutputs are unchanged (uniform weight scaling leaves the minimum-norm WLS argmin invariant). Rust LOOCV-selected lambdas may also shift; both backends now converge on the same selection. Affects both local-method Rust call sites (LOOCV attrop.rs:459, bootstrap attrop.rs:1096)._compute_observation_weightsno longer reads stale_precomputedcache. Removed theif self._precomputed is not None:branch that silently substitutedself._precomputed['Y'] / ['D'] / ['time_dist_matrix'](original-panel cache populated during main fit) for the function-argumentY, D. Under bootstrap,_fit_with_fixed_lambdapasses freshY, Dfromboot_data; the helper was discarding those and recomputing unit distances from the original panel, so Python's local bootstrap resampled units but reused stale unit-distance weights. Rust's bootstrap was already correct.Methodology references
Validation
TestTROPRustEdgeCaseParity::test_local_method_main_fit_parity— parametrized over(lambda_nn=inf, atol=1e-14)and(lambda_nn=0.1, atol=1e-10). Theinfregime asserts bit-identity main-fit ATT as the regression guard for the normalization fix.TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility_localflipped fromxfail(strict=True)to passingassert_allclose(atol=1e-5)across seeds[0, 42, 12345]. Residual ~1e-7 gap is Rustestimate_modelvs numpylstsqroundoff across per-replicate bootstrap fits; follow-up TODO row tracks unifying Rust to thesolve_wls_svdSVD path (same helper global-method uses since PR Unify Rust TROP inner solver to SVD (close finding #23 grid-search divergence) #348) for sub-1e-14 parity.TestTROPRustEdgeCaseParitytests pass (grid-search + global bootstrap × 3 seeds + local bootstrap × 3 seeds + local main-fit × 2 regimes).tests/test_rust_backend.py: 92 passed.tests/test_trop.pyunder Rust backend: 120 passed.DIFF_DIFF_BACKEND=python): 36 passed, 0 failed — confirms the_compute_observation_weightscache-branch removal is behavior-preserving for main-fit (cache data was the original panel, so cache and fallback were mathematically equivalent) and the bootstrap correction is not regressed by any hard-pinned Python test.Security / privacy
🤖 Generated with Claude Code