From e59610bf0bd7a3771c3146134bb4e8f5f7c060fd Mon Sep 17 00:00:00 2001 From: igerber Date: Mon, 20 Apr 2026 20:03:19 -0400 Subject: [PATCH 1/2] Unify Rust TROP inner solver to SVD; close finding #23 grid-search divergence MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes the grid-search half of silent-failures finding #23 (TODO row 87). The `xfail(strict=True)` regression `test_grid_search_rank_deficient_Y` baselined a ~6% ATT divergence between Rust and Python on two near-parallel control units. Root cause: Rust's `solve_joint_no_lowrank` used iterative block coordinate descent (50 iter, tol=1e-8) while Python used SVD-based minimum-norm least squares. On rank-deficient Y the two solvers converge to different stationary points of the same objective. Python is canonical (SVD / minimum-norm least squares per Golub & Van Loan). Rust's iterative solver was a speed optimization, not a methodology choice. Port the Rust inner TWFE step to SVD-based WLS that mirrors Python's `np.linalg.lstsq(rcond=None)` step-for-step, with numpy-compatible `rcond = eps * max(n, k)`. Changes - rust/src/linalg.rs: promote ndarray_to_faer to pub(crate) so trop.rs can reuse it. - rust/src/trop.rs: new module-private solve_wls_svd helper — thin-SVD + rcond truncation, matches numpy's minimum-norm semantics. Rewrite solve_joint_no_lowrank body to flatten y/weights row-major, build the [intercept | unit_dummies[1..] | time_dummies[1..]] design matrix, apply sqrt-weights, and solve via solve_wls_svd. Function signature unchanged — all 4 call sites (LOOCV, FISTA TWFE step x2, bootstrap) benefit transitively. - tests/test_rust_backend.py: remove @pytest.mark.xfail from test_grid_search_rank_deficient_Y; the gap is closed. Bootstrap-seed test retains its xfail (row 87 RNG mismatch, out of scope). - docs/methodology/REGISTRY.md: update the TROP Global Estimation bullet at the existing `np.linalg.lstsq` line to note Rust and Python now both use SVD-based minimum-norm WLS with numpy-compatible rcond. - TODO.md: delete row 87 (grid-search divergence entry). Verification - maturin develop --release --features accelerate: clean build, no warnings. - pytest tests/test_rust_backend.py::TestTROPRustEdgeCaseParity: grid-search test now passes; bootstrap-seed test correctly xfails. - pytest tests/test_rust_backend.py -k TROP -m '': 23 passed, 1 xfailed, no regressions. - pytest tests/test_trop.py: 83 passed, 37 deselected (slow). - TestTROPGlobalRustVsNumpy (incl. lambda_nn=0 low-rank FISTA path): 8 passed — FISTA TWFE step unchanged in behavior on well-conditioned data. - grep for other 'for _ in 0..50' coordinate-descent patterns in rust/src/*.rs: none found. Non-goals - No changes to row 87 (bootstrap RNG mismatch — Rust rand crate vs numpy default_rng ~28% SE gap on seed=42). Separate PR. - No changes to linalg.rs::solve_ols (rcond=1e-7 is load-bearing for MultiPeriodDiD / DiD / TWFE). - No public API changes. Co-Authored-By: Claude Opus 4.7 (1M context) --- TODO.md | 1 - docs/methodology/REGISTRY.md | 4 + rust/src/linalg.rs | 2 +- rust/src/trop.rs | 215 ++++++++++++++++++++++------------- tests/test_rust_backend.py | 22 ++-- 5 files changed, 148 insertions(+), 96 deletions(-) diff --git a/TODO.md b/TODO.md index 5893cffa..94612417 100644 --- a/TODO.md +++ b/TODO.md @@ -83,7 +83,6 @@ Deferred items from PR reviews that were not addressed before merge. | Weighted CR2 Bell-McCaffrey cluster-robust (`vcov_type="hc2_bm"` + `cluster_ids` + `weights`) currently raises `NotImplementedError`. Weighted hat matrix and residual rebalancing need threading per clubSandwich WLS handling. | `linalg.py::_compute_cr2_bm` | Phase 1a | Medium | | Regenerate `benchmarks/data/clubsandwich_cr2_golden.json` from R (`Rscript benchmarks/R/generate_clubsandwich_golden.R`). Current JSON has `source: python_self_reference` as a stability anchor until an authoritative R run. | `benchmarks/R/generate_clubsandwich_golden.R` | Phase 1a | Medium | | `honest_did.py:1907` `np.linalg.solve(A_sys, b_sys) / except LinAlgError: continue` is a silent basis-rejection in the vertex-enumeration loop that is algorithmically intentional (try the next basis). Consider surfacing a count of rejected bases as a diagnostic when ARP enumeration exhausts, so users see when the vertex search was heavily constrained. Not a silent failure in the sense of the Phase 2 audit (the algorithm is supposed to skip), but the diagnostic would help debug borderline cases. | `honest_did.py` | #334 | Low | -| TROP Rust vs Python grid-search divergence on rank-deficient Y: on two near-parallel control units, LOOCV grid-search ATT diverges ~6% between Rust (`trop_global.py:688`) and Python fallback (`trop_global.py:753`). Either grid-winner ties are broken differently or the per-λ solver reaches different stationary points under rank deficiency. Audit finding #23 flagged this surface. `@pytest.mark.xfail(strict=True)` in `tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_grid_search_rank_deficient_Y` baselines the gap. | `trop_global.py`, `rust/` | follow-up | Medium | | TROP Rust vs Python bootstrap SE divergence under fixed seed: `seed=42` on a tiny panel produces ~28% bootstrap-SE gap. Root cause: Rust bootstrap uses its own RNG (`rand` crate) while Python uses `numpy.random.default_rng`; same seed value maps to different bytestreams across backends. Audit axis-H (RNG/seed) adjacent. `@pytest.mark.xfail(strict=True)` in `tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility` baselines the gap. Unifying RNG (threading a numpy-generated seed-sequence into Rust, or porting Python to ChaCha) would close it. | `trop_global.py`, `rust/` | follow-up | Medium | | `bias_corrected_local_linear`: extend golden parity to `kernel="triangular"` and `kernel="uniform"` (currently epa-only; all three kernels share `kernel_W` and the `lprobust` math, so parity is expected but not separately asserted). | `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py` | Phase 1c | Low | | `bias_corrected_local_linear`: expose `vce in {"hc0", "hc1", "hc2", "hc3"}` on the public wrapper once R parity goldens exist (currently raises `NotImplementedError`). The port-level `lprobust` and `lprobust_res` already support all four; expanding the public surface requires a golden generator for each hc mode and a decision on hc2/hc3 q-fit leverage (R reuses p-fit `hii` for q-fit residuals; whether to match that or stage-match deserves a derivation before the wrapper advertises CCT-2014 conformance). | `diff_diff/local_linear.py::bias_corrected_local_linear`, `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py` | Phase 1c | Medium | diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index dcc1e209..4db62fb4 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -2059,6 +2059,10 @@ Treatment effects are **heterogeneous** per-observation values. ATT is their mea 1. **Without low-rank (λ_nn = ∞)**: Standard weighted least squares - Build design matrix with unit/time dummies (no treatment indicator) - Solve via np.linalg.lstsq for (μ, α, β) using (1-W)-masked weights + - Both the Python fallback and the Rust acceleration path use SVD-based + minimum-norm least squares with numpy-compatible rcond = eps × max(n, k), + so they return the canonical minimum-norm solution on rank-deficient Y + (e.g., two near-parallel control units) 2. **With low-rank (finite λ_nn)**: Alternating minimization - Alternate between: diff --git a/rust/src/linalg.rs b/rust/src/linalg.rs index 8d4c5854..002b2b2f 100644 --- a/rust/src/linalg.rs +++ b/rust/src/linalg.rs @@ -277,7 +277,7 @@ fn compute_robust_vcov_internal( } /// Convert ndarray Array2 to faer Mat -fn ndarray_to_faer(arr: &Array2) -> faer::Mat { +pub(crate) fn ndarray_to_faer(arr: &Array2) -> faer::Mat { let nrows = arr.nrows(); let ncols = arr.ncols(); faer::Mat::from_fn(nrows, ncols, |i, j| arr[[i, j]]) diff --git a/rust/src/trop.rs b/rust/src/trop.rs index 4257d4a8..b1f52e5c 100644 --- a/rust/src/trop.rs +++ b/rust/src/trop.rs @@ -14,6 +14,8 @@ use numpy::{PyArray1, PyArray2, PyReadonlyArray1, PyReadonlyArray2, ToPyArray}; use pyo3::prelude::*; use rayon::prelude::*; +use crate::linalg::ndarray_to_faer; + /// Minimum chunk size for parallel distance computation. /// Reduces scheduling overhead for small matrices. const MIN_CHUNK_SIZE: usize = 16; @@ -1233,111 +1235,164 @@ fn solve_joint_no_lowrank( y: &ArrayView2, delta: &ArrayView2, ) -> Option<(f64, Array1, Array1)> { + // SVD-based minimum-norm weighted least-squares fit — mirrors Python's + // `_solve_global_no_lowrank` at `diff_diff/trop_global.py:340-412` + // step-for-step so Rust and Python paths produce the same canonical + // solution on rank-deficient Y (silent-failures finding #23). + // + // Model: Y_it = μ + α_i + β_t + ε_it, with α_0 = β_0 = 0 for + // identification. Weights: δ_it. Flatten row-major with + // idx = t * n_units + i (matches Python's Y.flatten() C-order). let n_periods = y.nrows(); let n_units = y.ncols(); + let n_obs = n_periods * n_units; + let n_params = 1 + (n_units - 1) + (n_periods - 1); - // We solve using normal equations with the design matrix structure - // Rather than build full X matrix, use block structure for efficiency - // - // The model: Y_it = μ + α_i + β_t + ε_it - // With identification: α_0 = β_0 = 0 - - // Compute weighted sums needed for normal equations + // Flatten y + weights with NaN masking — matches trop_global.py:354-360. + let mut y_flat = Array1::::zeros(n_obs); + let mut w_flat = Array1::::zeros(n_obs); let mut sum_w = 0.0; - let mut sum_wy = 0.0; - - // Per-unit and per-period weighted sums - let mut sum_w_by_unit = Array1::::zeros(n_units); - let mut sum_wy_by_unit = Array1::::zeros(n_units); - let mut sum_w_by_period = Array1::::zeros(n_periods); - let mut sum_wy_by_period = Array1::::zeros(n_periods); - for t in 0..n_periods { for i in 0..n_units { - // NaN outcomes get zero weight (not imputed to 0.0 with active weight) - let w = if y[[t, i]].is_finite() { delta[[t, i]] } else { 0.0 }; - let y_ti = if y[[t, i]].is_finite() { y[[t, i]] } else { 0.0 }; - + let idx = t * n_units + i; + let y_ti = y[[t, i]]; + let w_ti = delta[[t, i]]; + let valid = y_ti.is_finite() && w_ti.is_finite(); + let w = if valid { w_ti.max(0.0) } else { 0.0 }; + y_flat[idx] = if valid { y_ti } else { 0.0 }; + w_flat[idx] = w; sum_w += w; - sum_wy += w * y_ti; - - sum_w_by_unit[i] += w; - sum_wy_by_unit[i] += w * y_ti; - sum_w_by_period[t] += w; - sum_wy_by_period[t] += w * y_ti; } } + // All-zero weights short-circuit — matches trop_global.py:366. if sum_w < 1e-10 { return None; } - // Use iterative approach: alternate between (alpha, beta) and mu - // until convergence (simpler than full normal equations) - let mut mu = sum_wy / sum_w; + // Build design matrix X = [intercept | unit_dummies[1..] | time_dummies[1..]] + // — matches trop_global.py:374-385. Explicit nested loops so the + // index correspondence with Python is unambiguous. + let mut x = Array2::::zeros((n_obs, n_params)); + for t in 0..n_periods { + for i in 0..n_units { + let idx = t * n_units + i; + x[[idx, 0]] = 1.0; // intercept + if i >= 1 { + x[[idx, i]] = 1.0; // unit dummy (unit 0 dropped) + } + if t >= 1 { + x[[idx, (n_units - 1) + t]] = 1.0; // time dummy (period 0 dropped) + } + } + } + + // Apply sqrt-weights: X_w = X * sqrt(w)[:, None], y_w = y * sqrt(w). + // Matches trop_global.py:387-389. + let sqrt_w: Array1 = w_flat.mapv(|w| w.sqrt()); + for r in 0..n_obs { + let s = sqrt_w[r]; + for c in 0..n_params { + x[[r, c]] *= s; + } + y_flat[r] *= s; + } + + // Solve via SVD with numpy-compatible rcond truncation. + let coeffs = solve_wls_svd(&x.view(), &y_flat.view())?; + + // Unpack: mu = coeffs[0], alpha[1..] = coeffs[1..n_units], + // beta[1..] = coeffs[n_units..]. Matches trop_global.py:406-410. + let mu = coeffs[0]; let mut alpha = Array1::::zeros(n_units); + for i in 1..n_units { + alpha[i] = coeffs[i]; + } let mut beta = Array1::::zeros(n_periods); + for t in 1..n_periods { + beta[t] = coeffs[(n_units - 1) + t]; + } - for _ in 0..50 { - let mu_old = mu; - let alpha_old = alpha.clone(); - let beta_old = beta.clone(); + Some((mu, alpha, beta)) +} - // Update alpha (fixing beta, mu) - for i in 1..n_units { // α_0 = 0 for identification - if sum_w_by_unit[i] > 1e-10 { - let mut num = 0.0; - for t in 0..n_periods { - // NaN outcomes get zero weight - let w = if y[[t, i]].is_finite() { delta[[t, i]] } else { 0.0 }; - let y_ti = if y[[t, i]].is_finite() { y[[t, i]] } else { 0.0 }; - num += w * (y_ti - mu - beta[t]); - } - alpha[i] = num / sum_w_by_unit[i]; - } - } +/// Minimum-norm least-squares solution via faer thin SVD with rcond truncation. +/// +/// Mirrors `np.linalg.lstsq(A, b, rcond=None)` from numpy: singular values +/// below `rcond * max(S)` with `rcond = eps * max(n_rows, n_cols)` are +/// treated as zero. On rank-deficient A this returns the unique +/// minimum-norm least-squares solution. +/// +/// This helper intentionally does NOT reuse `rust/src/linalg.rs::solve_ols` +/// because `solve_ols` hard-codes `rcond = 1e-7` (R's `lm()` default) which +/// would truncate singular values that numpy's default keeps. TROP's +/// canonical Python path is numpy-compatible; Rust must match. +/// +/// Returns `None` only when the SVD itself fails (rare on finite inputs); +/// the caller (LOOCV / FISTA / bootstrap) interprets `None` as an +/// unsuccessful fit. +fn solve_wls_svd(a: &ArrayView2, b: &ArrayView1) -> Option> { + let n_rows = a.nrows(); + let n_cols = a.ncols(); + let a_owned = a.to_owned(); + let b_owned = b.to_owned(); - // Update beta (fixing alpha, mu) - for t in 1..n_periods { // β_0 = 0 for identification - if sum_w_by_period[t] > 1e-10 { - let mut num = 0.0; - for i in 0..n_units { - // NaN outcomes get zero weight - let w = if y[[t, i]].is_finite() { delta[[t, i]] } else { 0.0 }; - let y_ti = if y[[t, i]].is_finite() { y[[t, i]] } else { 0.0 }; - num += w * (y_ti - mu - alpha[i]); - } - beta[t] = num / sum_w_by_period[t]; - } + // Convert ndarray -> faer for SVD. + let a_faer = ndarray_to_faer(&a_owned); + + let svd = a_faer.thin_svd().ok()?; + + let u_faer = svd.U(); + let s_diag = svd.S(); + let s_col = s_diag.column_vector(); + let v_faer = svd.V(); + + // Extract U (n_rows x min(n,k)) back to ndarray. + let u_rows = u_faer.nrows(); + let u_cols = u_faer.ncols(); + let mut u = Array2::::zeros((u_rows, u_cols)); + for i in 0..u_rows { + for j in 0..u_cols { + u[[i, j]] = u_faer[(i, j)]; } + } - // Update mu (fixing alpha, beta) - let mut num_mu = 0.0; - for t in 0..n_periods { - for i in 0..n_units { - // NaN outcomes get zero weight - let w = if y[[t, i]].is_finite() { delta[[t, i]] } else { 0.0 }; - let y_ti = if y[[t, i]].is_finite() { y[[t, i]] } else { 0.0 }; - num_mu += w * (y_ti - alpha[i] - beta[t]); - } + // Extract singular values. + let s_len = s_col.nrows(); + let mut s = Array1::::zeros(s_len); + for i in 0..s_len { + s[i] = s_col[i]; + } + + // Extract V (k x min(n,k)) back to ndarray. faer's V is not V^T. + let v_rows = v_faer.nrows(); + let v_cols = v_faer.ncols(); + let mut v = Array2::::zeros((v_rows, v_cols)); + for i in 0..v_rows { + for j in 0..v_cols { + v[[i, j]] = v_faer[(i, j)]; } - mu = num_mu / sum_w; - - // Check convergence across ALL parameters (not just mu) - let mu_diff = (mu - mu_old).abs(); - let alpha_diff = alpha.iter().zip(alpha_old.iter()) - .map(|(a, b)| (a - b).abs()) - .fold(0.0_f64, f64::max); - let beta_diff = beta.iter().zip(beta_old.iter()) - .map(|(a, b)| (a - b).abs()) - .fold(0.0_f64, f64::max); - let max_diff = mu_diff.max(alpha_diff).max(beta_diff); - if max_diff < 1e-8 { - break; + } + + // numpy rcond = eps * max(n_rows, n_cols); truncate s[i] <= rcond * max(s). + let rcond = f64::EPSILON * (n_rows.max(n_cols) as f64); + let s_max = s.iter().cloned().fold(0.0_f64, f64::max); + let threshold = s_max * rcond; + + // Compute β = V * S^{-1}_truncated * U^T * y. + let uty = u.t().dot(&b_owned); // (min(n,k),) + let mut s_inv_uty = Array1::::zeros(s_len); + for i in 0..s_len { + if s[i] > threshold { + s_inv_uty[i] = uty[i] / s[i]; } + // else: leave 0 — this is the pseudo-inverse / minimum-norm step + // that also covers Python's `except LinAlgError: pinv(...)` fallback + // tier, since faer thin_svd is numerically robust on finite inputs. } + let coeffs = v.dot(&s_inv_uty); - Some((mu, alpha, beta)) + Some(coeffs) } /// Solve global TWFE + low-rank via alternating minimization (no tau). diff --git a/tests/test_rust_backend.py b/tests/test_rust_backend.py index 360c152d..6b2359ba 100644 --- a/tests/test_rust_backend.py +++ b/tests/test_rust_backend.py @@ -2204,23 +2204,17 @@ def _make_correlated_panel(n_units=6, n_periods=5, n_treated=2): data.append({"unit": i, "time": t, "outcome": y, "treated": treated}) return pd.DataFrame(data) - @pytest.mark.xfail( - strict=True, - reason="TROP Rust grid-search at trop_global.py:688 and Python fallback " - "at trop_global.py:753 diverge on rank-deficient Y: empirical ATT " - "gap ~6% on two near-parallel control units. Either (a) grid-winner " - "ties break differently between backends, or (b) the per-λ solver " - "itself reaches different stationary points under rank deficiency. " - "Finding #23 flagged this exact surface as the Phase-2 gap. Rust " - "vs Python unification is a P1 follow-up (TODO.md). This xfail " - "baselines the divergence so we notice when/if the backends align.", - ) def test_grid_search_rank_deficient_Y(self): """Grid-search ATT parity on rank-deficient Y. - Known to fail: two near-parallel control units produce ~6% ATT - divergence between Rust and Python LOOCV grid-search. See xfail - reason for follow-up plan. + Finding #23 / TODO row 87 regression guard. Previously a ~6% + ATT divergence on two near-parallel control units because the + Rust inner solver used iterative block coordinate descent while + the Python fallback used SVD-based minimum-norm least squares. + Fixed by porting the Rust inner solver to an SVD-based WLS path + (numpy-compatible rcond = eps*max(n,k)) that mirrors Python's + `np.linalg.lstsq(rcond=None)` step-for-step. This test asserts + the backends now agree at atol=1e-6 on rank-deficient Y. """ import sys from unittest.mock import patch From bbbcc4db0e7d325d706ad6cc2178c838e4571a01 Mon Sep 17 00:00:00 2001 From: igerber Date: Mon, 20 Apr 2026 20:30:55 -0400 Subject: [PATCH 2/2] Address PR #348 CI review P3: drop stale TODO row-87 reference in test docstring After this PR deletes the old row 87 from TODO.md, row 87 now points to a different item. Replace the row-number breadcrumb with "Silent-failures audit Finding #23 (grid-search half)" which is stable across future TODO.md reshuffles. Co-Authored-By: Claude Opus 4.7 (1M context) --- tests/test_rust_backend.py | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/tests/test_rust_backend.py b/tests/test_rust_backend.py index 6b2359ba..228565d6 100644 --- a/tests/test_rust_backend.py +++ b/tests/test_rust_backend.py @@ -2207,12 +2207,13 @@ def _make_correlated_panel(n_units=6, n_periods=5, n_treated=2): def test_grid_search_rank_deficient_Y(self): """Grid-search ATT parity on rank-deficient Y. - Finding #23 / TODO row 87 regression guard. Previously a ~6% - ATT divergence on two near-parallel control units because the - Rust inner solver used iterative block coordinate descent while - the Python fallback used SVD-based minimum-norm least squares. - Fixed by porting the Rust inner solver to an SVD-based WLS path - (numpy-compatible rcond = eps*max(n,k)) that mirrors Python's + Silent-failures audit Finding #23 (grid-search half) regression + guard. Previously a ~6% ATT divergence on two near-parallel + control units because the Rust inner solver used iterative block + coordinate descent while the Python fallback used SVD-based + minimum-norm least squares. Fixed by porting the Rust inner + solver to an SVD-based WLS path (numpy-compatible + rcond = eps*max(n,k)) that mirrors Python's `np.linalg.lstsq(rcond=None)` step-for-step. This test asserts the backends now agree at atol=1e-6 on rank-deficient Y. """