From 888b5d1f9f275932411e0ecfae443c3289834f08 Mon Sep 17 00:00:00 2001 From: igerber Date: Sun, 19 Apr 2026 07:52:38 -0400 Subject: [PATCH] Exclude flaky wall-clock timing tests from default CI MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit CI was hitting false-positive failures on tests that assert wall-clock elapsed time, e.g. TestCallawaySantAnnaSEAccuracy.test_timing_performance failing on PR #328 with "Estimation took 0.103s, expected <0.1s" — a 3% overshoot that reflects runner noise (BLAS path variation, neighbor VM contention, cold caches), not a real regression. The existing "20x margin" comment acknowledged the problem but no fixed threshold can absorb CI wall-clock variance. Mark the two CS timing tests @pytest.mark.slow so they're excluded from default CI (existing addopts already uses `-m 'not slow'`). Tests remain runnable on-demand via `pytest -m slow` for local benchmarking — same pattern the TROP suite uses per CLAUDE.md. Tests marked: - test_se_accuracy.py::TestCallawaySantAnnaSEAccuracy::test_timing_performance (method-level marker — rest of the SE-correctness class still runs) - test_se_accuracy.py::TestPerformanceRegression (class-level marker — all three parametrized timing cases move out of default CI together) Left unchanged: test_methodology_honest_did.py::test_m0_short_circuit uses wall-clock as a proxy for "short-circuit path taken" — that's a correctness signal, not a performance signal. Marking it slow would remove the check. Added a TODO.md entry to replace it with a mock/spy in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) --- TODO.md | 1 + tests/test_se_accuracy.py | 17 +++++++++++++++-- 2 files changed, 16 insertions(+), 2 deletions(-) diff --git a/TODO.md b/TODO.md index a27d34a6..a7e1ca00 100644 --- a/TODO.md +++ b/TODO.md @@ -98,6 +98,7 @@ Deferred items from PR reviews that were not addressed before merge. | Sphinx autodoc fails to import 3 result members: `DiDResults.ci`, `MultiPeriodDiDResults.att`, `CallawaySantAnnaResults.aggregate` — investigate whether these are renamed/removed or just unresolvable from autosummary template | `docs/api/results.rst`, `docs/api/staggered.rst` | — | Medium | | `EDiDBootstrapResults` cross-reference is ambiguous — class is exported from both `diff_diff` and `diff_diff.efficient_did_bootstrap`, producing 3 "more than one target found" warnings. Add `:noindex:` to one source or use full-path refs | `diff_diff/efficient_did_results.py`, `docs/api/efficient_did.rst` | — | Low | | Tracked Sphinx autosummary stubs in `docs/api/_autosummary/*.rst` are stale — every sphinx build regenerates them with new attributes (e.g., `coef_var`, `survey_metadata`) that have been added to result classes. Either commit a refresh or move the directory to `.gitignore` and treat as build output. Also 6 untracked stubs exist for newer estimators (`WooldridgeDiD`, `SimulationMDEResults`, etc.) that have never been committed. | `docs/api/_autosummary/` | — | Low | +| HonestDiD `test_m0_short_circuit` uses wall-clock `elapsed < 0.5s` as a proxy for "short-circuit path taken" instead of calling the full optimizer. Replace with a direct correctness signal (mock/spy the optimizer or check a state flag) so the test doesn't depend on CI timing. Not flaky today at 500ms, but load-bearing correctness on a timing proxy is brittle. | `tests/test_methodology_honest_did.py:246` | — | Low | --- diff --git a/tests/test_se_accuracy.py b/tests/test_se_accuracy.py index 88b5038f..55de69f3 100644 --- a/tests/test_se_accuracy.py +++ b/tests/test_se_accuracy.py @@ -252,12 +252,18 @@ def test_se_vs_r_benchmark(self): assert se_diff_pct < 0.01, \ f"SE differs from R by {se_diff_pct:.4f}%, expected <0.01%" + @pytest.mark.slow def test_timing_performance(self, cs_results): """ Ensure estimation timing doesn't regress. Baseline: ~0.005s for 200 units x 8 periods (small scale) - Threshold: <0.1s (20x margin for CI variance) + Threshold: <0.1s. + + Excluded from default CI via ``@pytest.mark.slow`` — wall-clock time + on shared runners is noisy (BLAS path variation, neighbor VM + contention, cold caches) and produces false positives. Run locally + with ``pytest -m slow`` for ad-hoc performance sanity checks. """ _, elapsed = cs_results @@ -398,8 +404,15 @@ def test_influence_function_normalization(self): f"Python SE {se_py:.4f} doesn't match standard {se_standard:.4f}" +@pytest.mark.slow class TestPerformanceRegression: - """Tests to prevent performance regression.""" + """Tests to prevent performance regression. + + Excluded from default CI via ``@pytest.mark.slow`` — wall-clock time on + shared runners is noisy (BLAS path variation, neighbor VM contention, + cold caches) and produces false positives. Run locally with + ``pytest -m slow`` for ad-hoc performance sanity checks. + """ @pytest.mark.parametrize("n_units,max_time", [ (100, 0.15), # Small: <150ms (CI runners need headroom)