From 888b5d1f9f275932411e0ecfae443c3289834f08 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Sun, 19 Apr 2026 07:52:38 -0400
Subject: [PATCH] Exclude flaky wall-clock timing tests from default CI
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

CI was hitting false-positive failures on tests that assert wall-clock
elapsed time, e.g. TestCallawaySantAnnaSEAccuracy.test_timing_performance
failing on PR #328 with "Estimation took 0.103s, expected <0.1s" — a 3%
overshoot that reflects runner noise (BLAS path variation, neighbor VM
contention, cold caches), not a real regression. The existing "20x margin"
comment acknowledged the problem but no fixed threshold can absorb CI
wall-clock variance.

Mark the two CS timing tests @pytest.mark.slow so they're excluded from
default CI (existing addopts already uses `-m 'not slow'`). Tests remain
runnable on-demand via `pytest -m slow` for local benchmarking — same
pattern the TROP suite uses per CLAUDE.md.

Tests marked:
- test_se_accuracy.py::TestCallawaySantAnnaSEAccuracy::test_timing_performance
  (method-level marker — rest of the SE-correctness class still runs)
- test_se_accuracy.py::TestPerformanceRegression (class-level marker — all
  three parametrized timing cases move out of default CI together)

Left unchanged: test_methodology_honest_did.py::test_m0_short_circuit uses
wall-clock as a proxy for "short-circuit path taken" — that's a correctness
signal, not a performance signal. Marking it slow would remove the check.
Added a TODO.md entry to replace it with a mock/spy in a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 TODO.md                   |  1 +
 tests/test_se_accuracy.py | 17 +++++++++++++++--
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/TODO.md b/TODO.md
index a27d34a6..a7e1ca00 100644
--- a/TODO.md
+++ b/TODO.md
@@ -98,6 +98,7 @@ Deferred items from PR reviews that were not addressed before merge.
 | Sphinx autodoc fails to import 3 result members: `DiDResults.ci`, `MultiPeriodDiDResults.att`, `CallawaySantAnnaResults.aggregate` — investigate whether these are renamed/removed or just unresolvable from autosummary template | `docs/api/results.rst`, `docs/api/staggered.rst` | — | Medium |
 | `EDiDBootstrapResults` cross-reference is ambiguous — class is exported from both `diff_diff` and `diff_diff.efficient_did_bootstrap`, producing 3 "more than one target found" warnings. Add `:noindex:` to one source or use full-path refs | `diff_diff/efficient_did_results.py`, `docs/api/efficient_did.rst` | — | Low |
 | Tracked Sphinx autosummary stubs in `docs/api/_autosummary/*.rst` are stale — every sphinx build regenerates them with new attributes (e.g., `coef_var`, `survey_metadata`) that have been added to result classes. Either commit a refresh or move the directory to `.gitignore` and treat as build output. Also 6 untracked stubs exist for newer estimators (`WooldridgeDiD`, `SimulationMDEResults`, etc.) that have never been committed. | `docs/api/_autosummary/` | — | Low |
+| HonestDiD `test_m0_short_circuit` uses wall-clock `elapsed < 0.5s` as a proxy for "short-circuit path taken" instead of calling the full optimizer. Replace with a direct correctness signal (mock/spy the optimizer or check a state flag) so the test doesn't depend on CI timing. Not flaky today at 500ms, but load-bearing correctness on a timing proxy is brittle. | `tests/test_methodology_honest_did.py:246` | — | Low |
 
 ---
 
diff --git a/tests/test_se_accuracy.py b/tests/test_se_accuracy.py
index 88b5038f..55de69f3 100644
--- a/tests/test_se_accuracy.py
+++ b/tests/test_se_accuracy.py
@@ -252,12 +252,18 @@ def test_se_vs_r_benchmark(self):
         assert se_diff_pct < 0.01, \
             f"SE differs from R by {se_diff_pct:.4f}%, expected <0.01%"
 
+    @pytest.mark.slow
     def test_timing_performance(self, cs_results):
         """
         Ensure estimation timing doesn't regress.
 
         Baseline: ~0.005s for 200 units x 8 periods (small scale)
-        Threshold: <0.1s (20x margin for CI variance)
+        Threshold: <0.1s.
+
+        Excluded from default CI via ``@pytest.mark.slow`` — wall-clock time
+        on shared runners is noisy (BLAS path variation, neighbor VM
+        contention, cold caches) and produces false positives. Run locally
+        with ``pytest -m slow`` for ad-hoc performance sanity checks.
         """
         _, elapsed = cs_results
 
@@ -398,8 +404,15 @@ def test_influence_function_normalization(self):
             f"Python SE {se_py:.4f} doesn't match standard {se_standard:.4f}"
 
 
+@pytest.mark.slow
 class TestPerformanceRegression:
-    """Tests to prevent performance regression."""
+    """Tests to prevent performance regression.
+
+    Excluded from default CI via ``@pytest.mark.slow`` — wall-clock time on
+    shared runners is noisy (BLAS path variation, neighbor VM contention,
+    cold caches) and produces false positives. Run locally with
+    ``pytest -m slow`` for ad-hoc performance sanity checks.
+    """
 
     @pytest.mark.parametrize("n_units,max_time", [
         (100, 0.15),   # Small: <150ms (CI runners need headroom)