diff --git a/TODO.md b/TODO.md index a27d34a6..a7e1ca00 100644 --- a/TODO.md +++ b/TODO.md @@ -98,6 +98,7 @@ Deferred items from PR reviews that were not addressed before merge. | Sphinx autodoc fails to import 3 result members: `DiDResults.ci`, `MultiPeriodDiDResults.att`, `CallawaySantAnnaResults.aggregate` — investigate whether these are renamed/removed or just unresolvable from autosummary template | `docs/api/results.rst`, `docs/api/staggered.rst` | — | Medium | | `EDiDBootstrapResults` cross-reference is ambiguous — class is exported from both `diff_diff` and `diff_diff.efficient_did_bootstrap`, producing 3 "more than one target found" warnings. Add `:noindex:` to one source or use full-path refs | `diff_diff/efficient_did_results.py`, `docs/api/efficient_did.rst` | — | Low | | Tracked Sphinx autosummary stubs in `docs/api/_autosummary/*.rst` are stale — every sphinx build regenerates them with new attributes (e.g., `coef_var`, `survey_metadata`) that have been added to result classes. Either commit a refresh or move the directory to `.gitignore` and treat as build output. Also 6 untracked stubs exist for newer estimators (`WooldridgeDiD`, `SimulationMDEResults`, etc.) that have never been committed. | `docs/api/_autosummary/` | — | Low | +| HonestDiD `test_m0_short_circuit` uses wall-clock `elapsed < 0.5s` as a proxy for "short-circuit path taken" instead of calling the full optimizer. Replace with a direct correctness signal (mock/spy the optimizer or check a state flag) so the test doesn't depend on CI timing. Not flaky today at 500ms, but load-bearing correctness on a timing proxy is brittle. | `tests/test_methodology_honest_did.py:246` | — | Low | --- diff --git a/tests/test_se_accuracy.py b/tests/test_se_accuracy.py index 88b5038f..55de69f3 100644 --- a/tests/test_se_accuracy.py +++ b/tests/test_se_accuracy.py @@ -252,12 +252,18 @@ def test_se_vs_r_benchmark(self): assert se_diff_pct < 0.01, \ f"SE differs from R by {se_diff_pct:.4f}%, expected <0.01%" + @pytest.mark.slow def test_timing_performance(self, cs_results): """ Ensure estimation timing doesn't regress. Baseline: ~0.005s for 200 units x 8 periods (small scale) - Threshold: <0.1s (20x margin for CI variance) + Threshold: <0.1s. + + Excluded from default CI via ``@pytest.mark.slow`` — wall-clock time + on shared runners is noisy (BLAS path variation, neighbor VM + contention, cold caches) and produces false positives. Run locally + with ``pytest -m slow`` for ad-hoc performance sanity checks. """ _, elapsed = cs_results @@ -398,8 +404,15 @@ def test_influence_function_normalization(self): f"Python SE {se_py:.4f} doesn't match standard {se_standard:.4f}" +@pytest.mark.slow class TestPerformanceRegression: - """Tests to prevent performance regression.""" + """Tests to prevent performance regression. + + Excluded from default CI via ``@pytest.mark.slow`` — wall-clock time on + shared runners is noisy (BLAS path variation, neighbor VM contention, + cold caches) and produces false positives. Run locally with + ``pytest -m slow`` for ad-hoc performance sanity checks. + """ @pytest.mark.parametrize("n_units,max_time", [ (100, 0.15), # Small: <150ms (CI runners need headroom)