From b4b975ef5f40b954d6dba88770c3a3cd8ecf9dfd Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Wed, 22 Apr 2026 13:35:02 -0400
Subject: [PATCH 1/9] HAD Phase 3: pre-test diagnostics (qug_test, stute_test,
 yatchew_hr_test, workflow)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Ships paper Section 4 (de Chaisemartin et al. 2026, arXiv:2405.04465v6) pre-test
diagnostics for the HeterogeneousAdoptionDiD estimator in a new module
`diff_diff/had_pretests.py`:

- `qug_test()` — order-statistic ratio test of H_0: d_lower = 0 (Theorem 4).
  Closed-form T = D_(1)/(D_(2)-D_(1)); asymptotic p-value 1/(1+T) under the
  Exp(1)/Exp(1) limit law; reject if T > 1/alpha - 1. Zero-filter with warning;
  tie-break returns all-NaN inference.
- `stute_test()` — Cramer-von Mises cusum test with Mammen wild bootstrap
  (Appendix D). Statistic S = (1/G^2) sum(cumsum^2); literal per-iteration OLS
  refit matching paper Appendix D Algorithm. n_bootstrap >= 99 front-door
  validated; G < 10 returns NaN; G > 100_000 warns and points to yatchew_hr_test.
- `yatchew_hr_test()` — heteroskedasticity-robust variance-ratio linearity test
  (Theorem 7 / Equation 29). sigma2_diff normalizes by 2G paper-literal (NOT
  2(G-1)); tight hand-computed parity at atol=1e-12.
- `did_had_pretest_workflow()` — composite runs all three on a two-period panel
  and returns HADPretestReport with priority-ordered verdict string.

Deferred to Phase 3 follow-up patches:
- Joint Equation 18 cross-horizon Stute pre-trend test (paper formula needs PDF
  extraction; tracked in TODO.md)
- Appendix D vectorized matrix form for Stute bootstrap (functionally identical,
  ~2x faster; shipped literal-refit to match paper text)
- Tight R-package numerical parity (bootstrap seed semantics differ across numpy/R)

47 tests across 9 classes cover rejection rates, closed-form parity, tie-breaks,
NaN propagation, JSON serialization, seed reproducibility, sample-size gates,
and verdict logic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 TODO.md                                       |    6 +-
 diff_diff/__init__.py                         |   19 +
 diff_diff/had_pretests.py                     | 1065 +++++++++++++++++
 docs/methodology/REGISTRY.md                  |   23 +-
 .../papers/dechaisemartin-2026-review.md      |    9 +-
 tests/test_had_pretests.py                    |  716 +++++++++++
 6 files changed, 1824 insertions(+), 14 deletions(-)
 create mode 100644 diff_diff/had_pretests.py
 create mode 100644 tests/test_had_pretests.py

diff --git a/TODO.md b/TODO.md
index ab8ec9bc..0ec19d0d 100644
--- a/TODO.md
+++ b/TODO.md
@@ -95,7 +95,11 @@ Deferred items from PR reviews that were not addressed before merge.
 | `HeterogeneousAdoptionDiD`: `weights=` support. Deferred jointly with survey integration. nprobust's `lprobust` has no weight argument so the nonparametric continuous path needs a derivation; the 2SLS mass-point path needs weighted-sandwich parity. | `diff_diff/had.py` | Phase 2a | Medium |
 | `HeterogeneousAdoptionDiD` mass-point: `vcov_type in {"hc2", "hc2_bm"}` raises `NotImplementedError` pending a 2SLS-specific leverage derivation. The OLS leverage `x_i' (X'X)^{-1} x_i` is wrong for 2SLS; the correct finite-sample correction uses `x_i' (Z'X)^{-1} (...) (X'Z)^{-1} x_i`. Needs derivation plus an R / Stata (`ivreg2 small robust`) parity anchor. | `diff_diff/had.py::_fit_mass_point_2sls` | Phase 2a | Medium |
 | `HeterogeneousAdoptionDiD` continuous paths: thread `cluster=` through `bias_corrected_local_linear` (Phase 1c's wrapper already supports cluster; Phase 2a ignores it with a `UserWarning` on the continuous path to keep scope tight). | `diff_diff/had.py`, `diff_diff/local_linear.py` | Phase 2a | Low |
-| `HeterogeneousAdoptionDiD` Phase 3: `qug_test()`, `stute_test()`, `yatchew_hr_test()` pre-test diagnostics (paper Section 3.3). Composite helper `did_had_pretest_workflow()`. Not part of Phase 2a scope. | `diff_diff/had.py`, new module | Phase 2a | Medium |
+| `HeterogeneousAdoptionDiD` Phase 3 joint Equation 18 cross-horizon Stute test: paper's step 2 of the four-step pre-testing workflow tests joint pre-trends via a stacked-residual CvM across pre-period placebos. Phase 3 shipped the single-horizon Stute in `did_had_pretest_workflow()`; the joint variant needs the exact stacked-residual formula extracted from the paper PDF (not reproduced in `dechaisemartin-2026-review.md`). Follow-up patch. | `diff_diff/had_pretests.py` | Phase 3 | Medium |
+| `HeterogeneousAdoptionDiD` Phase 3 Stute performance: Appendix D vectorized matrix form replaces the per-iteration OLS refit with a single precomputed `M = I - X(X'X)^{-1}X'` applied to `eps * eta`. Functionally identical, ~2x faster. Shipped literal-refit form in Phase 3 to match paper text and keep reviewer surface small. | `diff_diff/had_pretests.py::stute_test` | Phase 3 | Low |
+| `HeterogeneousAdoptionDiD` Phase 3 R-parity: Phase 3 ships coverage-rate validation on synthetic DGPs (not tight point parity against `chaisemartin::stute_test` / `yatchew_test`). Tight numerical parity requires aligning bootstrap seed semantics and `B` across numpy/R and is deferred. | `tests/test_had_pretests.py` | Phase 3 | Low |
+| `HeterogeneousAdoptionDiD` Phase 3 nprobust bandwidth for Stute: some Stute variants on continuous regressors use nprobust-style optimal bandwidth selection. Phase 3 uses OLS residuals from a 2-parameter linear fit (no bandwidth selection). nprobust integration is a future enhancement; not in paper scope. | `diff_diff/had_pretests.py::stute_test` | Phase 3 | Low |
+| `HeterogeneousAdoptionDiD` Phase 3 multi-period workflow dispatch: `did_had_pretest_workflow` accepts two-period overall-path panels only. Multi-period users pre-slice to `(F-1, F)` before calling. A follow-up could add `aggregate="event_study"`-like dispatch for joint pre-trend diagnostics alongside Equation 18. | `diff_diff/had_pretests.py::did_had_pretest_workflow` | Phase 3 | Low |
 | `HeterogeneousAdoptionDiD` Phase 4: Pierce-Schott (2016) replication harness; reproduce paper Figure 2 values and Table 1 coverage rates. | `benchmarks/`, `tests/` | Phase 2a | Low |
 | `HeterogeneousAdoptionDiD` Phase 5: `practitioner_next_steps()` integration, tutorial notebook, and `llms.txt` updates (preserving UTF-8 fingerprint). | `diff_diff/practitioner.py`, `tutorials/`, `diff_diff/guides/` | Phase 2a | Low |
 | `HeterogeneousAdoptionDiD` time-varying dose on event study: Phase 2b REJECTS panels where `D_{g,t}` varies within a unit for `t >= F` (the aggregation uses `D_{g, F}` as the single regressor for all horizons, paper Appendix B.2 constant-dose convention). A follow-up PR could add a time-varying-dose estimator for these panels; current behavior is front-door rejection with a redirect to `ChaisemartinDHaultfoeuille`. | `diff_diff/had.py::_validate_had_panel_event_study` | Phase 2b | Low |
diff --git a/diff_diff/__init__.py b/diff_diff/__init__.py
index 7c04bf50..3dcb8d26 100644
--- a/diff_diff/__init__.py
+++ b/diff_diff/__init__.py
@@ -60,6 +60,16 @@
     HeterogeneousAdoptionDiDEventStudyResults,
     HeterogeneousAdoptionDiDResults,
 )
+from diff_diff.had_pretests import (
+    HADPretestReport,
+    QUGTestResults,
+    StuteTestResults,
+    YatchewTestResults,
+    did_had_pretest_workflow,
+    qug_test,
+    stute_test,
+    yatchew_hr_test,
+)
 from diff_diff.estimators import (
     DifferenceInDifferences,
     MultiPeriodDiD,
@@ -442,6 +452,15 @@
     "HeterogeneousAdoptionDiDResults",
     "HeterogeneousAdoptionDiDEventStudyResults",
     "HAD",
+    # HeterogeneousAdoptionDiD pre-test diagnostics (Phase 3)
+    "qug_test",
+    "stute_test",
+    "yatchew_hr_test",
+    "did_had_pretest_workflow",
+    "QUGTestResults",
+    "StuteTestResults",
+    "YatchewTestResults",
+    "HADPretestReport",
     # Datasets
     "load_card_krueger",
     "load_castle_doctrine",
diff --git a/diff_diff/had_pretests.py b/diff_diff/had_pretests.py
new file mode 100644
index 00000000..d5c21dd9
--- /dev/null
+++ b/diff_diff/had_pretests.py
@@ -0,0 +1,1065 @@
+"""Pre-test diagnostics for the HeterogeneousAdoptionDiD estimator.
+
+Paper Section 4 (de Chaisemartin, Ciccia, D'Haultfoeuille, Knau 2026,
+arXiv:2405.04465v6) prescribes three pre-test diagnostics that practitioners
+should run BEFORE trusting HAD estimates:
+
+1. :func:`qug_test` - order-statistic ratio test of the support infimum
+   ``H_0: d_lower = 0`` (paper Theorem 4). Closed-form, tuning-free.
+2. :func:`stute_test` - Cramer-von Mises cusum test of linearity of
+   ``E[ΔY | D_2]`` with Mammen (1993) wild bootstrap p-value (paper
+   Appendix D).
+3. :func:`yatchew_hr_test` - heteroskedasticity-robust variance-ratio
+   linearity test (paper Theorem 7 / Equation 29). Feasible at
+   ``G >= 100k``.
+
+Paper Section 4.2-4.3: if all three fail-to-reject, the TWFE estimator
+(``beta_fe``) is valid; rejection guides users to alternative estimators.
+
+The composite :func:`did_had_pretest_workflow` runs all three in sequence on
+a two-period HAD panel and returns a :class:`HADPretestReport` with a
+text verdict.
+
+Equation 18 (joint cross-horizon Stute over pre-period placebos) is deferred
+to a Phase 3 follow-up patch pending formula extraction from the paper PDF;
+see ``docs/methodology/REGISTRY.md`` and ``TODO.md``.
+"""
+
+from __future__ import annotations
+
+import warnings
+from dataclasses import dataclass
+from typing import Any, Dict, Optional
+
+import numpy as np
+import pandas as pd
+from scipy import stats
+
+from diff_diff.had import (
+    _aggregate_first_difference,
+    _json_safe_scalar,
+    _validate_had_panel,
+)
+from diff_diff.utils import _generate_mammen_weights
+
+__all__ = [
+    "QUGTestResults",
+    "StuteTestResults",
+    "YatchewTestResults",
+    "HADPretestReport",
+    "qug_test",
+    "stute_test",
+    "yatchew_hr_test",
+    "did_had_pretest_workflow",
+]
+
+
+_MIN_G_QUG = 2
+_MIN_G_STUTE = 10
+_MIN_G_YATCHEW = 3
+_MIN_N_BOOTSTRAP = 99
+_STUTE_LARGE_G_THRESHOLD = 100_000
+
+
+# =============================================================================
+# Result dataclasses
+# =============================================================================
+
+
+@dataclass
+class QUGTestResults:
+    """Result of :func:`qug_test` (paper Theorem 4).
+
+    The QUG test rejects ``H_0: d_lower = 0`` when the order-statistic
+    ratio ``T = D_{(1)} / (D_{(2)} - D_{(1)})`` exceeds ``1/alpha - 1``.
+    Under the null, the asymptotic limit law of ``T`` is the ratio of two
+    independent Exp(1) random variables, with CDF ``F(t) = t / (1 + t)``,
+    so ``p_value = 1 / (1 + T)``.
+
+    Attributes
+    ----------
+    t_stat : float
+        ``D_{(1)} / (D_{(2)} - D_{(1)})``. NaN when fewer than 2 non-zero
+        observations remain or when the two smallest doses tie.
+    p_value : float
+        ``1 / (1 + t_stat)`` under the null. NaN when ``t_stat`` is NaN.
+    reject : bool
+        ``True`` iff ``t_stat > critical_value``. ``False`` on NaN statistic.
+    alpha : float
+        Significance level used.
+    critical_value : float
+        ``1 / alpha - 1``. Populated even when the statistic is NaN so
+        downstream readers can inspect the decision threshold.
+    n_obs : int
+        Number of observations after filtering to ``d > 0``.
+    n_excluded_zero : int
+        Number of zero-dose observations excluded from the sample.
+    d_order_1 : float
+        Smallest positive dose ``D_{(1)}``. NaN when ``n_obs < 2``.
+    d_order_2 : float
+        Second-smallest positive dose ``D_{(2)}``. NaN when ``n_obs < 2``.
+    """
+
+    t_stat: float
+    p_value: float
+    reject: bool
+    alpha: float
+    critical_value: float
+    n_obs: int
+    n_excluded_zero: int
+    d_order_1: float
+    d_order_2: float
+
+    def __repr__(self) -> str:
+        return (
+            f"QUGTestResults(t_stat={self.t_stat:.4f}, p_value={self.p_value:.4f}, "
+            f"reject={self.reject}, alpha={self.alpha}, n_obs={self.n_obs})"
+        )
+
+    def summary(self) -> str:
+        """Formatted summary table."""
+        width = 64
+        lines = [
+            "=" * width,
+            "QUG null test (H_0: d_lower = 0)".center(width),
+            "=" * width,
+            f"{'Statistic T:':<30} {self.t_stat:>20.4f}",
+            f"{'p-value:':<30} {self.p_value:>20.4f}",
+            f"{'Critical value (1/alpha-1):':<30} {self.critical_value:>20.4f}",
+            f"{'Reject H_0:':<30} {str(self.reject):>20}",
+            f"{'alpha:':<30} {self.alpha:>20.4f}",
+            f"{'Observations:':<30} {self.n_obs:>20}",
+            f"{'Excluded (d == 0):':<30} {self.n_excluded_zero:>20}",
+            f"{'D_(1):':<30} {self.d_order_1:>20.4f}",
+            f"{'D_(2):':<30} {self.d_order_2:>20.4f}",
+            "=" * width,
+        ]
+        return "\n".join(lines)
+
+    def print_summary(self) -> None:
+        """Print the summary to stdout."""
+        print(self.summary())
+
+    def to_dict(self) -> Dict[str, Any]:
+        """Return results as a JSON-safe dict."""
+        return {
+            "test": "qug",
+            "t_stat": _json_safe_scalar(self.t_stat),
+            "p_value": _json_safe_scalar(self.p_value),
+            "reject": bool(self.reject),
+            "alpha": float(self.alpha),
+            "critical_value": _json_safe_scalar(self.critical_value),
+            "n_obs": int(self.n_obs),
+            "n_excluded_zero": int(self.n_excluded_zero),
+            "d_order_1": _json_safe_scalar(self.d_order_1),
+            "d_order_2": _json_safe_scalar(self.d_order_2),
+        }
+
+    def to_dataframe(self) -> pd.DataFrame:
+        """Return a one-row DataFrame of the result dict."""
+        return pd.DataFrame([self.to_dict()])
+
+
+@dataclass
+class StuteTestResults:
+    """Result of :func:`stute_test` (paper Appendix D).
+
+    The Stute test rejects the null that ``E[ΔY | D_2]`` is linear in
+    ``D_2`` (paper Assumption 8) when the sorted-residual CvM statistic
+    ``S = (1/G^2) Σ (Σ_{h=1}^g eps_{(h)})^2`` exceeds the Mammen wild
+    bootstrap ``1 - alpha`` quantile.
+
+    Attributes
+    ----------
+    cvm_stat : float
+        CvM statistic. NaN when ``G < 10`` (below the threshold the
+        statistic is not well-calibrated).
+    p_value : float
+        Bootstrap p-value ``(1 + sum(S_b >= S)) / (B + 1)``. NaN when
+        the statistic is NaN.
+    reject : bool
+        ``True`` iff ``p_value <= alpha``. ``False`` on NaN.
+    alpha : float
+        Significance level used.
+    n_bootstrap : int
+        Number of Mammen wild bootstrap replications.
+    n_obs : int
+        Number of observations.
+    seed : int or None
+        Seed passed to ``np.random.default_rng``. ``None`` when unseeded.
+    """
+
+    cvm_stat: float
+    p_value: float
+    reject: bool
+    alpha: float
+    n_bootstrap: int
+    n_obs: int
+    seed: Optional[int]
+
+    def __repr__(self) -> str:
+        return (
+            f"StuteTestResults(cvm_stat={self.cvm_stat:.4f}, "
+            f"p_value={self.p_value:.4f}, reject={self.reject}, "
+            f"alpha={self.alpha}, n_bootstrap={self.n_bootstrap}, "
+            f"n_obs={self.n_obs})"
+        )
+
+    def summary(self) -> str:
+        """Formatted summary table."""
+        width = 64
+        lines = [
+            "=" * width,
+            "Stute CvM linearity test (H_0: linear E[dY|D])".center(width),
+            "=" * width,
+            f"{'CvM statistic:':<30} {self.cvm_stat:>20.4f}",
+            f"{'Bootstrap p-value:':<30} {self.p_value:>20.4f}",
+            f"{'Reject H_0:':<30} {str(self.reject):>20}",
+            f"{'alpha:':<30} {self.alpha:>20.4f}",
+            f"{'Bootstrap replications:':<30} {self.n_bootstrap:>20}",
+            f"{'Observations:':<30} {self.n_obs:>20}",
+            f"{'Seed:':<30} {str(self.seed):>20}",
+            "=" * width,
+        ]
+        return "\n".join(lines)
+
+    def print_summary(self) -> None:
+        """Print the summary to stdout."""
+        print(self.summary())
+
+    def to_dict(self) -> Dict[str, Any]:
+        """Return results as a JSON-safe dict."""
+        return {
+            "test": "stute",
+            "cvm_stat": _json_safe_scalar(self.cvm_stat),
+            "p_value": _json_safe_scalar(self.p_value),
+            "reject": bool(self.reject),
+            "alpha": float(self.alpha),
+            "n_bootstrap": int(self.n_bootstrap),
+            "n_obs": int(self.n_obs),
+            "seed": None if self.seed is None else int(self.seed),
+        }
+
+    def to_dataframe(self) -> pd.DataFrame:
+        """Return a one-row DataFrame of the result dict."""
+        return pd.DataFrame([self.to_dict()])
+
+
+@dataclass
+class YatchewTestResults:
+    """Result of :func:`yatchew_hr_test` (paper Theorem 7 / Equation 29).
+
+    Heteroskedasticity-robust test of the same linearity null as
+    :func:`stute_test`, but using Yatchew's difference-based variance
+    estimator. The test statistic
+    ``T_hr = sqrt(G) * (sigma2_lin - sigma2_diff) / sigma2_W``
+    is asymptotically N(0, 1) under H_0; rejection uses the one-sided
+    standard-normal critical value.
+
+    Attributes
+    ----------
+    t_stat_hr : float
+        Test statistic ``T_hr`` from paper Equation 29. NaN when
+        ``G < 3``.
+    p_value : float
+        ``1 - Phi(T_hr)``. NaN when the statistic is NaN.
+    reject : bool
+        ``True`` iff ``T_hr >= critical_value``. ``False`` on NaN.
+    alpha : float
+        Significance level used.
+    critical_value : float
+        One-sided standard-normal critical value ``z_{1 - alpha}``.
+    sigma2_lin : float
+        Residual variance from OLS of ``dy`` on ``d``.
+    sigma2_diff : float
+        Yatchew differencing variance
+        ``(1 / (2G)) * sum((dy_{(g)} - dy_{(g-1)})^2)`` - divisor is ``2G``
+        (paper-literal), NOT ``2(G-1)``.
+    sigma2_W : float
+        Heteroskedasticity-robust scale
+        ``sqrt((1 / (G-1)) * sum(eps_{(g)}^2 * eps_{(g-1)}^2))``.
+    n_obs : int
+        Number of observations.
+    """
+
+    t_stat_hr: float
+    p_value: float
+    reject: bool
+    alpha: float
+    critical_value: float
+    sigma2_lin: float
+    sigma2_diff: float
+    sigma2_W: float
+    n_obs: int
+
+    def __repr__(self) -> str:
+        return (
+            f"YatchewTestResults(t_stat_hr={self.t_stat_hr:.4f}, "
+            f"p_value={self.p_value:.4f}, reject={self.reject}, "
+            f"alpha={self.alpha}, n_obs={self.n_obs})"
+        )
+
+    def summary(self) -> str:
+        """Formatted summary table."""
+        width = 64
+        lines = [
+            "=" * width,
+            "Yatchew-HR linearity test (H_0: linear E[dY|D])".center(width),
+            "=" * width,
+            f"{'T_hr statistic:':<30} {self.t_stat_hr:>20.4f}",
+            f"{'p-value:':<30} {self.p_value:>20.4f}",
+            f"{'Critical value (1-sided z):':<30} {self.critical_value:>20.4f}",
+            f"{'Reject H_0:':<30} {str(self.reject):>20}",
+            f"{'alpha:':<30} {self.alpha:>20.4f}",
+            f"{'sigma^2_lin (OLS):':<30} {self.sigma2_lin:>20.4f}",
+            f"{'sigma^2_diff (Yatchew):':<30} {self.sigma2_diff:>20.4f}",
+            f"{'sigma^2_W (HR scale):':<30} {self.sigma2_W:>20.4f}",
+            f"{'Observations:':<30} {self.n_obs:>20}",
+            "=" * width,
+        ]
+        return "\n".join(lines)
+
+    def print_summary(self) -> None:
+        """Print the summary to stdout."""
+        print(self.summary())
+
+    def to_dict(self) -> Dict[str, Any]:
+        """Return results as a JSON-safe dict."""
+        return {
+            "test": "yatchew_hr",
+            "t_stat_hr": _json_safe_scalar(self.t_stat_hr),
+            "p_value": _json_safe_scalar(self.p_value),
+            "reject": bool(self.reject),
+            "alpha": float(self.alpha),
+            "critical_value": _json_safe_scalar(self.critical_value),
+            "sigma2_lin": _json_safe_scalar(self.sigma2_lin),
+            "sigma2_diff": _json_safe_scalar(self.sigma2_diff),
+            "sigma2_W": _json_safe_scalar(self.sigma2_W),
+            "n_obs": int(self.n_obs),
+        }
+
+    def to_dataframe(self) -> pd.DataFrame:
+        """Return a one-row DataFrame of the result dict."""
+        return pd.DataFrame([self.to_dict()])
+
+
+@dataclass
+class HADPretestReport:
+    """Composite output of :func:`did_had_pretest_workflow`.
+
+    Bundles the three individual tests with an overall verdict string.
+
+    Attributes
+    ----------
+    qug : QUGTestResults
+    stute : StuteTestResults
+    yatchew : YatchewTestResults
+    all_pass : bool
+        ``True`` iff none of the three tests rejects. When ``True``,
+        paper Section 4.2-4.3 states TWFE (``beta_fe``) is valid under
+        Assumptions 5-8.
+    verdict : str
+        Human-readable classification. Priority-ordered first-match:
+
+        1. Any NaN p-value -> ``"inconclusive - {names} NaN"``
+        2. None reject -> ``"TWFE safe under Section 4 assumptions"``
+        3. Otherwise -> bundled string naming each rejected assumption:
+           ``"support infimum rejected - continuous_at_zero design
+           invalid (QUG)"`` and/or ``"linearity rejected - heterogeneity
+           bias ({Stute[,Yatchew]})"``.
+    alpha : float
+        Significance level shared across tests.
+    n_obs : int
+        Unit count after aggregation to the two-period first-difference.
+    """
+
+    qug: QUGTestResults
+    stute: StuteTestResults
+    yatchew: YatchewTestResults
+    all_pass: bool
+    verdict: str
+    alpha: float
+    n_obs: int
+
+    def __repr__(self) -> str:
+        return (
+            f"HADPretestReport(all_pass={self.all_pass}, "
+            f"verdict={self.verdict!r}, n_obs={self.n_obs})"
+        )
+
+    def summary(self) -> str:
+        """Formatted summary of all three tests and the verdict."""
+        width = 72
+        parts = [
+            "=" * width,
+            "HAD pre-test workflow".center(width),
+            "=" * width,
+            self.qug.summary(),
+            "",
+            self.stute.summary(),
+            "",
+            self.yatchew.summary(),
+            "",
+            "=" * width,
+            f"{'All pass:':<30} {str(self.all_pass):>40}",
+            f"Verdict: {self.verdict}",
+            "=" * width,
+        ]
+        return "\n".join(parts)
+
+    def print_summary(self) -> None:
+        """Print the summary to stdout."""
+        print(self.summary())
+
+    def to_dict(self) -> Dict[str, Any]:
+        """Return a JSON-safe nested dict of the full report."""
+        return {
+            "qug": self.qug.to_dict(),
+            "stute": self.stute.to_dict(),
+            "yatchew": self.yatchew.to_dict(),
+            "all_pass": bool(self.all_pass),
+            "verdict": str(self.verdict),
+            "alpha": float(self.alpha),
+            "n_obs": int(self.n_obs),
+        }
+
+    def to_dataframe(self) -> pd.DataFrame:
+        """Return a tidy 3-row DataFrame (one row per test).
+
+        Columns (in order): ``[test, statistic_name, statistic_value,
+        p_value, reject, alpha, n_obs]``.
+        """
+        rows = [
+            {
+                "test": "qug",
+                "statistic_name": "t_stat",
+                "statistic_value": _json_safe_scalar(self.qug.t_stat),
+                "p_value": _json_safe_scalar(self.qug.p_value),
+                "reject": bool(self.qug.reject),
+                "alpha": float(self.qug.alpha),
+                "n_obs": int(self.qug.n_obs),
+            },
+            {
+                "test": "stute",
+                "statistic_name": "cvm_stat",
+                "statistic_value": _json_safe_scalar(self.stute.cvm_stat),
+                "p_value": _json_safe_scalar(self.stute.p_value),
+                "reject": bool(self.stute.reject),
+                "alpha": float(self.stute.alpha),
+                "n_obs": int(self.stute.n_obs),
+            },
+            {
+                "test": "yatchew_hr",
+                "statistic_name": "t_stat_hr",
+                "statistic_value": _json_safe_scalar(self.yatchew.t_stat_hr),
+                "p_value": _json_safe_scalar(self.yatchew.p_value),
+                "reject": bool(self.yatchew.reject),
+                "alpha": float(self.yatchew.alpha),
+                "n_obs": int(self.yatchew.n_obs),
+            },
+        ]
+        cols = [
+            "test",
+            "statistic_name",
+            "statistic_value",
+            "p_value",
+            "reject",
+            "alpha",
+            "n_obs",
+        ]
+        return pd.DataFrame(rows).reindex(columns=cols)
+
+
+# =============================================================================
+# Private helpers
+# =============================================================================
+
+
+def _validate_1d_numeric(arr: np.ndarray, name: str) -> np.ndarray:
+    """Return ``arr`` as a 1D float ndarray or raise ``ValueError``."""
+    a = np.asarray(arr)
+    if a.ndim != 1:
+        raise ValueError(f"{name} must be 1-dimensional, got shape {a.shape}.")
+    a = a.astype(np.float64, copy=False)
+    if np.isnan(a).any():
+        raise ValueError(f"{name} contains NaN values.")
+    if not np.isfinite(a).all():
+        raise ValueError(f"{name} contains non-finite values (inf).")
+    return a
+
+
+def _fit_ols_intercept_slope(d: np.ndarray, dy: np.ndarray) -> "tuple[float, float, np.ndarray]":
+    """Fit ``dy = a + b*d + eps`` via closed-form OLS.
+
+    Returns ``(a_hat, b_hat, residuals)`` where ``residuals`` has the
+    same length as ``d`` in the ORIGINAL input order (not sorted).
+    """
+    d_mean = d.mean()
+    dy_mean = dy.mean()
+    d_dev = d - d_mean
+    var_d = np.dot(d_dev, d_dev)
+    if var_d <= 0.0:
+        # Degenerate case: all dose values equal. Slope undefined.
+        # Caller is responsible for gating before we reach here; if we
+        # do reach here, return (mean(dy), 0, dy - mean(dy)).
+        return float(dy_mean), 0.0, dy - dy_mean
+    b_hat = float(np.dot(d_dev, dy - dy_mean) / var_d)
+    a_hat = float(dy_mean - b_hat * d_mean)
+    residuals = dy - a_hat - b_hat * d
+    return a_hat, b_hat, residuals
+
+
+def _cvm_statistic(eps_sorted: np.ndarray) -> float:
+    """Compute the sorted-residual Cramer-von Mises statistic.
+
+    Equivalent to paper's ``S = Σ (g/G)^2 · ((1/g) Σ_{h=1}^g eps_(h))^2``:
+
+        (g/G)^2 * (C_g / g)^2 = C_g^2 / G^2
+
+    where ``C_g = Σ_{h=1}^g eps_(h)`` is the cumulative residual sum.
+    Implementation uses the simplified form for numerical stability.
+
+    Parameters
+    ----------
+    eps_sorted : np.ndarray, shape (G,)
+        Residuals sorted by the associated regressor ``d``.
+
+    Returns
+    -------
+    float
+        ``S = (1 / G^2) * sum(cumsum(eps_sorted)^2)``.
+    """
+    G = eps_sorted.shape[0]
+    cumsum = np.cumsum(eps_sorted)
+    return float(np.sum(cumsum * cumsum) / (G * G))
+
+
+def _compose_verdict(
+    qug: QUGTestResults, stute: StuteTestResults, yatchew: YatchewTestResults
+) -> str:
+    """Build the :class:`HADPretestReport` verdict string.
+
+    Priority-ordered first-match:
+
+    1. Any NaN p-value -> ``"inconclusive - {names} NaN"``
+    2. None reject -> ``"TWFE safe under Section 4 assumptions"``
+    3. Otherwise bundle each rejection reason.
+    """
+    nan_tests = [
+        name
+        for name, res in (
+            ("QUG", qug),
+            ("Stute", stute),
+            ("Yatchew", yatchew),
+        )
+        if not np.isfinite(res.p_value)
+    ]
+    if nan_tests:
+        return f"inconclusive - {', '.join(nan_tests)} NaN"
+
+    any_reject = qug.reject or stute.reject or yatchew.reject
+    if not any_reject:
+        return "TWFE safe under Section 4 assumptions"
+
+    reasons = []
+    if qug.reject:
+        reasons.append("support infimum rejected - continuous_at_zero design invalid (QUG)")
+    if stute.reject or yatchew.reject:
+        which = ",".join(name for name, r in (("Stute", stute), ("Yatchew", yatchew)) if r.reject)
+        reasons.append(f"linearity rejected - heterogeneity bias ({which})")
+    return "; ".join(reasons)
+
+
+# =============================================================================
+# Public test functions
+# =============================================================================
+
+
+def qug_test(d: np.ndarray, alpha: float = 0.05) -> QUGTestResults:
+    """Run the QUG null test for the support infimum (paper Theorem 4).
+
+    Tests ``H_0: d_lower = 0`` using the order-statistic ratio
+    ``T = D_{(1)} / (D_{(2)} - D_{(1)})``, rejecting when ``T > 1/alpha - 1``.
+    Under the null, the asymptotic limit law of ``T`` is the ratio of two
+    independent Exp(1) variables with CDF ``F(t) = t / (1 + t)``, so the
+    one-sided p-value is ``1 / (1 + T)``.
+
+    Zero-dose observations are filtered out (the test targets the infimum
+    of the treated support). A ``UserWarning`` is emitted naming the
+    exclusion count. When fewer than two positive doses remain, the test
+    returns all-NaN inference with ``reject=False``.
+
+    Parameters
+    ----------
+    d : np.ndarray, shape (G,)
+        Post-period dose vector. Must be 1D numeric and contain no NaN.
+    alpha : float, default 0.05
+        One-sided significance level. Must satisfy ``0 < alpha < 1``.
+
+    Returns
+    -------
+    QUGTestResults
+        Result dataclass with ``t_stat``, ``p_value``, ``reject``, and
+        sample metadata.
+
+    Raises
+    ------
+    ValueError
+        If ``d`` is not 1D numeric or contains NaN, or if ``alpha`` is
+        not in ``(0, 1)``.
+
+    Notes
+    -----
+    Tie-break: when ``D_{(1)} == D_{(2)}`` the statistic is undefined.
+    The test returns ``t_stat=NaN, p_value=NaN, reject=False`` with a
+    ``UserWarning`` rather than raising.
+
+    References
+    ----------
+    de Chaisemartin, Ciccia, D'Haultfoeuille, Knau (2026, arXiv:2405.04465v6),
+    Theorem 4 and Section 4.2.
+    """
+    if not (0.0 < alpha < 1.0):
+        raise ValueError(f"alpha must satisfy 0 < alpha < 1, got {alpha}.")
+
+    d_arr = _validate_1d_numeric(d, "d")
+    critical_value = 1.0 / alpha - 1.0
+
+    mask = d_arr > 0
+    d_nz = d_arr[mask]
+    n_excluded = int(d_arr.shape[0] - d_nz.shape[0])
+    if n_excluded > 0:
+        warnings.warn(
+            f"qug_test: excluded {n_excluded} observation(s) with d == 0 "
+            f"(the QUG null test targets the infimum of the treated-dose "
+            f"support; zero-dose observations are not in scope).",
+            UserWarning,
+            stacklevel=2,
+        )
+
+    n_obs = int(d_nz.shape[0])
+    if n_obs < _MIN_G_QUG:
+        warnings.warn(
+            f"qug_test: only {n_obs} positive-dose observation(s); need "
+            f"at least {_MIN_G_QUG}. Returning NaN result.",
+            UserWarning,
+            stacklevel=2,
+        )
+        return QUGTestResults(
+            t_stat=float("nan"),
+            p_value=float("nan"),
+            reject=False,
+            alpha=alpha,
+            critical_value=critical_value,
+            n_obs=n_obs,
+            n_excluded_zero=n_excluded,
+            d_order_1=float("nan"),
+            d_order_2=float("nan"),
+        )
+
+    d_sorted = np.sort(d_nz)
+    D1 = float(d_sorted[0])
+    D2 = float(d_sorted[1])
+
+    if D2 == D1:
+        warnings.warn(
+            "qug_test: D_(1) == D_(2); the test statistic is undefined "
+            "(ties at the minimum positive dose). Returning NaN result.",
+            UserWarning,
+            stacklevel=2,
+        )
+        return QUGTestResults(
+            t_stat=float("nan"),
+            p_value=float("nan"),
+            reject=False,
+            alpha=alpha,
+            critical_value=critical_value,
+            n_obs=n_obs,
+            n_excluded_zero=n_excluded,
+            d_order_1=D1,
+            d_order_2=D2,
+        )
+
+    t_stat = D1 / (D2 - D1)
+    p_value = 1.0 / (1.0 + t_stat)
+    reject = t_stat > critical_value
+
+    return QUGTestResults(
+        t_stat=float(t_stat),
+        p_value=float(p_value),
+        reject=bool(reject),
+        alpha=alpha,
+        critical_value=critical_value,
+        n_obs=n_obs,
+        n_excluded_zero=n_excluded,
+        d_order_1=D1,
+        d_order_2=D2,
+    )
+
+
+def stute_test(
+    d: np.ndarray,
+    dy: np.ndarray,
+    alpha: float = 0.05,
+    n_bootstrap: int = 999,
+    seed: Optional[int] = None,
+) -> StuteTestResults:
+    """Run the Stute Cramer-von Mises linearity test (paper Appendix D).
+
+    Tests ``H_0: E[ΔY | D_2]`` is linear in ``D_2`` (paper Assumption 8).
+    The test statistic is the sorted-residual cusum CvM
+
+        S = (1 / G^2) * sum_{g=1}^G (sum_{h=1}^g eps_(h))^2
+
+    where ``eps_(h)`` is the ``h``-th OLS residual after sorting by ``d``.
+    The p-value is the bootstrap tail probability
+    ``(1 + sum(S_b >= S)) / (B + 1)`` under the Mammen (1993) two-point
+    wild bootstrap; each bootstrap iteration refits OLS on
+    ``dy_b = a_hat + b_hat * d + eps * eta`` with multiplier weights ``eta``.
+
+    Parameters
+    ----------
+    d, dy : np.ndarray, shape (G,)
+        Dose and first-difference outcome vectors.
+    alpha : float, default 0.05
+        Significance level. Must satisfy ``0 < alpha < 1``.
+    n_bootstrap : int, default 999
+        Number of Mammen wild bootstrap replications. Must be ``>= 99``
+        (below which the discretised p-value grid is too coarse).
+    seed : int or None, default None
+        Seed for ``np.random.default_rng``. Pass an integer for
+        reproducible results.
+
+    Returns
+    -------
+    StuteTestResults
+
+    Raises
+    ------
+    ValueError
+        If ``d`` / ``dy`` are not 1D numeric, contain NaN, have unequal
+        lengths, or if ``alpha`` is outside ``(0, 1)``, or if
+        ``n_bootstrap < 99``.
+
+    Notes
+    -----
+    Sample-size gate: below ``G = 10`` the CvM statistic is not
+    well-calibrated. In that case the function emits ``UserWarning`` and
+    returns all-NaN inference rather than raising.
+
+    Large-G warning: at ``G > 100_000`` the per-iteration refit dominates
+    runtime; the function emits a ``UserWarning`` pointing users to
+    :func:`yatchew_hr_test`. Memory usage remains ``O(G)`` regardless
+    (no G x G matrix).
+
+    References
+    ----------
+    Stute, W. (1997). Nonparametric model checks for regression. Annals
+    of Statistics 25, 613-641.
+    Mammen, E. (1993). Bootstrap and wild bootstrap for high-dimensional
+    linear models. Annals of Statistics 21, 255-285.
+    de Chaisemartin et al. (2026), Appendix D.
+    """
+    if not (0.0 < alpha < 1.0):
+        raise ValueError(f"alpha must satisfy 0 < alpha < 1, got {alpha}.")
+    if n_bootstrap < _MIN_N_BOOTSTRAP:
+        raise ValueError(
+            f"n_bootstrap must be >= {_MIN_N_BOOTSTRAP} (below this the "
+            f"discretised p-value grid is too coarse to be meaningful). "
+            f"Got n_bootstrap={n_bootstrap}."
+        )
+
+    d_arr = _validate_1d_numeric(d, "d")
+    dy_arr = _validate_1d_numeric(dy, "dy")
+    if d_arr.shape[0] != dy_arr.shape[0]:
+        raise ValueError(
+            f"d and dy must have the same length; got d.shape={d_arr.shape}, "
+            f"dy.shape={dy_arr.shape}."
+        )
+
+    G = int(d_arr.shape[0])
+    if G < _MIN_G_STUTE:
+        warnings.warn(
+            f"stute_test: G = {G} is below the minimum {_MIN_G_STUTE} for "
+            f"the CvM statistic to be well-calibrated. Returning NaN result.",
+            UserWarning,
+            stacklevel=2,
+        )
+        return StuteTestResults(
+            cvm_stat=float("nan"),
+            p_value=float("nan"),
+            reject=False,
+            alpha=alpha,
+            n_bootstrap=int(n_bootstrap),
+            n_obs=G,
+            seed=seed,
+        )
+    if G > _STUTE_LARGE_G_THRESHOLD:
+        warnings.warn(
+            f"stute_test: G = {G} exceeds {_STUTE_LARGE_G_THRESHOLD}; the "
+            f"per-iteration refit is O(G) per iteration so the "
+            f"{n_bootstrap}-replication loop may take tens of seconds or "
+            f"more. Consider yatchew_hr_test() instead (paper Theorem 7 "
+            f"recommends Yatchew-HR at large G).",
+            UserWarning,
+            stacklevel=2,
+        )
+
+    a_hat, b_hat, eps = _fit_ols_intercept_slope(d_arr, dy_arr)
+    # Degenerate case: zero dose variation. Residual variance under the
+    # linear null is just Var(dy), but the CvM under "no signal" regressor
+    # carries no information; emit NaN.
+    if np.var(d_arr) <= 0.0 or np.var(eps) <= 0.0:
+        warnings.warn(
+            "stute_test: degenerate inputs (constant d or zero residual "
+            "variance). Returning NaN result.",
+            UserWarning,
+            stacklevel=2,
+        )
+        return StuteTestResults(
+            cvm_stat=float("nan"),
+            p_value=float("nan"),
+            reject=False,
+            alpha=alpha,
+            n_bootstrap=int(n_bootstrap),
+            n_obs=G,
+            seed=seed,
+        )
+
+    idx = np.argsort(d_arr, kind="stable")
+    S = _cvm_statistic(eps[idx])
+
+    rng = np.random.default_rng(seed)
+    bootstrap_S = np.empty(n_bootstrap, dtype=np.float64)
+    fitted = a_hat + b_hat * d_arr  # baseline fitted values under H_0
+    for b in range(n_bootstrap):
+        eta = _generate_mammen_weights(G, rng)
+        dy_b = fitted + eps * eta
+        _, _, eps_b = _fit_ols_intercept_slope(d_arr, dy_b)
+        bootstrap_S[b] = _cvm_statistic(eps_b[idx])
+
+    p_value = float((1.0 + float(np.sum(bootstrap_S >= S))) / (n_bootstrap + 1.0))
+    reject = p_value <= alpha
+
+    return StuteTestResults(
+        cvm_stat=float(S),
+        p_value=p_value,
+        reject=bool(reject),
+        alpha=alpha,
+        n_bootstrap=int(n_bootstrap),
+        n_obs=G,
+        seed=seed,
+    )
+
+
+def yatchew_hr_test(d: np.ndarray, dy: np.ndarray, alpha: float = 0.05) -> YatchewTestResults:
+    """Run the Yatchew heteroskedasticity-robust linearity test.
+
+    Tests ``H_0: E[ΔY | D_2]`` is linear in ``D_2`` (paper Assumption 8,
+    Theorem 7) via the variance-ratio statistic
+
+        T_hr = sqrt(G) * (sigma2_lin - sigma2_diff) / sigma2_W
+
+    where
+
+        sigma2_lin   = (1/G) * sum(eps^2)                        # OLS residuals
+        sigma2_diff  = (1/(2G)) * sum((dy_{(g)} - dy_{(g-1)})^2) # Yatchew differencing
+        sigma2_W     = sqrt((1/(G-1)) * sum(eps_{(g)}^2 * eps_{(g-1)}^2))
+
+    and ``_{(g)}`` denotes sort by ``d``. Rejection uses the one-sided
+    standard-normal critical value ``z_{1-alpha}``.
+
+    Parameters
+    ----------
+    d, dy : np.ndarray, shape (G,)
+        Dose and first-difference outcome vectors.
+    alpha : float, default 0.05
+        One-sided significance level.
+
+    Returns
+    -------
+    YatchewTestResults
+
+    Raises
+    ------
+    ValueError
+        If ``d`` / ``dy`` are not 1D numeric, contain NaN, have unequal
+        lengths, or if ``alpha`` is outside ``(0, 1)``.
+
+    Notes
+    -----
+    Sample-size gate: below ``G = 3`` the difference-variance estimator
+    is undefined; the function emits ``UserWarning`` and returns NaN
+    rather than raising.
+
+    Dose ties: stable sort (``np.argsort(d, kind="stable")``) preserves
+    the input order among tied ``d`` values. The paper does not specify
+    a tie-break policy; stability suffices.
+
+    References
+    ----------
+    Yatchew, A. (1997). An elementary estimator of the partial linear
+    model. Economics Letters 57, 135-143.
+    de Chaisemartin et al. (2026), Theorem 7 / Equation 29.
+    """
+    if not (0.0 < alpha < 1.0):
+        raise ValueError(f"alpha must satisfy 0 < alpha < 1, got {alpha}.")
+
+    d_arr = _validate_1d_numeric(d, "d")
+    dy_arr = _validate_1d_numeric(dy, "dy")
+    if d_arr.shape[0] != dy_arr.shape[0]:
+        raise ValueError(
+            f"d and dy must have the same length; got d.shape={d_arr.shape}, "
+            f"dy.shape={dy_arr.shape}."
+        )
+
+    G = int(d_arr.shape[0])
+    critical_value = float(stats.norm.ppf(1.0 - alpha))
+
+    if G < _MIN_G_YATCHEW:
+        warnings.warn(
+            f"yatchew_hr_test: G = {G} is below the minimum {_MIN_G_YATCHEW} "
+            f"(need at least 2 sorted differences). Returning NaN result.",
+            UserWarning,
+            stacklevel=2,
+        )
+        return YatchewTestResults(
+            t_stat_hr=float("nan"),
+            p_value=float("nan"),
+            reject=False,
+            alpha=alpha,
+            critical_value=critical_value,
+            sigma2_lin=float("nan"),
+            sigma2_diff=float("nan"),
+            sigma2_W=float("nan"),
+            n_obs=G,
+        )
+
+    _, _, eps = _fit_ols_intercept_slope(d_arr, dy_arr)
+    sigma2_lin = float(np.mean(eps * eps))
+
+    idx = np.argsort(d_arr, kind="stable")
+    dy_s = dy_arr[idx]
+    eps_s = eps[idx]
+
+    diff_dy = np.diff(dy_s)  # length G - 1
+    # Paper-literal divisor: 2G (NOT 2(G-1)). This matches paper review
+    # line 168: sigma2_diff := (1/(2G)) * sum((dy_{(g)} - dy_{(g-1)})^2).
+    sigma2_diff = float(np.sum(diff_dy * diff_dy) / (2.0 * G))
+
+    # sigma4_W = (1/(G-1)) * sum(eps_(g)^2 * eps_(g-1)^2) using np.mean
+    # which divides by the length of the input (G-1 here). Matches paper
+    # review line 171.
+    sigma4_W = float(np.mean(eps_s[1:] ** 2 * eps_s[:-1] ** 2))
+    if sigma4_W <= 0.0:
+        warnings.warn(
+            "yatchew_hr_test: sigma4_W <= 0 (zero residual variance). " "Returning NaN result.",
+            UserWarning,
+            stacklevel=2,
+        )
+        return YatchewTestResults(
+            t_stat_hr=float("nan"),
+            p_value=float("nan"),
+            reject=False,
+            alpha=alpha,
+            critical_value=critical_value,
+            sigma2_lin=sigma2_lin,
+            sigma2_diff=sigma2_diff,
+            sigma2_W=0.0,
+            n_obs=G,
+        )
+    sigma2_W = float(np.sqrt(sigma4_W))
+
+    t_stat_hr = float(np.sqrt(G) * (sigma2_lin - sigma2_diff) / sigma2_W)
+    p_value = float(1.0 - stats.norm.cdf(t_stat_hr))
+    reject = t_stat_hr >= critical_value
+
+    return YatchewTestResults(
+        t_stat_hr=t_stat_hr,
+        p_value=p_value,
+        reject=bool(reject),
+        alpha=alpha,
+        critical_value=critical_value,
+        sigma2_lin=sigma2_lin,
+        sigma2_diff=sigma2_diff,
+        sigma2_W=sigma2_W,
+        n_obs=G,
+    )
+
+
+def did_had_pretest_workflow(
+    data: pd.DataFrame,
+    outcome_col: str,
+    dose_col: str,
+    time_col: str,
+    unit_col: str,
+    first_treat_col: Optional[str] = None,
+    alpha: float = 0.05,
+    n_bootstrap: int = 999,
+    seed: Optional[int] = None,
+) -> HADPretestReport:
+    """Run all three HAD pre-tests on a two-period panel and return a
+    composite report (paper Section 4.2-4.3).
+
+    The workflow reduces the panel to unit-level first differences using
+    the Phase 2a validator + aggregator, then calls :func:`qug_test`,
+    :func:`stute_test`, and :func:`yatchew_hr_test` with shared ``alpha``
+    and a single-source seed passthrough (``seed`` is forwarded to
+    :func:`stute_test` only; the other two are deterministic).
+
+    Parameters
+    ----------
+    data : pd.DataFrame
+        Balanced two-period HAD panel. The dose column must be 0 for all
+        units at the pre-period (HAD no-unit-untreated pre-period
+        contract).
+    outcome_col, dose_col, time_col, unit_col : str
+        Column names.
+    first_treat_col : str or None, default None
+        Optional first-treatment-period column for cross-validation
+        (see :func:`HeterogeneousAdoptionDiD.fit`).
+    alpha : float, default 0.05
+    n_bootstrap : int, default 999
+        Replication count for :func:`stute_test`.
+    seed : int or None, default None
+        Seed forwarded to :func:`stute_test` only.
+
+    Returns
+    -------
+    HADPretestReport
+
+    Notes
+    -----
+    Phase 3 scope is two-period overall-path only. For multi-period
+    panels, slice to ``(F - 1, F)`` before calling. A future patch will
+    add a multi-period dispatch and the joint Equation 18 pre-trend
+    Stute test.
+
+    References
+    ----------
+    de Chaisemartin et al. (2026), Section 4.2-4.3, Theorem 4, Appendix D,
+    Theorem 7.
+    """
+    t_pre, t_post = _validate_had_panel(
+        data, outcome_col, dose_col, time_col, unit_col, first_treat_col
+    )
+    d_arr, dy_arr, _, _ = _aggregate_first_difference(
+        data, outcome_col, dose_col, time_col, unit_col, t_pre, t_post, None
+    )
+
+    qug_res = qug_test(d_arr, alpha=alpha)
+    stute_res = stute_test(d_arr, dy_arr, alpha=alpha, n_bootstrap=n_bootstrap, seed=seed)
+    yatchew_res = yatchew_hr_test(d_arr, dy_arr, alpha=alpha)
+
+    all_pass = not (qug_res.reject or stute_res.reject or yatchew_res.reject)
+    verdict = _compose_verdict(qug_res, stute_res, yatchew_res)
+
+    return HADPretestReport(
+        qug=qug_res,
+        stute=stute_res,
+        yatchew=yatchew_res,
+        all_pass=all_pass,
+        verdict=verdict,
+        alpha=alpha,
+        n_obs=int(d_arr.shape[0]),
+    )
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
index ae629889..92e0ddd8 100644
--- a/docs/methodology/REGISTRY.md
+++ b/docs/methodology/REGISTRY.md
@@ -2267,7 +2267,7 @@ so `δ_0` is recovered by OLS of `ΔY` on `X` and `D_2 * X`; Average Slope is `(
 3. Report the estimate as WAS_{d̲} under Assumption 6 or as the sign-identifying quantity under Assumption 5.
 
 *Algorithm variant - QUG null test (Theorem 4, Section 3.3):*
-Tuning-parameter-free test of `H_0: d̲ = 0` versus `H_1: d̲ > 0`. Shipped in `diff_diff/diagnostics.py` as `qug_test()`.
+Tuning-parameter-free test of `H_0: d̲ = 0` versus `H_1: d̲ > 0`. Shipped in `diff_diff/had_pretests.py` as `qug_test()`.
 1. Sort `D_{2,g}` ascending to obtain order statistics `D_{2,(1)} ≤ D_{2,(2)} ≤ ... ≤ D_{2,(G)}`.
 2. Compute test statistic `T := D_{2,(1)} / (D_{2,(2)} - D_{2,(1)})`.
 3. Reject `H_0` if `T > 1/α - 1`.
@@ -2276,7 +2276,7 @@ Tuning-parameter-free test of `H_0: d̲ = 0` versus `H_1: d̲ > 0`. Shipped in `
 - **Note:** Implementation is `O(G)` via `np.partition`; no sort required.
 
 *Algorithm variant - TWFE linearity test via Stute (1997) Cramér-von Mises with wild bootstrap (Section 4.3, Appendix D):*
-Shipped in `diff_diff/diagnostics.py` as `stute_test()`. Tests whether `E(ΔY | D_2)` is linear, the testable implication of TWFE's homogeneity assumption (Assumption 8) in HADs.
+Shipped in `diff_diff/had_pretests.py` as `stute_test()`. Tests whether `E(ΔY | D_2)` is linear, the testable implication of TWFE's homogeneity assumption (Assumption 8) in HADs.
 1. Fit linear regression of `ΔY_g` on constant and `D_{g,2}`; collect residuals `ε̂_{lin,g}`.
 2. Form cusum process `c_G(d) := G^{-1/2} Σ_{g=1}^G 1{D_{g,2} ≤ d} · ε̂_{lin,g}`.
 3. Compute Cramér-von Mises statistic `S := (1/G) Σ_{g=1}^G c_G²(D_{g,2})`. Equivalently, after sorting by `D_{g,2}`: `S = Σ_{g=1}^G (g/G)² · ((1/g) Σ_{h=1}^g ε̂_{lin,(h)})²`.
@@ -2291,7 +2291,7 @@ Shipped in `diff_diff/diagnostics.py` as `stute_test()`. Tests whether `E(ΔY |
 - **Note:** Default `n_bootstrap = 499` is a diff-diff choice; the paper does not prescribe.
 
 *Algorithm variant - Yatchew (1997) heteroskedasticity-robust linearity test (Appendix E, Theorem 7):*
-Shipped in `diff_diff/diagnostics.py` as `yatchew_hr_test()`. Alternative to Stute when `G` is large or heteroskedasticity is suspected.
+Shipped in `diff_diff/had_pretests.py` as `yatchew_hr_test()`. Alternative to Stute when `G` is large or heteroskedasticity is suspected.
 1. Sort `(D_{g,2}, ΔY_g)` by `D_{g,2}`.
 2. Compute difference-based variance estimator: `σ̂²_{diff} := (1/(2G)) Σ_{g=2}^G [(Y_{2,(g)} - Y_{1,(g)}) - (Y_{2,(g-1)} - Y_{1,(g-1)})]²`.
 3. Fit linear regression; compute residual variance `σ̂²_{lin}`.
@@ -2301,12 +2301,14 @@ Shipped in `diff_diff/diagnostics.py` as `yatchew_hr_test()`. Alternative to Stu
 7. Inference on `β̂_{fe}` conditional on accepting the linearity test is asymptotically valid (Theorem 7, Point 1; citing de Chaisemartin and D'Haultfœuille 2024 arXiv:2407.03725).
 
 *Four-step pre-testing workflow (Section 4.2-4.3):*
-Shipped as `did_had_pretest_workflow()` and surfaced via `practitioner_next_steps()`. The paper's decision rule for TWFE reliability in HADs:
+Shipped as `did_had_pretest_workflow()` in Phase 3. The paper's decision rule for TWFE reliability in HADs:
 1. Test the null of a QUG (`H_0: d̲ = 0`) using `qug_test()`.
 2. Run a pre-trends test of Assumption 7 (requires a pre-period `t=0`).
 3. Test that `E(ΔY | D_2)` is linear (`stute_test` or `yatchew_hr_test`).
 4. If NONE of the three is rejected, `β̂_{fe}` from TWFE may be used to estimate the treatment effect.
 
+**Phase 3 delivery:** `did_had_pretest_workflow()` runs steps 1 + 3 (QUG + Stute + Yatchew) on a two-period panel and returns a verdict. Step 2 (pre-trends test via Equation 18 joint cross-horizon Stute on pre-period placebos) is deferred to a Phase 3 follow-up patch; see `TODO.md`. The `practitioner_next_steps()` integration is queued for Phase 5.
+
 **Reference implementation(s):**
 - R: `did_had` (de Chaisemartin, Ciccia, D'Haultfœuille, Knau 2024a); `stute_test` (2024c); `yatchew_test` (Online Appendix, Table 3).
 - Stata: `did_had` (2024b); `stute_test` (2024d); `yatchew_test`. Also `twowayfeweights` (de Chaisemartin, D'Haultfœuille, Deeb 2019) for negative-weight diagnostics.
@@ -2332,16 +2334,19 @@ Shipped as `did_had_pretest_workflow()` and surfaced via `practitioner_next_step
     - **Note (CI endpoints):** Because the continuous-path `att` is `(mean(ΔY) - τ̂_bc) / den`, the beta-scale CI endpoints reverse relative to the Phase 1c boundary-limit CI: `CI_lower(β̂) = (mean(ΔY) - CI_upper(τ̂_bc)) / den` and `CI_upper(β̂) = (mean(ΔY) - CI_lower(τ̂_bc)) / den`. The `HeterogeneousAdoptionDiD.fit()` implementation computes `att ± z · se` directly via `safe_inference`, which handles the reversal naturally from the transformed point estimate.
     - **Note (Phase 2a/2b scope):** Phase 2a ships the single-period `aggregate="overall"` path; Phase 2b lifts `aggregate="event_study"` (Appendix B.2 multi-period extension) which returns a `HeterogeneousAdoptionDiDEventStudyResults` with per-event-time WAS estimates and pointwise CIs. `survey=` and `weights=` kwargs raise `NotImplementedError` pointing to the follow-up survey-integration PR.
     - **Note (panel-only):** The paper (Section 2) defines HAD on *panel or repeated cross-section* data, but both the overall and event-study paths ship a panel-only implementation: `HeterogeneousAdoptionDiD.fit()` requires a balanced panel with a unit identifier so that unit-level first differences `ΔY_{g,t} = Y_{g,t} - Y_{g,t_anchor}` can be formed. Repeated-cross-section inputs (disjoint unit IDs between periods) are rejected by the balanced-panel validator. RCS support is queued for a follow-up PR (tracked in `TODO.md`); it will need a separate identification path based on pre/post cell means rather than unit-level differences.
-- [x] Phase 2b: Multi-period event-study extension (Appendix B.2). `aggregate="event_study"` produces per-event-time WAS estimates using a uniform `F-1` baseline (`ΔY_{g,t} = Y_{g,t} - Y_{g,F-1}` for every horizon), reusing the three Phase 2a design paths on per-horizon first differences. Pre-period placebos included for `e <= -2` (the anchor `e = -1` is skipped since `ΔY = 0` trivially). Post-period estimates for `e >= 0`. The joint Stute test (Equation 18) across pre-periods is a SEPARATE diagnostic deferred to Phase 3 (pre-test diagnostics).
+- [x] Phase 2b: Multi-period event-study extension (Appendix B.2). `aggregate="event_study"` produces per-event-time WAS estimates using a uniform `F-1` baseline (`ΔY_{g,t} = Y_{g,t} - Y_{g,F-1}` for every horizon), reusing the three Phase 2a design paths on per-horizon first differences. Pre-period placebos included for `e <= -2` (the anchor `e = -1` is skipped since `ΔY = 0` trivially). Post-period estimates for `e >= 0`. The joint Stute test (Equation 18) across pre-periods is a SEPARATE diagnostic deferred to a **Phase 3 follow-up patch** (Phase 3 ships the single-horizon Stute test; the joint stacked-residual variant is tracked in `TODO.md`).
     - **Note (Phase 2b last-cohort filter):** When `first_treat_col` indicates more than one nonzero cohort, the panel is auto-filtered to the last-treatment cohort (`F_last = max(cohorts)`) **plus never-treated units** (`first_treat = 0`), with a `UserWarning` naming kept/dropped unit counts and dropped cohort labels. Paper Appendix B.2 is explicit that HAD "may be used only for the LAST treatment cohort in a staggered design"; the auto-filter implements this prescription, retaining never-treated units per the paper's "there must be an untreated group, at least till the period where the last cohort gets treated" requirement. Only earlier-cohort units (with `first_treat > 0` and `< F_last`) are dropped — never-treated units satisfy the dose invariant at every period (`D = 0` throughout) and preserve Design 1' identifiability (boundary at `0`) when last-cohort doses are uniformly positive. When `first_treat_col` is omitted on a >2-period panel, the validator infers each unit's first-positive-dose period from the dose path; if multiple distinct first-positive-dose cohorts are detected, the estimator raises a front-door `ValueError` directing users to pass `first_treat_col` (which activates the auto-filter) or use `ChaisemartinDHaultfoeuille` for full staggered support — there is no silent acceptance of staggered panels without cohort metadata. Common-adoption panels (single first-positive-dose cohort, or only never-treated + one cohort) pass through unchanged with `F` inferred from the dose invariant, and require dose contiguity (pre-periods < post-periods in natural ordering). Non-contiguous dose sequences (e.g., reverse treatment) raise with a pointer to `ChaisemartinDHaultfoeuille`.
     - **Note (Phase 2b constant-dose requirement):** The event-study aggregation uses `D_{g, F}` (first-treatment-period dose) as the single regressor for every event-time horizon, per paper Appendix B.2's "once treated, stay treated with the same dose" convention. The validator REJECTS panels where a unit has time-varying dose across post-treatment periods (`D_{g, t} != D_{g, F}` for any `t >= F` within-unit, beyond float tolerance) with a front-door `ValueError`, directing users with genuinely time-varying post-treatment doses to `ChaisemartinDHaultfoeuille` (`did_multiplegt_dyn`). Silent acceptance would misattribute later-horizon treatment-effect heterogeneity to the period-F dose. A follow-up PR could implement a time-varying-dose estimator; tracked in `TODO.md`.
     - **Note (Phase 2b per-horizon SE):** Each event-time horizon uses an INDEPENDENT sandwich computed on that horizon's first differences: continuous paths use the CCT-2014 robust SE from Phase 1c divided by `|den|`; mass-point path uses the structural-residual 2SLS sandwich from Phase 2a. This produces pointwise CIs per horizon, matching the paper's Pierce-Schott application (Section 5.2, Figure 2: "nonparametric pointwise CIs"). Joint cross-horizon covariance (IF-based stacking or block bootstrap) is NOT computed — the paper does not derive it and all reported CIs are pointwise. Follow-up PRs may add joint covariance for cross-horizon hypothesis tests; current tracking in `TODO.md`.
     - **Note (Phase 2b baseline convention):** All event-time horizons use a uniform `F-1` anchor: `ΔY_{g,t} = Y_{g,t} - Y_{g,F-1}` for every `t`. This is consistent with the paper's Garrett-et-al. application (Section 5.1: "outcome `Y_{g,t} - Y_{g,2001}`" where `F = 2002`), simplifies event-time indexing (`e = t - F` so `e = -1` is the anchor, skipped), and keeps the implementation symmetric for pre- and post-period horizons. The paper review text's asymmetric "`Y_{g,t} - Y_{g,1}` for pre" / "`Y_{g,t} - Y_{g,F-1}` for post" phrasing is covered by the uniform convention since both give the same placebo interpretation under parallel trends (the paper's own applications use the uniform anchor).
     - **Note (Phase 2b result class):** `aggregate="event_study"` returns a new `HeterogeneousAdoptionDiDEventStudyResults` dataclass (distinct from the single-period `HeterogeneousAdoptionDiDResults`) with per-horizon arrays (`event_times`, `att`, `se`, `t_stat`, `p_value`, `conf_int_low`, `conf_int_high`, `n_obs_per_horizon`) and shared metadata. `to_dataframe()` returns a tidy per-horizon DataFrame; `to_dict()` returns a dict with list-of-per-horizon fields. The static return-type annotation on `fit()` is `HeterogeneousAdoptionDiDResults` (the common case); callers passing `aggregate="event_study"` should annotate their variable as `HeterogeneousAdoptionDiDEventStudyResults` for type checkers.
-- [ ] Phase 3: `qug_test()` (`T = D_{2,(1)} / (D_{2,(2)} - D_{2,(1)})`, rejection `{T > 1/α - 1}`).
-- [ ] Phase 3: `stute_test()` Cramér-von Mises with Mammen wild bootstrap.
-- [ ] Phase 3: `yatchew_hr_test()` heteroskedasticity-robust linearity test.
-- [ ] Phase 3: `did_had_pretest_workflow()` composite helper.
+    - **Note (Phase 3 pretest workflow):** Phase 3 ships the Section 4 pre-test diagnostics in a new module `diff_diff/had_pretests.py` (separate from the 2,800-line `had.py` estimator module). The three tests — `qug_test`, `stute_test`, `yatchew_hr_test` — accept raw `(d, dy)` arrays and return their own result dataclasses (`QUGTestResults`, `StuteTestResults`, `YatchewTestResults`); the composite `did_had_pretest_workflow` accepts a two-period panel and returns `HADPretestReport` with a priority-ordered verdict string. The `data`-only workflow signature (no `result=`-consuming variant) avoids coupling pretests to the `BiasCorrectedFit` internal state, which does not expose residuals. Below-minimum sample sizes emit `UserWarning` and return all-NaN inference (no `ValueError` raise) for library consistency with the `safe_inference` convention; input-shape violations (non-1D, NaN-containing, mismatched length) still raise. No multiple-testing adjustment (Bonferroni/Holm) is applied — the paper does not prescribe one. The joint Equation 18 cross-horizon Stute test (step 2 of the paper's four-step workflow) is deferred to a Phase 3 follow-up patch; the single-horizon Stute in this PR is the standard Assumption 8 linearity test (paper Section 4.3, Appendix D).
+    - **Note (Phase 3 Stute bootstrap):** The Stute CvM statistic is implemented via the simplified-form `S = (1/G²) · Σ cumsum²`, which is algebraically equivalent to the paper's stated `Σ(g/G)² · ((1/g) Σ eps_{(h)})²` form. Bootstrap follows the paper's Appendix D Algorithm literally: each iteration refits OLS on `dy_b = a_hat + b_hat · d + eps · eta` (null DGP multiplied by Mammen weights). The Appendix-D vectorized matrix form (precomputing `M = I - X(X'X)⁻¹X'` once and applying it to `eps · eta` in each iteration) is functionally identical and ~2× faster; it is deferred as a performance follow-up to keep the Phase 3 reviewer surface small. Bootstrap reproducibility uses `np.random.default_rng(seed)` (library convention, matching `wild_bootstrap_se`); `seed=42` in tests for bitwise stability.
+    - **Note (Phase 3 Yatchew normalizer):** `σ̂²_diff` in `yatchew_hr_test` divides by `2G` (paper-literal from Theorem 7), NOT by `2(G-1)`. A unit test hand-computes `σ̂²_diff` at `G=4` on deterministic inputs and asserts the `2G`-normalizer form at `atol=1e-12` so any later regression to the (asymptotically equivalent but finite-sample different) `2(G-1)` form would fail the test. `σ̂⁴_W` uses `np.mean(eps_{(g)}² · eps_{(g-1)}²)` which divides by `G-1` via `np.mean` length, matching the paper's Theorem 7 / Equation 29 normalization.
+- [x] Phase 3: `qug_test()` (`T = D_{2,(1)} / (D_{2,(2)} - D_{2,(1)})`, rejection `{T > 1/α - 1}`). One-sided asymptotic p-value `1 / (1 + T)` under the Exp(1)/Exp(1) limit law (Theorem 4). Zero-dose observations filtered upfront (with `UserWarning`); `D_{(1)} == D_{(2)}` tie returns all-NaN inference (conservative). Closed-form tight parity tested at `atol=1e-12`.
+- [x] Phase 3: `stute_test()` Cramér-von Mises with Mammen wild bootstrap. Statistic `S = (1/G^2) Σ (cumsum_g)^2` (algebraically equivalent to paper's `Σ(g/G)^2 · ((1/g) Σ eps_{(h)})^2`). Bootstrap follows paper Appendix D Algorithm literal (per-iteration OLS refit). `n_bootstrap=999` default, `n_bootstrap >= 99` validated. `G < 10` returns NaN; `G > 100_000` emits a `UserWarning` pointing to `yatchew_hr_test`. Appendix-D vectorized matrix form deferred as a performance follow-up (tracked in `TODO.md`).
+- [x] Phase 3: `yatchew_hr_test()` heteroskedasticity-robust linearity test. Test statistic `T_hr = sqrt(G) · (σ̂²_lin - σ̂²_diff) / σ̂²_W` from paper Equation 29. Normalizer `σ̂²_diff` divides by `2G` (paper-literal Theorem 7), NOT `2(G-1)`; hand-computed tight parity asserted at `atol=1e-12`. One-sided standard-normal critical value. `G < 3` returns NaN.
+- [x] Phase 3: `did_had_pretest_workflow()` composite helper. Two-period `data`-only entry point (Phase 2a overall-path dispatch); reduces panel via `_aggregate_first_difference` and runs all three tests at a shared `alpha`. `seed` forwards to `stute_test` only (QUG and Yatchew are deterministic). Returns `HADPretestReport` with priority-ordered verdict string: `"TWFE safe under Section 4 assumptions"` when all fail-to-reject, `"inconclusive - {tests} NaN"` when any statistic is NaN, else a bundled string naming each rejected assumption. Multi-period panels pre-slice to `(F-1, F)` before calling; joint-horizon dispatch deferred to Phase 3 follow-up.
 - [ ] Phase 4: Pierce-Schott (2016) replication harness reproduces Figure 2 values.
 - [ ] Phase 4: Full DGP 1/2/3 coverage-rate reproduction from Table 1.
 - [ ] Phase 5: `practitioner_next_steps()` integration for HAD results.
diff --git a/docs/methodology/papers/dechaisemartin-2026-review.md b/docs/methodology/papers/dechaisemartin-2026-review.md
index 0dce9765..15c62eee 100644
--- a/docs/methodology/papers/dechaisemartin-2026-review.md
+++ b/docs/methodology/papers/dechaisemartin-2026-review.md
@@ -183,14 +183,15 @@ Alternative to Stute when `G` is large or heteroskedasticity is suspected.
 - [ ] Separate code paths for Design 1' (`d̲ = 0`), Design 1 mass-point (`d̲ > 0` discrete), and Design 1 continuous-near-`d̲`.
 - [ ] Local-linear regression backend (kernel weights, bandwidth selector).
 - [ ] Integration with bias-corrected CI from Calonico-Cattaneo-Farrell.
-- [ ] QUG null test (`T = D_{2,(1)} / (D_{2,(2)} - D_{2,(1)})`, rejection region `{T > 1/α - 1}`).
-- [ ] Stute Cramér-von Mises test with Mammen wild bootstrap.
-- [ ] Yatchew heteroskedasticity-robust linearity test.
+- [x] QUG null test (`T = D_{2,(1)} / (D_{2,(2)} - D_{2,(1)})`, rejection region `{T > 1/α - 1}`). **Phase 3 implementation (2026-04):** `qug_test()` in `diff_diff/had_pretests.py`. Asymptotic p-value `1/(1+T)` under Exp(1)/Exp(1) limit law. Zero-dose observations filtered upfront with `UserWarning`; tie-break `D_{(1)} == D_{(2)}` returns all-NaN inference. Tight closed-form parity at `atol=1e-12`.
+- [x] Stute Cramér-von Mises test with Mammen wild bootstrap. **Phase 3 implementation (2026-04):** `stute_test()` in `diff_diff/had_pretests.py`. Literal per-iteration OLS refit per paper Appendix D Algorithm. `n_bootstrap=999` default, `n_bootstrap >= 99` validated. Single-horizon only; joint Equation 18 cross-horizon Stute (Paper Section 5.2 Pierce-Schott application) deferred to a Phase 3 follow-up patch.
+- [x] Yatchew heteroskedasticity-robust linearity test. **Phase 3 implementation (2026-04):** `yatchew_hr_test()` in `diff_diff/had_pretests.py`. Test statistic `T_hr = sqrt(G)·(σ²_lin - σ²_diff)/σ²_W` from paper Equation 29. `σ²_diff` normalizes by `2G` (paper-literal), NOT `2(G-1)` (finite-sample equivalent but tests pin the paper-literal form). Standard-normal critical value, one-sided.
+- [x] Composite workflow `did_had_pretest_workflow()` (paper Section 4.2-4.3). **Phase 3 implementation (2026-04):** Two-period panel entry point runs all three tests and returns `HADPretestReport` with priority-ordered verdict string. The paper's step 2 (pre-trends test of Assumption 7) requires Equation 18 and is deferred to the Phase 3 follow-up patch.
 - [ ] Warnings for staggered treatment timing (direct users to existing `ChaisemartinDHaultfoeuille` in diff-diff).
 - [ ] Warnings for extensive-margin effects / positive mass of untreated (not fatal; suggests running existing DiD).
 - [ ] Documentation of non-testability of Assumptions 5 and 6.
 - [x] Multi-period event-study extension (Appendix B.2). **Phase 2b implementation (2026-04):** `aggregate="event_study"` returns per-event-time WAS estimates using uniform `F-1` anchor. Staggered timing auto-filtered to last cohort with `UserWarning` per Appendix B.2 prescription. Pointwise CIs per horizon (no joint cross-horizon covariance; matches paper's Pierce-Schott Figure 2). Pre-period placebos at `e <= -2`; the anchor `e = -1` is skipped since `ΔY = 0` there by construction.
-- [ ] Joint Stute test (Equation 18) across pre-periods. Deferred to Phase 3 (pre-test diagnostics).
+- [ ] Joint Stute test (Equation 18) across pre-periods. Deferred to a **Phase 3 follow-up patch** — Phase 3 (2026-04) shipped the single-horizon Stute test but not the joint cross-horizon variant; the exact stacked-residual formula needs extraction from the paper PDF (not reproduced in this review). Tracked in `TODO.md`.
 
 ---
 
diff --git a/tests/test_had_pretests.py b/tests/test_had_pretests.py
new file mode 100644
index 00000000..0d3acd88
--- /dev/null
+++ b/tests/test_had_pretests.py
@@ -0,0 +1,716 @@
+"""Tests for HAD Phase 3 pre-test diagnostics (had_pretests.py)."""
+
+from __future__ import annotations
+
+import json
+import warnings
+
+import numpy as np
+import pandas as pd
+import pytest
+
+from diff_diff import (
+    QUGTestResults,
+    StuteTestResults,
+    YatchewTestResults,
+    did_had_pretest_workflow,
+    qug_test,
+    stute_test,
+    yatchew_hr_test,
+)
+from diff_diff.had_pretests import _compose_verdict
+
+# =============================================================================
+# Helpers
+# =============================================================================
+
+
+def _make_two_period_panel(
+    G: int,
+    d: np.ndarray,
+    dy: np.ndarray,
+    seed: int = 42,
+) -> pd.DataFrame:
+    """Construct a two-period panel satisfying the HAD overall-path contract.
+
+    - ``time == 0``: ``d = 0`` for all units (HAD no-unit-untreated pre).
+    - ``time == 1``: dose ``d`` per unit, outcome ``y_pre + dy``.
+    """
+    rng = np.random.default_rng(seed)
+    y_pre = rng.normal(0.0, 1.0, size=G)
+    pre = pd.DataFrame({"unit": np.arange(G), "time": 0, "y": y_pre, "d": np.zeros(G)})
+    post = pd.DataFrame({"unit": np.arange(G), "time": 1, "y": y_pre + dy, "d": d})
+    return pd.concat([pre, post], ignore_index=True)
+
+
+def _linear_dgp(G: int, beta: float = 2.0, sigma: float = 0.3, seed: int = 42):
+    """Return ``(d, dy)`` for a linear DGP ``dy = beta * d + eps``."""
+    rng = np.random.default_rng(seed)
+    d = rng.uniform(0.0, 1.0, size=G)
+    dy = beta * d + rng.normal(0.0, sigma, size=G)
+    return d, dy
+
+
+def _quadratic_dgp(
+    G: int, beta: float = 2.0, gamma: float = 5.0, sigma: float = 0.3, seed: int = 42
+):
+    """Return ``(d, dy)`` for a quadratic DGP ``dy = beta*d + gamma*d^2 + eps``."""
+    rng = np.random.default_rng(seed)
+    d = rng.uniform(0.1, 1.0, size=G)
+    dy = beta * d + gamma * d * d + rng.normal(0.0, sigma, size=G)
+    return d, dy
+
+
+# =============================================================================
+# QUG test
+# =============================================================================
+
+
+class TestQUGTest:
+    """Tests for :func:`qug_test`."""
+
+    def test_tight_parity_closed_form(self):
+        """Commit criterion 4: [0.2, 0.5, 0.9] -> T = 0.2/0.3 at atol=1e-12."""
+        d = np.array([0.2, 0.5, 0.9])
+        r = qug_test(d, alpha=0.05)
+        expected_T = 0.2 / (0.5 - 0.2)
+        expected_p = 1.0 / (1.0 + expected_T)
+        np.testing.assert_allclose(r.t_stat, expected_T, atol=1e-12)
+        np.testing.assert_allclose(r.p_value, expected_p, atol=1e-12)
+        assert r.critical_value == pytest.approx(1.0 / 0.05 - 1.0, rel=1e-12)
+
+    def test_tie_break(self):
+        """Commit criterion 3: D_(1) == D_(2) -> NaN, reject=False."""
+        with warnings.catch_warnings(record=True) as caught:
+            warnings.simplefilter("always")
+            r = qug_test(np.array([0.3, 0.3, 0.5]))
+            assert len(caught) == 1
+            assert "D_(1) == D_(2)" in str(caught[0].message)
+        assert np.isnan(r.t_stat)
+        assert np.isnan(r.p_value)
+        assert r.reject is False
+        assert r.d_order_1 == 0.3
+        assert r.d_order_2 == 0.3
+
+    def test_zero_filter_emits_warning(self):
+        """Zero-dose observations are excluded with a UserWarning."""
+        d = np.array([0.0, 0.0, 0.3, 0.5, 0.9])
+        with warnings.catch_warnings(record=True) as caught:
+            warnings.simplefilter("always")
+            r = qug_test(d)
+            messages = [str(w.message) for w in caught]
+        assert any("excluded 2" in m for m in messages)
+        assert r.n_obs == 3
+        assert r.n_excluded_zero == 2
+        # Should reduce to [0.3, 0.5, 0.9]: T = 0.3 / 0.2 = 1.5
+        np.testing.assert_allclose(r.t_stat, 0.3 / 0.2, atol=1e-12)
+
+    def test_rejects_on_shifted_beta(self):
+        """Commit criterion 2 (single-realization): shifted-Beta -> reject."""
+        rng = np.random.default_rng(42)
+        # Support is [0.2, 1.2]: infimum is well away from zero.
+        d = 0.2 + rng.beta(2, 2, size=200)
+        r = qug_test(d, alpha=0.05)
+        # T = D_(1)/(D_(2) - D_(1)). D_(1) ~ 0.2; gap ~ 0.001 at G=200.
+        # So T is large -> reject.
+        assert r.reject is True
+        assert r.p_value < 0.05
+
+    def test_does_not_reject_on_uniform_support_zero(self):
+        """Commit criterion 1 (single-realization): Uniform(0,1) -> usually no reject."""
+        rng = np.random.default_rng(42)
+        d = rng.uniform(0.0, 1.0, size=500)
+        r = qug_test(d, alpha=0.05)
+        # Uniform support has infimum at 0 (asymptotic). For G=500,
+        # D_(1) is small but D_(2) - D_(1) also small -> T ~ O(1).
+        # Not a guaranteed non-reject at single seed, but this seed passes.
+        assert r.p_value > 0.05
+        assert r.reject is False
+
+    def test_reject_rate_bounded_by_alpha_on_uniform_null(self):
+        """Commit criterion 1: asymptotic size alpha is an UPPER bound.
+
+        Paper Theorem 4 establishes ``lim sup_{G->inf} P(reject | H_0) = alpha``
+        (one-sided asymptotic size bound). At finite G the test is often
+        conservative (actual rate below alpha). We verify the asymptotic
+        size control: reject rate MUST NOT exceed alpha by more than 0.03.
+        """
+        alpha = 0.05
+        n_reps = 200
+        rejections = 0
+        rng = np.random.default_rng(12345)
+        for _ in range(n_reps):
+            d = rng.uniform(0.0, 1.0, size=500)
+            r = qug_test(d, alpha=alpha)
+            if r.reject:
+                rejections += 1
+        rate = rejections / n_reps
+        assert rate <= alpha + 0.03, (
+            f"QUG reject rate {rate:.3f} exceeds alpha={alpha} by more than "
+            f"0.03; asymptotic size control violated."
+        )
+
+    def test_invalid_alpha_raises(self):
+        with pytest.raises(ValueError, match="alpha must satisfy"):
+            qug_test(np.array([0.1, 0.5]), alpha=0.0)
+        with pytest.raises(ValueError, match="alpha must satisfy"):
+            qug_test(np.array([0.1, 0.5]), alpha=1.0)
+
+    def test_nan_input_raises(self):
+        with pytest.raises(ValueError, match="contains NaN"):
+            qug_test(np.array([0.1, np.nan, 0.5]))
+
+    def test_2d_input_raises(self):
+        with pytest.raises(ValueError, match="must be 1-dimensional"):
+            qug_test(np.array([[0.1, 0.5], [0.3, 0.9]]))
+
+
+# =============================================================================
+# Stute test
+# =============================================================================
+
+
+class TestStuteTest:
+    """Tests for :func:`stute_test`."""
+
+    def test_does_not_reject_linear_dgp(self):
+        """Commit criterion 5 (single-realization): linear -> no reject."""
+        d, dy = _linear_dgp(G=200, beta=2.0, sigma=0.3, seed=42)
+        r = stute_test(d, dy, alpha=0.05, n_bootstrap=199, seed=42)
+        assert r.reject is False
+        assert r.p_value > 0.05
+
+    def test_rejects_quadratic_dgp(self):
+        """Commit criterion 6: quadratic DGP -> reject."""
+        d, dy = _quadratic_dgp(G=200, beta=2.0, gamma=5.0, sigma=0.3, seed=42)
+        r = stute_test(d, dy, alpha=0.05, n_bootstrap=199, seed=42)
+        assert r.reject is True
+        assert r.p_value < 0.05
+
+    def test_cvm_statistic_manual_equivalence(self):
+        """Verify the simplified CvM form matches the paper's stated form.
+
+        Paper: S = Σ(g/G)^2 · ((1/g) Σ eps_(h))^2
+        Code:  S = (1/G^2) Σ (cumsum_g)^2
+        These are algebraically identical.
+        """
+        from diff_diff.had_pretests import _cvm_statistic
+
+        eps = np.array([1.0, -0.5, 0.3, -0.2, 0.1])
+        G = len(eps)
+
+        # Paper form
+        cumsum = np.cumsum(eps)
+        paper_S = 0.0
+        for g in range(1, G + 1):
+            c_g = cumsum[g - 1]
+            paper_S += (g / G) ** 2 * (c_g / g) ** 2
+
+        # Simplified form (what the code uses)
+        code_S = _cvm_statistic(eps)
+
+        np.testing.assert_allclose(code_S, paper_S, atol=1e-14)
+
+    def test_n_bootstrap_below_99_raises(self):
+        """Commit criterion 8: n_bootstrap < 99 -> ValueError."""
+        d, dy = _linear_dgp(G=50)
+        with pytest.raises(ValueError, match=r"n_bootstrap must be >= 99"):
+            stute_test(d, dy, n_bootstrap=50)
+
+    def test_mismatched_lengths_raise(self):
+        with pytest.raises(ValueError, match="same length"):
+            stute_test(np.array([0.1, 0.2, 0.3]), np.array([1.0, 2.0]))
+
+    def test_nan_input_raises(self):
+        d, dy = _linear_dgp(G=50)
+        dy_nan = dy.copy()
+        dy_nan[5] = np.nan
+        with pytest.raises(ValueError, match="contains NaN"):
+            stute_test(d, dy_nan)
+
+
+# =============================================================================
+# Yatchew-HR test
+# =============================================================================
+
+
+class TestYatchewHRTest:
+    """Tests for :func:`yatchew_hr_test`."""
+
+    def test_does_not_reject_linear_dgp(self):
+        """Commit criterion 9 (single-realization): linear -> no reject."""
+        d, dy = _linear_dgp(G=200, beta=2.0, sigma=0.3, seed=42)
+        r = yatchew_hr_test(d, dy, alpha=0.05)
+        assert r.reject is False
+
+    def test_rejects_quadratic_dgp(self):
+        """Commit criterion 9 (single-realization): quadratic -> reject."""
+        d, dy = _quadratic_dgp(G=200, beta=2.0, gamma=5.0, sigma=0.3, seed=42)
+        r = yatchew_hr_test(d, dy, alpha=0.05)
+        assert r.reject is True
+        assert r.t_stat_hr > r.critical_value
+
+    def test_sigma2_diff_normalizer_is_2G_paper_literal(self):
+        """Commit criterion 10: sigma2_diff divides by 2G (paper-literal).
+
+        Hand-compute sigma2_diff on a known input and verify the
+        implementation divides by 2G, NOT 2(G-1).
+        """
+        # G = 4 deterministic inputs; sort by d first.
+        d = np.array([0.1, 0.2, 0.3, 0.4])
+        dy = np.array([1.0, 3.0, 2.0, 5.0])
+        r = yatchew_hr_test(d, dy, alpha=0.05)
+
+        # Sorted by d: (already sorted)
+        # diff_dy = [dy[1]-dy[0], dy[2]-dy[1], dy[3]-dy[2]]
+        #         = [2, -1, 3]; squared = [4, 1, 9]; sum = 14
+        # Paper-literal: sigma2_diff = 14 / (2 * 4) = 1.75
+        # WRONG form (using G-1=3): 14 / (2 * 3) = 2.333...
+        sum_sq_diff = 14.0
+        G = 4
+        expected_paper_literal = sum_sq_diff / (2.0 * G)
+        expected_wrong_form = sum_sq_diff / (2.0 * (G - 1))
+        np.testing.assert_allclose(r.sigma2_diff, expected_paper_literal, atol=1e-12)
+        assert (
+            abs(r.sigma2_diff - expected_wrong_form) > 1e-6
+        ), "sigma2_diff should divide by 2G (paper-literal), NOT 2(G-1)."
+
+    def test_invalid_alpha_raises(self):
+        d, dy = _linear_dgp(G=50)
+        with pytest.raises(ValueError, match="alpha must satisfy"):
+            yatchew_hr_test(d, dy, alpha=-0.1)
+
+
+# =============================================================================
+# Composite workflow
+# =============================================================================
+
+
+class TestCompositeWorkflow:
+    """Tests for :func:`did_had_pretest_workflow`."""
+
+    def test_all_pass_on_linear_no_pretrend(self):
+        """Commit criterion 11: linear DGP -> all_pass, 'TWFE safe' verdict."""
+        d, dy = _linear_dgp(G=200, beta=2.0, sigma=0.3, seed=42)
+        panel = _make_two_period_panel(200, d, dy, seed=42)
+        report = did_had_pretest_workflow(
+            panel,
+            outcome_col="y",
+            dose_col="d",
+            time_col="time",
+            unit_col="unit",
+            alpha=0.05,
+            n_bootstrap=199,
+            seed=42,
+        )
+        assert report.all_pass is True
+        assert report.verdict == "TWFE safe under Section 4 assumptions"
+        assert report.n_obs == 200
+        assert isinstance(report.qug, QUGTestResults)
+        assert isinstance(report.stute, StuteTestResults)
+        assert isinstance(report.yatchew, YatchewTestResults)
+
+    def test_rejects_on_quadratic_plus_shifted_support(self):
+        """Commit criterion 12: quadratic + shifted support -> both reject."""
+        rng = np.random.default_rng(42)
+        G = 200
+        # Shifted-Beta support [0.2, 1.2]: QUG should reject.
+        d = 0.2 + rng.beta(2, 2, size=G)
+        # Quadratic dose response: linearity tests should reject.
+        dy = 2.0 * d + 5.0 * d * d + rng.normal(0, 0.3, size=G)
+        panel = _make_two_period_panel(G, d, dy, seed=42)
+        report = did_had_pretest_workflow(
+            panel,
+            outcome_col="y",
+            dose_col="d",
+            time_col="time",
+            unit_col="unit",
+            alpha=0.05,
+            n_bootstrap=199,
+            seed=42,
+        )
+        assert report.all_pass is False
+        assert "support infimum rejected" in report.verdict
+        assert "linearity rejected" in report.verdict
+        # At least QUG and one of {Stute, Yatchew} rejected.
+        assert report.qug.reject is True
+        assert report.stute.reject or report.yatchew.reject
+
+    def test_workflow_seed_controls_stute_only(self):
+        """Workflow seed forwards to Stute only; re-run with same seed is reproducible."""
+        d, dy = _linear_dgp(G=100, beta=2.0, sigma=0.3, seed=42)
+        panel = _make_two_period_panel(100, d, dy, seed=42)
+        report_a = did_had_pretest_workflow(
+            panel, "y", "d", "time", "unit", n_bootstrap=199, seed=42
+        )
+        report_b = did_had_pretest_workflow(
+            panel, "y", "d", "time", "unit", n_bootstrap=199, seed=42
+        )
+        assert report_a.stute.p_value == report_b.stute.p_value
+        assert report_a.qug.t_stat == report_b.qug.t_stat
+        assert report_a.yatchew.t_stat_hr == report_b.yatchew.t_stat_hr
+
+
+# =============================================================================
+# NaN propagation
+# =============================================================================
+
+
+class TestNaNPropagation:
+    """All three tests NaN-propagate on degenerate inputs; verdict is inconclusive."""
+
+    def test_constant_dy_trivially_satisfies_linearity(self):
+        """Commit criterion 13 (reframed): constant dy is trivially linear
+        in d; Stute and Yatchew should fail-to-reject with large p-values.
+
+        Constant dy means OLS residuals are mathematically zero and the CvM
+        cusum is zero; the bootstrap yields all-tied ``S_b = 0``. The test
+        correctly fails-to-reject (linearity holds). NaN propagation is
+        covered by :meth:`test_qug_tie_propagates_to_composite` below.
+        """
+        G = 50
+        d = np.linspace(0.1, 1.0, G)
+        dy = np.full(G, 3.0)
+        panel = _make_two_period_panel(G, d, dy, seed=42)
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore")
+            report = did_had_pretest_workflow(
+                panel, "y", "d", "time", "unit", n_bootstrap=199, seed=42
+            )
+        assert report.stute.reject is False
+        assert report.yatchew.reject is False
+        assert report.stute.p_value > 0.5
+        assert report.yatchew.p_value > 0.5
+
+    def test_qug_tie_propagates_to_composite(self):
+        """QUG tie in composite -> verdict mentions QUG NaN."""
+        G = 20
+        # All units share the same positive dose -> D_(1) == D_(2) tie.
+        d = np.full(G, 0.5)
+        dy = np.linspace(0.0, 2.0, G)
+        panel = _make_two_period_panel(G, d, dy, seed=42)
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore")
+            report = did_had_pretest_workflow(
+                panel, "y", "d", "time", "unit", n_bootstrap=199, seed=42
+            )
+        assert np.isnan(report.qug.p_value)
+        assert "QUG" in report.verdict
+        assert report.verdict.startswith("inconclusive")
+
+    def test_stute_constant_d_returns_nan(self):
+        G = 50
+        d = np.full(G, 0.5)
+        dy = np.linspace(0.0, 2.0, G)
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore")
+            r = stute_test(d, dy, n_bootstrap=199, seed=42)
+        assert np.isnan(r.cvm_stat)
+        assert r.reject is False
+
+
+# =============================================================================
+# JSON / DataFrame serialization
+# =============================================================================
+
+
+class TestJSONSerialization:
+    """to_dict() is JSON-safe; to_dataframe() returns correct schema."""
+
+    def test_qug_to_dict_is_json_safe(self):
+        d = np.array([0.2, 0.5, 0.9])
+        r = qug_test(d)
+        json_str = json.dumps(r.to_dict())
+        round_tripped = json.loads(json_str)
+        assert round_tripped["test"] == "qug"
+        assert round_tripped["reject"] is False
+
+    def test_stute_to_dict_is_json_safe(self):
+        d, dy = _linear_dgp(G=50)
+        r = stute_test(d, dy, n_bootstrap=199, seed=42)
+        json.dumps(r.to_dict())  # must not raise
+
+    def test_yatchew_to_dict_is_json_safe(self):
+        d, dy = _linear_dgp(G=50)
+        r = yatchew_hr_test(d, dy)
+        json.dumps(r.to_dict())  # must not raise
+
+    def test_report_to_dict_is_json_safe(self):
+        """Commit criterion 16: json.dumps(report.to_dict()) succeeds."""
+        d, dy = _linear_dgp(G=50)
+        panel = _make_two_period_panel(50, d, dy, seed=42)
+        report = did_had_pretest_workflow(panel, "y", "d", "time", "unit", n_bootstrap=199, seed=42)
+        json_str = json.dumps(report.to_dict())
+        round_tripped = json.loads(json_str)
+        assert "qug" in round_tripped
+        assert "stute" in round_tripped
+        assert "yatchew" in round_tripped
+        assert "verdict" in round_tripped
+
+    def test_report_to_dataframe_schema(self):
+        """Commit criterion 17: 3-row frame with specified columns in order."""
+        d, dy = _linear_dgp(G=50)
+        panel = _make_two_period_panel(50, d, dy, seed=42)
+        report = did_had_pretest_workflow(panel, "y", "d", "time", "unit", n_bootstrap=199, seed=42)
+        df = report.to_dataframe()
+        expected_cols = [
+            "test",
+            "statistic_name",
+            "statistic_value",
+            "p_value",
+            "reject",
+            "alpha",
+            "n_obs",
+        ]
+        assert list(df.columns) == expected_cols
+        assert list(df["test"]) == ["qug", "stute", "yatchew_hr"]
+        assert list(df["statistic_name"]) == ["t_stat", "cvm_stat", "t_stat_hr"]
+        assert len(df) == 3
+
+    def test_per_test_to_dataframe_is_single_row(self):
+        """Commit criterion 18: per-test to_dataframe returns a 1-row frame."""
+        d, dy = _linear_dgp(G=50)
+        r_stute = stute_test(d, dy, n_bootstrap=199, seed=42)
+        r_yatchew = yatchew_hr_test(d, dy)
+        r_qug = qug_test(d)
+        assert len(r_stute.to_dataframe()) == 1
+        assert len(r_yatchew.to_dataframe()) == 1
+        assert len(r_qug.to_dataframe()) == 1
+        assert r_stute.to_dataframe()["test"].iloc[0] == "stute"
+
+
+# =============================================================================
+# Seed reproducibility
+# =============================================================================
+
+
+class TestSeedReproducibility:
+    """Commit criterion 7: stute_test(seed=42) bitwise-reproducible."""
+
+    def test_stute_seed_reproducible(self):
+        d, dy = _linear_dgp(G=100, seed=1)
+        r_a = stute_test(d, dy, n_bootstrap=199, seed=42)
+        r_b = stute_test(d, dy, n_bootstrap=199, seed=42)
+        assert r_a.cvm_stat == r_b.cvm_stat
+        assert r_a.p_value == r_b.p_value  # bitwise
+
+    def test_stute_different_seeds_produce_different_pvals(self):
+        d, dy = _linear_dgp(G=100, seed=1)
+        r_a = stute_test(d, dy, n_bootstrap=199, seed=42)
+        r_b = stute_test(d, dy, n_bootstrap=199, seed=43)
+        # CvM stat is seed-independent; only p-value varies across bootstraps.
+        assert r_a.cvm_stat == r_b.cvm_stat
+        # P-values should differ at B = 199 (grid = 1/200).
+        assert r_a.p_value != r_b.p_value
+
+
+# =============================================================================
+# Sample-size gates
+# =============================================================================
+
+
+class TestSampleSizeGates:
+    """Below-minimum sample sizes emit UserWarning + NaN result (no raise)."""
+
+    def test_qug_below_min_returns_nan(self):
+        """G=1 (after filter) -> NaN, warning, reject=False."""
+        with warnings.catch_warnings(record=True) as caught:
+            warnings.simplefilter("always")
+            r = qug_test(np.array([0.5]))
+            msgs = [str(w.message) for w in caught]
+        assert any("at least 2" in m for m in msgs)
+        assert np.isnan(r.t_stat)
+        assert r.reject is False
+
+    def test_qug_all_zero_returns_nan(self):
+        """All d=0 -> filter removes everything -> NaN."""
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore")
+            r = qug_test(np.zeros(10))
+        assert np.isnan(r.t_stat)
+        assert r.n_obs == 0
+        assert r.n_excluded_zero == 10
+
+    def test_stute_below_min_returns_nan(self):
+        """Commit criterion 14: stute_test(G=8) -> NaN, warning, no raise."""
+        d, dy = _linear_dgp(G=8)
+        with warnings.catch_warnings(record=True) as caught:
+            warnings.simplefilter("always")
+            r = stute_test(d, dy, n_bootstrap=199, seed=42)
+            msgs = [str(w.message) for w in caught]
+        assert any("below the minimum" in m for m in msgs)
+        assert np.isnan(r.cvm_stat)
+        assert r.reject is False
+
+    def test_yatchew_below_min_returns_nan(self):
+        """yatchew_hr_test(G=2) -> NaN, warning."""
+        d = np.array([0.1, 0.5])
+        dy = np.array([1.0, 2.0])
+        with warnings.catch_warnings(record=True) as caught:
+            warnings.simplefilter("always")
+            r = yatchew_hr_test(d, dy)
+            msgs = [str(w.message) for w in caught]
+        assert any("below the minimum" in m for m in msgs)
+        assert np.isnan(r.t_stat_hr)
+        assert r.reject is False
+
+
+# =============================================================================
+# _compose_verdict priority logic
+# =============================================================================
+
+
+def _mk_qug(reject: bool, p: float = 0.5) -> QUGTestResults:
+    return QUGTestResults(
+        t_stat=10.0 if reject else 0.5,
+        p_value=p,
+        reject=reject,
+        alpha=0.05,
+        critical_value=19.0,
+        n_obs=50,
+        n_excluded_zero=0,
+        d_order_1=0.1,
+        d_order_2=0.2,
+    )
+
+
+def _mk_stute(reject: bool, p: float = 0.5) -> StuteTestResults:
+    return StuteTestResults(
+        cvm_stat=0.5,
+        p_value=p,
+        reject=reject,
+        alpha=0.05,
+        n_bootstrap=999,
+        n_obs=50,
+        seed=42,
+    )
+
+
+def _mk_yatchew(reject: bool, p: float = 0.5) -> YatchewTestResults:
+    return YatchewTestResults(
+        t_stat_hr=3.0 if reject else -0.5,
+        p_value=p,
+        reject=reject,
+        alpha=0.05,
+        critical_value=1.645,
+        sigma2_lin=1.0,
+        sigma2_diff=1.0,
+        sigma2_W=1.0,
+        n_obs=50,
+    )
+
+
+class TestComposeVerdictLogic:
+    """Commit criterion 15: 5 priority-order tests."""
+
+    def test_nan_in_any_test_inconclusive(self):
+        """(a) Any NaN p-value -> inconclusive."""
+        q = _mk_qug(reject=False)
+        s = _mk_stute(reject=False, p=float("nan"))
+        y = _mk_yatchew(reject=False)
+        verdict = _compose_verdict(q, s, y)
+        assert verdict.startswith("inconclusive")
+        assert "Stute" in verdict
+
+    def test_none_reject_twfe_safe(self):
+        """(b) None reject -> 'TWFE safe under Section 4 assumptions'."""
+        q = _mk_qug(reject=False)
+        s = _mk_stute(reject=False)
+        y = _mk_yatchew(reject=False)
+        verdict = _compose_verdict(q, s, y)
+        assert verdict == "TWFE safe under Section 4 assumptions"
+
+    def test_only_qug_rejects(self):
+        """(c) Only QUG rejects -> QUG-only message."""
+        q = _mk_qug(reject=True)
+        s = _mk_stute(reject=False)
+        y = _mk_yatchew(reject=False)
+        verdict = _compose_verdict(q, s, y)
+        assert "support infimum rejected" in verdict
+        assert "linearity rejected" not in verdict
+
+    def test_only_linearity_rejects(self):
+        """(d) Only linearity rejects -> linearity-only message."""
+        q = _mk_qug(reject=False)
+        s = _mk_stute(reject=True)
+        y = _mk_yatchew(reject=False)
+        verdict = _compose_verdict(q, s, y)
+        assert "linearity rejected" in verdict
+        assert "Stute" in verdict
+        assert "Yatchew" not in verdict
+        assert "support infimum rejected" not in verdict
+
+    def test_all_reject_bundles_reasons(self):
+        """(e) All reject -> bundled message naming each."""
+        q = _mk_qug(reject=True)
+        s = _mk_stute(reject=True)
+        y = _mk_yatchew(reject=True)
+        verdict = _compose_verdict(q, s, y)
+        assert "support infimum rejected" in verdict
+        assert "linearity rejected" in verdict
+        assert "Stute" in verdict
+        assert "Yatchew" in verdict
+        assert "; " in verdict  # bundled with separator
+
+
+# =============================================================================
+# Top-level import surface
+# =============================================================================
+
+
+def test_phase_3_imports_at_top_level():
+    """Commit criterion 19: all Phase 3 names importable from `diff_diff`."""
+    import diff_diff
+
+    expected = [
+        "HADPretestReport",
+        "QUGTestResults",
+        "StuteTestResults",
+        "YatchewTestResults",
+        "did_had_pretest_workflow",
+        "qug_test",
+        "stute_test",
+        "yatchew_hr_test",
+    ]
+    for name in expected:
+        assert hasattr(diff_diff, name), f"diff_diff.{name} missing from public API"
+        assert name in diff_diff.__all__, f"{name} missing from diff_diff.__all__"
+
+
+# =============================================================================
+# summary() smoke
+# =============================================================================
+
+
+class TestSummary:
+    """summary() / print_summary() produce non-empty strings for all classes."""
+
+    def test_qug_summary_non_empty(self):
+        r = qug_test(np.array([0.2, 0.5, 0.9]))
+        s = r.summary()
+        assert len(s) > 0
+        assert "QUG" in s
+
+    def test_stute_summary_non_empty(self):
+        d, dy = _linear_dgp(G=50)
+        r = stute_test(d, dy, n_bootstrap=199, seed=42)
+        s = r.summary()
+        assert len(s) > 0
+        assert "Stute" in s or "CvM" in s
+
+    def test_yatchew_summary_non_empty(self):
+        d, dy = _linear_dgp(G=50)
+        r = yatchew_hr_test(d, dy)
+        s = r.summary()
+        assert len(s) > 0
+        assert "Yatchew" in s
+
+    def test_report_summary_bundles_all(self):
+        d, dy = _linear_dgp(G=50)
+        panel = _make_two_period_panel(50, d, dy, seed=42)
+        report = did_had_pretest_workflow(panel, "y", "d", "time", "unit", n_bootstrap=199, seed=42)
+        s = report.summary()
+        assert "QUG" in s
+        assert "Stute" in s or "CvM" in s
+        assert "Yatchew" in s
+        assert "Verdict:" in s

From 14953e3c90e582c8bbbd157a6584a37e238933ae Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Wed, 22 Apr 2026 13:39:00 -0400
Subject: [PATCH 2/9] Address /ai-review-local P1: workflow verdict must flag
 Assumption 7 gap

AI review found that did_had_pretest_workflow's fail-to-reject verdict
("TWFE safe under Section 4 assumptions") was a methodology overclaim:
Phase 3 ships steps 1 + 3 of the paper's four-step workflow but not
step 2 (Assumption 7 pre-trends test via Equation 18), so the verdict
could mislead users into believing the full paper-prescribed workflow
had passed.

Fix:
- _compose_verdict fail-to-reject branch now returns
  "QUG and linearity diagnostics fail-to-reject; Assumption 7
  pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up)"
- HADPretestReport docstring adds a prominent partial-workflow caveat
  on the verdict and all_pass fields.
- did_had_pretest_workflow docstring renames intent from "run the
  paper's workflow" to "run a PARTIAL workflow (steps 1 and 3 only)".
- Tests: renamed + updated to assert the Assumption 7 caveat appears
  in the verdict and that "TWFE safe" does NOT appear.
- REGISTRY.md updated to document the partial-workflow semantic and
  revised verdict string in both the Note block and checkbox row.

Also addresses P2 from the review: replaced positional `None` in the
_aggregate_first_difference call with keyword argument cluster_col=None
plus an inline comment naming the intent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 diff_diff/had_pretests.py    | 61 +++++++++++++++++++++++++++++-------
 docs/methodology/REGISTRY.md |  4 +--
 tests/test_had_pretests.py   | 25 +++++++++++----
 3 files changed, 70 insertions(+), 20 deletions(-)

diff --git a/diff_diff/had_pretests.py b/diff_diff/had_pretests.py
index d5c21dd9..52424dd6 100644
--- a/diff_diff/had_pretests.py
+++ b/diff_diff/had_pretests.py
@@ -349,20 +349,32 @@ class HADPretestReport:
 
     Bundles the three individual tests with an overall verdict string.
 
+    .. important::
+        This report reflects a **partial** workflow: Phase 3 ships paper
+        Sections 4.2-4.3 steps 1 (QUG) and 3 (linearity via Stute +
+        Yatchew-HR), but **NOT** step 2 (Assumption 7 pre-trends test via
+        Equation 18). Even when ``all_pass`` is ``True``, the paper's
+        four-step certification for TWFE validity is incomplete — pre-
+        trends testing is a separate diagnostic that must be run via the
+        user's own event-study / placebo analysis until the Phase 3
+        follow-up patch lands the joint Equation 18 Stute test.
+
     Attributes
     ----------
     qug : QUGTestResults
     stute : StuteTestResults
     yatchew : YatchewTestResults
     all_pass : bool
-        ``True`` iff none of the three tests rejects. When ``True``,
-        paper Section 4.2-4.3 states TWFE (``beta_fe``) is valid under
-        Assumptions 5-8.
+        ``True`` iff none of the three IMPLEMENTED tests rejects. This is
+        a PARTIAL indicator: it does not certify Assumption 7 (pre-trends),
+        which is not tested by Phase 3.
     verdict : str
         Human-readable classification. Priority-ordered first-match:
 
         1. Any NaN p-value -> ``"inconclusive - {names} NaN"``
-        2. None reject -> ``"TWFE safe under Section 4 assumptions"``
+        2. None reject -> ``"QUG and linearity diagnostics fail-to-reject;
+           Assumption 7 pre-trends test NOT run (paper step 2 deferred to
+           Phase 3 follow-up)"``
         3. Otherwise -> bundled string naming each rejected assumption:
            ``"support infimum rejected - continuous_at_zero design
            invalid (QUG)"`` and/or ``"linearity rejected - heterogeneity
@@ -559,7 +571,11 @@ def _compose_verdict(
 
     any_reject = qug.reject or stute.reject or yatchew.reject
     if not any_reject:
-        return "TWFE safe under Section 4 assumptions"
+        return (
+            "QUG and linearity diagnostics fail-to-reject; "
+            "Assumption 7 pre-trends test NOT run "
+            "(paper step 2 deferred to Phase 3 follow-up)"
+        )
 
     reasons = []
     if qug.reject:
@@ -998,14 +1014,28 @@ def did_had_pretest_workflow(
     n_bootstrap: int = 999,
     seed: Optional[int] = None,
 ) -> HADPretestReport:
-    """Run all three HAD pre-tests on a two-period panel and return a
-    composite report (paper Section 4.2-4.3).
+    """Run a PARTIAL HAD pre-test workflow on a two-period panel
+    (paper Section 4.2-4.3, steps 1 and 3 only; step 2 deferred).
+
+    Phase 3 scope runs:
+
+    - Step 1: :func:`qug_test` (``H_0: d_lower = 0``, Theorem 4).
+    - Step 3: :func:`stute_test` and :func:`yatchew_hr_test` (linearity
+      of ``E[ΔY | D_2]``, Assumption 8).
+
+    Phase 3 does **NOT** run step 2 (Assumption 7 pre-trends test via
+    paper Equation 18); that joint cross-horizon Stute variant is
+    deferred to a follow-up patch. Users should continue to perform their
+    own pre-trends / placebo analysis until the follow-up ships. The
+    returned :class:`HADPretestReport` verdict explicitly flags the
+    Assumption 7 gap when all implemented diagnostics fail-to-reject, so
+    callers do not receive a misleading "TWFE safe" signal.
 
     The workflow reduces the panel to unit-level first differences using
-    the Phase 2a validator + aggregator, then calls :func:`qug_test`,
-    :func:`stute_test`, and :func:`yatchew_hr_test` with shared ``alpha``
-    and a single-source seed passthrough (``seed`` is forwarded to
-    :func:`stute_test` only; the other two are deterministic).
+    the Phase 2a validator + aggregator, then calls the three tests with
+    shared ``alpha`` and a single-source seed passthrough (``seed`` is
+    forwarded to :func:`stute_test` only; QUG and Yatchew are
+    deterministic).
 
     Parameters
     ----------
@@ -1044,7 +1074,14 @@ def did_had_pretest_workflow(
         data, outcome_col, dose_col, time_col, unit_col, first_treat_col
     )
     d_arr, dy_arr, _, _ = _aggregate_first_difference(
-        data, outcome_col, dose_col, time_col, unit_col, t_pre, t_post, None
+        data,
+        outcome_col,
+        dose_col,
+        time_col,
+        unit_col,
+        t_pre,
+        t_post,
+        cluster_col=None,  # pretests do not use cluster-robust SE
     )
 
     qug_res = qug_test(d_arr, alpha=alpha)
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
index 92e0ddd8..1f2f4098 100644
--- a/docs/methodology/REGISTRY.md
+++ b/docs/methodology/REGISTRY.md
@@ -2340,13 +2340,13 @@ Shipped as `did_had_pretest_workflow()` in Phase 3. The paper's decision rule fo
     - **Note (Phase 2b per-horizon SE):** Each event-time horizon uses an INDEPENDENT sandwich computed on that horizon's first differences: continuous paths use the CCT-2014 robust SE from Phase 1c divided by `|den|`; mass-point path uses the structural-residual 2SLS sandwich from Phase 2a. This produces pointwise CIs per horizon, matching the paper's Pierce-Schott application (Section 5.2, Figure 2: "nonparametric pointwise CIs"). Joint cross-horizon covariance (IF-based stacking or block bootstrap) is NOT computed — the paper does not derive it and all reported CIs are pointwise. Follow-up PRs may add joint covariance for cross-horizon hypothesis tests; current tracking in `TODO.md`.
     - **Note (Phase 2b baseline convention):** All event-time horizons use a uniform `F-1` anchor: `ΔY_{g,t} = Y_{g,t} - Y_{g,F-1}` for every `t`. This is consistent with the paper's Garrett-et-al. application (Section 5.1: "outcome `Y_{g,t} - Y_{g,2001}`" where `F = 2002`), simplifies event-time indexing (`e = t - F` so `e = -1` is the anchor, skipped), and keeps the implementation symmetric for pre- and post-period horizons. The paper review text's asymmetric "`Y_{g,t} - Y_{g,1}` for pre" / "`Y_{g,t} - Y_{g,F-1}` for post" phrasing is covered by the uniform convention since both give the same placebo interpretation under parallel trends (the paper's own applications use the uniform anchor).
     - **Note (Phase 2b result class):** `aggregate="event_study"` returns a new `HeterogeneousAdoptionDiDEventStudyResults` dataclass (distinct from the single-period `HeterogeneousAdoptionDiDResults`) with per-horizon arrays (`event_times`, `att`, `se`, `t_stat`, `p_value`, `conf_int_low`, `conf_int_high`, `n_obs_per_horizon`) and shared metadata. `to_dataframe()` returns a tidy per-horizon DataFrame; `to_dict()` returns a dict with list-of-per-horizon fields. The static return-type annotation on `fit()` is `HeterogeneousAdoptionDiDResults` (the common case); callers passing `aggregate="event_study"` should annotate their variable as `HeterogeneousAdoptionDiDEventStudyResults` for type checkers.
-    - **Note (Phase 3 pretest workflow):** Phase 3 ships the Section 4 pre-test diagnostics in a new module `diff_diff/had_pretests.py` (separate from the 2,800-line `had.py` estimator module). The three tests — `qug_test`, `stute_test`, `yatchew_hr_test` — accept raw `(d, dy)` arrays and return their own result dataclasses (`QUGTestResults`, `StuteTestResults`, `YatchewTestResults`); the composite `did_had_pretest_workflow` accepts a two-period panel and returns `HADPretestReport` with a priority-ordered verdict string. The `data`-only workflow signature (no `result=`-consuming variant) avoids coupling pretests to the `BiasCorrectedFit` internal state, which does not expose residuals. Below-minimum sample sizes emit `UserWarning` and return all-NaN inference (no `ValueError` raise) for library consistency with the `safe_inference` convention; input-shape violations (non-1D, NaN-containing, mismatched length) still raise. No multiple-testing adjustment (Bonferroni/Holm) is applied — the paper does not prescribe one. The joint Equation 18 cross-horizon Stute test (step 2 of the paper's four-step workflow) is deferred to a Phase 3 follow-up patch; the single-horizon Stute in this PR is the standard Assumption 8 linearity test (paper Section 4.3, Appendix D).
+    - **Note (Phase 3 pretest workflow):** Phase 3 ships the Section 4 pre-test diagnostics in a new module `diff_diff/had_pretests.py` (separate from the 2,800-line `had.py` estimator module). The three tests — `qug_test`, `stute_test`, `yatchew_hr_test` — accept raw `(d, dy)` arrays and return their own result dataclasses (`QUGTestResults`, `StuteTestResults`, `YatchewTestResults`); the composite `did_had_pretest_workflow` accepts a two-period panel and returns `HADPretestReport` with a priority-ordered verdict string. The `data`-only workflow signature (no `result=`-consuming variant) avoids coupling pretests to the `BiasCorrectedFit` internal state, which does not expose residuals. Below-minimum sample sizes emit `UserWarning` and return all-NaN inference (no `ValueError` raise) for library consistency with the `safe_inference` convention; input-shape violations (non-1D, NaN-containing, mismatched length) still raise. No multiple-testing adjustment (Bonferroni/Holm) is applied — the paper does not prescribe one. **Partial-workflow semantic:** the composite runs paper steps 1 (QUG) + 3 (linearity via Stute + Yatchew-HR) only; step 2 (Assumption 7 pre-trends test via Equation 18) is deferred to a Phase 3 follow-up patch. When all three implemented tests fail-to-reject, the verdict explicitly flags the Assumption 7 gap (`"QUG and linearity diagnostics fail-to-reject; Assumption 7 pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up)"`) rather than claiming unconditional TWFE safety, so callers do not receive a misleading certification signal. The `HADPretestReport` docstring reinforces this caveat. Users should continue to perform their own pre-trends / placebo analysis until the follow-up ships the joint Equation 18 Stute variant.
     - **Note (Phase 3 Stute bootstrap):** The Stute CvM statistic is implemented via the simplified-form `S = (1/G²) · Σ cumsum²`, which is algebraically equivalent to the paper's stated `Σ(g/G)² · ((1/g) Σ eps_{(h)})²` form. Bootstrap follows the paper's Appendix D Algorithm literally: each iteration refits OLS on `dy_b = a_hat + b_hat · d + eps · eta` (null DGP multiplied by Mammen weights). The Appendix-D vectorized matrix form (precomputing `M = I - X(X'X)⁻¹X'` once and applying it to `eps · eta` in each iteration) is functionally identical and ~2× faster; it is deferred as a performance follow-up to keep the Phase 3 reviewer surface small. Bootstrap reproducibility uses `np.random.default_rng(seed)` (library convention, matching `wild_bootstrap_se`); `seed=42` in tests for bitwise stability.
     - **Note (Phase 3 Yatchew normalizer):** `σ̂²_diff` in `yatchew_hr_test` divides by `2G` (paper-literal from Theorem 7), NOT by `2(G-1)`. A unit test hand-computes `σ̂²_diff` at `G=4` on deterministic inputs and asserts the `2G`-normalizer form at `atol=1e-12` so any later regression to the (asymptotically equivalent but finite-sample different) `2(G-1)` form would fail the test. `σ̂⁴_W` uses `np.mean(eps_{(g)}² · eps_{(g-1)}²)` which divides by `G-1` via `np.mean` length, matching the paper's Theorem 7 / Equation 29 normalization.
 - [x] Phase 3: `qug_test()` (`T = D_{2,(1)} / (D_{2,(2)} - D_{2,(1)})`, rejection `{T > 1/α - 1}`). One-sided asymptotic p-value `1 / (1 + T)` under the Exp(1)/Exp(1) limit law (Theorem 4). Zero-dose observations filtered upfront (with `UserWarning`); `D_{(1)} == D_{(2)}` tie returns all-NaN inference (conservative). Closed-form tight parity tested at `atol=1e-12`.
 - [x] Phase 3: `stute_test()` Cramér-von Mises with Mammen wild bootstrap. Statistic `S = (1/G^2) Σ (cumsum_g)^2` (algebraically equivalent to paper's `Σ(g/G)^2 · ((1/g) Σ eps_{(h)})^2`). Bootstrap follows paper Appendix D Algorithm literal (per-iteration OLS refit). `n_bootstrap=999` default, `n_bootstrap >= 99` validated. `G < 10` returns NaN; `G > 100_000` emits a `UserWarning` pointing to `yatchew_hr_test`. Appendix-D vectorized matrix form deferred as a performance follow-up (tracked in `TODO.md`).
 - [x] Phase 3: `yatchew_hr_test()` heteroskedasticity-robust linearity test. Test statistic `T_hr = sqrt(G) · (σ̂²_lin - σ̂²_diff) / σ̂²_W` from paper Equation 29. Normalizer `σ̂²_diff` divides by `2G` (paper-literal Theorem 7), NOT `2(G-1)`; hand-computed tight parity asserted at `atol=1e-12`. One-sided standard-normal critical value. `G < 3` returns NaN.
-- [x] Phase 3: `did_had_pretest_workflow()` composite helper. Two-period `data`-only entry point (Phase 2a overall-path dispatch); reduces panel via `_aggregate_first_difference` and runs all three tests at a shared `alpha`. `seed` forwards to `stute_test` only (QUG and Yatchew are deterministic). Returns `HADPretestReport` with priority-ordered verdict string: `"TWFE safe under Section 4 assumptions"` when all fail-to-reject, `"inconclusive - {tests} NaN"` when any statistic is NaN, else a bundled string naming each rejected assumption. Multi-period panels pre-slice to `(F-1, F)` before calling; joint-horizon dispatch deferred to Phase 3 follow-up.
+- [x] Phase 3: `did_had_pretest_workflow()` composite helper. Two-period `data`-only entry point (Phase 2a overall-path dispatch); reduces panel via `_aggregate_first_difference` and runs all three IMPLEMENTED tests at a shared `alpha`. `seed` forwards to `stute_test` only (QUG and Yatchew are deterministic). Returns `HADPretestReport` with priority-ordered verdict string. Because Phase 3 ships steps 1 + 3 of the paper's four-step workflow but **not** step 2 (Assumption 7 pre-trends test via Equation 18), the fail-to-reject verdict explicitly flags the Assumption 7 gap rather than claiming unconditional TWFE safety: `"QUG and linearity diagnostics fail-to-reject; Assumption 7 pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up)"`. Other verdict forms: `"inconclusive - {tests} NaN"` when any statistic is NaN; bundled rejection-reason string naming each failed assumption otherwise. Multi-period panels pre-slice to `(F-1, F)` before calling; joint-horizon dispatch deferred to Phase 3 follow-up.
 - [ ] Phase 4: Pierce-Schott (2016) replication harness reproduces Figure 2 values.
 - [ ] Phase 4: Full DGP 1/2/3 coverage-rate reproduction from Table 1.
 - [ ] Phase 5: `practitioner_next_steps()` integration for HAD results.
diff --git a/tests/test_had_pretests.py b/tests/test_had_pretests.py
index 0d3acd88..00919845 100644
--- a/tests/test_had_pretests.py
+++ b/tests/test_had_pretests.py
@@ -289,8 +289,9 @@ def test_invalid_alpha_raises(self):
 class TestCompositeWorkflow:
     """Tests for :func:`did_had_pretest_workflow`."""
 
-    def test_all_pass_on_linear_no_pretrend(self):
-        """Commit criterion 11: linear DGP -> all_pass, 'TWFE safe' verdict."""
+    def test_all_pass_on_linear_flags_assumption7_gap(self):
+        """Commit criterion 11: linear DGP -> all_pass, verdict flags
+        Assumption 7 pre-trends gap per Phase 3 partial-workflow scope."""
         d, dy = _linear_dgp(G=200, beta=2.0, sigma=0.3, seed=42)
         panel = _make_two_period_panel(200, d, dy, seed=42)
         report = did_had_pretest_workflow(
@@ -304,7 +305,14 @@ def test_all_pass_on_linear_no_pretrend(self):
             seed=42,
         )
         assert report.all_pass is True
-        assert report.verdict == "TWFE safe under Section 4 assumptions"
+        # Partial-workflow verdict: explicitly names the Assumption 7 gap
+        # so callers do not receive a misleading "TWFE safe" signal.
+        assert "QUG and linearity diagnostics fail-to-reject" in report.verdict
+        assert "Assumption 7" in report.verdict
+        assert "pre-trends" in report.verdict
+        assert "paper step 2 deferred" in report.verdict
+        # Verdict must NOT claim unconditional TWFE safety.
+        assert "TWFE safe" not in report.verdict
         assert report.n_obs == 200
         assert isinstance(report.qug, QUGTestResults)
         assert isinstance(report.stute, StuteTestResults)
@@ -612,13 +620,18 @@ def test_nan_in_any_test_inconclusive(self):
         assert verdict.startswith("inconclusive")
         assert "Stute" in verdict
 
-    def test_none_reject_twfe_safe(self):
-        """(b) None reject -> 'TWFE safe under Section 4 assumptions'."""
+    def test_none_reject_flags_assumption7_gap(self):
+        """(b) None reject -> verdict flags the Assumption 7 pre-trends gap
+        rather than claiming unconditional TWFE safety."""
         q = _mk_qug(reject=False)
         s = _mk_stute(reject=False)
         y = _mk_yatchew(reject=False)
         verdict = _compose_verdict(q, s, y)
-        assert verdict == "TWFE safe under Section 4 assumptions"
+        assert "QUG and linearity diagnostics fail-to-reject" in verdict
+        assert "Assumption 7" in verdict
+        assert "pre-trends" in verdict
+        assert "paper step 2 deferred" in verdict
+        assert "TWFE safe" not in verdict
 
     def test_only_qug_rejects(self):
         """(c) Only QUG rejects -> QUG-only message."""

From 9d92485b9696ca9c4df5082ada678787906a1d89 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Wed, 22 Apr 2026 13:40:49 -0400
Subject: [PATCH 3/9] Address /ai-review-local P3s: module docstring +
 _compose_verdict docstring + string concat
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Re-review came back ✅ after the P1 fix; remaining findings were all P3
documentation/cleanup:

- Module docstring now explicitly frames Phase 3 as shipping steps 1 + 3
  of the paper's four-step workflow (not "all three fail-to-reject ->
  TWFE valid", which could read as unconditional certification).
- `_compose_verdict` docstring synchronized with the partial-workflow
  verdict text (previously said "TWFE safe under Section 4 assumptions";
  now says "QUG and linearity diagnostics fail-to-reject; Assumption 7
  pre-trends test NOT run").
- Collapsed adjacent-string-literal in `yatchew_hr_test` sigma4_W<=0
  warning into a single string.

No behavioral changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 diff_diff/had_pretests.py | 33 +++++++++++++++++++--------------
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/diff_diff/had_pretests.py b/diff_diff/had_pretests.py
index 52424dd6..67c41f05 100644
--- a/diff_diff/had_pretests.py
+++ b/diff_diff/had_pretests.py
@@ -1,8 +1,9 @@
 """Pre-test diagnostics for the HeterogeneousAdoptionDiD estimator.
 
 Paper Section 4 (de Chaisemartin, Ciccia, D'Haultfoeuille, Knau 2026,
-arXiv:2405.04465v6) prescribes three pre-test diagnostics that practitioners
-should run BEFORE trusting HAD estimates:
+arXiv:2405.04465v6) prescribes a four-step pre-testing workflow for TWFE
+validity in HADs. Phase 3 ships steps 1 and 3 of that workflow (step 2 is
+deferred):
 
 1. :func:`qug_test` - order-statistic ratio test of the support infimum
    ``H_0: d_lower = 0`` (paper Theorem 4). Closed-form, tuning-free.
@@ -13,16 +14,16 @@
    linearity test (paper Theorem 7 / Equation 29). Feasible at
    ``G >= 100k``.
 
-Paper Section 4.2-4.3: if all three fail-to-reject, the TWFE estimator
-(``beta_fe``) is valid; rejection guides users to alternative estimators.
+The composite :func:`did_had_pretest_workflow` runs the three implemented
+tests in sequence on a two-period HAD panel and returns a
+:class:`HADPretestReport` with a partial-workflow verdict. When all three
+fail-to-reject, the verdict explicitly flags that **the paper's step 2
+pre-trends test (Assumption 7) is NOT run** — callers do not receive an
+unconditional "TWFE safe" signal; the Assumption 7 check must be performed
+separately (e.g., via an event-study / placebo analysis) until the Phase 3
+follow-up patch lands the joint Equation 18 cross-horizon Stute variant.
 
-The composite :func:`did_had_pretest_workflow` runs all three in sequence on
-a two-period HAD panel and returns a :class:`HADPretestReport` with a
-text verdict.
-
-Equation 18 (joint cross-horizon Stute over pre-period placebos) is deferred
-to a Phase 3 follow-up patch pending formula extraction from the paper PDF;
-see ``docs/methodology/REGISTRY.md`` and ``TODO.md``.
+See ``docs/methodology/REGISTRY.md`` and ``TODO.md`` for the deferred items.
 """
 
 from __future__ import annotations
@@ -554,8 +555,12 @@ def _compose_verdict(
     Priority-ordered first-match:
 
     1. Any NaN p-value -> ``"inconclusive - {names} NaN"``
-    2. None reject -> ``"TWFE safe under Section 4 assumptions"``
-    3. Otherwise bundle each rejection reason.
+    2. None of the implemented tests reject -> partial-workflow verdict
+       flagging the Assumption 7 / pre-trends gap (paper step 2 deferred):
+       ``"QUG and linearity diagnostics fail-to-reject; Assumption 7
+       pre-trends test NOT run (paper step 2 deferred to Phase 3
+       follow-up)"``.
+    3. Otherwise bundle each rejection reason naming the failed assumption.
     """
     nan_tests = [
         name
@@ -969,7 +974,7 @@ def yatchew_hr_test(d: np.ndarray, dy: np.ndarray, alpha: float = 0.05) -> Yatch
     sigma4_W = float(np.mean(eps_s[1:] ** 2 * eps_s[:-1] ** 2))
     if sigma4_W <= 0.0:
         warnings.warn(
-            "yatchew_hr_test: sigma4_W <= 0 (zero residual variance). " "Returning NaN result.",
+            "yatchew_hr_test: sigma4_W <= 0 (zero residual variance); returning NaN result.",
             UserWarning,
             stacklevel=2,
         )

From 10a10792675f9171e8083335b6bfd9662755d15e Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Wed, 22 Apr 2026 15:05:59 -0400
Subject: [PATCH 4/9] Address PR #352 CI review round 1 (3 P1 + 2 P2 + 2 P3)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

P1 #1 — Stute tie-safe CvM:
Paper defines c_G(d) = Σ 1{D ≤ d} · eps with c_G(D_g) evaluated AT each
observation's dose, so tied observations share the post-tie cumulative sum.
My naive cumsum over sorted residuals produced partial within-tie sums that
were row-order-dependent. Fix: after cumsum, replace within-tie-block values
with the block's last cumsum via np.unique + np.repeat. `_cvm_statistic` now
accepts `d_sorted` and collapses tie blocks before squaring. Regression
test `test_cvm_statistic_tie_safe_order_invariance` pins order-invariance
on duplicate doses at atol=1e-14; `test_stute_order_invariance_with_duplicate_doses`
validates the end-to-end stute_test contract.

P1 #2 — Exact-linear fit must fail-to-reject (not return NaN):
For dy = a + b·d exact, Assumption 8 holds exactly and the correct outcome
is p=1, reject=False. My previous var(eps)<=0 check routed this to NaN. Fix:
dropped var(eps) degeneracy branch from stute_test (the bootstrap naturally
produces p=1 when eps=0 exactly). Added a scale-relative short-circuit
(sum(eps²) ≤ 1e-24 · sum(dy²)) in both stute_test and yatchew_hr_test so
FP noise (eps ~ 1e-16 from IEEE arithmetic on dy = 1 + 2*d) doesn't defeat
the short-circuit by producing non-zero but tiny OLS residuals. Yatchew
exact-linear now returns (t_stat_hr=-inf, p=1, reject=False) rather than
NaN. Regressions: TestStuteTest.test_exact_linear_returns_p1_not_nan,
TestYatchewHRTest.test_exact_linear_returns_p1_not_nan.

P1 #3 — HADPretestReport.all_pass contract:
Previously `all_pass = not (reject or reject or reject)` could be True
while `verdict` said "inconclusive - X NaN". Fix: gate all_pass on every
constituent p-value being finite AND no test rejecting. Updated docstring.
Regression: TestCompositeWorkflow.test_all_pass_false_when_any_test_nan.

P2 #1 — QUG negative-dose guard:
HAD doses must be non-negative (paper Section 2). The raw qug_test API
was silently folding d < 0 rows into the n_excluded_zero counter (filter
was `d > 0`). Fix: front-door ValueError on any d < 0. Regression:
TestQUGTest.test_negative_dose_raises.

P3 #1 — QUG np.partition:
REGISTRY claims O(G) via np.partition. Code was using np.sort. Switched
qug_test to np.partition(d_nz, 1), which guarantees partitioned[0] ≤
partitioned[1] = D_{(2)}, i.e., partitioned[0] = D_{(1)}. Tight
closed-form parity at atol=1e-12 still holds.

P3 #2 — REGISTRY n_bootstrap default:
REGISTRY said "Default n_bootstrap = 499" but code ships 999. Updated
REGISTRY to match code and added a note about the n_bootstrap >= 99
front-door validation.

Test count: 47 -> 53.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 diff_diff/had_pretests.py    | 168 ++++++++++++++++++++++++++++-------
 docs/methodology/REGISTRY.md |   2 +-
 tests/test_had_pretests.py   | 109 +++++++++++++++++++++--
 3 files changed, 241 insertions(+), 38 deletions(-)

diff --git a/diff_diff/had_pretests.py b/diff_diff/had_pretests.py
index 67c41f05..11400ebb 100644
--- a/diff_diff/had_pretests.py
+++ b/diff_diff/had_pretests.py
@@ -61,6 +61,15 @@
 _MIN_N_BOOTSTRAP = 99
 _STUTE_LARGE_G_THRESHOLD = 100_000
 
+# Scale-relative tolerance for detecting a numerically exact linear OLS fit,
+# i.e., ``sum(eps^2) / sum(dy^2) < _EXACT_LINEAR_RELATIVE_TOL`` means the
+# OLS residuals are IEEE-arithmetic noise rather than signal. Under exact
+# Assumption 8, residuals are mathematically zero; in practice FP round-off
+# leaves eps on the order of machine-epsilon (~1e-16). Squared that is
+# ~1e-32. The threshold ~1e-24 leaves ~10^8 accumulated FP operations of
+# margin so genuinely-noisy data is never mis-classified.
+_EXACT_LINEAR_RELATIVE_TOL = 1e-24
+
 
 # =============================================================================
 # Result dataclasses
@@ -366,9 +375,14 @@ class HADPretestReport:
     stute : StuteTestResults
     yatchew : YatchewTestResults
     all_pass : bool
-        ``True`` iff none of the three IMPLEMENTED tests rejects. This is
-        a PARTIAL indicator: it does not certify Assumption 7 (pre-trends),
-        which is not tested by Phase 3.
+        ``True`` iff every constituent ``p_value`` is finite AND no test
+        rejects. ``False`` whenever any test produced a NaN p-value
+        (inconclusive) OR any test rejected. This gating prevents
+        downstream callers from mistaking an "inconclusive" report for a
+        "pass" by inspecting ``all_pass`` in isolation. Even when
+        ``all_pass`` is ``True``, the report is a PARTIAL indicator: it
+        does not certify Assumption 7 (pre-trends), which is not tested
+        by Phase 3.
     verdict : str
         Human-readable classification. Priority-ordered first-match:
 
@@ -522,29 +536,50 @@ def _fit_ols_intercept_slope(d: np.ndarray, dy: np.ndarray) -> "tuple[float, flo
     return a_hat, b_hat, residuals
 
 
-def _cvm_statistic(eps_sorted: np.ndarray) -> float:
-    """Compute the sorted-residual Cramer-von Mises statistic.
+def _cvm_statistic(eps_sorted: np.ndarray, d_sorted: np.ndarray) -> float:
+    """Compute the tie-safe Cramer-von Mises cusum statistic.
+
+    Paper definition (Appendix D):
 
-    Equivalent to paper's ``S = Σ (g/G)^2 · ((1/g) Σ_{h=1}^g eps_(h))^2``:
+        c_G(d) := G^{-1/2} * sum_g 1{D_g <= d} * eps_g
+        S := (1/G) * sum_g c_G^2(D_g) = (1/G^2) * sum_g (C_g)^2
 
-        (g/G)^2 * (C_g / g)^2 = C_g^2 / G^2
+    where ``C_g = sum_{h : D_h <= D_g} eps_h`` is the cumulative residual
+    sum up to and including ALL observations with dose <= D_g. This
+    definition is tie-safe: at a tied dose value ``D_g == D_{g+1}``, both
+    c_G(D_g) and c_G(D_{g+1}) include all tie-block members, so the
+    cumulative sum used at each tied observation is the cumulative sum
+    through the END of the tie block.
 
-    where ``C_g = Σ_{h=1}^g eps_(h)`` is the cumulative residual sum.
-    Implementation uses the simplified form for numerical stability.
+    A naive per-observation ``cumsum`` on sorted residuals violates this
+    at tie blocks (each tied observation sees a partial within-block
+    cumulative sum). This implementation collapses each tie block to the
+    post-tie cumulative sum before squaring, matching the paper definition.
 
     Parameters
     ----------
     eps_sorted : np.ndarray, shape (G,)
-        Residuals sorted by the associated regressor ``d``.
+        Residuals sorted by ``d_sorted``.
+    d_sorted : np.ndarray, shape (G,)
+        Regressor values sorted ascending. Must be sorted consistently
+        with ``eps_sorted``.
 
     Returns
     -------
     float
-        ``S = (1 / G^2) * sum(cumsum(eps_sorted)^2)``.
+        ``S = (1 / G^2) * sum_g C_g^2``.
     """
     G = eps_sorted.shape[0]
     cumsum = np.cumsum(eps_sorted)
-    return float(np.sum(cumsum * cumsum) / (G * G))
+    # Tie-safe correction: replace within-tie-block values with the
+    # cumulative sum at the END of each tie block. np.unique on the
+    # already-sorted regressor gives per-unique-value counts; the last
+    # index of each tie block is `cumsum(counts) - 1`, and np.repeat
+    # expands that back to per-observation.
+    _, counts = np.unique(d_sorted, return_counts=True)
+    tie_end_idx = np.cumsum(counts) - 1
+    cumsum_tie_safe = np.repeat(cumsum[tie_end_idx], counts)
+    return float(np.sum(cumsum_tie_safe * cumsum_tie_safe) / (G * G))
 
 
 def _compose_verdict(
@@ -646,6 +681,17 @@ def qug_test(d: np.ndarray, alpha: float = 0.05) -> QUGTestResults:
     d_arr = _validate_1d_numeric(d, "d")
     critical_value = 1.0 / alpha - 1.0
 
+    # HAD support restriction: doses must be non-negative (paper Section 2).
+    # Reject negative doses at the front door rather than silently filtering
+    # them into the zero-exclusion counter.
+    if (d_arr < 0).any():
+        n_neg = int((d_arr < 0).sum())
+        raise ValueError(
+            f"qug_test: d contains {n_neg} negative value(s); HAD doses "
+            f"must be non-negative (paper Section 2). Check your dose "
+            f"column or pre-process before calling qug_test."
+        )
+
     mask = d_arr > 0
     d_nz = d_arr[mask]
     n_excluded = int(d_arr.shape[0] - d_nz.shape[0])
@@ -678,9 +724,13 @@ def qug_test(d: np.ndarray, alpha: float = 0.05) -> QUGTestResults:
             d_order_2=float("nan"),
         )
 
-    d_sorted = np.sort(d_nz)
-    D1 = float(d_sorted[0])
-    D2 = float(d_sorted[1])
+    # Use np.partition for O(G) extraction of the two smallest positive
+    # doses (faster than full O(G log G) sort). For k=1, np.partition
+    # guarantees partitioned[0] <= partitioned[1] = D_{(2)} (the 2nd-smallest),
+    # which implies partitioned[0] = D_{(1)} (the minimum).
+    partitioned = np.partition(d_nz, 1)
+    D1 = float(partitioned[0])
+    D2 = float(partitioned[1])
 
     if D2 == D1:
         warnings.warn(
@@ -827,13 +877,13 @@ def stute_test(
         )
 
     a_hat, b_hat, eps = _fit_ols_intercept_slope(d_arr, dy_arr)
-    # Degenerate case: zero dose variation. Residual variance under the
-    # linear null is just Var(dy), but the CvM under "no signal" regressor
-    # carries no information; emit NaN.
-    if np.var(d_arr) <= 0.0 or np.var(eps) <= 0.0:
+    # Genuine degeneracy: zero dose variation. The CvM cusum is defined
+    # against the regressor, and constant d carries no signal to test
+    # linearity against - emit NaN.
+    if np.var(d_arr) <= 0.0:
         warnings.warn(
-            "stute_test: degenerate inputs (constant d or zero residual "
-            "variance). Returning NaN result.",
+            "stute_test: constant d (zero dose variation); the Stute "
+            "linearity test requires regressor variation. Returning NaN result.",
             UserWarning,
             stacklevel=2,
         )
@@ -847,8 +897,26 @@ def stute_test(
             seed=seed,
         )
 
+    # Numerically exact linear fit: Assumption 8 holds to IEEE precision,
+    # so the Stute CvM statistic is formally 0 and every bootstrap draw is
+    # also 0. Short-circuit to p=1 to avoid FP-noise-driven bootstrap
+    # comparisons where cvm_stat and S_b are both at machine-epsilon scale.
+    eps_norm_sq = float(np.sum(eps * eps))
+    dy_norm_sq = float(np.sum(dy_arr * dy_arr))
+    if eps_norm_sq <= _EXACT_LINEAR_RELATIVE_TOL * max(dy_norm_sq, 1.0):
+        return StuteTestResults(
+            cvm_stat=0.0,
+            p_value=1.0,
+            reject=False,
+            alpha=alpha,
+            n_bootstrap=int(n_bootstrap),
+            n_obs=G,
+            seed=seed,
+        )
+
     idx = np.argsort(d_arr, kind="stable")
-    S = _cvm_statistic(eps[idx])
+    d_sorted = d_arr[idx]
+    S = _cvm_statistic(eps[idx], d_sorted)
 
     rng = np.random.default_rng(seed)
     bootstrap_S = np.empty(n_bootstrap, dtype=np.float64)
@@ -857,7 +925,7 @@ def stute_test(
         eta = _generate_mammen_weights(G, rng)
         dy_b = fitted + eps * eta
         _, _, eps_b = _fit_ols_intercept_slope(d_arr, dy_b)
-        bootstrap_S[b] = _cvm_statistic(eps_b[idx])
+        bootstrap_S[b] = _cvm_statistic(eps_b[idx], d_sorted)
 
     p_value = float((1.0 + float(np.sum(bootstrap_S >= S))) / (n_bootstrap + 1.0))
     reject = p_value <= alpha
@@ -959,6 +1027,31 @@ def yatchew_hr_test(d: np.ndarray, dy: np.ndarray, alpha: float = 0.05) -> Yatch
     _, _, eps = _fit_ols_intercept_slope(d_arr, dy_arr)
     sigma2_lin = float(np.mean(eps * eps))
 
+    # Numerically exact linear fit: same short-circuit as `stute_test`.
+    # Assumption 8 holds to IEEE precision; the Yatchew statistic is
+    # formally -inf (finite-negative numerator over zero denominator),
+    # which maps to p = 1 under the one-sided standard-normal critical
+    # value. Short-circuit so FP noise in ``sigma4_W`` cannot produce a
+    # spuriously large finite ``T_hr``.
+    eps_norm_sq = float(np.sum(eps * eps))
+    dy_norm_sq = float(np.sum(dy_arr * dy_arr))
+    if eps_norm_sq <= _EXACT_LINEAR_RELATIVE_TOL * max(dy_norm_sq, 1.0):
+        # For reporting, compute sigma2_diff on the sorted dy (finite,
+        # well-defined even in the exact-linear case).
+        idx_early = np.argsort(d_arr, kind="stable")
+        sigma2_diff_exact = float(np.sum(np.diff(dy_arr[idx_early]) ** 2) / (2.0 * G))
+        return YatchewTestResults(
+            t_stat_hr=float("-inf"),
+            p_value=1.0,
+            reject=False,
+            alpha=alpha,
+            critical_value=critical_value,
+            sigma2_lin=sigma2_lin,
+            sigma2_diff=sigma2_diff_exact,
+            sigma2_W=0.0,
+            n_obs=G,
+        )
+
     idx = np.argsort(d_arr, kind="stable")
     dy_s = dy_arr[idx]
     eps_s = eps[idx]
@@ -973,14 +1066,16 @@ def yatchew_hr_test(d: np.ndarray, dy: np.ndarray, alpha: float = 0.05) -> Yatch
     # review line 171.
     sigma4_W = float(np.mean(eps_s[1:] ** 2 * eps_s[:-1] ** 2))
     if sigma4_W <= 0.0:
-        warnings.warn(
-            "yatchew_hr_test: sigma4_W <= 0 (zero residual variance); returning NaN result.",
-            UserWarning,
-            stacklevel=2,
-        )
+        # sigma4_W = 0 implies OLS residuals are zero (or effectively zero),
+        # i.e. the linear fit reproduces dy exactly. Under H_0 (linearity),
+        # Assumption 8 holds exactly; the Yatchew statistic is formally -inf
+        # (finite-negative numerator over zero denominator), which corresponds
+        # to p-value = 1 (fail-to-reject). Return p = 1, reject = False with
+        # t_stat_hr = -inf as the formal-limit value. This matches the
+        # Stute-bootstrap behavior on the same exact-linear input.
         return YatchewTestResults(
-            t_stat_hr=float("nan"),
-            p_value=float("nan"),
+            t_stat_hr=float("-inf"),
+            p_value=1.0,
             reject=False,
             alpha=alpha,
             critical_value=critical_value,
@@ -1093,7 +1188,18 @@ def did_had_pretest_workflow(
     stute_res = stute_test(d_arr, dy_arr, alpha=alpha, n_bootstrap=n_bootstrap, seed=seed)
     yatchew_res = yatchew_hr_test(d_arr, dy_arr, alpha=alpha)
 
-    all_pass = not (qug_res.reject or stute_res.reject or yatchew_res.reject)
+    # `all_pass` must be conclusive: require every constituent p-value to be
+    # finite AND no test to reject. A NaN p-value means the test is
+    # inconclusive, not a pass, so `all_pass = False` on any NaN result
+    # (even when none of the tests set `reject = True`). This keeps
+    # `all_pass` semantically aligned with the verdict string.
+    finite_all = (
+        np.isfinite(qug_res.p_value)
+        and np.isfinite(stute_res.p_value)
+        and np.isfinite(yatchew_res.p_value)
+    )
+    any_reject = qug_res.reject or stute_res.reject or yatchew_res.reject
+    all_pass = bool(finite_all and not any_reject)
     verdict = _compose_verdict(qug_res, stute_res, yatchew_res)
 
     return HADPretestReport(
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
index 1f2f4098..36e778b1 100644
--- a/docs/methodology/REGISTRY.md
+++ b/docs/methodology/REGISTRY.md
@@ -2288,7 +2288,7 @@ Shipped in `diff_diff/had_pretests.py` as `stute_test()`. Tests whether `E(ΔY |
    - Repeat B times; the p-value is the fraction of `S*` exceeding `S`.
 5. Properties (page 26): asymptotic size, consistency under any fixed alternative, non-trivial local power at rate `G^{-1/2}`.
 6. Vectorized implementation (Appendix D): with `L` a `G × G` lower-triangular matrix of ones, `S = (1/G²) · 1ᵀ (L · E)^{∘2}`. Bootstrap uses a `G × G` realization matrix `H` of Mammen weights; memory-bounded at `G ≈ 100,000`.
-- **Note:** Default `n_bootstrap = 499` is a diff-diff choice; the paper does not prescribe.
+- **Note:** Default `n_bootstrap = 999` is a diff-diff choice; the paper does not prescribe. A front-door `ValueError` is raised for `n_bootstrap < 99` (below which the discretised bootstrap p-value grid `1/(B+1)` is too coarse to be meaningful).
 
 *Algorithm variant - Yatchew (1997) heteroskedasticity-robust linearity test (Appendix E, Theorem 7):*
 Shipped in `diff_diff/had_pretests.py` as `yatchew_hr_test()`. Alternative to Stute when `G` is large or heteroskedasticity is suspected.
diff --git a/tests/test_had_pretests.py b/tests/test_had_pretests.py
index 00919845..b9ff8d1f 100644
--- a/tests/test_had_pretests.py
+++ b/tests/test_had_pretests.py
@@ -164,6 +164,13 @@ def test_2d_input_raises(self):
         with pytest.raises(ValueError, match="must be 1-dimensional"):
             qug_test(np.array([[0.1, 0.5], [0.3, 0.9]]))
 
+    def test_negative_dose_raises(self):
+        """P2 fix: HAD doses must be non-negative; qug_test rejects negatives
+        at the front door rather than silently folding them into the
+        zero-exclusion counter."""
+        with pytest.raises(ValueError, match="negative value"):
+            qug_test(np.array([0.1, -0.3, 0.5, 0.9]))
+
 
 # =============================================================================
 # Stute test
@@ -188,15 +195,17 @@ def test_rejects_quadratic_dgp(self):
         assert r.p_value < 0.05
 
     def test_cvm_statistic_manual_equivalence(self):
-        """Verify the simplified CvM form matches the paper's stated form.
+        """Verify the simplified CvM form matches the paper's stated form
+        in the no-ties case.
 
         Paper: S = Σ(g/G)^2 · ((1/g) Σ eps_(h))^2
-        Code:  S = (1/G^2) Σ (cumsum_g)^2
-        These are algebraically identical.
+        Code (no ties):  S = (1/G^2) Σ (cumsum_g)^2
+        These are algebraically identical when all dose values are unique.
         """
         from diff_diff.had_pretests import _cvm_statistic
 
         eps = np.array([1.0, -0.5, 0.3, -0.2, 0.1])
+        d_sorted = np.array([0.1, 0.2, 0.3, 0.4, 0.5])  # all unique -> no ties
         G = len(eps)
 
         # Paper form
@@ -206,17 +215,69 @@ def test_cvm_statistic_manual_equivalence(self):
             c_g = cumsum[g - 1]
             paper_S += (g / G) ** 2 * (c_g / g) ** 2
 
-        # Simplified form (what the code uses)
-        code_S = _cvm_statistic(eps)
-
+        code_S = _cvm_statistic(eps, d_sorted)
         np.testing.assert_allclose(code_S, paper_S, atol=1e-14)
 
+    def test_cvm_statistic_tie_safe_order_invariance(self):
+        """CRITICAL P1 fix: tie-safe CvM is order-invariant within tie blocks.
+
+        Under the paper definition, at a tied dose ``D_g == D_{g+1}``, the
+        cusum ``c_G`` evaluated at both tied observations includes ALL
+        tie-block members. So the CvM statistic must not depend on the
+        within-tie ordering of residuals. A naive per-observation cumsum
+        (without tie-block collapse) would give different S values when
+        the within-tie residual order is permuted.
+        """
+        from diff_diff.had_pretests import _cvm_statistic
+
+        # 6 observations with two tie blocks: d = [0.1, 0.1, 0.1, 0.5, 0.5, 0.9]
+        d_sorted = np.array([0.1, 0.1, 0.1, 0.5, 0.5, 0.9])
+        eps_a = np.array([1.0, -0.5, 0.3, 0.7, -0.2, 0.1])
+        # Permute within-tie residual order (positions 0-2 and 3-4 permuted):
+        eps_b = np.array([0.3, 1.0, -0.5, -0.2, 0.7, 0.1])
+        S_a = _cvm_statistic(eps_a, d_sorted)
+        S_b = _cvm_statistic(eps_b, d_sorted)
+        np.testing.assert_allclose(S_a, S_b, atol=1e-14)
+
+    def test_stute_order_invariance_with_duplicate_doses(self):
+        """End-to-end P1 fix: stute_test on duplicate-dose inputs is invariant
+        to within-tie row ordering (tie-safe CvM contract propagates)."""
+        G = 40
+        # Build d with ties and a matched dy
+        rng = np.random.default_rng(42)
+        d_unique = np.array([0.1, 0.3, 0.5, 0.8])
+        d = np.repeat(d_unique, G // len(d_unique))
+        dy = 2.0 * d + rng.normal(0.0, 0.1, size=G)
+        # Permute order within each tie block
+        perm = np.arange(G)
+        rng_perm = np.random.default_rng(123)
+        for start in range(0, G, G // len(d_unique)):
+            block = perm[start : start + G // len(d_unique)]
+            rng_perm.shuffle(block)
+            perm[start : start + G // len(d_unique)] = block
+        r_a = stute_test(d, dy, n_bootstrap=199, seed=42)
+        r_b = stute_test(d[perm], dy[perm], n_bootstrap=199, seed=42)
+        # Observed cvm_stat must be bit-identical (tie-safe).
+        np.testing.assert_allclose(r_a.cvm_stat, r_b.cvm_stat, atol=1e-14)
+
     def test_n_bootstrap_below_99_raises(self):
         """Commit criterion 8: n_bootstrap < 99 -> ValueError."""
         d, dy = _linear_dgp(G=50)
         with pytest.raises(ValueError, match=r"n_bootstrap must be >= 99"):
             stute_test(d, dy, n_bootstrap=50)
 
+    def test_exact_linear_returns_p1_not_nan(self):
+        """P1 fix: exact linear fit (eps=0) must fail-to-reject with p=1,
+        NOT return NaN. Assumption 8 holds exactly in this case."""
+        G = 50
+        d = np.linspace(0.1, 1.0, G)
+        dy = 1.0 + 2.0 * d  # exact linear, residuals are zero
+        r = stute_test(d, dy, n_bootstrap=199, seed=42)
+        assert np.isfinite(r.p_value)
+        assert r.p_value == 1.0  # all bootstrap S_b tied at 0 with observed S
+        assert r.reject is False
+        assert r.cvm_stat == 0.0
+
     def test_mismatched_lengths_raise(self):
         with pytest.raises(ValueError, match="same length"):
             stute_test(np.array([0.1, 0.2, 0.3]), np.array([1.0, 2.0]))
@@ -280,6 +341,21 @@ def test_invalid_alpha_raises(self):
         with pytest.raises(ValueError, match="alpha must satisfy"):
             yatchew_hr_test(d, dy, alpha=-0.1)
 
+    def test_exact_linear_returns_p1_not_nan(self):
+        """P1 fix: exact linear fit (eps=0) must fail-to-reject with p=1,
+        NOT NaN. Yatchew statistic is formally -inf (finite-negative numerator
+        over zero denominator), which corresponds to p=1 under the one-sided
+        standard-normal critical value."""
+        G = 50
+        d = np.linspace(0.1, 1.0, G)
+        dy = 1.0 + 2.0 * d  # exact linear, residuals are zero
+        r = yatchew_hr_test(d, dy)
+        assert np.isfinite(r.p_value)
+        assert r.p_value == 1.0
+        assert r.reject is False
+        # t_stat_hr is formally -inf for the exact-linear limit.
+        assert r.t_stat_hr == float("-inf")
+
 
 # =============================================================================
 # Composite workflow
@@ -344,6 +420,27 @@ def test_rejects_on_quadratic_plus_shifted_support(self):
         assert report.qug.reject is True
         assert report.stute.reject or report.yatchew.reject
 
+    def test_all_pass_false_when_any_test_nan(self):
+        """P1 fix: all_pass must NOT be True when any constituent test is
+        NaN. Previously all_pass was computed from `not reject` only, so a
+        NaN-statistic test (reject=False by convention) incorrectly tripped
+        all_pass=True while verdict was "inconclusive"."""
+        G = 40
+        # Constant dose triggers QUG tie (D_(1) == D_(2)) -> NaN.
+        d = np.full(G, 0.5)
+        dy = np.linspace(0.0, 2.0, G)
+        panel = _make_two_period_panel(G, d, dy, seed=42)
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore")
+            report = did_had_pretest_workflow(
+                panel, "y", "d", "time", "unit", n_bootstrap=199, seed=42
+            )
+        # At least one test produces NaN, so `all_pass` MUST be False even
+        # though none of the tests set reject=True.
+        assert np.isnan(report.qug.p_value)
+        assert report.all_pass is False
+        assert report.verdict.startswith("inconclusive")
+
     def test_workflow_seed_controls_stute_only(self):
         """Workflow seed forwards to Stute only; re-run with same seed is reproducible."""
         d, dy = _linear_dgp(G=100, beta=2.0, sigma=0.3, seed=42)

From 6cd4805354428d25c826eee6f7831f1d2fba65f1 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Wed, 22 Apr 2026 15:23:58 -0400
Subject: [PATCH 5/9] Address PR #352 CI review round 2 (1 P0 + 1 P1 + 1 P2)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

P0 — Exact-linear shortcut not translation-invariant:
R1's scale-relative shortcut compared sum(eps^2) against sum(dy^2), but
sum(dy^2) is inflated by additive shifts in dy (e.g. dy = 1e12 + 2*d +
noise gives sum(dy^2) ~ 5e25, and threshold becomes ~50, which is easily
tripped by ordinary OLS SSE on unit-scale noise). This silently forced
p=1.0 on genuinely noisy data.

Fix: compare against CENTERED total sum of squares sum((dy - dybar)^2),
which equals (1 - R^2) * TSS and is translation-invariant by construction
(the mean subtraction absorbs any additive shift). Applied to both
stute_test and yatchew_hr_test. Module-level constant docstring updated to
explain why centering is required.

Regressions: TestStuteTest.test_exact_linear_shortcut_does_not_fire_on_shifted_noisy_data,
TestYatchewHRTest.test_exact_linear_shortcut_does_not_fire_on_shifted_noisy_data.
Exact numerical equivalence under dy -> dy + 1e12 is not attainable (FP
cancellation at 1e12 scale costs ~4 digits), but the SHORTCUT behavior
and decision MUST match — the tests assert both.

P1 — Yatchew missing tie-handling guard:
R1's tie-safe CvM fixed Stute, but yatchew_hr_test still silently computed
adjacent differences on stable-sorted duplicates. Under ties, within-tie
row ordering changes sigma2_diff and sigma4_W, so the statistic depends
on arbitrary input row order rather than the data.

Fix: front-door UserWarning + all-NaN result when `np.unique(d).size < G`.
Users with tied doses (mass-point designs, discretised registers) are
directed to stute_test, whose tie-safe CvM handles ties correctly.
This also subsumes the constant-d case (all tied) without needing a
separate check.

Regressions: TestYatchewHRTest.test_duplicate_doses_return_nan,
TestYatchewHRTest.test_constant_d_returns_nan.

P2 — REGISTRY documentation:
Added two Note blocks to the HAD section:
- **Note (Phase 3 Yatchew tie policy)**: justifies the NaN-on-duplicate
  policy and directs users to Stute for tied-dose designs.
- **Note (Phase 3 exact-linear short-circuit)**: documents the
  translation-invariance requirement and the centered-TSS scaling.

Test count: 53 -> 57.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 diff_diff/had_pretests.py    | 57 ++++++++++++++++++++----
 docs/methodology/REGISTRY.md |  2 +
 tests/test_had_pretests.py   | 84 ++++++++++++++++++++++++++++++++++++
 3 files changed, 134 insertions(+), 9 deletions(-)

diff --git a/diff_diff/had_pretests.py b/diff_diff/had_pretests.py
index 11400ebb..4f864e9b 100644
--- a/diff_diff/had_pretests.py
+++ b/diff_diff/had_pretests.py
@@ -61,13 +61,17 @@
 _MIN_N_BOOTSTRAP = 99
 _STUTE_LARGE_G_THRESHOLD = 100_000
 
-# Scale-relative tolerance for detecting a numerically exact linear OLS fit,
-# i.e., ``sum(eps^2) / sum(dy^2) < _EXACT_LINEAR_RELATIVE_TOL`` means the
-# OLS residuals are IEEE-arithmetic noise rather than signal. Under exact
+# Scale-relative tolerance for detecting a numerically exact linear OLS fit.
+# The ratio SSR / TSS = sum(eps^2) / sum((dy - dybar)^2) equals 1 - R^2
+# and is TRANSLATION-INVARIANT under additive shifts in dy. Under exact
 # Assumption 8, residuals are mathematically zero; in practice FP round-off
 # leaves eps on the order of machine-epsilon (~1e-16). Squared that is
 # ~1e-32. The threshold ~1e-24 leaves ~10^8 accumulated FP operations of
-# margin so genuinely-noisy data is never mis-classified.
+# margin so genuinely-noisy data is never mis-classified. We compare
+# against centered TSS rather than raw sum(dy^2) because scaling against
+# the raw magnitude is NOT translation-invariant: adding a large constant
+# to dy would inflate sum(dy^2) and spuriously trip the branch on noisy
+# data.
 _EXACT_LINEAR_RELATIVE_TOL = 1e-24
 
 
@@ -901,9 +905,10 @@ def stute_test(
     # so the Stute CvM statistic is formally 0 and every bootstrap draw is
     # also 0. Short-circuit to p=1 to avoid FP-noise-driven bootstrap
     # comparisons where cvm_stat and S_b are both at machine-epsilon scale.
+    # Comparison is against CENTERED TSS (translation-invariant).
     eps_norm_sq = float(np.sum(eps * eps))
-    dy_norm_sq = float(np.sum(dy_arr * dy_arr))
-    if eps_norm_sq <= _EXACT_LINEAR_RELATIVE_TOL * max(dy_norm_sq, 1.0):
+    dy_centered_sq = float(np.sum((dy_arr - dy_arr.mean()) ** 2))
+    if eps_norm_sq <= _EXACT_LINEAR_RELATIVE_TOL * max(dy_centered_sq, 1.0):
         return StuteTestResults(
             cvm_stat=0.0,
             p_value=1.0,
@@ -1024,6 +1029,39 @@ def yatchew_hr_test(d: np.ndarray, dy: np.ndarray, alpha: float = 0.05) -> Yatch
             n_obs=G,
         )
 
+    # Tie / constant-dose front-door guard. Yatchew's difference-based
+    # variance estimator uses adjacent differences of dy sorted by d;
+    # under tied doses the within-tie ordering is arbitrary (stable sort
+    # falls back to input row order), so the statistic becomes
+    # non-methodological and order-dependent. Reject at the front door
+    # with a UserWarning + NaN result rather than silently permuting.
+    # Mass-point designs and other tied-dose panels should use
+    # `stute_test` instead (its tie-safe CvM handles ties correctly).
+    n_unique_d = int(np.unique(d_arr).shape[0])
+    if n_unique_d < G:
+        n_dups = G - n_unique_d
+        warnings.warn(
+            f"yatchew_hr_test: d contains {n_dups} duplicate value(s) "
+            f"(only {n_unique_d} distinct dose values out of G={G}); "
+            f"the difference-based variance estimator is not well-defined "
+            f"under ties because adjacent-difference statistics depend on "
+            f"arbitrary within-tie row ordering. Use stute_test() instead "
+            f"(its tie-safe CvM handles ties correctly). Returning NaN result.",
+            UserWarning,
+            stacklevel=2,
+        )
+        return YatchewTestResults(
+            t_stat_hr=float("nan"),
+            p_value=float("nan"),
+            reject=False,
+            alpha=alpha,
+            critical_value=critical_value,
+            sigma2_lin=float("nan"),
+            sigma2_diff=float("nan"),
+            sigma2_W=float("nan"),
+            n_obs=G,
+        )
+
     _, _, eps = _fit_ols_intercept_slope(d_arr, dy_arr)
     sigma2_lin = float(np.mean(eps * eps))
 
@@ -1032,10 +1070,11 @@ def yatchew_hr_test(d: np.ndarray, dy: np.ndarray, alpha: float = 0.05) -> Yatch
     # formally -inf (finite-negative numerator over zero denominator),
     # which maps to p = 1 under the one-sided standard-normal critical
     # value. Short-circuit so FP noise in ``sigma4_W`` cannot produce a
-    # spuriously large finite ``T_hr``.
+    # spuriously large finite ``T_hr``. Comparison is against CENTERED
+    # TSS (translation-invariant).
     eps_norm_sq = float(np.sum(eps * eps))
-    dy_norm_sq = float(np.sum(dy_arr * dy_arr))
-    if eps_norm_sq <= _EXACT_LINEAR_RELATIVE_TOL * max(dy_norm_sq, 1.0):
+    dy_centered_sq = float(np.sum((dy_arr - dy_arr.mean()) ** 2))
+    if eps_norm_sq <= _EXACT_LINEAR_RELATIVE_TOL * max(dy_centered_sq, 1.0):
         # For reporting, compute sigma2_diff on the sorted dy (finite,
         # well-defined even in the exact-linear case).
         idx_early = np.argsort(d_arr, kind="stable")
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
index 36e778b1..cab9a058 100644
--- a/docs/methodology/REGISTRY.md
+++ b/docs/methodology/REGISTRY.md
@@ -2343,6 +2343,8 @@ Shipped as `did_had_pretest_workflow()` in Phase 3. The paper's decision rule fo
     - **Note (Phase 3 pretest workflow):** Phase 3 ships the Section 4 pre-test diagnostics in a new module `diff_diff/had_pretests.py` (separate from the 2,800-line `had.py` estimator module). The three tests — `qug_test`, `stute_test`, `yatchew_hr_test` — accept raw `(d, dy)` arrays and return their own result dataclasses (`QUGTestResults`, `StuteTestResults`, `YatchewTestResults`); the composite `did_had_pretest_workflow` accepts a two-period panel and returns `HADPretestReport` with a priority-ordered verdict string. The `data`-only workflow signature (no `result=`-consuming variant) avoids coupling pretests to the `BiasCorrectedFit` internal state, which does not expose residuals. Below-minimum sample sizes emit `UserWarning` and return all-NaN inference (no `ValueError` raise) for library consistency with the `safe_inference` convention; input-shape violations (non-1D, NaN-containing, mismatched length) still raise. No multiple-testing adjustment (Bonferroni/Holm) is applied — the paper does not prescribe one. **Partial-workflow semantic:** the composite runs paper steps 1 (QUG) + 3 (linearity via Stute + Yatchew-HR) only; step 2 (Assumption 7 pre-trends test via Equation 18) is deferred to a Phase 3 follow-up patch. When all three implemented tests fail-to-reject, the verdict explicitly flags the Assumption 7 gap (`"QUG and linearity diagnostics fail-to-reject; Assumption 7 pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up)"`) rather than claiming unconditional TWFE safety, so callers do not receive a misleading certification signal. The `HADPretestReport` docstring reinforces this caveat. Users should continue to perform their own pre-trends / placebo analysis until the follow-up ships the joint Equation 18 Stute variant.
     - **Note (Phase 3 Stute bootstrap):** The Stute CvM statistic is implemented via the simplified-form `S = (1/G²) · Σ cumsum²`, which is algebraically equivalent to the paper's stated `Σ(g/G)² · ((1/g) Σ eps_{(h)})²` form. Bootstrap follows the paper's Appendix D Algorithm literally: each iteration refits OLS on `dy_b = a_hat + b_hat · d + eps · eta` (null DGP multiplied by Mammen weights). The Appendix-D vectorized matrix form (precomputing `M = I - X(X'X)⁻¹X'` once and applying it to `eps · eta` in each iteration) is functionally identical and ~2× faster; it is deferred as a performance follow-up to keep the Phase 3 reviewer surface small. Bootstrap reproducibility uses `np.random.default_rng(seed)` (library convention, matching `wild_bootstrap_se`); `seed=42` in tests for bitwise stability.
     - **Note (Phase 3 Yatchew normalizer):** `σ̂²_diff` in `yatchew_hr_test` divides by `2G` (paper-literal from Theorem 7), NOT by `2(G-1)`. A unit test hand-computes `σ̂²_diff` at `G=4` on deterministic inputs and asserts the `2G`-normalizer form at `atol=1e-12` so any later regression to the (asymptotically equivalent but finite-sample different) `2(G-1)` form would fail the test. `σ̂⁴_W` uses `np.mean(eps_{(g)}² · eps_{(g-1)}²)` which divides by `G-1` via `np.mean` length, matching the paper's Theorem 7 / Equation 29 normalization.
+    - **Note (Phase 3 Yatchew tie policy):** `yatchew_hr_test` REJECTS duplicate dose values with a `UserWarning` + all-NaN result at the front door. The difference-based variance estimator `σ̂²_diff` and the heteroskedasticity-robust scale `σ̂⁴_W` both use adjacent differences (of sorted dy and of adjacent squared residuals, respectively); under tied doses the within-tie row ordering is arbitrary (stable sort falls back to input order), so the statistic becomes non-methodological and order-dependent. Callers with tied doses (mass-point designs, discretised dose registers) should use `stute_test` instead — its tie-safe Cramer-von Mises statistic collapses tie blocks to the post-tie cumulative sum and is provably order-invariant under within-tie permutations. A regression test on `stute_test` asserts bit-identical `cvm_stat` across tied-dose permutations at `atol=1e-14`; a matching regression on `yatchew_hr_test` asserts the NaN+warning behavior on duplicated and constant doses. This tie policy is a Phase 3 diff-diff choice (the paper describes Yatchew only as "sort by D" and does not specify a tie rule); an alternative implementation could document a justified tie-resolution convention and back it with permutation tests, but at the cost of a methodology surface the paper does not cover.
+    - **Note (Phase 3 exact-linear short-circuit):** Both `stute_test` and `yatchew_hr_test` detect numerically-exact linear fits (OLS residuals below machine precision relative to the signal) and short-circuit to `p=1.0, reject=False` without running the bootstrap / computing `T_hr`. The detection compares `sum(eps^2)` against the CENTERED total sum of squares `sum((dy - dybar)^2)` (ratio `<= 1e-24`), which is equivalent to `1 - R^2` and is TRANSLATION-INVARIANT under additive shifts in dy. Comparing against uncentered `sum(dy^2)` would NOT be translation-invariant: adding a large constant to dy inflates `sum(dy^2)` and can spuriously trip the short-circuit on genuinely noisy data. Regression tests pin (a) the exact-linear fail-to-reject behavior for `dy = a + b*d`, (b) translation-invariance under `dy -> dy + 1e12` (the short-circuit must not fire on noisy data regardless of the constant offset).
 - [x] Phase 3: `qug_test()` (`T = D_{2,(1)} / (D_{2,(2)} - D_{2,(1)})`, rejection `{T > 1/α - 1}`). One-sided asymptotic p-value `1 / (1 + T)` under the Exp(1)/Exp(1) limit law (Theorem 4). Zero-dose observations filtered upfront (with `UserWarning`); `D_{(1)} == D_{(2)}` tie returns all-NaN inference (conservative). Closed-form tight parity tested at `atol=1e-12`.
 - [x] Phase 3: `stute_test()` Cramér-von Mises with Mammen wild bootstrap. Statistic `S = (1/G^2) Σ (cumsum_g)^2` (algebraically equivalent to paper's `Σ(g/G)^2 · ((1/g) Σ eps_{(h)})^2`). Bootstrap follows paper Appendix D Algorithm literal (per-iteration OLS refit). `n_bootstrap=999` default, `n_bootstrap >= 99` validated. `G < 10` returns NaN; `G > 100_000` emits a `UserWarning` pointing to `yatchew_hr_test`. Appendix-D vectorized matrix form deferred as a performance follow-up (tracked in `TODO.md`).
 - [x] Phase 3: `yatchew_hr_test()` heteroskedasticity-robust linearity test. Test statistic `T_hr = sqrt(G) · (σ̂²_lin - σ̂²_diff) / σ̂²_W` from paper Equation 29. Normalizer `σ̂²_diff` divides by `2G` (paper-literal Theorem 7), NOT `2(G-1)`; hand-computed tight parity asserted at `atol=1e-12`. One-sided standard-normal critical value. `G < 3` returns NaN.
diff --git a/tests/test_had_pretests.py b/tests/test_had_pretests.py
index b9ff8d1f..dd891851 100644
--- a/tests/test_had_pretests.py
+++ b/tests/test_had_pretests.py
@@ -278,6 +278,39 @@ def test_exact_linear_returns_p1_not_nan(self):
         assert r.reject is False
         assert r.cvm_stat == 0.0
 
+    def test_exact_linear_shortcut_does_not_fire_on_shifted_noisy_data(self):
+        """P0 fix: the exact-linear short-circuit must NOT fire on noisy
+        data after an additive shift. Previously the short-circuit scaled
+        against raw ``sum(dy^2)``, which is NOT translation-invariant:
+        adding a large constant to dy inflated ``sum(dy^2)`` and
+        spuriously tripped the "exact linear" branch on genuinely noisy
+        data. Fix scales against ``sum((dy - dybar)^2)`` (centered TSS,
+        translation-invariant).
+
+        Exact numerical equivalence of the two bootstrap runs is NOT
+        expected (FP cancellation at 1e12 scale costs ~4 digits of
+        precision), but the short-circuit behavior and decision MUST
+        match.
+        """
+        rng = np.random.default_rng(42)
+        G = 50
+        d = rng.uniform(0.0, 1.0, size=G)
+        dy = 2.0 * d + rng.normal(0.0, 1.0, size=G)
+        dy_shifted = dy + 1e12
+
+        r_raw = stute_test(d, dy, n_bootstrap=199, seed=42)
+        r_shift = stute_test(d, dy_shifted, n_bootstrap=199, seed=42)
+
+        # Neither run may hit the exact-linear short-circuit (both have
+        # unit-scale noise; neither should register as effectively-zero
+        # residuals relative to centered TSS).
+        assert r_raw.cvm_stat > 0.0, "shortcut fired on raw noisy data"
+        assert r_shift.cvm_stat > 0.0, "shortcut fired on shifted noisy data"
+        # Decision must be preserved across translation.
+        assert r_raw.reject == r_shift.reject
+        # P-values match at moderate tolerance (FP cancellation limits precision).
+        np.testing.assert_allclose(r_raw.p_value, r_shift.p_value, rtol=5e-3)
+
     def test_mismatched_lengths_raise(self):
         with pytest.raises(ValueError, match="same length"):
             stute_test(np.array([0.1, 0.2, 0.3]), np.array([1.0, 2.0]))
@@ -356,6 +389,57 @@ def test_exact_linear_returns_p1_not_nan(self):
         # t_stat_hr is formally -inf for the exact-linear limit.
         assert r.t_stat_hr == float("-inf")
 
+    def test_exact_linear_shortcut_does_not_fire_on_shifted_noisy_data(self):
+        """P0 fix: the Yatchew exact-linear short-circuit must NOT fire
+        on noisy data after an additive shift (same rationale as Stute
+        fix: scale-ratio is translation-invariant only against centered
+        TSS)."""
+        rng = np.random.default_rng(42)
+        G = 50
+        d = rng.uniform(0.0, 1.0, size=G)
+        dy = 2.0 * d + rng.normal(0.0, 1.0, size=G)
+        dy_shifted = dy + 1e12
+
+        r_raw = yatchew_hr_test(d, dy)
+        r_shift = yatchew_hr_test(d, dy_shifted)
+
+        # Neither run may hit the short-circuit (t_stat_hr would be -inf).
+        assert np.isfinite(r_raw.t_stat_hr), "shortcut fired on raw noisy data"
+        assert np.isfinite(r_shift.t_stat_hr), "shortcut fired on shifted noisy data"
+        # Decision preserved across translation.
+        assert r_raw.reject == r_shift.reject
+        # T-stats match at moderate tolerance (FP cancellation at 1e12 scale).
+        np.testing.assert_allclose(r_raw.t_stat_hr, r_shift.t_stat_hr, rtol=1e-3)
+
+    def test_duplicate_doses_return_nan(self):
+        """P1 fix: yatchew_hr_test rejects duplicate doses with UserWarning
+        + NaN result. Adjacent differences depend on within-tie row order,
+        which is non-methodological; users with tied doses should use
+        stute_test (tie-safe CvM)."""
+        d = np.array([0.1, 0.2, 0.2, 0.5, 0.8])  # one duplicate at 0.2
+        dy = np.array([1.0, 2.0, 2.5, 3.0, 4.0])
+        with warnings.catch_warnings(record=True) as caught:
+            warnings.simplefilter("always")
+            r = yatchew_hr_test(d, dy)
+            msgs = [str(w.message) for w in caught]
+        assert any("duplicate" in m for m in msgs)
+        assert np.isnan(r.t_stat_hr)
+        assert np.isnan(r.p_value)
+        assert r.reject is False
+
+    def test_constant_d_returns_nan(self):
+        """P1 fix: yatchew_hr_test rejects constant d (degenerate case of
+        duplicate doses: all observations tied)."""
+        d = np.full(10, 0.5)
+        dy = np.linspace(0.0, 2.0, 10)
+        with warnings.catch_warnings(record=True) as caught:
+            warnings.simplefilter("always")
+            r = yatchew_hr_test(d, dy)
+            msgs = [str(w.message) for w in caught]
+        assert any("duplicate" in m for m in msgs)
+        assert np.isnan(r.t_stat_hr)
+        assert r.reject is False
+
 
 # =============================================================================
 # Composite workflow

From de20d51f122c2b271b88b167fa30321c7717f009 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Wed, 22 Apr 2026 15:46:58 -0400
Subject: [PATCH 6/9] Address PR #352 CI review round 3 (1 P0 + 1 P3)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

P0 — Yatchew sigma4_W=0 branch was not sign-aware:
R2's safety branch unconditionally returned (t_stat_hr=-inf, p=1,
reject=False) whenever sigma4_W <= 0, which is mathematically wrong for
non-exact-linear inputs where the adjacent-residual-product sum happens
to vanish.

Reviewer's counterexample: d=[1,2,3,4,5], dy=[1,0,-2,0,1].
OLS slope=0, residuals=dy itself. sigma2_lin=1.2, sigma2_diff=1.0,
sigma4_W=0 (each adjacent residual product has a zero factor).
The formal statistic is sqrt(5)*(1.2-1.0)/0 = +inf under the one-sided
critical value, so the correct decision is reject=True. Previously the
branch mapped this to p=1, silently flipping the rejection.

Fix: after the exact-linear short-circuit has already caught the
residuals-are-zero case, the sigma4_W=0 branch is now sign-aware on the
numerator:
- numerator > 0 (sigma2_lin > sigma2_diff) -> T_hr=+inf, p=0, reject=True
- numerator < 0 -> T_hr=-inf, p=1, reject=False
- numerator = 0 -> T_hr=NaN, p=NaN, reject=False (genuinely indeterminate)
All three branches now emit a UserWarning so users see the degenerate
denominator rather than silently consuming an infinite statistic.

Regression: TestYatchewHRTest.test_sigma4_W_zero_with_positive_numerator_rejects
pins the reviewer's counterexample: asserts t_stat_hr=+inf, p=0,
reject=True (NOT p=1, reject=False), with sigma2_lin=1.2 and
sigma2_diff=1.0 preserved for inspection.

P3 — yatchew_hr_test docstring stale on tie policy:
R2 added the duplicate-dose NaN guard in code and REGISTRY, but the
public docstring still said "stable sort preserves input order among
tied d values." Updated "Notes" section to describe: (a) duplicate-dose
rejection with redirect to stute_test, (b) the translation-invariant
exact-linear short-circuit, and (c) the sign-aware sigma4_W=0 branch.

Test count: 57 -> 58.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 diff_diff/had_pretests.py  | 79 +++++++++++++++++++++++++++++++-------
 tests/test_had_pretests.py | 37 ++++++++++++++++++
 2 files changed, 103 insertions(+), 13 deletions(-)

diff --git a/diff_diff/had_pretests.py b/diff_diff/had_pretests.py
index 4f864e9b..71db5963 100644
--- a/diff_diff/had_pretests.py
+++ b/diff_diff/had_pretests.py
@@ -986,9 +986,37 @@ def yatchew_hr_test(d: np.ndarray, dy: np.ndarray, alpha: float = 0.05) -> Yatch
     is undefined; the function emits ``UserWarning`` and returns NaN
     rather than raising.
 
-    Dose ties: stable sort (``np.argsort(d, kind="stable")``) preserves
-    the input order among tied ``d`` values. The paper does not specify
-    a tie-break policy; stability suffices.
+    Dose ties: REJECTED with ``UserWarning`` + all-NaN result. The
+    difference-based variance estimator ``sigma2_diff`` and the
+    heteroskedasticity-robust scale ``sigma4_W`` both use adjacent
+    differences of quantities sorted by ``d``; under tied doses the
+    within-tie row ordering is arbitrary (stable sort falls back to input
+    order) so the statistic becomes order-dependent rather than
+    data-dependent. Callers with tied doses (mass-point designs,
+    discretised dose registers) should use :func:`stute_test` instead -
+    its tie-safe Cramer-von Mises statistic collapses tie blocks to the
+    post-tie cumulative sum and is provably order-invariant under
+    within-tie permutations.
+
+    Exact-linear short-circuit: when the OLS residual sum-of-squares is
+    below IEEE precision relative to the centered total sum of squares
+    (``sum(eps^2) <= 1e-24 * sum((dy - dybar)^2)``, i.e. essentially
+    ``1 - R^2 == 0``), the test short-circuits to ``t_stat_hr=-inf,
+    p_value=1.0, reject=False`` - Assumption 8 holds exactly, the formal
+    statistic is ``-inf`` under the one-sided critical value, and the
+    correct decision is fail-to-reject. This shortcut is translation-
+    invariant because the comparison is against centered TSS (not raw
+    ``sum(dy^2)``).
+
+    Degenerate ``sigma4_W = 0`` with non-zero residuals: when the
+    adjacent-residual-product sum vanishes AFTER the exact-linear
+    shortcut is bypassed (e.g. residuals alternate zero/non-zero after
+    sorting), the formal statistic is ``+inf`` or ``-inf`` depending on
+    the sign of the numerator ``sigma2_lin - sigma2_diff``. The function
+    returns the sign-aware limit (``p=0, reject=True`` for positive
+    numerator; ``p=1, reject=False`` for negative; ``NaN`` for zero)
+    with a ``UserWarning``, rather than unconditionally mapping this to
+    ``p=1`` (which would flip a legitimate rejection).
 
     References
     ----------
@@ -1105,17 +1133,42 @@ def yatchew_hr_test(d: np.ndarray, dy: np.ndarray, alpha: float = 0.05) -> Yatch
     # review line 171.
     sigma4_W = float(np.mean(eps_s[1:] ** 2 * eps_s[:-1] ** 2))
     if sigma4_W <= 0.0:
-        # sigma4_W = 0 implies OLS residuals are zero (or effectively zero),
-        # i.e. the linear fit reproduces dy exactly. Under H_0 (linearity),
-        # Assumption 8 holds exactly; the Yatchew statistic is formally -inf
-        # (finite-negative numerator over zero denominator), which corresponds
-        # to p-value = 1 (fail-to-reject). Return p = 1, reject = False with
-        # t_stat_hr = -inf as the formal-limit value. This matches the
-        # Stute-bootstrap behavior on the same exact-linear input.
+        # sigma4_W = 0 AFTER the exact-linear short-circuit means OLS
+        # residuals are NOT zero (the shortcut already caught that case)
+        # but every adjacent pair of sorted squared residuals contains a
+        # zero (e.g. residuals alternate zero / nonzero after sort).
+        # The formal test statistic is ±inf depending on the sign of the
+        # numerator ``sigma2_lin - sigma2_diff``; mapping every such case
+        # to p=1 (as an earlier revision did) can flip a legitimate
+        # rejection into a fail-to-reject.
+        warnings.warn(
+            f"yatchew_hr_test: sigma4_W = 0 with non-zero residuals "
+            f"(sigma2_lin = {sigma2_lin:.6g}, sigma2_diff = {sigma2_diff:.6g}); "
+            f"the formal test statistic is infinite. Returning the "
+            f"sign-aware limit decision.",
+            UserWarning,
+            stacklevel=2,
+        )
+        numerator = sigma2_lin - sigma2_diff
+        if numerator > 0.0:
+            # T_hr -> +inf: reject (far into right tail).
+            t_stat_hr_val = float("inf")
+            p_value_val = 0.0
+            reject_val = True
+        elif numerator < 0.0:
+            # T_hr -> -inf: fail-to-reject.
+            t_stat_hr_val = float("-inf")
+            p_value_val = 1.0
+            reject_val = False
+        else:
+            # 0/0: genuinely indeterminate.
+            t_stat_hr_val = float("nan")
+            p_value_val = float("nan")
+            reject_val = False
         return YatchewTestResults(
-            t_stat_hr=float("-inf"),
-            p_value=1.0,
-            reject=False,
+            t_stat_hr=t_stat_hr_val,
+            p_value=p_value_val,
+            reject=reject_val,
             alpha=alpha,
             critical_value=critical_value,
             sigma2_lin=sigma2_lin,
diff --git a/tests/test_had_pretests.py b/tests/test_had_pretests.py
index dd891851..d8f6afdb 100644
--- a/tests/test_had_pretests.py
+++ b/tests/test_had_pretests.py
@@ -440,6 +440,43 @@ def test_constant_d_returns_nan(self):
         assert np.isnan(r.t_stat_hr)
         assert r.reject is False
 
+    def test_sigma4_W_zero_with_positive_numerator_rejects(self):
+        """P0 fix: when sigma4_W = 0 AFTER the exact-linear shortcut
+        (i.e., residuals are NOT all zero but adjacent-residual-product
+        sums vanish), the Yatchew statistic is formally +inf or -inf
+        depending on numerator sign. A unit-dose, non-exact-linear
+        input where residuals alternate zero/non-zero must NOT be
+        silently mapped to p=1 (fail-to-reject).
+
+        Counterexample from CI R3 reviewer: d=[1,2,3,4,5],
+        dy=[1,0,-2,0,1]. OLS slope = 0, residuals = dy itself,
+        sigma2_lin = 1.2, sigma2_diff = 1.0, sigma4_W = 0 (each
+        adjacent residual product includes a zero factor), and the
+        formal statistic is sqrt(5) * 0.2 / 0 = +inf -> reject=True.
+        Previously the sigma4_W <= 0 branch unconditionally returned
+        p=1, which flipped a legitimate rejection.
+        """
+        d = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
+        dy = np.array([1.0, 0.0, -2.0, 0.0, 1.0])
+        with warnings.catch_warnings(record=True) as caught:
+            warnings.simplefilter("always")
+            r = yatchew_hr_test(d, dy, alpha=0.05)
+            msgs = [str(w.message) for w in caught]
+        # Must NOT silently fail-to-reject.
+        assert not (r.p_value == 1.0 and r.reject is False), (
+            "sigma4_W=0 with positive numerator was silently mapped to "
+            "p=1, reject=False; expected +inf reject path."
+        )
+        # Positive numerator -> +inf statistic, p=0, reject=True.
+        assert r.t_stat_hr == float("inf")
+        assert r.p_value == 0.0
+        assert r.reject is True
+        # The caller should have been warned.
+        assert any("sigma4_W = 0" in m for m in msgs)
+        # Sanity: sigma2_lin and sigma2_diff preserved for inspection.
+        np.testing.assert_allclose(r.sigma2_lin, 1.2, atol=1e-12)
+        np.testing.assert_allclose(r.sigma2_diff, 1.0, atol=1e-12)
+
 
 # =============================================================================
 # Composite workflow

From 85a43f4b3d5c9c686cd6155fc35dba732f51edbc Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Wed, 22 Apr 2026 16:27:42 -0400
Subject: [PATCH 7/9] Address PR #352 CI review round 4 (1 P3 + 1 P2)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

R4 verdict was "Looks good" overall — no P0/P1 blockers remain. Fixes
the P3 + P2 proactively to avoid tied-dose workflow confusion.

R4 P3 / P2 — Workflow must follow paper's "Stute OR Yatchew" step 3:
R3's fix made yatchew_hr_test return NaN on duplicate doses (per paper
methodology). But the composite workflow then forced "inconclusive" on
any NaN child result, so common QUG-style panels with repeated d=0
(reviewer cites an application with 12 zero-dose units) were too
conservatively classified. Per paper Section 4.2-4.3 step 3, linearity
can be tested with Stute OR Yatchew; a conclusive Stute alone
suffices.

Fix:
- _compose_verdict now treats "QUG NaN" and "BOTH Stute + Yatchew NaN"
  as the only inconclusive cases. A conclusive Stute + NaN Yatchew (or
  vice versa) proceeds to the fail-to-reject / rejection logic with a
  "(Yatchew NaN - skipped)" suffix on the fail-to-reject verdict.
- all_pass gating: requires QUG conclusive AND at least ONE of Stute /
  Yatchew conclusive, plus no conclusive test rejects. Follows the
  paper's "OR" wording.
- HADPretestReport docstring updated to reflect the paper-faithful
  all_pass semantic.
- REGISTRY Yatchew tie-policy Note extended with a paragraph describing
  the workflow fallback.

Tests:
- Split TestComposeVerdictLogic.test_nan_in_any_test_inconclusive into
  test_qug_nan_is_inconclusive + test_both_linearity_nan_is_inconclusive,
  plus two new tests covering "Yatchew NaN alone does not force
  inconclusive" and "Stute NaN alone does not force inconclusive".
- New TestCompositeWorkflow.test_workflow_handles_tied_zero_doses_via_stute_fallback
  regression: 60-unit panel with 20 never-treated (d=0 ties) plus 40
  treated with continuous doses. Yatchew NaN, Stute + QUG conclusive,
  composite verdict must be conclusive (not "inconclusive") with a
  "Yatchew NaN - skipped" note and all_pass=True.

Test count: 58 -> 62.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 diff_diff/had_pretests.py    | 110 +++++++++++++++++++++--------------
 docs/methodology/REGISTRY.md |   2 +-
 tests/test_had_pretests.py   |  79 +++++++++++++++++++++++--
 3 files changed, 141 insertions(+), 50 deletions(-)

diff --git a/diff_diff/had_pretests.py b/diff_diff/had_pretests.py
index 71db5963..92521438 100644
--- a/diff_diff/had_pretests.py
+++ b/diff_diff/had_pretests.py
@@ -379,14 +379,15 @@ class HADPretestReport:
     stute : StuteTestResults
     yatchew : YatchewTestResults
     all_pass : bool
-        ``True`` iff every constituent ``p_value`` is finite AND no test
-        rejects. ``False`` whenever any test produced a NaN p-value
-        (inconclusive) OR any test rejected. This gating prevents
-        downstream callers from mistaking an "inconclusive" report for a
-        "pass" by inspecting ``all_pass`` in isolation. Even when
-        ``all_pass`` is ``True``, the report is a PARTIAL indicator: it
-        does not certify Assumption 7 (pre-trends), which is not tested
-        by Phase 3.
+        ``True`` iff (a) QUG is conclusive (step 1), (b) at least ONE of
+        Stute / Yatchew is conclusive (step 3 - paper's "Stute or
+        Yatchew" wording), AND (c) no conclusive test rejects. This
+        gating follows the paper's four-step workflow exactly: step 3
+        accepts either linearity test, so a conclusive Stute is
+        sufficient even when Yatchew returns NaN (e.g. tied-dose
+        panels). Even when ``all_pass`` is ``True``, the report is a
+        PARTIAL indicator: it does not certify Assumption 7 (pre-trends),
+        which is not tested by Phase 3.
     verdict : str
         Human-readable classification. Priority-ordered first-match:
 
@@ -591,41 +592,57 @@ def _compose_verdict(
 ) -> str:
     """Build the :class:`HADPretestReport` verdict string.
 
+    Paper Section 4.2-4.3 specifies a four-step workflow; Phase 3 ships
+    step 1 (QUG) and step 3 (linearity, via ``stute_test`` OR
+    ``yatchew_hr_test``). The linearity step accepts either test, so a
+    conclusive Stute result alone suffices even when Yatchew is NaN
+    (e.g. tied doses, which Yatchew rejects by contract).
+
     Priority-ordered first-match:
 
-    1. Any NaN p-value -> ``"inconclusive - {names} NaN"``
-    2. None of the implemented tests reject -> partial-workflow verdict
-       flagging the Assumption 7 / pre-trends gap (paper step 2 deferred):
-       ``"QUG and linearity diagnostics fail-to-reject; Assumption 7
-       pre-trends test NOT run (paper step 2 deferred to Phase 3
-       follow-up)"``.
-    3. Otherwise bundle each rejection reason naming the failed assumption.
+    1. QUG NaN -> ``"inconclusive - QUG NaN"`` (step 1 required).
+    2. BOTH Stute AND Yatchew NaN -> ``"inconclusive - both Stute and
+       Yatchew linearity tests NaN"`` (step 3 requires at least one).
+    3. Otherwise, count rejections from CONCLUSIVE tests only:
+       3a. None of the conclusive tests reject -> partial-workflow
+           fail-to-reject verdict flagging the Assumption 7 gap, plus
+           a ``" (Yatchew NaN - skipped)"`` suffix when applicable.
+       3b. At least one conclusive test rejects -> bundle each
+           rejection reason naming the failed assumption.
     """
-    nan_tests = [
-        name
-        for name, res in (
-            ("QUG", qug),
-            ("Stute", stute),
-            ("Yatchew", yatchew),
-        )
-        if not np.isfinite(res.p_value)
-    ]
-    if nan_tests:
-        return f"inconclusive - {', '.join(nan_tests)} NaN"
-
-    any_reject = qug.reject or stute.reject or yatchew.reject
-    if not any_reject:
+    qug_ok = bool(np.isfinite(qug.p_value))
+    stute_ok = bool(np.isfinite(stute.p_value))
+    yatchew_ok = bool(np.isfinite(yatchew.p_value))
+
+    if not qug_ok:
+        return "inconclusive - QUG NaN"
+    if not stute_ok and not yatchew_ok:
+        return "inconclusive - both Stute and Yatchew linearity tests NaN"
+
+    qug_rej = qug.reject
+    stute_rej = bool(stute_ok and stute.reject)
+    yatchew_rej = bool(yatchew_ok and yatchew.reject)
+
+    if not (qug_rej or stute_rej or yatchew_rej):
+        skipped = []
+        if not stute_ok:
+            skipped.append("Stute NaN")
+        if not yatchew_ok:
+            skipped.append("Yatchew NaN")
+        skip_note = f" ({'; '.join(skipped)} - skipped)" if skipped else ""
         return (
-            "QUG and linearity diagnostics fail-to-reject; "
-            "Assumption 7 pre-trends test NOT run "
+            "QUG and linearity diagnostics fail-to-reject"
+            f"{skip_note}; Assumption 7 pre-trends test NOT run "
             "(paper step 2 deferred to Phase 3 follow-up)"
         )
 
     reasons = []
-    if qug.reject:
+    if qug_rej:
         reasons.append("support infimum rejected - continuous_at_zero design invalid (QUG)")
-    if stute.reject or yatchew.reject:
-        which = ",".join(name for name, r in (("Stute", stute), ("Yatchew", yatchew)) if r.reject)
+    if stute_rej or yatchew_rej:
+        which = ",".join(
+            name for name, rejected in (("Stute", stute_rej), ("Yatchew", yatchew_rej)) if rejected
+        )
         reasons.append(f"linearity rejected - heterogeneity bias ({which})")
     return "; ".join(reasons)
 
@@ -1280,18 +1297,21 @@ def did_had_pretest_workflow(
     stute_res = stute_test(d_arr, dy_arr, alpha=alpha, n_bootstrap=n_bootstrap, seed=seed)
     yatchew_res = yatchew_hr_test(d_arr, dy_arr, alpha=alpha)
 
-    # `all_pass` must be conclusive: require every constituent p-value to be
-    # finite AND no test to reject. A NaN p-value means the test is
-    # inconclusive, not a pass, so `all_pass = False` on any NaN result
-    # (even when none of the tests set `reject = True`). This keeps
-    # `all_pass` semantically aligned with the verdict string.
-    finite_all = (
-        np.isfinite(qug_res.p_value)
-        and np.isfinite(stute_res.p_value)
-        and np.isfinite(yatchew_res.p_value)
-    )
+    # `all_pass` must be conclusive under the paper's four-step workflow
+    # (step 1 QUG + step 3 linearity via Stute OR Yatchew):
+    #   - QUG must produce a finite p-value (step 1 is required).
+    #   - At least ONE of Stute / Yatchew must produce a finite p-value
+    #     (step 3 accepts either; the paper's wording is "Stute OR
+    #     Yatchew"). This accommodates common QUG-style panels with
+    #     repeated d=0 units, where Yatchew's duplicate-dose guard trips
+    #     but Stute's tie-safe CvM still produces a conclusive result.
+    #   - No conclusive test may reject. NaN-p tests have reject=False by
+    #     convention, so the OR across `.reject` naturally counts only
+    #     the conclusive rejections.
+    qug_conclusive = bool(np.isfinite(qug_res.p_value))
+    linearity_conclusive = bool(np.isfinite(stute_res.p_value) or np.isfinite(yatchew_res.p_value))
     any_reject = qug_res.reject or stute_res.reject or yatchew_res.reject
-    all_pass = bool(finite_all and not any_reject)
+    all_pass = bool(qug_conclusive and linearity_conclusive and not any_reject)
     verdict = _compose_verdict(qug_res, stute_res, yatchew_res)
 
     return HADPretestReport(
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
index cab9a058..2abd188b 100644
--- a/docs/methodology/REGISTRY.md
+++ b/docs/methodology/REGISTRY.md
@@ -2343,7 +2343,7 @@ Shipped as `did_had_pretest_workflow()` in Phase 3. The paper's decision rule fo
     - **Note (Phase 3 pretest workflow):** Phase 3 ships the Section 4 pre-test diagnostics in a new module `diff_diff/had_pretests.py` (separate from the 2,800-line `had.py` estimator module). The three tests — `qug_test`, `stute_test`, `yatchew_hr_test` — accept raw `(d, dy)` arrays and return their own result dataclasses (`QUGTestResults`, `StuteTestResults`, `YatchewTestResults`); the composite `did_had_pretest_workflow` accepts a two-period panel and returns `HADPretestReport` with a priority-ordered verdict string. The `data`-only workflow signature (no `result=`-consuming variant) avoids coupling pretests to the `BiasCorrectedFit` internal state, which does not expose residuals. Below-minimum sample sizes emit `UserWarning` and return all-NaN inference (no `ValueError` raise) for library consistency with the `safe_inference` convention; input-shape violations (non-1D, NaN-containing, mismatched length) still raise. No multiple-testing adjustment (Bonferroni/Holm) is applied — the paper does not prescribe one. **Partial-workflow semantic:** the composite runs paper steps 1 (QUG) + 3 (linearity via Stute + Yatchew-HR) only; step 2 (Assumption 7 pre-trends test via Equation 18) is deferred to a Phase 3 follow-up patch. When all three implemented tests fail-to-reject, the verdict explicitly flags the Assumption 7 gap (`"QUG and linearity diagnostics fail-to-reject; Assumption 7 pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up)"`) rather than claiming unconditional TWFE safety, so callers do not receive a misleading certification signal. The `HADPretestReport` docstring reinforces this caveat. Users should continue to perform their own pre-trends / placebo analysis until the follow-up ships the joint Equation 18 Stute variant.
     - **Note (Phase 3 Stute bootstrap):** The Stute CvM statistic is implemented via the simplified-form `S = (1/G²) · Σ cumsum²`, which is algebraically equivalent to the paper's stated `Σ(g/G)² · ((1/g) Σ eps_{(h)})²` form. Bootstrap follows the paper's Appendix D Algorithm literally: each iteration refits OLS on `dy_b = a_hat + b_hat · d + eps · eta` (null DGP multiplied by Mammen weights). The Appendix-D vectorized matrix form (precomputing `M = I - X(X'X)⁻¹X'` once and applying it to `eps · eta` in each iteration) is functionally identical and ~2× faster; it is deferred as a performance follow-up to keep the Phase 3 reviewer surface small. Bootstrap reproducibility uses `np.random.default_rng(seed)` (library convention, matching `wild_bootstrap_se`); `seed=42` in tests for bitwise stability.
     - **Note (Phase 3 Yatchew normalizer):** `σ̂²_diff` in `yatchew_hr_test` divides by `2G` (paper-literal from Theorem 7), NOT by `2(G-1)`. A unit test hand-computes `σ̂²_diff` at `G=4` on deterministic inputs and asserts the `2G`-normalizer form at `atol=1e-12` so any later regression to the (asymptotically equivalent but finite-sample different) `2(G-1)` form would fail the test. `σ̂⁴_W` uses `np.mean(eps_{(g)}² · eps_{(g-1)}²)` which divides by `G-1` via `np.mean` length, matching the paper's Theorem 7 / Equation 29 normalization.
-    - **Note (Phase 3 Yatchew tie policy):** `yatchew_hr_test` REJECTS duplicate dose values with a `UserWarning` + all-NaN result at the front door. The difference-based variance estimator `σ̂²_diff` and the heteroskedasticity-robust scale `σ̂⁴_W` both use adjacent differences (of sorted dy and of adjacent squared residuals, respectively); under tied doses the within-tie row ordering is arbitrary (stable sort falls back to input order), so the statistic becomes non-methodological and order-dependent. Callers with tied doses (mass-point designs, discretised dose registers) should use `stute_test` instead — its tie-safe Cramer-von Mises statistic collapses tie blocks to the post-tie cumulative sum and is provably order-invariant under within-tie permutations. A regression test on `stute_test` asserts bit-identical `cvm_stat` across tied-dose permutations at `atol=1e-14`; a matching regression on `yatchew_hr_test` asserts the NaN+warning behavior on duplicated and constant doses. This tie policy is a Phase 3 diff-diff choice (the paper describes Yatchew only as "sort by D" and does not specify a tie rule); an alternative implementation could document a justified tie-resolution convention and back it with permutation tests, but at the cost of a methodology surface the paper does not cover.
+    - **Note (Phase 3 Yatchew tie policy):** `yatchew_hr_test` REJECTS duplicate dose values with a `UserWarning` + all-NaN result at the front door. The difference-based variance estimator `σ̂²_diff` and the heteroskedasticity-robust scale `σ̂⁴_W` both use adjacent differences (of sorted dy and of adjacent squared residuals, respectively); under tied doses the within-tie row ordering is arbitrary (stable sort falls back to input order), so the statistic becomes non-methodological and order-dependent. Callers with tied doses (mass-point designs, discretised dose registers) should use `stute_test` instead — its tie-safe Cramer-von Mises statistic collapses tie blocks to the post-tie cumulative sum and is provably order-invariant under within-tie permutations. A regression test on `stute_test` asserts bit-identical `cvm_stat` across tied-dose permutations at `atol=1e-14`; a matching regression on `yatchew_hr_test` asserts the NaN+warning behavior on duplicated and constant doses. This tie policy is a Phase 3 diff-diff choice (the paper describes Yatchew only as "sort by D" and does not specify a tie rule); an alternative implementation could document a justified tie-resolution convention and back it with permutation tests, but at the cost of a methodology surface the paper does not cover. **Workflow fallback:** `did_had_pretest_workflow` follows the paper's "Stute OR Yatchew" step-3 wording — a conclusive Stute (which is tie-safe) is sufficient to adjudicate linearity even when Yatchew returns NaN on tied-dose panels. The composite verdict then reads "QUG and linearity diagnostics fail-to-reject (Yatchew NaN - skipped); ..." rather than forcing the whole report to "inconclusive", and `all_pass` is `True` when QUG + at least one conclusive linearity test fail-to-reject.
     - **Note (Phase 3 exact-linear short-circuit):** Both `stute_test` and `yatchew_hr_test` detect numerically-exact linear fits (OLS residuals below machine precision relative to the signal) and short-circuit to `p=1.0, reject=False` without running the bootstrap / computing `T_hr`. The detection compares `sum(eps^2)` against the CENTERED total sum of squares `sum((dy - dybar)^2)` (ratio `<= 1e-24`), which is equivalent to `1 - R^2` and is TRANSLATION-INVARIANT under additive shifts in dy. Comparing against uncentered `sum(dy^2)` would NOT be translation-invariant: adding a large constant to dy inflates `sum(dy^2)` and can spuriously trip the short-circuit on genuinely noisy data. Regression tests pin (a) the exact-linear fail-to-reject behavior for `dy = a + b*d`, (b) translation-invariance under `dy -> dy + 1e12` (the short-circuit must not fire on noisy data regardless of the constant offset).
 - [x] Phase 3: `qug_test()` (`T = D_{2,(1)} / (D_{2,(2)} - D_{2,(1)})`, rejection `{T > 1/α - 1}`). One-sided asymptotic p-value `1 / (1 + T)` under the Exp(1)/Exp(1) limit law (Theorem 4). Zero-dose observations filtered upfront (with `UserWarning`); `D_{(1)} == D_{(2)}` tie returns all-NaN inference (conservative). Closed-form tight parity tested at `atol=1e-12`.
 - [x] Phase 3: `stute_test()` Cramér-von Mises with Mammen wild bootstrap. Statistic `S = (1/G^2) Σ (cumsum_g)^2` (algebraically equivalent to paper's `Σ(g/G)^2 · ((1/g) Σ eps_{(h)})^2`). Bootstrap follows paper Appendix D Algorithm literal (per-iteration OLS refit). `n_bootstrap=999` default, `n_bootstrap >= 99` validated. `G < 10` returns NaN; `G > 100_000` emits a `UserWarning` pointing to `yatchew_hr_test`. Appendix-D vectorized matrix form deferred as a performance follow-up (tracked in `TODO.md`).
diff --git a/tests/test_had_pretests.py b/tests/test_had_pretests.py
index d8f6afdb..7400d92f 100644
--- a/tests/test_had_pretests.py
+++ b/tests/test_had_pretests.py
@@ -541,6 +541,44 @@ def test_rejects_on_quadratic_plus_shifted_support(self):
         assert report.qug.reject is True
         assert report.stute.reject or report.yatchew.reject
 
+    def test_workflow_handles_tied_zero_doses_via_stute_fallback(self):
+        """R4 P2 fix: on a common QUG-style panel with repeated d=0
+        (never-treated comparison group), Yatchew returns NaN (duplicate
+        doses) but Stute handles ties correctly. The composite workflow
+        must adjudicate the linearity step via Stute ALONE per the paper's
+        "Stute or Yatchew" step-3 wording, producing a conclusive verdict
+        with a Yatchew-skipped note rather than forcing the whole report
+        to inconclusive.
+
+        Paper review line 298 cites an application with 12 zero-dose
+        units, so this panel shape is not exotic.
+        """
+        rng = np.random.default_rng(42)
+        G = 60
+        # 20 never-treated (d=0 at post), 40 treated with random positive.
+        d = np.zeros(G)
+        d[20:] = rng.uniform(0.1, 1.0, size=40)
+        dy = 2.0 * d + rng.normal(0.0, 0.3, size=G)
+        panel = _make_two_period_panel(G, d, dy, seed=42)
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore")
+            report = did_had_pretest_workflow(
+                panel, "y", "d", "time", "unit", n_bootstrap=199, seed=42
+            )
+        # Yatchew must NaN (ties from the 20 zero doses).
+        assert np.isnan(report.yatchew.p_value)
+        assert report.yatchew.reject is False
+        # QUG and Stute must be conclusive.
+        assert np.isfinite(report.qug.p_value)
+        assert np.isfinite(report.stute.p_value)
+        # Composite must be conclusive (NOT inconclusive).
+        assert "inconclusive" not in report.verdict
+        # Verdict should note Yatchew was skipped.
+        assert "Yatchew NaN - skipped" in report.verdict
+        # With a linear DGP, all_pass should be True despite Yatchew NaN
+        # (step 3 handled by Stute alone, per paper's "Stute OR Yatchew").
+        assert report.all_pass is True
+
     def test_all_pass_false_when_any_test_nan(self):
         """P1 fix: all_pass must NOT be True when any constituent test is
         NaN. Previously all_pass was computed from `not reject` only, so a
@@ -829,14 +867,47 @@ def _mk_yatchew(reject: bool, p: float = 0.5) -> YatchewTestResults:
 class TestComposeVerdictLogic:
     """Commit criterion 15: 5 priority-order tests."""
 
-    def test_nan_in_any_test_inconclusive(self):
-        """(a) Any NaN p-value -> inconclusive."""
+    def test_qug_nan_is_inconclusive(self):
+        """(a1) QUG NaN forces inconclusive (step 1 is required)."""
+        q = _mk_qug(reject=False, p=float("nan"))
+        s = _mk_stute(reject=False)
+        y = _mk_yatchew(reject=False)
+        verdict = _compose_verdict(q, s, y)
+        assert verdict == "inconclusive - QUG NaN"
+
+    def test_both_linearity_nan_is_inconclusive(self):
+        """(a2) When BOTH Stute and Yatchew are NaN, the linearity step
+        cannot be adjudicated and the verdict is inconclusive."""
+        q = _mk_qug(reject=False)
+        s = _mk_stute(reject=False, p=float("nan"))
+        y = _mk_yatchew(reject=False, p=float("nan"))
+        verdict = _compose_verdict(q, s, y)
+        assert verdict == "inconclusive - both Stute and Yatchew linearity tests NaN"
+
+    def test_yatchew_nan_alone_does_not_force_inconclusive(self):
+        """CRITICAL R4 P3 fix: the paper says step 3 uses "Stute OR
+        Yatchew". A conclusive Stute must be sufficient even when Yatchew
+        returns NaN (e.g. tied-dose panels). Verdict should be
+        fail-to-reject with a "(Yatchew NaN - skipped)" suffix."""
+        q = _mk_qug(reject=False)
+        s = _mk_stute(reject=False)
+        y = _mk_yatchew(reject=False, p=float("nan"))
+        verdict = _compose_verdict(q, s, y)
+        assert "inconclusive" not in verdict
+        assert verdict.startswith("QUG and linearity diagnostics fail-to-reject")
+        assert "Yatchew NaN - skipped" in verdict
+        assert "Assumption 7" in verdict
+
+    def test_stute_nan_alone_does_not_force_inconclusive(self):
+        """Mirror of the above: Yatchew conclusive + Stute NaN should
+        still adjudicate the linearity step via Yatchew alone."""
         q = _mk_qug(reject=False)
         s = _mk_stute(reject=False, p=float("nan"))
         y = _mk_yatchew(reject=False)
         verdict = _compose_verdict(q, s, y)
-        assert verdict.startswith("inconclusive")
-        assert "Stute" in verdict
+        assert "inconclusive" not in verdict
+        assert verdict.startswith("QUG and linearity diagnostics fail-to-reject")
+        assert "Stute NaN - skipped" in verdict
 
     def test_none_reject_flags_assumption7_gap(self):
         """(b) None reject -> verdict flags the Assumption 7 pre-trends gap

From ad19c0208afc2e530894b6bb1247c009ee96becf Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Wed, 22 Apr 2026 16:40:58 -0400
Subject: [PATCH 8/9] Address PR #352 CI review round 5 (1 P0 + 1 P3)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

P0 — Exact-linear shortcut not scale-invariant:
R2's fix changed the denominator from sum(dy^2) to centered TSS, which
made the guard TRANSLATION-invariant. But the code used
`max(centered_TSS, 1.0)` as a defensive floor, which BROKE scale
invariance: multiplying dy by a small constant (e.g. dy * 1e-12) made
centered_TSS ~ 1e-24 but the floor held the threshold at 1.0, firing
the shortcut on noisy data that should NOT trigger it.

Fix: remove the `max(..., 1.0)` floor and handle the `centered_TSS == 0`
case (constant dy, trivially linear) in a dedicated branch above the
relative comparison. The remaining path is purely `eps^2 <= tol *
dy_centered^2`, which is both translation- AND scale-invariant by
construction (R^2 is dimensionless).

Applied to both `stute_test` and `yatchew_hr_test`. Module-level
constant docstring updated to document the invariance and the
zero-TSS split.

Regressions:
- TestStuteTest.test_exact_linear_shortcut_does_not_fire_on_rescaled_noisy_data
- TestYatchewHRTest.test_exact_linear_shortcut_does_not_fire_on_rescaled_noisy_data
Both multiply a noisy `dy` by 1e-12 and assert the shortcut does NOT
fire and decisions match the unscaled input.

P3 — Stale verdict docs:
R4's Stute-OR-Yatchew linearity-step fallback changed `_compose_verdict`
semantics, but the `HADPretestReport.verdict` docstring still listed
"Any NaN p-value -> inconclusive" as rule 1, and the REGISTRY Phase 3
checkbox for the workflow still said "`inconclusive - {tests} NaN`
when any statistic is NaN".

Fix: both docs updated to reflect the paper-faithful semantics:
inconclusive only when QUG is NaN or when BOTH Stute and Yatchew are
NaN; otherwise the fail-to-reject verdict carries a
"(Yatchew NaN - skipped)" (or Stute) suffix when one linearity test
was NaN but the other was conclusive. REGISTRY note also documents
the `all_pass` gating (QUG conclusive AND at least one of
Stute/Yatchew conclusive AND no conclusive test rejects).

Test count: 62 -> 64.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 diff_diff/had_pretests.py    | 85 +++++++++++++++++++++++++-----------
 docs/methodology/REGISTRY.md |  2 +-
 tests/test_had_pretests.py   | 74 +++++++++++++++++++++++--------
 3 files changed, 116 insertions(+), 45 deletions(-)

diff --git a/diff_diff/had_pretests.py b/diff_diff/had_pretests.py
index 92521438..8d4fea95 100644
--- a/diff_diff/had_pretests.py
+++ b/diff_diff/had_pretests.py
@@ -61,17 +61,24 @@
 _MIN_N_BOOTSTRAP = 99
 _STUTE_LARGE_G_THRESHOLD = 100_000
 
-# Scale-relative tolerance for detecting a numerically exact linear OLS fit.
+# Scale-invariant tolerance for detecting a numerically exact linear OLS fit.
 # The ratio SSR / TSS = sum(eps^2) / sum((dy - dybar)^2) equals 1 - R^2
-# and is TRANSLATION-INVARIANT under additive shifts in dy. Under exact
-# Assumption 8, residuals are mathematically zero; in practice FP round-off
-# leaves eps on the order of machine-epsilon (~1e-16). Squared that is
-# ~1e-32. The threshold ~1e-24 leaves ~10^8 accumulated FP operations of
-# margin so genuinely-noisy data is never mis-classified. We compare
-# against centered TSS rather than raw sum(dy^2) because scaling against
-# the raw magnitude is NOT translation-invariant: adding a large constant
-# to dy would inflate sum(dy^2) and spuriously trip the branch on noisy
-# data.
+# and is BOTH TRANSLATION-INVARIANT (centering absorbs additive shifts)
+# and SCALE-INVARIANT (the ratio is dimensionless under multiplicative
+# rescaling of dy). Under exact Assumption 8, residuals are mathematically
+# zero; in practice FP round-off leaves eps on the order of machine-epsilon
+# (~1e-16). Squared that is ~1e-32. The threshold ~1e-24 leaves ~10^8
+# accumulated FP operations of margin so genuinely-noisy data is never
+# mis-classified.
+#
+# IMPORTANT: the comparison is purely ``eps^2 <= tol * dy_centered^2`` with
+# NO additive floor (e.g. ``max(dy_centered^2, 1.0)`` would break scale
+# invariance - scaling dy by 1e-12 would make dy_centered^2 ~ 1e-24 but
+# the floor would hold the threshold at 1.0, firing the shortcut on
+# noisy data that should not trigger it). The ``dy_centered^2 == 0``
+# edge case (constant dy) is handled by a separate branch above the
+# relative comparison, so the relative form is only applied when the
+# denominator is genuinely positive.
 _EXACT_LINEAR_RELATIVE_TOL = 1e-24
 
 
@@ -389,16 +396,27 @@ class HADPretestReport:
         PARTIAL indicator: it does not certify Assumption 7 (pre-trends),
         which is not tested by Phase 3.
     verdict : str
-        Human-readable classification. Priority-ordered first-match:
-
-        1. Any NaN p-value -> ``"inconclusive - {names} NaN"``
-        2. None reject -> ``"QUG and linearity diagnostics fail-to-reject;
-           Assumption 7 pre-trends test NOT run (paper step 2 deferred to
-           Phase 3 follow-up)"``
-        3. Otherwise -> bundled string naming each rejected assumption:
-           ``"support infimum rejected - continuous_at_zero design
-           invalid (QUG)"`` and/or ``"linearity rejected - heterogeneity
-           bias ({Stute[,Yatchew]})"``.
+        Human-readable classification. The paper's step 3 accepts either
+        Stute OR Yatchew, so a conclusive Stute alone can adjudicate
+        linearity even if Yatchew is NaN (e.g. tied-dose panels), and
+        vice versa. Priority-ordered first-match:
+
+        1. QUG NaN -> ``"inconclusive - QUG NaN"`` (step 1 is required).
+        2. BOTH Stute AND Yatchew NaN -> ``"inconclusive - both Stute
+           and Yatchew linearity tests NaN"`` (step 3 requires at least
+           one).
+        3. None of the CONCLUSIVE tests reject -> partial-workflow
+           fail-to-reject verdict. Format: ``"QUG and linearity
+           diagnostics fail-to-reject[ (Yatchew NaN - skipped)];
+           Assumption 7 pre-trends test NOT run (paper step 2 deferred
+           to Phase 3 follow-up)"``. The ``" (... - skipped)"`` suffix
+           appears when Stute or Yatchew was NaN but the other was
+           conclusive.
+        4. At least one conclusive test rejects -> bundled string
+           naming each failed assumption: ``"support infimum rejected
+           - continuous_at_zero design invalid (QUG)"`` and/or
+           ``"linearity rejected - heterogeneity bias
+           ({Stute[,Yatchew]})"``.
     alpha : float
         Significance level shared across tests.
     n_obs : int
@@ -922,10 +940,24 @@ def stute_test(
     # so the Stute CvM statistic is formally 0 and every bootstrap draw is
     # also 0. Short-circuit to p=1 to avoid FP-noise-driven bootstrap
     # comparisons where cvm_stat and S_b are both at machine-epsilon scale.
-    # Comparison is against CENTERED TSS (translation-invariant).
+    # Comparison is purely relative against CENTERED TSS: both translation-
+    # invariant (centering absorbs additive shifts) and scale-invariant
+    # (ratio is dimensionless under multiplicative dy rescaling).
     eps_norm_sq = float(np.sum(eps * eps))
     dy_centered_sq = float(np.sum((dy_arr - dy_arr.mean()) ** 2))
-    if eps_norm_sq <= _EXACT_LINEAR_RELATIVE_TOL * max(dy_centered_sq, 1.0):
+    if dy_centered_sq <= 0.0:
+        # Constant dy (zero centered TSS): trivially linear in d.
+        # Return p = 1 without running the bootstrap.
+        return StuteTestResults(
+            cvm_stat=0.0,
+            p_value=1.0,
+            reject=False,
+            alpha=alpha,
+            n_bootstrap=int(n_bootstrap),
+            n_obs=G,
+            seed=seed,
+        )
+    if eps_norm_sq <= _EXACT_LINEAR_RELATIVE_TOL * dy_centered_sq:
         return StuteTestResults(
             cvm_stat=0.0,
             p_value=1.0,
@@ -1115,11 +1147,14 @@ def yatchew_hr_test(d: np.ndarray, dy: np.ndarray, alpha: float = 0.05) -> Yatch
     # formally -inf (finite-negative numerator over zero denominator),
     # which maps to p = 1 under the one-sided standard-normal critical
     # value. Short-circuit so FP noise in ``sigma4_W`` cannot produce a
-    # spuriously large finite ``T_hr``. Comparison is against CENTERED
-    # TSS (translation-invariant).
+    # spuriously large finite ``T_hr``. Comparison is purely relative
+    # against CENTERED TSS - translation- AND scale-invariant.
     eps_norm_sq = float(np.sum(eps * eps))
     dy_centered_sq = float(np.sum((dy_arr - dy_arr.mean()) ** 2))
-    if eps_norm_sq <= _EXACT_LINEAR_RELATIVE_TOL * max(dy_centered_sq, 1.0):
+    if dy_centered_sq <= 0.0 or eps_norm_sq <= _EXACT_LINEAR_RELATIVE_TOL * dy_centered_sq:
+        # Exact-linear branch. Covers two cases:
+        # - dy_centered_sq == 0: dy is constant (trivially linear).
+        # - relative SSR below IEEE precision: near-exact OLS fit.
         # For reporting, compute sigma2_diff on the sorted dy (finite,
         # well-defined even in the exact-linear case).
         idx_early = np.argsort(d_arr, kind="stable")
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
index 2abd188b..6a824d94 100644
--- a/docs/methodology/REGISTRY.md
+++ b/docs/methodology/REGISTRY.md
@@ -2348,7 +2348,7 @@ Shipped as `did_had_pretest_workflow()` in Phase 3. The paper's decision rule fo
 - [x] Phase 3: `qug_test()` (`T = D_{2,(1)} / (D_{2,(2)} - D_{2,(1)})`, rejection `{T > 1/α - 1}`). One-sided asymptotic p-value `1 / (1 + T)` under the Exp(1)/Exp(1) limit law (Theorem 4). Zero-dose observations filtered upfront (with `UserWarning`); `D_{(1)} == D_{(2)}` tie returns all-NaN inference (conservative). Closed-form tight parity tested at `atol=1e-12`.
 - [x] Phase 3: `stute_test()` Cramér-von Mises with Mammen wild bootstrap. Statistic `S = (1/G^2) Σ (cumsum_g)^2` (algebraically equivalent to paper's `Σ(g/G)^2 · ((1/g) Σ eps_{(h)})^2`). Bootstrap follows paper Appendix D Algorithm literal (per-iteration OLS refit). `n_bootstrap=999` default, `n_bootstrap >= 99` validated. `G < 10` returns NaN; `G > 100_000` emits a `UserWarning` pointing to `yatchew_hr_test`. Appendix-D vectorized matrix form deferred as a performance follow-up (tracked in `TODO.md`).
 - [x] Phase 3: `yatchew_hr_test()` heteroskedasticity-robust linearity test. Test statistic `T_hr = sqrt(G) · (σ̂²_lin - σ̂²_diff) / σ̂²_W` from paper Equation 29. Normalizer `σ̂²_diff` divides by `2G` (paper-literal Theorem 7), NOT `2(G-1)`; hand-computed tight parity asserted at `atol=1e-12`. One-sided standard-normal critical value. `G < 3` returns NaN.
-- [x] Phase 3: `did_had_pretest_workflow()` composite helper. Two-period `data`-only entry point (Phase 2a overall-path dispatch); reduces panel via `_aggregate_first_difference` and runs all three IMPLEMENTED tests at a shared `alpha`. `seed` forwards to `stute_test` only (QUG and Yatchew are deterministic). Returns `HADPretestReport` with priority-ordered verdict string. Because Phase 3 ships steps 1 + 3 of the paper's four-step workflow but **not** step 2 (Assumption 7 pre-trends test via Equation 18), the fail-to-reject verdict explicitly flags the Assumption 7 gap rather than claiming unconditional TWFE safety: `"QUG and linearity diagnostics fail-to-reject; Assumption 7 pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up)"`. Other verdict forms: `"inconclusive - {tests} NaN"` when any statistic is NaN; bundled rejection-reason string naming each failed assumption otherwise. Multi-period panels pre-slice to `(F-1, F)` before calling; joint-horizon dispatch deferred to Phase 3 follow-up.
+- [x] Phase 3: `did_had_pretest_workflow()` composite helper. Two-period `data`-only entry point (Phase 2a overall-path dispatch); reduces panel via `_aggregate_first_difference` and runs all three IMPLEMENTED tests at a shared `alpha`. `seed` forwards to `stute_test` only (QUG and Yatchew are deterministic). Returns `HADPretestReport` with priority-ordered verdict string. Because Phase 3 ships steps 1 + 3 of the paper's four-step workflow but **not** step 2 (Assumption 7 pre-trends test via Equation 18), the fail-to-reject verdict explicitly flags the Assumption 7 gap rather than claiming unconditional TWFE safety: `"QUG and linearity diagnostics fail-to-reject; Assumption 7 pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up)"`. Verdict semantics follow the paper's "Stute OR Yatchew" step-3 wording: `"inconclusive - QUG NaN"` only when step 1 fails; `"inconclusive - both Stute and Yatchew linearity tests NaN"` only when step 3 has no conclusive linearity test; otherwise the fail-to-reject verdict may carry a `"(Yatchew NaN - skipped)"` (or Stute) suffix when one linearity test is NaN but the other is conclusive. Bundled rejection-reason strings naming each failed assumption otherwise. `all_pass` is `True` iff QUG is conclusive AND at least one of Stute/Yatchew is conclusive AND no conclusive test rejects. Multi-period panels pre-slice to `(F-1, F)` before calling; joint-horizon dispatch deferred to Phase 3 follow-up.
 - [ ] Phase 4: Pierce-Schott (2016) replication harness reproduces Figure 2 values.
 - [ ] Phase 4: Full DGP 1/2/3 coverage-rate reproduction from Table 1.
 - [ ] Phase 5: `practitioner_next_steps()` integration for HAD results.
diff --git a/tests/test_had_pretests.py b/tests/test_had_pretests.py
index 7400d92f..fc32c58f 100644
--- a/tests/test_had_pretests.py
+++ b/tests/test_had_pretests.py
@@ -279,13 +279,9 @@ def test_exact_linear_returns_p1_not_nan(self):
         assert r.cvm_stat == 0.0
 
     def test_exact_linear_shortcut_does_not_fire_on_shifted_noisy_data(self):
-        """P0 fix: the exact-linear short-circuit must NOT fire on noisy
-        data after an additive shift. Previously the short-circuit scaled
-        against raw ``sum(dy^2)``, which is NOT translation-invariant:
-        adding a large constant to dy inflated ``sum(dy^2)`` and
-        spuriously tripped the "exact linear" branch on genuinely noisy
-        data. Fix scales against ``sum((dy - dybar)^2)`` (centered TSS,
-        translation-invariant).
+        """P0 fix (R2): the exact-linear short-circuit must NOT fire on
+        noisy data after an additive shift. The fix scales against
+        ``sum((dy - dybar)^2)`` (centered TSS, translation-invariant).
 
         Exact numerical equivalence of the two bootstrap runs is NOT
         expected (FP cancellation at 1e12 scale costs ~4 digits of
@@ -301,16 +297,37 @@ def test_exact_linear_shortcut_does_not_fire_on_shifted_noisy_data(self):
         r_raw = stute_test(d, dy, n_bootstrap=199, seed=42)
         r_shift = stute_test(d, dy_shifted, n_bootstrap=199, seed=42)
 
-        # Neither run may hit the exact-linear short-circuit (both have
-        # unit-scale noise; neither should register as effectively-zero
-        # residuals relative to centered TSS).
         assert r_raw.cvm_stat > 0.0, "shortcut fired on raw noisy data"
         assert r_shift.cvm_stat > 0.0, "shortcut fired on shifted noisy data"
-        # Decision must be preserved across translation.
         assert r_raw.reject == r_shift.reject
-        # P-values match at moderate tolerance (FP cancellation limits precision).
         np.testing.assert_allclose(r_raw.p_value, r_shift.p_value, rtol=5e-3)
 
+    def test_exact_linear_shortcut_does_not_fire_on_rescaled_noisy_data(self):
+        """P0 fix (R5): the exact-linear short-circuit must NOT fire on
+        noisy data after a MULTIPLICATIVE rescaling. An earlier revision
+        used ``max(centered_TSS, 1.0)`` as the denominator in the guard,
+        which broke scale invariance: scaling ``dy`` by ``1e-12`` would
+        make ``centered_TSS ~ 1e-24`` but the floor would hold the
+        threshold at 1.0, firing the shortcut on noisy data that should
+        NOT trigger it. Fix uses a purely relative comparison with a
+        separate branch for the ``centered_TSS == 0`` edge case.
+        """
+        rng = np.random.default_rng(42)
+        G = 50
+        d = rng.uniform(0.0, 1.0, size=G)
+        dy = 2.0 * d + rng.normal(0.0, 1.0, size=G)
+        dy_scaled = dy * 1e-12  # multiplicative rescale
+
+        r_raw = stute_test(d, dy, n_bootstrap=199, seed=42)
+        r_scaled = stute_test(d, dy_scaled, n_bootstrap=199, seed=42)
+
+        # Neither run may hit the short-circuit.
+        assert r_raw.cvm_stat > 0.0, "shortcut fired on raw noisy data"
+        assert r_scaled.cvm_stat > 0.0, "shortcut fired on scaled noisy data"
+        # Decision must be preserved: rescaling dy cannot change the
+        # rejection outcome (the statistic is scale-invariant in dy).
+        assert r_raw.reject == r_scaled.reject
+
     def test_mismatched_lengths_raise(self):
         with pytest.raises(ValueError, match="same length"):
             stute_test(np.array([0.1, 0.2, 0.3]), np.array([1.0, 2.0]))
@@ -390,10 +407,8 @@ def test_exact_linear_returns_p1_not_nan(self):
         assert r.t_stat_hr == float("-inf")
 
     def test_exact_linear_shortcut_does_not_fire_on_shifted_noisy_data(self):
-        """P0 fix: the Yatchew exact-linear short-circuit must NOT fire
-        on noisy data after an additive shift (same rationale as Stute
-        fix: scale-ratio is translation-invariant only against centered
-        TSS)."""
+        """P0 fix (R2): Yatchew exact-linear short-circuit is translation-
+        invariant (comparison against centered TSS, not raw sum(dy^2))."""
         rng = np.random.default_rng(42)
         G = 50
         d = rng.uniform(0.0, 1.0, size=G)
@@ -403,14 +418,35 @@ def test_exact_linear_shortcut_does_not_fire_on_shifted_noisy_data(self):
         r_raw = yatchew_hr_test(d, dy)
         r_shift = yatchew_hr_test(d, dy_shifted)
 
-        # Neither run may hit the short-circuit (t_stat_hr would be -inf).
         assert np.isfinite(r_raw.t_stat_hr), "shortcut fired on raw noisy data"
         assert np.isfinite(r_shift.t_stat_hr), "shortcut fired on shifted noisy data"
-        # Decision preserved across translation.
         assert r_raw.reject == r_shift.reject
-        # T-stats match at moderate tolerance (FP cancellation at 1e12 scale).
         np.testing.assert_allclose(r_raw.t_stat_hr, r_shift.t_stat_hr, rtol=1e-3)
 
+    def test_exact_linear_shortcut_does_not_fire_on_rescaled_noisy_data(self):
+        """P0 fix (R5): Yatchew exact-linear short-circuit is scale-
+        invariant. An earlier revision used a ``max(centered_TSS, 1.0)``
+        floor that broke scale invariance under multiplicative rescaling
+        of dy (e.g. ``dy * 1e-12``). Fix removes the floor and handles
+        the zero-centered-TSS case in a separate branch."""
+        rng = np.random.default_rng(42)
+        G = 50
+        d = rng.uniform(0.0, 1.0, size=G)
+        dy = 2.0 * d + rng.normal(0.0, 1.0, size=G)
+        dy_scaled = dy * 1e-12
+
+        r_raw = yatchew_hr_test(d, dy)
+        r_scaled = yatchew_hr_test(d, dy_scaled)
+
+        # Neither run may short-circuit to t_stat_hr = -inf.
+        assert np.isfinite(r_raw.t_stat_hr), "shortcut fired on raw noisy data"
+        assert np.isfinite(r_scaled.t_stat_hr), "shortcut fired on scaled noisy data"
+        # Decision preserved: Yatchew is scale-invariant in dy.
+        assert r_raw.reject == r_scaled.reject
+        # T_hr is scale-invariant: sigma2_lin and sigma2_diff both scale
+        # by c^2, sigma2_W scales by c^2, so the ratio is unchanged.
+        np.testing.assert_allclose(r_raw.t_stat_hr, r_scaled.t_stat_hr, rtol=1e-3)
+
     def test_duplicate_doses_return_nan(self):
         """P1 fix: yatchew_hr_test rejects duplicate doses with UserWarning
         + NaN result. Adjacent differences depend on within-tie row order,

From b7e0d8270bbe4db943cb487c49d251cbf1e322d0 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Wed, 22 Apr 2026 17:17:05 -0400
Subject: [PATCH 9/9] Address PR #352 CI review round 6 (2 P1 + 1 P2)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

R6 P1 #1 — _compose_verdict hides conclusive rejections behind "inconclusive":
The R4 logic returned "inconclusive - QUG NaN" or "inconclusive - both
Stute and Yatchew linearity tests NaN" BEFORE checking whether any
conclusive test had rejected. The reviewer's example: G=2 with QUG
rejecting at alpha=0.05 and Stute/Yatchew NaN by sample-size gates —
the workflow emitted "inconclusive - both linearity NaN", hiding a
real assumption failure.

The paper's rule is one-way: TWFE is admissible only if NO test rejects.
A conclusive rejection therefore dominates unresolved-step notes.

Fix: reorder _compose_verdict:
  1. Collect rejections from conclusive tests first. If any, that is the
     primary verdict, and unresolved-step notes are APPENDED via
     "; additional steps unresolved: ..." rather than replacing the
     rejection.
  2. Only when NO conclusive rejection exists AND a required step is
     unresolved do we return a pure "inconclusive - ..." verdict.
  3. Otherwise fall through to the partial-workflow fail-to-reject
     verdict (with "(Yatchew NaN - skipped)" suffix if applicable).

Regressions:
- TestComposeVerdictLogic.test_qug_reject_with_both_linearity_nan_surfaces_rejection
- TestComposeVerdictLogic.test_linearity_reject_with_qug_nan_surfaces_rejection
- TestComposeVerdictLogic.test_all_three_reject_with_qug_nan_keeps_conclusive_rejections

R6 P1 #2 — Raw stute_test / yatchew_hr_test accept negative doses:
qug_test and _validate_had_panel both front-door-reject d < 0 (paper
Section 2 HAD support restriction), but the new linearity helpers only
validated shape + NaN. Negative doses are outside the method's stated
scope and could silently produce conclusive-looking output.

Fix: mirror the negative-dose guard. Both stute_test and yatchew_hr_test
now raise ValueError on any d < 0 with a message directing users to
pre-process or check the dose column. Docstrings updated to list the
new contract in the Raises section.

Regressions:
- TestNegativeDoseGuardsOnLinearityTests.test_stute_negative_dose_raises
- TestNegativeDoseGuardsOnLinearityTests.test_yatchew_negative_dose_raises

R6 P2 — Docstrings / REGISTRY sync:
HADPretestReport.verdict docstring rewritten to describe the new
"rejection-first, unresolved-suffix" priority. REGISTRY Phase 3
workflow checkbox updated to document the conclusive-rejection-not-
hidden semantics plus the non-negative-dose contract.

Test count: 64 -> 69.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 diff_diff/had_pretests.py    | 167 +++++++++++++++++++++++------------
 docs/methodology/REGISTRY.md |   2 +-
 tests/test_had_pretests.py   |  60 +++++++++++++
 3 files changed, 172 insertions(+), 57 deletions(-)

diff --git a/diff_diff/had_pretests.py b/diff_diff/had_pretests.py
index 8d4fea95..fff55aea 100644
--- a/diff_diff/had_pretests.py
+++ b/diff_diff/had_pretests.py
@@ -396,27 +396,36 @@ class HADPretestReport:
         PARTIAL indicator: it does not certify Assumption 7 (pre-trends),
         which is not tested by Phase 3.
     verdict : str
-        Human-readable classification. The paper's step 3 accepts either
-        Stute OR Yatchew, so a conclusive Stute alone can adjudicate
-        linearity even if Yatchew is NaN (e.g. tied-dose panels), and
-        vice versa. Priority-ordered first-match:
-
-        1. QUG NaN -> ``"inconclusive - QUG NaN"`` (step 1 is required).
-        2. BOTH Stute AND Yatchew NaN -> ``"inconclusive - both Stute
-           and Yatchew linearity tests NaN"`` (step 3 requires at least
-           one).
-        3. None of the CONCLUSIVE tests reject -> partial-workflow
-           fail-to-reject verdict. Format: ``"QUG and linearity
-           diagnostics fail-to-reject[ (Yatchew NaN - skipped)];
-           Assumption 7 pre-trends test NOT run (paper step 2 deferred
-           to Phase 3 follow-up)"``. The ``" (... - skipped)"`` suffix
-           appears when Stute or Yatchew was NaN but the other was
-           conclusive.
-        4. At least one conclusive test rejects -> bundled string
-           naming each failed assumption: ``"support infimum rejected
-           - continuous_at_zero design invalid (QUG)"`` and/or
-           ``"linearity rejected - heterogeneity bias
-           ({Stute[,Yatchew]})"``.
+        Human-readable classification. Paper rule: TWFE is admissible
+        only if NONE of the implemented tests rejects. A conclusive
+        rejection must therefore never be hidden by a purely-inconclusive
+        verdict just because another step happens to be NaN.
+
+        Priority:
+
+        1. If any CONCLUSIVE test rejected, that is the primary verdict:
+           a bundled string naming each failed assumption,
+           ``"support infimum rejected - continuous_at_zero design
+           invalid (QUG)"`` and/or ``"linearity rejected - heterogeneity
+           bias ({Stute[,Yatchew]})"``. If another step is unresolved
+           (QUG NaN, or BOTH linearity tests NaN), an
+           ``"; additional steps unresolved: ..."`` suffix is APPENDED
+           rather than replacing the rejection.
+        2. If no conclusive rejection but a required step is unresolved,
+           the verdict is ``"inconclusive - QUG NaN"`` when step 1 is
+           the only unresolved piece, ``"inconclusive - both Stute and
+           Yatchew linearity tests NaN"`` when step 3 lacks any
+           conclusive linearity test, or ``"inconclusive - QUG NaN;
+           both Stute and Yatchew linearity tests NaN"`` when both are
+           unresolved.
+        3. Otherwise (all required steps conclusive and none reject),
+           the partial-workflow fail-to-reject verdict:
+           ``"QUG and linearity diagnostics fail-to-reject[ (Yatchew
+           NaN - skipped)]; Assumption 7 pre-trends test NOT run (paper
+           step 2 deferred to Phase 3 follow-up)"``. The
+           ``" (... - skipped)"`` suffix appears when Stute OR Yatchew
+           was NaN but the other was conclusive (step 3 resolved via
+           the paper's "Stute OR Yatchew" wording).
     alpha : float
         Significance level shared across tests.
     n_obs : int
@@ -616,43 +625,36 @@ def _compose_verdict(
     conclusive Stute result alone suffices even when Yatchew is NaN
     (e.g. tied doses, which Yatchew rejects by contract).
 
-    Priority-ordered first-match:
-
-    1. QUG NaN -> ``"inconclusive - QUG NaN"`` (step 1 required).
-    2. BOTH Stute AND Yatchew NaN -> ``"inconclusive - both Stute and
-       Yatchew linearity tests NaN"`` (step 3 requires at least one).
-    3. Otherwise, count rejections from CONCLUSIVE tests only:
-       3a. None of the conclusive tests reject -> partial-workflow
-           fail-to-reject verdict flagging the Assumption 7 gap, plus
-           a ``" (Yatchew NaN - skipped)"`` suffix when applicable.
-       3b. At least one conclusive test rejects -> bundle each
-           rejection reason naming the failed assumption.
+    Paper logic: TWFE is admissible only if NONE of the implemented
+    tests rejects. A conclusive rejection must therefore never be hidden
+    by a purely-inconclusive verdict just because another step is NaN -
+    it is reported as the primary outcome and any unresolved steps are
+    appended as a suffix.
+
+    Priority:
+
+    1. Collect all rejection reasons from CONCLUSIVE tests. If any
+       conclusive test rejected, that is the primary verdict. Unresolved
+       steps (QUG NaN, or BOTH linearity tests NaN) are appended as
+       ``"; additional steps unresolved: ..."`` rather than replacing
+       the rejection.
+    2. If no conclusive test rejected but a required step is unresolved,
+       return a pure ``"inconclusive - ..."`` verdict naming the
+       unresolved step(s).
+    3. Otherwise (all required steps conclusive and none reject),
+       return the partial-workflow fail-to-reject verdict flagging the
+       Assumption 7 gap, with a ``" (Yatchew NaN - skipped)"`` suffix
+       when ONE linearity test was NaN and the other was conclusive.
     """
     qug_ok = bool(np.isfinite(qug.p_value))
     stute_ok = bool(np.isfinite(stute.p_value))
     yatchew_ok = bool(np.isfinite(yatchew.p_value))
 
-    if not qug_ok:
-        return "inconclusive - QUG NaN"
-    if not stute_ok and not yatchew_ok:
-        return "inconclusive - both Stute and Yatchew linearity tests NaN"
-
-    qug_rej = qug.reject
-    stute_rej = bool(stute_ok and stute.reject)
-    yatchew_rej = bool(yatchew_ok and yatchew.reject)
-
-    if not (qug_rej or stute_rej or yatchew_rej):
-        skipped = []
-        if not stute_ok:
-            skipped.append("Stute NaN")
-        if not yatchew_ok:
-            skipped.append("Yatchew NaN")
-        skip_note = f" ({'; '.join(skipped)} - skipped)" if skipped else ""
-        return (
-            "QUG and linearity diagnostics fail-to-reject"
-            f"{skip_note}; Assumption 7 pre-trends test NOT run "
-            "(paper step 2 deferred to Phase 3 follow-up)"
-        )
+    # Rejections from conclusive tests only. NaN-p tests have reject=False
+    # by convention, so the ``ok and reject`` guard is defensive.
+    qug_rej = qug_ok and qug.reject
+    stute_rej = stute_ok and stute.reject
+    yatchew_rej = yatchew_ok and yatchew.reject
 
     reasons = []
     if qug_rej:
@@ -662,7 +664,40 @@ def _compose_verdict(
             name for name, rejected in (("Stute", stute_rej), ("Yatchew", yatchew_rej)) if rejected
         )
         reasons.append(f"linearity rejected - heterogeneity bias ({which})")
-    return "; ".join(reasons)
+
+    # Unresolved steps: QUG is required; step 3 requires at least one
+    # conclusive linearity test.
+    unresolved = []
+    if not qug_ok:
+        unresolved.append("QUG NaN")
+    if not stute_ok and not yatchew_ok:
+        unresolved.append("both Stute and Yatchew linearity tests NaN")
+
+    if reasons:
+        # A conclusive rejection is the primary outcome. Append any
+        # unresolved-step note rather than replacing the rejection.
+        verdict = "; ".join(reasons)
+        if unresolved:
+            verdict += "; additional steps unresolved: " + "; ".join(unresolved)
+        return verdict
+
+    if unresolved:
+        return "inconclusive - " + "; ".join(unresolved)
+
+    # All required steps conclusive, none reject. Note any single skipped
+    # linearity test (the OTHER linearity test was conclusive and
+    # fail-to-reject, so step 3 IS resolved).
+    skipped = []
+    if not stute_ok:
+        skipped.append("Stute NaN")
+    if not yatchew_ok:
+        skipped.append("Yatchew NaN")
+    skip_note = f" ({'; '.join(skipped)} - skipped)" if skipped else ""
+    return (
+        "QUG and linearity diagnostics fail-to-reject"
+        f"{skip_note}; Assumption 7 pre-trends test NOT run "
+        "(paper step 2 deferred to Phase 3 follow-up)"
+    )
 
 
 # =============================================================================
@@ -848,7 +883,8 @@ def stute_test(
     ------
     ValueError
         If ``d`` / ``dy`` are not 1D numeric, contain NaN, have unequal
-        lengths, or if ``alpha`` is outside ``(0, 1)``, or if
+        lengths, if any ``d`` value is negative (paper Section 2 HAD
+        support restriction), if ``alpha`` is outside ``(0, 1)``, or if
         ``n_bootstrap < 99``.
 
     Notes
@@ -886,6 +922,15 @@ def stute_test(
             f"d and dy must have the same length; got d.shape={d_arr.shape}, "
             f"dy.shape={dy_arr.shape}."
         )
+    # HAD support restriction (paper Section 2): doses must be non-negative.
+    # Mirror the front-door guard from qug_test / _validate_had_panel.
+    if (d_arr < 0).any():
+        n_neg = int((d_arr < 0).sum())
+        raise ValueError(
+            f"stute_test: d contains {n_neg} negative value(s); HAD doses "
+            f"must be non-negative (paper Section 2). Check your dose "
+            f"column or pre-process before calling stute_test."
+        )
 
     G = int(d_arr.shape[0])
     if G < _MIN_G_STUTE:
@@ -1027,7 +1072,8 @@ def yatchew_hr_test(d: np.ndarray, dy: np.ndarray, alpha: float = 0.05) -> Yatch
     ------
     ValueError
         If ``d`` / ``dy`` are not 1D numeric, contain NaN, have unequal
-        lengths, or if ``alpha`` is outside ``(0, 1)``.
+        lengths, if any ``d`` value is negative (paper Section 2 HAD
+        support restriction), or if ``alpha`` is outside ``(0, 1)``.
 
     Notes
     -----
@@ -1083,6 +1129,15 @@ def yatchew_hr_test(d: np.ndarray, dy: np.ndarray, alpha: float = 0.05) -> Yatch
             f"d and dy must have the same length; got d.shape={d_arr.shape}, "
             f"dy.shape={dy_arr.shape}."
         )
+    # HAD support restriction (paper Section 2): doses must be non-negative.
+    # Mirror the front-door guard from qug_test / _validate_had_panel.
+    if (d_arr < 0).any():
+        n_neg = int((d_arr < 0).sum())
+        raise ValueError(
+            f"yatchew_hr_test: d contains {n_neg} negative value(s); HAD "
+            f"doses must be non-negative (paper Section 2). Check your "
+            f"dose column or pre-process before calling yatchew_hr_test."
+        )
 
     G = int(d_arr.shape[0])
     critical_value = float(stats.norm.ppf(1.0 - alpha))
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
index 6a824d94..6ad71b04 100644
--- a/docs/methodology/REGISTRY.md
+++ b/docs/methodology/REGISTRY.md
@@ -2348,7 +2348,7 @@ Shipped as `did_had_pretest_workflow()` in Phase 3. The paper's decision rule fo
 - [x] Phase 3: `qug_test()` (`T = D_{2,(1)} / (D_{2,(2)} - D_{2,(1)})`, rejection `{T > 1/α - 1}`). One-sided asymptotic p-value `1 / (1 + T)` under the Exp(1)/Exp(1) limit law (Theorem 4). Zero-dose observations filtered upfront (with `UserWarning`); `D_{(1)} == D_{(2)}` tie returns all-NaN inference (conservative). Closed-form tight parity tested at `atol=1e-12`.
 - [x] Phase 3: `stute_test()` Cramér-von Mises with Mammen wild bootstrap. Statistic `S = (1/G^2) Σ (cumsum_g)^2` (algebraically equivalent to paper's `Σ(g/G)^2 · ((1/g) Σ eps_{(h)})^2`). Bootstrap follows paper Appendix D Algorithm literal (per-iteration OLS refit). `n_bootstrap=999` default, `n_bootstrap >= 99` validated. `G < 10` returns NaN; `G > 100_000` emits a `UserWarning` pointing to `yatchew_hr_test`. Appendix-D vectorized matrix form deferred as a performance follow-up (tracked in `TODO.md`).
 - [x] Phase 3: `yatchew_hr_test()` heteroskedasticity-robust linearity test. Test statistic `T_hr = sqrt(G) · (σ̂²_lin - σ̂²_diff) / σ̂²_W` from paper Equation 29. Normalizer `σ̂²_diff` divides by `2G` (paper-literal Theorem 7), NOT `2(G-1)`; hand-computed tight parity asserted at `atol=1e-12`. One-sided standard-normal critical value. `G < 3` returns NaN.
-- [x] Phase 3: `did_had_pretest_workflow()` composite helper. Two-period `data`-only entry point (Phase 2a overall-path dispatch); reduces panel via `_aggregate_first_difference` and runs all three IMPLEMENTED tests at a shared `alpha`. `seed` forwards to `stute_test` only (QUG and Yatchew are deterministic). Returns `HADPretestReport` with priority-ordered verdict string. Because Phase 3 ships steps 1 + 3 of the paper's four-step workflow but **not** step 2 (Assumption 7 pre-trends test via Equation 18), the fail-to-reject verdict explicitly flags the Assumption 7 gap rather than claiming unconditional TWFE safety: `"QUG and linearity diagnostics fail-to-reject; Assumption 7 pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up)"`. Verdict semantics follow the paper's "Stute OR Yatchew" step-3 wording: `"inconclusive - QUG NaN"` only when step 1 fails; `"inconclusive - both Stute and Yatchew linearity tests NaN"` only when step 3 has no conclusive linearity test; otherwise the fail-to-reject verdict may carry a `"(Yatchew NaN - skipped)"` (or Stute) suffix when one linearity test is NaN but the other is conclusive. Bundled rejection-reason strings naming each failed assumption otherwise. `all_pass` is `True` iff QUG is conclusive AND at least one of Stute/Yatchew is conclusive AND no conclusive test rejects. Multi-period panels pre-slice to `(F-1, F)` before calling; joint-horizon dispatch deferred to Phase 3 follow-up.
+- [x] Phase 3: `did_had_pretest_workflow()` composite helper. Two-period `data`-only entry point (Phase 2a overall-path dispatch); reduces panel via `_aggregate_first_difference` and runs all three IMPLEMENTED tests at a shared `alpha`. `seed` forwards to `stute_test` only (QUG and Yatchew are deterministic). Returns `HADPretestReport` with priority-ordered verdict string. Because Phase 3 ships steps 1 + 3 of the paper's four-step workflow but **not** step 2 (Assumption 7 pre-trends test via Equation 18), the fail-to-reject verdict explicitly flags the Assumption 7 gap rather than claiming unconditional TWFE safety: `"QUG and linearity diagnostics fail-to-reject; Assumption 7 pre-trends test NOT run (paper step 2 deferred to Phase 3 follow-up)"`. Verdict priority follows the paper's one-way rule (TWFE admissible only if NO test rejects): **conclusive rejections are the primary verdict and are NEVER hidden by inconclusive status** — any unresolved-step note is appended via `"; additional steps unresolved: ..."` rather than replacing the rejection. The pure `"inconclusive - QUG NaN"` / `"inconclusive - both Stute and Yatchew linearity tests NaN"` forms only fire when NO conclusive test rejects AND a required step is unresolved. The partial-workflow fail-to-reject verdict may carry a `"(Yatchew NaN - skipped)"` (or Stute) suffix when one linearity test is NaN but the other is conclusive (step 3 resolved via the paper's "Stute OR Yatchew" wording). Bundled rejection-reason strings name each failed assumption in the conclusive-rejection case. `all_pass` is `True` iff QUG is conclusive AND at least one of Stute/Yatchew is conclusive AND no conclusive test rejects. **Non-negative-dose contract**: all three raw linearity helpers (`qug_test`, `stute_test`, `yatchew_hr_test`) raise a front-door `ValueError` on any `d < 0`, mirroring the `_validate_had_panel` guard (paper Section 2 HAD support restriction). Multi-period panels pre-slice to `(F-1, F)` before calling; joint-horizon dispatch deferred to Phase 3 follow-up.
 - [ ] Phase 4: Pierce-Schott (2016) replication harness reproduces Figure 2 values.
 - [ ] Phase 4: Full DGP 1/2/3 coverage-rate reproduction from Table 1.
 - [ ] Phase 5: `practitioner_next_steps()` integration for HAD results.
diff --git a/tests/test_had_pretests.py b/tests/test_had_pretests.py
index fc32c58f..a47e5991 100644
--- a/tests/test_had_pretests.py
+++ b/tests/test_had_pretests.py
@@ -172,6 +172,23 @@ def test_negative_dose_raises(self):
             qug_test(np.array([0.1, -0.3, 0.5, 0.9]))
 
 
+class TestNegativeDoseGuardsOnLinearityTests:
+    """R6 P1 fix: stute_test and yatchew_hr_test must enforce the same
+    HAD non-negative-dose contract as qug_test and the panel validator."""
+
+    def test_stute_negative_dose_raises(self):
+        d = np.linspace(-0.2, 0.8, 30)  # contains negatives
+        dy = 2.0 * d + np.random.default_rng(42).normal(0.0, 0.1, size=30)
+        with pytest.raises(ValueError, match="negative value"):
+            stute_test(d, dy, n_bootstrap=199, seed=42)
+
+    def test_yatchew_negative_dose_raises(self):
+        d = np.linspace(-0.2, 0.8, 30)
+        dy = 2.0 * d + np.random.default_rng(42).normal(0.0, 0.1, size=30)
+        with pytest.raises(ValueError, match="negative value"):
+            yatchew_hr_test(d, dy)
+
+
 # =============================================================================
 # Stute test
 # =============================================================================
@@ -945,6 +962,49 @@ def test_stute_nan_alone_does_not_force_inconclusive(self):
         assert verdict.startswith("QUG and linearity diagnostics fail-to-reject")
         assert "Stute NaN - skipped" in verdict
 
+    def test_qug_reject_with_both_linearity_nan_surfaces_rejection(self):
+        """R6 P1 fix: a conclusive QUG rejection must NOT be hidden by
+        "inconclusive" just because BOTH linearity tests are NaN. Paper
+        rule is one-way: TWFE is admissible only if NO test rejects, so
+        any conclusive rejection dominates unresolved-step notes.
+        """
+        q = _mk_qug(reject=True)
+        s = _mk_stute(reject=False, p=float("nan"))
+        y = _mk_yatchew(reject=False, p=float("nan"))
+        verdict = _compose_verdict(q, s, y)
+        # Rejection must be surfaced as the primary verdict.
+        assert "support infimum rejected" in verdict
+        assert not verdict.startswith("inconclusive")
+        # Unresolved linearity step is appended, not replacing the rejection.
+        assert "additional steps unresolved" in verdict
+        assert "both Stute and Yatchew" in verdict
+
+    def test_linearity_reject_with_qug_nan_surfaces_rejection(self):
+        """R6 P1 fix: a conclusive linearity rejection (Stute or Yatchew)
+        must NOT be hidden by "inconclusive - QUG NaN"."""
+        q = _mk_qug(reject=False, p=float("nan"))
+        s = _mk_stute(reject=True)
+        y = _mk_yatchew(reject=False)
+        verdict = _compose_verdict(q, s, y)
+        assert "linearity rejected" in verdict
+        assert "Stute" in verdict
+        assert not verdict.startswith("inconclusive")
+        assert "additional steps unresolved" in verdict
+        assert "QUG NaN" in verdict
+
+    def test_all_three_reject_with_qug_nan_keeps_conclusive_rejections(self):
+        """Mixed case: Stute and Yatchew both conclusively reject, QUG
+        unresolved. All conclusive rejections must be reported."""
+        q = _mk_qug(reject=False, p=float("nan"))
+        s = _mk_stute(reject=True)
+        y = _mk_yatchew(reject=True)
+        verdict = _compose_verdict(q, s, y)
+        assert "linearity rejected" in verdict
+        assert "Stute" in verdict
+        assert "Yatchew" in verdict
+        assert "QUG NaN" in verdict
+        assert not verdict.startswith("inconclusive")
+
     def test_none_reject_flags_assumption7_gap(self):
         """(b) None reject -> verdict flags the Assumption 7 pre-trends gap
         rather than claiming unconditional TWFE safety."""