diff --git a/CHANGELOG.md b/CHANGELOG.md index ca42006f..0519b306 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] +### Added +- **`BusinessReport` and `DiagnosticReport` (experimental preview)** - practitioner-ready output layer. `BusinessReport(results, ...)` produces plain-English narrative summaries (`.summary()`, `.full_report()`, `.export_markdown()`, `.to_dict()`) from any of the 16 fitted result types. `DiagnosticReport(results, ...)` orchestrates the existing diagnostic battery (parallel trends, pre-trends power, HonestDiD sensitivity, Goodman-Bacon, heterogeneity, design-effect, EPV) plus estimator-native diagnostics for SyntheticDiD (`pre_treatment_fit`, weight concentration, in-time placebo, zeta sensitivity) and TROP (factor-model fit metrics). Both classes expose an AI-legible `to_dict()` schema (single source of truth; prose renders from the dict). BR auto-constructs DR by default so summaries mention pre-trends, robustness, and design-effect findings in one call. See `docs/methodology/REPORTING.md` for methodology deviations including the no-traffic-light-gates decision, pre-trends verdict thresholds (0.05 / 0.30), and power-aware phrasing driven by `compute_pretrends_power`. **Both schemas are marked experimental in this release** - wording, verdict thresholds, and schema shape will change; do not anchor downstream tooling on them yet. + ### Changed - Add Zenodo DOI badge to README; upgrade the BibTeX citation block with the concept DOI (`10.5281/zenodo.19646175`) and list author as Isaac Gerber (matching `CITATION.cff`). Add `doi:` and `identifiers:` entries (concept + versioned) to `CITATION.cff`. DOI was minted by Zenodo when v3.1.3 was released. diff --git a/README.md b/README.md index 4c8bbd88..ebd80e8b 100644 --- a/README.md +++ b/README.md @@ -93,6 +93,38 @@ Measuring campaign lift? Evaluating a product launch? diff-diff handles the caus - **[Brand awareness survey tutorial](docs/tutorials/17_brand_awareness_survey.ipynb)** - Full example with complex survey design, brand funnel analysis, and staggered rollouts - **Have BRFSS/ACS/CPS individual records?** Use [`aggregate_survey()`](docs/api/prep.rst) to roll respondent-level microdata into a geographic-period panel with inverse-variance precision weights. The returned second-stage design uses analytic weights (`aweight`), so it works directly with `DifferenceInDifferences`, `TwoWayFixedEffects`, `MultiPeriodDiD`, `SunAbraham`, `ContinuousDiD`, and `EfficientDiD` (estimators marked **Full** in the [survey support matrix](docs/choosing_estimator.rst)) +### Experimental preview: `BusinessReport` and `DiagnosticReport` + +diff-diff ships two preview classes, `BusinessReport` and `DiagnosticReport`, that produce plain-English output and a structured `to_dict()` schema from any fitted result. **Both are experimental in this release** — wording, verdict thresholds, and schema shape will change as the library learns from real practitioner usage. Do not anchor downstream tooling on the schema yet; the experimental flag is noted in the CHANGELOG. + +```python +from diff_diff import CallawaySantAnna, BusinessReport + +cs = CallawaySantAnna(base_period="universal").fit( + df, outcome="revenue", unit="store", time="month", + first_treat="first_treat", aggregate="event_study", +) +report = BusinessReport( + cs, + outcome_label="Revenue per store", + outcome_unit="$", + business_question="Did the loyalty program lift revenue?", + treatment_label="the loyalty program", + # Optional: pass the panel + column names so the auto-constructed + # DiagnosticReport can run data-dependent checks (2x2 pre-trends, + # Goodman-Bacon decomposition, EfficientDiD Hausman pretest). + # Without these the auto path still runs but skips those checks. + data=df, + outcome="revenue", + unit="store", + time="month", + first_treat="first_treat", +) +print(report.summary()) +``` + +`BusinessReport` auto-constructs a `DiagnosticReport` so the summary mentions pre-trends, sensitivity, and design-effect findings in one call. Methodology (phrasing rules, verdict thresholds, schema stability) is documented in [docs/methodology/REPORTING.md](docs/methodology/REPORTING.md). Feedback on wording, applicability, and missing diagnostics is welcome — this is the part of the library most likely to evolve in the next few releases. + Already know DiD? The [academic quickstart](docs/quickstart.rst) and [estimator guide](docs/choosing_estimator.rst) cover the full technical details. ## Features diff --git a/ROADMAP.md b/ROADMAP.md index c7abe633..84d13139 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -57,6 +57,7 @@ See [Survey Design Support](docs/choosing_estimator.rst#survey-design-support) f Major landings since the prior roadmap revision. See [CHANGELOG.md](CHANGELOG.md) for the full history. +- **`BusinessReport` and `DiagnosticReport`** - practitioner-ready output layer. Plain-English stakeholder summaries + unified diagnostic runner with a stable AI-legible `to_dict()` schema. `BusinessReport` auto-constructs `DiagnosticReport` by default so summaries mention pre-trends, robustness, and design-effect findings in one call. Estimator-native validation surfaces are routed through: SyntheticDiD uses `pre_treatment_fit` / `in_time_placebo` / `sensitivity_to_zeta_omega`; EfficientDiD uses its native `hausman_pretest`; TROP exposes factor-model fit metrics. See `docs/methodology/REPORTING.md` for methodology deviations including no-traffic-light gates, pre-trends verdict thresholds, and power-aware phrasing. - **ChaisemartinDHaultfoeuille (dCDH)** - full feature set: `DID_M` contemporaneous-switch, multi-horizon `DID_l` event study, analytical SE, multiplier bootstrap, TWFE decomposition diagnostic, dynamic placebos, normalized estimator, cost-benefit aggregate, sup-t bands, covariate adjustment (`DID^X`), group-specific linear trends (`DID^{fd}`), state-set-specific trends, heterogeneity testing, non-binary treatment, HonestDiD integration, and survey support (TSL + pweight). - **SyntheticDiD jackknife variance** (`variance_method='jackknife'`) with survey-weighted jackknife. - **SyntheticDiD validation diagnostics**. @@ -78,8 +79,7 @@ Queued work, ordered by expected leverage. Each item is its own PR. Ordering is ### Practitioner-ready output -- **`BusinessReport` class.** Plain-English summaries of any estimator's results with markdown export. Optional rich formatting via a `[reporting]` extra; core remains numpy/pandas/scipy only. Turns raw coefficients into stakeholder-ready artifacts. -- **`DiagnosticReport` with context-aware `practitioner_next_steps()`.** Unified diagnostic runner that bundles parallel-trends, placebo, HonestDiD, Bacon decomposition, DEFF, EPV, and power diagnostics into one plain-English report. `practitioner_next_steps()` substitutes actual column names from fitted results instead of generic placeholders. +- **Context-aware `practitioner_next_steps()`.** Substitutes actual column names from fitted results instead of generic placeholders, so next-step guidance is executable rather than illustrative. (Standalone follow-up to the `BusinessReport` / `DiagnosticReport` landing below; tracked under the AI-Agent Track too.) ### Practitioner tutorials diff --git a/diff_diff/__init__.py b/diff_diff/__init__.py index cdee4ec9..c129f694 100644 --- a/diff_diff/__init__.py +++ b/diff_diff/__init__.py @@ -202,6 +202,16 @@ plot_synth_weights, ) from diff_diff.practitioner import practitioner_next_steps +from diff_diff.business_report import ( + BUSINESS_REPORT_SCHEMA_VERSION, + BusinessContext, + BusinessReport, +) +from diff_diff.diagnostic_report import ( + DIAGNOSTIC_REPORT_SCHEMA_VERSION, + DiagnosticReport, + DiagnosticReportResults, +) from diff_diff._guides_api import get_llm_guide from diff_diff.datasets import ( clear_cache, @@ -405,6 +415,12 @@ "clear_cache", # Practitioner guidance "practitioner_next_steps", + "BusinessReport", + "BusinessContext", + "BUSINESS_REPORT_SCHEMA_VERSION", + "DiagnosticReport", + "DiagnosticReportResults", + "DIAGNOSTIC_REPORT_SCHEMA_VERSION", # LLM guide accessor "get_llm_guide", ] diff --git a/diff_diff/business_report.py b/diff_diff/business_report.py new file mode 100644 index 00000000..2445251f --- /dev/null +++ b/diff_diff/business_report.py @@ -0,0 +1,2458 @@ +""" +BusinessReport — plain-English stakeholder narrative from any diff-diff result. + +Wraps any of the 16 fitted result types and produces: + +- ``summary()``: a short paragraph block suitable for an email or Slack message. +- ``full_report()``: a multi-section markdown report with headline, assumptions, + pre-trends, main result, robustness, sample, and an optional academic appendix. +- ``to_dict()``: a stable AI-legible structured schema (single source of truth — + prose is rendered from this dict, not templated alongside it). + +Design principles: + +- Plain English, not academic jargon. The library ships this in addition to, not + in place of, the estimator's existing ``results.summary()`` academic output. +- No estimator fitting and no variance re-derivation. Every effect, SE, p-value, + CI, and sensitivity bound is either read from ``results`` or produced by an + existing diff-diff utility. The report layer does compose a few cross-period + summaries from per-period inputs already on the result (joint-Wald / Bonferroni + pre-trends p-value, MDV-to-ATT ratio, heterogeneity dispersion over + post-treatment effects); see ``docs/methodology/REPORTING.md`` for the full + enumeration. +- Optional business context via keyword args (``outcome_label``, ``outcome_unit``, + ``business_question``, ``treatment_label``). Without them, BusinessReport uses + generic fallbacks — the zero-config path works. +- Diagnostic integration is implicit by default: ``BusinessReport(results)`` + auto-constructs a ``DiagnosticReport`` so the summary can mention pre-trends, + robustness, and design-effect findings. Pass ``auto_diagnostics=False`` or an + explicit ``diagnostics=`` object to override. + +Methodology deviations (no traffic-light gates, pre-trends verdict thresholds, +power-aware phrasing, unit-translation policy, schema stability) are documented +in ``docs/methodology/REPORTING.md``. The ``to_dict()`` schema is marked +experimental in v3.2. +""" + +from __future__ import annotations + +import re +from dataclasses import dataclass +from typing import Any, Dict, FrozenSet, List, Optional, Union + +import numpy as np + +from diff_diff.diagnostic_report import DiagnosticReport, DiagnosticReportResults + +BUSINESS_REPORT_SCHEMA_VERSION = "1.0" + +__all__ = [ + "BusinessReport", + "BusinessContext", + "BUSINESS_REPORT_SCHEMA_VERSION", +] + +# Recognized ``outcome_unit`` values mapped to a coarse "kind" used by the +# formatter. Unrecognized strings are accepted and rendered verbatim without +# arithmetic translation (``unit_kind = "unknown"``). +_UNIT_KINDS: Dict[str, str] = { + "$": "currency", + "usd": "currency", + "%": "percent", + "pp": "percentage_points", + "percentage_points": "percentage_points", + "percent": "percent", + "log_points": "log_points", + "log": "log_points", + "count": "count", + "users": "count", +} + + +@dataclass(frozen=True) +class BusinessContext: + """Frozen bundle of business-framing metadata used when rendering prose. + + Populated from ``BusinessReport`` constructor kwargs. Falls back to + neutral labels when fields are not supplied. + """ + + outcome_label: str + outcome_unit: Optional[str] + outcome_direction: Optional[str] + business_question: Optional[str] + treatment_label: str + alpha: float + + +class BusinessReport: + """Produce a stakeholder-ready narrative from any diff-diff results object. + + Parameters + ---------- + results : Any + A fitted diff-diff results object. Any of the 16 result types is + accepted. ``BaconDecompositionResults`` is not a valid input — Bacon + is a diagnostic, not an estimator; use ``DiagnosticReport`` for that. + outcome_label : str, optional + Stakeholder-friendly outcome name (e.g. ``"Revenue per user"``). + outcome_unit : str, optional + Unit label: ``"$"`` / ``"%"`` / ``"pp"`` / ``"log_points"`` / ``"count"`` + (recognized for formatting) or any free-form string (used verbatim + without arithmetic translation). + outcome_direction : str, optional + ``"higher_is_better"`` or ``"lower_is_better"``. Drives whether the + effect is described as "lift" / "drag" rather than just "increase" / + "decrease". + business_question : str, optional + Question the analysis answers (prepended to the summary). + treatment_label : str, optional + Stakeholder-friendly treatment name (e.g. ``"the campaign"``). + alpha : float, optional + Significance level. Defaults to ``results.alpha`` when not supplied. + Single knob: drives both CI level and significance phrasing. + honest_did_results : HonestDiDResults or SensitivityResults, optional + Pre-computed sensitivity result. When supplied, this is forwarded to + the internal ``DiagnosticReport`` so sensitivity is not re-computed. + auto_diagnostics : bool, default True + When ``True`` and ``diagnostics`` is ``None``, auto-construct a + ``DiagnosticReport``. Set ``False`` to skip diagnostics entirely. + diagnostics : DiagnosticReport or DiagnosticReportResults, optional + Explicit diagnostics object. Takes precedence over ``auto_diagnostics``. + include_appendix : bool, default True + Whether ``full_report()`` appends the estimator's academic + ``results.summary()`` output under a "Technical Appendix" section. + data, outcome, treatment, unit, time, first_treat : optional + Raw panel + column names forwarded to the auto-constructed + ``DiagnosticReport`` so data-dependent checks (2x2 PT on simple + DiD, Bacon-from-scratch, EfficientDiD Hausman pretest) can run. + survey_design : SurveyDesign, optional + The ``SurveyDesign`` object used to fit a survey-weighted + estimator. Forwarded to the auto-constructed ``DiagnosticReport`` + for fit-faithful Goodman-Bacon replay. When the fit carries + ``survey_metadata`` but ``survey_design`` is not supplied, Bacon + is skipped with an explicit reason rather than replaying an + unweighted decomposition for a design that does not match the + estimate. The simple 2x2 parallel-trends helper + (``utils.check_parallel_trends``) has no survey-aware variant; + on a survey-backed ``DiDResults`` it is skipped unconditionally + regardless of ``survey_design``. Supply + ``precomputed={'parallel_trends': ...}`` with a survey-aware + pretest to opt in. See ``docs/methodology/REPORTING.md``. + precomputed : dict, optional + Pre-computed diagnostic objects forwarded to the auto- + constructed ``DiagnosticReport`` (same keys as + ``DiagnosticReport(precomputed=...)``): ``"parallel_trends"``, + ``"sensitivity"``, ``"pretrends_power"``, ``"bacon"``. DR + validates keys and rejects estimator-incompatible entries + (e.g., HonestDiD bounds or generic PT on SDiD / TROP). + ``honest_did_results`` remains a shorthand for ``sensitivity``; + an explicit ``precomputed['sensitivity']`` wins on conflict. + """ + + def __init__( + self, + results: Any, + *, + outcome_label: Optional[str] = None, + outcome_unit: Optional[str] = None, + outcome_direction: Optional[str] = None, + business_question: Optional[str] = None, + treatment_label: Optional[str] = None, + alpha: Optional[float] = None, + honest_did_results: Optional[Any] = None, + auto_diagnostics: bool = True, + diagnostics: Optional[Union[DiagnosticReport, DiagnosticReportResults]] = None, + include_appendix: bool = True, + data: Optional[Any] = None, + outcome: Optional[str] = None, + treatment: Optional[str] = None, + unit: Optional[str] = None, + time: Optional[str] = None, + first_treat: Optional[str] = None, + survey_design: Optional[Any] = None, + precomputed: Optional[Dict[str, Any]] = None, + ): + if type(results).__name__ == "BaconDecompositionResults": + raise TypeError( + "BaconDecompositionResults is a diagnostic, not an estimator; " + "wrap the underlying estimator with BusinessReport and pass the " + "Bacon object to DiagnosticReport(precomputed={'bacon': ...})." + ) + + if diagnostics is not None and not isinstance( + diagnostics, (DiagnosticReport, DiagnosticReportResults) + ): + raise TypeError( + "diagnostics= must be a DiagnosticReport or " + "DiagnosticReportResults instance; " + f"got {type(diagnostics).__name__}." + ) + + # Estimator-aware validation for ``honest_did_results``. SDiD / + # TROP route robustness to ``estimator_native_diagnostics`` + # (SDiD: ``in_time_placebo``, ``sensitivity_to_zeta_omega``; + # TROP: factor-model fit metrics) and do not accept HonestDiD + # bounds because they are methodology-incompatible with the + # documented native-routing contract in REPORTING.md. Reject + # the passthrough here so it doesn't silently forward to the + # auto-constructed ``DiagnosticReport`` (which now also + # rejects it at construction time — round-21 P1 CI review on + # PR #318). + if honest_did_results is not None and type(results).__name__ in { + "SyntheticDiDResults", + "TROPResults", + }: + raise ValueError( + f"{type(results).__name__} routes robustness to " + "``estimator_native_diagnostics`` — ``honest_did_results`` " + "is not accepted on this estimator because HonestDiD " + "bounds are methodology-incompatible with the native " + "routing documented in REPORTING.md. Use the result " + "object's native diagnostics " + "(SDiD: ``in_time_placebo()``, ``sensitivity_to_zeta_omega()``, " + "``pre_treatment_fit``; TROP: ``effective_rank``, " + "``loocv_score``) — BusinessReport surfaces these " + "automatically under ``estimator_native_diagnostics``." + ) + + # Round-44 P1 CI review on PR #318: mirror the SDiD/TROP + # rejection pattern for ``CallawaySantAnna`` fits with + # ``base_period != "universal"``. HonestDiD Rambachan-Roth + # bounds are not valid for interpretation on the consecutive- + # comparison pre-period surface produced by ``varying`` base, + # so narrating precomputed sensitivity (whether passed as + # ``honest_did_results`` or ``precomputed['sensitivity']``) + # alongside a displayed varying-base fit mixes provenance the + # bounds don't support. DR enforces the same guard at + # construction; BR duplicates the check so the error fires + # before the auto-DR is built, matching the existing + # SDiD/TROP UX. REGISTRY.md §CallawaySantAnna line 410, + # §HonestDiD line 2458. + _cs_with_varying_base = type(results).__name__ == "CallawaySantAnnaResults" and ( + getattr(results, "base_period", "universal") != "universal" + ) + if _cs_with_varying_base: + _rejected_inputs: List[str] = [] + if honest_did_results is not None: + _rejected_inputs.append("honest_did_results") + if precomputed is not None and "sensitivity" in precomputed: + _rejected_inputs.append("precomputed['sensitivity']") + if _rejected_inputs: + _base_period = getattr(results, "base_period", "universal") + raise ValueError( + f"CallawaySantAnnaResults with " + f"``base_period={_base_period!r}`` cannot be " + "summarized alongside a precomputed HonestDiD " + "sensitivity object. The Rambachan-Roth bounds are " + "not valid for interpretation on the consecutive-" + "comparison pre-period surface this base yields " + "(REGISTRY.md §CallawaySantAnna / §HonestDiD). " + "Rejected inputs: " + ", ".join(_rejected_inputs) + ". " + "Re-fit the main estimator with " + "``CallawaySantAnna(base_period='universal')`` " + "before passing precomputed sensitivity, or drop " + "the sensitivity passthrough to let BR skip the " + "section with a methodology-critical reason." + ) + + self._results = results + self._honest_did_results = honest_did_results + self._auto_diagnostics = auto_diagnostics + self._diagnostics_arg = diagnostics + self._include_appendix = include_appendix + # Raw-data passthrough so the auto-constructed DR can run + # data-dependent checks (2x2 PT on simple DiD, Bacon-from- + # scratch on staggered estimators, EfficientDiD Hausman + # pretest). Without these, the auto path silently skips those + # checks (round-12 CI review on PR #318). + self._dr_data = data + self._dr_outcome = outcome + self._dr_treatment = treatment + self._dr_unit = unit + self._dr_time = time + self._dr_first_treat = first_treat + # Round-40 P1 CI review on PR #318: survey-backed fits need + # the ``SurveyDesign`` threaded through to the auto-constructed + # DR so Bacon decomposition is fit-faithful and the 2x2 PT + # skip path triggers for DiDResults with ``survey_metadata``. + # Without this passthrough, the auto path silently replays an + # unweighted decomposition / PT verdict for a weighted fit. + self._dr_survey_design = survey_design + # Round-43 P2 CI review on PR #318: BR docs and docstrings + # advertised a ``precomputed={'parallel_trends': ...}`` opt-in + # for survey-aware 2x2 PT and other escape hatches, but BR did + # not actually accept a ``precomputed=`` kwarg — the auto path + # only synthesized ``{"sensitivity": honest_did_results}``, so + # callers following the BR docs hit a ``TypeError`` on + # ``__init__``. Accept the passthrough here and forward every + # key to the auto-constructed DR (which owns validation against + # its implemented-key set and estimator-aware rejection rules). + # ``honest_did_results`` still feeds into ``sensitivity`` as a + # convenience; an explicit ``precomputed['sensitivity']`` wins + # on conflict. + self._dr_precomputed: Dict[str, Any] = dict(precomputed or {}) + # Round-43 P2 CI review on PR #318: mirror DR's eager key + # validation so users get the "unsupported key" error at BR + # construction rather than lazily when the DR is built inside + # ``to_dict()``. Kept in sync with ``DiagnosticReport``'s + # ``_supported_precomputed`` set; the cheapest way to avoid + # drift would be to import the set, but DR currently scopes it + # locally to ``__init__`` so mirror the literal here with a + # pointer comment. + _br_supported_precomputed = { + "parallel_trends", + "sensitivity", + "pretrends_power", + "bacon", + } + _br_unsupported = set(self._dr_precomputed) - _br_supported_precomputed + if _br_unsupported: + raise ValueError( + "precomputed= contains keys that are not implemented: " + f"{sorted(_br_unsupported)}. Supported keys: " + f"{sorted(_br_supported_precomputed)}. ``design_effect``, " + "``heterogeneity``, and ``epv`` are read directly from the " + "fitted result and do not accept precomputed overrides." + ) + + resolved_alpha = alpha if alpha is not None else getattr(results, "alpha", 0.05) + self._context = BusinessContext( + outcome_label=outcome_label or "the outcome", + outcome_unit=outcome_unit, + outcome_direction=outcome_direction, + business_question=business_question, + treatment_label=treatment_label or "the treatment", + alpha=float(resolved_alpha), + ) + + self._cached_schema: Optional[Dict[str, Any]] = None + + # -- Public API --------------------------------------------------------- + + def to_dict(self) -> Dict[str, Any]: + """Return the AI-legible structured schema (single source of truth).""" + if self._cached_schema is None: + self._cached_schema = self._build_schema() + return self._cached_schema + + def to_json(self, *, indent: int = 2) -> str: + """Return ``to_dict()`` serialized as JSON.""" + import json + + return json.dumps(self.to_dict(), indent=indent) + + def summary(self) -> str: + """Return a short plain-English paragraph block (6-10 sentences).""" + return _render_summary(self.to_dict()) + + def full_report(self) -> str: + """Return a structured multi-section markdown report.""" + base = _render_full_report(self.to_dict()) + if self._include_appendix: + try: + appendix = self._results.summary() + except Exception: # noqa: BLE001 + appendix = None + if appendix: + base = base + "\n\n## Technical Appendix\n\n```\n" + str(appendix) + "\n```\n" + return base + + def export_markdown(self) -> str: + """Alias for ``full_report()`` (discoverability).""" + return self.full_report() + + def headline(self) -> str: + """Return just the headline sentence.""" + return _render_headline_sentence(self.to_dict()) + + def caveats(self) -> List[Dict[str, str]]: + """Return the list of structured caveats (severity + topic + message).""" + return list(self.to_dict().get("caveats", [])) + + def __repr__(self) -> str: + estimator = type(self._results).__name__ + headline = self.to_dict().get("headline") or {} + val = headline.get("effect") + if isinstance(val, (int, float)) and np.isfinite(val): + return f"BusinessReport(results={estimator}, effect={val:.3g})" + return f"BusinessReport(results={estimator})" + + def __str__(self) -> str: + return self.summary() + + # -- Implementation detail --------------------------------------------- + + def _resolve_diagnostics(self) -> Optional[DiagnosticReportResults]: + """Return the DiagnosticReportResults to embed, or ``None`` if skipped.""" + if self._diagnostics_arg is not None: + if isinstance(self._diagnostics_arg, DiagnosticReportResults): + return self._diagnostics_arg + if isinstance(self._diagnostics_arg, DiagnosticReport): + return self._diagnostics_arg.run_all() + raise TypeError("diagnostics= must be a DiagnosticReport or DiagnosticReportResults") + if not self._auto_diagnostics: + return None + # Round-43 P2 CI review on PR #318: forward the user's + # ``precomputed`` dict through to DR. ``honest_did_results`` + # stays a convenience shortcut for ``sensitivity`` only; an + # explicit ``precomputed['sensitivity']`` from the caller + # wins. DR handles key validation (rejects unsupported keys + # and estimator-incompatible sensitivities / parallel_trends + # entries) so BR just merges and forwards. + precomputed: Dict[str, Any] = dict(self._dr_precomputed) + if self._honest_did_results is not None: + precomputed.setdefault("sensitivity", self._honest_did_results) + dr = DiagnosticReport( + self._results, + alpha=self._context.alpha, + precomputed=precomputed or None, + outcome_label=self._context.outcome_label, + treatment_label=self._context.treatment_label, + data=self._dr_data, + outcome=self._dr_outcome, + treatment=self._dr_treatment, + unit=self._dr_unit, + time=self._dr_time, + first_treat=self._dr_first_treat, + survey_design=self._dr_survey_design, + ) + return dr.run_all() + + def _build_schema(self) -> Dict[str, Any]: + """Assemble the structured schema. + + Pulls validation content (PT, sensitivity, Bacon, DEFF, EPV, ...) from + the internal ``DiagnosticReport``; extracts the stakeholder-facing + headline and sample metadata from the fitted result itself. + """ + estimator_name = type(self._results).__name__ + diagnostics_results = self._resolve_diagnostics() + dr_schema: Optional[Dict[str, Any]] = ( + diagnostics_results.schema if diagnostics_results is not None else None + ) + + headline = self._extract_headline(dr_schema) + sample = self._extract_sample() + heterogeneity = _lift_heterogeneity(dr_schema) + pre_trends = _lift_pre_trends(dr_schema) + sensitivity = _lift_sensitivity(dr_schema) + robustness = _lift_robustness(dr_schema) + assumption = _apply_anticipation_to_assumption( + _describe_assumption(estimator_name, self._results), + self._results, + ) + next_steps = (dr_schema or {}).get("next_steps", []) + caveats = _build_caveats(self._results, headline, sample, dr_schema) + references = _references_for(estimator_name) + + if diagnostics_results is None: + diagnostics_block: Dict[str, Any] = { + "status": "skipped", + "reason": "auto_diagnostics=False", + } + else: + diagnostics_block = { + "status": "ran", + "schema": dr_schema, + "overall_interpretation": ( + dr_schema.get("overall_interpretation", "") if dr_schema is not None else "" + ), + } + + return { + "schema_version": BUSINESS_REPORT_SCHEMA_VERSION, + "estimator": { + "class_name": estimator_name, + "display_name": estimator_name, + }, + "context": { + "outcome_label": self._context.outcome_label, + "outcome_unit": self._context.outcome_unit, + "outcome_direction": self._context.outcome_direction, + "business_question": self._context.business_question, + "treatment_label": self._context.treatment_label, + "alpha": self._context.alpha, + }, + "headline": headline, + "assumption": assumption, + "pre_trends": pre_trends, + "sensitivity": sensitivity, + "sample": sample, + "heterogeneity": heterogeneity, + "robustness": robustness, + "diagnostics": diagnostics_block, + "next_steps": next_steps, + "caveats": caveats, + "references": references, + } + + def _extract_headline(self, dr_schema: Optional[Dict[str, Any]]) -> Dict[str, Any]: + """Extract the headline effect + CI + p-value from the result.""" + r = self._results + # Delegate the attribute-alias lookup to the shared helper in the + # diagnostic_report module so BR and DR agree on which fields a + # result class exposes for its headline (including + # ``ContinuousDiDResults`` which uses ``overall_att_se`` / + # ``overall_att_p_value`` / ``overall_att_conf_int``). + from diff_diff.diagnostic_report import _extract_scalar_headline + + extracted = _extract_scalar_headline(r, fallback_alpha=self._context.alpha) + att: Optional[float] = None + se: Optional[float] = None + p: Optional[float] = None + ci: Optional[List[float]] = None + alpha = self._context.alpha + result_alpha: Optional[float] = None + if extracted is not None: + _name, att, se, p, ci, result_alpha = extracted + + # On any alpha mismatch, preserve the fitted CI at its native + # level. A faithful CI cannot be recomputed from point estimate + # and SE alone without reproducing the fit's inference contract + # (finite-df t-quantile, percentile bootstrap, wild cluster + # bootstrap, survey replicate quantile, rank-deficient + # undefined-df, etc.), and the 16 result classes do not expose + # a uniform descriptor for that. Two separate alpha values: + # ``display_alpha`` drives ``ci_level`` so the displayed CI + # label matches the preserved bounds; the caller's requested + # alpha drives the significance phrasing (``is_significant`` / + # ``near_threshold``). A caveat records the override. + display_alpha = alpha + phrasing_alpha = alpha + alpha_was_honored = True + alpha_override_caveat: Optional[str] = None + if ( + result_alpha is not None + and not np.isclose(alpha, result_alpha) + and att is not None + and se is not None + ): + inference_method = getattr(r, "inference_method", "analytical") + if inference_method == "wild_bootstrap": + inference_label = "wild cluster bootstrap" + elif ( + inference_method == "bootstrap" or getattr(r, "bootstrap_results", None) is not None + ): + inference_label = "bootstrap" + elif getattr(r, "bootstrap_distribution", None) is not None: + inference_label = "bootstrap" + elif getattr(r, "variance_method", None) in {"bootstrap", "jackknife", "placebo"}: + variance_method = getattr(r, "variance_method", None) + inference_label = f"{variance_method} variance" + else: + df_survey = getattr( + r, + "df_survey", + getattr(getattr(r, "survey_metadata", None), "df_survey", None), + ) + if isinstance(df_survey, (int, float)) and df_survey > 0: + inference_label = "finite-df survey" + elif isinstance(df_survey, (int, float)) and df_survey == 0: + # Rank-deficient replicate design: the fit deliberately + # left inference undefined. Preserve (NaN bounds remain NaN). + inference_label = "undefined-df (replicate-weight)" + else: + # Ordinary analytical fit with a finite but unexposed + # ``df`` (``DifferenceInDifferences`` / ``MultiPeriodDiD`` + # / most staggered estimators / TROP). We cannot + # reproduce the t-quantile without the fit's ``df``. + inference_label = "analytical (native degrees of freedom)" + + display_alpha = float(result_alpha) + alpha_was_honored = False + alpha_override_caveat = ( + f"Requested alpha ({phrasing_alpha:.2f}) was not honored " + f"for the confidence interval because this fit uses " + f"{inference_label} inference; the displayed CI remains " + f"at the fit's native level " + f"({int(round((1.0 - result_alpha) * 100))}%). The " + f"significance phrasing still uses the requested alpha." + ) + + unit = self._context.outcome_unit + unit_kind = _UNIT_KINDS.get(unit.lower() if unit else "", "unknown") + sign = ( + "positive" + if (att is not None and att > 0) + else ( + "negative" + if (att is not None and att < 0) + else ("null" if att == 0 else "undefined") + ) + ) + if att is None or not np.isfinite(att): + sign = "undefined" + ci_level = int(round((1.0 - display_alpha) * 100)) + is_significant = ( + p is not None and np.isfinite(p) and p < phrasing_alpha if p is not None else False + ) + near_threshold = ( + p is not None + and np.isfinite(p) + and (phrasing_alpha - 0.01) < p < (phrasing_alpha + 0.001) + ) + # Use DR-computed breakdown_M if available for quick reference. + breakdown_M: Optional[float] = None + if dr_schema: + sens_section = dr_schema.get("sensitivity") or {} + if sens_section.get("status") == "ran": + breakdown_M = sens_section.get("breakdown_M") + + return { + "effect": att, + "se": se, + "ci_lower": ci[0] if ci else None, + "ci_upper": ci[1] if ci else None, + "alpha_was_honored": alpha_was_honored, + "alpha_override_caveat": alpha_override_caveat, + "ci_level": ci_level, + "p_value": p, + "is_significant": is_significant, + "near_significance_threshold": near_threshold, + "unit": unit, + "unit_kind": unit_kind, + "sign": sign, + "breakdown_M": breakdown_M, + } + + def _extract_sample(self) -> Dict[str, Any]: + """Extract sample metadata from the fitted result.""" + r = self._results + survey = self._extract_survey_block() + n_treated = _safe_int(getattr(r, "n_treated", getattr(r, "n_treated_units", None))) + n_control_units = _safe_int(getattr(r, "n_control", getattr(r, "n_control_units", None))) + + # Control-group semantics. For estimators that expose a + # ``control_group`` kwarg (CS, EfficientDiD, ContinuousDiD, + # StaggeredTripleDiff, ...), the meaning of ``n_control_units`` + # depends on it. When the mode is "not-yet-treated" (dynamic + # comparison set), the fixed tally stored on the result is only + # the fully-untreated subset — the actual comparison set varies + # by (g, t) cell. Label the exposed count accordingly so prose + # surfaces the dynamic context instead of misreporting + # "0 control" (round-13 / round-17 / round-18 CI review). + # + # Canonicalize both ``"not_yet_treated"`` (CS / EfficientDiD / + # ContinuousDiD / Wooldridge) and ``"notyettreated"`` + # (StaggeredTripleDiff) as the same dynamic mode. + # + # Per-estimator fixed-subset field: + # * CS / SA / Imputation / TwoStage / EfficientDiD / + # dCDH / ContinuousDiD — ``n_control_units`` is the + # never-treated tally; surface as ``n_never_treated``. + # * StaggeredTripleDiff — ``n_control_units`` is a composite + # total; the fixed subset is ``n_never_enabled`` (stored + # separately on the result). + # * Wooldridge — ``n_control_units`` is total eligible + # comparisons (never-treated + future-treated) and does not + # map to a never-treated count. Keep on the fixed-count + # path even in dynamic mode. + # * Stacked — ``n_control_units`` is "distinct control units + # across the trimmed set" (stacked_did_results.py L59-62). + # Under ``clean_control="not_yet_treated"``, the trimmed + # set uses the rule ``A_s > a + kappa_post`` which admits + # future-treated controls; it is NOT a never-treated tally + # and cannot be relabeled as ``n_never_treated``. Keep + # Stacked on the fixed-count path (round-21 P1 CI review + # on PR #318 flagged the earlier relabeling as a + # semantic-contract violation). + control_group = _control_group_choice(r) + name = type(r).__name__ + n_never_treated: Optional[int] = None + n_never_enabled: Optional[int] = None + n_control: Optional[int] = n_control_units + _never_treated_count_contract = name in { + "CallawaySantAnnaResults", + "SunAbrahamResults", + "ImputationDiDResults", + "TwoStageDiDResults", + "EfficientDiDResults", + "ChaisemartinDHaultfoeuilleResults", + "ContinuousDiDResults", + } + _canonical_control = ( + control_group.replace("_", "").lower() if isinstance(control_group, str) else None + ) + # Stacked has two dynamic (sub-experiment-specific) modes: + # ``not_yet_treated`` (A_s > a + kappa_post) and ``strict`` + # (A_s > a + kappa_post + kappa_pre). Only ``never_treated`` + # (A_s = infinity) is a fixed never-treated pool. Round-22 P1 + # CI review on PR #318 flagged that ``strict`` was being + # misrendered as a fixed control design. + is_stacked_dynamic = name == "StackedDiDResults" and _canonical_control in { + "notyettreated", + "strict", + } + is_dynamic_control = _canonical_control == "notyettreated" or is_stacked_dynamic + # StaggeredTripleDiff comparison-group contract: + # ``n_control_units`` is a composite total that also includes + # the eligibility-denied / larger-cohort cells. Regardless of + # the ``control_group`` mode the valid fixed comparison is the + # never-enabled cohort (``staggered_triple_diff.py:384``, + # REGISTRY.md §StaggeredTripleDifference line 1730). Round-37 + # P1 CI review on PR #318: under ``control_group="never_treated"`` + # (i.e., ``_canonical_control == "nevertreated"``) the composite + # total was being narrated as "control". Surface + # ``n_never_enabled`` instead on both the ``nevertreated`` and + # the dynamic ``notyettreated`` modes. + if name == "StaggeredTripleDiffResults" and _canonical_control == "nevertreated": + n_never_enabled = _safe_int(getattr(r, "n_never_enabled", None)) + n_control = None + if is_dynamic_control: + if name == "StaggeredTripleDiffResults": + n_never_enabled = _safe_int(getattr(r, "n_never_enabled", None)) + n_control = None + elif name == "StackedDiDResults": + # ``n_control_units`` is "distinct control units across + # the trimmed set" (stacked_did_results.py L59-62) which + # includes future-treated controls by construction under + # both dynamic modes. Do NOT relabel as + # ``n_never_treated``; instead surface the count under + # ``n_distinct_controls_trimmed`` (sub-experiment- + # specific context) and clear ``n_control`` so the + # report does not narrate a fixed control pool. + n_control = None + elif _never_treated_count_contract: + n_never_treated = n_control_units + n_control = None + + # Panel-vs-RCS count semantics. CallawaySantAnnaResults stores + # treated/control counts as OBSERVATIONS (not units) when the + # fit used ``panel=False`` — ``staggered_results.py L183-L184`` + # renders those counts as "obs:" rather than "units:". BR + # previously labeled them as "units" / "present in the panel", + # which misstates the sample composition for repeated cross- + # section fits. Carry the flag into the schema so rendering can + # branch. Round-28 P2 CI review on PR #318. + count_unit = "observations" if getattr(r, "panel", True) is False else "units" + + sample_block: Dict[str, Any] = { + "n_obs": _safe_int(getattr(r, "n_obs", None)), + "n_treated": n_treated, + "n_control": n_control, + "n_never_treated": n_never_treated, + "control_group": control_group if isinstance(control_group, str) else None, + "dynamic_control": is_dynamic_control, + "n_periods": _safe_int(getattr(r, "n_periods", None)), + "pre_periods": _safe_list_len(getattr(r, "pre_periods", None)), + "post_periods": _safe_list_len(getattr(r, "post_periods", None)), + "count_unit": count_unit, + "survey": survey, + } + if n_never_enabled is not None: + sample_block["n_never_enabled"] = n_never_enabled + # Stacked-specific: surface the distinct-control-units tally on a + # dedicated key so agents see the sub-experiment-specific + # comparison count without misreading it as a never-treated + # subset (round-21 / round-22 CI review). + if name == "StackedDiDResults": + sample_block["n_distinct_controls_trimmed"] = n_control_units + return sample_block + + def _extract_survey_block(self) -> Optional[Dict[str, Any]]: + sm = getattr(self._results, "survey_metadata", None) + if sm is None: + return None + deff = _safe_float(getattr(sm, "design_effect", None)) + return { + "weight_type": getattr(sm, "weight_type", None), + "effective_n": _safe_float(getattr(sm, "effective_n", None)), + "design_effect": deff, + # Round-43 P2 CI review on PR #318: the ``is_trivial`` + # upper bound matches DR's ``_check_design_effect`` and + # REPORTING.md's ``trivial`` band definition + # ``0.95 <= deff < 1.05`` (half-open). The prior closed + # interval ``<= 1.05`` produced ``is_trivial=True`` at + # exactly ``deff == 1.05`` while the DR schema emitted + # ``band_label="slightly_reduces"`` for the same value, + # suppressing BR's non-trivial prose at that boundary. + "is_trivial": deff is not None and 0.95 <= deff < 1.05, + "n_strata": _safe_int(getattr(sm, "n_strata", None)), + "n_psu": _safe_int(getattr(sm, "n_psu", None)), + "df_survey": _safe_int(getattr(sm, "df_survey", None)), + "replicate_method": getattr(sm, "replicate_method", None), + } + + +# --------------------------------------------------------------------------- +# Schema helpers (module-private) +# --------------------------------------------------------------------------- +def _safe_float(val: Any) -> Optional[float]: + if val is None: + return None + try: + return float(val) + except (TypeError, ValueError): + return None + + +def _safe_int(val: Any) -> Optional[int]: + if val is None: + return None + try: + return int(val) + except (TypeError, ValueError): + return None + + +def _safe_ci(ci: Any) -> Optional[List[float]]: + if ci is None: + return None + try: + lo, hi = ci + except (TypeError, ValueError): + return None + lo_f = _safe_float(lo) + hi_f = _safe_float(hi) + if lo_f is None or hi_f is None: + return None + return [lo_f, hi_f] + + +def _safe_list_len(val: Any) -> Optional[int]: + if val is None: + return None + try: + return int(len(val)) + except TypeError: + return None + + +def _lift_pre_trends(dr: Optional[Dict[str, Any]]) -> Dict[str, Any]: + """Pull pre-trends + power into a single BR-facing block.""" + if dr is None: + return {"status": "skipped", "reason": "auto_diagnostics=False"} + pt = dr.get("parallel_trends") or {} + pp = dr.get("pretrends_power") or {} + if pt.get("status") != "ran": + return { + "status": pt.get("status", "not_run"), + "reason": pt.get("reason"), + } + return { + "status": "computed", + "method": pt.get("method"), + "joint_p_value": pt.get("joint_p_value"), + "verdict": pt.get("verdict"), + "n_pre_periods": pt.get("n_pre_periods"), + # Preserve DR's inconclusive-PT provenance on the BR schema so + # downstream consumers (and BR's own summary renderer) see the + # undefined-row count and DR's detailed reason without having + # to re-consult the DR schema (round-39 P3 CI review on PR + # #318). These fields are populated only when + # ``verdict == "inconclusive"`` per ``_pt_event_study``'s + # inconclusive branch (``diagnostic_report.py:999``). + "n_dropped_undefined": pt.get("n_dropped_undefined"), + "reason": pt.get("reason"), + # Carry the denominator df through when the survey F-reference + # branch was used so BR consumers can flag the finite-sample + # correction without re-consulting the DR schema (round-28 P3 + # CI review on PR #318). + "df_denom": pt.get("df_denom"), + "power_status": pp.get("status"), + # Dedicated reason field so schema consumers see the fallback + # explanation when ``compute_pretrends_power`` cannot run + # (``status in {"skipped", "error", "not_applicable"}``). + # REPORTING.md lines 118-125 promise this provenance; round-29 + # P3 CI review on PR #318 flagged that only the enum status was + # being exposed and the reason was dropped at the lift boundary. + # ``power_status`` stays the machine-readable enum; ``power_reason`` + # carries the plain-English explanation. + "power_reason": pp.get("reason"), + "power_tier": pp.get("tier"), + "mdv": pp.get("mdv"), + "mdv_share_of_att": pp.get("mdv_share_of_att"), + # Carry the covariance-source annotation through so BR can hedge the + # power-tier phrasing when compute_pretrends_power silently used a + # diagonal fallback despite event_study_vcov being available. + "power_covariance_source": pp.get("covariance_source"), + } + + +def _lift_sensitivity(dr: Optional[Dict[str, Any]]) -> Dict[str, Any]: + if dr is None: + return {"status": "skipped", "reason": "auto_diagnostics=False"} + sens = dr.get("sensitivity") or {} + if sens.get("status") != "ran": + # Preserve ``method`` through to the BR schema so downstream + # consumers can distinguish a native-routed skip + # (``method="estimator_native"`` for SDiD / TROP, where + # robustness is covered by the native battery) from a + # methodology-blocked skip (e.g., CS with + # ``base_period='varying'``). Without it, agents reading the BR + # schema alone cannot tell these cases apart and would have to + # re-consult the DR schema to disambiguate. + return { + "status": sens.get("status", "not_run"), + "reason": sens.get("reason"), + "method": sens.get("method"), + } + return { + "status": "computed", + "method": sens.get("method"), + "breakdown_M": sens.get("breakdown_M"), + "conclusion": sens.get("conclusion"), + "grid": sens.get("grid"), + } + + +def _lift_heterogeneity(dr: Optional[Dict[str, Any]]) -> Dict[str, Any]: + """Return the heterogeneity section of the BR schema. + + Round-31 P2 CI review on PR #318: the lift previously returned + ``None`` on any non-``ran`` path, which broke the schema contract + that every top-level BR key resolves to a dict with a ``status`` + field. Downstream consumers had to special-case this one section. + Now returns a dict-shaped ``{"status": ..., "reason": ...}`` block + mirroring DR's own status enum so ``schema["heterogeneity"] + ["status"]`` is always readable. + """ + if dr is None: + return {"status": "skipped", "reason": "auto_diagnostics=False"} + het = dr.get("heterogeneity") or {} + status = het.get("status") + if status != "ran": + return { + "status": status or "not_run", + "reason": het.get("reason"), + } + return { + "status": "ran", + "source": het.get("source"), + "n_effects": het.get("n_effects"), + "min": het.get("min"), + "max": het.get("max"), + "cv": het.get("cv"), + "sign_consistent": het.get("sign_consistent"), + } + + +def _lift_robustness(dr: Optional[Dict[str, Any]]) -> Dict[str, Any]: + if dr is None: + return {"status": "skipped", "reason": "auto_diagnostics=False"} + bacon = dr.get("bacon") or {} + native = dr.get("estimator_native_diagnostics") or {} + return { + "bacon": { + "status": bacon.get("status"), + "forbidden_weight": bacon.get("forbidden_weight"), + "verdict": bacon.get("verdict"), + }, + "estimator_native": { + "status": native.get("status"), + "pre_treatment_fit": native.get("pre_treatment_fit"), + }, + } + + +def _anticipation_periods(results: Any) -> int: + """Return the non-negative anticipation-period count from a result, or 0. + + Helper for ``_describe_assumption``. Anticipation-capable estimators + (MultiPeriodDiD, CS, SA, ImputationDiD, TwoStageDiD, Stacked, EfficientDiD, + StaggeredTripleDiff, ContinuousDiD, Wooldridge) expose ``anticipation`` + as an int defaulting to ``0``. + """ + a = getattr(results, "anticipation", 0) + try: + k = int(a) + except (TypeError, ValueError): + return 0 + return k if k > 0 else 0 + + +def _control_group_choice(results: Any) -> Optional[str]: + """Return the control-group choice string for a fitted result, normalized + across estimator-specific attribute names. + + Most anticipation-capable estimators expose the control-group choice as + ``results.control_group``. ``StackedDiDResults`` exposes the same choice + as ``clean_control`` (the public Wing-Freedman-Hollingsworth-2024 kwarg + name). Without this alias, a StackedDiD fit with + ``clean_control="not_yet_treated"`` would surface as ``control_group=None`` + in the business-report schema, and the dynamic-control branch in + ``_extract_sample`` would never fire. + """ + cg = getattr(results, "control_group", None) + if isinstance(cg, str): + return cg + if type(results).__name__ == "StackedDiDResults": + clean = getattr(results, "clean_control", None) + if isinstance(clean, str): + return clean + return None + + +_STRICT_NO_ANTICIPATION_PATTERNS = ( + # Ordered from most specific to least specific so the first match + # wins on strings that could match multiple patterns. Matches are + # case-sensitive because every occurrence in ``_describe_assumption`` + # is a fixed canonical phrase. + ", plus no anticipation", + "plus no anticipation", + " Also assumes no anticipation (Assumption NA), overlap " + "(Assumption O), and absorbing / irreversible treatment.", + " Also assumes no anticipation.", + "Also assumes no anticipation.", + " and no anticipation", +) + + +def _strip_strict_no_anticipation(desc: str) -> str: + """Remove any strict no-anticipation phrasing from ``desc``. + + Several base assumption descriptions in ``_describe_assumption`` + hard-code a strict "plus no anticipation" / "Also assumes no + anticipation" clause (CS / SA / Imputation / TwoStage / Wooldridge + generic, StackedDiD sub-experiment, EfficientDiD PT-Post, EfficientDiD + PT-All, ContinuousDiD, TripleDifference, SyntheticDiD, TROP, dCDH, + and the fallback unconditional branch). When a fit actually allows + anticipation the helper must REPLACE that wording, not append a + contradictory clause on top of it. Round-30 P1 CI review on PR #318. + """ + if not desc: + return desc + out = desc + for pattern in _STRICT_NO_ANTICIPATION_PATTERNS: + out = out.replace(pattern, "") + # Collapse any doubled whitespace or dangling punctuation left by + # the removal (e.g., "cohorts, with..." -> "cohorts, with..."; + # "cohorts . " -> "cohorts."). + out = re.sub(r"\s+\.", ".", out) + out = re.sub(r"\s+,", ",", out) + out = re.sub(r" {2,}", " ", out) + return out.strip() + + +def _apply_anticipation_to_assumption(block: Dict[str, Any], results: Any) -> Dict[str, Any]: + """If the fit used ``anticipation > 0``, flip ``no_anticipation`` off, + strip any strict no-anticipation wording from the base description, + and append an anticipation-aware clause. + + Round-17 CI review flagged the strict "plus no anticipation" language + on anticipation-enabled fits. Per REGISTRY.md §CallawaySantAnna lines + 355-395 and the matching sections for SA / MultiPeriod / Wooldridge / + EfficientDiD, a fit with ``anticipation=k`` shifts the effective + treatment boundary by ``k`` pre-periods; the identifying assumption + becomes "no treatment effects earlier than ``k`` periods before the + treatment start" rather than strict no-anticipation. Round-30 CI + review caught that the previous implementation only appended — the + resulting prose said both "strict no-anticipation holds" and + "anticipation is allowed" in the same paragraph. + """ + k = _anticipation_periods(results) + if k <= 0: + return block + block = dict(block) # don't mutate the caller's dict + block["no_anticipation"] = False + block["anticipation_periods"] = k + period_word = "period" if k == 1 else "periods" + clause = ( + f" Anticipation is allowed for the {k} {period_word} immediately " + "before treatment: the identifying contract requires no treatment " + f"effects earlier than {k} {period_word} before the treatment " + "start (not strict no-anticipation)." + ) + desc = block.get("description", "") + if isinstance(desc, str): + block["description"] = _strip_strict_no_anticipation(desc) + clause + return block + + +def _describe_assumption(estimator_name: str, results: Any = None) -> Dict[str, Any]: + """Return the identifying-assumption block for an estimator.""" + if estimator_name in { + "SyntheticDiDResults", + }: + return { + "parallel_trends_variant": "weighted_pt", + "no_anticipation": True, + "description": ( + "Synthetic-Difference-in-Differences identifies the ATT under a " + "weighted parallel-trends analogue: the synthetic control is " + "chosen to match the treated group's pre-period trajectory." + ), + } + if estimator_name in {"TROPResults"}: + return { + "parallel_trends_variant": "factor_model", + "no_anticipation": True, + "description": ( + "TROP uses low-rank factor-model identification rather than a " + "parallel-trends assumption; unobserved heterogeneity is " + "captured through latent factor loadings." + ), + } + if estimator_name == "ContinuousDiDResults": + # Callaway, Goodman-Bacon & Sant'Anna (2024), two-level PT: + # REGISTRY.md §ContinuousDiD > Identification. + return { + "parallel_trends_variant": "dose_pt_or_strong_pt", + "no_anticipation": True, + "description": ( + "ContinuousDiD identifies dose-specific treatment effects " + "under two possible parallel-trends conditions (Callaway, " + "Goodman-Bacon & Sant'Anna 2024). Parallel Trends (PT) " + "assumes untreated potential outcome paths are equal across " + "all dose groups and the untreated group (conditional on " + "dose), identifying ATT(d|d) and the binarized ATT^loc but " + "NOT ATT(d), ACRT, or cross-dose comparisons. Strong " + "Parallel Trends (SPT) additionally rules out selection " + "into dose on the basis of treatment effects and is " + "required to identify the dose-response curve ATT(d), " + "marginal effect ACRT(d), and cross-dose contrasts." + ), + } + if estimator_name in {"TripleDifferenceResults", "StaggeredTripleDiffResults"}: + # Ortiz-Villavicencio & Sant'Anna (2025) — identification is the + # triple-difference cancellation across the 2x2x2 cells, not + # ordinary DiD parallel trends; see REGISTRY.md §TripleDifference + # and §StaggeredTripleDifference. + return { + "parallel_trends_variant": "triple_difference_cancellation", + "no_anticipation": True, + "description": ( + "Triple-difference identification relies on the DDD " + "decomposition (Ortiz-Villavicencio & Sant'Anna 2025): " + "the ATT is recovered from `DDD = DiD_A + DiD_B - DiD_C` " + "across the Group x Period x Eligibility (or Treatment) " + "cells, which differences out group-specific and " + "period-specific unobservables without requiring separate " + "parallel trends to hold between each cell pair. The " + "identifying restriction is therefore weaker than ordinary " + "DiD parallel trends but assumes that the residual " + "unobservable component is additively separable across the " + "three dimensions; practical overlap and common-support " + "conditions still apply on the propensity score when " + "covariates are used." + ), + } + if estimator_name == "ChaisemartinDHaultfoeuilleResults": + # de Chaisemartin & D'Haultfoeuille (2020, 2024) — identification is + # transition-based across (joiner, leaver, stable-control) cells + # around each switching period, not a group-time ATT parallel- + # trends restriction. Writing up dCDH as "parallel trends across + # treatment cohorts" was flagged as a source-faithfulness bug in + # PR #318 review; REGISTRY.md §ChaisemartinDHaultfoeuille is + # explicit about the transition-set construction. + # + # Phase-3 features (``controls``, ``trends_linear``, + # ``heterogeneity``) each modify the identifying contract and + # change the estimand from ``DID_l`` to ``DID^X_l`` / + # ``DID^{fd}_l`` / the heterogeneity-test variant. When active, + # append an explicit clause so the description does not + # misrepresent the identifying assumption (the reviewer has + # flagged several parallel source-faithfulness gaps elsewhere + # — explicitly surfacing Phase-3 config matches the per-estimator + # walkthrough pattern). + base_description = ( + "Identification is transition-based (de Chaisemartin & " + "D'Haultfoeuille 2020; dynamic companion 2024). At each " + "switching period, the estimator contrasts joiners " + "(D:0->1), leavers (D:1->0), and stable-treated / " + "stable-untreated control cells that share the same " + "treatment state across adjacent periods, yielding the " + "contemporaneous ``DID_M`` and per-horizon ``DID_l`` / " + "``DID_{g,l}`` building blocks. The identifying " + "restriction is parallel trends within each transition's " + "stable-control cell (not a single group-time ATT PT " + "condition across all cohorts) plus no anticipation; " + "with non-binary treatment the stable-control match is " + "additionally on exact baseline dose ``D_{g,1}``. " + "Reversible treatment is natively supported, unlike the " + "absorbing-treatment designs that rely on a fixed " + "treatment-onset cohort." + ) + has_controls = ( + results is not None and getattr(results, "covariate_residuals", None) is not None + ) + has_trends = ( + results is not None and getattr(results, "linear_trends_effects", None) is not None + ) + has_heterogeneity = ( + results is not None and getattr(results, "heterogeneity_effects", None) is not None + ) + active_parts: List[str] = [] + if has_controls and has_trends: + active_parts.append( + "the estimand is ``DID^{X,fd}_l`` (covariate-residualized " + "first-differences), and identification holds conditional on " + "the covariates entering the first-stage regression and " + "allowing group-specific linear trends" + ) + elif has_controls: + active_parts.append( + "the estimand is ``DID^X_l``, and identification holds " + "conditional on the covariates entering the first-stage " + "residualization" + ) + elif has_trends: + active_parts.append( + "the estimand is ``DID^{fd}_l`` (first-differenced) and the " + "identifying restriction is relaxed to allow group-specific " + "linear pre-trends" + ) + if has_heterogeneity: + active_parts.append("heterogeneity tests ``beta^{het}_l`` are reported per horizon") + if active_parts: + phase3_clause = " Phase-3 configuration: " + "; ".join(active_parts) + "." + base_description = base_description + phase3_clause + return { + "parallel_trends_variant": "transition_based", + "no_anticipation": True, + "description": base_description, + } + if estimator_name == "EfficientDiDResults": + # Chen, Sant'Anna & Xie (2025) — identification is parameterized + # by ``pt_assumption`` ("all" vs "post"). PT-All is the stronger + # regime (PT across all groups/periods, over-identified — paper + # Lemma 2.1), PT-Post the weaker (PT only in post-treatment, + # just-identified reduction to single-baseline DiD per Corollary + # 3.2). Also read ``control_group`` when present (not_yet_treated + # vs last_cohort) to be source-faithful to REGISTRY.md §EfficientDiD + # lines 736-738 and 907. + pt_assumption = getattr(results, "pt_assumption", "all") + control_group = getattr(results, "control_group", None) + # The estimator only accepts ``control_group`` values of + # ``"never_treated"`` (the default) or ``"last_cohort"``. When + # ``last_cohort`` is used, the latest treatment cohort is + # reclassified as a pseudo-never-treated comparison and time + # periods at/after its onset are dropped; describing such a fit + # with generic never-treated language would misstate the + # identifying setup (see REGISTRY.md §EfficientDiD line 908). + is_last_cohort = control_group == "last_cohort" + if pt_assumption == "post": + variant = "pt_post" + if is_last_cohort: + control_clause = ( + "the comparison group is the latest treated cohort " + "reclassified as pseudo-never-treated (periods " + "at/after that cohort's treatment start are " + "dropped)" + ) + else: + control_clause = "the comparison group is never-treated" + description = ( + "Identification under PT-Post (Chen, Sant'Anna & Xie " + "2025): parallel trends holds only in post-treatment " + "periods, " + control_clause + ", and the baseline is period g-1 only. This is the " + "weaker of the two regimes — just-identified and " + "reducing to standard single-baseline DiD (Corollary " + "3.2). Also assumes no anticipation (Assumption NA), " + "overlap (Assumption O), and absorbing / irreversible " + "treatment." + ) + else: + variant = "pt_all" + if is_last_cohort: + baseline_clause = ( + "using the latest treated cohort as a pseudo-never-" + "treated comparison (periods at/after that cohort's " + "treatment start are dropped); any earlier cohort " + "and any pre-treatment period can serve as baseline" + ) + else: + baseline_clause = ( + "using never-treated units as comparison; any " + "not-yet-treated cohort and any pre-treatment period " + "can serve as baseline" + ) + description = ( + "Identification under PT-All (Chen, Sant'Anna & Xie " + "2025): parallel trends holds for all groups and all " + "periods, " + + baseline_clause + + ". The estimator is over-identified (Lemma 2.1), and " + "the paper's optimal combination weights are applied. " + "Also assumes no anticipation (Assumption NA), overlap " + "(Assumption O), and absorbing / irreversible " + "treatment. The Hausman PT-All vs PT-Post pretest " + "(operating on the post-treatment event-study vector " + "ES(e), Theorem A.1) checks whether the stronger " + "PT-All regime is tenable." + ) + block: Dict[str, Any] = { + "parallel_trends_variant": variant, + "no_anticipation": True, + "description": description, + } + if isinstance(control_group, str): + block["control_group"] = control_group + return block + if estimator_name == "StackedDiDResults": + # Wing, Freedman & Hollingsworth (2024) — identification is + # sub-experiment common trends plus the IC1 (event window fits + # within the data range) and IC2 (clean controls exist for the + # event) inclusion conditions, NOT the generic "group-time ATT + # parallel trends" clause used for CS / SA / etc. (round-22 P1 + # CI review on PR #318). The active ``clean_control`` rule + # determines which units qualify as valid controls for each + # adoption event. REGISTRY.md §StackedDiD lines 1189-1193 + # (identification) and 1234-1256 (clean-control rules). + clean_control = getattr(results, "clean_control", None) + if clean_control == "never_treated": + control_clause = ( + "controls are restricted to units that are never treated " + "over the panel (``A_s = infinity``)" + ) + elif clean_control == "strict": + control_clause = ( + "controls for event ``a`` are units satisfying the strict " + "rule ``A_s > a + kappa_post + kappa_pre`` (strictly " + "untreated across the full pre- and post-event window)" + ) + else: + # Default: "not_yet_treated" — A_s > a + kappa_post. + control_clause = ( + "controls for event ``a`` are units satisfying ``A_s > a + " + "kappa_post`` (not yet treated through the end of the " + "event's post-window, so future-treated units can serve " + "as controls for earlier events)" + ) + block: Dict[str, Any] = { + "parallel_trends_variant": "stacked_sub_experiment", + "no_anticipation": True, + "description": ( + "Identification under Stacked DiD (Wing, Freedman & " + "Hollingsworth 2024): within each stacked sub-experiment " + "parallel trends holds between the treated cohort and the " + "corresponding clean-control set over the event window " + "``[-kappa_pre, +kappa_post]``; " + + control_clause + + ". Sub-experiments are restricted by IC1 (the event " + "window fits within the available time range) and IC2 " + "(at least one clean control exists). The aggregate ATT is " + "a weighted sum over sub-experiments, so the common-trends " + "assumption is sub-experiment-specific, not a single " + "panel-wide group-time ATT condition. Also assumes no " + "anticipation." + ), + } + if isinstance(clean_control, str): + block["control_group"] = clean_control + block["clean_control"] = clean_control + return block + if estimator_name == "ImputationDiDResults": + # Borusyak, Jaravel & Spiess (2024) — identification is through + # an untreated-potential-outcome model: unit+time FE (optionally + # plus covariates) fitted on untreated observations only + # (``Omega_0``) deliver the counterfactual ``Y_it(0)``, and the + # treatment effect ``tau_it`` is the residual on treated + # observations. Writing this as generic "group-time ATT + # parallel trends" misstates the identifying model — the + # restriction is on the UNTREATED outcome's additive FE + # structure, not on cohort-time ATT equality. REGISTRY.md + # §ImputationDiD lines 1000-1013 and Assumption 1 (parallel + # trends) + Assumption 2 (no anticipation on untreated + # observations). Round-42 P1 CI review on PR #318 flagged this + # source-faithfulness gap. + return { + "parallel_trends_variant": "untreated_outcome_fe_model", + "no_anticipation": True, + "description": ( + "Identification under Imputation DiD (Borusyak, Jaravel " + "& Spiess 2024): the untreated potential outcome " + "``Y_it(0)`` follows an additive unit+time fixed-effects " + "model ``Y_it(0) = alpha_i + beta_t [+ X'_it * delta] + " + "epsilon_it``. Step 1 estimates those FE on untreated " + "observations only (``Omega_0`` = never-treated plus " + "not-yet-treated cells); Step 2 imputes the " + "counterfactual for treated observations from the " + "fitted FE; Step 3 aggregates ``tau_hat_it = Y_it - " + "Y_hat_it(0)`` with researcher-chosen weights. The " + "identifying restriction is therefore parallel trends " + "of the UNTREATED outcome model (Assumption 1) — " + "``E[Y_it(0)] = alpha_i + beta_t``, holding across all " + "observations — rather than equality of cohort-time " + "ATTs. Also assumes no anticipation on untreated " + "observations (Assumption 2) and absorbing treatment." + ), + } + if estimator_name == "TwoStageDiDResults": + # Gardner (2022) — identification is the same as BJS + # ImputationDiD (point estimates are algebraically equivalent + # per REGISTRY.md §TwoStageDiD line 1130): unit+time FE + # estimated on untreated observations only deliver the + # untreated potential-outcome trajectory; Stage 2 regresses + # the resulting residuals on treatment indicators. Writing + # this as generic "group-time ATT parallel trends" loses the + # load-bearing detail that Stage 1 operates only on untreated + # cells. REGISTRY.md §TwoStageDiD lines 1113-1128 and + # Assumption (same as ImputationDiD). Round-42 P1 CI review on + # PR #318 flagged this source-faithfulness gap. + return { + "parallel_trends_variant": "untreated_outcome_fe_model", + "no_anticipation": True, + "description": ( + "Identification under Two-Stage DiD (Gardner 2022): " + "Stage 1 fits unit + time fixed effects on untreated " + "observations only (``Omega_0``), residualizing the " + "outcome as ``y_tilde_it = Y_it - alpha_hat_i - " + "beta_hat_t``; Stage 2 regresses residualized outcomes " + "on the treatment indicator across treated observations " + "to recover the ATT. The point estimates are " + "algebraically equivalent to Borusyak-Jaravel-Spiess " + "imputation (both rely on the same untreated-outcome FE " + "model to construct the counterfactual). The " + "identifying restriction is therefore parallel trends " + "of the UNTREATED outcome: ``E[Y_it(0)] = alpha_i + " + "beta_t`` for all observations (not a group-time ATT " + "equality across cohorts). Also assumes no anticipation " + "(``Y_it = Y_it(0)`` for all untreated observations) " + "and absorbing / irreversible treatment." + ), + } + if estimator_name in { + "CallawaySantAnnaResults", + "SunAbrahamResults", + "WooldridgeDiDResults", + }: + return { + "parallel_trends_variant": "conditional_or_group_time", + "no_anticipation": True, + "description": ( + "Identification relies on parallel trends across treatment " + "cohorts and time periods (group-time ATT), plus no " + "anticipation." + ), + } + return { + "parallel_trends_variant": "unconditional", + "no_anticipation": True, + "description": ( + "Identification relies on the standard DiD parallel-trends " + "assumption plus no anticipation of treatment by either group." + ), + } + + +def _build_caveats( + _results: Any, + headline: Dict[str, Any], + sample: Dict[str, Any], + dr_schema: Optional[Dict[str, Any]], +) -> List[Dict[str, Any]]: + """Assemble the plain-English caveats list for the headline schema.""" + caveats: List[Dict[str, Any]] = [] + + # NaN ATT is the highest-severity caveat. + if headline.get("sign") == "undefined": + caveats.append( + { + "severity": "warning", + "topic": "estimation_failure", + "message": ( + "Estimation produced a non-finite effect. Inspect data " + "preparation and model specification before interpreting." + ), + } + ) + + # Alpha override could not be honored (bootstrap / finite-df inference). + alpha_override_msg = headline.get("alpha_override_caveat") + if isinstance(alpha_override_msg, str) and alpha_override_msg: + caveats.append( + { + "severity": "info", + "topic": "alpha_override_preserved", + "message": alpha_override_msg, + } + ) + + # Near-threshold p-value. + if headline.get("near_significance_threshold"): + caveats.append( + { + "severity": "info", + "topic": "near_significance", + "message": ( + "The p-value is close to the conventional significance " + "threshold; small changes to the sample or specification " + "could move it either way." + ), + } + ) + + # Few treated units. + nt = sample.get("n_treated") + if nt is not None and nt <= 3: + caveats.append( + { + "severity": "warning", + "topic": "few_treated", + "message": ( + f"Only {nt} treated units in this fit; standard errors " + "rely on large-cluster asymptotics and may be unreliable. " + "Consider SyntheticDiD or an exact-permutation inference " + "alternative." + ), + } + ) + + # Non-trivial design effect. + survey = sample.get("survey") + if survey and not survey.get("is_trivial"): + deff = survey.get("design_effect") + eff_n = survey.get("effective_n") + if isinstance(deff, (int, float)) and deff >= 5.0: + caveats.append( + { + "severity": "warning", + "topic": "design_effect", + "message": ( + f"Very large survey design effect (DEFF = {deff:.2g}). " + "Inspect the weight distribution and consider weight " + "trimming if driven by outlier weights." + ), + } + ) + elif isinstance(deff, (int, float)) and deff >= 1.5: + if isinstance(eff_n, (int, float)): + caveats.append( + { + "severity": "info", + "topic": "design_effect", + "message": ( + f"Survey design reduces effective sample size: " + f"DEFF = {deff:.2g}; effective n = {eff_n:.0f}." + ), + } + ) + + # Bacon forbidden comparisons. + # Round-45 P1 CI review on PR #318: Goodman-Bacon is a + # decomposition of TWFE weights (see ``bacon.py`` header and + # Goodman-Bacon 2021). On fits already produced by a + # heterogeneity-robust estimator (CS / SA / BJS / Gardner / + # Wooldridge / EfficientDiD / Stacked / dCDH / TripleDifference / + # StaggeredTripleDiff / SDiD / TROP), a high forbidden-weight share + # says "TWFE would have been materially biased on this rollout", + # not "the displayed estimator needs to be replaced" — the + # displayed estimator is already robust to the heterogeneity that + # Bacon flags. DR partly preserves this with "if not already in + # use" prose; BR must carry the same distinction through to the + # caveat. The TWFE-style estimators whose results route through + # Bacon and for which the "switch to a robust estimator" + # recommendation is load-bearing are the DiDResults-type fits; all + # other result classes are already robust. + _TWFE_STYLE_RESULTS: FrozenSet[str] = frozenset( + {"DiDResults", "MultiPeriodDiDResults", "TwoWayFixedEffectsResults"} + ) + if dr_schema: + bacon = dr_schema.get("bacon") or {} + if bacon.get("status") == "ran": + fw = bacon.get("forbidden_weight") + if isinstance(fw, (int, float)) and fw > 0.10: + _estimator_name = type(_results).__name__ + if _estimator_name in _TWFE_STYLE_RESULTS: + bacon_message = ( + f"Goodman-Bacon decomposition places {fw:.0%} " + "of implicit TWFE weight on 'forbidden' " + "later-vs-earlier comparisons. TWFE may be " + "materially biased under heterogeneous effects. " + "Re-estimate with a heterogeneity-robust " + "estimator (CS / SA / BJS / Gardner)." + ) + else: + bacon_message = ( + f"Goodman-Bacon decomposition places {fw:.0%} " + "of TWFE weight on 'forbidden' later-vs-earlier " + "comparisons. A TWFE benchmark on this rollout " + "would be materially biased under heterogeneous " + "effects; the displayed estimator is already " + "heterogeneity-robust, so this is a statement " + "about the rollout design (avoid reporting TWFE " + "alongside this fit), not about the current " + "result's validity." + ) + caveats.append( + { + "severity": "warning", + "topic": "bacon_contamination", + "message": bacon_message, + } + ) + + # Fragile sensitivity. + sens = dr_schema.get("sensitivity") or {} + if sens.get("status") == "ran": + bkd = sens.get("breakdown_M") + if isinstance(bkd, (int, float)) and bkd < 0.5: + caveats.append( + { + "severity": "warning", + "topic": "sensitivity_fragility", + "message": ( + f"HonestDiD breakdown value is {bkd:.2g}: the " + "result's confidence interval includes zero " + "once parallel-trends violations reach less than " + "half the observed pre-period variation. Treat " + "the headline as tentative." + ), + } + ) + + # Sensitivity was skipped for methodology reasons (e.g., CS fit with + # ``base_period='varying'`` — HonestDiD bounds are not interpretable + # there). Surface the reason as a warning-severity caveat so readers + # do not assume the headline is robust across the R-R grid. + # + # Exception (round-20 P2 CI review on PR #318): SDiD and TROP route + # robustness to ``estimator_native_diagnostics`` and mark the HonestDiD + # sensitivity block ``status="skipped", method="estimator_native"``. + # Surfacing "sensitivity was not run" as a warning contradicts the + # documented native-routing contract when the native battery actually + # ran. Suppress the warning and point readers at the native block + # instead. + if sens.get("status") == "skipped": + reason = sens.get("reason") + method = sens.get("method") + native = dr_schema.get("estimator_native_diagnostics") or {} + native_ran = native.get("status") == "ran" + if method == "estimator_native" and native_ran: + caveats.append( + { + "severity": "info", + "topic": "sensitivity_native_routed", + "message": ( + "HonestDiD was not run for this estimator. Robustness " + "is covered by the estimator-native sensitivity " + "diagnostics reported under " + "``estimator_native_diagnostics``." + ), + } + ) + elif isinstance(reason, str) and reason: + caveats.append( + { + "severity": "warning", + "topic": "sensitivity_skipped", + "message": ("HonestDiD sensitivity was not run on this fit. " + reason), + } + ) + + # Non-fatal warnings captured from delegated diagnostics + # (e.g., HonestDiD's bootstrap diag-covariance fallback, dropped + # non-consecutive horizons on dCDH). DR already records these in + # ``schema["warnings"]``; mirror the methodology-critical ones + # into BR's caveat list so summary/full-report prose can surface + # them without readers having to inspect the DR schema. + for msg in dr_schema.get("warnings", []) or []: + if not isinstance(msg, str) or not msg: + continue + # Skip alpha-override and design-effect messages already + # covered by dedicated caveats above. + lower = msg.lower() + if "sensitivity:" in lower or "pretrends_power:" in lower: + caveats.append( + { + "severity": "info", + "topic": "diagnostic_warning", + "message": msg, + } + ) + + # Unit mismatch caveat (log_points + unit override). + unit_kind = headline.get("unit_kind") + if unit_kind == "log_points": + caveats.append( + { + "severity": "info", + "topic": "unit_policy", + "message": ( + "The effect is reported in log-points as estimated; " + "BusinessReport does not arithmetically translate log-points " + "to percent or level changes. For small effects, log-points " + "approximate percentage changes." + ), + } + ) + return caveats + + +def _pt_method_subject(method: Optional[str]) -> str: + """Return a source-faithful sentence subject for the PT verdict prose. + + The ``parallel_trends.method`` field distinguishes between the + 2x2 slope-difference check, the pre-period event-study Wald / + Bonferroni variants, EfficientDiD's Hausman PT-All vs PT-Post + pretest, SDiD's weighted pre-treatment fit, and TROP's factor- + model identification. Generic "pre-treatment event-study" wording + is wrong for the first and third cases. See round-8 CI review on + PR #318 and REGISTRY.md §EfficientDiD (Hausman pretest). + """ + if method == "slope_difference": + return "The pre-period slope-difference test" + if method == "hausman": + return "The Hausman PT-All vs PT-Post pretest" + if method in { + "joint_wald", + "joint_wald_event_study", + "joint_wald_no_vcov", + "bonferroni", + # Survey-aware event-study PT variants use an F reference + # distribution with denominator df = ``survey_metadata.df_survey`` + # (round-27 P1 fix, documented in REPORTING.md). The subject + # remains the pre-period event-study coefficients; prose elsewhere + # flags the finite-sample correction via ``df_denom``. + "joint_wald_survey", + "joint_wald_event_study_survey", + }: + return "Pre-treatment event-study coefficients" + if method == "synthetic_fit": + return "The synthetic-control pre-treatment fit" + if method == "factor": + return "The factor-model pre-treatment fit" + return "Pre-treatment data" + + +def _pt_method_stat_label(method: Optional[str]) -> Optional[str]: + """Return the joint-statistic label appropriate to the PT method. + + Returns ``"joint p"`` for Wald / Bonferroni paths (including the + survey-aware F-reference variants, which remain joint tests on the + pre-period coefficient vector — only the reference distribution + changes), ``"p"`` for the 2x2 slope-difference and Hausman paths + (single-statistic tests), and ``None`` for design-enforced paths + that have no p-value. + """ + if method in { + "joint_wald", + "joint_wald_event_study", + "joint_wald_no_vcov", + "bonferroni", + "joint_wald_survey", + "joint_wald_event_study_survey", + }: + return "joint p" + if method in {"slope_difference", "hausman"}: + return "p" + if method in {"synthetic_fit", "factor"}: + return None + return "joint p" + + +def _references_for(estimator_name: str) -> List[Dict[str, str]]: + """Map the estimator to the appropriate citation references.""" + base = [ + { + "role": "sensitivity", + "citation": ( + "Rambachan, A., & Roth, J. (2023). A More Credible Approach " + "to Parallel Trends. Review of Economic Studies." + ), + }, + { + "role": "workflow", + "citation": ( + "Baker, A. C., Callaway, B., Cunningham, S., Goodman-Bacon, A., " + "& Sant'Anna, P. H. C. (2025). Difference-in-Differences " + "Designs: A Practitioner's Guide." + ), + }, + ] + estimator_refs = { + "CallawaySantAnnaResults": { + "role": "estimator", + "citation": ( + "Callaway, B., & Sant'Anna, P. H. C. (2021). " + "Difference-in-Differences with multiple time periods. " + "Journal of Econometrics." + ), + }, + "SyntheticDiDResults": { + "role": "estimator", + "citation": ( + "Arkhangelsky, D., Athey, S., Hirshberg, D. A., Imbens, G. W., " + "& Wager, S. (2021). Synthetic Difference in Differences." + ), + }, + "SunAbrahamResults": { + "role": "estimator", + "citation": ( + "Sun, L., & Abraham, S. (2021). Estimating dynamic treatment " + "effects in event studies. Journal of Econometrics." + ), + }, + "ImputationDiDResults": { + "role": "estimator", + "citation": ( + "Borusyak, K., Jaravel, X., & Spiess, J. (2024). " "Revisiting event-study designs." + ), + }, + "EfficientDiDResults": { + "role": "estimator", + "citation": ( + "Chen, X., Sant'Anna, P. H. C., & Xie, H. (2025). " + "Efficient Estimation of Treatment Effects in Staggered " + "DiD Designs." + ), + }, + "ChaisemartinDHaultfoeuilleResults": { + "role": "estimator", + "citation": ( + "de Chaisemartin, C., & D'Haultfœuille, X. (2020). " + "Two-way fixed effects estimators with heterogeneous " + "treatment effects. American Economic Review." + ), + }, + } + if estimator_name in estimator_refs: + return [estimator_refs[estimator_name]] + base + return base + + +# --------------------------------------------------------------------------- +# Prose rendering +# --------------------------------------------------------------------------- +def _format_value(value: Optional[float], unit: Optional[str], unit_kind: str) -> str: + """Format a numeric effect with its unit. No arithmetic translation.""" + if value is None or not np.isfinite(value): + return "undefined" + if unit_kind == "currency": + sign = "-" if value < 0 else "" + return f"{sign}${abs(value):,.2f}" + if unit_kind == "percent": + return f"{value:.2f}%" + if unit_kind == "percentage_points": + return f"{value:.2f} pp" + if unit_kind == "log_points": + return f"{value:.3g} log-points" + if unit_kind == "count": + return f"{value:,.0f}" + # unknown / free-form + if unit: + return f"{value:.3g} {unit}" + return f"{value:.3g}" + + +def _significance_phrase(p: Optional[float], alpha: float) -> str: + """Return a plain-English significance phrase. + + Tiers per ``docs/methodology/REPORTING.md``: + * p < 0.001: "strongly supported by the data" + * 0.001 <= p < 0.01: "well-supported" + * 0.01 <= p < alpha: "statistically significant at the X% level" + * alpha <= p < 0.10: CI-includes-zero language + * p >= 0.10: consistent-with-no-effect language + """ + if p is None or not np.isfinite(p): + return "statistical significance cannot be assessed (p-value unavailable)" + ci_level = int(round((1.0 - alpha) * 100)) + if p < 0.001: + return "the direction of the effect is strongly supported by the data" + if p < 0.01: + return "the direction of the effect is well-supported by the data" + if p < alpha: + return f"the effect is statistically significant at the {ci_level}% level" + if p < 0.10: + return ( + "the confidence interval includes zero; the direction is suggestive " + "but not statistically significant" + ) + return "the confidence interval includes zero; the data are consistent with no effect" + + +def _direction_verb(effect: float, outcome_direction: Optional[str]) -> str: + """Return a direction-aware verb for the headline sentence. + + When ``outcome_direction`` is unset we use neutral change verbs + (``increased`` / ``decreased``). When it is supplied, we additionally + flavor the verb with a value-laden connotation so the stakeholder can + read off whether the estimated effect points in the desired direction: + + - ``higher_is_better``: positive effect -> "lifted"; negative -> "reduced" + - ``lower_is_better``: positive effect -> "worsened"; negative -> "improved" + - None: positive -> "increased"; negative -> "decreased" + """ + if effect == 0: + return "did not change" + if outcome_direction == "higher_is_better": + return "lifted" if effect > 0 else "reduced" + if outcome_direction == "lower_is_better": + return "worsened" if effect > 0 else "improved" + return "increased" if effect > 0 else "decreased" + + +def _render_headline_sentence(schema: Dict[str, Any]) -> str: + """Render the headline sentence from the schema. + + Uses the absolute value in the magnitude slot when the verb already + conveys direction ("decreased ... by $0.14" rather than "decreased ... + by -$0.14"). CI bounds are rendered at their natural signed values. + When ``outcome_direction`` is supplied, the verb picks up a value-laden + connotation ("lifted" / "reduced" vs neutral "increased" / "decreased"). + """ + ctx = schema.get("context", {}) + h = schema.get("headline", {}) + effect = h.get("effect") + outcome = ctx.get("outcome_label", "the outcome") + treatment = ctx.get("treatment_label", "the treatment") + outcome_direction = ctx.get("outcome_direction") + unit = h.get("unit") + unit_kind = h.get("unit_kind", "unknown") + + if effect is None or not np.isfinite(effect): + return ( + f"We were unable to produce a finite estimate of {treatment}'s " + f"effect on {outcome}. Inspect the data and model specification." + ) + + verb = _direction_verb(effect, outcome_direction) + magnitude = _format_value(abs(effect), unit, unit_kind) + lo = h.get("ci_lower") + hi = h.get("ci_upper") + # Round-37 P1 CI review on PR #318: on a finite point estimate + # whose CI bounds are NaN (undefined inference — survey-df + # collapse, zero effective clusters, etc.), the previous isinstance + # check passed because ``NaN`` is a ``float`` and the sentence + # rendered ``(... 95% CI: undefined to undefined)``. Gate on + # ``np.isfinite`` like DR's own headline renderer already does; + # add an explicit inference-unavailable trailer instead of the + # broken CI clause. + ci_str = "" + ci_finite = ( + isinstance(lo, (int, float)) + and isinstance(hi, (int, float)) + and np.isfinite(lo) + and np.isfinite(hi) + ) + if ci_finite: + lo_s = _format_value(lo, unit, unit_kind) + hi_s = _format_value(hi, unit, unit_kind) + ci_str = f" ({h.get('ci_level', 95)}% CI: {lo_s} to {hi_s})" + elif isinstance(lo, (int, float)) or isinstance(hi, (int, float)): + # At least one bound was supplied but not finite -> inference + # undefined. Replace the CI clause with an explicit marker so + # downstream prose does not claim a confidence interval that + # is not actually available. + ci_str = " (inference unavailable: confidence interval is undefined for this fit)" + by_clause = f" by {magnitude}" if effect != 0 else "" + return f"{treatment.capitalize()} {verb} {outcome}{by_clause}{ci_str}." + + +def _render_summary(schema: Dict[str, Any]) -> str: + """Render the short-form stakeholder summary paragraph.""" + sentences: List[str] = [] + ctx = schema.get("context", {}) + question = ctx.get("business_question") + if question: + sentences.append(f"Question: {question}") + + # Headline sentence with significance phrase. + sentences.append(_render_headline_sentence(schema)) + h = schema.get("headline", {}) + p = h.get("p_value") + alpha = ctx.get("alpha", 0.05) + if p is not None and np.isfinite(p): + sig = _significance_phrase(p, alpha) + sentences.append(f"Statistically, {sig}.") + if h.get("near_significance_threshold"): + sentences.append( + "The p-value is close to the conventional threshold; " + "small changes to the sample could move it either way." + ) + + # Pre-trends + power-aware phrasing. + pt = schema.get("pre_trends", {}) or {} + if pt.get("status") == "computed": + jp = pt.get("joint_p_value") + verdict = pt.get("verdict") + # ``tier`` already incorporates the diagonal-fallback downgrade — + # ``DiagnosticReport._check_pretrends_power`` applies it centrally + # so every report surface (BR summary, BR full_report, BR schema, + # DR summary) reads the same adjusted value (round-14 CI review). + tier = pt.get("power_tier") + method = pt.get("method") + subject = _pt_method_subject(method) + stat_label = _pt_method_stat_label(method) + jp_phrase = ( + f" ({stat_label} = {jp:.3g})" if isinstance(jp, (int, float)) and stat_label else "" + ) + # Only point to "the sensitivity analysis below" when a + # sensitivity block actually ran. For estimators that route to + # native diagnostics (SDiD / TROP) or fits where sensitivity was + # skipped / not applicable, the clause would mislead (round-12 + # CI review on PR #318). + sens_ran = (schema.get("sensitivity", {}) or {}).get("status") == "computed" + sens_tail_major = " pending the sensitivity analysis below" if sens_ran else "" + sens_tail_alongside = " alongside the sensitivity analysis below" if sens_ran else "" + sens_tail_see_bounded = ( + " See the sensitivity analysis below for bounded-violation guarantees." + if sens_ran + else "" + ) + sens_tail_see_reliable = " See the sensitivity analysis below." if sens_ran else "" + if verdict == "clear_violation": + sentences.append( + f"{subject} clearly reject parallel trends{jp_phrase}; the " + "headline should be treated as tentative" + sens_tail_major + "." + ) + elif verdict == "some_evidence_against": + sentences.append( + f"{subject} show some evidence against parallel trends" + f"{jp_phrase}; interpret the headline" + + (sens_tail_alongside if sens_ran else " with caution") + + "." + ) + elif verdict == "no_detected_violation": + if tier == "well_powered": + sentences.append( + f"{subject} are consistent with parallel trends, and " + "the test is well-powered (the minimum-detectable " + "violation is small relative to the estimated effect)." + ) + elif tier == "moderately_powered": + sentences.append( + f"{subject} do not reject parallel trends; the test is " + "moderately informative." + sens_tail_see_bounded + ) + else: + sentences.append( + f"{subject} do not reject parallel trends, but the test " + "has limited power — a non-rejection does not prove the " + "assumption." + sens_tail_see_reliable + ) + elif verdict == "design_enforced_pt": + sentences.append( + "The synthetic control is designed to match the treated " + "group's pre-period trajectory (SDiD's weighted-parallel-" + "trends analogue)." + ) + elif verdict == "inconclusive": + # Round-35 P1 CI review on PR #318: a ``verdict=="inconclusive"`` + # state means one or more pre-period coefficients had + # undefined inference (zero SE, NaN p-value) and the joint + # test cannot be formed. BR previously omitted the sentence + # entirely, so stakeholder prose silently skipped the + # identifying-assumption diagnostic. Name the state + # explicitly and quote the undefined-row count when + # available. + n_dropped = pt.get("n_dropped_undefined") + if isinstance(n_dropped, int) and n_dropped > 0: + rows_word = "row" if n_dropped == 1 else "rows" + sentences.append( + f"The pre-trends test is inconclusive on this fit: " + f"{n_dropped} pre-period {rows_word} had undefined " + "inference (zero / negative SE or a non-finite " + "per-period p-value), so the joint test cannot be " + "formed. Treat parallel trends as unassessed rather " + "than supported." + ) + else: + sentences.append( + "The pre-trends test is inconclusive on this fit: " + "pre-period inference was undefined, so the joint " + "test cannot be formed. Treat parallel trends as " + "unassessed rather than supported." + ) + + # Sensitivity. A ``single_M_precomputed`` sensitivity block has + # ``breakdown_M=None`` by construction because only one M was evaluated; + # narrate it as a point check, NOT as grid-wide robustness. + sens = schema.get("sensitivity", {}) or {} + if sens.get("status") == "computed": + bkd = sens.get("breakdown_M") + conclusion = sens.get("conclusion") + if conclusion == "single_M_precomputed": + grid_points = sens.get("grid") or [] + point = grid_points[0] if grid_points else {} + m_val = point.get("M") + robust = point.get("robust_to_zero") + if isinstance(m_val, (int, float)): + if robust: + sentences.append( + f"HonestDiD (single point checked): at M = {m_val:.2g}, " + f"the robust confidence interval excludes zero. This is " + f"a point check, not a breakdown analysis — run " + f"HonestDiD.sensitivity() across a grid of M values " + f"for a full robustness claim." + ) + else: + sentences.append( + f"HonestDiD (single point checked): at M = {m_val:.2g}, " + f"the robust confidence interval includes zero. Run " + f"HonestDiD.sensitivity() across a grid to find the " + f"breakdown value." + ) + elif bkd is None: + sentences.append( + "HonestDiD: the result remains significant across the " + "full grid — robust to plausible parallel-trends violations." + ) + elif isinstance(bkd, (int, float)) and bkd >= 1.0: + sentences.append( + f"HonestDiD: the result remains significant under " + f"parallel-trends violations up to {bkd:.2g}x the observed " + f"pre-period variation." + ) + elif isinstance(bkd, (int, float)): + sentences.append( + f"HonestDiD: the result is fragile — the confidence interval " + f"includes zero once violations reach {bkd:.2g}x the " + f"pre-period variation." + ) + + # Sample sentence. For fits with a dynamic comparison set (CS / + # ContinuousDiD / StaggeredTripleDiff / EfficientDiD / + # StackedDiD under ``clean_control in {"not_yet_treated", + # "strict"}``) the fixed control count is suppressed because the + # comparison group varies by cohort/sub-experiment; narrate the + # mode explicitly rather than misreporting a fixed-subset tally as + # "control" (rounds 13 / 17 / 18 / 22 CI review). + sample = schema.get("sample", {}) or {} + # ``schema["estimator"]`` is a dict with ``class_name``; unwrap it + # for the per-estimator dynamic-control phrasing branch below. + estimator_block = schema.get("estimator") or {} + estimator = estimator_block.get("class_name") if isinstance(estimator_block, dict) else None + n_obs = sample.get("n_obs") + n_t = sample.get("n_treated") + n_c = sample.get("n_control") + n_nt = sample.get("n_never_treated") + n_ne = sample.get("n_never_enabled") + is_dynamic = sample.get("dynamic_control") + cg = sample.get("control_group") + # Panel-vs-RCS count-unit label. For repeated cross-section fits + # (``panel=False`` on CallawaySantAnna), treated / never-treated + # tallies are observation counts, not unit counts. Keep the + # "N treated" phrasing (the N is still correct), but adjust the + # never-treated clause so it does not claim "units present in + # the panel" for an RCS sample. + count_unit = sample.get("count_unit", "units") + ne_unit_word = "observations" if count_unit == "observations" else "units" + if isinstance(n_obs, int): + if isinstance(n_t, int) and isinstance(n_c, int): + sentences.append(f"Sample: {n_obs:,} observations ({n_t:,} treated, {n_c:,} control).") + elif is_dynamic and isinstance(n_t, int): + if isinstance(n_ne, int) and n_ne > 0: + subset_clause = f"; {n_ne:,} never-enabled {ne_unit_word} are also present" + elif isinstance(n_nt, int) and n_nt > 0: + subset_clause = f"; {n_nt:,} never-treated {ne_unit_word} are also present" + else: + subset_clause = "" + # Estimator-specific dynamic-comparison phrasing. StackedDiD + # uses sub-experiment-specific clean controls (IC1/IC2 + # trimming) rather than a not-yet-treated rollout; the + # generic phrasing misstates the identification setup. + if estimator == "StackedDiDResults": + cc_label = cg if isinstance(cg, str) else "clean_control" + n_distinct = sample.get("n_distinct_controls_trimmed") + distinct_clause = ( + f" across {n_distinct:,} distinct control units in the trimmed stack" + if isinstance(n_distinct, int) + else "" + ) + sentences.append( + f"Sample: {n_obs:,} observations ({n_t:,} treated) with a " + f"sub-experiment-specific clean-control comparison " + f"(``clean_control='{cc_label}'``): each adoption event is " + f"compared against the units satisfying the rule relative " + f"to that event's window, not a single fixed control " + f"group{distinct_clause}{subset_clause}." + ) + else: + sentences.append( + f"Sample: {n_obs:,} observations ({n_t:,} treated) with a " + "dynamic not-yet-treated comparison group (the control set " + f"varies by cohort and period){subset_clause}." + ) + elif ( + estimator == "StaggeredTripleDiffResults" + and isinstance(n_t, int) + and isinstance(n_ne, int) + and n_ne > 0 + ): + # Round-38 P2 CI review on PR #318: StaggeredTripleDiff + # under fixed ``control_group="never_treated"`` had the + # schema moved to ``n_never_enabled`` (round-37) but the + # renderers fell through to the generic + # ``Sample: N observations.`` sentence because the + # ``is_dynamic_control`` branch didn't fire. REGISTRY.md + # §StaggeredTripleDifference line 1730 names the + # never-enabled cohort as the valid fixed comparison on + # this path; the prose must say so. + sentences.append( + f"Sample: {n_obs:,} observations ({n_t:,} treated, " f"{n_ne:,} never-enabled)." + ) + else: + sentences.append(f"Sample: {n_obs:,} observations.") + survey = sample.get("survey") + if survey and not survey.get("is_trivial"): + deff = survey.get("design_effect") + eff_n = survey.get("effective_n") + if isinstance(deff, (int, float)) and isinstance(eff_n, (int, float)): + # Round-35 P2 CI review on PR #318: ``deff < 0.95`` is a + # precision-improving design (effective N is LARGER than + # nominal N). Narrating that as "reduces effective sample + # size" is directionally wrong. Branch on the sign of + # the departure from 1. + if deff < 1.0: + sentences.append( + f"Survey design improves effective sample size to " + f"~{eff_n:,.0f} (DEFF = {deff:.2g})." + ) + else: + sentences.append( + f"Survey design reduces effective sample size to " + f"~{eff_n:,.0f} (DEFF = {deff:.2g})." + ) + + # Highest-severity caveat (if any). + caveats = schema.get("caveats", []) + warning_caveats = [c for c in caveats if c.get("severity") == "warning"] + if warning_caveats: + top = warning_caveats[0] + sentences.append(f"Caveat: {top.get('message')}") + + return " ".join(s for s in sentences if s) + + +def _render_full_report(schema: Dict[str, Any]) -> str: + """Render the structured multi-section markdown report.""" + ctx = schema.get("context", {}) + h = schema.get("headline", {}) + sample = schema.get("sample", {}) + pt = schema.get("pre_trends", {}) or {} + sens = schema.get("sensitivity", {}) or {} + assumption = schema.get("assumption", {}) + het = schema.get("heterogeneity") + caveats = schema.get("caveats", []) + references = schema.get("references", []) + next_steps = schema.get("next_steps", []) + + lines: List[str] = [] + lines.append(f"# Business Report: {ctx.get('outcome_label', 'Outcome')}") + lines.append("") + if ctx.get("business_question"): + lines.append(f"**Question**: {ctx['business_question']}") + lines.append("") + lines.append(f"**Estimator**: `{schema.get('estimator', {}).get('class_name')}`") + lines.append("") + + # Headline + lines.append("## Headline") + lines.append("") + lines.append(_render_headline_sentence(schema)) + p = h.get("p_value") + alpha = ctx.get("alpha", 0.05) + if isinstance(p, (int, float)): + lines.append("") + lines.append(f"Statistically, {_significance_phrase(p, alpha)}.") + lines.append("") + + # Identifying assumption + lines.append("## Identifying Assumption") + lines.append("") + lines.append(assumption.get("description", "") or "Standard DiD parallel-trends assumption.") + lines.append("") + + # Pre-trends + lines.append("## Pre-Trends") + lines.append("") + if pt.get("status") == "computed": + jp = pt.get("joint_p_value") + verdict = pt.get("verdict") + tier = pt.get("power_tier") + # Use the method-aware statistic label the summary path already + # uses: "joint p" for Wald / Bonferroni event-study, "p" for + # slope-difference / Hausman single-statistic tests, and None + # for design-enforced SDiD / TROP paths where there is no + # p-value at all. Round-25 P2 CI review on PR #318 flagged the + # hard-coded "joint p" wording as misdescribing 2x2 / Hausman + # fits and inventing a nonexistent p-value for SDiD / TROP. + method = pt.get("method") + stat_label = _pt_method_stat_label(method) + if stat_label and isinstance(jp, (int, float)): + lines.append(f"- Verdict: `{verdict}` ({stat_label} = {jp:.3g})") + elif stat_label: + lines.append(f"- Verdict: `{verdict}` ({stat_label} unavailable)") + else: + lines.append(f"- Verdict: `{verdict}`") + if tier: + lines.append(f"- Power tier: `{tier}`") + mdv = pt.get("mdv") + ratio = pt.get("mdv_share_of_att") + if isinstance(mdv, (int, float)): + lines.append(f"- Minimum detectable violation (MDV): {mdv:.3g}") + if isinstance(ratio, (int, float)): + lines.append(f"- MDV / |ATT|: {ratio:.2g}") + else: + lines.append(f"- Pre-trends not computed: {pt.get('reason', 'unavailable')}") + lines.append("") + + # Sensitivity. A single-M HonestDiDResults passthrough has + # breakdown_M=None by construction because only one M was evaluated; + # the "robust across full grid" phrasing is reserved for genuine + # grid-over-M SensitivityResults. + lines.append("## Sensitivity (HonestDiD)") + lines.append("") + if sens.get("status") == "computed": + bkd = sens.get("breakdown_M") + concl = sens.get("conclusion") + lines.append(f"- Method: `{sens.get('method')}`") + if concl == "single_M_precomputed": + grid_points = sens.get("grid") or [] + point = grid_points[0] if grid_points else {} + m_val = point.get("M") + robust = point.get("robust_to_zero") + if isinstance(m_val, (int, float)): + lines.append(f"- Single point checked: M = {m_val:.3g}") + lines.append( + f"- Robust CI at M = {m_val:.3g}: " + f"{'excludes zero' if robust else 'includes zero'}" + ) + lines.append( + "- Run `HonestDiD.sensitivity()` across a grid of M " + "values to find the breakdown value." + ) + else: + lines.append("- Single-M passthrough (breakdown not available)") + elif isinstance(bkd, (int, float)): + lines.append(f"- Breakdown M: {bkd:.3g}") + else: + lines.append("- Breakdown M: robust across full grid (no breakdown)") + lines.append(f"- Conclusion: `{concl}`") + else: + lines.append(f"- Sensitivity not computed: {sens.get('reason', 'unavailable')}") + lines.append("") + + # Sample + lines.append("## Sample") + lines.append("") + if isinstance(sample.get("n_obs"), int): + lines.append(f"- Observations: {sample['n_obs']:,}") + if isinstance(sample.get("n_treated"), int): + lines.append(f"- Treated: {sample['n_treated']:,}") + # ``n_control`` is only populated for estimators whose control set + # is a fixed tally. For dynamic modes (CS / ContinuousDiD / + # StaggeredTripleDiff / EfficientDiD / StackedDiD under + # ``clean_control in {"not_yet_treated", "strict"}``) the comparison + # group is dynamic per cohort/sub-experiment; report the estimator- + # specific fixed subset (``n_never_enabled`` for triple-difference; + # ``n_never_treated`` elsewhere; ``n_distinct_controls_trimmed`` for + # Stacked) when available, then name the dynamic-comparison mode + # explicitly. + estimator_block = schema.get("estimator") or {} + estimator_name = ( + estimator_block.get("class_name") if isinstance(estimator_block, dict) else None + ) + cg = sample.get("control_group") + # Panel-vs-RCS count-unit label for the full report. Mirrors the + # summary path: CallawaySantAnna's ``panel=False`` mode stores + # counts as observations, not units (round-28 P2). + md_count_unit = sample.get("count_unit", "units") + md_ne_unit_word = "observations" if md_count_unit == "observations" else "units" + md_sample_location = ( + "in the repeated cross-section sample" + if md_count_unit == "observations" + else "in the panel" + ) + if isinstance(sample.get("n_control"), int): + lines.append(f"- Control: {sample['n_control']:,}") + elif ( + estimator_name == "StaggeredTripleDiffResults" + and isinstance(sample.get("n_never_enabled"), int) + and sample["n_never_enabled"] > 0 + and not sample.get("dynamic_control") + ): + # Round-38 P2 CI review on PR #318: fixed + # ``control_group="never_treated"`` on StaggeredTripleDiff + # clears ``n_control`` (composite total) and populates + # ``n_never_enabled`` (the valid fixed comparison cohort per + # REGISTRY.md line 1730). The full report must render that + # fixed count — the dynamic-control branch below would not + # fire on this path. + lines.append( + f"- Never-enabled units (fixed comparison cohort): " f"{sample['n_never_enabled']:,}" + ) + elif sample.get("dynamic_control"): + if isinstance(sample.get("n_never_enabled"), int) and sample["n_never_enabled"] > 0: + lines.append( + f"- Never-enabled {md_ne_unit_word} present " + f"{md_sample_location}: {sample['n_never_enabled']:,}" + ) + elif isinstance(sample.get("n_never_treated"), int) and sample["n_never_treated"] > 0: + lines.append( + f"- Never-treated {md_ne_unit_word} present " + f"{md_sample_location}: {sample['n_never_treated']:,}" + ) + if estimator_name == "StackedDiDResults": + n_distinct = sample.get("n_distinct_controls_trimmed") + if isinstance(n_distinct, int): + lines.append(f"- Distinct control units in trimmed stack: {n_distinct:,}") + cc_label = cg if isinstance(cg, str) else "clean_control" + lines.append( + f"- Comparison group: sub-experiment-specific clean controls " + f"(``clean_control='{cc_label}'``; each adoption event is " + "compared against units satisfying the rule relative to that " + "event's window, not a single fixed control group)" + ) + else: + lines.append( + "- Comparison group: dynamic not-yet-treated units " + "(varies by cohort and period; no fixed control count)" + ) + survey = sample.get("survey") + if survey: + if survey.get("is_trivial"): + lines.append("- Survey design: trivial DEFF (~1.0)") + else: + deff = survey.get("design_effect") + eff_n = survey.get("effective_n") + if isinstance(deff, (int, float)): + lines.append(f"- Survey DEFF: {deff:.2g}") + if isinstance(eff_n, (int, float)): + lines.append(f"- Effective N: {eff_n:,.0f}") + lines.append("") + + # Heterogeneity — only render the populated section when the check + # actually ran. Round-32 P2 CI review on PR #318: round-31 changed + # ``_lift_heterogeneity`` to always return a dict (stable schema + # contract), but the renderer's ``if het:`` truthiness guard then + # entered the block on every fit and printed ``Source: None``, + # ``N effects: None``, etc. Gate on the ``status`` enum instead. + if isinstance(het, dict) and het.get("status") == "ran": + lines.append("## Heterogeneity") + lines.append("") + lines.append(f"- Source: `{het.get('source')}`") + lines.append(f"- N effects: {het.get('n_effects')}") + mn = het.get("min") + mx = het.get("max") + if isinstance(mn, (int, float)) and isinstance(mx, (int, float)): + lines.append(f"- Range: {mn:.3g} to {mx:.3g}") + cv = het.get("cv") + if isinstance(cv, (int, float)): + lines.append(f"- CV: {cv:.3g}") + lines.append(f"- Sign consistent: {het.get('sign_consistent')}") + lines.append("") + + # Caveats + if caveats: + lines.append("## Caveats") + lines.append("") + for c in caveats: + sev = c.get("severity", "info") + lines.append(f"- **{sev.upper()}** — {c.get('message')}") + lines.append("") + + # Next steps + if next_steps: + lines.append("## Next Steps") + lines.append("") + for s in next_steps: + if s.get("label"): + lines.append(f"- {s['label']}") + if s.get("why"): + lines.append(f" - _why_: {s['why']}") + lines.append("") + + # References + if references: + lines.append("## References") + lines.append("") + for ref in references: + lines.append(f"- {ref.get('citation')}") + lines.append("") + + return "\n".join(lines) diff --git a/diff_diff/diagnostic_report.py b/diff_diff/diagnostic_report.py new file mode 100644 index 00000000..0fe798a9 --- /dev/null +++ b/diff_diff/diagnostic_report.py @@ -0,0 +1,3219 @@ +""" +DiagnosticReport — unified, plain-English validity assessment for diff-diff results. + +Orchestrates the library's existing diagnostic functions (parallel trends, +pre-trends power, HonestDiD sensitivity, Goodman-Bacon, design-effect +diagnostics, EPV, heterogeneity, and estimator-native checks for SDiD/TROP) +into a single report with a stable AI-legible schema. + +Design principles: + +- No hard pass/fail gates. Severity is conveyed by natural-language phrasing, + not a traffic-light enum. See ``docs/methodology/REPORTING.md``. +- No estimator fitting and no variance re-derivation from raw data. Every + effect, SE, p-value, CI, and sensitivity bound is either read from + ``results`` or produced by an existing diff-diff utility. May call + ``check_parallel_trends`` / ``bacon_decompose`` / + ``EfficientDiD.hausman_pretest`` when the caller supplies the panel + + column kwargs. Report-layer cross-period aggregations (joint-Wald / + Bonferroni pre-trends p-value, heterogeneity dispersion over + post-treatment effects) are enumerated in + ``docs/methodology/REPORTING.md``. +- Lazy evaluation. ``DiagnosticReport(results, ...)`` is free; ``run_all()`` + triggers compute and caches. +- Never prove a null. Pre-trends phrasing uses power information from + ``compute_pretrends_power`` to distinguish well-powered from underpowered + non-violations. + +The ``to_dict()`` surface is an AI-legible contract. See the schema reference +in ``docs/methodology/REPORTING.md`` and the ``DIAGNOSTIC_REPORT_SCHEMA_VERSION`` +constant below. The schema is marked experimental in v3.2. +""" + +from __future__ import annotations + +from dataclasses import dataclass, field +from typing import Any, Dict, FrozenSet, List, Optional, Tuple + +import numpy as np +import pandas as pd + +DIAGNOSTIC_REPORT_SCHEMA_VERSION = "1.0" + +__all__ = [ + "DiagnosticReport", + "DiagnosticReportResults", + "DIAGNOSTIC_REPORT_SCHEMA_VERSION", +] + + +# --------------------------------------------------------------------------- +# Canonical check names and per-type applicability +# --------------------------------------------------------------------------- +# The set of check names that ``DiagnosticReport`` supports. +_CHECK_NAMES: Tuple[str, ...] = ( + "parallel_trends", + "pretrends_power", + "sensitivity", + "bacon", + "design_effect", + "heterogeneity", + "epv", + "estimator_native", + "placebo", +) + +# Type-level applicability: which checks are *ever* applicable for each of the +# 16 result types. Instance-level applicability further filters by whether +# required attributes are present (e.g. ``survey_metadata`` for DEFF) and by +# whether the user disabled a check via ``run_*=False``. +# See ``docs/methodology/REPORTING.md`` for the full matrix and rationale. +# +# Implementation note: The keys are result-class names looked up via +# ``type(results).__name__``. This string-based dispatch mirrors the +# ``_HANDLERS`` pattern in ``diff_diff/practitioner.py`` and avoids circular +# imports across the 16 result modules. Renaming or aliasing any result class +# requires updating both this table and ``_PT_METHOD`` below; the +# applicability-matrix test parametrized over all result types serves as the +# regression guard. +# ``pretrends_power`` is restricted to the result families for which +# ``compute_pretrends_power`` has an explicit adapter — see +# ``diff_diff/pretrends.py`` around the result-type dispatch. Expanding +# beyond this set (Imputation / Stacked / TwoStage / EfficientDiD / +# StaggeredTripleDiff / Wooldridge / dCDH) would cause the helper to +# raise ``TypeError("Unsupported results type ...")`` and mark the check +# as ``error``, so the narrower set is the right contract. +# +# ``sensitivity`` is restricted to families with a ``HonestDiD`` +# adapter: MultiPeriod, CS, dCDH (via ``placebo_event_study``). SDiD +# and TROP use their own native paths (``estimator_native``) instead +# of HonestDiD. +_APPLICABILITY: Dict[str, FrozenSet[str]] = { + "DiDResults": frozenset({"parallel_trends", "design_effect"}), + "MultiPeriodDiDResults": frozenset( + {"parallel_trends", "pretrends_power", "sensitivity", "bacon", "design_effect"} + ), + "CallawaySantAnnaResults": frozenset( + { + "parallel_trends", + "pretrends_power", + "sensitivity", + "bacon", + "design_effect", + "heterogeneity", + "epv", + } + ), + "SunAbrahamResults": frozenset( + { + "parallel_trends", + "pretrends_power", + "bacon", + "design_effect", + "heterogeneity", + } + ), + "ImputationDiDResults": frozenset( + { + "parallel_trends", + "bacon", + "design_effect", + "heterogeneity", + } + ), + "TwoStageDiDResults": frozenset( + { + "parallel_trends", + "bacon", + "design_effect", + "heterogeneity", + } + ), + "StackedDiDResults": frozenset( + { + "parallel_trends", + "bacon", + "design_effect", + "heterogeneity", + } + ), + "SyntheticDiDResults": frozenset( + {"parallel_trends", "sensitivity", "design_effect", "estimator_native"} + ), + "TROPResults": frozenset( + # TROP identification is factor-model-based, not parallel-trends- + # based: the estimator native ``_pt_factor()`` handler returns + # ``status="not_applicable"``, and REPORTING.md routes TROP PT + # to factor-model diagnostics instead. Exposing PT in + # ``applicable_checks`` advertised a handler that never runs — + # round-28 P2 CI review on PR #318 flagged the contract mismatch + # for callers who gate workflows on ``applicable_checks``. + { + "sensitivity", + "design_effect", + "heterogeneity", + "estimator_native", + } + ), + "EfficientDiDResults": frozenset( + { + "parallel_trends", + "bacon", + "design_effect", + "heterogeneity", + "epv", + } + ), + "ContinuousDiDResults": frozenset({"design_effect", "heterogeneity"}), + "TripleDifferenceResults": frozenset({"design_effect", "epv"}), + "StaggeredTripleDiffResults": frozenset({"parallel_trends", "design_effect"}), + "WooldridgeDiDResults": frozenset( + { + "parallel_trends", + "bacon", + "design_effect", + "heterogeneity", + } + ), + "ChaisemartinDHaultfoeuilleResults": frozenset( + { + "parallel_trends", + "sensitivity", + "bacon", + "design_effect", + } + ), + "BaconDecompositionResults": frozenset({"bacon"}), +} + +# Per-type parallel-trends method. The PT check dispatches internally on this. +# Values: +# "two_x_two" — uses utils.check_parallel_trends (requires ``data``) +# "event_study" — joint Wald on pre-period event-study coefficients +# "hausman" — EfficientDiD.hausman_pretest (native PT-All vs PT-Post) +# "synthetic_fit" — SDiD weighted pre-treatment fit (surfaces pre_treatment_fit) +# "factor" — TROP factor-model identification (no PT; renders "N/A" prose) +_PT_METHOD: Dict[str, str] = { + "DiDResults": "two_x_two", + "MultiPeriodDiDResults": "event_study", + "CallawaySantAnnaResults": "event_study", + "SunAbrahamResults": "event_study", + "ImputationDiDResults": "event_study", + "TwoStageDiDResults": "event_study", + "StackedDiDResults": "event_study", + "EfficientDiDResults": "hausman", + "ContinuousDiDResults": "event_study", + "StaggeredTripleDiffResults": "event_study", + "WooldridgeDiDResults": "event_study", + "ChaisemartinDHaultfoeuilleResults": "event_study", + "SyntheticDiDResults": "synthetic_fit", + "TROPResults": "factor", +} + + +@dataclass(frozen=True) +class DiagnosticReportResults: + """Frozen container holding the outcome of a ``DiagnosticReport.run_all()`` call. + + Attributes + ---------- + schema : dict + The AI-legible structured schema (also returned by ``to_dict()``). + interpretation : str + The ``overall_interpretation`` paragraph synthesizing findings across + checks. + applicable_checks : tuple of str + The names of checks that applied to this estimator + options. + skipped_checks : dict of str -> str + Mapping from skipped-check name to plain-English reason. + warnings : tuple of str + Warnings captured while running the underlying diagnostic functions. + """ + + schema: Dict[str, Any] + interpretation: str + applicable_checks: Tuple[str, ...] + skipped_checks: Dict[str, str] = field(default_factory=dict) + warnings: Tuple[str, ...] = () + + +class DiagnosticReport: + """Run the standard diff-diff diagnostic battery on a fitted result. + + Parameters + ---------- + results : Any + A fitted diff-diff results object (e.g. ``CallawaySantAnnaResults``, + ``DiDResults``, ``SyntheticDiDResults``). Any of the 16 result types + in the library is accepted. + data : pandas.DataFrame, optional + The underlying panel. Required for checks that need raw data + (2x2 parallel-trends check on ``DiDResults``; Bacon-from-scratch when + ``results`` is not itself a Bacon fit; the opt-in placebo battery). + outcome, treatment, time, unit, first_treat : str, optional + Column names identifying the panel structure. + pre_periods, post_periods : list, optional + Explicit pre- and post-treatment period labels. + run_parallel_trends, run_sensitivity, run_placebo, run_bacon, + run_design_effect, run_heterogeneity, run_epv, run_pretrends_power : bool + Per-check opt-in flags. ``run_placebo`` defaults to ``False`` (opt-in, + expensive, currently not implemented — placebo key remains reserved + as ``skipped`` in the schema). All other checks default to ``True`` + and are further gated by estimator-type and instance-level + applicability (see ``docs/methodology/REPORTING.md``). + sensitivity_M_grid : tuple of float, default (0.5, 1.0, 1.5, 2.0) + Grid of M values passed to ``HonestDiD.sensitivity``. Yields a + ``SensitivityResults`` object with ``breakdown_M`` populated. + sensitivity_method : str, default "relative_magnitude" + HonestDiD restriction type. + alpha : float, default 0.05 + Significance level used across checks. + survey_design : SurveyDesign, optional + The ``SurveyDesign`` object used to fit a survey-weighted + estimator. Required for fit-faithful replay of Goodman-Bacon on a + survey-backed fit; threaded to ``bacon_decompose(survey_design=...)``. + When the fit carries ``survey_metadata`` but ``survey_design`` is + not supplied, Bacon is skipped with an explicit reason rather than + replaying an unweighted decomposition for a design that does not + match the estimate. The simple 2x2 parallel-trends helper + (``utils.check_parallel_trends``) has no survey-aware variant; + on a survey-backed ``DiDResults`` it is skipped unconditionally + regardless of ``survey_design``. Supply + ``precomputed={'parallel_trends': ...}`` with a survey-aware + pretest to opt in. See ``docs/methodology/REPORTING.md``. + precomputed : dict, optional + Map of check name to a pre-computed result object. Accepted keys + (this is the full implemented list; unsupported keys raise + ``ValueError``): + + - ``"parallel_trends"`` — a dict returned by + ``utils.check_parallel_trends`` (adapted into the schema shape). + - ``"sensitivity"`` — a ``SensitivityResults`` (grid) or + ``HonestDiDResults`` (single-M) object; used verbatim and no + ``HonestDiD.sensitivity_analysis`` call is made. + - ``"pretrends_power"`` — a ``PreTrendsPowerResults`` object. + - ``"bacon"`` — a ``BaconDecompositionResults`` object. + + Other sections (``design_effect``, ``heterogeneity``, ``epv``) are + read directly from the fitted result object and do not currently + accept precomputed values — there is no expensive call to bypass. + ``placebo`` is reserved in the schema but opt-in / deferred in MVP. + outcome_label, treatment_label : str, optional + Plain-English labels used in prose rendering. + """ + + def __init__( + self, + results: Any, + *, + data: Optional[pd.DataFrame] = None, + outcome: Optional[str] = None, + treatment: Optional[str] = None, + time: Optional[str] = None, + unit: Optional[str] = None, + first_treat: Optional[str] = None, + pre_periods: Optional[List[Any]] = None, + post_periods: Optional[List[Any]] = None, + run_parallel_trends: bool = True, + run_sensitivity: bool = True, + run_placebo: bool = False, + run_bacon: bool = True, + run_design_effect: bool = True, + run_heterogeneity: bool = True, + run_epv: bool = True, + run_pretrends_power: bool = True, + sensitivity_M_grid: Tuple[float, ...] = (0.5, 1.0, 1.5, 2.0), + sensitivity_method: str = "relative_magnitude", + alpha: float = 0.05, + survey_design: Optional[Any] = None, + precomputed: Optional[Dict[str, Any]] = None, + outcome_label: Optional[str] = None, + treatment_label: Optional[str] = None, + ): + self._results = results + self._data = data + self._outcome = outcome + self._treatment = treatment + self._time = time + self._unit = unit + self._first_treat = first_treat + self._pre_periods = pre_periods + self._post_periods = post_periods + self._run_flags: Dict[str, bool] = { + "parallel_trends": run_parallel_trends, + "pretrends_power": run_pretrends_power, + "sensitivity": run_sensitivity, + "bacon": run_bacon, + "design_effect": run_design_effect, + "heterogeneity": run_heterogeneity, + "epv": run_epv, + "placebo": run_placebo, + "estimator_native": True, + } + self._sensitivity_M_grid = tuple(sensitivity_M_grid) + self._sensitivity_method = sensitivity_method + self._alpha = float(alpha) + # Round-40 P1 CI review on PR #318: survey-backed fits need the + # ``SurveyDesign`` object threaded through to ``bacon_decompose`` + # for a fit-faithful Goodman-Bacon replay, and the unweighted + # 2x2 parallel-trends helper (``utils.check_parallel_trends``) + # cannot be called on a survey-weighted DiDResults without + # silently reporting an unweighted verdict for a weighted fit. + # When the fit carries ``survey_metadata`` but the caller did + # not supply ``survey_design``, both checks skip with an + # explicit reason instead of replaying a different design than + # the estimate. See REPORTING.md "Survey-backed fits". + self._survey_design = survey_design + self._precomputed = dict(precomputed or {}) + # Validate precomputed keys against the actually-implemented passthrough + # set so advertised contracts do not silently diverge from behavior. + _supported_precomputed = {"parallel_trends", "sensitivity", "pretrends_power", "bacon"} + _unsupported = set(self._precomputed) - _supported_precomputed + if _unsupported: + raise ValueError( + "precomputed= contains keys that are not implemented: " + f"{sorted(_unsupported)}. Supported keys: " + f"{sorted(_supported_precomputed)}. ``design_effect``, " + "``heterogeneity``, and ``epv`` are read directly from the " + "fitted result and do not accept precomputed overrides." + ) + + # Estimator-aware precomputed validation. SDiD / TROP route + # robustness to ``estimator_native_diagnostics`` (SDiD: weighted + # pre-treatment fit, in-time placebo, zeta-omega sensitivity; + # TROP: factor-model fit metrics), and TROP PT is not applicable + # (factor-model identification, not PT). Accepting generic + # HonestDiD / parallel-trends precomputed inputs on these + # estimators would surface methodology-incompatible diagnostics + # through the generic report sections — the opposite of the + # native-routing contract documented in REPORTING.md. + # Round-21 P1 CI review on PR #318 flagged this bypass. + _result_name = type(self._results).__name__ + _native_routed_names = {"SyntheticDiDResults", "TROPResults"} + if _result_name in _native_routed_names: + _incompatible_keys = [] + if "sensitivity" in self._precomputed: + _incompatible_keys.append("sensitivity") + if "parallel_trends" in self._precomputed: + _incompatible_keys.append("parallel_trends") + # Round-32 P1 CI review on PR #318: ``pretrends_power`` is a + # Roth-style power analysis on pre-period event-study + # coefficients under the PT identifying contract. SDiD's PT + # analogue is design-enforced pre-treatment fit and TROP uses + # factor-model identification (PT not applicable); surfacing + # a Roth-style power tier on either would bypass the native- + # routing contract. Round-21's guard covered ``sensitivity`` + # and ``parallel_trends`` but not ``pretrends_power``, so the + # round-31 ``_compute_applicable_checks`` broadening exposed + # it. + if "pretrends_power" in self._precomputed: + _incompatible_keys.append("pretrends_power") + if _incompatible_keys: + raise ValueError( + f"{_result_name} routes robustness and pre-trends " + "diagnostics to ``estimator_native_diagnostics`` — " + "generic HonestDiD, parallel-trends, and pre-trends " + "power precomputed passthroughs are methodology-" + "incompatible with this estimator. Rejected " + f"precomputed keys: {sorted(_incompatible_keys)}. " + "Use the native diagnostics on the result object " + "(SDiD: ``in_time_placebo``, ``sensitivity_to_zeta_omega``, " + "``pre_treatment_fit``; TROP: ``effective_rank``, " + "``loocv_score``) — DR surfaces these automatically." + ) + + # Round-44 P1 CI review on PR #318: mirror the SDiD/TROP + # __init__ rejection pattern for ``CallawaySantAnna`` with + # ``base_period != "universal"``. HonestDiD bounds are not + # valid for interpretation on consecutive-comparison + # (``base_period='varying'``) pre-period surfaces (REGISTRY.md + # §CallawaySantAnna line 410 plus §HonestDiD line 2458). + # ``precomputed["sensitivity"]`` would otherwise bypass the + # applicability-gate guard (which already existed for the auto + # path) and let BR/DR narrate the Rambachan-Roth bounds as + # ordinary robustness on a displayed fit whose interpretation + # does not match the bounds' provenance. Reject at + # construction so users get the error up-front rather than a + # late skip in the schema. + if _result_name == "CallawaySantAnnaResults" and "sensitivity" in self._precomputed: + _base_period = getattr(self._results, "base_period", "universal") + if _base_period != "universal": + raise ValueError( + "precomputed['sensitivity'] on " + "CallawaySantAnnaResults requires " + "``base_period='universal'`` on the displayed fit — " + "HonestDiD Rambachan-Roth bounds are not valid for " + "interpretation on the consecutive-comparison " + "pre-period surface produced by " + f"``base_period={_base_period!r}``. Narrating the " + "bounds as robustness alongside a varying-base fit " + "mixes provenance the bounds don't support. Re-fit " + "the main estimator with " + "``CallawaySantAnna(base_period='universal')`` " + "before passing precomputed sensitivity." + ) + + self._outcome_label = outcome_label + self._treatment_label = treatment_label + self._cached: Optional[DiagnosticReportResults] = None + + # -- Public API --------------------------------------------------------- + + def run_all(self) -> DiagnosticReportResults: + """Run all applicable diagnostics. Idempotent; caches on first call.""" + if self._cached is None: + self._cached = self._execute() + return self._cached + + def to_dict(self) -> Dict[str, Any]: + """Return the AI-legible structured schema.""" + return self.run_all().schema + + def summary(self) -> str: + """Return a short plain-English paragraph.""" + return self.run_all().interpretation + + def full_report(self) -> str: + """Return the multi-section markdown report.""" + return _render_dr_full_report(self.run_all()) + + def export_markdown(self) -> str: + """Alias for ``full_report()``.""" + return self.full_report() + + def to_dataframe(self) -> pd.DataFrame: + """Return one row per check with status and headline metric.""" + schema = self.to_dict() + rows = [] + for check in _CHECK_NAMES: + section_key = "estimator_native_diagnostics" if check == "estimator_native" else check + section = schema.get(section_key, {}) + rows.append( + { + "check": check, + "status": section.get("status"), + "headline": _check_headline(check, section), + "reason": section.get("reason"), + } + ) + return pd.DataFrame(rows) + + @property + def applicable_checks(self) -> Tuple[str, ...]: + """Names of checks that will run, given estimator + instance + options. + + No compute is triggered; this reflects only the applicability matrix + filtered by instance state (survey_metadata, epv_diagnostics, vcov) + and the user's ``run_*`` flags. + """ + return tuple(sorted(self._compute_applicable_checks()[0])) + + @property + def skipped_checks(self) -> Dict[str, str]: + """Mapping of skipped check -> plain-English reason. Requires ``run_all()``.""" + return dict(self.run_all().skipped_checks) + + # -- Implementation detail --------------------------------------------- + + def _compute_applicable_checks(self) -> Tuple[set, Dict[str, str]]: + """Compute the applicable-check set + per-check skipped reasons. + + Returns + ------- + applicable : set of str + Checks that will run. + skipped : dict + Mapping from check name -> plain-English reason for any check + that is type-applicable but skipped for this instance or by user + opt-out. Checks that are not type-applicable for this estimator + are omitted from both sets (not surfaced as "skipped"). + """ + type_name = type(self._results).__name__ + type_level = set(_APPLICABILITY.get(type_name, frozenset())) + # A precomputed passthrough is a caller-supplied override, not + # a claim about estimator-native applicability. Round-31 P1 CI + # review on PR #318: when a caller passes + # ``precomputed["sensitivity"] = ...`` on an estimator family + # whose ``_APPLICABILITY`` row lacks ``"sensitivity"`` (SA, + # Imputation, TwoStage, Stacked, EfficientDiD, Wooldridge, + # TripleDifference, StaggeredTripleDiff, ContinuousDiD, plain + # DiD), the gate previously filtered the section out silently + # and the supplied result disappeared from the schema. SDiD + # and TROP are still rejected up front in ``__init__`` + # (round-21) because their native-routing contract makes + # HonestDiD methodology-incompatible; those never reach here. + # For every other estimator, an explicit passthrough wins + # over the default applicability matrix. + type_level = type_level | set(self._precomputed) + applicable: set = set() + skipped: Dict[str, str] = {} + + for check in type_level: + # Per-check user opt-out + if not self._run_flags.get(check, True): + skipped[check] = f"run_{check}=False (user opted out)" + continue + # Instance-level gating — skipped when the caller supplied + # a precomputed override (the per-check ``_instance_skip_reason`` + # branches already return None for precomputed keys, but this + # short-circuit makes the override contract explicit and + # survives any future gate additions). + if check in self._precomputed: + applicable.add(check) + continue + reason = self._instance_skip_reason(check) + if reason is not None: + skipped[check] = reason + continue + applicable.add(check) + + # Placebo is reserved for every result type in MVP so the schema + # shape is stable: ``schema["placebo"]["status"] == "skipped"`` + # always holds regardless of estimator. The opt-in execution path + # is deferred to a follow-up; ``REPORTING.md`` documents this. + skipped.setdefault( + "placebo", + "Placebo battery runs on opt-in only; not yet implemented in MVP. " + "Reserved in the schema for forward compatibility.", + ) + + return applicable, skipped + + def _instance_skip_reason(self, check: str) -> Optional[str]: + """Return a plain-English reason this check cannot run on this instance, or None.""" + r = self._results + name = type(r).__name__ + if check == "design_effect": + if getattr(r, "survey_metadata", None) is None: + return "No survey design attached to results.survey_metadata." + return None + if check == "epv": + if getattr(r, "epv_diagnostics", None) is None: + return "Estimator did not produce results.epv_diagnostics for this fit." + return None + if check == "parallel_trends": + # Precomputed parallel-trends always unlocks this check. The + # EfficientDiD Hausman skip message already points users at + # ``precomputed={'parallel_trends': ...}`` when replay fails + # (DR / survey fits), so applicability must honor the + # override before the replay-gate below fires. Round-22 P1 + # CI review on PR #318 flagged that PT precomputed was + # advertised but skipped before use. + if "parallel_trends" in self._precomputed: + return None + method = _PT_METHOD.get(name) + if method == "two_x_two": + # Mirror the full argument contract of ``_pt_two_x_two``: + # the runner needs ``data`` AND all three column names to + # call ``check_parallel_trends``. Gating only on ``data`` + # (as before) left ``applicable_checks`` overstated when + # one of the column kwargs was missing (round-11 CI + # review on PR #318). + two_x_two_missing = [ + arg + for arg, val in ( + ("data", self._data), + ("outcome", self._outcome), + ("time", self._time), + ("treatment", self._treatment), + ) + if val is None + ] + if two_x_two_missing: + return ( + "2x2 parallel-trends check needs raw panel data + " + "outcome / time / treatment column names. Missing: " + + ", ".join(two_x_two_missing) + + "." + ) + # Round-40 P1 CI review on PR #318: the simple 2x2 helper + # ``utils.check_parallel_trends`` is unweighted — it has + # no ``survey_design`` parameter and cannot faithfully + # diagnose the pre-period trajectory of a survey- + # weighted DiDResults. Rather than silently emitting + # an unweighted verdict alongside the weighted estimate, + # skip with an explicit reason. Users can supply + # ``precomputed={'parallel_trends': ...}`` with a + # survey-aware pretest result if they have one. + if getattr(r, "survey_metadata", None) is not None: + return ( + "Original fit used a survey design; the simple " + "2x2 parallel-trends check (``utils." + "check_parallel_trends``) is unweighted and " + "would diagnose a different design than the " + "weighted estimate. Supply a survey-aware " + "pretest via " + "``precomputed={'parallel_trends': ...}`` to " + "opt in." + ) + if method == "event_study": + pre_coefs, n_dropped_undefined = _collect_pre_period_coefs(r) + # Round-42 P1 CI review on PR #318: the all-undefined + # pre-period case (every pre-row dropped for ``se <= 0`` + # / non-finite inference) is the twin of the partial- + # undefined case from round-33. It must route to the + # inconclusive runner rather than skip, so the explicit + # ``method="inconclusive"`` / ``n_dropped_undefined`` + # provenance is surfaced through DR's schema and BR's + # summary emits the "inconclusive" identifying- + # assumption warning rather than silently dropping PT. + if not pre_coefs and n_dropped_undefined == 0: + return ( + "No pre-period event-study coefficients are exposed on " + "this fit. For staggered estimators, re-fit with " + "aggregate='event_study' to populate event-study output." + ) + # vcov is optional for the Bonferroni fallback. + if method == "hausman": + # EfficientDiD's Hausman pretest requires the raw panel + # to refit under PT-All and PT-Post. Gate at applicability + # rather than letting ``_pt_hausman`` skip at runtime, so + # ``applicable_checks`` and ``completed_steps`` reflect + # reality. + hausman_missing = [ + arg + for arg, val in ( + ("data", self._data), + ("outcome", self._outcome), + ("unit", self._unit), + ("time", self._time), + ("first_treat", self._first_treat), + ) + if val is None + ] + if hausman_missing: + return ( + "EfficientDiD.hausman_pretest needs raw panel data; " + "pass data + outcome + unit + time + first_treat to " + "DiagnosticReport. Missing: " + ", ".join(hausman_missing) + "." + ) + # Fit-faithful guard: DR / survey fits cannot be replayed + # under defaults, so skip with an explicit reason rather + # than rerunning a different design. + if getattr(r, "estimation_path", "nocov") != "nocov": + return ( + "Original EfficientDiD fit used the doubly-robust " + "covariate path; ``covariates`` is not stored on " + "the result, so the Hausman pretest cannot be " + "faithfully replayed." + ) + if getattr(r, "survey_metadata", None) is not None: + return ( + "Original EfficientDiD fit used a survey design; " + "replaying the Hausman pretest would require the " + "full ``SurveyDesign`` object." + ) + return None + if check == "pretrends_power": + # ``compute_pretrends_power`` handles CS / SA / ImputationDiD + # event-study results by reading ``event_study_effects`` + # directly, so we accept either a top-level ``vcov`` OR a + # populated event-study surface. Precomputed overrides also + # bypass this gate. + if "pretrends_power" in self._precomputed: + return None + has_vcov = getattr(r, "vcov", None) is not None + has_event_vcov = getattr(r, "event_study_vcov", None) is not None + has_event_es = getattr(r, "event_study_effects", None) is not None + if not (has_vcov or has_event_vcov or has_event_es): + return ( + "Pre-trends power needs either results.vcov or " + "event_study_effects (from aggregate='event_study' on " + "staggered estimators); neither available." + ) + pre_coefs, _ = _collect_pre_period_coefs(r) + if len(pre_coefs) < 2: + return "Pre-trends power needs >= 2 pre-treatment periods." + return None + if check == "sensitivity": + # Native SDiD/TROP paths substitute for HonestDiD. + if name in {"SyntheticDiDResults", "TROPResults"}: + return None + # Round-44 P1 CI review on PR #318: the CS varying-base + # guard MUST fire before the precomputed early-return. + # Previously, ``precomputed["sensitivity"]`` unlocked this + # check unconditionally, letting BR/DR narrate the + # Rambachan-Roth bounds as ordinary robustness even though + # HonestDiD explicitly warns those bounds are not valid + # for interpretation on consecutive-comparison + # (``base_period='varying'``) pre-period surfaces + # (REGISTRY.md §CallawaySantAnna line 410, §HonestDiD line + # 2458). The previous skip message also mis-pointed users + # at ``precomputed`` as the opt-in; that path now routes + # through the same guard, so the correct remediation is to + # re-fit the main estimator with ``base_period='universal'`` + # or to consult HonestDiD outside the report layer. + if name == "CallawaySantAnnaResults": + base_period = getattr(r, "base_period", "universal") + if base_period != "universal": + return ( + "HonestDiD on CallawaySantAnna requires " + "``base_period='universal'`` for valid interpretation " + "(Rambachan-Roth bounds are not comparable across the " + "consecutive pre-period comparisons produced by " + f"``base_period={base_period!r}``). Re-fit with " + "``CallawaySantAnna(base_period='universal')``; " + "``precomputed={'sensitivity': ...}`` is rejected here " + "because the precomputed bounds would be narrated as " + "robustness for a displayed fit whose pre-period " + "surface has a different interpretation than the one " + "the bounds were computed against." + ) + # Precomputed sensitivity unlocks this check for every + # other estimator (SDiD/TROP were already rejected at DR + # __init__; CS varying-base is gated above). The CS + # guard above runs on the *displayed fit*, not on the + # provenance of the precomputed bounds; it protects + # against narrating bounds whose interpretation is + # incompatible with the fit being summarized. + if "sensitivity" in self._precomputed: + return None + # dCDH uses ``placebo_event_study`` as its pre-period surface, + # which HonestDiD consumes via a dedicated branch. Accept the + # fit when that attribute is populated. + if name == "ChaisemartinDHaultfoeuilleResults": + pes = getattr(r, "placebo_event_study", None) + if pes is None: + return ( + "HonestDiD on dCDH requires results.placebo_event_study " + "(re-fit with a placebo-producing configuration)." + ) + return None + # MultiPeriod / CS path: ``HonestDiD.sensitivity_analysis`` + # consumes ``event_study_effects`` plus either ``vcov`` + + # ``interaction_indices`` (MultiPeriod) or ``event_study_vcov`` + # + ``event_study_vcov_index`` (CS), with a per-SE diagonal + # fallback otherwise. + has_vcov = getattr(r, "vcov", None) is not None + has_event_vcov = getattr(r, "event_study_vcov", None) is not None + has_event_es = getattr(r, "event_study_effects", None) is not None + if not (has_vcov or has_event_vcov or has_event_es): + return ( + "HonestDiD needs either results.vcov, event_study_vcov, " + "or event_study_effects; none available." + ) + pre_coefs, _ = _collect_pre_period_coefs(r) + if len(pre_coefs) < 1: + return "HonestDiD requires at least one pre-period coefficient." + return None + if check == "bacon": + # Precomputed Bacon always unlocks this check. Users with an + # already-computed ``BaconDecompositionResults`` (e.g., run + # separately against a stored panel that isn't available at + # report time) need the passthrough to land on the Bacon + # runner instead of being skipped for missing column kwargs. + # Round-22 P1 CI review on PR #318 flagged that Bacon + # precomputed was advertised but skipped before use. + if "bacon" in self._precomputed: + return None + # ``BaconDecompositionResults`` carries the decomposition + # directly; no data/column kwargs needed. + if name == "BaconDecompositionResults": + return None + # Otherwise mirror the full argument contract of + # ``_check_bacon`` / ``bacon_decompose``: the runner needs + # ``data``, ``first_treat``, and the ``outcome`` / ``time`` / + # ``unit`` column names. Gating on only ``data`` + + # ``first_treat`` (as before) left ``applicable_checks`` + # overstated when a column kwarg was missing (round-11 CI + # review on PR #318). + bacon_missing = [ + arg + for arg, val in ( + ("data", self._data), + ("outcome", self._outcome), + ("time", self._time), + ("unit", self._unit), + ("first_treat", self._first_treat), + ) + if val is None + ] + if bacon_missing: + return ( + "Bacon decomposition needs panel data + outcome / time " + "/ unit / first_treat column names. Missing: " + ", ".join(bacon_missing) + "." + ) + # Round-40 P1 CI review on PR #318: ``bacon_decompose`` + # supports a ``survey_design`` kwarg for survey-weighted + # decomposition. When the fitted result carries + # ``survey_metadata`` but the caller did not supply a + # ``survey_design`` object, replaying with defaults would + # produce an unweighted decomposition for a different + # design than the weighted estimate. Skip with an explicit + # reason; users can pass ``survey_design=`` on + # ``DiagnosticReport`` / ``BusinessReport`` or supply + # ``precomputed={'bacon': ...}`` with a survey-aware + # decomposition. + if getattr(r, "survey_metadata", None) is not None and self._survey_design is None: + return ( + "Original fit used a survey design; Goodman-Bacon " + "replay under defaults would produce an unweighted " + "decomposition for a different design than the " + "weighted estimate. Pass ``survey_design=`` " + "on DiagnosticReport / BusinessReport, or supply " + "``precomputed={'bacon': ...}`` with a survey-aware " + "decomposition." + ) + return None + if check == "heterogeneity": + # Needs multiple group or event-study effects. Use len() rather than + # truthiness because some estimators expose these as DataFrames, + # which raise on bool() conversion. + for attr in ( + "group_effects", + "event_study_effects", + "treatment_effects", # TROP per-(unit, time) + "group_time_effects", # CS default aggregation + "period_effects", # MultiPeriod + ): + val = getattr(r, attr, None) + if val is None: + continue + try: + if len(val) > 0: + return None + except TypeError: + continue + return "No group/event-study effects available to compute heterogeneity." + if check == "estimator_native": + if name not in {"SyntheticDiDResults", "TROPResults"}: + return f"{name} does not expose native validation methods." + return None + return None + + def _execute(self) -> DiagnosticReportResults: + """Run the diagnostic battery and assemble the schema.""" + applicable, skipped = self._compute_applicable_checks() + + # Initialize all schema sections to either "ran"/"skipped"/"not_applicable". + sections: Dict[str, Dict[str, Any]] = {} + for check in _CHECK_NAMES: + if check in applicable: + sections[check] = {"status": "not_run", "reason": "pending implementation"} + elif check in skipped: + sections[check] = {"status": "skipped", "reason": skipped[check]} + else: + sections[check] = { + "status": "not_applicable", + "reason": f"{check} is not applicable to " f"{type(self._results).__name__}.", + } + + # Run the checks that are applicable. Each returns a schema-section dict + # that replaces the placeholder above. + if "parallel_trends" in applicable: + sections["parallel_trends"] = self._check_parallel_trends() + if "pretrends_power" in applicable: + sections["pretrends_power"] = self._check_pretrends_power() + if "sensitivity" in applicable: + sections["sensitivity"] = self._check_sensitivity() + if "bacon" in applicable: + sections["bacon"] = self._check_bacon() + if "design_effect" in applicable: + sections["design_effect"] = self._check_design_effect() + if "heterogeneity" in applicable: + sections["heterogeneity"] = self._check_heterogeneity() + if "epv" in applicable: + sections["epv"] = self._check_epv() + if "estimator_native" in applicable: + sections["estimator_native"] = self._check_estimator_native() + + # Estimator-native placeholder: SDiD/TROP diagnostics come in a later task. + if "estimator_native" not in applicable and "estimator_native" not in skipped: + sections["estimator_native"] = { + "status": "not_applicable", + "reason": f"{type(self._results).__name__} does not expose native " + "validation methods beyond what's captured above.", + } + + # Headline metric — best-effort across estimator types. + headline = self._extract_headline_metric() + + # Pull suggested next steps from the practitioner workflow. + next_steps = self._collect_next_steps(sections) + + # Populate schema-level warnings for every section that ended in "error", + # so users and agents do not have to scan each section dict to discover + # that a diagnostic failed. Preserves provenance per the "no silent + # failures" convention. + top_warnings: List[str] = [] + for check in _CHECK_NAMES: + section_key = "estimator_native" if check == "estimator_native" else check + section = sections.get(section_key, {}) + if section.get("status") == "error": + reason = section.get("reason") or "diagnostic raised an exception" + top_warnings.append(f"{check}: {reason}") + # Surface non-fatal warnings captured by delegated diagnostics + # (e.g., HonestDiD's "base_period='varying' is not valid for + # interpretation" on CallawaySantAnna, or the diag-covariance + # fallback on bootstrap-fitted CS). These rode up on each + # section's ``warnings`` field and must not be swallowed. + section_warnings = section.get("warnings") + if isinstance(section_warnings, (list, tuple)): + for msg in section_warnings: + if msg is None: + continue + top_warnings.append(f"{check}: {msg}") + # Some sections (e.g., sensitivity skipped for varying-base CS) + # also surface methodology-critical context via ``reason`` even + # though ``status != "error"``. We do not duplicate those here + # — the section's own status/reason is the authoritative record. + + schema: Dict[str, Any] = { + "schema_version": DIAGNOSTIC_REPORT_SCHEMA_VERSION, + "estimator": type(self._results).__name__, + "headline_metric": headline, + "parallel_trends": sections["parallel_trends"], + "pretrends_power": sections["pretrends_power"], + "sensitivity": sections["sensitivity"], + "placebo": sections["placebo"], + "bacon": sections["bacon"], + "design_effect": sections["design_effect"], + "heterogeneity": sections["heterogeneity"], + "epv": sections["epv"], + "estimator_native_diagnostics": sections["estimator_native"], + "skipped": {k: v for k, v in skipped.items()}, + "warnings": top_warnings, + "overall_interpretation": "", + "next_steps": next_steps, + } + interpretation = _render_overall_interpretation(schema, self._context_labels()) + schema["overall_interpretation"] = interpretation + + return DiagnosticReportResults( + schema=schema, + interpretation=interpretation, + applicable_checks=tuple(sorted(applicable)), + skipped_checks=skipped, + warnings=tuple(top_warnings), + ) + + def _context_labels(self) -> Dict[str, str]: + """Return plain-English labels used in prose rendering.""" + return { + "outcome_label": self._outcome_label or "the outcome", + "treatment_label": self._treatment_label or "the treatment", + } + + def _collect_next_steps(self, sections: Dict[str, Dict[str, Any]]) -> List[Dict[str, Any]]: + """Pull and filter practitioner_next_steps, marking DR-covered steps complete. + + A step is marked complete only when its DR section actually ran + (``status == "ran"``). The previous implementation marked steps + complete based on membership in the applicability set, which + overstated completion for checks that were applicable but skipped + at runtime (e.g., Hausman on a DR / survey fit; sensitivity on + varying-base CS). + """ + try: + from diff_diff.practitioner import practitioner_next_steps + + def _ran(key: str) -> bool: + return sections.get(key, {}).get("status") == "ran" + + completed = [] + if _ran("parallel_trends"): + completed.append("parallel_trends") + if _ran("sensitivity"): + completed.append("sensitivity") + # SDiD / TROP route their sensitivity analogue through + # ``estimator_native_diagnostics`` rather than HonestDiD. When + # that native block ran, the Baker step-6 sensitivity check + # has effectively been performed; treating the sensitivity + # section as not-run would have ``next_steps`` redundantly + # recommend a check the report already executed (round-19 + # CI review on PR #318). + result_name = type(self._results).__name__ + if result_name in {"SyntheticDiDResults", "TROPResults"} and _ran("estimator_native"): + if "sensitivity" not in completed: + completed.append("sensitivity") + if _ran("heterogeneity"): + completed.append("heterogeneity") + ns = practitioner_next_steps( + self._results, + completed_steps=completed, + verbose=False, + ) + return [ + { + "label": s.get("label"), + "why": s.get("why"), + "code": s.get("code"), + "priority": s.get("priority"), + "baker_step": s.get("baker_step"), + } + for s in ns.get("next_steps", [])[:5] + ] + except Exception: # noqa: BLE001 + return [] + + # -- Per-check runners -------------------------------------------------- + + def _check_parallel_trends(self) -> Dict[str, Any]: + """Run the parallel-trends check. Dispatches on PT method for this type.""" + if "parallel_trends" in self._precomputed: + return self._format_precomputed_pt(self._precomputed["parallel_trends"]) + + method = _PT_METHOD.get(type(self._results).__name__) + if method == "two_x_two": + return self._pt_two_x_two() + if method == "event_study": + return self._pt_event_study() + if method == "hausman": + return self._pt_hausman() + if method == "synthetic_fit": + return self._pt_synthetic_fit() + if method == "factor": + return self._pt_factor() + return { + "status": "not_applicable", + "reason": f"No parallel-trends method registered for " + f"{type(self._results).__name__}.", + } + + def _pt_two_x_two(self) -> Dict[str, Any]: + """Simple two-period PT check via ``utils.check_parallel_trends``.""" + from diff_diff.utils import check_parallel_trends + + if self._data is None or self._outcome is None or self._time is None: + return { + "status": "skipped", + "reason": "Requires data=, outcome=, time=, and a treatment-group " + "column; not supplied.", + } + treatment_group = self._treatment + if treatment_group is None: + return { + "status": "skipped", + "reason": "Requires treatment= identifying the " + "treated-group indicator; not supplied.", + } + # Round-40 P1 CI review on PR #318: defense-in-depth. The + # instance-level applicability gate should have already returned + # a skip reason when ``results.survey_metadata`` is non-None and + # no precomputed PT was supplied, but ``_pt_two_x_two`` is also + # reachable directly from ``_check_parallel_trends`` if future + # callers add method dispatch overrides. Guard at the runner + # too to prevent ``utils.check_parallel_trends`` from emitting + # an unweighted verdict for a weighted fit. + if getattr(self._results, "survey_metadata", None) is not None: + return { + "status": "skipped", + "reason": ( + "Original fit used a survey design; the simple 2x2 " + "parallel-trends helper (``utils.check_parallel_trends``) " + "is unweighted and cannot faithfully diagnose a " + "survey-weighted DiDResults. Supply a survey-aware " + "pretest via ``precomputed={'parallel_trends': ...}`` " + "to opt in." + ), + } + try: + raw = check_parallel_trends( + self._data, + outcome=self._outcome, + time=self._time, + treatment_group=treatment_group, + pre_periods=self._pre_periods, + ) + except Exception as exc: # noqa: BLE001 + return { + "status": "error", + "reason": f"check_parallel_trends raised {type(exc).__name__}: {exc}", + } + p_value = _to_python_float(raw.get("p_value")) + return { + "status": "ran", + "method": "slope_difference", + "joint_p_value": p_value, + "treated_trend": _to_python_float(raw.get("treated_trend")), + "control_trend": _to_python_float(raw.get("control_trend")), + "trend_difference": _to_python_float(raw.get("trend_difference")), + "t_statistic": _to_python_float(raw.get("t_statistic")), + "verdict": _pt_verdict(p_value), + } + + def _pt_event_study(self) -> Dict[str, Any]: + """Event-study joint Wald (or Bonferroni fallback) on pre-period coefficients. + + Works with either ``pre_period_effects`` (``MultiPeriodDiDResults`` style, + dict of ``PeriodEffect`` objects) or ``event_study_effects`` (CS / SA / + ImputationDiD style, dict of dicts with ``effect``/``se``/``p_value`` keys). + """ + r = self._results + pre_coefs, n_dropped_undefined = _collect_pre_period_coefs(r) + # Round-33 P0 / Round-42 P1 CI review on PR #318: undefined- + # inference rows must drive an explicit ``inconclusive`` PT + # result rather than either (a) silently shrinking the + # Bonferroni family on the remaining subset and publishing a + # finite joint p-value (R33, mixed-partial case), or (b) + # routing through the empty-coefs ``skipped`` path when every + # pre-row was rejected (R42, all-undefined case). Both violate + # the ``safe_inference`` contract: ``se <= 0`` / non-finite + # effect or SE yields NaN downstream per ``utils.py`` line + # 175, REGISTRY.md line 197. The inconclusive block preserves + # the undefined-row count on the schema so BR's summary can + # quote it and stakeholders see an explicit "PT could not be + # assessed" warning rather than a silent PT-absent narrative. + if n_dropped_undefined > 0: + return { + "status": "ran", + "method": "inconclusive", + "joint_p_value": None, + "test_statistic": None, + "df": len(pre_coefs), + "n_pre_periods": len(pre_coefs), + "n_dropped_undefined": n_dropped_undefined, + "verdict": "inconclusive", + "reason": ( + f"{n_dropped_undefined} pre-period coefficient(s) " + "have undefined inference (non-finite effect / SE or " + "SE <= 0). Per the safe-inference contract " + "(``utils.py`` line 175, REGISTRY.md line 197), this " + "yields NaN downstream; the joint PT test is " + "inconclusive on this fit. Re-fit with a different " + "variance method (bootstrap / cluster) if the " + "affected rows are a small number of cohorts, or " + "investigate why the per-period SE collapsed." + ), + } + if not pre_coefs: + return { + "status": "skipped", + "reason": "No pre-period event-study coefficients available.", + } + interaction_indices = getattr(r, "interaction_indices", None) + vcov = getattr(r, "vcov", None) + + # pre_coefs is a sorted list of (key, effect, se, p_value) tuples. + per_period = [ + { + "period": _to_python_scalar(k), + "coef": _to_python_float(eff), + "se": _to_python_float(se), + "p_value": _to_python_float(p), + } + for (k, eff, se, p) in pre_coefs + ] + + joint_p: Optional[float] = None + test_statistic: Optional[float] = None + df = len(pre_coefs) + method = "bonferroni" + # Joint-Wald pathway is taken only when EVERY pre-period key is present + # in the relevant index mapping (required len == df guard below). This + # protects against estimators whose event-study keys use a different + # namespace than the vcov indexing: if any key is missing, we fall back + # to Bonferroni rather than risk indexing into the wrong vcov rows. + # The schema's ``method`` field exposes which path ran so agents and + # tests can distinguish the two unambiguously. + # + # Two covariance sources are supported: + # 1. ``interaction_indices`` + ``vcov`` — the MultiPeriodDiDResults + # convention, where ``vcov`` is the full regression covariance + # matrix and ``interaction_indices`` maps period labels to rows. + # 2. ``event_study_vcov_index`` + ``event_study_vcov`` — the + # CallawaySantAnnaResults convention, where the event-study + # covariance is stored separately from the full regression vcov. + vcov_for_wald: Optional[Any] = None + idx_map_for_wald: Optional[Any] = None + vcov_method_tag = "joint_wald" + if vcov is not None and interaction_indices is not None: + vcov_for_wald = vcov + idx_map_for_wald = interaction_indices + else: + es_vcov = getattr(r, "event_study_vcov", None) + es_vcov_index = getattr(r, "event_study_vcov_index", None) + if es_vcov is not None and es_vcov_index is not None: + vcov_for_wald = es_vcov + # ``event_study_vcov_index`` is an ordered list of relative-time + # keys; convert it into a dict mapping key -> position. + try: + idx_map_for_wald = {k: i for i, k in enumerate(es_vcov_index)} + vcov_method_tag = "joint_wald_event_study" + except TypeError: + idx_map_for_wald = None + df_denom: Optional[float] = None + if vcov_for_wald is not None and idx_map_for_wald is not None and df > 0: + try: + keys_in_vcov = [k for (k, _, _, _) in pre_coefs if k in idx_map_for_wald] + if len(keys_in_vcov) == df: + idx = [idx_map_for_wald[k] for k in keys_in_vcov] + beta_map = {k: eff for (k, eff, _, _) in pre_coefs} + beta = np.array([beta_map[k] for k in keys_in_vcov], dtype=float) + v_sub = np.asarray(vcov_for_wald)[np.ix_(idx, idx)] + stat = float(beta @ np.linalg.solve(v_sub, beta)) + + # Round-27 P1 CI review on PR #318: survey-backed + # fits carry a finite ``df_survey`` on + # ``survey_metadata``; using the chi-square reference + # distribution on those produces overconfident + # p-values because it ignores the finite-sample + # correction the design-based SE already reflects. + # When a finite denominator df is available, compute + # ``F = W / k`` (numerator df = k pre-periods) against + # an F(k, df_survey) reference. Reserve the chi-square + # path for fits with no finite-df information. + sm = getattr(r, "survey_metadata", None) + df_survey_raw = getattr(sm, "df_survey", None) if sm is not None else None + df_survey: Optional[float] = None + if df_survey_raw is not None: + try: + df_survey_val = float(df_survey_raw) + if np.isfinite(df_survey_val) and df_survey_val > 0: + df_survey = df_survey_val + except (TypeError, ValueError): + df_survey = None + + if df_survey is not None: + from scipy.stats import f as f_dist + + f_stat = stat / df + joint_p = float(1.0 - f_dist.cdf(f_stat, dfn=df, dfd=df_survey)) + test_statistic = stat + method = f"{vcov_method_tag}_survey" + df_denom = df_survey + else: + from scipy.stats import chi2 + + joint_p = float(1.0 - chi2.cdf(stat, df=df)) + test_statistic = stat + method = vcov_method_tag + except Exception: # noqa: BLE001 + joint_p = None + test_statistic = None + method = "bonferroni" + + if joint_p is None: + # Bonferroni fallback is only valid when EVERY retained pre- + # period contributes a finite p-value. Otherwise we would + # silently shrink the test family (e.g., replicate-weight + # survey fits where ``safe_inference`` returns NaN p-values + # for rows whose effective survey df collapsed — the row's + # ``effect`` / ``se`` is still finite, so the ``se > 0`` + # collector filter lets it through, but a Bonferroni + # computed on the remaining subset publishes a finite joint + # p-value that BR lifts into "consistent with parallel + # trends" prose). Round-34 P0 CI review on PR #318 flagged + # that the round-33 guard only caught the ``se <= 0`` case + # and missed this. + # + # Strategy: if any retained pre-period has non-finite + # ``p_value``, emit an explicit inconclusive PT block with + # a visible count/reason. Otherwise run Bonferroni on the + # full family as documented in REPORTING.md. + nan_p_count = sum( + 1 + for p in per_period + if not (isinstance(p["p_value"], (int, float)) and np.isfinite(p["p_value"])) + ) + if nan_p_count > 0: + return { + "status": "ran", + "method": "inconclusive", + "joint_p_value": None, + "test_statistic": None, + "df": len(pre_coefs), + "n_pre_periods": len(pre_coefs), + "n_dropped_undefined": nan_p_count, + "per_period": per_period, + "verdict": "inconclusive", + "reason": ( + f"{nan_p_count} retained pre-period coefficient(s) " + "have non-finite per-period p-value (undefined " + "inference per the ``safe_inference`` contract — " + "e.g., replicate-weight survey fits where effective " + "df collapsed). Bonferroni on the remaining subset " + "would silently shrink the test family; the joint " + "PT test is inconclusive on this fit. Inspect the " + "per_period block for the undefined rows." + ), + } + ps = [p["p_value"] for p in per_period] + if ps: + joint_p = min(1.0, min(ps) * len(ps)) + + out = { + "status": "ran", + "method": method, + "joint_p_value": joint_p, + "test_statistic": test_statistic, + "df": df, + "n_pre_periods": df, + "per_period": per_period, + "verdict": _pt_verdict(joint_p), + } + # Expose the denominator df when the survey F-path was used so + # BR / DR prose can flag the finite-sample correction rather than + # silently presenting a chi-square-style result. + if df_denom is not None: + out["df_denom"] = df_denom + return out + + def _check_pretrends_power(self) -> Dict[str, Any]: + """Compute pre-trends power (MDV) via ``compute_pretrends_power``. + + Feeds the ``mdv_share_of_att`` ratio used by ``BusinessReport`` to select + the power-aware phrasing tier for the ``no_detected_violation`` verdict. + """ + if "pretrends_power" in self._precomputed: + return self._format_precomputed_pretrends_power(self._precomputed["pretrends_power"]) + + from diff_diff.pretrends import compute_pretrends_power + + try: + pp = compute_pretrends_power( + self._results, + alpha=self._alpha, + target_power=0.80, + violation_type="linear", + ) + except Exception as exc: # noqa: BLE001 + return { + "status": "error", + "reason": f"compute_pretrends_power raised " f"{type(exc).__name__}: {exc}", + } + + # Build the schema section and compute the MDV/|ATT| ratio for BR. + headline_metric = self._extract_headline_metric() + att = headline_metric.get("value") if headline_metric else None + mdv = _to_python_float(getattr(pp, "mdv", None)) + ratio: Optional[float] = None + if ( + mdv is not None + and att is not None + and np.isfinite(att) + and abs(att) > 0 + and np.isfinite(mdv) + ): + ratio = mdv / abs(att) + + cov_source = self._infer_cov_source(self._results) + tier = _apply_diag_fallback_downgrade(_power_tier(ratio), cov_source) + return { + "status": "ran", + "method": "compute_pretrends_power", + "violation_type": getattr(pp, "violation_type", "linear"), + "alpha": _to_python_float(getattr(pp, "alpha", self._alpha)), + "target_power": _to_python_float(getattr(pp, "target_power", 0.80)), + "mdv": mdv, + "mdv_share_of_att": ratio, + # Power is reported at ``violation_magnitude`` — the M that + # the helper actually evaluated (defaults to the MDV when + # the caller passed ``M=None``). Schema consumers should + # read ``violation_magnitude`` alongside the power value. + "violation_magnitude": _to_python_float(getattr(pp, "violation_magnitude", None)), + "power_at_violation_magnitude": _to_python_float(getattr(pp, "power", None)), + "n_pre_periods": int(getattr(pp, "n_pre_periods", 0) or 0), + "tier": tier, + "covariance_source": cov_source, + } + + def _format_precomputed_pretrends_power(self, obj: Any) -> Dict[str, Any]: + """Adapt a pre-computed ``PreTrendsPowerResults`` to the schema shape. + + Round-20 P1 CI review on PR #318: this path must mirror the + covariance-source annotation and diagonal-fallback downgrade that + ``_check_pretrends_power`` applies on the default path. Otherwise + the same fit passed through ``precomputed={"pretrends_power": ...}`` + can be labeled ``well_powered`` while the default path reports + ``moderately_powered`` (per REPORTING.md's conservative deviation + for CS / SA / ImputationDiD event-study fits with full + ``event_study_vcov`` available but unused). Resolve the source + fit via ``obj.original_results`` first (which ``compute_pretrends_power`` + populates at construction time), falling back to ``self._results``. + """ + mdv = _to_python_float(getattr(obj, "mdv", None)) + hm = self._extract_headline_metric() + att = hm.get("value") if hm else None + ratio: Optional[float] = None + if mdv is not None and att is not None and np.isfinite(att) and abs(att) > 0: + ratio = mdv / abs(att) + source_fit = getattr(obj, "original_results", None) or self._results + cov_source = self._infer_cov_source(source_fit) + tier = _apply_diag_fallback_downgrade(_power_tier(ratio), cov_source) + return { + "status": "ran", + "method": "precomputed", + "violation_type": getattr(obj, "violation_type", "linear"), + "alpha": _to_python_float(getattr(obj, "alpha", self._alpha)), + "target_power": _to_python_float(getattr(obj, "target_power", 0.80)), + "mdv": mdv, + "mdv_share_of_att": ratio, + "violation_magnitude": _to_python_float(getattr(obj, "violation_magnitude", None)), + "power_at_violation_magnitude": _to_python_float(getattr(obj, "power", None)), + "n_pre_periods": int(getattr(obj, "n_pre_periods", 0) or 0), + "tier": tier, + "covariance_source": cov_source, + "precomputed": True, + } + + @staticmethod + def _infer_cov_source(source_fit: Any) -> str: + """Classify whether ``compute_pretrends_power`` had access to the + full pre-period covariance on ``source_fit``. + + CS / SA / ImputationDiD / EfficientDiD / Stacked / etc. currently + fall back to ``np.diag(ses**2)`` inside ``pretrends.py``, even when + ``event_study_vcov`` is populated on the result; the returned + ``PreTrendsPowerResults.vcov`` therefore ignores off-diagonal pre- + period correlations. Annotating the source explicitly lets BR + downgrade the tier conservatively. + """ + is_event_study_type = type(source_fit).__name__ in { + "CallawaySantAnnaResults", + "SunAbrahamResults", + "ImputationDiDResults", + "StackedDiDResults", + "StaggeredTripleDiffResults", + "WooldridgeDiDResults", + "ChaisemartinDHaultfoeuilleResults", + "EfficientDiDResults", + "TwoStageDiDResults", + } + has_full_es_vcov = ( + getattr(source_fit, "event_study_vcov", None) is not None + and getattr(source_fit, "event_study_vcov_index", None) is not None + ) + if is_event_study_type and has_full_es_vcov: + return "diag_fallback_available_full_vcov_unused" + if is_event_study_type: + return "diag_fallback" + return "full_pre_period_vcov" + + def _check_sensitivity(self) -> Dict[str, Any]: + """Run HonestDiD over the M grid. Uses ``SensitivityResults.breakdown_M``. + + The standard path calls ``HonestDiD(method=..., M_grid=...).sensitivity_analysis()``. + SDiD and TROP route to estimator-native sensitivity in + ``estimator_native_diagnostics`` and emit a pointer here. + """ + if "sensitivity" in self._precomputed: + return self._format_precomputed_sensitivity(self._precomputed["sensitivity"]) + + name = type(self._results).__name__ + if name == "SyntheticDiDResults": + return { + "status": "skipped", + "reason": ( + "SyntheticDiD uses native sensitivity analogues " + "(``in_time_placebo``, ``sensitivity_to_zeta_omega``) " + "rather than HonestDiD; see " + "``estimator_native_diagnostics``." + ), + "method": "estimator_native", + } + if name == "TROPResults": + return { + "status": "skipped", + "reason": ( + "TROP identification is factor-model-based; HonestDiD " + "bounds do not apply. Use the factor-model fit metrics " + "(effective rank, LOOCV score, selected lambdas) in " + "``estimator_native_diagnostics`` as the analogue." + ), + "method": "estimator_native", + } + + # Varying-base CS gate: handled at ``_instance_skip_reason``, so + # this code path is not reached for a varying-base CS fit unless + # the user passed ``precomputed={'sensitivity': ...}`` (handled + # above). Kept here as a comment anchor; see _instance_skip_reason. + + import warnings as _warnings + + try: + from typing import cast + + from diff_diff.honest_did import HonestDiD + + # Capture any non-fatal UserWarnings HonestDiD emits (bootstrap + # diag-covariance fallback on CS, library-extension note on + # dCDH, dropped non-consecutive horizons, etc.) so BR/DR do not + # silently narrate sensitivity as clean when the helper + # flagged caveats. The try/except below still handles fatal + # errors; captured warnings ride on the returned dict. + with _warnings.catch_warnings(record=True) as caught: + _warnings.simplefilter("always") + # The sensitivity_method string is validated at runtime by + # HonestDiD; the Literal annotation is for static typing only. + honest = HonestDiD( + method=cast(Any, self._sensitivity_method), + alpha=self._alpha, + ) + sens = honest.sensitivity_analysis( + self._results, + M_grid=list(self._sensitivity_M_grid), + ) + except Exception as exc: # noqa: BLE001 + return { + "status": "error", + "method": self._sensitivity_method, + "reason": f"HonestDiD.sensitivity_analysis raised " f"{type(exc).__name__}: {exc}", + } + + captured = [str(w.message) for w in caught if issubclass(w.category, Warning)] + formatted = self._format_sensitivity_results(sens) + if captured: + formatted["warnings"] = captured + return formatted + + def _format_sensitivity_results(self, sens: Any) -> Dict[str, Any]: + grid = [] + raw_M = getattr(sens, "M_values", None) + raw_cis = getattr(sens, "robust_cis", None) + raw_bounds = getattr(sens, "bounds", None) + M_values: List[Any] = list(raw_M) if raw_M is not None else [] + cis: List[Any] = list(raw_cis) if raw_cis is not None else [] + bounds: List[Any] = list(raw_bounds) if raw_bounds is not None else [] + for i, M in enumerate(M_values): + ci = cis[i] if i < len(cis) else (None, None) + bd = bounds[i] if i < len(bounds) else (None, None) + lo = _to_python_float(ci[0]) + hi = _to_python_float(ci[1]) + robust_to_zero = lo is not None and hi is not None and (lo > 0 or hi < 0) + grid.append( + { + "M": _to_python_float(M), + "ci_lower": lo, + "ci_upper": hi, + "bound_lower": _to_python_float(bd[0]), + "bound_upper": _to_python_float(bd[1]), + "robust_to_zero": robust_to_zero, + } + ) + bkd = _to_python_float(getattr(sens, "breakdown_M", None)) + if bkd is None: + conclusion = "robust_over_grid" + elif bkd >= 1.0: + conclusion = f"robust_to_M_{bkd:.2f}" + else: + conclusion = "fragile" + return { + "status": "ran", + "method": getattr(sens, "method", self._sensitivity_method), + "grid": grid, + "breakdown_M": bkd, + "original_estimate": _to_python_float(getattr(sens, "original_estimate", None)), + "original_se": _to_python_float(getattr(sens, "original_se", None)), + "conclusion": conclusion, + } + + def _format_precomputed_sensitivity(self, obj: Any) -> Dict[str, Any]: + """Accept either ``SensitivityResults`` (grid) or ``HonestDiDResults`` (single M). + + The single-M branch preserves ``original_estimate`` and + ``original_se`` for parity with the grid branch — both + ``SensitivityResults`` and ``HonestDiDResults`` carry these fields, + and downstream tooling that reads the schema should see a + consistent shape regardless of which object was passed. (The + grid path surfaces them via ``_format_sensitivity_results``.) + """ + if hasattr(obj, "M_values") and hasattr(obj, "breakdown_M"): + formatted = self._format_sensitivity_results(obj) + formatted["precomputed"] = True + return formatted + # Single-M HonestDiDResults: adapt with no breakdown_M. + ci_lb = _to_python_float(getattr(obj, "ci_lb", None)) + ci_ub = _to_python_float(getattr(obj, "ci_ub", None)) + return { + "status": "ran", + "method": getattr(obj, "method", self._sensitivity_method), + "grid": [ + { + "M": _to_python_float(getattr(obj, "M", None)), + "ci_lower": ci_lb, + "ci_upper": ci_ub, + "bound_lower": _to_python_float(getattr(obj, "lb", None)), + "bound_upper": _to_python_float(getattr(obj, "ub", None)), + "robust_to_zero": ( + ci_lb is not None and ci_ub is not None and (ci_lb > 0 or ci_ub < 0) + ), + } + ], + "breakdown_M": None, + "original_estimate": _to_python_float(getattr(obj, "original_estimate", None)), + "original_se": _to_python_float(getattr(obj, "original_se", None)), + "conclusion": "single_M_precomputed", + "precomputed": True, + } + + def _check_bacon(self) -> Dict[str, Any]: + """Surface Bacon decomposition: read-out when applicable, else skip. + + If ``results`` is itself a ``BaconDecompositionResults``, read fields. + If ``data`` + ``first_treat`` are supplied, call ``bacon_decompose``. + Otherwise, skip with a helpful reason. + """ + if "bacon" in self._precomputed: + return self._format_bacon(self._precomputed["bacon"]) + + r = self._results + name = type(r).__name__ + if name == "BaconDecompositionResults": + return self._format_bacon(r) + + data = self._data + outcome = self._outcome + unit = self._unit + time = self._time + first_treat = self._first_treat + if data is None or outcome is None or unit is None or time is None or first_treat is None: + return { + "status": "skipped", + "reason": "Bacon decomposition requires data + outcome + unit + time " + "+ first_treat on DiagnosticReport; not all supplied.", + } + # Round-40 P1 CI review on PR #318: defense-in-depth. The + # instance-level applicability gate should have already returned + # a skip when the result carries ``survey_metadata`` but no + # ``survey_design`` is available to thread through. Guard at + # the runner too in case a future caller bypasses the gate. + if getattr(r, "survey_metadata", None) is not None and self._survey_design is None: + return { + "status": "skipped", + "reason": ( + "Original fit used a survey design; Goodman-Bacon " + "replay under defaults would produce an unweighted " + "decomposition for a different design than the " + "weighted estimate. Pass ``survey_design=`` " + "on DiagnosticReport / BusinessReport, or supply " + "``precomputed={'bacon': ...}`` with a survey-aware " + "decomposition." + ), + } + + try: + from diff_diff.bacon import bacon_decompose + + bacon = bacon_decompose( + data, + outcome=outcome, + unit=unit, + time=time, + first_treat=first_treat, + survey_design=self._survey_design, + ) + except Exception as exc: # noqa: BLE001 + return { + "status": "error", + "reason": f"bacon_decompose raised {type(exc).__name__}: {exc}", + } + return self._format_bacon(bacon) + + def _format_bacon(self, bacon: Any) -> Dict[str, Any]: + treated_vs_never = _to_python_float(getattr(bacon, "total_weight_treated_vs_never", None)) + earlier_vs_later = _to_python_float(getattr(bacon, "total_weight_earlier_vs_later", None)) + later_vs_earlier = _to_python_float(getattr(bacon, "total_weight_later_vs_earlier", None)) + twfe = _to_python_float(getattr(bacon, "twfe_estimate", None)) + forbidden = later_vs_earlier if later_vs_earlier is not None else 0.0 + if forbidden > 0.10: + verdict = "materially_contaminated" + elif forbidden > 0.01: + verdict = "minor_forbidden_weight" + else: + verdict = "clean" + return { + "status": "ran", + "twfe_estimate": twfe, + "weight_by_type": { + "treated_vs_never": treated_vs_never, + "earlier_vs_later": earlier_vs_later, + "later_vs_earlier": later_vs_earlier, + }, + "forbidden_weight": later_vs_earlier, + "verdict": verdict, + "n_timing_groups": _to_python_scalar(getattr(bacon, "n_timing_groups", None)), + } + + def _check_design_effect(self) -> Dict[str, Any]: + """Read survey design-effect from ``results.survey_metadata``. + + Emits a plain-English ``band_label`` alongside the numeric + fields so downstream prose can classify the correction without + re-deriving the threshold rule. REPORTING.md describes the + band breakpoints (round-32 P2 CI review on PR #318 flagged + that the docs advertised the label but the implementation was + only emitting the numeric fields plus ``is_trivial``). + + Bands (per REPORTING.md): + * ``deff < 0.95`` -> ``"improves_precision"`` (effective N + is LARGER than nominal N — a precision-improving design; + round-35 split this out from the old ``trivial`` bucket); + * ``0.95 <= deff < 1.05`` -> ``"trivial"`` (effectively no + effect on inference); + * ``1.05 <= deff < 2`` -> ``"slightly_reduces"``; + * ``2 <= deff < 5`` -> ``"materially_reduces"``; + * ``deff >= 5`` -> ``"large_warning"``. + ``None`` deff (or non-finite) -> ``band_label=None`` (no + classification). + """ + sm = getattr(self._results, "survey_metadata", None) + if sm is None: + return { + "status": "skipped", + "reason": "No survey_metadata attached to results.", + } + deff = _to_python_float(getattr(sm, "design_effect", None)) + eff_n = _to_python_float(getattr(sm, "effective_n", None)) + # Round-35 P2 CI review on PR #318: ``is_trivial`` used to be + # ``0.95 <= deff <= 1.05`` while ``band_label`` treated + # anything ``< 1.05`` as trivial. On a precision-improving + # design (``deff < 0.95``) BR's summary keyed off + # ``not is_trivial`` and narrated "Survey design reduces + # effective sample size", which is directionally wrong — the + # effective N is LARGER than the nominal N. Split the band + # into a dedicated ``improves_precision`` label for + # ``deff < 0.95`` and keep ``is_trivial`` restricted to the + # tight "effectively no effect" window so the schema + # carries the precision-improving signal explicitly. + # + # Round-43 P2 CI review on PR #318: the ``is_trivial`` upper + # bound used ``<= 1.05`` (closed interval) but REPORTING.md + # defines the ``trivial`` band as ``0.95 <= deff < 1.05`` + # (half-open) and ``slightly_reduces`` as ``1.05 <= deff < 2``. + # At exactly ``deff == 1.05`` the schema emitted + # ``band_label="slightly_reduces"`` while also setting + # ``is_trivial=True``, suppressing the non-trivial prose that + # the documented threshold says should fire. Align the + # ``is_trivial`` bound with the band-label bound. + is_trivial = deff is not None and 0.95 <= deff < 1.05 + if deff is None or not np.isfinite(deff): + band_label: Optional[str] = None + elif deff < 0.95: + band_label = "improves_precision" + elif deff < 1.05: + band_label = "trivial" + elif deff < 2.0: + band_label = "slightly_reduces" + elif deff < 5.0: + band_label = "materially_reduces" + else: + band_label = "large_warning" + return { + "status": "ran", + "deff": deff, + "effective_n": eff_n, + "weight_type": getattr(sm, "weight_type", None), + "n_strata": _to_python_scalar(getattr(sm, "n_strata", None)), + "n_psu": _to_python_scalar(getattr(sm, "n_psu", None)), + "df_survey": _to_python_scalar(getattr(sm, "df_survey", None)), + "replicate_method": getattr(sm, "replicate_method", None), + "is_trivial": is_trivial, + "band_label": band_label, + } + + def _check_heterogeneity(self) -> Dict[str, Any]: + """Compute effect-stability metrics (CV, range, sign consistency).""" + effects = self._collect_effect_scalars() + if not effects: + return { + "status": "skipped", + "reason": "No group / event-study / period effects available.", + } + vals = np.array(effects, dtype=float) + finite = vals[np.isfinite(vals)] + if finite.size == 0: + return { + "status": "skipped", + "reason": "All effect values are non-finite.", + } + mean = float(np.mean(finite)) + sd = float(np.std(finite, ddof=1)) if finite.size > 1 else 0.0 + mn = float(np.min(finite)) + mx = float(np.max(finite)) + cv = sd / abs(mean) if abs(mean) > 0.1 * sd and abs(mean) > 0 else None + sign_consistent = bool(np.all(finite >= 0) or np.all(finite <= 0)) + return { + "status": "ran", + "source": self._heterogeneity_source(), + "n_effects": int(finite.size), + "min": mn, + "max": mx, + "mean": mean, + "sd": sd, + "range": mx - mn, + "cv": cv, + "sign_consistent": sign_consistent, + } + + def _check_epv(self) -> Dict[str, Any]: + """Read EPV diagnostics from ``results.epv_diagnostics``. + + The diff-diff convention (see ``diff_diff/staggered.py`` around the + low-EPV summary warning) is that ``epv_diagnostics`` is a dict keyed + by cell identifier (e.g. ``(g, t)`` for staggered) whose values are + per-cell dicts with ``is_low`` (bool) and ``epv`` (float). The + threshold lives on ``results.epv_threshold`` (default 10) rather + than being hardcoded. + """ + r = self._results + epv = getattr(r, "epv_diagnostics", None) + if epv is None: + return { + "status": "skipped", + "reason": "Estimator did not produce results.epv_diagnostics for this fit.", + } + threshold = _to_python_float(getattr(r, "epv_threshold", 10)) or 10.0 + + if isinstance(epv, dict): + low_cells = [k for k, v in epv.items() if isinstance(v, dict) and v.get("is_low")] + epv_floats: List[float] = [] + for v in epv.values(): + if not isinstance(v, dict): + continue + raw = v.get("epv") + if raw is None: + continue + converted = _to_python_float(raw) + if converted is not None: + epv_floats.append(converted) + min_epv: Optional[float] = min(epv_floats) if epv_floats else None + return { + "status": "ran", + "threshold": threshold, + "n_cells_low": len(low_cells), + "n_cells_total": len(epv), + "min_epv": min_epv, + "affected_cohorts": [_to_python_scalar(c) for c in low_cells], + } + + # Legacy object-shaped fallback (not currently emitted by the library + # but kept so custom subclasses that mirror the old shape still work). + low_cells_attr = getattr(epv, "low_epv_cells", None) or [] + return { + "status": "ran", + "threshold": threshold, + "n_cells_low": int(len(low_cells_attr)), + "n_cells_total": _to_python_scalar(getattr(epv, "n_cells_total", None)), + "min_epv": _to_python_float(getattr(epv, "min_epv", None)), + "affected_cohorts": [_to_python_scalar(c) for c in low_cells_attr], + } + + def _check_estimator_native(self) -> Dict[str, Any]: + """SDiD / TROP native validation surfaces. + + SDiD: ``pre_treatment_fit`` (weighted-PT analogue), weight + concentration (``get_weight_concentration``), ``in_time_placebo`` + (placebo-timing sweep), and ``sensitivity_to_zeta_omega`` + (regularization sensitivity). + + TROP: factor-model fit metrics (``effective_rank``, ``loocv_score``, + selected ``lambda_*``). + """ + r = self._results + name = type(r).__name__ + if name == "SyntheticDiDResults": + return self._sdid_native(r) + if name == "TROPResults": + return self._trop_native(r) + return { + "status": "not_applicable", + "reason": f"{name} does not expose native validation methods.", + } + + def _sdid_native(self, r: Any) -> Dict[str, Any]: + """Populate SDiD-native diagnostics section.""" + out: Dict[str, Any] = {"status": "ran", "estimator": "SyntheticDiD"} + out["pre_treatment_fit"] = _to_python_float(getattr(r, "pre_treatment_fit", None)) + # Weight concentration via the public method on SyntheticDiDResults. + try: + wc = r.get_weight_concentration(top_k=5) + out["weight_concentration"] = { + "effective_n": _to_python_float(wc.get("effective_n")), + "herfindahl": _to_python_float(wc.get("herfindahl")), + "top_k": _to_python_scalar(wc.get("top_k")), + "top_k_share": _to_python_float(wc.get("top_k_share")), + } + except Exception as exc: # noqa: BLE001 + out["weight_concentration"] = { + "status": "error", + "reason": f"get_weight_concentration raised " f"{type(exc).__name__}: {exc}", + } + # In-time placebo — runs only when the fit snapshot is available. + try: + placebo_df = r.in_time_placebo() + out["in_time_placebo"] = { + "n_placebos": int(len(placebo_df)), + "max_abs_effect": _to_python_float( + placebo_df["att"].abs().max() if len(placebo_df) > 0 else None + ), + "mean_abs_effect": _to_python_float( + placebo_df["att"].abs().mean() if len(placebo_df) > 0 else None + ), + } + except Exception as exc: # noqa: BLE001 + out["in_time_placebo"] = { + "status": "skipped", + "reason": f"in_time_placebo unavailable: " f"{type(exc).__name__}: {exc}", + } + # Zeta-omega sensitivity. + try: + zeta_df = r.sensitivity_to_zeta_omega() + atts = zeta_df["att"].astype(float).tolist() if len(zeta_df) > 0 else [] + out["zeta_sensitivity"] = { + "grid": [ + { + "multiplier": _to_python_float(row.get("multiplier")), + "att": _to_python_float(row.get("att")), + "pre_fit_rmse": _to_python_float(row.get("pre_fit_rmse")), + "effective_n": _to_python_float(row.get("effective_n")), + } + for row in zeta_df.to_dict(orient="records") + ], + "att_range": ([min(atts), max(atts)] if atts else None), + } + except Exception as exc: # noqa: BLE001 + out["zeta_sensitivity"] = { + "status": "skipped", + "reason": f"sensitivity_to_zeta_omega unavailable: " f"{type(exc).__name__}: {exc}", + } + return out + + def _trop_native(self, r: Any) -> Dict[str, Any]: + """Populate TROP-native factor-model diagnostics section.""" + return { + "status": "ran", + "estimator": "TROP", + "factor_model": { + "effective_rank": _to_python_float(getattr(r, "effective_rank", None)), + "loocv_score": _to_python_float(getattr(r, "loocv_score", None)), + "lambda_time": _to_python_float(getattr(r, "lambda_time", None)), + "lambda_unit": _to_python_float(getattr(r, "lambda_unit", None)), + "lambda_nn": _to_python_float(getattr(r, "lambda_nn", None)), + "n_pre_periods": _to_python_scalar(getattr(r, "n_pre_periods", None)), + "n_post_periods": _to_python_scalar(getattr(r, "n_post_periods", None)), + }, + } + + # -- Heterogeneity helpers -------------------------------------------- + + def _collect_effect_scalars(self) -> List[float]: + """Collect scalar **post-treatment** effect values across group / event- + study / TROP sources. + + Pre-period coefficients (placebos and normalization constraints) + and synthetic reference-marker rows are explicitly excluded — + mixing them into the heterogeneity dispersion / sign-consistency + summary silently redefines the estimand, which the round-6 CI + review flagged on PR #318. + + Returns an empty list if no recognized effect container yields + any post-treatment entries. + """ + r = self._results + # 1. group_effects: per-cohort post-treatment ATT(g) by construction. + ge = getattr(r, "group_effects", None) + if ge is not None: + return self._scalars_from_mapping(ge) + # 2. MultiPeriodDiDResults: use the ``post_period_effects`` property + # (post-treatment only) instead of ``period_effects`` (which mixes + # pre- and post-treatment coefficients). + ppe = getattr(r, "post_period_effects", None) + if ppe is not None: + return self._scalars_from_mapping(ppe) + # 3. event_study_effects: dict keyed by relative time -> dict with + # 'effect'. Filter to **post-treatment** horizons (rel_time >= 0), + # exclude reference markers (``n_groups == 0`` on CS/SA; + # ``n_obs == 0`` on Stacked/TwoStage/Imputation/EfficientDiD), and + # exclude entries with non-finite effect. + es = getattr(r, "event_study_effects", None) + if es is not None: + # Anticipation-aware post-treatment cutoff: include horizons + # from the anticipation window onward (where treatment- + # affected effects can live) per REGISTRY.md §CallawaySantAnna + # lines 355-395; round-15 CI review flagged the prior + # ``rel >= 0`` rule as excluding anticipation-window effects + # from the heterogeneity dispersion summary. + post_cutoff = _pre_post_boundary(r) + post_only: List[float] = [] + try: + items = list(es.items()) + except Exception: # noqa: BLE001 + items = [] + for key, entry in items: + try: + rel = int(key) + except (TypeError, ValueError): + # Non-integer keys — unknown shape; skip conservatively + # rather than mixing into the dispersion summary. + continue + if rel < post_cutoff: + continue + if isinstance(entry, dict): + if entry.get("n_groups") == 0 or entry.get("n_obs") == 0: + continue + eff = _extract_scalar_effect(entry) + if eff is None or not np.isfinite(eff): + continue + post_only.append(eff) + return post_only + # 4. TROP: treatment_effects dict keyed by (unit, time) -> float. + # TROP produces counterfactual deltas only at observed points for + # treated units (the factor-model construction), so these are + # post-treatment by design. + te = getattr(r, "treatment_effects", None) + if te is not None: + return self._scalars_from_mapping(te) + # 5. CS default aggregation: group_time_effects dict keyed by + # (g, t) -> dict. Filter to t >= g (post-treatment cells); the + # pre-treatment cells (t < g) are identification-deviation + # placebos, not effect heterogeneity. + gte = getattr(r, "group_time_effects", None) + if gte is not None: + post_cells: List[float] = [] + try: + items = list(gte.items()) + except Exception: # noqa: BLE001 + items = [] + for key, entry in items: + g_t = None + if isinstance(key, tuple) and len(key) == 2: + g_t = key + else: + g_val = ( + getattr(entry, "group", None) + if not isinstance(entry, dict) + else entry.get("group") + ) + t_val = ( + getattr(entry, "time", None) + if not isinstance(entry, dict) + else entry.get("time") + ) + if g_val is not None and t_val is not None: + g_t = (g_val, t_val) + if g_t is not None: + try: + g_num = float(g_t[0]) + t_num = float(g_t[1]) + # Estimator-specific post cutoff. CS / + # EfficientDiD / SA treat ``t >= g - anticipation`` + # as treatment-affected (anticipation window is + # post-announcement). Wooldridge aggregation is + # documented as ``t >= g`` with the anticipation + # window rendered as placebos, not post- + # treatment effects (REGISTRY.md §Wooldridge + # lines 1351-1352). Round-16 CI review flagged + # the blanket anticipation shift as Wooldridge- + # unfaithful. + if type(r).__name__ == "WooldridgeDiDResults": + anticipation = 0 + else: + anticipation = getattr(r, "anticipation", 0) or 0 + try: + anticipation = int(anticipation) + except (TypeError, ValueError): + anticipation = 0 + if t_num < g_num - anticipation: + continue + except (TypeError, ValueError): + pass + eff = _extract_scalar_effect(entry) + if eff is None or not np.isfinite(eff): + continue + post_cells.append(eff) + return post_cells + return [] + + @staticmethod + def _scalars_from_mapping(mapping: Any) -> List[float]: + """Extract scalar effect values from various result-mapping shapes.""" + out: List[float] = [] + values: List[Any] + values_fn = getattr(mapping, "values", None) + if callable(values_fn): + try: + values = list(values_fn()) + except Exception: # noqa: BLE001 + return [] + else: + try: + values = list(mapping) # type: ignore[arg-type] + except Exception: # noqa: BLE001 + return [] + for val in values: + eff = _extract_scalar_effect(val) + if eff is not None: + out.append(eff) + return out + + def _heterogeneity_source(self) -> str: + """Name the attribute that produced the scalars (for the schema). + + Mirrors the dispatch order in ``_collect_effect_scalars`` and + reports the actual post-treatment surface consumed (e.g., + ``post_period_effects`` rather than ``period_effects`` on + ``MultiPeriodDiDResults``, and ``event_study_effects_post`` to + make it clear pre-period / reference-marker rows were filtered). + """ + r = self._results + if getattr(r, "group_effects", None) is not None: + return "group_effects" + if getattr(r, "post_period_effects", None) is not None: + return "post_period_effects" + if getattr(r, "event_study_effects", None) is not None: + return "event_study_effects_post" + if getattr(r, "treatment_effects", None) is not None: + return "treatment_effects" + if getattr(r, "group_time_effects", None) is not None: + return "group_time_effects_post" + return "unknown" + + def _pt_hausman(self) -> Dict[str, Any]: + """EfficientDiD native PT check via ``EfficientDiD.hausman_pretest``. + + This is the correct PT check for EfficientDiD (PT-All vs PT-Post); the + generic event-study approach is inappropriate for this estimator per + ``practitioner._parallel_trends_step`` guidance. + """ + data = self._data + outcome = self._outcome + unit = self._unit + time = self._time + first_treat = self._first_treat + missing = [ + name + for name, val in ( + ("data", data), + ("outcome", outcome), + ("unit", unit), + ("time", time), + ("first_treat", first_treat), + ) + if val is None + ] + if ( + missing + or data is None + or outcome is None + or unit is None + or time is None + or first_treat is None + ): + return { + "status": "skipped", + "method": "hausman", + "reason": ( + "EfficientDiD.hausman_pretest requires data + outcome + unit + " + f"time + first_treat kwargs on DiagnosticReport; missing: " + f"{', '.join(missing)}." + ), + } + + # Fit-faithful guard. ``EfficientDiDResults`` exposes + # ``control_group``, ``anticipation``, and ``estimation_path`` + # (``"nocov"`` or ``"dr"``) plus ``survey_metadata``, but not the + # ``covariates`` list, ``cluster`` column, or nuisance kwargs + # needed to replay a DR / clustered / survey-weighted fit. If + # the original fit used any of those paths, rerunning the + # pretest under defaults would diagnose a different design than + # the estimate being summarized. Skip with an explicit reason + # instead of silently fibbing. + r = self._results + estimation_path = getattr(r, "estimation_path", "nocov") + has_survey = getattr(r, "survey_metadata", None) is not None + if estimation_path != "nocov" or has_survey: + reasons: List[str] = [] + if estimation_path == "dr": + reasons.append( + "the original fit used the doubly-robust path with " + "covariates (``covariates`` list is not stored on " + "``EfficientDiDResults``)" + ) + if has_survey: + reasons.append( + "the original fit used a survey design (replay would " + "require the full ``SurveyDesign`` object)" + ) + return { + "status": "skipped", + "method": "hausman", + "reason": ( + "Cannot faithfully replay the Hausman pretest: " + + "; ".join(reasons) + + ". Rerunning the pretest under defaults would " + "diagnose a different design than the estimate. " + "Rerun ``EfficientDiD.hausman_pretest(...)`` " + "manually with the original fit's kwargs or pass " + "``precomputed={'parallel_trends': ...}`` if you have " + "a pretest result." + ), + } + + # Propagate settings we can read off the result. On the + # ``nocov`` / no-survey path we just gated to, the design + # kwargs that matter for fit-faithful replay are + # ``control_group``, ``anticipation``, and — when the fit was + # clustered — ``cluster``. ``EfficientDiDResults`` persists the + # cluster column so a clustered Hausman statistic is reported + # for a clustered fit rather than a silently-unclustered one. + hausman_kwargs: Dict[str, Any] = {} + fit_control_group = getattr(r, "control_group", None) + if isinstance(fit_control_group, str): + hausman_kwargs["control_group"] = fit_control_group + fit_anticipation = getattr(r, "anticipation", None) + if isinstance(fit_anticipation, (int, float)) and np.isfinite(fit_anticipation): + hausman_kwargs["anticipation"] = int(fit_anticipation) + fit_cluster = getattr(r, "cluster", None) + if isinstance(fit_cluster, str) and fit_cluster: + hausman_kwargs["cluster"] = fit_cluster + + try: + from diff_diff.efficient_did import EfficientDiD + + pt = EfficientDiD.hausman_pretest( + data, + outcome=outcome, + unit=unit, + time=time, + first_treat=first_treat, + alpha=self._alpha, + **hausman_kwargs, + ) + except Exception as exc: # noqa: BLE001 + return { + "status": "error", + "method": "hausman", + "reason": f"hausman_pretest raised {type(exc).__name__}: {exc}", + } + + p_value = _to_python_float(getattr(pt, "p_value", None)) + # ``HausmanPretestResult`` exposes ``statistic`` (not + # ``test_statistic``); keep a fallback in case a precomputed + # passthrough object uses the alternate name. + test_stat = _to_python_float(getattr(pt, "statistic", getattr(pt, "test_statistic", None))) + return { + "status": "ran", + "method": "hausman", + "joint_p_value": p_value, + "test_statistic": test_stat, + "df": _to_python_scalar(getattr(pt, "df", None)), + "verdict": _pt_verdict(p_value), + } + + def _pt_synthetic_fit(self) -> Dict[str, Any]: + """SDiD weighted pre-treatment-fit PT analogue. + + SDiD's design-enforced fit quality substitutes for a standard PT test: + the synthetic control is explicitly constructed to match the treated + group's pre-period trajectory, so small ``pre_treatment_fit`` RMSE + means the weighted-PT analogue is satisfied. + """ + r = self._results + fit = _to_python_float(getattr(r, "pre_treatment_fit", None)) + if fit is None: + return { + "status": "skipped", + "method": "synthetic_fit", + "reason": "SyntheticDiDResults.pre_treatment_fit is not populated " "on this fit.", + } + # Proxy verdict: unlike a classical PT p-value, this is a fit-quality + # metric. Classify conservatively — phrasing in BR will explain that + # this is SDiD's design-enforced analogue, not a PT hypothesis test. + return { + "status": "ran", + "method": "synthetic_fit", + "pre_treatment_fit_rmse": fit, + "verdict": "design_enforced_pt", + } + + def _pt_factor(self) -> Dict[str, Any]: + """TROP has no PT concept — its identification is factor-model-based.""" + return { + "status": "not_applicable", + "reason": "TROP uses factor-model identification; parallel trends is " + "not applicable. See estimator_native_diagnostics for the " + "factor-model fit metrics.", + "method": "factor", + } + + def _format_precomputed_pt(self, obj: Any) -> Dict[str, Any]: + """Adapt a pre-computed parallel-trends result to the schema shape. + + Accepted inputs (round-23 P1 CI review on PR #318): + * A dict from ``utils.check_parallel_trends`` with ``p_value`` + (2x2 PT shape) — ``joint_p_value`` inherits from ``p_value`` + when only the 2x2 key is supplied. + * A schema-shaped dict with ``joint_p_value`` and optional + ``test_statistic`` / ``df`` / ``method`` (the same shape + ``to_dict()["parallel_trends"]`` emits on the default path), + so a PT block from one DR run can be replayed into another. + * A native result object exposing ``p_value`` (or + ``joint_p_value``) plus optional ``statistic`` / + ``test_statistic`` and ``df`` — in particular, EfficientDiD's + ``HausmanPretestResult``, which is what the ``_pt_hausman`` + skip message points users toward when replay fails on a + non-nocov / survey fit. + + Previously the formatter rejected non-dict inputs outright and + only read ``p_value``, so ``HausmanPretestResult`` could not be + passed through at all and a schema-shaped dict silently lost its + ``joint_p_value`` / ``test_statistic`` / ``df`` fields. + """ + + def _read(name: str) -> Any: + if isinstance(obj, dict): + return obj.get(name) + return getattr(obj, name, None) + + # Accept joint_p_value preferentially, but fall back to the 2x2 + # ``p_value`` key so ``utils.check_parallel_trends`` dicts still + # work as before. + raw_p = _read("joint_p_value") + if raw_p is None: + raw_p = _read("p_value") + p_value = _to_python_float(raw_p) + + # ``HausmanPretestResult`` exposes ``statistic``; schema-shaped + # dicts and the default DR path both use ``test_statistic``. + raw_stat = _read("test_statistic") + if raw_stat is None: + raw_stat = _read("statistic") + test_statistic = _to_python_float(raw_stat) + + df = _to_python_scalar(_read("df")) + + # Method inference (round-26 P2 CI review on PR #318). Downstream + # BR / DR prose keys off ``method`` to pick the right subject and + # statistic label (``"joint p"`` for event-study Wald / + # Bonferroni, ``"p"`` for the 2x2 slope-difference and Hausman + # single-statistic tests, no label for design-enforced paths). + # Defaulting to ``"precomputed"`` made raw 2x2 dicts and native + # Hausman objects render with the wrong subject ("Pre-treatment + # data") and label ("joint p"). Infer from the distinguishing + # fields when ``method`` is not explicit: + # * ``HausmanPretestResult`` / shape: has ``statistic``, plus + # at least one of ``att_all`` / ``att_post`` / ``recommendation`` + # (disambiguates from the schema-shaped dict which may also + # carry ``test_statistic`` but does not carry the Hausman- + # specific companion fields). + # * ``utils.check_parallel_trends`` 2x2 dict: carries + # ``trend_difference`` / ``treated_trend`` / ``control_trend`` + # as its distinguishing fields. + method = _read("method") + if method is None: + hausman_markers = _read("statistic") is not None and any( + _read(tag) is not None + for tag in ("att_all", "att_post", "recommendation", "reject") + ) + slope_markers = any( + _read(tag) is not None + for tag in ("trend_difference", "treated_trend", "control_trend") + ) + if hausman_markers: + method = "hausman" + elif slope_markers: + method = "slope_difference" + else: + method = "precomputed" + + # If no recognized p-value field was supplied at all, surface an + # error rather than silently producing ``joint_p_value=None``. + # Stay permissive about dict shapes — absence of ``test_statistic`` + # or ``df`` is fine (2x2 PT has neither), but a complete absence + # of a p-value / joint-p-value means the input is not a PT result. + if raw_p is None: + return { + "status": "error", + "method": method, + "reason": ( + "precomputed['parallel_trends'] must expose either " + "``joint_p_value`` (schema shape / HausmanPretestResult) or " + "``p_value`` (check_parallel_trends 2x2 shape). Got an object " + "with neither: pass a dict with one of those keys, or a " + "native result object (e.g., HausmanPretestResult) exposing " + "``p_value``." + ), + } + + out: Dict[str, Any] = { + "status": "ran", + "method": method, + "joint_p_value": p_value, + "verdict": _pt_verdict(p_value), + "precomputed": True, + } + if test_statistic is not None: + out["test_statistic"] = test_statistic + if df is not None: + out["df"] = df + # Preserve the survey-F denominator df when replaying a schema- + # shaped PT block from the default path (round-28 P3 CI review + # on PR #318). Without this, the finite-sample correction + # recorded on the source block is silently dropped at replay. + df_denom = _to_python_float(_read("df_denom")) + if df_denom is not None: + out["df_denom"] = df_denom + return out + + # -- Headline metric extraction ---------------------------------------- + + def _extract_headline_metric(self) -> Optional[Dict[str, Any]]: + """Best-effort extraction of the scalar headline metric from the result.""" + extracted = _extract_scalar_headline(self._results, fallback_alpha=self._alpha) + if extracted is None: + return None + name, value, se, p, ci, alpha = extracted + return { + "name": name, + "value": value, + "se": se, + "p_value": p, + "conf_int": ci, + "alpha": alpha, + } + + +# --------------------------------------------------------------------------- +# Helpers (module-private) +# --------------------------------------------------------------------------- +def _extract_scalar_headline( + results: Any, + fallback_alpha: float = 0.05, +) -> Optional[ + Tuple[ + str, + Optional[float], + Optional[float], + Optional[float], + Optional[List[float]], + Optional[float], + ] +]: + """Extract ``(name, value, se, p_value, conf_int, alpha)`` from a fitted result. + + Centralizes the scalar-headline mapping shared by both ``BusinessReport`` + and ``DiagnosticReport`` so schema drift (e.g. ``ContinuousDiDResults`` + using ``overall_att_se`` / ``overall_att_p_value`` / + ``overall_att_conf_int`` instead of the ``overall_att`` stem) is handled + in one place. + + Each row in the attribute-alias table below is tried in priority order. + The first point-estimate attribute that resolves to a non-None value + wins; the companion SE / p-value / CI attributes are then resolved from + the same row, taking the first alias that exists on the result object. + """ + # (name, [se aliases], [p-value aliases], [ci aliases]) + alias_table: List[Tuple[str, List[str], List[str], List[str]]] = [ + # Staggered / multi-period aggregations + ( + "overall_att", + ["overall_se", "overall_att_se"], + ["overall_p_value", "overall_att_p_value"], + ["overall_conf_int", "overall_att_conf_int"], + ), + # MultiPeriodDiDResults + ("avg_att", ["avg_se"], ["avg_p_value"], ["avg_conf_int"]), + # Simple DiDResults / SyntheticDiDResults / TROPResults / TripleDifferenceResults + ("att", ["se"], ["p_value"], ["conf_int"]), + ] + for name, se_aliases, p_aliases, ci_aliases in alias_table: + val = getattr(results, name, None) + if val is None: + continue + se = next( + ( + _to_python_float(getattr(results, a, None)) + for a in se_aliases + if getattr(results, a, None) is not None + ), + None, + ) + p = next( + ( + _to_python_float(getattr(results, a, None)) + for a in p_aliases + if getattr(results, a, None) is not None + ), + None, + ) + ci = next( + ( + _to_python_ci(getattr(results, a, None)) + for a in ci_aliases + if getattr(results, a, None) is not None + ), + None, + ) + alpha = _to_python_float(getattr(results, "alpha", fallback_alpha)) + return (name, _to_python_float(val), se, p, ci, alpha) + return None + + +def _extract_scalar_effect(val: Any) -> Optional[float]: + """Pull a scalar effect out of the many shapes results expose. + + Handles: ``PeriodEffect`` / ``GroupTimeEffect`` objects (``.effect`` + or ``.att`` attr), dicts with an ``"effect"`` or ``"att"`` key, and + bare scalars. Wooldridge stores ``att`` in its ``group_time_effects`` + / ``group_effects`` / ``event_study_effects`` payloads rather than + ``effect`` (round-16 CI review on PR #318). + """ + if isinstance(val, dict): + eff = val.get("effect") + if eff is None: + eff = val.get("att") + if eff is None: + return None + try: + return float(eff) + except (TypeError, ValueError): + return None + eff_attr = getattr(val, "effect", None) + if eff_attr is None: + eff_attr = getattr(val, "att", None) + if eff_attr is not None: + try: + return float(eff_attr) + except (TypeError, ValueError): + return None + try: + return float(val) + except (TypeError, ValueError): + return None + + +def _power_tier(ratio: Optional[float]) -> str: + """Map ``mdv / |att|`` to a phrasing tier used by ``BusinessReport``. + + Tiers per ``docs/methodology/REPORTING.md``: + * ``well_powered``: ratio < 0.25 + * ``moderately_powered``: 0.25 <= ratio < 1.0 + * ``underpowered``: ratio >= 1.0 + * ``unknown``: ratio is None or non-finite + """ + if ratio is None or not np.isfinite(ratio): + return "unknown" + if ratio < 0.25: + return "well_powered" + if ratio < 1.0: + return "moderately_powered" + return "underpowered" + + +def _apply_diag_fallback_downgrade(tier: str, cov_source: str) -> str: + """Conservatively downgrade ``well_powered`` to ``moderately_powered`` + when ``compute_pretrends_power`` used the diagonal-SE approximation + while the full ``event_study_vcov`` was available on the source fit. + + REPORTING.md's conservative deviation: off-diagonal pre-period + correlations are ignored under the diagonal fallback, so a + ``well_powered`` verdict can overstate the real informativeness of + the pre-test. The downgrade applies at every DR path + (``_check_pretrends_power`` and ``_format_precomputed_pretrends_power``) + so BR ``summary()`` / ``full_report()`` / ``to_dict()`` and DR + ``summary()`` all read the same adjusted tier. Round-14 CI review + flagged per-surface divergence; round-20 flagged that the precomputed + adapter bypassed the downgrade entirely. + """ + if tier == "well_powered" and cov_source == "diag_fallback_available_full_vcov_unused": + return "moderately_powered" + return tier + + +def _pre_post_boundary(results: Any) -> int: + """Return the relative-time cutoff that separates true pre-period + horizons from treatment (and post-treatment) horizons. + + Horizons ``rel < _pre_post_boundary(results)`` are true pre-period + coefficients suitable for PT tests and pre-trends power. Horizons + ``rel >= _pre_post_boundary(results)`` include the anticipation + window and post-treatment effects — these are the "affected by + treatment (or anticipated treatment)" horizons, and are what + heterogeneity dispersion should summarize. + + For anticipation-aware staggered estimators (CS, SA, EfficientDiD, + etc., per REGISTRY.md §CallawaySantAnna lines 355-395), a fit with + ``anticipation=k`` moves the identification boundary to + ``e = -1 - k`` and treats ``e ∈ [-k, -1]`` as the anticipation + window. True pre-periods are ``e < -k``. Returns ``-anticipation`` + (non-positive integer) in that case, falling back to ``0`` (the + standard ``e < 0`` boundary) when no anticipation field is exposed. + + Round-15 CI review on PR #318 flagged the hard-coded ``rel < 0`` + rule as a methodology mismatch on anticipation fits. + + Estimator-specific override: Wooldridge aggregation keeps + ``t >= g`` and treats anticipation-window cells as placebos, not + post-treatment effects (REGISTRY.md §Wooldridge lines 1351-1352). + The boundary for ``WooldridgeDiDResults`` is therefore ``0`` + regardless of the ``anticipation`` value stored on the result. + """ + if type(results).__name__ == "WooldridgeDiDResults": + return 0 + anticipation = getattr(results, "anticipation", 0) + try: + k = int(anticipation) + except (TypeError, ValueError): + return 0 + if not np.isfinite(k) or k < 0: + return 0 + return -k + + +def _collect_pre_period_coefs( + results: Any, +) -> Tuple[List[Tuple[Any, float, float, Optional[float]]], int]: + """Return ``(sorted list of (key, effect, se, p_value), n_dropped_undefined)`` + for pre-period coefficients. + + Handles three shapes: + * ``pre_period_effects``: dict-of-``PeriodEffect`` on ``MultiPeriodDiDResults``. + * ``event_study_effects``: dict-of-dict (with ``effect`` / ``se`` / ``p_value`` keys) + on the staggered estimators (CS / SA / ImputationDiD / Stacked / EDiD / etc.). + Pre-period entries are those with negative relative-time keys. + * ``placebo_event_study``: dict-of-dict on + ``ChaisemartinDHaultfoeuilleResults`` — dCDH's dynamic placebos + ``DID^{pl}_l`` are the estimator's pre-period analogue. + + Filtering rules (critical for methodology-safe PT tests): + + * Entries marked as reference markers (``n_groups == 0`` on CS / SA or + ``n_obs == 0`` on Stacked / TwoStage / Imputation event-study shape) + are excluded. These are synthetic ``effect=0, se=NaN`` rows injected + for universal-base normalization and are NOT counted in + ``n_dropped_undefined`` — they never represented a real pre-period. + * Entries whose ``effect`` or ``se`` is non-finite (NaN / inf) or whose + ``se <= 0`` are excluded as undefined inference (``safe_inference`` + contract, ``utils.py:175``). These ARE real pre-periods whose + inference is undefined, so they contribute to + ``n_dropped_undefined``. Round-33 P0 CI review on PR #318 flagged + that the Bonferroni fallback silently shrank the test family when + this happened, turning partially-undefined PT surfaces into clean + stakeholder-facing verdicts. Callers (``_pt_event_study``) use + ``n_dropped_undefined`` to force an inconclusive verdict rather + than silently shrinking. + + Returns ``([], 0)`` when none of the three sources provides valid + pre-period entries. + """ + results_list: List[Tuple[Any, float, float, Optional[float]]] = [] + n_dropped_undefined = 0 + pre = getattr(results, "pre_period_effects", None) + # dCDH exposes pre-period placebos via ``placebo_event_study``; the + # round-6 CI review flagged that routing dCDH through the generic + # ``event_study_effects`` path produced empty pre-coef lists and + # silently skipped the PT check. + dcdh_placebo = getattr(results, "placebo_event_study", None) + if pre: + for k, pe in pre.items(): + eff = getattr(pe, "effect", None) + se = getattr(pe, "se", None) + p = getattr(pe, "p_value", None) + if eff is None or se is None: + n_dropped_undefined += 1 + continue + try: + eff_f = float(eff) + se_f = float(se) + except (TypeError, ValueError): + n_dropped_undefined += 1 + continue + if not (np.isfinite(eff_f) and np.isfinite(se_f) and se_f > 0): + n_dropped_undefined += 1 + continue + results_list.append((k, eff_f, se_f, _to_python_float(p))) + elif dcdh_placebo: + # dCDH placebo horizons are the pre-period surface. + for k, entry in dcdh_placebo.items(): + if not isinstance(entry, dict): + continue + eff = entry.get("effect") + se = entry.get("se") + p = entry.get("p_value") + if eff is None or se is None: + n_dropped_undefined += 1 + continue + try: + eff_f = float(eff) + se_f = float(se) + except (TypeError, ValueError): + n_dropped_undefined += 1 + continue + if not (np.isfinite(eff_f) and np.isfinite(se_f) and se_f > 0): + n_dropped_undefined += 1 + continue + results_list.append((k, eff_f, se_f, _to_python_float(p))) + else: + # Anticipation-aware cutoff: for CS/SA/EfficientDiD fits with + # ``anticipation=k``, treat horizons ``e ∈ [-k, -1]`` as the + # anticipation window (not true pre-periods) and only use + # ``e < -k`` for PT tests. + pre_cutoff = _pre_post_boundary(results) + es = getattr(results, "event_study_effects", None) or {} + for k, entry in es.items(): + # Pre-period relative-time keys are negative (convention: e=-1, -2, ...). + try: + rel = int(k) + except (TypeError, ValueError): + continue + if rel >= pre_cutoff: + continue + if not isinstance(entry, dict): + continue + # Drop universal-base reference markers. These are synthetic, + # not a real pre-period, so they do not count toward + # ``n_dropped_undefined``. + if entry.get("n_groups") == 0 or entry.get("n_obs") == 0: + continue + # Wooldridge stores ``att`` rather than ``effect`` in its + # event-study payloads; accept either (round-16 CI review). + eff = entry.get("effect") + if eff is None: + eff = entry.get("att") + se = entry.get("se") + p = entry.get("p_value") + if eff is None or se is None: + n_dropped_undefined += 1 + continue + try: + eff_f = float(eff) + se_f = float(se) + except (TypeError, ValueError): + n_dropped_undefined += 1 + continue + if not (np.isfinite(eff_f) and np.isfinite(se_f) and se_f > 0): + n_dropped_undefined += 1 + continue + results_list.append((k, eff_f, se_f, _to_python_float(p))) + results_list.sort(key=lambda t: t[0] if isinstance(t[0], (int, float)) else str(t[0])) + return results_list, n_dropped_undefined + + +def _pt_verdict(p: Optional[float]) -> str: + """Map a pre-trends joint p-value to the three-bin verdict enum. + + Verdicts per ``docs/methodology/REPORTING.md``: + - p >= 0.30 -> ``no_detected_violation`` (phrasing hedges on power + unless DR also reports that the test is well-powered via + ``compute_pretrends_power``). + - 0.05 <= p < 0.30 -> ``some_evidence_against``. + - p < 0.05 -> ``clear_violation``. + """ + if p is None or not np.isfinite(p): + return "inconclusive" + if p < 0.05: + return "clear_violation" + if p < 0.30: + return "some_evidence_against" + return "no_detected_violation" + + +def _to_python_float(value: Any) -> Optional[float]: + """Convert numpy scalars to built-in ``float``; preserve None; return None on failure.""" + if value is None: + return None + try: + f = float(value) + except (TypeError, ValueError): + return None + return f + + +def _to_python_scalar(value: Any) -> Any: + """Convert numpy scalars to built-in Python types where possible; pass through otherwise.""" + if isinstance(value, np.generic): + return value.item() + return value + + +def _to_python_ci(ci: Any) -> Optional[List[float]]: + """Convert a 2-tuple CI to ``[float, float]``; return None when malformed.""" + if ci is None: + return None + try: + lo, hi = ci + except (TypeError, ValueError): + return None + lo_f = _to_python_float(lo) + hi_f = _to_python_float(hi) + if lo_f is None or hi_f is None: + return None + return [lo_f, hi_f] + + +# --------------------------------------------------------------------------- +# Prose rendering helpers +# --------------------------------------------------------------------------- +def _check_headline(check: str, section: Dict[str, Any]) -> Optional[Any]: + """Return the most descriptive scalar for the per-check row in to_dataframe().""" + if section.get("status") != "ran": + return None + if check == "parallel_trends": + return section.get("joint_p_value") + if check == "pretrends_power": + return section.get("mdv_share_of_att") + if check == "sensitivity": + return section.get("breakdown_M") + if check == "bacon": + return section.get("forbidden_weight") + if check == "design_effect": + return section.get("deff") + if check == "heterogeneity": + return section.get("cv") + if check == "epv": + return section.get("min_epv") + if check == "estimator_native": + return section.get("pre_treatment_fit") + return None + + +def _pt_subject_phrase(method: Optional[str]) -> str: + """Return a source-faithful subject for DR's PT verdict sentence. + + Round-8 CI review: the generic "pre-treatment event-study + coefficients" wording mis-describes the 2x2 slope-difference check + (``method="slope_difference"``) and EfficientDiD's Hausman PT-All + vs PT-Post pretest (``method="hausman"``). See REGISTRY.md + §EfficientDiD line 907 for the Hausman test's operating vector. + """ + if method == "slope_difference": + return "The pre-period slope-difference test" + if method == "hausman": + return "The Hausman PT-All vs PT-Post pretest" + if method in { + "joint_wald", + "joint_wald_event_study", + "joint_wald_no_vcov", + "bonferroni", + # Survey-aware event-study PT variants use an F(k, df_survey) + # reference rather than chi-square(k); the subject is still the + # pre-period event-study coefficient vector — only the + # reference distribution changes (round-28 / round-29 CI + # review on PR #318). Recognizing the ``_survey`` suffix here + # lets DR prose match the BR prose and the REPORTING.md + # contract. + "joint_wald_survey", + "joint_wald_event_study_survey", + }: + return "Pre-treatment event-study coefficients" + if method == "synthetic_fit": + return "The synthetic-control pre-treatment fit" + if method == "factor": + return "The factor-model pre-treatment fit" + return "Pre-treatment data" + + +def _pt_stat_label(method: Optional[str]) -> Optional[str]: + """Label for the joint-statistic p-value in the PT prose. + + Wald / Bonferroni paths take a joint p-value (``joint p``); the 2x2 + slope-difference and Hausman paths are single-statistic tests + (``p``). Design-enforced paths return ``None`` so the sentence + omits a statistic. Survey F-reference variants remain joint tests + on the pre-period coefficient vector and keep the ``joint p`` + label — the correction is a different reference distribution, not + a different test. + """ + if method in { + "joint_wald", + "joint_wald_event_study", + "joint_wald_no_vcov", + "bonferroni", + "joint_wald_survey", + "joint_wald_event_study_survey", + }: + return "joint p" + if method in {"slope_difference", "hausman"}: + return "p" + if method in {"synthetic_fit", "factor"}: + return None + return "joint p" + + +def _render_overall_interpretation(schema: Dict[str, Any], labels: Dict[str, str]) -> str: + """Synthesize a plain-English paragraph across DR checks. + + The paragraph names the headline effect, the dominant validity concern + (typically parallel trends or sensitivity), secondary caveats + (heterogeneity, design effect, Bacon), and one concrete next action. + Never produces traffic-light verdicts — severity is conveyed by natural + language per ``docs/methodology/REPORTING.md``. + """ + sentences: List[str] = [] + headline = schema.get("headline_metric") or {} + est = schema.get("estimator", "the estimator") + outcome = labels.get("outcome_label", "the outcome") + treatment = labels.get("treatment_label", "the treatment") + + # Sentence 1: headline. + # Round-36 P0 CI review on PR #318: a non-finite headline value + # (NaN ATT from a failed fit, e.g., rank-deficient design matrix or + # zero effective sample) previously passed the ``val is not None`` + # guard because ``NaN is not None``. Since ``NaN > 0`` and + # ``NaN < 0`` are both false, the directional branch fell through + # to "did not change" and the sentence rendered as "did not change + # ... by nan (p = nan, 95% CI: nan to nan)". BR's equivalent + # headline renderer already gates on ``np.isfinite(value)`` and + # emits an estimation-failure sentence; DR now mirrors that. + val = headline.get("value") if isinstance(headline, dict) else None + ci = headline.get("conf_int") if isinstance(headline, dict) else None + p = headline.get("p_value") if isinstance(headline, dict) else None + val_finite = isinstance(val, (int, float)) and np.isfinite(val) + if val is not None and not val_finite: + sentences.append( + f"On {est}, {treatment}'s effect on {outcome} is non-finite " + "(the estimation did not produce a usable point estimate). " + "Inspect the fit for rank deficiency, zero effective sample, " + "or a survey-design collapse before interpreting." + ) + elif val_finite: + direction = "increased" if val > 0 else "decreased" if val < 0 else "did not change" + # Use the headline's own alpha rather than hardcoding 95 so prose + # stays consistent with the rendered interval when alpha != 0.05. + headline_alpha = headline.get("alpha") if isinstance(headline, dict) else None + if isinstance(headline_alpha, (int, float)) and 0 < headline_alpha < 1: + ci_level = int(round((1.0 - headline_alpha) * 100)) + else: + ci_level = 95 + ci_finite = ( + isinstance(ci, (list, tuple)) + and len(ci) == 2 + and all(isinstance(v, (int, float)) and np.isfinite(v) for v in ci) + ) + ci_str = f" ({ci_level}% CI: {ci[0]:.3g} to {ci[1]:.3g})" if ci_finite else "" + p_str = f", p = {p:.3g}" if isinstance(p, (int, float)) and np.isfinite(p) else "" + sentences.append( + f"On {est}, {treatment} {direction} {outcome} by {val:.3g}{ci_str}{p_str}." + ) + + # Sentence 2: parallel trends + power (method-aware prose per the + # round-8 CI review on PR #318; PT method can be slope_difference + # (2x2), joint_wald / bonferroni (event study), hausman (EfficientDiD + # PT-All vs PT-Post), synthetic_fit (SDiD), or factor (TROP), and the + # generic "event-study coefficients" wording is wrong for the + # 2x2 and Hausman paths). + pt = schema.get("parallel_trends") or {} + pp = schema.get("pretrends_power") or {} + # Only point to "the sensitivity analysis below" when a sensitivity + # block actually ran. For estimators routing to native diagnostics + # (SDiD / TROP) or fits where sensitivity was skipped / not + # applicable, the clause would be misleading (round-12 CI review). + sens_ran = (schema.get("sensitivity") or {}).get("status") == "ran" + if pt.get("status") == "ran": + verdict = pt.get("verdict") + jp = pt.get("joint_p_value") + method = pt.get("method") + subject = _pt_subject_phrase(method) + stat_label = _pt_stat_label(method) + jp_str = ( + f" ({stat_label} = {jp:.3g})" if isinstance(jp, (int, float)) and stat_label else "" + ) + sens_tail_pending = " pending sensitivity analysis" if sens_ran else "" + sens_tail_alongside = ( + " Interpret the headline alongside the sensitivity analysis below." if sens_ran else "" + ) + sens_tail_bounded = ( + " See the sensitivity analysis below for bounded-violation guarantees." + if sens_ran + else "" + ) + sens_tail_reliable = ( + " See the HonestDiD sensitivity analysis below for a more reliable signal." + if sens_ran + else "" + ) + if verdict == "clear_violation": + sentences.append( + f"{subject} clearly reject parallel trends{jp_str}. The " + "headline estimate should be treated as tentative" + sens_tail_pending + "." + ) + elif verdict == "some_evidence_against": + sentences.append( + f"{subject} show some evidence against parallel trends" + f"{jp_str}." + sens_tail_alongside + ) + elif verdict == "no_detected_violation": + tier = pp.get("tier") if pp.get("status") == "ran" else "unknown" + if tier == "well_powered": + sentences.append( + f"{subject} are consistent with parallel trends" + f"{jp_str} and the test is well-powered (MDV is a small " + "share of the estimated effect), so a material pre-trend " + "would likely have been detected." + ) + elif tier == "moderately_powered": + sentences.append( + f"{subject} do not reject parallel trends" + f"{jp_str}; the test is moderately informative." + sens_tail_bounded + ) + else: + sentences.append( + f"{subject} do not reject parallel trends" + f"{jp_str}, but the test has limited power — a non-rejection " + "does not prove the assumption." + sens_tail_reliable + ) + elif verdict == "design_enforced_pt": + rmse = pt.get("pre_treatment_fit_rmse") + sentences.append( + f"The synthetic control matches the treated group's " + f"pre-period trajectory with RMSE = " + f"{rmse:.3g} (SDiD's design-enforced analogue of parallel " + f"trends)." + if isinstance(rmse, (int, float)) + else "SDiD's synthetic control is designed to satisfy the " + "weighted parallel-trends analogue." + ) + elif verdict == "inconclusive": + # Round-35 P1 CI review on PR #318: DR summary / overall + # interpretation must surface the inconclusive state + # explicitly rather than omitting the PT sentence. A missing + # sentence was indistinguishable from "PT did not run", and + # stakeholders reading the summary could not tell that the + # joint test had been attempted but yielded undefined + # inference. + n_dropped = pt.get("n_dropped_undefined") + if isinstance(n_dropped, int) and n_dropped > 0: + rows_word = "row" if n_dropped == 1 else "rows" + sentences.append( + f"Pre-trends is inconclusive on this fit: " + f"{n_dropped} pre-period {rows_word} had undefined " + "inference (zero / negative SE or a non-finite " + "per-period p-value), so the joint test cannot be " + "formed. Treat parallel trends as unassessed." + ) + else: + sentences.append( + "Pre-trends is inconclusive on this fit: pre-period " + "inference was undefined, so the joint test cannot " + "be formed. Treat parallel trends as unassessed." + ) + + # Sentence 3: sensitivity. The "robust across the grid" phrasing is reserved + # for genuine SensitivityResults grids; a precomputed single-M HonestDiDResults + # is narrated as a point check ("at M=") even though breakdown_M is None. + sens = schema.get("sensitivity") or {} + if sens.get("status") == "ran": + bkd = sens.get("breakdown_M") + conclusion = sens.get("conclusion") + if conclusion == "single_M_precomputed": + grid = sens.get("grid") or [] + point = grid[0] if grid else {} + m_val = point.get("M") + robust = point.get("robust_to_zero") + if isinstance(m_val, (int, float)): + if robust: + sentences.append( + f"HonestDiD sensitivity (single point checked): " + f"at M = {m_val:.2g}, the robust CI excludes zero. " + f"This is a point check, not a grid — use " + f"HonestDiD.sensitivity() for a breakdown value." + ) + else: + sentences.append( + f"HonestDiD sensitivity (single point checked): " + f"at M = {m_val:.2g}, the robust CI includes zero. " + f"Run HonestDiD.sensitivity() across a grid to find " + f"the breakdown value." + ) + elif bkd is None: + sentences.append( + "The effect remains significant across the entire HonestDiD " + "grid — robust to plausible parallel-trends violations." + ) + elif isinstance(bkd, (int, float)) and bkd >= 1.0: + sentences.append( + f"HonestDiD sensitivity: the result remains significant under " + f"parallel-trends violations up to {bkd:.2g}x the observed " + f"pre-period variation." + ) + else: + sentences.append( + f"HonestDiD sensitivity: the result is fragile — the " + f"confidence interval includes zero once violations reach " + f"{bkd:.2g}x the pre-period variation." + if isinstance(bkd, (int, float)) + else "" + ) + + # Sentence 4: one secondary caveat if present. + bacon = schema.get("bacon") or {} + if bacon.get("status") == "ran" and bacon.get("verdict") == "materially_contaminated": + fw = bacon.get("forbidden_weight") + if isinstance(fw, (int, float)): + sentences.append( + f"Goodman-Bacon decomposition flags {fw:.0%} of TWFE weight on " + f"'forbidden' later-vs-earlier comparisons — consider a " + f"heterogeneity-robust estimator (CS / SA / BJS / Gardner) if " + f"not already in use." + ) + deff = schema.get("design_effect") or {} + if deff.get("status") == "ran" and not deff.get("is_trivial"): + d = deff.get("deff") + eff_n = deff.get("effective_n") + if isinstance(d, (int, float)) and d >= 1.05: + eff_str = f", effective n = {eff_n:.0f}" if isinstance(eff_n, (int, float)) else "" + sentences.append( + f"Survey design effect is {d:.2g} (variance inflation relative " + f"to simple random sampling{eff_str})." + ) + + # Sentence 5: next step + next_steps = schema.get("next_steps") or [] + if next_steps: + top = next_steps[0] + if top.get("label"): + sentences.append(f"Next step: {top['label']}.") + + if not sentences: + return "" + return " ".join(s for s in sentences if s) + + +def _render_dr_full_report(results: "DiagnosticReportResults") -> str: + """Render a markdown report from a populated ``DiagnosticReportResults``.""" + schema = results.schema + lines: List[str] = [] + lines.append("# Diagnostic Report") + lines.append("") + lines.append(f"**Estimator**: `{schema.get('estimator')}`") + headline = schema.get("headline_metric") + if headline: + lines.append( + f"**Headline**: {headline.get('name')} = " + f"{headline.get('value')} " + f"(SE {headline.get('se')}, p = {headline.get('p_value')})" + ) + lines.append("") + lines.append("## Overall Interpretation") + lines.append("") + lines.append(schema.get("overall_interpretation", "") or "_No synthesis available._") + lines.append("") + + section_order = [ + ("Parallel trends", "parallel_trends"), + ("Pre-trends power", "pretrends_power"), + ("HonestDiD sensitivity", "sensitivity"), + ("Goodman-Bacon decomposition", "bacon"), + ("Effect-stability / heterogeneity", "heterogeneity"), + ("Survey design effect", "design_effect"), + ("Propensity-score EPV", "epv"), + ("Estimator-native diagnostics", "estimator_native_diagnostics"), + ("Placebo battery", "placebo"), + ] + for title, key in section_order: + section = schema.get(key) or {} + status = section.get("status", "not_run") + lines.append(f"## {title}") + lines.append(f"- status: `{status}`") + if status == "skipped" or status == "not_applicable": + reason = section.get("reason") + if reason: + lines.append(f"- reason: {reason}") + else: + for k, v in section.items(): + if k in ("status", "reason"): + continue + if isinstance(v, (dict, list)): + continue + lines.append(f"- {k}: `{v}`") + lines.append("") + + if schema.get("next_steps"): + lines.append("## Next Steps") + for s in schema["next_steps"]: + if s.get("label"): + lines.append(f"- {s['label']}") + if s.get("why"): + lines.append(f" - why: {s['why']}") + return "\n".join(lines) diff --git a/diff_diff/efficient_did.py b/diff_diff/efficient_did.py index 138c6790..4dcb93e4 100644 --- a/diff_diff/efficient_did.py +++ b/diff_diff/efficient_did.py @@ -372,9 +372,12 @@ def fit( # Store survey df for safe_inference calls (t-distribution with survey df) self._survey_df = survey_metadata.df_survey if survey_metadata is not None else None # Guard: replicate design with undefined df → NaN inference - if (self._survey_df is None and resolved_survey is not None - and hasattr(resolved_survey, 'uses_replicate_variance') - and resolved_survey.uses_replicate_variance): + if ( + self._survey_df is None + and resolved_survey is not None + and hasattr(resolved_survey, "uses_replicate_variance") + and resolved_survey.uses_replicate_variance + ): self._survey_df = 0 # Bootstrap + survey supported via PSU-level multiplier bootstrap. @@ -510,14 +513,18 @@ def fit( n_strata_u = len(np.unique(unit_strata)) if unit_strata is not None else 0 n_psu_u = len(np.unique(unit_psu)) if unit_psu is not None else 0 self._unit_resolved_survey = resolved_survey.subset_to_units( - row_idx, unit_weights_s, unit_strata, unit_psu, unit_fpc, - n_strata_u, n_psu_u, + row_idx, + unit_weights_s, + unit_strata, + unit_psu, + unit_fpc, + n_strata_u, + n_psu_u, ) # Use unit-level df (not panel-level) for t-distribution self._survey_df = self._unit_resolved_survey.df_survey # Re-apply replicate guard: undefined df → NaN inference - if (self._survey_df is None - and self._unit_resolved_survey.uses_replicate_variance): + if self._survey_df is None and self._unit_resolved_survey.uses_replicate_variance: self._survey_df = 0 else: self._unit_resolved_survey = None @@ -717,10 +724,14 @@ def fit( # Filter out comparison pairs with zero survey weight if unit_level_weights is not None and pairs: pairs = [ - (gp, tpre) for gp, tpre in pairs - if np.sum(unit_level_weights[ - never_treated_mask if np.isinf(gp) else cohort_masks[gp] - ]) > 0 + (gp, tpre) + for gp, tpre in pairs + if np.sum( + unit_level_weights[ + never_treated_mask if np.isinf(gp) else cohort_masks[gp] + ] + ) + > 0 ] if not pairs: @@ -1081,6 +1092,7 @@ def fit( efficient_weights=stored_weights if stored_weights else None, omega_condition_numbers=stored_cond if stored_cond else None, control_group=self.control_group, + cluster=self.cluster, influence_functions=eif_by_gt if store_eif else None, bootstrap_results=bootstrap_results, estimation_path="dr" if use_covariates else "nocov", @@ -1108,8 +1120,11 @@ def _recompute_unit_survey_metadata(self, panel_metadata): ) # Propagate effective replicate df if available # (but not the df=0 sentinel — keep metadata as None for undefined df) - if (self._survey_df is not None and self._survey_df != 0 - and meta.df_survey != self._survey_df): + if ( + self._survey_df is not None + and self._survey_df != 0 + and meta.df_survey != self._survey_df + ): meta.df_survey = self._survey_df return meta return panel_metadata @@ -1129,7 +1144,9 @@ def _compute_survey_eif_se(self, eif_vals: np.ndarray) -> float: # Score-scale IFs to match TSL bread: psi = w * eif / sum(w) w = self._unit_resolved_survey.weights psi_scaled = w * eif_vals / w.sum() - variance, n_valid = compute_replicate_if_variance(psi_scaled, self._unit_resolved_survey) + variance, n_valid = compute_replicate_if_variance( + psi_scaled, self._unit_resolved_survey + ) # Update survey df to reflect effective replicate count if n_valid < self._unit_resolved_survey.n_replicates: self._survey_df = n_valid - 1 if n_valid > 1 else None @@ -1271,7 +1288,11 @@ def _aggregate_overall( # WIF correction: accounts for uncertainty in cohort-size weights wif = self._compute_wif_contribution( - keepers, effects, unit_cohorts, cohort_fractions, n_units, + keepers, + effects, + unit_cohorts, + cohort_fractions, + n_units, unit_weights=self._unit_level_weights, ) # Compute SE: survey path uses score-level psi to avoid double-weighting @@ -1282,19 +1303,17 @@ def _aggregate_overall( total_w = float(np.sum(uw)) psi_total = uw * agg_eif / total_w + wif / total_w - if (hasattr(self._unit_resolved_survey, 'uses_replicate_variance') - and self._unit_resolved_survey.uses_replicate_variance): + if ( + hasattr(self._unit_resolved_survey, "uses_replicate_variance") + and self._unit_resolved_survey.uses_replicate_variance + ): from diff_diff.survey import compute_replicate_if_variance - variance, _ = compute_replicate_if_variance( - psi_total, self._unit_resolved_survey - ) + variance, _ = compute_replicate_if_variance(psi_total, self._unit_resolved_survey) else: from diff_diff.survey import compute_survey_if_variance - variance = compute_survey_if_variance( - psi_total, self._unit_resolved_survey - ) + variance = compute_survey_if_variance(psi_total, self._unit_resolved_survey) se = float(np.sqrt(max(variance, 0.0))) if np.isfinite(variance) else np.nan else: agg_eif_total = agg_eif + wif @@ -1389,7 +1408,11 @@ def _aggregate_event_study( es_keepers = [(g, t) for (g, t) in gt_pairs] es_effects = effs wif_e = self._compute_wif_contribution( - es_keepers, es_effects, unit_cohorts, cohort_fractions, n_units, + es_keepers, + es_effects, + unit_cohorts, + cohort_fractions, + n_units, unit_weights=self._unit_level_weights, ) @@ -1398,8 +1421,10 @@ def _aggregate_event_study( total_w = float(np.sum(uw)) psi_total = uw * agg_eif / total_w + wif_e / total_w - if (hasattr(self._unit_resolved_survey, 'uses_replicate_variance') - and self._unit_resolved_survey.uses_replicate_variance): + if ( + hasattr(self._unit_resolved_survey, "uses_replicate_variance") + and self._unit_resolved_survey.uses_replicate_variance + ): from diff_diff.survey import compute_replicate_if_variance variance, _ = compute_replicate_if_variance( @@ -1408,9 +1433,7 @@ def _aggregate_event_study( else: from diff_diff.survey import compute_survey_if_variance - variance = compute_survey_if_variance( - psi_total, self._unit_resolved_survey - ) + variance = compute_survey_if_variance(psi_total, self._unit_resolved_survey) agg_se = float(np.sqrt(max(variance, 0.0))) if np.isfinite(variance) else np.nan else: agg_eif = agg_eif + wif_e diff --git a/diff_diff/efficient_did_results.py b/diff_diff/efficient_did_results.py index 7f86f5a9..13123ea8 100644 --- a/diff_diff/efficient_did_results.py +++ b/diff_diff/efficient_did_results.py @@ -149,6 +149,12 @@ class EfficientDiDResults: default=None, repr=False ) control_group: str = "never_treated" + # Cluster column used at fit time (None for unclustered fits). Persisted + # so downstream diagnostics — notably ``DiagnosticReport._pt_hausman`` — + # can replay the Hausman PT-All vs PT-Post pretest under the same + # clustering as the original estimate rather than silently producing + # unclustered p-values for a clustered fit. + cluster: Optional[str] = None influence_functions: Optional[Dict[Tuple[Any, Any], "np.ndarray"]] = field( default=None, repr=False ) diff --git a/diff_diff/guides/llms-full.txt b/diff_diff/guides/llms-full.txt index f5d79c30..40bb5ae2 100644 --- a/diff_diff/guides/llms-full.txt +++ b/diff_diff/guides/llms-full.txt @@ -1741,3 +1741,126 @@ DIFF_DIFF_BACKEND=rust pytest # Force Rust (fail if unavailable) | Efficiency-optimal estimation | `EfficientDiD` | | Corrective weighting for stacked regressions | `StackedDiD` | | Robustness to parallel trends violations | `HonestDiD` | + +## BusinessReport + +Plain-English stakeholder narrative from any of the 16 fitted result types. +Renders `summary()` (short paragraph), `full_report()` (multi-section +markdown), and `to_dict()` (stable AI-legible schema — single source of +truth; prose renders from the dict). + +```python +from diff_diff import BusinessReport + +report = BusinessReport( + results, + outcome_label="Revenue per user", + outcome_unit="$", # "$" / "%" / "pp" / "log_points" / "count" recognized + outcome_direction="higher_is_better", + business_question="Did the campaign lift revenue?", + treatment_label="the campaign", + alpha=0.05, # single knob: drives both CI level and phrasing threshold + auto_diagnostics=True, # default; auto-constructs DiagnosticReport +) + +print(report.summary()) # 6-10 sentence paragraph +print(report.full_report()) # structured markdown +report.to_dict() # AI-legible schema; stable top-level keys +``` + +Constructor rejects `BaconDecompositionResults` with a helpful TypeError +(Bacon is a diagnostic, not an estimator; wrap the underlying estimator +and pass the Bacon object to `DiagnosticReport(precomputed={'bacon': ...})`). + +Schema top-level keys (all always present; missing content uses a +`{"status": "skipped", "reason": "..."}` shape rather than being absent): + +- `schema_version`, `estimator`, `context` +- `headline`, `assumption`, `pre_trends`, `sensitivity` +- `sample`, `heterogeneity`, `robustness`, `diagnostics` +- `next_steps`, `caveats`, `references` + +Status enum values: `ran | skipped | error | not_applicable | not_run | computed`. + +## DiagnosticReport + +Unified diagnostic runner orchestrating `check_parallel_trends`, +`compute_pretrends_power`, `HonestDiD.sensitivity`, `bacon_decompose`, +plus estimator-native surfaces for SyntheticDiD (`pre_treatment_fit`, +`get_weight_concentration`, `in_time_placebo`, `sensitivity_to_zeta_omega`) +and TROP (factor-model metrics). EfficientDiD PT uses the native +`hausman_pretest`. The `design_effect` section is read-only: it +echoes `survey_metadata.design_effect` / `effective_n` from the +fitted result along with a plain-English band label. The +`epv` section is similarly read-only, reporting from +`results.epv_diagnostics` plus `results.epv_threshold`. + +```python +from diff_diff import DiagnosticReport + +dr = DiagnosticReport( + results, + data=df, # optional; needed for 2x2 PT, Bacon-from-scratch + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + alpha=0.05, + # Opt-outs (all default True except placebo) + run_parallel_trends=True, + run_sensitivity=True, + run_placebo=False, # opt-in; not implemented in MVP + run_bacon=True, + run_design_effect=True, + run_heterogeneity=True, + run_epv=True, + run_pretrends_power=True, # drives power-aware PT phrasing + sensitivity_M_grid=(0.5, 1.0, 1.5, 2.0), + sensitivity_method="relative_magnitude", + # Escape hatch for users who ran a diagnostic with custom args: + precomputed={"sensitivity": my_honest_did_results}, +) + +dr.run_all() # triggers compute, caches +print(dr.summary()) # overall-interpretation paragraph +dr.to_dict() # AI-legible schema +dr.to_dataframe() # one row per check +dr.applicable_checks # tuple of checks that will run for this estimator +dr.skipped_checks # dict of {check: plain-English reason} +``` + +Schema top-level keys: `schema_version, estimator, headline_metric, +parallel_trends, pretrends_power, sensitivity, placebo, bacon, +design_effect, heterogeneity, epv, estimator_native_diagnostics, +skipped, warnings, overall_interpretation, next_steps`. + +### Verdicts and tiers + +Pre-trends verdict (three bins, documented in `docs/methodology/REPORTING.md`): + +- `joint_p >= 0.30` -> `no_detected_violation` +- `0.05 <= joint_p < 0.30` -> `some_evidence_against` +- `joint_p < 0.05` -> `clear_violation` + +Power tier (drives BR phrasing for the `no_detected_violation` verdict): + +- `mdv / |att| < 0.25` -> `well_powered` +- `0.25 <= mdv / |att| < 1.0` -> `moderately_powered` +- `mdv / |att| >= 1.0` -> `underpowered` +- power not runnable -> `unknown` (BR falls back to underpowered phrasing) + +### Methodology notes + +BR and DR do no estimator fitting and do not re-derive variance from +raw data — every effect, SE, p-value, CI, and sensitivity bound is +read from the fitted result or produced by an existing diff-diff +utility (may call `check_parallel_trends`, `bacon_decompose`, or +`EfficientDiD.hausman_pretest` when the panel + column kwargs are +supplied). The `design_effect` section is read-only: it echoes +`survey_metadata.design_effect` / `effective_n` from the fitted +result rather than calling `compute_deff_diagnostics`. Report-layer +cross-period aggregations are enumerated in +`docs/methodology/REPORTING.md`. Both schemas are experimental in the +current release; see that document for phrasing rules, the +no-traffic-light decision, unit-translation policy, and schema +stability policy. diff --git a/diff_diff/guides/llms-practitioner.txt b/diff_diff/guides/llms-practitioner.txt index 6680d800..a8a78008 100644 --- a/diff_diff/guides/llms-practitioner.txt +++ b/diff_diff/guides/llms-practitioner.txt @@ -439,6 +439,50 @@ Your analysis report MUST include all of the following: - [ ] Comparison across at least 2-3 estimators - [ ] Estimates with and without covariates (REQUIRED) +### One-call reporting via BusinessReport + DiagnosticReport + +The `DiagnosticReport` class orchestrates Steps 3 (parallel trends), 6 +(sensitivity), and 7 (heterogeneity) in a single call and produces +plain-English output. Pair with `BusinessReport` for a +stakeholder-ready narrative: + +```python +from diff_diff import BusinessReport, DiagnosticReport + +# Optional: run diagnostics explicitly so you can inspect the structure. +dr = DiagnosticReport(cs_result, data=data, outcome='y', unit='id', + time='t', first_treat='g') +dr.run_all() +print(dr.summary()) # overall interpretation paragraph +dr.to_dict() # AI-legible structured schema + +# Or let BusinessReport auto-construct a DiagnosticReport and render the +# full stakeholder narrative in one call. Pass ``data`` + the column +# names so data-dependent checks (2x2 PT, Goodman-Bacon, EfficientDiD +# Hausman pretest) actually run — without them the auto path still +# produces a report but skips those checks with an explicit reason. +br = BusinessReport( + cs_result, + outcome_label='Revenue per user', + outcome_unit='$', + business_question='Did the campaign lift revenue?', + treatment_label='the campaign', + data=data, + outcome='y', + unit='id', + time='t', + first_treat='g', +) +print(br.summary()) # short paragraph block +print(br.full_report()) # structured markdown +``` + +`DiagnosticReport` uses power-aware phrasing: when a pre-trends test +does not reject, the summary reflects whether the test is well-powered +(via `compute_pretrends_power`), rather than defaulting to "parallel +trends hold". See `docs/methodology/REPORTING.md` for the full verdict +and tier rules. + ### Runtime guidance ```python from diff_diff import practitioner_next_steps diff --git a/diff_diff/imputation.py b/diff_diff/imputation.py index 49a40333..b5454c6f 100644 --- a/diff_diff/imputation.py +++ b/diff_diff/imputation.py @@ -857,6 +857,7 @@ def _refit_imp(w_r): n_treated_units=n_treated_units, n_control_units=n_control_units, alpha=self.alpha, + anticipation=self.anticipation, bootstrap_results=bootstrap_results, _estimator_ref=self, survey_metadata=survey_metadata, diff --git a/diff_diff/imputation_results.py b/diff_diff/imputation_results.py index 870b7271..e7f7613c 100644 --- a/diff_diff/imputation_results.py +++ b/diff_diff/imputation_results.py @@ -135,6 +135,7 @@ class ImputationDiDResults: n_treated_units: int n_control_units: int alpha: float = 0.05 + anticipation: int = 0 pretrend_results: Optional[Dict[str, Any]] = field(default=None, repr=False) bootstrap_results: Optional[ImputationBootstrapResults] = field(default=None, repr=False) # Internal: stores data needed for pretrend_test() diff --git a/diff_diff/practitioner.py b/diff_diff/practitioner.py index c58a75f6..cd1d4235 100644 --- a/diff_diff/practitioner.py +++ b/diff_diff/practitioner.py @@ -388,7 +388,12 @@ def _handle_sa(results: Any): "# sa_alt = SunAbraham(control_group='not_yet_treated')" ), priority="medium", - step_name="sensitivity", + # DR's sensitivity section runs HonestDiD, not specification + # variation; tagging this as ``sensitivity`` caused + # ``_collect_next_steps`` to suppress it after HonestDiD ran. + # Use ``specification_comparison`` so the recommendation + # persists alongside a completed HonestDiD sensitivity check. + step_name="specification_comparison", ), _step( baker_step=7, @@ -431,7 +436,10 @@ def _handle_imputation(results: Any): "# Leave-one-cohort-out sensitivity analysis" ), priority="medium", - step_name="sensitivity", + # See note on SA handler: DR completes ``sensitivity`` when + # HonestDiD runs, which is unrelated to this specification- + # variation recommendation. Tag separately. + step_name="specification_comparison", ), _robustness_compare_step("CS, SA, or Gardner"), _covariates_step(), @@ -457,7 +465,10 @@ def _handle_two_stage(results: Any): "# Leave-one-cohort-out sensitivity analysis" ), priority="medium", - step_name="sensitivity", + # See note on SA handler: DR completes ``sensitivity`` when + # HonestDiD runs, which is unrelated to this specification- + # variation recommendation. Tag separately. + step_name="specification_comparison", ), _robustness_compare_step("CS, BJS, or SA"), _covariates_step(), @@ -482,7 +493,10 @@ def _handle_stacked(results: Any): "# stacked_alt = StackedDiD(clean_control='not_yet_treated')" ), priority="medium", - step_name="sensitivity", + # See note on SA handler: DR completes ``sensitivity`` when + # HonestDiD runs, which does not replay ``clean_control`` + # variation. Tag separately. + step_name="specification_comparison", ), _step( baker_step=7, @@ -556,7 +570,16 @@ def _handle_synthetic(results: Any): " 'with positive effective support.')" ), priority="medium", - step_name="sensitivity", + # DR's SyntheticDiD native battery covers pre-treatment fit, + # weight concentration, in-time placebo, and zeta-omega + # sensitivity, but NOT the jackknife LOO workflow (which + # requires a separate ``variance_method='jackknife'`` fit + # via ``get_loo_effects_df``). Tagging this recommendation + # as ``sensitivity`` caused ``_collect_next_steps`` to + # suppress it as soon as the native block ran, even though + # the jackknife was never executed. Round-24 P2 CI review + # on PR #318; same class as round-20 Hausman mistag. + step_name="loo_jackknife", ), _step( baker_step=6, @@ -624,7 +647,12 @@ def _handle_trop(results: Any): "# Leave-one-out: drop each treated unit and re-estimate" ), priority="medium", - step_name="sensitivity", + # TROP's estimator-native diagnostics surface factor-model fit + # metrics, not in-time or in-space placebos; DR does not run + # placebos on TROP. Tag separately from ``sensitivity`` so the + # recommendation persists after DR marks the TROP native + # battery complete. + step_name="placebo", ), _robustness_compare_step("SyntheticDiD or CS"), ] @@ -648,7 +676,12 @@ def _handle_efficient(results: Any): "# edid_alt = EfficientDiD(control_group='last_cohort')" ), priority="medium", - step_name="sensitivity", + # See note on SA handler: DR completes ``sensitivity`` when + # HonestDiD runs, which does not re-estimate with an + # alternative control_group. Tag separately so this + # recommendation persists alongside a completed HonestDiD + # block. + step_name="specification_comparison", ), _step( baker_step=7, @@ -663,7 +696,16 @@ def _handle_efficient(results: Any): "pretest = EfficientDiD.hausman_pretest(\n" " data, outcome='y', unit='id', time='t', first_treat='g')" ), - step_name="heterogeneity", + # The Hausman pretest is a parallel-trends diagnostic per + # REGISTRY.md §EfficientDiD: it tests whether the stronger + # PT-All regime is tenable relative to PT-Post. ``DiagnosticReport`` + # treats a ran Hausman block as ``parallel_trends`` completion + # (``_check_pt_hausman``), so tagging this practitioner step as + # ``parallel_trends`` keeps ``_collect_next_steps()`` from + # recommending a check the report already executed. Round-20 P2 + # CI review on PR #318 flagged the earlier ``heterogeneity`` tag + # as a mismatched-step-name bug. + step_name="parallel_trends", ), _robustness_compare_step("CS, SA, or BJS"), _covariates_step(), diff --git a/diff_diff/pretrends.py b/diff_diff/pretrends.py index abf4cd29..b249cef6 100644 --- a/diff_diff/pretrends.py +++ b/diff_diff/pretrends.py @@ -613,12 +613,30 @@ def _extract_pre_period_params( "Re-run with aggregate='event_study'." ) - # Get pre-period effects (negative relative times) - # Filter out normalization constraints (n_groups=0) and non-finite SEs + # Get pre-period effects. Anticipation-aware cutoff per + # REGISTRY.md §CallawaySantAnna lines 355-395: with + # ``anticipation=k``, true pre-periods are ``t < -k``; + # ``t ∈ [-k, -1]`` is the anticipation window and must + # not be used for pre-trends power. Filter out + # normalization constraints (n_groups=0) and non-finite + # SEs as well. + _ant = getattr(results, "anticipation", 0) or 0 + try: + _ant = int(_ant) + except (TypeError, ValueError): + _ant = 0 + _pre_cutoff = -_ant + # ``safe_inference`` treats ``se <= 0`` as undefined + # inference; filter the same way here so pre-trends + # power never silently includes rows whose per-period + # SE collapsed (round-33 P0 CI review on PR #318). pre_effects = { t: data for t, data in results.event_study_effects.items() - if t < 0 and data.get("n_groups", 1) > 0 and np.isfinite(data.get("se", np.nan)) + if t < _pre_cutoff + and data.get("n_groups", 1) > 0 + and np.isfinite(data.get("se", np.nan)) + and float(data.get("se", 0.0)) > 0 } if not pre_effects: @@ -640,12 +658,22 @@ def _extract_pre_period_params( from diff_diff.sun_abraham import SunAbrahamResults if isinstance(results, SunAbrahamResults): - # Get pre-period effects (negative relative times) - # Filter out normalization constraints (n_groups=0) and non-finite SEs + # Same anticipation-aware pre-period cutoff as + # CallawaySantAnna above. + _ant = getattr(results, "anticipation", 0) or 0 + try: + _ant = int(_ant) + except (TypeError, ValueError): + _ant = 0 + _pre_cutoff = -_ant + # Mirror the ``se > 0`` filter applied on the CS branch. pre_effects = { t: data for t, data in results.event_study_effects.items() - if t < 0 and data.get("n_groups", 1) > 0 and np.isfinite(data.get("se", np.nan)) + if t < _pre_cutoff + and data.get("n_groups", 1) > 0 + and np.isfinite(data.get("se", np.nan)) + and float(data.get("se", 0.0)) > 0 } if not pre_effects: diff --git a/diff_diff/stacked_did.py b/diff_diff/stacked_did.py index 7c610b5c..69e4ffd2 100644 --- a/diff_diff/stacked_did.py +++ b/diff_diff/stacked_did.py @@ -593,6 +593,7 @@ def _refit_stacked(w_r): weighting=self.weighting, clean_control=self.clean_control, alpha=self.alpha, + anticipation=self.anticipation, survey_metadata=survey_metadata, ) diff --git a/diff_diff/stacked_did_results.py b/diff_diff/stacked_did_results.py index 7a45a316..fb5bfb96 100644 --- a/diff_diff/stacked_did_results.py +++ b/diff_diff/stacked_did_results.py @@ -93,6 +93,7 @@ class StackedDiDResults: weighting: str = "aggregate" clean_control: str = "not_yet_treated" alpha: float = 0.05 + anticipation: int = 0 # Survey design metadata (SurveyMetadata instance from diff_diff.survey) survey_metadata: Optional[Any] = field(default=None) diff --git a/diff_diff/staggered.py b/diff_diff/staggered.py index 2004f2e7..ae711e12 100644 --- a/diff_diff/staggered.py +++ b/diff_diff/staggered.py @@ -2001,6 +2001,7 @@ def fit( alpha=self.alpha, control_group=self.control_group, base_period=self.base_period, + anticipation=self.anticipation, event_study_effects=event_study_effects, group_effects=group_effects, bootstrap_results=bootstrap_results, diff --git a/diff_diff/staggered_results.py b/diff_diff/staggered_results.py index 2cb31d60..9c8f5275 100644 --- a/diff_diff/staggered_results.py +++ b/diff_diff/staggered_results.py @@ -111,6 +111,14 @@ class CallawaySantAnnaResults: alpha: float = 0.05 control_group: str = "never_treated" base_period: str = "varying" + # Anticipation periods (``k``) used at fit time. Persisted on the + # result so downstream diagnostics (``BusinessReport`` / + # ``DiagnosticReport`` / ``compute_pretrends_power``) can classify + # pre-period vs anticipation-window coefficients without re- + # plumbing the kwarg through every call site. See REGISTRY.md + # §CallawaySantAnna lines 355-395 for the shifted-boundary + # contract. + anticipation: int = 0 panel: bool = True event_study_effects: Optional[Dict[int, Dict[str, Any]]] = field(default=None) group_effects: Optional[Dict[Any, Dict[str, Any]]] = field(default=None) diff --git a/diff_diff/staggered_triple_diff.py b/diff_diff/staggered_triple_diff.py index 758d518b..08c6131f 100644 --- a/diff_diff/staggered_triple_diff.py +++ b/diff_diff/staggered_triple_diff.py @@ -136,8 +136,7 @@ def __init__( raise ValueError(f"epv_threshold must be > 0, got {epv_threshold}") if pscore_fallback not in ["error", "unconditional"]: raise ValueError( - f"pscore_fallback must be 'error' or 'unconditional', " - f"got '{pscore_fallback}'" + f"pscore_fallback must be 'error' or 'unconditional', " f"got '{pscore_fallback}'" ) self.estimation_method = estimation_method @@ -707,6 +706,7 @@ def fit( alpha=self.alpha, control_group=self.control_group, base_period=self.base_period, + anticipation=self.anticipation, estimation_method=self.estimation_method, event_study_effects=event_study_effects, group_effects=group_effects, @@ -1379,10 +1379,7 @@ def _compute_pscore( beta_clean = np.where(np.isfinite(beta_logistic), beta_logistic, 0.0) pscore_cache[pscore_key] = (beta_clean, diag) except (np.linalg.LinAlgError, ValueError): - if ( - self.pscore_fallback == "error" - or self.rank_deficient_action == "error" - ): + if self.pscore_fallback == "error" or self.rank_deficient_action == "error": raise ctx = f" for {context_label}" if context_label else "" warnings.warn( diff --git a/diff_diff/staggered_triple_diff_results.py b/diff_diff/staggered_triple_diff_results.py index bc664d4a..6ffc0738 100644 --- a/diff_diff/staggered_triple_diff_results.py +++ b/diff_diff/staggered_triple_diff_results.py @@ -74,6 +74,11 @@ class StaggeredTripleDiffResults: alpha: float = 0.05 control_group: str = "notyettreated" base_period: str = "varying" + # Anticipation periods (``k``) used at fit time. Persisted so + # downstream diagnostics in ``BusinessReport`` / ``DiagnosticReport`` + # can render the anticipation-aware assumption block and + # horizon-classification cutoffs accurately on real fits. + anticipation: int = 0 estimation_method: str = "dr" event_study_effects: Optional[Dict[int, Dict[str, Any]]] = field(default=None) group_effects: Optional[Dict[Any, Dict[str, Any]]] = field(default=None) diff --git a/diff_diff/sun_abraham.py b/diff_diff/sun_abraham.py index bb79052f..f3c78f8e 100644 --- a/diff_diff/sun_abraham.py +++ b/diff_diff/sun_abraham.py @@ -79,6 +79,12 @@ class SunAbrahamResults: n_control_units: int alpha: float = 0.05 control_group: str = "never_treated" + # Anticipation periods (``k``) used at fit time. Persisted so + # downstream diagnostics (``BusinessReport`` / ``DiagnosticReport`` + # / ``compute_pretrends_power``) can classify pre-period vs + # anticipation-window coefficients without re-plumbing the kwarg + # through every caller. + anticipation: int = 0 bootstrap_results: Optional["SABootstrapResults"] = field(default=None, repr=False) cohort_effects: Optional[Dict[Tuple[Any, int], Dict[str, Any]]] = field( default=None, repr=False @@ -893,6 +899,7 @@ def _refit_sa_cohort(w_r): n_control_units=n_control_units, alpha=self.alpha, control_group=self.control_group, + anticipation=self.anticipation, bootstrap_results=bootstrap_results, cohort_effects=cohort_effects_storage, survey_metadata=survey_metadata, diff --git a/diff_diff/two_stage.py b/diff_diff/two_stage.py index 6a385bd9..dc86e438 100644 --- a/diff_diff/two_stage.py +++ b/diff_diff/two_stage.py @@ -841,6 +841,7 @@ def _refit_ts(w_r): n_treated_units=n_treated_units, n_control_units=n_control_units, alpha=self.alpha, + anticipation=self.anticipation, bootstrap_results=bootstrap_results, survey_metadata=survey_metadata, ) diff --git a/diff_diff/two_stage_results.py b/diff_diff/two_stage_results.py index 6097f05a..d7cf7c8c 100644 --- a/diff_diff/two_stage_results.py +++ b/diff_diff/two_stage_results.py @@ -136,6 +136,7 @@ class TwoStageDiDResults: n_treated_units: int n_control_units: int alpha: float = 0.05 + anticipation: int = 0 bootstrap_results: Optional[TwoStageBootstrapResults] = field(default=None, repr=False) # Survey design metadata (SurveyMetadata instance from diff_diff.survey) survey_metadata: Optional[Any] = field(default=None, repr=False) diff --git a/docs/api/business_report.rst b/docs/api/business_report.rst new file mode 100644 index 00000000..3017dbf2 --- /dev/null +++ b/docs/api/business_report.rst @@ -0,0 +1,92 @@ +BusinessReport +============== + +``BusinessReport`` wraps any fitted diff-diff result object and produces +stakeholder-ready output: + +- ``summary()`` — a short paragraph block suitable for an email or Slack. +- ``full_report()`` — a structured multi-section markdown report. +- ``to_dict()`` — a stable AI-legible structured schema (single source + of truth; prose renders from this dict). + +By default, BusinessReport constructs an internal ``DiagnosticReport`` +to surface pre-trends, sensitivity, and other validity checks as part +of the narrative. Pass ``auto_diagnostics=False`` to skip this, or +``diagnostics=`` to supply an explicit one. + +Pre-computed diagnostics can be forwarded directly to the auto- +constructed ``DiagnosticReport`` via +``precomputed={'parallel_trends': ...}``, +``precomputed={'sensitivity': ...}``, +``precomputed={'pretrends_power': ...}``, or +``precomputed={'bacon': ...}`` — same keys as +``DiagnosticReport(precomputed=...)``. DR validates keys and rejects +estimator-incompatible entries. + +Data-dependent checks (2x2 parallel trends on simple DiD, +Goodman-Bacon decomposition on staggered estimators, the EfficientDiD +Hausman PT-All vs PT-Post pretest) require the raw panel + column +names. Pass ``data``, ``outcome``, ``treatment``, ``unit``, ``time``, +and/or ``first_treat`` to ``BusinessReport`` and they are forwarded +to the auto-constructed ``DiagnosticReport``. Without these kwargs, +those specific checks are skipped with an explicit reason while the +rest of the report still renders. + +For survey-weighted fits (any result carrying +``survey_metadata``) pass the original ``SurveyDesign`` via +``survey_design=``. It is threaded through to +``bacon_decompose`` for a fit-faithful Goodman-Bacon replay. When +``survey_metadata`` is set but ``survey_design`` is not supplied, +Bacon is skipped with an explicit reason so the report never emits +an unweighted decomposition for a design that differs from the +estimate. The simple 2x2 parallel-trends helper has no survey-aware +variant and is skipped unconditionally on a survey-backed +``DiDResults`` regardless of ``survey_design``; supply +``precomputed={'parallel_trends': ...}`` with a survey-aware +pretest to opt in. + +Methodology deviations (no traffic-light gates, pre-trends verdict +thresholds, power-aware phrasing, unit-translation policy, schema +stability) are documented in :doc:`../methodology/REPORTING`. + +Example +------- + +.. code-block:: python + + from diff_diff import CallawaySantAnna, BusinessReport + + cs = CallawaySantAnna(base_period="universal").fit( + df, outcome="revenue", unit="store", time="period", + first_treat="first_treat", aggregate="event_study", + ) + report = BusinessReport( + cs, + outcome_label="Revenue per store", + outcome_unit="$", + business_question="Did the loyalty program lift revenue?", + treatment_label="the loyalty program", + # Optional: panel + column names so auto diagnostics can run the + # data-dependent checks (2x2 PT, Goodman-Bacon, EfficientDiD + # Hausman). Without these the auto path still runs and just + # skips those checks. + data=df, + outcome="revenue", + unit="store", + time="period", + first_treat="first_treat", + ) + print(report.summary()) + +API +--- + +.. autoclass:: diff_diff.BusinessReport + :members: + :show-inheritance: + +.. autoclass:: diff_diff.BusinessContext + :members: + :show-inheritance: + +.. autodata:: diff_diff.BUSINESS_REPORT_SCHEMA_VERSION diff --git a/docs/api/diagnostic_report.rst b/docs/api/diagnostic_report.rst new file mode 100644 index 00000000..0a72e2ca --- /dev/null +++ b/docs/api/diagnostic_report.rst @@ -0,0 +1,77 @@ +DiagnosticReport +================ + +``DiagnosticReport`` orchestrates the library's existing diagnostic +functions (parallel trends, pre-trends power, HonestDiD sensitivity, +Goodman-Bacon, design-effect, EPV, heterogeneity, and estimator-native +checks for SyntheticDiD and TROP) into a single report with a stable +AI-legible schema. + +Construction is free; ``run_all()`` triggers the compute and caches. +A second call to ``to_dict()`` or ``summary()`` reuses the cached +result. + +Methodology deviations (no traffic-light gates, opt-in placebo +battery, estimator-native diagnostic routing, power-aware phrasing +threshold) are documented in :doc:`../methodology/REPORTING`. + +Data-dependent checks (2x2 parallel trends on simple DiD, +Goodman-Bacon decomposition on staggered estimators, the EfficientDiD +Hausman PT-All vs PT-Post pretest) require the raw panel + column +names. Pass ``data``, ``outcome``, ``treatment``, ``unit``, ``time``, +and/or ``first_treat`` and they feed the runners. Without these +kwargs, those specific checks are skipped with an explicit reason +while the rest of the battery still runs. + +For survey-weighted fits (any result carrying +``survey_metadata``) pass the original ``SurveyDesign`` via +``survey_design=``. It is threaded through to +``bacon_decompose`` for a fit-faithful Goodman-Bacon replay. When +``survey_metadata`` is set but ``survey_design`` is not supplied, +Bacon is skipped with an explicit reason so the report never emits +an unweighted decomposition for a design that differs from the +estimate; alternatively supply +``precomputed={'bacon': }`` with a +survey-aware result. + +The simple 2x2 parallel-trends helper has no survey-aware variant +and is skipped unconditionally on a survey-backed ``DiDResults`` +regardless of ``survey_design`` — the helper cannot consume the +design even when it is available. Supply +``precomputed={'parallel_trends': }`` with a survey-aware +pretest result to opt in. + +Example +------- + +.. code-block:: python + + from diff_diff import CallawaySantAnna, DiagnosticReport + + cs = CallawaySantAnna(base_period="universal").fit( + df, outcome="outcome", unit="unit", time="period", + first_treat="first_treat", aggregate="event_study", + ) + dr = DiagnosticReport( + cs, + data=df, + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + ) + print(dr.summary()) + dr.to_dataframe() # one row per check + +API +--- + +.. autoclass:: diff_diff.DiagnosticReport + :members: + :show-inheritance: + +.. autoclass:: diff_diff.DiagnosticReportResults + :members: + :show-inheritance: + +.. autodata:: diff_diff.DIAGNOSTIC_REPORT_SCHEMA_VERSION diff --git a/docs/api/index.rst b/docs/api/index.rst index 3d08dc98..da128317 100644 --- a/docs/api/index.rst +++ b/docs/api/index.rst @@ -253,6 +253,15 @@ Diagnostics & Inference power pretrends +Reporting +~~~~~~~~~ + +.. toctree:: + :maxdepth: 2 + + business_report + diagnostic_report + Results & Visualization ~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/docs/doc-deps.yaml b/docs/doc-deps.yaml index aee0b9d0..dfb1181b 100644 --- a/docs/doc-deps.yaml +++ b/docs/doc-deps.yaml @@ -482,6 +482,36 @@ sources: - path: docs/tutorials/07_pretrends_power.ipynb type: tutorial + diff_diff/business_report.py: + drift_risk: medium + docs: + - path: docs/methodology/REPORTING.md + type: methodology + note: "Phrasing rules, pre-trends verdict thresholds, unit-translation policy, schema stability." + - path: docs/api/business_report.rst + type: api_reference + - path: README.md + section: "BusinessReport" + type: user_guide + - path: diff_diff/guides/llms-full.txt + section: "BusinessReport" + type: user_guide + + diff_diff/diagnostic_report.py: + drift_risk: medium + docs: + - path: docs/methodology/REPORTING.md + type: methodology + note: "Applicability matrix, opt-in placebo rationale, native-diagnostic routing, no-traffic-lights decision." + - path: docs/api/diagnostic_report.rst + type: api_reference + - path: README.md + section: "DiagnosticReport" + type: user_guide + - path: diff_diff/guides/llms-full.txt + section: "DiagnosticReport" + type: user_guide + diff_diff/power.py: drift_risk: low docs: diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index 64d2e63a..775f323f 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -2844,6 +2844,19 @@ The 8-step workflow in `diff_diff/guides/llms-practitioner.txt` is adapted from --- +# Reporting + +BusinessReport and DiagnosticReport are the practitioner-ready output +layer. Their methodology (phrasing rules, pre-trends verdict +thresholds, power-aware phrasing, unit-translation policy, schema +stability, no-traffic-light-gates decision, estimator-native diagnostic +routing) is recorded in a dedicated file to keep this registry +estimator-focused: + +- See [`REPORTING.md`](./REPORTING.md). + +--- + # Version History - **v1.3** (2026-03-26): Added Replicate Weight Variance, DEFF Diagnostics, diff --git a/docs/methodology/REPORTING.md b/docs/methodology/REPORTING.md new file mode 100644 index 00000000..fd10cfda --- /dev/null +++ b/docs/methodology/REPORTING.md @@ -0,0 +1,251 @@ +# Reporting + +This document records the methodology choices embedded in +`BusinessReport` and `DiagnosticReport` — the convenience layer that +produces plain-English stakeholder narratives from any diff-diff result. + +Methodology for estimators lives in `REGISTRY.md`. This file is the +single source for reporting-layer decisions; `REGISTRY.md` cross-links +here rather than duplicating content. + +## Module + +- `diff_diff/business_report.py` — `BusinessReport`, `BusinessContext`. +- `diff_diff/diagnostic_report.py` — `DiagnosticReport`, + `DiagnosticReportResults`. + +Both modules dispatch by `type(results).__name__` lookup to avoid +circular imports across the 16 result classes. They do no estimator +fitting and do not re-derive any variance from raw data; every effect, +SE, p-value, CI, and sensitivity bound is either read from the fitted +result or produced by an existing diff-diff utility +(`compute_honest_did`, `HonestDiD.sensitivity`, `bacon_decompose`, +`check_parallel_trends`, `compute_pretrends_power`). When the caller +passes the raw panel + column kwargs, `DiagnosticReport` may call +those utilities on the supplied data (2x2 PT via +`check_parallel_trends`, Goodman-Bacon decomposition via +`bacon_decompose`, and the EfficientDiD Hausman PT-All vs PT-Post +pretest via `EfficientDiD.hausman_pretest`). + +The `design_effect` section of `DiagnosticReport.to_dict()` is a +read-only surface: it echoes `survey_metadata.design_effect` and +`effective_n` from the fitted result along with a `band_label` enum +classifying the deviation from 1. The enum values are: + +- `"improves_precision"` for `deff < 0.95` (effective N is LARGER + than nominal N — a precision-improving design); +- `"trivial"` for `0.95 <= deff < 1.05` (effectively no effect on + inference); +- `"slightly_reduces"` for `1.05 <= deff < 2`; +- `"materially_reduces"` for `2 <= deff < 5`; +- `"large_warning"` for `deff >= 5`; +- `None` when `deff` is missing or non-finite. + +The section does not call `compute_deff_diagnostics` (that helper +needs per-fit internals the result objects do not expose). The report layer **does** compose a few +cross-period summary statistics from per-period inputs already +produced by the estimator — specifically the joint-Wald / Bonferroni +pre-trends p-value from pre-period event-study coefficients (see +`_pt_event_study`), the MDV-to-ATT ratio for power-tier selection, +and the heterogeneity dispersion block (CV / range / sign- +consistency over post-treatment group / event-study / group-time +effects, pre-period and reference-marker rows excluded). These are +reporting-layer aggregations of inputs already in the result object, +not new inference. + +## Design deviations + +- **Note:** No hard pass/fail gates. `DiagnosticReport` does not produce + a traffic-light verdict. Severity is conveyed through natural-language + phrasing ("robust", "fragile", "material share"). This is an explicit + deviation from the strategy document's Gap 4 ("traffic-light + assessment (green/yellow/red)"); the choice is motivated by the + well-known risk of naive thresholds producing false confidence. A + `ConservativeThresholds` opt-in layer remains available as a future + addition if practitioner demand materialises. + +- **Note:** Placebo battery is opt-in (`run_placebo=False` by default). + `run_all_placebo_tests` on a typical panel (500 permutations times one + DiD fit per permutation) adds tens of seconds of latency, which would + be surprising as the default on a convenience wrapper. The schema + reserves the `"placebo"` key; it is always rendered with + `{"status": "skipped", "reason": "..."}` in MVP so agents parsing the + schema see a stable shape. + +- **Note:** `DiagnosticReport` does not call `check_parallel_trends` on + event-study or staggered result objects. `check_parallel_trends` in + `diff_diff/utils.py` assumes a single binary treatment with universal + pre-periods; for staggered and event-study designs, DR reads the + pre-period event-study coefficients directly and constructs a joint + Wald statistic (or Bonferroni fallback when `vcov` is missing). This + mirrors the guidance in `practitioner._parallel_trends_step(staggered=True)`. + +- **Note:** Survey-design threading for fit-faithful Bacon replay. + `DiagnosticReport(survey_design=...)` and + `BusinessReport(survey_design=...)` accept the original + `SurveyDesign` object and forward it to + `bacon_decompose(survey_design=...)` so the Goodman-Bacon + decomposition is computed under the same design as the weighted + estimate. When `survey_metadata` is set but `survey_design` is not + supplied, Bacon skips with an explicit reason rather than replaying + an unweighted decomposition for a design that differs from the + weighted estimate; users can alternatively pass + `precomputed={'bacon': ...}` with a survey-aware result. + + The simple 2x2 parallel-trends helper (`utils.check_parallel_trends`) + has no survey-aware variant. On a survey-backed `DiDResults` the + check is skipped **unconditionally**, regardless of whether + `survey_design` is supplied, because the helper cannot consume the + design even when it is available. Users must pass + `precomputed={'parallel_trends': ...}` with a survey-aware pretest + result to opt in. Event-study PT on staggered estimators is + unaffected — it reads the weighted pre-period coefficients directly + off the fitted result and uses the finite-df reference described + below, so no second replay is needed. + +- **Note:** Survey finite-df PT policy. When the fitted result carries + a finite `survey_metadata.df_survey`, `_pt_event_study` computes + `F = W / k` (numerator df = k pre-period coefficients) against an + F(k, df_survey) reference distribution rather than chi-square(k). + The design-based SE already reflects the effective sample size, so + the chi-square reference would systematically over-reject under the + finite-sample correction the SE captures. The schema surfaces the + survey branch via the `method` suffix `_survey` + (e.g., `joint_wald_survey`, `joint_wald_event_study_survey`) and + exposes the denominator df as `df_denom`, so BR / DR prose can flag + the finite-sample correction rather than silently presenting a + chi-square-style result. Non-finite `df_survey` (NaN / inf / + non-positive) falls back to the chi-square path. + +- **Note:** Estimator-native validation surfaces are surfaced rather + than duplicated. `SyntheticDiDResults` routes parallel-trends to + `pre_treatment_fit` (the RMSE of the synthetic-control fit on the + pre-period), and routes sensitivity to `in_time_placebo()` + + `sensitivity_to_zeta_omega()`. `TROPResults` surfaces factor-model + diagnostics (`effective_rank`, `loocv_score`, selected `lambda_*`) + under `estimator_native_diagnostics`. `EfficientDiDResults` PT runs + through `EfficientDiD.hausman_pretest` (the estimator's native + PT-All vs PT-Post check). + +- **Note:** Pre-trends verdict is a three-bin heuristic, not a field + convention. DR maps the joint p-value as follows: + + - `joint_p >= 0.30` → `no_detected_violation`. + - `0.05 <= joint_p < 0.30` → `some_evidence_against`. + - `joint_p < 0.05` → `clear_violation`. + + These thresholds are diff-diff heuristics. The 0.30 upper bound draws + on equivalence-testing intuition (Rambachan & Roth 2023 discuss the + limitations of pre-tests). The `no_detected_violation` label + deliberately avoids "parallel trends hold" language — the test did + not detect a violation, but pre-trends tests are commonly + underpowered. See the power-aware phrasing rule below. + +- **Note:** Power-aware phrasing for `no_detected_violation`. DR calls + `compute_pretrends_power(results, violation_type='linear', + alpha=alpha, target_power=0.80)` for the estimator families that + ship a `compute_pretrends_power` adapter: `MultiPeriodDiDResults`, + `CallawaySantAnnaResults`, and `SunAbrahamResults` (see + `_APPLICABILITY["pretrends_power"]` in + `diff_diff/diagnostic_report.py`). Other staggered families with + event-study output (`ImputationDiDResults`, `TwoStageDiDResults`, + `StackedDiDResults`, `EfficientDiDResults`, + `StaggeredTripleDiffResults`, `WooldridgeDiDResults`, + `ChaisemartinDHaultfoeuilleResults`) do not yet have a power + adapter and therefore render the `no_detected_violation` tier as + `underpowered` with the fallback reason recorded in + `schema["pre_trends"]["power_reason"]` (plain-English explanation) + while `schema["pre_trends"]["power_status"]` carries the + machine-readable enum (`"ran"` / `"skipped"` / `"error"` / + `"not_applicable"`). BusinessReport then reads + `mdv_share_of_att = mdv / abs(att)` and selects a tier: + + - `< 0.25` → `well_powered` — "the test has 80% power to + detect a violation of magnitude M, which is only X% of the + estimated effect; if a material pre-trend existed, this test would + likely have caught it." + - `>= 0.25 and < 1.0` → `moderately_powered` — "the test + is informative but not definitive; see the sensitivity analysis + below for bounded-violation guarantees." + - `>= 1.0` → `underpowered` — "the test has limited + power — a non-rejection does not prove the assumption. See + the HonestDiD sensitivity analysis below for a more reliable + signal." + - Power analysis not runnable → fall back to `underpowered` + phrasing; the fallback reason is recorded in + `schema["pre_trends"]["power_reason"]` (plain-English explanation; + `power_status` carries the enum). + + Rationale: always-hedging phrasing under-sells well-designed + studies; always-confident phrasing over-sells underpowered ones. + The library already ships `compute_pretrends_power()`, so using it + is the honest default rather than hedging every non-violation. + +- **Note:** Diagonal-covariance fallback for staggered-estimator power. + `compute_pretrends_power()` currently drops to `np.diag(ses**2)` for + CS / SA / ImputationDiD / Stacked / etc. even when the full + `event_study_vcov` is attached on the result. The + `DiagnosticReport.pretrends_power` block records + `covariance_source: "diag_fallback_available_full_vcov_unused"` in + that case, and `BusinessReport` downgrades a `well_powered` tier to + `moderately_powered` before rendering prose. This is a known + conservative deviation from the documented "use the full pre-period + covariance" position — it prevents the diagonal approximation from + producing an overly optimistic "well-powered" claim when correlated + pre-period errors could tighten the MDV. The right long-term fix is + to teach `compute_pretrends_power()` to consume `event_study_vcov` + and `event_study_vcov_index`; until that lands this downgrade stays. + +- **Note:** Unit-translation policy. BusinessReport does not + arithmetically translate log-points to percents or level effects to + log-points. The estimate is rendered in the scale the estimator + produced; `outcome_unit="log_points"` emits an informational + caveat. The policy avoids guessing the underlying model (no + estimator in the library currently exports both log and level + coefficients), which would be unsafe in the presence of non-linear + link functions (Poisson QMLE, logit). + +- **Note:** Single-knob `alpha` with preserved-native-CI fallback. + BusinessReport exposes only `alpha` (defaults to `results.alpha`); + there is no separate `significance_threshold` parameter. When the + requested `alpha` matches the fit's native level, it drives both the + CI level (`(1 - alpha) * 100`% interval) and the phrasing tier + threshold ("statistically significant at the (1 - alpha) * 100% + level"). When the requested `alpha` differs from the fit's native + level (e.g., the user asks for `alpha=0.10` on a result fit with + `alpha=0.05`), BusinessReport does NOT recompute the CI at the + requested level, because the stored CI is the only quantile the + underlying estimator supplied (bootstrap distributions and + finite-df analytical variances are not always retained on the + result). Instead, the schema preserves the fit's native CI (with its + original level) and uses the requested `alpha` only for the + significance-phrasing threshold, and emits an + `alpha_override_preserved` caveat describing the mismatch. This is + the conservative choice: it avoids silently recomputing CIs under + assumptions the estimator may not support. + +- **Note:** Schema stability policy for the AI-legible `to_dict()` + surface. New top-level keys count as additive (no version bump); new + values in any `status` enum count as breaking (agents doing + exhaustive pattern match will break on unknown enums); renames and + removals count as breaking. The `BUSINESS_REPORT_SCHEMA_VERSION` + and `DIAGNOSTIC_REPORT_SCHEMA_VERSION` constants bump independently. + The v3.2 CHANGELOG marks both schemas experimental so users do not + anchor tooling on them prematurely; a formal deprecation policy will + land within two subsequent PRs. + +## Reference implementation(s) + +The phrasing rules follow the guidance in: + +- Baker, A. C., Callaway, B., Cunningham, S., Goodman-Bacon, A., & + Sant'Anna, P. H. C. (2025). *Difference-in-Differences Designs: A + Practitioner's Guide.* (The 8-step workflow enforced through + `diff_diff/practitioner.py`.) +- Rambachan, A., & Roth, J. (2023). *A More Credible Approach to + Parallel Trends.* Review of Economic Studies. (HonestDiD sensitivity; + the pre-test power caveat directly shaped the three-tier power + phrasing.) +- Roth, J. (2022). *Pretest with Caution: Event-study Estimates after + Testing for Parallel Trends.* American Economic Review: Insights. + (Motivates the power-aware phrasing tiers.) diff --git a/tests/test_business_report.py b/tests/test_business_report.py new file mode 100644 index 00000000..cda3e5a7 --- /dev/null +++ b/tests/test_business_report.py @@ -0,0 +1,4436 @@ +"""Tests for ``diff_diff.business_report.BusinessReport``. + +Covers the expanded test list from the approved plan: +- Schema contract across result types. +- JSON round-trip. +- BR-DR integration (auto, explicit, False). +- ``honest_did_results=`` passthrough (no re-computation). +- Unit-label behavior (pp vs $ differ; column-name fallback). +- Log-points unit policy (no arithmetic translation; informational caveat). +- Significance-chasing guard boundary. +- Pre-trends verdict thresholds (three bins routed through BR phrasing). +- Power-aware phrasing (three tiers + underpowered fallback). +- NaN ATT surfaces a caveat and does not crash. +- ``include_appendix`` toggle. +- ``BusinessReport(BaconDecompositionResults)`` raises TypeError. +- Survey metadata passthrough to schema + phrasing. +- Single-knob alpha drives both CI level and phrasing. +""" + +from __future__ import annotations + +import json +import warnings +from unittest.mock import patch + +import numpy as np +import pytest + +import diff_diff as dd +from diff_diff import ( + BusinessContext, + BusinessReport, + CallawaySantAnna, + DiagnosticReport, + DifferenceInDifferences, + MultiPeriodDiD, + SyntheticDiD, + bacon_decompose, + generate_did_data, + generate_factor_data, + generate_staggered_data, +) +from diff_diff.business_report import BUSINESS_REPORT_SCHEMA_VERSION + +warnings.filterwarnings("ignore") + +_BR_TOP_LEVEL_KEYS = { + "schema_version", + "estimator", + "context", + "headline", + "assumption", + "pre_trends", + "sensitivity", + "sample", + "heterogeneity", + "robustness", + "diagnostics", + "next_steps", + "caveats", + "references", +} + + +# --------------------------------------------------------------------------- +# Fixtures +# --------------------------------------------------------------------------- +@pytest.fixture(scope="module") +def did_fit(): + df = generate_did_data(n_units=80, n_periods=4, treatment_effect=1.5, seed=7) + did = DifferenceInDifferences().fit(df, outcome="outcome", treatment="treated", time="post") + return did, df + + +@pytest.fixture(scope="module") +def event_study_fit(): + df = generate_did_data(n_units=80, n_periods=8, treatment_effect=1.5, seed=7) + es = MultiPeriodDiD().fit( + df, + outcome="outcome", + treatment="treated", + time="period", + unit="unit", + reference_period=3, + ) + return es, df + + +@pytest.fixture(scope="module") +def cs_fit(): + sdf = generate_staggered_data(n_units=100, n_periods=6, treatment_effect=1.5, seed=7) + # base_period='universal' so DR's sensitivity check can run without + # hitting the round-5 methodology-critical skip (Rambachan-Roth bounds + # are not interpretable on consecutive-comparison pre-periods). + cs = CallawaySantAnna(base_period="universal").fit( + sdf, + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + aggregate="event_study", + ) + return cs, sdf + + +@pytest.fixture(scope="module") +def sdid_fit(): + fdf = generate_factor_data(n_units=25, n_pre=8, n_post=4, n_treated=4, seed=11) + sdid = SyntheticDiD().fit(fdf, outcome="outcome", unit="unit", time="period", treatment="treat") + return sdid, fdf + + +@pytest.fixture(scope="module") +def edid_fit(): + from diff_diff import EfficientDiD + + sdf = generate_staggered_data(n_units=100, n_periods=6, treatment_effect=1.5, seed=7) + edid = EfficientDiD().fit( + sdf, outcome="outcome", unit="unit", time="period", first_treat="first_treat" + ) + return edid, sdf + + +# --------------------------------------------------------------------------- +# Schema contract +# --------------------------------------------------------------------------- +class TestSchemaContract: + def test_top_level_keys(self, event_study_fit): + fit, _ = event_study_fit + br = BusinessReport(fit, auto_diagnostics=False) + assert set(br.to_dict().keys()) == _BR_TOP_LEVEL_KEYS + + def test_schema_version(self, event_study_fit): + fit, _ = event_study_fit + assert ( + BusinessReport(fit, auto_diagnostics=False).to_dict()["schema_version"] + == BUSINESS_REPORT_SCHEMA_VERSION + ) + + def test_json_round_trip(self, cs_fit): + fit, _ = cs_fit + br = BusinessReport( + fit, + outcome_label="sales", + outcome_unit="$", + treatment_label="the policy", + ) + dumped = json.dumps(br.to_dict()) + assert len(dumped) > 0 + assert json.loads(dumped)["schema_version"] == BUSINESS_REPORT_SCHEMA_VERSION + + def test_json_round_trip_sdid(self, sdid_fit): + fit, _ = sdid_fit + br = BusinessReport(fit, outcome_label="revenue", outcome_unit="$") + dumped = json.dumps(br.to_dict()) + assert len(dumped) > 0 + + +# --------------------------------------------------------------------------- +# BR ↔ DR integration +# --------------------------------------------------------------------------- +class TestDiagnosticsIntegration: + def test_auto_diagnostics_true_populates_diagnostics_block(self, event_study_fit): + fit, _ = event_study_fit + br = BusinessReport(fit, auto_diagnostics=True) + d = br.to_dict() + assert d["diagnostics"]["status"] == "ran" + assert "schema" in d["diagnostics"] + + def test_auto_diagnostics_false_skips(self, event_study_fit): + fit, _ = event_study_fit + br = BusinessReport(fit, auto_diagnostics=False) + d = br.to_dict() + assert d["diagnostics"]["status"] == "skipped" + assert "auto_diagnostics=False" in d["diagnostics"]["reason"] + + def test_explicit_diagnostics_results_takes_precedence(self, event_study_fit): + fit, _ = event_study_fit + dr = DiagnosticReport(fit) + dr_results = dr.run_all() + br = BusinessReport(fit, diagnostics=dr_results) + d = br.to_dict() + assert d["diagnostics"]["status"] == "ran" + # Same dict identity shows the supplied results were used verbatim. + assert d["diagnostics"]["schema"] is dr_results.schema + + def test_explicit_diagnostics_report_runs(self, event_study_fit): + fit, _ = event_study_fit + dr = DiagnosticReport(fit) + br = BusinessReport(fit, diagnostics=dr) + assert br.to_dict()["diagnostics"]["status"] == "ran" + + def test_diagnostics_wrong_type_raises(self, event_study_fit): + fit, _ = event_study_fit + with pytest.raises(TypeError): + BusinessReport(fit, diagnostics="not a DR") # type: ignore[arg-type] + + +# --------------------------------------------------------------------------- +# HonestDiD passthrough +# --------------------------------------------------------------------------- +class TestHonestDiDPassthrough: + def test_supplied_sensitivity_is_not_recomputed(self, event_study_fit): + fit, _ = event_study_fit + + class _FakeSens: + M_values = np.array([0.5, 1.0]) + bounds = [(0.1, 2.0), (-0.2, 2.5)] + robust_cis = [(0.05, 2.1), (-0.3, 2.6)] + breakdown_M = 1.5 + method = "relative_magnitude" + original_estimate = 1.0 + original_se = 0.2 + alpha = 0.05 + + fake = _FakeSens() + with patch("diff_diff.honest_did.HonestDiD.sensitivity_analysis") as mock: + br = BusinessReport(fit, honest_did_results=fake) + schema = br.to_dict() + mock.assert_not_called() + sens = schema["sensitivity"] + assert sens["status"] == "computed" + assert sens["breakdown_M"] == 1.5 + + +# --------------------------------------------------------------------------- +# Unit labels and policy +# --------------------------------------------------------------------------- +class TestUnitLabels: + def test_dollar_unit_formats_currency(self, cs_fit): + fit, _ = cs_fit + br = BusinessReport(fit, outcome_label="sales", outcome_unit="$", auto_diagnostics=False) + headline = br.headline() + assert "$" in headline + + def test_pp_unit_formats_percentage_points(self, cs_fit): + fit, _ = cs_fit + br = BusinessReport( + fit, outcome_label="awareness", outcome_unit="pp", auto_diagnostics=False + ) + headline = br.headline() + assert "pp" in headline + + def test_zero_config_falls_back_to_generic_label(self, cs_fit): + fit, _ = cs_fit + br = BusinessReport(fit, auto_diagnostics=False) + d = br.to_dict() + assert d["context"]["outcome_label"] == "the outcome" + assert d["context"]["treatment_label"] == "the treatment" + + def test_log_points_emits_unit_policy_caveat(self, cs_fit): + fit, _ = cs_fit + br = BusinessReport(fit, outcome_unit="log_points", auto_diagnostics=False) + caveats = br.caveats() + topics = {c.get("topic") for c in caveats} + assert "unit_policy" in topics + + +# --------------------------------------------------------------------------- +# Significance phrasing +# --------------------------------------------------------------------------- +class TestOutcomeDirection: + """outcome_direction selects value-laden vs neutral verbs.""" + + def test_higher_is_better_positive_effect_uses_lifted(self, cs_fit): + fit, _ = cs_fit + br = BusinessReport( + fit, + outcome_label="sales", + outcome_unit="$", + outcome_direction="higher_is_better", + treatment_label="the policy", + auto_diagnostics=False, + ) + headline = br.headline() + assert "lifted" in headline + assert "increased" not in headline + + def test_lower_is_better_positive_effect_uses_worsened(self, cs_fit): + fit, _ = cs_fit # CS has a positive effect on this seed + br = BusinessReport( + fit, + outcome_label="churn", + outcome_unit="%", + outcome_direction="lower_is_better", + treatment_label="the change", + auto_diagnostics=False, + ) + headline = br.headline() + assert "worsened" in headline + + def test_direction_none_uses_neutral_verb(self, cs_fit): + fit, _ = cs_fit + br = BusinessReport( + fit, + outcome_label="sales", + outcome_unit="$", + auto_diagnostics=False, + ) + headline = br.headline() + assert "increased" in headline + assert "lifted" not in headline + + +class TestWarningsPassthrough: + """Broad exception handling still records provenance in schema.warnings.""" + + def test_diagnostic_error_surfaces_as_top_level_warning(self, event_study_fit): + fit, _ = event_study_fit + + def _raise(*args, **kwargs): + raise RuntimeError("synthetic test failure") + + with patch("diff_diff.honest_did.HonestDiD.sensitivity_analysis", side_effect=_raise): + br = BusinessReport(fit, auto_diagnostics=True) + schema = br.to_dict() + inner = schema["diagnostics"]["schema"] + # The error is recorded at the section level... + assert inner["sensitivity"]["status"] == "error" + # ...AND surfaced at the top level for quick scanning. + assert any("sensitivity:" in w for w in inner["warnings"]) + assert any("synthetic test failure" in w for w in inner["warnings"]) + + +class TestSignificancePhrasing: + def test_high_significance_produces_strong_language(self, cs_fit): + """CS on this seed has p ~ 1e-56 (very strong) -> 'strongly supported'.""" + fit, _ = cs_fit + br = BusinessReport(fit, outcome_label="sales", outcome_unit="$") + summary = br.summary() + assert "strongly supported" in summary + + def test_near_threshold_caveat(self, event_study_fit): + """Fabricate a p-value near 0.05 to exercise the significance-chasing guard.""" + fit, _ = event_study_fit + # Monkey-patch the result to land p_value in (0.04, 0.051). + original = fit.avg_p_value + try: + fit.avg_p_value = 0.045 + br = BusinessReport(fit, auto_diagnostics=False) + caveats = br.caveats() + topics = {c.get("topic") for c in caveats} + assert "near_significance" in topics + finally: + fit.avg_p_value = original + + def test_far_from_threshold_no_near_caveat(self, event_study_fit): + fit, _ = event_study_fit + original = fit.avg_p_value + try: + fit.avg_p_value = 0.010 + br = BusinessReport(fit, auto_diagnostics=False) + topics = {c.get("topic") for c in br.caveats()} + assert "near_significance" not in topics + finally: + fit.avg_p_value = original + + +# --------------------------------------------------------------------------- +# Pre-trends verdict + power tier phrasing +# --------------------------------------------------------------------------- +class TestPreTrendsVerdictPhrasing: + """Verdict and tier should flow through into schema AND phrasing.""" + + def test_verdict_and_tier_surface_in_schema(self, event_study_fit): + fit, _ = event_study_fit + br = BusinessReport(fit, auto_diagnostics=True) + pt = br.to_dict()["pre_trends"] + # This fixture has a clear violation and an underpowered test — both set. + assert pt["status"] == "computed" + assert pt["verdict"] in { + "no_detected_violation", + "some_evidence_against", + "clear_violation", + } + + def test_clear_violation_phrased_tentatively(self, event_study_fit): + fit, _ = event_study_fit + br = BusinessReport(fit, auto_diagnostics=True) + if br.to_dict()["pre_trends"].get("verdict") == "clear_violation": + summary = br.summary() + assert "tentative" in summary or "reject parallel trends" in summary + + def test_underpowered_phrasing_uses_hedge_language(self, cs_fit): + """CS fit on this seed typically produces 'no_detected_violation' + underpowered.""" + fit, sdf = cs_fit + # Force the CS fit through our BR pipeline. + br = BusinessReport( + fit, + outcome_label="sales", + outcome_unit="$", + diagnostics=DiagnosticReport( + fit, + data=sdf, + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + ), + ) + pt = br.to_dict()["pre_trends"] + if pt.get("verdict") == "no_detected_violation": + summary = br.summary() + # One of the three tier-specific phrases should appear. + assert ( + "limited power" in summary + or "moderately informative" in summary + or "well-powered" in summary + or "likely have been detected" in summary + ) + + +# --------------------------------------------------------------------------- +# NaN ATT +# --------------------------------------------------------------------------- +class TestNaNATT: + def test_nan_att_produces_caveat_and_does_not_crash(self, event_study_fit): + fit, _ = event_study_fit + original = fit.avg_att + try: + fit.avg_att = float("nan") + br = BusinessReport(fit, auto_diagnostics=False) + summary = br.summary() + caveats = br.caveats() + assert isinstance(summary, str) + assert any(c.get("topic") == "estimation_failure" for c in caveats) + assert br.to_dict()["headline"]["sign"] == "undefined" + finally: + fit.avg_att = original + + +# --------------------------------------------------------------------------- +# include_appendix toggle +# --------------------------------------------------------------------------- +class TestAppendix: + def test_include_appendix_true_embeds_summary(self, event_study_fit): + fit, _ = event_study_fit + br = BusinessReport(fit, auto_diagnostics=False, include_appendix=True) + md = br.full_report() + assert "## Technical Appendix" in md + + def test_include_appendix_false_omits(self, event_study_fit): + fit, _ = event_study_fit + br = BusinessReport(fit, auto_diagnostics=False, include_appendix=False) + md = br.full_report() + assert "## Technical Appendix" not in md + + +# --------------------------------------------------------------------------- +# BaconDecompositionResults +# --------------------------------------------------------------------------- +class TestBaconTypeError: + def test_br_on_bacon_raises(self): + sdf = generate_staggered_data(n_units=30, n_periods=6, treatment_effect=1.5, seed=7) + bacon = bacon_decompose( + sdf, outcome="outcome", unit="unit", time="period", first_treat="first_treat" + ) + with pytest.raises(TypeError, match="BaconDecompositionResults is a diagnostic"): + BusinessReport(bacon) + + +# --------------------------------------------------------------------------- +# Survey metadata passthrough +# --------------------------------------------------------------------------- +class TestSurveyPassthrough: + def test_survey_absent_yields_null_survey_block(self, cs_fit): + fit, _ = cs_fit + br = BusinessReport(fit, auto_diagnostics=False) + d = br.to_dict() + assert d["sample"]["survey"] is None + + def test_survey_present_populates_block(self, event_study_fit): + """Synthetically attach a survey_metadata shim and verify BR surfaces it.""" + fit, _ = event_study_fit + + class _ShimMeta: + weight_type = "pweight" + effective_n = 120.0 + design_effect = 2.5 + sum_weights = 200.0 + n_strata = 8 + n_psu = 20 + df_survey = 18 + replicate_method = None + + original = fit.survey_metadata + try: + fit.survey_metadata = _ShimMeta() + br = BusinessReport(fit, auto_diagnostics=False) + survey = br.to_dict()["sample"]["survey"] + assert survey is not None + assert survey["weight_type"] == "pweight" + assert survey["design_effect"] == 2.5 + assert survey["is_trivial"] is False + + summary = br.summary() + # When DEFF >= 1.5 we inject a caveat or a summary sentence. + assert ( + "design effect" in summary.lower() + or "effective sample size" in summary.lower() + or any(c.get("topic") == "design_effect" for c in br.caveats()) + ) + finally: + fit.survey_metadata = original + + +# --------------------------------------------------------------------------- +# Single-knob alpha +# --------------------------------------------------------------------------- +class TestAlphaKnob: + def test_alpha_equal_to_result_alpha_drives_ci_level(self, event_study_fit): + """When caller's alpha matches the fit's native alpha, ``ci_level`` + reflects that alpha (e.g., alpha=0.05 -> 95% CI).""" + fit, _ = event_study_fit + br = BusinessReport(fit, alpha=0.05, auto_diagnostics=False) + assert br.to_dict()["headline"]["ci_level"] == 95 + + def test_alpha_mismatch_preserves_fitted_ci_at_native_level(self, event_study_fit): + """Round-7 regression: a caller alpha that differs from the fit's + native alpha must NOT recompute a z-based CI (the fit used t-based + inference with a finite ``df`` that BR cannot reproduce from + ``(att, se)`` alone). The displayed CI stays at the fit's native + level, while significance phrasing uses the caller's alpha. A + caveat records the override. + """ + import math + + fit, _ = event_study_fit + br95 = BusinessReport(fit, alpha=0.05, auto_diagnostics=False) + br90 = BusinessReport(fit, alpha=0.10, auto_diagnostics=False) + h95 = br95.to_dict()["headline"] + h90 = br90.to_dict()["headline"] + if h95["effect"] is not None and math.isfinite(h95["effect"]): + # Bounds must match between the two: the alpha=0.10 call + # preserves the fit's 95% CI rather than recomputing a 90% z-CI. + assert h90["ci_lower"] == pytest.approx(h95["ci_lower"]) + assert h90["ci_upper"] == pytest.approx(h95["ci_upper"]) + # ``ci_level`` stays at the fit's native level in both cases. + assert h95["ci_level"] == 95 + assert h90["ci_level"] == 95 + # Override is surfaced as an info-level caveat. + topics = {c.get("topic") for c in br90.caveats()} + assert "alpha_override_preserved" in topics, ( + "Alpha mismatch must surface a caveat documenting the preserved " + "native CI level; topics seen: " + str(topics) + ) + + +class TestAlphaOverrideBootstrapAndFiniteDF: + """Alpha override preserves the fitted CI on any inference contract + that cannot be reproduced from point-estimate + SE alone (bootstrap / + wild cluster bootstrap / percentile / jackknife / placebo / finite-df + survey / undefined-d.f. replicate / analytical t-quantile). The + displayed CI stays at the fit's native level; significance phrasing + still uses the caller's alpha; an informational caveat records the + override. + """ + + class _BootstrapResultStub: + """Minimal stub shaped like a bootstrap-inferred result.""" + + def __init__(self): + self.att = 1.0 + self.se = 0.5 + self.p_value = 0.04 + # Original 95% CI from the bootstrap distribution. + self.conf_int = (0.05, 1.95) + self.alpha = 0.05 + self.n_obs = 100 + self.n_treated = 40 + self.n_control = 60 + self.inference_method = "bootstrap" + self.survey_metadata = None + # Presence of a bootstrap distribution triggers the preserve path. + import numpy as np + + self.bootstrap_distribution = np.random.default_rng(0).normal(1.0, 0.5, 200) + + def test_bootstrap_fit_preserves_fitted_ci_on_alpha_mismatch(self): + stub = self._BootstrapResultStub() + br = BusinessReport(stub, alpha=0.10, auto_diagnostics=False) + h = br.to_dict()["headline"] + # Native fit was at 95%; requested 90% should NOT be reflected in the label. + assert h["ci_level"] == 95, ( + "Bootstrap fit must preserve fitted CI level (95) when caller " + f"requests a different alpha; got {h['ci_level']}" + ) + # Bounds should match the stored bootstrap interval, not a normal-z + # recomputation at 90%. + assert h["ci_lower"] == pytest.approx(0.05) + assert h["ci_upper"] == pytest.approx(1.95) + # A caveat records the override. + caveat_topics = {c.get("topic") for c in br.caveats()} + assert "alpha_override_preserved" in caveat_topics + + class _FiniteDfSurveyStub: + def __init__(self): + from types import SimpleNamespace + + self.att = 2.0 + self.se = 0.4 + self.p_value = 0.001 + self.conf_int = (1.22, 2.78) # 95% via survey t-quantile + self.alpha = 0.05 + self.n_obs = 120 + self.n_treated = 50 + self.n_control = 70 + self.inference_method = "analytical" + # Finite survey d.f. triggers the preserve path — normal approx + # would widen / narrow incorrectly. + self.survey_metadata = SimpleNamespace( + weight_type="pweight", + effective_n=110.0, + design_effect=1.2, + sum_weights=120.0, + n_strata=4, + n_psu=12, + df_survey=8, + replicate_method=None, + ) + + def test_finite_df_fit_preserves_fitted_ci_on_alpha_mismatch(self): + stub = self._FiniteDfSurveyStub() + br = BusinessReport(stub, alpha=0.10, auto_diagnostics=False) + h = br.to_dict()["headline"] + assert h["ci_level"] == 95 + assert h["ci_lower"] == pytest.approx(1.22) + assert h["ci_upper"] == pytest.approx(2.78) + caveat_topics = {c.get("topic") for c in br.caveats()} + assert "alpha_override_preserved" in caveat_topics + + +class TestWildBootstrapAlphaOverride: + """Regression for the round-4 P0 finding that ``inference='wild_bootstrap'`` + results were falling through to a normal-approximation recomputation.""" + + def test_wild_bootstrap_preserves_fitted_ci(self): + class _WildBootstrapStub: + def __init__(self): + self.att = 1.0 + self.se = 0.5 + self.p_value = 0.04 + # 95% CI produced by the wild cluster bootstrap surface. + self.conf_int = (0.10, 1.90) + self.alpha = 0.05 + self.n_obs = 100 + self.n_treated = 40 + self.n_control = 60 + self.inference_method = "wild_bootstrap" + self.survey_metadata = None + # Wild-boot fits don't necessarily carry a raw distribution; + # the inference_method string alone must be enough. + self.bootstrap_distribution = None + + stub = _WildBootstrapStub() + br = BusinessReport(stub, alpha=0.10, auto_diagnostics=False) + h = br.to_dict()["headline"] + assert h["ci_level"] == 95, ( + "Wild cluster bootstrap must preserve fitted CI level on alpha " + f"mismatch; got {h['ci_level']}" + ) + assert h["ci_lower"] == pytest.approx(0.10) + assert h["ci_upper"] == pytest.approx(1.90) + caveats = br.caveats() + assert any(c.get("topic") == "alpha_override_preserved" for c in caveats) + # Caveat message should call out wild cluster bootstrap specifically. + preserved_msg = next( + c["message"] for c in caveats if c.get("topic") == "alpha_override_preserved" + ) + assert "wild cluster bootstrap" in preserved_msg + + +class TestAssumptionBlockSourceFaithful: + """Regression for the round-4 P1 finding that ``_describe_assumption`` + was producing generic DiD PT text for ContinuousDiD, TripleDifference, + and StaggeredTripleDifference — all of which have different identifying + logic per the Methodology Registry.""" + + def _stub(self, class_name): + cls = type(class_name, (), {}) + obj = cls() + obj.att = 1.0 + obj.se = 0.1 + obj.p_value = 0.001 + obj.conf_int = (0.8, 1.2) + obj.alpha = 0.05 + obj.n_obs = 100 + obj.n_treated = 40 + obj.n_control = 60 + obj.survey_metadata = None + obj.event_study_effects = None + obj.inference_method = "analytical" + return obj + + def test_continuous_did_assumption_uses_two_level_pt(self): + br = BusinessReport(self._stub("ContinuousDiDResults"), auto_diagnostics=False) + assumption = br.to_dict()["assumption"] + assert assumption["parallel_trends_variant"] == "dose_pt_or_strong_pt" + desc = assumption["description"] + # Registry-backed language: PT vs Strong PT + ACRT mention. + assert "Strong Parallel Trends" in desc or "SPT" in desc + assert "ATT(d" in desc or "ACRT" in desc + assert "Callaway" in desc # attribution to CGBS 2024 + + def test_triple_difference_assumption_uses_ddd_decomposition(self): + class TripleDifferenceResults: + pass + + obj = TripleDifferenceResults() + obj.att = 1.0 + obj.se = 0.1 + obj.p_value = 0.001 + obj.conf_int = (0.8, 1.2) + obj.alpha = 0.05 + obj.n_obs = 100 + obj.n_treated = 40 + obj.n_control = 60 + obj.survey_metadata = None + obj.inference_method = "analytical" + + br = BusinessReport(obj, auto_diagnostics=False) + assumption = br.to_dict()["assumption"] + assert assumption["parallel_trends_variant"] == "triple_difference_cancellation" + desc = assumption["description"] + assert "DDD" in desc + assert "Ortiz-Villavicencio" in desc or "2025" in desc + + def test_staggered_triple_diff_assumption_uses_ddd_not_generic_pt(self): + class StaggeredTripleDiffResults: + pass + + obj = StaggeredTripleDiffResults() + obj.overall_att = 1.0 + obj.overall_se = 0.1 + obj.overall_p_value = 0.001 + obj.overall_conf_int = (0.8, 1.2) + obj.alpha = 0.05 + obj.n_obs = 100 + obj.n_treated = 40 + obj.n_control = 60 + obj.survey_metadata = None + obj.event_study_effects = None + obj.inference_method = "analytical" + + br = BusinessReport(obj, auto_diagnostics=False) + assumption = br.to_dict()["assumption"] + assert assumption["parallel_trends_variant"] == "triple_difference_cancellation" + desc = assumption["description"] + assert "triple-difference" in desc.lower() or "DDD" in desc + # Must NOT be the generic group-time PT text. + assert "group-time ATT" not in desc + + def test_imputation_did_assumption_uses_untreated_fe_model(self): + """Round-42 P1 regression: BJS (2024) identifies through the + untreated-outcome FE model (Step 1 estimates FE on ``Omega_0`` + = never-treated + not-yet-treated observations, Assumption 1 + parallel trends applies to ``E[Y_it(0)]``). The old generic + "group-time ATT" wording misstated this: the identifying + restriction is on the UNTREATED outcome's additive FE + structure, not on cohort-time ATT equality. REGISTRY.md + §ImputationDiD lines 1000-1013 and Assumption 1/2. + """ + + class ImputationDiDResults: + pass + + obj = ImputationDiDResults() + obj.overall_att = 1.0 + obj.overall_se = 0.1 + obj.overall_p_value = 0.001 + obj.overall_conf_int = (0.8, 1.2) + obj.alpha = 0.05 + obj.n_obs = 100 + obj.n_treated = 40 + obj.n_control = 60 + obj.survey_metadata = None + obj.event_study_effects = None + obj.inference_method = "analytical" + obj.anticipation = 0 + + br = BusinessReport(obj, auto_diagnostics=False) + assumption = br.to_dict()["assumption"] + assert assumption["parallel_trends_variant"] == "untreated_outcome_fe_model" + desc = assumption["description"] + # Registry-backed: Borusyak-Jaravel-Spiess attribution. + assert "Borusyak" in desc or "BJS" in desc or "2024" in desc + # Load-bearing source detail: untreated-observation FE model. + assert "untreated" in desc.lower() + assert "Omega_0" in desc or "fixed effect" in desc.lower() + # Must NOT render the pre-R42 generic group-time-ATT template + # that grouped BJS in with CS / SA. + assert ( + "parallel trends across treatment cohorts and time periods (group-time ATT)" not in desc + ), ( + "ImputationDiD identifies via untreated-outcome FE modelling " + "(BJS 2024 Assumption 1), not generic group-time ATT PT. The " + f"assumption description must not use the pre-R42 template. Got: {desc!r}" + ) + + def test_two_stage_did_assumption_uses_untreated_fe_model(self): + """Round-42 P1 regression: Gardner (2022) two-stage DiD shares + BJS's untreated-outcome FE identification (REGISTRY.md explicitly + states "Parallel trends (same as ImputationDiD)" and the point + estimates are algebraically equivalent). Stage 1 fits FE on + untreated observations, Stage 2 residualizes treated observations. + The old generic "group-time ATT" wording dropped the untreated- + subset detail. REGISTRY.md §TwoStageDiD lines 1113-1128. + """ + + class TwoStageDiDResults: + pass + + obj = TwoStageDiDResults() + obj.overall_att = 1.0 + obj.overall_se = 0.1 + obj.overall_p_value = 0.001 + obj.overall_conf_int = (0.8, 1.2) + obj.alpha = 0.05 + obj.n_obs = 100 + obj.n_treated = 40 + obj.n_control = 60 + obj.survey_metadata = None + obj.event_study_effects = None + obj.inference_method = "analytical" + obj.anticipation = 0 + + br = BusinessReport(obj, auto_diagnostics=False) + assumption = br.to_dict()["assumption"] + assert assumption["parallel_trends_variant"] == "untreated_outcome_fe_model" + desc = assumption["description"] + # Registry-backed: Gardner 2022 attribution. + assert "Gardner" in desc or "2022" in desc + # Load-bearing: Stage 1 operates on untreated observations. + assert "untreated" in desc.lower() + assert "Stage 1" in desc or "stage 1" in desc.lower() + # Must mention the two-stage procedure. + assert "two-stage" in desc.lower() or "Two-Stage" in desc + # Must NOT render the pre-R42 generic group-time-ATT template + # that grouped Gardner in with CS / SA. + assert ( + "parallel trends across treatment cohorts and time periods (group-time ATT)" not in desc + ), ( + "TwoStageDiD identifies via the same untreated-outcome FE " + "model as ImputationDiD (Gardner 2022); the assumption " + f"description must not use the pre-R42 template. Got: {desc!r}" + ) + + +class TestEfficientDiDAssumptionPtAllPtPost: + """Round-8 regression: EfficientDiD has two distinct PT regimes + (PT-All and PT-Post, per Chen-Sant'Anna-Xie 2025 Corollary 3.2 and + Lemma 2.1). The old generic group-time PT text was source-unfaithful; + the assumption block must now read ``results.pt_assumption`` and + branch on it. + """ + + def _stub(self, pt_assumption: str, control_group: str = "never_treated"): + """Build an EfficientDiD-shaped stub. ``control_group`` defaults to + ``"never_treated"`` (the estimator's actual default); the only other + accepted value is ``"last_cohort"`` (pseudo-never-treated). The + earlier ``"not_yet_treated"`` default was invalid for this estimator + and was flagged in round-10 CI review.""" + + class EfficientDiDResults: + pass + + stub = EfficientDiDResults() + stub.overall_att = 1.0 + stub.overall_se = 0.2 + stub.overall_p_value = 0.001 + stub.overall_conf_int = (0.6, 1.4) + stub.alpha = 0.05 + stub.n_obs = 100 + stub.n_treated = 40 + stub.n_control = 60 + stub.survey_metadata = None + stub.event_study_effects = None + stub.inference_method = "analytical" + stub.pt_assumption = pt_assumption + stub.control_group = control_group + return stub + + def test_pt_all_uses_pt_all_language(self): + br = BusinessReport(self._stub("all"), auto_diagnostics=False) + a = br.to_dict()["assumption"] + assert a["parallel_trends_variant"] == "pt_all" + assert "PT-All" in a["description"] + assert "Hausman" in a["description"] + # Must NOT be the old generic group-time PT text. + assert "group-time ATT" not in a["description"] + + def test_pt_post_uses_pt_post_language(self): + br = BusinessReport(self._stub("post"), auto_diagnostics=False) + a = br.to_dict()["assumption"] + assert a["parallel_trends_variant"] == "pt_post" + assert "PT-Post" in a["description"] + assert "Corollary 3.2" in a["description"] or "single-baseline" in a["description"] + + def test_pt_post_never_treated_names_never_treated(self): + """Default control_group: description must say never-treated.""" + br = BusinessReport(self._stub("post", "never_treated"), auto_diagnostics=False) + desc = br.to_dict()["assumption"]["description"] + assert "never-treated" in desc + assert "latest treated cohort" not in desc + + def test_pt_post_last_cohort_branch_describes_pseudo_control(self): + """Round-10 regression: ``control_group='last_cohort'`` must not be + narrated with generic never-treated language. The description must + describe the pseudo-never-treated latest-cohort design (REGISTRY.md + §EfficientDiD line 908).""" + br = BusinessReport(self._stub("post", "last_cohort"), auto_diagnostics=False) + desc = br.to_dict()["assumption"]["description"] + assert "latest treated cohort" in desc + assert "pseudo-never-treated" in desc + assert "dropped" in desc + + def test_pt_all_last_cohort_branch_describes_pseudo_control(self): + br = BusinessReport(self._stub("all", "last_cohort"), auto_diagnostics=False) + desc = br.to_dict()["assumption"]["description"] + assert "latest treated cohort" in desc + assert "pseudo-never-treated" in desc + + def test_control_group_is_reflected_in_block(self): + br = BusinessReport(self._stub("all", "last_cohort"), auto_diagnostics=False) + a = br.to_dict()["assumption"] + assert a.get("control_group") == "last_cohort" + + +class TestMethodAwarePTProse: + """Round-8 regression: BR and DR summary prose must branch on the + ``parallel_trends.method`` field. Generic "pre-treatment event-study + coefficients" wording is wrong for the 2x2 ``slope_difference`` path + and for EfficientDiD's ``hausman`` PT-All vs PT-Post pretest. + """ + + def test_br_summary_uses_slope_difference_wording_for_simple_did(self): + """Use a stub DR schema with a known slope_difference verdict so + the test is deterministic across pre-period counts. The real + 2x2 fit can produce NaN verdicts when there is only one + pre-period, so we don't rely on a real DR here.""" + + class DiDResults: + pass + + stub = DiDResults() + stub.att = 1.0 + stub.se = 0.2 + stub.p_value = 0.001 + stub.conf_int = (0.6, 1.4) + stub.alpha = 0.05 + stub.n_obs = 100 + stub.n_treated = 40 + stub.n_control = 60 + stub.survey_metadata = None + stub.inference_method = "analytical" + + # Hand-crafted DR schema with ``method = "slope_difference"``. + from diff_diff.diagnostic_report import DiagnosticReportResults + + fake_schema = { + "schema_version": "1.0", + "estimator": "DiDResults", + "headline_metric": {"name": "att", "value": 1.0}, + "parallel_trends": { + "status": "ran", + "method": "slope_difference", + "joint_p_value": 0.40, + "verdict": "no_detected_violation", + }, + "pretrends_power": {"status": "not_applicable"}, + "sensitivity": {"status": "not_applicable"}, + "placebo": {"status": "skipped", "reason": "opt-in"}, + "bacon": {"status": "not_applicable"}, + "design_effect": {"status": "not_applicable"}, + "heterogeneity": {"status": "not_applicable"}, + "epv": {"status": "not_applicable"}, + "estimator_native_diagnostics": {"status": "not_applicable"}, + "skipped": {}, + "warnings": [], + "overall_interpretation": "", + "next_steps": [], + } + fake_dr_results = DiagnosticReportResults( + schema=fake_schema, + interpretation="", + applicable_checks=("parallel_trends",), + skipped_checks={}, + warnings=(), + ) + br = BusinessReport(stub, diagnostics=fake_dr_results) + summary = br.summary() + pt_method = br.to_dict()["pre_trends"].get("method") + assert pt_method == "slope_difference" + # Must NOT use the generic event-study wording. + assert "event-study coefficients" not in summary + # Must use the slope-difference subject phrase. + assert "slope-difference" in summary + + def test_dr_summary_uses_hausman_wording_for_efficient_did(self, edid_fit): + from diff_diff import DiagnosticReport + + fit, sdf = edid_fit + dr = DiagnosticReport( + fit, + data=sdf, + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + ) + summary = dr.summary() + pt = dr.to_dict()["parallel_trends"] + # EfficientDiD's PT check routes through hausman_pretest. + assert pt.get("method") == "hausman" + # The generic event-study wording must not appear for this path. + assert "event-study coefficients" not in summary + + +class TestFullReportMethodAwarePTLabel: + """Round-25 P2 CI review on PR #318: ``BusinessReport.full_report()`` + previously hard-coded ``joint p = ...`` in the Pre-Trends section, + which mislabels the 2x2 ``slope_difference`` and EfficientDiD + ``hausman`` single-statistic tests and invents a nonexistent + ``joint p`` label for design-enforced SDiD / TROP paths that have + no p-value at all. The markdown path must use the same + method-aware label helper the summary path already uses + (``_pt_method_stat_label``). + """ + + @staticmethod + def _stub_result_with_method(method: str): + from diff_diff.diagnostic_report import DiagnosticReportResults + + class DiDResults: + pass + + stub = DiDResults() + stub.att = 1.0 + stub.se = 0.2 + stub.p_value = 0.001 + stub.conf_int = (0.6, 1.4) + stub.alpha = 0.05 + stub.n_obs = 100 + stub.n_treated = 40 + stub.n_control = 60 + stub.survey_metadata = None + stub.inference_method = "analytical" + + pt_block: dict = { + "status": "ran", + "method": method, + "verdict": "no_detected_violation", + } + # SDiD's synthetic_fit path has no p-value by design; the other + # methods do. + if method != "synthetic_fit": + pt_block["joint_p_value"] = 0.40 + + fake_schema = { + "schema_version": "1.0", + "estimator": "DiDResults", + "headline_metric": {"name": "att", "value": 1.0}, + "parallel_trends": pt_block, + "pretrends_power": {"status": "not_applicable"}, + "sensitivity": {"status": "not_applicable"}, + "placebo": {"status": "skipped", "reason": "opt-in"}, + "bacon": {"status": "not_applicable"}, + "design_effect": {"status": "not_applicable"}, + "heterogeneity": {"status": "not_applicable"}, + "epv": {"status": "not_applicable"}, + "estimator_native_diagnostics": {"status": "not_applicable"}, + "skipped": {}, + "warnings": [], + "overall_interpretation": "", + "next_steps": [], + } + fake_dr = DiagnosticReportResults( + schema=fake_schema, + interpretation="", + applicable_checks=("parallel_trends",), + skipped_checks={}, + warnings=(), + ) + return stub, fake_dr + + def _pt_section(self, md: str) -> str: + # The Pre-Trends section is delimited by the next ``##`` heading. + after = md.split("## Pre-Trends", 1)[1] + return after.split("\n## ", 1)[0] + + def test_full_report_slope_difference_uses_single_p_label(self): + stub, fake_dr = self._stub_result_with_method("slope_difference") + md = BusinessReport(stub, diagnostics=fake_dr).full_report() + section = self._pt_section(md) + assert "joint p" not in section, ( + f"2x2 slope_difference is a single-statistic test and must " + f"not be labeled ``joint p`` in the markdown. Got: {section!r}" + ) + # The single-statistic label ``p = ...`` must be present. + assert "p = 0.4" in section + + def test_full_report_hausman_uses_single_p_label(self): + stub, fake_dr = self._stub_result_with_method("hausman") + section = self._pt_section(BusinessReport(stub, diagnostics=fake_dr).full_report()) + assert "joint p" not in section, ( + f"EfficientDiD Hausman is a single-statistic test and must " + f"not be labeled ``joint p`` in the markdown. Got: {section!r}" + ) + assert "p = 0.4" in section + + def test_full_report_synthetic_fit_omits_p_label(self): + stub, fake_dr = self._stub_result_with_method("synthetic_fit") + section = self._pt_section(BusinessReport(stub, diagnostics=fake_dr).full_report()) + # No p-value of any kind for design-enforced SDiD PT analogue. + assert "joint p" not in section + assert "p = " not in section + # Verdict must still render. + assert "Verdict:" in section + + +class TestHausmanPretestPropagatesFitDesign: + """Round-9 regression: ``_pt_hausman`` must propagate the fitted + result's ``control_group`` and ``anticipation`` into + ``EfficientDiD.hausman_pretest`` so the pretest diagnoses the same + design as the estimate being summarized. Rerunning with defaults + would silently change the identification regime. + """ + + def _real_edid_fit(self): + from diff_diff import EfficientDiD + + sdf = generate_staggered_data(n_units=100, n_periods=6, treatment_effect=1.5, seed=7) + edid = EfficientDiD().fit( + sdf, outcome="outcome", unit="unit", time="period", first_treat="first_treat" + ) + # Force non-default design knobs on the result so the regression + # exercises propagation even when the constructor used defaults. + edid.control_group = "last_cohort" + edid.anticipation = 1 + return edid, sdf + + def test_hausman_pretest_receives_control_group_and_anticipation(self): + from diff_diff import DiagnosticReport + + fit, sdf = self._real_edid_fit() + captured: dict = {} + + def _fake_hausman(*args, **kwargs): + captured.update(kwargs) + + class _Result: + statistic = 0.0 + p_value = 0.5 + df = 1 + + return _Result() + + with patch( + "diff_diff.efficient_did.EfficientDiD.hausman_pretest", + side_effect=_fake_hausman, + ): + DiagnosticReport( + fit, + data=sdf, + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + ).run_all() + + assert ( + captured.get("control_group") == "last_cohort" + ), f"control_group must propagate from the fit; got {captured}" + assert ( + captured.get("anticipation") == 1 + ), f"anticipation must propagate from the fit; got {captured}" + + +class TestHausmanFitFaithfulSkip: + """Round-10 regression: DR / survey-weighted EfficientDiD fits cannot + replay the Hausman pretest from ``(data, outcome, unit, time, + first_treat)`` alone because the result does not expose ``covariates``, + ``cluster``, nuisance kwargs, or the full survey design. DR must skip + with an explicit reason rather than rerunning defaults. + """ + + def _make_fit(self, *, estimation_path="nocov", survey_metadata=None): + from diff_diff import EfficientDiD + + sdf = generate_staggered_data(n_units=100, n_periods=6, treatment_effect=1.5, seed=7) + edid = EfficientDiD().fit( + sdf, outcome="outcome", unit="unit", time="period", first_treat="first_treat" + ) + edid.estimation_path = estimation_path + edid.survey_metadata = survey_metadata + return edid, sdf + + def test_dr_covariate_path_skipped_with_reason(self): + from diff_diff import DiagnosticReport + + fit, sdf = self._make_fit(estimation_path="dr") + dr = DiagnosticReport( + fit, + data=sdf, + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + ) + assert "parallel_trends" not in dr.applicable_checks + reason = dr.skipped_checks.get("parallel_trends", "") + assert "doubly-robust" in reason + + def test_survey_weighted_fit_skipped_with_reason(self): + from types import SimpleNamespace + + from diff_diff import DiagnosticReport + + fake_survey = SimpleNamespace( + weight_type="pweight", + effective_n=80.0, + design_effect=1.25, + sum_weights=100.0, + n_strata=None, + n_psu=None, + df_survey=40, + ) + fit, sdf = self._make_fit(survey_metadata=fake_survey) + dr = DiagnosticReport( + fit, + data=sdf, + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + ) + assert "parallel_trends" not in dr.applicable_checks + reason = dr.skipped_checks.get("parallel_trends", "") + assert "survey design" in reason + + +class TestHausmanPretestPropagatesCluster: + """Round-11 regression: ``EfficientDiDResults`` now persists the + ``cluster`` column used at fit time, and ``_pt_hausman`` forwards + it to ``EfficientDiD.hausman_pretest``. Without this, clustered + fits would be replayed under unclustered inference, silently + publishing an H statistic / p-value for the wrong design. + """ + + def test_hausman_pretest_receives_cluster_kwarg(self): + import pandas as pd + + from diff_diff import DiagnosticReport, EfficientDiD + + sdf = generate_staggered_data(n_units=100, n_periods=6, treatment_effect=1.5, seed=7) + # Add a cluster column (e.g., region) to the panel. + sdf = pd.DataFrame(sdf).copy() + sdf["cluster_col"] = sdf["unit"] % 10 + + edid = EfficientDiD(cluster="cluster_col").fit( + sdf, + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + ) + # Confirm persistence landed. + assert getattr(edid, "cluster", None) == "cluster_col" + + captured: dict = {} + + def _fake_hausman(*args, **kwargs): + captured.update(kwargs) + + class _Result: + statistic = 0.0 + p_value = 0.5 + df = 1 + + return _Result() + + with patch( + "diff_diff.efficient_did.EfficientDiD.hausman_pretest", + side_effect=_fake_hausman, + ): + DiagnosticReport( + edid, + data=sdf, + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + ).run_all() + + assert ( + captured.get("cluster") == "cluster_col" + ), f"cluster column must propagate from fit to Hausman pretest; got {captured}" + + +class TestAnticipationPersistsOnRealResults: + """Round-19 P1 regression: ``CallawaySantAnnaResults``, + ``SunAbrahamResults``, and ``StaggeredTripleDiffResults`` must + persist the ``anticipation`` field so the anticipation-aware + reporting code (round-15/17) actually fires on real fits. Stub- + only regressions had hidden that the result constructors were + dropping the value. + """ + + def test_cs_fit_persists_anticipation(self): + sdf = generate_staggered_data(n_units=100, n_periods=6, treatment_effect=1.5, seed=7) + cs = CallawaySantAnna(base_period="universal", anticipation=1).fit( + sdf, + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + aggregate="event_study", + ) + assert getattr(cs, "anticipation", None) == 1 + br = BusinessReport(cs, auto_diagnostics=False) + a = br.to_dict()["assumption"] + # Round-17 assumption-aware block now fires on a real fit. + assert a["no_anticipation"] is False + assert a["anticipation_periods"] == 1 + assert "not strict no-anticipation" in a["description"] + + def test_sun_abraham_fit_persists_anticipation(self): + from diff_diff import SunAbraham + + sdf = generate_staggered_data(n_units=100, n_periods=6, treatment_effect=1.5, seed=7) + sa = SunAbraham(anticipation=1).fit( + sdf, outcome="outcome", unit="unit", time="period", first_treat="first_treat" + ) + assert getattr(sa, "anticipation", None) == 1 + br = BusinessReport(sa, auto_diagnostics=False) + a = br.to_dict()["assumption"] + assert a["no_anticipation"] is False + assert a["anticipation_periods"] == 1 + + def test_imputation_fit_persists_anticipation(self): + from diff_diff import ImputationDiD + + sdf = generate_staggered_data(n_units=80, n_periods=8, treatment_effect=1.5, seed=7) + im = ImputationDiD(anticipation=1).fit( + sdf, outcome="outcome", unit="unit", time="period", first_treat="first_treat" + ) + assert getattr(im, "anticipation", None) == 1 + br = BusinessReport(im, auto_diagnostics=False) + a = br.to_dict()["assumption"] + assert a["no_anticipation"] is False + assert a["anticipation_periods"] == 1 + assert "not strict no-anticipation" in a["description"] + + def test_two_stage_fit_persists_anticipation(self): + from diff_diff import TwoStageDiD + + sdf = generate_staggered_data(n_units=80, n_periods=8, treatment_effect=1.5, seed=7) + ts = TwoStageDiD(anticipation=2).fit( + sdf, outcome="outcome", unit="unit", time="period", first_treat="first_treat" + ) + assert getattr(ts, "anticipation", None) == 2 + br = BusinessReport(ts, auto_diagnostics=False) + a = br.to_dict()["assumption"] + assert a["no_anticipation"] is False + assert a["anticipation_periods"] == 2 + assert "2 periods" in a["description"] + + def test_stacked_fit_persists_anticipation(self): + from diff_diff import StackedDiD + + sdf = generate_staggered_data(n_units=80, n_periods=8, treatment_effect=1.5, seed=7) + st = StackedDiD(anticipation=1).fit( + sdf, outcome="outcome", unit="unit", time="period", first_treat="first_treat" + ) + assert getattr(st, "anticipation", None) == 1 + br = BusinessReport(st, auto_diagnostics=False) + a = br.to_dict()["assumption"] + assert a["no_anticipation"] is False + assert a["anticipation_periods"] == 1 + + +class TestInconclusivePTProvenancePreservedOnBRSchema: + """Round-39 P3 CI review on PR #318: DR's ``_pt_event_study`` emits + ``n_dropped_undefined`` and a detailed ``reason`` on the + inconclusive PT block (undefined pre-period inference — NaN + per-period p-value or zero / negative SE). BR's ``_lift_pre_trends`` + was dropping both fields at the lift boundary, so the BR schema + and BR's summary renderer lost the provenance DR had already + computed. Preserve both so BR consumers see the exact count of + undefined rows and the same reason without re-consulting the DR + schema. + """ + + def test_n_dropped_undefined_and_reason_land_on_br_pre_trends(self): + class StackedDiDResults: + pass + + obj = StackedDiDResults() + obj.overall_att = 1.0 + obj.overall_se = 0.2 + obj.overall_p_value = 0.001 + obj.overall_conf_int = (0.6, 1.4) + obj.alpha = 0.05 + obj.n_obs = 400 + obj.n_treated_units = 100 + obj.n_control_units = 300 + obj.survey_metadata = None + obj.event_study_effects = { + -2: {"effect": 0.1, "se": 0.2, "p_value": 0.62, "n_obs": 400}, + -1: {"effect": 0.05, "se": 0.3, "p_value": float("nan"), "n_obs": 400}, + } + + br = BusinessReport(obj) + pt = br.to_dict()["pre_trends"] + # Status and verdict reflect the inconclusive outcome. + assert pt["verdict"] == "inconclusive" + # The provenance fields are present on the BR schema. + assert pt["n_dropped_undefined"] == 1 + assert isinstance(pt.get("reason"), str) and pt["reason"] + # And the summary renderer quotes the count (the existing + # inconclusive branch in ``_render_summary`` reads + # ``pt.get("n_dropped_undefined")``; before this fix that lookup + # returned ``None`` because the lift had dropped it). + summary = br.summary() + assert "1 pre-period row had undefined inference" in summary + + +class TestStaggeredTripleDiffNeverTreatedFixedComparison: + """Round-37 P1 CI review on PR #318: ``StaggeredTripleDiffResults`` + stores ``n_control_units`` as a composite total that also includes + the eligibility-denied cohorts. The valid fixed comparison under + ``control_group="never_treated"`` is the never-enabled cohort + (``staggered_triple_diff.py:384``, REGISTRY.md §StaggeredTripleDifference + line 1730). BR was previously narrating the composite total as + "control" on the ``nevertreated`` mode; the fix surfaces + ``n_never_enabled`` as the fixed comparison count on that path + too (the dynamic ``notyettreated`` path was already correct). + """ + + @staticmethod + def _stub(control_group: str): + class StaggeredTripleDiffResults: + pass + + stub = StaggeredTripleDiffResults() + stub.overall_att = 1.0 + stub.overall_se = 0.2 + stub.overall_p_value = 0.001 + stub.overall_conf_int = (0.6, 1.4) + stub.alpha = 0.05 + stub.n_obs = 800 + stub.n_treated = 100 + stub.n_control_units = 500 # composite total + stub.n_never_enabled = 300 # fixed never-enabled subset + stub.event_study_effects = None + stub.survey_metadata = None + stub.control_group = control_group + return stub + + def test_never_treated_mode_surfaces_never_enabled_not_composite_total(self): + sample = BusinessReport(self._stub("never_treated"), auto_diagnostics=False).to_dict()[ + "sample" + ] + # Composite total must not be surfaced as the fixed control + # count on the ``nevertreated`` path. + assert sample["n_control"] is None, ( + f"n_control must not carry the composite n_control_units " + f"total on StaggeredTripleDiff(control_group='never_treated'); " + f"got sample={sample!r}" + ) + assert sample["n_never_enabled"] == 300 + + def test_never_treated_mode_summary_renders_never_enabled_count(self): + """Round-38 P3 strengthened regression: the summary must + POSITIVELY surface the valid fixed comparison cohort + (``300 never-enabled``), not merely avoid the wrong + ``500 control`` phrasing. + """ + import re + + summary = BusinessReport(self._stub("never_treated"), auto_diagnostics=False).summary() + # Old wrong phrasing absent. + assert not re.search(r"\b500\s+control", summary), summary + # New fixed cohort present. + assert "300 never-enabled" in summary, ( + f"BR summary must render the valid fixed never-enabled " + f"comparison cohort on StaggeredTripleDiff(control_group=" + f"'never_treated'); got: {summary!r}" + ) + # And the generic no-comparison fallback must not fire. + assert "Sample: 800 observations." not in summary + + def test_never_treated_mode_full_report_renders_never_enabled_count(self): + md = BusinessReport(self._stub("never_treated"), auto_diagnostics=False).full_report() + sample_section = md.split("## Sample", 1)[1].split("\n## ", 1)[0] + assert "never-enabled" in sample_section.lower() + assert "300" in sample_section + # No bare "- Control: 500" line (composite total) should appear + # on this path. + assert "- Control: 500" not in sample_section + + +class TestBRHeadlineOmitsBrokenCIOnUndefinedInference: + """Round-37 P1 CI review on PR #318: ``_extract_headline`` preserves + the fit's native CI even when it is undefined (e.g., survey-df + collapse produces finite ATT but NaN CI endpoints). The renderer + previously gated on ``isinstance(lo, (int, float))``, which accepts + ``NaN`` (a float) and rendered ``95% CI: undefined to undefined``. + Gate on ``np.isfinite`` instead, and emit an explicit + "inference unavailable" trailer when at least one bound is + non-finite. DR's own headline renderer already handled this + correctly (round-36 fix). + """ + + @staticmethod + def _stub_nan_ci(): + class DiDResults: + pass + + stub = DiDResults() + stub.att = 1.0 + stub.se = float("nan") + stub.t_stat = float("nan") + stub.p_value = float("nan") + stub.conf_int = (float("nan"), float("nan")) + stub.alpha = 0.05 + stub.n_obs = 200 + stub.n_treated = 100 + stub.n_control = 100 + stub.survey_metadata = None + return stub + + def test_summary_does_not_render_undefined_ci_interval(self): + summary = BusinessReport(self._stub_nan_ci(), auto_diagnostics=False).summary() + lower = summary.lower() + # Must not render the broken CI interval fragment. + assert "undefined to undefined" not in lower, summary + assert "95% ci: nan" not in lower + # Must explicitly flag that inference is unavailable. + assert "inference unavailable" in lower + + def test_full_report_does_not_render_undefined_ci_interval(self): + md = BusinessReport(self._stub_nan_ci(), auto_diagnostics=False).full_report() + lower = md.lower() + assert "undefined to undefined" not in lower + assert "95% ci: nan" not in lower + assert "inference unavailable" in lower + + +class TestStackedCleanControlSurfacesInSampleBlock: + """Pre-emptive audit regression: ``StackedDiD`` exposes its control- + group choice as ``clean_control`` (the public Wing-Freedman- + Hollingsworth-2024 kwarg name), not ``control_group``. The BR sample + block must normalize the key so downstream agents see a consistent + ``control_group`` field across estimators. + + ``n_control_units`` on ``StackedDiDResults`` is documented as + "distinct control units across the trimmed set" (stacked_did_results + L59-62). Under ``clean_control="not_yet_treated"`` the trimmed set + admits future-treated controls by construction, so the count is + NOT a never-treated tally and must not be relabeled as + ``n_never_treated`` — round-21 P1 CI review on PR #318 flagged the + prior relabeling as a semantic-contract violation because it can + fabricate never-treated support that does not exist (e.g., in an + all-eventually-treated panel). + """ + + def test_stacked_not_yet_treated_surfaces_as_dynamic_without_never_treated_relabel(self): + """``clean_control='not_yet_treated'`` is a dynamic, sub- + experiment-specific comparison set (``A_s > a + kappa_post``); + ``n_control`` is cleared (not a fixed tally), ``n_never_treated`` + is NOT relabeled, and the distinct-controls tally is surfaced + under the dedicated ``n_distinct_controls_trimmed`` key. + """ + from diff_diff import StackedDiD + + sdf = generate_staggered_data(n_units=80, n_periods=8, treatment_effect=1.5, seed=7) + st = StackedDiD(clean_control="not_yet_treated").fit( + sdf, outcome="outcome", unit="unit", time="period", first_treat="first_treat" + ) + assert getattr(st, "clean_control", None) == "not_yet_treated" + sample = BusinessReport(st, auto_diagnostics=False).to_dict()["sample"] + assert sample["control_group"] == "not_yet_treated" + assert sample["dynamic_control"] is True + assert sample["n_never_treated"] is None, ( + "StackedDiDResults.n_control_units is the distinct-control-" + "units tally of the trimmed set (includes future-treated " + "controls); it must not be surfaced as n_never_treated." + ) + # Round-22 correction: ``n_control`` must be cleared under + # dynamic modes so the report does not narrate a fixed control + # tally. The underlying count is surfaced under the dedicated + # Stacked key. + assert sample["n_control"] is None + assert sample["n_distinct_controls_trimmed"] == int(st.n_control_units) + + def test_stacked_strict_clean_control_surfaces_as_dynamic(self): + """``clean_control='strict'`` (``A_s > a + kappa_post + kappa_pre``) + is also a sub-experiment-specific rule — stricter than + ``not_yet_treated`` but still NOT a fixed never-treated pool + (round-22 P1 CI review on PR #318). + """ + from diff_diff import StackedDiD + + sdf = generate_staggered_data(n_units=80, n_periods=8, treatment_effect=1.5, seed=7) + st = StackedDiD(clean_control="strict").fit( + sdf, outcome="outcome", unit="unit", time="period", first_treat="first_treat" + ) + sample = BusinessReport(st, auto_diagnostics=False).to_dict()["sample"] + assert sample["control_group"] == "strict" + assert sample["dynamic_control"] is True, ( + "clean_control='strict' is sub-experiment-specific (rule " + "A_s > a + kappa_post + kappa_pre) and must be marked dynamic " + "so the report does not claim a fixed never-treated control " + "pool." + ) + assert sample["n_control"] is None + assert sample["n_never_treated"] is None + + def test_stacked_never_treated_surfaces_as_fixed_control(self): + from diff_diff import StackedDiD + + sdf = generate_staggered_data(n_units=80, n_periods=8, treatment_effect=1.5, seed=7) + st = StackedDiD(clean_control="never_treated").fit( + sdf, outcome="outcome", unit="unit", time="period", first_treat="first_treat" + ) + sample = BusinessReport(st, auto_diagnostics=False).to_dict()["sample"] + assert sample["control_group"] == "never_treated" + assert sample["dynamic_control"] is False + + def test_stacked_all_eventually_treated_panel_does_not_fabricate_never_treated(self): + """All-eventually-treated stacked panel with + ``clean_control="not_yet_treated"`` must not claim any + never-treated units, because every unit is eventually treated + (the round-21 reviewer example). + """ + + from diff_diff import StackedDiD + + # Every unit is eventually treated (no never-treated). + # Multiple cohorts so Stacked has something to stack against. + sdf = generate_staggered_data( + n_units=80, + n_periods=10, + never_treated_frac=0.0, + treatment_effect=1.5, + seed=7, + ) + # Sanity: the fixture has no never-treated units. + assert sdf[sdf["first_treat"] == 0].empty + + st = StackedDiD(clean_control="not_yet_treated", kappa_pre=1, kappa_post=1).fit( + sdf, outcome="outcome", unit="unit", time="period", first_treat="first_treat" + ) + sample = BusinessReport(st, auto_diagnostics=False).to_dict()["sample"] + assert sample["n_never_treated"] is None, ( + "All-eventually-treated panel under clean_control='not_yet_treated' " + "must not surface any never-treated count; the trimmed stack " + "contains only future-treated controls." + ) + + +class TestStackedDiDAssumptionBlock: + """Round-22 P1 regression: ``StackedDiDResults`` must get a + dedicated assumption description reflecting Wing-Freedman- + Hollingsworth (2024) identification — sub-experiment common trends + plus IC1 (event window fits) and IC2 (clean controls exist) — not + the generic "group-time ATT" clause used for CS / SA / etc. The + active ``clean_control`` rule must be named in the description. + """ + + @staticmethod + def _stub(clean_control: str): + class StackedDiDResults: + pass + + stub = StackedDiDResults() + stub.overall_att = 1.0 + stub.overall_se = 0.2 + stub.overall_p_value = 0.001 + stub.overall_conf_int = (0.6, 1.4) + stub.alpha = 0.05 + stub.n_obs = 400 + stub.n_treated = 50 + stub.n_control_units = 300 + stub.survey_metadata = None + stub.event_study_effects = None + stub.clean_control = clean_control + return stub + + def test_not_yet_treated_names_subexperiment_contract(self): + br = BusinessReport(self._stub("not_yet_treated"), auto_diagnostics=False) + a = br.to_dict()["assumption"] + assert a["parallel_trends_variant"] == "stacked_sub_experiment" + desc = a["description"] + assert "Wing, Freedman & Hollingsworth 2024" in desc + assert "sub-experiment" in desc + assert "IC1" in desc and "IC2" in desc + assert "A_s > a + kappa_post" in desc + assert "not_yet_treated" not in desc or "``A_s > a + kappa_post``" in desc + # The active clean_control is carried on the block explicitly for + # consumers that want structured access. + assert a["clean_control"] == "not_yet_treated" + + def test_strict_names_strict_rule(self): + desc = BusinessReport(self._stub("strict"), auto_diagnostics=False).to_dict()["assumption"][ + "description" + ] + assert "A_s > a + kappa_post + kappa_pre" in desc + + def test_never_treated_names_fixed_pool(self): + desc = BusinessReport(self._stub("never_treated"), auto_diagnostics=False).to_dict()[ + "assumption" + ]["description"] + assert "never treated" in desc.lower() + assert "A_s = infinity" in desc + + +class TestStackedRenderingNarratesDynamicControl: + """Round-22 P1 regression: BR ``summary()`` / ``full_report()`` must + narrate Stacked dynamic clean-control designs as sub-experiment- + specific comparisons, not as fixed "N treated / M control" samples. + Previously the ``n_control`` branch fired first and misrendered both + ``clean_control='not_yet_treated'`` and ``'strict'``. + """ + + def test_summary_does_not_narrate_stacked_dynamic_as_fixed_control(self): + from diff_diff import StackedDiD + + sdf = generate_staggered_data(n_units=80, n_periods=8, treatment_effect=1.5, seed=7) + st = StackedDiD(clean_control="not_yet_treated").fit( + sdf, outcome="outcome", unit="unit", time="period", first_treat="first_treat" + ) + summary = BusinessReport(st, auto_diagnostics=False).summary() + # Must NOT render a "X treated, Y control" clause (that narration + # implies a fixed comparison pool). + import re + + assert not re.search(r"\d[\d,]*\s+treated,\s+\d[\d,]*\s+control", summary), ( + f"Stacked with dynamic clean-control must not be narrated " + f"as fixed treated/control counts. Got: {summary!r}" + ) + # Must narrate the sub-experiment-specific clean-control contract. + assert "sub-experiment-specific clean-control" in summary + assert "clean_control='not_yet_treated'" in summary + + def test_full_report_names_sub_experiment_comparison_for_stacked_strict(self): + from diff_diff import StackedDiD + + sdf = generate_staggered_data(n_units=80, n_periods=8, treatment_effect=1.5, seed=7) + st = StackedDiD(clean_control="strict").fit( + sdf, outcome="outcome", unit="unit", time="period", first_treat="first_treat" + ) + md = BusinessReport(st, auto_diagnostics=False).full_report() + # Must NOT emit a bare "Control: N" line. + assert ( + "- Control:" not in md or "- Control: " not in md.split("## Sample")[1].split("##")[0] + ), ( + "Stacked with dynamic clean-control must not render a fixed " + "'- Control: N' line in the Sample section." + ) + assert "sub-experiment-specific clean controls" in md + assert "clean_control='strict'" in md + + +class TestDCDHPhase3AssumptionClause: + """Pre-emptive audit regression: ``ChaisemartinDHaultfoeuilleResults`` + populates ``covariate_residuals`` when ``controls`` is set in fit, + ``linear_trends_effects`` when ``trends_linear=True``, and + ``heterogeneity_effects`` when ``heterogeneity`` is set. Each change + modifies the identifying contract and the estimand label + (``DID^X_l`` / ``DID^{fd}_l`` / ``DID^{X,fd}_l``). The BR assumption + description must surface the active configuration so the prose does + not misrepresent the identifying assumption on a Phase-3 fit. + """ + + def test_dcdh_base_case_has_no_phase3_clause(self): + from diff_diff.business_report import _describe_assumption + + class Stub: + covariate_residuals = None + linear_trends_effects = None + heterogeneity_effects = None + + block = _describe_assumption("ChaisemartinDHaultfoeuilleResults", Stub()) + assert "Phase-3 configuration" not in block["description"] + + def test_dcdh_controls_only_surfaces_did_x(self): + import pandas as pd + + from diff_diff.business_report import _describe_assumption + + class Stub: + covariate_residuals = pd.DataFrame({"theta_hat": [0.1]}) + linear_trends_effects = None + heterogeneity_effects = None + + desc = _describe_assumption("ChaisemartinDHaultfoeuilleResults", Stub())["description"] + assert "Phase-3 configuration" in desc + assert "DID^X_l" in desc + assert "first-stage residualization" in desc + assert "DID^{fd}_l" not in desc + + def test_dcdh_trends_linear_only_surfaces_did_fd(self): + from diff_diff.business_report import _describe_assumption + + class Stub: + covariate_residuals = None + linear_trends_effects = {1: {"effect": 0.1}} + heterogeneity_effects = None + + desc = _describe_assumption("ChaisemartinDHaultfoeuilleResults", Stub())["description"] + assert "Phase-3 configuration" in desc + assert "DID^{fd}_l" in desc + assert "group-specific linear pre-trends" in desc + + def test_dcdh_controls_and_trends_surfaces_combined_estimand(self): + import pandas as pd + + from diff_diff.business_report import _describe_assumption + + class Stub: + covariate_residuals = pd.DataFrame({"theta_hat": [0.1]}) + linear_trends_effects = {1: {"effect": 0.1}} + heterogeneity_effects = {1: {}} + + desc = _describe_assumption("ChaisemartinDHaultfoeuilleResults", Stub())["description"] + assert "DID^{X,fd}_l" in desc + assert "heterogeneity tests" in desc + assert "beta^{het}_l" in desc + + +class TestAnticipationStripsStrictNoAnticipationClause: + """Round-30 P1 CI review on PR #318: ``_apply_anticipation_to_assumption`` + previously only appended an anticipation clause. Several base + descriptions already say "plus no anticipation" or "Also assumes + no anticipation", so an anticipation-enabled fit would render + self-contradictory prose: the strict clause AND the relaxed one in + the same paragraph. The helper now strips the strict phrasing + before appending. These regressions cover every anticipation- + capable estimator base description that previously carried such + wording. + """ + + _STRICT_PATTERNS = ( + "plus no anticipation", + "Also assumes no anticipation", + ) + + @staticmethod + def _stub(class_name: str, **extras): + stub_cls = type(class_name, (), {}) + stub = stub_cls() + stub.overall_att = 1.0 + stub.overall_se = 0.2 + stub.overall_p_value = 0.001 + stub.overall_conf_int = (0.6, 1.4) + stub.alpha = 0.05 + stub.n_obs = 400 + stub.n_treated = 100 + stub.n_control = 300 + stub.survey_metadata = None + stub.event_study_effects = None + stub.anticipation = 2 + for k, v in extras.items(): + setattr(stub, k, v) + return stub + + def _assert_no_strict_contract(self, description: str): + assert isinstance(description, str) and description + for pat in self._STRICT_PATTERNS: + assert pat not in description, ( + f"Anticipation-enabled fit description must not carry " + f"the strict phrase {pat!r}. Got: {description!r}" + ) + # Must still say anticipation is allowed (relaxed contract). + assert "Anticipation is allowed" in description + assert "not strict no-anticipation" in description + + def test_generic_group_time_strips_strict_clause(self): + # Generic CS/SA/Imputation/TwoStage/Wooldridge branch. + stub = self._stub("CallawaySantAnnaResults") + block = BusinessReport(stub, auto_diagnostics=False).to_dict()["assumption"] + assert block["no_anticipation"] is False + assert block["anticipation_periods"] == 2 + self._assert_no_strict_contract(block["description"]) + + def test_efficient_did_pt_all_strips_strict_clause(self): + stub = self._stub("EfficientDiDResults", pt_assumption="all") + block = BusinessReport(stub, auto_diagnostics=False).to_dict()["assumption"] + self._assert_no_strict_contract(block["description"]) + # PT-All identifying content should still be present. + assert "PT-All" in block["description"] + + def test_efficient_did_pt_post_strips_strict_clause(self): + stub = self._stub("EfficientDiDResults", pt_assumption="post") + block = BusinessReport(stub, auto_diagnostics=False).to_dict()["assumption"] + self._assert_no_strict_contract(block["description"]) + assert "PT-Post" in block["description"] + + def test_stacked_did_strips_strict_clause(self): + stub = self._stub("StackedDiDResults", clean_control="not_yet_treated") + block = BusinessReport(stub, auto_diagnostics=False).to_dict()["assumption"] + self._assert_no_strict_contract(block["description"]) + # Stacked sub-experiment identifying content preserved. + assert "IC1" in block["description"] and "IC2" in block["description"] + + def test_rendered_full_report_has_no_strict_contract_for_anticipation(self): + """Integration: the rendered markdown's Identifying Assumption + section must also be free of the strict phrase on an + anticipation-enabled fit. + """ + stub = self._stub("CallawaySantAnnaResults") + md = BusinessReport(stub, auto_diagnostics=False).full_report() + assumption_section = md.split("## Identifying Assumption", 1)[1].split("\n## ", 1)[0] + for pat in self._STRICT_PATTERNS: + assert pat not in assumption_section, ( + f"Rendered assumption section must not carry the strict " + f"phrase {pat!r} under anticipation > 0. Got: " + f"{assumption_section!r}" + ) + assert "Anticipation is allowed" in assumption_section + + +class TestAnticipationAwareAssumptionBlock: + """Round-17 P1 regression: ``_describe_assumption`` must drop the + strict "plus no anticipation" language when the fit allows + ``anticipation > 0``. REGISTRY.md §CallawaySantAnna lines 355-395 + (and the matching SA / MultiPeriod / Wooldridge / EfficientDiD + sections) treat anticipation as a relaxation of the strict no- + anticipation assumption: no treatment effects earlier than ``k`` + periods before treatment, not none at all. + """ + + def test_cs_with_anticipation_sets_no_anticipation_false(self): + class CallawaySantAnnaResults: + pass + + stub = CallawaySantAnnaResults() + stub.overall_att = 1.0 + stub.overall_se = 0.2 + stub.overall_p_value = 0.001 + stub.overall_conf_int = (0.6, 1.4) + stub.alpha = 0.05 + stub.n_obs = 100 + stub.n_treated = 40 + stub.n_control = 60 + stub.survey_metadata = None + stub.event_study_effects = None + stub.anticipation = 2 + + br = BusinessReport(stub, auto_diagnostics=False) + a = br.to_dict()["assumption"] + assert ( + a["no_anticipation"] is False + ), f"anticipation=2 must flip no_anticipation off; got {a}" + assert a["anticipation_periods"] == 2 + assert "2 periods" in a["description"] + assert "not strict no-anticipation" in a["description"] + + def test_efficient_did_with_anticipation_flips_no_anticipation_off(self): + class EfficientDiDResults: + pass + + stub = EfficientDiDResults() + stub.overall_att = 1.0 + stub.overall_se = 0.2 + stub.overall_p_value = 0.001 + stub.overall_conf_int = (0.6, 1.4) + stub.alpha = 0.05 + stub.n_obs = 100 + stub.n_treated = 40 + stub.n_control = 60 + stub.survey_metadata = None + stub.event_study_effects = None + stub.pt_assumption = "all" + stub.control_group = "never_treated" + stub.anticipation = 1 + + br = BusinessReport(stub, auto_diagnostics=False) + a = br.to_dict()["assumption"] + assert a["no_anticipation"] is False + assert a["anticipation_periods"] == 1 + assert "1 period" in a["description"] + + def test_anticipation_zero_preserves_strict_no_anticipation(self): + """Default (``anticipation=0``) keeps the strict text.""" + + class CallawaySantAnnaResults: + pass + + stub = CallawaySantAnnaResults() + stub.overall_att = 1.0 + stub.overall_se = 0.2 + stub.overall_p_value = 0.001 + stub.overall_conf_int = (0.6, 1.4) + stub.alpha = 0.05 + stub.n_obs = 100 + stub.n_treated = 40 + stub.n_control = 60 + stub.survey_metadata = None + stub.event_study_effects = None + stub.anticipation = 0 + + br = BusinessReport(stub, auto_diagnostics=False) + a = br.to_dict()["assumption"] + assert a["no_anticipation"] is True + assert "anticipation_periods" not in a + assert "not strict no-anticipation" not in a["description"] + + +class TestContinuousDiDDynamicControlSample: + """Round-18 P1 regression: ContinuousDiD with + ``control_group="not_yet_treated"`` must take the dynamic-control + path in ``to_dict()``, ``summary()``, and ``full_report()``. The + stored ``n_control_units`` is only the fully-untreated ``D=0`` + tally; the actual comparison set includes future-treated cohorts + beyond the anticipation window. + """ + + def test_continuous_did_not_yet_treated_surfaces_dynamic_mode(self): + class ContinuousDiDResults: + pass + + stub = ContinuousDiDResults() + stub.overall_att = 1.0 + stub.overall_att_se = 0.2 + stub.overall_att_p_value = 0.001 + stub.overall_att_conf_int = (0.6, 1.4) + stub.alpha = 0.05 + stub.n_obs = 120 + stub.n_treated = 50 + stub.n_control = 70 # D=0 (never-treated) count only. + stub.survey_metadata = None + stub.control_group = "not_yet_treated" + + br = BusinessReport(stub, auto_diagnostics=False) + sample = br.to_dict()["sample"] + assert sample["n_control"] is None + assert sample["n_never_treated"] == 70 + assert sample["dynamic_control"] is True + + summary = br.summary() + assert " control)" not in summary + assert "dynamic not-yet-treated" in summary + + full = br.full_report() + assert "- Control: 70" not in full + assert "dynamic not-yet-treated" in full + + +class TestStaggeredTripleDiffDynamicControlSample: + """Round-18 P1 regression: StaggeredTripleDifference with + ``control_group="notyettreated"`` (no underscore per the estimator + contract) must also take the dynamic-control path. Its fixed + subset is ``n_never_enabled`` (separate field) rather than a + never-treated count. + """ + + def test_notyettreated_surfaces_n_never_enabled(self): + class StaggeredTripleDiffResults: + pass + + stub = StaggeredTripleDiffResults() + stub.overall_att = 1.0 + stub.overall_se = 0.2 + stub.overall_p_value = 0.001 + stub.overall_conf_int = (0.6, 1.4) + stub.alpha = 0.05 + stub.n_obs = 200 + stub.n_treated = 80 + stub.n_control = 120 # Composite total (ignored in this mode). + stub.n_never_enabled = 30 # Fixed subset exposed in this mode. + stub.survey_metadata = None + stub.event_study_effects = None + stub.inference_method = "analytical" + stub.control_group = "notyettreated" # No underscore. + + br = BusinessReport(stub, auto_diagnostics=False) + sample = br.to_dict()["sample"] + assert sample["n_control"] is None + assert sample["dynamic_control"] is True + assert sample["n_never_enabled"] == 30 + assert sample["n_never_treated"] is None, ( + "StaggeredTripleDiff must expose n_never_enabled, not " "n_never_treated" + ) + + summary = br.summary() + assert " control)" not in summary + assert "dynamic not-yet-treated" in summary + assert "30 never-enabled" in summary + + full = br.full_report() + assert "- Control:" not in full + assert "Never-enabled units present in the panel: 30" in full + + +class TestWooldridgeSampleNotYetTreatedSemantics: + """Round-17 P1 regression: Wooldridge's ``n_control_units`` is the + total eligible comparison set (never-treated plus future-treated + units that contribute valid not-yet-treated comparisons). BR must + NOT reinterpret that count as ``n_never_treated`` for Wooldridge, + which would overstate never-treated availability. CS / SA / + ImputationDiD / etc. retain the existing reinterpretation because + their contracts define ``n_control_units`` as never-treated only. + """ + + def test_wooldridge_not_yet_treated_keeps_fixed_n_control(self): + class WooldridgeDiDResults: + pass + + stub = WooldridgeDiDResults() + stub.overall_att = 1.0 + stub.overall_se = 0.2 + stub.overall_p_value = 0.001 + stub.overall_conf_int = (0.6, 1.4) + stub.alpha = 0.05 + stub.n_obs = 100 + stub.n_treated = 40 + stub.n_control = 60 # Total eligible, NOT never-treated only. + stub.survey_metadata = None + stub.control_group = "not_yet_treated" + + br = BusinessReport(stub, auto_diagnostics=False) + sample = br.to_dict()["sample"] + assert sample["n_control"] == 60, ( + "Wooldridge n_control_units is total eligible controls; " + "must not be hidden behind not_yet_treated reinterpretation" + ) + assert sample["n_never_treated"] is None + + def test_cs_not_yet_treated_still_reinterprets(self): + """CS retains the existing behavior: the fixed ``n_control`` is + suppressed and ``n_never_treated`` surfaces the never-treated + count. Regression from round 13.""" + sdf = generate_staggered_data(n_units=100, n_periods=6, treatment_effect=1.5, seed=7) + cs = CallawaySantAnna(base_period="universal", control_group="not_yet_treated").fit( + sdf, + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + aggregate="event_study", + ) + br = BusinessReport(cs, auto_diagnostics=False) + sample = br.to_dict()["sample"] + assert sample["n_control"] is None + assert sample["n_never_treated"] == getattr(cs, "n_control_units", None) + + +class TestWooldridgeResultsRouting: + """Round-16 P1 regression: the collectors must accept + ``WooldridgeDiDResults`` payloads, which use ``att`` (not + ``effect``). Without this, PT and heterogeneity silently skip on + Wooldridge fits. Also, Wooldridge aggregation keeps ``t >= g`` and + ignores the ``anticipation`` shift used by CS / SA / EfficientDiD + (REGISTRY.md §Wooldridge lines 1351-1352). + """ + + def _wooldridge_stub(self, *, anticipation: int = 0): + class WooldridgeDiDResults: + pass + + stub = WooldridgeDiDResults() + stub.overall_att = 1.0 + stub.overall_se = 0.2 + stub.overall_p_value = 0.001 + stub.overall_conf_int = (0.6, 1.4) + stub.alpha = 0.05 + stub.n_obs = 100 + stub.n_treated = 40 + stub.n_control = 60 + stub.survey_metadata = None + stub.anticipation = anticipation + # Event study: Wooldridge payloads use ``att`` not ``effect``. + stub.event_study_effects = { + -2: {"att": -0.05, "se": 0.1, "p_value": 0.62}, + -1: {"att": 0.04, "se": 0.1, "p_value": 0.69}, + 0: {"att": 1.00, "se": 0.1, "p_value": 0.001}, + 1: {"att": 1.20, "se": 0.1, "p_value": 0.001}, + 2: {"att": 1.40, "se": 0.1, "p_value": 0.001}, + } + return stub + + def test_pre_period_collector_reads_att_payload(self): + from diff_diff.diagnostic_report import _collect_pre_period_coefs + + stub = self._wooldridge_stub() + pre, _ = _collect_pre_period_coefs(stub) + keys = sorted(row[0] for row in pre) + assert keys == [ + -2, + -1, + ], f"pre-period collector must read Wooldridge ``att`` payloads; got {keys}" + effects = {row[0]: row[1] for row in pre} + assert effects[-2] == pytest.approx(-0.05) + assert effects[-1] == pytest.approx(0.04) + + def test_heterogeneity_reads_att_payload(self): + from diff_diff import DiagnosticReport + + stub = self._wooldridge_stub() + dr = DiagnosticReport(stub) + effects = sorted(dr._collect_effect_scalars()) + # Event-study post-only: rel >= 0 → {1.00, 1.20, 1.40}. + assert effects == pytest.approx([1.00, 1.20, 1.40]) + + def test_wooldridge_ignores_anticipation_shift_on_pre_periods(self): + from diff_diff.diagnostic_report import _collect_pre_period_coefs + + stub = self._wooldridge_stub(anticipation=1) + pre, _ = _collect_pre_period_coefs(stub) + keys = sorted(row[0] for row in pre) + # Wooldridge keeps rel < 0 regardless of anticipation. + assert keys == [-2, -1] + + def test_wooldridge_ignores_anticipation_shift_on_heterogeneity(self): + from diff_diff import DiagnosticReport + + stub = self._wooldridge_stub(anticipation=1) + dr = DiagnosticReport(stub) + effects = sorted(dr._collect_effect_scalars()) + # Anticipation window (rel=-1) must not leak into the post set + # for Wooldridge even with anticipation=1. + assert effects == pytest.approx([1.00, 1.20, 1.40]) + + +class TestAnticipationAwareHorizonClassification: + """Round-15 P1 regression: on anticipation-aware fits (CS / SA / + EfficientDiD with ``anticipation > 0``), the report layer must + classify horizons using the shifted boundary: + + - True pre-periods (PT + pre-trends power): ``rel < -anticipation``. + - Treatment-affected horizons (heterogeneity dispersion): + ``rel >= -anticipation`` (anticipation window is post-announcement). + + Prior code hard-coded ``rel < 0`` / ``rel >= 0`` and could include + anticipation-window coefficients as "pre" in PT / power while + excluding them as "post" in heterogeneity. REGISTRY.md + §CallawaySantAnna lines 355-395 documents the shifted-boundary rule. + """ + + def _cs_stub_with_anticipation(self, *, anticipation: int = 1): + class CallawaySantAnnaResults: + pass + + stub = CallawaySantAnnaResults() + stub.overall_att = 1.0 + stub.overall_se = 0.2 + stub.overall_p_value = 0.001 + stub.overall_conf_int = (0.6, 1.4) + stub.alpha = 0.05 + stub.n_obs = 100 + stub.n_treated = 40 + stub.n_control = 60 + stub.survey_metadata = None + stub.base_period = "universal" + stub.anticipation = anticipation + stub.event_study_effects = { + -3: {"effect": -0.05, "se": 0.1, "p_value": 0.62, "n_groups": 15}, + -2: {"effect": 0.04, "se": 0.1, "p_value": 0.69, "n_groups": 15}, + -1: {"effect": 0.80, "se": 0.1, "p_value": 0.01, "n_groups": 15}, + 0: {"effect": 1.00, "se": 0.1, "p_value": 0.001, "n_groups": 15}, + 1: {"effect": 1.20, "se": 0.1, "p_value": 0.001, "n_groups": 12}, + 2: {"effect": 1.40, "se": 0.1, "p_value": 0.001, "n_groups": 10}, + } + return stub + + def test_pre_period_collector_excludes_anticipation_window(self): + from diff_diff.diagnostic_report import _collect_pre_period_coefs + + stub = self._cs_stub_with_anticipation(anticipation=1) + pre, _ = _collect_pre_period_coefs(stub) + keys = sorted(row[0] for row in pre) + # Anticipation window (rel=-1) must be excluded; only -3, -2 remain. + assert keys == [-3, -2], ( + f"pre-period collector must exclude the anticipation " f"window; got {keys}" + ) + + def test_heterogeneity_includes_anticipation_window(self): + from diff_diff import DiagnosticReport + + stub = self._cs_stub_with_anticipation(anticipation=1) + dr = DiagnosticReport(stub) + effects = sorted(dr._collect_effect_scalars()) + # rel ∈ {-1, 0, 1, 2} → {0.80, 1.00, 1.20, 1.40}. + assert effects == pytest.approx([0.80, 1.00, 1.20, 1.40]) + + def test_anticipation_zero_preserves_old_behavior(self): + from diff_diff import DiagnosticReport + from diff_diff.diagnostic_report import _collect_pre_period_coefs + + stub = self._cs_stub_with_anticipation(anticipation=0) + pre, _ = _collect_pre_period_coefs(stub) + assert sorted(row[0] for row in pre) == [-3, -2, -1] + + dr = DiagnosticReport(stub) + effects = sorted(dr._collect_effect_scalars()) + # Only non-negative horizons: 1.00, 1.20, 1.40. + assert effects == pytest.approx([1.00, 1.20, 1.40]) + + +class TestDiagFallbackDowngradeAppliedCentrally: + """Round-14 regression: when ``compute_pretrends_power`` fell back to + a diagonal-SE approximation while the full ``event_study_vcov`` was + available, the ``well_powered`` tier must be downgraded to + ``moderately_powered`` on **every** report surface (BR summary, BR + full_report, BR schema, DR summary), not just inside one of them. + Centralize the downgrade in ``_check_pretrends_power`` so every + consumer reads the same adjusted tier. REPORTING.md lines 126-139. + """ + + def test_br_schema_tier_is_downgraded(self): + """Smoke-check that the centralized downgrade lands in the DR + schema when ``covariance_source`` is the flagged fallback value.""" + # Build a hand-crafted DR schema exactly as the centralized + # downgrade would emit it — mdv ratio < 0.25 (so the pre- + # downgrade tier is ``well_powered``), cov_source is the + # diag-fallback-with-full-vcov-available sentinel. + from diff_diff.diagnostic_report import DiagnosticReportResults + + schema = { + "schema_version": "1.0", + "estimator": "CallawaySantAnnaResults", + "headline_metric": {"name": "overall_att", "value": 1.0}, + "parallel_trends": { + "status": "ran", + "method": "joint_wald_event_study", + "joint_p_value": 0.40, + "verdict": "no_detected_violation", + }, + "pretrends_power": { + "status": "ran", + "method": "compute_pretrends_power", + "mdv": 0.10, + "mdv_share_of_att": 0.10, + # Central downgrade: tier already reflects the cov-source. + "tier": "moderately_powered", + "covariance_source": "diag_fallback_available_full_vcov_unused", + }, + "sensitivity": {"status": "not_applicable"}, + "placebo": {"status": "skipped", "reason": "opt-in"}, + "bacon": {"status": "not_applicable"}, + "design_effect": {"status": "not_applicable"}, + "heterogeneity": {"status": "not_applicable"}, + "epv": {"status": "not_applicable"}, + "estimator_native_diagnostics": {"status": "not_applicable"}, + "skipped": {}, + "warnings": [], + "overall_interpretation": "", + "next_steps": [], + } + + class CallawaySantAnnaResults: + pass + + stub = CallawaySantAnnaResults() + stub.overall_att = 1.0 + stub.overall_se = 0.2 + stub.overall_p_value = 0.001 + stub.overall_conf_int = (0.6, 1.4) + stub.alpha = 0.05 + stub.n_obs = 100 + stub.n_treated = 40 + stub.n_control = 60 + stub.survey_metadata = None + + dr_results = DiagnosticReportResults( + schema=schema, + interpretation="", + applicable_checks=("parallel_trends", "pretrends_power"), + skipped_checks={}, + warnings=(), + ) + br = BusinessReport(stub, diagnostics=dr_results) + br_schema = br.to_dict() + pt_block = br_schema["pre_trends"] + assert pt_block["power_tier"] == "moderately_powered" + # All three prose surfaces must reflect the downgraded tier — + # none should render the well-powered phrasing ("likely have + # been detected" / well-powered adjective). + summary = br.summary() + full = br.full_report() + for text in (summary, full): + assert "well-powered" not in text.lower() + assert "likely have" not in text + # Positive check: moderately-informative phrasing appears in BR + # prose and BR's overall-interpretation pass-through. + assert ( + "moderately informative" in summary + or "moderately informative" in full + or "moderately-informative" in summary + ) + + def test_center_downgrade_fires_on_real_cs_fit(self, cs_fit): + """On a real CS fit the central downgrade should land in the DR + schema when the helper used the diagonal fallback — no separate + BR-side downgrade is needed.""" + from diff_diff import DiagnosticReport + + fit, sdf = cs_fit + dr = DiagnosticReport( + fit, + data=sdf, + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + ) + pp = dr.to_dict()["pretrends_power"] + if pp.get("status") != "ran": + pytest.skip("pretrends_power did not run on this fixture") + cov = pp.get("covariance_source") + if cov != "diag_fallback_available_full_vcov_unused": + pytest.skip( + "fixture did not trigger the diag_fallback_available path; " "nothing to downgrade" + ) + # When the flagged cov_source fires, tier must never be + # ``well_powered`` — centralized downgrade guarantees this. + assert pp["tier"] != "well_powered" + + +class TestCSNotYetTreatedControlGroupSemantics: + """Round-13 P1 regression: ``BusinessReport`` must not relabel + ``n_control_units`` as generic "control" for a + ``CallawaySantAnna(control_group='not_yet_treated')`` fit — that + field counts only never-treated units, while the actual comparison + group is the dynamic not-yet-treated set at each (g, t) cell. + """ + + def test_not_yet_treated_fit_does_not_render_misleading_control_count(self): + sdf = generate_staggered_data(n_units=100, n_periods=6, treatment_effect=1.5, seed=7) + # Fit with the dynamic not-yet-treated comparison mode. + cs = CallawaySantAnna(base_period="universal", control_group="not_yet_treated").fit( + sdf, + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + aggregate="event_study", + ) + br = BusinessReport(cs, auto_diagnostics=False) + sample = br.to_dict()["sample"] + + # Fixed ``n_control`` must NOT be populated — the comparison set + # is dynamic per (g, t), not a fixed unit tally. + assert ( + sample["n_control"] is None + ), f"n_control must be None for not_yet_treated; got {sample['n_control']}" + # The new fields surface the real semantics. + assert sample["control_group"] == "not_yet_treated" + assert sample["n_never_treated"] == getattr(cs, "n_control_units", None) + + # Both summary and full_report must describe the dynamic + # comparison group rather than asserting a misleading "control" + # count. + summary = br.summary() + # No "(N treated, N control)" phrasing on this path. + assert " control)" not in summary + assert "not-yet-treated" in summary or "dynamic" in summary + + full = br.full_report() + assert "- Control:" not in full or "not-yet-treated" in full + assert "dynamic not-yet-treated" in full or "not-yet-treated" in full + + def test_never_treated_fit_still_shows_fixed_control_count(self, cs_fit): + """Default path (``control_group='never_treated'``) keeps the + fixed ``n_control`` tally so existing prose is unchanged.""" + fit, _ = cs_fit # default is never_treated + br = BusinessReport(fit, auto_diagnostics=False) + sample = br.to_dict()["sample"] + assert isinstance(sample["n_control"], int) + assert sample["control_group"] == "never_treated" + + +class TestBRDataKwargsPassthroughToAutoDR: + """Round-12 regression: ``BusinessReport`` now accepts + ``data`` / ``outcome`` / ``treatment`` / ``unit`` / ``time`` / + ``first_treat`` kwargs and forwards them to the auto-constructed + ``DiagnosticReport``. Without this, data-dependent checks (2x2 PT, + Bacon, EfficientDiD Hausman) are silently skipped on the zero- + config auto path even though the README markets one-call + diagnostics from a fitted result. + """ + + def test_did_fit_gets_2x2_pt_via_passthrough(self, did_fit): + fit, df = did_fit + br = BusinessReport( + fit, + data=df, + outcome="outcome", + treatment="treated", + time="post", + ) + # Auto-DR received the kwargs and ran the 2x2 PT check. + dr_schema = br.to_dict()["diagnostics"]["schema"] + assert dr_schema["parallel_trends"]["status"] == "ran" + assert dr_schema["parallel_trends"]["method"] == "slope_difference" + + def test_cs_fit_gets_bacon_via_passthrough(self, cs_fit): + fit, sdf = cs_fit + br = BusinessReport( + fit, + data=sdf, + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + ) + dr_schema = br.to_dict()["diagnostics"]["schema"] + # Bacon needs data + outcome + time + unit + first_treat; before + # the passthrough, the auto path skipped because only the + # estimator result was available. + assert dr_schema["bacon"]["status"] == "ran" + + def test_no_passthrough_still_works_and_skips_gracefully(self, did_fit): + """Zero-config auto path must still produce a valid report; it + just skips data-dependent checks.""" + fit, _ = did_fit + br = BusinessReport(fit) # no data kwargs + dr_schema = br.to_dict()["diagnostics"]["schema"] + # PT needs data for 2x2 and was gated out of applicable — section + # is "skipped" rather than "ran". + assert dr_schema["parallel_trends"]["status"] in {"skipped", "not_applicable"} + + +class TestSensitivityProseGuarding: + """Round-12 regression: BR / DR summary prose must not promise a + "sensitivity analysis below" sentence when no sensitivity block + actually ran (e.g., SDiD / TROP routed to native diagnostics, + single-M precomputed passthrough rendered separately, skipped + sensitivity for varying-base CS). + """ + + def test_br_sdid_does_not_mention_sensitivity_below(self, sdid_fit): + fit, _ = sdid_fit + summary = BusinessReport(fit).summary() + # SDiD routes to estimator-native diagnostics, not HonestDiD. + # The PT verdict for SDiD is ``design_enforced_pt`` which does + # not append any "see sensitivity below" clause, so the prose + # should not mention it. + assert "sensitivity analysis below" not in summary + + def test_dr_trop_does_not_mention_sensitivity_below(self, sdid_fit): + # SDiD and TROP both skip HonestDiD. Use SDiD as proxy here + # since it already has a fixture; the same guard covers TROP. + from diff_diff import DiagnosticReport + + fit, _ = sdid_fit + summary = DiagnosticReport(fit).summary() + assert "sensitivity analysis below" not in summary + + +class TestSDiDTROPSkippedSensitivityCaveatSuppressed: + """Round-20 P2 regression on PR #318: ``DiagnosticReport`` marks the + HonestDiD sensitivity block ``status="skipped", method="estimator_native"`` + for SDiD / TROP because robustness is routed to the native diagnostics + (``in_time_placebo``, ``sensitivity_to_zeta_omega``, factor-model + metrics) under ``estimator_native_diagnostics``. ``BusinessReport`` + must not surface "HonestDiD sensitivity was not run" as a warning + caveat when the native battery actually ran, because that contradicts + the documented native-routing contract and misleads the reader into + thinking robustness was skipped. + """ + + def test_sdid_native_routed_suppresses_skipped_caveat(self, sdid_fit): + from diff_diff import DiagnosticReport + + fit, _ = sdid_fit + br = BusinessReport(fit) + schema = br.to_dict() + + # BR's lifted ``sensitivity`` block only carries status/reason; the + # ``method`` field lives on the DR schema, which BR reads internally + # to decide caveat suppression. Confirm the DR-side shape separately. + assert schema["sensitivity"]["status"] == "skipped" + dr_schema = DiagnosticReport(fit).to_dict() + assert dr_schema["sensitivity"]["status"] == "skipped" + assert dr_schema["sensitivity"]["method"] == "estimator_native" + native_ran = dr_schema["estimator_native_diagnostics"].get("status") == "ran" + + caveat_topics = [c.get("topic") for c in schema.get("caveats", [])] + if native_ran: + # The fix: no "sensitivity_skipped" warning; instead an info + # caveat pointing at the native block. + assert "sensitivity_skipped" not in caveat_topics + assert "sensitivity_native_routed" in caveat_topics + native_msg = next( + c for c in schema["caveats"] if c.get("topic") == "sensitivity_native_routed" + ) + assert native_msg["severity"] == "info" + assert "estimator-native" in native_msg["message"].lower() + else: + # When the native battery did not produce a ran block, the + # legacy warning behavior is still correct — SDiD users should + # know HonestDiD was not attempted. + assert ( + "sensitivity_skipped" in caveat_topics + or "sensitivity_native_routed" in caveat_topics + ) + + +class TestEfficientDiDHausmanStepTaggedAsParallelTrends: + """Round-20 P2 regression on PR #318: the EfficientDiD practitioner + workflow step "Run Hausman pretest (PT-All vs PT-Post)" must be + tagged ``_step_name="parallel_trends"``, not ``"heterogeneity"``, so + that ``DiagnosticReport._collect_next_steps()`` — which treats a ran + Hausman block as parallel-trends completion — correctly suppresses the + step from the "next steps" list when the report already executed it. + REGISTRY.md §EfficientDiD (lines 895-908) classifies the Hausman + pretest as a parallel-trends diagnostic, so the fix aligns the + practitioner tag with the identification-layer classification. + """ + + def test_hausman_step_is_tagged_parallel_trends(self): + """``practitioner_next_steps`` strips ``_step_name`` from the + returned steps, so we exercise the tagging via the + ``completed_steps=["parallel_trends"]`` filter contract: a + correctly-tagged Hausman step is removed from the output; a + mistagged step remains. + """ + from diff_diff.practitioner import practitioner_next_steps + + class EfficientDiDResults: + pass + + stub = EfficientDiDResults() + stub.overall_att = 1.0 + stub.overall_se = 0.3 + stub.overall_p_value = 0.01 + stub.overall_conf_int = (0.4, 1.6) + stub.alpha = 0.05 + stub.n_obs = 500 + stub.n_treated = 200 + stub.n_control = 300 + stub.survey_metadata = None + stub.event_study_effects = None + stub.pt_assumption = "all" + + # Without any completed steps, the Hausman pretest is included. + baseline = practitioner_next_steps(stub, verbose=False)["next_steps"] + hausman_in_baseline = any("Hausman pretest" in s.get("label", "") for s in baseline) + assert hausman_in_baseline, "EfficientDiD workflow must include the Hausman pretest step" + + # After marking ``parallel_trends`` complete (which DR does when + # ``_check_pt_hausman`` runs), the Hausman step must be filtered + # out. Before the round-20 retag it was tagged as + # ``heterogeneity`` and survived this filter — that is the bug. + filtered = practitioner_next_steps( + stub, completed_steps=["parallel_trends"], verbose=False + )["next_steps"] + assert not any("Hausman pretest" in s.get("label", "") for s in filtered), ( + "Hausman step must be tagged as 'parallel_trends' (REGISTRY.md " + "§EfficientDiD classifies it as a PT diagnostic) so that " + "DR's _collect_next_steps() suppresses it after running the same " + "check. Still present after completed_steps=['parallel_trends'] " + "filter, meaning the tag is wrong." + ) + + +class TestSpecificationComparisonStepTagPersistsAfterSensitivityRuns: + """Pre-emptive audit regression: several practitioner handlers + previously tagged their "compare specifications" / "vary control + group" step as ``_step_name="sensitivity"``. DR marks ``sensitivity`` + complete when HonestDiD runs — which is orthogonal to the + specification-variation recommendation — so these steps were + incorrectly suppressed from ``next_steps`` after a fit with + HonestDiD sensitivity. Retag as ``specification_comparison`` so the + recommendations persist alongside a completed HonestDiD block. Same + class of mistag as the round-20 Hausman finding (which was about + ``heterogeneity`` vs ``parallel_trends``). + """ + + @staticmethod + def _build_stub(class_name: str, **extras): + stub_cls = type(class_name, (), {}) + stub = stub_cls() + stub.overall_att = 1.0 + stub.overall_se = 0.2 + stub.overall_p_value = 0.001 + stub.overall_conf_int = (0.6, 1.4) + stub.alpha = 0.05 + stub.n_obs = 400 + stub.n_treated = 100 + stub.n_control = 300 + stub.survey_metadata = None + stub.event_study_effects = None + for k, v in extras.items(): + setattr(stub, k, v) + return stub + + @staticmethod + def _step_labels_after_completed(stub, completed): + from diff_diff.practitioner import practitioner_next_steps + + return [ + s.get("label", "") + for s in practitioner_next_steps(stub, completed_steps=completed, verbose=False)[ + "next_steps" + ] + ] + + def test_sa_specification_falsification_persists_after_sensitivity_runs(self): + stub = self._build_stub("SunAbrahamResults") + labels = self._step_labels_after_completed(stub, completed=["sensitivity"]) + assert any("Specification-based falsification" in lab for lab in labels), ( + "SA's 'Specification-based falsification' step must persist " + "after DR marks sensitivity complete — HonestDiD does not run " + "control_group / anticipation variation." + ) + + def test_imputation_specification_falsification_persists_after_sensitivity_runs(self): + stub = self._build_stub("ImputationDiDResults") + labels = self._step_labels_after_completed(stub, completed=["sensitivity"]) + assert any("Specification-based falsification" in lab for lab in labels) + + def test_two_stage_specification_falsification_persists_after_sensitivity_runs(self): + stub = self._build_stub("TwoStageDiDResults") + labels = self._step_labels_after_completed(stub, completed=["sensitivity"]) + assert any("Specification-based falsification" in lab for lab in labels) + + def test_stacked_clean_control_variation_persists_after_sensitivity_runs(self): + stub = self._build_stub("StackedDiDResults") + labels = self._step_labels_after_completed(stub, completed=["sensitivity"]) + assert any("Vary clean control" in lab for lab in labels), ( + "StackedDiD's 'Vary clean control definition' step must " + "persist after DR marks sensitivity complete — HonestDiD does " + "not replay clean_control variation." + ) + + def test_efficient_compare_control_groups_persists_after_sensitivity_runs(self): + stub = self._build_stub("EfficientDiDResults", pt_assumption="all") + labels = self._step_labels_after_completed(stub, completed=["sensitivity"]) + assert any("Compare control group definitions" in lab for lab in labels), ( + "EfficientDiD's 'Compare control group definitions' step " + "must persist after DR marks sensitivity complete — HonestDiD " + "does not re-estimate with alternative control_group." + ) + + +class TestCSRepeatedCrossSectionCountLabels: + """Round-28 P2 CI review on PR #318: ``CallawaySantAnna(panel=False)`` + stores treated / control counts as OBSERVATIONS, not units + (``staggered_results.py L183-L184`` renders them as "obs:" in that + mode). BR previously labeled them as "units" / "present in the + panel", which misstates the sample composition on repeated-cross- + section fits. The schema now carries a ``count_unit`` flag and the + rendering branches on it. + """ + + @staticmethod + def _stub(panel: bool): + class CallawaySantAnnaResults: + pass + + stub = CallawaySantAnnaResults() + stub.overall_att = 1.0 + stub.overall_se = 0.2 + stub.overall_p_value = 0.001 + stub.overall_conf_int = (0.6, 1.4) + stub.alpha = 0.05 + stub.n_obs = 1000 + stub.n_treated_units = 200 + stub.n_control_units = 800 + stub.survey_metadata = None + stub.event_study_effects = None + stub.control_group = "not_yet_treated" + stub.panel = panel + return stub + + def test_schema_exposes_count_unit(self): + for panel, expected in [(True, "units"), (False, "observations")]: + sample = BusinessReport(self._stub(panel), auto_diagnostics=False).to_dict()["sample"] + assert sample["count_unit"] == expected + + def test_panel_true_renders_unit_wording(self): + br = BusinessReport(self._stub(panel=True), auto_diagnostics=False) + summary = br.summary() + md = br.full_report() + assert "never-treated units" in summary + assert "present in the panel" in md + assert "repeated cross-section sample" not in md + + def test_panel_false_renders_rcs_wording(self): + br = BusinessReport(self._stub(panel=False), auto_diagnostics=False) + summary = br.summary() + md = br.full_report() + # RCS-specific wording in both surfaces. + assert "never-treated observations" in summary + assert "repeated cross-section sample" in md + # No misleading "units" or "panel" claims. + assert "never-treated units" not in summary + assert "present in the panel" not in md + + +class TestTROPApplicableChecksExcludesParallelTrends: + """Round-28 P2 CI review on PR #318: TROP identification is + factor-model-based; its native PT handler returns + ``status="not_applicable"``. Advertising ``parallel_trends`` in + ``DiagnosticReport.applicable_checks`` for TROP was a contract + mismatch for callers using that set to gate workflows or UI. + """ + + def test_trop_applicable_checks_omits_parallel_trends(self): + from diff_diff import DiagnosticReport + + class TROPResults: + pass + + stub = TROPResults() + stub.overall_att = 1.0 + stub.overall_se = 0.2 + stub.alpha = 0.05 + stub.n_obs = 100 + + dr = DiagnosticReport(stub) + assert "parallel_trends" not in dr.applicable_checks, ( + "TROP PT routes to factor-model diagnostics and is " + "not_applicable; it must not appear in applicable_checks." + ) + + +class TestSurveyPTProsePropagation: + """Round-28 P3 CI review on PR #318: the survey F-reference PT + variants (``joint_wald_survey``, ``joint_wald_event_study_survey``) + must carry through BR's method-aware label helpers so prose uses + "joint p" (not the fall-through default) and preserves the + ``df_denom`` provenance in the BR schema. + """ + + def test_lift_pre_trends_preserves_df_denom(self): + from diff_diff.business_report import _lift_pre_trends + + fake_dr = { + "parallel_trends": { + "status": "ran", + "method": "joint_wald_event_study_survey", + "joint_p_value": 0.35, + "df_denom": 30.0, + "n_pre_periods": 3, + "verdict": "no_detected_violation", + }, + "pretrends_power": {"status": "not_applicable"}, + } + lifted = _lift_pre_trends(fake_dr) + assert lifted["method"] == "joint_wald_event_study_survey" + assert lifted["df_denom"] == 30.0 + + def test_lift_pre_trends_exposes_power_reason(self): + """Round-29 P3 regression: when ``compute_pretrends_power`` cannot + run, REPORTING.md lines 118-125 promise the fallback reason is + recorded in the BR pre-trends block. Previously only the enum + status surfaced and the reason was dropped at the lift + boundary; the new ``power_reason`` field carries the + plain-English explanation alongside the existing enum + ``power_status``. + """ + from diff_diff.business_report import _lift_pre_trends + + fake_dr = { + "parallel_trends": { + "status": "ran", + "method": "joint_wald_event_study", + "joint_p_value": 0.35, + "n_pre_periods": 3, + "verdict": "no_detected_violation", + }, + "pretrends_power": { + "status": "not_applicable", + "reason": ( + "StackedDiDResults does not yet have a " "compute_pretrends_power adapter." + ), + }, + } + lifted = _lift_pre_trends(fake_dr) + # Machine-readable status preserved. + assert lifted["power_status"] == "not_applicable" + # Plain-English reason now exposed on the schema. + assert lifted["power_reason"] == ( + "StackedDiDResults does not yet have a " "compute_pretrends_power adapter." + ) + + def test_survey_pt_method_stat_label_uses_joint_p(self): + from diff_diff.business_report import ( + _pt_method_stat_label, + _pt_method_subject, + ) + + for method in ("joint_wald_survey", "joint_wald_event_study_survey"): + assert _pt_method_stat_label(method) == "joint p", ( + f"Survey PT variant {method!r} must map to 'joint p' " + f"(the joint test remains; only the reference " + f"distribution changes)." + ) + assert _pt_method_subject(method) == "Pre-treatment event-study coefficients", ( + f"Survey PT variant {method!r} must use the event-study " + f"subject phrase, not the generic fall-through." + ) + + +class TestSDiDJackknifeStepPersistsAfterNativeSensitivity: + """Round-24 P2 CI review on PR #318: the SyntheticDiD practitioner + step "Leave-one-out influence (jackknife)" must persist after + ``DiagnosticReport`` marks ``sensitivity`` complete via the SDiD + native battery (pre-treatment fit, weight concentration, + ``in_time_placebo``, ``sensitivity_to_zeta_omega``). DR does NOT + run the jackknife LOO workflow — ``get_loo_effects_df`` requires a + separate ``variance_method='jackknife'`` fit — so suppressing the + recommendation when the native block fires overstates what the + report has already executed. Same class as round-20 Hausman and + pre-emptive TROP-placebo retags: step_name was coarser than DR's + actual coverage. + """ + + def test_sdid_jackknife_step_persists_via_practitioner_filter(self): + """Unit-level: ``practitioner_next_steps`` with + ``completed_steps=["sensitivity"]`` still surfaces the jackknife + recommendation because it is now tagged ``loo_jackknife``. + """ + from diff_diff.practitioner import practitioner_next_steps + + class SyntheticDiDResults: + pass + + stub = SyntheticDiDResults() + stub.att = 1.0 + stub.se = 0.2 + stub.p_value = 0.001 + stub.conf_int = (0.6, 1.4) + stub.alpha = 0.05 + stub.n_obs = 200 + stub.n_treated = 20 + stub.n_control = 180 + stub.survey_metadata = None + stub.event_study_effects = None + + labels = [ + s.get("label", "") + for s in practitioner_next_steps(stub, completed_steps=["sensitivity"], verbose=False)[ + "next_steps" + ] + ] + assert any("Leave-one-out influence (jackknife)" in lab for lab in labels), ( + "SDiD jackknife recommendation must persist after DR marks " + "sensitivity complete — the SDiD native battery does not run " + "the jackknife LOO workflow (requires a separate " + "variance_method='jackknife' fit)." + ) + + def test_sdid_jackknife_step_persists_in_dr_next_steps(self, sdid_fit): + """Integration: ``DiagnosticReport(...).to_dict()["next_steps"]`` + preserves the jackknife recommendation when only the default + native SDiD diagnostics ran. + """ + from diff_diff import DiagnosticReport + + fit, _ = sdid_fit + next_steps = DiagnosticReport(fit).to_dict()["next_steps"] + labels = [s.get("label", "") for s in next_steps] + assert any("Leave-one-out influence (jackknife)" in lab for lab in labels), ( + "DR next_steps must preserve the SDiD jackknife recommendation " + "when the SDiD native battery ran but the jackknife workflow " + f"did not. Got labels: {labels}" + ) + + +class TestTROPInTimePlaceboStepTaggedAsPlacebo: + """Pre-emptive audit regression: the TROP practitioner workflow + step "In-time or in-space placebo" was previously tagged + ``_step_name="sensitivity"``. TROP's estimator-native diagnostics + surface factor-model fit metrics (``effective_rank``, ``loocv_score``, + selected lambdas) — not placebos — and + ``DiagnosticReport._collect_next_steps`` marks ``sensitivity`` complete + for SDiD / TROP when the native battery runs. That suppressed the + TROP placebo recommendation unjustly. Retag as ``placebo`` so it + persists. + """ + + def test_trop_placebo_step_persists_after_native_sensitivity_completion(self): + from diff_diff.practitioner import practitioner_next_steps + + class TROPResults: + pass + + stub = TROPResults() + stub.overall_att = 1.0 + stub.overall_se = 0.2 + stub.overall_p_value = 0.001 + stub.overall_conf_int = (0.6, 1.4) + stub.alpha = 0.05 + stub.n_obs = 400 + stub.n_treated = 40 + stub.n_control = 360 + stub.survey_metadata = None + stub.event_study_effects = None + + labels = [ + s.get("label", "") + for s in practitioner_next_steps(stub, completed_steps=["sensitivity"], verbose=False)[ + "next_steps" + ] + ] + assert any("In-time or in-space placebo" in lab for lab in labels), ( + "TROP's placebo recommendation must persist after DR marks " + "sensitivity complete (SDiD/TROP native battery) — factor-" + "model diagnostics are not a placebo substitute." + ) + + +class TestPrecomputedSensitivityHonoredOnAllCompatibleEstimators: + """Round-31 P1 CI review on PR #318: ``DiagnosticReport(precomputed= + {"sensitivity": ...})`` and ``BusinessReport(honest_did_results=...)`` + were silently dropped on estimator families whose ``_APPLICABILITY`` + row lacked ``"sensitivity"`` — SA, Imputation, TwoStage, Stacked, + EfficientDiD, Wooldridge, TripleDifference, StaggeredTripleDiff, + ContinuousDiD, and plain DiD. The applicability gate filtered the + section out before the supplied object reached the runner, so the + schema rendered ``sensitivity: {"status": "not_applicable"}`` and + the user never learned their robustness result had been ignored. + + The gate now honors an explicit passthrough regardless of the + default ``_APPLICABILITY`` matrix. SDiD / TROP are still rejected + up front in ``__init__`` (round-21) because their native-routing + contract is methodology-incompatible with HonestDiD. + """ + + @staticmethod + def _fake_grid_sens(): + from types import SimpleNamespace + + return SimpleNamespace( + M_values=[0.5, 1.0, 1.5], + bounds=[(0.1, 2.0), (-0.2, 2.5), (-0.5, 3.0)], + robust_cis=[(0.05, 2.1), (-0.3, 2.6), (-0.6, 3.1)], + breakdown_M=1.25, + method="relative_magnitude", + original_estimate=1.0, + original_se=0.2, + alpha=0.05, + ) + + @staticmethod + def _stub(class_name: str, **extras): + + # For estimator types that have fits, we'd use real fits; but + # several of these need specific setup. Stub with minimal + # required fields — the gate fix operates on the applicability + # set and the sensitivity runner short-circuits on the + # precomputed key without touching result internals. + stub_cls = type(class_name, (), {}) + stub = stub_cls() + stub.overall_att = 1.0 + stub.overall_se = 0.2 + stub.overall_p_value = 0.001 + stub.overall_conf_int = (0.6, 1.4) + stub.att = 1.0 + stub.se = 0.2 + stub.p_value = 0.001 + stub.conf_int = (0.6, 1.4) + stub.alpha = 0.05 + stub.n_obs = 500 + stub.n_treated = 200 + stub.n_control = 300 + stub.survey_metadata = None + stub.event_study_effects = None + for k, v in extras.items(): + setattr(stub, k, v) + return stub + + def test_dr_precomputed_sensitivity_honored_on_sun_abraham(self): + from diff_diff import DiagnosticReport + + stub = self._stub("SunAbrahamResults") + dr = DiagnosticReport(stub, precomputed={"sensitivity": self._fake_grid_sens()}) + sens = dr.to_dict()["sensitivity"] + assert sens["status"] == "ran", ( + f"precomputed sensitivity on SunAbrahamResults must be honored; " f"got {sens!r}" + ) + assert sens.get("precomputed") is True + assert sens["breakdown_M"] == 1.25 + + def test_dr_precomputed_sensitivity_honored_on_efficient_did(self): + from diff_diff import DiagnosticReport + + stub = self._stub("EfficientDiDResults", pt_assumption="all") + dr = DiagnosticReport(stub, precomputed={"sensitivity": self._fake_grid_sens()}) + sens = dr.to_dict()["sensitivity"] + assert sens["status"] == "ran" + assert sens.get("precomputed") is True + + def test_dr_precomputed_sensitivity_honored_on_plain_did(self): + from diff_diff import DiagnosticReport + + stub = self._stub("DiDResults") + dr = DiagnosticReport(stub, precomputed={"sensitivity": self._fake_grid_sens()}) + sens = dr.to_dict()["sensitivity"] + assert sens["status"] == "ran" + + def test_br_honest_did_results_honored_on_imputation(self): + stub = self._stub("ImputationDiDResults") + br = BusinessReport(stub, honest_did_results=self._fake_grid_sens()) + sens = br.to_dict()["sensitivity"] + assert sens["status"] == "computed", ( + f"honest_did_results on ImputationDiDResults must be honored " f"by BR; got {sens!r}" + ) + assert sens["breakdown_M"] == 1.25 + + +class TestHeterogeneityLiftAlwaysReturnsDict: + """Round-31 P2 CI review on PR #318: ``_lift_heterogeneity`` used to + return ``None`` whenever the DR heterogeneity section didn't + successfully run, so the BR schema stored a raw ``None`` at + ``schema["heterogeneity"]``. The rest of the schema promises dict- + shaped ``{"status": ..., "reason": ...}`` blocks on every top- + level key; this one broke the contract and forced downstream + consumers to special-case it. + """ + + def test_lift_none_dr_returns_dict(self): + from diff_diff.business_report import _lift_heterogeneity + + block = _lift_heterogeneity(None) + assert isinstance(block, dict) + assert block["status"] == "skipped" + assert "auto_diagnostics" in (block.get("reason") or "") + + def test_lift_skipped_dr_section_returns_dict_with_status(self): + from diff_diff.business_report import _lift_heterogeneity + + block = _lift_heterogeneity( + { + "heterogeneity": { + "status": "skipped", + "reason": "No group_effects or event_study_effects on result.", + } + } + ) + assert block["status"] == "skipped" + assert "No group_effects" in block["reason"] + + def test_lift_not_applicable_dr_section_returns_dict(self): + from diff_diff.business_report import _lift_heterogeneity + + block = _lift_heterogeneity( + { + "heterogeneity": { + "status": "not_applicable", + "reason": "TripleDifferenceResults is a 2-period design.", + } + } + ) + assert block["status"] == "not_applicable" + assert block["reason"] + + def test_br_schema_heterogeneity_is_always_dict(self): + """End-to-end: a fit whose heterogeneity did not run still + exposes a dict-shaped block at ``schema["heterogeneity"]`` + rather than a raw ``None``. + """ + + class DiDResults: + pass + + stub = DiDResults() + stub.att = 1.0 + stub.se = 0.2 + stub.p_value = 0.001 + stub.conf_int = (0.6, 1.4) + stub.alpha = 0.05 + stub.n_obs = 200 + stub.n_treated = 100 + stub.n_control = 100 + stub.survey_metadata = None + + het = BusinessReport(stub, auto_diagnostics=True).to_dict()["heterogeneity"] + assert isinstance(het, dict), ( + f"schema['heterogeneity'] must be a dict (the stable-schema " + f"contract); got {type(het).__name__}: {het!r}" + ) + assert "status" in het + + +class TestSDiDTROPRejectPrecomputedPretrendsPower: + """Round-32 P1 CI review on PR #318: round-21 rejected + ``precomputed["sensitivity"]`` / ``precomputed["parallel_trends"]`` + on SDiD / TROP because the native-routing contract makes those + methodology-incompatible. Round-31's broadening of the + applicability gate exposed a parallel hole — ``precomputed[ + "pretrends_power"]`` was not in the rejection set, so a Roth- + style power verdict could surface on a report whose PT is + design-enforced (SDiD) or factor-model (TROP). The guard now + rejects all three precomputed keys uniformly on the native- + routed estimator families. + """ + + @staticmethod + def _dummy_power_object(): + from types import SimpleNamespace + + return SimpleNamespace( + mdv=0.1, + violation_type="linear", + alpha=0.05, + target_power=0.80, + violation_magnitude=0.1, + power=0.80, + n_pre_periods=2, + ) + + def test_dr_rejects_precomputed_pretrends_power_on_sdid(self, sdid_fit): + from diff_diff import DiagnosticReport + + fit, _ = sdid_fit + with pytest.raises(ValueError, match="estimator_native_diagnostics"): + DiagnosticReport(fit, precomputed={"pretrends_power": self._dummy_power_object()}) + + def test_dr_rejects_precomputed_pretrends_power_on_trop(self): + from diff_diff import DiagnosticReport + + class TROPResults: + pass + + stub = TROPResults() + stub.overall_att = 1.0 + stub.overall_se = 0.2 + stub.alpha = 0.05 + stub.n_obs = 100 + with pytest.raises(ValueError, match="estimator_native_diagnostics"): + DiagnosticReport(stub, precomputed={"pretrends_power": self._dummy_power_object()}) + + +class TestHeterogeneityOmittedFromFullReportWhenNotRan: + """Round-32 P2 CI review on PR #318: round-31 made + ``_lift_heterogeneity`` always return a dict (stable schema + contract), but the full-report renderer's ``if het:`` truthiness + guard then entered the Heterogeneity section on every fit and + printed ``Source: None`` / ``N effects: None`` / ``Sign + consistent: None``. Renderer now gates on ``status == "ran"``. + """ + + def test_full_report_omits_heterogeneity_section_when_skipped(self): + class DiDResults: + pass + + stub = DiDResults() + stub.att = 1.0 + stub.se = 0.2 + stub.p_value = 0.001 + stub.conf_int = (0.6, 1.4) + stub.alpha = 0.05 + stub.n_obs = 200 + stub.n_treated = 100 + stub.n_control = 100 + stub.survey_metadata = None + + md = BusinessReport(stub, auto_diagnostics=True).full_report() + # The section header is only emitted when status == "ran". + # Plain DiD does not have heterogeneity in its applicability + # row, so the section should NOT appear. + assert "## Heterogeneity" not in md, ( + f"Heterogeneity section must be omitted when it did not " + f"run; rendering ``Source: None`` / ``N effects: None`` " + f"is worse than omitting. Got markdown:\n{md}" + ) + # Specifically, none of the placeholder ``None`` lines may + # appear anywhere in the rendered report. + assert "Source: `None`" not in md + assert "N effects: None" not in md + assert "Sign consistent: None" not in md + + +class TestDesignEffectBandLabel: + """Round-32 P2 CI review on PR #318: REPORTING.md promises a + plain-English band label on the ``design_effect`` section, but the + implementation only emitted numeric fields plus ``is_trivial``. + Add a stable ``band_label`` enum aligned with the REPORTING.md + threshold rule. + """ + + @staticmethod + def _stub_with_deff(deff: float): + from types import SimpleNamespace + + from diff_diff import DiagnosticReport + + class CallawaySantAnnaResults: + pass + + stub = CallawaySantAnnaResults() + stub.overall_att = 1.0 + stub.overall_se = 0.2 + stub.overall_p_value = 0.001 + stub.overall_conf_int = (0.6, 1.4) + stub.alpha = 0.05 + stub.n_obs = 500 + stub.n_treated = 100 + stub.n_control_units = 400 + stub.event_study_effects = None + stub.survey_metadata = SimpleNamespace( + design_effect=deff, + effective_n=500.0 / max(deff, 1e-9), + weight_type="pweight", + n_strata=None, + n_psu=None, + df_survey=None, + replicate_method=None, + ) + return DiagnosticReport(stub).to_dict()["design_effect"] + + def test_trivial_band_under_1_05(self): + assert self._stub_with_deff(1.01)["band_label"] == "trivial" + + def test_slightly_reduces_band_under_2(self): + assert self._stub_with_deff(1.5)["band_label"] == "slightly_reduces" + + def test_materially_reduces_band_under_5(self): + assert self._stub_with_deff(3.2)["band_label"] == "materially_reduces" + + def test_large_warning_band_at_or_above_5(self): + assert self._stub_with_deff(7.5)["band_label"] == "large_warning" + + def test_deff_exactly_1_05_is_slightly_reduces_not_trivial(self): + """Round-43 P2 regression: REPORTING.md defines the ``trivial`` + band as ``0.95 <= deff < 1.05`` (half-open) and + ``slightly_reduces`` as starting at ``1.05``. The prior code used + ``is_trivial = 0.95 <= deff <= 1.05`` (closed), producing a + schema that labeled exactly ``deff == 1.05`` as + ``band_label="slightly_reduces"`` AND ``is_trivial=True`` — + internally inconsistent, and the ``is_trivial=True`` flag + suppressed the non-trivial prose that the documented threshold + says should fire. + """ + # DR schema: band_label="slightly_reduces" and is_trivial=False. + block = self._stub_with_deff(1.05) + assert block["band_label"] == "slightly_reduces", ( + f"DEFF==1.05 must land in the ``slightly_reduces`` band per " + f"REPORTING.md (half-open threshold). Got: {block!r}" + ) + assert block["is_trivial"] is False, ( + f"DEFF==1.05 must NOT be classified as trivial (half-open " + f"threshold). Got is_trivial={block['is_trivial']!r}" + ) + + # BR's sample-block ``is_trivial`` must match. + from types import SimpleNamespace + + from diff_diff import BusinessReport + + class CallawaySantAnnaResults: + pass + + stub = CallawaySantAnnaResults() + stub.overall_att = 1.0 + stub.overall_se = 0.2 + stub.overall_p_value = 0.001 + stub.overall_conf_int = (0.6, 1.4) + stub.alpha = 0.05 + stub.n_obs = 500 + stub.n_treated = 100 + stub.n_control_units = 400 + stub.event_study_effects = None + stub.survey_metadata = SimpleNamespace( + design_effect=1.05, + effective_n=500.0 / 1.05, + weight_type="pweight", + n_strata=None, + n_psu=None, + df_survey=None, + replicate_method=None, + ) + br = BusinessReport(stub, auto_diagnostics=False) + sample = br.to_dict()["sample"] + assert sample["survey"] is not None + assert sample["survey"]["is_trivial"] is False, ( + f"BR sample-block ``is_trivial`` at DEFF==1.05 must match " + f"DR's half-open threshold. Got: {sample['survey']!r}" + ) + + def test_deff_just_under_1_05_is_trivial(self): + """Round-43 P2 regression: the lower-bound adjacent point + ``deff == 1.049`` is still inside the half-open ``trivial`` + band ``[0.95, 1.05)``. + """ + block = self._stub_with_deff(1.049) + assert block["band_label"] == "trivial" + assert block["is_trivial"] is True + + +class TestSDiDTROPRejectIncompatiblePrecomputedInputs: + """Round-21 P1 CI review on PR #318: ``precomputed={"sensitivity": + ...}`` and ``BusinessReport(honest_did_results=...)`` previously + short-circuited the SDiD / TROP native-routing guards, letting the + generic report sections surface methodology-incompatible HonestDiD + or generic PT diagnostics on estimators that route robustness to + ``estimator_native_diagnostics``. DR / BR must now reject those + passthroughs with a clear error pointing users at the native + diagnostics on the result object. + """ + + @staticmethod + def _dummy_sens_object(): + from types import SimpleNamespace + + return SimpleNamespace( + M_values=[0.5, 1.0], + bounds=[(0.1, 2.0), (-0.2, 2.5)], + robust_cis=[(0.05, 2.1), (-0.3, 2.6)], + breakdown_M=0.75, + method="relative_magnitude", + original_estimate=1.0, + original_se=0.2, + alpha=0.05, + ) + + @staticmethod + def _dummy_pt_object(): + from types import SimpleNamespace + + return SimpleNamespace(joint_p_value=0.2, n_pre_periods=3, method="event_study") + + def test_dr_rejects_precomputed_sensitivity_on_sdid(self, sdid_fit): + from diff_diff import DiagnosticReport + + fit, _ = sdid_fit + with pytest.raises(ValueError, match="estimator_native_diagnostics"): + DiagnosticReport(fit, precomputed={"sensitivity": self._dummy_sens_object()}) + + def test_dr_rejects_precomputed_parallel_trends_on_sdid(self, sdid_fit): + from diff_diff import DiagnosticReport + + fit, _ = sdid_fit + with pytest.raises(ValueError, match="estimator_native_diagnostics"): + DiagnosticReport(fit, precomputed={"parallel_trends": self._dummy_pt_object()}) + + def test_br_rejects_honest_did_results_on_sdid(self, sdid_fit): + fit, _ = sdid_fit + with pytest.raises(ValueError, match="estimator_native_diagnostics"): + BusinessReport(fit, honest_did_results=self._dummy_sens_object()) + + def test_dr_rejects_precomputed_sensitivity_on_trop(self): + """TROP construction is expensive; use a stub with the right name.""" + from diff_diff import DiagnosticReport + + class TROPResults: + pass + + stub = TROPResults() + stub.overall_att = 1.0 + stub.overall_se = 0.2 + stub.alpha = 0.05 + stub.n_obs = 100 + with pytest.raises(ValueError, match="estimator_native_diagnostics"): + DiagnosticReport(stub, precomputed={"sensitivity": self._dummy_sens_object()}) + + def test_dr_rejects_precomputed_parallel_trends_on_trop(self): + from diff_diff import DiagnosticReport + + class TROPResults: + pass + + stub = TROPResults() + stub.overall_att = 1.0 + stub.overall_se = 0.2 + stub.alpha = 0.05 + stub.n_obs = 100 + with pytest.raises(ValueError, match="estimator_native_diagnostics"): + DiagnosticReport(stub, precomputed={"parallel_trends": self._dummy_pt_object()}) + + def test_dr_still_accepts_precomputed_on_compatible_estimators(self, cs_fit): + """CS remains a valid passthrough target — the guardrail is + estimator-specific, not a blanket ban. + """ + from diff_diff import DiagnosticReport + + fit, _ = cs_fit + # Should not raise. + DiagnosticReport(fit, precomputed={"sensitivity": self._dummy_sens_object()}) + + def test_br_still_accepts_honest_did_results_on_compatible_estimators(self, cs_fit): + fit, _ = cs_fit + # Should not raise. + BusinessReport(fit, honest_did_results=self._dummy_sens_object()) + + +class TestBRLiftSensitivityPreservesMethodOnSkip: + """Pre-emptive audit regression: ``_lift_sensitivity`` previously + dropped the ``method`` field from BR's ``sensitivity`` block when + ``status != "ran"``. That forced BR-schema consumers to re-consult + the DR schema to distinguish native-routed skips + (``method="estimator_native"`` for SDiD / TROP, where robustness is + covered by the native battery) from methodology-blocked skips (e.g., + CS with ``base_period='varying'``). Preserving the field keeps BR + self-describing. + """ + + def test_sdid_br_schema_exposes_native_method_on_sensitivity_skip(self, sdid_fit): + fit, _ = sdid_fit + sens_block = BusinessReport(fit).to_dict()["sensitivity"] + assert sens_block["status"] == "skipped" + # The round-20 DR fix set method="estimator_native"; BR must pass + # it through so an agent consuming BR alone can tell this is a + # native-routed skip. + assert sens_block.get("method") == "estimator_native", ( + "BR's sensitivity block must preserve method='estimator_native' " + "when DR emitted it; otherwise downstream agents cannot " + f"distinguish native routing from methodology blocks. Got: {sens_block}" + ) + + +class TestHausmanTestStatisticPopulated: + """Round-10 P3 regression: ``HausmanPretestResult`` exposes + ``statistic`` (not ``test_statistic``); the DR schema was previously + reading the wrong attribute and losing the H statistic.""" + + def test_test_statistic_field_is_populated_on_success(self, edid_fit): + from diff_diff import DiagnosticReport + + fit, sdf = edid_fit + dr = DiagnosticReport( + fit, + data=sdf, + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + ) + pt = dr.to_dict()["parallel_trends"] + if pt["status"] == "ran": + # Method-specific: only Hausman exposes a test_statistic. + assert pt["method"] == "hausman" + ts = pt["test_statistic"] + assert ( + ts is not None and isinstance(ts, float) and np.isfinite(ts) + ), f"Hausman H statistic must be populated on success; got {ts}" + + +class TestFullReportSingleM: + """Regression: ``full_report()`` must not claim full-grid robustness for a + single-M HonestDiDResults passthrough. The summary path was fixed earlier; + the structured-markdown path had the same bug and now mirrors it.""" + + @staticmethod + def _fake_single_m(M=1.5, ci_lb=1.0, ci_ub=3.0): + from types import SimpleNamespace + + return SimpleNamespace( + M=M, + lb=ci_lb, + ub=ci_ub, + ci_lb=ci_lb, + ci_ub=ci_ub, + method="relative_magnitude", + alpha=0.05, + ) + + def test_full_report_does_not_claim_full_grid_for_single_m(self, event_study_fit): + fit, _ = event_study_fit + br = BusinessReport(fit, honest_did_results=self._fake_single_m()) + md = br.full_report() + assert "robust across full grid" not in md + assert "Single point checked" in md or "single point" in md.lower() + + +# --------------------------------------------------------------------------- +# Summary + full_report work across estimators +# --------------------------------------------------------------------------- +class TestAcrossEstimators: + def test_summary_nonempty_for_all(self, did_fit, event_study_fit, cs_fit, sdid_fit): + for fit, _ in (did_fit, event_study_fit, cs_fit, sdid_fit): + br = BusinessReport(fit, auto_diagnostics=False) + s = br.summary() + assert isinstance(s, str) + assert len(s) > 0 + + +# --------------------------------------------------------------------------- +# Public API exposure +# --------------------------------------------------------------------------- +def test_public_api_exports(): + for name in ("BusinessReport", "BusinessContext", "BUSINESS_REPORT_SCHEMA_VERSION"): + assert hasattr(dd, name) + + +def test_repr_includes_estimator_and_effect(cs_fit): + fit, _ = cs_fit + r = repr(BusinessReport(fit, auto_diagnostics=False)) + assert "CallawaySantAnnaResults" in r + + +def test_str_equals_summary(cs_fit): + fit, _ = cs_fit + br = BusinessReport(fit, auto_diagnostics=False) + assert str(br) == br.summary() + + +def test_business_context_is_frozen_dataclass(): + ctx = BusinessContext( + outcome_label="x", + outcome_unit=None, + outcome_direction=None, + business_question=None, + treatment_label="y", + alpha=0.05, + ) + with pytest.raises((AttributeError, Exception)): + ctx.alpha = 0.10 # type: ignore[misc] + + +class TestBootstrapResultsAndNBootstrapDetection: + """Regression for the round-5 P0 finding that ``_extract_headline`` + only preserved native CI surfaces when a result advertised + ``inference_method`` / ``bootstrap_distribution`` / ``variance_method`` + / ``df_survey``. + + Several staggered / continuous / dCDH result classes copy bootstrap- + derived se/p/conf_int into their top-level fields at fit time and + expose the bootstrap only via a ``bootstrap_results`` sub-object or + an ``n_bootstrap > 0`` attribute. An ``alpha`` override on such a + fit would silently swap a percentile/multiplier bootstrap CI for a + normal-approximation one. BR must now detect either marker and + preserve the fitted CI at its native level. + """ + + def _base_stub(self): + stub = type("Stub", (), {})() + stub.att = 1.0 + stub.se = 0.5 + stub.p_value = 0.04 + stub.conf_int = (0.05, 1.95) + stub.alpha = 0.05 + stub.n_obs = 100 + stub.n_treated = 40 + stub.n_control = 60 + stub.survey_metadata = None + # Crucially NOT exposing inference_method / bootstrap_distribution + # / variance_method / df_survey: exactly the surface the reviewer + # flagged as silently falling through. + return stub + + def test_bootstrap_results_object_alone_preserves_fit_ci(self): + stub = self._base_stub() + stub.bootstrap_results = type("BootSub", (), {"n_bootstrap": 199})() + br = BusinessReport(stub, alpha=0.10, auto_diagnostics=False) + h = br.to_dict()["headline"] + assert h["ci_level"] == 95, ( + "Result carrying bootstrap_results must preserve fitted CI " + "level on alpha mismatch; got " + str(h["ci_level"]) + ) + assert h["ci_lower"] == pytest.approx(0.05) + assert h["ci_upper"] == pytest.approx(1.95) + topics = {c.get("topic") for c in br.caveats()} + assert "alpha_override_preserved" in topics + + def test_n_bootstrap_positive_alone_preserves_fit_ci(self): + """ContinuousDiDResults-style: ``n_bootstrap`` field, no bootstrap_results.""" + stub = self._base_stub() + stub.n_bootstrap = 499 + br = BusinessReport(stub, alpha=0.10, auto_diagnostics=False) + h = br.to_dict()["headline"] + assert h["ci_level"] == 95 + assert h["ci_lower"] == pytest.approx(0.05) + assert h["ci_upper"] == pytest.approx(1.95) + topics = {c.get("topic") for c in br.caveats()} + assert "alpha_override_preserved" in topics + + def test_n_bootstrap_zero_still_preserves_on_alpha_mismatch(self): + """Analytic fits (``n_bootstrap = 0``) also preserve the fitted CI + on alpha mismatch — BR cannot reproduce a ``DiDResults`` / + ``MultiPeriodDiDResults`` / TROP t-quantile CI without the fit's + finite ``df``, which is not exposed uniformly. Round-7 regression. + """ + stub = self._base_stub() + stub.n_bootstrap = 0 + br = BusinessReport(stub, alpha=0.10, auto_diagnostics=False) + h = br.to_dict()["headline"] + # Analytic fit's native 95% CI is preserved at 95% on 90% override. + assert h["ci_level"] == 95 + assert h["ci_lower"] == pytest.approx(0.05) + assert h["ci_upper"] == pytest.approx(1.95) + topics = {c.get("topic") for c in br.caveats()} + assert "alpha_override_preserved" in topics + + def test_dcdh_shaped_bootstrap_stub_preserves_fit_ci(self): + """dCDH copies bootstrap se/p/conf_int into top-level fields without + ``inference_method``. The reviewer called this out specifically.""" + + class ChaisemartinDHaultfoeuilleResults: # name-keyed dispatch + pass + + stub = ChaisemartinDHaultfoeuilleResults() + stub.att = 1.5 + stub.se = 0.4 + stub.p_value = 0.02 + stub.conf_int = (0.72, 2.28) + stub.alpha = 0.05 + stub.n_obs = 200 + stub.n_treated = 80 + stub.n_control = 120 + stub.survey_metadata = None + stub.event_study_effects = None + stub.placebo_event_study = None + # dCDH carries bootstrap via a sub-object; top-level fields are + # the bootstrap-derived values, not analytic. + stub.bootstrap_results = type("DCDHBoot", (), {"n_bootstrap": 499})() + + br = BusinessReport(stub, alpha=0.10, auto_diagnostics=False) + h = br.to_dict()["headline"] + assert h["ci_level"] == 95 + assert h["ci_lower"] == pytest.approx(0.72) + assert h["ci_upper"] == pytest.approx(2.28) + + +class TestAnalyticalFiniteDfAlphaOverride: + """Round-7 regressions for the P0 finding that + ``_extract_headline`` was recomputing a normal-z CI on alpha + mismatch for analytical fits whose native inference used a finite + ``df`` (``DifferenceInDifferences`` / ``MultiPeriodDiD`` / TROP) + that BR cannot reproduce from ``(att, se)`` alone. The fix is to + always preserve the fitted CI on alpha mismatch. + """ + + def test_analytical_did_result_preserves_native_ci(self): + from diff_diff import DifferenceInDifferences, generate_did_data + + df = generate_did_data(n_units=80, n_periods=4, treatment_effect=1.5, seed=7) + fit = DifferenceInDifferences().fit(df, outcome="outcome", treatment="treated", time="post") + native_lo, native_hi = fit.conf_int + + br = BusinessReport(fit, alpha=0.10, auto_diagnostics=False) + h = br.to_dict()["headline"] + # Native 95% CI preserved — no z-based recomputation. + assert h["ci_level"] == 95 + assert h["ci_lower"] == pytest.approx(native_lo) + assert h["ci_upper"] == pytest.approx(native_hi) + topics = {c.get("topic") for c in br.caveats()} + assert "alpha_override_preserved" in topics + + def test_multiperiod_preserves_native_ci_on_alpha_override(self): + from diff_diff import MultiPeriodDiD, generate_did_data + + df = generate_did_data(n_units=80, n_periods=8, treatment_effect=1.5, seed=7) + fit = MultiPeriodDiD().fit( + df, + outcome="outcome", + treatment="treated", + time="period", + unit="unit", + reference_period=3, + ) + native_lo, native_hi = fit.avg_conf_int + + br = BusinessReport(fit, alpha=0.10, auto_diagnostics=False) + h = br.to_dict()["headline"] + assert h["ci_level"] == 95 + assert h["ci_lower"] == pytest.approx(native_lo) + assert h["ci_upper"] == pytest.approx(native_hi) + + def test_undefined_df_survey_stub_does_not_invent_finite_ci(self): + """When the fit's native inference returned NaN (rank-deficient + replicate design: ``df_survey = 0``), BR must not recompute a + finite interval — the NaN signal must propagate through.""" + from types import SimpleNamespace + + class _UndefinedDfStub: + pass + + stub = _UndefinedDfStub() + stub.att = 1.0 + stub.se = float("nan") + stub.p_value = float("nan") + stub.conf_int = (float("nan"), float("nan")) + stub.alpha = 0.05 + stub.n_obs = 100 + stub.n_treated = 40 + stub.n_control = 60 + stub.inference_method = "analytical" + stub.survey_metadata = SimpleNamespace( + weight_type="replicate", + replicate_method="JK1", + effective_n=80.0, + design_effect=1.25, + sum_weights=100.0, + n_strata=None, + n_psu=None, + df_survey=0, + ) + + br = BusinessReport(stub, alpha=0.10, auto_diagnostics=False) + h = br.to_dict()["headline"] + # NaN bounds must propagate — BR must not invent a finite CI. + lo, hi = h["ci_lower"], h["ci_upper"] + assert lo is None or not np.isfinite(lo), f"ci_lower should be NaN/None, got {lo}" + assert hi is None or not np.isfinite(hi), f"ci_upper should be NaN/None, got {hi}" + + +class TestDCDHAssumptionTransitionBased: + """Regression for the round-5 P1 finding that + ``ChaisemartinDHaultfoeuilleResults`` was narrated with generic group- + time PT text instead of source-backed transition-based identification. + """ + + def test_dcdh_uses_transition_based_language(self): + class ChaisemartinDHaultfoeuilleResults: + pass + + obj = ChaisemartinDHaultfoeuilleResults() + obj.att = 1.0 + obj.se = 0.1 + obj.p_value = 0.001 + obj.conf_int = (0.8, 1.2) + obj.alpha = 0.05 + obj.n_obs = 100 + obj.n_treated = 40 + obj.n_control = 60 + obj.survey_metadata = None + obj.event_study_effects = None + obj.placebo_event_study = None + obj.inference_method = "analytical" + + br = BusinessReport(obj, auto_diagnostics=False) + assumption = br.to_dict()["assumption"] + assert assumption["parallel_trends_variant"] == "transition_based" + desc = assumption["description"] + # Source-faithful: joiners/leavers/stable-control, dCDH paper attribution. + assert "joiner" in desc.lower() + assert "leaver" in desc.lower() + assert "Chaisemartin" in desc or "D'Haultfoeuille" in desc + # Must NOT open with the generic group-time PT framing. The text + # may reference it inside a contrast clause ("not a single + # group-time ATT PT"), which is fine and intended. + assert not desc.startswith("Identification relies on parallel trends") + + +class TestCSVaryingBaseSensitivitySkipped: + """Regression for the round-5 P1 finding that DR would narrate HonestDiD + bounds as robust sensitivity for a CallawaySantAnna fit with + ``base_period='varying'`` (the CS default). The HonestDiD helper + explicitly warns that those bounds are not valid for interpretation; + DR must preemptively skip and surface the reason.""" + + def test_cs_varying_base_skips_sensitivity_with_reason(self): + class CallawaySantAnnaResults: + pass + + stub = CallawaySantAnnaResults() + stub.overall_att = 1.0 + stub.overall_se = 0.3 + stub.overall_p_value = 0.01 + stub.overall_conf_int = (0.4, 1.6) + stub.alpha = 0.05 + stub.n_obs = 100 + stub.n_treated = 40 + stub.n_control = 60 + stub.survey_metadata = None + stub.event_study_effects = None + stub.event_study_vcov = None + stub.event_study_vcov_index = None + stub.vcov = None + stub.interaction_indices = None + stub.base_period = "varying" + stub.inference_method = "analytical" + + from diff_diff import DiagnosticReport + + dr = DiagnosticReport(stub).run_all() + sens = dr.schema["sensitivity"] + assert sens["status"] == "skipped" + reason = sens["reason"] + assert "base_period" in reason and "universal" in reason + # And BR must surface this as a warning-severity caveat. + br = BusinessReport(stub, diagnostics=dr) + caveats = br.caveats() + topics = {c.get("topic") for c in caveats} + assert "sensitivity_skipped" in topics, ( + "BR must surface varying-base sensitivity skip as a caveat; " f"got topics {topics}" + ) + + +class TestCSVaryingBaseSensitivityRejectsPrecomputed: + """Round-44 P1 CI review on PR #318: ``precomputed['sensitivity']`` + (and BR's ``honest_did_results`` shorthand) previously bypassed the + varying-base CS guard — the applicability gate's precomputed early- + return unlocked the sensitivity section and BR/DR narrated the + Rambachan-Roth bounds as ordinary robustness on a displayed fit + whose consecutive-comparison pre-period surface has a different + interpretation than the bounds' provenance (REGISTRY.md + §CallawaySantAnna line 410, §HonestDiD line 2458). DR and BR must + reject at construction.""" + + def _cs_varying_base_stub(self): + class CallawaySantAnnaResults: + pass + + stub = CallawaySantAnnaResults() + stub.overall_att = 1.0 + stub.overall_se = 0.3 + stub.overall_p_value = 0.01 + stub.overall_conf_int = (0.4, 1.6) + stub.alpha = 0.05 + stub.n_obs = 100 + stub.n_treated = 40 + stub.n_control = 60 + stub.survey_metadata = None + stub.event_study_effects = None + stub.event_study_vcov = None + stub.event_study_vcov_index = None + stub.vcov = None + stub.interaction_indices = None + stub.base_period = "varying" + stub.inference_method = "analytical" + return stub + + def _dummy_sens_object(self): + from types import SimpleNamespace + + return SimpleNamespace( + M_values=[0.5, 1.0], + bounds=[(0.1, 2.0), (-0.2, 2.5)], + robust_cis=[(0.05, 2.1), (-0.3, 2.6)], + breakdown_M=0.75, + method="relative_magnitude", + original_estimate=1.0, + original_se=0.2, + alpha=0.05, + ) + + def test_dr_rejects_precomputed_sensitivity_on_varying_base_cs(self): + from diff_diff import DiagnosticReport + + stub = self._cs_varying_base_stub() + with pytest.raises(ValueError, match="base_period='universal'"): + DiagnosticReport(stub, precomputed={"sensitivity": self._dummy_sens_object()}) + + def test_dr_allows_precomputed_sensitivity_on_universal_base_cs(self): + """Universal-base CS + precomputed sensitivity is the supported + path — must not be rejected.""" + from diff_diff import DiagnosticReport + + stub = self._cs_varying_base_stub() + stub.base_period = "universal" + # Should not raise. + DiagnosticReport(stub, precomputed={"sensitivity": self._dummy_sens_object()}) + + def test_br_rejects_precomputed_sensitivity_on_varying_base_cs(self): + stub = self._cs_varying_base_stub() + with pytest.raises(ValueError, match="base_period='varying'"): + BusinessReport( + stub, + precomputed={"sensitivity": self._dummy_sens_object()}, + ) + + def test_br_rejects_honest_did_results_on_varying_base_cs(self): + """BR's ``honest_did_results`` shorthand must hit the same + rejection — it becomes ``precomputed['sensitivity']`` under the + hood, and the methodology problem is identical.""" + stub = self._cs_varying_base_stub() + with pytest.raises(ValueError, match="honest_did_results"): + BusinessReport( + stub, + honest_did_results=self._dummy_sens_object(), + ) + + def test_br_rejects_both_passthrough_inputs_names_them(self): + """When both passthrough inputs are supplied, the error must + name both so the user knows every input that was rejected.""" + stub = self._cs_varying_base_stub() + with pytest.raises(ValueError) as excinfo: + BusinessReport( + stub, + honest_did_results=self._dummy_sens_object(), + precomputed={"sensitivity": self._dummy_sens_object()}, + ) + msg = str(excinfo.value) + assert "honest_did_results" in msg + assert "precomputed['sensitivity']" in msg + + +class TestBaconCaveatEstimatorAware: + """Round-45 P1 CI review on PR #318: Goodman-Bacon decomposes TWFE + weights. On fits already produced by a heterogeneity-robust + estimator (CS / SA / BJS / Gardner / Wooldridge / EfficientDiD / + Stacked / dCDH / TripleDifference / StaggeredTripleDiff / SDiD / + TROP), a high forbidden-weight share says "TWFE would have been + materially biased on this rollout", not "the displayed estimator + needs to be replaced" — the displayed fit is already robust. + + BR's caveat must be estimator-aware: keep the "switch to a robust + estimator" recommendation for TWFE-style fits only. + """ + + @staticmethod + def _bacon_schema_with_high_forbidden_weight(): + """Build a fake ``DiagnosticReportResults`` whose schema carries + a Bacon block with ``forbidden_weight > 0.10`` so BR's caveat + builder fires on the Bacon branch.""" + from diff_diff.diagnostic_report import DiagnosticReportResults + + schema = { + "schema_version": "1.0", + "estimator": {"class_name": "Stub", "display_name": "Stub"}, + "headline_metric": {}, + "parallel_trends": {"status": "skipped", "reason": "stub"}, + "pretrends_power": {"status": "skipped", "reason": "stub"}, + "sensitivity": {"status": "skipped", "reason": "stub"}, + "placebo": {"status": "skipped", "reason": "stub"}, + "bacon": { + "status": "ran", + "twfe_estimate": 1.2, + "weight_by_type": { + "treated_vs_never": 0.5, + "earlier_vs_later": 0.1, + "later_vs_earlier": 0.4, + }, + "forbidden_weight": 0.4, # > 0.10 threshold + "verdict": "materially_contaminated", + "n_timing_groups": 3, + }, + "design_effect": {"status": "skipped", "reason": "stub"}, + "heterogeneity": {"status": "skipped", "reason": "stub"}, + "epv": {"status": "skipped", "reason": "stub"}, + "estimator_native_diagnostics": {"status": "not_applicable"}, + "skipped": {}, + "warnings": [], + "overall_interpretation": "", + "next_steps": [], + } + return DiagnosticReportResults( + schema=schema, + interpretation="", + applicable_checks=("bacon",), + skipped_checks={}, + warnings=(), + ) + + @staticmethod + def _make_cs_like_stub(class_name: str): + cls = type(class_name, (), {}) + obj = cls() + obj.overall_att = 1.0 + obj.overall_se = 0.2 + obj.overall_p_value = 0.001 + obj.overall_conf_int = (0.6, 1.4) + obj.alpha = 0.05 + obj.n_obs = 500 + obj.n_treated = 100 + obj.n_control_units = 400 + obj.survey_metadata = None + obj.event_study_effects = None + obj.inference_method = "analytical" + return obj + + @staticmethod + def _make_twfe_style_stub(): + class MultiPeriodDiDResults: + pass + + obj = MultiPeriodDiDResults() + obj.avg_att = 1.0 + obj.avg_se = 0.2 + obj.avg_p_value = 0.001 + obj.avg_conf_int = (0.6, 1.4) + obj.alpha = 0.05 + obj.n_obs = 500 + obj.n_treated = 100 + obj.n_control = 400 + obj.survey_metadata = None + obj.pre_period_effects = None + obj.inference_method = "analytical" + return obj + + def test_cs_like_fit_does_not_recommend_switching_estimators(self): + """On an already-robust CS-style fit with high forbidden + Bacon weight, BR must not recommend switching to a robust + estimator — the displayed fit IS already robust.""" + stub = self._make_cs_like_stub("CallawaySantAnnaResults") + dr = self._bacon_schema_with_high_forbidden_weight() + br = BusinessReport(stub, diagnostics=dr) + caveats = br.caveats() + bacon_caveats = [c for c in caveats if c.get("topic") == "bacon_contamination"] + assert len(bacon_caveats) == 1, ( + f"High forbidden-weight Bacon must surface a caveat. " f"Got caveats: {caveats!r}" + ) + msg = bacon_caveats[0]["message"].lower() + # Must NOT tell the user to switch estimators. + assert "re-estimate with a heterogeneity-robust" not in msg, ( + f"CS is already heterogeneity-robust; must not recommend " + f"switching. Got message: {msg!r}" + ) + # Must frame this as a TWFE benchmark problem / rollout-design + # statement, not a displayed-fit-validity problem. + assert ( + "heterogeneity-robust" in msg + or "already" in msg + or "twfe benchmark" in msg + or "rollout design" in msg + ), f"CS Bacon caveat must reframe as rollout/TWFE issue. Got: {msg!r}" + # And the full-report rendering must reflect the softer wording. + md = br.full_report() + assert "Re-estimate with a heterogeneity-robust estimator" not in md, md + + def test_other_robust_estimators_also_avoid_switch_recommendation(self): + """Spot-check the same rule holds for multiple + heterogeneity-robust estimators on the Bacon path.""" + for class_name in ( + "SunAbrahamResults", + "ImputationDiDResults", + "TwoStageDiDResults", + "StackedDiDResults", + "WooldridgeDiDResults", + "ChaisemartinDHaultfoeuilleResults", + "EfficientDiDResults", + ): + stub = self._make_cs_like_stub(class_name) + dr = self._bacon_schema_with_high_forbidden_weight() + br = BusinessReport(stub, diagnostics=dr) + msgs = [c["message"] for c in br.caveats() if c.get("topic") == "bacon_contamination"] + assert msgs, f"{class_name}: Bacon caveat must fire" + assert ( + "Re-estimate with a heterogeneity-robust estimator" not in msgs[0] + ), f"{class_name} is already robust; must not recommend switching. Got: {msgs[0]!r}" + + def test_twfe_style_fit_keeps_switch_recommendation(self): + """The switch-to-robust recommendation is load-bearing for + genuinely TWFE-style fits and must be preserved there.""" + stub = self._make_twfe_style_stub() + dr = self._bacon_schema_with_high_forbidden_weight() + br = BusinessReport(stub, diagnostics=dr) + caveats = br.caveats() + bacon_caveats = [c for c in caveats if c.get("topic") == "bacon_contamination"] + assert len(bacon_caveats) == 1 + msg = bacon_caveats[0]["message"] + # TWFE-style fit: keep the explicit switch recommendation. + assert "Re-estimate with a heterogeneity-robust estimator" in msg, ( + f"MultiPeriodDiDResults (TWFE event-study) must keep the " + f"switch-to-robust recommendation. Got: {msg!r}" + ) + + +class TestBusinessReportSurveyDesignPassthrough: + """Round-40 P1 CI review on PR #318: ``BusinessReport`` must accept + ``survey_design`` and forward it to the auto-constructed + ``DiagnosticReport``, so Bacon replay on survey-backed fits is + fit-faithful and the simple 2x2 PT path skips with an explicit + reason rather than reporting an unweighted verdict for a weighted + estimate.""" + + def _did_with_survey(self): + from types import SimpleNamespace + + class DiDResults: + pass + + obj = DiDResults() + obj.att = 1.0 + obj.se = 0.2 + obj.t_stat = 5.0 + obj.p_value = 0.001 + obj.conf_int = (0.6, 1.4) + obj.alpha = 0.05 + obj.n_obs = 400 + obj.n_treated = 100 + obj.n_control = 300 + obj.survey_metadata = SimpleNamespace( + design_effect=1.25, + effective_n=320.0, + weight_type="pweight", + n_strata=None, + n_psu=None, + df_survey=20.0, + replicate_method=None, + ) + obj.inference_method = "analytical" + return obj + + def _staggered_stub_with_survey(self): + from types import SimpleNamespace + + class CallawaySantAnnaResults: + pass + + obj = CallawaySantAnnaResults() + obj.overall_att = 1.0 + obj.overall_se = 0.2 + obj.overall_p_value = 0.001 + obj.overall_conf_int = (0.6, 1.4) + obj.alpha = 0.05 + obj.n_obs = 600 + obj.n_treated = 200 + obj.n_control_units = 400 + obj.survey_metadata = SimpleNamespace( + design_effect=1.5, + effective_n=400.0, + weight_type="pweight", + n_strata=None, + n_psu=None, + df_survey=30.0, + replicate_method=None, + ) + obj.event_study_effects = None + return obj + + def test_survey_backed_did_br_rolls_up_pt_skip(self): + """BR's auto-constructed DR must skip the 2x2 PT helper on a + survey-backed DiDResults. BR's schema then surfaces the + skipped PT block with the survey-design reason (no unweighted + verdict leaks into the narrative).""" + import pandas as pd + + panel = pd.DataFrame( + { + "outcome": [1.0, 2.0, 1.1, 2.2], + "post": [0, 1, 0, 1], + "treated": [0, 0, 1, 1], + } + ) + obj = self._did_with_survey() + br = BusinessReport( + obj, + outcome_label="Revenue", + outcome_unit="$", + data=panel, + outcome="outcome", + time="post", + treatment="treated", + ) + schema = br.to_dict() + diag = schema.get("diagnostics", {}) + dr_schema = diag.get("schema", {}) if isinstance(diag, dict) else {} + pt_block = dr_schema.get("parallel_trends", {}) if isinstance(dr_schema, dict) else {} + # Round-40 schema: parallel_trends skipped with a survey-design + # reason rather than emitting an unweighted verdict. BR's auto + # path must honor the skip. + assert pt_block.get("status") == "skipped" + reason = (pt_block.get("reason") or "").lower() + assert "survey design" in reason + + def test_survey_backed_staggered_br_forwards_survey_design_to_bacon(self): + """BR must forward ``survey_design`` to the auto-constructed + DR, which in turn threads it to ``bacon_decompose``. Verify via + ``unittest.mock.patch`` that the kwarg reaches the decomposer. + """ + from unittest.mock import MagicMock, patch + + import pandas as pd + + panel = pd.DataFrame( + { + "outcome": [1.0, 2.0, 1.1, 2.2, 1.2, 2.3, 1.3, 2.4], + "unit": [1, 1, 2, 2, 3, 3, 4, 4], + "period": [1, 2, 1, 2, 1, 2, 1, 2], + "first_treat": [0, 0, 0, 0, 2, 2, 2, 2], + } + ) + obj = self._staggered_stub_with_survey() + sentinel_design = object() + fake_decomp = MagicMock() + fake_decomp.total_weight_treated_vs_never = 0.9 + fake_decomp.total_weight_earlier_vs_later = 0.05 + fake_decomp.total_weight_later_vs_earlier = 0.05 + fake_decomp.twfe_estimate = 1.1 + fake_decomp.n_timing_groups = 2 + with patch("diff_diff.bacon.bacon_decompose", return_value=fake_decomp) as m: + br = BusinessReport( + obj, + data=panel, + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + survey_design=sentinel_design, + ) + br.to_dict() # trigger DR build + assert m.called, "bacon_decompose was not called" + _, kwargs = m.call_args + assert kwargs.get("survey_design") is sentinel_design + + def test_survey_backed_staggered_br_skips_bacon_without_survey_design(self): + """Without ``survey_design``, BR's DR must skip Bacon with the + survey-design reason (fit-faithful replay requires it).""" + import pandas as pd + + panel = pd.DataFrame( + { + "outcome": [1.0, 2.0, 1.1, 2.2, 1.2, 2.3, 1.3, 2.4], + "unit": [1, 1, 2, 2, 3, 3, 4, 4], + "period": [1, 2, 1, 2, 1, 2, 1, 2], + "first_treat": [0, 0, 0, 0, 2, 2, 2, 2], + } + ) + obj = self._staggered_stub_with_survey() + br = BusinessReport( + obj, + data=panel, + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + # survey_design intentionally omitted + ) + schema = br.to_dict() + diag = schema.get("diagnostics", {}) + dr_schema = diag.get("schema", {}) if isinstance(diag, dict) else {} + bacon_block = dr_schema.get("bacon", {}) if isinstance(dr_schema, dict) else {} + assert bacon_block.get("status") == "skipped" + reason = (bacon_block.get("reason") or "").lower() + assert "survey design" in reason + + +class TestBusinessReportPrecomputedPassthrough: + """Round-43 P2 CI review on PR #318: ``BusinessReport`` must accept + the ``precomputed=`` dict that its docs advertise and forward every + key to the auto-constructed ``DiagnosticReport``. DR owns key + validation (rejects unsupported keys and estimator-incompatible + entries).""" + + def _did_with_survey(self): + from types import SimpleNamespace + + class DiDResults: + pass + + obj = DiDResults() + obj.att = 1.0 + obj.se = 0.2 + obj.t_stat = 5.0 + obj.p_value = 0.001 + obj.conf_int = (0.6, 1.4) + obj.alpha = 0.05 + obj.n_obs = 400 + obj.n_treated = 100 + obj.n_control = 300 + obj.survey_metadata = SimpleNamespace( + design_effect=1.25, + effective_n=320.0, + weight_type="pweight", + n_strata=None, + n_psu=None, + df_survey=20.0, + replicate_method=None, + ) + obj.inference_method = "analytical" + return obj + + def test_br_accepts_precomputed_parallel_trends_on_survey_did(self): + """The BR docs advertise + ``precomputed={'parallel_trends': ...}`` as the opt-in for + survey-backed 2x2 PT. That contract must actually be + reachable from BR's constructor, not just DR's.""" + obj = self._did_with_survey() + precomputed_pt = { + "p_value": 0.42, + "treated_trend": 0.05, + "control_trend": 0.04, + "trend_difference": 0.01, + "t_statistic": 0.8, + } + br = BusinessReport( + obj, + outcome_label="Revenue", + outcome_unit="$", + precomputed={"parallel_trends": precomputed_pt}, + ) + schema = br.to_dict() + dr_schema = schema["diagnostics"]["schema"] + pt = dr_schema["parallel_trends"] + assert pt["status"] == "ran", ( + "BR precomputed PT passthrough must unlock the otherwise-" + f"skipped survey-backed 2x2 path. Got PT block: {pt!r}" + ) + assert pt["joint_p_value"] == pytest.approx(0.42) + + def test_br_rejects_unsupported_precomputed_key(self): + """DR validates the precomputed key set; BR must raise the same + error rather than silently dropping unsupported keys.""" + obj = self._did_with_survey() + with pytest.raises(ValueError, match="precomputed="): + BusinessReport(obj, precomputed={"unknown_check": object()}) + + def test_br_explicit_precomputed_sensitivity_wins_over_honest_did_results(self): + """When both ``honest_did_results`` (shorthand) and explicit + ``precomputed['sensitivity']`` are supplied, the explicit + passthrough wins. Documented contract: honest_did_results is a + convenience that only kicks in when no explicit sensitivity is + present.""" + from types import SimpleNamespace + + class MultiPeriodDiDResults: + pass + + obj = MultiPeriodDiDResults() + obj.avg_att = 1.0 + obj.avg_se = 0.1 + obj.avg_p_value = 0.001 + obj.avg_conf_int = (0.8, 1.2) + obj.alpha = 0.05 + obj.n_obs = 100 + obj.n_treated = 40 + obj.n_control = 60 + obj.survey_metadata = None + obj.pre_period_effects = {} + obj.vcov = None + obj.interaction_indices = None + obj.event_study_vcov = None + obj.event_study_vcov_index = None + + explicit_sens = SimpleNamespace( + M_values=[0.5, 1.0], + bounds=[(0.1, 2.0), (-0.2, 2.5)], + robust_cis=[(0.05, 2.1), (-0.3, 2.6)], + breakdown_M=0.87, # sentinel — the explicit one's breakdown + method="relative_magnitude", + original_estimate=1.0, + original_se=0.1, + alpha=0.05, + ) + shorthand_sens = SimpleNamespace( + M_values=[0.5, 1.0], + bounds=[(0.1, 2.0), (-0.2, 2.5)], + robust_cis=[(0.05, 2.1), (-0.3, 2.6)], + breakdown_M=0.33, # different sentinel + method="relative_magnitude", + original_estimate=1.0, + original_se=0.1, + alpha=0.05, + ) + br = BusinessReport( + obj, + honest_did_results=shorthand_sens, + precomputed={"sensitivity": explicit_sens}, + ) + dr_schema = br.to_dict()["diagnostics"]["schema"] + sens = dr_schema["sensitivity"] + assert sens.get("breakdown_M") == pytest.approx(0.87), ( + "Explicit precomputed['sensitivity'] must win over " + "honest_did_results shorthand. Got sens block: " + repr(sens) + ) diff --git a/tests/test_diagnostic_report.py b/tests/test_diagnostic_report.py new file mode 100644 index 00000000..2f3800a6 --- /dev/null +++ b/tests/test_diagnostic_report.py @@ -0,0 +1,2324 @@ +"""Tests for ``diff_diff.diagnostic_report.DiagnosticReport``. + +Covers: +- Schema contract: every top-level key always present, stable enum values. +- Applicability matrix: per-estimator ``applicable_checks`` property. +- JSON round-trip. +- ``precomputed=`` passthrough (sensitivity). +- Pre-trends verdict thresholds (three bins). +- Power-aware tier thresholds (three bins + fallback). +- DEFF reads from ``survey_metadata`` when present. +- EfficientDiD ``hausman_pretest`` pathway. +- SDiD / TROP native diagnostics. +- Error-doesn't-break-report (diagnostic raises -> section records error). +""" + +from __future__ import annotations + +import json +import warnings +from unittest.mock import patch + +import numpy as np +import pytest + +import diff_diff as dd +from diff_diff import ( + CallawaySantAnna, + DiagnosticReport, + DiagnosticReportResults, + DifferenceInDifferences, + EfficientDiD, + MultiPeriodDiD, + SyntheticDiD, + generate_did_data, + generate_factor_data, + generate_staggered_data, +) +from diff_diff.diagnostic_report import ( + DIAGNOSTIC_REPORT_SCHEMA_VERSION, + _power_tier, + _pt_verdict, +) + +# --------------------------------------------------------------------------- +# Fixtures +# --------------------------------------------------------------------------- +_TOP_LEVEL_KEYS = { + "schema_version", + "estimator", + "headline_metric", + "parallel_trends", + "pretrends_power", + "sensitivity", + "placebo", + "bacon", + "design_effect", + "heterogeneity", + "epv", + "estimator_native_diagnostics", + "skipped", + "warnings", + "overall_interpretation", + "next_steps", +} + +_STATUS_ENUM = { + "ran", + "skipped", + "error", + "not_applicable", + "not_run", + "computed", +} + + +@pytest.fixture(scope="module") +def did_fit(): + warnings.filterwarnings("ignore") + df = generate_did_data(n_units=80, n_periods=4, treatment_effect=1.5, seed=7) + did = DifferenceInDifferences().fit(df, outcome="outcome", treatment="treated", time="post") + return did, df + + +@pytest.fixture(scope="module") +def multi_period_fit(): + warnings.filterwarnings("ignore") + df = generate_did_data(n_units=80, n_periods=8, treatment_effect=1.5, seed=7) + es = MultiPeriodDiD().fit( + df, + outcome="outcome", + treatment="treated", + time="period", + unit="unit", + reference_period=3, + ) + return es, df + + +@pytest.fixture(scope="module") +def cs_fit(): + warnings.filterwarnings("ignore") + sdf = generate_staggered_data(n_units=100, n_periods=6, treatment_effect=1.5, seed=7) + # Use base_period='universal' so HonestDiD sensitivity can run on this + # fixture. CS's default is 'varying', which DR now skips with a + # methodology-critical reason (Rambachan-Roth bounds are not valid for + # interpretation on consecutive-comparison pre-periods). See the + # round-5 CI review on PR #318. + cs = CallawaySantAnna(base_period="universal").fit( + sdf, + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + aggregate="event_study", + ) + return cs, sdf + + +@pytest.fixture(scope="module") +def edid_fit(): + warnings.filterwarnings("ignore") + sdf = generate_staggered_data(n_units=100, n_periods=6, treatment_effect=1.5, seed=7) + edid = EfficientDiD().fit( + sdf, outcome="outcome", unit="unit", time="period", first_treat="first_treat" + ) + return edid, sdf + + +@pytest.fixture(scope="module") +def sdid_fit(): + warnings.filterwarnings("ignore") + fdf = generate_factor_data(n_units=25, n_pre=8, n_post=4, n_treated=4, seed=11) + sdid = SyntheticDiD().fit(fdf, outcome="outcome", unit="unit", time="period", treatment="treat") + return sdid, fdf + + +# --------------------------------------------------------------------------- +# Schema contract +# --------------------------------------------------------------------------- +class TestSchemaContract: + """The AI-legible schema is the public promise. These tests lock it down.""" + + def test_every_top_level_key_present_did(self, did_fit): + fit, df = did_fit + dr = DiagnosticReport(fit, data=df, outcome="outcome", treatment="treated", time="post") + schema = dr.to_dict() + assert set(schema.keys()) == _TOP_LEVEL_KEYS + + def test_every_top_level_key_present_multiperiod(self, multi_period_fit): + fit, _ = multi_period_fit + schema = DiagnosticReport(fit).to_dict() + assert set(schema.keys()) == _TOP_LEVEL_KEYS + + def test_every_top_level_key_present_cs(self, cs_fit): + fit, sdf = cs_fit + schema = DiagnosticReport( + fit, + data=sdf, + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + ).to_dict() + assert set(schema.keys()) == _TOP_LEVEL_KEYS + + def test_every_top_level_key_present_sdid(self, sdid_fit): + fit, _ = sdid_fit + schema = DiagnosticReport(fit).to_dict() + assert set(schema.keys()) == _TOP_LEVEL_KEYS + + def test_schema_version_constant(self, multi_period_fit): + fit, _ = multi_period_fit + schema = DiagnosticReport(fit).to_dict() + assert schema["schema_version"] == DIAGNOSTIC_REPORT_SCHEMA_VERSION + assert DIAGNOSTIC_REPORT_SCHEMA_VERSION == "1.0" + + def test_all_statuses_use_closed_enum(self, cs_fit): + fit, sdf = cs_fit + schema = DiagnosticReport( + fit, + data=sdf, + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + ).to_dict() + for key in [ + "parallel_trends", + "pretrends_power", + "sensitivity", + "placebo", + "bacon", + "design_effect", + "heterogeneity", + "epv", + "estimator_native_diagnostics", + ]: + section = schema.get(key) + assert isinstance(section, dict), f"{key} missing" + assert ( + section.get("status") in _STATUS_ENUM + ), f"{key}.status = {section.get('status')!r} not in {_STATUS_ENUM}" + + def test_json_round_trip_multiperiod(self, multi_period_fit): + fit, _ = multi_period_fit + dr = DiagnosticReport(fit) + dumped = json.dumps(dr.to_dict()) + assert len(dumped) > 0 + round = json.loads(dumped) + assert round["schema_version"] == DIAGNOSTIC_REPORT_SCHEMA_VERSION + + def test_json_round_trip_sdid(self, sdid_fit): + fit, _ = sdid_fit + dumped = json.dumps(DiagnosticReport(fit).to_dict()) + assert len(dumped) > 0 + + +# --------------------------------------------------------------------------- +# Applicability matrix +# --------------------------------------------------------------------------- +class TestApplicabilityMatrix: + """Per-estimator applicability set filtered by instance state + options.""" + + def test_did_without_data_skips_pt(self, did_fit): + fit, _ = did_fit + dr = DiagnosticReport(fit) # no data + assert "parallel_trends" not in dr.applicable_checks + assert "parallel_trends" in dr.skipped_checks + reason = dr.skipped_checks["parallel_trends"] + assert "data" in reason.lower() + + def test_did_with_data_runs_pt(self, did_fit): + fit, df = did_fit + dr = DiagnosticReport(fit, data=df, outcome="outcome", treatment="treated", time="post") + assert "parallel_trends" in dr.applicable_checks + + def test_did_with_data_but_no_column_kwargs_skips_pt(self, did_fit): + """Round-11 regression: ``applicable_checks`` must match the + runner's full argument contract. 2x2 PT needs ``data`` AND + ``outcome`` / ``time`` / ``treatment`` — not just ``data``.""" + fit, df = did_fit + dr = DiagnosticReport(fit, data=df) # missing column kwargs + assert "parallel_trends" not in dr.applicable_checks + reason = dr.skipped_checks["parallel_trends"] + assert "outcome" in reason + assert "time" in reason + assert "treatment" in reason + + def test_bacon_applicability_requires_all_column_kwargs(self, cs_fit): + """Round-11 regression: Bacon needs the full ``outcome`` / ``time`` + / ``unit`` / ``first_treat`` contract from ``bacon_decompose``.""" + fit, sdf = cs_fit + dr = DiagnosticReport( + fit, + data=sdf, + first_treat="first_treat", + # intentionally omit outcome / time / unit + ) + assert "bacon" not in dr.applicable_checks + reason = dr.skipped_checks["bacon"] + assert "outcome" in reason or "time" in reason or "unit" in reason + + def test_multiperiod_runs_pt_and_power_and_sensitivity(self, multi_period_fit): + fit, _ = multi_period_fit + dr = DiagnosticReport(fit) + applicable = set(dr.applicable_checks) + assert "parallel_trends" in applicable + assert "pretrends_power" in applicable + assert "sensitivity" in applicable + + def test_cs_runs_heterogeneity(self, cs_fit): + fit, sdf = cs_fit + dr = DiagnosticReport( + fit, + data=sdf, + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + ) + applicable = set(dr.applicable_checks) + assert "heterogeneity" in applicable + assert "bacon" in applicable + assert "parallel_trends" in applicable + + def test_sdid_has_estimator_native(self, sdid_fit): + fit, _ = sdid_fit + dr = DiagnosticReport(fit) + assert "estimator_native" in dr.applicable_checks + + def test_run_opt_outs_move_checks_to_skipped(self, multi_period_fit): + fit, _ = multi_period_fit + dr = DiagnosticReport(fit, run_sensitivity=False) + assert "sensitivity" not in dr.applicable_checks + assert dr.skipped_checks["sensitivity"].startswith("run_sensitivity=False") + + def test_placebo_is_reserved_and_skipped(self, did_fit): + """Placebo is always in _CHECK_NAMES, always skipped in MVP. + + Round-26 P3: tightened from ``status in {"skipped", + "not_applicable"}`` to exact ``status == "skipped"`` because + both REPORTING.md §MVP scope and the implementation + (``_compute_applicable_checks`` always seeds ``"placebo"`` into + ``skipped``) now pin the MVP contract to a single value. + """ + fit, df = did_fit + dr = DiagnosticReport(fit, data=df, outcome="outcome", treatment="treated", time="post") + placebo_section = dr.to_dict()["placebo"] + assert placebo_section["status"] == "skipped" + assert isinstance(placebo_section.get("reason"), str) and placebo_section["reason"] + + +# --------------------------------------------------------------------------- +# Precomputed passthrough +# --------------------------------------------------------------------------- +class TestPrecomputed: + def test_precomputed_sensitivity_is_used_verbatim(self, multi_period_fit): + fit, _ = multi_period_fit + + # Construct a minimal SensitivityResults-shaped object the formatter recognizes. + class _FakeSens: + M_values = np.array([0.5, 1.0]) + bounds = [(0.1, 2.0), (-0.2, 2.5)] + robust_cis = [(0.05, 2.1), (-0.3, 2.6)] + breakdown_M = 0.75 + method = "relative_magnitude" + original_estimate = 1.0 + original_se = 0.2 + alpha = 0.05 + + fake = _FakeSens() + with patch("diff_diff.honest_did.HonestDiD.sensitivity_analysis") as mock: + dr = DiagnosticReport(fit, precomputed={"sensitivity": fake}) + dr.to_dict() + mock.assert_not_called() + schema = dr.to_dict() + assert schema["sensitivity"]["status"] == "ran" + assert schema["sensitivity"]["breakdown_M"] == 0.75 + + def test_precomputed_pretrends_power_parity_with_default_path(self, cs_fit): + """Round-20 P1 regression: ``precomputed={"pretrends_power": ...}`` + must apply the same covariance-source annotation and conservative + diagonal-fallback downgrade as ``_check_pretrends_power``. Otherwise + the same fit can be labeled ``well_powered`` through the precomputed + path and ``moderately_powered`` through the default path. + """ + from diff_diff.pretrends import compute_pretrends_power + + fit, data = cs_fit + + # Precompute the power result from the same fit. The compute function + # populates ``original_results`` on the output so DR's precomputed + # adapter can inspect the source fit's event_study_vcov. + pp = compute_pretrends_power(fit, alpha=0.05, target_power=0.80, violation_type="linear") + assert getattr(pp, "original_results", None) is fit + + dr_default = DiagnosticReport(fit, data=data).to_dict() + dr_precomputed = DiagnosticReport( + fit, data=data, precomputed={"pretrends_power": pp} + ).to_dict() + + default_block = dr_default["pretrends_power"] + precomp_block = dr_precomputed["pretrends_power"] + + # Both paths are "ran"; the precomputed path flags itself with + # ``precomputed=True`` while the default path sets ``method= + # compute_pretrends_power``. + assert default_block["status"] == "ran" + assert precomp_block["status"] == "ran" + assert precomp_block.get("precomputed") is True + + # Tier and covariance_source must agree across paths so downstream + # BR prose does not diverge based on which path produced the block. + assert default_block["tier"] == precomp_block["tier"] + assert default_block["covariance_source"] == precomp_block["covariance_source"] + + def test_precomputed_pretrends_power_downgrades_when_full_vcov_unused(self): + """Stub-based regression: when the source fit has both + ``event_study_vcov`` and ``event_study_vcov_index`` populated but + the diagonal fallback was used, the precomputed adapter must emit + ``covariance_source='diag_fallback_available_full_vcov_unused'`` and + downgrade a ``well_powered`` tier to ``moderately_powered`` — just + like the default compute path. Complements the live-fit parity test + by exercising the tier-bumping edge explicitly. + """ + + # Minimal CS-shaped stub with full vcov flagged. + class _CSStub: + overall_att = 1.0 + overall_se = 0.25 + overall_t_stat = 4.0 + overall_p_value = 0.001 + overall_conf_int = (0.5, 1.5) + alpha = 0.05 + n_obs = 400 + n_treated = 80 + n_control = 320 + survey_metadata = None + event_study_effects = None + event_study_vcov = np.eye(3) + event_study_vcov_index = {-2: 0, -1: 1, 0: 2} + + stub = _CSStub() + stub.__class__.__name__ = "CallawaySantAnnaResults" + + class _PPStub: + mdv = 0.1 # |ATT| = 1.0 -> ratio = 0.1 -> well_powered before downgrade + violation_type = "linear" + alpha = 0.05 + target_power = 0.80 + violation_magnitude = 0.1 + power = 0.80 + n_pre_periods = 2 + original_results = stub + + dr = DiagnosticReport(stub, precomputed={"pretrends_power": _PPStub()}) + block = dr.to_dict()["pretrends_power"] + assert block["status"] == "ran" + assert block["covariance_source"] == "diag_fallback_available_full_vcov_unused" + # Downgrade must apply: pre-tier is well_powered, post-tier is moderately_powered. + assert block["tier"] == "moderately_powered" + + def test_precomputed_parallel_trends_bypasses_applicability_gate(self, cs_fit): + """Round-22 P1 regression: ``precomputed["parallel_trends"]`` was + documented as supported but ``_instance_skip_reason`` skipped the + PT check on applicability grounds (missing raw panel / columns + for the event-study replay, non-replayable EfficientDiD fits, + etc.) BEFORE the precomputed runner could fire. The fix + short-circuits the gate when the precomputed key is present so + advertised passthroughs actually land on the runner. + """ + fit, _ = cs_fit + precomputed_pt = { + "status": "ran", + "method": "event_study", + "joint_p_value": 0.42, + "n_pre_periods": 3, + "verdict": "no_detected_violation", + } + + # Without passing ``data`` + column kwargs, the applicability + # gate would previously have marked PT as skipped. With the + # precomputed override, it must land on the formatter instead. + dr = DiagnosticReport(fit, precomputed={"parallel_trends": precomputed_pt}) + pt_block = dr.to_dict()["parallel_trends"] + assert pt_block["status"] == "ran", ( + f"precomputed parallel_trends must bypass the applicability gate. " + f"Got status={pt_block.get('status')}, reason={pt_block.get('reason')}" + ) + + def test_precomputed_parallel_trends_preserves_schema_shaped_joint_p(self, cs_fit): + """Round-23 P1 regression: schema-shaped PT dicts with + ``joint_p_value`` (the key emitted by the default DR path and + the shape users are most likely to replay from one DR to + another) must land on ``joint_p_value`` in the output, not + silently fall through to ``None``. Prior formatter read only + ``p_value``, so a dict with ``joint_p_value=0.42`` was + degraded to ``joint_p_value=None`` / ``verdict="inconclusive"``. + """ + fit, _ = cs_fit + dr = DiagnosticReport( + fit, + precomputed={ + "parallel_trends": { + "joint_p_value": 0.42, + "test_statistic": 5.6, + "df": 3, + "method": "hausman", + } + }, + ) + pt = dr.to_dict()["parallel_trends"] + assert pt["status"] == "ran" + assert pt["method"] == "hausman" + assert ( + pt["joint_p_value"] == 0.42 + ), f"joint_p_value must survive formatting; got {pt.get('joint_p_value')}" + assert pt["test_statistic"] == 5.6 + assert pt["df"] == 3 + # Verdict must be derived from the surviving p-value, not None. + assert pt["verdict"] != "inconclusive" + + def test_precomputed_parallel_trends_accepts_native_hausman_result(self, cs_fit): + """Round-23 P1 regression: ``_pt_hausman`` tells users with + non-replayable EfficientDiD fits to pass a precomputed pretest + result, but the formatter previously rejected non-dict inputs + outright. The ``HausmanPretestResult`` dataclass — the exact + object ``EfficientDiD.hausman_pretest(...)`` returns — must + now pass through with ``statistic`` / ``p_value`` / ``df`` + preserved on the schema. + """ + from types import SimpleNamespace + + fit, _ = cs_fit + # Mirror HausmanPretestResult: the key fields are ``statistic``, + # ``p_value``, ``df``. Uses SimpleNamespace so the test does + # not need EfficientDiD's construction path. + hausman = SimpleNamespace( + statistic=7.2, + p_value=0.065, + df=3, + reject=False, + alpha=0.05, + att_all=1.0, + att_post=1.05, + recommendation="pt_all", + ) + dr = DiagnosticReport(fit, precomputed={"parallel_trends": hausman}) + pt = dr.to_dict()["parallel_trends"] + assert pt["status"] == "ran", ( + f"Native HausmanPretestResult must be accepted; got " + f"status={pt.get('status')}, reason={pt.get('reason')}" + ) + assert pt["joint_p_value"] == 0.065 + # ``statistic`` on the source object maps to ``test_statistic`` + # in the emitted schema (matches the default ``_pt_hausman`` + # path that also exposes it as ``test_statistic``). + assert pt["test_statistic"] == 7.2 + assert pt["df"] == 3 + + def test_precomputed_pt_infers_slope_difference_method_for_raw_2x2_dict(self, cs_fit): + """Round-26 P2 regression: a raw ``utils.check_parallel_trends()`` + dict (no ``method`` key, has ``trend_difference`` / p_value) must + be recognized as the slope-difference 2x2 path and render with + the single-statistic ``p`` label, not the generic ``joint p`` + wording that ``"precomputed"`` falls through to. + """ + from diff_diff import BusinessReport + from diff_diff.diagnostic_report import DiagnosticReportResults + + fit, _ = cs_fit + raw_2x2 = { + "treated_trend": 0.1, + "treated_trend_se": 0.05, + "control_trend": 0.08, + "control_trend_se": 0.04, + "trend_difference": 0.02, + "trend_difference_se": 0.06, + "t_statistic": 0.33, + "p_value": 0.40, + "parallel_trends_plausible": True, + } + dr = DiagnosticReport(fit, precomputed={"parallel_trends": raw_2x2}) + pt = dr.to_dict()["parallel_trends"] + assert pt["status"] == "ran" + assert pt["method"] == "slope_difference", ( + f"Raw check_parallel_trends dict must infer " + f"method='slope_difference'; got {pt.get('method')!r}" + ) + + # Markdown prose must use the single-statistic ``p`` label + # (not ``joint p``, which is Wald / Bonferroni-specific). + br_dr = DiagnosticReportResults( + schema=dr.to_dict(), + interpretation="", + applicable_checks=("parallel_trends",), + skipped_checks={}, + warnings=(), + ) + md = BusinessReport(fit, diagnostics=br_dr).full_report() + pt_section = md.split("## Pre-Trends", 1)[1].split("\n## ", 1)[0] + assert "joint p" not in pt_section + assert "p = 0.4" in pt_section + + def test_precomputed_pt_infers_hausman_method_for_native_object(self, cs_fit): + """Round-26 P2 regression: a native Hausman-like object without + an explicit ``method`` tag (``HausmanPretestResult`` shape: + ``statistic`` + ``att_all`` / ``att_post`` / ``recommendation``) + must be recognized as the Hausman path and render with the + single-statistic ``p`` label, not ``joint p``. + """ + from types import SimpleNamespace + + from diff_diff import BusinessReport + from diff_diff.diagnostic_report import DiagnosticReportResults + + fit, _ = cs_fit + hausman_like = SimpleNamespace( + statistic=4.5, + p_value=0.21, + df=3, + reject=False, + alpha=0.05, + att_all=1.0, + att_post=1.1, + recommendation="pt_all", + # Note: no ``method`` attribute — tests the inference path. + ) + dr = DiagnosticReport(fit, precomputed={"parallel_trends": hausman_like}) + pt = dr.to_dict()["parallel_trends"] + assert pt["status"] == "ran" + assert pt["method"] == "hausman", ( + f"Native Hausman-like object must infer method='hausman'; " f"got {pt.get('method')!r}" + ) + assert pt["test_statistic"] == 4.5 + assert pt["joint_p_value"] == 0.21 + + # Markdown prose must use the single-statistic ``p`` label. + br_dr = DiagnosticReportResults( + schema=dr.to_dict(), + interpretation="", + applicable_checks=("parallel_trends",), + skipped_checks={}, + warnings=(), + ) + md = BusinessReport(fit, diagnostics=br_dr).full_report() + pt_section = md.split("## Pre-Trends", 1)[1].split("\n## ", 1)[0] + assert "joint p" not in pt_section + assert "p = 0.21" in pt_section + + def test_precomputed_pt_explicit_method_wins_over_inference(self, cs_fit): + """Explicit ``method`` in the input must never be overridden by + the heuristic inference (defensive: e.g., a user passes a + schema-shaped dict labeled ``method='event_study'`` where the + ``trend_difference`` markers would otherwise suggest + slope_difference). + """ + fit, _ = cs_fit + spoofed = { + "method": "event_study", + "joint_p_value": 0.42, + "trend_difference": 0.02, # would otherwise trigger slope_difference inference + } + dr = DiagnosticReport(fit, precomputed={"parallel_trends": spoofed}) + assert dr.to_dict()["parallel_trends"]["method"] == "event_study" + + def test_precomputed_parallel_trends_rejects_input_without_p_value(self, cs_fit): + """Inputs without any recognized p-value field (neither + ``joint_p_value`` nor ``p_value``) must surface a clear error, + not silently land on ``joint_p_value=None``. Keeps the formatter + permissive about absent ``test_statistic`` / ``df`` (2x2 PT has + neither) while catching obviously-wrong inputs. + """ + fit, _ = cs_fit + dr = DiagnosticReport(fit, precomputed={"parallel_trends": {"method": "event_study"}}) + pt = dr.to_dict()["parallel_trends"] + assert pt["status"] == "error" + assert "joint_p_value" in pt["reason"] or "p_value" in pt["reason"] + + def test_precomputed_bacon_bypasses_applicability_gate(self, cs_fit): + """Round-22 P1 regression: ``precomputed["bacon"]`` was + documented as supported but ``_instance_skip_reason`` skipped + Bacon on applicability grounds (``data`` / column kwargs missing) + before the runner could fire. Users with an already-computed + ``BaconDecompositionResults`` must be able to pass it through + without re-supplying the raw panel. + """ + from types import SimpleNamespace + + fit, _ = cs_fit + precomputed_bacon = SimpleNamespace( + weights=None, + att=1.2, + comparison_types={}, + total_weight_later_vs_earlier=0.02, + ) + + dr = DiagnosticReport(fit, precomputed={"bacon": precomputed_bacon}) + bacon_block = dr.to_dict()["bacon"] + assert bacon_block["status"] == "ran", ( + f"precomputed bacon must bypass the applicability gate. " + f"Got status={bacon_block.get('status')}, reason={bacon_block.get('reason')}" + ) + + def test_precomputed_single_m_sensitivity_exposes_original_estimate_and_se(self, cs_fit): + """Pre-emptive audit regression: ``_format_precomputed_sensitivity`` + used to drop ``original_estimate`` and ``original_se`` on the + single-M ``HonestDiDResults`` branch, even though both + ``SensitivityResults`` and ``HonestDiDResults`` carry those fields. + The grid branch surfaces them via ``_format_sensitivity_results``, + so dropping them on the single-M branch made the schema shape + dependent on which object type the user passed. Parity fix: the + single-M branch now carries the same fields. + """ + from types import SimpleNamespace + + fit, _ = cs_fit + single_m = SimpleNamespace( + lb=0.3, + ub=1.8, + ci_lb=0.15, + ci_ub=1.95, + M=1.0, + method="relative_magnitude", + original_estimate=1.05, + original_se=0.22, + alpha=0.05, + ) + + block = DiagnosticReport(fit, precomputed={"sensitivity": single_m}).to_dict()[ + "sensitivity" + ] + assert block["status"] == "ran" + assert block["conclusion"] == "single_M_precomputed" + # Parity with the grid branch: these fields must be present and + # reflect the passed object's values. + assert block["original_estimate"] == 1.05 + assert block["original_se"] == 0.22 + + +# --------------------------------------------------------------------------- +# Verdict / tier helpers +# --------------------------------------------------------------------------- +class TestJointWaldAlignment: + """Cover the event-study PT joint-Wald vs Bonferroni fallback paths. + + These tests address the correctness-sensitive codepath in + ``_pt_event_study`` where pre-period coefficient keys must align with + ``interaction_indices`` before the joint Wald statistic can be indexed + into the right vcov rows/columns. When alignment fails, the code must + fall back to Bonferroni rather than compute a Wald statistic on the + wrong rows. + """ + + @staticmethod + def _stub_result(pre_effects, interaction_indices, vcov, **extra): + """Build a minimal MultiPeriodDiDResults-shaped stub for PT tests. + + ``pre_effects`` is an iterable of ``(period_key, effect, se, p_value)`` + tuples. Returns an object whose class name is ``MultiPeriodDiDResults`` + so DR's name-keyed dispatch routes it to the event-study PT path. + """ + from types import SimpleNamespace + + pre_map = { + k: SimpleNamespace(effect=eff, se=se, p_value=p) for (k, eff, se, p) in pre_effects + } + + class MultiPeriodDiDResults: # noqa: D401 — test stub that mimics the real class name + pass + + obj = MultiPeriodDiDResults() + obj.pre_period_effects = pre_map + obj.interaction_indices = interaction_indices + obj.vcov = np.asarray(vcov, dtype=float) if vcov is not None else None + obj.avg_att = 1.0 + obj.avg_se = 0.1 + obj.avg_p_value = 0.001 + obj.avg_conf_int = (0.8, 1.2) + obj.alpha = 0.05 + obj.n_obs = 100 + obj.n_treated = 50 + obj.n_control = 50 + obj.survey_metadata = None + for k, v in extra.items(): + setattr(obj, k, v) + return obj + + def test_joint_wald_runs_when_keys_align(self): + """With aligned pre_effects + interaction_indices + vcov, Wald runs + and the computed chi-squared statistic matches the closed form.""" + pre = [(-3, 0.0, 0.5, 0.99), (-2, 0.0, 0.5, 0.99), (-1, 0.0, 0.5, 0.99)] + interaction_indices = {-3: 0, -2: 1, -1: 2, 0: 3} # maps period -> vcov row + vcov = np.diag([0.25, 0.25, 0.25, 0.25]) # SE = 0.5 for each pre-period + stub = self._stub_result(pre, interaction_indices, vcov) + dr = DiagnosticReport(stub, run_sensitivity=False, run_bacon=False) + pt = dr.to_dict()["parallel_trends"] + assert pt["status"] == "ran" + assert ( + pt["method"] == "joint_wald" + ), f"Expected joint_wald with aligned keys; got {pt.get('method')}" + # beta=0 across all periods -> test_statistic = 0 -> p = 1.0 + assert pt["test_statistic"] == pytest.approx(0.0) + assert pt["joint_p_value"] == pytest.approx(1.0) + assert pt["df"] == 3 + + def test_joint_wald_computes_expected_statistic(self): + """Verify the Wald statistic matches a known closed-form value.""" + # beta = [1.0, -0.5, 0.2]; vcov diagonal with variances [0.25, 0.25, 0.16] + # -> test_statistic = 1.0^2/0.25 + 0.5^2/0.25 + 0.2^2/0.16 + # = 4.0 + 1.0 + 0.25 = 5.25 + pre = [(-3, 1.0, 0.5, 0.04), (-2, -0.5, 0.5, 0.30), (-1, 0.2, 0.4, 0.61)] + interaction_indices = {-3: 0, -2: 1, -1: 2} + vcov = np.diag([0.25, 0.25, 0.16]) + stub = self._stub_result(pre, interaction_indices, vcov) + dr = DiagnosticReport(stub, run_sensitivity=False, run_bacon=False) + pt = dr.to_dict()["parallel_trends"] + assert pt["method"] == "joint_wald" + assert pt["test_statistic"] == pytest.approx(5.25, rel=1e-6) + + def test_falls_back_to_bonferroni_without_interaction_indices(self): + pre = [(-2, 1.0, 0.5, 0.04), (-1, 0.2, 0.5, 0.69)] + stub = self._stub_result(pre, interaction_indices=None, vcov=np.diag([0.25, 0.25])) + dr = DiagnosticReport(stub, run_sensitivity=False, run_bacon=False) + pt = dr.to_dict()["parallel_trends"] + assert pt["status"] == "ran" + assert pt["method"] == "bonferroni", ( + "Missing interaction_indices must force Bonferroni fallback, " + "never attempt a Wald statistic on misaligned rows." + ) + # Bonferroni: min(per-period p) * n = 0.04 * 2 = 0.08 (< 1) + assert pt["joint_p_value"] == pytest.approx(0.08, rel=1e-6) + + def test_falls_back_to_bonferroni_when_keys_misaligned(self): + """pre_effects has keys [-2, -1] but interaction_indices uses [2019, 2020].""" + pre = [(-2, 1.0, 0.5, 0.04), (-1, 0.2, 0.5, 0.69)] + interaction_indices = {2019: 0, 2020: 1} # deliberately different namespace + vcov = np.diag([0.25, 0.25]) + stub = self._stub_result(pre, interaction_indices, vcov) + dr = DiagnosticReport(stub, run_sensitivity=False, run_bacon=False) + pt = dr.to_dict()["parallel_trends"] + assert pt["status"] == "ran" + assert pt["method"] == "bonferroni", ( + "Misaligned interaction_indices must force Bonferroni fallback — " + "the len(keys_in_vcov) == df guard should prevent the Wald path." + ) + + def test_falls_back_to_bonferroni_when_vcov_missing(self): + pre = [(-2, 1.0, 0.5, 0.04), (-1, 0.2, 0.5, 0.69)] + interaction_indices = {-2: 0, -1: 1} + stub = self._stub_result(pre, interaction_indices, vcov=None) + dr = DiagnosticReport(stub, run_sensitivity=False, run_bacon=False) + pt = dr.to_dict()["parallel_trends"] + assert pt["method"] == "bonferroni" + + def test_joint_wald_uses_F_reference_when_survey_df_is_finite(self): + """Round-27 P1 regression: event-study PT on a survey-backed fit + must use an F reference distribution with denominator df = + ``survey_metadata.df_survey`` rather than the chi-square + reference. Chi-square over-rejects under a finite-sample + correction; the design-based SE already reflects the effective + sample size and the PT test must match. + """ + from types import SimpleNamespace + + from scipy.stats import chi2 + from scipy.stats import f as f_dist + + # Same fixture as ``test_joint_wald_runs_when_keys_align`` but with + # a survey_metadata carrying a finite df_survey. + pre = [(-3, 1.0, 1.0, 0.32), (-2, 1.0, 1.0, 0.32), (-1, 1.0, 1.0, 0.32)] + interaction_indices = {-3: 0, -2: 1, -1: 2, 0: 3} + vcov = np.eye(4) + stub = self._stub_result( + pre, + interaction_indices, + vcov, + survey_metadata=SimpleNamespace(df_survey=20.0), + ) + + dr = DiagnosticReport(stub, run_sensitivity=False, run_bacon=False) + pt = dr.to_dict()["parallel_trends"] + + # With beta = [1,1,1] and V = I, the Wald statistic is 3.0. + assert pt["status"] == "ran" + assert pt["test_statistic"] == pytest.approx(3.0, rel=1e-6) + assert pt["df"] == 3 + + # Method tag surfaces the survey branch so BR / DR prose can + # flag the finite-sample correction. Denominator df is exposed + # on the schema for downstream consumers. + assert pt["method"].endswith("_survey") + assert pt["df_denom"] == pytest.approx(20.0) + + # F statistic = W / k = 3.0 / 3 = 1.0; survey p-value uses + # F(3, 20) instead of chi-square(3). + expected_p_survey = float(1.0 - f_dist.cdf(1.0, dfn=3, dfd=20.0)) + expected_p_chi2 = float(1.0 - chi2.cdf(3.0, df=3)) + assert pt["joint_p_value"] == pytest.approx(expected_p_survey, rel=1e-6) + # Chi-square would be noticeably more confident (smaller p) than + # F under finite df; confirm the survey path isn't degenerating + # back to chi-square. + assert expected_p_survey > expected_p_chi2 + + def test_precomputed_survey_pt_replay_preserves_df_denom(self, cs_fit): + """Round-28 P3 regression: a schema-shaped PT block carrying the + survey ``df_denom`` and ``_survey`` method suffix must round-trip + through ``precomputed={"parallel_trends": ...}`` without losing + the finite-sample provenance. Previously ``_format_precomputed_pt`` + dropped ``df_denom``, so replaying a survey-aware DR block + silently demoted it to a chi-square-style passthrough. + """ + fit, _ = cs_fit + survey_pt = { + "method": "joint_wald_event_study_survey", + "joint_p_value": 0.18, + "test_statistic": 5.2, + "df": 3, + "df_denom": 20.0, + } + dr = DiagnosticReport(fit, precomputed={"parallel_trends": survey_pt}) + pt = dr.to_dict()["parallel_trends"] + assert pt["status"] == "ran" + assert pt["method"] == "joint_wald_event_study_survey" + assert pt["df_denom"] == 20.0 + assert pt["df"] == 3 + + def test_dr_prose_uses_event_study_subject_for_survey_pt(self): + """Round-29 P3 regression: DR's own ``_pt_subject_phrase`` / + ``_pt_stat_label`` helpers previously didn't recognize the + ``_survey`` variants, so summary / full_report prose fell + through to the generic "Pre-treatment data" wording — BR's + helpers were fixed last round but DR's were not. The survey + variants must render with the event-study subject and the + ``joint p`` label; the F-reference correction is a different + reference distribution, not a different test. + """ + from diff_diff.diagnostic_report import ( + _pt_stat_label, + _pt_subject_phrase, + ) + + for method in ( + "joint_wald_survey", + "joint_wald_event_study_survey", + ): + assert _pt_subject_phrase(method) == "Pre-treatment event-study coefficients", ( + f"DR subject for {method!r} must match the non-survey " + f"event-study phrasing; got " + f"{_pt_subject_phrase(method)!r}" + ) + assert _pt_stat_label(method) == "joint p" + + def test_joint_wald_ignores_non_finite_survey_df(self): + """If ``df_survey`` is NaN / inf / non-positive, fall back to + chi-square (no finite-sample correction available). + """ + from types import SimpleNamespace + + pre = [(-3, 1.0, 1.0, 0.32), (-2, 1.0, 1.0, 0.32), (-1, 1.0, 1.0, 0.32)] + interaction_indices = {-3: 0, -2: 1, -1: 2, 0: 3} + vcov = np.eye(4) + stub = self._stub_result( + pre, + interaction_indices, + vcov, + survey_metadata=SimpleNamespace(df_survey=float("nan")), + ) + dr = DiagnosticReport(stub, run_sensitivity=False, run_bacon=False) + pt = dr.to_dict()["parallel_trends"] + # Non-finite df_survey must not taint the method tag. + assert not pt["method"].endswith("_survey") + assert "df_denom" not in pt + + +class TestNarrowedApplicabilityAndPlaceboSchema: + """Regressions for the round-3 CI-review findings. + + * ``pretrends_power`` and ``sensitivity`` are now restricted to the + result families that their backing helpers actually support, so + default reports no longer land in ``error`` for SA / Imputation / + Stacked / EfficientDiD / StaggeredTripleDiff / Wooldridge. + * ``placebo`` is always ``status="skipped"`` in MVP regardless of + estimator, matching the ``REPORTING.md`` contract. + """ + + def test_placebo_is_always_skipped_not_not_applicable(self, did_fit): + fit, df = did_fit + dr = DiagnosticReport(fit, data=df, outcome="outcome", treatment="treated", time="post") + placebo = dr.to_dict()["placebo"] + assert placebo["status"] == "skipped", ( + f"placebo must always be status='skipped' per REPORTING.md; " + f"got {placebo['status']!r}" + ) + + def test_placebo_skipped_for_multiperiod_fit(self, multi_period_fit): + fit, _ = multi_period_fit + placebo = DiagnosticReport(fit).to_dict()["placebo"] + assert placebo["status"] == "skipped" + + def test_placebo_skipped_for_sdid_fit(self, sdid_fit): + fit, _ = sdid_fit + placebo = DiagnosticReport(fit).to_dict()["placebo"] + assert placebo["status"] == "skipped" + + def test_sun_abraham_sensitivity_not_applicable(self): + """SA is not in HonestDiD's adapter list; DR must not try to run it.""" + import warnings + + from diff_diff import SunAbraham, generate_staggered_data + + warnings.filterwarnings("ignore") + sdf = generate_staggered_data(n_units=100, n_periods=6, treatment_effect=1.5, seed=7) + fit = SunAbraham().fit( + sdf, outcome="outcome", unit="unit", time="period", first_treat="first_treat" + ) + dr = DiagnosticReport(fit) + applicable = set(dr.applicable_checks) + sensitivity = dr.to_dict()["sensitivity"] + assert "sensitivity" not in applicable, ( + "SunAbrahamResults has no HonestDiD adapter; sensitivity must not " + "be marked applicable" + ) + assert sensitivity["status"] == "not_applicable" + + def test_n_obs_zero_reference_marker_filtered(self): + """Stacked / TwoStage / Imputation reference markers use n_obs=0 + (not n_groups=0). ``_collect_pre_period_coefs`` must filter both.""" + import numpy as np + + from diff_diff.diagnostic_report import _collect_pre_period_coefs + + class StackedDiDResults: + pass + + obj = StackedDiDResults() + obj.event_study_effects = { + -2: {"effect": 0.1, "se": 0.3, "p_value": 0.74, "n_obs": 50}, + -1: { + "effect": 0.0, + "se": np.nan, + "p_value": np.nan, + "n_obs": 0, # synthetic reference marker + }, + 0: {"effect": 1.5, "se": 0.2, "p_value": 0.0001, "n_obs": 50}, + } + coefs, _ = _collect_pre_period_coefs(obj) + keys = [k for (k, _, _, _) in coefs] + assert -1 not in keys, "n_obs==0 row must be filtered out" + assert -2 in keys + + +class TestReferenceMarkerAndNaNFiltering: + """Regression for the P0 finding that reference markers + NaN pre-periods + were being swept into Bonferroni / Wald PT as real evidence. + + Universal-base CS / SA / ImputationDiD / Stacked event-study output + injects a synthetic reference-period row (``effect=0``, ``se=NaN``, + ``p_value=NaN``, ``n_groups=0``). Treating that row as valid + pre-period evidence would inflate the Bonferroni denominator and + collapse all-NaN fallbacks to a false-clean verdict. + """ + + @staticmethod + def _cs_stub_with_reference_marker(): + import numpy as np + + class CallawaySantAnnaResults: + pass + + obj = CallawaySantAnnaResults() + obj.overall_att = 1.0 + obj.overall_se = 0.1 + obj.overall_p_value = 0.001 + obj.overall_conf_int = (0.8, 1.2) + obj.alpha = 0.05 + obj.n_obs = 200 + obj.n_treated = 40 + obj.n_control = 160 + obj.survey_metadata = None + # Two real pre-period rows + one universal-base reference marker (n_groups=0). + obj.event_study_effects = { + -3: {"effect": 0.1, "se": 0.3, "p_value": 0.74, "n_groups": 5}, + -2: {"effect": -0.2, "se": 0.3, "p_value": 0.51, "n_groups": 5}, + -1: { + "effect": 0.0, + "se": np.nan, + "p_value": np.nan, + "conf_int": (np.nan, np.nan), + "n_groups": 0, + }, + 0: {"effect": 1.5, "se": 0.2, "p_value": 0.0001, "n_groups": 5}, + } + obj.vcov = None + obj.interaction_indices = None + obj.event_study_vcov = None + obj.event_study_vcov_index = None + return obj + + def test_reference_marker_excluded_from_pt_collection(self): + from diff_diff.diagnostic_report import _collect_pre_period_coefs + + obj = self._cs_stub_with_reference_marker() + coefs, _ = _collect_pre_period_coefs(obj) + keys = [k for (k, _, _, _) in coefs] + assert -1 not in keys, ( + "Universal-base reference marker (n_groups=0) must not appear " + "as a valid pre-period coefficient" + ) + assert -3 in keys and -2 in keys + # Every returned SE must be finite. + for _k, _eff, se, _p in coefs: + assert np.isfinite(se), f"Non-finite SE leaked through: {se}" + + def test_all_nan_pre_periods_do_not_produce_clean_verdict(self): + """If *every* pre-period row is a reference marker / NaN, the PT + check must return inconclusive / skipped — never a clean p_value=1.0. + """ + import numpy as np + + class CallawaySantAnnaResults: + pass + + obj = CallawaySantAnnaResults() + obj.overall_att = 1.0 + obj.overall_se = 0.1 + obj.overall_p_value = 0.001 + obj.overall_conf_int = (0.8, 1.2) + obj.alpha = 0.05 + obj.n_obs = 200 + obj.n_treated = 40 + obj.n_control = 160 + obj.survey_metadata = None + obj.event_study_effects = { + -1: { + "effect": 0.0, + "se": np.nan, + "p_value": np.nan, + "n_groups": 0, + }, + 0: {"effect": 1.5, "se": 0.2, "p_value": 0.0001, "n_groups": 5}, + } + obj.vcov = None + obj.interaction_indices = None + obj.event_study_vcov = None + obj.event_study_vcov_index = None + dr = DiagnosticReport(obj, run_sensitivity=False, run_bacon=False) + pt = dr.to_dict()["parallel_trends"] + # All pre-period rows were reference markers → no valid data → skipped. + assert pt["status"] == "skipped" + # Verdict must not falsely say "no detected violation" when the only + # "data" was a reference marker. + assert pt.get("verdict") != "no_detected_violation" + + def test_undefined_pre_period_inference_yields_inconclusive_not_shrunken_bonferroni(self): + """Round-33 P0 regression: when any pre-period has undefined + inference (non-finite effect / SE or ``se <= 0``), the Bonferroni + fallback must NOT silently shrink the test family on the + remaining subset and publish a clean joint p-value. Per the + ``safe_inference`` contract (``utils.py`` line 175), undefined + SE yields NaN downstream; the joint PT test must be explicitly + inconclusive so BR prose does not render a stakeholder-facing + "parallel trends hold" verdict from a partially-undefined + pre-period surface. + """ + from types import SimpleNamespace + + import numpy as np + + class MultiPeriodDiDResults: + pass + + obj = MultiPeriodDiDResults() + # One valid row + one row whose p-value is NaN (the ``se`` here + # is finite / positive; the NaN p models an exotic fit where + # the inference pipeline could not produce a p-value even with + # a valid SE). + obj.pre_period_effects = { + -2: SimpleNamespace(effect=1.0, se=0.5, p_value=0.04), + -1: SimpleNamespace(effect=0.5, se=0.0, p_value=np.nan), + } + obj.vcov = None + obj.interaction_indices = None + obj.event_study_vcov = None + obj.event_study_vcov_index = None + obj.avg_att = 1.0 + obj.avg_se = 0.1 + obj.avg_p_value = 0.001 + obj.avg_conf_int = (0.8, 1.2) + obj.alpha = 0.05 + obj.n_obs = 100 + obj.n_treated = 50 + obj.n_control = 50 + obj.survey_metadata = None + + dr = DiagnosticReport(obj, run_sensitivity=False, run_bacon=False) + pt = dr.to_dict()["parallel_trends"] + + # Method flagged inconclusive; joint_p None; verdict inconclusive. + assert pt["method"] == "inconclusive" + assert pt["joint_p_value"] is None + assert pt["verdict"] == "inconclusive" + # Metadata records how many pre-periods were dropped and why. + assert pt["n_dropped_undefined"] == 1 + assert "undefined inference" in pt["reason"] + + def test_nan_headline_yields_estimation_failure_prose_not_did_not_change(self): + """Round-36 P0 regression: a non-finite headline effect + (``NaN`` ATT from a failed fit) previously passed the ``val is + not None`` guard in ``_render_overall_interpretation``. Since + ``NaN > 0`` and ``NaN < 0`` are both false, the directional + branch fell through to "did not change" and rendered + "did not change ... by nan (p = nan, 95% CI: nan to nan)" — + misleading stakeholder prose on a failed fit. + + Both ``DiagnosticReport.summary()`` and + ``to_dict()["overall_interpretation"]`` must now emit an + explicit estimation-failure sentence instead. + """ + + class DiDResults: + pass + + stub = DiDResults() + stub.att = float("nan") + stub.se = float("nan") + stub.t_stat = float("nan") + stub.p_value = float("nan") + stub.conf_int = (float("nan"), float("nan")) + stub.alpha = 0.05 + stub.n_obs = 100 + stub.n_treated = 50 + stub.n_control = 50 + stub.survey_metadata = None + + dr = DiagnosticReport(stub, run_sensitivity=False, run_bacon=False) + summary = dr.summary() + interp = dr.to_dict()["overall_interpretation"] + + for label, prose in [("summary", summary), ("overall_interpretation", interp)]: + lower = prose.lower() + # Must NOT render directional / numeric prose on a NaN fit. + assert ( + "did not change" not in lower + ), f"{label} rendered 'did not change' on a NaN fit; got: {prose!r}" + assert ( + "nan" not in lower + ), f"{label} rendered 'nan' in the stakeholder-facing prose; got: {prose!r}" + assert "by nan" not in lower + assert "ci: nan" not in lower + # Must name the non-finite state explicitly. + assert ( + "non-finite" in lower or "did not produce" in lower + ), f"{label} must emit an estimation-failure sentence; got: {prose!r}" + + def test_summary_prose_surfaces_inconclusive_pt_explicitly(self): + """Round-35 P1 regression: when pre-trends is inconclusive + (undefined pre-period inference), both ``BusinessReport.summary()`` + and ``DiagnosticReport.summary()`` must emit explicit inconclusive + prose — not merely omit the PT sentence. A missing sentence was + indistinguishable from "PT did not run" and would silently drop + the identifying-assumption diagnostic from stakeholder output. + """ + from diff_diff import BusinessReport + + class StackedDiDResults: + pass + + obj = StackedDiDResults() + obj.overall_att = 1.0 + obj.overall_se = 0.2 + obj.overall_p_value = 0.001 + obj.overall_conf_int = (0.6, 1.4) + obj.alpha = 0.05 + obj.n_obs = 400 + obj.n_treated_units = 100 + obj.n_control_units = 300 + obj.survey_metadata = None + obj.event_study_effects = { + -2: {"effect": 0.1, "se": 0.2, "p_value": 0.62, "n_obs": 400}, + -1: {"effect": 0.05, "se": 0.3, "p_value": float("nan"), "n_obs": 400}, + } + + dr_summary = DiagnosticReport(obj, run_sensitivity=False, run_bacon=False).summary() + br_summary = BusinessReport(obj).summary() + + # Both summaries must explicitly name the inconclusive state. + for label, prose in [("DR", dr_summary), ("BR", br_summary)]: + assert "inconclusive" in prose.lower(), ( + f"{label}.summary() must surface the inconclusive PT " + f"state explicitly; got: {prose!r}" + ) + # And must not offer false-clean "do not reject" wording. + assert "do not reject parallel trends" not in prose.lower() + assert "consistent with parallel trends" not in prose.lower() + + def test_design_effect_deff_below_95_uses_improves_precision_wording(self): + """Round-35 P2 regression: ``deff < 0.95`` is a precision- + improving survey design — effective N is LARGER than nominal + N. DR emits ``band_label="improves_precision"`` and BR narrates + "improves effective sample size" instead of "reduces". + """ + from types import SimpleNamespace + + from diff_diff import BusinessReport + + class CallawaySantAnnaResults: + pass + + obj = CallawaySantAnnaResults() + obj.overall_att = 1.0 + obj.overall_se = 0.2 + obj.overall_p_value = 0.001 + obj.overall_conf_int = (0.6, 1.4) + obj.alpha = 0.05 + obj.n_obs = 500 + obj.n_treated = 100 + obj.n_control_units = 400 + obj.event_study_effects = None + obj.survey_metadata = SimpleNamespace( + design_effect=0.80, + effective_n=625.0, + weight_type="pweight", + n_strata=None, + n_psu=None, + df_survey=None, + replicate_method=None, + ) + + # Schema: band_label surfaces the precision-improving state. + deff_block = DiagnosticReport(obj).to_dict()["design_effect"] + assert deff_block["band_label"] == "improves_precision" + + # Prose: BR says "improves", not "reduces". + summary = BusinessReport(obj).summary().lower() + assert "improves effective sample size" in summary + assert "reduces effective sample size" not in summary + + def test_finite_se_nan_p_value_yields_inconclusive_on_bonferroni_only_surface(self): + """Round-34 P0 regression: replicate-weight survey fits can emit + event-study rows with finite ``effect`` / ``se`` but + ``p_value=NaN`` when ``safe_inference`` sees ``df <= 0`` — the + design-based SE is still defined but inference fields collapse + to NaN per ``utils.py`` line 175. The round-33 collector filter + (``se > 0``) lets such rows through; the Bonferroni fallback + previously excluded NaN p-values and scaled by the reduced + family, producing a clean joint PT verdict that BR rendered as + "do not reject parallel trends" prose. + + Use a ``StackedDiDResults`` stub (Bonferroni-only surface: no + ``vcov`` / ``event_study_vcov``) with one finite-inference row + and one finite-SE / NaN-p row, and assert DR emits inconclusive. + """ + from diff_diff import BusinessReport + + class StackedDiDResults: + pass + + obj = StackedDiDResults() + obj.overall_att = 1.0 + obj.overall_se = 0.2 + obj.overall_p_value = 0.001 + obj.overall_conf_int = (0.6, 1.4) + obj.alpha = 0.05 + obj.n_obs = 400 + obj.n_treated_units = 100 + obj.n_control_units = 300 + obj.survey_metadata = None + obj.event_study_effects = { + -2: {"effect": 0.1, "se": 0.2, "p_value": 0.62, "n_obs": 400}, + # Finite SE but NaN p-value — models the replicate-weight + # collapsed-df case. Previously stayed in the family but + # was dropped from the Bonferroni denominator. + -1: {"effect": 0.05, "se": 0.3, "p_value": float("nan"), "n_obs": 400}, + } + + dr = DiagnosticReport(obj, run_sensitivity=False, run_bacon=False) + pt = dr.to_dict()["parallel_trends"] + + assert pt["method"] == "inconclusive", ( + f"Bonferroni-only surface with NaN per-period p-value must " + f"return inconclusive; got method={pt.get('method')!r} with " + f"joint_p={pt.get('joint_p_value')!r}" + ) + assert pt["verdict"] == "inconclusive" + assert pt["joint_p_value"] is None + assert pt["n_dropped_undefined"] == 1 + + # And BR must not turn that into "do not reject" / "consistent + # with parallel trends" wording. + br_summary = BusinessReport(obj).summary().lower() + assert "do not reject parallel trends" not in br_summary + assert "consistent with parallel trends" not in br_summary + + def test_zero_se_pre_period_yields_inconclusive(self): + """Round-33 P0 regression: a pre-period row whose SE is + zero/negative is undefined inference per the ``safe_inference`` + contract and must push the event-study PT to inconclusive. + """ + from types import SimpleNamespace + + class MultiPeriodDiDResults: + pass + + obj = MultiPeriodDiDResults() + obj.pre_period_effects = { + -2: SimpleNamespace(effect=1.0, se=0.5, p_value=0.04), + -1: SimpleNamespace(effect=0.5, se=0.0, p_value=0.99), + } + obj.vcov = None + obj.interaction_indices = None + obj.event_study_vcov = None + obj.event_study_vcov_index = None + obj.avg_att = 1.0 + obj.avg_se = 0.1 + obj.avg_p_value = 0.001 + obj.avg_conf_int = (0.8, 1.2) + obj.alpha = 0.05 + obj.n_obs = 100 + obj.n_treated = 50 + obj.n_control = 50 + obj.survey_metadata = None + + pt = DiagnosticReport(obj, run_sensitivity=False, run_bacon=False).to_dict()[ + "parallel_trends" + ] + assert pt["verdict"] == "inconclusive" + assert pt["method"] == "inconclusive" + assert pt["n_dropped_undefined"] >= 1 + + def test_all_pre_periods_undefined_yields_inconclusive_not_skipped(self): + """Round-42 P1 regression: the twin of the partially-undefined + case. When every pre-period row is dropped by the collector + for undefined inference (all ``se <= 0`` or non-finite effect/SE), + ``_collect_pre_period_coefs`` returns ``([], n_dropped_undefined > 0)``. + The prior behavior routed through the empty-coefs ``skipped`` + path ("No pre-period event-study coefficients available"), + which let BR drop the identifying-assumption warning and render + a silent-PT-absent narrative. That violates the inconclusive + contract documented in REPORTING.md: when any pre-row is + dropped for undefined inference, the joint PT test is + inconclusive, not skipped. + """ + from diff_diff import BusinessReport + + class StackedDiDResults: + pass + + obj = StackedDiDResults() + obj.overall_att = 1.0 + obj.overall_se = 0.2 + obj.overall_p_value = 0.001 + obj.overall_conf_int = (0.6, 1.4) + obj.alpha = 0.05 + obj.n_obs = 400 + obj.n_treated_units = 100 + obj.n_control_units = 300 + obj.survey_metadata = None + # All pre-rows have ``se == 0`` — undefined inference per the + # safe-inference contract (``utils.py:175``). The collector's + # ``se > 0`` filter drops all of them, leaving pre_coefs=[] + # with n_dropped_undefined=2 (the R42 all-undefined case). + obj.event_study_effects = { + -2: { + "effect": 0.1, + "se": 0.0, + "p_value": 1.0, + "n_obs": 400, + }, + -1: { + "effect": 0.05, + "se": 0.0, + "p_value": 1.0, + "n_obs": 400, + }, + } + + dr = DiagnosticReport(obj, run_sensitivity=False, run_bacon=False) + # Applicability gate: PT must be marked applicable (runs as + # inconclusive), not skipped with "no coefficients available". + assert "parallel_trends" in dr.applicable_checks, ( + "All-undefined pre-period case must keep PT applicable so " + "the inconclusive runner can emit the explicit " + "n_dropped_undefined provenance. Current skipped reasons: " + f"{dr.skipped_checks}" + ) + pt = dr.to_dict()["parallel_trends"] + assert pt["status"] == "ran", pt + assert pt["method"] == "inconclusive", ( + f"All-undefined pre-period family must route to the " + f"inconclusive runner, not 'skipped'. Got status=" + f"{pt.get('status')!r}, method={pt.get('method')!r}, " + f"reason={pt.get('reason')!r}" + ) + assert pt["verdict"] == "inconclusive" + assert pt["joint_p_value"] is None + # All-undefined: n_dropped_undefined equals attempted pre-period + # count (2 rows here), and the valid subset is empty. + assert pt["n_dropped_undefined"] == 2 + assert pt["n_pre_periods"] == 0 + + # BR must surface this as an inconclusive identifying- + # assumption warning, not silently omit PT. The "inconclusive" + # verdict phrasing is the load-bearing contract for + # stakeholders. + br_summary = BusinessReport(obj).summary().lower() + assert "inconclusive" in br_summary, ( + f"All-undefined PT must surface 'inconclusive' in BR " f"summary. Got: {br_summary!r}" + ) + # And must not claim PT was untested / no-coefs. + assert "no pre-period event-study coefficients" not in br_summary + assert "consistent with parallel trends" not in br_summary + + def test_pretrends_power_adapter_filters_zero_se_cs(self): + """Round-33 P0 regression: CS / SA ``compute_pretrends_power`` + adapters also use the ``se > 0`` filter alongside + ``np.isfinite(se)`` so the power analysis never includes rows + whose per-period SE collapsed. + """ + + import numpy as np + + from diff_diff.pretrends import compute_pretrends_power + from diff_diff.staggered import CallawaySantAnnaResults + + obj = object.__new__(CallawaySantAnnaResults) + obj.anticipation = 0 + # Three pre-periods: two valid, one with zero SE. The valid + # two are enough to run power analysis; the zero-SE row must + # NOT slip into the `ses` vector and divide-by-zero. + obj.event_study_effects = { + -3: {"effect": 0.1, "se": 0.2, "p_value": 0.7, "n_groups": 1}, + -2: {"effect": 0.0, "se": 0.0, "p_value": float("nan"), "n_groups": 1}, + -1: {"effect": 0.0, "se": 0.2, "p_value": 0.99, "n_groups": 1}, + 0: {"effect": 1.0, "se": 0.2, "p_value": 0.0, "n_groups": 1}, + } + obj.overall_att = 1.0 + obj.alpha = 0.05 + + pp = compute_pretrends_power(obj, alpha=0.05, target_power=0.80, violation_type="linear") + # Zero-SE row must not appear in pre_period_ses. + assert len(pp.pre_period_ses) == 2 + assert np.all(pp.pre_period_ses > 0) + + +class TestPrecomputedValidation: + """Regression for the P1 finding that ``precomputed=`` silently accepted + keys that were never implemented. Unsupported keys now raise.""" + + def test_unsupported_precomputed_key_raises(self, multi_period_fit): + fit, _ = multi_period_fit + with pytest.raises(ValueError, match="not implemented"): + DiagnosticReport(fit, precomputed={"design_effect": object()}) + + def test_supported_precomputed_keys_accepted(self, multi_period_fit): + fit, _ = multi_period_fit + # The four implemented keys should not raise at construction. + DiagnosticReport(fit, precomputed={"parallel_trends": {"p_value": 0.5}}) + + def test_mixed_supported_and_unsupported_raises(self, multi_period_fit): + fit, _ = multi_period_fit + with pytest.raises(ValueError, match="epv"): + DiagnosticReport(fit, precomputed={"sensitivity": None, "epv": object()}) + + +class TestSingleMSensitivityPrecomputed: + """Single-M HonestDiDResults must NOT be narrated as full-grid robustness. + + Regression for the P0 CI-review finding that ``conclusion='single_M_precomputed'`` + was being swallowed because both renderers checked ``breakdown_M is None`` and + fell through to the "robust across the full grid" phrasing. + """ + + def _fake_single_m(self, M=1.5, ci_lb=1.0, ci_ub=3.0): + from types import SimpleNamespace + + return SimpleNamespace( + M=M, + lb=ci_lb, + ub=ci_ub, + ci_lb=ci_lb, + ci_ub=ci_ub, + method="relative_magnitude", + alpha=0.05, + ) + + def test_dr_schema_preserves_single_m_marker(self, multi_period_fit): + fit, _ = multi_period_fit + dr = DiagnosticReport(fit, precomputed={"sensitivity": self._fake_single_m()}) + sens = dr.to_dict()["sensitivity"] + assert sens["status"] == "ran" + assert sens["conclusion"] == "single_M_precomputed" + assert sens["breakdown_M"] is None + assert len(sens["grid"]) == 1 + + def test_dr_summary_does_not_claim_full_grid_robustness(self, multi_period_fit): + fit, _ = multi_period_fit + dr = DiagnosticReport(fit, precomputed={"sensitivity": self._fake_single_m()}) + summary = dr.summary() + assert "across the entire HonestDiD grid" not in summary + assert "robust across the grid" not in summary + # It should narrate the single-M check honestly. + assert "single point checked" in summary + assert "not a breakdown" in summary or "not a grid" in summary + + def test_br_summary_does_not_claim_full_grid_robustness(self, multi_period_fit): + """BR via honest_did_results= passthrough must not oversell a point check.""" + from diff_diff import BusinessReport + + fit, _ = multi_period_fit + br = BusinessReport(fit, honest_did_results=self._fake_single_m()) + summary = br.summary() + assert "full grid" not in summary + assert "single point checked" in summary + + +class TestEPVDictBacked: + """EPV diagnostics on fits that use the dict-of-dicts convention. + + Regression for the P0 CI-review finding that ``_check_epv`` assumed + ``low_epv_cells`` / ``min_epv`` attributes but the library stores + ``epv_diagnostics`` as ``{(g, t): {"is_low": ..., "epv": ...}}``. + """ + + def _make_cs_stub(self, epv_diag, threshold=10.0): + class CallawaySantAnnaResults: + pass + + obj = CallawaySantAnnaResults() + obj.overall_att = 1.0 + obj.overall_se = 0.1 + obj.overall_p_value = 0.001 + obj.overall_conf_int = (0.8, 1.2) + obj.alpha = 0.05 + obj.n_obs = 200 + obj.n_treated = 40 + obj.n_control = 160 + obj.survey_metadata = None + obj.event_study_effects = None + obj.epv_diagnostics = epv_diag + obj.epv_threshold = threshold + return obj + + def test_low_epv_cells_counted_from_is_low_flag(self): + epv = { + (2020, 1): {"is_low": True, "epv": 4.5}, + (2020, 2): {"is_low": False, "epv": 18.0}, + (2021, 1): {"is_low": True, "epv": 2.0}, + (2021, 2): {"is_low": False, "epv": 22.0}, + } + stub = self._make_cs_stub(epv, threshold=10.0) + dr = DiagnosticReport(stub, run_sensitivity=False, run_bacon=False) + section = dr.to_dict()["epv"] + assert section["status"] == "ran" + assert section["n_cells_low"] == 2 + assert section["n_cells_total"] == 4 + assert section["min_epv"] == pytest.approx(2.0) + assert section["threshold"] == pytest.approx(10.0) + + def test_no_low_cells_reports_clean(self): + epv = {(2020, 1): {"is_low": False, "epv": 15.0}} + stub = self._make_cs_stub(epv, threshold=10.0) + dr = DiagnosticReport(stub, run_sensitivity=False, run_bacon=False) + section = dr.to_dict()["epv"] + assert section["n_cells_low"] == 0 + assert section["min_epv"] == pytest.approx(15.0) + + def test_threshold_read_from_results_not_hardcoded(self): + """Pass a non-default epv_threshold and confirm DR echoes it.""" + epv = {(2020, 1): {"is_low": True, "epv": 7.0}} + stub = self._make_cs_stub(epv, threshold=8.5) + dr = DiagnosticReport(stub, run_sensitivity=False, run_bacon=False) + assert dr.to_dict()["epv"]["threshold"] == pytest.approx(8.5) + + +class TestCSEventStudyVCovSupport: + """CS sensitivity + pretrends_power must not be skipped for absence of results.vcov. + + Regression for the P1 CI-review finding that the applicability gate required + ``results.vcov`` but CS exposes ``event_study_vcov`` / ``event_study_vcov_index``. + """ + + def test_cs_sensitivity_runs_on_aggregated_fit(self, cs_fit): + fit, sdf = cs_fit + dr = DiagnosticReport( + fit, + data=sdf, + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + ) + assert ( + "sensitivity" in dr.applicable_checks + ), "CS fit with event_study aggregation must not skip sensitivity" + sens = dr.to_dict()["sensitivity"] + # It may run successfully or emit an error depending on data shape, + # but it must NOT be skipped for "results.vcov not available". + assert sens["status"] in {"ran", "error"}, sens + + def test_cs_pretrends_power_runs_on_aggregated_fit(self, cs_fit): + fit, sdf = cs_fit + dr = DiagnosticReport( + fit, + data=sdf, + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + ) + assert ( + "pretrends_power" in dr.applicable_checks + ), "CS fit with event_study aggregation must not skip pretrends_power" + + +class TestCSJointWaldViaEventStudyVCov: + """CS PT should use joint_wald via event_study_vcov when interaction_indices is absent. + + Regression for the P1 CI-review finding that CS always fell back to Bonferroni + even though ``event_study_vcov`` + ``event_study_vcov_index`` were available. + """ + + def _make_cs_stub_with_es_vcov(self): + class CallawaySantAnnaResults: + pass + + obj = CallawaySantAnnaResults() + obj.overall_att = 1.0 + obj.overall_se = 0.1 + obj.overall_p_value = 0.001 + obj.overall_conf_int = (0.8, 1.2) + obj.alpha = 0.05 + obj.n_obs = 200 + obj.n_treated = 40 + obj.n_control = 160 + obj.survey_metadata = None + # Pre-period event-study entries with known coefficients + vcov. + obj.event_study_effects = { + -3: {"effect": 0.5, "se": 0.5, "p_value": 0.32}, + -2: {"effect": -0.5, "se": 0.5, "p_value": 0.32}, + -1: {"effect": 0.2, "se": 0.4, "p_value": 0.62}, + 0: {"effect": 2.0, "se": 0.3, "p_value": 0.0001}, + 1: {"effect": 2.5, "se": 0.3, "p_value": 0.0001}, + } + obj.event_study_vcov = np.diag([0.25, 0.25, 0.16, 0.09, 0.09]) + obj.event_study_vcov_index = [-3, -2, -1, 0, 1] + obj.vcov = None # CS convention + obj.interaction_indices = None + return obj + + def test_cs_pt_uses_event_study_vcov_wald(self): + stub = self._make_cs_stub_with_es_vcov() + dr = DiagnosticReport(stub, run_sensitivity=False, run_bacon=False) + pt = dr.to_dict()["parallel_trends"] + assert pt["status"] == "ran" + assert ( + pt["method"] == "joint_wald_event_study" + ), f"Expected event-study-backed Wald; got method={pt.get('method')!r}" + # Closed-form: 0.5^2/0.25 + (-0.5)^2/0.25 + 0.2^2/0.16 = 1 + 1 + 0.25 = 2.25 + assert pt["test_statistic"] == pytest.approx(2.25, rel=1e-6) + assert pt["df"] == 3 + + +class TestContinuousDiDHeadline: + """ContinuousDiDResults exposes overall_att_se/p_value/conf_int, not overall_se/… + + Regression for the P1 CI-review finding that both report classes missed + ContinuousDiDResults inference fields. + """ + + def test_extract_scalar_headline_resolves_continuous_did_aliases(self): + from diff_diff.diagnostic_report import _extract_scalar_headline + + class ContinuousDiDResults: + pass + + obj = ContinuousDiDResults() + obj.overall_att = 2.5 + obj.overall_att_se = 0.4 + obj.overall_att_p_value = 0.00001 + obj.overall_att_conf_int = (1.7, 3.3) + obj.alpha = 0.05 + + result = _extract_scalar_headline(obj) + assert result is not None + name, value, se, p, ci, alpha = result + assert name == "overall_att" + assert value == pytest.approx(2.5) + assert se == pytest.approx(0.4) + assert p == pytest.approx(0.00001) + assert ci == [pytest.approx(1.7), pytest.approx(3.3)] + assert alpha == pytest.approx(0.05) + + +class TestVerdictsAndTiers: + def test_pt_verdict_three_bins(self): + assert _pt_verdict(0.001) == "clear_violation" + assert _pt_verdict(0.049) == "clear_violation" + assert _pt_verdict(0.10) == "some_evidence_against" + assert _pt_verdict(0.29) == "some_evidence_against" + assert _pt_verdict(0.30) == "no_detected_violation" + assert _pt_verdict(0.99) == "no_detected_violation" + assert _pt_verdict(None) == "inconclusive" + assert _pt_verdict(float("nan")) == "inconclusive" + + def test_power_tier_three_bins_plus_unknown(self): + assert _power_tier(0.1) == "well_powered" + assert _power_tier(0.24) == "well_powered" + assert _power_tier(0.25) == "moderately_powered" + assert _power_tier(0.99) == "moderately_powered" + assert _power_tier(1.0) == "underpowered" + assert _power_tier(5.0) == "underpowered" + assert _power_tier(None) == "unknown" + assert _power_tier(float("nan")) == "unknown" + + +# --------------------------------------------------------------------------- +# EfficientDiD hausman pathway +# --------------------------------------------------------------------------- +class TestEfficientDiDHausman: + def test_hausman_pretest_runs_with_data_kwargs(self, edid_fit): + fit, sdf = edid_fit + dr = DiagnosticReport( + fit, + data=sdf, + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + ) + pt = dr.to_dict()["parallel_trends"] + assert pt["status"] == "ran" + assert pt["method"] == "hausman" + + def test_hausman_skipped_without_data_kwargs(self, edid_fit): + """Without the raw panel kwargs, PT is now gated at the + applicability level (round-10 CI review) — no method field on + the skip section, but ``applicable_checks`` excludes + ``parallel_trends`` and ``skipped_checks`` names it with the + missing-kwargs reason.""" + fit, _ = edid_fit + dr = DiagnosticReport(fit) + pt = dr.to_dict()["parallel_trends"] + assert pt["status"] == "skipped" + assert "parallel_trends" not in dr.applicable_checks + assert "parallel_trends" in dr.skipped_checks + assert "hausman_pretest" in dr.skipped_checks["parallel_trends"] + + +# --------------------------------------------------------------------------- +# SDiD native +# --------------------------------------------------------------------------- +class TestSDiDNative: + def test_sdid_pt_uses_synthetic_fit_method(self, sdid_fit): + fit, _ = sdid_fit + pt = DiagnosticReport(fit).to_dict()["parallel_trends"] + assert pt["method"] == "synthetic_fit" + assert pt["verdict"] == "design_enforced_pt" + assert isinstance(pt.get("pre_treatment_fit_rmse"), float) + + def test_sdid_native_section_populated(self, sdid_fit): + fit, _ = sdid_fit + native = DiagnosticReport(fit).to_dict()["estimator_native_diagnostics"] + assert native["status"] == "ran" + assert native["estimator"] == "SyntheticDiD" + assert "weight_concentration" in native + assert "in_time_placebo" in native + assert "zeta_sensitivity" in native + + def test_sdid_does_not_call_honest_did(self, sdid_fit): + """HonestDiD sensitivity should NOT run on SDiD (native path used instead).""" + fit, _ = sdid_fit + with patch("diff_diff.honest_did.HonestDiD.sensitivity_analysis") as mock: + DiagnosticReport(fit).to_dict() + mock.assert_not_called() + + +# --------------------------------------------------------------------------- +# Error handling +# --------------------------------------------------------------------------- +class TestErrorHandling: + def test_sensitivity_error_does_not_break_report(self, multi_period_fit): + """A failing diagnostic records its error in the section; the report still renders.""" + fit, _ = multi_period_fit + + def _raise(*args, **kwargs): + raise RuntimeError("synthetic test failure") + + with patch("diff_diff.honest_did.HonestDiD.sensitivity_analysis", side_effect=_raise): + dr = DiagnosticReport(fit) + schema = dr.to_dict() + sens = schema["sensitivity"] + assert sens["status"] == "error" + assert "synthetic test failure" in sens["reason"] + # Other sections still ran. + assert schema["parallel_trends"]["status"] == "ran" + + +# --------------------------------------------------------------------------- +# Overall prose +# --------------------------------------------------------------------------- +class TestOverallInterpretation: + def test_overall_interpretation_nonempty_for_fit(self, cs_fit): + fit, sdf = cs_fit + dr = DiagnosticReport( + fit, + data=sdf, + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + ) + prose = dr.summary() + assert isinstance(prose, str) + assert len(prose) > 50 # a real paragraph + + def test_full_report_has_headers(self, cs_fit): + fit, sdf = cs_fit + dr = DiagnosticReport( + fit, + data=sdf, + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + ) + md = dr.full_report() + assert "# Diagnostic Report" in md + assert "## Overall Interpretation" in md + assert "## Parallel trends" in md + assert "## HonestDiD sensitivity" in md + + +# --------------------------------------------------------------------------- +# Public result class +# --------------------------------------------------------------------------- +class TestDiagnosticReportResults: + def test_run_all_returns_dataclass(self, multi_period_fit): + fit, _ = multi_period_fit + dr = DiagnosticReport(fit) + results = dr.run_all() + assert isinstance(results, DiagnosticReportResults) + assert isinstance(results.applicable_checks, tuple) + assert isinstance(results.schema, dict) + + def test_run_all_is_idempotent(self, multi_period_fit): + fit, _ = multi_period_fit + dr = DiagnosticReport(fit) + a = dr.run_all() + b = dr.run_all() + assert a is b # cached + + +class TestDCDHParallelTrendsViaPlaceboEventStudy: + """Regression for the round-6 P1 finding that dCDH was advertised as + PT-applicable but ``_collect_pre_period_coefs`` never read + ``placebo_event_study``, so the PT check was silently skipped even + on fits with valid placebo horizons. + """ + + def _stub(self, with_placebo: bool): + class ChaisemartinDHaultfoeuilleResults: + pass + + stub = ChaisemartinDHaultfoeuilleResults() + stub.att = 1.0 + stub.overall_att = 1.0 + stub.overall_se = 0.2 + stub.overall_p_value = 0.001 + stub.overall_conf_int = (0.6, 1.4) + stub.alpha = 0.05 + stub.n_obs = 100 + stub.n_treated = 40 + stub.n_control = 60 + stub.survey_metadata = None + stub.event_study_effects = None + if with_placebo: + stub.placebo_event_study = { + -3: { + "effect": 0.05, + "se": 0.1, + "p_value": 0.62, + "conf_int": (-0.15, 0.25), + "n_obs": 40, + }, + -2: { + "effect": -0.08, + "se": 0.09, + "p_value": 0.38, + "conf_int": (-0.26, 0.10), + "n_obs": 45, + }, + -1: { + "effect": 0.04, + "se": 0.10, + "p_value": 0.69, + "conf_int": (-0.16, 0.24), + "n_obs": 50, + }, + } + else: + stub.placebo_event_study = None + return stub + + def test_pt_check_reads_placebo_event_study(self): + stub = self._stub(with_placebo=True) + dr = DiagnosticReport(stub).run_all() + pt = dr.schema["parallel_trends"] + assert ( + pt["status"] == "ran" + ), f"dCDH PT check must run on a fit with placebo_event_study; got {pt}" + # Per-period rows should come from the placebo keys (negative horizons). + per_period = pt.get("per_period") or pt.get("periods") or [] + assert per_period, "PT output must include per-period rows" + periods = [row.get("period") for row in per_period] + assert all( + isinstance(p, int) and p < 0 for p in periods + ), f"dCDH PT must use negative placebo horizons; got {periods}" + + def test_pt_check_skips_when_no_placebo_event_study(self): + stub = self._stub(with_placebo=False) + dr = DiagnosticReport(stub).run_all() + pt = dr.schema["parallel_trends"] + assert ( + pt["status"] == "skipped" + ), f"dCDH PT must skip when placebo_event_study is missing; got {pt}" + + +class TestHeterogeneityPostTreatmentOnly: + """Regression for the round-6 P1 finding that ``_check_heterogeneity`` + was mixing pre- and post-treatment coefficients into the CV / range / + sign-consistency summary. + """ + + def test_collector_prefers_post_period_effects_over_period_effects(self): + """On a MultiPeriod-shaped stub, ``_collect_effect_scalars`` must read + ``post_period_effects`` (post-treatment only), not ``period_effects`` + (which mixes pre- and post-treatment coefficients). If the pre-period + value leaked in, sign_consistency would flip and the range would span + a much larger interval.""" + from diff_diff.diagnostic_report import DiagnosticReport + + class MultiPeriodDiDResults: + pass + + stub = MultiPeriodDiDResults() + pe_pre = type("PeriodEffect", (), {"effect": -1.0, "se": 0.2})() + pe_post_1 = type("PeriodEffect", (), {"effect": 1.0, "se": 0.2})() + pe_post_2 = type("PeriodEffect", (), {"effect": 3.0, "se": 0.2})() + stub.period_effects = {-1: pe_pre, 0: pe_post_1, 1: pe_post_2} + stub.post_period_effects = {0: pe_post_1, 1: pe_post_2} + stub.pre_period_effects = {-1: pe_pre} + stub.avg_att = 2.0 + stub.avg_se = 0.1 + stub.avg_p_value = 0.001 + stub.avg_conf_int = (1.8, 2.2) + stub.alpha = 0.05 + stub.n_obs = 100 + stub.n_treated = 40 + stub.n_control = 60 + stub.survey_metadata = None + + # Bypass the applicability-matrix gate by constructing the report + # object and calling the extractor directly: the fix is in the + # extractor, and MultiPeriod's applicability matrix may or may + # not include heterogeneity at any given release. + dr = DiagnosticReport(stub) + effects = sorted(dr._collect_effect_scalars()) + assert effects == [1.0, 3.0], ( + f"Extractor must return only post-treatment effects " + f"(no pre-period -1.0); got {effects}" + ) + assert dr._heterogeneity_source() == "post_period_effects" + + def test_event_study_filters_pre_period_and_reference_markers(self): + class CallawaySantAnnaResults: + pass + + stub = CallawaySantAnnaResults() + # Event study: pre horizons (rel<0), reference marker (n_groups=0), + # non-finite row, and two valid post rows. + stub.event_study_effects = { + -2: {"effect": -3.0, "se": 0.2, "n_groups": 15}, + -1: {"effect": 0.0, "se": float("nan"), "n_groups": 0}, # reference marker + 0: {"effect": 1.0, "se": 0.2, "n_groups": 15}, + 1: {"effect": 2.0, "se": 0.2, "n_groups": 12}, + 2: {"effect": float("nan"), "se": 0.2, "n_groups": 5}, # non-finite + } + stub.overall_att = 1.5 + stub.overall_se = 0.1 + stub.overall_p_value = 0.001 + stub.overall_conf_int = (1.3, 1.7) + stub.alpha = 0.05 + stub.n_obs = 100 + stub.n_treated = 40 + stub.n_control = 60 + stub.survey_metadata = None + stub.base_period = "universal" + + dr = DiagnosticReport( + stub, + run_parallel_trends=False, + run_sensitivity=False, + run_bacon=False, + ).run_all() + het = dr.schema["heterogeneity"] + assert het["status"] == "ran" + assert het["source"] == "event_study_effects_post" + # Only rel>=0, finite, non-reference rows: {1.0, 2.0}. + assert het["n_effects"] == 2 + assert het["min"] == pytest.approx(1.0) + assert het["max"] == pytest.approx(2.0) + assert het["sign_consistent"] is True + + +# --------------------------------------------------------------------------- +# Round-40 P1: survey-design threading for fit-faithful replay +# --------------------------------------------------------------------------- +class TestSurveyDesignThreading: + """Round-40 P1 CI review on PR #318: when a fitted result carries + ``survey_metadata``, Goodman-Bacon and the simple 2x2 PT helper + cannot be faithfully replayed without the original ``SurveyDesign``. + + DR must: + * accept a ``survey_design`` kwarg; + * thread it to ``bacon_decompose(survey_design=...)`` when the + user supplies it; + * skip Bacon with an explicit reason when ``survey_metadata`` is + set but ``survey_design`` is not supplied; + * skip the simple 2x2 PT check with an explicit reason on + survey-backed ``DiDResults`` (the helper has no + ``survey_design`` parameter). + """ + + def _did_with_survey(self): + from types import SimpleNamespace + + class DiDResults: + pass + + obj = DiDResults() + obj.att = 1.0 + obj.se = 0.2 + obj.t_stat = 5.0 + obj.p_value = 0.001 + obj.conf_int = (0.6, 1.4) + obj.alpha = 0.05 + obj.n_obs = 400 + obj.n_treated = 100 + obj.n_control = 300 + obj.survey_metadata = SimpleNamespace( + design_effect=1.25, + effective_n=320.0, + weight_type="pweight", + n_strata=None, + n_psu=None, + df_survey=20.0, + replicate_method=None, + ) + obj.inference_method = "analytical" + return obj + + def _staggered_stub_with_survey(self): + """Lightweight CS-like stub carrying survey_metadata for Bacon gating.""" + from types import SimpleNamespace + + class CallawaySantAnnaResults: + pass + + obj = CallawaySantAnnaResults() + obj.overall_att = 1.0 + obj.overall_se = 0.2 + obj.overall_p_value = 0.001 + obj.overall_conf_int = (0.6, 1.4) + obj.alpha = 0.05 + obj.n_obs = 600 + obj.n_treated = 200 + obj.n_control_units = 400 + obj.survey_metadata = SimpleNamespace( + design_effect=1.5, + effective_n=400.0, + weight_type="pweight", + n_strata=None, + n_psu=None, + df_survey=30.0, + replicate_method=None, + ) + obj.event_study_effects = None + return obj + + def test_survey_backed_did_skips_2x2_pt_with_reason(self): + """Survey-backed ``DiDResults`` must skip the 2x2 PT helper + (``utils.check_parallel_trends`` is unweighted) and produce a + skip reason naming the survey-design replay requirement. + """ + obj = self._did_with_survey() + import pandas as pd + + panel = pd.DataFrame( + { + "outcome": [1.0, 2.0, 1.1, 2.2], + "post": [0, 1, 0, 1], + "treated": [0, 0, 1, 1], + } + ) + dr = DiagnosticReport( + obj, + data=panel, + outcome="outcome", + time="post", + treatment="treated", + ) + assert "parallel_trends" not in dr.applicable_checks + reason = dr.skipped_checks["parallel_trends"] + assert "survey design" in reason.lower() + pt = dr.to_dict()["parallel_trends"] + assert pt["status"] == "skipped" + + def test_survey_backed_did_skips_2x2_pt_even_when_survey_design_supplied(self): + """Round-41 P3 regression: supplying ``survey_design`` does NOT + unlock the simple 2x2 PT helper. ``utils.check_parallel_trends`` + has no survey-aware variant, so the helper cannot consume the + design even when it is available; the check is skipped + unconditionally on a survey-backed ``DiDResults`` and the skip + reason must point the user at the precomputed-PT opt-in rather + than imply that ``survey_design`` would have helped. + """ + import pandas as pd + + obj = self._did_with_survey() + panel = pd.DataFrame( + { + "outcome": [1.0, 2.0, 1.1, 2.2], + "post": [0, 1, 0, 1], + "treated": [0, 0, 1, 1], + } + ) + sentinel_design = object() + dr = DiagnosticReport( + obj, + data=panel, + outcome="outcome", + time="post", + treatment="treated", + survey_design=sentinel_design, + ) + # Supplying survey_design does not unlock 2x2 PT. + assert "parallel_trends" not in dr.applicable_checks + reason = dr.skipped_checks["parallel_trends"] + # Reason must point at the precomputed-PT opt-in and must not + # claim ``survey_design`` fixes this path. + assert "precomputed" in reason.lower() + assert "parallel_trends" in reason.lower() + pt = dr.to_dict()["parallel_trends"] + assert pt["status"] == "skipped" + + def test_survey_backed_did_with_precomputed_pt_runs(self): + """When the user supplies ``precomputed={'parallel_trends': ...}`` + on a survey-backed DiDResults, DR must honor the override rather + than skip with the survey-design reason. + """ + obj = self._did_with_survey() + precomputed_pt = { + "p_value": 0.42, + "treated_trend": 0.05, + "control_trend": 0.04, + "trend_difference": 0.01, + "t_statistic": 0.8, + } + dr = DiagnosticReport( + obj, + precomputed={"parallel_trends": precomputed_pt}, + ) + assert "parallel_trends" in dr.applicable_checks + pt = dr.to_dict()["parallel_trends"] + assert pt["status"] == "ran" + + def test_survey_backed_staggered_skips_bacon_without_survey_design(self): + """CS-like survey-backed fit: Bacon replay must skip with a + reason naming the survey-design requirement rather than produce + an unweighted decomposition for a weighted estimate. + """ + obj = self._staggered_stub_with_survey() + import pandas as pd + + panel = pd.DataFrame( + { + "outcome": [1.0, 2.0, 1.1, 2.2, 1.2, 2.3, 1.3, 2.4], + "unit": [1, 1, 2, 2, 3, 3, 4, 4], + "period": [1, 2, 1, 2, 1, 2, 1, 2], + "first_treat": [0, 0, 0, 0, 2, 2, 2, 2], + } + ) + dr = DiagnosticReport( + obj, + data=panel, + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + ) + assert "bacon" not in dr.applicable_checks + reason = dr.skipped_checks["bacon"] + assert "survey design" in reason.lower() + assert "survey_design" in reason or "SurveyDesign" in reason + bacon = dr.to_dict()["bacon"] + assert bacon["status"] == "skipped" + + def test_survey_backed_staggered_threads_survey_design_to_bacon(self): + """When ``survey_design`` is supplied, Bacon applicability flips + back to runnable and ``bacon_decompose`` is invoked with the + survey design. Assert via ``unittest.mock.patch`` that the + kwarg is forwarded. + """ + from unittest.mock import MagicMock, patch + + obj = self._staggered_stub_with_survey() + import pandas as pd + + panel = pd.DataFrame( + { + "outcome": [1.0, 2.0, 1.1, 2.2, 1.2, 2.3, 1.3, 2.4], + "unit": [1, 1, 2, 2, 3, 3, 4, 4], + "period": [1, 2, 1, 2, 1, 2, 1, 2], + "first_treat": [0, 0, 0, 0, 2, 2, 2, 2], + } + ) + + sentinel_design = object() + fake_decomp = MagicMock() + fake_decomp.total_weight_treated_vs_never = 0.9 + fake_decomp.total_weight_earlier_vs_later = 0.05 + fake_decomp.total_weight_later_vs_earlier = 0.05 + fake_decomp.twfe_estimate = 1.1 + fake_decomp.n_timing_groups = 2 + + with patch("diff_diff.bacon.bacon_decompose", return_value=fake_decomp) as m: + dr = DiagnosticReport( + obj, + data=panel, + outcome="outcome", + unit="unit", + time="period", + first_treat="first_treat", + survey_design=sentinel_design, + ) + # Applicability gate passes since survey_design is supplied. + assert "bacon" in dr.applicable_checks + bacon = dr.to_dict()["bacon"] + assert bacon["status"] == "ran" + # The survey_design must be threaded through to + # bacon_decompose as a kwarg so the replayed decomposition + # matches the fitted design. + assert m.called, "bacon_decompose was not called" + _, kwargs = m.call_args + assert kwargs.get("survey_design") is sentinel_design + + +# --------------------------------------------------------------------------- +# Public API exposure +# --------------------------------------------------------------------------- +def test_public_api_exports(): + for name in ("DiagnosticReport", "DiagnosticReportResults", "DIAGNOSTIC_REPORT_SCHEMA_VERSION"): + assert hasattr(dd, name), f"diff_diff must export {name}"