diff --git a/CHANGELOG.md b/CHANGELOG.md index aaa2d43d..d7474d25 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -13,6 +13,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - **`did_had_pretest_workflow(aggregate="event_study")`**: multi-period dispatch on balanced ≥3-period panels. Runs QUG at `F` + joint pre-trends Stute across earlier pre-periods + joint homogeneity-linearity Stute across post-periods. Step 2 closure requires ≥2 pre-periods; with only a single pre-period (the base `F-1`) `pretrends_joint=None` and the verdict flags the skip. Reuses the Phase 2b event-study panel validator (last-cohort auto-filter under staggered timing with `UserWarning`; `ValueError` when `first_treat_col=None` and the panel is staggered). The data-in wrappers `joint_pretrends_test` and `joint_homogeneity_test` also route through that same validator internally, so direct wrapper calls inherit the last-cohort filter and constant-post-dose invariant. `HADPretestReport` extended with `pretrends_joint`, `homogeneity_joint`, and `aggregate` fields; serialization methods (`summary`, `to_dict`, `to_dataframe`, `__repr__`) preserve the Phase 3 output bit-exactly on `aggregate="overall"` — no `aggregate` key, no header row, no schema drift — and only surface the new fields on `aggregate="event_study"`. - **`ChaisemartinDHaultfoeuille.by_path`** — per-path event-study disaggregation, mirroring R `did_multiplegt_dyn(..., by_path=k)`. Passing `by_path=k` (positive int) to the estimator reports separate `DID_{path,l}` + SE + inference for the top-k most common observed treatment paths in the window `[F_g-1, F_g-1+L_max]`, answering the practitioner question "is a single pulse enough, or do you need sustained exposure?" across paths like `(0,1,0,0)` vs `(0,1,1,0)` vs `(0,1,1,1)`. The per-path SE follows the joiners-only / leavers-only IF precedent (switcher-side contribution zeroed for non-path groups; control pool and cohort structure unchanged; plug-in SE with path-specific divisor). Requires `drop_larger_lower=False` (multi-switch groups are the object of interest) and `L_max >= 1`. Binary treatment only in this release; combinations with `controls`, `trends_linear`, `trends_nonparam`, `heterogeneity`, `design2`, `honest_did`, `survey_design`, and `n_bootstrap > 0` raise `NotImplementedError` and are deferred to follow-up PRs. Results expose `results.path_effects: Dict[Tuple[int, ...], Dict[str, Any]]` and `results.to_dataframe(level="by_path")`; the summary grows a "Treatment-Path Disaggregation" block. Ties in path frequency are broken lexicographically on the path tuple for deterministic ranking. Overflow (`by_path > n_observed_paths`) returns all observed paths with a `UserWarning`. See `docs/methodology/REGISTRY.md` §ChaisemartinDHaultfoeuille `Note (Phase 3 by_path per-path event-study disaggregation)` for the full contract. - **R-parity for `ChaisemartinDHaultfoeuille.by_path`** against `DIDmultiplegtDYN 2.3.3`. Two new scenarios in `benchmarks/data/dcdh_dynr_golden_values.json` generated from `did_multiplegt_dyn(..., by_path=k)`: `mixed_single_switch_by_path` (2 paths, `by_path=2`) and `multi_path_reversible_by_path` (4 observed paths, `by_path=3`, via a new deterministic multi-path DGP pattern in the R generator). Per-path point estimates and per-path switcher counts match R exactly; per-path SE matches within the Phase 2 multi-horizon SE envelope (observed rtol ≤ 10.2% on the 2-path scenario, ≤ 4.2% on the 4-path scenario). Parity tests live at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPath`, matching paths by tuple label via set-equality (robust to R's undocumented frequency-tie tiebreak) and cross-checking per-path switcher counts before SE comparison. **Deviation documented:** cross-path cohort sharing — our full-panel cohort-centered plug-in vs R's per-path re-run diverges materially when a `(D_{g,1}, F_g, S_g)` cohort spans multiple observed paths; the two coincide when every cohort is single-path. The parity scenarios are constructed to keep cohorts single-path (scenario 13 by design, scenario 14 via path-assignment-deterministic-on-F_g). See `docs/methodology/REGISTRY.md` §ChaisemartinDHaultfoeuille `Note (Phase 3 by_path...)` for the full write-up. +- **`profile_panel()` utility + `llms-autonomous.txt` reference guide (agent-facing)** — new `diff_diff.profile_panel(df, *, unit, time, treatment, outcome)` returns a frozen `PanelProfile` dataclass of structural facts (panel balance, treatment-type classification — `"binary_absorbing"` / `"binary_non_absorbing"` / `"continuous"` / `"categorical"`, cohort structure, outcome characteristics, and a `tuple[Alert, ...]` of factual observations). `.to_dict()` returns a JSON-serializable view. Paired with a new bundled `"autonomous"` variant on `get_llm_guide()` — `get_llm_guide("autonomous")` returns a reference-shaped guide (distinct from the existing workflow-prose `"practitioner"` variant) with §1 audience disclaimer, §2 `PanelProfile` field reference, §3 embedded 17-estimator × 9-design-feature support matrix, §4 per-design-feature reasoning citing Baker et al. (2025) and Roth / Sant'Anna (2023), §5 post-fit validation index, §6 BR/DR schema reference, §7 citations, §8 intentional omissions. Both pieces are bundled inside the wheel (no GitHub / RTD dependency at runtime); `diff_diff/__init__.py` module docstring leads with an agent-entry block listing `profile_panel`, `get_llm_guide("autonomous")`, `get_llm_guide("practitioner")`, and `BusinessReport` so `help(diff_diff)` surfaces them. Descriptive, not opinionated — `profile_panel` alerts never recommend a specific estimator, and the guide enumerates trade-offs rather than dispatching. Exports: `profile_panel`, `PanelProfile`, `Alert` from top-level `diff_diff`. - **`target_parameter` block in BR/DR schemas (experimental; schema version bumped to 2.0)** — `BUSINESS_REPORT_SCHEMA_VERSION` and `DIAGNOSTIC_REPORT_SCHEMA_VERSION` bumped from `"1.0"` to `"2.0"` because the new `"no_scalar_by_design"` value on the `headline.status` / `headline_metric.status` enum (dCDH `trends_linear=True, L_max>=2` configuration) is a breaking change per the REPORTING.md stability policy. BusinessReport and DiagnosticReport now emit a top-level `target_parameter` block naming what the headline scalar actually represents for each of the 16 result classes. Closes BR/DR foundation gap #6 (target-parameter clarity). Fields: `name`, `definition`, `aggregation` (machine-readable dispatch tag), `headline_attribute` (raw result attribute), `reference` (citation pointer). BR's summary emits the short `name` right after the headline; DR's overall-interpretation paragraph does the same; both full reports carry a "## Target Parameter" section with the full definition. Per-estimator dispatch is sourced from REGISTRY.md and lives in the new `diff_diff/_reporting_helpers.py::describe_target_parameter`. A few branches read fit-time config (`EfficientDiDResults.pt_assumption`, `StackedDiDResults.clean_control`, `ChaisemartinDHaultfoeuilleResults.L_max` / `covariate_residuals` / `linear_trends_effects`); others emit a fixed tag (the fit-time `aggregate` kwarg on CS / Imputation / TwoStage / Wooldridge does not change the `overall_att` scalar — disambiguating horizon / group tables is tracked under gap #9). See `docs/methodology/REPORTING.md` "Target parameter" section. - SyntheticDiD coverage Monte Carlo calibration table added to `docs/methodology/REGISTRY.md` §SyntheticDiD — rejection rates at α ∈ {0.01, 0.05, 0.10} across `placebo` / `bootstrap` / `jackknife` on 3 representative DGPs (balanced / exchangeable, unbalanced, and Arkhangelsky et al. (2021) AER §6.3 non-exchangeable). Artifact at `benchmarks/data/sdid_coverage.json` (500 seeds × B=200), regenerable via `benchmarks/python/coverage_sdid.py`. diff --git a/ROADMAP.md b/ROADMAP.md index 65a4b119..fcc02f22 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -137,15 +137,17 @@ Long-running program, framed as "building toward" rather than with discrete ship - Baker et al. (2025) 8-step workflow enforcement in `diff_diff/practitioner.py`. - `practitioner_next_steps()` context-aware guidance. -- Runtime LLM guides via `get_llm_guide(...)` (`llms.txt`, `llms-full.txt`, `llms-practitioner.txt`), bundled in the wheel. +- Runtime LLM guides via `get_llm_guide(...)` (`llms.txt`, `llms-full.txt`, `llms-practitioner.txt`, `llms-autonomous.txt`), bundled in the wheel. +- `profile_panel(df, ...)` returns a `PanelProfile` dataclass of structural facts about the panel - factual, not opinionated. Pairs with the `"autonomous"` guide variant (reference-shaped: estimator-support matrix + per-design-feature reasoning) so agents describe the data then consult a bundled reference rather than calling a deterministic recommender. +- Package docstring leads with an "For AI agents" entry block so `help(diff_diff)` surfaces the agent entry points automatically. - Silent-operation warnings so agents and humans see the same signals at the same time. **Next blocks toward the vision.** -- **BusinessReport / DiagnosticReport** (in Shipping Next) - the output form the vision assumes. +- **Post-hoc mismatch detection in BR/DR output** - surfaces structured warnings like "you fit TWFE on staggered data with 37% forbidden-comparison weights" when the profile and the fitted estimator disagree. Safety net, not a pre-emptive rules engine. +- **Structured `sanity_checks` block in BR/DR** - machine-legible pass / warn / fail signals (pretrends, power, forbidden-comparisons, event-study cleanliness, placebo, sensitivity) so agents can dispatch on a stable schema rather than parsing prose. - **Context-aware `practitioner_next_steps()`** that substitutes actual column names - turns guidance into executable recommendations. -- **AI-legible diagnostic surfaces** - once BusinessReport ships, a structured JSON counterpart that agents can parse without screen-scraping human text. -- **Scenario-to-estimator selection guidance** - agent-facing extension of `docs/practitioner_decision_tree.rst` that returns a specific estimator choice plus rationale for a given scenario description. +- **Unified `assess_*` verb** across estimator native-diagnostic methods for a single discoverable convention. - **End-to-end scenario walkthrough templates** - reusable orchestration recipes an agent can adapt from data ingest through business-ready output. --- diff --git a/diff_diff/__init__.py b/diff_diff/__init__.py index 4a9b93c4..0247eecd 100644 --- a/diff_diff/__init__.py +++ b/diff_diff/__init__.py @@ -4,14 +4,20 @@ This library provides sklearn-like estimators for causal inference using the difference-in-differences methodology. -For rigorous analysis, follow the 8-step practitioner workflow based -on Baker et al. (2025). After estimation, call -``practitioner_next_steps(results)`` for context-aware guidance on -remaining diagnostic steps. +For AI agents: -AI agents: call ``diff_diff.get_llm_guide()`` for a complete API reference. -Use ``get_llm_guide("practitioner")`` for the 8-step workflow or -``get_llm_guide("full")`` for comprehensive documentation. + 1. Describe your data: ``diff_diff.profile_panel(df, unit=..., time=..., + treatment=..., outcome=...)`` + 2. Consult the reference: ``diff_diff.get_llm_guide("autonomous")`` + (estimator-support matrix + reasoning) + 3. Follow the workflow: ``diff_diff.get_llm_guide("practitioner")`` + (Baker et al. (2025) 8-step recipe) + 4. Report results: ``diff_diff.BusinessReport(results)`` + (structured agent-legible output) + +For a comprehensive API reference call ``diff_diff.get_llm_guide("full")``; +``practitioner_next_steps(results)`` returns context-aware guidance after +any estimator's ``fit()``. """ # Import backend detection from dedicated module (avoids circular imports) @@ -244,6 +250,7 @@ DiagnosticReportResults, ) from diff_diff._guides_api import get_llm_guide +from diff_diff.profile import Alert, PanelProfile, profile_panel from diff_diff.datasets import ( clear_cache, list_datasets, @@ -487,6 +494,10 @@ "DiagnosticReport", "DiagnosticReportResults", "DIAGNOSTIC_REPORT_SCHEMA_VERSION", + # Panel profiling (agent-facing pre-fit describe utility) + "profile_panel", + "PanelProfile", + "Alert", # LLM guide accessor "get_llm_guide", ] diff --git a/diff_diff/_guides_api.py b/diff_diff/_guides_api.py index 5a00ed77..503c74ba 100644 --- a/diff_diff/_guides_api.py +++ b/diff_diff/_guides_api.py @@ -1,4 +1,5 @@ """Runtime accessor for bundled LLM guide files.""" + from __future__ import annotations from importlib.resources import files @@ -7,6 +8,7 @@ "concise": "llms.txt", "full": "llms-full.txt", "practitioner": "llms-practitioner.txt", + "autonomous": "llms-autonomous.txt", } @@ -21,6 +23,10 @@ def get_llm_guide(variant: str = "concise") -> str: - ``"concise"`` -- compact API reference (llms.txt) - ``"full"`` -- complete API documentation (llms-full.txt) - ``"practitioner"`` -- 8-step practitioner workflow (llms-practitioner.txt) + - ``"autonomous"`` -- reference guide for AI-agent use: estimator-support + matrix, per-design-feature reasoning, post-fit validation index, and + BR/DR schema (llms-autonomous.txt). Pair with + :func:`diff_diff.profile_panel` for pre-fit data description. Returns ------- @@ -42,7 +48,5 @@ def get_llm_guide(variant: str = "concise") -> str: filename = _VARIANT_TO_FILE[variant] except (KeyError, TypeError): valid = ", ".join(repr(k) for k in _VARIANT_TO_FILE) - raise ValueError( - f"Unknown guide variant {variant!r}. Valid options: {valid}." - ) from None + raise ValueError(f"Unknown guide variant {variant!r}. Valid options: {valid}.") from None return files("diff_diff.guides").joinpath(filename).read_text(encoding="utf-8") diff --git a/diff_diff/guides/llms-autonomous.txt b/diff_diff/guides/llms-autonomous.txt new file mode 100644 index 00000000..b571911e --- /dev/null +++ b/diff_diff/guides/llms-autonomous.txt @@ -0,0 +1,844 @@ +# diff-diff: Autonomous-agent reference guide + +This guide is reference material for AI agents using diff-diff without +human-in-the-loop supervision. It catalogs the library's estimators, names +the design features each supports, explains how to read the +`profile_panel()` output, and points at post-fit validation utilities and +report schemas. + +It is a reference, not a decision tree. Multiple estimators usually fit a +given panel; choosing between them involves trade-offs the cited literature +discusses and that this guide does not pretend to resolve. + +**Pair this guide with:** +- `get_llm_guide("practitioner")` - the Baker et al. (2025) 8-step validation + workflow in workflow-prose form. +- `get_llm_guide("full")` - comprehensive API documentation for every public + function and class. +- `profile_panel(df, unit=..., time=..., treatment=..., outcome=...)` - the + pre-fit describe utility whose output fields this guide's sections §2 and + §4 reason about. + + +## Table of contents + +- §1. What this guide is (and is not) +- §2. PanelProfile field reference +- §3. Estimator-support matrix +- §4. Estimator-choice reasoning by design feature +- §5. Post-fit validation utilities +- §6. How to read BusinessReport / DiagnosticReport output +- §7. Glossary + citations +- §8. Intentional omissions + + +## §1. What this guide is (and is not) + +**What it is.** A reference you consult after running `profile_panel()` and +before calling any estimator's `fit()`. The matrix in §3 and the per-design- +feature discussions in §4 tell you which estimators are well-suited to the +panel shape reported by the profile; the post-fit index in §5 tells you +which diagnostics apply once you have a fitted result. + +**What it is not.** A deterministic recommender. No function in diff-diff +returns "pick estimator X." This guide does not either. When several +estimators fit a design, it enumerates them and names the trade-offs. The +agent is responsible for weighing those trade-offs (often with the cited +references in §7) and justifying the choice in the final write-up. + +**Why this shape.** A rules-engine recommender would lock in a policy that +ages poorly as new estimators land and as the applied-econometrics +literature evolves. Static reference material + descriptive profiling is +less brittle: when a new estimator is added it gets a row in §3 and a +paragraph in §4, without rewriting a dispatcher. + + +## §2. PanelProfile field reference + +`profile_panel(df, unit=..., time=..., treatment=..., outcome=...)` returns +a frozen `PanelProfile` dataclass. Call `.to_dict()` for a JSON-serializable +view. Every field below appears as a top-level key in that dict. + +### Panel structure + +- **`n_units: int`** - count of distinct values in the `unit` column. +- **`n_periods: int`** - count of distinct values in the `time` column. +- **`n_obs: int`** - total rows in the panel. +- **`is_balanced: bool`** - true iff every distinct `(unit, time)` cell + appears at least once in the panel (i.e. the unique `(unit, time)` + support equals `n_units * n_periods`). Duplicate rows do not affect + balance but are surfaced via the `duplicate_unit_time_rows` alert. +- **`observation_coverage: float`** - ratio of unique `(unit, time)` + keys to `n_units * n_periods`, always in `[0, 1]` (duplicates do not + inflate). A value below `0.70` also triggers the + `panel_highly_unbalanced` alert. + +### Treatment variation + +- **`treatment_type: str`** - classification of the treatment column. + Exactly one of: + - `"binary_absorbing"`: observed non-NaN values are a subset of + {0, 1} (one or two distinct values, covering all-zero and all-one + panels as valid degenerate cases) and each unit's treatment + sequence (ordered by `time`) is weakly monotone non-decreasing. + The canonical DiD setting. + - `"binary_non_absorbing"`: values a subset of {0, 1} with at least + two distinct values observed, where at least one unit switches + from 1 back to 0. Only `ChaisemartinDHaultfoeuille` handles this + natively; the other absorbing-only estimators would misapply. + - `"continuous"`: numeric with more than two distinct values, or a + two-valued numeric column whose values are not in {0, 1} (e.g., + a dose, a discrete-integer partial-adoption score). Use + `ContinuousDiD` or `HeterogeneousAdoptionDiD`. + - `"categorical"`: non-numeric dtype (object / category), or a + column that is entirely NaN. Often indicates a treatment arm. + Encode each arm as a binary indicator and fit separately, or + use a multi-treatment workflow outside the current estimator + suite. + + Bool-dtype treatment columns (`True` / `False`) are classified the + same way as numeric `{0, 1}`: the library's binary estimators + validate on value support rather than dtype, so `True` and `False` + behave like `1` and `0` for absorbing / non-absorbing classification. +- **`is_staggered: bool`** - true iff treatment is `binary_absorbing` and + at least two distinct first-treatment periods are observed. Drives the + choice between classic DiD/TWFE and staggered-robust estimators. +- **`n_cohorts: int`** - for `binary_absorbing`, the number of distinct + first-treatment periods (cohorts). Zero for other `treatment_type` + values. +- **`cohort_sizes: Mapping[Any, int]`** - map from first-treatment period + to cohort size (number of units adopting at that time). Empty for + non-absorbing / continuous / categorical treatments. +- **`has_never_treated: bool`** - at least one unit has `treatment == 0` + in every observed non-NaN row (applies to both binary and continuous + treatment columns; for continuous this flags zero-dose control units). + Required by `SyntheticDiD`, `SunAbraham`, `EfficientDiD` under both + `assumption="PT-All"` and `assumption="PT-Post"` (unless + `control_group="last_cohort"` is passed), and `ContinuousDiD` + (which requires `P(D=0) > 0` - Remark 3.1 lowest-dose-as-control + is not yet implemented). Preferred-but-optional by + `CallawaySantAnna` and `ChaisemartinDHaultfoeuille`. Always `False` + for `"categorical"`. +- **`has_always_treated: bool`** - at least one binary-treatment + unit has `treatment == 1` in every observed non-NaN row (no + pre-treatment information for that unit in the DiD sense). + Binary-only semantics: for `"continuous"` panels this field is + always `False` because pre-treatment periods are determined by the + `first_treat` column supplied to `ContinuousDiD.fit()`, not by + whether the dose is positive - a unit with a constant positive dose + can still have well-defined pre-treatment periods. Always `False` + for `"categorical"` too. +- **`treatment_varies_within_unit: bool`** - at least one unit has more + than one distinct non-NaN treatment value across its observed rows. + For binary panels this is normally `True` (pre vs. post the adoption + period), and for continuous panels this flags time-varying dose. + `ContinuousDiD.fit()` requires this to be `False` (dose must be + time-invariant per unit, per Callaway et al. 2024); a `True` value on + a continuous panel rules the estimator out. Always `False` for + `"categorical"`. + +### Timing + +- **`first_treatment_period: Optional[Any]`** - earliest first-treatment + period observed (for `binary_absorbing`); `None` otherwise. +- **`last_treatment_period: Optional[Any]`** - latest first-treatment + period observed; `None` otherwise. +- **`min_pre_periods: Optional[int]`** - across treated units, the + smallest number of observed pre-treatment periods (each treated + unit's observed `(unit, time)` support is counted independently, so + this reflects the least-supported treated unit on unbalanced panels). + Low values (< 3) fire the `short_pre_panel` alert and limit power + for parallel-trends tests. +- **`min_post_periods: Optional[int]`** - across treated units, the + smallest number of observed post-treatment periods; same per-unit + support semantics as above. Low values limit event-study dynamics. + +### Outcome + +- **`outcome_dtype: str`** - the pandas dtype name (e.g. `"float64"`, + `"int64"`, `"bool"`). +- **`outcome_is_binary: bool`** - outcome has exactly two distinct + non-NaN values, both in {0, 1}. For binary outcomes the linear + parallel-trends assumption is restrictive; consider the logit/log-odds + alternative in the Roth/Sant'Anna (2023) survey. +- **`outcome_has_zeros: bool`** - any non-NaN outcome equals zero. + Relevant for log-transform diagnostics. +- **`outcome_has_negatives: bool`** - any non-NaN outcome is negative. + Relevant for log-transform diagnostics. +- **`outcome_missing_fraction: float`** - share of rows where the + outcome column is NaN, in `[0, 1]`. +- **`outcome_summary: Mapping[str, float]`** - `{min, max, mean, std}` + computed with NaN-skipping; empty for non-numeric outcomes. + +### Alerts + +`alerts: tuple[Alert, ...]` is a list of factual observations. Each +`Alert` has `code`, `severity` (`"info"` or `"warn"`), `message`, and +`observed` (the numerical or boolean value that tripped the alert). + +The v1 alert catalogue is listed below. Alerts never name a specific +estimator. Severity `"warn"` means the observation is likely relevant to +estimator choice or to the interpretation of diagnostics; `"info"` means +it is descriptive context. + +| Alert code | Severity | Fires when | +|---|---|---| +| `missing_id_rows_dropped` | warn | rows with NaN `unit` or `time` were dropped before computing structural facts | +| `duplicate_unit_time_rows` | warn | panel contains more than one row per (unit, time) | +| `min_cohort_size_below_10` | warn | smallest cohort has fewer than 10 units | +| `only_one_cohort` | info | all treated units adopt simultaneously | +| `short_pre_panel` | warn | `min_pre_periods < 3` | +| `short_post_panel` | info | `min_post_periods < 3` | +| `no_never_treated` | info | every unit is eventually treated | +| `has_always_treated_units` | info | some units are treated in every observed period | +| `all_units_treated_simultaneously` | info | single cohort and no never-treated group | +| `panel_highly_unbalanced` | warn | `observation_coverage < 0.70` | +| `only_two_periods` | info | `n_periods == 2` | +| `outcome_looks_binary_but_dtype_float` | info | outcome takes {0, 1} values but is stored as float | + + +## §3. Estimator-support matrix + +Rows are estimator classes exported from `diff_diff`. Columns are design +features derivable from `PanelProfile`. Cells: `✓` supported; `✗` not +supported / out of scope; `warn` supported but with documented caveats; +`partial` supported subject to restrictions discussed in §4. + +| Estimator | binary absorbing | staggered | continuous | triple-diff | never-treated required | covariate adjustment | few-treated (synthetic) | heterogeneous adoption | clustered SE | +|---|---|---|---|---|---|---|---|---|---| +| `DifferenceInDifferences` | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | +| `MultiPeriodDiD` | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | +| `TwoWayFixedEffects` | ✓ | warn | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | +| `CallawaySantAnna` | ✓ | ✓ | ✗ | ✗ | partial | ✓ | ✗ | ✗ | ✓ | +| `SunAbraham` | ✓ | ✓ | ✗ | ✗ | ✓ | ✓ | ✗ | ✗ | ✓ | +| `ChaisemartinDHaultfoeuille` | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | +| `ImputationDiD` | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | +| `TwoStageDiD` | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | +| `StackedDiD` | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | +| `WooldridgeDiD` (ETWFE) | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | +| `EfficientDiD` | ✓ | ✓ | ✗ | ✗ | partial | ✓ | ✗ | ✗ | ✓ | +| `SyntheticDiD` | ✓ | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ | partial | +| `TROP` | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | partial | +| `TripleDifference` | ✓ | ✗ | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ | +| `StaggeredTripleDifference` | ✓ | ✓ | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ | +| `ContinuousDiD` | ✗ | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✓ | +| `HeterogeneousAdoptionDiD` | ✗ | partial | partial | ✗ | ✗ | ✗ | ✗ | ✓ | warn | + +**Footnotes.** +- `TwoWayFixedEffects` + staggered: fits but mixes positive and negative + cohort-weights that violate the ATT interpretation; consult + `BaconDecomposition` to quantify. Prefer any staggered-robust + estimator (CS, SA, dCDH, Imputation, TwoStage, ETWFE) for a staggered + design. +- `CallawaySantAnna` + never-treated: the "never-treated" control group + is one option; "not-yet-treated" is the other. Pick via the + `control_group` argument. If `has_never_treated == False`, use + `control_group="not_yet_treated"`. +- `EfficientDiD` + never-treated: both `assumption="PT-All"` and + `assumption="PT-Post"` require actual never-treated units - PT-Post + is the weaker parallel-trends assumption but still uses never-treated + as the comparison group (REGISTRY.md `EfficientDiD` "Parallel Trends + -- two variants"). To admit an all-eventually-treated panel, pass + `control_group="last_cohort"` to reclassify the latest treatment + cohort as a pseudo-never-treated control and trim post-treatment + periods at/after its adoption. The `EfficientDiD.hausman_pretest` + classmethod picks between `PT-All` and `PT-Post` on panels that do + have never-treated units. +- `SyntheticDiD` + staggered: not supported. `fit()` raises + `ValueError` on within-unit treatment variation; SDiD requires block + treatment (all treated units adopt at the same time). For staggered + designs use a cohort-level fit loop externally or pick a + staggered-robust estimator above. +- `TROP` staggered support: treatment is an absorbing-state indicator, + so staggered adoption is handled via the D matrix. TROP `fit()` has + no covariate surface; its local method uses every unit untreated at + period `t` as the donor pool (not a never-treated-only set). +- `HeterogeneousAdoptionDiD` covariate adjustment: identification with + covariates (paper Appendix B.1, Equation 19) is deferred to future + work; `fit(covariates=...)` is not yet implemented. +- `HeterogeneousAdoptionDiD` clustered SE: `cluster=` is honored on the + mass-point / CR1 path; on the continuous nonparametric paths the + kwarg emits a `UserWarning` and is ignored (Phase 2a scope). Use + `bias_corrected_local_linear` directly for cluster-robust inference + on the nonparametric path. +- `HeterogeneousAdoptionDiD` continuous: supports partial-adoption + intensity as a continuous first-stage variable; not a pure + dose-response estimator - use `ContinuousDiD` for that. +- `HeterogeneousAdoptionDiD` staggered support is `partial`, not + general. Paper Appendix B.2 restricts staggered use to the + **last treatment cohort plus never-treated units**. With + `aggregate="event_study"` and a `first_treat_col` kwarg, + `fit()` auto-filters to `F_last = max(cohorts)` and emits a + `UserWarning` naming kept/dropped counts; earlier-cohort units + are dropped. Without `first_treat_col`, a multi-cohort panel + raises `ValueError`. For full staggered support that retains + every cohort, use `ChaisemartinDHaultfoeuille` instead. + +**Balanced-panel eligibility.** The following estimators require +exactly one observation per `(unit, time)` cell with every unit +observed in every period: `ContinuousDiD`, `EfficientDiD`, +`SyntheticDiD`, `HeterogeneousAdoptionDiD`, +`StaggeredTripleDifference`. Gate these on BOTH +`PanelProfile.is_balanced == True` AND the absence of the +`duplicate_unit_time_rows` alert (`is_balanced` is computed from the +unique-key support and stays `True` when duplicates exist; the +alert is the separate signal for duplicates). Treat both +conditions as hard gates: `EfficientDiD` and +`HeterogeneousAdoptionDiD` raise `ValueError` at `fit()` on +duplicate cells, and `ContinuousDiD`'s precompute path resolves +duplicates with last-row-wins (silent overwrite that can change +the estimand). If either condition fails, pre-process with +`diff_diff.prep.balance_panel()` and a +`drop_duplicates([unit, time])` pass, or pick a balance-tolerant +estimator from the remaining rows (CS/SA/dCDH/Imputation/TwoStage/ +Stacked/ETWFE all accept unbalanced input, with some caveats in +their own docs). + + +## §4. Estimator-choice reasoning by design feature + +Each subsection names a design feature and lists estimators applicable to +it with the most important trade-offs. Multiple paths are always +explicit; no subsection says "pick estimator X." + +### §4.1 Classic 2×2 DiD (binary absorbing, two periods, no staggering) + +When `treatment_type == "binary_absorbing"`, `n_periods == 2`, and +`is_staggered == False`, the classic Card-and-Krueger 2×2 design applies. +Most estimators in the library produce the same point estimate in this +case; the choice between them is mostly about output shape: + +- `DifferenceInDifferences` for a minimal results object. +- `TwoWayFixedEffects` if you want the equivalent two-way-FE regression + output (coefficient table, VCV, etc.). Identical to DiD in the 2×2 + case. +- `TripleDifference` if a second comparison dimension is available + (DDD) - see §4.6. + +### §4.2 Multi-period single-cohort (event-study without staggering) + +When `is_staggered == False` and `n_periods > 2`, event-study dynamics +can be estimated but cohort-mixing bias is moot: + +- `MultiPeriodDiD` - per-period effect, standard event-study plot. +- `TwoWayFixedEffects` with event-time dummies - similar output, no + forbidden comparisons because there is only one cohort. + +### §4.3 Staggered adoption (multi-cohort binary absorbing) + +When `is_staggered == True`, classic TWFE mixes positive- and +negative-weighted cohort comparisons (Goodman-Bacon 2021, +de Chaisemartin & d'Haultfoeuille 2020). Use one of the staggered-robust +estimators: + +- `CallawaySantAnna` - group-time ATTs aggregated to ES / overall / cohort + dimensions. Flexible control-group choice (never-treated vs. + not-yet-treated). Covariate adjustment via doubly-robust (DR), IPW, + or regression-adjustment (RA). +- `SunAbraham` - interaction-weighted estimator; closely tied to + two-way-FE output, computationally cheap, produces event-time + coefficients. Requires a never-treated cohort (`fit` raises a + `ValueError` when none exists). +- `ChaisemartinDHaultfoeuille` - DID_M / DID_l estimators robust to + non-absorbing / reversible treatment (see §4.5). Interference / + between-unit spillovers are not supported natively - SUTVA is + assumed like every other DiD estimator in the suite. +- `ImputationDiD` (Borusyak, Jaravel, Spiess) - imputation-based, + efficient under homoskedasticity, produces an imputation-based + residual at the observation level. +- `TwoStageDiD` (Gardner) - two-stage residualize-then-regress. +- `StackedDiD` - stacked event-study regressions, one subpanel per + cohort. Conservative interpretation. +- `WooldridgeDiD` (ETWFE) - extended-TWFE with cohort-by-time-by- + covariates interactions; heterogeneous covariate-by-cohort effects. +- `EfficientDiD` (Chen, Sant'Anna, Xie 2025) - asymptotically efficient + under either `PT-All` or `PT-Post`; use `EfficientDiD.hausman_pretest` + to pick. Requires a balanced panel (`PanelProfile.is_balanced == + True`); `fit()` raises `ValueError` on unbalanced input. + +Diagnostic: `bacon_decompose(df, ...)` shows the weight allocation of a +TWFE fit to 2×2 comparison types. Forbidden-comparison weight > 10% is a +strong signal that the TWFE estimate is biased. + +### §4.4 No never-treated group + +When `has_never_treated == False`: + +- `SyntheticDiD` requires a never-treated donor pool - not applicable. +- `TROP` does not require a strict never-treated partition: its donor + pool is every unit untreated at the current period `t` (via the + absorbing D matrix). When every unit is eventually treated TROP can + still fit, with the donor pool shrinking over time - check the + pre-treatment coverage of the factor-model fit in the results + diagnostics. +- `EfficientDiD` requires never-treated comparisons under both + `assumption="PT-All"` and `assumption="PT-Post"`. To admit an + all-treated panel, pass `control_group="last_cohort"` to use the + latest treatment cohort as a pseudo-never-treated control + (post-treatment periods at/after that cohort's adoption are + trimmed). Distinct from CallawaySantAnna's `not_yet_treated` + option. +- `ContinuousDiD` requires zero-dose control units (`P(D=0) > 0`). + Remark 3.1 of the paper (lowest-dose-as-control) is not yet + implemented; `fit()` raises `ValueError` when no `D=0` units exist. +- `CallawaySantAnna` - use `control_group="not_yet_treated"` to use + not-yet-treated units as the control pool. +- `ChaisemartinDHaultfoeuille` - constructs switchers vs. non-switchers + directly; no never-treated requirement. +- TWFE / `MultiPeriodDiD` / `ImputationDiD` / `TwoStageDiD` / + `StackedDiD` / `WooldridgeDiD` - use the last-treated or untreated- + until-late units as implicit controls; estimators do not error, but + consider whether the implicit control structure is what you want. + +### §4.5 Non-absorbing binary treatment (treatment switches back to 0) + +When `treatment_type == "binary_non_absorbing"`: + +- `ChaisemartinDHaultfoeuille` is the only estimator in the library + that treats this natively. Switcher / non-switcher comparisons are + its primitive object. +- Other estimators assume absorbing treatment and will produce + estimates whose interpretation is unclear. Do not use them without + a well-argued reason. + +### §4.6 Triple-difference design (DDD) + +When a second cross-cutting comparison axis exists (e.g., policy hits +some states and some demographic subgroups within states): + +- `TripleDifference` - classic two-period DDD. +- `StaggeredTripleDifference` - staggered DDD, robust to cohort-mixing. + +Triple-difference is not automatically detected by `profile_panel`; +it requires the caller to identify the third comparison axis. If a +`group` covariate in the panel drives differential exposure, DDD is +worth considering. + +### §4.7 Continuous / dose-response treatment + +When `treatment_type == "continuous"`: + +- `ContinuousDiD` (Callaway, Goodman-Bacon, Sant'Anna 2024) - + continuous / dose-response treatment. **Three eligibility + prerequisites**: (a) zero-dose control units must exist + (`P(D=0) > 0`) because Remark 3.1 (lowest-dose-as-control) is not + yet implemented, (b) dose must be time-invariant per unit (rule out + panels where `PanelProfile.treatment_varies_within_unit == True`), + and (c) the panel must be balanced (`PanelProfile.is_balanced == + True`). `fit()` raises `ValueError` in any of the three cases. Note that + staggered adoption IS supported natively (adoption timing is + expressed via the `first_treat` column, not via within-unit dose + variation). The estimator exposes several dose-indexed targets that + require different assumptions: `ATT(d|d)` (effect of dose `d` on + units that received `d`) and `ATT^{loc}` (binarized overall ATT) + are identified under Parallel Trends; `ATT(d)` (full dose-response + curve), `ACRT(d)` (marginal effect, i.e. the average causal + response), and `ACRT^{glob}` require the stronger Strong Parallel + Trends assumption. The BR headline scalar is the overall ATT; ACR + and dose-response tables are available in the result object. + Supports B-spline basis construction. +- `HeterogeneousAdoptionDiD` - partial-adoption intensity, with a + scalar first-stage adoption summary. Useful when adoption is + graded rather than binary. + +### §4.8 Few treated units (one or a handful) + +When few treated units exist (not a separate `PanelProfile` field yet, +but derivable from `cohort_sizes` + `has_never_treated`): + +- `SyntheticDiD` - synthetic-control-meets-DiD. Requires never-treated + donors and sufficient pre-treatment periods (Arkhangelsky et al. 2021). + Block treatment only: all treated units must adopt at the same time. + Requires a balanced panel (`PanelProfile.is_balanced == True`); + `fit()` raises `ValueError` and points at `balance_panel()`. +- `TROP` - factor-model-based generalized synthetic control. Uses every + unit untreated at period `t` as the donor pool (via the absorbing-state + D matrix); supports staggered adoption and more complex factor + structures. No covariate-adjustment surface on `fit()`. + +Classical DiD estimators will still produce estimates, but inference is +unreliable with very small treated groups; cluster-robust SE relies on +the number of clusters, not the number of treated units. Bootstrap +methods in the library are preferred. + +### §4.9 Heterogeneous adoption intensity + +When adoption varies in strength across units (partial-adoption settings, +intensity of exposure differs): + +- `HeterogeneousAdoptionDiD` - requires a balanced panel + (`PanelProfile.is_balanced == True`; `fit()` raises `ValueError` + when any unit is missing a period). Targets a Weighted Average Slope (WAS) + on single-period Heterogeneous Adoption Designs where no genuinely + untreated group exists (paper Equation 2 / Theorem 1). The + `target_parameter` attribute on the results object is literally + `"WAS"` for Design 1' and `"WAS_d_lower"` for Design 1 with lower-dose + comparison under Assumption 6. `fit(aggregate="overall")` (Phase 2a) + returns a single scalar WAS; `fit(aggregate="event_study")` (Phase + 2b) returns per-event-time WAS estimates. `did_had_pretest_workflow()` + runs the paper's three-step TWFE-suitability battery: (1) QUG null + via `qug_test`, (2) Assumption 7 pre-trends via `stute_test` / + `stute_joint_pretest` (event-study path only; the two-period overall + path flags this step as deferred), and (3) linearity of + `E[ΔY | D_2]` via `stute_test` / `yatchew_hr_test`. Assumption 3 + (uniform continuity / no extensive-margin jump) is not testable; the + pre-test battery does not and cannot validate it. Not ATT-shaped; do + not relabel the headline as ATT in report text. + + **Staggered-timing scope is last-cohort-only (Appendix B.2).** + HAD's staggered support is the `partial` cell in §3: on a + multi-cohort panel passed to `aggregate="event_study"`, `fit()` + auto-filters to the last treatment cohort (`F_last = + max(cohorts)`) plus never-treated units and emits a + `UserWarning` naming kept/dropped counts; earlier treated + cohorts are dropped. The `first_treat_col` kwarg is + **required** for the auto-filter to activate; without it a + multi-cohort panel raises `ValueError` pointing the caller at + `ChaisemartinDHaultfoeuille` for full staggered support. The + resulting estimand is a **last-cohort-only WAS**, not a + multi-cohort average — report it as such. + +### §4.10 Repeated cross-sections (no panel structure) + +`profile_panel` assumes long-format panel data. When the same units are +not observed across time (true repeated cross-sections), only the +estimators whose documented contract explicitly admits RCS are +applicable. Do not route RCS data to any other estimator in the suite - +most of them are panel-only by construction and will either raise at +fit time or estimate under a misspecified identifying assumption. + +Explicit RCS support in this library: + +- `CallawaySantAnna(panel=False)` - repeated-cross-section mode per + REGISTRY.md §CallawaySantAnna; use this variant on RCS data. +- `TripleDifference` - DDD cross-sectional use cases are documented + in `docs/choosing_estimator.rst`; the two-period DDD estimator does + not require within-unit tracking when the third comparison axis + carries the identification. The staggered DDD variant is panel-only + and listed separately below. + +Explicitly rejected for RCS (panel-only): + +- `EfficientDiD` - REGISTRY notes "does not handle ... repeated + cross-sections." +- `HeterogeneousAdoptionDiD` - panel-only (requires a balanced panel + with per-unit adoption timing). +- `SyntheticDiD` - requires balanced panel with per-unit donor matching. +- `ContinuousDiD` - requires balanced panel with per-unit constant + dose. +- `StaggeredTripleDifference` - panel-only; `fit()` has no + `panel=False` mode and rejects duplicate / unbalanced + `(unit, time)` structure. For cross-sectional DDD data use + `TripleDifference` instead. + +Treat other estimators in this guide as panel-only unless their own +docs explicitly say otherwise. When routing, also: + +- Cluster SE on the unit proxy (state, region) rather than the + individual cross-section respondent. +- Confirm the treatment assignment is at the cluster level, not at + the individual-respondent level, before interpreting the estimate + as a group-time ATT. + + +## §5. Post-fit validation utilities + +After any `fit()`, the Baker et al. (2025) 8-step workflow recommends a +diagnostic sequence. The library exposes utilities covering each step. +Consult `get_llm_guide("practitioner")` for the workflow-prose form; this +section is the API-reference index. + +### Parallel-trends and pre-trends + +- `check_parallel_trends(df, ...)` - exported from `diff_diff`. + Regression-based visual-plus-numeric test on pre-treatment periods. + Returns a structured result with p-value and per-period coefficients. +- `check_parallel_trends_robust(df, ...)` - Roth (2022) power-adjusted + version; adds a "believable-magnitude" check against a power curve. +- `equivalence_test_trends(df, ...)` - Bilinski-Hatfield-style + equivalence test (alternative framing of the PT test). +- `compute_pretrends_power(results, ...)` - standalone power analysis + for the PT test; takes a fitted `MultiPeriodDiDResults` (or + compatible event-study results object), not raw DataFrame. Useful + when `min_pre_periods` is small. + +### Sensitivity / robustness + +- `compute_honest_did(results, ...)` - Rambachan-Roth (2023) honest DiD. + Quantifies the sensitivity of ATT to parallel-trends violations. + Outputs sensitivity bounds under smoothness restrictions. +- `compute_pretrends_power(results, ...)` - complementary tool for + power-aware pre-trends interpretation (same fitted-results-first + signature as above). + +### Placebo tests + +- `run_placebo_test(df, ...)` - generic placebo runner. +- `run_all_placebo_tests(df, ...)` - batch runner over predefined + placebos. +- `placebo_timing_test(df, ...)` - false placebo-treatment time. +- `placebo_group_test(df, ...)` - placebo treatment-group assignment. +- `permutation_test(df, ...)` - Fisher-style exact permutation. +- `leave_one_out_test(df, ...)` - refit dropping one unit at a time. + +### Estimator-native diagnostics + +Some estimators expose diagnostics as methods on the result object: + +- `SyntheticDiDResults.in_time_placebo()` - placebo treatment applied + in a pre-treatment period. +- `SyntheticDiDResults.sensitivity_to_zeta_omega()` - regularization- + hyperparameter sensitivity. +- `SyntheticDiDResults.get_weight_concentration()` - donor-weight + concentration summary. +- `CallawaySantAnna.diagnose_propensity(df, ...)` - propensity-score + overlap check when using DR / IPW controls. +- `EfficientDiD.hausman_pretest(df, ...)` - chooses between `PT-All` and + `PT-Post` for `EfficientDiD`. +- `did_had_pretest_workflow(df, ...)` - bundled QUG / Stute / Yatchew- + Härdle pre-test battery for `HeterogeneousAdoptionDiD`. + +### Decomposition and weight auditing + +- `bacon_decompose(df, ...)` - Goodman-Bacon (2021) TWFE weight + decomposition. Returns a `BaconDecompositionResults` with the weight + on forbidden (later-vs-earlier) comparisons. Run before interpreting + any TWFE staggered fit. + +### Event-study plotting + +- `plot_event_study(results, ...)` +- `plot_group_effects(results, ...)` +- `plot_group_time_heatmap(results, ...)` +- `plot_staircase(results, ...)` +- `plot_honest_event_study(honest_results, ...)` - takes a + `HonestDiDResults` returned by `compute_honest_did`, not a fit + result directly. +- `plot_sensitivity(sensitivity_results, ...)` - takes a + `SensitivityResults` object (the result of honest-DiD sensitivity + analysis), not a fit result directly. +- `plot_synth_weights(results, ...)` +- `plot_dose_response(results, ...)` +- `plot_power_curve(...)` + +Event-study plots are also a diagnostic - pre-treatment coefficients +close to zero support parallel trends. + + +## §6. How to read BusinessReport / DiagnosticReport output + +`BusinessReport(results)` and `DiagnosticReport(results)` are experimental +in the 3.2 line. Their schema is versioned (`BUSINESS_REPORT_SCHEMA_VERSION` +and `DIAGNOSTIC_REPORT_SCHEMA_VERSION`, both `"2.0"` at time of writing) +and expected to evolve. Treat `.to_dict()` output as the agent-legible +contract; the prose renderers (`summary()`, `full_report()`) are derived +from it. + +### BusinessReport `to_dict()` schema (v2.0) + +Top-level keys emitted by `BusinessReport.to_dict()` +(source: `diff_diff/business_report.py`): + +- `schema_version: str` - `BUSINESS_REPORT_SCHEMA_VERSION`, e.g. `"2.0"`. +- `estimator: dict` - `class_name` (the fitted result class) and a + human-friendly `display_name`. +- `context: dict` - the `BusinessContext` bundle: `outcome_label`, + `outcome_unit`, `outcome_direction`, `business_question`, + `treatment_label`, `alpha`. +- `headline: dict` - the main point estimate plus framing fields. +- `target_parameter: dict` - what the headline scalar represents. + Fields: `name` (e.g. `"ATT"`, `"DID_M"`, `"dose-response"`, + `"WAS"`), `definition` (plain-English description), `aggregation` + (machine tag), `headline_attribute` (raw result attribute), and + `reference` (REGISTRY.md citation string). +- `assumption: dict` - named assumptions relied on (parallel trends, + no anticipation, SUTVA, ...). Note: singular `"assumption"`, not + `"assumptions"`. +- `pre_trends: dict` - pre-trends test result with verdict string + (e.g. `"clean"`, `"inconclusive"`, `"violated"`), p-value, and + power assessment if available. Note: underscore-split + `"pre_trends"`. +- `sensitivity: dict` - HonestDiD sensitivity summary when available. +- `sample: dict` - sample size and coverage details. Note: bare + `"sample"`, not `"sample_summary"`. +- `heterogeneity: dict` - heterogeneity summary if applicable. +- `robustness: dict` - placebo / robustness summaries if available. +- `diagnostics: dict` - a wrapper around the auto-constructed + `DiagnosticReport`. Always has a `status` field: `"skipped"` with a + `reason` when `auto_diagnostics=False`, otherwise `"ran"` with the + full DR `to_dict()` payload under `diagnostics["schema"]` and a + mirrored `overall_interpretation` string. Parse `schema` (not + `diagnostics` directly) to access the DR sections documented below. +- `next_steps: list[dict]` - Baker et al. next-step guidance from + `practitioner_next_steps`. +- `caveats: list[str]` - free-text caveats generated from failed + checks. +- `references: list[dict]` - citations relevant to the estimator. + +### DiagnosticReport `to_dict()` schema (v2.0) + +Top-level keys (source: `diff_diff/diagnostic_report.py`): + +- `schema_version: str` - `DIAGNOSTIC_REPORT_SCHEMA_VERSION`. +- `estimator: str` - the fitted result class name. +- `headline_metric: dict` - the main scalar the report headlines. +- `target_parameter: dict` - same shape as the BR field above. +- `parallel_trends: dict` - PT test result. +- `pretrends_power: dict` - power-aware pre-trends assessment when + applicable. +- `sensitivity: dict` - HonestDiD sensitivity summary. +- `placebo: dict` - placebo-test results. +- `bacon: dict` - Goodman-Bacon decomposition when applicable. +- `design_effect: dict` - survey / clustering design-effect summary. +- `heterogeneity: dict` - group-time heterogeneity summary. +- `epv: dict` - events-per-variable / sample-adequacy. +- `estimator_native_diagnostics: dict` - estimator-specific + diagnostics (e.g. SDiD weight concentration, TROP factor-model + fit). +- `skipped: dict` - checks skipped on this estimator type, with the + reason. +- `warnings: list[str]` - top-level aggregated warnings. +- `overall_interpretation: str` - rendered prose summary of the + sections. +- `next_steps: list[dict]` - same shape as the BR field. + +Each section value is a dict. Parse it in two layers: + +1. `status: str` — execution state, not qualitative interpretation. + The values actually emitted by `DiagnosticReport.to_dict()` are: + `"ran"` (section executed), `"not_applicable"` (check does not + apply to this estimator or design), `"not_run"` (implementation + pending), `"no_scalar_by_design"` (for estimators that return a + table instead of a scalar headline, e.g. dCDH with + `trends_linear=True, L_max>=2`), and `"skipped"` (auto-diagnostics + disabled or the section was short-circuited at top level). +2. `verdict: str` (only present when `status == "ran"`) — qualitative + interpretation of the executed check. Candidate values include + `"clean"`, `"inconclusive"`, `"violated"`, and section-specific + labels. + +`reason: str` is an optional free-text explanation that usually +accompanies non-`"ran"` statuses; it may also appear on `"ran"` +sections as supplementary context. The rest of each section dict is +section-specific payload (e.g. p-values, coefficients, cohort tables). + +Forthcoming schema additions (not yet shipped): a top-level +`sanity_checks` block (machine-legible pass/warn/fail summary) and a +`mismatch_warnings` list (post-hoc estimator-mismatch detection) are +queued for a later wave. Treat their current absence as expected. + + +## §7. Glossary + citations + +**ATT**: Average Treatment Effect on the Treated. The target parameter +of most DiD estimators. + +**Parallel trends**: counterfactual trends in treated and control +outcomes would have moved together absent treatment. Untestable directly; +pre-treatment dynamics are a necessary (not sufficient) indicator. + +**No anticipation**: units do not respond to treatment before it occurs. +If plausible, test via pre-treatment event-study coefficients. + +**SUTVA**: Stable Unit Treatment Value Assumption. Rules out spillovers +and interference between units. + +**Forbidden comparison**: in TWFE, a comparison where already-treated +units serve as controls for later-treated units. Weights are negative +and the resulting estimate can flip sign vs. the true ATT. + +**Cohort / treatment timing**: first-treatment period for an +absorbing-treatment unit. Units sharing a cohort share an adoption date. + +**Staggered adoption**: two or more cohorts present in the panel. + +**Doubly-robust (DR) / IPW / RA**: three covariate-adjustment strategies +in `CallawaySantAnna`. DR is consistent if either the propensity model +or the outcome model is correctly specified. + +### Primary references + +- **Baker, Andrew, Brantly Callaway, Scott Cunningham, Andrew + Goodman-Bacon, and Pedro H. C. Sant'Anna (2025).** "Difference-in- + Differences Designs: A Practitioner's Guide." arXiv:2503.13323. + The 8-step workflow and best-practice framing. Ships as + `get_llm_guide("practitioner")`. +- **Roth, Jonathan, Pedro H. C. Sant'Anna, Alyssa Bilinski, and John + Poe (2023).** "What's Trending in Difference-in-Differences? A + Synthesis of the Recent Econometrics Literature." Journal of + Econometrics 235(2): 2218-2244. Canonical-assumption framing; + classification of estimator relaxations. +- **Goodman-Bacon, Andrew (2021).** "Difference-in-Differences with + Variation in Treatment Timing." Journal of Econometrics + 225(2): 254-277. TWFE weight decomposition; + `bacon_decompose` implements this. +- **Callaway, Brantly, and Pedro H. C. Sant'Anna (2021).** + "Difference-in-Differences with Multiple Time Periods." Journal of + Econometrics 225(2): 200-230. Group-time ATT. +- **Sun, Liyang, and Sarah Abraham (2021).** "Estimating Dynamic + Treatment Effects in Event Studies with Heterogeneous Treatment + Effects." Journal of Econometrics 225(2): 175-199. IW estimator. +- **de Chaisemartin, Clément, and Xavier d'Haultfoeuille (2020).** + "Two-Way Fixed Effects Estimators with Heterogeneous Treatment + Effects." American Economic Review 110(9): 2964-2996. DID_M + estimator. +- **Borusyak, Kirill, Xavier Jaravel, and Jann Spiess (2024).** + "Revisiting Event-Study Designs: Robust and Efficient Estimation." + Review of Economic Studies 91(6): 3253-3285. Imputation estimator. +- **Gardner, John (2022).** "Two-Stage Differences in Differences." + arXiv:2207.05943. Two-stage estimator. +- **Wooldridge, Jeffrey M. (2021).** "Two-Way Fixed Effects, the Two- + Way Mundlak Regression, and Difference-in-Differences Estimators." + ETWFE formulation. +- **Arkhangelsky, Dmitry, Susan Athey, David Hirshberg, Guido Imbens, + and Stefan Wager (2021).** "Synthetic Difference-in-Differences." + American Economic Review 111(12): 4088-4118. SDiD estimator. +- **Rambachan, Ashesh, and Jonathan Roth (2023).** "A More Credible + Approach to Parallel Trends." Review of Economic Studies + 90(5): 2555-2591. HonestDiD sensitivity. +- **Bilinski, Alyssa, and Laura A. Hatfield (2019).** "Nothing to See + Here? Non-Inferiority Approaches to Parallel Trends and Other + Model Assumptions." arXiv:1805.03273. Equivalence test. +- **Sant'Anna, Pedro H. C., and Jun Zhao (2020).** "Doubly Robust + Difference-in-Differences Estimators." Journal of Econometrics + 219(1): 101-122. DR adjustment. +- **Chen, Xiaohong, Pedro H. C. Sant'Anna, and Haitian Xie (2025).** + "Efficient Difference-in-Differences and Event Study Estimators." + Primary source for the `EfficientDiD` estimator (PT-All / PT-Post + framing and efficient combination weights). +- **Callaway, Brantly, Andrew Goodman-Bacon, and Pedro H. C. + Sant'Anna (2024).** "Difference-in-Differences with a Continuous + Treatment." Primary source for `ContinuousDiD`; introduces the + Parallel Trends vs Strong Parallel Trends distinction underlying + `ATT(d|d)`, `ATT(d)`, `ACRT(d)`, and `ACRT^{glob}`. + +### Online resources + +- **psantanna.com/did-resources** - practitioner checklist + reading + list maintained by Pedro Sant'Anna. +- **bcallaway11.github.io/did** - `did` R package tutorials + (Callaway-Sant'Anna). + + +## §8. Intentional omissions + +This guide does **not**: + +- Recommend a specific estimator for a specific dataset. When multiple + estimators fit, §4 lists them and names the trade-offs; the choice is + the agent's. +- Enumerate every possible design edge case. The literature cited in §7 + covers them; this guide is a navigation aid, not a substitute. +- Promise forward-compatibility of the BR / DR schema or the alert + catalogue. Treat these as experimental until the 12-item foundation- + gap list closes. +- Replace `bacon_decompose()`, `compute_honest_did()`, or any of the + estimator-native diagnostics. Post-fit validation is mandatory, not + optional, and belongs in the final write-up. +- Cover methods outside diff-diff's estimator suite (e.g., instrumental + variables, regression discontinuity, synthetic control for a single + treated unit). When those apply, point the user at dedicated + libraries. + +**If in doubt, consult the primary references in §7 and use +`get_llm_guide("practitioner")` for the Baker et al. workflow.** diff --git a/diff_diff/profile.py b/diff_diff/profile.py new file mode 100644 index 00000000..b343f9d4 --- /dev/null +++ b/diff_diff/profile.py @@ -0,0 +1,714 @@ +"""Descriptive panel-profiling utility for agent-facing use. + +``profile_panel()`` inspects a DiD panel and returns a :class:`PanelProfile` +dataclass of structural facts — panel balance, treatment-type classification, +outcome characteristics, and a list of factual :class:`Alert` observations. + +This module is descriptive, not opinionated. Alerts report what is (e.g. +"smallest cohort has 7 units"), never what to do about it. Estimator +selection is the caller's responsibility; consult +``diff_diff.get_llm_guide("autonomous")`` for the estimator-support matrix +and per-design-feature reasoning. +""" + +from __future__ import annotations + +from dataclasses import dataclass +from typing import Any, Dict, List, Mapping, Optional, Tuple, cast + +import numpy as np +import pandas as pd + +_OBSERVATION_COVERAGE_THRESHOLD = 0.70 +_MIN_COHORT_SIZE_THRESHOLD = 10 +_SHORT_PRE_PANEL_THRESHOLD = 3 +_SHORT_POST_PANEL_THRESHOLD = 3 + + +@dataclass(frozen=True) +class Alert: + """A factual observation about a panel. + + ``severity`` is ``"info"`` (descriptive) or ``"warn"`` (descriptive and + likely relevant to the caller's estimator choice). Alerts never + recommend a specific estimator. + """ + + code: str + severity: str + message: str + observed: Any + + +@dataclass(frozen=True) +class PanelProfile: + """Structural facts about a DiD panel. + + Returned by :func:`profile_panel`. Mirrors the ``BusinessContext`` + frozen-dataclass pattern. Consume ``.to_dict()`` for a JSON-serializable + representation and reason against the bundled + ``llms-autonomous.txt`` guide. + """ + + n_units: int + n_periods: int + n_obs: int + is_balanced: bool # every (unit, time) cell appears at least once + observation_coverage: float # unique (unit, time) keys / (n_units * n_periods) + + treatment_type: str + is_staggered: bool + n_cohorts: int + cohort_sizes: Mapping[Any, int] + has_never_treated: bool + has_always_treated: bool + treatment_varies_within_unit: bool + + first_treatment_period: Optional[Any] + last_treatment_period: Optional[Any] + min_pre_periods: Optional[int] + min_post_periods: Optional[int] + + outcome_dtype: str + outcome_is_binary: bool + outcome_has_zeros: bool + outcome_has_negatives: bool + outcome_missing_fraction: float + outcome_summary: Mapping[str, float] + + alerts: Tuple[Alert, ...] + + def to_dict(self) -> Dict[str, Any]: + """Return a JSON-serializable dict representation of the profile.""" + return { + "n_units": self.n_units, + "n_periods": self.n_periods, + "n_obs": self.n_obs, + "is_balanced": self.is_balanced, + "observation_coverage": self.observation_coverage, + "treatment_type": self.treatment_type, + "is_staggered": self.is_staggered, + "n_cohorts": self.n_cohorts, + "cohort_sizes": {_jsonable_key(k): int(v) for k, v in self.cohort_sizes.items()}, + "has_never_treated": self.has_never_treated, + "has_always_treated": self.has_always_treated, + "treatment_varies_within_unit": self.treatment_varies_within_unit, + "first_treatment_period": _jsonable(self.first_treatment_period), + "last_treatment_period": _jsonable(self.last_treatment_period), + "min_pre_periods": self.min_pre_periods, + "min_post_periods": self.min_post_periods, + "outcome_dtype": self.outcome_dtype, + "outcome_is_binary": self.outcome_is_binary, + "outcome_has_zeros": self.outcome_has_zeros, + "outcome_has_negatives": self.outcome_has_negatives, + "outcome_missing_fraction": self.outcome_missing_fraction, + "outcome_summary": {k: float(v) for k, v in self.outcome_summary.items()}, + "alerts": [ + { + "code": a.code, + "severity": a.severity, + "message": a.message, + "observed": _jsonable(a.observed), + } + for a in self.alerts + ], + } + + +def profile_panel( + df: pd.DataFrame, + *, + unit: str, + time: str, + treatment: str, + outcome: str, +) -> PanelProfile: + """Describe the structure of a DiD panel. + + Reports structural facts — balance, treatment-type classification, + outcome characteristics, factual alerts. Descriptive, not opinionated: + the profile says what is, never what to do about it. Estimator + selection is up to the caller. + + Parameters + ---------- + df : pandas.DataFrame + Long-format panel data containing the four named columns. + unit : str + Column identifying the cross-sectional unit. + time : str + Column identifying the time period. + treatment : str + Column holding the treatment indicator or dose. See Notes for the + classification rules. + outcome : str + Column holding the outcome variable. + + Returns + ------- + PanelProfile + Frozen dataclass. Call ``.to_dict()`` for a JSON-serializable view. + + Raises + ------ + ValueError + If any of the four column names is not present in ``df``. + + Examples + -------- + >>> import pandas as pd + >>> from diff_diff import profile_panel + >>> df = pd.DataFrame({ + ... "u": [1, 1, 2, 2], + ... "t": [0, 1, 0, 1], + ... "tr": [0, 0, 1, 1], + ... "y": [0.1, 0.2, 0.1, 0.9], + ... }) + >>> profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + >>> profile.is_balanced + True + >>> profile.treatment_type + 'binary_absorbing' + + Notes + ----- + Classification rules for ``treatment_type``: + + - ``"binary_absorbing"``: numeric treatment whose observed non-NaN + values are a subset of :math:`\\{0, 1\\}` (one or two distinct + values) AND each unit's treatment sequence (ordered by ``time``) + is weakly monotone non-decreasing. All-zero and all-one panels + are valid degenerate cases. + - ``"binary_non_absorbing"``: values a subset of :math:`\\{0, 1\\}` + with at least two distinct values observed, where at least one + unit switches from 1 back to 0. + - ``"continuous"``: numeric treatment with more than two distinct + values, or a 2-valued numeric whose values are not in + :math:`\\{0, 1\\}` (matches the ``ContinuousDiD`` convention). + - ``"categorical"``: non-numeric dtype (object / category) or a + column that is entirely NaN. + + Bool-dtype columns (``True`` / ``False``) are classified the same + way as numeric ``{0, 1}``: the library's binary estimators validate + on value support via :func:`diff_diff.utils.validate_binary`, so + ``True`` / ``False`` behave like ``1`` / ``0`` for absorbing / + non-absorbing classification. + + ``has_never_treated`` is computed across both binary and + continuous numeric treatment types: some unit has ``treatment == + 0`` in every observed non-NaN row. For binary this flags the + clean-control group; for continuous this flags zero-dose controls + (required by ``ContinuousDiD``). Always ``False`` for + ``"categorical"``. + + ``has_always_treated`` has binary-only semantics: some unit has + ``treatment == 1`` in every observed non-NaN row (no pre-treatment + information in the DiD sense). For ``"continuous"`` and + ``"categorical"`` treatment this field is always ``False`` + regardless of dose positivity — pre-treatment periods on + continuous DiD are determined by the separate ``first_treat`` + column passed to ``ContinuousDiD.fit``, not by whether the dose + is strictly positive. + + Rows with ``NaN`` in ``unit`` or ``time`` are dropped up front and + surfaced via the ``missing_id_rows_dropped`` alert; all subsequent + structural facts are computed on the non-missing subset, so + ``observation_coverage`` is always in ``[0, 1]``. Duplicate + ``(unit, time)`` rows are surfaced separately via the + ``duplicate_unit_time_rows`` alert. + + The profile does not recommend an estimator. Consult + ``diff_diff.get_llm_guide("autonomous")`` for the estimator-support + matrix and per-design-feature reasoning. + """ + _validate_columns(df, unit=unit, time=time, treatment=treatment, outcome=outcome) + + input_row_count = int(len(df)) + if input_row_count == 0: + raise ValueError("profile_panel: DataFrame is empty; at least one row is required.") + + missing_id_mask = cast(pd.Series, df[[unit, time]].isna().any(axis=1)) + n_rows_with_missing_id = int(missing_id_mask.sum()) + if n_rows_with_missing_id > 0: + df = df.loc[~missing_id_mask] + n_obs = int(len(df)) + if n_obs == 0: + raise ValueError( + f"profile_panel: no rows remain after dropping " + f"{n_rows_with_missing_id} row(s) with missing unit or time " + "identifier; at least one valid row is required." + ) + + n_units = int(df[unit].nunique()) + n_periods = int(df[time].nunique()) + n_unique_keys = int(df[[unit, time]].drop_duplicates().shape[0]) + denom = n_units * n_periods + observation_coverage = float(n_unique_keys / denom) if denom > 0 else 0.0 + is_balanced = n_unique_keys == denom + n_duplicate_rows = n_obs - n_unique_keys + + ( + treatment_type, + is_staggered, + cohort_sizes, + has_never_treated, + has_always_treated, + first_tp, + last_tp, + ) = _classify_treatment(df, unit=unit, time=time, treatment=treatment) + + if pd.api.types.is_numeric_dtype(df[treatment]) or pd.api.types.is_bool_dtype(df[treatment]): + per_unit_distinct = df.groupby(unit)[treatment].nunique(dropna=True) + treatment_varies_within_unit = bool((per_unit_distinct > 1).any()) + else: + treatment_varies_within_unit = False + + min_pre, min_post = _compute_pre_post( + df, + unit=unit, + time=time, + treatment=treatment, + treatment_type=treatment_type, + ) + + outcome_col = cast(pd.Series, df[outcome]) + outcome_dtype = str(outcome_col.dtype) + valid = cast(pd.Series, outcome_col.dropna()) + outcome_missing_fraction = ( + float(1.0 - len(valid) / len(outcome_col)) if len(outcome_col) > 0 else 0.0 + ) + outcome_is_binary, outcome_has_zeros, outcome_has_negatives = _classify_outcome(valid) + outcome_summary = _summarize_outcome(valid) + + dtype_kind = getattr(outcome_col.dtype, "kind", "O") + alerts = _compute_alerts( + n_periods=n_periods, + observation_coverage=observation_coverage, + cohort_sizes=cohort_sizes, + has_never_treated=has_never_treated, + has_always_treated=has_always_treated, + min_pre_periods=min_pre, + min_post_periods=min_post, + outcome_is_binary=outcome_is_binary, + outcome_dtype_kind=dtype_kind, + n_duplicate_rows=n_duplicate_rows, + n_rows_with_missing_id=n_rows_with_missing_id, + ) + + return PanelProfile( + n_units=n_units, + n_periods=n_periods, + n_obs=n_obs, + is_balanced=is_balanced, + observation_coverage=observation_coverage, + treatment_type=treatment_type, + is_staggered=is_staggered, + n_cohorts=len(cohort_sizes), + cohort_sizes=cohort_sizes, + has_never_treated=has_never_treated, + has_always_treated=has_always_treated, + treatment_varies_within_unit=treatment_varies_within_unit, + first_treatment_period=first_tp, + last_treatment_period=last_tp, + min_pre_periods=min_pre, + min_post_periods=min_post, + outcome_dtype=outcome_dtype, + outcome_is_binary=outcome_is_binary, + outcome_has_zeros=outcome_has_zeros, + outcome_has_negatives=outcome_has_negatives, + outcome_missing_fraction=outcome_missing_fraction, + outcome_summary=outcome_summary, + alerts=tuple(alerts), + ) + + +def _validate_columns(df: pd.DataFrame, **cols: str) -> None: + missing = [(role, name) for role, name in cols.items() if name not in df.columns] + if missing: + pairs = ", ".join(f"{role}={name!r}" for role, name in missing) + raise ValueError( + f"profile_panel: column(s) not found in DataFrame: {pairs}. " + f"Provided columns: {list(df.columns)}" + ) + + +def _classify_treatment( + df: pd.DataFrame, + *, + unit: str, + time: str, + treatment: str, +) -> Tuple[ + str, + bool, + Dict[Any, int], + bool, + bool, + Optional[Any], + Optional[Any], +]: + """Return (type, is_staggered, cohort_sizes, has_never, has_always, first_tp, last_tp).""" + col = df[treatment] + is_numeric = pd.api.types.is_numeric_dtype(col) + is_bool = pd.api.types.is_bool_dtype(col) + + # Bool-dtype treatment columns are treated as binary 0/1 inputs. + # The library's binary estimators validate value support via + # `validate_binary`, which accepts bool because True/False coerce + # to 1/0 numerically. Classifying bool columns as "categorical" + # here would route a valid binary design away from the supported + # estimator set. + if (not is_numeric) and (not is_bool): + return ("categorical", False, {}, False, False, None, None) + + distinct = col.dropna().unique() + n_distinct = len(distinct) + values_set = set(distinct.tolist()) + if n_distinct == 0: + return ("categorical", False, {}, False, False, None, None) + + # has_never_treated has a single well-defined meaning across binary + # and continuous numeric treatment: some unit has treatment == 0 in + # every observed non-NaN row. For binary this is the clean-control + # group; for continuous this is the zero-dose control required by + # ContinuousDiD (P(D=0) > 0). + unit_max = df.groupby(unit)[treatment].max().to_numpy() + unit_min = df.groupby(unit)[treatment].min().to_numpy() + has_never_treated = bool(np.any(unit_max == 0)) + + is_binary_valued = values_set <= {0, 1, 0.0, 1.0} + # has_always_treated has binary-only semantics: "unit is treated in + # every observed period" = unit_min == 1 on a binary panel (no + # pre-treatment information). For continuous panels, positive dose + # throughout does not mean "always treated in the DiD sense" + # (pre-treatment periods are determined by `first_treat`, not by + # whether the dose is positive), so this field is False for + # continuous / categorical types. + has_always_treated = is_binary_valued and bool(np.any(unit_min == 1)) + + if not is_binary_valued: + return ( + "continuous", + False, + {}, + has_never_treated, + has_always_treated, + None, + None, + ) + + sorted_df = df.sort_values([unit, time]) + + # Monotonicity check on the observed non-NaN subsequence per unit. + # A path like [0, 1, NaN, 0] must be detected as non-absorbing: the + # non-NaN subsequence [0, 1, 0] violates weak monotonicity. + is_absorbing = True + for _, group in sorted_df.groupby(unit, sort=False): + vals = group[treatment].to_numpy() + mask = ~pd.isna(vals) + # Cast to int so np.diff on a bool-dtype column performs + # arithmetic (1 - 0 = 1, 0 - 1 = -1) rather than XOR (which + # would mask a True -> False transition). + observed = vals[mask].astype(np.int64, copy=False) + if len(observed) >= 2 and bool(np.any(np.diff(observed) < 0)): + is_absorbing = False + break + + if not is_absorbing: + return ( + "binary_non_absorbing", + False, + {}, + has_never_treated, + has_always_treated, + None, + None, + ) + + first_treat = sorted_df[sorted_df[treatment] == 1].groupby(unit, sort=False)[time].min() + cohort_counts = first_treat.value_counts().sort_index() + cohort_sizes: Dict[Any, int] = {k: int(v) for k, v in cohort_counts.items()} + first_tp = min(cohort_sizes) if cohort_sizes else None + last_tp = max(cohort_sizes) if cohort_sizes else None + is_staggered = len(cohort_sizes) >= 2 + + return ( + "binary_absorbing", + is_staggered, + cohort_sizes, + has_never_treated, + has_always_treated, + first_tp, + last_tp, + ) + + +def _compute_pre_post( + df: pd.DataFrame, + *, + unit: str, + time: str, + treatment: str, + treatment_type: str, +) -> Tuple[Optional[int], Optional[int]]: + """Return (min_pre, min_post) across treated units using each unit's + observed (unit, time) support. On unbalanced panels this correctly + reflects the actual pre/post exposure of the least-supported treated + unit, rather than the global panel period set which could overstate + exposure and suppress short-panel alerts. + """ + if treatment_type != "binary_absorbing": + return None, None + + support = df[[unit, time]].drop_duplicates() + sorted_df = df.sort_values([unit, time]) + first_treat_per_unit = ( + sorted_df[sorted_df[treatment] == 1].groupby(unit, sort=False)[time].min() + ) + if first_treat_per_unit.empty: + return None, None + + pre_counts: List[int] = [] + post_counts: List[int] = [] + treated_units = first_treat_per_unit.index.tolist() + for u in treated_units: + c_u = first_treat_per_unit.loc[u] + unit_periods = support.loc[support[unit] == u, time] + pre_counts.append(int((unit_periods < c_u).sum())) + post_counts.append(int((unit_periods >= c_u).sum())) + + return int(min(pre_counts)), int(min(post_counts)) + + +def _classify_outcome(valid: pd.Series) -> Tuple[bool, bool, bool]: + n_distinct = valid.nunique(dropna=False) + if n_distinct == 0: + return False, False, False + + is_numeric = pd.api.types.is_numeric_dtype(valid) + if is_numeric: + distinct_set = set(valid.unique().tolist()) + is_binary = n_distinct == 2 and (distinct_set <= {0, 1} or distinct_set <= {0.0, 1.0}) + has_zeros = bool((valid == 0).any()) + has_negatives = bool((valid < 0).any()) + return is_binary, has_zeros, has_negatives + + return False, False, False + + +def _summarize_outcome(valid: pd.Series) -> Dict[str, float]: + if len(valid) == 0 or not pd.api.types.is_numeric_dtype(valid): + return {} + return { + "min": float(valid.min()), + "max": float(valid.max()), + "mean": float(valid.mean()), + "std": float(valid.std(ddof=1)) if len(valid) > 1 else 0.0, + } + + +def _compute_alerts( + *, + n_periods: int, + observation_coverage: float, + cohort_sizes: Mapping[Any, int], + has_never_treated: bool, + has_always_treated: bool, + min_pre_periods: Optional[int], + min_post_periods: Optional[int], + outcome_is_binary: bool, + outcome_dtype_kind: str, + n_duplicate_rows: int, + n_rows_with_missing_id: int, +) -> List[Alert]: + alerts: List[Alert] = [] + + if n_rows_with_missing_id > 0: + alerts.append( + Alert( + code="missing_id_rows_dropped", + severity="warn", + message=( + f"Dropped {n_rows_with_missing_id} row(s) with missing " + "unit or time identifier; structural facts are computed " + "from the non-missing subset." + ), + observed=int(n_rows_with_missing_id), + ) + ) + + if n_duplicate_rows > 0: + alerts.append( + Alert( + code="duplicate_unit_time_rows", + severity="warn", + message=( + f"Found {n_duplicate_rows} duplicate (unit, time) row(s); " + "balance and coverage are computed from the unique support." + ), + observed=int(n_duplicate_rows), + ) + ) + + if cohort_sizes: + smallest = min(cohort_sizes.values()) + if smallest < _MIN_COHORT_SIZE_THRESHOLD: + alerts.append( + Alert( + code="min_cohort_size_below_10", + severity="warn", + message=( + f"Smallest cohort has {smallest} units; " + "cohort-level inference will be noisy." + ), + observed=int(smallest), + ) + ) + if len(cohort_sizes) == 1: + alerts.append( + Alert( + code="only_one_cohort", + severity="info", + message=("All treated units adopt at the same time " "(non-staggered design)."), + observed=1, + ) + ) + if not has_never_treated: + alerts.append( + Alert( + code="all_units_treated_simultaneously", + severity="info", + message=( + "Every unit is treated and every treated unit " + "adopts in the same period; no untreated " + "comparison group exists in the panel." + ), + observed=None, + ) + ) + + if min_pre_periods is not None and min_pre_periods < _SHORT_PRE_PANEL_THRESHOLD: + alerts.append( + Alert( + code="short_pre_panel", + severity="warn", + message=( + f"Minimum pre-treatment periods across treated units is " + f"{min_pre_periods}; parallel-trends and event-study " + "diagnostics have limited power." + ), + observed=int(min_pre_periods), + ) + ) + if min_post_periods is not None and min_post_periods < _SHORT_POST_PANEL_THRESHOLD: + alerts.append( + Alert( + code="short_post_panel", + severity="info", + message=( + f"Minimum post-treatment periods across treated units is " + f"{min_post_periods}; dynamic-effect estimation is " + "limited." + ), + observed=int(min_post_periods), + ) + ) + + if cohort_sizes and not has_never_treated: + alerts.append( + Alert( + code="no_never_treated", + severity="info", + message=( + "No never-treated comparison units; every unit in the " + "panel is eventually treated." + ), + observed=False, + ) + ) + + if has_always_treated: + alerts.append( + Alert( + code="has_always_treated_units", + severity="info", + message=( + "Some units are treated in every observed period; they " + "provide no pre-treatment information." + ), + observed=True, + ) + ) + + if observation_coverage < _OBSERVATION_COVERAGE_THRESHOLD: + alerts.append( + Alert( + code="panel_highly_unbalanced", + severity="warn", + message=( + f"Observation coverage is {observation_coverage:.1%}; " + "panel is highly unbalanced." + ), + observed=float(observation_coverage), + ) + ) + + if n_periods == 2: + alerts.append( + Alert( + code="only_two_periods", + severity="info", + message="Only two time periods are observed (2x2 design).", + observed=2, + ) + ) + + if outcome_is_binary and outcome_dtype_kind == "f": + alerts.append( + Alert( + code="outcome_looks_binary_but_dtype_float", + severity="info", + message=("Outcome takes values in {0, 1} but is stored with a " "float dtype."), + observed=None, + ) + ) + + return alerts + + +def _jsonable(x: Any) -> Any: + """Coerce a value to a JSON-serializable primitive.""" + if x is None: + return None + if isinstance(x, bool): + return bool(x) + if isinstance(x, (int, float, str)): + return x + if isinstance(x, np.bool_): + return bool(x) + if isinstance(x, np.integer): + return int(x) + if isinstance(x, np.floating): + return float(x) + if isinstance(x, (pd.Timestamp, np.datetime64)): + return str(x) + if isinstance(x, dict): + return {_jsonable_key(k): _jsonable(v) for k, v in x.items()} + if isinstance(x, (list, tuple)): + return [_jsonable(v) for v in x] + return str(x) + + +def _jsonable_key(k: Any) -> Any: + """Coerce a mapping key to a JSON-compatible primitive.""" + if isinstance(k, bool): + return bool(k) + if isinstance(k, (int, float, str)): + return k + if isinstance(k, np.bool_): + return bool(k) + if isinstance(k, np.integer): + return int(k) + if isinstance(k, np.floating): + return float(k) + return str(k) diff --git a/tests/test_guides.py b/tests/test_guides.py index bc0abe83..2d08871d 100644 --- a/tests/test_guides.py +++ b/tests/test_guides.py @@ -1,4 +1,5 @@ """Tests for the bundled LLM guide accessor.""" + import importlib.resources import pytest @@ -7,7 +8,7 @@ from diff_diff._guides_api import _VARIANT_TO_FILE -@pytest.mark.parametrize("variant", ["concise", "full", "practitioner"]) +@pytest.mark.parametrize("variant", ["concise", "full", "practitioner", "autonomous"]) def test_all_variants_load(variant): text = get_llm_guide(variant) assert isinstance(text, str) @@ -19,9 +20,10 @@ def test_default_is_concise(): def test_full_is_largest(): - lengths = {v: len(get_llm_guide(v)) for v in ("concise", "full", "practitioner")} + lengths = {v: len(get_llm_guide(v)) for v in ("concise", "full", "practitioner", "autonomous")} assert lengths["full"] > lengths["concise"] assert lengths["full"] > lengths["practitioner"] + assert lengths["full"] > lengths["autonomous"] def test_content_stability_practitioner_workflow(): @@ -32,6 +34,22 @@ def test_content_stability_self_reference_after_rewrite(): assert "get_llm_guide" in get_llm_guide("concise") +def test_content_stability_autonomous_fingerprints(): + text = get_llm_guide("autonomous") + assert "profile_panel" in text + assert "estimator-support matrix" in text.lower() + + +def test_autonomous_contains_intact_estimator_matrix(): + # Section 3 is a markdown table with 10 data columns + the estimator + # name column -> rows have at least 11 pipe characters. This guards + # against the matrix being accidentally deleted or truncated. + text = get_llm_guide("autonomous") + assert any( + line.count("|") >= 11 for line in text.splitlines() + ), "Section 3 estimator-support matrix appears to be missing or truncated." + + def test_wheel_content_matches_package_resource(): for variant, filename in _VARIANT_TO_FILE.items(): on_disk = ( diff --git a/tests/test_profile_panel.py b/tests/test_profile_panel.py new file mode 100644 index 00000000..b1d7a9f5 --- /dev/null +++ b/tests/test_profile_panel.py @@ -0,0 +1,868 @@ +"""Tests for ``diff_diff.profile_panel`` and the ``PanelProfile`` dataclass.""" + +from __future__ import annotations + +import dataclasses +import json +from typing import Any, Dict, Iterable, Optional + +import numpy as np +import pandas as pd +import pytest + +from diff_diff import PanelProfile, profile_panel +from diff_diff.profile import Alert + + +def _make_panel( + *, + n_units: int, + periods: Iterable[int], + first_treat: Optional[Dict[int, int]] = None, + outcome_fn: Any = None, +) -> pd.DataFrame: + """Build a balanced long panel with optional per-unit first-treatment timing. + + ``first_treat`` maps unit -> first treatment period (inclusive). Units not + in the mapping are never-treated. + """ + first_treat = first_treat or {} + rows = [] + rng = np.random.default_rng(0) + for u in range(1, n_units + 1): + for t in periods: + tr = 1 if (u in first_treat and t >= first_treat[u]) else 0 + if outcome_fn is not None: + y = outcome_fn(u, t, tr, rng) + else: + y = float(u) + 0.1 * t + 0.5 * tr + rows.append({"u": u, "t": t, "tr": tr, "y": y}) + return pd.DataFrame(rows) + + +def _alert_codes(profile: PanelProfile) -> set[str]: + return {a.code for a in profile.alerts} + + +def test_balanced_binary_2x2(): + first_treat = {u: 1 for u in range(11, 21)} + df = _make_panel(n_units=20, periods=[0, 1], first_treat=first_treat) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "binary_absorbing" + assert profile.is_staggered is False + assert profile.has_never_treated is True + assert profile.n_units == 20 + assert profile.n_periods == 2 + assert profile.is_balanced is True + + +def test_staggered_multi_cohort(): + first_treat: Dict[int, int] = {} + first_treat.update({u: 3 for u in range(1, 11)}) + first_treat.update({u: 5 for u in range(11, 21)}) + first_treat.update({u: 7 for u in range(21, 31)}) + df = _make_panel(n_units=40, periods=range(1, 9), first_treat=first_treat) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "binary_absorbing" + assert profile.is_staggered is True + assert profile.n_cohorts == 3 + assert profile.cohort_sizes == {3: 10, 5: 10, 7: 10} + assert profile.first_treatment_period == 3 + assert profile.last_treatment_period == 7 + assert profile.has_never_treated is True + + +def test_binary_non_absorbing_switcher(): + rows = [] + rng = np.random.default_rng(0) + for u in range(1, 21): + treat_seq = [0, 1, 1, 0, 0] if u > 10 else [0, 0, 0, 0, 0] + for t, tr in enumerate(treat_seq): + rows.append({"u": u, "t": t, "tr": tr, "y": rng.normal()}) + df = pd.DataFrame(rows) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "binary_non_absorbing" + assert profile.cohort_sizes == {} + assert profile.is_staggered is False + assert profile.has_never_treated is True + + +def test_continuous_treatment(): + rng = np.random.default_rng(0) + rows = [] + for u in range(1, 41): + dose = float(rng.uniform(0, 5)) + for t in range(4): + rows.append({"u": u, "t": t, "tr": dose, "y": rng.normal()}) + df = pd.DataFrame(rows) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "continuous" + assert profile.cohort_sizes == {} + assert profile.is_staggered is False + # Each unit has a constant dose across all periods → time-invariant. + assert profile.treatment_varies_within_unit is False + + +def test_continuous_treatment_with_time_varying_dose(): + """Time-varying dose must be flagged so agents routed to + ContinuousDiD do not hit the fit-time "dose must be time-invariant" + ValueError. treatment_varies_within_unit == True signals the + incompatibility.""" + rng = np.random.default_rng(0) + rows = [] + for u in range(1, 21): + for t in range(4): + dose = float(rng.uniform(0, 5)) + rows.append({"u": u, "t": t, "tr": dose, "y": rng.normal()}) + df = pd.DataFrame(rows) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "continuous" + assert profile.treatment_varies_within_unit is True + + +def test_binary_absorbing_varies_within_unit(): + """Binary-absorbing panels have within-unit treatment variation by + construction (0 pre, 1 post). The field is True.""" + first_treat = {u: 2 for u in range(11, 21)} + df = _make_panel(n_units=20, periods=range(0, 4), first_treat=first_treat) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_varies_within_unit is True + + +def test_continuous_positive_dose_does_not_fire_has_always_treated(): + """Valid ContinuousDiD panels have units with a constant positive + dose across all periods AND well-defined pre-treatment periods + (via a separate `first_treat` column). `has_always_treated` has + binary-only semantics, so it must be False on continuous panels + regardless of dose positivity. Previously the field conflated + "positive dose throughout" with "always treated in the DiD sense", + which fired the misleading `has_always_treated_units` alert on + valid continuous-DiD panels.""" + rng = np.random.default_rng(0) + rows = [] + for u in range(1, 21): + dose = 0.0 if u <= 5 else 2.5 + for t in range(4): + rows.append({"u": u, "t": t, "tr": dose, "y": rng.normal()}) + df = pd.DataFrame(rows) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "continuous" + assert profile.has_never_treated is True + assert profile.has_always_treated is False, ( + "has_always_treated must be False on continuous panels regardless " + "of dose positivity (binary-only semantics)" + ) + assert "has_always_treated_units" not in _alert_codes(profile) + + +def test_bool_dtype_treatment_is_binary_absorbing(): + """Bool-dtype treatment columns (True/False) must classify the same + way as numeric {0, 1}. The library's binary estimators validate on + value support via `validate_binary`, which accepts bool because + True/False coerce to 1/0 numerically. Classifying bool as + "categorical" would silently route valid binary DiD panels away + from the supported estimator set.""" + first_treat = {u: 2 for u in range(11, 21)} + rows = [] + for u in range(1, 21): + for t in range(4): + treated = u in first_treat and t >= first_treat[u] + rows.append({"u": u, "t": t, "tr": bool(treated), "y": float(u) + 0.1 * t}) + df = pd.DataFrame(rows) + assert df["tr"].dtype == bool + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "binary_absorbing" + assert profile.has_never_treated is True + assert profile.has_always_treated is False + assert profile.treatment_varies_within_unit is True + assert profile.cohort_sizes == {2: 10} + + +def test_bool_dtype_non_absorbing(): + """Reversible 0 -> 1 -> 0 treatment expressed as a bool column must + classify as binary_non_absorbing, same as numeric.""" + rows = [] + for u in range(1, 11): + seq = [False, True, True, False, False] if u > 5 else [False] * 5 + for t, tr in enumerate(seq): + rows.append({"u": u, "t": t, "tr": tr, "y": float(u) + 0.1 * t}) + df = pd.DataFrame(rows) + assert df["tr"].dtype == bool + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "binary_non_absorbing" + assert profile.has_never_treated is True + + +def test_categorical_treatment_object_dtype(): + rows = [] + for u in range(1, 11): + arm = "A" if u <= 5 else "B" + for t in range(4): + rows.append({"u": u, "t": t, "tr": arm, "y": float(u) + 0.1 * t}) + df = pd.DataFrame(rows) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "categorical" + assert profile.has_never_treated is False + assert profile.has_always_treated is False + + +def test_no_never_treated_alert(): + first_treat = {u: 2 for u in range(1, 21)} + df = _make_panel(n_units=20, periods=range(0, 5), first_treat=first_treat) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.has_never_treated is False + codes = _alert_codes(profile) + assert "no_never_treated" in codes + + +def test_has_always_treated_alert(): + rows = [] + for u in range(1, 21): + for t in range(5): + tr = 1 if u <= 5 else (1 if t >= 3 else 0) + rows.append({"u": u, "t": t, "tr": tr, "y": float(u) + 0.1 * t}) + df = pd.DataFrame(rows) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.has_always_treated is True + codes = _alert_codes(profile) + assert "has_always_treated_units" in codes + + +def test_unbalanced_panel_below_threshold(): + first_treat = {u: 3 for u in range(11, 21)} + df = _make_panel(n_units=20, periods=range(0, 5), first_treat=first_treat) + df = df.iloc[::3].reset_index(drop=True) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.is_balanced is False + assert profile.observation_coverage < 0.70 + codes = _alert_codes(profile) + assert "panel_highly_unbalanced" in codes + + +def test_binary_outcome_float_dtype_alert(): + first_treat = {u: 2 for u in range(11, 31)} + df = _make_panel( + n_units=30, + periods=range(0, 4), + first_treat=first_treat, + outcome_fn=lambda u, t, tr, rng: float(tr), + ) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.outcome_is_binary is True + assert profile.outcome_dtype == "float64" + codes = _alert_codes(profile) + assert "outcome_looks_binary_but_dtype_float" in codes + + +def test_outcome_missing_fraction_computed(): + first_treat = {u: 2 for u in range(11, 21)} + df = _make_panel(n_units=20, periods=range(0, 4), first_treat=first_treat) + df.loc[0:9, "y"] = np.nan + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert 0.0 < profile.outcome_missing_fraction < 1.0 + assert profile.outcome_missing_fraction == pytest.approx(10 / len(df)) + + +def test_short_pre_panel_alert(): + first_treat = {u: 1 for u in range(11, 21)} + df = _make_panel(n_units=20, periods=[0, 1, 2, 3], first_treat=first_treat) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.min_pre_periods == 1 + codes = _alert_codes(profile) + assert "short_pre_panel" in codes + + +def test_missing_column_raises_value_error(): + df = pd.DataFrame({"u": [1, 2], "t": [0, 1], "y": [0.0, 1.0]}) + with pytest.raises(ValueError, match="treatment"): + profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + + +def test_panel_profile_is_frozen(): + first_treat = {u: 2 for u in range(11, 21)} + df = _make_panel(n_units=20, periods=range(0, 4), first_treat=first_treat) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + with pytest.raises(dataclasses.FrozenInstanceError): + profile.n_units = 999 # type: ignore[misc] + + +def test_to_dict_is_json_serializable(): + first_treat = {u: 3 for u in range(11, 21)} + df = _make_panel(n_units=20, periods=range(0, 6), first_treat=first_treat) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + payload = profile.to_dict() + as_json = json.dumps(payload) + roundtripped = json.loads(as_json) + assert roundtripped["treatment_type"] == "binary_absorbing" + assert set(roundtripped.keys()) >= { + "n_units", + "n_periods", + "n_obs", + "is_balanced", + "observation_coverage", + "treatment_type", + "is_staggered", + "n_cohorts", + "cohort_sizes", + "has_never_treated", + "has_always_treated", + "treatment_varies_within_unit", + "first_treatment_period", + "last_treatment_period", + "min_pre_periods", + "min_post_periods", + "outcome_dtype", + "outcome_is_binary", + "outcome_has_zeros", + "outcome_has_negatives", + "outcome_missing_fraction", + "outcome_summary", + "alerts", + } + + +def test_alerts_are_factual_no_recommender_language(): + first_treat = {u: 1 for u in range(11, 21)} + df = _make_panel(n_units=12, periods=[0, 1, 2, 3], first_treat=first_treat) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + forbidden_substrings = ( + "recommend", + "should use", + "use estimator", + "we suggest", + "you should", + ) + for alert in profile.alerts: + lowered = alert.message.lower() + for phrase in forbidden_substrings: + assert phrase not in lowered, ( + f"alert {alert.code!r} contains recommender-adjacent phrase " + f"{phrase!r} in message: {alert.message!r}" + ) + + +def test_alert_dataclass_is_frozen(): + a = Alert(code="x", severity="info", message="m", observed=None) + with pytest.raises(dataclasses.FrozenInstanceError): + a.code = "y" # type: ignore[misc] + + +def test_all_zero_treatment_is_binary_absorbing(): + """Degenerate binary: no unit is ever treated. Must classify as binary, + not continuous, so the documented taxonomy matches the implementation.""" + df = _make_panel(n_units=20, periods=range(0, 4), first_treat=None) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "binary_absorbing" + assert profile.has_never_treated is True + assert profile.has_always_treated is False + assert profile.cohort_sizes == {} + assert profile.n_cohorts == 0 + + +def test_all_one_treatment_is_binary_absorbing_always_treated(): + """Degenerate binary: every unit treated in every period. Must classify as + binary_absorbing with has_always_treated=True.""" + rows = [] + for u in range(1, 21): + for t in range(4): + rows.append({"u": u, "t": t, "tr": 1, "y": float(u) + 0.1 * t}) + df = pd.DataFrame(rows) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "binary_absorbing" + assert profile.has_never_treated is False + assert profile.has_always_treated is True + codes = _alert_codes(profile) + assert "has_always_treated_units" in codes + + +def test_binary_with_nans_only_zeros_observed_is_binary(): + """Binary panel with some NaNs and only 0 observed among non-NaN values — + still classify as binary, not continuous.""" + rows = [] + for u in range(1, 11): + for t in range(4): + tr = 0 if (u + t) % 2 == 0 else np.nan + rows.append({"u": u, "t": t, "tr": tr, "y": float(u) + 0.1 * t}) + df = pd.DataFrame(rows) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "binary_absorbing" + + +def test_all_nan_treatment_is_categorical(): + """Treatment column entirely NaN — classify as categorical (no info).""" + rows = [] + for u in range(1, 11): + for t in range(4): + rows.append({"u": u, "t": t, "tr": np.nan, "y": float(u) + 0.1 * t}) + df = pd.DataFrame(rows) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "categorical" + + +def test_top_level_import_surface(): + """profile_panel, PanelProfile, and Alert must be importable from the + top-level namespace so `help(diff_diff)` points at real symbols.""" + import diff_diff + + assert callable(diff_diff.profile_panel) + assert diff_diff.PanelProfile.__name__ == "PanelProfile" + assert diff_diff.Alert.__name__ == "Alert" + for name in ("profile_panel", "PanelProfile", "Alert"): + assert name in diff_diff.__all__, f"{name} missing from __all__" + + +def test_duplicate_unit_time_rows_do_not_inflate_coverage(): + """Duplicate (unit, time) rows must not make a panel look balanced. + observation_coverage must stay in [0, 1] and derive from the unique + (unit, time) support, and the duplicate_unit_time_rows alert fires.""" + first_treat = {u: 2 for u in range(11, 21)} + df = _make_panel(n_units=20, periods=range(0, 4), first_treat=first_treat) + df_dup = pd.concat([df, df.iloc[:5].copy()], ignore_index=True) + profile = profile_panel(df_dup, unit="u", time="t", treatment="tr", outcome="y") + assert profile.is_balanced is True + assert 0.0 <= profile.observation_coverage <= 1.0 + assert "duplicate_unit_time_rows" in _alert_codes(profile) + + df_missing_cell = df.drop(df.index[0]).reset_index(drop=True) + df_dup_missing = pd.concat( + [df_missing_cell, df_missing_cell.iloc[:5].copy()], ignore_index=True + ) + profile2 = profile_panel(df_dup_missing, unit="u", time="t", treatment="tr", outcome="y") + assert profile2.is_balanced is False + assert profile2.observation_coverage < 1.0 + assert "duplicate_unit_time_rows" in _alert_codes(profile2) + + +def test_reversal_through_nan_is_binary_non_absorbing(): + """A 0 -> 1 -> NaN -> 0 path must be detected as non-absorbing: the + observed non-NaN subsequence violates weak monotonicity. Previously a + NaN-inclusive diff could report False monotonicity violation.""" + rows = [] + for u in range(1, 11): + treat_seq = [0, 1, np.nan, 0] + for t, tr in enumerate(treat_seq): + rows.append({"u": u, "t": t, "tr": tr, "y": float(u) + 0.1 * t}) + df = pd.DataFrame(rows) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "binary_non_absorbing" + + +def test_continuous_zero_dose_controls_flag_has_never_treated(): + """Continuous treatment with some zero-dose units must flag + has_never_treated=True. Previously continuous panels hardcoded + has_never_treated=False regardless of control availability. + has_always_treated has binary-only semantics and must remain + False on continuous panels regardless of dose positivity.""" + rows = [] + rng = np.random.default_rng(0) + for u in range(1, 21): + dose = 0.0 if u <= 5 else float(rng.uniform(0.5, 3.0)) + for t in range(4): + rows.append({"u": u, "t": t, "tr": dose, "y": rng.normal()}) + df = pd.DataFrame(rows) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "continuous" + assert profile.has_never_treated is True + assert profile.has_always_treated is False + + +def test_guide_api_strings_resolve_against_public_api(): + """Sanity-check that every estimator referenced in the autonomous guide + exists in the public API, plus the `hausman_pretest` classmethod location + and the `not_yet_treated` control-group string. Guards against guide + drift that the CI reviewer has previously flagged.""" + import diff_diff + from diff_diff import get_llm_guide + + text = get_llm_guide("autonomous") + + for name in ( + "DifferenceInDifferences", + "MultiPeriodDiD", + "TwoWayFixedEffects", + "CallawaySantAnna", + "SunAbraham", + "ChaisemartinDHaultfoeuille", + "ImputationDiD", + "TwoStageDiD", + "StackedDiD", + "WooldridgeDiD", + "EfficientDiD", + "SyntheticDiD", + "TROP", + "TripleDifference", + "StaggeredTripleDifference", + "ContinuousDiD", + "HeterogeneousAdoptionDiD", + ): + assert name in text, f"estimator {name!r} missing from guide" + assert hasattr(diff_diff, name), f"{name!r} in guide but not exported" + + assert hasattr( + diff_diff.EfficientDiD, "hausman_pretest" + ), "EfficientDiD.hausman_pretest classmethod missing from the public API" + + assert "EfficientDiD.hausman_pretest" in text + assert "Hausman.hausman_pretest" not in text + + assert 'control_group="not_yet_treated"' in text + assert "notyettreated" not in text + + # HAD targets WAS / WAS_d_lower, not ATT; event-study is per-event- + # time, not per-cohort. Guard against the guide drifting back to + # ATT-shaped / per-cohort phrasing. + assert "Weighted Average Slope (WAS)" in text + assert "WAS_d_lower" in text + assert "per-cohort Pierce-Schott" not in text + + # EfficientDiD has three paths when no never-treated exists: + # PT-Post, PT-All, or control_group="last_cohort". The guide must + # mention last_cohort in the no-never-treated section so agents do + # not rule out the supported path. + assert 'control_group="last_cohort"' in text + + # SunAbraham requires a never-treated cohort; the fit path raises a + # ValueError when none exists. Guard the matrix / prose contract so + # the guide cannot drift back to claiming SunAbraham is optional. + sun_abraham_row = next( + line for line in text.splitlines() if "`SunAbraham`" in line and "|" in line + ) + cells = [cell.strip() for cell in sun_abraham_row.strip("|").split("|")] + # Column order: estimator, binary_absorbing, staggered, continuous, + # triple-diff, never-treated-required, covariate, few-treated, + # heterogeneous-adoption, clustered-SE. + assert cells[5] == "✓", ( + "SunAbraham matrix row must mark never-treated-required=✓ " f"(row: {sun_abraham_row!r})" + ) + + # HAD Assumption 3 is not testable per REGISTRY.md; the guide must + # not claim otherwise. + assert "Assumption 3" in text # mentioned as untestable, not as validated + assert "validate Assumptions 3 and 7" not in text + assert "not testable" in text + + # EfficientDiD requires never-treated under BOTH assumption="PT-All" + # and assumption="PT-Post" — PT-Post is not a "drop the requirement" + # escape hatch. Only control_group="last_cohort" admits all-treated + # panels. Guard against guide drift back to the incorrect wording. + assert "PT-Post is the weaker" in text or "both" in text.lower() + # The old claim "switch to `assumption=\"PT-Post\"` to drop" must + # not reappear in any form. + assert 'switch to `assumption="PT-Post"` to drop' not in text + + # Matrix covariate cells: SyntheticDiD accepts fit(covariates=...) + # and residualizes the outcome; ContinuousDiD.fit has no covariate + # surface. Guard the matrix rows against drift. + sdid_row = next(line for line in text.splitlines() if "`SyntheticDiD`" in line and "|" in line) + sdid_cells = [c.strip() for c in sdid_row.strip("|").split("|")] + assert sdid_cells[6] in ("✓", "partial"), ( + "SyntheticDiD covariate-adjustment cell must be ✓ or partial " + f"(residualization path exists); got {sdid_cells[6]!r}" + ) + cdid_row = next(line for line in text.splitlines() if "`ContinuousDiD`" in line and "|" in line) + cdid_cells = [c.strip() for c in cdid_row.strip("|").split("|")] + assert cdid_cells[6] == "✗", ( + "ContinuousDiD covariate-adjustment cell must be ✗ " + f"(no covariate surface on fit()); got {cdid_cells[6]!r}" + ) + + # §5 API signatures: compute_pretrends_power takes a fitted results + # object (not df), plot_sensitivity takes SensitivityResults, + # plot_honest_event_study takes HonestDiDResults. Guard against + # drift back to the df-first / results-only signatures. + assert "`compute_pretrends_power(results" in text + assert "`plot_sensitivity(sensitivity_results" in text + assert "`plot_honest_event_study(honest_results" in text + + # §6 BR/DR schema alignment. The emitted top-level keys are + # singular / underscored ("assumption", "pre_trends", "sample"), + # not the plural / run-together variants. DiagnosticReport emits + # sections at the top level (not nested under a "checks" dict) + # and uses "estimator" (the string class name) / "headline_metric" + # / "estimator_native_diagnostics". Guard each real key and + # forbid the obsolete ones. + for real_key in ( + "`assumption: dict`", + "`pre_trends: dict`", + "`sample: dict`", + "`headline_metric: dict`", + "`estimator_native_diagnostics: dict`", + "`overall_interpretation: str`", + ): + assert real_key in text, f"BR/DR §6 missing real key: {real_key}" + for obsolete_key in ( + "`assumptions: dict`", + "`pretrends: dict`", + "`main_result: dict`", + "`sample_summary: dict`", + "`estimator_type: str`", + "`checks: dict`", + ): + assert obsolete_key not in text, f"BR/DR §6 still lists obsolete key: {obsolete_key}" + + # BR `diagnostics` is a wrapper (status + schema/reason + possibly + # overall_interpretation), not the DR payload directly. Guard the + # wrapper wording so the guide does not drift back to telling + # agents to parse BR["diagnostics"] as the DR schema. + assert 'diagnostics["schema"]' in text + # target_parameter includes a `reference` field per + # describe_target_parameter(); guard its documentation. + assert "`reference` (REGISTRY.md citation string)" in text + + # Methodology source attribution: EfficientDiD is Chen, Sant'Anna, + # Xie (2025), not Arkhangelsky-Imbens. ContinuousDiD is Callaway, + # Goodman-Bacon, Sant'Anna (2024). Guard both attributions in the + # §4 prose and the §7 citation list. + assert "Chen, Sant'Anna, Xie 2025" in text + assert "(Arkhangelsky-Imbens)" not in text + assert "Callaway, Goodman-Bacon, Sant'Anna 2024" in text + # ContinuousDiD prose must distinguish the PT vs SPT identified + # targets rather than collapsing everything into "ACR". + assert "ATT(d|d)" in text + assert "ACRT" in text + assert "Strong Parallel Trends" in text + + # ContinuousDiD requires zero-dose (P(D=0) > 0) because Remark 3.1 + # lowest-dose-as-control is unimplemented; matrix col 5 must be ✓. + assert cdid_cells[5] == "✓", ( + "ContinuousDiD matrix row must mark never-treated-required=✓ " + f"(P(D=0) > 0 required per Remark 3.1); got {cdid_cells[5]!r}" + ) + assert "P(D=0) > 0" in text or "P(D=0) > 0" in text + + # ContinuousDiD DOES support staggered adoption natively (via the + # `first_treat` column). Matrix column 2 (staggered) must be ✓. + assert cdid_cells[2] == "✓", ( + "ContinuousDiD matrix row must mark staggered=✓ " + "(adoption timing via first_treat is supported); " + f"got {cdid_cells[2]!r}" + ) + + # ContinuousDiD also requires dose to be time-invariant per unit; + # this is the second eligibility prerequisite the guide must spell + # out. Guide text must mention the invariant explicitly AND the + # `treatment_varies_within_unit` field used to detect it. + assert "time-invariant" in text + assert "treatment_varies_within_unit" in text + + # DR §6 section statuses: execution-state vocabulary must include + # the actual emitted values ("ran", "not_applicable", "not_run", + # "no_scalar_by_design", "skipped"), and `verdict` must be + # documented separately from `status`. Guard against drift back + # to the pass/warn/inconclusive-as-status framing. + for real_status in ( + '"ran"', + '"not_applicable"', + '"not_run"', + '"no_scalar_by_design"', + ): + assert real_status in text, f"DR §6 section-status vocabulary must document {real_status}" + # `status` must not be described as "pass/warn/inconclusive" — + # those belong under `verdict`. + assert '`"pass"` / `"warn"` / `"inconclusive"`' not in text + assert "verdict" in text.lower() + + # Balanced-panel eligibility: ContinuousDiD, EfficientDiD, + # SyntheticDiD, and HeterogeneousAdoptionDiD all hard-reject + # unbalanced panels at fit() time. The guide must surface this + # so agents gate these estimators on PanelProfile.is_balanced + # before selecting them. + assert "is_balanced" in text, ( + "Guide must mention PanelProfile.is_balanced as an eligibility " + "check for balance-sensitive estimators" + ) + for estimator in ( + "ContinuousDiD", + "EfficientDiD", + "SyntheticDiD", + "HeterogeneousAdoptionDiD", + "StaggeredTripleDifference", + ): + idx = 0 + found = False + while idx < len(text): + loc = text.find(estimator, idx) + if loc < 0: + break + window = text[max(0, loc - 400) : loc + 400] + if "balanced" in window.lower() or "is_balanced" in window: + found = True + break + idx = loc + 1 + assert found, ( + f"Guide must mention a balanced-panel constraint near the " + f"{estimator!r} bullet / row (hard-rejects unbalanced panels " + "at fit time)" + ) + + # HeterogeneousAdoptionDiD staggered support is `partial` and + # specifically last-cohort-only (Appendix B.2): with first_treat_col + # supplied, fit() auto-filters to F_last + never-treated; without + # first_treat_col, a multi-cohort panel raises. Guide must surface + # this explicitly so agents don't route a general staggered panel + # to HAD expecting a multi-cohort estimand. + assert "last-cohort-only" in text or "last cohort" in text.lower(), ( + "Guide must name the last-cohort-only restriction on HAD " + "staggered support (Appendix B.2)" + ) + assert "first_treat_col" in text, ( + "Guide must mention that first_treat_col is required to activate " + "HAD's staggered last-cohort auto-filter" + ) + assert "ChaisemartinDHaultfoeuille" in text, ( + "Guide must point at ChaisemartinDHaultfoeuille as the fallback " + "for full staggered support" + ) + + # Balanced-panel gate is incomplete with `is_balanced` alone because + # duplicate (unit, time) rows don't flip is_balanced. Guide must + # require BOTH is_balanced == True AND absence of the + # duplicate_unit_time_rows alert before routing to the duplicate- + # intolerant estimators (ContinuousDiD silently overwrites + # duplicates via last-row-wins; EfficientDiD/HAD raise). + assert "duplicate_unit_time_rows" in text, ( + "Guide must name the duplicate_unit_time_rows alert as part of " + "the balanced-panel eligibility gate" + ) + assert "BOTH" in text or "both" in text, ( + "Guide must require BOTH is_balanced and absence of the " + "duplicate_unit_time_rows alert before routing to duplicate-" + "intolerant estimators" + ) + + # ChaisemartinDHaultfoeuille handles non-absorbing / reversible + # treatment; SUTVA is still assumed (no native interference or + # spillover support per REGISTRY.md). Guard against the guide + # drifting back to advertising dCDH as "robust to spillover + # designs" or similar. + for phrase in ( + "robust to spillover", + "interference-robust", + "supports spillover", + "and to spillover", + ): + assert phrase not in text, ( + f"Guide must not advertise unsupported dCDH capability " + f"{phrase!r}: SUTVA is assumed across the estimator suite." + ) + + # Repeated-cross-section (§4.10) must not claim broad + # applicability. The documented RCS-capable estimators are + # CallawaySantAnna(panel=False), TripleDifference, and + # StaggeredTripleDifference; EfficientDiD and + # HeterogeneousAdoptionDiD explicitly reject RCS per REGISTRY.md. + assert "most estimators remain applicable" not in text, ( + "§4.10 must not claim broad RCS applicability; only the " + "explicitly documented RCS-capable subset is applicable." + ) + assert "panel=False" in text, ( + "§4.10 must point at CallawaySantAnna(panel=False) as the " "explicit RCS mode" + ) + # The section must explicitly name at least one panel-only + # estimator as rejected for RCS, so agents do not silently route + # RCS data to it. + rcs_section_start = text.find("§4.10 Repeated cross-sections") + assert rcs_section_start >= 0 + rcs_section = text[rcs_section_start : rcs_section_start + 2500] + for panel_only in ( + "EfficientDiD", + "HeterogeneousAdoptionDiD", + "StaggeredTripleDifference", + ): + assert panel_only in rcs_section, ( + f"§4.10 must explicitly name {panel_only!r} as panel-only " + "so RCS data is not routed to it" + ) + + # The explicit RCS-capable bullet list must NOT put + # StaggeredTripleDifference next to the RCS-support language. + # The estimator has no panel=False mode and fit() rejects + # unbalanced input; only TripleDifference (non-staggered) is + # cross-sectional-DDD-capable. + explicit_support_block = text.find("Explicit RCS support", rcs_section_start) + rejected_block = text.find("Explicitly rejected for RCS", rcs_section_start) + assert 0 <= explicit_support_block < rejected_block, ( + "§4.10 must separate an Explicit RCS support list from the " "Explicitly rejected list" + ) + explicit_segment = text[explicit_support_block:rejected_block] + assert "StaggeredTripleDifference" not in explicit_segment, ( + "StaggeredTripleDifference must NOT appear in the Explicit RCS " + "support list — it is panel-only and balance-enforced." + ) + + +def test_min_pre_post_use_per_unit_observed_support(): + """On an unbalanced panel where one treated unit is missing its + earliest pre-period, min_pre_periods must reflect that unit's actual + observed support. Previously _compute_pre_post used the global period + set, which could hide short-panel cases and suppress the short_pre_panel + alert.""" + rows = [] + for u in range(1, 21): + first_treat = 3 + for t in range(0, 6): + if u == 1 and t <= 1: + continue + tr = 1 if t >= first_treat else 0 + rows.append({"u": u, "t": t, "tr": tr, "y": float(u) + 0.1 * t}) + df = pd.DataFrame(rows) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.min_pre_periods == 1 + assert "short_pre_panel" in _alert_codes(profile) + + +def test_missing_unit_or_time_ids_are_dropped_consistently(): + """NaN values in unit or time must not push observation_coverage above + 1.0. `nunique()` drops NaN while `drop_duplicates()` keeps NaN as a + distinct key, which previously produced coverage > 1 silently. The + fix drops NaN-id rows up front, emits the missing_id_rows_dropped + alert, and computes all structural facts on the non-missing subset.""" + first_treat = {u: 2 for u in range(11, 21)} + df = _make_panel(n_units=20, periods=range(0, 4), first_treat=first_treat) + df_with_missing = df.copy() + df_with_missing.loc[[0, 1, 2], "u"] = np.nan + df_with_missing.loc[[5, 6], "t"] = np.nan + profile = profile_panel(df_with_missing, unit="u", time="t", treatment="tr", outcome="y") + assert 0.0 <= profile.observation_coverage <= 1.0 + codes = _alert_codes(profile) + assert "missing_id_rows_dropped" in codes + drop_alert = next(a for a in profile.alerts if a.code == "missing_id_rows_dropped") + assert drop_alert.observed == 5 + + +def test_row_with_both_ids_missing_counted_once(): + """A row with BOTH unit and time NaN must count as one dropped row, + not two. Previously `isna().sum()` summed the two columns and + double-counted rows missing both identifiers.""" + first_treat = {u: 2 for u in range(11, 21)} + df = _make_panel(n_units=20, periods=range(0, 4), first_treat=first_treat) + df_both_missing = df.copy() + df_both_missing.loc[0, "u"] = np.nan + df_both_missing.loc[0, "t"] = np.nan + profile = profile_panel(df_both_missing, unit="u", time="t", treatment="tr", outcome="y") + drop_alert = next(a for a in profile.alerts if a.code == "missing_id_rows_dropped") + assert drop_alert.observed == 1 + + +def test_empty_dataframe_raises_value_error(): + """Direct empty input must raise, not silently return a 'balanced' + profile with zero units/periods.""" + df = pd.DataFrame({"u": [], "t": [], "tr": [], "y": []}) + with pytest.raises(ValueError, match="empty"): + profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + + +def test_empty_after_id_drop_raises_value_error(): + """If every row has a missing unit or time identifier, the panel is + empty after the drop; raise rather than returning is_balanced=True + on zero rows.""" + df = pd.DataFrame( + { + "u": [np.nan, np.nan], + "t": [0, 1], + "tr": [0, 1], + "y": [0.1, 0.2], + } + ) + with pytest.raises(ValueError, match="no rows remain"): + profile_panel(df, unit="u", time="t", treatment="tr", outcome="y")