From af7390fdeafb3750f2bd7920f452a6a711476a32 Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 24 Apr 2026 07:24:37 -0400 Subject: [PATCH 01/18] Add profile_panel() + llms-autonomous.txt agent-facing pair New `diff_diff.profile_panel(df, *, unit, time, treatment, outcome)` returns a frozen `PanelProfile` dataclass of structural facts about a DiD panel (balance, treatment-type classification, cohort structure, outcome characteristics, factual alerts). `.to_dict()` returns a JSON-serializable view. Descriptive only -- alerts never name a specific estimator; estimator selection is up to the caller. Paired with a new bundled `"autonomous"` variant on `get_llm_guide()` at `diff_diff/guides/llms-autonomous.txt`. Reference-shaped (distinct from the workflow-prose `"practitioner"` variant): `PanelProfile` field reference, 17-estimator x 9-design-feature support matrix, per-design-feature reasoning citing Baker et al. (2025) and Roth/Sant'Anna (2023), post-fit validation index, BR/DR schema reference, and an explicit "no deterministic recommendations" disclaimer. `diff_diff/__init__.py` module docstring leads with a "For AI agents:" entry block so `help(diff_diff)` surfaces the four-step sequence (profile, reference, workflow, report) at import time. Exports `profile_panel`, `PanelProfile`, `Alert` from the top-level namespace. Both files are bundled inside the wheel and sdist (no GitHub/RTD dependency at runtime) via the existing `diff_diff/guides/*.txt` glob in `pyproject.toml`. Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 1 + ROADMAP.md | 10 +- diff_diff/__init__.py | 25 +- diff_diff/_guides_api.py | 10 +- diff_diff/guides/llms-autonomous.txt | 594 +++++++++++++++++++++++++++ diff_diff/profile.py | 575 ++++++++++++++++++++++++++ tests/test_guides.py | 22 +- tests/test_profile_panel.py | 254 ++++++++++++ 8 files changed, 1475 insertions(+), 16 deletions(-) create mode 100644 diff_diff/guides/llms-autonomous.txt create mode 100644 diff_diff/profile.py create mode 100644 tests/test_profile_panel.py diff --git a/CHANGELOG.md b/CHANGELOG.md index aaa2d43d..d7474d25 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -13,6 +13,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - **`did_had_pretest_workflow(aggregate="event_study")`**: multi-period dispatch on balanced ≥3-period panels. Runs QUG at `F` + joint pre-trends Stute across earlier pre-periods + joint homogeneity-linearity Stute across post-periods. Step 2 closure requires ≥2 pre-periods; with only a single pre-period (the base `F-1`) `pretrends_joint=None` and the verdict flags the skip. Reuses the Phase 2b event-study panel validator (last-cohort auto-filter under staggered timing with `UserWarning`; `ValueError` when `first_treat_col=None` and the panel is staggered). The data-in wrappers `joint_pretrends_test` and `joint_homogeneity_test` also route through that same validator internally, so direct wrapper calls inherit the last-cohort filter and constant-post-dose invariant. `HADPretestReport` extended with `pretrends_joint`, `homogeneity_joint`, and `aggregate` fields; serialization methods (`summary`, `to_dict`, `to_dataframe`, `__repr__`) preserve the Phase 3 output bit-exactly on `aggregate="overall"` — no `aggregate` key, no header row, no schema drift — and only surface the new fields on `aggregate="event_study"`. - **`ChaisemartinDHaultfoeuille.by_path`** — per-path event-study disaggregation, mirroring R `did_multiplegt_dyn(..., by_path=k)`. Passing `by_path=k` (positive int) to the estimator reports separate `DID_{path,l}` + SE + inference for the top-k most common observed treatment paths in the window `[F_g-1, F_g-1+L_max]`, answering the practitioner question "is a single pulse enough, or do you need sustained exposure?" across paths like `(0,1,0,0)` vs `(0,1,1,0)` vs `(0,1,1,1)`. The per-path SE follows the joiners-only / leavers-only IF precedent (switcher-side contribution zeroed for non-path groups; control pool and cohort structure unchanged; plug-in SE with path-specific divisor). Requires `drop_larger_lower=False` (multi-switch groups are the object of interest) and `L_max >= 1`. Binary treatment only in this release; combinations with `controls`, `trends_linear`, `trends_nonparam`, `heterogeneity`, `design2`, `honest_did`, `survey_design`, and `n_bootstrap > 0` raise `NotImplementedError` and are deferred to follow-up PRs. Results expose `results.path_effects: Dict[Tuple[int, ...], Dict[str, Any]]` and `results.to_dataframe(level="by_path")`; the summary grows a "Treatment-Path Disaggregation" block. Ties in path frequency are broken lexicographically on the path tuple for deterministic ranking. Overflow (`by_path > n_observed_paths`) returns all observed paths with a `UserWarning`. See `docs/methodology/REGISTRY.md` §ChaisemartinDHaultfoeuille `Note (Phase 3 by_path per-path event-study disaggregation)` for the full contract. - **R-parity for `ChaisemartinDHaultfoeuille.by_path`** against `DIDmultiplegtDYN 2.3.3`. Two new scenarios in `benchmarks/data/dcdh_dynr_golden_values.json` generated from `did_multiplegt_dyn(..., by_path=k)`: `mixed_single_switch_by_path` (2 paths, `by_path=2`) and `multi_path_reversible_by_path` (4 observed paths, `by_path=3`, via a new deterministic multi-path DGP pattern in the R generator). Per-path point estimates and per-path switcher counts match R exactly; per-path SE matches within the Phase 2 multi-horizon SE envelope (observed rtol ≤ 10.2% on the 2-path scenario, ≤ 4.2% on the 4-path scenario). Parity tests live at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPath`, matching paths by tuple label via set-equality (robust to R's undocumented frequency-tie tiebreak) and cross-checking per-path switcher counts before SE comparison. **Deviation documented:** cross-path cohort sharing — our full-panel cohort-centered plug-in vs R's per-path re-run diverges materially when a `(D_{g,1}, F_g, S_g)` cohort spans multiple observed paths; the two coincide when every cohort is single-path. The parity scenarios are constructed to keep cohorts single-path (scenario 13 by design, scenario 14 via path-assignment-deterministic-on-F_g). See `docs/methodology/REGISTRY.md` §ChaisemartinDHaultfoeuille `Note (Phase 3 by_path...)` for the full write-up. +- **`profile_panel()` utility + `llms-autonomous.txt` reference guide (agent-facing)** — new `diff_diff.profile_panel(df, *, unit, time, treatment, outcome)` returns a frozen `PanelProfile` dataclass of structural facts (panel balance, treatment-type classification — `"binary_absorbing"` / `"binary_non_absorbing"` / `"continuous"` / `"categorical"`, cohort structure, outcome characteristics, and a `tuple[Alert, ...]` of factual observations). `.to_dict()` returns a JSON-serializable view. Paired with a new bundled `"autonomous"` variant on `get_llm_guide()` — `get_llm_guide("autonomous")` returns a reference-shaped guide (distinct from the existing workflow-prose `"practitioner"` variant) with §1 audience disclaimer, §2 `PanelProfile` field reference, §3 embedded 17-estimator × 9-design-feature support matrix, §4 per-design-feature reasoning citing Baker et al. (2025) and Roth / Sant'Anna (2023), §5 post-fit validation index, §6 BR/DR schema reference, §7 citations, §8 intentional omissions. Both pieces are bundled inside the wheel (no GitHub / RTD dependency at runtime); `diff_diff/__init__.py` module docstring leads with an agent-entry block listing `profile_panel`, `get_llm_guide("autonomous")`, `get_llm_guide("practitioner")`, and `BusinessReport` so `help(diff_diff)` surfaces them. Descriptive, not opinionated — `profile_panel` alerts never recommend a specific estimator, and the guide enumerates trade-offs rather than dispatching. Exports: `profile_panel`, `PanelProfile`, `Alert` from top-level `diff_diff`. - **`target_parameter` block in BR/DR schemas (experimental; schema version bumped to 2.0)** — `BUSINESS_REPORT_SCHEMA_VERSION` and `DIAGNOSTIC_REPORT_SCHEMA_VERSION` bumped from `"1.0"` to `"2.0"` because the new `"no_scalar_by_design"` value on the `headline.status` / `headline_metric.status` enum (dCDH `trends_linear=True, L_max>=2` configuration) is a breaking change per the REPORTING.md stability policy. BusinessReport and DiagnosticReport now emit a top-level `target_parameter` block naming what the headline scalar actually represents for each of the 16 result classes. Closes BR/DR foundation gap #6 (target-parameter clarity). Fields: `name`, `definition`, `aggregation` (machine-readable dispatch tag), `headline_attribute` (raw result attribute), `reference` (citation pointer). BR's summary emits the short `name` right after the headline; DR's overall-interpretation paragraph does the same; both full reports carry a "## Target Parameter" section with the full definition. Per-estimator dispatch is sourced from REGISTRY.md and lives in the new `diff_diff/_reporting_helpers.py::describe_target_parameter`. A few branches read fit-time config (`EfficientDiDResults.pt_assumption`, `StackedDiDResults.clean_control`, `ChaisemartinDHaultfoeuilleResults.L_max` / `covariate_residuals` / `linear_trends_effects`); others emit a fixed tag (the fit-time `aggregate` kwarg on CS / Imputation / TwoStage / Wooldridge does not change the `overall_att` scalar — disambiguating horizon / group tables is tracked under gap #9). See `docs/methodology/REPORTING.md` "Target parameter" section. - SyntheticDiD coverage Monte Carlo calibration table added to `docs/methodology/REGISTRY.md` §SyntheticDiD — rejection rates at α ∈ {0.01, 0.05, 0.10} across `placebo` / `bootstrap` / `jackknife` on 3 representative DGPs (balanced / exchangeable, unbalanced, and Arkhangelsky et al. (2021) AER §6.3 non-exchangeable). Artifact at `benchmarks/data/sdid_coverage.json` (500 seeds × B=200), regenerable via `benchmarks/python/coverage_sdid.py`. diff --git a/ROADMAP.md b/ROADMAP.md index 65a4b119..fcc02f22 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -137,15 +137,17 @@ Long-running program, framed as "building toward" rather than with discrete ship - Baker et al. (2025) 8-step workflow enforcement in `diff_diff/practitioner.py`. - `practitioner_next_steps()` context-aware guidance. -- Runtime LLM guides via `get_llm_guide(...)` (`llms.txt`, `llms-full.txt`, `llms-practitioner.txt`), bundled in the wheel. +- Runtime LLM guides via `get_llm_guide(...)` (`llms.txt`, `llms-full.txt`, `llms-practitioner.txt`, `llms-autonomous.txt`), bundled in the wheel. +- `profile_panel(df, ...)` returns a `PanelProfile` dataclass of structural facts about the panel - factual, not opinionated. Pairs with the `"autonomous"` guide variant (reference-shaped: estimator-support matrix + per-design-feature reasoning) so agents describe the data then consult a bundled reference rather than calling a deterministic recommender. +- Package docstring leads with an "For AI agents" entry block so `help(diff_diff)` surfaces the agent entry points automatically. - Silent-operation warnings so agents and humans see the same signals at the same time. **Next blocks toward the vision.** -- **BusinessReport / DiagnosticReport** (in Shipping Next) - the output form the vision assumes. +- **Post-hoc mismatch detection in BR/DR output** - surfaces structured warnings like "you fit TWFE on staggered data with 37% forbidden-comparison weights" when the profile and the fitted estimator disagree. Safety net, not a pre-emptive rules engine. +- **Structured `sanity_checks` block in BR/DR** - machine-legible pass / warn / fail signals (pretrends, power, forbidden-comparisons, event-study cleanliness, placebo, sensitivity) so agents can dispatch on a stable schema rather than parsing prose. - **Context-aware `practitioner_next_steps()`** that substitutes actual column names - turns guidance into executable recommendations. -- **AI-legible diagnostic surfaces** - once BusinessReport ships, a structured JSON counterpart that agents can parse without screen-scraping human text. -- **Scenario-to-estimator selection guidance** - agent-facing extension of `docs/practitioner_decision_tree.rst` that returns a specific estimator choice plus rationale for a given scenario description. +- **Unified `assess_*` verb** across estimator native-diagnostic methods for a single discoverable convention. - **End-to-end scenario walkthrough templates** - reusable orchestration recipes an agent can adapt from data ingest through business-ready output. --- diff --git a/diff_diff/__init__.py b/diff_diff/__init__.py index 4a9b93c4..0247eecd 100644 --- a/diff_diff/__init__.py +++ b/diff_diff/__init__.py @@ -4,14 +4,20 @@ This library provides sklearn-like estimators for causal inference using the difference-in-differences methodology. -For rigorous analysis, follow the 8-step practitioner workflow based -on Baker et al. (2025). After estimation, call -``practitioner_next_steps(results)`` for context-aware guidance on -remaining diagnostic steps. +For AI agents: -AI agents: call ``diff_diff.get_llm_guide()`` for a complete API reference. -Use ``get_llm_guide("practitioner")`` for the 8-step workflow or -``get_llm_guide("full")`` for comprehensive documentation. + 1. Describe your data: ``diff_diff.profile_panel(df, unit=..., time=..., + treatment=..., outcome=...)`` + 2. Consult the reference: ``diff_diff.get_llm_guide("autonomous")`` + (estimator-support matrix + reasoning) + 3. Follow the workflow: ``diff_diff.get_llm_guide("practitioner")`` + (Baker et al. (2025) 8-step recipe) + 4. Report results: ``diff_diff.BusinessReport(results)`` + (structured agent-legible output) + +For a comprehensive API reference call ``diff_diff.get_llm_guide("full")``; +``practitioner_next_steps(results)`` returns context-aware guidance after +any estimator's ``fit()``. """ # Import backend detection from dedicated module (avoids circular imports) @@ -244,6 +250,7 @@ DiagnosticReportResults, ) from diff_diff._guides_api import get_llm_guide +from diff_diff.profile import Alert, PanelProfile, profile_panel from diff_diff.datasets import ( clear_cache, list_datasets, @@ -487,6 +494,10 @@ "DiagnosticReport", "DiagnosticReportResults", "DIAGNOSTIC_REPORT_SCHEMA_VERSION", + # Panel profiling (agent-facing pre-fit describe utility) + "profile_panel", + "PanelProfile", + "Alert", # LLM guide accessor "get_llm_guide", ] diff --git a/diff_diff/_guides_api.py b/diff_diff/_guides_api.py index 5a00ed77..503c74ba 100644 --- a/diff_diff/_guides_api.py +++ b/diff_diff/_guides_api.py @@ -1,4 +1,5 @@ """Runtime accessor for bundled LLM guide files.""" + from __future__ import annotations from importlib.resources import files @@ -7,6 +8,7 @@ "concise": "llms.txt", "full": "llms-full.txt", "practitioner": "llms-practitioner.txt", + "autonomous": "llms-autonomous.txt", } @@ -21,6 +23,10 @@ def get_llm_guide(variant: str = "concise") -> str: - ``"concise"`` -- compact API reference (llms.txt) - ``"full"`` -- complete API documentation (llms-full.txt) - ``"practitioner"`` -- 8-step practitioner workflow (llms-practitioner.txt) + - ``"autonomous"`` -- reference guide for AI-agent use: estimator-support + matrix, per-design-feature reasoning, post-fit validation index, and + BR/DR schema (llms-autonomous.txt). Pair with + :func:`diff_diff.profile_panel` for pre-fit data description. Returns ------- @@ -42,7 +48,5 @@ def get_llm_guide(variant: str = "concise") -> str: filename = _VARIANT_TO_FILE[variant] except (KeyError, TypeError): valid = ", ".join(repr(k) for k in _VARIANT_TO_FILE) - raise ValueError( - f"Unknown guide variant {variant!r}. Valid options: {valid}." - ) from None + raise ValueError(f"Unknown guide variant {variant!r}. Valid options: {valid}.") from None return files("diff_diff.guides").joinpath(filename).read_text(encoding="utf-8") diff --git a/diff_diff/guides/llms-autonomous.txt b/diff_diff/guides/llms-autonomous.txt new file mode 100644 index 00000000..fcefd79e --- /dev/null +++ b/diff_diff/guides/llms-autonomous.txt @@ -0,0 +1,594 @@ +# diff-diff: Autonomous-agent reference guide + +This guide is reference material for AI agents using diff-diff without +human-in-the-loop supervision. It catalogs the library's estimators, names +the design features each supports, explains how to read the +`profile_panel()` output, and points at post-fit validation utilities and +report schemas. + +It is a reference, not a decision tree. Multiple estimators usually fit a +given panel; choosing between them involves trade-offs the cited literature +discusses and that this guide does not pretend to resolve. + +**Pair this guide with:** +- `get_llm_guide("practitioner")` - the Baker et al. (2025) 8-step validation + workflow in workflow-prose form. +- `get_llm_guide("full")` - comprehensive API documentation for every public + function and class. +- `profile_panel(df, unit=..., time=..., treatment=..., outcome=...)` - the + pre-fit describe utility whose output fields this guide's sections §2 and + §4 reason about. + + +## Table of contents + +- §1. What this guide is (and is not) +- §2. PanelProfile field reference +- §3. Estimator-support matrix +- §4. Estimator-choice reasoning by design feature +- §5. Post-fit validation utilities +- §6. How to read BusinessReport / DiagnosticReport output +- §7. Glossary + citations +- §8. Intentional omissions + + +## §1. What this guide is (and is not) + +**What it is.** A reference you consult after running `profile_panel()` and +before calling any estimator's `fit()`. The matrix in §3 and the per-design- +feature discussions in §4 tell you which estimators are well-suited to the +panel shape reported by the profile; the post-fit index in §5 tells you +which diagnostics apply once you have a fitted result. + +**What it is not.** A deterministic recommender. No function in diff-diff +returns "pick estimator X." This guide does not either. When several +estimators fit a design, it enumerates them and names the trade-offs. The +agent is responsible for weighing those trade-offs (often with the cited +references in §7) and justifying the choice in the final write-up. + +**Why this shape.** A rules-engine recommender would lock in a policy that +ages poorly as new estimators land and as the applied-econometrics +literature evolves. Static reference material + descriptive profiling is +less brittle: when a new estimator is added it gets a row in §3 and a +paragraph in §4, without rewriting a dispatcher. + + +## §2. PanelProfile field reference + +`profile_panel(df, unit=..., time=..., treatment=..., outcome=...)` returns +a frozen `PanelProfile` dataclass. Call `.to_dict()` for a JSON-serializable +view. Every field below appears as a top-level key in that dict. + +### Panel structure + +- **`n_units: int`** - count of distinct values in the `unit` column. +- **`n_periods: int`** - count of distinct values in the `time` column. +- **`n_obs: int`** - total rows in the panel. +- **`is_balanced: bool`** - true iff `n_obs == n_units * n_periods`, i.e. + every unit is observed in every period. +- **`observation_coverage: float`** - `n_obs / (n_units * n_periods)` in + `[0, 1]`. A value below `0.70` also triggers the + `panel_highly_unbalanced` alert. + +### Treatment variation + +- **`treatment_type: str`** - classification of the treatment column. + Exactly one of: + - `"binary_absorbing"`: numeric with values in {0, 1}; each unit's + treatment sequence (ordered by `time`) is weakly monotone + non-decreasing. The canonical DiD setting. + - `"binary_non_absorbing"`: values in {0, 1} but at least one unit + switches from 1 back to 0. Only `ChaisemartinDHaultfoeuille` handles + this natively; the other absorbing-only estimators would misapply. + - `"continuous"`: numeric with more than two distinct values (e.g., a + dose, a discrete-integer partial-adoption score). Use + `ContinuousDiD` or `HeterogeneousAdoptionDiD`. + - `"categorical"`: non-numeric dtype (object / category) or bool dtype. + Often indicates a treatment arm. Encode each arm as a binary + indicator and fit separately, or use a multi-treatment workflow + outside the current estimator suite. +- **`is_staggered: bool`** - true iff treatment is `binary_absorbing` and + at least two distinct first-treatment periods are observed. Drives the + choice between classic DiD/TWFE and staggered-robust estimators. +- **`n_cohorts: int`** - for `binary_absorbing`, the number of distinct + first-treatment periods (cohorts). Zero for other `treatment_type` + values. +- **`cohort_sizes: Mapping[Any, int]`** - map from first-treatment period + to cohort size (number of units adopting at that time). Empty for + non-absorbing / continuous / categorical treatments. +- **`has_never_treated: bool`** - at least one unit has `treatment == 0` + in every observed period. Required by `SyntheticDiD` and + `EfficientDiD` with `assumption="PT-All"`; preferred-but-optional by + `CallawaySantAnna` and `ChaisemartinDHaultfoeuille`. +- **`has_always_treated: bool`** - at least one unit has + `treatment == 1` in every observed period. Always-treated units + provide no pre-treatment identification and are dropped by most + estimators. + +### Timing + +- **`first_treatment_period: Optional[Any]`** - earliest first-treatment + period observed (for `binary_absorbing`); `None` otherwise. +- **`last_treatment_period: Optional[Any]`** - latest first-treatment + period observed; `None` otherwise. +- **`min_pre_periods: Optional[int]`** - across cohorts, the smallest + number of pre-treatment periods. Low values (< 3) fire the + `short_pre_panel` alert and limit power for parallel-trends tests. +- **`min_post_periods: Optional[int]`** - across cohorts, the smallest + number of post-treatment periods. Low values limit event-study + dynamics. + +### Outcome + +- **`outcome_dtype: str`** - the pandas dtype name (e.g. `"float64"`, + `"int64"`, `"bool"`). +- **`outcome_is_binary: bool`** - outcome has exactly two distinct + non-NaN values, both in {0, 1}. For binary outcomes the linear + parallel-trends assumption is restrictive; consider the logit/log-odds + alternative in the Roth/Sant'Anna (2023) survey. +- **`outcome_has_zeros: bool`** - any non-NaN outcome equals zero. + Relevant for log-transform diagnostics. +- **`outcome_has_negatives: bool`** - any non-NaN outcome is negative. + Relevant for log-transform diagnostics. +- **`outcome_missing_fraction: float`** - share of rows where the + outcome column is NaN, in `[0, 1]`. +- **`outcome_summary: Mapping[str, float]`** - `{min, max, mean, std}` + computed with NaN-skipping; empty for non-numeric outcomes. + +### Alerts + +`alerts: tuple[Alert, ...]` is a list of factual observations. Each +`Alert` has `code`, `severity` (`"info"` or `"warn"`), `message`, and +`observed` (the numerical or boolean value that tripped the alert). + +The v1 alert catalogue is listed below. Alerts never name a specific +estimator. Severity `"warn"` means the observation is likely relevant to +estimator choice or to the interpretation of diagnostics; `"info"` means +it is descriptive context. + +| Alert code | Severity | Fires when | +|---|---|---| +| `min_cohort_size_below_10` | warn | smallest cohort has fewer than 10 units | +| `only_one_cohort` | info | all treated units adopt simultaneously | +| `short_pre_panel` | warn | `min_pre_periods < 3` | +| `short_post_panel` | info | `min_post_periods < 3` | +| `no_never_treated` | info | every unit is eventually treated | +| `has_always_treated_units` | info | some units are treated in every observed period | +| `all_units_treated_simultaneously` | info | single cohort and no never-treated group | +| `panel_highly_unbalanced` | warn | `observation_coverage < 0.70` | +| `only_two_periods` | info | `n_periods == 2` | +| `outcome_looks_binary_but_dtype_float` | info | outcome takes {0, 1} values but is stored as float | + + +## §3. Estimator-support matrix + +Rows are estimator classes exported from `diff_diff`. Columns are design +features derivable from `PanelProfile`. Cells: `✓` supported; `✗` not +supported / out of scope; `warn` supported but with documented caveats; +`partial` supported subject to restrictions discussed in §4. + +| Estimator | binary absorbing | staggered | continuous | triple-diff | never-treated required | covariate adjustment | few-treated (synthetic) | heterogeneous adoption | clustered SE | +|---|---|---|---|---|---|---|---|---|---| +| `DifferenceInDifferences` | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | +| `MultiPeriodDiD` | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | +| `TwoWayFixedEffects` | ✓ | warn | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | +| `CallawaySantAnna` | ✓ | ✓ | ✗ | ✗ | partial | ✓ | ✗ | ✗ | ✓ | +| `SunAbraham` | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | +| `ChaisemartinDHaultfoeuille` | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | +| `ImputationDiD` | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | +| `TwoStageDiD` | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | +| `StackedDiD` | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | +| `WooldridgeDiD` (ETWFE) | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | +| `EfficientDiD` | ✓ | ✓ | ✗ | ✗ | partial | ✓ | ✗ | ✗ | ✓ | +| `SyntheticDiD` | ✓ | partial | ✗ | ✗ | ✓ | ✗ | ✓ | ✗ | partial | +| `TROP` | ✓ | partial | ✗ | ✗ | ✓ | partial | ✓ | ✗ | partial | +| `TripleDifference` | ✓ | ✗ | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ | +| `StaggeredTripleDifference` | ✓ | ✓ | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ | +| `ContinuousDiD` | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | +| `HeterogeneousAdoptionDiD` | ✗ | partial | partial | ✗ | ✗ | ✓ | ✗ | ✓ | ✓ | + +**Footnotes.** +- `TwoWayFixedEffects` + staggered: fits but mixes positive and negative + cohort-weights that violate the ATT interpretation; consult + `BaconDecomposition` to quantify. Prefer any staggered-robust + estimator (CS, SA, dCDH, Imputation, TwoStage, ETWFE) for a staggered + design. +- `CallawaySantAnna` + never-treated: the "never-treated" control group + is one option; "not-yet-treated" is the other. Pick via the + `control_group` argument. If `has_never_treated == False`, use + `control_group="notyettreated"`. +- `EfficientDiD` + never-treated: `assumption="PT-All"` requires it; + `assumption="PT-Post"` does not. The `Hausman.hausman_pretest` + classmethod picks between them using a formal test. +- `SyntheticDiD` + staggered: native support is limited; the + single-event estimator is the canonical case. Multi-event extensions + use a cohort-level fit loop - check the estimator's docstring. +- `TROP` covariate adjustment: supported through the global-method + path, not the local-method path. +- `HeterogeneousAdoptionDiD` continuous: supports partial-adoption + intensity as a covariate-like continuous variable; not a pure + dose-response estimator - use `ContinuousDiD` for that. + + +## §4. Estimator-choice reasoning by design feature + +Each subsection names a design feature and lists estimators applicable to +it with the most important trade-offs. Multiple paths are always +explicit; no subsection says "pick estimator X." + +### §4.1 Classic 2×2 DiD (binary absorbing, two periods, no staggering) + +When `treatment_type == "binary_absorbing"`, `n_periods == 2`, and +`is_staggered == False`, the classic Card-and-Krueger 2×2 design applies. +Most estimators in the library produce the same point estimate in this +case; the choice between them is mostly about output shape: + +- `DifferenceInDifferences` for a minimal results object. +- `TwoWayFixedEffects` if you want the equivalent two-way-FE regression + output (coefficient table, VCV, etc.). Identical to DiD in the 2×2 + case. +- `TripleDifference` if a second comparison dimension is available + (DDD) - see §4.6. + +### §4.2 Multi-period single-cohort (event-study without staggering) + +When `is_staggered == False` and `n_periods > 2`, event-study dynamics +can be estimated but cohort-mixing bias is moot: + +- `MultiPeriodDiD` - per-period effect, standard event-study plot. +- `TwoWayFixedEffects` with event-time dummies - similar output, no + forbidden comparisons because there is only one cohort. + +### §4.3 Staggered adoption (multi-cohort binary absorbing) + +When `is_staggered == True`, classic TWFE mixes positive- and +negative-weighted cohort comparisons (Goodman-Bacon 2021, +de Chaisemartin & d'Haultfoeuille 2020). Use one of the staggered-robust +estimators: + +- `CallawaySantAnna` - group-time ATTs aggregated to ES / overall / cohort + dimensions. Flexible control-group choice (never-treated vs. + not-yet-treated). Covariate adjustment via doubly-robust (DR), IPW, + or regression-adjustment (RA). +- `SunAbraham` - interaction-weighted estimator; closely tied to + two-way-FE output, computationally cheap, produces event-time + coefficients. +- `ChaisemartinDHaultfoeuille` - DID_M / DID_l estimators robust to + non-absorbing treatment (see §4.5) and to spillover designs. +- `ImputationDiD` (Borusyak, Jaravel, Spiess) - imputation-based, + efficient under homoskedasticity, produces an imputation-based + residual at the observation level. +- `TwoStageDiD` (Gardner) - two-stage residualize-then-regress. +- `StackedDiD` - stacked event-study regressions, one subpanel per + cohort. Conservative interpretation. +- `WooldridgeDiD` (ETWFE) - extended-TWFE with cohort-by-time-by- + covariates interactions; heterogeneous covariate-by-cohort effects. +- `EfficientDiD` (Arkhangelsky-Imbens) - asymptotically efficient under + either `PT-All` or `PT-Post`; use `Hausman.hausman_pretest` to pick. + +Diagnostic: `bacon_decompose(df, ...)` shows the weight allocation of a +TWFE fit to 2×2 comparison types. Forbidden-comparison weight > 10% is a +strong signal that the TWFE estimate is biased. + +### §4.4 No never-treated group + +When `has_never_treated == False`: + +- `SyntheticDiD` requires a never-treated donor pool - not applicable. +- `TROP` needs never-treated units for the factor-model fit - not + applicable in the standard setup. +- `EfficientDiD` with `assumption="PT-All"` requires it - use + `assumption="PT-Post"` instead. +- `CallawaySantAnna` - use `control_group="notyettreated"` to use + not-yet-treated units as the control pool. +- `ChaisemartinDHaultfoeuille` - constructs switchers vs. non-switchers + directly; no never-treated requirement. +- TWFE / `MultiPeriodDiD` / `ImputationDiD` / `TwoStageDiD` / + `StackedDiD` / `WooldridgeDiD` - use the last-treated or untreated- + until-late units as implicit controls; estimators do not error, but + consider whether the implicit control structure is what you want. + +### §4.5 Non-absorbing binary treatment (treatment switches back to 0) + +When `treatment_type == "binary_non_absorbing"`: + +- `ChaisemartinDHaultfoeuille` is the only estimator in the library + that treats this natively. Switcher / non-switcher comparisons are + its primitive object. +- Other estimators assume absorbing treatment and will produce + estimates whose interpretation is unclear. Do not use them without + a well-argued reason. + +### §4.6 Triple-difference design (DDD) + +When a second cross-cutting comparison axis exists (e.g., policy hits +some states and some demographic subgroups within states): + +- `TripleDifference` - classic two-period DDD. +- `StaggeredTripleDifference` - staggered DDD, robust to cohort-mixing. + +Triple-difference is not automatically detected by `profile_panel`; +it requires the caller to identify the third comparison axis. If a +`group` covariate in the panel drives differential exposure, DDD is +worth considering. + +### §4.7 Continuous / dose-response treatment + +When `treatment_type == "continuous"`: + +- `ContinuousDiD` - dose-response treatment, estimates average + causal response (ACR). Supports B-spline bandwidth selection. +- `HeterogeneousAdoptionDiD` - partial-adoption intensity, with a + scalar first-stage adoption summary. Useful when adoption is + graded rather than binary. + +### §4.8 Few treated units (one or a handful) + +When few treated units exist (not a separate `PanelProfile` field yet, +but derivable from `cohort_sizes` + `has_never_treated`): + +- `SyntheticDiD` - synthetic-control-meets-DiD. Requires never-treated + donors and sufficient pre-treatment periods (Arkhangelsky et al. 2021). +- `TROP` - factor-model-based generalized synthetic control. Similar + donor-pool requirements; supports more complex factor structures. + +Classical DiD estimators will still produce estimates, but inference is +unreliable with very small treated groups; cluster-robust SE relies on +the number of clusters, not the number of treated units. Bootstrap +methods in the library are preferred. + +### §4.9 Heterogeneous adoption intensity + +When adoption varies in strength across units (partial-adoption settings, +intensity of exposure differs): + +- `HeterogeneousAdoptionDiD` - Phase 3 tools include Stute / Yatchew- + Härdle / QUG pre-tests for the first-stage model. The estimator's + workflow returns an aggregated ATT plus a per-cohort Pierce-Schott- + style validation. + +### §4.10 Repeated cross-sections (no panel structure) + +`profile_panel` assumes long-format panel data. When the same units are +not observed across time (true repeated cross-sections), most estimators +remain applicable but: + +- Clustered SE must cluster on the unit proxy (state, region) rather + than individual. +- The `CallawaySantAnna` estimator has an explicit repeated-cross- + section mode; see its `panel` kwarg. + + +## §5. Post-fit validation utilities + +After any `fit()`, the Baker et al. (2025) 8-step workflow recommends a +diagnostic sequence. The library exposes utilities covering each step. +Consult `get_llm_guide("practitioner")` for the workflow-prose form; this +section is the API-reference index. + +### Parallel-trends and pre-trends + +- `check_parallel_trends(df, ...)` - exported from `diff_diff`. + Regression-based visual-plus-numeric test on pre-treatment periods. + Returns a structured result with p-value and per-period coefficients. +- `check_parallel_trends_robust(df, ...)` - Roth (2022) power-adjusted + version; adds a "believable-magnitude" check against a power curve. +- `equivalence_test_trends(df, ...)` - Bilinski-Hatfield-style + equivalence test (alternative framing of the PT test). +- `compute_pretrends_power(df, ...)` - standalone power analysis for the + PT test, useful when `min_pre_periods` is small. + +### Sensitivity / robustness + +- `compute_honest_did(results, ...)` - Rambachan-Roth (2023) honest DiD. + Quantifies the sensitivity of ATT to parallel-trends violations. + Outputs sensitivity bounds under smoothness restrictions. +- `compute_pretrends_power(...)` - complementary tool for power-aware + pre-trends interpretation. + +### Placebo tests + +- `run_placebo_test(df, ...)` - generic placebo runner. +- `run_all_placebo_tests(df, ...)` - batch runner over predefined + placebos. +- `placebo_timing_test(df, ...)` - false placebo-treatment time. +- `placebo_group_test(df, ...)` - placebo treatment-group assignment. +- `permutation_test(df, ...)` - Fisher-style exact permutation. +- `leave_one_out_test(df, ...)` - refit dropping one unit at a time. + +### Estimator-native diagnostics + +Some estimators expose diagnostics as methods on the result object: + +- `SyntheticDiDResults.in_time_placebo()` - placebo treatment applied + in a pre-treatment period. +- `SyntheticDiDResults.sensitivity_to_zeta_omega()` - regularization- + hyperparameter sensitivity. +- `SyntheticDiDResults.get_weight_concentration()` - donor-weight + concentration summary. +- `CallawaySantAnna.diagnose_propensity(df, ...)` - propensity-score + overlap check when using DR / IPW controls. +- `Hausman.hausman_pretest(df, ...)` - chooses between `PT-All` and + `PT-Post` for `EfficientDiD`. +- `did_had_pretest_workflow(df, ...)` - bundled QUG / Stute / Yatchew- + Härdle pre-test battery for `HeterogeneousAdoptionDiD`. + +### Decomposition and weight auditing + +- `bacon_decompose(df, ...)` - Goodman-Bacon (2021) TWFE weight + decomposition. Returns a `BaconDecompositionResults` with the weight + on forbidden (later-vs-earlier) comparisons. Run before interpreting + any TWFE staggered fit. + +### Event-study plotting + +- `plot_event_study(results, ...)` +- `plot_group_effects(results, ...)` +- `plot_group_time_heatmap(results, ...)` +- `plot_staircase(results, ...)` +- `plot_honest_event_study(results, ...)` +- `plot_sensitivity(results, ...)` +- `plot_synth_weights(results, ...)` +- `plot_dose_response(results, ...)` +- `plot_power_curve(...)` + +Event-study plots are also a diagnostic - pre-treatment coefficients +close to zero support parallel trends. + + +## §6. How to read BusinessReport / DiagnosticReport output + +`BusinessReport(results)` and `DiagnosticReport(results)` are experimental +in the 3.2 line. Their schema is versioned (`BUSINESS_REPORT_SCHEMA_VERSION` +and `DIAGNOSTIC_REPORT_SCHEMA_VERSION`, both `"2.0"` at time of writing) +and expected to evolve. Treat `.to_dict()` output as the agent-legible +contract; the prose renderers (`summary()`, `full_report()`) are derived +from it. + +### BusinessReport `to_dict()` schema (v2.0) + +Top-level keys: + +- `schema_version: str` - e.g. `"2.0"`. +- `target_parameter: dict` - what the headline scalar represents. + Fields: `name` (e.g. `"ATT"`, `"DID_M"`, `"dose-response"`), + `definition` (human-readable explanation), `aggregation` (machine + tag: `"att_overall"`, `"did_m"`, `"did_or_twfe"`, ...), + `headline_attribute` (the raw result attribute the headline renders + from), `reference` (REGISTRY.md citation string). +- `headline: dict` - the main point estimate plus framing. +- `assumptions: dict` - named assumptions relied on (parallel trends, + no anticipation, SUTVA, ...). +- `pretrends: dict` - pre-trends test result with verdict string + (e.g. `"clean"`, `"inconclusive"`, `"violated"`), p-value, power + assessment if available. +- `main_result: dict` - point estimate, SE, CI, significance. +- `robustness: dict` - sensitivity and placebo summaries if available. +- `sample_summary: dict` - sample size and coverage details. +- `caveats: list[str]` - free-text caveats generated from failed + checks. + +### DiagnosticReport `to_dict()` schema (v2.0) + +- `schema_version: str`. +- `estimator_type: str` - the result class name. +- `checks: dict` - per-check status. Candidate keys: + `parallel_trends`, `pretrends_power`, `sensitivity`, `bacon`, + `design_effect`, `heterogeneity`, `epv`, `estimator_native`, + `placebo`. Each value is a dict describing what was run and its + outcome. Not all checks apply to all estimators; see the + applicability matrix in `diff_diff.diagnostic_report`. + +### Forthcoming schema additions (not yet shipped) + +- `sanity_checks: dict` - machine-legible pass / warn / fail summary + (forthcoming Wave). +- `mismatch_warnings: list[dict]` - post-hoc estimator-mismatch + detection (forthcoming Wave). + + +## §7. Glossary + citations + +**ATT**: Average Treatment Effect on the Treated. The target parameter +of most DiD estimators. + +**Parallel trends**: counterfactual trends in treated and control +outcomes would have moved together absent treatment. Untestable directly; +pre-treatment dynamics are a necessary (not sufficient) indicator. + +**No anticipation**: units do not respond to treatment before it occurs. +If plausible, test via pre-treatment event-study coefficients. + +**SUTVA**: Stable Unit Treatment Value Assumption. Rules out spillovers +and interference between units. + +**Forbidden comparison**: in TWFE, a comparison where already-treated +units serve as controls for later-treated units. Weights are negative +and the resulting estimate can flip sign vs. the true ATT. + +**Cohort / treatment timing**: first-treatment period for an +absorbing-treatment unit. Units sharing a cohort share an adoption date. + +**Staggered adoption**: two or more cohorts present in the panel. + +**Doubly-robust (DR) / IPW / RA**: three covariate-adjustment strategies +in `CallawaySantAnna`. DR is consistent if either the propensity model +or the outcome model is correctly specified. + +### Primary references + +- **Baker, Andrew, Brantly Callaway, Scott Cunningham, Andrew + Goodman-Bacon, and Pedro H. C. Sant'Anna (2025).** "Difference-in- + Differences Designs: A Practitioner's Guide." arXiv:2503.13323. + The 8-step workflow and best-practice framing. Ships as + `get_llm_guide("practitioner")`. +- **Roth, Jonathan, Pedro H. C. Sant'Anna, Alyssa Bilinski, and John + Poe (2023).** "What's Trending in Difference-in-Differences? A + Synthesis of the Recent Econometrics Literature." Journal of + Econometrics 235(2): 2218-2244. Canonical-assumption framing; + classification of estimator relaxations. +- **Goodman-Bacon, Andrew (2021).** "Difference-in-Differences with + Variation in Treatment Timing." Journal of Econometrics + 225(2): 254-277. TWFE weight decomposition; + `bacon_decompose` implements this. +- **Callaway, Brantly, and Pedro H. C. Sant'Anna (2021).** + "Difference-in-Differences with Multiple Time Periods." Journal of + Econometrics 225(2): 200-230. Group-time ATT. +- **Sun, Liyang, and Sarah Abraham (2021).** "Estimating Dynamic + Treatment Effects in Event Studies with Heterogeneous Treatment + Effects." Journal of Econometrics 225(2): 175-199. IW estimator. +- **de Chaisemartin, Clément, and Xavier d'Haultfoeuille (2020).** + "Two-Way Fixed Effects Estimators with Heterogeneous Treatment + Effects." American Economic Review 110(9): 2964-2996. DID_M + estimator. +- **Borusyak, Kirill, Xavier Jaravel, and Jann Spiess (2024).** + "Revisiting Event-Study Designs: Robust and Efficient Estimation." + Review of Economic Studies 91(6): 3253-3285. Imputation estimator. +- **Gardner, John (2022).** "Two-Stage Differences in Differences." + arXiv:2207.05943. Two-stage estimator. +- **Wooldridge, Jeffrey M. (2021).** "Two-Way Fixed Effects, the Two- + Way Mundlak Regression, and Difference-in-Differences Estimators." + ETWFE formulation. +- **Arkhangelsky, Dmitry, Susan Athey, David Hirshberg, Guido Imbens, + and Stefan Wager (2021).** "Synthetic Difference-in-Differences." + American Economic Review 111(12): 4088-4118. SDiD estimator. +- **Rambachan, Ashesh, and Jonathan Roth (2023).** "A More Credible + Approach to Parallel Trends." Review of Economic Studies + 90(5): 2555-2591. HonestDiD sensitivity. +- **Bilinski, Alyssa, and Laura A. Hatfield (2019).** "Nothing to See + Here? Non-Inferiority Approaches to Parallel Trends and Other + Model Assumptions." arXiv:1805.03273. Equivalence test. +- **Sant'Anna, Pedro H. C., and Jun Zhao (2020).** "Doubly Robust + Difference-in-Differences Estimators." Journal of Econometrics + 219(1): 101-122. DR adjustment. + +### Online resources + +- **psantanna.com/did-resources** - practitioner checklist + reading + list maintained by Pedro Sant'Anna. +- **bcallaway11.github.io/did** - `did` R package tutorials + (Callaway-Sant'Anna). + + +## §8. Intentional omissions + +This guide does **not**: + +- Recommend a specific estimator for a specific dataset. When multiple + estimators fit, §4 lists them and names the trade-offs; the choice is + the agent's. +- Enumerate every possible design edge case. The literature cited in §7 + covers them; this guide is a navigation aid, not a substitute. +- Promise forward-compatibility of the BR / DR schema or the alert + catalogue. Treat these as experimental until the 12-item foundation- + gap list closes. +- Replace `bacon_decompose()`, `compute_honest_did()`, or any of the + estimator-native diagnostics. Post-fit validation is mandatory, not + optional, and belongs in the final write-up. +- Cover methods outside diff-diff's estimator suite (e.g., instrumental + variables, regression discontinuity, synthetic control for a single + treated unit). When those apply, point the user at dedicated + libraries. + +**If in doubt, consult the primary references in §7 and use +`get_llm_guide("practitioner")` for the Baker et al. workflow.** diff --git a/diff_diff/profile.py b/diff_diff/profile.py new file mode 100644 index 00000000..869fb396 --- /dev/null +++ b/diff_diff/profile.py @@ -0,0 +1,575 @@ +"""Descriptive panel-profiling utility for agent-facing use. + +``profile_panel()`` inspects a DiD panel and returns a :class:`PanelProfile` +dataclass of structural facts — panel balance, treatment-type classification, +outcome characteristics, and a list of factual :class:`Alert` observations. + +This module is descriptive, not opinionated. Alerts report what is (e.g. +"smallest cohort has 7 units"), never what to do about it. Estimator +selection is the caller's responsibility; consult +``diff_diff.get_llm_guide("autonomous")`` for the estimator-support matrix +and per-design-feature reasoning. +""" + +from __future__ import annotations + +from dataclasses import dataclass +from typing import Any, Dict, List, Mapping, Optional, Tuple, cast + +import numpy as np +import pandas as pd + +_OBSERVATION_COVERAGE_THRESHOLD = 0.70 +_MIN_COHORT_SIZE_THRESHOLD = 10 +_SHORT_PRE_PANEL_THRESHOLD = 3 +_SHORT_POST_PANEL_THRESHOLD = 3 + + +@dataclass(frozen=True) +class Alert: + """A factual observation about a panel. + + ``severity`` is ``"info"`` (descriptive) or ``"warn"`` (descriptive and + likely relevant to the caller's estimator choice). Alerts never + recommend a specific estimator. + """ + + code: str + severity: str + message: str + observed: Any + + +@dataclass(frozen=True) +class PanelProfile: + """Structural facts about a DiD panel. + + Returned by :func:`profile_panel`. Mirrors the ``BusinessContext`` + frozen-dataclass pattern. Consume ``.to_dict()`` for a JSON-serializable + representation and reason against the bundled + ``llms-autonomous.txt`` guide. + """ + + n_units: int + n_periods: int + n_obs: int + is_balanced: bool + observation_coverage: float + + treatment_type: str + is_staggered: bool + n_cohorts: int + cohort_sizes: Mapping[Any, int] + has_never_treated: bool + has_always_treated: bool + + first_treatment_period: Optional[Any] + last_treatment_period: Optional[Any] + min_pre_periods: Optional[int] + min_post_periods: Optional[int] + + outcome_dtype: str + outcome_is_binary: bool + outcome_has_zeros: bool + outcome_has_negatives: bool + outcome_missing_fraction: float + outcome_summary: Mapping[str, float] + + alerts: Tuple[Alert, ...] + + def to_dict(self) -> Dict[str, Any]: + """Return a JSON-serializable dict representation of the profile.""" + return { + "n_units": self.n_units, + "n_periods": self.n_periods, + "n_obs": self.n_obs, + "is_balanced": self.is_balanced, + "observation_coverage": self.observation_coverage, + "treatment_type": self.treatment_type, + "is_staggered": self.is_staggered, + "n_cohorts": self.n_cohorts, + "cohort_sizes": {_jsonable_key(k): int(v) for k, v in self.cohort_sizes.items()}, + "has_never_treated": self.has_never_treated, + "has_always_treated": self.has_always_treated, + "first_treatment_period": _jsonable(self.first_treatment_period), + "last_treatment_period": _jsonable(self.last_treatment_period), + "min_pre_periods": self.min_pre_periods, + "min_post_periods": self.min_post_periods, + "outcome_dtype": self.outcome_dtype, + "outcome_is_binary": self.outcome_is_binary, + "outcome_has_zeros": self.outcome_has_zeros, + "outcome_has_negatives": self.outcome_has_negatives, + "outcome_missing_fraction": self.outcome_missing_fraction, + "outcome_summary": {k: float(v) for k, v in self.outcome_summary.items()}, + "alerts": [ + { + "code": a.code, + "severity": a.severity, + "message": a.message, + "observed": _jsonable(a.observed), + } + for a in self.alerts + ], + } + + +def profile_panel( + df: pd.DataFrame, + *, + unit: str, + time: str, + treatment: str, + outcome: str, +) -> PanelProfile: + """Describe the structure of a DiD panel. + + Reports structural facts — balance, treatment-type classification, + outcome characteristics, factual alerts. Descriptive, not opinionated: + the profile says what is, never what to do about it. Estimator + selection is up to the caller. + + Parameters + ---------- + df : pandas.DataFrame + Long-format panel data containing the four named columns. + unit : str + Column identifying the cross-sectional unit. + time : str + Column identifying the time period. + treatment : str + Column holding the treatment indicator or dose. See Notes for the + classification rules. + outcome : str + Column holding the outcome variable. + + Returns + ------- + PanelProfile + Frozen dataclass. Call ``.to_dict()`` for a JSON-serializable view. + + Raises + ------ + ValueError + If any of the four column names is not present in ``df``. + + Examples + -------- + >>> import pandas as pd + >>> from diff_diff import profile_panel + >>> df = pd.DataFrame({ + ... "u": [1, 1, 2, 2], + ... "t": [0, 1, 0, 1], + ... "tr": [0, 0, 1, 1], + ... "y": [0.1, 0.2, 0.1, 0.9], + ... }) + >>> profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + >>> profile.is_balanced + True + >>> profile.treatment_type + 'binary_absorbing' + + Notes + ----- + Classification rules for ``treatment_type``: + + - ``"binary_absorbing"``: numeric treatment taking values in :math:`\\{0, 1\\}` + where each unit's treatment sequence (ordered by ``time``) is weakly + monotone non-decreasing. + - ``"binary_non_absorbing"``: values in :math:`\\{0, 1\\}` but at least + one unit switches from 1 back to 0. + - ``"continuous"``: numeric treatment with more than two distinct + values (matches the ``ContinuousDiD`` convention). + - ``"categorical"``: non-numeric dtype (object / category) or a + boolean-dtype column. + + The profile does not recommend an estimator. Consult + ``diff_diff.get_llm_guide("autonomous")`` for the estimator-support + matrix and per-design-feature reasoning. + """ + _validate_columns(df, unit=unit, time=time, treatment=treatment, outcome=outcome) + + n_units = int(df[unit].nunique()) + n_periods = int(df[time].nunique()) + n_obs = int(len(df)) + denom = n_units * n_periods + observation_coverage = float(n_obs / denom) if denom > 0 else 0.0 + is_balanced = n_obs == denom + + ( + treatment_type, + is_staggered, + cohort_sizes, + has_never_treated, + has_always_treated, + first_tp, + last_tp, + ) = _classify_treatment(df, unit=unit, time=time, treatment=treatment) + + min_pre, min_post = _compute_pre_post( + df, + unit=unit, + time=time, + treatment=treatment, + treatment_type=treatment_type, + ) + + outcome_col = cast(pd.Series, df[outcome]) + outcome_dtype = str(outcome_col.dtype) + valid = cast(pd.Series, outcome_col.dropna()) + outcome_missing_fraction = ( + float(1.0 - len(valid) / len(outcome_col)) if len(outcome_col) > 0 else 0.0 + ) + outcome_is_binary, outcome_has_zeros, outcome_has_negatives = _classify_outcome(valid) + outcome_summary = _summarize_outcome(valid) + + dtype_kind = getattr(outcome_col.dtype, "kind", "O") + alerts = _compute_alerts( + n_periods=n_periods, + observation_coverage=observation_coverage, + cohort_sizes=cohort_sizes, + has_never_treated=has_never_treated, + has_always_treated=has_always_treated, + min_pre_periods=min_pre, + min_post_periods=min_post, + outcome_is_binary=outcome_is_binary, + outcome_dtype_kind=dtype_kind, + ) + + return PanelProfile( + n_units=n_units, + n_periods=n_periods, + n_obs=n_obs, + is_balanced=is_balanced, + observation_coverage=observation_coverage, + treatment_type=treatment_type, + is_staggered=is_staggered, + n_cohorts=len(cohort_sizes), + cohort_sizes=cohort_sizes, + has_never_treated=has_never_treated, + has_always_treated=has_always_treated, + first_treatment_period=first_tp, + last_treatment_period=last_tp, + min_pre_periods=min_pre, + min_post_periods=min_post, + outcome_dtype=outcome_dtype, + outcome_is_binary=outcome_is_binary, + outcome_has_zeros=outcome_has_zeros, + outcome_has_negatives=outcome_has_negatives, + outcome_missing_fraction=outcome_missing_fraction, + outcome_summary=outcome_summary, + alerts=tuple(alerts), + ) + + +def _validate_columns(df: pd.DataFrame, **cols: str) -> None: + missing = [(role, name) for role, name in cols.items() if name not in df.columns] + if missing: + pairs = ", ".join(f"{role}={name!r}" for role, name in missing) + raise ValueError( + f"profile_panel: column(s) not found in DataFrame: {pairs}. " + f"Provided columns: {list(df.columns)}" + ) + + +def _classify_treatment( + df: pd.DataFrame, + *, + unit: str, + time: str, + treatment: str, +) -> Tuple[ + str, + bool, + Dict[Any, int], + bool, + bool, + Optional[Any], + Optional[Any], +]: + """Return (type, is_staggered, cohort_sizes, has_never, has_always, first_tp, last_tp).""" + col = df[treatment] + is_numeric = pd.api.types.is_numeric_dtype(col) + is_bool = pd.api.types.is_bool_dtype(col) + + if (not is_numeric) or is_bool: + return ("categorical", False, {}, False, False, None, None) + + distinct = col.dropna().unique() + n_distinct = len(distinct) + values_set = set(distinct.tolist()) + is_binary_valued = n_distinct == 2 and values_set <= {0, 1, 0.0, 1.0} + + if not is_binary_valued: + return ("continuous", False, {}, False, False, None, None) + + sorted_df = df.sort_values([unit, time]) + + is_absorbing = True + for _, group in sorted_df.groupby(unit, sort=False): + vals = group[treatment].to_numpy() + if len(vals) >= 2 and bool(np.any(np.diff(vals) < 0)): + is_absorbing = False + break + + unit_treatment_max = df.groupby(unit)[treatment].max().to_numpy() + unit_treatment_min = df.groupby(unit)[treatment].min().to_numpy() + has_never_treated = bool(np.any(unit_treatment_max == 0)) + has_always_treated = bool(np.any(unit_treatment_min == 1)) + + if not is_absorbing: + return ( + "binary_non_absorbing", + False, + {}, + has_never_treated, + has_always_treated, + None, + None, + ) + + first_treat = sorted_df[sorted_df[treatment] == 1].groupby(unit, sort=False)[time].min() + cohort_counts = first_treat.value_counts().sort_index() + cohort_sizes: Dict[Any, int] = {k: int(v) for k, v in cohort_counts.items()} + first_tp = min(cohort_sizes) if cohort_sizes else None + last_tp = max(cohort_sizes) if cohort_sizes else None + is_staggered = len(cohort_sizes) >= 2 + + return ( + "binary_absorbing", + is_staggered, + cohort_sizes, + has_never_treated, + has_always_treated, + first_tp, + last_tp, + ) + + +def _compute_pre_post( + df: pd.DataFrame, + *, + unit: str, + time: str, + treatment: str, + treatment_type: str, +) -> Tuple[Optional[int], Optional[int]]: + if treatment_type != "binary_absorbing": + return None, None + + all_periods = sorted(df[time].unique().tolist()) + sorted_df = df.sort_values([unit, time]) + first_treat_per_unit = ( + sorted_df[sorted_df[treatment] == 1].groupby(unit, sort=False)[time].min() + ) + cohort_values = first_treat_per_unit.unique().tolist() + if not cohort_values: + return None, None + + min_pre = min(sum(1 for p in all_periods if p < c) for c in cohort_values) + min_post = min(sum(1 for p in all_periods if p >= c) for c in cohort_values) + return int(min_pre), int(min_post) + + +def _classify_outcome(valid: pd.Series) -> Tuple[bool, bool, bool]: + n_distinct = valid.nunique(dropna=False) + if n_distinct == 0: + return False, False, False + + is_numeric = pd.api.types.is_numeric_dtype(valid) + if is_numeric: + distinct_set = set(valid.unique().tolist()) + is_binary = n_distinct == 2 and (distinct_set <= {0, 1} or distinct_set <= {0.0, 1.0}) + has_zeros = bool((valid == 0).any()) + has_negatives = bool((valid < 0).any()) + return is_binary, has_zeros, has_negatives + + return False, False, False + + +def _summarize_outcome(valid: pd.Series) -> Dict[str, float]: + if len(valid) == 0 or not pd.api.types.is_numeric_dtype(valid): + return {} + return { + "min": float(valid.min()), + "max": float(valid.max()), + "mean": float(valid.mean()), + "std": float(valid.std(ddof=1)) if len(valid) > 1 else 0.0, + } + + +def _compute_alerts( + *, + n_periods: int, + observation_coverage: float, + cohort_sizes: Mapping[Any, int], + has_never_treated: bool, + has_always_treated: bool, + min_pre_periods: Optional[int], + min_post_periods: Optional[int], + outcome_is_binary: bool, + outcome_dtype_kind: str, +) -> List[Alert]: + alerts: List[Alert] = [] + + if cohort_sizes: + smallest = min(cohort_sizes.values()) + if smallest < _MIN_COHORT_SIZE_THRESHOLD: + alerts.append( + Alert( + code="min_cohort_size_below_10", + severity="warn", + message=( + f"Smallest cohort has {smallest} units; " + "cohort-level inference will be noisy." + ), + observed=int(smallest), + ) + ) + if len(cohort_sizes) == 1: + alerts.append( + Alert( + code="only_one_cohort", + severity="info", + message=("All treated units adopt at the same time " "(non-staggered design)."), + observed=1, + ) + ) + if not has_never_treated: + alerts.append( + Alert( + code="all_units_treated_simultaneously", + severity="info", + message=( + "Every unit is treated and every treated unit " + "adopts in the same period; no untreated " + "comparison group exists in the panel." + ), + observed=None, + ) + ) + + if min_pre_periods is not None and min_pre_periods < _SHORT_PRE_PANEL_THRESHOLD: + alerts.append( + Alert( + code="short_pre_panel", + severity="warn", + message=( + f"Minimum pre-treatment periods across cohorts is " + f"{min_pre_periods}; parallel-trends and event-study " + "diagnostics have limited power." + ), + observed=int(min_pre_periods), + ) + ) + if min_post_periods is not None and min_post_periods < _SHORT_POST_PANEL_THRESHOLD: + alerts.append( + Alert( + code="short_post_panel", + severity="info", + message=( + f"Minimum post-treatment periods across cohorts is " + f"{min_post_periods}; dynamic-effect estimation is " + "limited." + ), + observed=int(min_post_periods), + ) + ) + + if cohort_sizes and not has_never_treated: + alerts.append( + Alert( + code="no_never_treated", + severity="info", + message=( + "No never-treated comparison units; every unit in the " + "panel is eventually treated." + ), + observed=False, + ) + ) + + if has_always_treated: + alerts.append( + Alert( + code="has_always_treated_units", + severity="info", + message=( + "Some units are treated in every observed period; they " + "provide no pre-treatment information." + ), + observed=True, + ) + ) + + if observation_coverage < _OBSERVATION_COVERAGE_THRESHOLD: + alerts.append( + Alert( + code="panel_highly_unbalanced", + severity="warn", + message=( + f"Observation coverage is {observation_coverage:.1%}; " + "panel is highly unbalanced." + ), + observed=float(observation_coverage), + ) + ) + + if n_periods == 2: + alerts.append( + Alert( + code="only_two_periods", + severity="info", + message="Only two time periods are observed (2x2 design).", + observed=2, + ) + ) + + if outcome_is_binary and outcome_dtype_kind == "f": + alerts.append( + Alert( + code="outcome_looks_binary_but_dtype_float", + severity="info", + message=("Outcome takes values in {0, 1} but is stored with a " "float dtype."), + observed=None, + ) + ) + + return alerts + + +def _jsonable(x: Any) -> Any: + """Coerce a value to a JSON-serializable primitive.""" + if x is None: + return None + if isinstance(x, bool): + return bool(x) + if isinstance(x, (int, float, str)): + return x + if isinstance(x, np.bool_): + return bool(x) + if isinstance(x, np.integer): + return int(x) + if isinstance(x, np.floating): + return float(x) + if isinstance(x, (pd.Timestamp, np.datetime64)): + return str(x) + if isinstance(x, dict): + return {_jsonable_key(k): _jsonable(v) for k, v in x.items()} + if isinstance(x, (list, tuple)): + return [_jsonable(v) for v in x] + return str(x) + + +def _jsonable_key(k: Any) -> Any: + """Coerce a mapping key to a JSON-compatible primitive.""" + if isinstance(k, bool): + return bool(k) + if isinstance(k, (int, float, str)): + return k + if isinstance(k, np.bool_): + return bool(k) + if isinstance(k, np.integer): + return int(k) + if isinstance(k, np.floating): + return float(k) + return str(k) diff --git a/tests/test_guides.py b/tests/test_guides.py index bc0abe83..2d08871d 100644 --- a/tests/test_guides.py +++ b/tests/test_guides.py @@ -1,4 +1,5 @@ """Tests for the bundled LLM guide accessor.""" + import importlib.resources import pytest @@ -7,7 +8,7 @@ from diff_diff._guides_api import _VARIANT_TO_FILE -@pytest.mark.parametrize("variant", ["concise", "full", "practitioner"]) +@pytest.mark.parametrize("variant", ["concise", "full", "practitioner", "autonomous"]) def test_all_variants_load(variant): text = get_llm_guide(variant) assert isinstance(text, str) @@ -19,9 +20,10 @@ def test_default_is_concise(): def test_full_is_largest(): - lengths = {v: len(get_llm_guide(v)) for v in ("concise", "full", "practitioner")} + lengths = {v: len(get_llm_guide(v)) for v in ("concise", "full", "practitioner", "autonomous")} assert lengths["full"] > lengths["concise"] assert lengths["full"] > lengths["practitioner"] + assert lengths["full"] > lengths["autonomous"] def test_content_stability_practitioner_workflow(): @@ -32,6 +34,22 @@ def test_content_stability_self_reference_after_rewrite(): assert "get_llm_guide" in get_llm_guide("concise") +def test_content_stability_autonomous_fingerprints(): + text = get_llm_guide("autonomous") + assert "profile_panel" in text + assert "estimator-support matrix" in text.lower() + + +def test_autonomous_contains_intact_estimator_matrix(): + # Section 3 is a markdown table with 10 data columns + the estimator + # name column -> rows have at least 11 pipe characters. This guards + # against the matrix being accidentally deleted or truncated. + text = get_llm_guide("autonomous") + assert any( + line.count("|") >= 11 for line in text.splitlines() + ), "Section 3 estimator-support matrix appears to be missing or truncated." + + def test_wheel_content_matches_package_resource(): for variant, filename in _VARIANT_TO_FILE.items(): on_disk = ( diff --git a/tests/test_profile_panel.py b/tests/test_profile_panel.py new file mode 100644 index 00000000..3599f30a --- /dev/null +++ b/tests/test_profile_panel.py @@ -0,0 +1,254 @@ +"""Tests for ``diff_diff.profile_panel`` and the ``PanelProfile`` dataclass.""" + +from __future__ import annotations + +import dataclasses +import json +from typing import Any, Dict, Iterable, Optional + +import numpy as np +import pandas as pd +import pytest + +from diff_diff import PanelProfile, profile_panel +from diff_diff.profile import Alert + + +def _make_panel( + *, + n_units: int, + periods: Iterable[int], + first_treat: Optional[Dict[int, int]] = None, + outcome_fn: Any = None, +) -> pd.DataFrame: + """Build a balanced long panel with optional per-unit first-treatment timing. + + ``first_treat`` maps unit -> first treatment period (inclusive). Units not + in the mapping are never-treated. + """ + first_treat = first_treat or {} + rows = [] + rng = np.random.default_rng(0) + for u in range(1, n_units + 1): + for t in periods: + tr = 1 if (u in first_treat and t >= first_treat[u]) else 0 + if outcome_fn is not None: + y = outcome_fn(u, t, tr, rng) + else: + y = float(u) + 0.1 * t + 0.5 * tr + rows.append({"u": u, "t": t, "tr": tr, "y": y}) + return pd.DataFrame(rows) + + +def _alert_codes(profile: PanelProfile) -> set[str]: + return {a.code for a in profile.alerts} + + +def test_balanced_binary_2x2(): + first_treat = {u: 1 for u in range(11, 21)} + df = _make_panel(n_units=20, periods=[0, 1], first_treat=first_treat) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "binary_absorbing" + assert profile.is_staggered is False + assert profile.has_never_treated is True + assert profile.n_units == 20 + assert profile.n_periods == 2 + assert profile.is_balanced is True + + +def test_staggered_multi_cohort(): + first_treat: Dict[int, int] = {} + first_treat.update({u: 3 for u in range(1, 11)}) + first_treat.update({u: 5 for u in range(11, 21)}) + first_treat.update({u: 7 for u in range(21, 31)}) + df = _make_panel(n_units=40, periods=range(1, 9), first_treat=first_treat) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "binary_absorbing" + assert profile.is_staggered is True + assert profile.n_cohorts == 3 + assert profile.cohort_sizes == {3: 10, 5: 10, 7: 10} + assert profile.first_treatment_period == 3 + assert profile.last_treatment_period == 7 + assert profile.has_never_treated is True + + +def test_binary_non_absorbing_switcher(): + rows = [] + rng = np.random.default_rng(0) + for u in range(1, 21): + treat_seq = [0, 1, 1, 0, 0] if u > 10 else [0, 0, 0, 0, 0] + for t, tr in enumerate(treat_seq): + rows.append({"u": u, "t": t, "tr": tr, "y": rng.normal()}) + df = pd.DataFrame(rows) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "binary_non_absorbing" + assert profile.cohort_sizes == {} + assert profile.is_staggered is False + assert profile.has_never_treated is True + + +def test_continuous_treatment(): + rng = np.random.default_rng(0) + rows = [] + for u in range(1, 41): + dose = float(rng.uniform(0, 5)) + for t in range(4): + rows.append({"u": u, "t": t, "tr": dose, "y": rng.normal()}) + df = pd.DataFrame(rows) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "continuous" + assert profile.cohort_sizes == {} + assert profile.is_staggered is False + + +def test_categorical_treatment_object_dtype(): + rows = [] + for u in range(1, 11): + arm = "A" if u <= 5 else "B" + for t in range(4): + rows.append({"u": u, "t": t, "tr": arm, "y": float(u) + 0.1 * t}) + df = pd.DataFrame(rows) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "categorical" + assert profile.has_never_treated is False + assert profile.has_always_treated is False + + +def test_no_never_treated_alert(): + first_treat = {u: 2 for u in range(1, 21)} + df = _make_panel(n_units=20, periods=range(0, 5), first_treat=first_treat) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.has_never_treated is False + codes = _alert_codes(profile) + assert "no_never_treated" in codes + + +def test_has_always_treated_alert(): + rows = [] + for u in range(1, 21): + for t in range(5): + tr = 1 if u <= 5 else (1 if t >= 3 else 0) + rows.append({"u": u, "t": t, "tr": tr, "y": float(u) + 0.1 * t}) + df = pd.DataFrame(rows) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.has_always_treated is True + codes = _alert_codes(profile) + assert "has_always_treated_units" in codes + + +def test_unbalanced_panel_below_threshold(): + first_treat = {u: 3 for u in range(11, 21)} + df = _make_panel(n_units=20, periods=range(0, 5), first_treat=first_treat) + df = df.iloc[::3].reset_index(drop=True) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.is_balanced is False + assert profile.observation_coverage < 0.70 + codes = _alert_codes(profile) + assert "panel_highly_unbalanced" in codes + + +def test_binary_outcome_float_dtype_alert(): + first_treat = {u: 2 for u in range(11, 31)} + df = _make_panel( + n_units=30, + periods=range(0, 4), + first_treat=first_treat, + outcome_fn=lambda u, t, tr, rng: float(tr), + ) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.outcome_is_binary is True + assert profile.outcome_dtype == "float64" + codes = _alert_codes(profile) + assert "outcome_looks_binary_but_dtype_float" in codes + + +def test_outcome_missing_fraction_computed(): + first_treat = {u: 2 for u in range(11, 21)} + df = _make_panel(n_units=20, periods=range(0, 4), first_treat=first_treat) + df.loc[0:9, "y"] = np.nan + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert 0.0 < profile.outcome_missing_fraction < 1.0 + assert profile.outcome_missing_fraction == pytest.approx(10 / len(df)) + + +def test_short_pre_panel_alert(): + first_treat = {u: 1 for u in range(11, 21)} + df = _make_panel(n_units=20, periods=[0, 1, 2, 3], first_treat=first_treat) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.min_pre_periods == 1 + codes = _alert_codes(profile) + assert "short_pre_panel" in codes + + +def test_missing_column_raises_value_error(): + df = pd.DataFrame({"u": [1, 2], "t": [0, 1], "y": [0.0, 1.0]}) + with pytest.raises(ValueError, match="treatment"): + profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + + +def test_panel_profile_is_frozen(): + first_treat = {u: 2 for u in range(11, 21)} + df = _make_panel(n_units=20, periods=range(0, 4), first_treat=first_treat) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + with pytest.raises(dataclasses.FrozenInstanceError): + profile.n_units = 999 # type: ignore[misc] + + +def test_to_dict_is_json_serializable(): + first_treat = {u: 3 for u in range(11, 21)} + df = _make_panel(n_units=20, periods=range(0, 6), first_treat=first_treat) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + payload = profile.to_dict() + as_json = json.dumps(payload) + roundtripped = json.loads(as_json) + assert roundtripped["treatment_type"] == "binary_absorbing" + assert set(roundtripped.keys()) >= { + "n_units", + "n_periods", + "n_obs", + "is_balanced", + "observation_coverage", + "treatment_type", + "is_staggered", + "n_cohorts", + "cohort_sizes", + "has_never_treated", + "has_always_treated", + "first_treatment_period", + "last_treatment_period", + "min_pre_periods", + "min_post_periods", + "outcome_dtype", + "outcome_is_binary", + "outcome_has_zeros", + "outcome_has_negatives", + "outcome_missing_fraction", + "outcome_summary", + "alerts", + } + + +def test_alerts_are_factual_no_recommender_language(): + first_treat = {u: 1 for u in range(11, 21)} + df = _make_panel(n_units=12, periods=[0, 1, 2, 3], first_treat=first_treat) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + forbidden_substrings = ( + "recommend", + "should use", + "use estimator", + "we suggest", + "you should", + ) + for alert in profile.alerts: + lowered = alert.message.lower() + for phrase in forbidden_substrings: + assert phrase not in lowered, ( + f"alert {alert.code!r} contains recommender-adjacent phrase " + f"{phrase!r} in message: {alert.message!r}" + ) + + +def test_alert_dataclass_is_frozen(): + a = Alert(code="x", severity="info", message="m", observed=None) + with pytest.raises(dataclasses.FrozenInstanceError): + a.code = "y" # type: ignore[misc] From 0bc776bacd06503cd572e6f61c1415532876c758 Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 24 Apr 2026 07:28:22 -0400 Subject: [PATCH 02/18] Fix profile_panel() binary detection for degenerate panels MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The treatment classifier required exactly two observed distinct values to treat a column as binary. Panels that are entirely never-treated (values = {0}) or entirely always-treated (values = {1}) were falling through to "continuous", contradicting the documented taxonomy which defines "binary_absorbing" as "values in {0, 1}". Rule is now values_set <= {0, 1, 0.0, 1.0} with at least one observed value; entirely-NaN treatment columns fall through to "categorical" rather than "continuous". Docstring and the autonomous guide section §2 reference are updated to match. New regression tests: - all-zero treatment panel -> binary_absorbing, has_never_treated - all-one treatment panel -> binary_absorbing, has_always_treated - binary with NaNs, only zeros observed -> binary_absorbing - all-NaN treatment -> categorical - top-level import surface (profile_panel / PanelProfile / Alert) Co-Authored-By: Claude Opus 4.7 (1M context) --- diff_diff/guides/llms-autonomous.txt | 29 +++++++------ diff_diff/profile.py | 28 ++++++++---- tests/test_profile_panel.py | 64 ++++++++++++++++++++++++++++ 3 files changed, 100 insertions(+), 21 deletions(-) diff --git a/diff_diff/guides/llms-autonomous.txt b/diff_diff/guides/llms-autonomous.txt index fcefd79e..d237d770 100644 --- a/diff_diff/guides/llms-autonomous.txt +++ b/diff_diff/guides/llms-autonomous.txt @@ -74,19 +74,24 @@ view. Every field below appears as a top-level key in that dict. - **`treatment_type: str`** - classification of the treatment column. Exactly one of: - - `"binary_absorbing"`: numeric with values in {0, 1}; each unit's - treatment sequence (ordered by `time`) is weakly monotone - non-decreasing. The canonical DiD setting. - - `"binary_non_absorbing"`: values in {0, 1} but at least one unit - switches from 1 back to 0. Only `ChaisemartinDHaultfoeuille` handles - this natively; the other absorbing-only estimators would misapply. - - `"continuous"`: numeric with more than two distinct values (e.g., a - dose, a discrete-integer partial-adoption score). Use + - `"binary_absorbing"`: observed non-NaN values are a subset of + {0, 1} (one or two distinct values, covering all-zero and all-one + panels as valid degenerate cases) and each unit's treatment + sequence (ordered by `time`) is weakly monotone non-decreasing. + The canonical DiD setting. + - `"binary_non_absorbing"`: values a subset of {0, 1} with at least + two distinct values observed, where at least one unit switches + from 1 back to 0. Only `ChaisemartinDHaultfoeuille` handles this + natively; the other absorbing-only estimators would misapply. + - `"continuous"`: numeric with more than two distinct values, or a + two-valued numeric column whose values are not in {0, 1} (e.g., + a dose, a discrete-integer partial-adoption score). Use `ContinuousDiD` or `HeterogeneousAdoptionDiD`. - - `"categorical"`: non-numeric dtype (object / category) or bool dtype. - Often indicates a treatment arm. Encode each arm as a binary - indicator and fit separately, or use a multi-treatment workflow - outside the current estimator suite. + - `"categorical"`: non-numeric dtype (object / category), a bool + dtype column, or a column that is entirely NaN. Often indicates + a treatment arm. Encode each arm as a binary indicator and fit + separately, or use a multi-treatment workflow outside the + current estimator suite. - **`is_staggered: bool`** - true iff treatment is `binary_absorbing` and at least two distinct first-treatment periods are observed. Drives the choice between classic DiD/TWFE and staggered-robust estimators. diff --git a/diff_diff/profile.py b/diff_diff/profile.py index 869fb396..193efb0e 100644 --- a/diff_diff/profile.py +++ b/diff_diff/profile.py @@ -172,15 +172,23 @@ def profile_panel( ----- Classification rules for ``treatment_type``: - - ``"binary_absorbing"``: numeric treatment taking values in :math:`\\{0, 1\\}` - where each unit's treatment sequence (ordered by ``time``) is weakly - monotone non-decreasing. - - ``"binary_non_absorbing"``: values in :math:`\\{0, 1\\}` but at least - one unit switches from 1 back to 0. + - ``"binary_absorbing"``: numeric treatment whose observed non-NaN + values are a subset of :math:`\\{0, 1\\}` (one or two distinct + values) AND each unit's treatment sequence (ordered by ``time``) + is weakly monotone non-decreasing. All-zero and all-one panels + are valid degenerate cases. + - ``"binary_non_absorbing"``: values a subset of :math:`\\{0, 1\\}` + with at least two distinct values observed, where at least one + unit switches from 1 back to 0. - ``"continuous"``: numeric treatment with more than two distinct - values (matches the ``ContinuousDiD`` convention). - - ``"categorical"``: non-numeric dtype (object / category) or a - boolean-dtype column. + values, or a 2-valued numeric whose values are not in + :math:`\\{0, 1\\}` (matches the ``ContinuousDiD`` convention). + - ``"categorical"``: non-numeric dtype (object / category), a + boolean-dtype column, or a column that is entirely NaN. + + Boolean-dtype columns are intentionally classified as + ``"categorical"``; cast to ``int`` if you want binary-treatment + profiling. The profile does not recommend an estimator. Consult ``diff_diff.get_llm_guide("autonomous")`` for the estimator-support @@ -297,7 +305,9 @@ def _classify_treatment( distinct = col.dropna().unique() n_distinct = len(distinct) values_set = set(distinct.tolist()) - is_binary_valued = n_distinct == 2 and values_set <= {0, 1, 0.0, 1.0} + if n_distinct == 0: + return ("categorical", False, {}, False, False, None, None) + is_binary_valued = values_set <= {0, 1, 0.0, 1.0} if not is_binary_valued: return ("continuous", False, {}, False, False, None, None) diff --git a/tests/test_profile_panel.py b/tests/test_profile_panel.py index 3599f30a..ab1f1336 100644 --- a/tests/test_profile_panel.py +++ b/tests/test_profile_panel.py @@ -252,3 +252,67 @@ def test_alert_dataclass_is_frozen(): a = Alert(code="x", severity="info", message="m", observed=None) with pytest.raises(dataclasses.FrozenInstanceError): a.code = "y" # type: ignore[misc] + + +def test_all_zero_treatment_is_binary_absorbing(): + """Degenerate binary: no unit is ever treated. Must classify as binary, + not continuous, so the documented taxonomy matches the implementation.""" + df = _make_panel(n_units=20, periods=range(0, 4), first_treat=None) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "binary_absorbing" + assert profile.has_never_treated is True + assert profile.has_always_treated is False + assert profile.cohort_sizes == {} + assert profile.n_cohorts == 0 + + +def test_all_one_treatment_is_binary_absorbing_always_treated(): + """Degenerate binary: every unit treated in every period. Must classify as + binary_absorbing with has_always_treated=True.""" + rows = [] + for u in range(1, 21): + for t in range(4): + rows.append({"u": u, "t": t, "tr": 1, "y": float(u) + 0.1 * t}) + df = pd.DataFrame(rows) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "binary_absorbing" + assert profile.has_never_treated is False + assert profile.has_always_treated is True + codes = _alert_codes(profile) + assert "has_always_treated_units" in codes + + +def test_binary_with_nans_only_zeros_observed_is_binary(): + """Binary panel with some NaNs and only 0 observed among non-NaN values — + still classify as binary, not continuous.""" + rows = [] + for u in range(1, 11): + for t in range(4): + tr = 0 if (u + t) % 2 == 0 else np.nan + rows.append({"u": u, "t": t, "tr": tr, "y": float(u) + 0.1 * t}) + df = pd.DataFrame(rows) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "binary_absorbing" + + +def test_all_nan_treatment_is_categorical(): + """Treatment column entirely NaN — classify as categorical (no info).""" + rows = [] + for u in range(1, 11): + for t in range(4): + rows.append({"u": u, "t": t, "tr": np.nan, "y": float(u) + 0.1 * t}) + df = pd.DataFrame(rows) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "categorical" + + +def test_top_level_import_surface(): + """profile_panel, PanelProfile, and Alert must be importable from the + top-level namespace so `help(diff_diff)` points at real symbols.""" + import diff_diff + + assert callable(diff_diff.profile_panel) + assert diff_diff.PanelProfile.__name__ == "PanelProfile" + assert diff_diff.Alert.__name__ == "Alert" + for name in ("profile_panel", "PanelProfile", "Alert"): + assert name in diff_diff.__all__, f"{name} missing from __all__" From 0c7ba05b833d01431c9895b635636330bb44e48a Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 24 Apr 2026 08:05:29 -0400 Subject: [PATCH 03/18] Address PR #356 CI review (3 P1 code + 2 P1 guide) profile_panel(): - Balance / coverage: compute from unique (unit, time) support rather than raw row count. Duplicated rows no longer inflate coverage above 1 or mask missing cells. Adds duplicate_unit_time_rows warn alert. - Non-absorbing detection: drop NaN before the monotonicity diff so a path like [0, 1, NaN, 0] is correctly classified binary_non_absorbing rather than binary_absorbing. - has_never_treated / has_always_treated: compute generically across numeric treatment types (not only binary). Continuous panels with zero-dose control units now surface has_never_treated=True; docstring and guide field reference updated. llms-autonomous.txt: - SyntheticDiD: staggered changed from "partial" to "strict block" (fit raises on within-unit variation); footnote rewritten. - TROP: staggered support marked (via absorbing D matrix); never-treated no longer required (donor pool is every unit untreated at period t); fit() covariate surface removed (there is none). - HeterogeneousAdoptionDiD: covariate adjustment changed to "not supported" (Appendix B.1 future work per REGISTRY.md); clustered SE annotated "warn" on continuous paths (cluster= kwarg is ignored with UserWarning, CR1 only on mass-point path). - Replace "notyettreated" with "not_yet_treated" throughout (matches staggered.py / sun_abraham.py public API). - Replace "Hausman.hausman_pretest" with "EfficientDiD.hausman_pretest" (the classmethod lives on EfficientDiD, not a Hausman namespace). Tests: - test_duplicate_unit_time_rows_do_not_inflate_coverage - test_reversal_through_nan_is_binary_non_absorbing - test_continuous_zero_dose_controls_flag_has_never_treated - test_guide_api_strings_resolve_against_public_api (verifies every estimator name, hausman_pretest path, and control_group spelling in the bundled guide resolves against diff_diff's public API; guards against future drift) Co-Authored-By: Claude Opus 4.7 (1M context) --- diff_diff/guides/llms-autonomous.txt | 79 +++++++++++++++-------- diff_diff/profile.py | 72 +++++++++++++++++---- tests/test_profile_panel.py | 96 ++++++++++++++++++++++++++++ 3 files changed, 208 insertions(+), 39 deletions(-) diff --git a/diff_diff/guides/llms-autonomous.txt b/diff_diff/guides/llms-autonomous.txt index d237d770..9d72c614 100644 --- a/diff_diff/guides/llms-autonomous.txt +++ b/diff_diff/guides/llms-autonomous.txt @@ -64,10 +64,13 @@ view. Every field below appears as a top-level key in that dict. - **`n_units: int`** - count of distinct values in the `unit` column. - **`n_periods: int`** - count of distinct values in the `time` column. - **`n_obs: int`** - total rows in the panel. -- **`is_balanced: bool`** - true iff `n_obs == n_units * n_periods`, i.e. - every unit is observed in every period. -- **`observation_coverage: float`** - `n_obs / (n_units * n_periods)` in - `[0, 1]`. A value below `0.70` also triggers the +- **`is_balanced: bool`** - true iff every distinct `(unit, time)` cell + appears at least once in the panel (i.e. the unique `(unit, time)` + support equals `n_units * n_periods`). Duplicate rows do not affect + balance but are surfaced via the `duplicate_unit_time_rows` alert. +- **`observation_coverage: float`** - ratio of unique `(unit, time)` + keys to `n_units * n_periods`, always in `[0, 1]` (duplicates do not + inflate). A value below `0.70` also triggers the `panel_highly_unbalanced` alert. ### Treatment variation @@ -102,13 +105,15 @@ view. Every field below appears as a top-level key in that dict. to cohort size (number of units adopting at that time). Empty for non-absorbing / continuous / categorical treatments. - **`has_never_treated: bool`** - at least one unit has `treatment == 0` - in every observed period. Required by `SyntheticDiD` and - `EfficientDiD` with `assumption="PT-All"`; preferred-but-optional by - `CallawaySantAnna` and `ChaisemartinDHaultfoeuille`. + in every observed non-NaN row (applies to both binary and continuous + treatment columns; for continuous this flags zero-dose control units). + Required by `SyntheticDiD` and `EfficientDiD` with `assumption="PT-All"`; + preferred-but-optional by `CallawaySantAnna` and + `ChaisemartinDHaultfoeuille`. Always `False` for `"categorical"`. - **`has_always_treated: bool`** - at least one unit has - `treatment == 1` in every observed period. Always-treated units + strictly-positive treatment in every observed non-NaN row. Such units provide no pre-treatment identification and are dropped by most - estimators. + estimators. Always `False` for `"categorical"`. ### Timing @@ -153,6 +158,7 @@ it is descriptive context. | Alert code | Severity | Fires when | |---|---|---| +| `duplicate_unit_time_rows` | warn | panel contains more than one row per (unit, time) | | `min_cohort_size_below_10` | warn | smallest cohort has fewer than 10 units | | `only_one_cohort` | info | all treated units adopt simultaneously | | `short_pre_panel` | warn | `min_pre_periods < 3` | @@ -185,12 +191,12 @@ supported / out of scope; `warn` supported but with documented caveats; | `StackedDiD` | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | | `WooldridgeDiD` (ETWFE) | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | | `EfficientDiD` | ✓ | ✓ | ✗ | ✗ | partial | ✓ | ✗ | ✗ | ✓ | -| `SyntheticDiD` | ✓ | partial | ✗ | ✗ | ✓ | ✗ | ✓ | ✗ | partial | -| `TROP` | ✓ | partial | ✗ | ✗ | ✓ | partial | ✓ | ✗ | partial | +| `SyntheticDiD` | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | ✓ | ✗ | partial | +| `TROP` | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | partial | | `TripleDifference` | ✓ | ✗ | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ | | `StaggeredTripleDifference` | ✓ | ✓ | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ | | `ContinuousDiD` | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | -| `HeterogeneousAdoptionDiD` | ✗ | partial | partial | ✗ | ✗ | ✓ | ✗ | ✓ | ✓ | +| `HeterogeneousAdoptionDiD` | ✗ | partial | partial | ✗ | ✗ | ✗ | ✗ | ✓ | warn | **Footnotes.** - `TwoWayFixedEffects` + staggered: fits but mixes positive and negative @@ -201,17 +207,29 @@ supported / out of scope; `warn` supported but with documented caveats; - `CallawaySantAnna` + never-treated: the "never-treated" control group is one option; "not-yet-treated" is the other. Pick via the `control_group` argument. If `has_never_treated == False`, use - `control_group="notyettreated"`. + `control_group="not_yet_treated"`. - `EfficientDiD` + never-treated: `assumption="PT-All"` requires it; - `assumption="PT-Post"` does not. The `Hausman.hausman_pretest` + `assumption="PT-Post"` does not. The `EfficientDiD.hausman_pretest` classmethod picks between them using a formal test. -- `SyntheticDiD` + staggered: native support is limited; the - single-event estimator is the canonical case. Multi-event extensions - use a cohort-level fit loop - check the estimator's docstring. -- `TROP` covariate adjustment: supported through the global-method - path, not the local-method path. +- `SyntheticDiD` + staggered: not supported. `fit()` raises + `ValueError` on within-unit treatment variation; SDiD requires block + treatment (all treated units adopt at the same time). For staggered + designs use a cohort-level fit loop externally or pick a + staggered-robust estimator above. +- `TROP` staggered support: treatment is an absorbing-state indicator, + so staggered adoption is handled via the D matrix. TROP `fit()` has + no covariate surface; its local method uses every unit untreated at + period `t` as the donor pool (not a never-treated-only set). +- `HeterogeneousAdoptionDiD` covariate adjustment: identification with + covariates (paper Appendix B.1, Equation 19) is deferred to future + work; `fit(covariates=...)` is not yet implemented. +- `HeterogeneousAdoptionDiD` clustered SE: `cluster=` is honored on the + mass-point / CR1 path; on the continuous nonparametric paths the + kwarg emits a `UserWarning` and is ignored (Phase 2a scope). Use + `bias_corrected_local_linear` directly for cluster-robust inference + on the nonparametric path. - `HeterogeneousAdoptionDiD` continuous: supports partial-adoption - intensity as a covariate-like continuous variable; not a pure + intensity as a continuous first-stage variable; not a pure dose-response estimator - use `ContinuousDiD` for that. @@ -269,7 +287,7 @@ estimators: - `WooldridgeDiD` (ETWFE) - extended-TWFE with cohort-by-time-by- covariates interactions; heterogeneous covariate-by-cohort effects. - `EfficientDiD` (Arkhangelsky-Imbens) - asymptotically efficient under - either `PT-All` or `PT-Post`; use `Hausman.hausman_pretest` to pick. + either `PT-All` or `PT-Post`; use `EfficientDiD.hausman_pretest` to pick. Diagnostic: `bacon_decompose(df, ...)` shows the weight allocation of a TWFE fit to 2×2 comparison types. Forbidden-comparison weight > 10% is a @@ -280,11 +298,15 @@ strong signal that the TWFE estimate is biased. When `has_never_treated == False`: - `SyntheticDiD` requires a never-treated donor pool - not applicable. -- `TROP` needs never-treated units for the factor-model fit - not - applicable in the standard setup. +- `TROP` does not require a strict never-treated partition: its donor + pool is every unit untreated at the current period `t` (via the + absorbing D matrix). When every unit is eventually treated TROP can + still fit, with the donor pool shrinking over time - check the + pre-treatment coverage of the factor-model fit in the results + diagnostics. - `EfficientDiD` with `assumption="PT-All"` requires it - use `assumption="PT-Post"` instead. -- `CallawaySantAnna` - use `control_group="notyettreated"` to use +- `CallawaySantAnna` - use `control_group="not_yet_treated"` to use not-yet-treated units as the control pool. - `ChaisemartinDHaultfoeuille` - constructs switchers vs. non-switchers directly; no never-treated requirement. @@ -334,8 +356,11 @@ but derivable from `cohort_sizes` + `has_never_treated`): - `SyntheticDiD` - synthetic-control-meets-DiD. Requires never-treated donors and sufficient pre-treatment periods (Arkhangelsky et al. 2021). -- `TROP` - factor-model-based generalized synthetic control. Similar - donor-pool requirements; supports more complex factor structures. + Block treatment only: all treated units must adopt at the same time. +- `TROP` - factor-model-based generalized synthetic control. Uses every + unit untreated at period `t` as the donor pool (via the absorbing-state + D matrix); supports staggered adoption and more complex factor + structures. No covariate-adjustment surface on `fit()`. Classical DiD estimators will still produce estimates, but inference is unreliable with very small treated groups; cluster-robust SE relies on @@ -413,7 +438,7 @@ Some estimators expose diagnostics as methods on the result object: concentration summary. - `CallawaySantAnna.diagnose_propensity(df, ...)` - propensity-score overlap check when using DR / IPW controls. -- `Hausman.hausman_pretest(df, ...)` - chooses between `PT-All` and +- `EfficientDiD.hausman_pretest(df, ...)` - chooses between `PT-All` and `PT-Post` for `EfficientDiD`. - `did_had_pretest_workflow(df, ...)` - bundled QUG / Stute / Yatchew- Härdle pre-test battery for `HeterogeneousAdoptionDiD`. diff --git a/diff_diff/profile.py b/diff_diff/profile.py index 193efb0e..f05e6dde 100644 --- a/diff_diff/profile.py +++ b/diff_diff/profile.py @@ -53,8 +53,8 @@ class PanelProfile: n_units: int n_periods: int n_obs: int - is_balanced: bool - observation_coverage: float + is_balanced: bool # every (unit, time) cell appears at least once + observation_coverage: float # unique (unit, time) keys / (n_units * n_periods) treatment_type: str is_staggered: bool @@ -190,6 +190,20 @@ def profile_panel( ``"categorical"``; cast to ``int`` if you want binary-treatment profiling. + ``has_never_treated`` and ``has_always_treated`` are computed + generically across numeric treatment types (both binary and + continuous). ``has_never_treated`` fires when some unit has + ``treatment == 0`` in every observed non-NaN row; for continuous + panels this flags zero-dose controls. ``has_always_treated`` fires + when some unit has strictly-positive treatment in every observed + non-NaN row. Both are always ``False`` for ``"categorical"``. + + Duplicate ``(unit, time)`` rows are surfaced via the + ``duplicate_unit_time_rows`` alert; ``is_balanced`` and + ``observation_coverage`` are computed from the unique ``(unit, + time)`` support, so ``observation_coverage`` is always in + ``[0, 1]``. + The profile does not recommend an estimator. Consult ``diff_diff.get_llm_guide("autonomous")`` for the estimator-support matrix and per-design-feature reasoning. @@ -199,9 +213,11 @@ def profile_panel( n_units = int(df[unit].nunique()) n_periods = int(df[time].nunique()) n_obs = int(len(df)) + n_unique_keys = int(df[[unit, time]].drop_duplicates().shape[0]) denom = n_units * n_periods - observation_coverage = float(n_obs / denom) if denom > 0 else 0.0 - is_balanced = n_obs == denom + observation_coverage = float(n_unique_keys / denom) if denom > 0 else 0.0 + is_balanced = n_unique_keys == denom + n_duplicate_rows = n_obs - n_unique_keys ( treatment_type, @@ -241,6 +257,7 @@ def profile_panel( min_post_periods=min_post, outcome_is_binary=outcome_is_binary, outcome_dtype_kind=dtype_kind, + n_duplicate_rows=n_duplicate_rows, ) return PanelProfile( @@ -307,25 +324,42 @@ def _classify_treatment( values_set = set(distinct.tolist()) if n_distinct == 0: return ("categorical", False, {}, False, False, None, None) - is_binary_valued = values_set <= {0, 1, 0.0, 1.0} + # Generic never-/always-treated semantics (applies to both binary + # and continuous numeric treatment): "never-treated" means the unit + # has treatment == 0 in every observed non-NaN row; "always-treated" + # means treatment > 0 in every observed non-NaN row. + unit_max = df.groupby(unit)[treatment].max().to_numpy() + unit_min = df.groupby(unit)[treatment].min().to_numpy() + has_never_treated = bool(np.any(unit_max == 0)) + has_always_treated = bool(np.any(unit_min > 0)) + + is_binary_valued = values_set <= {0, 1, 0.0, 1.0} if not is_binary_valued: - return ("continuous", False, {}, False, False, None, None) + return ( + "continuous", + False, + {}, + has_never_treated, + has_always_treated, + None, + None, + ) sorted_df = df.sort_values([unit, time]) + # Monotonicity check on the observed non-NaN subsequence per unit. + # A path like [0, 1, NaN, 0] must be detected as non-absorbing: the + # non-NaN subsequence [0, 1, 0] violates weak monotonicity. is_absorbing = True for _, group in sorted_df.groupby(unit, sort=False): vals = group[treatment].to_numpy() - if len(vals) >= 2 and bool(np.any(np.diff(vals) < 0)): + mask = ~pd.isna(vals) + observed = vals[mask] + if len(observed) >= 2 and bool(np.any(np.diff(observed) < 0)): is_absorbing = False break - unit_treatment_max = df.groupby(unit)[treatment].max().to_numpy() - unit_treatment_min = df.groupby(unit)[treatment].min().to_numpy() - has_never_treated = bool(np.any(unit_treatment_max == 0)) - has_always_treated = bool(np.any(unit_treatment_min == 1)) - if not is_absorbing: return ( "binary_non_absorbing", @@ -418,9 +452,23 @@ def _compute_alerts( min_post_periods: Optional[int], outcome_is_binary: bool, outcome_dtype_kind: str, + n_duplicate_rows: int, ) -> List[Alert]: alerts: List[Alert] = [] + if n_duplicate_rows > 0: + alerts.append( + Alert( + code="duplicate_unit_time_rows", + severity="warn", + message=( + f"Found {n_duplicate_rows} duplicate (unit, time) row(s); " + "balance and coverage are computed from the unique support." + ), + observed=int(n_duplicate_rows), + ) + ) + if cohort_sizes: smallest = min(cohort_sizes.values()) if smallest < _MIN_COHORT_SIZE_THRESHOLD: diff --git a/tests/test_profile_panel.py b/tests/test_profile_panel.py index ab1f1336..ae52a7bf 100644 --- a/tests/test_profile_panel.py +++ b/tests/test_profile_panel.py @@ -316,3 +316,99 @@ def test_top_level_import_surface(): assert diff_diff.Alert.__name__ == "Alert" for name in ("profile_panel", "PanelProfile", "Alert"): assert name in diff_diff.__all__, f"{name} missing from __all__" + + +def test_duplicate_unit_time_rows_do_not_inflate_coverage(): + """Duplicate (unit, time) rows must not make a panel look balanced. + observation_coverage must stay in [0, 1] and derive from the unique + (unit, time) support, and the duplicate_unit_time_rows alert fires.""" + first_treat = {u: 2 for u in range(11, 21)} + df = _make_panel(n_units=20, periods=range(0, 4), first_treat=first_treat) + df_dup = pd.concat([df, df.iloc[:5].copy()], ignore_index=True) + profile = profile_panel(df_dup, unit="u", time="t", treatment="tr", outcome="y") + assert profile.is_balanced is True + assert 0.0 <= profile.observation_coverage <= 1.0 + assert "duplicate_unit_time_rows" in _alert_codes(profile) + + df_missing_cell = df.drop(df.index[0]).reset_index(drop=True) + df_dup_missing = pd.concat( + [df_missing_cell, df_missing_cell.iloc[:5].copy()], ignore_index=True + ) + profile2 = profile_panel(df_dup_missing, unit="u", time="t", treatment="tr", outcome="y") + assert profile2.is_balanced is False + assert profile2.observation_coverage < 1.0 + assert "duplicate_unit_time_rows" in _alert_codes(profile2) + + +def test_reversal_through_nan_is_binary_non_absorbing(): + """A 0 -> 1 -> NaN -> 0 path must be detected as non-absorbing: the + observed non-NaN subsequence violates weak monotonicity. Previously a + NaN-inclusive diff could report False monotonicity violation.""" + rows = [] + for u in range(1, 11): + treat_seq = [0, 1, np.nan, 0] + for t, tr in enumerate(treat_seq): + rows.append({"u": u, "t": t, "tr": tr, "y": float(u) + 0.1 * t}) + df = pd.DataFrame(rows) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "binary_non_absorbing" + + +def test_continuous_zero_dose_controls_flag_has_never_treated(): + """Continuous treatment with some zero-dose units must flag + has_never_treated=True. Previously continuous panels hardcoded + has_never_treated=False regardless of control availability.""" + rows = [] + rng = np.random.default_rng(0) + for u in range(1, 21): + dose = 0.0 if u <= 5 else float(rng.uniform(0.5, 3.0)) + for t in range(4): + rows.append({"u": u, "t": t, "tr": dose, "y": rng.normal()}) + df = pd.DataFrame(rows) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "continuous" + assert profile.has_never_treated is True + assert profile.has_always_treated is True + + +def test_guide_api_strings_resolve_against_public_api(): + """Sanity-check that every estimator referenced in the autonomous guide + exists in the public API, plus the `hausman_pretest` classmethod location + and the `not_yet_treated` control-group string. Guards against guide + drift that the CI reviewer has previously flagged.""" + import diff_diff + from diff_diff import get_llm_guide + + text = get_llm_guide("autonomous") + + for name in ( + "DifferenceInDifferences", + "MultiPeriodDiD", + "TwoWayFixedEffects", + "CallawaySantAnna", + "SunAbraham", + "ChaisemartinDHaultfoeuille", + "ImputationDiD", + "TwoStageDiD", + "StackedDiD", + "WooldridgeDiD", + "EfficientDiD", + "SyntheticDiD", + "TROP", + "TripleDifference", + "StaggeredTripleDifference", + "ContinuousDiD", + "HeterogeneousAdoptionDiD", + ): + assert name in text, f"estimator {name!r} missing from guide" + assert hasattr(diff_diff, name), f"{name!r} in guide but not exported" + + assert hasattr( + diff_diff.EfficientDiD, "hausman_pretest" + ), "EfficientDiD.hausman_pretest classmethod missing from the public API" + + assert "EfficientDiD.hausman_pretest" in text + assert "Hausman.hausman_pretest" not in text + + assert 'control_group="not_yet_treated"' in text + assert "notyettreated" not in text From 109c83a027945bc3de5cb9e2e2f98e17086a59b8 Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 24 Apr 2026 08:38:48 -0400 Subject: [PATCH 04/18] Address PR #356 CI review round 2 (2 P1 guide + 1 P1 code) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Guide: - Rewrite HAD section (§4.9, §3 footnotes) to use WAS / WAS_d_lower terminology (paper Equation 2 / Theorem 1). HAD does not target ATT and its event-study output is per-event-time, not per-cohort. The fitted target_parameter attribute is literally "WAS" for Design 1' and "WAS_d_lower" for Design 1 / Assumption 6. - Add control_group="last_cohort" to the EfficientDiD guidance in §4.4 (no-never-treated paths) and the corresponding footnote. Previously the guide steered agents only toward assumption="PT-Post" when a never-treated cohort was absent, ruling out the supported pseudo- control path. Distinct from CallawaySantAnna's "not_yet_treated". profile_panel(): - Missing identifier rows are dropped up front when unit or time contains NaN. Previously n_units / n_periods used nunique() (NaN- dropping) while the unique (unit, time) support used drop_duplicates() (NaN-preserving), producing observation_coverage > 1 on filtered data with missing IDs. All structural facts now compute on the non-missing subset; a new missing_id_rows_dropped warn alert surfaces the drop count so the behavior is not silent. Tests: - test_missing_unit_or_time_ids_are_dropped_consistently asserts observation_coverage stays in [0, 1] and the alert fires with the correct drop count. - test_guide_api_strings_resolve_against_public_api extended with assertions for WAS / WAS_d_lower phrasing, the absence of the old "per-cohort Pierce-Schott" wording, and control_group="last_cohort". Co-Authored-By: Claude Opus 4.7 (1M context) --- diff_diff/guides/llms-autonomous.txt | 35 +++++++++++++++++++++------- diff_diff/profile.py | 33 +++++++++++++++++++++----- tests/test_profile_panel.py | 32 +++++++++++++++++++++++++ 3 files changed, 85 insertions(+), 15 deletions(-) diff --git a/diff_diff/guides/llms-autonomous.txt b/diff_diff/guides/llms-autonomous.txt index 9d72c614..bd9af235 100644 --- a/diff_diff/guides/llms-autonomous.txt +++ b/diff_diff/guides/llms-autonomous.txt @@ -158,6 +158,7 @@ it is descriptive context. | Alert code | Severity | Fires when | |---|---|---| +| `missing_id_rows_dropped` | warn | rows with NaN `unit` or `time` were dropped before computing structural facts | | `duplicate_unit_time_rows` | warn | panel contains more than one row per (unit, time) | | `min_cohort_size_below_10` | warn | smallest cohort has fewer than 10 units | | `only_one_cohort` | info | all treated units adopt simultaneously | @@ -208,9 +209,13 @@ supported / out of scope; `warn` supported but with documented caveats; is one option; "not-yet-treated" is the other. Pick via the `control_group` argument. If `has_never_treated == False`, use `control_group="not_yet_treated"`. -- `EfficientDiD` + never-treated: `assumption="PT-All"` requires it; - `assumption="PT-Post"` does not. The `EfficientDiD.hausman_pretest` - classmethod picks between them using a formal test. +- `EfficientDiD` + never-treated: three paths handle an absent + never-treated cohort. `assumption="PT-All"` requires never-treated; + switch to `assumption="PT-Post"` to drop that requirement, or pass + `control_group="last_cohort"` to reclassify the latest treatment + cohort as a pseudo-never-treated control (and trim post-treatment + periods at/after its adoption). The `EfficientDiD.hausman_pretest` + classmethod picks between `PT-All` and `PT-Post` using a formal test. - `SyntheticDiD` + staggered: not supported. `fit()` raises `ValueError` on within-unit treatment variation; SDiD requires block treatment (all treated units adopt at the same time). For staggered @@ -304,8 +309,12 @@ When `has_never_treated == False`: still fit, with the donor pool shrinking over time - check the pre-treatment coverage of the factor-model fit in the results diagnostics. -- `EfficientDiD` with `assumption="PT-All"` requires it - use - `assumption="PT-Post"` instead. +- `EfficientDiD` with `assumption="PT-All"` requires it - switch to + `assumption="PT-Post"` to drop that requirement, or pass + `control_group="last_cohort"` to use the latest treatment cohort as a + pseudo-never-treated control (post-treatment periods at/after that + cohort's adoption are trimmed). Distinct from CallawaySantAnna's + `not_yet_treated` option. - `CallawaySantAnna` - use `control_group="not_yet_treated"` to use not-yet-treated units as the control pool. - `ChaisemartinDHaultfoeuille` - constructs switchers vs. non-switchers @@ -372,10 +381,18 @@ methods in the library are preferred. When adoption varies in strength across units (partial-adoption settings, intensity of exposure differs): -- `HeterogeneousAdoptionDiD` - Phase 3 tools include Stute / Yatchew- - Härdle / QUG pre-tests for the first-stage model. The estimator's - workflow returns an aggregated ATT plus a per-cohort Pierce-Schott- - style validation. +- `HeterogeneousAdoptionDiD` - targets a Weighted Average Slope (WAS) + on single-period Heterogeneous Adoption Designs where no genuinely + untreated group exists (paper Equation 2 / Theorem 1). The + `target_parameter` attribute on the results object is literally + `"WAS"` for Design 1' and `"WAS_d_lower"` for Design 1 with lower-dose + comparison under Assumption 6. `fit(aggregate="overall")` (Phase 2a) + returns a single scalar WAS; `fit(aggregate="event_study")` (Phase + 2b) returns per-event-time WAS estimates. Phase 3 pre-tests + (`qug_test`, `stute_test`, `stute_joint_pretest`, `yatchew_hr_test`) + and the `did_had_pretest_workflow()` bundle validate Assumptions 3 + and 7. Not ATT-shaped; do not relabel the headline as ATT in report + text. ### §4.10 Repeated cross-sections (no panel structure) diff --git a/diff_diff/profile.py b/diff_diff/profile.py index f05e6dde..70bf1db4 100644 --- a/diff_diff/profile.py +++ b/diff_diff/profile.py @@ -198,11 +198,12 @@ def profile_panel( when some unit has strictly-positive treatment in every observed non-NaN row. Both are always ``False`` for ``"categorical"``. - Duplicate ``(unit, time)`` rows are surfaced via the - ``duplicate_unit_time_rows`` alert; ``is_balanced`` and - ``observation_coverage`` are computed from the unique ``(unit, - time)`` support, so ``observation_coverage`` is always in - ``[0, 1]``. + Rows with ``NaN`` in ``unit`` or ``time`` are dropped up front and + surfaced via the ``missing_id_rows_dropped`` alert; all subsequent + structural facts are computed on the non-missing subset, so + ``observation_coverage`` is always in ``[0, 1]``. Duplicate + ``(unit, time)`` rows are surfaced separately via the + ``duplicate_unit_time_rows`` alert. The profile does not recommend an estimator. Consult ``diff_diff.get_llm_guide("autonomous")`` for the estimator-support @@ -210,9 +211,13 @@ def profile_panel( """ _validate_columns(df, unit=unit, time=time, treatment=treatment, outcome=outcome) + n_rows_with_missing_id = int(df[unit].isna().sum() + df[time].isna().sum()) + if n_rows_with_missing_id > 0: + df = df.dropna(subset=[unit, time]) + n_obs = int(len(df)) + n_units = int(df[unit].nunique()) n_periods = int(df[time].nunique()) - n_obs = int(len(df)) n_unique_keys = int(df[[unit, time]].drop_duplicates().shape[0]) denom = n_units * n_periods observation_coverage = float(n_unique_keys / denom) if denom > 0 else 0.0 @@ -258,6 +263,7 @@ def profile_panel( outcome_is_binary=outcome_is_binary, outcome_dtype_kind=dtype_kind, n_duplicate_rows=n_duplicate_rows, + n_rows_with_missing_id=n_rows_with_missing_id, ) return PanelProfile( @@ -453,9 +459,24 @@ def _compute_alerts( outcome_is_binary: bool, outcome_dtype_kind: str, n_duplicate_rows: int, + n_rows_with_missing_id: int, ) -> List[Alert]: alerts: List[Alert] = [] + if n_rows_with_missing_id > 0: + alerts.append( + Alert( + code="missing_id_rows_dropped", + severity="warn", + message=( + f"Dropped {n_rows_with_missing_id} row(s) with missing " + "unit or time identifier; structural facts are computed " + "from the non-missing subset." + ), + observed=int(n_rows_with_missing_id), + ) + ) + if n_duplicate_rows > 0: alerts.append( Alert( diff --git a/tests/test_profile_panel.py b/tests/test_profile_panel.py index ae52a7bf..ef2cc1b3 100644 --- a/tests/test_profile_panel.py +++ b/tests/test_profile_panel.py @@ -412,3 +412,35 @@ def test_guide_api_strings_resolve_against_public_api(): assert 'control_group="not_yet_treated"' in text assert "notyettreated" not in text + + # HAD targets WAS / WAS_d_lower, not ATT; event-study is per-event- + # time, not per-cohort. Guard against the guide drifting back to + # ATT-shaped / per-cohort phrasing. + assert "Weighted Average Slope (WAS)" in text + assert "WAS_d_lower" in text + assert "per-cohort Pierce-Schott" not in text + + # EfficientDiD has three paths when no never-treated exists: + # PT-Post, PT-All, or control_group="last_cohort". The guide must + # mention last_cohort in the no-never-treated section so agents do + # not rule out the supported path. + assert 'control_group="last_cohort"' in text + + +def test_missing_unit_or_time_ids_are_dropped_consistently(): + """NaN values in unit or time must not push observation_coverage above + 1.0. `nunique()` drops NaN while `drop_duplicates()` keeps NaN as a + distinct key, which previously produced coverage > 1 silently. The + fix drops NaN-id rows up front, emits the missing_id_rows_dropped + alert, and computes all structural facts on the non-missing subset.""" + first_treat = {u: 2 for u in range(11, 21)} + df = _make_panel(n_units=20, periods=range(0, 4), first_treat=first_treat) + df_with_missing = df.copy() + df_with_missing.loc[[0, 1, 2], "u"] = np.nan + df_with_missing.loc[[5, 6], "t"] = np.nan + profile = profile_panel(df_with_missing, unit="u", time="t", treatment="tr", outcome="y") + assert 0.0 <= profile.observation_coverage <= 1.0 + codes = _alert_codes(profile) + assert "missing_id_rows_dropped" in codes + drop_alert = next(a for a in profile.alerts if a.code == "missing_id_rows_dropped") + assert drop_alert.observed == 5 From 58640814f0b06c146f0d608bfdef1523bb3cf16e Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 24 Apr 2026 09:03:08 -0400 Subject: [PATCH 05/18] Address PR #356 CI review round 3 (2 P1 guide + 1 P1 + 1 P2 code) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Guide: - SunAbraham matrix row: never-treated-required changed from ✗ to ✓. The fit path raises ValueError when no never-treated units exist (sun_abraham.py:566-567); the registry documents this requirement. §4.3 prose updated to state the requirement explicitly. - HAD §4.9 (and §3 Phase 3 description): correct the diagnostic-claim wording. The pre-test battery validates the QUG null (step 1), Assumption 7 pre-trends (step 2, event-study only), and linearity of E[ΔY|D_2] (step 3). Assumption 3 (uniform continuity / no extensive-margin jump) is explicitly "not testable" per REGISTRY.md L2181, so the guide must not claim Phase 3 validates it. profile_panel(): - Empty-panel guard: raise ValueError when the input DataFrame is empty (direct) or when no rows remain after dropping missing (unit, time) identifiers. Previously both cases returned is_balanced=True / observation_coverage=0 on zero valid rows. - Fix missing-id counting: n_rows_with_missing_id was the SUM of unit-NaN and time-NaN (double-counting rows missing both). Now computed from a row-wise isna().any(axis=1) mask so every dropped row is counted exactly once. Tests: - test_row_with_both_ids_missing_counted_once - test_empty_dataframe_raises_value_error - test_empty_after_id_drop_raises_value_error - test_guide_api_strings_resolve_against_public_api extended: SunAbraham matrix-row cell assertion (never-treated-required=✓), "Assumption 3" presence, absence of "validate Assumptions 3 and 7", presence of "not testable". Co-Authored-By: Claude Opus 4.7 (1M context) --- diff_diff/guides/llms-autonomous.txt | 19 +++++---- diff_diff/profile.py | 15 ++++++- tests/test_profile_panel.py | 58 ++++++++++++++++++++++++++++ 3 files changed, 83 insertions(+), 9 deletions(-) diff --git a/diff_diff/guides/llms-autonomous.txt b/diff_diff/guides/llms-autonomous.txt index bd9af235..fa0e0112 100644 --- a/diff_diff/guides/llms-autonomous.txt +++ b/diff_diff/guides/llms-autonomous.txt @@ -185,7 +185,7 @@ supported / out of scope; `warn` supported but with documented caveats; | `MultiPeriodDiD` | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | | `TwoWayFixedEffects` | ✓ | warn | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | | `CallawaySantAnna` | ✓ | ✓ | ✗ | ✗ | partial | ✓ | ✗ | ✗ | ✓ | -| `SunAbraham` | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | +| `SunAbraham` | ✓ | ✓ | ✗ | ✗ | ✓ | ✓ | ✗ | ✗ | ✓ | | `ChaisemartinDHaultfoeuille` | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | | `ImputationDiD` | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | | `TwoStageDiD` | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | @@ -280,7 +280,8 @@ estimators: or regression-adjustment (RA). - `SunAbraham` - interaction-weighted estimator; closely tied to two-way-FE output, computationally cheap, produces event-time - coefficients. + coefficients. Requires a never-treated cohort (`fit` raises a + `ValueError` when none exists). - `ChaisemartinDHaultfoeuille` - DID_M / DID_l estimators robust to non-absorbing treatment (see §4.5) and to spillover designs. - `ImputationDiD` (Borusyak, Jaravel, Spiess) - imputation-based, @@ -388,11 +389,15 @@ intensity of exposure differs): `"WAS"` for Design 1' and `"WAS_d_lower"` for Design 1 with lower-dose comparison under Assumption 6. `fit(aggregate="overall")` (Phase 2a) returns a single scalar WAS; `fit(aggregate="event_study")` (Phase - 2b) returns per-event-time WAS estimates. Phase 3 pre-tests - (`qug_test`, `stute_test`, `stute_joint_pretest`, `yatchew_hr_test`) - and the `did_had_pretest_workflow()` bundle validate Assumptions 3 - and 7. Not ATT-shaped; do not relabel the headline as ATT in report - text. + 2b) returns per-event-time WAS estimates. `did_had_pretest_workflow()` + runs the paper's three-step TWFE-suitability battery: (1) QUG null + via `qug_test`, (2) Assumption 7 pre-trends via `stute_test` / + `stute_joint_pretest` (event-study path only; the two-period overall + path flags this step as deferred), and (3) linearity of + `E[ΔY | D_2]` via `stute_test` / `yatchew_hr_test`. Assumption 3 + (uniform continuity / no extensive-margin jump) is not testable; the + pre-test battery does not and cannot validate it. Not ATT-shaped; do + not relabel the headline as ATT in report text. ### §4.10 Repeated cross-sections (no panel structure) diff --git a/diff_diff/profile.py b/diff_diff/profile.py index 70bf1db4..47dd41be 100644 --- a/diff_diff/profile.py +++ b/diff_diff/profile.py @@ -211,10 +211,21 @@ def profile_panel( """ _validate_columns(df, unit=unit, time=time, treatment=treatment, outcome=outcome) - n_rows_with_missing_id = int(df[unit].isna().sum() + df[time].isna().sum()) + input_row_count = int(len(df)) + if input_row_count == 0: + raise ValueError("profile_panel: DataFrame is empty; at least one row is required.") + + missing_id_mask = cast(pd.Series, df[[unit, time]].isna().any(axis=1)) + n_rows_with_missing_id = int(missing_id_mask.sum()) if n_rows_with_missing_id > 0: - df = df.dropna(subset=[unit, time]) + df = df.loc[~missing_id_mask] n_obs = int(len(df)) + if n_obs == 0: + raise ValueError( + f"profile_panel: no rows remain after dropping " + f"{n_rows_with_missing_id} row(s) with missing unit or time " + "identifier; at least one valid row is required." + ) n_units = int(df[unit].nunique()) n_periods = int(df[time].nunique()) diff --git a/tests/test_profile_panel.py b/tests/test_profile_panel.py index ef2cc1b3..fc44210e 100644 --- a/tests/test_profile_panel.py +++ b/tests/test_profile_panel.py @@ -426,6 +426,26 @@ def test_guide_api_strings_resolve_against_public_api(): # not rule out the supported path. assert 'control_group="last_cohort"' in text + # SunAbraham requires a never-treated cohort; the fit path raises a + # ValueError when none exists. Guard the matrix / prose contract so + # the guide cannot drift back to claiming SunAbraham is optional. + sun_abraham_row = next( + line for line in text.splitlines() if "`SunAbraham`" in line and "|" in line + ) + cells = [cell.strip() for cell in sun_abraham_row.strip("|").split("|")] + # Column order: estimator, binary_absorbing, staggered, continuous, + # triple-diff, never-treated-required, covariate, few-treated, + # heterogeneous-adoption, clustered-SE. + assert cells[5] == "✓", ( + "SunAbraham matrix row must mark never-treated-required=✓ " f"(row: {sun_abraham_row!r})" + ) + + # HAD Assumption 3 is not testable per REGISTRY.md; the guide must + # not claim otherwise. + assert "Assumption 3" in text # mentioned as untestable, not as validated + assert "validate Assumptions 3 and 7" not in text + assert "not testable" in text + def test_missing_unit_or_time_ids_are_dropped_consistently(): """NaN values in unit or time must not push observation_coverage above @@ -444,3 +464,41 @@ def test_missing_unit_or_time_ids_are_dropped_consistently(): assert "missing_id_rows_dropped" in codes drop_alert = next(a for a in profile.alerts if a.code == "missing_id_rows_dropped") assert drop_alert.observed == 5 + + +def test_row_with_both_ids_missing_counted_once(): + """A row with BOTH unit and time NaN must count as one dropped row, + not two. Previously `isna().sum()` summed the two columns and + double-counted rows missing both identifiers.""" + first_treat = {u: 2 for u in range(11, 21)} + df = _make_panel(n_units=20, periods=range(0, 4), first_treat=first_treat) + df_both_missing = df.copy() + df_both_missing.loc[0, "u"] = np.nan + df_both_missing.loc[0, "t"] = np.nan + profile = profile_panel(df_both_missing, unit="u", time="t", treatment="tr", outcome="y") + drop_alert = next(a for a in profile.alerts if a.code == "missing_id_rows_dropped") + assert drop_alert.observed == 1 + + +def test_empty_dataframe_raises_value_error(): + """Direct empty input must raise, not silently return a 'balanced' + profile with zero units/periods.""" + df = pd.DataFrame({"u": [], "t": [], "tr": [], "y": []}) + with pytest.raises(ValueError, match="empty"): + profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + + +def test_empty_after_id_drop_raises_value_error(): + """If every row has a missing unit or time identifier, the panel is + empty after the drop; raise rather than returning is_balanced=True + on zero rows.""" + df = pd.DataFrame( + { + "u": [np.nan, np.nan], + "t": [0, 1], + "tr": [0, 1], + "y": [0.1, 0.2], + } + ) + with pytest.raises(ValueError, match="no rows remain"): + profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") From c05c52f6f87e876bfb1a38e2fe3b1fd7e66db9e2 Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 24 Apr 2026 09:16:05 -0400 Subject: [PATCH 06/18] Address PR #356 CI review round 4 (2 P1 guide + 2 P2 code/docs) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Guide: - Matrix covariate cells corrected. SyntheticDiD column 6 flipped ✗ → ✓: fit() accepts covariates= and residualizes the outcome (synthetic_did.py:213-239, 425-436). ContinuousDiD column 6 flipped ✓ → ✗: the fit() signature has no covariate argument and REGISTRY lists covariate adjustment as deferred. - EfficientDiD wording corrected. The registry says PT-Post still uses never-treated as the comparison group (REGISTRY.md L756-758); "switch to PT-Post" does not drop the requirement. Rewrote has_never_treated field description, the §3 footnote, and §4.4 prose to state that PT-All and PT-Post both require never-treated units and control_group="last_cohort" is the only supported path for an all-eventually-treated panel. profile_panel(): - _compute_pre_post now uses each treated unit's observed (unit, time) support rather than the global panel period set. On unbalanced panels this correctly captures the least-supported treated unit's pre/post exposure and fires short_pre_panel / short_post_panel when appropriate; the previous global-periods approach could overstate exposure and suppress the alert. Guide §5 API signatures corrected: - compute_pretrends_power takes a fitted results object, not a raw DataFrame (pretrends.py:1048). - plot_sensitivity takes a SensitivityResults object (visualization/_diagnostic.py:12-14). - plot_honest_event_study takes a HonestDiDResults object (visualization/_event_study.py:9). Tests: - test_min_pre_post_use_per_unit_observed_support (unbalanced treated unit missing early pre-period fires short_pre_panel). - test_guide_api_strings_resolve_against_public_api extended with SyntheticDiD/ContinuousDiD covariate-cell matrix assertions, EfficientDiD both-PT-variants-require-never-treated wording assertion, and §5 API-signature fingerprints. Co-Authored-By: Claude Opus 4.7 (1M context) --- diff_diff/guides/llms-autonomous.txt | 59 +++++++++++++++++----------- diff_diff/profile.py | 24 ++++++++--- tests/test_profile_panel.py | 53 +++++++++++++++++++++++++ 3 files changed, 107 insertions(+), 29 deletions(-) diff --git a/diff_diff/guides/llms-autonomous.txt b/diff_diff/guides/llms-autonomous.txt index fa0e0112..036ef9e6 100644 --- a/diff_diff/guides/llms-autonomous.txt +++ b/diff_diff/guides/llms-autonomous.txt @@ -107,9 +107,11 @@ view. Every field below appears as a top-level key in that dict. - **`has_never_treated: bool`** - at least one unit has `treatment == 0` in every observed non-NaN row (applies to both binary and continuous treatment columns; for continuous this flags zero-dose control units). - Required by `SyntheticDiD` and `EfficientDiD` with `assumption="PT-All"`; - preferred-but-optional by `CallawaySantAnna` and - `ChaisemartinDHaultfoeuille`. Always `False` for `"categorical"`. + Required by `SyntheticDiD`, `SunAbraham`, and `EfficientDiD` under + both `assumption="PT-All"` and `assumption="PT-Post"` unless + `control_group="last_cohort"` is passed; preferred-but-optional by + `CallawaySantAnna` and `ChaisemartinDHaultfoeuille`. Always `False` + for `"categorical"`. - **`has_always_treated: bool`** - at least one unit has strictly-positive treatment in every observed non-NaN row. Such units provide no pre-treatment identification and are dropped by most @@ -192,11 +194,11 @@ supported / out of scope; `warn` supported but with documented caveats; | `StackedDiD` | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | | `WooldridgeDiD` (ETWFE) | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | | `EfficientDiD` | ✓ | ✓ | ✗ | ✗ | partial | ✓ | ✗ | ✗ | ✓ | -| `SyntheticDiD` | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | ✓ | ✗ | partial | +| `SyntheticDiD` | ✓ | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ | partial | | `TROP` | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | partial | | `TripleDifference` | ✓ | ✗ | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ | | `StaggeredTripleDifference` | ✓ | ✓ | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ | -| `ContinuousDiD` | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | +| `ContinuousDiD` | ✗ | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | | `HeterogeneousAdoptionDiD` | ✗ | partial | partial | ✗ | ✗ | ✗ | ✗ | ✓ | warn | **Footnotes.** @@ -209,13 +211,16 @@ supported / out of scope; `warn` supported but with documented caveats; is one option; "not-yet-treated" is the other. Pick via the `control_group` argument. If `has_never_treated == False`, use `control_group="not_yet_treated"`. -- `EfficientDiD` + never-treated: three paths handle an absent - never-treated cohort. `assumption="PT-All"` requires never-treated; - switch to `assumption="PT-Post"` to drop that requirement, or pass +- `EfficientDiD` + never-treated: both `assumption="PT-All"` and + `assumption="PT-Post"` require actual never-treated units - PT-Post + is the weaker parallel-trends assumption but still uses never-treated + as the comparison group (REGISTRY.md `EfficientDiD` "Parallel Trends + -- two variants"). To admit an all-eventually-treated panel, pass `control_group="last_cohort"` to reclassify the latest treatment - cohort as a pseudo-never-treated control (and trim post-treatment - periods at/after its adoption). The `EfficientDiD.hausman_pretest` - classmethod picks between `PT-All` and `PT-Post` using a formal test. + cohort as a pseudo-never-treated control and trim post-treatment + periods at/after its adoption. The `EfficientDiD.hausman_pretest` + classmethod picks between `PT-All` and `PT-Post` on panels that do + have never-treated units. - `SyntheticDiD` + staggered: not supported. `fit()` raises `ValueError` on within-unit treatment variation; SDiD requires block treatment (all treated units adopt at the same time). For staggered @@ -310,12 +315,13 @@ When `has_never_treated == False`: still fit, with the donor pool shrinking over time - check the pre-treatment coverage of the factor-model fit in the results diagnostics. -- `EfficientDiD` with `assumption="PT-All"` requires it - switch to - `assumption="PT-Post"` to drop that requirement, or pass - `control_group="last_cohort"` to use the latest treatment cohort as a - pseudo-never-treated control (post-treatment periods at/after that - cohort's adoption are trimmed). Distinct from CallawaySantAnna's - `not_yet_treated` option. +- `EfficientDiD` requires never-treated comparisons under both + `assumption="PT-All"` and `assumption="PT-Post"`. To admit an + all-treated panel, pass `control_group="last_cohort"` to use the + latest treatment cohort as a pseudo-never-treated control + (post-treatment periods at/after that cohort's adoption are + trimmed). Distinct from CallawaySantAnna's `not_yet_treated` + option. - `CallawaySantAnna` - use `control_group="not_yet_treated"` to use not-yet-treated units as the control pool. - `ChaisemartinDHaultfoeuille` - constructs switchers vs. non-switchers @@ -427,16 +433,19 @@ section is the API-reference index. version; adds a "believable-magnitude" check against a power curve. - `equivalence_test_trends(df, ...)` - Bilinski-Hatfield-style equivalence test (alternative framing of the PT test). -- `compute_pretrends_power(df, ...)` - standalone power analysis for the - PT test, useful when `min_pre_periods` is small. +- `compute_pretrends_power(results, ...)` - standalone power analysis + for the PT test; takes a fitted `MultiPeriodDiDResults` (or + compatible event-study results object), not raw DataFrame. Useful + when `min_pre_periods` is small. ### Sensitivity / robustness - `compute_honest_did(results, ...)` - Rambachan-Roth (2023) honest DiD. Quantifies the sensitivity of ATT to parallel-trends violations. Outputs sensitivity bounds under smoothness restrictions. -- `compute_pretrends_power(...)` - complementary tool for power-aware - pre-trends interpretation. +- `compute_pretrends_power(results, ...)` - complementary tool for + power-aware pre-trends interpretation (same fitted-results-first + signature as above). ### Placebo tests @@ -478,8 +487,12 @@ Some estimators expose diagnostics as methods on the result object: - `plot_group_effects(results, ...)` - `plot_group_time_heatmap(results, ...)` - `plot_staircase(results, ...)` -- `plot_honest_event_study(results, ...)` -- `plot_sensitivity(results, ...)` +- `plot_honest_event_study(honest_results, ...)` - takes a + `HonestDiDResults` returned by `compute_honest_did`, not a fit + result directly. +- `plot_sensitivity(sensitivity_results, ...)` - takes a + `SensitivityResults` object (the result of honest-DiD sensitivity + analysis), not a fit result directly. - `plot_synth_weights(results, ...)` - `plot_dose_response(results, ...)` - `plot_power_curve(...)` diff --git a/diff_diff/profile.py b/diff_diff/profile.py index 47dd41be..81812894 100644 --- a/diff_diff/profile.py +++ b/diff_diff/profile.py @@ -414,21 +414,33 @@ def _compute_pre_post( treatment: str, treatment_type: str, ) -> Tuple[Optional[int], Optional[int]]: + """Return (min_pre, min_post) across treated units using each unit's + observed (unit, time) support. On unbalanced panels this correctly + reflects the actual pre/post exposure of the least-supported treated + unit, rather than the global panel period set which could overstate + exposure and suppress short-panel alerts. + """ if treatment_type != "binary_absorbing": return None, None - all_periods = sorted(df[time].unique().tolist()) + support = df[[unit, time]].drop_duplicates() sorted_df = df.sort_values([unit, time]) first_treat_per_unit = ( sorted_df[sorted_df[treatment] == 1].groupby(unit, sort=False)[time].min() ) - cohort_values = first_treat_per_unit.unique().tolist() - if not cohort_values: + if first_treat_per_unit.empty: return None, None - min_pre = min(sum(1 for p in all_periods if p < c) for c in cohort_values) - min_post = min(sum(1 for p in all_periods if p >= c) for c in cohort_values) - return int(min_pre), int(min_post) + pre_counts: List[int] = [] + post_counts: List[int] = [] + treated_units = first_treat_per_unit.index.tolist() + for u in treated_units: + c_u = first_treat_per_unit.loc[u] + unit_periods = support.loc[support[unit] == u, time] + pre_counts.append(int((unit_periods < c_u).sum())) + post_counts.append(int((unit_periods >= c_u).sum())) + + return int(min(pre_counts)), int(min(post_counts)) def _classify_outcome(valid: pd.Series) -> Tuple[bool, bool, bool]: diff --git a/tests/test_profile_panel.py b/tests/test_profile_panel.py index fc44210e..051bfffa 100644 --- a/tests/test_profile_panel.py +++ b/tests/test_profile_panel.py @@ -446,6 +446,59 @@ def test_guide_api_strings_resolve_against_public_api(): assert "validate Assumptions 3 and 7" not in text assert "not testable" in text + # EfficientDiD requires never-treated under BOTH assumption="PT-All" + # and assumption="PT-Post" — PT-Post is not a "drop the requirement" + # escape hatch. Only control_group="last_cohort" admits all-treated + # panels. Guard against guide drift back to the incorrect wording. + assert "PT-Post is the weaker" in text or "both" in text.lower() + # The old claim "switch to `assumption=\"PT-Post\"` to drop" must + # not reappear in any form. + assert 'switch to `assumption="PT-Post"` to drop' not in text + + # Matrix covariate cells: SyntheticDiD accepts fit(covariates=...) + # and residualizes the outcome; ContinuousDiD.fit has no covariate + # surface. Guard the matrix rows against drift. + sdid_row = next(line for line in text.splitlines() if "`SyntheticDiD`" in line and "|" in line) + sdid_cells = [c.strip() for c in sdid_row.strip("|").split("|")] + assert sdid_cells[6] in ("✓", "partial"), ( + "SyntheticDiD covariate-adjustment cell must be ✓ or partial " + f"(residualization path exists); got {sdid_cells[6]!r}" + ) + cdid_row = next(line for line in text.splitlines() if "`ContinuousDiD`" in line and "|" in line) + cdid_cells = [c.strip() for c in cdid_row.strip("|").split("|")] + assert cdid_cells[6] == "✗", ( + "ContinuousDiD covariate-adjustment cell must be ✗ " + f"(no covariate surface on fit()); got {cdid_cells[6]!r}" + ) + + # §5 API signatures: compute_pretrends_power takes a fitted results + # object (not df), plot_sensitivity takes SensitivityResults, + # plot_honest_event_study takes HonestDiDResults. Guard against + # drift back to the df-first / results-only signatures. + assert "`compute_pretrends_power(results" in text + assert "`plot_sensitivity(sensitivity_results" in text + assert "`plot_honest_event_study(honest_results" in text + + +def test_min_pre_post_use_per_unit_observed_support(): + """On an unbalanced panel where one treated unit is missing its + earliest pre-period, min_pre_periods must reflect that unit's actual + observed support. Previously _compute_pre_post used the global period + set, which could hide short-panel cases and suppress the short_pre_panel + alert.""" + rows = [] + for u in range(1, 21): + first_treat = 3 + for t in range(0, 6): + if u == 1 and t <= 1: + continue + tr = 1 if t >= first_treat else 0 + rows.append({"u": u, "t": t, "tr": tr, "y": float(u) + 0.1 * t}) + df = pd.DataFrame(rows) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.min_pre_periods == 1 + assert "short_pre_panel" in _alert_codes(profile) + def test_missing_unit_or_time_ids_are_dropped_consistently(): """NaN values in unit or time must not push observation_coverage above From 45067415ef518b81d714eda4878b34935d6e5575 Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 24 Apr 2026 09:26:07 -0400 Subject: [PATCH 07/18] Address PR #356 CI review round 5 (2 P2 docs) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Guide §6 rewritten to match the actual BR/DR to_dict() schemas emitted at diff_diff/business_report.py:524-549 and diff_diff/diagnostic_report.py: 1013-1031. BR top-level keys are "assumption" (singular), "pre_trends" (underscored), "sample" (bare); the new documentation also adds "estimator", "context", "heterogeneity", "diagnostics", "next_steps", "references". DR keys are "estimator" (string), "headline_metric", "estimator_native_diagnostics", "skipped", "warnings", "overall_interpretation", "next_steps"; the old "checks: dict" nesting never existed and "estimator_type" was never the key name. The "forthcoming schema additions" block is dropped since it documented an un-shipped surface. Prose + alert wording aligned with the per-treated-unit implementation: - Guide §2: min_pre_periods / min_post_periods described as "across treated units" (each unit's observed (unit, time) support counted independently on unbalanced panels). - profile.py: short_pre_panel / short_post_panel alert messages use "across treated units" to match the computation. Tests: - test_guide_api_strings_resolve_against_public_api extended with BR/DR §6 schema assertions. Guards every real top-level key documented (assumption, pre_trends, sample, headline_metric, estimator_native_diagnostics, overall_interpretation) and forbids the previously documented obsolete keys (assumptions, pretrends, main_result, sample_summary, estimator_type, checks). Co-Authored-By: Claude Opus 4.7 (1M context) --- diff_diff/guides/llms-autonomous.txt | 97 +++++++++++++++++++--------- diff_diff/profile.py | 4 +- tests/test_profile_panel.py | 26 ++++++++ 3 files changed, 95 insertions(+), 32 deletions(-) diff --git a/diff_diff/guides/llms-autonomous.txt b/diff_diff/guides/llms-autonomous.txt index 036ef9e6..b70b0cf7 100644 --- a/diff_diff/guides/llms-autonomous.txt +++ b/diff_diff/guides/llms-autonomous.txt @@ -123,12 +123,15 @@ view. Every field below appears as a top-level key in that dict. period observed (for `binary_absorbing`); `None` otherwise. - **`last_treatment_period: Optional[Any]`** - latest first-treatment period observed; `None` otherwise. -- **`min_pre_periods: Optional[int]`** - across cohorts, the smallest - number of pre-treatment periods. Low values (< 3) fire the - `short_pre_panel` alert and limit power for parallel-trends tests. -- **`min_post_periods: Optional[int]`** - across cohorts, the smallest - number of post-treatment periods. Low values limit event-study - dynamics. +- **`min_pre_periods: Optional[int]`** - across treated units, the + smallest number of observed pre-treatment periods (each treated + unit's observed `(unit, time)` support is counted independently, so + this reflects the least-supported treated unit on unbalanced panels). + Low values (< 3) fire the `short_pre_panel` alert and limit power + for parallel-trends tests. +- **`min_post_periods: Optional[int]`** - across treated units, the + smallest number of observed post-treatment periods; same per-unit + support semantics as above. Low values limit event-study dynamics. ### Outcome @@ -512,37 +515,71 @@ from it. ### BusinessReport `to_dict()` schema (v2.0) -Top-level keys: +Top-level keys emitted by `BusinessReport.to_dict()` +(source: `diff_diff/business_report.py`): -- `schema_version: str` - e.g. `"2.0"`. +- `schema_version: str` - `BUSINESS_REPORT_SCHEMA_VERSION`, e.g. `"2.0"`. +- `estimator: dict` - `class_name` (the fitted result class) and a + human-friendly `display_name`. +- `context: dict` - the `BusinessContext` bundle: `outcome_label`, + `outcome_unit`, `outcome_direction`, `business_question`, + `treatment_label`, `alpha`. +- `headline: dict` - the main point estimate plus framing fields. - `target_parameter: dict` - what the headline scalar represents. - Fields: `name` (e.g. `"ATT"`, `"DID_M"`, `"dose-response"`), - `definition` (human-readable explanation), `aggregation` (machine - tag: `"att_overall"`, `"did_m"`, `"did_or_twfe"`, ...), - `headline_attribute` (the raw result attribute the headline renders - from), `reference` (REGISTRY.md citation string). -- `headline: dict` - the main point estimate plus framing. -- `assumptions: dict` - named assumptions relied on (parallel trends, - no anticipation, SUTVA, ...). -- `pretrends: dict` - pre-trends test result with verdict string - (e.g. `"clean"`, `"inconclusive"`, `"violated"`), p-value, power - assessment if available. -- `main_result: dict` - point estimate, SE, CI, significance. -- `robustness: dict` - sensitivity and placebo summaries if available. -- `sample_summary: dict` - sample size and coverage details. + Fields: `name` (e.g. `"ATT"`, `"DID_M"`, `"dose-response"`, + `"WAS"`), `definition`, `aggregation` (machine tag), and + `headline_attribute` (raw result attribute). +- `assumption: dict` - named assumptions relied on (parallel trends, + no anticipation, SUTVA, ...). Note: singular `"assumption"`, not + `"assumptions"`. +- `pre_trends: dict` - pre-trends test result with verdict string + (e.g. `"clean"`, `"inconclusive"`, `"violated"`), p-value, and + power assessment if available. Note: underscore-split + `"pre_trends"`. +- `sensitivity: dict` - HonestDiD sensitivity summary when available. +- `sample: dict` - sample size and coverage details. Note: bare + `"sample"`, not `"sample_summary"`. +- `heterogeneity: dict` - heterogeneity summary if applicable. +- `robustness: dict` - placebo / robustness summaries if available. +- `diagnostics: dict` - when the auto-constructed `DiagnosticReport` + is present, this is its full `to_dict()` schema nested under the + BR. +- `next_steps: list[dict]` - Baker et al. next-step guidance from + `practitioner_next_steps`. - `caveats: list[str]` - free-text caveats generated from failed checks. +- `references: list[dict]` - citations relevant to the estimator. ### DiagnosticReport `to_dict()` schema (v2.0) -- `schema_version: str`. -- `estimator_type: str` - the result class name. -- `checks: dict` - per-check status. Candidate keys: - `parallel_trends`, `pretrends_power`, `sensitivity`, `bacon`, - `design_effect`, `heterogeneity`, `epv`, `estimator_native`, - `placebo`. Each value is a dict describing what was run and its - outcome. Not all checks apply to all estimators; see the - applicability matrix in `diff_diff.diagnostic_report`. +Top-level keys (source: `diff_diff/diagnostic_report.py`): + +- `schema_version: str` - `DIAGNOSTIC_REPORT_SCHEMA_VERSION`. +- `estimator: str` - the fitted result class name. +- `headline_metric: dict` - the main scalar the report headlines. +- `target_parameter: dict` - same shape as the BR field above. +- `parallel_trends: dict` - PT test result. +- `pretrends_power: dict` - power-aware pre-trends assessment when + applicable. +- `sensitivity: dict` - HonestDiD sensitivity summary. +- `placebo: dict` - placebo-test results. +- `bacon: dict` - Goodman-Bacon decomposition when applicable. +- `design_effect: dict` - survey / clustering design-effect summary. +- `heterogeneity: dict` - group-time heterogeneity summary. +- `epv: dict` - events-per-variable / sample-adequacy. +- `estimator_native_diagnostics: dict` - estimator-specific + diagnostics (e.g. SDiD weight concentration, TROP factor-model + fit). +- `skipped: dict` - checks skipped on this estimator type, with the + reason. +- `warnings: list[str]` - top-level aggregated warnings. +- `overall_interpretation: str` - rendered prose summary of the + sections. +- `next_steps: list[dict]` - same shape as the BR field. + +Each section value is a dict with at minimum `status` +(`"pass"` / `"warn"` / `"inconclusive"` / `"error"` / `"skipped"`), +a `reason`, and section-specific payload fields. ### Forthcoming schema additions (not yet shipped) diff --git a/diff_diff/profile.py b/diff_diff/profile.py index 81812894..98827b51 100644 --- a/diff_diff/profile.py +++ b/diff_diff/profile.py @@ -556,7 +556,7 @@ def _compute_alerts( code="short_pre_panel", severity="warn", message=( - f"Minimum pre-treatment periods across cohorts is " + f"Minimum pre-treatment periods across treated units is " f"{min_pre_periods}; parallel-trends and event-study " "diagnostics have limited power." ), @@ -569,7 +569,7 @@ def _compute_alerts( code="short_post_panel", severity="info", message=( - f"Minimum post-treatment periods across cohorts is " + f"Minimum post-treatment periods across treated units is " f"{min_post_periods}; dynamic-effect estimation is " "limited." ), diff --git a/tests/test_profile_panel.py b/tests/test_profile_panel.py index 051bfffa..780b2a7c 100644 --- a/tests/test_profile_panel.py +++ b/tests/test_profile_panel.py @@ -479,6 +479,32 @@ def test_guide_api_strings_resolve_against_public_api(): assert "`plot_sensitivity(sensitivity_results" in text assert "`plot_honest_event_study(honest_results" in text + # §6 BR/DR schema alignment. The emitted top-level keys are + # singular / underscored ("assumption", "pre_trends", "sample"), + # not the plural / run-together variants. DiagnosticReport emits + # sections at the top level (not nested under a "checks" dict) + # and uses "estimator" (the string class name) / "headline_metric" + # / "estimator_native_diagnostics". Guard each real key and + # forbid the obsolete ones. + for real_key in ( + "`assumption: dict`", + "`pre_trends: dict`", + "`sample: dict`", + "`headline_metric: dict`", + "`estimator_native_diagnostics: dict`", + "`overall_interpretation: str`", + ): + assert real_key in text, f"BR/DR §6 missing real key: {real_key}" + for obsolete_key in ( + "`assumptions: dict`", + "`pretrends: dict`", + "`main_result: dict`", + "`sample_summary: dict`", + "`estimator_type: str`", + "`checks: dict`", + ): + assert obsolete_key not in text, f"BR/DR §6 still lists obsolete key: {obsolete_key}" + def test_min_pre_post_use_per_unit_observed_support(): """On an unbalanced panel where one treated unit is missing its From 046e35c8fb98e5fefbd62e991f2387e8abee099e Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 24 Apr 2026 09:46:59 -0400 Subject: [PATCH 08/18] Address PR #356 CI review round 6 (3 P2 methodology / docs) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Guide methodology attribution corrected: - EfficientDiD primary source is Chen, Sant'Anna, Xie (2025) per REGISTRY.md L746-749, not Arkhangelsky-Imbens. §4.3 bullet and the §7 citation list updated. - ContinuousDiD synopsis (§4.7) now distinguishes the PT-identified targets (`ATT(d|d)`, `ATT^{loc}`) from the SPT-required targets (`ATT(d)`, `ACRT(d)`, `ACRT^{glob}`) per REGISTRY.md L682-706, with the Callaway, Goodman-Bacon, Sant'Anna (2024) citation added to §7. Guide §6 BR schema corrections: - `diagnostics` wrapper documented accurately: always has `status`, with `schema` + `overall_interpretation` on the "ran" path or `reason` on the "skipped" path. Previously the guide said the key held the DR payload directly. - `target_parameter.reference` field documented alongside the other fields per `describe_target_parameter()` in diff_diff/_reporting_helpers.py:34-42. Regression tests extended: - `diagnostics["schema"]` wrapper wording guarded (and absence of plain "DR payload under diagnostics"). - `target_parameter.reference` key documentation guarded. - EfficientDiD attribution: `Chen, Sant'Anna, Xie 2025` present, `(Arkhangelsky-Imbens)` absent. - ContinuousDiD targets: `ATT(d|d)`, `ACRT`, `Strong Parallel Trends` all present. Co-Authored-By: Claude Opus 4.7 (1M context) --- diff_diff/guides/llms-autonomous.txt | 41 ++++++++++++++++++++++------ tests/test_profile_panel.py | 22 +++++++++++++++ 2 files changed, 54 insertions(+), 9 deletions(-) diff --git a/diff_diff/guides/llms-autonomous.txt b/diff_diff/guides/llms-autonomous.txt index b70b0cf7..3fef0c83 100644 --- a/diff_diff/guides/llms-autonomous.txt +++ b/diff_diff/guides/llms-autonomous.txt @@ -300,8 +300,9 @@ estimators: cohort. Conservative interpretation. - `WooldridgeDiD` (ETWFE) - extended-TWFE with cohort-by-time-by- covariates interactions; heterogeneous covariate-by-cohort effects. -- `EfficientDiD` (Arkhangelsky-Imbens) - asymptotically efficient under - either `PT-All` or `PT-Post`; use `EfficientDiD.hausman_pretest` to pick. +- `EfficientDiD` (Chen, Sant'Anna, Xie 2025) - asymptotically efficient + under either `PT-All` or `PT-Post`; use `EfficientDiD.hausman_pretest` + to pick. Diagnostic: `bacon_decompose(df, ...)` shows the weight allocation of a TWFE fit to 2×2 comparison types. Forbidden-comparison weight > 10% is a @@ -362,8 +363,17 @@ worth considering. When `treatment_type == "continuous"`: -- `ContinuousDiD` - dose-response treatment, estimates average - causal response (ACR). Supports B-spline bandwidth selection. +- `ContinuousDiD` (Callaway, Goodman-Bacon, Sant'Anna 2024) - + continuous / dose-response treatment. The estimator exposes + several dose-indexed targets that require different assumptions: + `ATT(d|d)` (effect of dose `d` on units that received `d`) and + `ATT^{loc}` (binarized overall ATT) are identified under Parallel + Trends; `ATT(d)` (full dose-response curve), `ACRT(d)` (marginal + effect, i.e. the average causal response), and `ACRT^{glob}` + require the stronger Strong Parallel Trends assumption. The BR + headline scalar is the overall ATT; ACR and dose-response tables + are available in the result object. Supports B-spline basis + construction. - `HeterogeneousAdoptionDiD` - partial-adoption intensity, with a scalar first-stage adoption summary. Useful when adoption is graded rather than binary. @@ -527,8 +537,9 @@ Top-level keys emitted by `BusinessReport.to_dict()` - `headline: dict` - the main point estimate plus framing fields. - `target_parameter: dict` - what the headline scalar represents. Fields: `name` (e.g. `"ATT"`, `"DID_M"`, `"dose-response"`, - `"WAS"`), `definition`, `aggregation` (machine tag), and - `headline_attribute` (raw result attribute). + `"WAS"`), `definition` (plain-English description), `aggregation` + (machine tag), `headline_attribute` (raw result attribute), and + `reference` (REGISTRY.md citation string). - `assumption: dict` - named assumptions relied on (parallel trends, no anticipation, SUTVA, ...). Note: singular `"assumption"`, not `"assumptions"`. @@ -541,9 +552,12 @@ Top-level keys emitted by `BusinessReport.to_dict()` `"sample"`, not `"sample_summary"`. - `heterogeneity: dict` - heterogeneity summary if applicable. - `robustness: dict` - placebo / robustness summaries if available. -- `diagnostics: dict` - when the auto-constructed `DiagnosticReport` - is present, this is its full `to_dict()` schema nested under the - BR. +- `diagnostics: dict` - a wrapper around the auto-constructed + `DiagnosticReport`. Always has a `status` field: `"skipped"` with a + `reason` when `auto_diagnostics=False`, otherwise `"ran"` with the + full DR `to_dict()` payload under `diagnostics["schema"]` and a + mirrored `overall_interpretation` string. Parse `schema` (not + `diagnostics` directly) to access the DR sections documented below. - `next_steps: list[dict]` - Baker et al. next-step guidance from `practitioner_next_steps`. - `caveats: list[str]` - free-text caveats generated from failed @@ -663,6 +677,15 @@ or the outcome model is correctly specified. - **Sant'Anna, Pedro H. C., and Jun Zhao (2020).** "Doubly Robust Difference-in-Differences Estimators." Journal of Econometrics 219(1): 101-122. DR adjustment. +- **Chen, Xiaohong, Pedro H. C. Sant'Anna, and Haitian Xie (2025).** + "Efficient Difference-in-Differences and Event Study Estimators." + Primary source for the `EfficientDiD` estimator (PT-All / PT-Post + framing and efficient combination weights). +- **Callaway, Brantly, Andrew Goodman-Bacon, and Pedro H. C. + Sant'Anna (2024).** "Difference-in-Differences with a Continuous + Treatment." Primary source for `ContinuousDiD`; introduces the + Parallel Trends vs Strong Parallel Trends distinction underlying + `ATT(d|d)`, `ATT(d)`, `ACRT(d)`, and `ACRT^{glob}`. ### Online resources diff --git a/tests/test_profile_panel.py b/tests/test_profile_panel.py index 780b2a7c..84424b28 100644 --- a/tests/test_profile_panel.py +++ b/tests/test_profile_panel.py @@ -505,6 +505,28 @@ def test_guide_api_strings_resolve_against_public_api(): ): assert obsolete_key not in text, f"BR/DR §6 still lists obsolete key: {obsolete_key}" + # BR `diagnostics` is a wrapper (status + schema/reason + possibly + # overall_interpretation), not the DR payload directly. Guard the + # wrapper wording so the guide does not drift back to telling + # agents to parse BR["diagnostics"] as the DR schema. + assert 'diagnostics["schema"]' in text + # target_parameter includes a `reference` field per + # describe_target_parameter(); guard its documentation. + assert "`reference` (REGISTRY.md citation string)" in text + + # Methodology source attribution: EfficientDiD is Chen, Sant'Anna, + # Xie (2025), not Arkhangelsky-Imbens. ContinuousDiD is Callaway, + # Goodman-Bacon, Sant'Anna (2024). Guard both attributions in the + # §4 prose and the §7 citation list. + assert "Chen, Sant'Anna, Xie 2025" in text + assert "(Arkhangelsky-Imbens)" not in text + assert "Callaway, Goodman-Bacon, Sant'Anna 2024" in text + # ContinuousDiD prose must distinguish the PT vs SPT identified + # targets rather than collapsing everything into "ACR". + assert "ATT(d|d)" in text + assert "ACRT" in text + assert "Strong Parallel Trends" in text + def test_min_pre_post_use_per_unit_observed_support(): """On an unbalanced panel where one treated unit is missing its From eea7aa53797a8cebdbdcd84e8e5553b5adb372a6 Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 24 Apr 2026 09:58:12 -0400 Subject: [PATCH 09/18] Address PR #356 CI review round 7 (1 P1 guide + 1 P2 docs) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ContinuousDiD eligibility (P1): ContinuousDiD.fit() requires zero-dose control units (P(D=0) > 0) because Remark 3.1 lowest-dose-as-control is not yet implemented (REGISTRY.md L714; continuous_did.py:349-360). Guide corrections: - Matrix row: never-treated-required ✗ → ✓. - has_never_treated field doc: add ContinuousDiD to the required-by list with the Remark 3.1 note. - §4.4 no-never-treated paths: add ContinuousDiD bullet naming the ValueError and pointing at the unimplemented Remark 3.1 path. - §4.7 ContinuousDiD bullet: prepend the "Requires zero-dose control units" callout. DiagnosticReport §6 section-status semantics (P2): the real schema uses `status` for execution state ("ran", "not_applicable", "not_run", "no_scalar_by_design", "skipped") and `verdict` as a separate field for qualitative interpretation (e.g. "clean", "inconclusive", "violated"). Previously §6 conflated these: it listed pass/warn/inconclusive as status values and omitted the execution statuses. Rewrote the section-status paragraph to document the two layers explicitly; moved the forthcoming-additions note into the same block. Tests: - ContinuousDiD matrix-row cell assertion (col 5 = ✓). - "P(D=0) > 0" substring fingerprint guards the Remark 3.1 wording. - DR §6 status vocabulary: guards every real status the implementation emits ("ran", "not_applicable", "not_run", "no_scalar_by_design") and forbids the obsolete pass/warn/inconclusive-as-status framing; `verdict` must be mentioned as a distinct concept. Co-Authored-By: Claude Opus 4.7 (1M context) --- diff_diff/guides/llms-autonomous.txt | 69 ++++++++++++++++++---------- tests/test_profile_panel.py | 25 ++++++++++ 2 files changed, 70 insertions(+), 24 deletions(-) diff --git a/diff_diff/guides/llms-autonomous.txt b/diff_diff/guides/llms-autonomous.txt index 3fef0c83..b5f6af62 100644 --- a/diff_diff/guides/llms-autonomous.txt +++ b/diff_diff/guides/llms-autonomous.txt @@ -107,9 +107,11 @@ view. Every field below appears as a top-level key in that dict. - **`has_never_treated: bool`** - at least one unit has `treatment == 0` in every observed non-NaN row (applies to both binary and continuous treatment columns; for continuous this flags zero-dose control units). - Required by `SyntheticDiD`, `SunAbraham`, and `EfficientDiD` under - both `assumption="PT-All"` and `assumption="PT-Post"` unless - `control_group="last_cohort"` is passed; preferred-but-optional by + Required by `SyntheticDiD`, `SunAbraham`, `EfficientDiD` under both + `assumption="PT-All"` and `assumption="PT-Post"` (unless + `control_group="last_cohort"` is passed), and `ContinuousDiD` + (which requires `P(D=0) > 0` - Remark 3.1 lowest-dose-as-control + is not yet implemented). Preferred-but-optional by `CallawaySantAnna` and `ChaisemartinDHaultfoeuille`. Always `False` for `"categorical"`. - **`has_always_treated: bool`** - at least one unit has @@ -201,7 +203,7 @@ supported / out of scope; `warn` supported but with documented caveats; | `TROP` | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | partial | | `TripleDifference` | ✓ | ✗ | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ | | `StaggeredTripleDifference` | ✓ | ✓ | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ | -| `ContinuousDiD` | ✗ | ✗ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ | +| `ContinuousDiD` | ✗ | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✓ | | `HeterogeneousAdoptionDiD` | ✗ | partial | partial | ✗ | ✗ | ✗ | ✗ | ✓ | warn | **Footnotes.** @@ -326,6 +328,9 @@ When `has_never_treated == False`: (post-treatment periods at/after that cohort's adoption are trimmed). Distinct from CallawaySantAnna's `not_yet_treated` option. +- `ContinuousDiD` requires zero-dose control units (`P(D=0) > 0`). + Remark 3.1 of the paper (lowest-dose-as-control) is not yet + implemented; `fit()` raises `ValueError` when no `D=0` units exist. - `CallawaySantAnna` - use `control_group="not_yet_treated"` to use not-yet-treated units as the control pool. - `ChaisemartinDHaultfoeuille` - constructs switchers vs. non-switchers @@ -364,16 +369,18 @@ worth considering. When `treatment_type == "continuous"`: - `ContinuousDiD` (Callaway, Goodman-Bacon, Sant'Anna 2024) - - continuous / dose-response treatment. The estimator exposes - several dose-indexed targets that require different assumptions: - `ATT(d|d)` (effect of dose `d` on units that received `d`) and - `ATT^{loc}` (binarized overall ATT) are identified under Parallel - Trends; `ATT(d)` (full dose-response curve), `ACRT(d)` (marginal - effect, i.e. the average causal response), and `ACRT^{glob}` - require the stronger Strong Parallel Trends assumption. The BR - headline scalar is the overall ATT; ACR and dose-response tables - are available in the result object. Supports B-spline basis - construction. + continuous / dose-response treatment. **Requires zero-dose control + units (`P(D=0) > 0`)**: `fit()` raises `ValueError` when no `D=0` + rows are present, because Remark 3.1 (lowest-dose-as-control) is + not yet implemented. The estimator exposes several dose-indexed + targets that require different assumptions: `ATT(d|d)` (effect of + dose `d` on units that received `d`) and `ATT^{loc}` (binarized + overall ATT) are identified under Parallel Trends; `ATT(d)` (full + dose-response curve), `ACRT(d)` (marginal effect, i.e. the average + causal response), and `ACRT^{glob}` require the stronger Strong + Parallel Trends assumption. The BR headline scalar is the overall + ATT; ACR and dose-response tables are available in the result + object. Supports B-spline basis construction. - `HeterogeneousAdoptionDiD` - partial-adoption intensity, with a scalar first-stage adoption summary. Useful when adoption is graded rather than binary. @@ -591,16 +598,30 @@ Top-level keys (source: `diff_diff/diagnostic_report.py`): sections. - `next_steps: list[dict]` - same shape as the BR field. -Each section value is a dict with at minimum `status` -(`"pass"` / `"warn"` / `"inconclusive"` / `"error"` / `"skipped"`), -a `reason`, and section-specific payload fields. - -### Forthcoming schema additions (not yet shipped) - -- `sanity_checks: dict` - machine-legible pass / warn / fail summary - (forthcoming Wave). -- `mismatch_warnings: list[dict]` - post-hoc estimator-mismatch - detection (forthcoming Wave). +Each section value is a dict. Parse it in two layers: + +1. `status: str` — execution state, not qualitative interpretation. + The values actually emitted by `DiagnosticReport.to_dict()` are: + `"ran"` (section executed), `"not_applicable"` (check does not + apply to this estimator or design), `"not_run"` (implementation + pending), `"no_scalar_by_design"` (for estimators that return a + table instead of a scalar headline, e.g. dCDH with + `trends_linear=True, L_max>=2`), and `"skipped"` (auto-diagnostics + disabled or the section was short-circuited at top level). +2. `verdict: str` (only present when `status == "ran"`) — qualitative + interpretation of the executed check. Candidate values include + `"clean"`, `"inconclusive"`, `"violated"`, and section-specific + labels. + +`reason: str` is an optional free-text explanation that usually +accompanies non-`"ran"` statuses; it may also appear on `"ran"` +sections as supplementary context. The rest of each section dict is +section-specific payload (e.g. p-values, coefficients, cohort tables). + +Forthcoming schema additions (not yet shipped): a top-level +`sanity_checks` block (machine-legible pass/warn/fail summary) and a +`mismatch_warnings` list (post-hoc estimator-mismatch detection) are +queued for a later wave. Treat their current absence as expected. ## §7. Glossary + citations diff --git a/tests/test_profile_panel.py b/tests/test_profile_panel.py index 84424b28..053e3ba3 100644 --- a/tests/test_profile_panel.py +++ b/tests/test_profile_panel.py @@ -527,6 +527,31 @@ def test_guide_api_strings_resolve_against_public_api(): assert "ACRT" in text assert "Strong Parallel Trends" in text + # ContinuousDiD requires zero-dose (P(D=0) > 0) because Remark 3.1 + # lowest-dose-as-control is unimplemented; matrix col 5 must be ✓. + assert cdid_cells[5] == "✓", ( + "ContinuousDiD matrix row must mark never-treated-required=✓ " + f"(P(D=0) > 0 required per Remark 3.1); got {cdid_cells[5]!r}" + ) + assert "P(D=0) > 0" in text or "P(D=0) > 0" in text + + # DR §6 section statuses: execution-state vocabulary must include + # the actual emitted values ("ran", "not_applicable", "not_run", + # "no_scalar_by_design", "skipped"), and `verdict` must be + # documented separately from `status`. Guard against drift back + # to the pass/warn/inconclusive-as-status framing. + for real_status in ( + '"ran"', + '"not_applicable"', + '"not_run"', + '"no_scalar_by_design"', + ): + assert real_status in text, f"DR §6 section-status vocabulary must document {real_status}" + # `status` must not be described as "pass/warn/inconclusive" — + # those belong under `verdict`. + assert '`"pass"` / `"warn"` / `"inconclusive"`' not in text + assert "verdict" in text.lower() + def test_min_pre_post_use_per_unit_observed_support(): """On an unbalanced panel where one treated unit is missing its From e6c5b5708973e38d76ee494f7260af0af47df5b0 Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 24 Apr 2026 10:08:36 -0400 Subject: [PATCH 10/18] Address PR #356 CI review round 8 (2 P1 methodology + 1 P2 tests) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ContinuousDiD staggered support (P1 #1): the matrix marked staggered=✗, but the method natively supports staggered adoption via the `first_treat` column (continuous_did.py:159-169, 919-925; REGISTRY.md L788-825). Matrix cell flipped ✗ → ✓. Time-invariant dose requirement (P1 #2): ContinuousDiD.fit() requires dose to be time-invariant per unit (continuous_did.py:222-228; docs/methodology/continuous-did.md:L70-75), but profile_panel() did not expose this so time-varying-dose continuous panels were routed to ContinuousDiD only to hard-fail at fit time. Added `PanelProfile.treatment_varies_within_unit: bool` — True iff any unit has more than one distinct non-NaN treatment value across its observed rows. Computed unconditionally for numeric (non-bool) treatment columns; False for categorical. `to_dict()` exposes it. Guide §2 documents the field, §4.7 ContinuousDiD bullet lists two eligibility prerequisites: P(D=0) > 0 AND treatment_varies_within_unit == False. Tests (P2): - test_continuous_treatment_with_time_varying_dose: random-per-row continuous panel -> treatment_varies_within_unit=True. - test_continuous_treatment (existing): constant-per-unit dose -> treatment_varies_within_unit=False. - test_binary_absorbing_varies_within_unit: binary absorbing panel always True by construction. - Guide-resolution test: ContinuousDiD matrix col 2 (staggered) = ✓; guide mentions "time-invariant" and "treatment_varies_within_unit". - to_dict JSON round-trip set extended with the new key. Co-Authored-By: Claude Opus 4.7 (1M context) --- diff_diff/guides/llms-autonomous.txt | 39 ++++++++++++++++-------- diff_diff/profile.py | 11 +++++++ tests/test_profile_panel.py | 44 ++++++++++++++++++++++++++++ 3 files changed, 81 insertions(+), 13 deletions(-) diff --git a/diff_diff/guides/llms-autonomous.txt b/diff_diff/guides/llms-autonomous.txt index b5f6af62..a88e1e18 100644 --- a/diff_diff/guides/llms-autonomous.txt +++ b/diff_diff/guides/llms-autonomous.txt @@ -118,6 +118,14 @@ view. Every field below appears as a top-level key in that dict. strictly-positive treatment in every observed non-NaN row. Such units provide no pre-treatment identification and are dropped by most estimators. Always `False` for `"categorical"`. +- **`treatment_varies_within_unit: bool`** - at least one unit has more + than one distinct non-NaN treatment value across its observed rows. + For binary panels this is normally `True` (pre vs. post the adoption + period), and for continuous panels this flags time-varying dose. + `ContinuousDiD.fit()` requires this to be `False` (dose must be + time-invariant per unit, per Callaway et al. 2024); a `True` value on + a continuous panel rules the estimator out. Always `False` for + `"categorical"`. ### Timing @@ -203,7 +211,7 @@ supported / out of scope; `warn` supported but with documented caveats; | `TROP` | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ | ✗ | partial | | `TripleDifference` | ✓ | ✗ | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ | | `StaggeredTripleDifference` | ✓ | ✓ | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ | -| `ContinuousDiD` | ✗ | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✓ | +| `ContinuousDiD` | ✗ | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✗ | ✓ | | `HeterogeneousAdoptionDiD` | ✗ | partial | partial | ✗ | ✗ | ✗ | ✗ | ✓ | warn | **Footnotes.** @@ -369,18 +377,23 @@ worth considering. When `treatment_type == "continuous"`: - `ContinuousDiD` (Callaway, Goodman-Bacon, Sant'Anna 2024) - - continuous / dose-response treatment. **Requires zero-dose control - units (`P(D=0) > 0`)**: `fit()` raises `ValueError` when no `D=0` - rows are present, because Remark 3.1 (lowest-dose-as-control) is - not yet implemented. The estimator exposes several dose-indexed - targets that require different assumptions: `ATT(d|d)` (effect of - dose `d` on units that received `d`) and `ATT^{loc}` (binarized - overall ATT) are identified under Parallel Trends; `ATT(d)` (full - dose-response curve), `ACRT(d)` (marginal effect, i.e. the average - causal response), and `ACRT^{glob}` require the stronger Strong - Parallel Trends assumption. The BR headline scalar is the overall - ATT; ACR and dose-response tables are available in the result - object. Supports B-spline basis construction. + continuous / dose-response treatment. **Two eligibility + prerequisites**: (a) zero-dose control units must exist + (`P(D=0) > 0`) because Remark 3.1 (lowest-dose-as-control) is not + yet implemented, and (b) dose must be time-invariant per unit (rule + out panels where `PanelProfile.treatment_varies_within_unit == + True`). `fit()` raises `ValueError` in either case. Note that + staggered adoption IS supported natively (adoption timing is + expressed via the `first_treat` column, not via within-unit dose + variation). The estimator exposes several dose-indexed targets that + require different assumptions: `ATT(d|d)` (effect of dose `d` on + units that received `d`) and `ATT^{loc}` (binarized overall ATT) + are identified under Parallel Trends; `ATT(d)` (full dose-response + curve), `ACRT(d)` (marginal effect, i.e. the average causal + response), and `ACRT^{glob}` require the stronger Strong Parallel + Trends assumption. The BR headline scalar is the overall ATT; ACR + and dose-response tables are available in the result object. + Supports B-spline basis construction. - `HeterogeneousAdoptionDiD` - partial-adoption intensity, with a scalar first-stage adoption summary. Useful when adoption is graded rather than binary. diff --git a/diff_diff/profile.py b/diff_diff/profile.py index 98827b51..7766a6f5 100644 --- a/diff_diff/profile.py +++ b/diff_diff/profile.py @@ -62,6 +62,7 @@ class PanelProfile: cohort_sizes: Mapping[Any, int] has_never_treated: bool has_always_treated: bool + treatment_varies_within_unit: bool first_treatment_period: Optional[Any] last_treatment_period: Optional[Any] @@ -91,6 +92,7 @@ def to_dict(self) -> Dict[str, Any]: "cohort_sizes": {_jsonable_key(k): int(v) for k, v in self.cohort_sizes.items()}, "has_never_treated": self.has_never_treated, "has_always_treated": self.has_always_treated, + "treatment_varies_within_unit": self.treatment_varies_within_unit, "first_treatment_period": _jsonable(self.first_treatment_period), "last_treatment_period": _jsonable(self.last_treatment_period), "min_pre_periods": self.min_pre_periods, @@ -245,6 +247,14 @@ def profile_panel( last_tp, ) = _classify_treatment(df, unit=unit, time=time, treatment=treatment) + if pd.api.types.is_numeric_dtype(df[treatment]) and not pd.api.types.is_bool_dtype( + df[treatment] + ): + per_unit_distinct = df.groupby(unit)[treatment].nunique(dropna=True) + treatment_varies_within_unit = bool((per_unit_distinct > 1).any()) + else: + treatment_varies_within_unit = False + min_pre, min_post = _compute_pre_post( df, unit=unit, @@ -289,6 +299,7 @@ def profile_panel( cohort_sizes=cohort_sizes, has_never_treated=has_never_treated, has_always_treated=has_always_treated, + treatment_varies_within_unit=treatment_varies_within_unit, first_treatment_period=first_tp, last_treatment_period=last_tp, min_pre_periods=min_pre, diff --git a/tests/test_profile_panel.py b/tests/test_profile_panel.py index 053e3ba3..15457fc7 100644 --- a/tests/test_profile_panel.py +++ b/tests/test_profile_panel.py @@ -99,6 +99,34 @@ def test_continuous_treatment(): assert profile.treatment_type == "continuous" assert profile.cohort_sizes == {} assert profile.is_staggered is False + # Each unit has a constant dose across all periods → time-invariant. + assert profile.treatment_varies_within_unit is False + + +def test_continuous_treatment_with_time_varying_dose(): + """Time-varying dose must be flagged so agents routed to + ContinuousDiD do not hit the fit-time "dose must be time-invariant" + ValueError. treatment_varies_within_unit == True signals the + incompatibility.""" + rng = np.random.default_rng(0) + rows = [] + for u in range(1, 21): + for t in range(4): + dose = float(rng.uniform(0, 5)) + rows.append({"u": u, "t": t, "tr": dose, "y": rng.normal()}) + df = pd.DataFrame(rows) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "continuous" + assert profile.treatment_varies_within_unit is True + + +def test_binary_absorbing_varies_within_unit(): + """Binary-absorbing panels have within-unit treatment variation by + construction (0 pre, 1 post). The field is True.""" + first_treat = {u: 2 for u in range(11, 21)} + df = _make_panel(n_units=20, periods=range(0, 4), first_treat=first_treat) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_varies_within_unit is True def test_categorical_treatment_object_dtype(): @@ -214,6 +242,7 @@ def test_to_dict_is_json_serializable(): "cohort_sizes", "has_never_treated", "has_always_treated", + "treatment_varies_within_unit", "first_treatment_period", "last_treatment_period", "min_pre_periods", @@ -535,6 +564,21 @@ def test_guide_api_strings_resolve_against_public_api(): ) assert "P(D=0) > 0" in text or "P(D=0) > 0" in text + # ContinuousDiD DOES support staggered adoption natively (via the + # `first_treat` column). Matrix column 2 (staggered) must be ✓. + assert cdid_cells[2] == "✓", ( + "ContinuousDiD matrix row must mark staggered=✓ " + "(adoption timing via first_treat is supported); " + f"got {cdid_cells[2]!r}" + ) + + # ContinuousDiD also requires dose to be time-invariant per unit; + # this is the second eligibility prerequisite the guide must spell + # out. Guide text must mention the invariant explicitly AND the + # `treatment_varies_within_unit` field used to detect it. + assert "time-invariant" in text + assert "treatment_varies_within_unit" in text + # DR §6 section statuses: execution-state vocabulary must include # the actual emitted values ("ran", "not_applicable", "not_run", # "no_scalar_by_design", "skipped"), and `verdict` must be From 57d42a005caa29d8b5b2ecc20761c0285efe77d9 Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 24 Apr 2026 10:49:44 -0400 Subject: [PATCH 11/18] Address PR #356 CI review round 9 (1 P1 + 1 P2 semantic) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Rebased onto current main (resolved CHANGELOG.md conflict: the by_path bullet from PR #355 and the profile_panel/autonomous-guide bullet from this PR now live side-by-side under [Unreleased]). has_always_treated now has binary-only semantics: - For binary treatment (absorbing or non-absorbing): unit_min == 1 means the unit is treated in every observed period (no pre-treatment information in the DiD sense). - For continuous treatment: always False. Pre-treatment periods on continuous DiD are determined by the separate `first_treat` column supplied to `ContinuousDiD.fit`, not by whether the dose is positive. A unit with a constant positive dose can still have well-defined pre-treatment periods, so flagging it as "always-treated / no pre-treatment information" was factually wrong and triggered the misleading `has_always_treated_units` alert on valid continuous panels. - Categorical: False by construction. Guide §2 has_always_treated field doc updated to state the binary-only semantics explicitly, with a note about `first_treat`. Tests: - New: test_continuous_positive_dose_does_not_fire_has_always_treated asserts has_always_treated=False AND the alert does not fire on a constant-positive-dose continuous panel. - Existing test_continuous_zero_dose_controls_flag_has_never_treated updated: has_always_treated expected to be False (was True under the old semantics). Co-Authored-By: Claude Opus 4.7 (1M context) --- diff_diff/guides/llms-autonomous.txt | 13 +++++++---- diff_diff/profile.py | 19 ++++++++++++----- tests/test_profile_panel.py | 32 ++++++++++++++++++++++++++-- 3 files changed, 53 insertions(+), 11 deletions(-) diff --git a/diff_diff/guides/llms-autonomous.txt b/diff_diff/guides/llms-autonomous.txt index a88e1e18..90650169 100644 --- a/diff_diff/guides/llms-autonomous.txt +++ b/diff_diff/guides/llms-autonomous.txt @@ -114,10 +114,15 @@ view. Every field below appears as a top-level key in that dict. is not yet implemented). Preferred-but-optional by `CallawaySantAnna` and `ChaisemartinDHaultfoeuille`. Always `False` for `"categorical"`. -- **`has_always_treated: bool`** - at least one unit has - strictly-positive treatment in every observed non-NaN row. Such units - provide no pre-treatment identification and are dropped by most - estimators. Always `False` for `"categorical"`. +- **`has_always_treated: bool`** - at least one binary-treatment + unit has `treatment == 1` in every observed non-NaN row (no + pre-treatment information for that unit in the DiD sense). + Binary-only semantics: for `"continuous"` panels this field is + always `False` because pre-treatment periods are determined by the + `first_treat` column supplied to `ContinuousDiD.fit()`, not by + whether the dose is positive - a unit with a constant positive dose + can still have well-defined pre-treatment periods. Always `False` + for `"categorical"` too. - **`treatment_varies_within_unit: bool`** - at least one unit has more than one distinct non-NaN treatment value across its observed rows. For binary panels this is normally `True` (pre vs. post the adoption diff --git a/diff_diff/profile.py b/diff_diff/profile.py index 7766a6f5..0a1d15cb 100644 --- a/diff_diff/profile.py +++ b/diff_diff/profile.py @@ -353,16 +353,25 @@ def _classify_treatment( if n_distinct == 0: return ("categorical", False, {}, False, False, None, None) - # Generic never-/always-treated semantics (applies to both binary - # and continuous numeric treatment): "never-treated" means the unit - # has treatment == 0 in every observed non-NaN row; "always-treated" - # means treatment > 0 in every observed non-NaN row. + # has_never_treated has a single well-defined meaning across binary + # and continuous numeric treatment: some unit has treatment == 0 in + # every observed non-NaN row. For binary this is the clean-control + # group; for continuous this is the zero-dose control required by + # ContinuousDiD (P(D=0) > 0). unit_max = df.groupby(unit)[treatment].max().to_numpy() unit_min = df.groupby(unit)[treatment].min().to_numpy() has_never_treated = bool(np.any(unit_max == 0)) - has_always_treated = bool(np.any(unit_min > 0)) is_binary_valued = values_set <= {0, 1, 0.0, 1.0} + # has_always_treated has binary-only semantics: "unit is treated in + # every observed period" = unit_min == 1 on a binary panel (no + # pre-treatment information). For continuous panels, positive dose + # throughout does not mean "always treated in the DiD sense" + # (pre-treatment periods are determined by `first_treat`, not by + # whether the dose is positive), so this field is False for + # continuous / categorical types. + has_always_treated = is_binary_valued and bool(np.any(unit_min == 1)) + if not is_binary_valued: return ( "continuous", diff --git a/tests/test_profile_panel.py b/tests/test_profile_panel.py index 15457fc7..f1d98183 100644 --- a/tests/test_profile_panel.py +++ b/tests/test_profile_panel.py @@ -129,6 +129,32 @@ def test_binary_absorbing_varies_within_unit(): assert profile.treatment_varies_within_unit is True +def test_continuous_positive_dose_does_not_fire_has_always_treated(): + """Valid ContinuousDiD panels have units with a constant positive + dose across all periods AND well-defined pre-treatment periods + (via a separate `first_treat` column). `has_always_treated` has + binary-only semantics, so it must be False on continuous panels + regardless of dose positivity. Previously the field conflated + "positive dose throughout" with "always treated in the DiD sense", + which fired the misleading `has_always_treated_units` alert on + valid continuous-DiD panels.""" + rng = np.random.default_rng(0) + rows = [] + for u in range(1, 21): + dose = 0.0 if u <= 5 else 2.5 + for t in range(4): + rows.append({"u": u, "t": t, "tr": dose, "y": rng.normal()}) + df = pd.DataFrame(rows) + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "continuous" + assert profile.has_never_treated is True + assert profile.has_always_treated is False, ( + "has_always_treated must be False on continuous panels regardless " + "of dose positivity (binary-only semantics)" + ) + assert "has_always_treated_units" not in _alert_codes(profile) + + def test_categorical_treatment_object_dtype(): rows = [] for u in range(1, 11): @@ -386,7 +412,9 @@ def test_reversal_through_nan_is_binary_non_absorbing(): def test_continuous_zero_dose_controls_flag_has_never_treated(): """Continuous treatment with some zero-dose units must flag has_never_treated=True. Previously continuous panels hardcoded - has_never_treated=False regardless of control availability.""" + has_never_treated=False regardless of control availability. + has_always_treated has binary-only semantics and must remain + False on continuous panels regardless of dose positivity.""" rows = [] rng = np.random.default_rng(0) for u in range(1, 21): @@ -397,7 +425,7 @@ def test_continuous_zero_dose_controls_flag_has_never_treated(): profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") assert profile.treatment_type == "continuous" assert profile.has_never_treated is True - assert profile.has_always_treated is True + assert profile.has_always_treated is False def test_guide_api_strings_resolve_against_public_api(): From a65b5fa66ab082b1e82ab64b0b57a666d42cc1b4 Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 24 Apr 2026 11:03:10 -0400 Subject: [PATCH 12/18] Address PR #356 CI review round 10 (1 P1 + 1 P2 + 1 P3) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Balanced-panel eligibility (P1): ContinuousDiD, EfficientDiD, SyntheticDiD, and HeterogeneousAdoptionDiD all hard-reject unbalanced panels at fit() time (continuous_did.py:329-338, efficient_did.py: 407-414, synthetic_did.py:399-412, had.py:1173-1188; REGISTRY.md cross-refs). Guide updates surface this: - New "Balanced-panel eligibility" block after §3 matrix footnotes names the four affected estimators and points at `PanelProfile.is_balanced == True` as the gate. Directs users with unbalanced panels to `diff_diff.prep.balance_panel()` or to a balance-tolerant estimator. - §4 per-estimator bullets for all four estimators prepend or append the balanced-panel requirement with the specific fit() error the caller would otherwise hit. - ContinuousDiD §4.7 bullet now lists THREE eligibility prerequisites (zero-dose controls, time-invariant dose, balanced panel) where it previously listed two. Docstring (P3): profile_panel() docstring notes block updated to match the binary-only has_always_treated semantics shipped in round 9. The old wording claimed the field fired on "strictly positive treatment in every observed non-NaN row" across numeric types, which no longer matches the implementation. Tests (P2): - Semantic guide test asserts `is_balanced` is mentioned in the guide and each of the four balance-sensitive estimators appears within 400 characters of a "balanced" / "is_balanced" marker, so future edits cannot silently drop the eligibility gate from any of them. Co-Authored-By: Claude Opus 4.7 (1M context) --- diff_diff/guides/llms-autonomous.txt | 28 ++++++++++++++++++------ diff_diff/profile.py | 22 +++++++++++++------ tests/test_profile_panel.py | 32 ++++++++++++++++++++++++++++ 3 files changed, 69 insertions(+), 13 deletions(-) diff --git a/diff_diff/guides/llms-autonomous.txt b/diff_diff/guides/llms-autonomous.txt index 90650169..4c0be2d0 100644 --- a/diff_diff/guides/llms-autonomous.txt +++ b/diff_diff/guides/llms-autonomous.txt @@ -260,6 +260,16 @@ supported / out of scope; `warn` supported but with documented caveats; intensity as a continuous first-stage variable; not a pure dose-response estimator - use `ContinuousDiD` for that. +**Balanced-panel eligibility.** The following estimators hard-reject +unbalanced panels (each raises `ValueError` at `fit()` when a unit is +missing any period): `ContinuousDiD`, `EfficientDiD`, `SyntheticDiD`, +`HeterogeneousAdoptionDiD`. Gate these on +`PanelProfile.is_balanced == True`; if `False`, pre-process with +`diff_diff.prep.balance_panel()` or pick a balance-tolerant +estimator from the remaining rows (CS/SA/dCDH/Imputation/TwoStage/ +Stacked/ETWFE all accept unbalanced input, with some caveats in their +own docs). + ## §4. Estimator-choice reasoning by design feature @@ -317,7 +327,8 @@ estimators: covariates interactions; heterogeneous covariate-by-cohort effects. - `EfficientDiD` (Chen, Sant'Anna, Xie 2025) - asymptotically efficient under either `PT-All` or `PT-Post`; use `EfficientDiD.hausman_pretest` - to pick. + to pick. Requires a balanced panel (`PanelProfile.is_balanced == + True`); `fit()` raises `ValueError` on unbalanced input. Diagnostic: `bacon_decompose(df, ...)` shows the weight allocation of a TWFE fit to 2×2 comparison types. Forbidden-comparison weight > 10% is a @@ -382,12 +393,13 @@ worth considering. When `treatment_type == "continuous"`: - `ContinuousDiD` (Callaway, Goodman-Bacon, Sant'Anna 2024) - - continuous / dose-response treatment. **Two eligibility + continuous / dose-response treatment. **Three eligibility prerequisites**: (a) zero-dose control units must exist (`P(D=0) > 0`) because Remark 3.1 (lowest-dose-as-control) is not - yet implemented, and (b) dose must be time-invariant per unit (rule - out panels where `PanelProfile.treatment_varies_within_unit == - True`). `fit()` raises `ValueError` in either case. Note that + yet implemented, (b) dose must be time-invariant per unit (rule out + panels where `PanelProfile.treatment_varies_within_unit == True`), + and (c) the panel must be balanced (`PanelProfile.is_balanced == + True`). `fit()` raises `ValueError` in any of the three cases. Note that staggered adoption IS supported natively (adoption timing is expressed via the `first_treat` column, not via within-unit dose variation). The estimator exposes several dose-indexed targets that @@ -411,6 +423,8 @@ but derivable from `cohort_sizes` + `has_never_treated`): - `SyntheticDiD` - synthetic-control-meets-DiD. Requires never-treated donors and sufficient pre-treatment periods (Arkhangelsky et al. 2021). Block treatment only: all treated units must adopt at the same time. + Requires a balanced panel (`PanelProfile.is_balanced == True`); + `fit()` raises `ValueError` and points at `balance_panel()`. - `TROP` - factor-model-based generalized synthetic control. Uses every unit untreated at period `t` as the donor pool (via the absorbing-state D matrix); supports staggered adoption and more complex factor @@ -426,7 +440,9 @@ methods in the library are preferred. When adoption varies in strength across units (partial-adoption settings, intensity of exposure differs): -- `HeterogeneousAdoptionDiD` - targets a Weighted Average Slope (WAS) +- `HeterogeneousAdoptionDiD` - requires a balanced panel + (`PanelProfile.is_balanced == True`; `fit()` raises `ValueError` + when any unit is missing a period). Targets a Weighted Average Slope (WAS) on single-period Heterogeneous Adoption Designs where no genuinely untreated group exists (paper Equation 2 / Theorem 1). The `target_parameter` attribute on the results object is literally diff --git a/diff_diff/profile.py b/diff_diff/profile.py index 0a1d15cb..9a24a345 100644 --- a/diff_diff/profile.py +++ b/diff_diff/profile.py @@ -192,13 +192,21 @@ def profile_panel( ``"categorical"``; cast to ``int`` if you want binary-treatment profiling. - ``has_never_treated`` and ``has_always_treated`` are computed - generically across numeric treatment types (both binary and - continuous). ``has_never_treated`` fires when some unit has - ``treatment == 0`` in every observed non-NaN row; for continuous - panels this flags zero-dose controls. ``has_always_treated`` fires - when some unit has strictly-positive treatment in every observed - non-NaN row. Both are always ``False`` for ``"categorical"``. + ``has_never_treated`` is computed across both binary and + continuous numeric treatment types: some unit has ``treatment == + 0`` in every observed non-NaN row. For binary this flags the + clean-control group; for continuous this flags zero-dose controls + (required by ``ContinuousDiD``). Always ``False`` for + ``"categorical"``. + + ``has_always_treated`` has binary-only semantics: some unit has + ``treatment == 1`` in every observed non-NaN row (no pre-treatment + information in the DiD sense). For ``"continuous"`` and + ``"categorical"`` treatment this field is always ``False`` + regardless of dose positivity — pre-treatment periods on + continuous DiD are determined by the separate ``first_treat`` + column passed to ``ContinuousDiD.fit``, not by whether the dose + is strictly positive. Rows with ``NaN`` in ``unit`` or ``time`` are dropped up front and surfaced via the ``missing_id_rows_dropped`` alert; all subsequent diff --git a/tests/test_profile_panel.py b/tests/test_profile_panel.py index f1d98183..70b4c851 100644 --- a/tests/test_profile_panel.py +++ b/tests/test_profile_panel.py @@ -624,6 +624,38 @@ def test_guide_api_strings_resolve_against_public_api(): assert '`"pass"` / `"warn"` / `"inconclusive"`' not in text assert "verdict" in text.lower() + # Balanced-panel eligibility: ContinuousDiD, EfficientDiD, + # SyntheticDiD, and HeterogeneousAdoptionDiD all hard-reject + # unbalanced panels at fit() time. The guide must surface this + # so agents gate these estimators on PanelProfile.is_balanced + # before selecting them. + assert "is_balanced" in text, ( + "Guide must mention PanelProfile.is_balanced as an eligibility " + "check for balance-sensitive estimators" + ) + for estimator in ( + "ContinuousDiD", + "EfficientDiD", + "SyntheticDiD", + "HeterogeneousAdoptionDiD", + ): + idx = 0 + found = False + while idx < len(text): + loc = text.find(estimator, idx) + if loc < 0: + break + window = text[max(0, loc - 400) : loc + 400] + if "balanced" in window.lower() or "is_balanced" in window: + found = True + break + idx = loc + 1 + assert found, ( + f"Guide must mention a balanced-panel constraint near the " + f"{estimator!r} bullet / row (hard-rejects unbalanced panels " + "at fit time)" + ) + def test_min_pre_post_use_per_unit_observed_support(): """On an unbalanced panel where one treated unit is missing its From 9a95d2f634d1395825f83c3d5b495cdb8544e7f4 Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 24 Apr 2026 11:13:03 -0400 Subject: [PATCH 13/18] Address PR #356 CI review round 11 (1 P1 guide + 1 P2 test) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit HeterogeneousAdoptionDiD staggered support is `partial` (per §3 matrix), but the restriction was never unpacked. Per REGISTRY.md L2281 and had.py:1100-1288, Appendix B.2 limits staggered HAD to the **last treatment cohort plus never-treated units**. With `first_treat_col` supplied, `fit(aggregate="event_study")` auto-filters to F_last and emits a UserWarning naming kept/dropped counts; earlier-cohort units are dropped. Without `first_treat_col` on a multi-cohort panel, fit() raises a front-door ValueError pointing at ChaisemartinDHaultfoeuille for full staggered support. Guide updates: - New §3 footnote on the HAD `partial` cell spelling out the last- cohort-only restriction, the `first_treat_col` requirement for the auto-filter, and the ChaisemartinDHaultfoeuille fallback. - §4.9 HAD bullet appended with a "Staggered-timing scope is last- cohort-only" paragraph carrying the same contract plus the "last-cohort-only WAS" estimand clarification. Tests: - Semantic guide test asserts "last-cohort-only" (or "last cohort") wording, "first_treat_col" token, and ChaisemartinDHaultfoeuille as the fallback are all present so future guide edits cannot silently drop the Appendix B.2 disclosure. Co-Authored-By: Claude Opus 4.7 (1M context) --- diff_diff/guides/llms-autonomous.txt | 22 ++++++++++++++++++++++ tests/test_profile_panel.py | 19 +++++++++++++++++++ 2 files changed, 41 insertions(+) diff --git a/diff_diff/guides/llms-autonomous.txt b/diff_diff/guides/llms-autonomous.txt index 4c0be2d0..894b0f1c 100644 --- a/diff_diff/guides/llms-autonomous.txt +++ b/diff_diff/guides/llms-autonomous.txt @@ -259,6 +259,15 @@ supported / out of scope; `warn` supported but with documented caveats; - `HeterogeneousAdoptionDiD` continuous: supports partial-adoption intensity as a continuous first-stage variable; not a pure dose-response estimator - use `ContinuousDiD` for that. +- `HeterogeneousAdoptionDiD` staggered support is `partial`, not + general. Paper Appendix B.2 restricts staggered use to the + **last treatment cohort plus never-treated units**. With + `aggregate="event_study"` and a `first_treat_col` kwarg, + `fit()` auto-filters to `F_last = max(cohorts)` and emits a + `UserWarning` naming kept/dropped counts; earlier-cohort units + are dropped. Without `first_treat_col`, a multi-cohort panel + raises `ValueError`. For full staggered support that retains + every cohort, use `ChaisemartinDHaultfoeuille` instead. **Balanced-panel eligibility.** The following estimators hard-reject unbalanced panels (each raises `ValueError` at `fit()` when a unit is @@ -459,6 +468,19 @@ intensity of exposure differs): pre-test battery does not and cannot validate it. Not ATT-shaped; do not relabel the headline as ATT in report text. + **Staggered-timing scope is last-cohort-only (Appendix B.2).** + HAD's staggered support is the `partial` cell in §3: on a + multi-cohort panel passed to `aggregate="event_study"`, `fit()` + auto-filters to the last treatment cohort (`F_last = + max(cohorts)`) plus never-treated units and emits a + `UserWarning` naming kept/dropped counts; earlier treated + cohorts are dropped. The `first_treat_col` kwarg is + **required** for the auto-filter to activate; without it a + multi-cohort panel raises `ValueError` pointing the caller at + `ChaisemartinDHaultfoeuille` for full staggered support. The + resulting estimand is a **last-cohort-only WAS**, not a + multi-cohort average — report it as such. + ### §4.10 Repeated cross-sections (no panel structure) `profile_panel` assumes long-format panel data. When the same units are diff --git a/tests/test_profile_panel.py b/tests/test_profile_panel.py index 70b4c851..6d1c85d5 100644 --- a/tests/test_profile_panel.py +++ b/tests/test_profile_panel.py @@ -656,6 +656,25 @@ def test_guide_api_strings_resolve_against_public_api(): "at fit time)" ) + # HeterogeneousAdoptionDiD staggered support is `partial` and + # specifically last-cohort-only (Appendix B.2): with first_treat_col + # supplied, fit() auto-filters to F_last + never-treated; without + # first_treat_col, a multi-cohort panel raises. Guide must surface + # this explicitly so agents don't route a general staggered panel + # to HAD expecting a multi-cohort estimand. + assert "last-cohort-only" in text or "last cohort" in text.lower(), ( + "Guide must name the last-cohort-only restriction on HAD " + "staggered support (Appendix B.2)" + ) + assert "first_treat_col" in text, ( + "Guide must mention that first_treat_col is required to activate " + "HAD's staggered last-cohort auto-filter" + ) + assert "ChaisemartinDHaultfoeuille" in text, ( + "Guide must point at ChaisemartinDHaultfoeuille as the fallback " + "for full staggered support" + ) + def test_min_pre_post_use_per_unit_observed_support(): """On an unbalanced panel where one treated unit is missing its From 610b8aa975c6514eeb8c23cd4b19dd03ba8525bf Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 24 Apr 2026 11:26:10 -0400 Subject: [PATCH 14/18] Address PR #356 CI review round 12 (1 P2 guide) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Balanced-panel eligibility gate tightened. `PanelProfile.is_balanced` is computed from the unique `(unit, time)` support, so it stays `True` even when duplicate rows exist — `duplicate_unit_time_rows` is a separate alert for that case. But ContinuousDiD / EfficientDiD / HeterogeneousAdoptionDiD all require exactly one observation per cell: EfficientDiD and HAD raise ValueError on duplicates at fit() time, and ContinuousDiD's precompute path silently resolves duplicates via last-row-wins (which can change the estimand without warning). Guide §3 balanced-panel-eligibility block now requires BOTH `is_balanced == True` AND absence of the `duplicate_unit_time_rows` alert before routing to these estimators, with the specific failure mode (raise vs silent overwrite) named per-estimator and a concrete two-step fix (`balance_panel()` + `drop_duplicates([unit, time])`). Tests: extended the semantic guide test to assert the guide mentions `duplicate_unit_time_rows` and uses "BOTH/both" wording around the is_balanced gate, so future edits cannot silently drop the duplicate- row half of the eligibility requirement. Co-Authored-By: Claude Opus 4.7 (1M context) --- diff_diff/guides/llms-autonomous.txt | 25 +++++++++++++++++-------- tests/test_profile_panel.py | 16 ++++++++++++++++ 2 files changed, 33 insertions(+), 8 deletions(-) diff --git a/diff_diff/guides/llms-autonomous.txt b/diff_diff/guides/llms-autonomous.txt index 894b0f1c..88080844 100644 --- a/diff_diff/guides/llms-autonomous.txt +++ b/diff_diff/guides/llms-autonomous.txt @@ -269,15 +269,24 @@ supported / out of scope; `warn` supported but with documented caveats; raises `ValueError`. For full staggered support that retains every cohort, use `ChaisemartinDHaultfoeuille` instead. -**Balanced-panel eligibility.** The following estimators hard-reject -unbalanced panels (each raises `ValueError` at `fit()` when a unit is -missing any period): `ContinuousDiD`, `EfficientDiD`, `SyntheticDiD`, -`HeterogeneousAdoptionDiD`. Gate these on -`PanelProfile.is_balanced == True`; if `False`, pre-process with -`diff_diff.prep.balance_panel()` or pick a balance-tolerant +**Balanced-panel eligibility.** The following estimators require +exactly one observation per `(unit, time)` cell with every unit +observed in every period: `ContinuousDiD`, `EfficientDiD`, +`SyntheticDiD`, `HeterogeneousAdoptionDiD`. Gate these on BOTH +`PanelProfile.is_balanced == True` AND the absence of the +`duplicate_unit_time_rows` alert (`is_balanced` is computed from the +unique-key support and stays `True` when duplicates exist; the +alert is the separate signal for duplicates). Treat both +conditions as hard gates: `EfficientDiD` and +`HeterogeneousAdoptionDiD` raise `ValueError` at `fit()` on +duplicate cells, and `ContinuousDiD`'s precompute path resolves +duplicates with last-row-wins (silent overwrite that can change +the estimand). If either condition fails, pre-process with +`diff_diff.prep.balance_panel()` and a +`drop_duplicates([unit, time])` pass, or pick a balance-tolerant estimator from the remaining rows (CS/SA/dCDH/Imputation/TwoStage/ -Stacked/ETWFE all accept unbalanced input, with some caveats in their -own docs). +Stacked/ETWFE all accept unbalanced input, with some caveats in +their own docs). ## §4. Estimator-choice reasoning by design feature diff --git a/tests/test_profile_panel.py b/tests/test_profile_panel.py index 6d1c85d5..6fee23e1 100644 --- a/tests/test_profile_panel.py +++ b/tests/test_profile_panel.py @@ -675,6 +675,22 @@ def test_guide_api_strings_resolve_against_public_api(): "for full staggered support" ) + # Balanced-panel gate is incomplete with `is_balanced` alone because + # duplicate (unit, time) rows don't flip is_balanced. Guide must + # require BOTH is_balanced == True AND absence of the + # duplicate_unit_time_rows alert before routing to the duplicate- + # intolerant estimators (ContinuousDiD silently overwrites + # duplicates via last-row-wins; EfficientDiD/HAD raise). + assert "duplicate_unit_time_rows" in text, ( + "Guide must name the duplicate_unit_time_rows alert as part of " + "the balanced-panel eligibility gate" + ) + assert "BOTH" in text or "both" in text, ( + "Guide must require BOTH is_balanced and absence of the " + "duplicate_unit_time_rows alert before routing to duplicate-" + "intolerant estimators" + ) + def test_min_pre_post_use_per_unit_observed_support(): """On an unbalanced panel where one treated unit is missing its From 2ba101076c34e0a9d20ff1e7c2024f5878d93643 Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 24 Apr 2026 12:06:37 -0400 Subject: [PATCH 15/18] Address PR #356 CI review round 13 (1 P1 guide + code) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Bool-dtype treatment columns are now classified the same way as numeric {0, 1} rather than as "categorical". The library's binary estimators validate value support via `validate_binary` (utils.py: 49-67), which accepts bool because True/False coerce to 1/0 numerically. Classifying bool as categorical silently routed valid binary DiD panels away from the supported estimator set. Changes: - _classify_treatment() no longer early-returns "categorical" for bool dtype. The downstream absorbing/non-absorbing logic handles bool by casting to int before np.diff (raw bool diff is XOR, which would mask a True -> False transition). - treatment_varies_within_unit now includes bool-dtype columns (was previously hardcoded False for bool). - Guide §2 removes the "bool = categorical" rule and adds an explicit "bool is binary" note with a pointer to validate_binary as the reason. - profile_panel() docstring mirrors the same update. Tests: - test_bool_dtype_treatment_is_binary_absorbing: staggered-style bool panel with never-treated cohort -> binary_absorbing, correct has_never_treated / treatment_varies_within_unit / cohort_sizes. - test_bool_dtype_non_absorbing: reversible False -> True -> False bool panel -> binary_non_absorbing. Guards the int-cast before np.diff so future refactors don't regress to bool XOR semantics. Co-Authored-By: Claude Opus 4.7 (1M context) --- diff_diff/guides/llms-autonomous.txt | 15 +++++++---- diff_diff/profile.py | 29 +++++++++++++-------- tests/test_profile_panel.py | 38 ++++++++++++++++++++++++++++ 3 files changed, 67 insertions(+), 15 deletions(-) diff --git a/diff_diff/guides/llms-autonomous.txt b/diff_diff/guides/llms-autonomous.txt index 88080844..4024acf8 100644 --- a/diff_diff/guides/llms-autonomous.txt +++ b/diff_diff/guides/llms-autonomous.txt @@ -90,11 +90,16 @@ view. Every field below appears as a top-level key in that dict. two-valued numeric column whose values are not in {0, 1} (e.g., a dose, a discrete-integer partial-adoption score). Use `ContinuousDiD` or `HeterogeneousAdoptionDiD`. - - `"categorical"`: non-numeric dtype (object / category), a bool - dtype column, or a column that is entirely NaN. Often indicates - a treatment arm. Encode each arm as a binary indicator and fit - separately, or use a multi-treatment workflow outside the - current estimator suite. + - `"categorical"`: non-numeric dtype (object / category), or a + column that is entirely NaN. Often indicates a treatment arm. + Encode each arm as a binary indicator and fit separately, or + use a multi-treatment workflow outside the current estimator + suite. + + Bool-dtype treatment columns (`True` / `False`) are classified the + same way as numeric `{0, 1}`: the library's binary estimators + validate on value support rather than dtype, so `True` and `False` + behave like `1` and `0` for absorbing / non-absorbing classification. - **`is_staggered: bool`** - true iff treatment is `binary_absorbing` and at least two distinct first-treatment periods are observed. Drives the choice between classic DiD/TWFE and staggered-robust estimators. diff --git a/diff_diff/profile.py b/diff_diff/profile.py index 9a24a345..b343f9d4 100644 --- a/diff_diff/profile.py +++ b/diff_diff/profile.py @@ -185,12 +185,14 @@ def profile_panel( - ``"continuous"``: numeric treatment with more than two distinct values, or a 2-valued numeric whose values are not in :math:`\\{0, 1\\}` (matches the ``ContinuousDiD`` convention). - - ``"categorical"``: non-numeric dtype (object / category), a - boolean-dtype column, or a column that is entirely NaN. + - ``"categorical"``: non-numeric dtype (object / category) or a + column that is entirely NaN. - Boolean-dtype columns are intentionally classified as - ``"categorical"``; cast to ``int`` if you want binary-treatment - profiling. + Bool-dtype columns (``True`` / ``False``) are classified the same + way as numeric ``{0, 1}``: the library's binary estimators validate + on value support via :func:`diff_diff.utils.validate_binary`, so + ``True`` / ``False`` behave like ``1`` / ``0`` for absorbing / + non-absorbing classification. ``has_never_treated`` is computed across both binary and continuous numeric treatment types: some unit has ``treatment == @@ -255,9 +257,7 @@ def profile_panel( last_tp, ) = _classify_treatment(df, unit=unit, time=time, treatment=treatment) - if pd.api.types.is_numeric_dtype(df[treatment]) and not pd.api.types.is_bool_dtype( - df[treatment] - ): + if pd.api.types.is_numeric_dtype(df[treatment]) or pd.api.types.is_bool_dtype(df[treatment]): per_unit_distinct = df.groupby(unit)[treatment].nunique(dropna=True) treatment_varies_within_unit = bool((per_unit_distinct > 1).any()) else: @@ -352,7 +352,13 @@ def _classify_treatment( is_numeric = pd.api.types.is_numeric_dtype(col) is_bool = pd.api.types.is_bool_dtype(col) - if (not is_numeric) or is_bool: + # Bool-dtype treatment columns are treated as binary 0/1 inputs. + # The library's binary estimators validate value support via + # `validate_binary`, which accepts bool because True/False coerce + # to 1/0 numerically. Classifying bool columns as "categorical" + # here would route a valid binary design away from the supported + # estimator set. + if (not is_numeric) and (not is_bool): return ("categorical", False, {}, False, False, None, None) distinct = col.dropna().unique() @@ -400,7 +406,10 @@ def _classify_treatment( for _, group in sorted_df.groupby(unit, sort=False): vals = group[treatment].to_numpy() mask = ~pd.isna(vals) - observed = vals[mask] + # Cast to int so np.diff on a bool-dtype column performs + # arithmetic (1 - 0 = 1, 0 - 1 = -1) rather than XOR (which + # would mask a True -> False transition). + observed = vals[mask].astype(np.int64, copy=False) if len(observed) >= 2 and bool(np.any(np.diff(observed) < 0)): is_absorbing = False break diff --git a/tests/test_profile_panel.py b/tests/test_profile_panel.py index 6fee23e1..3b42884d 100644 --- a/tests/test_profile_panel.py +++ b/tests/test_profile_panel.py @@ -155,6 +155,44 @@ def test_continuous_positive_dose_does_not_fire_has_always_treated(): assert "has_always_treated_units" not in _alert_codes(profile) +def test_bool_dtype_treatment_is_binary_absorbing(): + """Bool-dtype treatment columns (True/False) must classify the same + way as numeric {0, 1}. The library's binary estimators validate on + value support via `validate_binary`, which accepts bool because + True/False coerce to 1/0 numerically. Classifying bool as + "categorical" would silently route valid binary DiD panels away + from the supported estimator set.""" + first_treat = {u: 2 for u in range(11, 21)} + rows = [] + for u in range(1, 21): + for t in range(4): + treated = u in first_treat and t >= first_treat[u] + rows.append({"u": u, "t": t, "tr": bool(treated), "y": float(u) + 0.1 * t}) + df = pd.DataFrame(rows) + assert df["tr"].dtype == bool + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "binary_absorbing" + assert profile.has_never_treated is True + assert profile.has_always_treated is False + assert profile.treatment_varies_within_unit is True + assert profile.cohort_sizes == {2: 10} + + +def test_bool_dtype_non_absorbing(): + """Reversible 0 -> 1 -> 0 treatment expressed as a bool column must + classify as binary_non_absorbing, same as numeric.""" + rows = [] + for u in range(1, 11): + seq = [False, True, True, False, False] if u > 5 else [False] * 5 + for t, tr in enumerate(seq): + rows.append({"u": u, "t": t, "tr": tr, "y": float(u) + 0.1 * t}) + df = pd.DataFrame(rows) + assert df["tr"].dtype == bool + profile = profile_panel(df, unit="u", time="t", treatment="tr", outcome="y") + assert profile.treatment_type == "binary_non_absorbing" + assert profile.has_never_treated is True + + def test_categorical_treatment_object_dtype(): rows = [] for u in range(1, 11): From 889b24a7787357294210a79b07959ea9605c7b27 Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 24 Apr 2026 12:24:51 -0400 Subject: [PATCH 16/18] Address PR #356 CI review round 14 (1 P1 guide) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit dCDH "spillover designs" claim removed. The earlier §4.3 bullet said ChaisemartinDHaultfoeuille was "robust to non-absorbing treatment and to spillover designs," but REGISTRY.md §ChaisemartinDHaultfoeuille only documents support for non-absorbing / reversible treatment — the estimator assumes SUTVA like every other DiD estimator in the suite. The same guide's §7 glossary already defines SUTVA as ruling out spillovers, so the §4.3 wording was internally inconsistent and would have misrouted interference designs to dCDH. Rewrote the §4.3 bullet to state the actually-supported contract (non-absorbing / reversible treatment via DID_M / DID_l) and added an explicit note that interference / between-unit spillovers are not supported natively. Regression test: `tests/test_profile_panel.py` extended with a forbidden-phrase check that fails if the autonomous guide re-advertises dCDH as "robust to spillover", "interference-robust", "supports spillover", or "and to spillover". Co-Authored-By: Claude Opus 4.7 (1M context) --- diff_diff/guides/llms-autonomous.txt | 4 +++- tests/test_profile_panel.py | 16 ++++++++++++++++ 2 files changed, 19 insertions(+), 1 deletion(-) diff --git a/diff_diff/guides/llms-autonomous.txt b/diff_diff/guides/llms-autonomous.txt index 4024acf8..1f96f72e 100644 --- a/diff_diff/guides/llms-autonomous.txt +++ b/diff_diff/guides/llms-autonomous.txt @@ -339,7 +339,9 @@ estimators: coefficients. Requires a never-treated cohort (`fit` raises a `ValueError` when none exists). - `ChaisemartinDHaultfoeuille` - DID_M / DID_l estimators robust to - non-absorbing treatment (see §4.5) and to spillover designs. + non-absorbing / reversible treatment (see §4.5). Interference / + between-unit spillovers are not supported natively - SUTVA is + assumed like every other DiD estimator in the suite. - `ImputationDiD` (Borusyak, Jaravel, Spiess) - imputation-based, efficient under homoskedasticity, produces an imputation-based residual at the observation level. diff --git a/tests/test_profile_panel.py b/tests/test_profile_panel.py index 3b42884d..aae6e438 100644 --- a/tests/test_profile_panel.py +++ b/tests/test_profile_panel.py @@ -729,6 +729,22 @@ def test_guide_api_strings_resolve_against_public_api(): "intolerant estimators" ) + # ChaisemartinDHaultfoeuille handles non-absorbing / reversible + # treatment; SUTVA is still assumed (no native interference or + # spillover support per REGISTRY.md). Guard against the guide + # drifting back to advertising dCDH as "robust to spillover + # designs" or similar. + for phrase in ( + "robust to spillover", + "interference-robust", + "supports spillover", + "and to spillover", + ): + assert phrase not in text, ( + f"Guide must not advertise unsupported dCDH capability " + f"{phrase!r}: SUTVA is assumed across the estimator suite." + ) + def test_min_pre_post_use_per_unit_observed_support(): """On an unbalanced panel where one treated unit is missing its From 9d5f8f946f79fa2eab5219d5b0cbf2ffe5dc5e81 Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 24 Apr 2026 13:33:18 -0400 Subject: [PATCH 17/18] Address PR #356 CI review round 15 (1 P1 guide) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit §4.10 repeated-cross-section wording tightened. The earlier text ("most estimators remain applicable") overstated the RCS support surface: only CallawaySantAnna(panel=False), TripleDifference, and StaggeredTripleDifference have documented RCS-capable contracts (REGISTRY.md §CallawaySantAnna; docs/choosing_estimator.rst DDD cross-sectional use cases). EfficientDiD and HeterogeneousAdoptionDiD explicitly reject RCS per REGISTRY.md. SyntheticDiD and ContinuousDiD are panel-only by construction. Under the old wording an autonomous agent could silently route RCS data to a panel-only estimator. Rewrote §4.10 into three explicit blocks: - "Explicit RCS support": the three RCS-capable estimators. - "Explicitly rejected for RCS (panel-only)": names EfficientDiD, HeterogeneousAdoptionDiD, SyntheticDiD, ContinuousDiD. - "Treat other estimators in this guide as panel-only unless their own docs explicitly say otherwise." Kept the clustered-SE note and added a cluster-vs-respondent treatment-assignment check. Regression test: asserts the guide does not contain the "most estimators remain applicable" phrase, names `panel=False` as the explicit CS RCS mode, and that the §4.10 section explicitly lists both EfficientDiD and HeterogeneousAdoptionDiD as panel-only rejected. Co-Authored-By: Claude Opus 4.7 (1M context) --- diff_diff/guides/llms-autonomous.txt | 40 +++++++++++++++++++++++----- tests/test_profile_panel.py | 24 +++++++++++++++++ 2 files changed, 57 insertions(+), 7 deletions(-) diff --git a/diff_diff/guides/llms-autonomous.txt b/diff_diff/guides/llms-autonomous.txt index 1f96f72e..0aabf0a7 100644 --- a/diff_diff/guides/llms-autonomous.txt +++ b/diff_diff/guides/llms-autonomous.txt @@ -500,13 +500,39 @@ intensity of exposure differs): ### §4.10 Repeated cross-sections (no panel structure) `profile_panel` assumes long-format panel data. When the same units are -not observed across time (true repeated cross-sections), most estimators -remain applicable but: - -- Clustered SE must cluster on the unit proxy (state, region) rather - than individual. -- The `CallawaySantAnna` estimator has an explicit repeated-cross- - section mode; see its `panel` kwarg. +not observed across time (true repeated cross-sections), only the +estimators whose documented contract explicitly admits RCS are +applicable. Do not route RCS data to any other estimator in the suite - +most of them are panel-only by construction and will either raise at +fit time or estimate under a misspecified identifying assumption. + +Explicit RCS support in this library: + +- `CallawaySantAnna(panel=False)` - repeated-cross-section mode per + REGISTRY.md §CallawaySantAnna; use this variant on RCS data. +- `TripleDifference` / `StaggeredTripleDifference` - DDD cross-sectional + use cases are documented in `docs/choosing_estimator.rst`; the DDD + estimators do not require within-unit tracking when the third + comparison axis carries the identification. + +Explicitly rejected for RCS (panel-only): + +- `EfficientDiD` - REGISTRY notes "does not handle ... repeated + cross-sections." +- `HeterogeneousAdoptionDiD` - panel-only (requires a balanced panel + with per-unit adoption timing). +- `SyntheticDiD` - requires balanced panel with per-unit donor matching. +- `ContinuousDiD` - requires balanced panel with per-unit constant + dose. + +Treat other estimators in this guide as panel-only unless their own +docs explicitly say otherwise. When routing, also: + +- Cluster SE on the unit proxy (state, region) rather than the + individual cross-section respondent. +- Confirm the treatment assignment is at the cluster level, not at + the individual-respondent level, before interpreting the estimate + as a group-time ATT. ## §5. Post-fit validation utilities diff --git a/tests/test_profile_panel.py b/tests/test_profile_panel.py index aae6e438..d2845a88 100644 --- a/tests/test_profile_panel.py +++ b/tests/test_profile_panel.py @@ -745,6 +745,30 @@ def test_guide_api_strings_resolve_against_public_api(): f"{phrase!r}: SUTVA is assumed across the estimator suite." ) + # Repeated-cross-section (§4.10) must not claim broad + # applicability. The documented RCS-capable estimators are + # CallawaySantAnna(panel=False), TripleDifference, and + # StaggeredTripleDifference; EfficientDiD and + # HeterogeneousAdoptionDiD explicitly reject RCS per REGISTRY.md. + assert "most estimators remain applicable" not in text, ( + "§4.10 must not claim broad RCS applicability; only the " + "explicitly documented RCS-capable subset is applicable." + ) + assert "panel=False" in text, ( + "§4.10 must point at CallawaySantAnna(panel=False) as the " "explicit RCS mode" + ) + # The section must explicitly name at least one panel-only + # estimator as rejected for RCS, so agents do not silently route + # RCS data to it. + rcs_section_start = text.find("§4.10 Repeated cross-sections") + assert rcs_section_start >= 0 + rcs_section = text[rcs_section_start : rcs_section_start + 2500] + for panel_only in ("EfficientDiD", "HeterogeneousAdoptionDiD"): + assert panel_only in rcs_section, ( + f"§4.10 must explicitly name {panel_only!r} as panel-only " + "so RCS data is not routed to it" + ) + def test_min_pre_post_use_per_unit_observed_support(): """On an unbalanced panel where one treated unit is missing its From ef6b53d5947941ee571232ce879a6ce536720b1e Mon Sep 17 00:00:00 2001 From: igerber Date: Fri, 24 Apr 2026 13:48:52 -0400 Subject: [PATCH 18/18] Address PR #356 CI review round 16 (1 P1 guide) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Rebased onto current main (17 commits clean — PR #355, #358, #359 all merged since last rebase). StaggeredTripleDifference corrected as panel-only + balance-enforced. The earlier §4.10 RCS wording paired TripleDifference / StaggeredTripleDifference together in the Explicit RCS support list, but REGISTRY.md §StaggeredTripleDifference requires a balanced panel and staggered_triple_diff.py:93-109 has no panel=False mode — fit() rejects unbalanced/duplicate (unit, time) structure at staggered_triple_diff.py:846-864. - §4.10 Explicit RCS support: TripleDifference (two-period) only; StaggeredTripleDifference removed from the supported set. - §4.10 Explicitly rejected for RCS: StaggeredTripleDifference added with a concrete "no panel=False mode" + "use TripleDifference for cross-sectional DDD" pointer. - §3 Balanced-panel eligibility: StaggeredTripleDifference added to the balance-sensitive gate. Regression tests extended: - Balanced-panel proximity check now covers StaggeredTripleDifference. - §4.10 section test asserts StaggeredTripleDifference appears in the Explicitly rejected block and NOT in the Explicit RCS support block. Co-Authored-By: Claude Opus 4.7 (1M context) --- diff_diff/guides/llms-autonomous.txt | 16 +++++++++++----- tests/test_profile_panel.py | 23 ++++++++++++++++++++++- 2 files changed, 33 insertions(+), 6 deletions(-) diff --git a/diff_diff/guides/llms-autonomous.txt b/diff_diff/guides/llms-autonomous.txt index 0aabf0a7..b571911e 100644 --- a/diff_diff/guides/llms-autonomous.txt +++ b/diff_diff/guides/llms-autonomous.txt @@ -277,7 +277,8 @@ supported / out of scope; `warn` supported but with documented caveats; **Balanced-panel eligibility.** The following estimators require exactly one observation per `(unit, time)` cell with every unit observed in every period: `ContinuousDiD`, `EfficientDiD`, -`SyntheticDiD`, `HeterogeneousAdoptionDiD`. Gate these on BOTH +`SyntheticDiD`, `HeterogeneousAdoptionDiD`, +`StaggeredTripleDifference`. Gate these on BOTH `PanelProfile.is_balanced == True` AND the absence of the `duplicate_unit_time_rows` alert (`is_balanced` is computed from the unique-key support and stays `True` when duplicates exist; the @@ -510,10 +511,11 @@ Explicit RCS support in this library: - `CallawaySantAnna(panel=False)` - repeated-cross-section mode per REGISTRY.md §CallawaySantAnna; use this variant on RCS data. -- `TripleDifference` / `StaggeredTripleDifference` - DDD cross-sectional - use cases are documented in `docs/choosing_estimator.rst`; the DDD - estimators do not require within-unit tracking when the third - comparison axis carries the identification. +- `TripleDifference` - DDD cross-sectional use cases are documented + in `docs/choosing_estimator.rst`; the two-period DDD estimator does + not require within-unit tracking when the third comparison axis + carries the identification. The staggered DDD variant is panel-only + and listed separately below. Explicitly rejected for RCS (panel-only): @@ -524,6 +526,10 @@ Explicitly rejected for RCS (panel-only): - `SyntheticDiD` - requires balanced panel with per-unit donor matching. - `ContinuousDiD` - requires balanced panel with per-unit constant dose. +- `StaggeredTripleDifference` - panel-only; `fit()` has no + `panel=False` mode and rejects duplicate / unbalanced + `(unit, time)` structure. For cross-sectional DDD data use + `TripleDifference` instead. Treat other estimators in this guide as panel-only unless their own docs explicitly say otherwise. When routing, also: diff --git a/tests/test_profile_panel.py b/tests/test_profile_panel.py index d2845a88..b1d7a9f5 100644 --- a/tests/test_profile_panel.py +++ b/tests/test_profile_panel.py @@ -676,6 +676,7 @@ def test_guide_api_strings_resolve_against_public_api(): "EfficientDiD", "SyntheticDiD", "HeterogeneousAdoptionDiD", + "StaggeredTripleDifference", ): idx = 0 found = False @@ -763,12 +764,32 @@ def test_guide_api_strings_resolve_against_public_api(): rcs_section_start = text.find("§4.10 Repeated cross-sections") assert rcs_section_start >= 0 rcs_section = text[rcs_section_start : rcs_section_start + 2500] - for panel_only in ("EfficientDiD", "HeterogeneousAdoptionDiD"): + for panel_only in ( + "EfficientDiD", + "HeterogeneousAdoptionDiD", + "StaggeredTripleDifference", + ): assert panel_only in rcs_section, ( f"§4.10 must explicitly name {panel_only!r} as panel-only " "so RCS data is not routed to it" ) + # The explicit RCS-capable bullet list must NOT put + # StaggeredTripleDifference next to the RCS-support language. + # The estimator has no panel=False mode and fit() rejects + # unbalanced input; only TripleDifference (non-staggered) is + # cross-sectional-DDD-capable. + explicit_support_block = text.find("Explicit RCS support", rcs_section_start) + rejected_block = text.find("Explicitly rejected for RCS", rcs_section_start) + assert 0 <= explicit_support_block < rejected_block, ( + "§4.10 must separate an Explicit RCS support list from the " "Explicitly rejected list" + ) + explicit_segment = text[explicit_support_block:rejected_block] + assert "StaggeredTripleDifference" not in explicit_segment, ( + "StaggeredTripleDifference must NOT appear in the Explicit RCS " + "support list — it is panel-only and balance-enforced." + ) + def test_min_pre_post_use_per_unit_observed_support(): """On an unbalanced panel where one treated unit is missing its