Add profile_panel() + autonomous LLM guide for agent-facing workflows#356
Add profile_panel() + autonomous LLM guide for agent-facing workflows#356
Conversation
c1f141a to
d4a3dc1
Compare
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment The PR adds a new agent-facing methodology surface, but the new autonomous guide is not consistent with Executive Summary
Methodology
Code Quality
Performance No findings. Maintainability No additional findings beyond the guide/registry drift already noted. Tech Debt
Security No findings. Documentation/Tests
Path to Approval
|
profile_panel(): - Balance / coverage: compute from unique (unit, time) support rather than raw row count. Duplicated rows no longer inflate coverage above 1 or mask missing cells. Adds duplicate_unit_time_rows warn alert. - Non-absorbing detection: drop NaN before the monotonicity diff so a path like [0, 1, NaN, 0] is correctly classified binary_non_absorbing rather than binary_absorbing. - has_never_treated / has_always_treated: compute generically across numeric treatment types (not only binary). Continuous panels with zero-dose control units now surface has_never_treated=True; docstring and guide field reference updated. llms-autonomous.txt: - SyntheticDiD: staggered changed from "partial" to "strict block" (fit raises on within-unit variation); footnote rewritten. - TROP: staggered support marked (via absorbing D matrix); never-treated no longer required (donor pool is every unit untreated at period t); fit() covariate surface removed (there is none). - HeterogeneousAdoptionDiD: covariate adjustment changed to "not supported" (Appendix B.1 future work per REGISTRY.md); clustered SE annotated "warn" on continuous paths (cluster= kwarg is ignored with UserWarning, CR1 only on mass-point path). - Replace "notyettreated" with "not_yet_treated" throughout (matches staggered.py / sun_abraham.py public API). - Replace "Hausman.hausman_pretest" with "EfficientDiD.hausman_pretest" (the classmethod lives on EfficientDiD, not a Hausman namespace). Tests: - test_duplicate_unit_time_rows_do_not_inflate_coverage - test_reversal_through_nan_is_binary_non_absorbing - test_continuous_zero_dose_controls_flag_has_never_treated - test_guide_api_strings_resolve_against_public_api (verifies every estimator name, hausman_pretest path, and control_group spelling in the bundled guide resolves against diff_diff's public API; guards against future drift) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
|
/ai-review |
Guide: - Rewrite HAD section (§4.9, §3 footnotes) to use WAS / WAS_d_lower terminology (paper Equation 2 / Theorem 1). HAD does not target ATT and its event-study output is per-event-time, not per-cohort. The fitted target_parameter attribute is literally "WAS" for Design 1' and "WAS_d_lower" for Design 1 / Assumption 6. - Add control_group="last_cohort" to the EfficientDiD guidance in §4.4 (no-never-treated paths) and the corresponding footnote. Previously the guide steered agents only toward assumption="PT-Post" when a never-treated cohort was absent, ruling out the supported pseudo- control path. Distinct from CallawaySantAnna's "not_yet_treated". profile_panel(): - Missing identifier rows are dropped up front when unit or time contains NaN. Previously n_units / n_periods used nunique() (NaN- dropping) while the unique (unit, time) support used drop_duplicates() (NaN-preserving), producing observation_coverage > 1 on filtered data with missing IDs. All structural facts now compute on the non-missing subset; a new missing_id_rows_dropped warn alert surfaces the drop count so the behavior is not silent. Tests: - test_missing_unit_or_time_ids_are_dropped_consistently asserts observation_coverage stays in [0, 1] and the alert fires with the correct drop count. - test_guide_api_strings_resolve_against_public_api extended with assertions for WAS / WAS_d_lower phrasing, the absence of the old "per-cohort Pierce-Schott" wording, and control_group="last_cohort". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance No findings. Maintainability No findings. Tech Debt
Security No findings. Documentation/Tests No separate unmitigated findings beyond the tracked guide-validation gap above. Static review only: Path to Approval
|
|
/ai-review |
Guide: - SunAbraham matrix row: never-treated-required changed from ✗ to ✓. The fit path raises ValueError when no never-treated units exist (sun_abraham.py:566-567); the registry documents this requirement. §4.3 prose updated to state the requirement explicitly. - HAD §4.9 (and §3 Phase 3 description): correct the diagnostic-claim wording. The pre-test battery validates the QUG null (step 1), Assumption 7 pre-trends (step 2, event-study only), and linearity of E[ΔY|D_2] (step 3). Assumption 3 (uniform continuity / no extensive-margin jump) is explicitly "not testable" per REGISTRY.md L2181, so the guide must not claim Phase 3 validates it. profile_panel(): - Empty-panel guard: raise ValueError when the input DataFrame is empty (direct) or when no rows remain after dropping missing (unit, time) identifiers. Previously both cases returned is_balanced=True / observation_coverage=0 on zero valid rows. - Fix missing-id counting: n_rows_with_missing_id was the SUM of unit-NaN and time-NaN (double-counting rows missing both). Now computed from a row-wise isna().any(axis=1) mask so every dropped row is counted exactly once. Tests: - test_row_with_both_ids_missing_counted_once - test_empty_dataframe_raises_value_error - test_empty_after_id_drop_raises_value_error - test_guide_api_strings_resolve_against_public_api extended: SunAbraham matrix-row cell assertion (never-treated-required=✓), "Assumption 3" presence, absence of "validate Assumptions 3 and 7", presence of "not testable". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall AssessmentExecutive Summary
Methodology
Code Quality
PerformanceNo findings. MaintainabilityNo findings. Tech Debt
SecurityNo findings. Documentation/Tests
Path to Approval
|
|
/ai-review |
Guide: - Matrix covariate cells corrected. SyntheticDiD column 6 flipped ✗ → ✓: fit() accepts covariates= and residualizes the outcome (synthetic_did.py:213-239, 425-436). ContinuousDiD column 6 flipped ✓ → ✗: the fit() signature has no covariate argument and REGISTRY lists covariate adjustment as deferred. - EfficientDiD wording corrected. The registry says PT-Post still uses never-treated as the comparison group (REGISTRY.md L756-758); "switch to PT-Post" does not drop the requirement. Rewrote has_never_treated field description, the §3 footnote, and §4.4 prose to state that PT-All and PT-Post both require never-treated units and control_group="last_cohort" is the only supported path for an all-eventually-treated panel. profile_panel(): - _compute_pre_post now uses each treated unit's observed (unit, time) support rather than the global panel period set. On unbalanced panels this correctly captures the least-supported treated unit's pre/post exposure and fires short_pre_panel / short_post_panel when appropriate; the previous global-periods approach could overstate exposure and suppress the alert. Guide §5 API signatures corrected: - compute_pretrends_power takes a fitted results object, not a raw DataFrame (pretrends.py:1048). - plot_sensitivity takes a SensitivityResults object (visualization/_diagnostic.py:12-14). - plot_honest_event_study takes a HonestDiDResults object (visualization/_event_study.py:9). Tests: - test_min_pre_post_use_per_unit_observed_support (unbalanced treated unit missing early pre-period fires short_pre_panel). - test_guide_api_strings_resolve_against_public_api extended with SyntheticDiD/ContinuousDiD covariate-cell matrix assertions, EfficientDiD both-PT-variants-require-never-treated wording assertion, and §5 API-signature fingerprints. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
|
/ai-review |
Guide §6 rewritten to match the actual BR/DR to_dict() schemas emitted at diff_diff/business_report.py:524-549 and diff_diff/diagnostic_report.py: 1013-1031. BR top-level keys are "assumption" (singular), "pre_trends" (underscored), "sample" (bare); the new documentation also adds "estimator", "context", "heterogeneity", "diagnostics", "next_steps", "references". DR keys are "estimator" (string), "headline_metric", "estimator_native_diagnostics", "skipped", "warnings", "overall_interpretation", "next_steps"; the old "checks: dict" nesting never existed and "estimator_type" was never the key name. The "forthcoming schema additions" block is dropped since it documented an un-shipped surface. Prose + alert wording aligned with the per-treated-unit implementation: - Guide §2: min_pre_periods / min_post_periods described as "across treated units" (each unit's observed (unit, time) support counted independently on unbalanced panels). - profile.py: short_pre_panel / short_post_panel alert messages use "across treated units" to match the computation. Tests: - test_guide_api_strings_resolve_against_public_api extended with BR/DR §6 schema assertions. Guards every real top-level key documented (assumption, pre_trends, sample, headline_metric, estimator_native_diagnostics, overall_interpretation) and forbids the previously documented obsolete keys (assumptions, pretrends, main_result, sample_summary, estimator_type, checks). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment✅ Looks good — no unmitigated P0/P1 findings. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
|
/ai-review |
Guide methodology attribution corrected:
- EfficientDiD primary source is Chen, Sant'Anna, Xie (2025) per
REGISTRY.md L746-749, not Arkhangelsky-Imbens. §4.3 bullet and the
§7 citation list updated.
- ContinuousDiD synopsis (§4.7) now distinguishes the PT-identified
targets (`ATT(d|d)`, `ATT^{loc}`) from the SPT-required targets
(`ATT(d)`, `ACRT(d)`, `ACRT^{glob}`) per REGISTRY.md L682-706, with
the Callaway, Goodman-Bacon, Sant'Anna (2024) citation added to §7.
Guide §6 BR schema corrections:
- `diagnostics` wrapper documented accurately: always has `status`,
with `schema` + `overall_interpretation` on the "ran" path or
`reason` on the "skipped" path. Previously the guide said the key
held the DR payload directly.
- `target_parameter.reference` field documented alongside the other
fields per `describe_target_parameter()` in
diff_diff/_reporting_helpers.py:34-42.
Regression tests extended:
- `diagnostics["schema"]` wrapper wording guarded (and absence of
plain "DR payload under diagnostics").
- `target_parameter.reference` key documentation guarded.
- EfficientDiD attribution: `Chen, Sant'Anna, Xie 2025` present,
`(Arkhangelsky-Imbens)` absent.
- ContinuousDiD targets: `ATT(d|d)`, `ACRT`, `Strong Parallel Trends`
all present.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall AssessmentExecutive Summary
Methodology
Code QualityNo findings. PerformanceNo findings. MaintainabilityNo findings. Tech Debt
SecurityNo findings. Documentation/Tests
Path to Approval
|
|
/ai-review |
ContinuousDiD eligibility (P1): ContinuousDiD.fit() requires zero-dose
control units (P(D=0) > 0) because Remark 3.1 lowest-dose-as-control
is not yet implemented (REGISTRY.md L714; continuous_did.py:349-360).
Guide corrections:
- Matrix row: never-treated-required ✗ → ✓.
- has_never_treated field doc: add ContinuousDiD to the required-by
list with the Remark 3.1 note.
- §4.4 no-never-treated paths: add ContinuousDiD bullet naming the
ValueError and pointing at the unimplemented Remark 3.1 path.
- §4.7 ContinuousDiD bullet: prepend the "Requires zero-dose control
units" callout.
DiagnosticReport §6 section-status semantics (P2): the real schema
uses `status` for execution state ("ran", "not_applicable", "not_run",
"no_scalar_by_design", "skipped") and `verdict` as a separate field
for qualitative interpretation (e.g. "clean", "inconclusive",
"violated"). Previously §6 conflated these: it listed
pass/warn/inconclusive as status values and omitted the execution
statuses. Rewrote the section-status paragraph to document the two
layers explicitly; moved the forthcoming-additions note into the
same block.
Tests:
- ContinuousDiD matrix-row cell assertion (col 5 = ✓).
- "P(D=0) > 0" substring fingerprint guards the Remark 3.1 wording.
- DR §6 status vocabulary: guards every real status the implementation
emits ("ran", "not_applicable", "not_run", "no_scalar_by_design")
and forbids the obsolete pass/warn/inconclusive-as-status framing;
`verdict` must be mentioned as a distinct concept.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment
Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
|
/ai-review |
ContinuousDiD staggered support (P1 #1): the matrix marked staggered=✗, but the method natively supports staggered adoption via the `first_treat` column (continuous_did.py:159-169, 919-925; REGISTRY.md L788-825). Matrix cell flipped ✗ → ✓. Time-invariant dose requirement (P1 #2): ContinuousDiD.fit() requires dose to be time-invariant per unit (continuous_did.py:222-228; docs/methodology/continuous-did.md:L70-75), but profile_panel() did not expose this so time-varying-dose continuous panels were routed to ContinuousDiD only to hard-fail at fit time. Added `PanelProfile.treatment_varies_within_unit: bool` — True iff any unit has more than one distinct non-NaN treatment value across its observed rows. Computed unconditionally for numeric (non-bool) treatment columns; False for categorical. `to_dict()` exposes it. Guide §2 documents the field, §4.7 ContinuousDiD bullet lists two eligibility prerequisites: P(D=0) > 0 AND treatment_varies_within_unit == False. Tests (P2): - test_continuous_treatment_with_time_varying_dose: random-per-row continuous panel -> treatment_varies_within_unit=True. - test_continuous_treatment (existing): constant-per-unit dose -> treatment_varies_within_unit=False. - test_binary_absorbing_varies_within_unit: binary absorbing panel always True by construction. - Guide-resolution test: ContinuousDiD matrix col 2 (staggered) = ✓; guide mentions "time-invariant" and "treatment_varies_within_unit". - to_dict JSON round-trip set extended with the new key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
|
/ai-review |
profile_panel(): - Balance / coverage: compute from unique (unit, time) support rather than raw row count. Duplicated rows no longer inflate coverage above 1 or mask missing cells. Adds duplicate_unit_time_rows warn alert. - Non-absorbing detection: drop NaN before the monotonicity diff so a path like [0, 1, NaN, 0] is correctly classified binary_non_absorbing rather than binary_absorbing. - has_never_treated / has_always_treated: compute generically across numeric treatment types (not only binary). Continuous panels with zero-dose control units now surface has_never_treated=True; docstring and guide field reference updated. llms-autonomous.txt: - SyntheticDiD: staggered changed from "partial" to "strict block" (fit raises on within-unit variation); footnote rewritten. - TROP: staggered support marked (via absorbing D matrix); never-treated no longer required (donor pool is every unit untreated at period t); fit() covariate surface removed (there is none). - HeterogeneousAdoptionDiD: covariate adjustment changed to "not supported" (Appendix B.1 future work per REGISTRY.md); clustered SE annotated "warn" on continuous paths (cluster= kwarg is ignored with UserWarning, CR1 only on mass-point path). - Replace "notyettreated" with "not_yet_treated" throughout (matches staggered.py / sun_abraham.py public API). - Replace "Hausman.hausman_pretest" with "EfficientDiD.hausman_pretest" (the classmethod lives on EfficientDiD, not a Hausman namespace). Tests: - test_duplicate_unit_time_rows_do_not_inflate_coverage - test_reversal_through_nan_is_binary_non_absorbing - test_continuous_zero_dose_controls_flag_has_never_treated - test_guide_api_strings_resolve_against_public_api (verifies every estimator name, hausman_pretest path, and control_group spelling in the bundled guide resolves against diff_diff's public API; guards against future drift) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Guide: - Rewrite HAD section (§4.9, §3 footnotes) to use WAS / WAS_d_lower terminology (paper Equation 2 / Theorem 1). HAD does not target ATT and its event-study output is per-event-time, not per-cohort. The fitted target_parameter attribute is literally "WAS" for Design 1' and "WAS_d_lower" for Design 1 / Assumption 6. - Add control_group="last_cohort" to the EfficientDiD guidance in §4.4 (no-never-treated paths) and the corresponding footnote. Previously the guide steered agents only toward assumption="PT-Post" when a never-treated cohort was absent, ruling out the supported pseudo- control path. Distinct from CallawaySantAnna's "not_yet_treated". profile_panel(): - Missing identifier rows are dropped up front when unit or time contains NaN. Previously n_units / n_periods used nunique() (NaN- dropping) while the unique (unit, time) support used drop_duplicates() (NaN-preserving), producing observation_coverage > 1 on filtered data with missing IDs. All structural facts now compute on the non-missing subset; a new missing_id_rows_dropped warn alert surfaces the drop count so the behavior is not silent. Tests: - test_missing_unit_or_time_ids_are_dropped_consistently asserts observation_coverage stays in [0, 1] and the alert fires with the correct drop count. - test_guide_api_strings_resolve_against_public_api extended with assertions for WAS / WAS_d_lower phrasing, the absence of the old "per-cohort Pierce-Schott" wording, and control_group="last_cohort". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
HeterogeneousAdoptionDiD staggered support is `partial` (per §3 matrix), but the restriction was never unpacked. Per REGISTRY.md L2281 and had.py:1100-1288, Appendix B.2 limits staggered HAD to the **last treatment cohort plus never-treated units**. With `first_treat_col` supplied, `fit(aggregate="event_study")` auto-filters to F_last and emits a UserWarning naming kept/dropped counts; earlier-cohort units are dropped. Without `first_treat_col` on a multi-cohort panel, fit() raises a front-door ValueError pointing at ChaisemartinDHaultfoeuille for full staggered support. Guide updates: - New §3 footnote on the HAD `partial` cell spelling out the last- cohort-only restriction, the `first_treat_col` requirement for the auto-filter, and the ChaisemartinDHaultfoeuille fallback. - §4.9 HAD bullet appended with a "Staggered-timing scope is last- cohort-only" paragraph carrying the same contract plus the "last-cohort-only WAS" estimand clarification. Tests: - Semantic guide test asserts "last-cohort-only" (or "last cohort") wording, "first_treat_col" token, and ChaisemartinDHaultfoeuille as the fallback are all present so future guide edits cannot silently drop the Appendix B.2 disclosure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Balanced-panel eligibility gate tightened. `PanelProfile.is_balanced` is computed from the unique `(unit, time)` support, so it stays `True` even when duplicate rows exist — `duplicate_unit_time_rows` is a separate alert for that case. But ContinuousDiD / EfficientDiD / HeterogeneousAdoptionDiD all require exactly one observation per cell: EfficientDiD and HAD raise ValueError on duplicates at fit() time, and ContinuousDiD's precompute path silently resolves duplicates via last-row-wins (which can change the estimand without warning). Guide §3 balanced-panel-eligibility block now requires BOTH `is_balanced == True` AND absence of the `duplicate_unit_time_rows` alert before routing to these estimators, with the specific failure mode (raise vs silent overwrite) named per-estimator and a concrete two-step fix (`balance_panel()` + `drop_duplicates([unit, time])`). Tests: extended the semantic guide test to assert the guide mentions `duplicate_unit_time_rows` and uses "BOTH/both" wording around the is_balanced gate, so future edits cannot silently drop the duplicate- row half of the eligibility requirement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bool-dtype treatment columns are now classified the same way as
numeric {0, 1} rather than as "categorical". The library's binary
estimators validate value support via `validate_binary` (utils.py:
49-67), which accepts bool because True/False coerce to 1/0
numerically. Classifying bool as categorical silently routed valid
binary DiD panels away from the supported estimator set.
Changes:
- _classify_treatment() no longer early-returns "categorical" for
bool dtype. The downstream absorbing/non-absorbing logic handles
bool by casting to int before np.diff (raw bool diff is XOR, which
would mask a True -> False transition).
- treatment_varies_within_unit now includes bool-dtype columns (was
previously hardcoded False for bool).
- Guide §2 removes the "bool = categorical" rule and adds an
explicit "bool is binary" note with a pointer to validate_binary
as the reason.
- profile_panel() docstring mirrors the same update.
Tests:
- test_bool_dtype_treatment_is_binary_absorbing: staggered-style
bool panel with never-treated cohort -> binary_absorbing, correct
has_never_treated / treatment_varies_within_unit / cohort_sizes.
- test_bool_dtype_non_absorbing: reversible False -> True -> False
bool panel -> binary_non_absorbing. Guards the int-cast before
np.diff so future refactors don't regress to bool XOR semantics.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dCDH "spillover designs" claim removed. The earlier §4.3 bullet said ChaisemartinDHaultfoeuille was "robust to non-absorbing treatment and to spillover designs," but REGISTRY.md §ChaisemartinDHaultfoeuille only documents support for non-absorbing / reversible treatment — the estimator assumes SUTVA like every other DiD estimator in the suite. The same guide's §7 glossary already defines SUTVA as ruling out spillovers, so the §4.3 wording was internally inconsistent and would have misrouted interference designs to dCDH. Rewrote the §4.3 bullet to state the actually-supported contract (non-absorbing / reversible treatment via DID_M / DID_l) and added an explicit note that interference / between-unit spillovers are not supported natively. Regression test: `tests/test_profile_panel.py` extended with a forbidden-phrase check that fails if the autonomous guide re-advertises dCDH as "robust to spillover", "interference-robust", "supports spillover", or "and to spillover". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
§4.10 repeated-cross-section wording tightened. The earlier text
("most estimators remain applicable") overstated the RCS support
surface: only CallawaySantAnna(panel=False), TripleDifference, and
StaggeredTripleDifference have documented RCS-capable contracts
(REGISTRY.md §CallawaySantAnna; docs/choosing_estimator.rst DDD
cross-sectional use cases). EfficientDiD and HeterogeneousAdoptionDiD
explicitly reject RCS per REGISTRY.md. SyntheticDiD and ContinuousDiD
are panel-only by construction. Under the old wording an autonomous
agent could silently route RCS data to a panel-only estimator.
Rewrote §4.10 into three explicit blocks:
- "Explicit RCS support": the three RCS-capable estimators.
- "Explicitly rejected for RCS (panel-only)": names EfficientDiD,
HeterogeneousAdoptionDiD, SyntheticDiD, ContinuousDiD.
- "Treat other estimators in this guide as panel-only unless their
own docs explicitly say otherwise."
Kept the clustered-SE note and added a cluster-vs-respondent
treatment-assignment check.
Regression test: asserts the guide does not contain the "most
estimators remain applicable" phrase, names `panel=False` as the
explicit CS RCS mode, and that the §4.10 section explicitly lists
both EfficientDiD and HeterogeneousAdoptionDiD as panel-only
rejected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rebased onto current main (17 commits clean — PR #355, #358, #359 all merged since last rebase). StaggeredTripleDifference corrected as panel-only + balance-enforced. The earlier §4.10 RCS wording paired TripleDifference / StaggeredTripleDifference together in the Explicit RCS support list, but REGISTRY.md §StaggeredTripleDifference requires a balanced panel and staggered_triple_diff.py:93-109 has no panel=False mode — fit() rejects unbalanced/duplicate (unit, time) structure at staggered_triple_diff.py:846-864. - §4.10 Explicit RCS support: TripleDifference (two-period) only; StaggeredTripleDifference removed from the supported set. - §4.10 Explicitly rejected for RCS: StaggeredTripleDifference added with a concrete "no panel=False mode" + "use TripleDifference for cross-sectional DDD" pointer. - §3 Balanced-panel eligibility: StaggeredTripleDifference added to the balance-sensitive gate. Regression tests extended: - Balanced-panel proximity check now covers StaggeredTripleDifference. - §4.10 section test asserts StaggeredTripleDifference appears in the Explicitly rejected block and NOT in the Explicit RCS support block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology Code Quality Performance Maintainability Tech Debt Security Documentation/Tests |
New `diff_diff.profile_panel(df, *, unit, time, treatment, outcome)` returns a frozen `PanelProfile` dataclass of structural facts about a DiD panel (balance, treatment-type classification, cohort structure, outcome characteristics, factual alerts). `.to_dict()` returns a JSON-serializable view. Descriptive only -- alerts never name a specific estimator; estimator selection is up to the caller. Paired with a new bundled `"autonomous"` variant on `get_llm_guide()` at `diff_diff/guides/llms-autonomous.txt`. Reference-shaped (distinct from the workflow-prose `"practitioner"` variant): `PanelProfile` field reference, 17-estimator x 9-design-feature support matrix, per-design-feature reasoning citing Baker et al. (2025) and Roth/Sant'Anna (2023), post-fit validation index, BR/DR schema reference, and an explicit "no deterministic recommendations" disclaimer. `diff_diff/__init__.py` module docstring leads with a "For AI agents:" entry block so `help(diff_diff)` surfaces the four-step sequence (profile, reference, workflow, report) at import time. Exports `profile_panel`, `PanelProfile`, `Alert` from the top-level namespace. Both files are bundled inside the wheel and sdist (no GitHub/RTD dependency at runtime) via the existing `diff_diff/guides/*.txt` glob in `pyproject.toml`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The treatment classifier required exactly two observed distinct values
to treat a column as binary. Panels that are entirely never-treated
(values = {0}) or entirely always-treated (values = {1}) were falling
through to "continuous", contradicting the documented taxonomy which
defines "binary_absorbing" as "values in {0, 1}".
Rule is now values_set <= {0, 1, 0.0, 1.0} with at least one observed
value; entirely-NaN treatment columns fall through to "categorical"
rather than "continuous". Docstring and the autonomous guide section
§2 reference are updated to match.
New regression tests:
- all-zero treatment panel -> binary_absorbing, has_never_treated
- all-one treatment panel -> binary_absorbing, has_always_treated
- binary with NaNs, only zeros observed -> binary_absorbing
- all-NaN treatment -> categorical
- top-level import surface (profile_panel / PanelProfile / Alert)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
profile_panel(): - Balance / coverage: compute from unique (unit, time) support rather than raw row count. Duplicated rows no longer inflate coverage above 1 or mask missing cells. Adds duplicate_unit_time_rows warn alert. - Non-absorbing detection: drop NaN before the monotonicity diff so a path like [0, 1, NaN, 0] is correctly classified binary_non_absorbing rather than binary_absorbing. - has_never_treated / has_always_treated: compute generically across numeric treatment types (not only binary). Continuous panels with zero-dose control units now surface has_never_treated=True; docstring and guide field reference updated. llms-autonomous.txt: - SyntheticDiD: staggered changed from "partial" to "strict block" (fit raises on within-unit variation); footnote rewritten. - TROP: staggered support marked (via absorbing D matrix); never-treated no longer required (donor pool is every unit untreated at period t); fit() covariate surface removed (there is none). - HeterogeneousAdoptionDiD: covariate adjustment changed to "not supported" (Appendix B.1 future work per REGISTRY.md); clustered SE annotated "warn" on continuous paths (cluster= kwarg is ignored with UserWarning, CR1 only on mass-point path). - Replace "notyettreated" with "not_yet_treated" throughout (matches staggered.py / sun_abraham.py public API). - Replace "Hausman.hausman_pretest" with "EfficientDiD.hausman_pretest" (the classmethod lives on EfficientDiD, not a Hausman namespace). Tests: - test_duplicate_unit_time_rows_do_not_inflate_coverage - test_reversal_through_nan_is_binary_non_absorbing - test_continuous_zero_dose_controls_flag_has_never_treated - test_guide_api_strings_resolve_against_public_api (verifies every estimator name, hausman_pretest path, and control_group spelling in the bundled guide resolves against diff_diff's public API; guards against future drift) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Guide: - Rewrite HAD section (§4.9, §3 footnotes) to use WAS / WAS_d_lower terminology (paper Equation 2 / Theorem 1). HAD does not target ATT and its event-study output is per-event-time, not per-cohort. The fitted target_parameter attribute is literally "WAS" for Design 1' and "WAS_d_lower" for Design 1 / Assumption 6. - Add control_group="last_cohort" to the EfficientDiD guidance in §4.4 (no-never-treated paths) and the corresponding footnote. Previously the guide steered agents only toward assumption="PT-Post" when a never-treated cohort was absent, ruling out the supported pseudo- control path. Distinct from CallawaySantAnna's "not_yet_treated". profile_panel(): - Missing identifier rows are dropped up front when unit or time contains NaN. Previously n_units / n_periods used nunique() (NaN- dropping) while the unique (unit, time) support used drop_duplicates() (NaN-preserving), producing observation_coverage > 1 on filtered data with missing IDs. All structural facts now compute on the non-missing subset; a new missing_id_rows_dropped warn alert surfaces the drop count so the behavior is not silent. Tests: - test_missing_unit_or_time_ids_are_dropped_consistently asserts observation_coverage stays in [0, 1] and the alert fires with the correct drop count. - test_guide_api_strings_resolve_against_public_api extended with assertions for WAS / WAS_d_lower phrasing, the absence of the old "per-cohort Pierce-Schott" wording, and control_group="last_cohort". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Guide: - SunAbraham matrix row: never-treated-required changed from ✗ to ✓. The fit path raises ValueError when no never-treated units exist (sun_abraham.py:566-567); the registry documents this requirement. §4.3 prose updated to state the requirement explicitly. - HAD §4.9 (and §3 Phase 3 description): correct the diagnostic-claim wording. The pre-test battery validates the QUG null (step 1), Assumption 7 pre-trends (step 2, event-study only), and linearity of E[ΔY|D_2] (step 3). Assumption 3 (uniform continuity / no extensive-margin jump) is explicitly "not testable" per REGISTRY.md L2181, so the guide must not claim Phase 3 validates it. profile_panel(): - Empty-panel guard: raise ValueError when the input DataFrame is empty (direct) or when no rows remain after dropping missing (unit, time) identifiers. Previously both cases returned is_balanced=True / observation_coverage=0 on zero valid rows. - Fix missing-id counting: n_rows_with_missing_id was the SUM of unit-NaN and time-NaN (double-counting rows missing both). Now computed from a row-wise isna().any(axis=1) mask so every dropped row is counted exactly once. Tests: - test_row_with_both_ids_missing_counted_once - test_empty_dataframe_raises_value_error - test_empty_after_id_drop_raises_value_error - test_guide_api_strings_resolve_against_public_api extended: SunAbraham matrix-row cell assertion (never-treated-required=✓), "Assumption 3" presence, absence of "validate Assumptions 3 and 7", presence of "not testable". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Guide: - Matrix covariate cells corrected. SyntheticDiD column 6 flipped ✗ → ✓: fit() accepts covariates= and residualizes the outcome (synthetic_did.py:213-239, 425-436). ContinuousDiD column 6 flipped ✓ → ✗: the fit() signature has no covariate argument and REGISTRY lists covariate adjustment as deferred. - EfficientDiD wording corrected. The registry says PT-Post still uses never-treated as the comparison group (REGISTRY.md L756-758); "switch to PT-Post" does not drop the requirement. Rewrote has_never_treated field description, the §3 footnote, and §4.4 prose to state that PT-All and PT-Post both require never-treated units and control_group="last_cohort" is the only supported path for an all-eventually-treated panel. profile_panel(): - _compute_pre_post now uses each treated unit's observed (unit, time) support rather than the global panel period set. On unbalanced panels this correctly captures the least-supported treated unit's pre/post exposure and fires short_pre_panel / short_post_panel when appropriate; the previous global-periods approach could overstate exposure and suppress the alert. Guide §5 API signatures corrected: - compute_pretrends_power takes a fitted results object, not a raw DataFrame (pretrends.py:1048). - plot_sensitivity takes a SensitivityResults object (visualization/_diagnostic.py:12-14). - plot_honest_event_study takes a HonestDiDResults object (visualization/_event_study.py:9). Tests: - test_min_pre_post_use_per_unit_observed_support (unbalanced treated unit missing early pre-period fires short_pre_panel). - test_guide_api_strings_resolve_against_public_api extended with SyntheticDiD/ContinuousDiD covariate-cell matrix assertions, EfficientDiD both-PT-variants-require-never-treated wording assertion, and §5 API-signature fingerprints. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Guide §6 rewritten to match the actual BR/DR to_dict() schemas emitted at diff_diff/business_report.py:524-549 and diff_diff/diagnostic_report.py: 1013-1031. BR top-level keys are "assumption" (singular), "pre_trends" (underscored), "sample" (bare); the new documentation also adds "estimator", "context", "heterogeneity", "diagnostics", "next_steps", "references". DR keys are "estimator" (string), "headline_metric", "estimator_native_diagnostics", "skipped", "warnings", "overall_interpretation", "next_steps"; the old "checks: dict" nesting never existed and "estimator_type" was never the key name. The "forthcoming schema additions" block is dropped since it documented an un-shipped surface. Prose + alert wording aligned with the per-treated-unit implementation: - Guide §2: min_pre_periods / min_post_periods described as "across treated units" (each unit's observed (unit, time) support counted independently on unbalanced panels). - profile.py: short_pre_panel / short_post_panel alert messages use "across treated units" to match the computation. Tests: - test_guide_api_strings_resolve_against_public_api extended with BR/DR §6 schema assertions. Guards every real top-level key documented (assumption, pre_trends, sample, headline_metric, estimator_native_diagnostics, overall_interpretation) and forbids the previously documented obsolete keys (assumptions, pretrends, main_result, sample_summary, estimator_type, checks). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Guide methodology attribution corrected:
- EfficientDiD primary source is Chen, Sant'Anna, Xie (2025) per
REGISTRY.md L746-749, not Arkhangelsky-Imbens. §4.3 bullet and the
§7 citation list updated.
- ContinuousDiD synopsis (§4.7) now distinguishes the PT-identified
targets (`ATT(d|d)`, `ATT^{loc}`) from the SPT-required targets
(`ATT(d)`, `ACRT(d)`, `ACRT^{glob}`) per REGISTRY.md L682-706, with
the Callaway, Goodman-Bacon, Sant'Anna (2024) citation added to §7.
Guide §6 BR schema corrections:
- `diagnostics` wrapper documented accurately: always has `status`,
with `schema` + `overall_interpretation` on the "ran" path or
`reason` on the "skipped" path. Previously the guide said the key
held the DR payload directly.
- `target_parameter.reference` field documented alongside the other
fields per `describe_target_parameter()` in
diff_diff/_reporting_helpers.py:34-42.
Regression tests extended:
- `diagnostics["schema"]` wrapper wording guarded (and absence of
plain "DR payload under diagnostics").
- `target_parameter.reference` key documentation guarded.
- EfficientDiD attribution: `Chen, Sant'Anna, Xie 2025` present,
`(Arkhangelsky-Imbens)` absent.
- ContinuousDiD targets: `ATT(d|d)`, `ACRT`, `Strong Parallel Trends`
all present.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ContinuousDiD eligibility (P1): ContinuousDiD.fit() requires zero-dose
control units (P(D=0) > 0) because Remark 3.1 lowest-dose-as-control
is not yet implemented (REGISTRY.md L714; continuous_did.py:349-360).
Guide corrections:
- Matrix row: never-treated-required ✗ → ✓.
- has_never_treated field doc: add ContinuousDiD to the required-by
list with the Remark 3.1 note.
- §4.4 no-never-treated paths: add ContinuousDiD bullet naming the
ValueError and pointing at the unimplemented Remark 3.1 path.
- §4.7 ContinuousDiD bullet: prepend the "Requires zero-dose control
units" callout.
DiagnosticReport §6 section-status semantics (P2): the real schema
uses `status` for execution state ("ran", "not_applicable", "not_run",
"no_scalar_by_design", "skipped") and `verdict` as a separate field
for qualitative interpretation (e.g. "clean", "inconclusive",
"violated"). Previously §6 conflated these: it listed
pass/warn/inconclusive as status values and omitted the execution
statuses. Rewrote the section-status paragraph to document the two
layers explicitly; moved the forthcoming-additions note into the
same block.
Tests:
- ContinuousDiD matrix-row cell assertion (col 5 = ✓).
- "P(D=0) > 0" substring fingerprint guards the Remark 3.1 wording.
- DR §6 status vocabulary: guards every real status the implementation
emits ("ran", "not_applicable", "not_run", "no_scalar_by_design")
and forbids the obsolete pass/warn/inconclusive-as-status framing;
`verdict` must be mentioned as a distinct concept.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ContinuousDiD staggered support (P1 #1): the matrix marked staggered=✗, but the method natively supports staggered adoption via the `first_treat` column (continuous_did.py:159-169, 919-925; REGISTRY.md L788-825). Matrix cell flipped ✗ → ✓. Time-invariant dose requirement (P1 #2): ContinuousDiD.fit() requires dose to be time-invariant per unit (continuous_did.py:222-228; docs/methodology/continuous-did.md:L70-75), but profile_panel() did not expose this so time-varying-dose continuous panels were routed to ContinuousDiD only to hard-fail at fit time. Added `PanelProfile.treatment_varies_within_unit: bool` — True iff any unit has more than one distinct non-NaN treatment value across its observed rows. Computed unconditionally for numeric (non-bool) treatment columns; False for categorical. `to_dict()` exposes it. Guide §2 documents the field, §4.7 ContinuousDiD bullet lists two eligibility prerequisites: P(D=0) > 0 AND treatment_varies_within_unit == False. Tests (P2): - test_continuous_treatment_with_time_varying_dose: random-per-row continuous panel -> treatment_varies_within_unit=True. - test_continuous_treatment (existing): constant-per-unit dose -> treatment_varies_within_unit=False. - test_binary_absorbing_varies_within_unit: binary absorbing panel always True by construction. - Guide-resolution test: ContinuousDiD matrix col 2 (staggered) = ✓; guide mentions "time-invariant" and "treatment_varies_within_unit". - to_dict JSON round-trip set extended with the new key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rebased onto current main (resolved CHANGELOG.md conflict: the by_path bullet from PR #355 and the profile_panel/autonomous-guide bullet from this PR now live side-by-side under [Unreleased]). has_always_treated now has binary-only semantics: - For binary treatment (absorbing or non-absorbing): unit_min == 1 means the unit is treated in every observed period (no pre-treatment information in the DiD sense). - For continuous treatment: always False. Pre-treatment periods on continuous DiD are determined by the separate `first_treat` column supplied to `ContinuousDiD.fit`, not by whether the dose is positive. A unit with a constant positive dose can still have well-defined pre-treatment periods, so flagging it as "always-treated / no pre-treatment information" was factually wrong and triggered the misleading `has_always_treated_units` alert on valid continuous panels. - Categorical: False by construction. Guide §2 has_always_treated field doc updated to state the binary-only semantics explicitly, with a note about `first_treat`. Tests: - New: test_continuous_positive_dose_does_not_fire_has_always_treated asserts has_always_treated=False AND the alert does not fire on a constant-positive-dose continuous panel. - Existing test_continuous_zero_dose_controls_flag_has_never_treated updated: has_always_treated expected to be False (was True under the old semantics). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Balanced-panel eligibility (P1): ContinuousDiD, EfficientDiD, SyntheticDiD, and HeterogeneousAdoptionDiD all hard-reject unbalanced panels at fit() time (continuous_did.py:329-338, efficient_did.py: 407-414, synthetic_did.py:399-412, had.py:1173-1188; REGISTRY.md cross-refs). Guide updates surface this: - New "Balanced-panel eligibility" block after §3 matrix footnotes names the four affected estimators and points at `PanelProfile.is_balanced == True` as the gate. Directs users with unbalanced panels to `diff_diff.prep.balance_panel()` or to a balance-tolerant estimator. - §4 per-estimator bullets for all four estimators prepend or append the balanced-panel requirement with the specific fit() error the caller would otherwise hit. - ContinuousDiD §4.7 bullet now lists THREE eligibility prerequisites (zero-dose controls, time-invariant dose, balanced panel) where it previously listed two. Docstring (P3): profile_panel() docstring notes block updated to match the binary-only has_always_treated semantics shipped in round 9. The old wording claimed the field fired on "strictly positive treatment in every observed non-NaN row" across numeric types, which no longer matches the implementation. Tests (P2): - Semantic guide test asserts `is_balanced` is mentioned in the guide and each of the four balance-sensitive estimators appears within 400 characters of a "balanced" / "is_balanced" marker, so future edits cannot silently drop the eligibility gate from any of them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
HeterogeneousAdoptionDiD staggered support is `partial` (per §3 matrix), but the restriction was never unpacked. Per REGISTRY.md L2281 and had.py:1100-1288, Appendix B.2 limits staggered HAD to the **last treatment cohort plus never-treated units**. With `first_treat_col` supplied, `fit(aggregate="event_study")` auto-filters to F_last and emits a UserWarning naming kept/dropped counts; earlier-cohort units are dropped. Without `first_treat_col` on a multi-cohort panel, fit() raises a front-door ValueError pointing at ChaisemartinDHaultfoeuille for full staggered support. Guide updates: - New §3 footnote on the HAD `partial` cell spelling out the last- cohort-only restriction, the `first_treat_col` requirement for the auto-filter, and the ChaisemartinDHaultfoeuille fallback. - §4.9 HAD bullet appended with a "Staggered-timing scope is last- cohort-only" paragraph carrying the same contract plus the "last-cohort-only WAS" estimand clarification. Tests: - Semantic guide test asserts "last-cohort-only" (or "last cohort") wording, "first_treat_col" token, and ChaisemartinDHaultfoeuille as the fallback are all present so future guide edits cannot silently drop the Appendix B.2 disclosure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Balanced-panel eligibility gate tightened. `PanelProfile.is_balanced` is computed from the unique `(unit, time)` support, so it stays `True` even when duplicate rows exist — `duplicate_unit_time_rows` is a separate alert for that case. But ContinuousDiD / EfficientDiD / HeterogeneousAdoptionDiD all require exactly one observation per cell: EfficientDiD and HAD raise ValueError on duplicates at fit() time, and ContinuousDiD's precompute path silently resolves duplicates via last-row-wins (which can change the estimand without warning). Guide §3 balanced-panel-eligibility block now requires BOTH `is_balanced == True` AND absence of the `duplicate_unit_time_rows` alert before routing to these estimators, with the specific failure mode (raise vs silent overwrite) named per-estimator and a concrete two-step fix (`balance_panel()` + `drop_duplicates([unit, time])`). Tests: extended the semantic guide test to assert the guide mentions `duplicate_unit_time_rows` and uses "BOTH/both" wording around the is_balanced gate, so future edits cannot silently drop the duplicate- row half of the eligibility requirement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bool-dtype treatment columns are now classified the same way as
numeric {0, 1} rather than as "categorical". The library's binary
estimators validate value support via `validate_binary` (utils.py:
49-67), which accepts bool because True/False coerce to 1/0
numerically. Classifying bool as categorical silently routed valid
binary DiD panels away from the supported estimator set.
Changes:
- _classify_treatment() no longer early-returns "categorical" for
bool dtype. The downstream absorbing/non-absorbing logic handles
bool by casting to int before np.diff (raw bool diff is XOR, which
would mask a True -> False transition).
- treatment_varies_within_unit now includes bool-dtype columns (was
previously hardcoded False for bool).
- Guide §2 removes the "bool = categorical" rule and adds an
explicit "bool is binary" note with a pointer to validate_binary
as the reason.
- profile_panel() docstring mirrors the same update.
Tests:
- test_bool_dtype_treatment_is_binary_absorbing: staggered-style
bool panel with never-treated cohort -> binary_absorbing, correct
has_never_treated / treatment_varies_within_unit / cohort_sizes.
- test_bool_dtype_non_absorbing: reversible False -> True -> False
bool panel -> binary_non_absorbing. Guards the int-cast before
np.diff so future refactors don't regress to bool XOR semantics.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dCDH "spillover designs" claim removed. The earlier §4.3 bullet said ChaisemartinDHaultfoeuille was "robust to non-absorbing treatment and to spillover designs," but REGISTRY.md §ChaisemartinDHaultfoeuille only documents support for non-absorbing / reversible treatment — the estimator assumes SUTVA like every other DiD estimator in the suite. The same guide's §7 glossary already defines SUTVA as ruling out spillovers, so the §4.3 wording was internally inconsistent and would have misrouted interference designs to dCDH. Rewrote the §4.3 bullet to state the actually-supported contract (non-absorbing / reversible treatment via DID_M / DID_l) and added an explicit note that interference / between-unit spillovers are not supported natively. Regression test: `tests/test_profile_panel.py` extended with a forbidden-phrase check that fails if the autonomous guide re-advertises dCDH as "robust to spillover", "interference-robust", "supports spillover", or "and to spillover". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
§4.10 repeated-cross-section wording tightened. The earlier text
("most estimators remain applicable") overstated the RCS support
surface: only CallawaySantAnna(panel=False), TripleDifference, and
StaggeredTripleDifference have documented RCS-capable contracts
(REGISTRY.md §CallawaySantAnna; docs/choosing_estimator.rst DDD
cross-sectional use cases). EfficientDiD and HeterogeneousAdoptionDiD
explicitly reject RCS per REGISTRY.md. SyntheticDiD and ContinuousDiD
are panel-only by construction. Under the old wording an autonomous
agent could silently route RCS data to a panel-only estimator.
Rewrote §4.10 into three explicit blocks:
- "Explicit RCS support": the three RCS-capable estimators.
- "Explicitly rejected for RCS (panel-only)": names EfficientDiD,
HeterogeneousAdoptionDiD, SyntheticDiD, ContinuousDiD.
- "Treat other estimators in this guide as panel-only unless their
own docs explicitly say otherwise."
Kept the clustered-SE note and added a cluster-vs-respondent
treatment-assignment check.
Regression test: asserts the guide does not contain the "most
estimators remain applicable" phrase, names `panel=False` as the
explicit CS RCS mode, and that the §4.10 section explicitly lists
both EfficientDiD and HeterogeneousAdoptionDiD as panel-only
rejected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rebased onto current main (17 commits clean — PR #355, #358, #359 all merged since last rebase). StaggeredTripleDifference corrected as panel-only + balance-enforced. The earlier §4.10 RCS wording paired TripleDifference / StaggeredTripleDifference together in the Explicit RCS support list, but REGISTRY.md §StaggeredTripleDifference requires a balanced panel and staggered_triple_diff.py:93-109 has no panel=False mode — fit() rejects unbalanced/duplicate (unit, time) structure at staggered_triple_diff.py:846-864. - §4.10 Explicit RCS support: TripleDifference (two-period) only; StaggeredTripleDifference removed from the supported set. - §4.10 Explicitly rejected for RCS: StaggeredTripleDifference added with a concrete "no panel=False mode" + "use TripleDifference for cross-sectional DDD" pointer. - §3 Balanced-panel eligibility: StaggeredTripleDifference added to the balance-sensitive gate. Regression tests extended: - Balanced-panel proximity check now covers StaggeredTripleDifference. - §4.10 section test asserts StaggeredTripleDifference appears in the Explicitly rejected block and NOT in the Explicit RCS support block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
e3b22dc to
ef6b53d
Compare
…ck get_loo_effects_df on survey jackknife
P0 (Methodology — survey jackknife silently skipping undefined LOO):
The Rust & Rao (1996) stratified jackknife formula `SE² =
Σ_h (1-f_h)·(n_h-1)/n_h·Σ_{j∈h}(τ̂_{(h,j)} - τ̄_h)²` requires every
PSU-LOO `τ̂_{(h,j)}` to be defined. The previous implementation
silently skipped PSUs whose deletion removed all treated units (or
zeroed control ω_eff mass, or raised in the estimator) while still
applying the full `(n_h-1)/n_h` factor, under-scaling variance on
designs where treated units pack into a single PSU.
Fix: `_jackknife_se_survey` now tracks any undefined replicate in a
contributing stratum (n_h ≥ 2) and short-circuits to `SE=NaN` with a
targeted `UserWarning` naming the stratum / PSU / reason (deletion
removes all treated, kept ω_eff zero, kept treated survey mass zero,
estimator raised, estimator returned non-finite). Partial LOOs are
still returned in `placebo_effects` for debugging; users needing a
variance estimator that accommodates PSU-deletion infeasibility
should use `variance_method="bootstrap"`. Silent stratum-level skip
for `n_h < 2` is preserved (canonical lonely-PSU handling matching
R `survey::svyjkn`).
New regression `test_jackknife_full_design_undefined_replicate_returns_nan`
exercises the fix on the original `sdid_survey_data_full_design`
fixture (treated all in stratum 0 PSU 0 → LOO PSU 0 removes all
treated) and asserts both the `UserWarning` match and `np.isnan(se)`.
The existing jackknife tests that asserted finite SE now use a new
`sdid_survey_data_jk_well_formed` fixture where treated units are
spread across two PSUs within stratum 0 (so every LOO leaves ≥1
treated). The self-consistency test
(`test_jackknife_full_design_stratum_aggregation_formula_magnitude`)
was rewritten from a flaccid finite-positive check to a real
recomputation of the Rust & Rao formula on the returned 6-entry
`placebo_effects` array, asserting `result.se == pytest.approx(
expected, rel=1e-12)`.
Coverage MC (`benchmarks/data/sdid_coverage.json`) is unchanged:
the `stratified_survey` DGP spreads its 32 treated units across
PSUs 2 and 3 within stratum 1 and PSUs 0 and 1 within stratum 0,
so every LOO is defined there too. The previously-reported jackknife
anti-conservatism (α=0.05 rejection = 0.45, SE/trueSD = 0.46) is
the documented few-PSU limitation (1 effective DoF per stratum
with `n_h = 2`), not the P0 silent-skip bug.
P1 (Code Quality — get_loo_effects_df on survey jackknife):
`SyntheticDiDResults.get_loo_effects_df()` assumes a length-N
unit-indexed `placebo_effects` array (first n_control are control-
LOO, next n_treated are treated-LOO). Survey-jackknife fits return
a flat PSU-level replicate array of variable length; joining onto
the fit-time `control_unit_ids + treated_unit_ids` would mislabel
PSU replicates as unit-level effects.
Fix: `get_loo_effects_df()` now raises `NotImplementedError` with a
targeted message pointing to `result.placebo_effects` for the raw
PSU-level array and REGISTRY §SyntheticDiD "Note (survey + jackknife
composition)" for the aggregation formula. New regression
`test_get_loo_effects_df_raises_on_survey_jackknife` asserts the
raise on a survey fit. Non-survey and pweight-only jackknife fits
continue to use `get_loo_effects_df()` as before (unit-level LOO).
P3 (Documentation — stale default variance_method note):
`docs/methodology/REGISTRY.md:L1569` default-variance-method note
rewritten to reflect that all three variance methods now support
full survey designs (removing "full design supported on bootstrap
only" language) and to recommend bootstrap specifically on surveys
with few PSUs per stratum.
Branch also rebased onto current origin/main to pick up PR #356
(agent-profile-panel) and PR #361 — the R1 Maintainability finding
about "unrelated API deletions" was a stale-base-drift artifact
(my branch was created before #356 merged). After rebase the diff
against main shows only SDID-survey changes.
Verification
------------
pytest tests/test_survey_phase5.py
tests/test_methodology_sdid.py::{TestBootstrapSE,TestPlaceboSE,TestJackknifeSE,TestCoverageMCArtifact}
→ 87 passed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
diff_diff.profile_panel(df, *, unit, time, treatment, outcome)returning a frozenPanelProfiledataclass of structural facts (panel balance, treatment-type classification, cohort structure, outcome characteristics, factualAlerttuple)..to_dict()returns a JSON-serializable view. Descriptive only — alerts never name an estimator.get_llm_guide("autonomous")variant atdiff_diff/guides/llms-autonomous.txt. Reference-shaped (distinct from the workflow-prose"practitioner"variant):PanelProfilefield reference (§2), 17-estimator × 9-design-feature support matrix (§3), per-design-feature reasoning (§4), post-fit validation index (§5), BR/DR schema reference (§6), citations (§7), intentional-omissions disclaimer (§8).diff_diff/__init__.pymodule docstring sohelp(diff_diff)leads with the four-step sequence (describe / consult reference / follow workflow / report).Both the utility and the guide are bundled inside the wheel and sdist via the existing
diff_diff/guides/*.txtglob, so agents can access them at runtime without a GitHub or ReadTheDocs dependency.Test plan
pytest tests/test_profile_panel.py -v— 21 unit tests covering panel shapes (balanced / unbalanced / staggered / never-treated / always-treated / degenerate all-zero and all-one / all-NaN treatment), treatment-type classification, alerts, frozen-immutability, JSON round-trip, top-level import surface, factual-not-recommender message check.pytest tests/test_guides.py -v— auto-parametrized variant loader now covers"autonomous"; content-stability assertions check forprofile_panel+estimator-support matrixsubstrings and an intact markdown table (pipe-count ≥ 11).black diff_diff tests && ruff check diff_diff/profile.py tests/test_profile_panel.py && mypy diff_diff/profile.py— all clean.python -c "import diff_diff; help(diff_diff)"— agent-oriented block appears at the top.python -c "import diff_diff; print(diff_diff.get_llm_guide('autonomous')[:400])"— bundled guide loads.python -m build --sdist && tar -tzf dist/diff_diff-*.tar.gz | grep -E "llms-autonomous|profile.py"— both files ship in the sdist.Design notes
profile_*/describe_*/detect_*only — neverrecommend_*/choose_*/select_*.profile_panelv1 scope is core only (panel structure, treatment variation, outcome characteristics, alerts). Covariate profiling, survey-design profiling (weights / strata / PSU), and duplicate-key detection are deferred to follow-on PRs.practitioner_next_steps()andllms-practitioner.txtworkflow guide — this PR covers the pre-fit describe step and the estimator-choice reference; the practitioner variant remains the canonical post-fit validation workflow.🤖 Generated with Claude Code