Add profile_panel() + autonomous LLM guide for agent-facing workflows by igerber · Pull Request #356 · igerber/diff-diff

igerber · 2026-04-24T11:30:16Z

Summary

Adds diff_diff.profile_panel(df, *, unit, time, treatment, outcome) returning a frozen PanelProfile dataclass of structural facts (panel balance, treatment-type classification, cohort structure, outcome characteristics, factual Alert tuple). .to_dict() returns a JSON-serializable view. Descriptive only — alerts never name an estimator.
Adds get_llm_guide("autonomous") variant at diff_diff/guides/llms-autonomous.txt. Reference-shaped (distinct from the workflow-prose "practitioner" variant): PanelProfile field reference (§2), 17-estimator × 9-design-feature support matrix (§3), per-design-feature reasoning (§4), post-fit validation index (§5), BR/DR schema reference (§6), citations (§7), intentional-omissions disclaimer (§8).
Prepends a "For AI agents:" entry block to the diff_diff/__init__.py module docstring so help(diff_diff) leads with the four-step sequence (describe / consult reference / follow workflow / report).

Both the utility and the guide are bundled inside the wheel and sdist via the existing diff_diff/guides/*.txt glob, so agents can access them at runtime without a GitHub or ReadTheDocs dependency.

Test plan

pytest tests/test_profile_panel.py -v — 21 unit tests covering panel shapes (balanced / unbalanced / staggered / never-treated / always-treated / degenerate all-zero and all-one / all-NaN treatment), treatment-type classification, alerts, frozen-immutability, JSON round-trip, top-level import surface, factual-not-recommender message check.
pytest tests/test_guides.py -v — auto-parametrized variant loader now covers "autonomous"; content-stability assertions check for profile_panel + estimator-support matrix substrings and an intact markdown table (pipe-count ≥ 11).
black diff_diff tests && ruff check diff_diff/profile.py tests/test_profile_panel.py && mypy diff_diff/profile.py — all clean.
python -c "import diff_diff; help(diff_diff)" — agent-oriented block appears at the top.
python -c "import diff_diff; print(diff_diff.get_llm_guide('autonomous')[:400])" — bundled guide loads.
python -m build --sdist && tar -tzf dist/diff_diff-*.tar.gz | grep -E "llms-autonomous|profile.py" — both files ship in the sdist.

Design notes

No deterministic recommender API. The utility reports facts; the guide enumerates trade-offs per design feature; estimator selection is the agent's responsibility. Verb discipline: profile_* / describe_* / detect_* only — never recommend_* / choose_* / select_*.
profile_panel v1 scope is core only (panel structure, treatment variation, outcome characteristics, alerts). Covariate profiling, survey-design profiling (weights / strata / PSU), and duplicate-key detection are deferred to follow-on PRs.
Both surfaces pair with the existing practitioner_next_steps() and llms-practitioner.txt workflow guide — this PR covers the pre-fit describe step and the estimator-choice reference; the practitioner variant remains the canonical post-fit validation workflow.

🤖 Generated with Claude Code

igerber · 2026-04-24T11:46:08Z

/ai-review

github-actions · 2026-04-24T11:54:40Z

🔁 AI review rerun (requested by @igerber)

Head SHA: d4a3dc1af238e35478bc8d665ec65aa9cc2edc3c

Overall Assessment

⚠️ Needs changes

The PR adds a new agent-facing methodology surface, but the new autonomous guide is not consistent with docs/methodology/REGISTRY.md / estimator docstrings, and profile_panel() has silent structural-classification bugs on realistic edge cases. Static review only: I could not run the new tests here because the local Python environment is missing pandas.

Executive Summary

The new autonomous guide misstates estimator support for SyntheticDiD, TROP, and HeterogeneousAdoptionDiD.
The guide also contains API strings that do not exist as written, including control_group="notyettreated" and Hausman.hausman_pretest.
profile_panel() can report duplicated unit-time panels as balanced because it uses raw row count instead of unique (unit, time) support.
profile_panel()’s monotonicity check includes NaN, so a 1 -> NaN -> 0 treatment path can be mislabeled binary_absorbing.
profile_panel() suppresses has_never_treated on continuous-dose panels even when zero-dose control units exist.
The new tests mostly check guide fingerprints/table shape and miss the edge cases above.

Methodology

P1 — diff_diff/guides/llms-autonomous.txt:L175-L215,L247-L338 is not faithful to the registry/docstrings for multiple affected methods. SyntheticDiD is marked staggered=partial, but the registry/code require block treatment and raise on within-unit treatment variation (docs/methodology/REGISTRY.md:L1421-L1425,L1542-L1544, diff_diff/synthetic_did.py:L376-L388). TROP is described as needing never-treated units and partial covariate support, but the local method uses all untreated-at-t units and fit() has no covariate surface (docs/methodology/REGISTRY.md:L1911-L1923,L1930-L1933,L2042-L2046, diff_diff/trop.py:L370-L412, diff_diff/trop_local.py:L388-L412). Impact: the new autonomous workflow can reject valid estimators or suggest unsupported ones. Concrete fix: rebuild the SyntheticDiD / TROP rows and footnotes directly from REGISTRY.md.
P1 — diff_diff/guides/llms-autonomous.txt:L193-L215,L201-L207,L287-L288,L416-L417 also misstates HeterogeneousAdoptionDiD and public API names. The guide marks HAD as supporting covariate adjustment / clustered SE, but Appendix B.1 covariates are explicitly future work and estimator-level cluster= is ignored on the continuous paths (docs/methodology/REGISTRY.md:L2172-L2173,L2231-L2239, diff_diff/had.py:L1832-L1837,L2363-L2370). It also uses control_group="notyettreated" instead of not_yet_treated (diff_diff/staggered.py:L139-L142,L304-L324, diff_diff/sun_abraham.py:L333-L336,L424-L434) and references Hausman.hausman_pretest instead of EfficientDiD.hausman_pretest (diff_diff/efficient_did.py:L1534-L1572). Impact: guide-driven code can fail at runtime or misstate identification/inference. Concrete fix: correct the capability flags and API tokens, then add a guide smoke test that resolves every referenced enum/classmethod against the public API.

Code Quality

P1 — profile_panel() derives is_balanced and observation_coverage from raw row count alone (diff_diff/profile.py:L199-L205), while the guide describes these as support-level panel structure (diff_diff/guides/llms-autonomous.txt:L64-L70). With duplicated (unit, time) keys, a panel can report is_balanced=True despite a missing cell, and observation_coverage can exceed the documented [0,1] range. Impact: silent incorrect structural facts in a new public API. Concrete fix: detect duplicate (unit, time) keys explicitly, compute balance/coverage from unique support, and add a duplicate-key regression test.
P1 — _classify_treatment() promises monotonicity over the observed non-NaN treatment path (diff_diff/profile.py:L175-L182) but diffs the raw sorted series including NaN (diff_diff/profile.py:L315-L320). A path like 0, 1, NaN, 0 will therefore be classified binary_absorbing instead of binary_non_absorbing. Impact: reversible-treatment panels can be silently routed toward absorbing-only estimators. Concrete fix: drop NaN before the monotonicity check and add a test for 1 -> NaN -> 0.
P1 — profile_panel() hardcodes has_never_treated=False for continuous numeric treatment because continuous inputs exit early from _classify_treatment() (diff_diff/profile.py:L305-L313), but the public field definition is generic (diff_diff/guides/llms-autonomous.txt:L104-L107) and ContinuousDiD exposes control_group="never_treated" (diff_diff/continuous_did.py:L59-L60). Impact: continuous-dose panels with zero-dose controls lose a load-bearing control-pool fact. Concrete fix: compute has_never_treated across numeric treatment types, or narrow the documented field contract to binary-only semantics and update the guide/tests consistently.

Performance

No findings.

Maintainability

No additional findings beyond the guide/registry drift already noted.

Tech Debt

P3 — The repo already tracks weak CI coverage for .txt AI guides in TODO.md:L127-L128. The current drift is a concrete example of that debt. Impact: informational only because it is already tracked. Concrete fix: either keep it tracked or close it with semantic guide assertions.

Security

No findings.

Documentation/Tests

P2 — The new tests do not cover the failing cases above: tests/test_guides.py:L37-L50 only checks fingerprints/table shape, and tests/test_profile_panel.py:L90-L101,L285-L295 do not exercise duplicate keys, 1 -> NaN -> 0 reversals, or continuous zero-dose controls. Impact: the current P1 issues have no regression protection. Concrete fix: add those exact cases to tests/test_guides.py / tests/test_profile_panel.py.

Path to Approval

Reconcile diff_diff/guides/llms-autonomous.txt with docs/methodology/REGISTRY.md and public APIs: fix the SyntheticDiD, TROP, and HeterogeneousAdoptionDiD rows/notes, change notyettreated to not_yet_treated, and replace Hausman.hausman_pretest with EfficientDiD.hausman_pretest.
Harden profile_panel() against duplicated (unit, time) keys: reject or alert on duplicates, and compute balance/coverage from unique support so observation_coverage <= 1.
Fix treatment/control edge cases in profile_panel(): ignore NaN when checking absorbing monotonicity, and compute has_never_treated correctly for continuous zero-dose panels (or narrow the public contract consistently).
Add regression tests for the corrected guide/API references and the new profile_panel() edge cases so this surface cannot drift again.

profile_panel(): - Balance / coverage: compute from unique (unit, time) support rather than raw row count. Duplicated rows no longer inflate coverage above 1 or mask missing cells. Adds duplicate_unit_time_rows warn alert. - Non-absorbing detection: drop NaN before the monotonicity diff so a path like [0, 1, NaN, 0] is correctly classified binary_non_absorbing rather than binary_absorbing. - has_never_treated / has_always_treated: compute generically across numeric treatment types (not only binary). Continuous panels with zero-dose control units now surface has_never_treated=True; docstring and guide field reference updated. llms-autonomous.txt: - SyntheticDiD: staggered changed from "partial" to "strict block" (fit raises on within-unit variation); footnote rewritten. - TROP: staggered support marked (via absorbing D matrix); never-treated no longer required (donor pool is every unit untreated at period t); fit() covariate surface removed (there is none). - HeterogeneousAdoptionDiD: covariate adjustment changed to "not supported" (Appendix B.1 future work per REGISTRY.md); clustered SE annotated "warn" on continuous paths (cluster= kwarg is ignored with UserWarning, CR1 only on mass-point path). - Replace "notyettreated" with "not_yet_treated" throughout (matches staggered.py / sun_abraham.py public API). - Replace "Hausman.hausman_pretest" with "EfficientDiD.hausman_pretest" (the classmethod lives on EfficientDiD, not a Hausman namespace). Tests: - test_duplicate_unit_time_rows_do_not_inflate_coverage - test_reversal_through_nan_is_binary_non_absorbing - test_continuous_zero_dose_controls_flag_has_never_treated - test_guide_api_strings_resolve_against_public_api (verifies every estimator name, hausman_pretest path, and control_group spelling in the bundled guide resolves against diff_diff's public API; guards against future drift) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-24T12:05:39Z

/ai-review

github-actions · 2026-04-24T12:11:54Z

🔁 AI review rerun (requested by @igerber)

Head SHA: a506c0a55ad850043a77339d32716203887ea344

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior re-review issues around the autonomous guide’s SyntheticDiD/TROP/HeterogeneousAdoptionDiD capability rows, the bad API tokens, and the earlier profile_panel() duplicate/NaN/continuous-control bugs appear addressed.
The autonomous guide still misstates HeterogeneousAdoptionDiD’s estimand and output shape: the method targets WAS / WAS_d_lower, not ATT, and the event-study path is per-event-time, not “per-cohort.”
The same guide still underspecifies EfficientDiD when there is no never-treated group: control_group="last_cohort" is a supported path, but §4.4 steers agents only to PT-Post.
profile_panel() can still emit impossible structural facts when unit or time contains missing IDs because its numerator and denominator treat NaNs inconsistently; observation_coverage can exceed the documented [0,1] range.
Static review only: I could not run the new tests locally because this environment lacks pytest and pandas.

Methodology

Severity P1. Impact: diff_diff/guides/llms-autonomous.txt:L375-L378 says HeterogeneousAdoptionDiD “returns an aggregated ATT plus a per-cohort Pierce-Schott-style validation,” but the registry defines the target as WAS / WAS_d_lower (docs/methodology/REGISTRY.md:L2189-L2200), and the estimator docs/results class describe a single-period WAS or per-event-time WAS outputs, not an ATT nor a per-cohort validation surface (diff_diff/had.py:L210-L245, diff_diff/had.py:L1757-L1770, diff_diff/had.py:L2014-L2020). This can make agent-facing reports mislabel the estimand and look for nonexistent outputs. Concrete fix: rewrite §4.9 to use WAS terminology and distinguish aggregate="overall" from aggregate="event_study" plus did_had_pretest_workflow().
Severity P1. Impact: diff_diff/guides/llms-autonomous.txt:L211-L213 and L307-L310 imply that, without never-treated units, EfficientDiD should fall back to PT-Post. The implementation and registry also support control_group="last_cohort" as a pseudo-control path (diff_diff/efficient_did.py:L161-L167, L459-L483; docs/methodology/REGISTRY.md:L928-L929). The guide can therefore incorrectly rule out a supported identification strategy on all-treated staggered panels. Concrete fix: update the footnote and §4.4 to mention control_group="last_cohort" explicitly and avoid presenting PT-Post as the only alternative.

Code Quality

Severity P1. Impact: diff_diff/profile.py:L213-L219 computes n_units/n_periods with nunique() (dropping NaNs) but computes unique (unit,time) support with drop_duplicates() (still counting NaN keys). That can produce observation_coverage > 1 and violate the public [0,1] contract documented in diff_diff/profile.py:L201-L205 and diff_diff/guides/llms-autonomous.txt:L71-L74. This is a silent correctness issue in a new public API whenever filtered data contain missing IDs. Concrete fix: reject missing unit/time values up front, or compute both numerator and denominator on the same non-missing key subset and surface a dedicated alert.

Performance

No findings.

Maintainability

No additional findings.

Tech Debt

Severity P3. Impact: the repo already tracks weak CI coverage for .txt AI guides in TODO.md:L128. That remains informational only. Concrete fix: keep that item open until the guide gains semantic assertions, not just fingerprint/table-shape checks.

Security

No findings.

Documentation/Tests

Severity P2. Impact: tests/test_guides.py:L37-L50 only validates guide fingerprints/table presence, and tests/test_profile_panel.py:L321-L414 covers the prior duplicate/NaN/continuous-control regressions but not missing unit/time IDs. The current P1 guide/profile issues therefore have no regression protection. Concrete fix: add tests that assert HAD is described as WAS / WAS_d_lower, that the EfficientDiD no-never-treated guidance mentions control_group="last_cohort", and that profile_panel() rejects or consistently handles missing identifier rows.

Path to Approval

Fix the HAD section of llms-autonomous.txt so it describes WAS / WAS_d_lower and the actual overall/event-study outputs instead of ATT/per-cohort validation.
Fix the EfficientDiD no-never-treated guidance to include the supported control_group="last_cohort" path.
Harden profile_panel() against missing unit/time IDs so observation_coverage cannot exceed 1 and balance metrics remain well-defined.
Add regression tests for the corrected guide semantics and the missing-identifier profile_panel() case.

igerber · 2026-04-24T12:38:49Z

/ai-review

Guide: - Rewrite HAD section (§4.9, §3 footnotes) to use WAS / WAS_d_lower terminology (paper Equation 2 / Theorem 1). HAD does not target ATT and its event-study output is per-event-time, not per-cohort. The fitted target_parameter attribute is literally "WAS" for Design 1' and "WAS_d_lower" for Design 1 / Assumption 6. - Add control_group="last_cohort" to the EfficientDiD guidance in §4.4 (no-never-treated paths) and the corresponding footnote. Previously the guide steered agents only toward assumption="PT-Post" when a never-treated cohort was absent, ruling out the supported pseudo- control path. Distinct from CallawaySantAnna's "not_yet_treated". profile_panel(): - Missing identifier rows are dropped up front when unit or time contains NaN. Previously n_units / n_periods used nunique() (NaN- dropping) while the unique (unit, time) support used drop_duplicates() (NaN-preserving), producing observation_coverage > 1 on filtered data with missing IDs. All structural facts now compute on the non-missing subset; a new missing_id_rows_dropped warn alert surfaces the drop count so the behavior is not silent. Tests: - test_missing_unit_or_time_ids_are_dropped_consistently asserts observation_coverage stays in [0, 1] and the alert fires with the correct drop count. - test_guide_api_strings_resolve_against_public_api extended with assertions for WAS / WAS_d_lower phrasing, the absence of the old "per-cohort Pierce-Schott" wording, and control_group="last_cohort". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-24T12:48:15Z

🔁 AI review rerun (requested by @igerber)

Head SHA: ba063b112c5fedf25b46818787eeedff736aae13

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior re-review issues around HAD WAS wording, EfficientDiD(control_group="last_cohort"), and profile_panel()’s observation_coverage > 1 bug appear addressed.
The new autonomous guide still has a methodology mismatch: it marks SunAbraham as not requiring never-treated units, but both the registry and the fit path still require them.
The HAD guide section now uses WAS/WAS_d_lower, but it overstates what the Phase 3 diagnostics establish by saying they validate Assumptions 3 and 7; Assumption 3 is explicitly untestable in the registry.
profile_panel() still lacks empty-data handling after dropping missing ID rows (and on direct empty input), so it can return a “balanced” profile with zero valid observations.
Static review only: this environment does not have pytest or pandas, so I could not execute the added tests.

Methodology

Severity P1. Location: diff_diff/guides/llms-autonomous.txt:L187-L188, docs/methodology/REGISTRY.md:L939-L942, docs/methodology/REGISTRY.md:L991-L993, diff_diff/sun_abraham.py:L563-L567. Impact: the support matrix says SunAbraham does not require never-treated units, but the registry still documents never-treated controls as required and the implementation raises when none exist before it branches on control_group. An agent following the matrix can select an estimator the library will reject on all-treated staggered panels. Concrete fix: change the matrix cell to show that never-treated units are required for SunAbraham, and add a semantic guide test for that contract.
Severity P1. Location: diff_diff/guides/llms-autonomous.txt:L391-L394, docs/methodology/REGISTRY.md:L2178-L2181, docs/methodology/REGISTRY.md:L2328-L2336, diff_diff/had_pretests.py:L1-L50. Impact: the guide says the HAD Phase 3 diagnostics “validate Assumptions 3 and 7,” but the registry is explicit that Assumption 3 (uniform continuity / no extensive-margin jump) is not testable. The shipped diagnostics cover QUG, Assumption 7 pre-trends on the event-study path, and linearity/TWFE suitability; they do not validate Assumption 3. This can lead agent-facing reports to overclaim identification support. Concrete fix: rewrite §4.9 so it limits the diagnostic claim to QUG, Assumption 7 pre-trends, and linearity/TWFE checks, with Assumption 3 left explicitly untestable.

Code Quality

Severity P1. Location: diff_diff/profile.py:L214-L225. Impact: after dropping rows with missing unit/time, profile_panel() keeps going even when no valid rows remain. That yields n_units = 0, n_periods = 0, observation_coverage = 0.0, and is_balanced = True (0 == 0) for an empty sample. Per the review checklist, this is missing empty-result handling in a new code path and can silently hand downstream agents a nonexistent-but-“balanced” panel. Concrete fix: add a front-door empty-panel guard after the ID drop (and for direct-empty input), preferably raising ValueError with the row counts removed; add regression tests for both df.empty and empty-after-drop.
Severity P2. Location: diff_diff/profile.py:L214-L216. Impact: n_rows_with_missing_id is computed as unit.isna().sum() + time.isna().sum(), so a row with both IDs missing is counted twice even though only one row is dropped. The missing_id_rows_dropped alert can therefore overstate the number of dropped rows and violate the “factual observations” contract of the new API. Concrete fix: count dropped rows with a row-wise isna().any(axis=1) mask, and add a regression test where one row has both unit and time missing.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3 (tracked in TODO.md). Location: tests/test_guides.py:L37-L50, tests/test_profile_panel.py:L374-L446, TODO.md:L128. Impact: the guide tests are still mostly fingerprint/shape checks, which is why the remaining guide-semantic mismatches slipped through. This is already tracked as AI-guide validation debt, so it is informational rather than blocking. Concrete fix: keep the TODO open; when fixing the guide, add assertions for SunAbraham’s never-treated requirement and the HAD diagnostic claim.

Security

No findings.

Documentation/Tests

No separate unmitigated findings beyond the tracked guide-validation gap above. Static review only: pytest and pandas were unavailable here, so I could not run the newly added tests.

Path to Approval

Fix the autonomous-guide matrix so SunAbraham is marked as requiring never-treated units, and align any nearby prose with that contract.
Rewrite HAD §4.9 so the diagnostic discussion does not claim Assumption 3 is testable; limit it to QUG, Assumption 7 pre-trends, and linearity/TWFE checks.
Add empty-panel handling to profile_panel() for direct-empty input and empty-after-ID-drop input, so it cannot return is_balanced=True on zero valid rows.
Fix missing_id_rows_dropped to count dropped rows rather than missing ID cells, and add regression tests for both the empty-after-drop case and a row with both IDs missing.

igerber · 2026-04-24T13:03:09Z

/ai-review

Guide: - SunAbraham matrix row: never-treated-required changed from ✗ to ✓. The fit path raises ValueError when no never-treated units exist (sun_abraham.py:566-567); the registry documents this requirement. §4.3 prose updated to state the requirement explicitly. - HAD §4.9 (and §3 Phase 3 description): correct the diagnostic-claim wording. The pre-test battery validates the QUG null (step 1), Assumption 7 pre-trends (step 2, event-study only), and linearity of E[ΔY|D_2] (step 3). Assumption 3 (uniform continuity / no extensive-margin jump) is explicitly "not testable" per REGISTRY.md L2181, so the guide must not claim Phase 3 validates it. profile_panel(): - Empty-panel guard: raise ValueError when the input DataFrame is empty (direct) or when no rows remain after dropping missing (unit, time) identifiers. Previously both cases returned is_balanced=True / observation_coverage=0 on zero valid rows. - Fix missing-id counting: n_rows_with_missing_id was the SUM of unit-NaN and time-NaN (double-counting rows missing both). Now computed from a row-wise isna().any(axis=1) mask so every dropped row is counted exactly once. Tests: - test_row_with_both_ids_missing_counted_once - test_empty_dataframe_raises_value_error - test_empty_after_id_drop_raises_value_error - test_guide_api_strings_resolve_against_public_api extended: SunAbraham matrix-row cell assertion (never-treated-required=✓), "Assumption 3" presence, absence of "validate Assumptions 3 and 7", presence of "not testable". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-24T13:11:36Z

🔁 AI review rerun (requested by @igerber)

Head SHA: c7ef29a0c4864c4253047a6c3ad2f2367ee15972

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior re-review items on SunAbraham’s never-treated requirement, the HAD Assumption 3 wording, empty-after-drop handling in profile_panel(), and double-counted missing-ID rows appear resolved.
The autonomous guide still misstates the EfficientDiD contract: it says assumption="PT-Post" removes the never-treated requirement, but the registry and fit path still require actual never-treated units unless control_group="last_cohort" is used.
The support matrix still has wrong covariate-adjustment cells: SyntheticDiD is marked unsupported even though fit(..., covariates=...) residualizes outcomes, while ContinuousDiD is marked supported even though covariates are still deferred.
profile_panel() overstates min_pre_periods / min_post_periods on unbalanced panels by counting global calendar periods rather than observed cohort support, which can suppress the short-panel alerts.
§5 of the autonomous guide has a few runtime-facing API signature mistakes (compute_pretrends_power, plot_sensitivity, plot_honest_event_study).
Static review only: this environment lacked pytest and core deps like numpy, so I could not run the added tests or import the package.

Methodology

Severity P1. Location: diff_diff/guides/llms-autonomous.txt:L107-L110, diff_diff/guides/llms-autonomous.txt:L212-L218, diff_diff/guides/llms-autonomous.txt:L313-L317, docs/methodology/REGISTRY.md:L756-L758, docs/methodology/REGISTRY.md:L870-L871, diff_diff/efficient_did.py:L161-L167, diff_diff/efficient_did.py:L458-L484. Impact: the new guide tells agents that assumption="PT-Post" drops the never-treated requirement. That is not the current library contract: PT-Post still uses never-treated comparisons, and all-treated panels only become admissible through control_group="last_cohort". An agent following the guide can select EfficientDiD on an all-eventually-treated panel and hit a front-door ValueError, or misstate the identifying setup in its report. Concrete fix: rewrite the has_never_treated field description, the matrix footnote, and §4.4 so they say EfficientDiD needs never-treated controls under both PT variants unless control_group="last_cohort" is explicitly chosen; add a semantic guide test for that contract.
Severity P1. Location: diff_diff/guides/llms-autonomous.txt:L195-L199, diff_diff/synthetic_did.py:L213-L239, diff_diff/synthetic_did.py:L425-L436, docs/methodology/REGISTRY.md:L1549-L1551, docs/methodology/REGISTRY.md:L736-L737, diff_diff/continuous_did.py:L159-L169. Impact: the support matrix mislabels covariate support in both directions: SyntheticDiD is shown as ✗ even though the public API accepts covariates= and residualizes the outcome, while ContinuousDiD is shown as ✓ even though the registry still lists covariates as deferred and the public fit() surface has no covariate argument. That will wrongly eliminate SDiD from valid workflows and advertise a nonexistent ContinuousDiD feature. Concrete fix: change the SyntheticDiD covariate-adjustment cell to ✓ (or partial with a footnote if you want to distinguish residualization from DR/IPW-style adjustment) and change ContinuousDiD to ✗; add a semantic matrix test that checks these rows against the public fit signatures / registry checklist.

Code Quality

Severity P2. Location: diff_diff/profile.py:L409-L431, diff_diff/profile.py:L541-L565, diff_diff/guides/llms-autonomous.txt:L124-L129. Impact: min_pre_periods and min_post_periods are computed from the global set of panel periods, not from the treated cohorts’ observed support. On unbalanced panels this can overstate available pre/post exposure and suppress short_pre_panel / short_post_panel, even though the guide presents those fields as factual cohort-level support. Concrete fix: compute pre/post counts from observed (unit, time) support for the treated cohorts after deduplicating keys, and add a regression test where treated cohorts are missing early pre-periods or late post-periods.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3 (tracked in TODO.md). Location: tests/test_guides.py:L37-L50, TODO.md:L128-L128. Impact: the current guide tests still only assert fingerprints / table shape, which is why the remaining semantic matrix drift was not caught. This is already tracked as .txt AI guides being outside stronger CI validation, so it is informational rather than blocking. Concrete fix: keep the TODO open; when fixing the guide, add semantic assertions for the EfficientDiD no-never-treated contract and the SyntheticDiD / ContinuousDiD covariate-support cells.

Security

No findings.

Documentation/Tests

Severity P2. Location: diff_diff/guides/llms-autonomous.txt:L430-L438, diff_diff/guides/llms-autonomous.txt:L481-L482, diff_diff/pretrends.py:L1048-L1055, diff_diff/visualization/_diagnostic.py:L12-L14, diff_diff/visualization/_event_study.py:L787-L789. Impact: §5 gives several wrong runtime signatures for agent use: compute_pretrends_power takes a fitted results object, not raw df; plot_sensitivity takes SensitivityResults; plot_honest_event_study takes HonestDiDResults. Agents using the new reference verbatim will call the wrong APIs and fail at runtime. Concrete fix: update those entries to the actual first-argument types and add a semantic guide test that checks the guide text against the public function signatures.

Path to Approval

Fix the autonomous guide’s EfficientDiD wording so PT-Post does not claim to remove the never-treated requirement; document control_group="last_cohort" as the supported no-never-treated path, and add a regression test for that contract.
Correct the support matrix’s covariate-adjustment cells for SyntheticDiD and ContinuousDiD, and add a semantic regression test that checks those rows against the public API / methodology registry.

igerber · 2026-04-24T13:16:07Z

/ai-review

Guide: - Matrix covariate cells corrected. SyntheticDiD column 6 flipped ✗ → ✓: fit() accepts covariates= and residualizes the outcome (synthetic_did.py:213-239, 425-436). ContinuousDiD column 6 flipped ✓ → ✗: the fit() signature has no covariate argument and REGISTRY lists covariate adjustment as deferred. - EfficientDiD wording corrected. The registry says PT-Post still uses never-treated as the comparison group (REGISTRY.md L756-758); "switch to PT-Post" does not drop the requirement. Rewrote has_never_treated field description, the §3 footnote, and §4.4 prose to state that PT-All and PT-Post both require never-treated units and control_group="last_cohort" is the only supported path for an all-eventually-treated panel. profile_panel(): - _compute_pre_post now uses each treated unit's observed (unit, time) support rather than the global panel period set. On unbalanced panels this correctly captures the least-supported treated unit's pre/post exposure and fires short_pre_panel / short_post_panel when appropriate; the previous global-periods approach could overstate exposure and suppress the alert. Guide §5 API signatures corrected: - compute_pretrends_power takes a fitted results object, not a raw DataFrame (pretrends.py:1048). - plot_sensitivity takes a SensitivityResults object (visualization/_diagnostic.py:12-14). - plot_honest_event_study takes a HonestDiDResults object (visualization/_event_study.py:9). Tests: - test_min_pre_post_use_per_unit_observed_support (unbalanced treated unit missing early pre-period fires short_pre_panel). - test_guide_api_strings_resolve_against_public_api extended with SyntheticDiD/ContinuousDiD covariate-cell matrix assertions, EfficientDiD both-PT-variants-require-never-treated wording assertion, and §5 API-signature fingerprints. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-24T13:22:34Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 3590ddd75a615e34301f334c067d9a66b6c3d028

Overall Assessment

✅ Looks good

Executive Summary

The prior P1 methodology findings from the last re-review appear resolved: the autonomous guide now matches source on EfficientDiD’s never-treated contract, SunAbraham’s never-treated requirement, the SyntheticDiD/ContinuousDiD covariate-support cells, HAD Assumption 3 wording, and the §5 function signatures.
profile_panel() also closes the earlier empty-after-drop, missing-ID, duplicate-support, NaN-through-monotonicity, and min-pre/min-post bugs, with targeted regression coverage.
Remaining issue: §6 of diff_diff/guides/llms-autonomous.txt does not match the actual BusinessReport / DiagnosticReport to_dict() schema, so an agent using it as a parsing contract will look for keys that do not exist.
Remaining issue: min_pre_periods / min_post_periods are now implemented and tested as least-supported treated-unit counts, but the guide and emitted alert text still describe them as cohort-level counts.
Static review only: this environment lacks pytest, numpy, and pandas, so I could not run the added tests or import the package.

Methodology

No unmitigated findings. The prior methodology issues in diff_diff/guides/llms-autonomous.txt appear addressed, and the new semantic checks in tests/test_profile_panel.py:L374-L480 cover the corrected contracts.

Code Quality

Severity P2. Impact: profile_panel() now computes min_pre_periods / min_post_periods from the least-supported treated unit, but both the guide and the runtime alert text still say these are “across cohorts.” That makes the public contract misleading: agents can read a unit-level missingness signal as a cohort-level design limitation. Concrete fix: either change the prose to “across treated units” everywhere, or change the computation back to genuine cohort-level support if that is the intended contract. Locations: diff_diff/guides/llms-autonomous.txt:L126-L130, diff_diff/profile.py:L417-L443, diff_diff/profile.py:L553-L576, tests/test_profile_panel.py:L483-L500.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3 (tracked in TODO.md). Impact: .txt AI guides are still only lightly validated in CI, so semantic drift in agent-facing guides remains easy to miss. Concrete fix: keep the existing TODO open and extend semantic tests to cover the BR/DR schema section, not just estimator names, matrix cells, and signature snippets. Locations: TODO.md:L121-L129, tests/test_guides.py:L37-L50, tests/test_profile_panel.py:L374-L480.

Security

No findings.

Documentation/Tests

Severity P2. Impact: §6 of the autonomous guide misdocuments the actual BR/DR to_dict() schema. It advertises BusinessReport keys such as assumptions, pretrends, main_result, and sample_summary, but BusinessReport.to_dict() actually returns keys like assumption, pre_trends, sample, diagnostics, and references. It also says DiagnosticReport has estimator_type plus a nested checks dict, while the implementation returns estimator and separate top-level sections such as parallel_trends, pretrends_power, sensitivity, bacon, and estimator_native_diagnostics. An autonomous agent following the guide as a literal parsing contract will fail on real output. Concrete fix: rewrite §6 to match the actual schemas emitted by BusinessReport.to_dict() and DiagnosticReport.to_dict(), then add a semantic test that compares the documented key set to the real builders. Locations: diff_diff/guides/llms-autonomous.txt:L513-L545, diff_diff/business_report.py:L524-L549, diff_diff/diagnostic_report.py:L1013-L1031.

igerber · 2026-04-24T13:26:08Z

/ai-review

Guide §6 rewritten to match the actual BR/DR to_dict() schemas emitted at diff_diff/business_report.py:524-549 and diff_diff/diagnostic_report.py: 1013-1031. BR top-level keys are "assumption" (singular), "pre_trends" (underscored), "sample" (bare); the new documentation also adds "estimator", "context", "heterogeneity", "diagnostics", "next_steps", "references". DR keys are "estimator" (string), "headline_metric", "estimator_native_diagnostics", "skipped", "warnings", "overall_interpretation", "next_steps"; the old "checks: dict" nesting never existed and "estimator_type" was never the key name. The "forthcoming schema additions" block is dropped since it documented an un-shipped surface. Prose + alert wording aligned with the per-treated-unit implementation: - Guide §2: min_pre_periods / min_post_periods described as "across treated units" (each unit's observed (unit, time) support counted independently on unbalanced panels). - profile.py: short_pre_panel / short_post_panel alert messages use "across treated units" to match the computation. Tests: - test_guide_api_strings_resolve_against_public_api extended with BR/DR §6 schema assertions. Guards every real top-level key documented (assumption, pre_trends, sample, headline_metric, estimator_native_diagnostics, overall_interpretation) and forbids the previously documented obsolete keys (assumptions, pretrends, main_result, sample_summary, estimator_type, checks). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-24T13:32:43Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 38a8f6f6dcb071f6f626572a7c3f887019debe2b

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

The prior re-review issues on min_pre_periods / min_post_periods semantics and the major BR/DR top-level key mismatches are mostly resolved.
I did not find any P0/P1 correctness issues in profile_panel(); the new tests cover the earlier empty-after-drop, duplicate-support, NaN-through-monotonicity, and per-unit pre/post-support cases.
Severity P2: the new autonomous guide still has methodology-reference drift. EfficientDiD is attributed to the wrong source, and the ContinuousDiD synopsis collapses the PT/SPT estimand distinction into “ACR.”
Severity P2: §6 still misstates the BusinessReport.to_dict() contract. diagnostics is a wrapper block, not the raw DiagnosticReport schema, and target_parameter.reference is undocumented.
Static review only: this environment does not have pytest, and import-time numpy is also unavailable, so I could not execute the added tests.

Methodology

Severity P2. Impact: ContinuousDiD is described as “estimates average causal response (ACR),” but the registry and BR/DR target-parameter dispatch distinguish ATT(d|d) / overall ATT under PT from ACRT under SPT. An autonomous agent following §4.7 can mislabel the estimand it reports. Concrete fix: rewrite the ContinuousDiD bullet to say the estimator exposes dose-response ATT and ACR outputs, with overall ATT as the headline scalar and ACR requiring the stronger SPT regime; add the CGBS 2024 citation to §7. Locations: diff_diff/guides/llms-autonomous.txt:L361-L369, docs/methodology/REGISTRY.md:L682-L706, diff_diff/_reporting_helpers.py:L332-L347.
Severity P2. Impact: EfficientDiD is labeled “Arkhangelsky-Imbens,” but the registry’s primary source is Chen, Sant’Anna, and Xie (2025). That breaks source-material traceability in an agent-facing methodology guide. Concrete fix: change the §4.3 attribution and add the Chen-Sant’Anna–Xie reference to §7 alongside the other estimator papers. Locations: diff_diff/guides/llms-autonomous.txt:L303-L304, docs/methodology/REGISTRY.md:L746-L749.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3 (tracked in TODO.md). Impact: guide validation is still mostly string/fingerprint based, so schema drift in .txt AI guides can survive even with the new tests. Concrete fix: keep the existing TODO open and extend the guide tests to compare §6 against the actual BR/DR builders, not just key presence. Locations: TODO.md:L121-L129, tests/test_guides.py:L37-L50, tests/test_profile_panel.py:L482-L506.

Security

No findings.

Documentation/Tests

Severity P2. Impact: §6 still tells agents that BusinessReport["diagnostics"] is the nested DiagnosticReport.to_dict() payload, but the implementation wraps that payload under diagnostics["schema"] and adds wrapper fields like status and overall_interpretation; the guide also omits the target_parameter.reference field that describe_target_parameter() guarantees. Agents using the guide as a literal parsing contract will look in the wrong place or miss fields. Concrete fix: document the exact wrapper shape for diagnostics, include target_parameter.reference, and add a semantic test that compares the documented schema bullets to real BusinessReport.to_dict() output. Locations: diff_diff/guides/llms-autonomous.txt:L516-L546, diff_diff/business_report.py:L510-L549, diff_diff/_reporting_helpers.py:L34-L42, tests/test_guides.py:L37-L50, tests/test_profile_panel.py:L482-L506.

igerber · 2026-04-24T13:47:00Z

/ai-review

Guide methodology attribution corrected: - EfficientDiD primary source is Chen, Sant'Anna, Xie (2025) per REGISTRY.md L746-749, not Arkhangelsky-Imbens. §4.3 bullet and the §7 citation list updated. - ContinuousDiD synopsis (§4.7) now distinguishes the PT-identified targets (`ATT(d|d)`, `ATT^{loc}`) from the SPT-required targets (`ATT(d)`, `ACRT(d)`, `ACRT^{glob}`) per REGISTRY.md L682-706, with the Callaway, Goodman-Bacon, Sant'Anna (2024) citation added to §7. Guide §6 BR schema corrections: - `diagnostics` wrapper documented accurately: always has `status`, with `schema` + `overall_interpretation` on the "ran" path or `reason` on the "skipped" path. Previously the guide said the key held the DR payload directly. - `target_parameter.reference` field documented alongside the other fields per `describe_target_parameter()` in diff_diff/_reporting_helpers.py:34-42. Regression tests extended: - `diagnostics["schema"]` wrapper wording guarded (and absence of plain "DR payload under diagnostics"). - `target_parameter.reference` key documentation guarded. - EfficientDiD attribution: `Chen, Sant'Anna, Xie 2025` present, `(Arkhangelsky-Imbens)` absent. - ContinuousDiD targets: `ATT(d|d)`, `ACRT`, `Strong Parallel Trends` all present. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-24T13:54:17Z

🔁 AI review rerun (requested by @igerber)

Head SHA: bd4e1be562f79f91e31af629cf95663748de5402

Overall Assessment

⚠️ Needs changes — one unmitigated P1 remains in the new autonomous guide’s methodology matrix.

Executive Summary

The prior re-review issues on EfficientDiD attribution, ContinuousDiD PT/SPT wording, and the BR/DR wrapper/reference-field schema drift look addressed.
P1: the new autonomous guide still misstates ContinuousDiD eligibility. It says never-treated / zero-dose controls are not required, but both REGISTRY.md and ContinuousDiD.fit() require P(D=0) > 0 because the lowest-dose-as-control path is not implemented.
P2: §6 of the guide still misdocuments DiagnosticReport.to_dict() section status semantics. Real sections use execution-style status="ran" plus a separate verdict; reason is optional, and additional statuses like not_applicable, not_run, and no_scalar_by_design exist.
P3 (tracked): guide validation is still mostly string/fingerprint based, which is why the remaining guide drift slipped through.
Static review only: I could not execute the added tests here because import-time numpy is unavailable in this environment.

Methodology

Severity P1. Impact: Method affected: ContinuousDiD. The agent-facing support matrix marks ContinuousDiD as never-treated required = ✗, and the surrounding has_never_treated / continuous-treatment prose also omits the zero-dose-control requirement. That conflicts with the Methodology Registry and the implementation, which both require P(D=0) > 0 because Remark 3.1 (lowest-dose-as-control) is not implemented. In the advertised profile_panel() -> get_llm_guide("autonomous") workflow, an agent can therefore treat an all-eventually-treated continuous-dose panel as eligible for ContinuousDiD even though fit() rejects it. Concrete fix: change the matrix cell to ✓, add ContinuousDiD to the has_never_treated guidance, and state explicitly in §4.7 that zero-dose / never-treated controls are required until Remark 3.1 is implemented. Locations: diff_diff/guides/llms-autonomous.txt:L107-L114, diff_diff/guides/llms-autonomous.txt:L187-L205, diff_diff/guides/llms-autonomous.txt:L366-L376, docs/methodology/REGISTRY.md:L714-L716, diff_diff/continuous_did.py:L349-L360, tests/test_continuous_did.py:L1279-L1295.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3 (tracked in TODO.md). Impact: the new guide checks are still mostly presence/fingerprint assertions, so semantic schema/methodology drift in .txt guides can survive CI. This is already tracked as non-blocking guide-validation debt. Concrete fix: keep the TODO open and extend tests to compare specific guide claims against live serializer/estimator behavior, not just substrings and table shape. Locations: TODO.md:L121-L129, tests/test_guides.py:L37-L50, tests/test_profile_panel.py:L458-L515.

Security

No findings.

Documentation/Tests

Severity P2. Impact: §6 says every DiagnosticReport section has statuses like pass/warn/inconclusive plus a reason, but the actual schema uses status="ran" for executed sections, keeps interpretation in a separate verdict, and also emits not_applicable, not_run, and no_scalar_by_design. An agent treating the guide as a literal parsing contract will mis-handle real BR/DR output. Concrete fix: rewrite §6 so status is documented as execution state, move qualitative interpretation to verdict, make reason optional, and list the additional statuses that actually occur. Locations: diff_diff/guides/llms-autonomous.txt:L594-L596, diff_diff/diagnostic_report.py:L890-L900, diff_diff/diagnostic_report.py:L968-L978, diff_diff/diagnostic_report.py:L1178-L1185, diff_diff/diagnostic_report.py:L1769-L1777, diff_diff/diagnostic_report.py:L1969-L2028.
I could not run tests/test_profile_panel.py or tests/test_guides.py in this environment because numpy is not installed, so the verification above is static-only.

Path to Approval

Fix the ContinuousDiD methodology guidance in diff_diff/guides/llms-autonomous.txt: mark zero-dose / never-treated controls as required in the matrix and explain the unimplemented Remark 3.1 path in the continuous-treatment prose.
Correct §6 of diff_diff/guides/llms-autonomous.txt to match the real DiagnosticReport.to_dict() contract: status="ran" for executed sections, verdict for qualitative interpretation, optional reason, and the extra statuses currently emitted.
Add semantic tests that assert those two guide claims against live code paths, rather than only checking substrings/table shape.

igerber · 2026-04-24T13:58:14Z

/ai-review

ContinuousDiD eligibility (P1): ContinuousDiD.fit() requires zero-dose control units (P(D=0) > 0) because Remark 3.1 lowest-dose-as-control is not yet implemented (REGISTRY.md L714; continuous_did.py:349-360). Guide corrections: - Matrix row: never-treated-required ✗ → ✓. - has_never_treated field doc: add ContinuousDiD to the required-by list with the Remark 3.1 note. - §4.4 no-never-treated paths: add ContinuousDiD bullet naming the ValueError and pointing at the unimplemented Remark 3.1 path. - §4.7 ContinuousDiD bullet: prepend the "Requires zero-dose control units" callout. DiagnosticReport §6 section-status semantics (P2): the real schema uses `status` for execution state ("ran", "not_applicable", "not_run", "no_scalar_by_design", "skipped") and `verdict` as a separate field for qualitative interpretation (e.g. "clean", "inconclusive", "violated"). Previously §6 conflated these: it listed pass/warn/inconclusive as status values and omitted the execution statuses. Rewrote the section-status paragraph to document the two layers explicitly; moved the forthcoming-additions note into the same block. Tests: - ContinuousDiD matrix-row cell assertion (col 5 = ✓). - "P(D=0) > 0" substring fingerprint guards the Remark 3.1 wording. - DR §6 status vocabulary: guards every real status the implementation emits ("ran", "not_applicable", "not_run", "no_scalar_by_design") and forbids the obsolete pass/warn/inconclusive-as-status framing; `verdict` must be mentioned as a distinct concept. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-24T14:04:54Z

🔁 AI review rerun (requested by @igerber)

Head SHA: b361c6f07129bdce9fe3b7f154022f2b039a2994

Overall Assessment

⚠️ Needs changes. The prior re-review issues on ContinuousDiD zero-dose controls and DiagnosticReport status/verdict semantics look addressed in diff_diff/guides/llms-autonomous.txt:L107-L114 and diff_diff/guides/llms-autonomous.txt:L603-L618, but there are still unmitigated P1 methodology issues in the new autonomous guide/profile workflow.

Executive Summary

The earlier ContinuousDiD control-group mismatch and BR/DR schema drift appear fixed.
Severity P1 [Newly identified]: the estimator-support matrix still says ContinuousDiD does not support staggered adoption, which contradicts the Methodology Registry and the implementation.
Severity P1 [Newly identified]: the new profile_panel() + guide flow still omits ContinuousDiD's unit-constant-dose requirement, so time-varying-dose panels are profiled as generic "continuous" and pointed toward an estimator that hard-rejects them.
Severity P2: the new semantic guide test does not assert either of those ContinuousDiD eligibility rules, so this drift can pass CI.
Verification was static only; pandas and pytest are not available in this environment.

Methodology

Severity P1 [Newly identified]. Impact: Method affected: ContinuousDiD. The matrix marks ContinuousDiD as staggered = ✗ at diff_diff/guides/llms-autonomous.txt:L206-L206, but the source material explicitly defines multi-period staggered notation and staggered (g,t) aggregation for the method in docs/methodology/continuous-did.md:L61-L75 and docs/methodology/continuous-did.md:L257-L260, and the registry documents staggered-adoption equations at docs/methodology/REGISTRY.md:L788-L825. The implementation also uses first_treat cohorts and a not-yet-treated control mask at diff_diff/continuous_did.py:L159-L169 and diff_diff/continuous_did.py:L919-L925. In the advertised profile_panel() -> get_llm_guide("autonomous") workflow, an agent can therefore wrongly exclude a supported estimator for staggered continuous-dose panels. Concrete fix: change the ContinuousDiD staggered cell from ✗ to ✓ (or partial only if you add a precise qualifier), and explain that continuous-treatment timing is driven by first_treat metadata rather than PanelProfile.is_staggered.
Severity P1 [Newly identified]. Impact: Method affected: ContinuousDiD. The new workflow still over-admits invalid continuous-treatment designs. profile_panel classifies any numeric non-binary treatment as "continuous" (diff_diff/profile.py:L339-L364), and the guide's field reference and §4.7 point users toward ContinuousDiD / HeterogeneousAdoptionDiD for that case (diff_diff/guides/llms-autonomous.txt:L89-L92, diff_diff/guides/llms-autonomous.txt:L367-L383). But the method definition requires dose to be time-invariant within unit in docs/methodology/continuous-did.md:L72-L75, and ContinuousDiD.fit() hard-fails when that is violated at diff_diff/continuous_did.py:L222-L228. That means a time-varying-dose panel is currently profiled as eligible and only rejected later at fit time. Concrete fix: add a PanelProfile field or alert for within-unit treatment variation and use it in §2/§4.7; at minimum, state explicitly that ContinuousDiD requires time-invariant dose and remove the unconditional routing from the generic "continuous" definition.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new semantic guard in tests/test_profile_panel.py:L458-L536 checks ContinuousDiD's covariate and zero-dose-control cells, but it never asserts the staggered-support cell or the time-invariant-dose contract. That is why both methodology issues above can still ship without CI noise. Concrete fix: extend test_guide_api_strings_resolve_against_public_api() to assert cdid_cells[2] and require explicit guide text about unit-constant dose, or add a live-behavior test that compares the guide against ContinuousDiD.fit() on a varying-dose panel.
Verification note: this review was static only because pandas and pytest are unavailable in this environment.

Path to Approval

Fix the ContinuousDiD row in diff_diff/guides/llms-autonomous.txt so staggered adoption is not shown as unsupported, and clarify how continuous-treatment timing should be inferred when PanelProfile.is_staggered is binary-only.
Surface the unit-constant-dose requirement in the new autonomous workflow. The best fix is a new PanelProfile signal plus matching §2/§4.7 text; the minimum fix is explicit prose that varying-dose panels are not eligible for ContinuousDiD.
Add semantic tests for both rules so future guide edits cannot regress.

igerber · 2026-04-24T14:08:38Z

/ai-review

ContinuousDiD staggered support (P1 #1): the matrix marked staggered=✗, but the method natively supports staggered adoption via the `first_treat` column (continuous_did.py:159-169, 919-925; REGISTRY.md L788-825). Matrix cell flipped ✗ → ✓. Time-invariant dose requirement (P1 #2): ContinuousDiD.fit() requires dose to be time-invariant per unit (continuous_did.py:222-228; docs/methodology/continuous-did.md:L70-75), but profile_panel() did not expose this so time-varying-dose continuous panels were routed to ContinuousDiD only to hard-fail at fit time. Added `PanelProfile.treatment_varies_within_unit: bool` — True iff any unit has more than one distinct non-NaN treatment value across its observed rows. Computed unconditionally for numeric (non-bool) treatment columns; False for categorical. `to_dict()` exposes it. Guide §2 documents the field, §4.7 ContinuousDiD bullet lists two eligibility prerequisites: P(D=0) > 0 AND treatment_varies_within_unit == False. Tests (P2): - test_continuous_treatment_with_time_varying_dose: random-per-row continuous panel -> treatment_varies_within_unit=True. - test_continuous_treatment (existing): constant-per-unit dose -> treatment_varies_within_unit=False. - test_binary_absorbing_varies_within_unit: binary absorbing panel always True by construction. - Guide-resolution test: ContinuousDiD matrix col 2 (staggered) = ✓; guide mentions "time-invariant" and "treatment_varies_within_unit". - to_dict JSON round-trip set extended with the new key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-24T14:14:41Z

🔁 AI review rerun (requested by @igerber)

Head SHA: aaa486390b35fe6998e7d8826ef72636a4af8653

Overall Assessment

⚠️ Needs changes

Executive Summary

The prior non-ContinuousDiD guide drift appears mostly addressed; the remaining blockers are still on the ContinuousDiD path.
Severity P1: the autonomous support matrix still says ContinuousDiD does not support staggered adoption, which contradicts the methodology docs and the estimator’s (g,t) implementation.
Severity P1: the new profile_panel() + guide workflow still does not surface ContinuousDiD’s unit-constant-dose requirement, and it reuses binary-exposure semantics for continuous dose columns, so valid continuous-DiD panels can be mislabeled as having “always-treated” units while invalid varying-dose panels are not screened pre-fit.
Severity P2: the new tests still do not assert either ContinuousDiD rule, and one test now hard-codes the incorrect continuous-panel has_always_treated behavior.
Verification was static only; this environment is missing pytest and package imports fail because numpy is unavailable.

Methodology

Severity P1 [Previously raised, still unresolved]. Impact: Method affected: ContinuousDiD. The estimator-support matrix still marks ContinuousDiD as staggered = ✗ in diff_diff/guides/llms-autonomous.txt:L189-L207 (row at :L206), but the source material explicitly defines multi-period staggered notation and staggered (g,t) aggregation in docs/methodology/continuous-did.md:L61-L75 and docs/methodology/continuous-did.md:L257-L260, and the registry/implementation both reflect that multi-period design (docs/methodology/REGISTRY.md:L728-L733, diff_diff/continuous_did.py:L394-L423, diff_diff/continuous_did.py:L912-L925). In the advertised agent workflow, this wrongly excludes a supported estimator for staggered continuous-dose panels. Concrete fix: change the ContinuousDiD staggered cell to ✓ and add a short footnote that continuous-treatment timing comes from the separate first_treat metadata, not from PanelProfile.is_staggered.
Severity P1 [Previously raised, still unresolved]. Impact: Method affected: ContinuousDiD. The new workflow still omits the method’s unit-constant-dose assumption and also misstates continuous-dose semantics. profile_panel() classifies every non-binary numeric treatment as generic "continuous" and documents has_always_treated as a generic numeric-treatment property in diff_diff/profile.py:L183-L199 and diff_diff/profile.py:L355-L364; the guide then routes that class to ContinuousDiD based only on the zero-dose-control requirement in diff_diff/guides/llms-autonomous.txt:L89-L120 and diff_diff/guides/llms-autonomous.txt:L367-L383. But ContinuousDiD.fit() takes separate first_treat and dose inputs and hard-rejects any unit whose dose varies within unit in diff_diff/continuous_did.py:L183-L186 and diff_diff/continuous_did.py:L222-L228, consistent with docs/methodology/continuous-did.md:L61-L75. As written, the profile can over-admit invalid varying-dose panels and also emit the false has_always_treated_units alert on valid continuous-DiD panels solely because dose is positive in pre-treatment rows (diff_diff/profile.py:L593-L604). Concrete fix: add an explicit within-unit-dose signal/alert to PanelProfile and .to_dict(), document it in §2/§4.7, and stop equating “positive dose in every observed row” with “treated in every observed period” for continuous designs.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. I did not find these P1 issues tracked in TODO.md, so they remain unmitigated.

Security

No findings.

Documentation/Tests

Severity P2. Impact: the new regression tests still do not lock down the two ContinuousDiD contracts above. tests/test_profile_panel.py:L374-L550 checks many guide semantics but stops at cdid_cells[5] and never asserts the staggered-support cell or any time-invariant-dose/profile signal; tests/test_guides.py:L37-L50 only checks for fingerprints/table presence; and tests/test_profile_panel.py:L357-L371 explicitly expects has_always_treated is True on a valid continuous-dose panel, which bakes the wrong semantics into CI. Concrete fix: extend the semantic guide test to assert the ContinuousDiD staggered cell and explicit unit-constant-dose language, and add profile tests showing that continuous panels expose the within-unit-dose contract without firing the “always treated / no pre-treatment information” alert merely because dose > 0 in pre-treatment rows.

Path to Approval

Fix the ContinuousDiD row in diff_diff/guides/llms-autonomous.txt so staggered adoption is shown as supported, with a note that timing is supplied via first_treat.
Surface the unit-constant-dose requirement in the new profile_panel()/guide workflow, and remove or re-scope the continuous-panel has_always_treated semantics so they do not claim valid dose panels lack pre-treatment information.
Add semantic tests for both rules: ContinuousDiD staggered support and the within-unit-dose contract/correct continuous-panel alert behavior.

igerber · 2026-04-24T14:49:46Z

/ai-review

profile_panel(): - Balance / coverage: compute from unique (unit, time) support rather than raw row count. Duplicated rows no longer inflate coverage above 1 or mask missing cells. Adds duplicate_unit_time_rows warn alert. - Non-absorbing detection: drop NaN before the monotonicity diff so a path like [0, 1, NaN, 0] is correctly classified binary_non_absorbing rather than binary_absorbing. - has_never_treated / has_always_treated: compute generically across numeric treatment types (not only binary). Continuous panels with zero-dose control units now surface has_never_treated=True; docstring and guide field reference updated. llms-autonomous.txt: - SyntheticDiD: staggered changed from "partial" to "strict block" (fit raises on within-unit variation); footnote rewritten. - TROP: staggered support marked (via absorbing D matrix); never-treated no longer required (donor pool is every unit untreated at period t); fit() covariate surface removed (there is none). - HeterogeneousAdoptionDiD: covariate adjustment changed to "not supported" (Appendix B.1 future work per REGISTRY.md); clustered SE annotated "warn" on continuous paths (cluster= kwarg is ignored with UserWarning, CR1 only on mass-point path). - Replace "notyettreated" with "not_yet_treated" throughout (matches staggered.py / sun_abraham.py public API). - Replace "Hausman.hausman_pretest" with "EfficientDiD.hausman_pretest" (the classmethod lives on EfficientDiD, not a Hausman namespace). Tests: - test_duplicate_unit_time_rows_do_not_inflate_coverage - test_reversal_through_nan_is_binary_non_absorbing - test_continuous_zero_dose_controls_flag_has_never_treated - test_guide_api_strings_resolve_against_public_api (verifies every estimator name, hausman_pretest path, and control_group spelling in the bundled guide resolves against diff_diff's public API; guards against future drift) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Guide: - Rewrite HAD section (§4.9, §3 footnotes) to use WAS / WAS_d_lower terminology (paper Equation 2 / Theorem 1). HAD does not target ATT and its event-study output is per-event-time, not per-cohort. The fitted target_parameter attribute is literally "WAS" for Design 1' and "WAS_d_lower" for Design 1 / Assumption 6. - Add control_group="last_cohort" to the EfficientDiD guidance in §4.4 (no-never-treated paths) and the corresponding footnote. Previously the guide steered agents only toward assumption="PT-Post" when a never-treated cohort was absent, ruling out the supported pseudo- control path. Distinct from CallawaySantAnna's "not_yet_treated". profile_panel(): - Missing identifier rows are dropped up front when unit or time contains NaN. Previously n_units / n_periods used nunique() (NaN- dropping) while the unique (unit, time) support used drop_duplicates() (NaN-preserving), producing observation_coverage > 1 on filtered data with missing IDs. All structural facts now compute on the non-missing subset; a new missing_id_rows_dropped warn alert surfaces the drop count so the behavior is not silent. Tests: - test_missing_unit_or_time_ids_are_dropped_consistently asserts observation_coverage stays in [0, 1] and the alert fires with the correct drop count. - test_guide_api_strings_resolve_against_public_api extended with assertions for WAS / WAS_d_lower phrasing, the absence of the old "per-cohort Pierce-Schott" wording, and control_group="last_cohort". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

HeterogeneousAdoptionDiD staggered support is `partial` (per §3 matrix), but the restriction was never unpacked. Per REGISTRY.md L2281 and had.py:1100-1288, Appendix B.2 limits staggered HAD to the **last treatment cohort plus never-treated units**. With `first_treat_col` supplied, `fit(aggregate="event_study")` auto-filters to F_last and emits a UserWarning naming kept/dropped counts; earlier-cohort units are dropped. Without `first_treat_col` on a multi-cohort panel, fit() raises a front-door ValueError pointing at ChaisemartinDHaultfoeuille for full staggered support. Guide updates: - New §3 footnote on the HAD `partial` cell spelling out the last- cohort-only restriction, the `first_treat_col` requirement for the auto-filter, and the ChaisemartinDHaultfoeuille fallback. - §4.9 HAD bullet appended with a "Staggered-timing scope is last- cohort-only" paragraph carrying the same contract plus the "last-cohort-only WAS" estimand clarification. Tests: - Semantic guide test asserts "last-cohort-only" (or "last cohort") wording, "first_treat_col" token, and ChaisemartinDHaultfoeuille as the fallback are all present so future guide edits cannot silently drop the Appendix B.2 disclosure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Balanced-panel eligibility gate tightened. `PanelProfile.is_balanced` is computed from the unique `(unit, time)` support, so it stays `True` even when duplicate rows exist — `duplicate_unit_time_rows` is a separate alert for that case. But ContinuousDiD / EfficientDiD / HeterogeneousAdoptionDiD all require exactly one observation per cell: EfficientDiD and HAD raise ValueError on duplicates at fit() time, and ContinuousDiD's precompute path silently resolves duplicates via last-row-wins (which can change the estimand without warning). Guide §3 balanced-panel-eligibility block now requires BOTH `is_balanced == True` AND absence of the `duplicate_unit_time_rows` alert before routing to these estimators, with the specific failure mode (raise vs silent overwrite) named per-estimator and a concrete two-step fix (`balance_panel()` + `drop_duplicates([unit, time])`). Tests: extended the semantic guide test to assert the guide mentions `duplicate_unit_time_rows` and uses "BOTH/both" wording around the is_balanced gate, so future edits cannot silently drop the duplicate- row half of the eligibility requirement. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bool-dtype treatment columns are now classified the same way as numeric {0, 1} rather than as "categorical". The library's binary estimators validate value support via `validate_binary` (utils.py: 49-67), which accepts bool because True/False coerce to 1/0 numerically. Classifying bool as categorical silently routed valid binary DiD panels away from the supported estimator set. Changes: - _classify_treatment() no longer early-returns "categorical" for bool dtype. The downstream absorbing/non-absorbing logic handles bool by casting to int before np.diff (raw bool diff is XOR, which would mask a True -> False transition). - treatment_varies_within_unit now includes bool-dtype columns (was previously hardcoded False for bool). - Guide §2 removes the "bool = categorical" rule and adds an explicit "bool is binary" note with a pointer to validate_binary as the reason. - profile_panel() docstring mirrors the same update. Tests: - test_bool_dtype_treatment_is_binary_absorbing: staggered-style bool panel with never-treated cohort -> binary_absorbing, correct has_never_treated / treatment_varies_within_unit / cohort_sizes. - test_bool_dtype_non_absorbing: reversible False -> True -> False bool panel -> binary_non_absorbing. Guards the int-cast before np.diff so future refactors don't regress to bool XOR semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dCDH "spillover designs" claim removed. The earlier §4.3 bullet said ChaisemartinDHaultfoeuille was "robust to non-absorbing treatment and to spillover designs," but REGISTRY.md §ChaisemartinDHaultfoeuille only documents support for non-absorbing / reversible treatment — the estimator assumes SUTVA like every other DiD estimator in the suite. The same guide's §7 glossary already defines SUTVA as ruling out spillovers, so the §4.3 wording was internally inconsistent and would have misrouted interference designs to dCDH. Rewrote the §4.3 bullet to state the actually-supported contract (non-absorbing / reversible treatment via DID_M / DID_l) and added an explicit note that interference / between-unit spillovers are not supported natively. Regression test: `tests/test_profile_panel.py` extended with a forbidden-phrase check that fails if the autonomous guide re-advertises dCDH as "robust to spillover", "interference-robust", "supports spillover", or "and to spillover". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

§4.10 repeated-cross-section wording tightened. The earlier text ("most estimators remain applicable") overstated the RCS support surface: only CallawaySantAnna(panel=False), TripleDifference, and StaggeredTripleDifference have documented RCS-capable contracts (REGISTRY.md §CallawaySantAnna; docs/choosing_estimator.rst DDD cross-sectional use cases). EfficientDiD and HeterogeneousAdoptionDiD explicitly reject RCS per REGISTRY.md. SyntheticDiD and ContinuousDiD are panel-only by construction. Under the old wording an autonomous agent could silently route RCS data to a panel-only estimator. Rewrote §4.10 into three explicit blocks: - "Explicit RCS support": the three RCS-capable estimators. - "Explicitly rejected for RCS (panel-only)": names EfficientDiD, HeterogeneousAdoptionDiD, SyntheticDiD, ContinuousDiD. - "Treat other estimators in this guide as panel-only unless their own docs explicitly say otherwise." Kept the clustered-SE note and added a cluster-vs-respondent treatment-assignment check. Regression test: asserts the guide does not contain the "most estimators remain applicable" phrase, names `panel=False` as the explicit CS RCS mode, and that the §4.10 section explicitly lists both EfficientDiD and HeterogeneousAdoptionDiD as panel-only rejected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Rebased onto current main (17 commits clean — PR #355, #358, #359 all merged since last rebase). StaggeredTripleDifference corrected as panel-only + balance-enforced. The earlier §4.10 RCS wording paired TripleDifference / StaggeredTripleDifference together in the Explicit RCS support list, but REGISTRY.md §StaggeredTripleDifference requires a balanced panel and staggered_triple_diff.py:93-109 has no panel=False mode — fit() rejects unbalanced/duplicate (unit, time) structure at staggered_triple_diff.py:846-864. - §4.10 Explicit RCS support: TripleDifference (two-period) only; StaggeredTripleDifference removed from the supported set. - §4.10 Explicitly rejected for RCS: StaggeredTripleDifference added with a concrete "no panel=False mode" + "use TripleDifference for cross-sectional DDD" pointer. - §3 Balanced-panel eligibility: StaggeredTripleDifference added to the balance-sensitive gate. Regression tests extended: - Balanced-panel proximity check now covers StaggeredTripleDifference. - §4.10 section test asserts StaggeredTripleDifference appears in the Explicitly rejected block and NOT in the Explicit RCS support block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-24T17:53:27Z

🔁 AI review rerun (requested by @igerber)

Head SHA: e3b22dcf50ed559df7310afe848ef6a30e0a62b5

Overall Assessment
✅ Looks good

Executive Summary

The prior P1 on StaggeredTripleDifference appears resolved: the autonomous guide now treats it as balanced-panel-only and explicitly excludes it from the repeated-cross-section subset in llms-autonomous.txt:223, llms-autonomous.txt:277, and llms-autonomous.txt:510.
I did not find a new undocumented methodology deviation in the changed guide/profile surfaces after cross-checking against docs/methodology/REGISTRY.md and the relevant estimator contracts.
profile_panel() covers the main new edge cases introduced here: missing unit/time IDs, duplicate (unit, time) keys, degenerate binary-treatment panels, continuous zero-dose controls, and non-absorbing paths with missing gaps.
The new semantic guide tests now pin the corrected StaggeredTripleDifference balance/RCS contract in test_profile_panel.py:680, test_profile_panel.py:717, and test_profile_panel.py:749.
Static review only: I could not execute pytest or import diff_diff in this environment because pytest and numpy are unavailable here.

Methodology
No findings. Re-review focus item is addressed: the earlier StaggeredTripleDifference RCS/balanced-panel mismatch is fixed in the new guide and regression-covered in the new tests.

Code Quality
No findings.

Performance
No findings.

Maintainability
No findings.

Tech Debt
No findings. I did not identify a new deferrable item that needs TODO.md tracking.

Security
No findings.

Documentation/Tests
No findings. The added tests materially improve regression protection for the agent-facing guide/profile contract, but I was only able to do static review in this environment.

New `diff_diff.profile_panel(df, *, unit, time, treatment, outcome)` returns a frozen `PanelProfile` dataclass of structural facts about a DiD panel (balance, treatment-type classification, cohort structure, outcome characteristics, factual alerts). `.to_dict()` returns a JSON-serializable view. Descriptive only -- alerts never name a specific estimator; estimator selection is up to the caller. Paired with a new bundled `"autonomous"` variant on `get_llm_guide()` at `diff_diff/guides/llms-autonomous.txt`. Reference-shaped (distinct from the workflow-prose `"practitioner"` variant): `PanelProfile` field reference, 17-estimator x 9-design-feature support matrix, per-design-feature reasoning citing Baker et al. (2025) and Roth/Sant'Anna (2023), post-fit validation index, BR/DR schema reference, and an explicit "no deterministic recommendations" disclaimer. `diff_diff/__init__.py` module docstring leads with a "For AI agents:" entry block so `help(diff_diff)` surfaces the four-step sequence (profile, reference, workflow, report) at import time. Exports `profile_panel`, `PanelProfile`, `Alert` from the top-level namespace. Both files are bundled inside the wheel and sdist (no GitHub/RTD dependency at runtime) via the existing `diff_diff/guides/*.txt` glob in `pyproject.toml`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The treatment classifier required exactly two observed distinct values to treat a column as binary. Panels that are entirely never-treated (values = {0}) or entirely always-treated (values = {1}) were falling through to "continuous", contradicting the documented taxonomy which defines "binary_absorbing" as "values in {0, 1}". Rule is now values_set <= {0, 1, 0.0, 1.0} with at least one observed value; entirely-NaN treatment columns fall through to "categorical" rather than "continuous". Docstring and the autonomous guide section §2 reference are updated to match. New regression tests: - all-zero treatment panel -> binary_absorbing, has_never_treated - all-one treatment panel -> binary_absorbing, has_always_treated - binary with NaNs, only zeros observed -> binary_absorbing - all-NaN treatment -> categorical - top-level import surface (profile_panel / PanelProfile / Alert) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>