Skip to content

Add profile_panel() + autonomous LLM guide for agent-facing workflows#356

Merged
igerber merged 18 commits intomainfrom
agent-profile-panel
Apr 24, 2026
Merged

Add profile_panel() + autonomous LLM guide for agent-facing workflows#356
igerber merged 18 commits intomainfrom
agent-profile-panel

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 24, 2026

Summary

  • Adds diff_diff.profile_panel(df, *, unit, time, treatment, outcome) returning a frozen PanelProfile dataclass of structural facts (panel balance, treatment-type classification, cohort structure, outcome characteristics, factual Alert tuple). .to_dict() returns a JSON-serializable view. Descriptive only — alerts never name an estimator.
  • Adds get_llm_guide("autonomous") variant at diff_diff/guides/llms-autonomous.txt. Reference-shaped (distinct from the workflow-prose "practitioner" variant): PanelProfile field reference (§2), 17-estimator × 9-design-feature support matrix (§3), per-design-feature reasoning (§4), post-fit validation index (§5), BR/DR schema reference (§6), citations (§7), intentional-omissions disclaimer (§8).
  • Prepends a "For AI agents:" entry block to the diff_diff/__init__.py module docstring so help(diff_diff) leads with the four-step sequence (describe / consult reference / follow workflow / report).

Both the utility and the guide are bundled inside the wheel and sdist via the existing diff_diff/guides/*.txt glob, so agents can access them at runtime without a GitHub or ReadTheDocs dependency.

Test plan

  • pytest tests/test_profile_panel.py -v — 21 unit tests covering panel shapes (balanced / unbalanced / staggered / never-treated / always-treated / degenerate all-zero and all-one / all-NaN treatment), treatment-type classification, alerts, frozen-immutability, JSON round-trip, top-level import surface, factual-not-recommender message check.
  • pytest tests/test_guides.py -v — auto-parametrized variant loader now covers "autonomous"; content-stability assertions check for profile_panel + estimator-support matrix substrings and an intact markdown table (pipe-count ≥ 11).
  • black diff_diff tests && ruff check diff_diff/profile.py tests/test_profile_panel.py && mypy diff_diff/profile.py — all clean.
  • python -c "import diff_diff; help(diff_diff)" — agent-oriented block appears at the top.
  • python -c "import diff_diff; print(diff_diff.get_llm_guide('autonomous')[:400])" — bundled guide loads.
  • python -m build --sdist && tar -tzf dist/diff_diff-*.tar.gz | grep -E "llms-autonomous|profile.py" — both files ship in the sdist.

Design notes

  • No deterministic recommender API. The utility reports facts; the guide enumerates trade-offs per design feature; estimator selection is the agent's responsibility. Verb discipline: profile_* / describe_* / detect_* only — never recommend_* / choose_* / select_*.
  • profile_panel v1 scope is core only (panel structure, treatment variation, outcome characteristics, alerts). Covariate profiling, survey-design profiling (weights / strata / PSU), and duplicate-key detection are deferred to follow-on PRs.
  • Both surfaces pair with the existing practitioner_next_steps() and llms-practitioner.txt workflow guide — this PR covers the pre-fit describe step and the estimator-choice reference; the practitioner variant remains the canonical post-fit validation workflow.

🤖 Generated with Claude Code

@igerber igerber force-pushed the agent-profile-panel branch from c1f141a to d4a3dc1 Compare April 24, 2026 11:44
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: d4a3dc1af238e35478bc8d665ec65aa9cc2edc3c


Overall Assessment

⚠️ Needs changes

The PR adds a new agent-facing methodology surface, but the new autonomous guide is not consistent with docs/methodology/REGISTRY.md / estimator docstrings, and profile_panel() has silent structural-classification bugs on realistic edge cases. Static review only: I could not run the new tests here because the local Python environment is missing pandas.

Executive Summary

  • The new autonomous guide misstates estimator support for SyntheticDiD, TROP, and HeterogeneousAdoptionDiD.
  • The guide also contains API strings that do not exist as written, including control_group="notyettreated" and Hausman.hausman_pretest.
  • profile_panel() can report duplicated unit-time panels as balanced because it uses raw row count instead of unique (unit, time) support.
  • profile_panel()’s monotonicity check includes NaN, so a 1 -> NaN -> 0 treatment path can be mislabeled binary_absorbing.
  • profile_panel() suppresses has_never_treated on continuous-dose panels even when zero-dose control units exist.
  • The new tests mostly check guide fingerprints/table shape and miss the edge cases above.

Methodology

  • P1diff_diff/guides/llms-autonomous.txt:L175-L215,L247-L338 is not faithful to the registry/docstrings for multiple affected methods. SyntheticDiD is marked staggered=partial, but the registry/code require block treatment and raise on within-unit treatment variation (docs/methodology/REGISTRY.md:L1421-L1425,L1542-L1544, diff_diff/synthetic_did.py:L376-L388). TROP is described as needing never-treated units and partial covariate support, but the local method uses all untreated-at-t units and fit() has no covariate surface (docs/methodology/REGISTRY.md:L1911-L1923,L1930-L1933,L2042-L2046, diff_diff/trop.py:L370-L412, diff_diff/trop_local.py:L388-L412). Impact: the new autonomous workflow can reject valid estimators or suggest unsupported ones. Concrete fix: rebuild the SyntheticDiD / TROP rows and footnotes directly from REGISTRY.md.
  • P1diff_diff/guides/llms-autonomous.txt:L193-L215,L201-L207,L287-L288,L416-L417 also misstates HeterogeneousAdoptionDiD and public API names. The guide marks HAD as supporting covariate adjustment / clustered SE, but Appendix B.1 covariates are explicitly future work and estimator-level cluster= is ignored on the continuous paths (docs/methodology/REGISTRY.md:L2172-L2173,L2231-L2239, diff_diff/had.py:L1832-L1837,L2363-L2370). It also uses control_group="notyettreated" instead of not_yet_treated (diff_diff/staggered.py:L139-L142,L304-L324, diff_diff/sun_abraham.py:L333-L336,L424-L434) and references Hausman.hausman_pretest instead of EfficientDiD.hausman_pretest (diff_diff/efficient_did.py:L1534-L1572). Impact: guide-driven code can fail at runtime or misstate identification/inference. Concrete fix: correct the capability flags and API tokens, then add a guide smoke test that resolves every referenced enum/classmethod against the public API.

Code Quality

  • P1profile_panel() derives is_balanced and observation_coverage from raw row count alone (diff_diff/profile.py:L199-L205), while the guide describes these as support-level panel structure (diff_diff/guides/llms-autonomous.txt:L64-L70). With duplicated (unit, time) keys, a panel can report is_balanced=True despite a missing cell, and observation_coverage can exceed the documented [0,1] range. Impact: silent incorrect structural facts in a new public API. Concrete fix: detect duplicate (unit, time) keys explicitly, compute balance/coverage from unique support, and add a duplicate-key regression test.
  • P1_classify_treatment() promises monotonicity over the observed non-NaN treatment path (diff_diff/profile.py:L175-L182) but diffs the raw sorted series including NaN (diff_diff/profile.py:L315-L320). A path like 0, 1, NaN, 0 will therefore be classified binary_absorbing instead of binary_non_absorbing. Impact: reversible-treatment panels can be silently routed toward absorbing-only estimators. Concrete fix: drop NaN before the monotonicity check and add a test for 1 -> NaN -> 0.
  • P1profile_panel() hardcodes has_never_treated=False for continuous numeric treatment because continuous inputs exit early from _classify_treatment() (diff_diff/profile.py:L305-L313), but the public field definition is generic (diff_diff/guides/llms-autonomous.txt:L104-L107) and ContinuousDiD exposes control_group="never_treated" (diff_diff/continuous_did.py:L59-L60). Impact: continuous-dose panels with zero-dose controls lose a load-bearing control-pool fact. Concrete fix: compute has_never_treated across numeric treatment types, or narrow the documented field contract to binary-only semantics and update the guide/tests consistently.

Performance

No findings.

Maintainability

No additional findings beyond the guide/registry drift already noted.

Tech Debt

  • P3 — The repo already tracks weak CI coverage for .txt AI guides in TODO.md:L127-L128. The current drift is a concrete example of that debt. Impact: informational only because it is already tracked. Concrete fix: either keep it tracked or close it with semantic guide assertions.

Security

No findings.

Documentation/Tests

  • P2 — The new tests do not cover the failing cases above: tests/test_guides.py:L37-L50 only checks fingerprints/table shape, and tests/test_profile_panel.py:L90-L101,L285-L295 do not exercise duplicate keys, 1 -> NaN -> 0 reversals, or continuous zero-dose controls. Impact: the current P1 issues have no regression protection. Concrete fix: add those exact cases to tests/test_guides.py / tests/test_profile_panel.py.

Path to Approval

  1. Reconcile diff_diff/guides/llms-autonomous.txt with docs/methodology/REGISTRY.md and public APIs: fix the SyntheticDiD, TROP, and HeterogeneousAdoptionDiD rows/notes, change notyettreated to not_yet_treated, and replace Hausman.hausman_pretest with EfficientDiD.hausman_pretest.
  2. Harden profile_panel() against duplicated (unit, time) keys: reject or alert on duplicates, and compute balance/coverage from unique support so observation_coverage <= 1.
  3. Fix treatment/control edge cases in profile_panel(): ignore NaN when checking absorbing monotonicity, and compute has_never_treated correctly for continuous zero-dose panels (or narrow the public contract consistently).
  4. Add regression tests for the corrected guide/API references and the new profile_panel() edge cases so this surface cannot drift again.

igerber added a commit that referenced this pull request Apr 24, 2026
profile_panel():
- Balance / coverage: compute from unique (unit, time) support rather
  than raw row count. Duplicated rows no longer inflate coverage above
  1 or mask missing cells. Adds duplicate_unit_time_rows warn alert.
- Non-absorbing detection: drop NaN before the monotonicity diff so a
  path like [0, 1, NaN, 0] is correctly classified binary_non_absorbing
  rather than binary_absorbing.
- has_never_treated / has_always_treated: compute generically across
  numeric treatment types (not only binary). Continuous panels with
  zero-dose control units now surface has_never_treated=True; docstring
  and guide field reference updated.

llms-autonomous.txt:
- SyntheticDiD: staggered changed from "partial" to "strict block"
  (fit raises on within-unit variation); footnote rewritten.
- TROP: staggered support marked (via absorbing D matrix); never-treated
  no longer required (donor pool is every unit untreated at period t);
  fit() covariate surface removed (there is none).
- HeterogeneousAdoptionDiD: covariate adjustment changed to "not
  supported" (Appendix B.1 future work per REGISTRY.md); clustered SE
  annotated "warn" on continuous paths (cluster= kwarg is ignored with
  UserWarning, CR1 only on mass-point path).
- Replace "notyettreated" with "not_yet_treated" throughout (matches
  staggered.py / sun_abraham.py public API).
- Replace "Hausman.hausman_pretest" with "EfficientDiD.hausman_pretest"
  (the classmethod lives on EfficientDiD, not a Hausman namespace).

Tests:
- test_duplicate_unit_time_rows_do_not_inflate_coverage
- test_reversal_through_nan_is_binary_non_absorbing
- test_continuous_zero_dose_controls_flag_has_never_treated
- test_guide_api_strings_resolve_against_public_api (verifies every
  estimator name, hausman_pretest path, and control_group spelling in
  the bundled guide resolves against diff_diff's public API; guards
  against future drift)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: a506c0a55ad850043a77339d32716203887ea344


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior re-review issues around the autonomous guide’s SyntheticDiD/TROP/HeterogeneousAdoptionDiD capability rows, the bad API tokens, and the earlier profile_panel() duplicate/NaN/continuous-control bugs appear addressed.
  • The autonomous guide still misstates HeterogeneousAdoptionDiD’s estimand and output shape: the method targets WAS / WAS_d_lower, not ATT, and the event-study path is per-event-time, not “per-cohort.”
  • The same guide still underspecifies EfficientDiD when there is no never-treated group: control_group="last_cohort" is a supported path, but §4.4 steers agents only to PT-Post.
  • profile_panel() can still emit impossible structural facts when unit or time contains missing IDs because its numerator and denominator treat NaNs inconsistently; observation_coverage can exceed the documented [0,1] range.
  • Static review only: I could not run the new tests locally because this environment lacks pytest and pandas.

Methodology

  • Severity P1. Impact: diff_diff/guides/llms-autonomous.txt:L375-L378 says HeterogeneousAdoptionDiD “returns an aggregated ATT plus a per-cohort Pierce-Schott-style validation,” but the registry defines the target as WAS / WAS_d_lower (docs/methodology/REGISTRY.md:L2189-L2200), and the estimator docs/results class describe a single-period WAS or per-event-time WAS outputs, not an ATT nor a per-cohort validation surface (diff_diff/had.py:L210-L245, diff_diff/had.py:L1757-L1770, diff_diff/had.py:L2014-L2020). This can make agent-facing reports mislabel the estimand and look for nonexistent outputs. Concrete fix: rewrite §4.9 to use WAS terminology and distinguish aggregate="overall" from aggregate="event_study" plus did_had_pretest_workflow().
  • Severity P1. Impact: diff_diff/guides/llms-autonomous.txt:L211-L213 and L307-L310 imply that, without never-treated units, EfficientDiD should fall back to PT-Post. The implementation and registry also support control_group="last_cohort" as a pseudo-control path (diff_diff/efficient_did.py:L161-L167, L459-L483; docs/methodology/REGISTRY.md:L928-L929). The guide can therefore incorrectly rule out a supported identification strategy on all-treated staggered panels. Concrete fix: update the footnote and §4.4 to mention control_group="last_cohort" explicitly and avoid presenting PT-Post as the only alternative.

Code Quality

  • Severity P1. Impact: diff_diff/profile.py:L213-L219 computes n_units/n_periods with nunique() (dropping NaNs) but computes unique (unit,time) support with drop_duplicates() (still counting NaN keys). That can produce observation_coverage > 1 and violate the public [0,1] contract documented in diff_diff/profile.py:L201-L205 and diff_diff/guides/llms-autonomous.txt:L71-L74. This is a silent correctness issue in a new public API whenever filtered data contain missing IDs. Concrete fix: reject missing unit/time values up front, or compute both numerator and denominator on the same non-missing key subset and surface a dedicated alert.

Performance

  • No findings.

Maintainability

  • No additional findings.

Tech Debt

  • Severity P3. Impact: the repo already tracks weak CI coverage for .txt AI guides in TODO.md:L128. That remains informational only. Concrete fix: keep that item open until the guide gains semantic assertions, not just fingerprint/table-shape checks.

Security

  • No findings.

Documentation/Tests

  • Severity P2. Impact: tests/test_guides.py:L37-L50 only validates guide fingerprints/table presence, and tests/test_profile_panel.py:L321-L414 covers the prior duplicate/NaN/continuous-control regressions but not missing unit/time IDs. The current P1 guide/profile issues therefore have no regression protection. Concrete fix: add tests that assert HAD is described as WAS / WAS_d_lower, that the EfficientDiD no-never-treated guidance mentions control_group="last_cohort", and that profile_panel() rejects or consistently handles missing identifier rows.

Path to Approval

  1. Fix the HAD section of llms-autonomous.txt so it describes WAS / WAS_d_lower and the actual overall/event-study outputs instead of ATT/per-cohort validation.
  2. Fix the EfficientDiD no-never-treated guidance to include the supported control_group="last_cohort" path.
  3. Harden profile_panel() against missing unit/time IDs so observation_coverage cannot exceed 1 and balance metrics remain well-defined.
  4. Add regression tests for the corrected guide semantics and the missing-identifier profile_panel() case.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

igerber added a commit that referenced this pull request Apr 24, 2026
Guide:
- Rewrite HAD section (§4.9, §3 footnotes) to use WAS / WAS_d_lower
  terminology (paper Equation 2 / Theorem 1). HAD does not target ATT
  and its event-study output is per-event-time, not per-cohort. The
  fitted target_parameter attribute is literally "WAS" for Design 1'
  and "WAS_d_lower" for Design 1 / Assumption 6.
- Add control_group="last_cohort" to the EfficientDiD guidance in §4.4
  (no-never-treated paths) and the corresponding footnote. Previously
  the guide steered agents only toward assumption="PT-Post" when a
  never-treated cohort was absent, ruling out the supported pseudo-
  control path. Distinct from CallawaySantAnna's "not_yet_treated".

profile_panel():
- Missing identifier rows are dropped up front when unit or time
  contains NaN. Previously n_units / n_periods used nunique() (NaN-
  dropping) while the unique (unit, time) support used
  drop_duplicates() (NaN-preserving), producing observation_coverage
  > 1 on filtered data with missing IDs. All structural facts now
  compute on the non-missing subset; a new missing_id_rows_dropped
  warn alert surfaces the drop count so the behavior is not silent.

Tests:
- test_missing_unit_or_time_ids_are_dropped_consistently asserts
  observation_coverage stays in [0, 1] and the alert fires with the
  correct drop count.
- test_guide_api_strings_resolve_against_public_api extended with
  assertions for WAS / WAS_d_lower phrasing, the absence of the old
  "per-cohort Pierce-Schott" wording, and control_group="last_cohort".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: ba063b112c5fedf25b46818787eeedff736aae13


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior re-review issues around HAD WAS wording, EfficientDiD(control_group="last_cohort"), and profile_panel()’s observation_coverage > 1 bug appear addressed.
  • The new autonomous guide still has a methodology mismatch: it marks SunAbraham as not requiring never-treated units, but both the registry and the fit path still require them.
  • The HAD guide section now uses WAS/WAS_d_lower, but it overstates what the Phase 3 diagnostics establish by saying they validate Assumptions 3 and 7; Assumption 3 is explicitly untestable in the registry.
  • profile_panel() still lacks empty-data handling after dropping missing ID rows (and on direct empty input), so it can return a “balanced” profile with zero valid observations.
  • Static review only: this environment does not have pytest or pandas, so I could not execute the added tests.

Methodology

  • Severity P1. Location: diff_diff/guides/llms-autonomous.txt:L187-L188, docs/methodology/REGISTRY.md:L939-L942, docs/methodology/REGISTRY.md:L991-L993, diff_diff/sun_abraham.py:L563-L567. Impact: the support matrix says SunAbraham does not require never-treated units, but the registry still documents never-treated controls as required and the implementation raises when none exist before it branches on control_group. An agent following the matrix can select an estimator the library will reject on all-treated staggered panels. Concrete fix: change the matrix cell to show that never-treated units are required for SunAbraham, and add a semantic guide test for that contract.
  • Severity P1. Location: diff_diff/guides/llms-autonomous.txt:L391-L394, docs/methodology/REGISTRY.md:L2178-L2181, docs/methodology/REGISTRY.md:L2328-L2336, diff_diff/had_pretests.py:L1-L50. Impact: the guide says the HAD Phase 3 diagnostics “validate Assumptions 3 and 7,” but the registry is explicit that Assumption 3 (uniform continuity / no extensive-margin jump) is not testable. The shipped diagnostics cover QUG, Assumption 7 pre-trends on the event-study path, and linearity/TWFE suitability; they do not validate Assumption 3. This can lead agent-facing reports to overclaim identification support. Concrete fix: rewrite §4.9 so it limits the diagnostic claim to QUG, Assumption 7 pre-trends, and linearity/TWFE checks, with Assumption 3 left explicitly untestable.

Code Quality

  • Severity P1. Location: diff_diff/profile.py:L214-L225. Impact: after dropping rows with missing unit/time, profile_panel() keeps going even when no valid rows remain. That yields n_units = 0, n_periods = 0, observation_coverage = 0.0, and is_balanced = True (0 == 0) for an empty sample. Per the review checklist, this is missing empty-result handling in a new code path and can silently hand downstream agents a nonexistent-but-“balanced” panel. Concrete fix: add a front-door empty-panel guard after the ID drop (and for direct-empty input), preferably raising ValueError with the row counts removed; add regression tests for both df.empty and empty-after-drop.
  • Severity P2. Location: diff_diff/profile.py:L214-L216. Impact: n_rows_with_missing_id is computed as unit.isna().sum() + time.isna().sum(), so a row with both IDs missing is counted twice even though only one row is dropped. The missing_id_rows_dropped alert can therefore overstate the number of dropped rows and violate the “factual observations” contract of the new API. Concrete fix: count dropped rows with a row-wise isna().any(axis=1) mask, and add a regression test where one row has both unit and time missing.

Performance

No findings.

Maintainability

No findings.

Tech Debt

  • Severity P3 (tracked in TODO.md). Location: tests/test_guides.py:L37-L50, tests/test_profile_panel.py:L374-L446, TODO.md:L128. Impact: the guide tests are still mostly fingerprint/shape checks, which is why the remaining guide-semantic mismatches slipped through. This is already tracked as AI-guide validation debt, so it is informational rather than blocking. Concrete fix: keep the TODO open; when fixing the guide, add assertions for SunAbraham’s never-treated requirement and the HAD diagnostic claim.

Security

No findings.

Documentation/Tests

No separate unmitigated findings beyond the tracked guide-validation gap above. Static review only: pytest and pandas were unavailable here, so I could not run the newly added tests.

Path to Approval

  1. Fix the autonomous-guide matrix so SunAbraham is marked as requiring never-treated units, and align any nearby prose with that contract.
  2. Rewrite HAD §4.9 so the diagnostic discussion does not claim Assumption 3 is testable; limit it to QUG, Assumption 7 pre-trends, and linearity/TWFE checks.
  3. Add empty-panel handling to profile_panel() for direct-empty input and empty-after-ID-drop input, so it cannot return is_balanced=True on zero valid rows.
  4. Fix missing_id_rows_dropped to count dropped rows rather than missing ID cells, and add regression tests for both the empty-after-drop case and a row with both IDs missing.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

igerber added a commit that referenced this pull request Apr 24, 2026
Guide:
- SunAbraham matrix row: never-treated-required changed from ✗ to ✓.
  The fit path raises ValueError when no never-treated units exist
  (sun_abraham.py:566-567); the registry documents this requirement.
  §4.3 prose updated to state the requirement explicitly.
- HAD §4.9 (and §3 Phase 3 description): correct the diagnostic-claim
  wording. The pre-test battery validates the QUG null (step 1),
  Assumption 7 pre-trends (step 2, event-study only), and linearity
  of E[ΔY|D_2] (step 3). Assumption 3 (uniform continuity / no
  extensive-margin jump) is explicitly "not testable" per REGISTRY.md
  L2181, so the guide must not claim Phase 3 validates it.

profile_panel():
- Empty-panel guard: raise ValueError when the input DataFrame is
  empty (direct) or when no rows remain after dropping missing
  (unit, time) identifiers. Previously both cases returned
  is_balanced=True / observation_coverage=0 on zero valid rows.
- Fix missing-id counting: n_rows_with_missing_id was the SUM of
  unit-NaN and time-NaN (double-counting rows missing both). Now
  computed from a row-wise isna().any(axis=1) mask so every dropped
  row is counted exactly once.

Tests:
- test_row_with_both_ids_missing_counted_once
- test_empty_dataframe_raises_value_error
- test_empty_after_id_drop_raises_value_error
- test_guide_api_strings_resolve_against_public_api extended:
  SunAbraham matrix-row cell assertion (never-treated-required=✓),
  "Assumption 3" presence, absence of "validate Assumptions 3 and 7",
  presence of "not testable".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: c7ef29a0c4864c4253047a6c3ad2f2367ee15972


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior re-review items on SunAbraham’s never-treated requirement, the HAD Assumption 3 wording, empty-after-drop handling in profile_panel(), and double-counted missing-ID rows appear resolved.
  • The autonomous guide still misstates the EfficientDiD contract: it says assumption="PT-Post" removes the never-treated requirement, but the registry and fit path still require actual never-treated units unless control_group="last_cohort" is used.
  • The support matrix still has wrong covariate-adjustment cells: SyntheticDiD is marked unsupported even though fit(..., covariates=...) residualizes outcomes, while ContinuousDiD is marked supported even though covariates are still deferred.
  • profile_panel() overstates min_pre_periods / min_post_periods on unbalanced panels by counting global calendar periods rather than observed cohort support, which can suppress the short-panel alerts.
  • §5 of the autonomous guide has a few runtime-facing API signature mistakes (compute_pretrends_power, plot_sensitivity, plot_honest_event_study).
  • Static review only: this environment lacked pytest and core deps like numpy, so I could not run the added tests or import the package.

Methodology

  • Severity P1. Location: diff_diff/guides/llms-autonomous.txt:L107-L110, diff_diff/guides/llms-autonomous.txt:L212-L218, diff_diff/guides/llms-autonomous.txt:L313-L317, docs/methodology/REGISTRY.md:L756-L758, docs/methodology/REGISTRY.md:L870-L871, diff_diff/efficient_did.py:L161-L167, diff_diff/efficient_did.py:L458-L484. Impact: the new guide tells agents that assumption="PT-Post" drops the never-treated requirement. That is not the current library contract: PT-Post still uses never-treated comparisons, and all-treated panels only become admissible through control_group="last_cohort". An agent following the guide can select EfficientDiD on an all-eventually-treated panel and hit a front-door ValueError, or misstate the identifying setup in its report. Concrete fix: rewrite the has_never_treated field description, the matrix footnote, and §4.4 so they say EfficientDiD needs never-treated controls under both PT variants unless control_group="last_cohort" is explicitly chosen; add a semantic guide test for that contract.
  • Severity P1. Location: diff_diff/guides/llms-autonomous.txt:L195-L199, diff_diff/synthetic_did.py:L213-L239, diff_diff/synthetic_did.py:L425-L436, docs/methodology/REGISTRY.md:L1549-L1551, docs/methodology/REGISTRY.md:L736-L737, diff_diff/continuous_did.py:L159-L169. Impact: the support matrix mislabels covariate support in both directions: SyntheticDiD is shown as even though the public API accepts covariates= and residualizes the outcome, while ContinuousDiD is shown as even though the registry still lists covariates as deferred and the public fit() surface has no covariate argument. That will wrongly eliminate SDiD from valid workflows and advertise a nonexistent ContinuousDiD feature. Concrete fix: change the SyntheticDiD covariate-adjustment cell to (or partial with a footnote if you want to distinguish residualization from DR/IPW-style adjustment) and change ContinuousDiD to ; add a semantic matrix test that checks these rows against the public fit signatures / registry checklist.

Code Quality

  • Severity P2. Location: diff_diff/profile.py:L409-L431, diff_diff/profile.py:L541-L565, diff_diff/guides/llms-autonomous.txt:L124-L129. Impact: min_pre_periods and min_post_periods are computed from the global set of panel periods, not from the treated cohorts’ observed support. On unbalanced panels this can overstate available pre/post exposure and suppress short_pre_panel / short_post_panel, even though the guide presents those fields as factual cohort-level support. Concrete fix: compute pre/post counts from observed (unit, time) support for the treated cohorts after deduplicating keys, and add a regression test where treated cohorts are missing early pre-periods or late post-periods.

Performance

No findings.

Maintainability

No findings.

Tech Debt

  • Severity P3 (tracked in TODO.md). Location: tests/test_guides.py:L37-L50, TODO.md:L128-L128. Impact: the current guide tests still only assert fingerprints / table shape, which is why the remaining semantic matrix drift was not caught. This is already tracked as .txt AI guides being outside stronger CI validation, so it is informational rather than blocking. Concrete fix: keep the TODO open; when fixing the guide, add semantic assertions for the EfficientDiD no-never-treated contract and the SyntheticDiD / ContinuousDiD covariate-support cells.

Security

No findings.

Documentation/Tests

  • Severity P2. Location: diff_diff/guides/llms-autonomous.txt:L430-L438, diff_diff/guides/llms-autonomous.txt:L481-L482, diff_diff/pretrends.py:L1048-L1055, diff_diff/visualization/_diagnostic.py:L12-L14, diff_diff/visualization/_event_study.py:L787-L789. Impact: §5 gives several wrong runtime signatures for agent use: compute_pretrends_power takes a fitted results object, not raw df; plot_sensitivity takes SensitivityResults; plot_honest_event_study takes HonestDiDResults. Agents using the new reference verbatim will call the wrong APIs and fail at runtime. Concrete fix: update those entries to the actual first-argument types and add a semantic guide test that checks the guide text against the public function signatures.

Path to Approval

  1. Fix the autonomous guide’s EfficientDiD wording so PT-Post does not claim to remove the never-treated requirement; document control_group="last_cohort" as the supported no-never-treated path, and add a regression test for that contract.
  2. Correct the support matrix’s covariate-adjustment cells for SyntheticDiD and ContinuousDiD, and add a semantic regression test that checks those rows against the public API / methodology registry.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

igerber added a commit that referenced this pull request Apr 24, 2026
Guide:
- Matrix covariate cells corrected. SyntheticDiD column 6 flipped
  ✗ → ✓: fit() accepts covariates= and residualizes the outcome
  (synthetic_did.py:213-239, 425-436). ContinuousDiD column 6 flipped
  ✓ → ✗: the fit() signature has no covariate argument and REGISTRY
  lists covariate adjustment as deferred.
- EfficientDiD wording corrected. The registry says PT-Post still uses
  never-treated as the comparison group (REGISTRY.md L756-758);
  "switch to PT-Post" does not drop the requirement. Rewrote
  has_never_treated field description, the §3 footnote, and §4.4
  prose to state that PT-All and PT-Post both require never-treated
  units and control_group="last_cohort" is the only supported path
  for an all-eventually-treated panel.

profile_panel():
- _compute_pre_post now uses each treated unit's observed (unit, time)
  support rather than the global panel period set. On unbalanced
  panels this correctly captures the least-supported treated unit's
  pre/post exposure and fires short_pre_panel / short_post_panel when
  appropriate; the previous global-periods approach could overstate
  exposure and suppress the alert.

Guide §5 API signatures corrected:
- compute_pretrends_power takes a fitted results object, not a raw
  DataFrame (pretrends.py:1048).
- plot_sensitivity takes a SensitivityResults object
  (visualization/_diagnostic.py:12-14).
- plot_honest_event_study takes a HonestDiDResults object
  (visualization/_event_study.py:9).

Tests:
- test_min_pre_post_use_per_unit_observed_support (unbalanced treated
  unit missing early pre-period fires short_pre_panel).
- test_guide_api_strings_resolve_against_public_api extended with
  SyntheticDiD/ContinuousDiD covariate-cell matrix assertions,
  EfficientDiD both-PT-variants-require-never-treated wording
  assertion, and §5 API-signature fingerprints.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 3590ddd75a615e34301f334c067d9a66b6c3d028


Overall Assessment

Looks good

Executive Summary

  • The prior P1 methodology findings from the last re-review appear resolved: the autonomous guide now matches source on EfficientDiD’s never-treated contract, SunAbraham’s never-treated requirement, the SyntheticDiD/ContinuousDiD covariate-support cells, HAD Assumption 3 wording, and the §5 function signatures.
  • profile_panel() also closes the earlier empty-after-drop, missing-ID, duplicate-support, NaN-through-monotonicity, and min-pre/min-post bugs, with targeted regression coverage.
  • Remaining issue: §6 of diff_diff/guides/llms-autonomous.txt does not match the actual BusinessReport / DiagnosticReport to_dict() schema, so an agent using it as a parsing contract will look for keys that do not exist.
  • Remaining issue: min_pre_periods / min_post_periods are now implemented and tested as least-supported treated-unit counts, but the guide and emitted alert text still describe them as cohort-level counts.
  • Static review only: this environment lacks pytest, numpy, and pandas, so I could not run the added tests or import the package.

Methodology

  • No unmitigated findings. The prior methodology issues in diff_diff/guides/llms-autonomous.txt appear addressed, and the new semantic checks in tests/test_profile_panel.py:L374-L480 cover the corrected contracts.

Code Quality

  • Severity P2. Impact: profile_panel() now computes min_pre_periods / min_post_periods from the least-supported treated unit, but both the guide and the runtime alert text still say these are “across cohorts.” That makes the public contract misleading: agents can read a unit-level missingness signal as a cohort-level design limitation. Concrete fix: either change the prose to “across treated units” everywhere, or change the computation back to genuine cohort-level support if that is the intended contract. Locations: diff_diff/guides/llms-autonomous.txt:L126-L130, diff_diff/profile.py:L417-L443, diff_diff/profile.py:L553-L576, tests/test_profile_panel.py:L483-L500.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3 (tracked in TODO.md). Impact: .txt AI guides are still only lightly validated in CI, so semantic drift in agent-facing guides remains easy to miss. Concrete fix: keep the existing TODO open and extend semantic tests to cover the BR/DR schema section, not just estimator names, matrix cells, and signature snippets. Locations: TODO.md:L121-L129, tests/test_guides.py:L37-L50, tests/test_profile_panel.py:L374-L480.

Security

  • No findings.

Documentation/Tests

  • Severity P2. Impact: §6 of the autonomous guide misdocuments the actual BR/DR to_dict() schema. It advertises BusinessReport keys such as assumptions, pretrends, main_result, and sample_summary, but BusinessReport.to_dict() actually returns keys like assumption, pre_trends, sample, diagnostics, and references. It also says DiagnosticReport has estimator_type plus a nested checks dict, while the implementation returns estimator and separate top-level sections such as parallel_trends, pretrends_power, sensitivity, bacon, and estimator_native_diagnostics. An autonomous agent following the guide as a literal parsing contract will fail on real output. Concrete fix: rewrite §6 to match the actual schemas emitted by BusinessReport.to_dict() and DiagnosticReport.to_dict(), then add a semantic test that compares the documented key set to the real builders. Locations: diff_diff/guides/llms-autonomous.txt:L513-L545, diff_diff/business_report.py:L524-L549, diff_diff/diagnostic_report.py:L1013-L1031.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

igerber added a commit that referenced this pull request Apr 24, 2026
Guide §6 rewritten to match the actual BR/DR to_dict() schemas emitted
at diff_diff/business_report.py:524-549 and diff_diff/diagnostic_report.py:
1013-1031. BR top-level keys are "assumption" (singular), "pre_trends"
(underscored), "sample" (bare); the new documentation also adds
"estimator", "context", "heterogeneity", "diagnostics", "next_steps",
"references". DR keys are "estimator" (string), "headline_metric",
"estimator_native_diagnostics", "skipped", "warnings",
"overall_interpretation", "next_steps"; the old "checks: dict" nesting
never existed and "estimator_type" was never the key name. The
"forthcoming schema additions" block is dropped since it documented an
un-shipped surface.

Prose + alert wording aligned with the per-treated-unit implementation:
- Guide §2: min_pre_periods / min_post_periods described as "across
  treated units" (each unit's observed (unit, time) support counted
  independently on unbalanced panels).
- profile.py: short_pre_panel / short_post_panel alert messages use
  "across treated units" to match the computation.

Tests:
- test_guide_api_strings_resolve_against_public_api extended with
  BR/DR §6 schema assertions. Guards every real top-level key
  documented (assumption, pre_trends, sample, headline_metric,
  estimator_native_diagnostics, overall_interpretation) and forbids
  the previously documented obsolete keys (assumptions, pretrends,
  main_result, sample_summary, estimator_type, checks).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 38a8f6f6dcb071f6f626572a7c3f887019debe2b


Overall Assessment

Looks good — no unmitigated P0/P1 findings.

Executive Summary

  • The prior re-review issues on min_pre_periods / min_post_periods semantics and the major BR/DR top-level key mismatches are mostly resolved.
  • I did not find any P0/P1 correctness issues in profile_panel(); the new tests cover the earlier empty-after-drop, duplicate-support, NaN-through-monotonicity, and per-unit pre/post-support cases.
  • Severity P2: the new autonomous guide still has methodology-reference drift. EfficientDiD is attributed to the wrong source, and the ContinuousDiD synopsis collapses the PT/SPT estimand distinction into “ACR.”
  • Severity P2: §6 still misstates the BusinessReport.to_dict() contract. diagnostics is a wrapper block, not the raw DiagnosticReport schema, and target_parameter.reference is undocumented.
  • Static review only: this environment does not have pytest, and import-time numpy is also unavailable, so I could not execute the added tests.

Methodology

  • Severity P2. Impact: ContinuousDiD is described as “estimates average causal response (ACR),” but the registry and BR/DR target-parameter dispatch distinguish ATT(d|d) / overall ATT under PT from ACRT under SPT. An autonomous agent following §4.7 can mislabel the estimand it reports. Concrete fix: rewrite the ContinuousDiD bullet to say the estimator exposes dose-response ATT and ACR outputs, with overall ATT as the headline scalar and ACR requiring the stronger SPT regime; add the CGBS 2024 citation to §7. Locations: diff_diff/guides/llms-autonomous.txt:L361-L369, docs/methodology/REGISTRY.md:L682-L706, diff_diff/_reporting_helpers.py:L332-L347.
  • Severity P2. Impact: EfficientDiD is labeled “Arkhangelsky-Imbens,” but the registry’s primary source is Chen, Sant’Anna, and Xie (2025). That breaks source-material traceability in an agent-facing methodology guide. Concrete fix: change the §4.3 attribution and add the Chen-Sant’Anna–Xie reference to §7 alongside the other estimator papers. Locations: diff_diff/guides/llms-autonomous.txt:L303-L304, docs/methodology/REGISTRY.md:L746-L749.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3 (tracked in TODO.md). Impact: guide validation is still mostly string/fingerprint based, so schema drift in .txt AI guides can survive even with the new tests. Concrete fix: keep the existing TODO open and extend the guide tests to compare §6 against the actual BR/DR builders, not just key presence. Locations: TODO.md:L121-L129, tests/test_guides.py:L37-L50, tests/test_profile_panel.py:L482-L506.

Security

  • No findings.

Documentation/Tests

  • Severity P2. Impact: §6 still tells agents that BusinessReport["diagnostics"] is the nested DiagnosticReport.to_dict() payload, but the implementation wraps that payload under diagnostics["schema"] and adds wrapper fields like status and overall_interpretation; the guide also omits the target_parameter.reference field that describe_target_parameter() guarantees. Agents using the guide as a literal parsing contract will look in the wrong place or miss fields. Concrete fix: document the exact wrapper shape for diagnostics, include target_parameter.reference, and add a semantic test that compares the documented schema bullets to real BusinessReport.to_dict() output. Locations: diff_diff/guides/llms-autonomous.txt:L516-L546, diff_diff/business_report.py:L510-L549, diff_diff/_reporting_helpers.py:L34-L42, tests/test_guides.py:L37-L50, tests/test_profile_panel.py:L482-L506.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

igerber added a commit that referenced this pull request Apr 24, 2026
Guide methodology attribution corrected:
- EfficientDiD primary source is Chen, Sant'Anna, Xie (2025) per
  REGISTRY.md L746-749, not Arkhangelsky-Imbens. §4.3 bullet and the
  §7 citation list updated.
- ContinuousDiD synopsis (§4.7) now distinguishes the PT-identified
  targets (`ATT(d|d)`, `ATT^{loc}`) from the SPT-required targets
  (`ATT(d)`, `ACRT(d)`, `ACRT^{glob}`) per REGISTRY.md L682-706, with
  the Callaway, Goodman-Bacon, Sant'Anna (2024) citation added to §7.

Guide §6 BR schema corrections:
- `diagnostics` wrapper documented accurately: always has `status`,
  with `schema` + `overall_interpretation` on the "ran" path or
  `reason` on the "skipped" path. Previously the guide said the key
  held the DR payload directly.
- `target_parameter.reference` field documented alongside the other
  fields per `describe_target_parameter()` in
  diff_diff/_reporting_helpers.py:34-42.

Regression tests extended:
- `diagnostics["schema"]` wrapper wording guarded (and absence of
  plain "DR payload under diagnostics").
- `target_parameter.reference` key documentation guarded.
- EfficientDiD attribution: `Chen, Sant'Anna, Xie 2025` present,
  `(Arkhangelsky-Imbens)` absent.
- ContinuousDiD targets: `ATT(d|d)`, `ACRT`, `Strong Parallel Trends`
  all present.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: bd4e1be562f79f91e31af629cf95663748de5402


Overall Assessment

⚠️ Needs changes — one unmitigated P1 remains in the new autonomous guide’s methodology matrix.

Executive Summary

  • The prior re-review issues on EfficientDiD attribution, ContinuousDiD PT/SPT wording, and the BR/DR wrapper/reference-field schema drift look addressed.
  • P1: the new autonomous guide still misstates ContinuousDiD eligibility. It says never-treated / zero-dose controls are not required, but both REGISTRY.md and ContinuousDiD.fit() require P(D=0) > 0 because the lowest-dose-as-control path is not implemented.
  • P2: §6 of the guide still misdocuments DiagnosticReport.to_dict() section status semantics. Real sections use execution-style status="ran" plus a separate verdict; reason is optional, and additional statuses like not_applicable, not_run, and no_scalar_by_design exist.
  • P3 (tracked): guide validation is still mostly string/fingerprint based, which is why the remaining guide drift slipped through.
  • Static review only: I could not execute the added tests here because import-time numpy is unavailable in this environment.

Methodology

  • Severity P1. Impact: Method affected: ContinuousDiD. The agent-facing support matrix marks ContinuousDiD as never-treated required = ✗, and the surrounding has_never_treated / continuous-treatment prose also omits the zero-dose-control requirement. That conflicts with the Methodology Registry and the implementation, which both require P(D=0) > 0 because Remark 3.1 (lowest-dose-as-control) is not implemented. In the advertised profile_panel() -> get_llm_guide("autonomous") workflow, an agent can therefore treat an all-eventually-treated continuous-dose panel as eligible for ContinuousDiD even though fit() rejects it. Concrete fix: change the matrix cell to , add ContinuousDiD to the has_never_treated guidance, and state explicitly in §4.7 that zero-dose / never-treated controls are required until Remark 3.1 is implemented. Locations: diff_diff/guides/llms-autonomous.txt:L107-L114, diff_diff/guides/llms-autonomous.txt:L187-L205, diff_diff/guides/llms-autonomous.txt:L366-L376, docs/methodology/REGISTRY.md:L714-L716, diff_diff/continuous_did.py:L349-L360, tests/test_continuous_did.py:L1279-L1295.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

  • Severity P3 (tracked in TODO.md). Impact: the new guide checks are still mostly presence/fingerprint assertions, so semantic schema/methodology drift in .txt guides can survive CI. This is already tracked as non-blocking guide-validation debt. Concrete fix: keep the TODO open and extend tests to compare specific guide claims against live serializer/estimator behavior, not just substrings and table shape. Locations: TODO.md:L121-L129, tests/test_guides.py:L37-L50, tests/test_profile_panel.py:L458-L515.

Security

No findings.

Documentation/Tests

  • Severity P2. Impact: §6 says every DiagnosticReport section has statuses like pass/warn/inconclusive plus a reason, but the actual schema uses status="ran" for executed sections, keeps interpretation in a separate verdict, and also emits not_applicable, not_run, and no_scalar_by_design. An agent treating the guide as a literal parsing contract will mis-handle real BR/DR output. Concrete fix: rewrite §6 so status is documented as execution state, move qualitative interpretation to verdict, make reason optional, and list the additional statuses that actually occur. Locations: diff_diff/guides/llms-autonomous.txt:L594-L596, diff_diff/diagnostic_report.py:L890-L900, diff_diff/diagnostic_report.py:L968-L978, diff_diff/diagnostic_report.py:L1178-L1185, diff_diff/diagnostic_report.py:L1769-L1777, diff_diff/diagnostic_report.py:L1969-L2028.
  • I could not run tests/test_profile_panel.py or tests/test_guides.py in this environment because numpy is not installed, so the verification above is static-only.

Path to Approval

  1. Fix the ContinuousDiD methodology guidance in diff_diff/guides/llms-autonomous.txt: mark zero-dose / never-treated controls as required in the matrix and explain the unimplemented Remark 3.1 path in the continuous-treatment prose.
  2. Correct §6 of diff_diff/guides/llms-autonomous.txt to match the real DiagnosticReport.to_dict() contract: status="ran" for executed sections, verdict for qualitative interpretation, optional reason, and the extra statuses currently emitted.
  3. Add semantic tests that assert those two guide claims against live code paths, rather than only checking substrings/table shape.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

igerber added a commit that referenced this pull request Apr 24, 2026
ContinuousDiD eligibility (P1): ContinuousDiD.fit() requires zero-dose
control units (P(D=0) > 0) because Remark 3.1 lowest-dose-as-control
is not yet implemented (REGISTRY.md L714; continuous_did.py:349-360).
Guide corrections:
- Matrix row: never-treated-required ✗ → ✓.
- has_never_treated field doc: add ContinuousDiD to the required-by
  list with the Remark 3.1 note.
- §4.4 no-never-treated paths: add ContinuousDiD bullet naming the
  ValueError and pointing at the unimplemented Remark 3.1 path.
- §4.7 ContinuousDiD bullet: prepend the "Requires zero-dose control
  units" callout.

DiagnosticReport §6 section-status semantics (P2): the real schema
uses `status` for execution state ("ran", "not_applicable", "not_run",
"no_scalar_by_design", "skipped") and `verdict` as a separate field
for qualitative interpretation (e.g. "clean", "inconclusive",
"violated"). Previously §6 conflated these: it listed
pass/warn/inconclusive as status values and omitted the execution
statuses. Rewrote the section-status paragraph to document the two
layers explicitly; moved the forthcoming-additions note into the
same block.

Tests:
- ContinuousDiD matrix-row cell assertion (col 5 = ✓).
- "P(D=0) > 0" substring fingerprint guards the Remark 3.1 wording.
- DR §6 status vocabulary: guards every real status the implementation
  emits ("ran", "not_applicable", "not_run", "no_scalar_by_design")
  and forbids the obsolete pass/warn/inconclusive-as-status framing;
  `verdict` must be mentioned as a distinct concept.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: b361c6f07129bdce9fe3b7f154022f2b039a2994


Overall Assessment

⚠️ Needs changes. The prior re-review issues on ContinuousDiD zero-dose controls and DiagnosticReport status/verdict semantics look addressed in diff_diff/guides/llms-autonomous.txt:L107-L114 and diff_diff/guides/llms-autonomous.txt:L603-L618, but there are still unmitigated P1 methodology issues in the new autonomous guide/profile workflow.

Executive Summary

  • The earlier ContinuousDiD control-group mismatch and BR/DR schema drift appear fixed.
  • Severity P1 [Newly identified]: the estimator-support matrix still says ContinuousDiD does not support staggered adoption, which contradicts the Methodology Registry and the implementation.
  • Severity P1 [Newly identified]: the new profile_panel() + guide flow still omits ContinuousDiD's unit-constant-dose requirement, so time-varying-dose panels are profiled as generic "continuous" and pointed toward an estimator that hard-rejects them.
  • Severity P2: the new semantic guide test does not assert either of those ContinuousDiD eligibility rules, so this drift can pass CI.
  • Verification was static only; pandas and pytest are not available in this environment.

Methodology

  • Severity P1 [Newly identified]. Impact: Method affected: ContinuousDiD. The matrix marks ContinuousDiD as staggered = ✗ at diff_diff/guides/llms-autonomous.txt:L206-L206, but the source material explicitly defines multi-period staggered notation and staggered (g,t) aggregation for the method in docs/methodology/continuous-did.md:L61-L75 and docs/methodology/continuous-did.md:L257-L260, and the registry documents staggered-adoption equations at docs/methodology/REGISTRY.md:L788-L825. The implementation also uses first_treat cohorts and a not-yet-treated control mask at diff_diff/continuous_did.py:L159-L169 and diff_diff/continuous_did.py:L919-L925. In the advertised profile_panel() -> get_llm_guide("autonomous") workflow, an agent can therefore wrongly exclude a supported estimator for staggered continuous-dose panels. Concrete fix: change the ContinuousDiD staggered cell from to (or partial only if you add a precise qualifier), and explain that continuous-treatment timing is driven by first_treat metadata rather than PanelProfile.is_staggered.
  • Severity P1 [Newly identified]. Impact: Method affected: ContinuousDiD. The new workflow still over-admits invalid continuous-treatment designs. profile_panel classifies any numeric non-binary treatment as "continuous" (diff_diff/profile.py:L339-L364), and the guide's field reference and §4.7 point users toward ContinuousDiD / HeterogeneousAdoptionDiD for that case (diff_diff/guides/llms-autonomous.txt:L89-L92, diff_diff/guides/llms-autonomous.txt:L367-L383). But the method definition requires dose to be time-invariant within unit in docs/methodology/continuous-did.md:L72-L75, and ContinuousDiD.fit() hard-fails when that is violated at diff_diff/continuous_did.py:L222-L228. That means a time-varying-dose panel is currently profiled as eligible and only rejected later at fit time. Concrete fix: add a PanelProfile field or alert for within-unit treatment variation and use it in §2/§4.7; at minimum, state explicitly that ContinuousDiD requires time-invariant dose and remove the unconditional routing from the generic "continuous" definition.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings.

Security

  • No findings.

Documentation/Tests

  • Severity P2. Impact: the new semantic guard in tests/test_profile_panel.py:L458-L536 checks ContinuousDiD's covariate and zero-dose-control cells, but it never asserts the staggered-support cell or the time-invariant-dose contract. That is why both methodology issues above can still ship without CI noise. Concrete fix: extend test_guide_api_strings_resolve_against_public_api() to assert cdid_cells[2] and require explicit guide text about unit-constant dose, or add a live-behavior test that compares the guide against ContinuousDiD.fit() on a varying-dose panel.
  • Verification note: this review was static only because pandas and pytest are unavailable in this environment.

Path to Approval

  1. Fix the ContinuousDiD row in diff_diff/guides/llms-autonomous.txt so staggered adoption is not shown as unsupported, and clarify how continuous-treatment timing should be inferred when PanelProfile.is_staggered is binary-only.
  2. Surface the unit-constant-dose requirement in the new autonomous workflow. The best fix is a new PanelProfile signal plus matching §2/§4.7 text; the minimum fix is explicit prose that varying-dose panels are not eligible for ContinuousDiD.
  3. Add semantic tests for both rules so future guide edits cannot regress.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

igerber added a commit that referenced this pull request Apr 24, 2026
ContinuousDiD staggered support (P1 #1): the matrix marked
staggered=✗, but the method natively supports staggered adoption via
the `first_treat` column (continuous_did.py:159-169, 919-925;
REGISTRY.md L788-825). Matrix cell flipped ✗ → ✓.

Time-invariant dose requirement (P1 #2): ContinuousDiD.fit() requires
dose to be time-invariant per unit (continuous_did.py:222-228;
docs/methodology/continuous-did.md:L70-75), but profile_panel() did
not expose this so time-varying-dose continuous panels were routed to
ContinuousDiD only to hard-fail at fit time.

Added `PanelProfile.treatment_varies_within_unit: bool` — True iff
any unit has more than one distinct non-NaN treatment value across
its observed rows. Computed unconditionally for numeric (non-bool)
treatment columns; False for categorical. `to_dict()` exposes it.
Guide §2 documents the field, §4.7 ContinuousDiD bullet lists two
eligibility prerequisites: P(D=0) > 0 AND
treatment_varies_within_unit == False.

Tests (P2):
- test_continuous_treatment_with_time_varying_dose: random-per-row
  continuous panel -> treatment_varies_within_unit=True.
- test_continuous_treatment (existing): constant-per-unit dose ->
  treatment_varies_within_unit=False.
- test_binary_absorbing_varies_within_unit: binary absorbing panel
  always True by construction.
- Guide-resolution test: ContinuousDiD matrix col 2 (staggered) = ✓;
  guide mentions "time-invariant" and "treatment_varies_within_unit".
- to_dict JSON round-trip set extended with the new key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: aaa486390b35fe6998e7d8826ef72636a4af8653


Overall Assessment

⚠️ Needs changes

Executive Summary

  • The prior non-ContinuousDiD guide drift appears mostly addressed; the remaining blockers are still on the ContinuousDiD path.
  • Severity P1: the autonomous support matrix still says ContinuousDiD does not support staggered adoption, which contradicts the methodology docs and the estimator’s (g,t) implementation.
  • Severity P1: the new profile_panel() + guide workflow still does not surface ContinuousDiD’s unit-constant-dose requirement, and it reuses binary-exposure semantics for continuous dose columns, so valid continuous-DiD panels can be mislabeled as having “always-treated” units while invalid varying-dose panels are not screened pre-fit.
  • Severity P2: the new tests still do not assert either ContinuousDiD rule, and one test now hard-codes the incorrect continuous-panel has_always_treated behavior.
  • Verification was static only; this environment is missing pytest and package imports fail because numpy is unavailable.

Methodology

  • Severity P1 [Previously raised, still unresolved]. Impact: Method affected: ContinuousDiD. The estimator-support matrix still marks ContinuousDiD as staggered = ✗ in diff_diff/guides/llms-autonomous.txt:L189-L207 (row at :L206), but the source material explicitly defines multi-period staggered notation and staggered (g,t) aggregation in docs/methodology/continuous-did.md:L61-L75 and docs/methodology/continuous-did.md:L257-L260, and the registry/implementation both reflect that multi-period design (docs/methodology/REGISTRY.md:L728-L733, diff_diff/continuous_did.py:L394-L423, diff_diff/continuous_did.py:L912-L925). In the advertised agent workflow, this wrongly excludes a supported estimator for staggered continuous-dose panels. Concrete fix: change the ContinuousDiD staggered cell to and add a short footnote that continuous-treatment timing comes from the separate first_treat metadata, not from PanelProfile.is_staggered.

  • Severity P1 [Previously raised, still unresolved]. Impact: Method affected: ContinuousDiD. The new workflow still omits the method’s unit-constant-dose assumption and also misstates continuous-dose semantics. profile_panel() classifies every non-binary numeric treatment as generic "continuous" and documents has_always_treated as a generic numeric-treatment property in diff_diff/profile.py:L183-L199 and diff_diff/profile.py:L355-L364; the guide then routes that class to ContinuousDiD based only on the zero-dose-control requirement in diff_diff/guides/llms-autonomous.txt:L89-L120 and diff_diff/guides/llms-autonomous.txt:L367-L383. But ContinuousDiD.fit() takes separate first_treat and dose inputs and hard-rejects any unit whose dose varies within unit in diff_diff/continuous_did.py:L183-L186 and diff_diff/continuous_did.py:L222-L228, consistent with docs/methodology/continuous-did.md:L61-L75. As written, the profile can over-admit invalid varying-dose panels and also emit the false has_always_treated_units alert on valid continuous-DiD panels solely because dose is positive in pre-treatment rows (diff_diff/profile.py:L593-L604). Concrete fix: add an explicit within-unit-dose signal/alert to PanelProfile and .to_dict(), document it in §2/§4.7, and stop equating “positive dose in every observed row” with “treated in every observed period” for continuous designs.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. I did not find these P1 issues tracked in TODO.md, so they remain unmitigated.

Security

  • No findings.

Documentation/Tests

  • Severity P2. Impact: the new regression tests still do not lock down the two ContinuousDiD contracts above. tests/test_profile_panel.py:L374-L550 checks many guide semantics but stops at cdid_cells[5] and never asserts the staggered-support cell or any time-invariant-dose/profile signal; tests/test_guides.py:L37-L50 only checks for fingerprints/table presence; and tests/test_profile_panel.py:L357-L371 explicitly expects has_always_treated is True on a valid continuous-dose panel, which bakes the wrong semantics into CI. Concrete fix: extend the semantic guide test to assert the ContinuousDiD staggered cell and explicit unit-constant-dose language, and add profile tests showing that continuous panels expose the within-unit-dose contract without firing the “always treated / no pre-treatment information” alert merely because dose > 0 in pre-treatment rows.

Path to Approval

  1. Fix the ContinuousDiD row in diff_diff/guides/llms-autonomous.txt so staggered adoption is shown as supported, with a note that timing is supplied via first_treat.
  2. Surface the unit-constant-dose requirement in the new profile_panel()/guide workflow, and remove or re-scope the continuous-panel has_always_treated semantics so they do not claim valid dose panels lack pre-treatment information.
  3. Add semantic tests for both rules: ContinuousDiD staggered support and the within-unit-dose contract/correct continuous-panel alert behavior.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 24, 2026

/ai-review

igerber added a commit that referenced this pull request Apr 24, 2026
profile_panel():
- Balance / coverage: compute from unique (unit, time) support rather
  than raw row count. Duplicated rows no longer inflate coverage above
  1 or mask missing cells. Adds duplicate_unit_time_rows warn alert.
- Non-absorbing detection: drop NaN before the monotonicity diff so a
  path like [0, 1, NaN, 0] is correctly classified binary_non_absorbing
  rather than binary_absorbing.
- has_never_treated / has_always_treated: compute generically across
  numeric treatment types (not only binary). Continuous panels with
  zero-dose control units now surface has_never_treated=True; docstring
  and guide field reference updated.

llms-autonomous.txt:
- SyntheticDiD: staggered changed from "partial" to "strict block"
  (fit raises on within-unit variation); footnote rewritten.
- TROP: staggered support marked (via absorbing D matrix); never-treated
  no longer required (donor pool is every unit untreated at period t);
  fit() covariate surface removed (there is none).
- HeterogeneousAdoptionDiD: covariate adjustment changed to "not
  supported" (Appendix B.1 future work per REGISTRY.md); clustered SE
  annotated "warn" on continuous paths (cluster= kwarg is ignored with
  UserWarning, CR1 only on mass-point path).
- Replace "notyettreated" with "not_yet_treated" throughout (matches
  staggered.py / sun_abraham.py public API).
- Replace "Hausman.hausman_pretest" with "EfficientDiD.hausman_pretest"
  (the classmethod lives on EfficientDiD, not a Hausman namespace).

Tests:
- test_duplicate_unit_time_rows_do_not_inflate_coverage
- test_reversal_through_nan_is_binary_non_absorbing
- test_continuous_zero_dose_controls_flag_has_never_treated
- test_guide_api_strings_resolve_against_public_api (verifies every
  estimator name, hausman_pretest path, and control_group spelling in
  the bundled guide resolves against diff_diff's public API; guards
  against future drift)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
Guide:
- Rewrite HAD section (§4.9, §3 footnotes) to use WAS / WAS_d_lower
  terminology (paper Equation 2 / Theorem 1). HAD does not target ATT
  and its event-study output is per-event-time, not per-cohort. The
  fitted target_parameter attribute is literally "WAS" for Design 1'
  and "WAS_d_lower" for Design 1 / Assumption 6.
- Add control_group="last_cohort" to the EfficientDiD guidance in §4.4
  (no-never-treated paths) and the corresponding footnote. Previously
  the guide steered agents only toward assumption="PT-Post" when a
  never-treated cohort was absent, ruling out the supported pseudo-
  control path. Distinct from CallawaySantAnna's "not_yet_treated".

profile_panel():
- Missing identifier rows are dropped up front when unit or time
  contains NaN. Previously n_units / n_periods used nunique() (NaN-
  dropping) while the unique (unit, time) support used
  drop_duplicates() (NaN-preserving), producing observation_coverage
  > 1 on filtered data with missing IDs. All structural facts now
  compute on the non-missing subset; a new missing_id_rows_dropped
  warn alert surfaces the drop count so the behavior is not silent.

Tests:
- test_missing_unit_or_time_ids_are_dropped_consistently asserts
  observation_coverage stays in [0, 1] and the alert fires with the
  correct drop count.
- test_guide_api_strings_resolve_against_public_api extended with
  assertions for WAS / WAS_d_lower phrasing, the absence of the old
  "per-cohort Pierce-Schott" wording, and control_group="last_cohort".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
HeterogeneousAdoptionDiD staggered support is `partial` (per §3
matrix), but the restriction was never unpacked. Per REGISTRY.md L2281
and had.py:1100-1288, Appendix B.2 limits staggered HAD to the **last
treatment cohort plus never-treated units**. With `first_treat_col`
supplied, `fit(aggregate="event_study")` auto-filters to F_last and
emits a UserWarning naming kept/dropped counts; earlier-cohort units
are dropped. Without `first_treat_col` on a multi-cohort panel, fit()
raises a front-door ValueError pointing at ChaisemartinDHaultfoeuille
for full staggered support.

Guide updates:
- New §3 footnote on the HAD `partial` cell spelling out the last-
  cohort-only restriction, the `first_treat_col` requirement for the
  auto-filter, and the ChaisemartinDHaultfoeuille fallback.
- §4.9 HAD bullet appended with a "Staggered-timing scope is last-
  cohort-only" paragraph carrying the same contract plus the
  "last-cohort-only WAS" estimand clarification.

Tests:
- Semantic guide test asserts "last-cohort-only" (or "last cohort")
  wording, "first_treat_col" token, and ChaisemartinDHaultfoeuille as
  the fallback are all present so future guide edits cannot silently
  drop the Appendix B.2 disclosure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
Balanced-panel eligibility gate tightened. `PanelProfile.is_balanced`
is computed from the unique `(unit, time)` support, so it stays `True`
even when duplicate rows exist — `duplicate_unit_time_rows` is a
separate alert for that case. But ContinuousDiD / EfficientDiD /
HeterogeneousAdoptionDiD all require exactly one observation per cell:
EfficientDiD and HAD raise ValueError on duplicates at fit() time, and
ContinuousDiD's precompute path silently resolves duplicates via
last-row-wins (which can change the estimand without warning).

Guide §3 balanced-panel-eligibility block now requires BOTH
`is_balanced == True` AND absence of the `duplicate_unit_time_rows`
alert before routing to these estimators, with the specific failure
mode (raise vs silent overwrite) named per-estimator and a concrete
two-step fix (`balance_panel()` + `drop_duplicates([unit, time])`).

Tests: extended the semantic guide test to assert the guide mentions
`duplicate_unit_time_rows` and uses "BOTH/both" wording around the
is_balanced gate, so future edits cannot silently drop the duplicate-
row half of the eligibility requirement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
Bool-dtype treatment columns are now classified the same way as
numeric {0, 1} rather than as "categorical". The library's binary
estimators validate value support via `validate_binary` (utils.py:
49-67), which accepts bool because True/False coerce to 1/0
numerically. Classifying bool as categorical silently routed valid
binary DiD panels away from the supported estimator set.

Changes:
- _classify_treatment() no longer early-returns "categorical" for
  bool dtype. The downstream absorbing/non-absorbing logic handles
  bool by casting to int before np.diff (raw bool diff is XOR, which
  would mask a True -> False transition).
- treatment_varies_within_unit now includes bool-dtype columns (was
  previously hardcoded False for bool).
- Guide §2 removes the "bool = categorical" rule and adds an
  explicit "bool is binary" note with a pointer to validate_binary
  as the reason.
- profile_panel() docstring mirrors the same update.

Tests:
- test_bool_dtype_treatment_is_binary_absorbing: staggered-style
  bool panel with never-treated cohort -> binary_absorbing, correct
  has_never_treated / treatment_varies_within_unit / cohort_sizes.
- test_bool_dtype_non_absorbing: reversible False -> True -> False
  bool panel -> binary_non_absorbing. Guards the int-cast before
  np.diff so future refactors don't regress to bool XOR semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
dCDH "spillover designs" claim removed. The earlier §4.3 bullet said
ChaisemartinDHaultfoeuille was "robust to non-absorbing treatment and
to spillover designs," but REGISTRY.md §ChaisemartinDHaultfoeuille
only documents support for non-absorbing / reversible treatment — the
estimator assumes SUTVA like every other DiD estimator in the suite.
The same guide's §7 glossary already defines SUTVA as ruling out
spillovers, so the §4.3 wording was internally inconsistent and would
have misrouted interference designs to dCDH.

Rewrote the §4.3 bullet to state the actually-supported contract
(non-absorbing / reversible treatment via DID_M / DID_l) and added
an explicit note that interference / between-unit spillovers are not
supported natively.

Regression test: `tests/test_profile_panel.py` extended with a
forbidden-phrase check that fails if the autonomous guide re-advertises
dCDH as "robust to spillover", "interference-robust", "supports
spillover", or "and to spillover".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
§4.10 repeated-cross-section wording tightened. The earlier text
("most estimators remain applicable") overstated the RCS support
surface: only CallawaySantAnna(panel=False), TripleDifference, and
StaggeredTripleDifference have documented RCS-capable contracts
(REGISTRY.md §CallawaySantAnna; docs/choosing_estimator.rst DDD
cross-sectional use cases). EfficientDiD and HeterogeneousAdoptionDiD
explicitly reject RCS per REGISTRY.md. SyntheticDiD and ContinuousDiD
are panel-only by construction. Under the old wording an autonomous
agent could silently route RCS data to a panel-only estimator.

Rewrote §4.10 into three explicit blocks:
- "Explicit RCS support": the three RCS-capable estimators.
- "Explicitly rejected for RCS (panel-only)": names EfficientDiD,
  HeterogeneousAdoptionDiD, SyntheticDiD, ContinuousDiD.
- "Treat other estimators in this guide as panel-only unless their
  own docs explicitly say otherwise."

Kept the clustered-SE note and added a cluster-vs-respondent
treatment-assignment check.

Regression test: asserts the guide does not contain the "most
estimators remain applicable" phrase, names `panel=False` as the
explicit CS RCS mode, and that the §4.10 section explicitly lists
both EfficientDiD and HeterogeneousAdoptionDiD as panel-only
rejected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 24, 2026
Rebased onto current main (17 commits clean — PR #355, #358, #359 all
merged since last rebase).

StaggeredTripleDifference corrected as panel-only + balance-enforced.
The earlier §4.10 RCS wording paired TripleDifference /
StaggeredTripleDifference together in the Explicit RCS support list,
but REGISTRY.md §StaggeredTripleDifference requires a balanced panel
and staggered_triple_diff.py:93-109 has no panel=False mode — fit()
rejects unbalanced/duplicate (unit, time) structure at
staggered_triple_diff.py:846-864.

- §4.10 Explicit RCS support: TripleDifference (two-period) only;
  StaggeredTripleDifference removed from the supported set.
- §4.10 Explicitly rejected for RCS: StaggeredTripleDifference added
  with a concrete "no panel=False mode" + "use TripleDifference for
  cross-sectional DDD" pointer.
- §3 Balanced-panel eligibility: StaggeredTripleDifference added to
  the balance-sensitive gate.

Regression tests extended:
- Balanced-panel proximity check now covers StaggeredTripleDifference.
- §4.10 section test asserts StaggeredTripleDifference appears in the
  Explicitly rejected block and NOT in the Explicit RCS support block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: e3b22dcf50ed559df7310afe848ef6a30e0a62b5


Overall Assessment
✅ Looks good

Executive Summary

  • The prior P1 on StaggeredTripleDifference appears resolved: the autonomous guide now treats it as balanced-panel-only and explicitly excludes it from the repeated-cross-section subset in llms-autonomous.txt:223, llms-autonomous.txt:277, and llms-autonomous.txt:510.
  • I did not find a new undocumented methodology deviation in the changed guide/profile surfaces after cross-checking against docs/methodology/REGISTRY.md and the relevant estimator contracts.
  • profile_panel() covers the main new edge cases introduced here: missing unit/time IDs, duplicate (unit, time) keys, degenerate binary-treatment panels, continuous zero-dose controls, and non-absorbing paths with missing gaps.
  • The new semantic guide tests now pin the corrected StaggeredTripleDifference balance/RCS contract in test_profile_panel.py:680, test_profile_panel.py:717, and test_profile_panel.py:749.
  • Static review only: I could not execute pytest or import diff_diff in this environment because pytest and numpy are unavailable here.

Methodology
No findings. Re-review focus item is addressed: the earlier StaggeredTripleDifference RCS/balanced-panel mismatch is fixed in the new guide and regression-covered in the new tests.

Code Quality
No findings.

Performance
No findings.

Maintainability
No findings.

Tech Debt
No findings. I did not identify a new deferrable item that needs TODO.md tracking.

Security
No findings.

Documentation/Tests
No findings. The added tests materially improve regression protection for the agent-facing guide/profile contract, but I was only able to do static review in this environment.

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 24, 2026
igerber and others added 18 commits April 24, 2026 15:41
New `diff_diff.profile_panel(df, *, unit, time, treatment, outcome)`
returns a frozen `PanelProfile` dataclass of structural facts about a
DiD panel (balance, treatment-type classification, cohort structure,
outcome characteristics, factual alerts). `.to_dict()` returns a
JSON-serializable view. Descriptive only -- alerts never name a
specific estimator; estimator selection is up to the caller.

Paired with a new bundled `"autonomous"` variant on `get_llm_guide()`
at `diff_diff/guides/llms-autonomous.txt`. Reference-shaped (distinct
from the workflow-prose `"practitioner"` variant): `PanelProfile`
field reference, 17-estimator x 9-design-feature support matrix,
per-design-feature reasoning citing Baker et al. (2025) and
Roth/Sant'Anna (2023), post-fit validation index, BR/DR schema
reference, and an explicit "no deterministic recommendations"
disclaimer.

`diff_diff/__init__.py` module docstring leads with a "For AI agents:"
entry block so `help(diff_diff)` surfaces the four-step sequence
(profile, reference, workflow, report) at import time. Exports
`profile_panel`, `PanelProfile`, `Alert` from the top-level namespace.

Both files are bundled inside the wheel and sdist (no GitHub/RTD
dependency at runtime) via the existing `diff_diff/guides/*.txt` glob
in `pyproject.toml`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The treatment classifier required exactly two observed distinct values
to treat a column as binary. Panels that are entirely never-treated
(values = {0}) or entirely always-treated (values = {1}) were falling
through to "continuous", contradicting the documented taxonomy which
defines "binary_absorbing" as "values in {0, 1}".

Rule is now values_set <= {0, 1, 0.0, 1.0} with at least one observed
value; entirely-NaN treatment columns fall through to "categorical"
rather than "continuous". Docstring and the autonomous guide section
§2 reference are updated to match.

New regression tests:
- all-zero treatment panel -> binary_absorbing, has_never_treated
- all-one treatment panel -> binary_absorbing, has_always_treated
- binary with NaNs, only zeros observed -> binary_absorbing
- all-NaN treatment -> categorical
- top-level import surface (profile_panel / PanelProfile / Alert)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
profile_panel():
- Balance / coverage: compute from unique (unit, time) support rather
  than raw row count. Duplicated rows no longer inflate coverage above
  1 or mask missing cells. Adds duplicate_unit_time_rows warn alert.
- Non-absorbing detection: drop NaN before the monotonicity diff so a
  path like [0, 1, NaN, 0] is correctly classified binary_non_absorbing
  rather than binary_absorbing.
- has_never_treated / has_always_treated: compute generically across
  numeric treatment types (not only binary). Continuous panels with
  zero-dose control units now surface has_never_treated=True; docstring
  and guide field reference updated.

llms-autonomous.txt:
- SyntheticDiD: staggered changed from "partial" to "strict block"
  (fit raises on within-unit variation); footnote rewritten.
- TROP: staggered support marked (via absorbing D matrix); never-treated
  no longer required (donor pool is every unit untreated at period t);
  fit() covariate surface removed (there is none).
- HeterogeneousAdoptionDiD: covariate adjustment changed to "not
  supported" (Appendix B.1 future work per REGISTRY.md); clustered SE
  annotated "warn" on continuous paths (cluster= kwarg is ignored with
  UserWarning, CR1 only on mass-point path).
- Replace "notyettreated" with "not_yet_treated" throughout (matches
  staggered.py / sun_abraham.py public API).
- Replace "Hausman.hausman_pretest" with "EfficientDiD.hausman_pretest"
  (the classmethod lives on EfficientDiD, not a Hausman namespace).

Tests:
- test_duplicate_unit_time_rows_do_not_inflate_coverage
- test_reversal_through_nan_is_binary_non_absorbing
- test_continuous_zero_dose_controls_flag_has_never_treated
- test_guide_api_strings_resolve_against_public_api (verifies every
  estimator name, hausman_pretest path, and control_group spelling in
  the bundled guide resolves against diff_diff's public API; guards
  against future drift)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Guide:
- Rewrite HAD section (§4.9, §3 footnotes) to use WAS / WAS_d_lower
  terminology (paper Equation 2 / Theorem 1). HAD does not target ATT
  and its event-study output is per-event-time, not per-cohort. The
  fitted target_parameter attribute is literally "WAS" for Design 1'
  and "WAS_d_lower" for Design 1 / Assumption 6.
- Add control_group="last_cohort" to the EfficientDiD guidance in §4.4
  (no-never-treated paths) and the corresponding footnote. Previously
  the guide steered agents only toward assumption="PT-Post" when a
  never-treated cohort was absent, ruling out the supported pseudo-
  control path. Distinct from CallawaySantAnna's "not_yet_treated".

profile_panel():
- Missing identifier rows are dropped up front when unit or time
  contains NaN. Previously n_units / n_periods used nunique() (NaN-
  dropping) while the unique (unit, time) support used
  drop_duplicates() (NaN-preserving), producing observation_coverage
  > 1 on filtered data with missing IDs. All structural facts now
  compute on the non-missing subset; a new missing_id_rows_dropped
  warn alert surfaces the drop count so the behavior is not silent.

Tests:
- test_missing_unit_or_time_ids_are_dropped_consistently asserts
  observation_coverage stays in [0, 1] and the alert fires with the
  correct drop count.
- test_guide_api_strings_resolve_against_public_api extended with
  assertions for WAS / WAS_d_lower phrasing, the absence of the old
  "per-cohort Pierce-Schott" wording, and control_group="last_cohort".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Guide:
- SunAbraham matrix row: never-treated-required changed from ✗ to ✓.
  The fit path raises ValueError when no never-treated units exist
  (sun_abraham.py:566-567); the registry documents this requirement.
  §4.3 prose updated to state the requirement explicitly.
- HAD §4.9 (and §3 Phase 3 description): correct the diagnostic-claim
  wording. The pre-test battery validates the QUG null (step 1),
  Assumption 7 pre-trends (step 2, event-study only), and linearity
  of E[ΔY|D_2] (step 3). Assumption 3 (uniform continuity / no
  extensive-margin jump) is explicitly "not testable" per REGISTRY.md
  L2181, so the guide must not claim Phase 3 validates it.

profile_panel():
- Empty-panel guard: raise ValueError when the input DataFrame is
  empty (direct) or when no rows remain after dropping missing
  (unit, time) identifiers. Previously both cases returned
  is_balanced=True / observation_coverage=0 on zero valid rows.
- Fix missing-id counting: n_rows_with_missing_id was the SUM of
  unit-NaN and time-NaN (double-counting rows missing both). Now
  computed from a row-wise isna().any(axis=1) mask so every dropped
  row is counted exactly once.

Tests:
- test_row_with_both_ids_missing_counted_once
- test_empty_dataframe_raises_value_error
- test_empty_after_id_drop_raises_value_error
- test_guide_api_strings_resolve_against_public_api extended:
  SunAbraham matrix-row cell assertion (never-treated-required=✓),
  "Assumption 3" presence, absence of "validate Assumptions 3 and 7",
  presence of "not testable".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Guide:
- Matrix covariate cells corrected. SyntheticDiD column 6 flipped
  ✗ → ✓: fit() accepts covariates= and residualizes the outcome
  (synthetic_did.py:213-239, 425-436). ContinuousDiD column 6 flipped
  ✓ → ✗: the fit() signature has no covariate argument and REGISTRY
  lists covariate adjustment as deferred.
- EfficientDiD wording corrected. The registry says PT-Post still uses
  never-treated as the comparison group (REGISTRY.md L756-758);
  "switch to PT-Post" does not drop the requirement. Rewrote
  has_never_treated field description, the §3 footnote, and §4.4
  prose to state that PT-All and PT-Post both require never-treated
  units and control_group="last_cohort" is the only supported path
  for an all-eventually-treated panel.

profile_panel():
- _compute_pre_post now uses each treated unit's observed (unit, time)
  support rather than the global panel period set. On unbalanced
  panels this correctly captures the least-supported treated unit's
  pre/post exposure and fires short_pre_panel / short_post_panel when
  appropriate; the previous global-periods approach could overstate
  exposure and suppress the alert.

Guide §5 API signatures corrected:
- compute_pretrends_power takes a fitted results object, not a raw
  DataFrame (pretrends.py:1048).
- plot_sensitivity takes a SensitivityResults object
  (visualization/_diagnostic.py:12-14).
- plot_honest_event_study takes a HonestDiDResults object
  (visualization/_event_study.py:9).

Tests:
- test_min_pre_post_use_per_unit_observed_support (unbalanced treated
  unit missing early pre-period fires short_pre_panel).
- test_guide_api_strings_resolve_against_public_api extended with
  SyntheticDiD/ContinuousDiD covariate-cell matrix assertions,
  EfficientDiD both-PT-variants-require-never-treated wording
  assertion, and §5 API-signature fingerprints.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Guide §6 rewritten to match the actual BR/DR to_dict() schemas emitted
at diff_diff/business_report.py:524-549 and diff_diff/diagnostic_report.py:
1013-1031. BR top-level keys are "assumption" (singular), "pre_trends"
(underscored), "sample" (bare); the new documentation also adds
"estimator", "context", "heterogeneity", "diagnostics", "next_steps",
"references". DR keys are "estimator" (string), "headline_metric",
"estimator_native_diagnostics", "skipped", "warnings",
"overall_interpretation", "next_steps"; the old "checks: dict" nesting
never existed and "estimator_type" was never the key name. The
"forthcoming schema additions" block is dropped since it documented an
un-shipped surface.

Prose + alert wording aligned with the per-treated-unit implementation:
- Guide §2: min_pre_periods / min_post_periods described as "across
  treated units" (each unit's observed (unit, time) support counted
  independently on unbalanced panels).
- profile.py: short_pre_panel / short_post_panel alert messages use
  "across treated units" to match the computation.

Tests:
- test_guide_api_strings_resolve_against_public_api extended with
  BR/DR §6 schema assertions. Guards every real top-level key
  documented (assumption, pre_trends, sample, headline_metric,
  estimator_native_diagnostics, overall_interpretation) and forbids
  the previously documented obsolete keys (assumptions, pretrends,
  main_result, sample_summary, estimator_type, checks).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Guide methodology attribution corrected:
- EfficientDiD primary source is Chen, Sant'Anna, Xie (2025) per
  REGISTRY.md L746-749, not Arkhangelsky-Imbens. §4.3 bullet and the
  §7 citation list updated.
- ContinuousDiD synopsis (§4.7) now distinguishes the PT-identified
  targets (`ATT(d|d)`, `ATT^{loc}`) from the SPT-required targets
  (`ATT(d)`, `ACRT(d)`, `ACRT^{glob}`) per REGISTRY.md L682-706, with
  the Callaway, Goodman-Bacon, Sant'Anna (2024) citation added to §7.

Guide §6 BR schema corrections:
- `diagnostics` wrapper documented accurately: always has `status`,
  with `schema` + `overall_interpretation` on the "ran" path or
  `reason` on the "skipped" path. Previously the guide said the key
  held the DR payload directly.
- `target_parameter.reference` field documented alongside the other
  fields per `describe_target_parameter()` in
  diff_diff/_reporting_helpers.py:34-42.

Regression tests extended:
- `diagnostics["schema"]` wrapper wording guarded (and absence of
  plain "DR payload under diagnostics").
- `target_parameter.reference` key documentation guarded.
- EfficientDiD attribution: `Chen, Sant'Anna, Xie 2025` present,
  `(Arkhangelsky-Imbens)` absent.
- ContinuousDiD targets: `ATT(d|d)`, `ACRT`, `Strong Parallel Trends`
  all present.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ContinuousDiD eligibility (P1): ContinuousDiD.fit() requires zero-dose
control units (P(D=0) > 0) because Remark 3.1 lowest-dose-as-control
is not yet implemented (REGISTRY.md L714; continuous_did.py:349-360).
Guide corrections:
- Matrix row: never-treated-required ✗ → ✓.
- has_never_treated field doc: add ContinuousDiD to the required-by
  list with the Remark 3.1 note.
- §4.4 no-never-treated paths: add ContinuousDiD bullet naming the
  ValueError and pointing at the unimplemented Remark 3.1 path.
- §4.7 ContinuousDiD bullet: prepend the "Requires zero-dose control
  units" callout.

DiagnosticReport §6 section-status semantics (P2): the real schema
uses `status` for execution state ("ran", "not_applicable", "not_run",
"no_scalar_by_design", "skipped") and `verdict` as a separate field
for qualitative interpretation (e.g. "clean", "inconclusive",
"violated"). Previously §6 conflated these: it listed
pass/warn/inconclusive as status values and omitted the execution
statuses. Rewrote the section-status paragraph to document the two
layers explicitly; moved the forthcoming-additions note into the
same block.

Tests:
- ContinuousDiD matrix-row cell assertion (col 5 = ✓).
- "P(D=0) > 0" substring fingerprint guards the Remark 3.1 wording.
- DR §6 status vocabulary: guards every real status the implementation
  emits ("ran", "not_applicable", "not_run", "no_scalar_by_design")
  and forbids the obsolete pass/warn/inconclusive-as-status framing;
  `verdict` must be mentioned as a distinct concept.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ContinuousDiD staggered support (P1 #1): the matrix marked
staggered=✗, but the method natively supports staggered adoption via
the `first_treat` column (continuous_did.py:159-169, 919-925;
REGISTRY.md L788-825). Matrix cell flipped ✗ → ✓.

Time-invariant dose requirement (P1 #2): ContinuousDiD.fit() requires
dose to be time-invariant per unit (continuous_did.py:222-228;
docs/methodology/continuous-did.md:L70-75), but profile_panel() did
not expose this so time-varying-dose continuous panels were routed to
ContinuousDiD only to hard-fail at fit time.

Added `PanelProfile.treatment_varies_within_unit: bool` — True iff
any unit has more than one distinct non-NaN treatment value across
its observed rows. Computed unconditionally for numeric (non-bool)
treatment columns; False for categorical. `to_dict()` exposes it.
Guide §2 documents the field, §4.7 ContinuousDiD bullet lists two
eligibility prerequisites: P(D=0) > 0 AND
treatment_varies_within_unit == False.

Tests (P2):
- test_continuous_treatment_with_time_varying_dose: random-per-row
  continuous panel -> treatment_varies_within_unit=True.
- test_continuous_treatment (existing): constant-per-unit dose ->
  treatment_varies_within_unit=False.
- test_binary_absorbing_varies_within_unit: binary absorbing panel
  always True by construction.
- Guide-resolution test: ContinuousDiD matrix col 2 (staggered) = ✓;
  guide mentions "time-invariant" and "treatment_varies_within_unit".
- to_dict JSON round-trip set extended with the new key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rebased onto current main (resolved CHANGELOG.md conflict: the by_path
bullet from PR #355 and the profile_panel/autonomous-guide bullet from
this PR now live side-by-side under [Unreleased]).

has_always_treated now has binary-only semantics:
- For binary treatment (absorbing or non-absorbing): unit_min == 1
  means the unit is treated in every observed period (no
  pre-treatment information in the DiD sense).
- For continuous treatment: always False. Pre-treatment periods on
  continuous DiD are determined by the separate `first_treat` column
  supplied to `ContinuousDiD.fit`, not by whether the dose is
  positive. A unit with a constant positive dose can still have
  well-defined pre-treatment periods, so flagging it as
  "always-treated / no pre-treatment information" was factually wrong
  and triggered the misleading `has_always_treated_units` alert on
  valid continuous panels.
- Categorical: False by construction.

Guide §2 has_always_treated field doc updated to state the
binary-only semantics explicitly, with a note about `first_treat`.

Tests:
- New: test_continuous_positive_dose_does_not_fire_has_always_treated
  asserts has_always_treated=False AND the alert does not fire on a
  constant-positive-dose continuous panel.
- Existing test_continuous_zero_dose_controls_flag_has_never_treated
  updated: has_always_treated expected to be False (was True under
  the old semantics).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Balanced-panel eligibility (P1): ContinuousDiD, EfficientDiD,
SyntheticDiD, and HeterogeneousAdoptionDiD all hard-reject unbalanced
panels at fit() time (continuous_did.py:329-338, efficient_did.py:
407-414, synthetic_did.py:399-412, had.py:1173-1188; REGISTRY.md
cross-refs). Guide updates surface this:

- New "Balanced-panel eligibility" block after §3 matrix footnotes
  names the four affected estimators and points at
  `PanelProfile.is_balanced == True` as the gate. Directs users with
  unbalanced panels to `diff_diff.prep.balance_panel()` or to a
  balance-tolerant estimator.
- §4 per-estimator bullets for all four estimators prepend or append
  the balanced-panel requirement with the specific fit() error the
  caller would otherwise hit.
- ContinuousDiD §4.7 bullet now lists THREE eligibility prerequisites
  (zero-dose controls, time-invariant dose, balanced panel) where it
  previously listed two.

Docstring (P3): profile_panel() docstring notes block updated to
match the binary-only has_always_treated semantics shipped in round
9. The old wording claimed the field fired on "strictly positive
treatment in every observed non-NaN row" across numeric types, which
no longer matches the implementation.

Tests (P2):
- Semantic guide test asserts `is_balanced` is mentioned in the guide
  and each of the four balance-sensitive estimators appears within
  400 characters of a "balanced" / "is_balanced" marker, so future
  edits cannot silently drop the eligibility gate from any of them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
HeterogeneousAdoptionDiD staggered support is `partial` (per §3
matrix), but the restriction was never unpacked. Per REGISTRY.md L2281
and had.py:1100-1288, Appendix B.2 limits staggered HAD to the **last
treatment cohort plus never-treated units**. With `first_treat_col`
supplied, `fit(aggregate="event_study")` auto-filters to F_last and
emits a UserWarning naming kept/dropped counts; earlier-cohort units
are dropped. Without `first_treat_col` on a multi-cohort panel, fit()
raises a front-door ValueError pointing at ChaisemartinDHaultfoeuille
for full staggered support.

Guide updates:
- New §3 footnote on the HAD `partial` cell spelling out the last-
  cohort-only restriction, the `first_treat_col` requirement for the
  auto-filter, and the ChaisemartinDHaultfoeuille fallback.
- §4.9 HAD bullet appended with a "Staggered-timing scope is last-
  cohort-only" paragraph carrying the same contract plus the
  "last-cohort-only WAS" estimand clarification.

Tests:
- Semantic guide test asserts "last-cohort-only" (or "last cohort")
  wording, "first_treat_col" token, and ChaisemartinDHaultfoeuille as
  the fallback are all present so future guide edits cannot silently
  drop the Appendix B.2 disclosure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Balanced-panel eligibility gate tightened. `PanelProfile.is_balanced`
is computed from the unique `(unit, time)` support, so it stays `True`
even when duplicate rows exist — `duplicate_unit_time_rows` is a
separate alert for that case. But ContinuousDiD / EfficientDiD /
HeterogeneousAdoptionDiD all require exactly one observation per cell:
EfficientDiD and HAD raise ValueError on duplicates at fit() time, and
ContinuousDiD's precompute path silently resolves duplicates via
last-row-wins (which can change the estimand without warning).

Guide §3 balanced-panel-eligibility block now requires BOTH
`is_balanced == True` AND absence of the `duplicate_unit_time_rows`
alert before routing to these estimators, with the specific failure
mode (raise vs silent overwrite) named per-estimator and a concrete
two-step fix (`balance_panel()` + `drop_duplicates([unit, time])`).

Tests: extended the semantic guide test to assert the guide mentions
`duplicate_unit_time_rows` and uses "BOTH/both" wording around the
is_balanced gate, so future edits cannot silently drop the duplicate-
row half of the eligibility requirement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bool-dtype treatment columns are now classified the same way as
numeric {0, 1} rather than as "categorical". The library's binary
estimators validate value support via `validate_binary` (utils.py:
49-67), which accepts bool because True/False coerce to 1/0
numerically. Classifying bool as categorical silently routed valid
binary DiD panels away from the supported estimator set.

Changes:
- _classify_treatment() no longer early-returns "categorical" for
  bool dtype. The downstream absorbing/non-absorbing logic handles
  bool by casting to int before np.diff (raw bool diff is XOR, which
  would mask a True -> False transition).
- treatment_varies_within_unit now includes bool-dtype columns (was
  previously hardcoded False for bool).
- Guide §2 removes the "bool = categorical" rule and adds an
  explicit "bool is binary" note with a pointer to validate_binary
  as the reason.
- profile_panel() docstring mirrors the same update.

Tests:
- test_bool_dtype_treatment_is_binary_absorbing: staggered-style
  bool panel with never-treated cohort -> binary_absorbing, correct
  has_never_treated / treatment_varies_within_unit / cohort_sizes.
- test_bool_dtype_non_absorbing: reversible False -> True -> False
  bool panel -> binary_non_absorbing. Guards the int-cast before
  np.diff so future refactors don't regress to bool XOR semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dCDH "spillover designs" claim removed. The earlier §4.3 bullet said
ChaisemartinDHaultfoeuille was "robust to non-absorbing treatment and
to spillover designs," but REGISTRY.md §ChaisemartinDHaultfoeuille
only documents support for non-absorbing / reversible treatment — the
estimator assumes SUTVA like every other DiD estimator in the suite.
The same guide's §7 glossary already defines SUTVA as ruling out
spillovers, so the §4.3 wording was internally inconsistent and would
have misrouted interference designs to dCDH.

Rewrote the §4.3 bullet to state the actually-supported contract
(non-absorbing / reversible treatment via DID_M / DID_l) and added
an explicit note that interference / between-unit spillovers are not
supported natively.

Regression test: `tests/test_profile_panel.py` extended with a
forbidden-phrase check that fails if the autonomous guide re-advertises
dCDH as "robust to spillover", "interference-robust", "supports
spillover", or "and to spillover".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
§4.10 repeated-cross-section wording tightened. The earlier text
("most estimators remain applicable") overstated the RCS support
surface: only CallawaySantAnna(panel=False), TripleDifference, and
StaggeredTripleDifference have documented RCS-capable contracts
(REGISTRY.md §CallawaySantAnna; docs/choosing_estimator.rst DDD
cross-sectional use cases). EfficientDiD and HeterogeneousAdoptionDiD
explicitly reject RCS per REGISTRY.md. SyntheticDiD and ContinuousDiD
are panel-only by construction. Under the old wording an autonomous
agent could silently route RCS data to a panel-only estimator.

Rewrote §4.10 into three explicit blocks:
- "Explicit RCS support": the three RCS-capable estimators.
- "Explicitly rejected for RCS (panel-only)": names EfficientDiD,
  HeterogeneousAdoptionDiD, SyntheticDiD, ContinuousDiD.
- "Treat other estimators in this guide as panel-only unless their
  own docs explicitly say otherwise."

Kept the clustered-SE note and added a cluster-vs-respondent
treatment-assignment check.

Regression test: asserts the guide does not contain the "most
estimators remain applicable" phrase, names `panel=False` as the
explicit CS RCS mode, and that the §4.10 section explicitly lists
both EfficientDiD and HeterogeneousAdoptionDiD as panel-only
rejected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rebased onto current main (17 commits clean — PR #355, #358, #359 all
merged since last rebase).

StaggeredTripleDifference corrected as panel-only + balance-enforced.
The earlier §4.10 RCS wording paired TripleDifference /
StaggeredTripleDifference together in the Explicit RCS support list,
but REGISTRY.md §StaggeredTripleDifference requires a balanced panel
and staggered_triple_diff.py:93-109 has no panel=False mode — fit()
rejects unbalanced/duplicate (unit, time) structure at
staggered_triple_diff.py:846-864.

- §4.10 Explicit RCS support: TripleDifference (two-period) only;
  StaggeredTripleDifference removed from the supported set.
- §4.10 Explicitly rejected for RCS: StaggeredTripleDifference added
  with a concrete "no panel=False mode" + "use TripleDifference for
  cross-sectional DDD" pointer.
- §3 Balanced-panel eligibility: StaggeredTripleDifference added to
  the balance-sensitive gate.

Regression tests extended:
- Balanced-panel proximity check now covers StaggeredTripleDifference.
- §4.10 section test asserts StaggeredTripleDifference appears in the
  Explicitly rejected block and NOT in the Explicit RCS support block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber igerber force-pushed the agent-profile-panel branch from e3b22dc to ef6b53d Compare April 24, 2026 19:41
@igerber igerber merged commit 4852b34 into main Apr 24, 2026
19 checks passed
@igerber igerber deleted the agent-profile-panel branch April 24, 2026 20:50
igerber added a commit that referenced this pull request Apr 24, 2026
…ck get_loo_effects_df on survey jackknife

P0 (Methodology — survey jackknife silently skipping undefined LOO):
The Rust & Rao (1996) stratified jackknife formula `SE² =
Σ_h (1-f_h)·(n_h-1)/n_h·Σ_{j∈h}(τ̂_{(h,j)} - τ̄_h)²` requires every
PSU-LOO `τ̂_{(h,j)}` to be defined. The previous implementation
silently skipped PSUs whose deletion removed all treated units (or
zeroed control ω_eff mass, or raised in the estimator) while still
applying the full `(n_h-1)/n_h` factor, under-scaling variance on
designs where treated units pack into a single PSU.

Fix: `_jackknife_se_survey` now tracks any undefined replicate in a
contributing stratum (n_h ≥ 2) and short-circuits to `SE=NaN` with a
targeted `UserWarning` naming the stratum / PSU / reason (deletion
removes all treated, kept ω_eff zero, kept treated survey mass zero,
estimator raised, estimator returned non-finite). Partial LOOs are
still returned in `placebo_effects` for debugging; users needing a
variance estimator that accommodates PSU-deletion infeasibility
should use `variance_method="bootstrap"`. Silent stratum-level skip
for `n_h < 2` is preserved (canonical lonely-PSU handling matching
R `survey::svyjkn`).

New regression `test_jackknife_full_design_undefined_replicate_returns_nan`
exercises the fix on the original `sdid_survey_data_full_design`
fixture (treated all in stratum 0 PSU 0 → LOO PSU 0 removes all
treated) and asserts both the `UserWarning` match and `np.isnan(se)`.

The existing jackknife tests that asserted finite SE now use a new
`sdid_survey_data_jk_well_formed` fixture where treated units are
spread across two PSUs within stratum 0 (so every LOO leaves ≥1
treated). The self-consistency test
(`test_jackknife_full_design_stratum_aggregation_formula_magnitude`)
was rewritten from a flaccid finite-positive check to a real
recomputation of the Rust & Rao formula on the returned 6-entry
`placebo_effects` array, asserting `result.se == pytest.approx(
expected, rel=1e-12)`.

Coverage MC (`benchmarks/data/sdid_coverage.json`) is unchanged:
the `stratified_survey` DGP spreads its 32 treated units across
PSUs 2 and 3 within stratum 1 and PSUs 0 and 1 within stratum 0,
so every LOO is defined there too. The previously-reported jackknife
anti-conservatism (α=0.05 rejection = 0.45, SE/trueSD = 0.46) is
the documented few-PSU limitation (1 effective DoF per stratum
with `n_h = 2`), not the P0 silent-skip bug.

P1 (Code Quality — get_loo_effects_df on survey jackknife):
`SyntheticDiDResults.get_loo_effects_df()` assumes a length-N
unit-indexed `placebo_effects` array (first n_control are control-
LOO, next n_treated are treated-LOO). Survey-jackknife fits return
a flat PSU-level replicate array of variable length; joining onto
the fit-time `control_unit_ids + treated_unit_ids` would mislabel
PSU replicates as unit-level effects.

Fix: `get_loo_effects_df()` now raises `NotImplementedError` with a
targeted message pointing to `result.placebo_effects` for the raw
PSU-level array and REGISTRY §SyntheticDiD "Note (survey + jackknife
composition)" for the aggregation formula. New regression
`test_get_loo_effects_df_raises_on_survey_jackknife` asserts the
raise on a survey fit. Non-survey and pweight-only jackknife fits
continue to use `get_loo_effects_df()` as before (unit-level LOO).

P3 (Documentation — stale default variance_method note):
`docs/methodology/REGISTRY.md:L1569` default-variance-method note
rewritten to reflect that all three variance methods now support
full survey designs (removing "full design supported on bootstrap
only" language) and to recommend bootstrap specifically on surveys
with few PSUs per stratum.

Branch also rebased onto current origin/main to pick up PR #356
(agent-profile-panel) and PR #361 — the R1 Maintainability finding
about "unrelated API deletions" was a stale-base-drift artifact
(my branch was created before #356 merged). After rebase the diff
against main shows only SDID-survey changes.

Verification
------------
pytest tests/test_survey_phase5.py
       tests/test_methodology_sdid.py::{TestBootstrapSE,TestPlaceboSE,TestJackknifeSE,TestCoverageMCArtifact}
→ 87 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant