Skip to content

Add SyntheticDiD validation diagnostics#309

Merged
igerber merged 7 commits intomainfrom
sdid-validation
Apr 17, 2026
Merged

Add SyntheticDiD validation diagnostics#309
igerber merged 7 commits intomainfrom
sdid-validation

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 16, 2026

Summary

  • New post-fit diagnostics on SyntheticDiDResults: get_loo_effects_df(), get_weight_concentration(), in_time_placebo(), sensitivity_to_zeta_omega(), plus synthetic/treated pre and post trajectories retained as public fields.
  • Private _SyntheticDiDFitSnapshot on results (Y matrices, unit IDs, period lists, survey weights) so re-estimation diagnostics work from the results object without requiring the user to re-supply panel data. Arrays are marked read-only after fit.
  • Jackknife LOO entries in placebo_effects can now be mapped back to user-facing unit identities; get_loo_effects_df() returns the joined view sorted by influence.
  • _handle_synthetic in practitioner.py rewritten to reference the new callables instead of pseudo-code comments.
  • REGISTRY.md gains a Validation diagnostics subsection documenting purpose, formulas, and edge cases for each diagnostic; doc-deps.yaml links results.py to the methodology entry.

Methodology references (required if estimator / math changes)

  • Method name(s): SyntheticDiD (post-fit diagnostics; core estimator and variance math unchanged).
  • Paper / source link(s): Arkhangelsky, D., Athey, S., Hirshberg, D. A., Imbens, G. W., & Wager, S. (2021). Synthetic Difference-in-Differences. American Economic Review, 111(12), 4088-4118.
  • Any intentional deviations from the source (and why):
    • in_time_placebo() reuses self.zeta_omega / self.zeta_lambda by default instead of recomputing auto-zeta per fake window (matches R synthdid convention of treating regularization as a property of the original fit). Labelled **Note:** in REGISTRY.md.
    • sensitivity_to_zeta_omega() holds the original Frank-Wolfe time weights fixed to isolate sensitivity to zeta_omega; zeta_lambda sensitivity is not exposed. Labelled **Note:** in REGISTRY.md.

Validation

  • Tests added/updated: tests/test_methodology_sdid.py gains 33 new tests across TestFitSnapshot, TestTrajectories, TestLooEffectsDf, TestWeightConcentration, TestInTimePlacebo, TestSensitivityToZetaOmega, TestPractitionerSdidReferences - covering snapshot shapes and read-only contract, unweighted and survey-weighted trajectories, LOO DataFrame correctness (positional mapping, NaN propagation, ValueError on non-jackknife variance), concentration metrics under composed survey weights, negative top_k validation, in-time placebo default sweep feasibility, empty-default schema preservation, explicit-list override, zeta refit regression at multiplier=1.0, and the exact documented default grid values.
  • Backtest / simulation / notebook evidence (if applicable): N/A - diagnostics only; no estimator math changes.

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

igerber and others added 4 commits April 16, 2026 07:09
Frames the problem space for improving SyntheticDiD validation
diagnostics - the gap between getting an estimate and being confident
it's reliable. Captures current state, opportunities from the
jackknife work, and what a complete validation workflow looks like.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds five post-fit diagnostics so practitioners can follow a credible
synthetic control validation workflow without reverse-engineering
internals:

- get_loo_effects_df(): joins jackknife LOO pseudo-values to unit IDs
- get_weight_concentration(): effective N, Herfindahl, top-k share
- in_time_placebo(): re-fit on shifted fake treatment periods (sweeps
  all feasible pre-periods by default; reuses self.zeta_omega per R
  synthdid convention)
- sensitivity_to_zeta_omega(): ATT across a log-spaced zeta grid,
  time weights held fixed
- Trajectories: synthetic_pre/post and treated_pre/post retained on
  results for plotting and custom fit metrics

Retains a private _SyntheticDiDFitSnapshot (Y matrices, unit IDs,
period lists, survey weights) on results to support re-estimation.
Threads unit IDs through the existing jackknife loops so placebo_effects
entries can be mapped back to user-facing identities.

Rewrites _handle_synthetic in practitioner.py to reference these
callables instead of pseudo-code comments. Adds REGISTRY.md subsection
documenting purpose, formulas, and edge cases for each diagnostic.

Tests: 28 new tests in test_methodology_sdid.py covering snapshot
contents, trajectory correctness, LOO ID mapping, concentration
metrics (including survey composition), placebo sweep feasibility,
zeta-sensitivity regression at multiplier=1.0, and practitioner
snippet syntax.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-only

- in_time_placebo() returns a DataFrame with the documented columns even
  when no feasible fake treatment periods exist (e.g., n_pre < 3). Prior
  behavior returned a schema-less empty frame, breaking the documented
  contract and downstream code.
- get_weight_concentration() validates top_k >= 0 and raises ValueError
  on negative input. Prior behavior silently coerced to 0.
- _SyntheticDiDFitSnapshot marks survey weight arrays (w_control,
  w_treated) read-only alongside the outcome matrices to prevent
  accidental mutation by diagnostic methods.

Tests: covers the empty-default-placebo schema case (n_pre=2) and the
negative-top_k ValueError path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rage

- Reworded the sensitivity_to_zeta_omega() default-grid description in
  both docstring and REGISTRY.md. Prior phrasing ("16x log-spaced range")
  could be read as "16 points log-spaced"; actual default is a 5-point
  grid with multipliers (0.25, 0.5, 1.0, 2.0, 4.0) spanning 16x.
- Publicly exposed trajectory arrays and time_weights_array are now
  marked read-only after fit, matching the snapshot's treatment.
- Added test asserting the exact default zeta grid values (not just
  length), so doc/implementation drift is caught.
- Added survey-weighted trajectory test: confirms treated_pre/post use
  survey-weighted mean and synthetic_pre uses composed omega_eff.
- Added test asserting the public arrays are read-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

⚠️ Needs changes

Executive Summary

  • Affected methods: SyntheticDiDResults.get_loo_effects_df(), get_weight_concentration(), in_time_placebo(), and sensitivity_to_zeta_omega(). I did not find an unmitigated estimator, weighting, or SE-methodology regression in the changed SyntheticDiD fit path.
  • P1: sensitivity_to_zeta_omega() does not handle empty grids correctly. zeta_grid=[] or multipliers=() returns a zero-column DataFrame, not the documented schema.
  • The two math-adjacent diagnostic deviations are documented in REGISTRY.md, so they are P3 informational only.
  • New tests cover many happy paths and several edge cases, but they miss the empty-grid branch that would have caught the P1 above.
  • This was a static review only; I could not execute pytest in this environment because numpy/pandas are unavailable.

Methodology

Code Quality

  • Severity: P1. Impact: sensitivity_to_zeta_omega() promises a result with columns zeta_omega, att, pre_fit_rmse, max_unit_weight, and effective_n, but when the resolved grid is empty it returns pd.DataFrame(rows) with rows=[], i.e. a zero-column frame. Any downstream caller that branches on an empty grid but still expects the documented schema will break. Location: diff_diff/results.py:1290 through the return at diff_diff/results.py:1361; documented contract at docs/methodology/REGISTRY.md:1524. Concrete fix: predeclare columns = ["zeta_omega", "att", "pre_fit_rmse", "max_unit_weight", "effective_n"] and return pd.DataFrame(columns=columns) when zeta_values is empty; construct non-empty results with the same column list.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: the new sensitivity tests cover only non-empty grids, so the empty-result regression above is untested. See tests/test_methodology_sdid.py:1862 through tests/test_methodology_sdid.py:1928. Concrete fix: add regressions for res.sensitivity_to_zeta_omega(zeta_grid=[]) and res.sensitivity_to_zeta_omega(multipliers=()), asserting zero rows with the exact documented column set.
  • Validation note: I could not run the test suite locally because this review environment is missing numpy and pandas.

Path to Approval

  1. In sensitivity_to_zeta_omega(), preserve the documented output schema when the resolved zeta grid is empty.
  2. Add regression tests for both zeta_grid=[] and multipliers=() that assert shape == (0, 5) and the exact expected column names.

sensitivity_to_zeta_omega() returned a zero-column DataFrame when the
resolved grid was empty (zeta_grid=[] or multipliers=()), breaking the
documented 5-column contract and any downstream caller that branched on
an empty grid while still expecting the schema.

Pre-declares the columns list and short-circuits with a typed empty
DataFrame when zeta_values is empty; the non-empty return path also
passes the columns list explicitly so both paths produce identical
schemas.

Tests cover both entry points that can produce an empty grid.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 16, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 0843481739c6b6bd31511a7a18f5fece5ac6e2e3


Overall Assessment

✅ Looks good

Executive Summary

  • Re-review: the prior P1 on empty sensitivity_to_zeta_omega() outputs is fixed in diff_diff/results.py:1314 and covered by tests/test_methodology_sdid.py:1930 and tests/test_methodology_sdid.py:1942.
  • Affected methodology is limited to post-fit SyntheticDiD diagnostics and retained fit state; I did not find a new ATT, weighting, or variance regression in the estimator path. The only behavior deviations I found are the two explicitly documented registry notes for in_time_placebo() and sensitivity_to_zeta_omega(). (synth-inference.github.io)
  • P2: the new practitioner jackknife snippet can still raise on valid variance_method="jackknife" results when LOO estimates are unavailable, because it checks only results.variance_method before calling get_loo_effects_df().
  • P2: _fit_snapshot now retains outcome matrices, unit IDs, and survey weights on every SyntheticDiDResults, which increases accidental data-retention / serialization risk.
  • Validation note: this was a static review only; pytest is unavailable here, and the environment also lacks numpy and pandas.

Methodology

Code Quality

  • Severity: P2. Impact: the shipped practitioner snippet at diff_diff/practitioner.py:548 only gates on results.variance_method == 'jackknife' before calling get_loo_effects_df(), but SyntheticDiD legitimately returns empty jackknife output for single-treated or single-effective-control designs at diff_diff/synthetic_did.py:1247 and diff_diff/synthetic_did.py:1267, and get_loo_effects_df() raises in that state at diff_diff/results.py:1018. Concrete fix: guard on LOO availability (results.placebo_effects is not None or getattr(results, "_loo_unit_ids", None) is not None) or catch ValueError and print a graceful fallback message instead. This is a real edge case, not a theoretical one; the official synthdid variance docs also call out single-treated designs as special handling. (synth-inference.github.io)

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No new findings.

Security

  • Severity: P2. Impact: the new _SyntheticDiDFitSnapshot retains outcome matrices, unit IDs, and survey weights on the results object at diff_diff/results.py:653, and every fit now attaches it unconditionally at diff_diff/synthetic_did.py:593 and diff_diff/synthetic_did.py:618. That makes SyntheticDiDResults materially more sensitive than before; generic pickling or dataclasses.asdict() will now carry underlying panel state even though the new method errors describe serialization as excluding the snapshot at diff_diff/results.py:1099 and diff_diff/results.py:1276. Concrete fix: add an explicit serialization guard that drops _fit_snapshot, or make snapshot retention opt-in and document the privacy trade-off in the public API.

Documentation/Tests

  • Severity: P2. Impact: the new practitioner tests only assert that the snippets reference existing methods and parse as Python at tests/test_methodology_sdid.py:1958 and tests/test_methodology_sdid.py:1985; they do not exercise the jackknife-unavailable branch that would have caught the practitioner failure above. Concrete fix: add a regression on a one-treated-unit SyntheticDiD(variance_method="jackknife") result and assert the generated guidance produces a graceful fallback message instead of invoking get_loo_effects_df().
  • Validation note: I could not run the tests here because pytest is missing and the environment also lacks numpy and pandas.

- Practitioner LOO snippet now guards on `_loo_unit_ids` availability,
  not just variance_method. Single-treated-unit and single-effective-
  control designs legitimately return empty jackknife output; a user
  copy-pasting the prior snippet would hit a ValueError from
  get_loo_effects_df(). The else branch now describes the actual
  requirements.
- SyntheticDiDResults.__getstate__ drops _fit_snapshot on pickle so
  generic pickle.dumps() no longer carries outcome matrices, unit IDs,
  or survey weights to wherever the pickled bytes are sent. The live
  session is unaffected; unpickled results raise the existing "re-fit
  to enable" message from in_time_placebo / sensitivity_to_zeta_omega.

Tests cover:
- Snippet degrades gracefully on single-treated jackknife fit
  (exec()'s the snippet, asserts the "LOO not available" message).
- Pickle round-trip drops the snapshot and leaves public fields intact.
- in_time_placebo / sensitivity_to_zeta_omega raise after unpickle.
- __getstate__ does not mutate the live instance's snapshot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 16, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: a5c080c6370acb17cee734934093d0f3e12f83f2


Overall Assessment

✅ Looks good

Executive Summary

  • Re-review focus: the prior practitioner failure is fixed. The jackknife guidance now checks LOO availability before calling get_loo_effects_df(), and there is a regression covering the single-treated-unit case in diff_diff/practitioner.py:L539-L557 and tests/test_methodology_sdid.py:L1998-L2024.
  • The methodology impact is limited to post-fit SyntheticDiDResults diagnostics (get_loo_effects_df, trajectories, get_weight_concentration, in_time_placebo, sensitivity_to_zeta_omega). I did not find a regression in the ATT or SE paths, which still run through the existing SDID estimator and variance code in diff_diff/synthetic_did.py:L474-L647.
  • The only behavior deviations I found are the two explicitly documented registry notes: in_time_placebo() reuses the original fit’s zetas, and sensitivity_to_zeta_omega() holds time weights fixed. Those are documented in docs/methodology/REGISTRY.md:L1521-L1526, so they are informational rather than defects.
  • Residual P2: the new pickle guard is real, but _fit_snapshot is still a declared dataclass field. That means pickle is protected while other dataclass-recursive serialization paths can still carry the retained panel matrices, unit IDs, and survey weights from diff_diff/results.py:L780-L810 and diff_diff/synthetic_did.py:L593-L647.
  • P3 informational: the checked-in autosummary stub for SyntheticDiDResults still omits the new public diagnostics/trajectory fields in docs/api/_autosummary/diff_diff.SyntheticDiDResults.rst:L14-L53. That is already tracked as stale autosummary tech debt in TODO.md:L88-L99.
  • Validation note: this was a static review only. The environment here is missing numpy, pandas, and pytest, so I could not execute the test suite.

Methodology

  • Severity: P3 informational. Impact: the affected methods are the new post-fit SDID diagnostics and retained trajectory/snapshot state, not the core estimator or variance formulas. The fit path still computes ATT and inference through the existing SDID estimator, placebo/bootstrap/jackknife branches, and safe_inference() in diff_diff/synthetic_did.py:L474-L573, and the in-repo SDID references still point to Arkhangelsky et al. (2021) / Algorithms 3 and 4 in diff_diff/synthetic_did.py:L26-L58 and diff_diff/synthetic_did.py:L1189-L1382. The two deviations I found are both documented in the Methodology Registry: zeta reuse in in_time_placebo() and fixed time weights in sensitivity_to_zeta_omega() at docs/methodology/REGISTRY.md:L1521-L1526. Concrete fix: none.

Code Quality

  • Severity: P3 informational. Impact: the previous re-review code-quality concern around practitioner guidance is addressed. _handle_synthetic() no longer assumes variance_method='jackknife' implies LOO output is available, and the new regression executes the snippet against a valid single-treated-unit jackknife result without raising in diff_diff/practitioner.py:L539-L557 and tests/test_methodology_sdid.py:L1998-L2024. Concrete fix: none.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No new untracked tech-debt findings.

Security

  • Severity: P2. Impact: _fit_snapshot is now excluded from pickle via __getstate__(), but it remains a declared dataclass field on SyntheticDiDResults in diff_diff/results.py:L780-L810, and every fit still attaches the full retained panel snapshot in diff_diff/synthetic_did.py:L593-L647. That means pickle round-trips are safer, but generic dataclass-recursive serializers still have a straightforward path to copy the stored outcome matrices, unit IDs, and survey weights. The new tests only cover pickle behavior in tests/test_methodology_sdid.py:L2027-L2078. Concrete fix: remove _fit_snapshot from the dataclass field list and attach it as a plain private attribute after constructing SyntheticDiDResults, or provide a dedicated safe-export path and explicitly document dataclasses.asdict()-style serialization as unsupported for these results.

Documentation/Tests

  • Severity: P3 informational (tracked in TODO.md). Impact: the tracked autosummary page for SyntheticDiDResults still lists only the pre-existing methods/attributes in docs/api/_autosummary/diff_diff.SyntheticDiDResults.rst:L14-L53, so the checked-in API docs lag the new public diagnostics unless the stubs are refreshed. This matches the already-tracked stale-autosummary item in TODO.md:L88-L99, so it is not a blocker. Concrete fix: regenerate docs/api/_autosummary/* or stop versioning those stubs and treat them as build output.

Previous change added __getstate__ to block pickle-based serialization of
the retained panel state, but these fields were still declared on the
SyntheticDiDResults dataclass, so dataclass-recursive serializers
(dataclasses.asdict, dataclasses.fields, dataclasses.replace) could still
reach outcome matrices, unit IDs, and survey weights.

Fix by moving _fit_snapshot, _loo_unit_ids, and _loo_roles out of the
dataclass field list entirely. They are now plain instance attributes
initialized to None in __post_init__ and populated by SyntheticDiD.fit()
after result construction. The dataclass machinery no longer sees them.

__getstate__ is kept as defense-in-depth for pickle: plain instance
attributes still land in __dict__ and would otherwise pickle normally.

Tests: dataclass_fields() and asdict() do not surface the internal state,
plus existing pickle / LOO / in_time_placebo / sensitivity tests still
pass against the refactored attachment pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 16, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: f0c9e5c8e63412418a07c0b041cb37a9cf5c0bad


Overall Assessment

✅ Looks good

Executive Summary

  • Re-review focus: the prior practitioner failure is fixed. _handle_synthetic() now guards the jackknife LOO snippet on _loo_unit_ids instead of unconditionally calling get_loo_effects_df() in diff_diff/practitioner.py#L548, with regression coverage in tests/test_methodology_sdid.py#L1998.
  • The methodology impact is limited to post-fit SDID diagnostics on SyntheticDiDResults; the ATT and SE paths still run through the existing estimator, variance methods, and safe_inference() in diff_diff/synthetic_did.py#L474 and diff_diff/synthetic_did.py#L1193. That is consistent with the SDID paper and the official synthdid variance/diagnostic workflow. (aeaweb.org)
  • The prior serializer/privacy concern is addressed: _fit_snapshot and _loo_* are no longer dataclass fields, __getstate__() strips the fit snapshot on pickle, and regressions cover pickle, dataclasses.fields(), and dataclasses.asdict() in diff_diff/results.py#L786, diff_diff/results.py#L802, and tests/test_methodology_sdid.py#L2027.
  • The only remaining item I found is informational: the two methodology deviations are explicitly documented in docs/methodology/REGISTRY.md#L1521, and the stale autosummary stub for SyntheticDiDResults remains tracked in TODO.md#L88.
  • Validation note: this was a static review only. I could not run pytest here because the environment is missing pytest and numpy.

Methodology

  • Severity: P3 informational. Impact: the affected methods are post-fit diagnostics only (get_loo_effects_df(), trajectories, get_weight_concentration(), in_time_placebo(), sensitivity_to_zeta_omega()), not estimator or variance changes; the underlying SDID estimator and variance surface remain aligned with Arkhangelsky et al. and official synthdid, and official synthdid already exposes weight, trajectory, and placebo-inspection surfaces. The only deviations I found are the REGISTRY-noted zeta reuse in docs/methodology/REGISTRY.md#L1521 and fixed time weights in docs/methodology/REGISTRY.md#L1524, so this is non-blocking. Concrete fix: none. (aeaweb.org)

Code Quality

Performance

  • Severity: none. Impact: no performance regression stood out in the changed code. The new re-estimation helpers are opt-in result methods, and the main fit() path only adds snapshot/trajectory retention in diff_diff/synthetic_did.py#L593. Concrete fix: none.

Maintainability

Tech Debt

  • Severity: none. Impact: no new untracked tech-debt finding. Concrete fix: none.

Security

Documentation/Tests

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 16, 2026
@igerber igerber merged commit 3b1fe6b into main Apr 17, 2026
23 of 24 checks passed
@igerber igerber deleted the sdid-validation branch April 17, 2026 00:43
igerber added a commit that referenced this pull request Apr 17, 2026
…ides

PR #309 added four validation diagnostics to SyntheticDiDResults:
`get_weight_concentration()`, `get_loo_effects_df()`, `in_time_placebo()`,
and `sensitivity_to_zeta_omega()`. Because this PR is the first place the
guides ship inside the wheel, we want them faithful to main's API at the
moment of merge.

- `llms-full.txt`: add the four methods to the SyntheticDiDResults
  Methods line and a short "Validation diagnostics" subsection describing
  each.
- `llms-practitioner.txt`: split the former `SyntheticDiD/TROP` bullet so
  SyntheticDiD now points at the built-in helpers (with the jackknife
  caveat for LOO); TROP keeps the generic guidance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant