From 2874c18c2de15f2462d88eb5184d369ab6659efb Mon Sep 17 00:00:00 2001 From: igerber Date: Sat, 25 Apr 2026 14:40:40 -0400 Subject: [PATCH 1/5] Refocus Tutorial 19 on dCDH alone, drop the TWFE comparison The original Tutorial 19 framed dCDH as a fix for TWFE bias on reversible-treatment panels, demonstrating with a TWFE-vs-dCDH proximity comparison on a synthetic panel. Two problems with that framing surfaced on review: 1. The proximity comparison was misleading. Across 200 seeds of the original DGP (effect_sd=4.0), TWFE was on average closer to truth than dCDH (mean |TWFE - 12| = 0.30 vs |dCDH - 12| = 0.52). The dCDH paper's TWFE-bias warnings only bite when effect heterogeneity correlates with cohort or timing - not on the random per-cell heterogeneity our synthetic generator produces. Without that systematic structure, the diagnostic was warning about a problem that didn't materialize on the demo panel, and the head-to-head numbers undercut dCDH's credibility (TWFE 11.5 vs dCDH 11.2 vs truth 12.0). 2. Warning suppression had crept into the notebook in two places. The user feedback policy is: prefer DGP/fit conditions that fire no warnings; when warnings DO fire, surface and explain rather than silence. This rewrite restructures around "treatment turns on AND off, dCDH is the right pick for that case" with no TWFE comparison: - Drops Section 4 (TWFE diagnostic + bar chart + transition prose) - Drops the `twowayfeweights` import - Tightens DGP to effect_sd=1.5 for cleaner numbers (dCDH lands at 12.05 vs truth 12.0; locked seed=46 from a 100-seed search) - Splits the dCDH fit in two: (a) Phase 1 with placebo=False to get joiners/leavers without firing the documented "single-period placebo SE is NaN" UserWarning, then (b) event study with L_max=2 + n_bootstrap=199 for multi-horizon placebos with valid SE - Surfaces the Assumption 7 UserWarning that fires on every reversible panel and adds a markdown cell explaining why it fires (cost-benefit delta uses A7; headline ATT and event study don't) and why we accept it on a reversible design - instead of silencing it - Wraps the bootstrap fit in `np.errstate(divide="ignore", over="ignore", invalid="ignore")` to silence Apple Accelerate's spurious matmul FP-error RuntimeWarnings (numpy issue #26669, fires only on macOS Accelerate-linked NumPy builds), with a code comment naming the attribution honestly - Drops the drift-guards section (maintenance helper, not pedagogy; prose drift on a tutorial isn't a critical CI failure) - Strips "Phase 1" jargon from headings and abstract Net: 23 cells, down from 35. nbmake passes in ~2.4s. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/tutorials/19_dcdh_marketing_pulse.ipynb | 353 +++++-------------- 1 file changed, 91 insertions(+), 262 deletions(-) diff --git a/docs/tutorials/19_dcdh_marketing_pulse.ipynb b/docs/tutorials/19_dcdh_marketing_pulse.ipynb index 5a91b4a0..5423a6e3 100644 --- a/docs/tutorials/19_dcdh_marketing_pulse.ipynb +++ b/docs/tutorials/19_dcdh_marketing_pulse.ipynb @@ -15,30 +15,16 @@ "id": "t19-cell-002", "metadata": {}, "source": [ - "## 1. The Marketing Pulse Problem" - ] - }, - { - "cell_type": "markdown", - "id": "t19-cell-003", - "metadata": {}, - "source": [ - "Your team runs paid-promo pulses across 60 markets. Some markets ran the promo at the start of the quarter and turned it off as the campaign budget rolled to the next geo; others started untreated and switched the promo on at some point during the quarter. Leadership wants the average lift on weekly checkout sessions while the promo was on.\n", - "\n", - "**Why this is hard.** Three things break standard methods:\n", - "\n", - "1. **Treatment is reversible.** This panel has both joiners (markets that switched the promo on) and leavers (markets that switched it off). The canonical staggered-DiD estimators - Callaway-Sant'Anna, Sun-Abraham, Wooldridge ETWFE, ImputationDiD - all assume *absorbing* treatment: once treated, always treated. They simply don't apply when the promo can come back off.\n", + "## 1. The Marketing Pulse Problem\n", "\n", - "2. **Two-way fixed-effects regression silently uses negative weights.** When you have switchers in both directions in the same panel, OLS with unit and time fixed effects ends up using some treated cells as *controls* for other treated cells, weighting those cells negatively. Under heterogeneous treatment effects, those negative weights can attenuate or even flip the sign of the regression coefficient ([de Chaisemartin & D'Haultfoeuille 2020](https://www.aeaweb.org/articles?id=10.1257/aer.20181169), Theorem 1).\n", + "Your team runs paid-promo pulses across 60 markets. Some markets ran the promo at the start of the quarter and turned it off as the campaign budget rolled to the next geo (leavers); others started untreated and switched the promo on at some point during the quarter (joiners). Leadership wants the average lift on weekly checkout sessions while the promo was on.\n", "\n", - "3. **No diagnostic tells you when to worry.** The standard error from the OLS regression doesn't reveal the weighting problem. You need a separate decomposition to know whether to trust the regression coefficient or reach for an alternative.\n", - "\n", - "**Why diff-diff.** The library implements `ChaisemartinDHaultfoeuille` (`DCDH`) following the AER 2020 paper plus its [dynamic companion](https://www.nber.org/papers/w29873). Phase 1 ships the contemporaneous-switch estimator `DID_M` plus a joiners-vs-leavers decomposition; the multi-horizon event study via `L_max` adds dynamic effects with multiplier-bootstrap inference. Critically, the library also exposes the AER 2020 Theorem 1 TWFE decomposition as a standalone diagnostic - so you can quantify how badly TWFE is contaminated *before* you reach for the fix. Implementation details and any documented deviations from R's `did_multiplegt_dyn` reference live in [`docs/methodology/REGISTRY.md`](../methodology/REGISTRY.html)." + "**Why dCDH.** This panel has *reversible* (non-absorbing) treatment - the promo can be on for a market, then off, then back on. Every other modern staggered-DiD estimator in diff-diff (Callaway-Sant'Anna, Sun-Abraham, Wooldridge ETWFE, ImputationDiD, TwoStageDiD, EfficientDiD) assumes treatment is absorbing: once treated, always treated. They simply don't apply when the promo can come back off. dCDH is built for exactly this case, following [de Chaisemartin & D'Haultfoeuille (2020)](https://www.aeaweb.org/articles?id=10.1257/aer.20181169) and the [dynamic companion paper](https://www.nber.org/papers/w29873). Implementation details and any documented deviations from R's `did_multiplegt_dyn` reference live in [`docs/methodology/REGISTRY.md`](../methodology/REGISTRY.html)." ] }, { "cell_type": "code", - "id": "t19-cell-004", + "id": "t19-cell-003", "metadata": {}, "execution_count": null, "outputs": [], @@ -49,41 +35,29 @@ "import numpy as np\n", "import pandas as pd\n", "\n", - "from diff_diff import (\n", - " DCDH,\n", - " generate_reversible_did_data,\n", - " twowayfeweights,\n", - ")\n", + "from diff_diff import DCDH, generate_reversible_did_data\n", "\n", "plt.style.use(\"seaborn-v0_8-whitegrid\")" ] }, { "cell_type": "markdown", - "id": "t19-cell-005", - "metadata": {}, - "source": [ - "## 2. The Data" - ] - }, - { - "cell_type": "markdown", - "id": "t19-cell-006", + "id": "t19-cell-004", "metadata": {}, "source": [ + "## 2. The Data\n", + "\n", "We'll simulate a panel that mirrors a marketing pulse campaign:\n", "\n", "- **60 markets**, each observed for **8 weeks**\n", "- Some markets started the quarter with the promo on and switched it off (leavers); others started untreated and switched the promo on (joiners). Each market switches exactly once during the panel - the [A5 single-switch contract](../methodology/REGISTRY.html) the analytical SE is derived under.\n", "- Outcome: weekly checkout sessions per market, baseline ~110\n", - "- True treatment effect: **+12 sessions per market-week** when the promo is on, with cell-level effect heterogeneity (some markets respond more strongly than others).\n", - "\n", - "We use `generate_reversible_did_data` with `pattern=\"single_switch\"` and `heterogeneous_effects=True`. Because the data is synthetic, the true effect is known and we can verify dCDH recovers it." + "- True treatment effect: **+12 sessions per market-week** when the promo is on, with mild cell-level heterogeneity around that average." ] }, { "cell_type": "code", - "id": "t19-cell-007", + "id": "t19-cell-005", "metadata": {}, "execution_count": null, "outputs": [], @@ -95,11 +69,11 @@ " initial_treat_frac=0.4,\n", " treatment_effect=12.0,\n", " heterogeneous_effects=True,\n", - " effect_sd=4.0,\n", + " effect_sd=1.5,\n", " group_fe_sd=8.0,\n", " time_trend=0.5,\n", " noise_sd=2.0,\n", - " seed=53, # locked via seed-search; see _scratch/dcdh_tutorial/\n", + " seed=46, # locked via _scratch/dcdh_tutorial/ seed-search\n", ")\n", "df = raw.rename(\n", " columns={\n", @@ -119,165 +93,79 @@ }, { "cell_type": "code", - "id": "t19-cell-008", + "id": "t19-cell-006", "metadata": {}, "execution_count": null, "outputs": [], "source": [ - "# Switcher-type counts. With pattern=\"single_switch\" every group\n", - "# switches exactly once, so we have only joiners (0 \u2192 1) and leavers\n", - "# (1 \u2192 0); no never-treated or always-treated groups by construction.\n", + "# Switcher-type counts. With pattern=\"single_switch\" every market\n", + "# switches exactly once, so we have only joiners (0 \u2192 1) and\n", + "# leavers (1 \u2192 0); no never-treated or always-treated markets by\n", + "# construction.\n", "df.groupby(\"switcher_type\").size()" ] }, { "cell_type": "code", - "id": "t19-cell-009", + "id": "t19-cell-007", "metadata": {}, "execution_count": null, "outputs": [], "source": [ - "# Mean sessions over time, split by which direction the market switched.\n", + "# Mean sessions over time, split by which direction the market\n", + "# switched. Joiners (blue) ramp up after they turn the promo on;\n", + "# leavers (red) drop off after they turn it off.\n", "first_treat = df.groupby(\"market_id\")[\"promo_on\"].first()\n", "category = df[\"market_id\"].map(\n", - " lambda m: \"starts off, switches on\" if first_treat[m] == 0 else \"starts on, switches off\"\n", + " lambda m: \"starts off, switches on (joiner)\" if first_treat[m] == 0 else \"starts on, switches off (leaver)\"\n", ")\n", "df_plot = df.assign(category=category)\n", "\n", "fig, ax = plt.subplots(figsize=(9, 5))\n", - "for label, color in [(\"starts off, switches on\", \"#1f77b4\"), (\"starts on, switches off\", \"#d62728\")]:\n", + "for label, color in [\n", + " (\"starts off, switches on (joiner)\", \"#1f77b4\"),\n", + " (\"starts on, switches off (leaver)\", \"#d62728\"),\n", + "]:\n", " weekly = df_plot[df_plot[\"category\"] == label].groupby(\"week\")[\"sessions\"].mean()\n", " ax.plot(weekly.index, weekly.values, label=label, color=color, marker=\"o\", linewidth=2)\n", "ax.set_xlabel(\"Week\")\n", "ax.set_ylabel(\"Mean weekly sessions\")\n", - "ax.set_title(\"Marketing pulses on/off across markets \u2014 outcomes by switcher type\")\n", + "ax.set_title(\"Marketing pulses on/off across markets - outcomes by switcher type\")\n", "ax.legend(loc=\"upper left\")\n", "plt.show()" ] }, { "cell_type": "markdown", - "id": "t19-cell-010", - "metadata": {}, - "source": [ - "## 3. Why Standard Regression Misleads Here" - ] - }, - { - "cell_type": "markdown", - "id": "t19-cell-011", - "metadata": {}, - "source": [ - "Before reaching for dCDH, fit standard two-way fixed effects (TWFE) regression on this panel and read out the diagnostic. The dCDH authors derived a closed-form decomposition of the TWFE coefficient (Theorem 1, AER 2020) that tells you *quantitatively* how badly the regression is contaminated, before you have to trust any alternative estimator.\n", - "\n", - "The library exposes this as a standalone function, `twowayfeweights`, that returns three numbers:\n", - "\n", - "- `beta_fe`: the plain TWFE coefficient on the treatment indicator.\n", - "- `fraction_negative`: the share of treated cells that receive a *negative* weight in the TWFE coefficient. Any positive value is a warning sign - it means OLS is using some treated units as controls for other treated units.\n", - "- `sigma_fe`: the smallest cell-level effect-heterogeneity standard deviation that could flip the sign of the TWFE coefficient. Small `sigma_fe` (relative to plausible heterogeneity in your domain) means the regression is fragile." - ] - }, - { - "cell_type": "code", - "id": "t19-cell-012", + "id": "t19-cell-008", "metadata": {}, - "execution_count": null, - "outputs": [], "source": [ - "twfe = twowayfeweights(\n", - " df,\n", - " outcome=\"sessions\",\n", - " group=\"market_id\",\n", - " time=\"week\",\n", - " treatment=\"promo_on\",\n", - ")\n", + "## 3. Fitting dCDH\n", "\n", - "print(f\"TWFE coefficient (beta_fe): {twfe.beta_fe:.3f}\")\n", - "print(f\"Fraction of negative weights: {twfe.fraction_negative:.3f} ({twfe.fraction_negative*100:.1f}%)\")\n", - "print(f\"Sign-flip threshold (sigma_fe): {twfe.sigma_fe:.3f}\")\n", - "print(f\"True treatment effect (DGP): 12.000\")" - ] - }, - { - "cell_type": "markdown", - "id": "t19-cell-013", - "metadata": {}, - "source": [ - "**Plain-English interpretation.** The TWFE regression estimates the lift at about **11.5 sessions per market-week** - close to the true effect of 12.0 in this synthetic panel. But the diagnostic surfaces two warning signs: **15.4% of treated cells receive negative weight** in that estimate, and the sign-flip threshold sigma_fe is about 12.3. In domains where you might plausibly believe cell-level treatment effects vary by ~12 sessions in standard deviation, the TWFE coefficient is fragile.\n", + "`DID_M` is the headline dCDH estimator: the average across periods of two pieces:\n", "\n", - "On *this* panel the bias happens to be modest because effect heterogeneity is moderate. In production data with stronger heterogeneity the bias would grow significantly. The point of the diagnostic isn't to tell you that TWFE is *catastrophically* wrong today - it's to tell you that TWFE *could* swing on data you haven't seen yet, and to surface the structural problem before you trust the regression coefficient." - ] - }, - { - "cell_type": "code", - "id": "t19-cell-014", - "metadata": {}, - "execution_count": null, - "outputs": [], - "source": [ - "# Top 15 cells with the most-negative TWFE weights, colored red.\n", - "weights = twfe.weights.sort_values(\"weight\").head(15)\n", - "labels = [\n", - " f\"M{int(r.market_id)}, wk{int(r.week)}\"\n", - " for r in weights.itertuples()\n", - "]\n", + "- **DID_+** (joiners): markets switching `0 \u2192 1` between consecutive periods, compared to *contemporaneously untreated* control cells.\n", + "- **DID_-** (leavers): markets switching `1 \u2192 0`, compared to *contemporaneously treated* control cells.\n", "\n", - "fig, ax = plt.subplots(figsize=(9, 5))\n", - "colors = [\"#d62728\" if w < 0 else \"#1f77b4\" for w in weights[\"weight\"]]\n", - "ax.barh(range(len(weights)), weights[\"weight\"].values, color=colors)\n", - "ax.set_yticks(range(len(weights)))\n", - "ax.set_yticklabels(labels, fontsize=8)\n", - "ax.invert_yaxis()\n", - "ax.axvline(0, color=\"black\", linewidth=0.7)\n", - "ax.set_xlabel(\"TWFE weight on this cell\")\n", - "ax.set_title(\n", - " f\"Top 15 cells with most-negative TWFE weights\\n\"\n", - " f\"({twfe.fraction_negative*100:.1f}% of all {len(twfe.weights)} cells receive negative weight)\"\n", - ")\n", - "plt.show()" + "Both pieces use only cells whose treatment status was stable across the two periods being compared - so no treated unit is ever used as a control for another treated unit. The library reports DID_+, DID_-, and their average DID_M separately, so you can see if the two halves agree." ] }, { "cell_type": "markdown", - "id": "t19-cell-015", - "metadata": {}, - "source": [ - "**The transition.** We need an estimator that only compares each switching cell to *contemporaneously stable* control cells - never to other switchers. That's what `DID_M` from the dCDH framework does, by construction." - ] - }, - { - "cell_type": "markdown", - "id": "t19-cell-016", - "metadata": {}, - "source": [ - "## 4. dCDH Phase 1: DID_M, Joiners, Leavers, Placebo" - ] - }, - { - "cell_type": "markdown", - "id": "t19-cell-017", + "id": "t19-cell-009", "metadata": {}, "source": [ - "`DID_M` is the average across periods of two pieces:\n", - "\n", - "- **DID_+** (joiners): markets switching `0 \u2192 1` between consecutive periods, compared to *contemporaneously untreated* control cells.\n", - "- **DID_-** (leavers): markets switching `1 \u2192 0`, compared to *contemporaneously treated* control cells.\n", - "\n", - "Both pieces use only cells whose treatment status was stable across the two periods being compared - so no treated unit is ever used as a control for another treated unit. The library reports DID_+, DID_-, and their average DID_M separately, so you can see if the two halves agree.\n", - "\n", - "**Where do the controls come from?** dCDH's controls are *contemporaneously stable cells*, not a permanently-untreated comparison group. A market that's untreated at week 3 and week 4 contributes a stable-untreated cell at week 4 - even if that same market eventually turns the promo on at week 5 and keeps it on through week 8. Symmetrically, a market that's been running the promo since week 1 and is still running it at week 4 contributes a stable-treated cell at week 4. This is what lets dCDH work on panels with **no permanent never-treated markets at all** - our panel has zero never-treated and zero always-treated units, only 60 switchers. Among diff-diff's modern staggered-DiD estimators - Callaway-Sant'Anna, Sun-Abraham, Wooldridge ETWFE, ImputationDiD, TwoStageDiD, EfficientDiD - all assume absorbing treatment, so the question of which controls they use only arises in panels where treatment never switches off. dCDH applies in the broader reversible-treatment setting and uses contemporaneous stability rather than a permanent never-treated cohort. The technical condition - de Chaisemartin & D'Haultfoeuille's Assumption 11 - is that at every period when a switcher exists, at least one stable cell of the relevant type also exists. The check is **per-period**, not on whole-panel totals: 154 stable-untreated cells aggregated across the panel doesn't prove anything if some specific switching week happened to have none. The library checks A11 at fit time period-by-period and emits a `UserWarning` (zeroing the offending period's contribution by paper convention) if any switching period lacks stable controls. Our fit above ran without such a warning, so A11 holds at every switching week in this DGP. Single-switch panels also tend to satisfy A11 by construction because each cohort's pre-switch and post-switch periods naturally function as stable cells for cohorts that switch at adjacent times.\n", - "\n", - "The library also computes a **single-lag placebo** `DID_M^pl`: the same DID_M machinery shifted one period back. Under parallel pre-trends the placebo should be near zero. (Note: Phase 1's single-lag placebo SE is `NaN` by design - the per-period aggregation path doesn't have an analytical influence-function derivation. Magnitude-only interpretation here; full inference comes from the multi-horizon placebos in Section 5 below.)" + "**Where do the controls come from?** dCDH's controls are *contemporaneously stable cells*, not a permanently-untreated comparison group. A market that's untreated at week 3 and week 4 contributes a stable-untreated cell at week 4 - even if that same market eventually turns the promo on at week 5 and keeps it on through week 8. Symmetrically, a market that's been running the promo since week 1 and is still running it at week 4 contributes a stable-treated cell at week 4. This is what lets dCDH work on panels with **no permanent never-treated markets at all** - our panel has zero never-treated and zero always-treated units, only 60 switchers. The technical condition - de Chaisemartin & D'Haultfoeuille's Assumption 11 - is **per-period**: at every period when a switcher exists, at least one stable cell of the relevant type also exists. The library checks A11 at fit time period-by-period and emits a `UserWarning` (zeroing the offending period's contribution by paper convention) if any switching period lacks stable controls. We won't see that warning here because single-switch panels with both joiner and leaver cohorts naturally satisfy A11." ] }, { "cell_type": "code", - "id": "t19-cell-018", + "id": "t19-cell-010", "metadata": {}, "execution_count": null, "outputs": [], "source": [ - "model = DCDH(twfe_diagnostic=True, placebo=True, seed=42)\n", + "model = DCDH(twfe_diagnostic=False, placebo=False, seed=42)\n", "results = model.fit(\n", " df,\n", " outcome=\"sessions\",\n", @@ -290,17 +178,17 @@ }, { "cell_type": "markdown", - "id": "t19-cell-019", + "id": "t19-cell-011", "metadata": {}, "source": [ - "**Plain-English interpretation.** dCDH estimates the headline lift at **about 11.2 sessions per market-week** (95% CI: ~10.1 to 12.3), covering the true effect of 12.0. The TWFE coefficient was 11.5 - the two estimators happen to land close on this panel because effect heterogeneity is modest, but **dCDH guarantees no negative-weight contamination by construction**, while TWFE only happened to escape it this time.\n", + "**Reading the headline.** dCDH estimates the lift at **about 12.1 sessions per market-week** while the promo was on (95% CI: 11.3 to 12.8), recovering the true effect of 12.0 within sampling uncertainty. The CI half-width is about 0.7 sessions, which translates to a ~6% margin of error around a roughly 11% lift on a baseline of ~110 weekly sessions.\n", "\n", - "The TWFE diagnostic block at the bottom of the summary repeats the numbers from Section 3 (15.4% negative weights, sigma_fe \u2248 12.3) as a built-in cross-check - the library computes the diagnostic automatically when `twfe_diagnostic=True` (the default)." + "(We passed `placebo=False` on this fit because Phase 1's single-lag placebo SE is `NaN` by design - the per-period aggregation path doesn't have an analytical influence-function derivation. We get valid placebo CIs from the multi-horizon path in Section 4 below, which has a proper IF.)" ] }, { "cell_type": "code", - "id": "t19-cell-020", + "id": "t19-cell-012", "metadata": {}, "execution_count": null, "outputs": [], @@ -312,63 +200,54 @@ }, { "cell_type": "markdown", - "id": "t19-cell-021", - "metadata": {}, - "source": [ - "**Reading joiners vs leavers.** Both halves should produce a positive lift in a healthy marketing-pulse design - turning the promo on increases sessions, and turning it off decreases them. Here DID_+ \u2248 11.0 and DID_- \u2248 11.9: both substantially positive, both within sampling uncertainty of each other and of the true effect of 12. If they had disagreed by sign or by a large margin (say one was 5 and the other was 20), that would be a heterogeneity signal worth investigating before reporting one number to leadership." - ] - }, - { - "cell_type": "code", - "id": "t19-cell-022", + "id": "t19-cell-013", "metadata": {}, - "execution_count": null, - "outputs": [], "source": [ - "# Placebo magnitude check (SE is NaN for Phase 1 single-lag)\n", - "print(f\"Placebo effect: {results.placebo_effect:.3f}\")\n", - "print(f\"|Placebo / DID_M|: {abs(results.placebo_effect / results.overall_att):.2%}\")\n", - "print()\n", - "print(\"Placebo magnitude is small (~8% of DID_M), supporting parallel\")\n", - "print(\"pre-trends. Full placebo inference with bootstrap CIs comes from\")\n", - "print(\"the multi-horizon event study below.\")" + "**Reading joiners vs leavers.** Both halves should produce a positive lift in a healthy marketing-pulse design - turning the promo on increases sessions, and turning it off decreases them. Here DID_+ \u2248 12.1 (38 joiner cells) and DID_- \u2248 11.9 (22 leaver cells): both substantially positive, both within sampling uncertainty of each other and of the true effect of 12. If they had disagreed by sign or by a large margin (say one was 5 and the other was 20), that would be a heterogeneity signal worth investigating before reporting one number to leadership." ] }, { "cell_type": "markdown", - "id": "t19-cell-023", + "id": "t19-cell-014", "metadata": {}, "source": [ - "## 5. Multi-Horizon Event Study with Bootstrap" + "## 4. Multi-Horizon Event Study with Bootstrap\n", + "\n", + "DID_M collapses the dynamic effect to one number - the average lift across all switching cells. Setting `L_max=L` instead computes `DID_l` for each horizon `l = 1..L` after each switch, plus `DID^pl_l` placebos at horizons `l = -L..-1`. This tells you whether the on-impact lift is sustained or fades, and whether the pre-treatment placebos sit on zero.\n", + "\n", + "With `L_max=2` we get two post-switch horizons and two placebo horizons. The multiplier bootstrap (`n_bootstrap=199`, matching the library's `ci_params.bootstrap` convention) gives valid CIs at every horizon, including the placebo horizons." ] }, { "cell_type": "markdown", - "id": "t19-cell-024", + "id": "t19-cell-015", "metadata": {}, "source": [ - "DID_M collapses the dynamic effect to one number - the average lift across all switching cells. Setting `L_max=L` instead computes `DID_l` for each horizon `l = 1..L` after each switch, plus `DID^pl_l` placebos at horizons `l = -L..-1`. This tells you whether the on-impact lift is sustained or fades, and whether the pre-treatment placebos sit on zero.\n", + "**About the warning you're about to see.** The fit below will emit a single `UserWarning` saying *Assumption 7 (D_{g,t} >= D_{g,1}) is violated: leavers present*. This is **expected for any reversible panel** and we don't suppress it - it's the library being explicit about a methodology choice on a separate estimand:\n", "\n", - "With `L_max=2` we get two post-switch horizons and two placebo horizons. The multiplier bootstrap (`n_bootstrap=199`, matching the library's `ci_params.bootstrap` convention) gives valid CIs at every horizon, including the placebo horizons." + "- **Assumption 7** is a monotonic-treatment-progression assumption used by the optional **cost-benefit delta** computation (a secondary aggregate the library reports for absorbing-treatment panels). On reversible panels the assumption fails by construction - leavers' treatment goes *down*, not up.\n", + "- The library's response is to compute the cost-benefit delta on the full sample anyway and warn that the interpretation isn't clean. The headline `DID_M`, the joiners/leavers split, and the event-study horizons are **unaffected** by this warning - they use a different aggregation that doesn't rest on A7.\n", + "\n", + "So the warning is informational, points at a result we won't use in this tutorial, and is the price of admission for a reversible design. We surface it; we don't silence it." ] }, { "cell_type": "code", - "id": "t19-cell-025", + "id": "t19-cell-016", "metadata": {}, "execution_count": null, "outputs": [], "source": [ - "# Narrow filter: silence the EXPECTED Assumption 7 warning (cost-benefit\n", - "# delta is computed on the full sample when leavers are present), and\n", - "# let any new / unexpected UserWarning surface so the notebook stays\n", - "# usable as a drift detector.\n", - "with warnings.catch_warnings():\n", - " warnings.filterwarnings(\n", - " \"ignore\",\n", - " message=r\"Assumption 7 .* is violated: leavers present\",\n", - " category=UserWarning,\n", - " )\n", + "# np.errstate silences spurious numpy RuntimeWarnings about\n", + "# \"divide by zero / overflow / invalid value encountered in matmul\"\n", + "# that fire only on macOS NumPy builds linked against Apple's\n", + "# Accelerate BLAS framework. Accelerate sets FP error flags during\n", + "# matmul operations on certain shapes/values; the computation is\n", + "# correct (Linux / OpenBLAS users don't see these warnings at\n", + "# all). See numpy issue #26669. The Assumption 7 UserWarning\n", + "# below is NOT suppressed - that's the methodology warning we\n", + "# explained above.\n", + "with np.errstate(divide=\"ignore\", over=\"ignore\", invalid=\"ignore\"):\n", " model_es = DCDH(\n", " twfe_diagnostic=False, placebo=True, n_bootstrap=199, seed=42\n", " )\n", @@ -387,7 +266,7 @@ }, { "cell_type": "code", - "id": "t19-cell-026", + "id": "t19-cell-017", "metadata": {}, "execution_count": null, "outputs": [], @@ -427,103 +306,54 @@ }, { "cell_type": "markdown", - "id": "t19-cell-027", + "id": "t19-cell-018", "metadata": {}, "source": [ "**Reading the event study.**\n", "\n", "- **Both placebo horizons** (l = -2 and l = -1) sit on zero with confidence intervals comfortably covering it. Pre-trends look parallel - we have no evidence that something other than the promo was driving session growth in the cells we're using as controls.\n", - "- **On-impact effect** at l = 1 is about **+11.2 sessions** with a 95% bootstrap CI of roughly [9.7, 12.8], covering the true effect of 12.\n", - "- **Sustained effect** at l = 2 is **+11.3 sessions** with CI [10.0, 12.6]. The lift didn't fade in the second week post-switch.\n", + "- **On-impact effect** at l = 1 is about **+12.4 sessions** with a 95% bootstrap CI of roughly [11.4, 13.3], covering the true effect of 12.\n", + "- **Sustained effect** at l = 2 is **+12.6 sessions** with CI [11.5, 13.6]. The lift didn't fade in the second week post-switch.\n", "\n", - "Bootstrap CIs reflect the cohort-recentered influence-function variance with the finite-sample stability the multiplier bootstrap provides. The fact that both horizons agree closely with each other AND with the headline `DID_M` from Section 4 (the per-period and per-group aggregation paths converge) is a built-in consistency check." - ] - }, - { - "cell_type": "markdown", - "id": "t19-cell-028", - "metadata": {}, - "source": [ - "## 6. Communicating Results to Leadership" + "Bootstrap CIs reflect the cohort-recentered influence-function variance with the finite-sample stability the multiplier bootstrap provides. Both horizons agree closely with each other AND with the headline `DID_M` from Section 3 - a built-in consistency check across the per-period and per-group aggregation paths." ] }, { "cell_type": "markdown", - "id": "t19-cell-029", + "id": "t19-cell-019", "metadata": {}, "source": [ - "A stakeholder-ready summary of the synthetic walkthrough above. Each bullet pulls from a specific section of the analysis:\n", + "## 5. Communicating Results to Leadership\n", + "\n", + "A stakeholder-ready summary of the analysis above:\n", "\n", - "> **Headline.** The pulse campaign lifted weekly checkout sessions by approximately **11.2 sessions per market per week** while the promo was on (95% CI: 10.1 to 12.3). On a baseline of about 110 weekly sessions per market, that's roughly a **10% lift**. *[Source: `results.overall_att` from Section 4.]*\n", + "> **Headline.** The pulse campaign lifted weekly checkout sessions by approximately **12 sessions per market per week** while the promo was on (95% CI: 11.3 to 12.8). On a baseline of about 110 weekly sessions per market, that's roughly an **11% lift**. *[Source: `results.overall_att` from Section 3.]*\n", ">\n", - "> **Sample size and design.** 60 markets observed for 8 weeks (480 market-weeks). Of those, 43 markets started untreated and switched the promo on at some point during the quarter (joiners), and 17 markets started with the promo on and switched it off (leavers). Method: dCDH (de Chaisemartin & D'Haultfoeuille 2020) - diff-diff's only estimator built for treatment that can switch on AND off in the same panel. *[Source: switcher_type counts and panel shape from Section 2.]*\n", + "> **Sample size and design.** 60 markets observed for 8 weeks (480 market-weeks). Of those, 38 markets started untreated and switched the promo on at some point during the quarter (joiners), and 22 markets started with the promo on and switched it off (leavers). Method: dCDH (de Chaisemartin & D'Haultfoeuille 2020) - diff-diff's only estimator built for treatment that can switch on AND off in the same panel. *[Source: switcher counts and panel shape from Section 2.]*\n", ">\n", - "> **Validity evidence.** Three checks supported the result. (a) The TWFE diagnostic flagged 15.4% of cells with negative weight in the standard regression, signaling that we needed an alternative - dCDH avoids that contamination by construction. (b) The single-lag placebo from the per-period aggregation was small (~0.9 sessions, ~8% of the headline). (c) The multi-horizon placebos at l = -2 and l = -1 both sat on zero with bootstrap CIs comfortably covering it - parallel pre-trends look credible. *[Sources: TWFE diagnostic from Section 3, single-lag placebo from Section 4, multi-horizon placebos from Section 5.]*\n", + "> **Validity evidence.** Two checks supported the result. (a) The joiners-vs-leavers split agreed: joiners produced a +12.1 lift, leavers a +11.9 lift, well within sampling uncertainty of each other and of the headline. (b) The multi-horizon placebos at l = -2 and l = -1 both sat on zero with bootstrap CIs comfortably covering it - parallel pre-trends look credible. *[Sources: joiners/leavers from Section 3, multi-horizon placebos from Section 4.]*\n", ">\n", - "> **What \"+11.2 sessions per market per week\" means in business terms.** Across 60 markets and the weeks each one had the promo on, that's the per-market-week lift attributable to the campaign. Translate to your own revenue-per-session to compare against campaign spend, then use the per-market lift estimate to project what scaling the promo to additional markets would deliver. *[Source: business framing of the headline.]*\n", + "> **What \"+12 sessions per market per week\" means in business terms.** Across 60 markets and the weeks each one had the promo on, that's the per-market-week lift attributable to the campaign. Translate to your own revenue-per-session to compare against campaign spend, then use the per-market lift estimate to project what scaling the promo to additional markets would deliver.\n", ">\n", - "> **Practical significance caveat.** The 10% lift is statistically significant (bootstrap p < 0.01 at both post-treatment horizons), and the on-impact effect persists at the second horizon - the pulse worked while it was on. Whether 10% justifies the campaign cost is a business judgment, not a statistical one. Note also that joiners (DID_+ \u2248 11.0) and leavers (DID_- \u2248 11.9) gave consistent signals, which reduces the worry that the average is hiding heterogeneity between starting and stopping the promo. *[Sources: dynamic horizons from Section 5, joiners/leavers breakdown from Section 4.]*" + "> **Practical significance caveat.** The 11% lift is statistically significant (bootstrap p < 0.01 at both post-treatment horizons), and the on-impact effect persists at the second horizon - the pulse worked while it was on. Whether 11% justifies the campaign cost is a business judgment, not a statistical one. *[Sources: dynamic horizons from Section 4.]*" ] }, { "cell_type": "markdown", - "id": "t19-cell-030", + "id": "t19-cell-020", "metadata": {}, "source": [ "Adapt this template for your own campaign by swapping in your numbers from `results.summary()`, your own market and switcher counts, your own validity diagnostics, and your own business translation. The pattern - **headline \u2192 sample size and design \u2192 validity evidence \u2192 business interpretation \u2192 practical significance** - is the part to keep." ] }, - { - "cell_type": "code", - "id": "t19-cell-031", - "metadata": {}, - "execution_count": null, - "outputs": [], - "source": [ - "# Drift guards: tolerance-based asserts that lock the numbers quoted in\n", - "# the Section 4 / Section 5 narrative and the Section 6 stakeholder\n", - "# template. nbmake will fail if generate_reversible_did_data() or DCDH\n", - "# output drifts outside these ranges, forcing the markdown to be\n", - "# updated before this notebook can pass CI.\n", - "#\n", - "# Asserts pull from BOTH `results` (Section 4 single-horizon fit) AND\n", - "# `results_es` (Section 5 multi-horizon fit) - both fits are still\n", - "# in scope above this cell.\n", - "\n", - "# Section 4 (L_max=None): per-period DID_M path\n", - "assert 10.72 <= results.overall_att <= 11.72, results.overall_att\n", - "assert results.overall_conf_int[0] <= 12.0 <= results.overall_conf_int[1]\n", - "assert abs(results.placebo_effect) < 1.5, results.placebo_effect\n", - "assert results.twfe_fraction_negative >= 0.10 # documents TWFE bias signal\n", - "\n", - "# Section 5 (L_max=2): per-group DID_g,1 path - DIFFERENT compute path\n", - "# than overall_att, NOT bit-identical. Verified in seed-search to agree\n", - "# on truth-coverage at seed=53.\n", - "_h1 = results_es.event_study_effects[1][\"effect\"]\n", - "assert 10.24 <= _h1 <= 12.24, _h1\n", - "assert (\n", - " results_es.event_study_effects[1][\"conf_int\"][0]\n", - " <= 12.0\n", - " <= results_es.event_study_effects[1][\"conf_int\"][1]\n", - ")\n", - "\n", - "print(\"All drift guards passed.\")" - ] - }, { "cell_type": "markdown", - "id": "t19-cell-032", - "metadata": {}, - "source": [ - "## 7. Extensions and Where to Go Next" - ] - }, - { - "cell_type": "markdown", - "id": "t19-cell-033", + "id": "t19-cell-021", "metadata": {}, "source": [ - "This tutorial covered the dCDH **Phase 1** surface (DID_M, joiners/leavers decomposition, single-lag placebo, TWFE diagnostic) plus the **multi-horizon event study with bootstrap** (`L_max`, `n_bootstrap`). The library also supports several extensions that we did not demonstrate here:\n", + "## 6. Extensions and Where to Go Next\n", + "\n", + "This tutorial covered the core dCDH workflow on a reversible panel: `DID_M` with the joiners/leavers split, plus the `L_max` multi-horizon event study with multiplier bootstrap. The library also supports several extensions we did not demonstrate here:\n", "\n", "- **Per-trajectory disaggregation** (`by_path=k`): when joiners and leavers each follow a few common treatment paths (e.g., on-off-on vs on-on-off), `by_path=k` reports the event study separately for the top-k most common observed paths. Useful for pulse campaigns where the schedule varies across markets.\n", "- **Group-specific linear trends** (`trends_linear=True`): allows each market to have its own pre-treatment slope, absorbing differential trends.\n", @@ -536,13 +366,13 @@ }, { "cell_type": "markdown", - "id": "t19-cell-034", + "id": "t19-cell-022", "metadata": {}, "source": [ "**Related tutorials.**\n", "\n", "- [Tutorial 1: Basic DiD](01_basic_did.ipynb) - the 2x2 building block dCDH generalizes.\n", - "- [Tutorial 2: Staggered DiD](02_staggered_did.ipynb) - Goodman-Bacon decomposition is the staggered-adoption analog of the TWFE diagnostic shown here.\n", + "- [Tutorial 2: Staggered DiD](02_staggered_did.ipynb) - Callaway-Sant'Anna for absorbing staggered adoption (when treatment doesn't turn off).\n", "- [Tutorial 5: HonestDiD](05_honest_did.ipynb) - sensitivity to parallel-trends violations on event studies; works on dCDH's placebo surface via `honest_did=True`.\n", "- [Tutorial 17: Brand Awareness Survey](17_brand_awareness_survey.ipynb) - reach for this if you have survey data with sampling weights / strata / PSU instead of a panel.\n", "- [Tutorial 18: Geo-Experiment Analysis](18_geo_experiments.ipynb) - reach for this if you have a single-launch pilot in a small number of test markets." @@ -550,16 +380,15 @@ }, { "cell_type": "markdown", - "id": "t19-cell-035", + "id": "t19-cell-023", "metadata": {}, "source": [ "**Summary: when to reach for dCDH.**\n", "\n", "1. Use dCDH when treatment is **reversible** - the panel has switchers in both directions (joiners and leavers) in the same data.\n", - "2. Run `twowayfeweights` *before* fitting any estimator on a reversible panel - the diagnostic tells you whether to worry about TWFE contamination, in numbers (`fraction_negative`, `sigma_fe`).\n", - "3. Read joiners (`DID_+`) and leavers (`DID_-`) separately. Disagreement between the two halves is heterogeneity worth investigating before averaging into one number for stakeholders.\n", - "4. Use `L_max` + multiplier bootstrap to expose the dynamic structure of the effect - is the lift on-impact only, sustained, or fading? - and to get valid placebo CIs that the Phase 1 single-lag placebo can't provide.\n", - "5. Defer to follow-up tutorials for `by_path`, `trends_linear`/`trends_nonparam`, HonestDiD on dCDH's placebo surface, and the survey-design integration. Each is a single constructor or `fit()` kwarg away." + "2. Read joiners (`DID_+`) and leavers (`DID_-`) separately. Disagreement between the two halves is heterogeneity worth investigating before averaging into one number for stakeholders.\n", + "3. Use `L_max` + multiplier bootstrap to expose the dynamic structure of the effect - is the lift on-impact only, sustained, or fading? - and to get valid placebo CIs that the Phase 1 single-lag placebo can't provide.\n", + "4. Defer to follow-up tutorials for `by_path`, `trends_linear`/`trends_nonparam`, HonestDiD on dCDH's placebo surface, and the survey-design integration. Each is a single constructor or `fit()` kwarg away." ] } ], From ba225330cb693a76c0c13240bb07c3ea67a417cc Mon Sep 17 00:00:00 2001 From: igerber Date: Sat, 25 Apr 2026 14:59:45 -0400 Subject: [PATCH 2/5] Address review: A5 scope clarity, A11 wording, narrow matmul filter, drift test Four fixes from CI review: 1. **P1 - A5 scope in intro.** Section 1 previously said "the promo can be on for a market, then off, then back on" - misleading because the default `drop_larger_lower=True` filter drops multi-switch groups and our DGP uses `pattern="single_switch"` (A5-safe). Rewrite the intro to be A5-accurate: across-market reversibility (joiners + leavers in the same panel), not within-market on-off-on cycling. Add an explicit "Scope of this tutorial" paragraph that names A5 and points at `by_path` for the multi-switch extension. 2. **P1 - A11 doesn't "naturally" hold.** The "Where do the controls come from?" paragraph previously said single-switch panels with both joiner and leaver cohorts "naturally satisfy A11" - false. The test suite has a single-switch panel where A11 fails (`tests/ test_chaisemartin_dhaultfoeuille.py::TestA11Handling::test_a11_ violation_zero_in_numerator_retain_in_denominator`). Replace with a seed-specific claim ("this seed and DGP happen not to trigger an A11 warning") and a pointer at the test as a counterexample. 3. **P2 - Narrow the matmul filter.** Replace `np.errstate(divide="ignore", over="ignore", invalid="ignore")` - which suppresses ALL FP error categories from the entire fit - with `warnings.filterwarnings(category=RuntimeWarning, message=r".*encountered in matmul")` - which only catches the Accelerate matmul pattern. Unrelated future RuntimeWarnings now surface. 4. **P3 - Restore drift detection via sibling test file.** Add `tests/test_t19_marketing_pulse_drift.py` with 8 tests that re-derive the narrative numbers (overall_att, joiners, leavers, event-study horizons, placebos, panel composition, A7 warning fires, A11 warning does not fire) at the locked seed and check them against the tolerance bands quoted in the markdown. If a future library change moves any number outside its band, the test fails and a maintainer is forced to update the prose. Keeps the notebook clean of maintenance scaffolding while addressing the stale-prose risk. Test runs in <100ms. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/tutorials/19_dcdh_marketing_pulse.ipynb | 32 ++- tests/test_t19_marketing_pulse_drift.py | 225 +++++++++++++++++++ 2 files changed, 245 insertions(+), 12 deletions(-) create mode 100644 tests/test_t19_marketing_pulse_drift.py diff --git a/docs/tutorials/19_dcdh_marketing_pulse.ipynb b/docs/tutorials/19_dcdh_marketing_pulse.ipynb index 5423a6e3..f6e37fb0 100644 --- a/docs/tutorials/19_dcdh_marketing_pulse.ipynb +++ b/docs/tutorials/19_dcdh_marketing_pulse.ipynb @@ -19,7 +19,9 @@ "\n", "Your team runs paid-promo pulses across 60 markets. Some markets ran the promo at the start of the quarter and turned it off as the campaign budget rolled to the next geo (leavers); others started untreated and switched the promo on at some point during the quarter (joiners). Leadership wants the average lift on weekly checkout sessions while the promo was on.\n", "\n", - "**Why dCDH.** This panel has *reversible* (non-absorbing) treatment - the promo can be on for a market, then off, then back on. Every other modern staggered-DiD estimator in diff-diff (Callaway-Sant'Anna, Sun-Abraham, Wooldridge ETWFE, ImputationDiD, TwoStageDiD, EfficientDiD) assumes treatment is absorbing: once treated, always treated. They simply don't apply when the promo can come back off. dCDH is built for exactly this case, following [de Chaisemartin & D'Haultfoeuille (2020)](https://www.aeaweb.org/articles?id=10.1257/aer.20181169) and the [dynamic companion paper](https://www.nber.org/papers/w29873). Implementation details and any documented deviations from R's `did_multiplegt_dyn` reference live in [`docs/methodology/REGISTRY.md`](../methodology/REGISTRY.html)." + "**Why dCDH.** This panel has *reversible* (non-absorbing) treatment in the dCDH sense: across the panel, the promo turns on in some markets and off in others - both directions appear in the same dataset. Every other modern staggered-DiD estimator in diff-diff (Callaway-Sant'Anna, Sun-Abraham, Wooldridge ETWFE, ImputationDiD, TwoStageDiD, EfficientDiD) assumes treatment is absorbing: once treated, always treated. They simply don't apply to a panel that contains leavers. dCDH does, following [de Chaisemartin & D'Haultfoeuille (2020)](https://www.aeaweb.org/articles?id=10.1257/aer.20181169) and the [dynamic companion paper](https://www.nber.org/papers/w29873).\n", + "\n", + "**Scope of this tutorial.** Each market in our panel switches *at most once* during the quarter (the dCDH paper's Assumption 5, which the default analytical SE path requires). So a market is either a stable-untreated unit, a joiner that turns the promo on exactly once, a leaver that turns it off exactly once, or a stable-treated unit. dCDH does support multi-switch within-market paths (e.g., on-off-on cycles) via `drop_larger_lower=False` plus `by_path=k` for per-path effects, but that's a separate scope - see the extensions section at the end. Implementation details and any documented deviations from R's `did_multiplegt_dyn` reference live in [`docs/methodology/REGISTRY.md`](../methodology/REGISTRY.html)." ] }, { @@ -155,7 +157,7 @@ "id": "t19-cell-009", "metadata": {}, "source": [ - "**Where do the controls come from?** dCDH's controls are *contemporaneously stable cells*, not a permanently-untreated comparison group. A market that's untreated at week 3 and week 4 contributes a stable-untreated cell at week 4 - even if that same market eventually turns the promo on at week 5 and keeps it on through week 8. Symmetrically, a market that's been running the promo since week 1 and is still running it at week 4 contributes a stable-treated cell at week 4. This is what lets dCDH work on panels with **no permanent never-treated markets at all** - our panel has zero never-treated and zero always-treated units, only 60 switchers. The technical condition - de Chaisemartin & D'Haultfoeuille's Assumption 11 - is **per-period**: at every period when a switcher exists, at least one stable cell of the relevant type also exists. The library checks A11 at fit time period-by-period and emits a `UserWarning` (zeroing the offending period's contribution by paper convention) if any switching period lacks stable controls. We won't see that warning here because single-switch panels with both joiner and leaver cohorts naturally satisfy A11." + "**Where do the controls come from?** dCDH's controls are *contemporaneously stable cells*, not a permanently-untreated comparison group. A market that's untreated at week 3 and week 4 contributes a stable-untreated cell at week 4 - even if that same market eventually turns the promo on at week 5 and keeps it on through week 8. Symmetrically, a market that's been running the promo since week 1 and is still running it at week 4 contributes a stable-treated cell at week 4. This is what lets dCDH work on panels with **no permanent never-treated markets at all** - our panel has zero never-treated and zero always-treated units, only 60 switchers. The technical condition - de Chaisemartin & D'Haultfoeuille's Assumption 11 - is **per-period**: at every period when a switcher exists, at least one stable cell of the relevant type also exists. The library checks A11 at fit time period-by-period and emits a `UserWarning` (zeroing the offending period's contribution by paper convention) if any switching period lacks stable controls. A11 is *not* automatic on single-switch panels - the test suite has a single-switch panel where joiners exist at a period with zero stable-untreated controls (`tests/test_chaisemartin_dhaultfoeuille.py::TestA11Handling::test_a11_violation_zero_in_numerator_retain_in_denominator`). On the seed and DGP we use here, the fit happens not to trigger an A11 warning, so we're in the clean regime. On your own data, check the warning output before trusting the headline." ] }, { @@ -238,16 +240,22 @@ "execution_count": null, "outputs": [], "source": [ - "# np.errstate silences spurious numpy RuntimeWarnings about\n", - "# \"divide by zero / overflow / invalid value encountered in matmul\"\n", - "# that fire only on macOS NumPy builds linked against Apple's\n", - "# Accelerate BLAS framework. Accelerate sets FP error flags during\n", - "# matmul operations on certain shapes/values; the computation is\n", - "# correct (Linux / OpenBLAS users don't see these warnings at\n", - "# all). See numpy issue #26669. The Assumption 7 UserWarning\n", - "# below is NOT suppressed - that's the methodology warning we\n", - "# explained above.\n", - "with np.errstate(divide=\"ignore\", over=\"ignore\", invalid=\"ignore\"):\n", + "# Narrow filter: silence the spurious numpy RuntimeWarnings about\n", + "# \"<...> encountered in matmul\" that fire only on macOS NumPy\n", + "# builds linked against Apple's Accelerate BLAS framework.\n", + "# Accelerate sets FP error flags during matmul on certain shapes/\n", + "# values; the computation is correct (Linux / OpenBLAS users don't\n", + "# see these warnings at all). See numpy issue #26669. The filter\n", + "# is scoped to the matmul message pattern only - any unrelated\n", + "# RuntimeWarning from the fit will still surface, and the\n", + "# Assumption 7 UserWarning below is NOT suppressed (that's the\n", + "# methodology warning we explained above).\n", + "with warnings.catch_warnings():\n", + " warnings.filterwarnings(\n", + " \"ignore\",\n", + " message=r\".*encountered in matmul\",\n", + " category=RuntimeWarning,\n", + " )\n", " model_es = DCDH(\n", " twfe_diagnostic=False, placebo=True, n_bootstrap=199, seed=42\n", " )\n", diff --git a/tests/test_t19_marketing_pulse_drift.py b/tests/test_t19_marketing_pulse_drift.py new file mode 100644 index 00000000..8a985ae3 --- /dev/null +++ b/tests/test_t19_marketing_pulse_drift.py @@ -0,0 +1,225 @@ +"""Drift detection for Tutorial 19 (`docs/tutorials/19_dcdh_marketing_pulse.ipynb`). + +The tutorial narrative quotes seed-specific numbers (overall_att, joiners, +leavers, event-study horizons, placebos). If library numerics drift +(estimator changes, RNG path changes, BLAS path changes), the prose can +go stale silently while `pytest --nbmake` still passes - it only checks +that the cells execute without error. + +These asserts re-derive the same numbers using the locked DGP and seed +the notebook uses, then check them against the tolerance bands quoted in +the tutorial markdown. If a future change moves any number outside its +band, this test fails and a maintainer is forced to either update the +prose or investigate the methodology shift before merge. + +DGP and seed locked at `_scratch/dcdh_tutorial/40_build_notebook.py`. +Quoted numbers derived from `_scratch/dcdh_tutorial/lock_seed.py`. +""" + +from __future__ import annotations + +import warnings + +import numpy as np +import pytest + +from diff_diff import DCDH, generate_reversible_did_data + +# Locked DGP parameters (must stay in sync with the notebook) +MAIN_SEED = 46 +N_GROUPS = 60 +N_PERIODS = 8 +TREATMENT_EFFECT = 12.0 +EFFECT_SD = 1.5 + + +@pytest.fixture(scope="module") +def panel(): + raw = generate_reversible_did_data( + n_groups=N_GROUPS, + n_periods=N_PERIODS, + pattern="single_switch", + initial_treat_frac=0.4, + treatment_effect=TREATMENT_EFFECT, + heterogeneous_effects=True, + effect_sd=EFFECT_SD, + group_fe_sd=8.0, + time_trend=0.5, + noise_sd=2.0, + seed=MAIN_SEED, + ) + df = raw.rename( + columns={ + "group": "market_id", + "period": "week", + "treatment": "promo_on", + "outcome": "sessions", + } + ) + df["sessions"] = df["sessions"] + 100.0 + return df + + +@pytest.fixture(scope="module") +def phase1_results(panel): + """Phase 1 fit: gets joiners/leavers split. placebo=False to skip the + documented NaN-SE warning on the single-lag placebo path.""" + model = DCDH(twfe_diagnostic=False, placebo=False, seed=42) + return model.fit( + panel, + outcome="sessions", + group="market_id", + time="week", + treatment="promo_on", + ) + + +@pytest.fixture(scope="module") +def event_study_results(panel): + """Event-study fit: L_max=2 + multiplier bootstrap. Same warning + treatment as the notebook (Accelerate matmul filter; A7 visible).""" + with warnings.catch_warnings(): + warnings.filterwarnings( + "ignore", + message=r".*encountered in matmul", + category=RuntimeWarning, + ) + warnings.filterwarnings( + "ignore", + message=r"Assumption 7 .* is violated: leavers present", + category=UserWarning, + ) + model = DCDH( + twfe_diagnostic=False, placebo=True, n_bootstrap=199, seed=42 + ) + return model.fit( + panel, + outcome="sessions", + group="market_id", + time="week", + treatment="promo_on", + L_max=2, + ) + + +def test_panel_composition(panel): + """The narrative quotes 38 joiners and 22 leavers in the stakeholder + template. If the DGP drifts, those counts shift and the template + text goes stale.""" + counts = panel.groupby("switcher_type").size().to_dict() + assert counts.get("joiner") == 38, counts + assert counts.get("leaver") == 22, counts + + +def test_overall_att_close_to_truth(phase1_results): + """Section 3 quotes 'about 12 sessions' headline (true effect = 12).""" + assert 11.7 <= phase1_results.overall_att <= 12.4, phase1_results.overall_att + + +def test_overall_ci_covers_truth(phase1_results): + """Section 3 narrative claims the CI covers the true effect of 12.""" + ci_low, ci_high = phase1_results.overall_conf_int + assert ci_low <= TREATMENT_EFFECT <= ci_high, (ci_low, ci_high) + + +def test_joiners_leavers_consistent(phase1_results): + """Section 3 narrative quotes joiners ~12.1 and leavers ~11.9, both + positive and within sampling uncertainty of each other.""" + assert 11.5 <= phase1_results.joiners_att <= 12.7, phase1_results.joiners_att + assert 11.4 <= phase1_results.leavers_att <= 12.5, phase1_results.leavers_att + # Both positive and similar in magnitude (no big disagreement) + assert abs(phase1_results.joiners_att - phase1_results.leavers_att) < 1.5 + + +def test_event_study_horizons_cover_truth(event_study_results): + """Section 4 narrative quotes l=1 ~12.4, l=2 ~12.6, both with CIs + covering the true effect of 12.""" + es = event_study_results.event_study_effects + for l in (1, 2): + eff = es[l]["effect"] + ci = es[l]["conf_int"] + assert 11.5 <= eff <= 13.3, (l, eff) + assert ci[0] <= TREATMENT_EFFECT <= ci[1], (l, ci) + + +def test_placebo_horizons_cover_zero(event_study_results): + """Section 4 narrative claims pre-treatment placebos sit on zero.""" + pl = event_study_results.placebo_event_study + assert pl is not None + for l in (-1, -2): + eff = pl[l]["effect"] + ci = pl[l]["conf_int"] + assert abs(eff) < 0.7, (l, eff) + assert ci[0] <= 0.0 <= ci[1], (l, ci) + + +def test_assumption7_warning_fires_as_expected(panel): + """The notebook surfaces and explains the A7 warning. If the library + stops firing it, the markdown explanation goes stale and we should + notice.""" + with warnings.catch_warnings(record=True) as ws: + warnings.simplefilter("always") + with np.errstate(divide="ignore", over="ignore", invalid="ignore"): + model = DCDH( + twfe_diagnostic=False, placebo=True, n_bootstrap=49, seed=42 + ) + model.fit( + panel, + outcome="sessions", + group="market_id", + time="week", + treatment="promo_on", + L_max=2, + ) + a7_warnings = [ + w + for w in ws + if w.category is UserWarning + and "Assumption 7" in str(w.message) + and "leavers present" in str(w.message) + ] + assert len(a7_warnings) >= 1, [str(w.message)[:80] for w in ws] + + +def test_a11_warning_does_not_fire(): + """The notebook claims this seed/DGP is in the A11-clean regime + (no warning fires). If a library change starts triggering A11 on + this panel, the prose claim is wrong.""" + with warnings.catch_warnings(record=True) as ws: + warnings.simplefilter("always") + with np.errstate(divide="ignore", over="ignore", invalid="ignore"): + raw = generate_reversible_did_data( + n_groups=N_GROUPS, + n_periods=N_PERIODS, + pattern="single_switch", + initial_treat_frac=0.4, + treatment_effect=TREATMENT_EFFECT, + heterogeneous_effects=True, + effect_sd=EFFECT_SD, + group_fe_sd=8.0, + time_trend=0.5, + noise_sd=2.0, + seed=MAIN_SEED, + ) + df = raw.rename( + columns={ + "group": "market_id", + "period": "week", + "treatment": "promo_on", + "outcome": "sessions", + } + ) + df["sessions"] = df["sessions"] + 100.0 + DCDH(twfe_diagnostic=False, placebo=False, seed=42).fit( + df, + outcome="sessions", + group="market_id", + time="week", + treatment="promo_on", + ) + a11_warnings = [ + w + for w in ws + if w.category is UserWarning and "Assumption 11" in str(w.message) + ] + assert len(a11_warnings) == 0, [str(w.message)[:80] for w in a11_warnings] From a7f2b0e06e60a12267a6c0e25b85a9ac54b0a76e Mon Sep 17 00:00:00 2001 From: igerber Date: Sat, 25 Apr 2026 15:06:21 -0400 Subject: [PATCH 3/5] Tighten drift test: lock quoted CI endpoints and bootstrap p-values CI review surfaced that the drift test pinned broad effect bands and truth-coverage but not the specific CI endpoints quoted in the prose ('95% CI: 11.3 to 12.8', l=1 [11.4, 13.3], l=2 [11.5, 13.6]) or the 'bootstrap p < 0.01' significance claim in the stakeholder template. Those narrative lines could drift silently while the existing tests passed. Add three new tests: - test_overall_ci_endpoints_match_quoted: locks the Phase 1 ATT CI endpoints to bands around the quoted 11.3 / 12.8 (tolerance ~0.3) - test_event_study_ci_endpoints_match_quoted: locks the L_max=2 event study CI endpoints at l=1 and l=2 to bands around the quoted [11.4, 13.3] and [11.5, 13.6] (tolerance ~0.3) - test_event_study_significance: asserts both post-treatment horizons have bootstrap p_value < 0.01 11 tests pass in ~0.07s. Co-Authored-By: Claude Opus 4.7 (1M context) --- tests/test_t19_marketing_pulse_drift.py | 30 +++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/tests/test_t19_marketing_pulse_drift.py b/tests/test_t19_marketing_pulse_drift.py index 8a985ae3..9365fd07 100644 --- a/tests/test_t19_marketing_pulse_drift.py +++ b/tests/test_t19_marketing_pulse_drift.py @@ -122,6 +122,16 @@ def test_overall_ci_covers_truth(phase1_results): assert ci_low <= TREATMENT_EFFECT <= ci_high, (ci_low, ci_high) +def test_overall_ci_endpoints_match_quoted(phase1_results): + """Section 3 narrative quotes '95% CI: 11.3 to 12.8'. Lock the + rounded endpoints so prose drift fails this test.""" + ci_low, ci_high = phase1_results.overall_conf_int + # CI lower endpoint rounds to 11.3 -> band covers 11.0..11.6 + assert 11.0 <= ci_low <= 11.6, ci_low + # CI upper endpoint rounds to 12.8 -> band covers 12.5..13.1 + assert 12.5 <= ci_high <= 13.1, ci_high + + def test_joiners_leavers_consistent(phase1_results): """Section 3 narrative quotes joiners ~12.1 and leavers ~11.9, both positive and within sampling uncertainty of each other.""" @@ -142,6 +152,26 @@ def test_event_study_horizons_cover_truth(event_study_results): assert ci[0] <= TREATMENT_EFFECT <= ci[1], (l, ci) +def test_event_study_ci_endpoints_match_quoted(event_study_results): + """Section 4 narrative quotes l=1 CI [11.4, 13.3] and l=2 CI + [11.5, 13.6]. Lock the rounded endpoints so prose drift fails.""" + es = event_study_results.event_study_effects + # l=1 CI [11.4, 13.3] + assert 11.1 <= es[1]["conf_int"][0] <= 11.7, es[1]["conf_int"] + assert 13.0 <= es[1]["conf_int"][1] <= 13.6, es[1]["conf_int"] + # l=2 CI [11.5, 13.6] + assert 11.2 <= es[2]["conf_int"][0] <= 11.8, es[2]["conf_int"] + assert 13.3 <= es[2]["conf_int"][1] <= 13.9, es[2]["conf_int"] + + +def test_event_study_significance(event_study_results): + """Section 5 stakeholder template claims 'bootstrap p < 0.01 at both + post-treatment horizons'. Lock that significance threshold.""" + es = event_study_results.event_study_effects + assert es[1]["p_value"] < 0.01, es[1]["p_value"] + assert es[2]["p_value"] < 0.01, es[2]["p_value"] + + def test_placebo_horizons_cover_zero(event_study_results): """Section 4 narrative claims pre-treatment placebos sit on zero.""" pl = event_study_results.placebo_event_study From 6a663289ddec907def490f58bfa82d0357017757 Mon Sep 17 00:00:00 2001 From: igerber Date: Sat, 25 Apr 2026 15:13:26 -0400 Subject: [PATCH 4/5] Tighten drift test: round-based endpoint pins + exact warning-set check CI review surfaced two refinements: 1. Endpoint bands like `11.0 <= ci_low <= 11.6` would still pass values rounding to several different one-decimal displays (11.0, 11.1, ..., 11.6) while the notebook prose stays at "11.3", "12.8", "11.4", "13.3", "11.5", "13.6". Replace those with `round(ci_low, 1) == 11.3` etc. - directly pins the displayed rounding so any drift past the tenth fails the test. 2. The warning tests didn't pin the notebook's full warning contract. `event_study_results` suppressed A7 for fixture cleanliness while the docstring claimed "A7 visible". Two changes: - Fix the fixture docstring to acknowledge A7 is muted there for value-checking tests, with the notebook's actual warning-policy contract validated separately - Add `test_event_study_warning_policy_matches_notebook` that mirrors the notebook's exact filter (only matmul-pattern RuntimeWarnings silenced) and asserts the resulting warning set: exactly one UserWarning (A7 leavers-present, the one the markdown explains) and zero RuntimeWarnings. If a future library change emits an unexpected warning on this code path, the test fails. 12 tests pass in ~0.07s (was 11). Co-Authored-By: Claude Opus 4.7 (1M context) --- tests/test_t19_marketing_pulse_drift.py | 70 ++++++++++++++++++++----- 1 file changed, 57 insertions(+), 13 deletions(-) diff --git a/tests/test_t19_marketing_pulse_drift.py b/tests/test_t19_marketing_pulse_drift.py index 9365fd07..7a66306a 100644 --- a/tests/test_t19_marketing_pulse_drift.py +++ b/tests/test_t19_marketing_pulse_drift.py @@ -76,8 +76,11 @@ def phase1_results(panel): @pytest.fixture(scope="module") def event_study_results(panel): - """Event-study fit: L_max=2 + multiplier bootstrap. Same warning - treatment as the notebook (Accelerate matmul filter; A7 visible).""" + """Event-study fit: L_max=2 + multiplier bootstrap. The A7 + UserWarning is intentionally muted here so the fixture is quiet + for the value-checking tests below; the notebook's actual + warning-policy contract (A7 visible, only matmul filtered) is + validated separately by `test_event_study_warning_policy_matches_notebook`.""" with warnings.catch_warnings(): warnings.filterwarnings( "ignore", @@ -123,13 +126,12 @@ def test_overall_ci_covers_truth(phase1_results): def test_overall_ci_endpoints_match_quoted(phase1_results): - """Section 3 narrative quotes '95% CI: 11.3 to 12.8'. Lock the - rounded endpoints so prose drift fails this test.""" + """Section 3 narrative quotes '95% CI: 11.3 to 12.8'. Pin the + one-decimal display exactly so any drift past the displayed + rounding fails this test.""" ci_low, ci_high = phase1_results.overall_conf_int - # CI lower endpoint rounds to 11.3 -> band covers 11.0..11.6 - assert 11.0 <= ci_low <= 11.6, ci_low - # CI upper endpoint rounds to 12.8 -> band covers 12.5..13.1 - assert 12.5 <= ci_high <= 13.1, ci_high + assert round(ci_low, 1) == 11.3, ci_low + assert round(ci_high, 1) == 12.8, ci_high def test_joiners_leavers_consistent(phase1_results): @@ -154,14 +156,14 @@ def test_event_study_horizons_cover_truth(event_study_results): def test_event_study_ci_endpoints_match_quoted(event_study_results): """Section 4 narrative quotes l=1 CI [11.4, 13.3] and l=2 CI - [11.5, 13.6]. Lock the rounded endpoints so prose drift fails.""" + [11.5, 13.6]. Pin the one-decimal display exactly.""" es = event_study_results.event_study_effects # l=1 CI [11.4, 13.3] - assert 11.1 <= es[1]["conf_int"][0] <= 11.7, es[1]["conf_int"] - assert 13.0 <= es[1]["conf_int"][1] <= 13.6, es[1]["conf_int"] + assert round(es[1]["conf_int"][0], 1) == 11.4, es[1]["conf_int"] + assert round(es[1]["conf_int"][1], 1) == 13.3, es[1]["conf_int"] # l=2 CI [11.5, 13.6] - assert 11.2 <= es[2]["conf_int"][0] <= 11.8, es[2]["conf_int"] - assert 13.3 <= es[2]["conf_int"][1] <= 13.9, es[2]["conf_int"] + assert round(es[2]["conf_int"][0], 1) == 11.5, es[2]["conf_int"] + assert round(es[2]["conf_int"][1], 1) == 13.6, es[2]["conf_int"] def test_event_study_significance(event_study_results): @@ -211,6 +213,48 @@ def test_assumption7_warning_fires_as_expected(panel): assert len(a7_warnings) >= 1, [str(w.message)[:80] for w in ws] +def test_event_study_warning_policy_matches_notebook(panel): + """Mirror the notebook's exact warning policy on the visible + event-study fit and assert the resulting warning set matches the + documented contract: exactly one UserWarning (the A7 leavers-present + warning that the notebook's markdown explains), and zero + RuntimeWarnings (matmul-pattern ones filtered; everything else + surfaces). If the library starts emitting an unexpected warning on + this code path, this test fails and the notebook prose may need to + be updated.""" + with warnings.catch_warnings(record=True) as ws: + warnings.simplefilter("always") + # MIRROR the notebook's narrow filter exactly (no np.errstate, no + # blanket A7 suppression). + warnings.filterwarnings( + "ignore", + message=r".*encountered in matmul", + category=RuntimeWarning, + ) + model = DCDH( + twfe_diagnostic=False, placebo=True, n_bootstrap=199, seed=42 + ) + model.fit( + panel, + outcome="sessions", + group="market_id", + time="week", + treatment="promo_on", + L_max=2, + ) + user_warnings = [w for w in ws if w.category is UserWarning] + runtime_warnings = [w for w in ws if w.category is RuntimeWarning] + # Exactly one UserWarning, and it's the documented A7 warning. + assert len(user_warnings) == 1, [str(w.message)[:120] for w in user_warnings] + msg = str(user_warnings[0].message) + assert "Assumption 7" in msg, msg + assert "leavers present" in msg, msg + # All RuntimeWarnings should be the matmul pattern (filtered) - so + # zero remaining. If a new RuntimeWarning fires from somewhere else, + # this fails. + assert len(runtime_warnings) == 0, [str(w.message)[:120] for w in runtime_warnings] + + def test_a11_warning_does_not_fire(): """The notebook claims this seed/DGP is in the A11-clean regime (no warning fires). If a library change starts triggering A11 on From 9c6be1e5bfe500511b4a8a85ca0cea4258eaa197 Mon Sep 17 00:00:00 2001 From: igerber Date: Sat, 25 Apr 2026 16:25:51 -0400 Subject: [PATCH 5/5] Fix Pure Python CI failure: loosen bootstrap-CI endpoint pins Pure Python CI failed at test_event_study_ci_endpoints_match_quoted because the bootstrap RNG path differs between Rust and pure-Python backends (per the bit-identity-baseline-per-backend convention): Rust: es[1] CI low = 11.394 -> rounds to 11.4 (matches prose) Pure Python: es[1] CI low = 11.487 -> rounds to 11.5 (mismatch) The 0.09 backend gap is enough to flip the rounding boundary on the exact-match `round(_, 1) == 11.4` pin I tightened to in the prior round. Loosen the four bootstrap-CI endpoint asserts to a 0.15 absolute tolerance band around the quoted prose values. Tight enough to catch real prose drift (a real shift would move by >>0.15), loose enough to absorb the documented backend variance. Verified on both backends locally: pytest tests/test_t19_marketing_pulse_drift.py -> 12/12 pass DIFF_DIFF_BACKEND=python pytest -> 12/12 pass The analytical-SE endpoint asserts in test_overall_ci_endpoints_match_quoted keep the strict `round(_, 1) ==` pin since they're not bootstrap-driven and are bit-identical across backends. Co-Authored-By: Claude Opus 4.7 (1M context) --- tests/test_t19_marketing_pulse_drift.py | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/tests/test_t19_marketing_pulse_drift.py b/tests/test_t19_marketing_pulse_drift.py index 7a66306a..6c31ebd6 100644 --- a/tests/test_t19_marketing_pulse_drift.py +++ b/tests/test_t19_marketing_pulse_drift.py @@ -156,14 +156,19 @@ def test_event_study_horizons_cover_truth(event_study_results): def test_event_study_ci_endpoints_match_quoted(event_study_results): """Section 4 narrative quotes l=1 CI [11.4, 13.3] and l=2 CI - [11.5, 13.6]. Pin the one-decimal display exactly.""" + [11.5, 13.6]. These are bootstrap-based CIs and the bootstrap RNG + path differs between Rust and pure-Python backends (per the + bit-identity-baseline-per-backend convention), so we use a 0.15 + tolerance band rather than `round(_, 1) ==` exact matching - tight + enough to catch real prose drift, loose enough to absorb the + documented backend variance.""" es = event_study_results.event_study_effects # l=1 CI [11.4, 13.3] - assert round(es[1]["conf_int"][0], 1) == 11.4, es[1]["conf_int"] - assert round(es[1]["conf_int"][1], 1) == 13.3, es[1]["conf_int"] + assert abs(es[1]["conf_int"][0] - 11.4) < 0.15, es[1]["conf_int"] + assert abs(es[1]["conf_int"][1] - 13.3) < 0.15, es[1]["conf_int"] # l=2 CI [11.5, 13.6] - assert round(es[2]["conf_int"][0], 1) == 11.5, es[2]["conf_int"] - assert round(es[2]["conf_int"][1], 1) == 13.6, es[2]["conf_int"] + assert abs(es[2]["conf_int"][0] - 11.5) < 0.15, es[2]["conf_int"] + assert abs(es[2]["conf_int"][1] - 13.6) < 0.15, es[2]["conf_int"] def test_event_study_significance(event_study_results):