From de6ce63e9952dc3489d6203d91b61502f9fad92b Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Sun, 19 Apr 2026 10:36:57 -0400
Subject: [PATCH 01/15] Add practitioner-workflow performance baseline

Six end-to-end scenarios covering CS + 8-step chain, survey DiD, BRFSS
microdata -> CS panel, SDiD few-markets, reversible dCDH, and continuous
dose-response -- anchored to applied-econ paper and industry conventions
rather than the 200 x 8 cookie cutter. Each chain is timed per-phase and
profiled with pyinstrument under both backends; findings and recommended
actions are in docs/performance-plan.md.

Measurement only -- no changes under diff_diff/ or rust/. The decision
doc identifies aggregate_survey per-cell scaffolding, ImputationDiD fit
loop, and dCDH heterogeneity refit as candidates for follow-up PRs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 benchmarks/speed_review/README.md             |  76 ++++
 benchmarks/speed_review/baselines/.gitignore  |   1 +
 .../brand_awareness_survey_python.json        |  58 ++++
 .../brand_awareness_survey_rust.json          |  58 ++++
 .../baselines/brfss_panel_python.json         |  48 +++
 .../baselines/brfss_panel_rust.json           |  48 +++
 .../baselines/campaign_staggered_python.json  |  62 ++++
 .../baselines/campaign_staggered_rust.json    |  62 ++++
 .../baselines/dose_response_python.json       |  50 +++
 .../baselines/dose_response_rust.json         |  50 +++
 .../baselines/geo_few_markets_python.json     |  46 +++
 .../baselines/geo_few_markets_rust.json       |  46 +++
 .../baselines/reversible_dcdh_python.json     |  38 ++
 .../baselines/reversible_dcdh_rust.json       |  38 ++
 .../bench_brand_awareness_survey.py           | 161 +++++++++
 benchmarks/speed_review/bench_brfss_panel.py  | 147 ++++++++
 .../speed_review/bench_campaign_staggered.py  | 124 +++++++
 .../speed_review/bench_dose_response.py       | 107 ++++++
 .../speed_review/bench_geo_few_markets.py     |  99 ++++++
 .../speed_review/bench_reversible_dcdh.py     | 117 +++++++
 benchmarks/speed_review/bench_shared.py       | 144 ++++++++
 benchmarks/speed_review/run_all.py            |  79 +++++
 docs/performance-plan.md                      | 189 ++++++++++
 docs/performance-scenarios.md                 | 324 ++++++++++++++++++
 24 files changed, 2172 insertions(+)
 create mode 100644 benchmarks/speed_review/README.md
 create mode 100644 benchmarks/speed_review/baselines/.gitignore
 create mode 100644 benchmarks/speed_review/baselines/brand_awareness_survey_python.json
 create mode 100644 benchmarks/speed_review/baselines/brand_awareness_survey_rust.json
 create mode 100644 benchmarks/speed_review/baselines/brfss_panel_python.json
 create mode 100644 benchmarks/speed_review/baselines/brfss_panel_rust.json
 create mode 100644 benchmarks/speed_review/baselines/campaign_staggered_python.json
 create mode 100644 benchmarks/speed_review/baselines/campaign_staggered_rust.json
 create mode 100644 benchmarks/speed_review/baselines/dose_response_python.json
 create mode 100644 benchmarks/speed_review/baselines/dose_response_rust.json
 create mode 100644 benchmarks/speed_review/baselines/geo_few_markets_python.json
 create mode 100644 benchmarks/speed_review/baselines/geo_few_markets_rust.json
 create mode 100644 benchmarks/speed_review/baselines/reversible_dcdh_python.json
 create mode 100644 benchmarks/speed_review/baselines/reversible_dcdh_rust.json
 create mode 100644 benchmarks/speed_review/bench_brand_awareness_survey.py
 create mode 100644 benchmarks/speed_review/bench_brfss_panel.py
 create mode 100644 benchmarks/speed_review/bench_campaign_staggered.py
 create mode 100644 benchmarks/speed_review/bench_dose_response.py
 create mode 100644 benchmarks/speed_review/bench_geo_few_markets.py
 create mode 100644 benchmarks/speed_review/bench_reversible_dcdh.py
 create mode 100644 benchmarks/speed_review/bench_shared.py
 create mode 100644 benchmarks/speed_review/run_all.py
 create mode 100644 docs/performance-scenarios.md

diff --git a/benchmarks/speed_review/README.md b/benchmarks/speed_review/README.md
new file mode 100644
index 00000000..4ba38232
--- /dev/null
+++ b/benchmarks/speed_review/README.md
@@ -0,0 +1,76 @@
+# Speed Review — Practitioner Workflow Benchmarks
+
+Scenario-driven performance measurement for end-to-end practitioner chains,
+as distinct from `benchmarks/run_benchmarks.py` which measures R-parity on
+isolated `fit()` calls.
+
+## Why these exist
+
+See [`docs/performance-scenarios.md`](../../docs/performance-scenarios.md) for
+the full methodology. Short version: the existing benchmarks measure
+`fit()` in isolation on 200 x 8 synthetic panels, which does not reflect what
+a practitioner running the 8-step Baker et al. (2025) workflow on a real
+BRFSS or geo-experiment panel actually sees. These scripts measure the full
+chain (Bacon -> fit -> HonestDiD -> cross-estimator robustness -> reporting)
+at data shapes anchored to applied-econ conventions.
+
+## Layout
+
+```
+benchmarks/speed_review/
+├── README.md                           # this file
+├── bench_shared.py                     # timing + pyinstrument harness
+├── run_all.py                          # orchestrator (both backends)
+├── bench_campaign_staggered.py         # Scenario 1: CS + 8-step chain
+├── bench_brand_awareness_survey.py     # Scenario 2: DiD + SurveyDesign
+├── bench_brfss_panel.py                # Scenario 3: aggregate_survey -> CS
+├── bench_geo_few_markets.py            # Scenario 4: SDiD + jackknife
+├── bench_reversible_dcdh.py            # Scenario 5: dCDH L_max + TSL
+├── bench_dose_response.py              # Scenario 6: ContinuousDiD splines
+├── bench_callaway.py                   # pre-existing CS scaling sweep
+├── baseline_results.json               # pre-existing CS baseline
+└── baselines/                          # this effort's output
+    ├── <scenario>_<backend>.json       # phase-level wall-clock (committed)
+    └── profiles/                       # flame HTMLs (gitignored)
+        └── <scenario>_<backend>.html   # pyinstrument flame output
+```
+
+**Note on profile HTMLs.** pyinstrument flames are ~500KB-1.2MB each and are
+regenerated on every run; they live under `baselines/profiles/` which is
+gitignored. The key hotspots identified from them are already captured in
+the findings doc (top-5 hot phases per scenario); run a scenario locally
+to regenerate the full flame when needed.
+
+## Running
+
+```bash
+# One-time install
+pip install pyinstrument
+
+# All scenarios, both backends
+python benchmarks/speed_review/run_all.py
+
+# One scenario, one backend
+DIFF_DIFF_BACKEND=rust python benchmarks/speed_review/bench_campaign_staggered.py
+
+# Subset
+python benchmarks/speed_review/run_all.py --scenarios brfss_panel geo_few_markets
+```
+
+## Where to look for findings
+
+[`docs/performance-plan.md`](../../docs/performance-plan.md) — "Practitioner
+Workflow Baseline (v3.1.3)" section holds per-scenario hot-phase rankings
+and action recommendations. The scenarios here are the measurement surface;
+the findings doc is the decision output.
+
+## Adding a scenario
+
+1. Add the scenario definition to `docs/performance-scenarios.md`
+   (persona, data shape, operation chain, source anchor).
+2. Add `bench_<name>.py` following the existing scripts: build data, define
+   `phases` as a list of `(label, callable)` tuples, call `run_scenario`.
+3. Register it in `run_all.py`'s `SCRIPTS` dict.
+4. Run under both backends, commit the refreshed `baselines/*.json` and the
+   corresponding `baselines/profiles/*.html`.
+5. Add a per-scenario finding paragraph to `docs/performance-plan.md`.
diff --git a/benchmarks/speed_review/baselines/.gitignore b/benchmarks/speed_review/baselines/.gitignore
new file mode 100644
index 00000000..66d050c3
--- /dev/null
+++ b/benchmarks/speed_review/baselines/.gitignore
@@ -0,0 +1 @@
+profiles/
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_python.json b/benchmarks/speed_review/baselines/brand_awareness_survey_python.json
new file mode 100644
index 00000000..6551227d
--- /dev/null
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_python.json
@@ -0,0 +1,58 @@
+{
+  "scenario": "brand_awareness_survey",
+  "backend": "python",
+  "has_rust_backend": false,
+  "total_seconds": 0.18850491600000008,
+  "phases": {
+    "1_naive_fit_no_survey_design": {
+      "seconds": 0.0016701670000000002,
+      "ok": true,
+      "error": null
+    },
+    "2_tsl_strata_psu_fpc": {
+      "seconds": 0.006741541999999989,
+      "ok": true,
+      "error": null
+    },
+    "3_replicate_weights_brr": {
+      "seconds": 0.014424250000000027,
+      "ok": true,
+      "error": null
+    },
+    "4_multi_outcome_loop_3_metrics": {
+      "seconds": 0.043619666,
+      "ok": true,
+      "error": null
+    },
+    "5_check_parallel_trends": {
+      "seconds": 0.00915220799999994,
+      "ok": true,
+      "error": null
+    },
+    "6_placebo_refit_pre_period": {
+      "seconds": 0.029268290999999946,
+      "ok": true,
+      "error": null
+    },
+    "7_event_study_plus_honest_did": {
+      "seconds": 0.08362433400000002,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "n_units": 200,
+    "n_periods": 12,
+    "n_obs": 2400,
+    "n_strata": 10,
+    "n_psu_per_stratum": 4,
+    "n_replicate_weights": 40,
+    "outcomes": [
+      "outcome",
+      "consideration",
+      "purchase_intent"
+    ]
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
\ No newline at end of file
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_rust.json b/benchmarks/speed_review/baselines/brand_awareness_survey_rust.json
new file mode 100644
index 00000000..48707354
--- /dev/null
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_rust.json
@@ -0,0 +1,58 @@
+{
+  "scenario": "brand_awareness_survey",
+  "backend": "rust",
+  "has_rust_backend": true,
+  "total_seconds": 0.16800324999999994,
+  "phases": {
+    "1_naive_fit_no_survey_design": {
+      "seconds": 0.0018907079999999077,
+      "ok": true,
+      "error": null
+    },
+    "2_tsl_strata_psu_fpc": {
+      "seconds": 0.006109541999999912,
+      "ok": true,
+      "error": null
+    },
+    "3_replicate_weights_brr": {
+      "seconds": 0.01849195799999992,
+      "ok": true,
+      "error": null
+    },
+    "4_multi_outcome_loop_3_metrics": {
+      "seconds": 0.02723191700000005,
+      "ok": true,
+      "error": null
+    },
+    "5_check_parallel_trends": {
+      "seconds": 0.009134625000000063,
+      "ok": true,
+      "error": null
+    },
+    "6_placebo_refit_pre_period": {
+      "seconds": 0.024182666999999936,
+      "ok": true,
+      "error": null
+    },
+    "7_event_study_plus_honest_did": {
+      "seconds": 0.08095333299999996,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "n_units": 200,
+    "n_periods": 12,
+    "n_obs": 2400,
+    "n_strata": 10,
+    "n_psu_per_stratum": 4,
+    "n_replicate_weights": 40,
+    "outcomes": [
+      "outcome",
+      "consideration",
+      "purchase_intent"
+    ]
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
\ No newline at end of file
diff --git a/benchmarks/speed_review/baselines/brfss_panel_python.json b/benchmarks/speed_review/baselines/brfss_panel_python.json
new file mode 100644
index 00000000..fddf6cab
--- /dev/null
+++ b/benchmarks/speed_review/baselines/brfss_panel_python.json
@@ -0,0 +1,48 @@
+{
+  "scenario": "brfss_panel",
+  "backend": "python",
+  "has_rust_backend": false,
+  "total_seconds": 1.599043583,
+  "phases": {
+    "1_aggregate_survey_microdata_to_panel": {
+      "seconds": 1.530210625,
+      "ok": true,
+      "error": null
+    },
+    "2_cs_fit_with_stage2_survey_design": {
+      "seconds": 0.014581666999999854,
+      "ok": true,
+      "error": null
+    },
+    "3_inspect_pretrends": {
+      "seconds": 1.8749999997069722e-06,
+      "ok": true,
+      "error": null
+    },
+    "4_honest_did_grid": {
+      "seconds": 0.003660958000000214,
+      "ok": true,
+      "error": null
+    },
+    "5_sun_abraham_robustness": {
+      "seconds": 0.05053487499999987,
+      "ok": true,
+      "error": null
+    },
+    "6_practitioner_next_steps": {
+      "seconds": 4.9042000000110164e-05,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "n_microdata_rows": 50000,
+    "n_states": 50,
+    "n_years": 10,
+    "n_strata": 10,
+    "n_psu": 200,
+    "n_bootstrap": 199
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
\ No newline at end of file
diff --git a/benchmarks/speed_review/baselines/brfss_panel_rust.json b/benchmarks/speed_review/baselines/brfss_panel_rust.json
new file mode 100644
index 00000000..44c2cfb0
--- /dev/null
+++ b/benchmarks/speed_review/baselines/brfss_panel_rust.json
@@ -0,0 +1,48 @@
+{
+  "scenario": "brfss_panel",
+  "backend": "rust",
+  "has_rust_backend": true,
+  "total_seconds": 1.5960411249999997,
+  "phases": {
+    "1_aggregate_survey_microdata_to_panel": {
+      "seconds": 1.5271849580000003,
+      "ok": true,
+      "error": null
+    },
+    "2_cs_fit_with_stage2_survey_design": {
+      "seconds": 0.014870542000000153,
+      "ok": true,
+      "error": null
+    },
+    "3_inspect_pretrends": {
+      "seconds": 2.208000000170074e-06,
+      "ok": true,
+      "error": null
+    },
+    "4_honest_did_grid": {
+      "seconds": 0.003847707999999894,
+      "ok": true,
+      "error": null
+    },
+    "5_sun_abraham_robustness": {
+      "seconds": 0.05008866700000025,
+      "ok": true,
+      "error": null
+    },
+    "6_practitioner_next_steps": {
+      "seconds": 4.3584000000151946e-05,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "n_microdata_rows": 50000,
+    "n_states": 50,
+    "n_years": 10,
+    "n_strata": 10,
+    "n_psu": 200,
+    "n_bootstrap": 199
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
\ No newline at end of file
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_python.json b/benchmarks/speed_review/baselines/campaign_staggered_python.json
new file mode 100644
index 00000000..69b536a3
--- /dev/null
+++ b/benchmarks/speed_review/baselines/campaign_staggered_python.json
@@ -0,0 +1,62 @@
+{
+  "scenario": "campaign_staggered",
+  "backend": "python",
+  "has_rust_backend": false,
+  "total_seconds": 0.493763792,
+  "phases": {
+    "1_bacon_decomposition": {
+      "seconds": 0.00662462499999994,
+      "ok": true,
+      "error": null
+    },
+    "2_cs_fit_with_covariates_bootstrap999": {
+      "seconds": 0.06328537499999998,
+      "ok": true,
+      "error": null
+    },
+    "3_inspect_pretrends": {
+      "seconds": 3.3750000000276614e-06,
+      "ok": true,
+      "error": null
+    },
+    "4_honest_did_M_grid": {
+      "seconds": 0.0047993339999999884,
+      "ok": true,
+      "error": null
+    },
+    "5_sun_abraham_robustness": {
+      "seconds": 0.09586058399999997,
+      "ok": true,
+      "error": null
+    },
+    "6_imputation_did_robustness": {
+      "seconds": 0.29060341599999995,
+      "ok": true,
+      "error": null
+    },
+    "7_cs_without_covariates": {
+      "seconds": 0.03254304100000005,
+      "ok": true,
+      "error": null
+    },
+    "8_practitioner_next_steps": {
+      "seconds": 3.7708000000025166e-05,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "n_units": 150,
+    "n_periods": 26,
+    "n_cohorts": 2,
+    "covariates": [
+      "log_pop",
+      "baseline_spend"
+    ],
+    "n_bootstrap": 999,
+    "aggregate": "all",
+    "estimation_method": "dr"
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
\ No newline at end of file
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_rust.json b/benchmarks/speed_review/baselines/campaign_staggered_rust.json
new file mode 100644
index 00000000..44a4cdd9
--- /dev/null
+++ b/benchmarks/speed_review/baselines/campaign_staggered_rust.json
@@ -0,0 +1,62 @@
+{
+  "scenario": "campaign_staggered",
+  "backend": "rust",
+  "has_rust_backend": true,
+  "total_seconds": 0.484783125,
+  "phases": {
+    "1_bacon_decomposition": {
+      "seconds": 0.006965292000000067,
+      "ok": true,
+      "error": null
+    },
+    "2_cs_fit_with_covariates_bootstrap999": {
+      "seconds": 0.060481958,
+      "ok": true,
+      "error": null
+    },
+    "3_inspect_pretrends": {
+      "seconds": 2.9169999999911767e-06,
+      "ok": true,
+      "error": null
+    },
+    "4_honest_did_M_grid": {
+      "seconds": 0.004540166999999928,
+      "ok": true,
+      "error": null
+    },
+    "5_sun_abraham_robustness": {
+      "seconds": 0.12357962499999997,
+      "ok": true,
+      "error": null
+    },
+    "6_imputation_did_robustness": {
+      "seconds": 0.25805591699999986,
+      "ok": true,
+      "error": null
+    },
+    "7_cs_without_covariates": {
+      "seconds": 0.031115375000000167,
+      "ok": true,
+      "error": null
+    },
+    "8_practitioner_next_steps": {
+      "seconds": 3.6332999999943993e-05,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "n_units": 150,
+    "n_periods": 26,
+    "n_cohorts": 2,
+    "covariates": [
+      "log_pop",
+      "baseline_spend"
+    ],
+    "n_bootstrap": 999,
+    "aggregate": "all",
+    "estimation_method": "dr"
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
\ No newline at end of file
diff --git a/benchmarks/speed_review/baselines/dose_response_python.json b/benchmarks/speed_review/baselines/dose_response_python.json
new file mode 100644
index 00000000..7deaf504
--- /dev/null
+++ b/benchmarks/speed_review/baselines/dose_response_python.json
@@ -0,0 +1,50 @@
+{
+  "scenario": "dose_response",
+  "backend": "python",
+  "has_rust_backend": false,
+  "total_seconds": 0.5864091250000001,
+  "phases": {
+    "1_cdid_cubic_spline_bootstrap199": {
+      "seconds": 0.14782820799999996,
+      "ok": true,
+      "error": null
+    },
+    "2_extract_dose_response_dataframes": {
+      "seconds": 0.0007271249999999396,
+      "ok": true,
+      "error": null
+    },
+    "3_cdid_event_study_pretrend": {
+      "seconds": 0.1467172499999999,
+      "ok": true,
+      "error": null
+    },
+    "4_binarized_did_comparison": {
+      "seconds": 0.0014637920000000193,
+      "ok": true,
+      "error": null
+    },
+    "5_spline_sensitivity_degree1": {
+      "seconds": 0.14299950000000006,
+      "ok": true,
+      "error": null
+    },
+    "6_spline_sensitivity_num_knots2": {
+      "seconds": 0.14666895800000002,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "n_units": 500,
+    "n_periods": 6,
+    "n_bootstrap": 199,
+    "spline_configs": [
+      "degree=3,k=1",
+      "degree=1,k=0",
+      "degree=3,k=2"
+    ]
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
\ No newline at end of file
diff --git a/benchmarks/speed_review/baselines/dose_response_rust.json b/benchmarks/speed_review/baselines/dose_response_rust.json
new file mode 100644
index 00000000..d4f4cb54
--- /dev/null
+++ b/benchmarks/speed_review/baselines/dose_response_rust.json
@@ -0,0 +1,50 @@
+{
+  "scenario": "dose_response",
+  "backend": "rust",
+  "has_rust_backend": true,
+  "total_seconds": 0.585448167,
+  "phases": {
+    "1_cdid_cubic_spline_bootstrap199": {
+      "seconds": 0.14863079200000007,
+      "ok": true,
+      "error": null
+    },
+    "2_extract_dose_response_dataframes": {
+      "seconds": 0.0007015000000000216,
+      "ok": true,
+      "error": null
+    },
+    "3_cdid_event_study_pretrend": {
+      "seconds": 0.14747212500000006,
+      "ok": true,
+      "error": null
+    },
+    "4_binarized_did_comparison": {
+      "seconds": 0.0016670830000000691,
+      "ok": true,
+      "error": null
+    },
+    "5_spline_sensitivity_degree1": {
+      "seconds": 0.14236974999999996,
+      "ok": true,
+      "error": null
+    },
+    "6_spline_sensitivity_num_knots2": {
+      "seconds": 0.1446025420000001,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "n_units": 500,
+    "n_periods": 6,
+    "n_bootstrap": 199,
+    "spline_configs": [
+      "degree=3,k=1",
+      "degree=1,k=0",
+      "degree=3,k=2"
+    ]
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
\ No newline at end of file
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_python.json b/benchmarks/speed_review/baselines/geo_few_markets_python.json
new file mode 100644
index 00000000..ccaec094
--- /dev/null
+++ b/benchmarks/speed_review/baselines/geo_few_markets_python.json
@@ -0,0 +1,46 @@
+{
+  "scenario": "geo_few_markets",
+  "backend": "python",
+  "has_rust_backend": false,
+  "total_seconds": 3.0047956659999997,
+  "phases": {
+    "1_sdid_jackknife_variance": {
+      "seconds": 0.4820445,
+      "ok": true,
+      "error": null
+    },
+    "2_sdid_bootstrap_variance_200": {
+      "seconds": 0.4802018750000001,
+      "ok": true,
+      "error": null
+    },
+    "3_in_time_placebo": {
+      "seconds": 0.9625541249999998,
+      "ok": true,
+      "error": null
+    },
+    "4_get_loo_effects_df": {
+      "seconds": 0.0008696669999999074,
+      "ok": true,
+      "error": null
+    },
+    "5_sensitivity_to_zeta_omega": {
+      "seconds": 1.079096792,
+      "ok": true,
+      "error": null
+    },
+    "6_weight_concentration": {
+      "seconds": 2.4999999999941735e-05,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "n_units": 80,
+    "n_periods": 12,
+    "n_treated": 5,
+    "n_factors": 2
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
\ No newline at end of file
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_rust.json b/benchmarks/speed_review/baselines/geo_few_markets_rust.json
new file mode 100644
index 00000000..ff62e455
--- /dev/null
+++ b/benchmarks/speed_review/baselines/geo_few_markets_rust.json
@@ -0,0 +1,46 @@
+{
+  "scenario": "geo_few_markets",
+  "backend": "rust",
+  "has_rust_backend": true,
+  "total_seconds": 0.03952049999999996,
+  "phases": {
+    "1_sdid_jackknife_variance": {
+      "seconds": 0.00735379199999997,
+      "ok": true,
+      "error": null
+    },
+    "2_sdid_bootstrap_variance_200": {
+      "seconds": 0.012488166000000023,
+      "ok": true,
+      "error": null
+    },
+    "3_in_time_placebo": {
+      "seconds": 0.008124790999999965,
+      "ok": true,
+      "error": null
+    },
+    "4_get_loo_effects_df": {
+      "seconds": 0.0006939590000000218,
+      "ok": true,
+      "error": null
+    },
+    "5_sensitivity_to_zeta_omega": {
+      "seconds": 0.010841416999999964,
+      "ok": true,
+      "error": null
+    },
+    "6_weight_concentration": {
+      "seconds": 1.5958999999954315e-05,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "n_units": 80,
+    "n_periods": 12,
+    "n_treated": 5,
+    "n_factors": 2
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
\ No newline at end of file
diff --git a/benchmarks/speed_review/baselines/reversible_dcdh_python.json b/benchmarks/speed_review/baselines/reversible_dcdh_python.json
new file mode 100644
index 00000000..b19cfc16
--- /dev/null
+++ b/benchmarks/speed_review/baselines/reversible_dcdh_python.json
@@ -0,0 +1,38 @@
+{
+  "scenario": "reversible_dcdh",
+  "backend": "python",
+  "has_rust_backend": false,
+  "total_seconds": 0.5749523339999999,
+  "phases": {
+    "1_dcdh_fit_Lmax3_survey_TSL": {
+      "seconds": 0.30690474999999995,
+      "ok": true,
+      "error": null
+    },
+    "2_inspect_placebo_and_summary": {
+      "seconds": 1.2500000000637002e-06,
+      "ok": true,
+      "error": null
+    },
+    "3_honest_did_on_placebo": {
+      "seconds": 0.0035600000000000076,
+      "ok": true,
+      "error": null
+    },
+    "4_heterogeneity_refit": {
+      "seconds": 0.264483833,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "n_groups": 120,
+    "n_periods": 10,
+    "pattern": "single_switch",
+    "L_max": 3,
+    "n_strata": 8,
+    "n_psu": 24
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
\ No newline at end of file
diff --git a/benchmarks/speed_review/baselines/reversible_dcdh_rust.json b/benchmarks/speed_review/baselines/reversible_dcdh_rust.json
new file mode 100644
index 00000000..22301be9
--- /dev/null
+++ b/benchmarks/speed_review/baselines/reversible_dcdh_rust.json
@@ -0,0 +1,38 @@
+{
+  "scenario": "reversible_dcdh",
+  "backend": "rust",
+  "has_rust_backend": true,
+  "total_seconds": 0.44736124999999993,
+  "phases": {
+    "1_dcdh_fit_Lmax3_survey_TSL": {
+      "seconds": 0.319021125,
+      "ok": true,
+      "error": null
+    },
+    "2_inspect_placebo_and_summary": {
+      "seconds": 1.37499999997015e-06,
+      "ok": true,
+      "error": null
+    },
+    "3_honest_did_on_placebo": {
+      "seconds": 0.003557540999999942,
+      "ok": true,
+      "error": null
+    },
+    "4_heterogeneity_refit": {
+      "seconds": 0.12477833300000007,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "n_groups": 120,
+    "n_periods": 10,
+    "pattern": "single_switch",
+    "L_max": 3,
+    "n_strata": 8,
+    "n_psu": 24
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
\ No newline at end of file
diff --git a/benchmarks/speed_review/bench_brand_awareness_survey.py b/benchmarks/speed_review/bench_brand_awareness_survey.py
new file mode 100644
index 00000000..338bfa82
--- /dev/null
+++ b/benchmarks/speed_review/bench_brand_awareness_survey.py
@@ -0,0 +1,161 @@
+"""
+Scenario 2: Brand awareness survey DiD — 2x2 with survey design.
+
+DifferenceInDifferences + SurveyDesign under two variance paths:
+  (a) analytical Taylor-series linearization (strata + PSU + FPC)
+  (b) replicate-weight bootstrap (BRR-style, ~160 replicate columns)
+
+Chains: naive fit (for SE-inflation comparison) -> TSL -> replicate -> multi-
+outcome refit loop -> check_parallel_trends -> placebo -> HonestDiD grid.
+
+Data shape: 40 regions x 8 quarters x ~100 respondents per cell =
+~32K respondent rows, 10 strata, 4 PSUs/stratum.
+"""
+
+import numpy as np
+
+from diff_diff import (
+    DifferenceInDifferences,
+    MultiPeriodDiD,
+    SurveyDesign,
+    check_parallel_trends,
+    compute_honest_did,
+)
+from diff_diff.prep import generate_survey_did_data
+
+from bench_shared import run_scenario
+
+
+def build_data(seed=42):
+    df = generate_survey_did_data(
+        n_units=200, n_periods=12, cohort_periods=[7],
+        never_treated_frac=0.5, treatment_effect=2.0,
+        dynamic_effects=True, effect_growth=0.2,
+        n_strata=10, psu_per_stratum=4,
+        weight_variation="high", psu_re_sd=1.5,
+        include_replicate_weights=True, panel=True, seed=seed,
+    )
+    rng = np.random.default_rng(seed + 1)
+    df["consideration"] = df["outcome"] + rng.normal(0, 0.4, size=len(df))
+    df["purchase_intent"] = df["outcome"] * 0.6 + rng.normal(0, 0.3, size=len(df))
+    df["post"] = (df["period"] >= 7).astype(int)
+    # Unit-level treatment indicator (for pre-period placebo and
+    # parallel-trends check — `treated` is row-level and zero in the pre-
+    # period, which those diagnostics can't use).
+    df["treat_unit"] = (df["first_treat"] > 0).astype(int)
+    return df
+
+
+def main():
+    data = build_data()
+    rw_cols = [c for c in data.columns if c.startswith("rep_")]
+
+    results = {}
+
+    def naive_fit():
+        did = DifferenceInDifferences(robust=True, cluster="psu")
+        results["naive"] = did.fit(
+            data, outcome="outcome", treatment="treat_unit", time="post",
+        )
+
+    def tsl_fit():
+        sd = SurveyDesign(
+            weights="weight", strata="stratum", psu="psu",
+            fpc="fpc", nest=True,
+        )
+        did = DifferenceInDifferences(robust=True)
+        results["tsl"] = did.fit(
+            data, outcome="outcome", treatment="treat_unit", time="post",
+            survey_design=sd,
+        )
+
+    def replicate_fit():
+        if not rw_cols:
+            raise RuntimeError("replicate weights not generated")
+        sd = SurveyDesign(
+            weights="weight", replicate_weights=rw_cols,
+            replicate_method="BRR",
+        )
+        did = DifferenceInDifferences(robust=True)
+        results["replicate"] = did.fit(
+            data, outcome="outcome", treatment="treat_unit", time="post",
+            survey_design=sd,
+        )
+
+    def multi_outcome_loop():
+        sd = SurveyDesign(
+            weights="weight", strata="stratum", psu="psu", nest=True,
+        )
+        out = {}
+        for y in ("outcome", "consideration", "purchase_intent"):
+            did = DifferenceInDifferences(robust=True)
+            out[y] = did.fit(
+                data, outcome=y, treatment="treat_unit", time="post",
+                survey_design=sd,
+            )
+        results["multi_outcome"] = out
+
+    def pretrends():
+        results["pt"] = check_parallel_trends(
+            data, outcome="outcome", time="period",
+            treatment_group="treat_unit",
+            pre_periods=list(range(1, 7)),
+        )
+
+    def placebo_refit():
+        pre = data[data["period"] < 7].copy()
+        pre["placebo_post"] = (pre["period"] >= 4).astype(int)
+        sd = SurveyDesign(
+            weights="weight", strata="stratum", psu="psu", nest=True,
+        )
+        did = DifferenceInDifferences(robust=True)
+        results["placebo"] = did.fit(
+            pre, outcome="outcome", treatment="treat_unit",
+            time="placebo_post", survey_design=sd,
+        )
+
+    def honest_did_grid():
+        sd = SurveyDesign(
+            weights="weight", strata="stratum", psu="psu", nest=True,
+        )
+        es = MultiPeriodDiD()
+        es_result = es.fit(
+            data, outcome="outcome", treatment="treat_unit",
+            time="period", unit="unit", reference_period=6,
+            survey_design=sd,
+        )
+        results["event_study"] = es_result
+        out = {}
+        for M in (0.5, 1.0, 1.5):
+            try:
+                out[M] = compute_honest_did(
+                    es_result, method="relative_magnitude", M=M,
+                )
+            except Exception as e:
+                out[M] = f"{type(e).__name__}: {e}"
+        results["honest"] = out
+
+    phases = [
+        ("1_naive_fit_no_survey_design", naive_fit),
+        ("2_tsl_strata_psu_fpc", tsl_fit),
+        ("3_replicate_weights_brr", replicate_fit),
+        ("4_multi_outcome_loop_3_metrics", multi_outcome_loop),
+        ("5_check_parallel_trends", pretrends),
+        ("6_placebo_refit_pre_period", placebo_refit),
+        ("7_event_study_plus_honest_did", honest_did_grid),
+    ]
+
+    run_scenario(
+        "brand_awareness_survey",
+        phases,
+        metadata={
+            "n_units": 200, "n_periods": 12, "n_obs": int(len(data)),
+            "n_strata": 10, "n_psu_per_stratum": 4,
+            "n_replicate_weights": len(rw_cols),
+            "outcomes": ["outcome", "consideration", "purchase_intent"],
+        },
+    )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/speed_review/bench_brfss_panel.py b/benchmarks/speed_review/bench_brfss_panel.py
new file mode 100644
index 00000000..c437a5f7
--- /dev/null
+++ b/benchmarks/speed_review/bench_brfss_panel.py
@@ -0,0 +1,147 @@
+"""
+Scenario 3: BRFSS-style microdata -> aggregate_survey -> CS panel.
+
+Chains: aggregate_survey (microdata -> state-year panel) -> CS fit with
+stage-2 SurveyDesign + bootstrap at PSU -> event-study pre-trends ->
+HonestDiD grid -> SunAbraham robustness refit -> practitioner_next_steps.
+
+Data shape: ~50K microdata rows scaled to a ~50-state x 10-year study
+population (reflects BRFSS 2024's ~458K universe filtered to a substate
+analytic slice). 10 strata, 200 PSUs. Collapses to a 500-cell panel.
+5 adoption cohorts staggered across the window.
+"""
+
+import numpy as np
+import pandas as pd
+
+from diff_diff import (
+    CallawaySantAnna,
+    SunAbraham,
+    SurveyDesign,
+    aggregate_survey,
+    compute_honest_did,
+    practitioner_next_steps,
+)
+
+from bench_shared import run_scenario
+
+
+def build_microdata(seed=42, n_states=50, n_years=10, n_per_cell=100,
+                   n_strata=10, n_psu=200):
+    rng = np.random.default_rng(seed)
+    n_rows = n_states * n_years * n_per_cell
+    state = np.repeat(np.arange(n_states), n_years * n_per_cell)
+    year = np.tile(
+        np.repeat(np.arange(2010, 2010 + n_years), n_per_cell),
+        n_states,
+    )
+    stratum = rng.integers(0, n_strata, size=n_rows)
+    psu = stratum * (n_psu // n_strata) + rng.integers(
+        0, n_psu // n_strata, size=n_rows,
+    )
+    weight = rng.lognormal(0, 0.4, size=n_rows) * 50.0
+
+    cohort_map = rng.choice(
+        [0, 2013, 2014, 2015, 2016, 2017],
+        size=n_states,
+        p=[0.4, 0.12, 0.12, 0.12, 0.12, 0.12],
+    )
+    first_treat = cohort_map[state]
+    treated = (first_treat > 0) & (year >= first_treat)
+    y = (
+        rng.normal(0, 1, size=n_rows)
+        + 0.5 * (year - 2010)
+        + 3.0 * treated.astype(float)
+        + rng.normal(0, 0.2, size=n_rows) * state
+    )
+    df = pd.DataFrame({
+        "state": state, "year": year,
+        "strata": stratum, "psu": psu, "finalwt": weight,
+        "y": y, "first_treat": first_treat,
+    })
+    return df
+
+
+def main():
+    micro = build_microdata()
+
+    results = {}
+
+    def aggregate():
+        sd = SurveyDesign(
+            weights="finalwt", strata="strata", psu="psu",
+        )
+        panel, stage2 = aggregate_survey(
+            micro, by=["state", "year"], outcomes="y",
+            survey_design=sd,
+        )
+        panel["first_treat"] = panel["state"].map(
+            micro.groupby("state")["first_treat"].first(),
+        )
+        results["panel"] = panel
+        results["stage2"] = stage2
+
+    def cs_fit():
+        cs = CallawaySantAnna(
+            control_group="never_treated", estimation_method="reg",
+            n_bootstrap=199, seed=123,
+        )
+        results["cs"] = cs.fit(
+            results["panel"], outcome="y_mean",
+            unit="state", time="year", first_treat="first_treat",
+            survey_design=results["stage2"], aggregate="all",
+        )
+
+    def inspect_pretrends():
+        es = results["cs"].event_study_effects or {}
+        results["pretrends"] = {
+            rel_t: eff for rel_t, eff in es.items() if rel_t < 0
+        }
+
+    def honest_grid():
+        out = {}
+        for M in (0.5, 1.0, 1.5):
+            try:
+                out[M] = compute_honest_did(
+                    results["cs"], method="relative_magnitude", M=M,
+                )
+            except Exception as e:
+                out[M] = f"{type(e).__name__}: {e}"
+        results["honest"] = out
+
+    def sun_abraham():
+        sa = SunAbraham(control_group="never_treated")
+        results["sa"] = sa.fit(
+            results["panel"], outcome="y_mean", unit="state",
+            time="year", first_treat="first_treat",
+            survey_design=results["stage2"],
+        )
+
+    def guidance():
+        results["guidance"] = practitioner_next_steps(results["cs"])
+
+    phases = [
+        ("1_aggregate_survey_microdata_to_panel", aggregate),
+        ("2_cs_fit_with_stage2_survey_design", cs_fit),
+        ("3_inspect_pretrends", inspect_pretrends),
+        ("4_honest_did_grid", honest_grid),
+        ("5_sun_abraham_robustness", sun_abraham),
+        ("6_practitioner_next_steps", guidance),
+    ]
+
+    run_scenario(
+        "brfss_panel",
+        phases,
+        metadata={
+            "n_microdata_rows": int(len(micro)),
+            "n_states": int(micro["state"].nunique()),
+            "n_years": int(micro["year"].nunique()),
+            "n_strata": int(micro["strata"].nunique()),
+            "n_psu": int(micro["psu"].nunique()),
+            "n_bootstrap": 199,
+        },
+    )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/speed_review/bench_campaign_staggered.py b/benchmarks/speed_review/bench_campaign_staggered.py
new file mode 100644
index 00000000..77e06672
--- /dev/null
+++ b/benchmarks/speed_review/bench_campaign_staggered.py
@@ -0,0 +1,124 @@
+"""
+Scenario 1: Staggered marketing campaign.
+
+CallawaySantAnna with covariates + bootstrap + aggregate='all', wrapped in
+the 8-step Baker workflow: Bacon -> CS fit -> event-study pre-trend
+inspection -> HonestDiD M-grid -> SunAbraham + ImputationDiD robustness
+-> with/without-covariates refit -> practitioner_next_steps.
+
+Data shape: 150 DMAs x 26 weekly periods, 2 staggered cohorts, 2 covariates.
+"""
+
+import numpy as np
+import pandas as pd
+
+from diff_diff import (
+    BaconDecomposition,
+    CallawaySantAnna,
+    ImputationDiD,
+    SunAbraham,
+    compute_honest_did,
+    practitioner_next_steps,
+)
+from diff_diff.prep import generate_staggered_data
+
+from bench_shared import run_scenario
+
+
+def build_data(seed=42):
+    df = generate_staggered_data(
+        n_units=150, n_periods=26, cohort_periods=[9, 14],
+        never_treated_frac=0.3, treatment_effect=3.0,
+        dynamic_effects=True, effect_growth=0.1, seed=seed,
+    )
+    rng = np.random.default_rng(seed + 1)
+    unit_log_pop = pd.Series(
+        rng.normal(0, 1, size=df["unit"].nunique()),
+        index=sorted(df["unit"].unique()),
+    )
+    df["log_pop"] = df["unit"].map(unit_log_pop)
+    df["baseline_spend"] = rng.normal(0, 1, size=len(df))
+    return df
+
+
+def main():
+    data = build_data()
+    covars = ["log_pop", "baseline_spend"]
+    fit_kwargs = dict(
+        data=data, outcome="outcome", unit="unit", time="period",
+        first_treat="first_treat",
+    )
+
+    results = {}
+
+    def bacon():
+        results["bacon"] = BaconDecomposition().fit(
+            data, outcome="outcome", unit="unit", time="period",
+            first_treat="first_treat",
+        )
+
+    def cs_fit():
+        cs = CallawaySantAnna(
+            control_group="never_treated", estimation_method="dr",
+            cluster="unit", n_bootstrap=999, seed=123,
+        )
+        results["cs"] = cs.fit(
+            **fit_kwargs, covariates=covars, aggregate="all",
+        )
+
+    def inspect_pretrends():
+        es = results["cs"].event_study_effects or {}
+        results["pretrends"] = {
+            rel_t: eff for rel_t, eff in es.items() if rel_t < 0
+        }
+
+    def honest_did_grid():
+        out = {}
+        for M in (0.5, 1.0, 1.5, 2.0):
+            out[M] = compute_honest_did(
+                results["cs"], method="relative_magnitude", M=M,
+            )
+        results["honest"] = out
+
+    def sun_abraham():
+        sa = SunAbraham(control_group="never_treated", cluster="unit")
+        results["sa"] = sa.fit(**fit_kwargs)
+
+    def imputation():
+        bjs = ImputationDiD(cluster="unit")
+        results["bjs"] = bjs.fit(**fit_kwargs, aggregate="event_study")
+
+    def cs_no_covariates():
+        cs = CallawaySantAnna(
+            control_group="never_treated", estimation_method="reg",
+            cluster="unit", n_bootstrap=199, seed=123,
+        )
+        results["cs_nocov"] = cs.fit(**fit_kwargs, aggregate="all")
+
+    def next_steps():
+        results["guidance"] = practitioner_next_steps(results["cs"])
+
+    phases = [
+        ("1_bacon_decomposition", bacon),
+        ("2_cs_fit_with_covariates_bootstrap999", cs_fit),
+        ("3_inspect_pretrends", inspect_pretrends),
+        ("4_honest_did_M_grid", honest_did_grid),
+        ("5_sun_abraham_robustness", sun_abraham),
+        ("6_imputation_did_robustness", imputation),
+        ("7_cs_without_covariates", cs_no_covariates),
+        ("8_practitioner_next_steps", next_steps),
+    ]
+
+    run_scenario(
+        "campaign_staggered",
+        phases,
+        metadata={
+            "n_units": 150, "n_periods": 26, "n_cohorts": 2,
+            "covariates": covars, "n_bootstrap": 999,
+            "aggregate": "all", "estimation_method": "dr",
+        },
+    )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/speed_review/bench_dose_response.py b/benchmarks/speed_review/bench_dose_response.py
new file mode 100644
index 00000000..9ae46094
--- /dev/null
+++ b/benchmarks/speed_review/bench_dose_response.py
@@ -0,0 +1,107 @@
+"""
+Scenario 6: Pricing dose-response with ContinuousDiD cubic spline.
+
+Chains: CDiD fit with aggregate='dose' (overall ATT + ACRT + dose-response
+curves + bootstrap 199) -> dataframe extraction -> event-study pre-trend ->
+binarized-DiD comparison -> spline sensitivity (degree=1, num_knots=2).
+
+Data shape: 500 stores x 6 quarterly periods, 1 cohort at period 3,
+log-normal dose. Matches Tutorial 14 scaled from 200 to 500 units.
+"""
+
+import numpy as np
+import pandas as pd
+
+from diff_diff import ContinuousDiD, DifferenceInDifferences
+from diff_diff.prep import generate_continuous_did_data
+
+from bench_shared import run_scenario
+
+
+def build_data(seed=42):
+    df = generate_continuous_did_data(
+        n_units=500, n_periods=6, seed=seed,
+    )
+    # Set first_treat to period 3 for all treated; ContinuousDiD expects
+    # staggered/first_treat format.
+    if "first_treat" not in df.columns:
+        treated_mask = df.get("dose", pd.Series(0.0, index=df.index)) > 0
+        df["first_treat"] = np.where(treated_mask, 3, 0)
+    return df
+
+
+def main():
+    data = build_data()
+
+    results = {}
+    fit_kwargs = dict(
+        data=data, outcome="outcome", unit="unit", time="period",
+        first_treat="first_treat", dose="dose",
+    )
+
+    def cdid_cubic_fit():
+        cdid = ContinuousDiD(
+            degree=3, num_knots=1, n_bootstrap=199, seed=123,
+        )
+        results["cubic"] = cdid.fit(**fit_kwargs, aggregate="dose")
+
+    def extract_curves():
+        r = results["cubic"]
+        out = {}
+        for level in ("dose_response", "group_time", "event_study"):
+            try:
+                out[level] = r.to_dataframe(level=level)
+            except Exception as e:
+                out[level] = f"{type(e).__name__}: {e}"
+        results["curves"] = out
+
+    def cdid_event_study():
+        cdid = ContinuousDiD(
+            degree=3, num_knots=1, n_bootstrap=0, seed=123,
+        )
+        results["event_study"] = cdid.fit(
+            **fit_kwargs, aggregate="eventstudy",
+        )
+
+    def binarized_comparison():
+        data_bin = data.copy()
+        data_bin["treated_any"] = (data_bin["dose"] > 0).astype(int)
+        data_bin["post"] = (data_bin["period"] >= 3).astype(int)
+        did = DifferenceInDifferences(robust=True)
+        results["binarized"] = did.fit(
+            data_bin, outcome="outcome", treatment="treated_any", time="post",
+        )
+
+    def spline_sensitivity_linear():
+        cdid = ContinuousDiD(
+            degree=1, num_knots=0, n_bootstrap=199, seed=123,
+        )
+        results["linear"] = cdid.fit(**fit_kwargs, aggregate="dose")
+
+    def spline_sensitivity_more_knots():
+        cdid = ContinuousDiD(
+            degree=3, num_knots=2, n_bootstrap=199, seed=123,
+        )
+        results["many_knots"] = cdid.fit(**fit_kwargs, aggregate="dose")
+
+    phases = [
+        ("1_cdid_cubic_spline_bootstrap199", cdid_cubic_fit),
+        ("2_extract_dose_response_dataframes", extract_curves),
+        ("3_cdid_event_study_pretrend", cdid_event_study),
+        ("4_binarized_did_comparison", binarized_comparison),
+        ("5_spline_sensitivity_degree1", spline_sensitivity_linear),
+        ("6_spline_sensitivity_num_knots2", spline_sensitivity_more_knots),
+    ]
+
+    run_scenario(
+        "dose_response",
+        phases,
+        metadata={
+            "n_units": 500, "n_periods": 6, "n_bootstrap": 199,
+            "spline_configs": ["degree=3,k=1", "degree=1,k=0", "degree=3,k=2"],
+        },
+    )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/speed_review/bench_geo_few_markets.py b/benchmarks/speed_review/bench_geo_few_markets.py
new file mode 100644
index 00000000..35350f0d
--- /dev/null
+++ b/benchmarks/speed_review/bench_geo_few_markets.py
@@ -0,0 +1,99 @@
+"""
+Scenario 4: Geo-experiment with few treated markets (SyntheticDiD).
+
+Chains: SDiD with jackknife variance (80 LOO refits) -> SDiD with bootstrap
+variance for SE comparison -> in_time_placebo -> get_loo_effects_df ->
+sensitivity_to_zeta_omega -> weight-concentration diagnostic.
+
+Data shape: 80 markets x 12 weekly periods (6 pre, 6 post), 5 treated,
+2 latent factors. Matches Tutorial 18's geo-experiment walkthrough.
+"""
+
+from diff_diff import SyntheticDiD
+from diff_diff.prep import generate_factor_data
+
+from bench_shared import run_scenario
+
+
+def build_data(seed=42):
+    return generate_factor_data(
+        n_units=80, n_pre=6, n_post=6, n_treated=5,
+        n_factors=2, treatment_effect=2.0,
+        factor_strength=1.0, treated_loading_shift=0.5,
+        seed=seed,
+    )
+
+
+def main():
+    data = build_data()
+    # `treat` is the unit-level (block) indicator; `treated` is row-level.
+    # SyntheticDiD requires block treatment, and post_periods identifies the
+    # treatment window among treated units.
+    post_periods = sorted(
+        data.loc[(data["treat"] == 1) & (data["treated"] == 1),
+                 "period"].unique().tolist(),
+    )
+
+    results = {}
+
+    def sdid_jackknife():
+        sdid = SyntheticDiD(variance_method="jackknife", seed=123)
+        results["jk"] = sdid.fit(
+            data, outcome="outcome", unit="unit", time="period",
+            treatment="treat", post_periods=post_periods,
+        )
+
+    def sdid_bootstrap():
+        sdid = SyntheticDiD(
+            variance_method="bootstrap", n_bootstrap=200, seed=123,
+        )
+        results["bs"] = sdid.fit(
+            data, outcome="outcome", unit="unit", time="period",
+            treatment="treat", post_periods=post_periods,
+        )
+
+    def in_time_placebo():
+        fn = getattr(results["jk"], "in_time_placebo", None)
+        if fn is None:
+            raise RuntimeError("in_time_placebo not available on results")
+        results["in_time"] = fn()
+
+    def loo_effects_df():
+        fn = getattr(results["jk"], "get_loo_effects_df", None)
+        if fn is None:
+            raise RuntimeError("get_loo_effects_df not available")
+        results["loo"] = fn()
+
+    def sensitivity_zeta_omega():
+        fn = getattr(results["jk"], "sensitivity_to_zeta_omega", None)
+        if fn is None:
+            raise RuntimeError("sensitivity_to_zeta_omega not available")
+        results["zeta"] = fn()
+
+    def weight_concentration():
+        fn = getattr(results["jk"], "get_weight_concentration", None)
+        if fn is None:
+            raise RuntimeError("get_weight_concentration not available")
+        results["wc"] = fn()
+
+    phases = [
+        ("1_sdid_jackknife_variance", sdid_jackknife),
+        ("2_sdid_bootstrap_variance_200", sdid_bootstrap),
+        ("3_in_time_placebo", in_time_placebo),
+        ("4_get_loo_effects_df", loo_effects_df),
+        ("5_sensitivity_to_zeta_omega", sensitivity_zeta_omega),
+        ("6_weight_concentration", weight_concentration),
+    ]
+
+    run_scenario(
+        "geo_few_markets",
+        phases,
+        metadata={
+            "n_units": 80, "n_periods": 12, "n_treated": 5,
+            "n_factors": 2,
+        },
+    )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/speed_review/bench_reversible_dcdh.py b/benchmarks/speed_review/bench_reversible_dcdh.py
new file mode 100644
index 00000000..b9c6499c
--- /dev/null
+++ b/benchmarks/speed_review/bench_reversible_dcdh.py
@@ -0,0 +1,117 @@
+"""
+Scenario 5: Reversible treatment with dCDH, L_max multi-horizon, survey TSL.
+
+Chains: dCDH fit with L_max=3 (multi-horizon DID_l + dynamic placebos +
+sup-t bands + TWFE diagnostic + survey TSL) -> inspect placebo ->
+compute_honest_did on placebo event study -> heterogeneity refit.
+
+Data shape: 120 groups x 10 periods, single-switch reversible pattern,
+survey-weighted with 8 strata and 24 PSUs.
+"""
+
+import numpy as np
+import pandas as pd
+
+from diff_diff import (
+    ChaisemartinDHaultfoeuille,
+    SurveyDesign,
+    compute_honest_did,
+)
+from diff_diff.prep import generate_reversible_did_data
+
+from bench_shared import run_scenario
+
+
+def attach_survey_columns(df, seed=42, n_strata=8, psu_per_stratum=3):
+    rng = np.random.default_rng(seed)
+    groups = sorted(df["group"].unique())
+    n_groups = len(groups)
+    stratum_map = {g: i % n_strata for i, g in enumerate(groups)}
+    psu_map = {
+        g: stratum_map[g] * psu_per_stratum + (i // n_strata) % psu_per_stratum
+        for i, g in enumerate(groups)
+    }
+    weight_map = {
+        g: float(rng.lognormal(0, 0.3)) for g in groups
+    }
+    df = df.copy()
+    df["stratum"] = df["group"].map(stratum_map)
+    df["psu"] = df["group"].map(psu_map)
+    df["pw"] = df["group"].map(weight_map)
+    return df
+
+
+def main():
+    raw = generate_reversible_did_data(
+        n_groups=120, n_periods=10, pattern="single_switch",
+        initial_treat_frac=0.3, p_switch=0.15,
+        treatment_effect=2.0, heterogeneous_effects=True,
+        seed=42,
+    )
+    data = attach_survey_columns(raw)
+
+    results = {}
+    fit_kwargs = dict(
+        data=data, outcome="outcome", group="group", time="period",
+        treatment="treatment",
+    )
+
+    def dcdh_fit_lmax3():
+        est = ChaisemartinDHaultfoeuille(seed=123)
+        sd = SurveyDesign(
+            weights="pw", strata="stratum", psu="psu",
+        )
+        results["dcdh"] = est.fit(
+            **fit_kwargs, L_max=3, survey_design=sd,
+        )
+
+    def inspect_placebo():
+        r = results["dcdh"]
+        results["placebo_summary"] = {
+            "placebo_effect": getattr(r, "placebo_effect", None),
+            "overall_att": getattr(r, "overall_att", None),
+            "joiners_att": getattr(r, "joiners_att", None),
+            "leavers_att": getattr(r, "leavers_att", None),
+        }
+
+    def honest_placebo():
+        out = {}
+        for M in (0.5, 1.0, 1.5):
+            try:
+                out[M] = compute_honest_did(
+                    results["dcdh"], method="relative_magnitude", M=M,
+                )
+            except Exception as e:
+                out[M] = f"{type(e).__name__}: {e}"
+        results["honest"] = out
+
+    def heterogeneity_refit():
+        est = ChaisemartinDHaultfoeuille(seed=123)
+        try:
+            results["het"] = est.fit(
+                **fit_kwargs, L_max=3, heterogeneity="group",
+            )
+        except (NotImplementedError, ValueError) as e:
+            results["het"] = f"{type(e).__name__}: {e}"
+
+    phases = [
+        ("1_dcdh_fit_Lmax3_survey_TSL", dcdh_fit_lmax3),
+        ("2_inspect_placebo_and_summary", inspect_placebo),
+        ("3_honest_did_on_placebo", honest_placebo),
+        ("4_heterogeneity_refit", heterogeneity_refit),
+    ]
+
+    run_scenario(
+        "reversible_dcdh",
+        phases,
+        metadata={
+            "n_groups": 120, "n_periods": 10,
+            "pattern": "single_switch", "L_max": 3,
+            "n_strata": int(data["stratum"].nunique()),
+            "n_psu": int(data["psu"].nunique()),
+        },
+    )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/speed_review/bench_shared.py b/benchmarks/speed_review/bench_shared.py
new file mode 100644
index 00000000..1530ad89
--- /dev/null
+++ b/benchmarks/speed_review/bench_shared.py
@@ -0,0 +1,144 @@
+"""
+Shared harness for the practitioner-workflow performance scenarios.
+
+Each ``bench_<scenario>.py`` script imports ``run_scenario`` and hands it a
+list of phases (label, callable). The harness times each phase, wraps the
+full chain in a pyinstrument profile, and writes:
+
+- ``benchmarks/results/<scenario>_<backend>.json`` — per-phase wall-clock
+- ``benchmarks/results/profiles/<scenario>_<backend>.html`` — flame profile
+
+Backend is auto-detected via ``diff_diff._backend.HAS_RUST_BACKEND`` and the
+``DIFF_DIFF_BACKEND`` env var. Run each script twice — once with
+``DIFF_DIFF_BACKEND=python`` and once with ``DIFF_DIFF_BACKEND=rust`` — to
+populate both files.
+
+See ``docs/performance-scenarios.md`` for scenario definitions and
+``docs/performance-plan.md`` for the per-scenario findings and action
+recommendations derived from these results.
+"""
+
+import json
+import os
+import sys
+import time
+import warnings
+from pathlib import Path
+
+import numpy as np
+
+try:
+    from pyinstrument import Profiler
+    HAS_PYINSTRUMENT = True
+except ImportError:
+    HAS_PYINSTRUMENT = False
+    Profiler = None  # type: ignore[assignment,misc]
+
+sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
+
+from diff_diff._backend import HAS_RUST_BACKEND
+
+RESULTS_DIR = Path(__file__).resolve().parent / "baselines"
+PROFILE_DIR = RESULTS_DIR / "profiles"
+
+
+def _backend_label():
+    """Return 'rust' or 'python' for file naming."""
+    env = os.environ.get("DIFF_DIFF_BACKEND", "auto").lower()
+    if env == "python":
+        return "python"
+    if env == "rust":
+        return "rust"
+    return "rust" if HAS_RUST_BACKEND else "python"
+
+
+def run_scenario(scenario_name, phases, metadata=None):
+    """Time a list of phases and write JSON + pyinstrument profile.
+
+    Parameters
+    ----------
+    scenario_name : str
+        Filename stem, e.g. ``"campaign_staggered"``. Output files use
+        ``<scenario_name>_<backend>.(json|html)``.
+    phases : list of (label, callable) tuples
+        Each callable takes no arguments and may return a value that is
+        passed forward via a shared ``context`` dict — but for simplicity
+        phases are independent here; each callable captures what it needs
+        from its enclosing scope.
+    metadata : dict, optional
+        Extra fields folded into the JSON under ``metadata`` (data shape,
+        params, etc.). Pure data, no callables.
+    """
+    backend = _backend_label()
+    RESULTS_DIR.mkdir(parents=True, exist_ok=True)
+    PROFILE_DIR.mkdir(parents=True, exist_ok=True)
+
+    warnings.filterwarnings(
+        "ignore", message=".*invalid value encountered in matmul.*",
+        category=RuntimeWarning,
+    )
+
+    profile = None
+    if HAS_PYINSTRUMENT:
+        profile = Profiler(async_mode="disabled")
+        profile.start()
+
+    phase_times = {}
+    total_start = time.perf_counter()
+    try:
+        for label, fn in phases:
+            t0 = time.perf_counter()
+            try:
+                fn()
+                phase_times[label] = {
+                    "seconds": time.perf_counter() - t0,
+                    "ok": True,
+                    "error": None,
+                }
+            except Exception as e:
+                phase_times[label] = {
+                    "seconds": time.perf_counter() - t0,
+                    "ok": False,
+                    "error": f"{type(e).__name__}: {e}",
+                }
+                print(f"  [{label}] FAILED: {type(e).__name__}: {e}")
+    finally:
+        total_elapsed = time.perf_counter() - total_start
+        if profile is not None:
+            profile.stop()
+            html_path = PROFILE_DIR / f"{scenario_name}_{backend}.html"
+            with open(html_path, "w") as f:
+                f.write(profile.output_html())
+            repo_root = Path(__file__).resolve().parents[2]
+            print(f"  profile -> {html_path.relative_to(repo_root)}")
+
+    record = {
+        "scenario": scenario_name,
+        "backend": backend,
+        "has_rust_backend": HAS_RUST_BACKEND,
+        "total_seconds": total_elapsed,
+        "phases": phase_times,
+        "metadata": metadata or {},
+        "diff_diff_version": _get_version(),
+        "numpy_version": np.__version__,
+    }
+
+    json_path = RESULTS_DIR / f"{scenario_name}_{backend}.json"
+    with open(json_path, "w") as f:
+        json.dump(record, f, indent=2, default=str)
+
+    print(f"\n  [{scenario_name}] backend={backend}  total={total_elapsed:.2f}s")
+    for label, info in phase_times.items():
+        status = "OK " if info["ok"] else "ERR"
+        print(f"    {status} {label:<40} {info['seconds']:>8.3f}s")
+    repo_root = Path(__file__).resolve().parents[2]
+    print(f"  json    -> {json_path.relative_to(repo_root)}")
+    return record
+
+
+def _get_version():
+    try:
+        import diff_diff
+        return diff_diff.__version__
+    except Exception:
+        return "unknown"
diff --git a/benchmarks/speed_review/run_all.py b/benchmarks/speed_review/run_all.py
new file mode 100644
index 00000000..d31a23d6
--- /dev/null
+++ b/benchmarks/speed_review/run_all.py
@@ -0,0 +1,79 @@
+#!/usr/bin/env python3
+"""
+Run every practitioner-workflow scenario under both backends.
+
+Writes per-scenario JSON + pyinstrument HTML to ``benchmarks/results/`` and
+``benchmarks/results/profiles/``. See ``docs/performance-scenarios.md`` for
+scenario definitions and ``docs/performance-plan.md`` for the derived
+findings.
+
+Usage:
+
+    python benchmarks/speed_review/run_all.py
+    python benchmarks/speed_review/run_all.py --backend python
+    python benchmarks/speed_review/run_all.py --backend rust
+    python benchmarks/speed_review/run_all.py --scenarios campaign_staggered
+"""
+
+import argparse
+import os
+import subprocess
+import sys
+from pathlib import Path
+
+HERE = Path(__file__).resolve().parent
+SCRIPTS = {
+    "campaign_staggered": "bench_campaign_staggered.py",
+    "brand_awareness_survey": "bench_brand_awareness_survey.py",
+    "brfss_panel": "bench_brfss_panel.py",
+    "geo_few_markets": "bench_geo_few_markets.py",
+    "reversible_dcdh": "bench_reversible_dcdh.py",
+    "dose_response": "bench_dose_response.py",
+}
+
+
+def run(scenario, backend):
+    script = HERE / SCRIPTS[scenario]
+    env = os.environ.copy()
+    env["DIFF_DIFF_BACKEND"] = backend
+    print(f"\n===== {scenario} backend={backend} =====")
+    result = subprocess.run(
+        [sys.executable, str(script)], env=env,
+    )
+    return result.returncode == 0
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--backend", choices=["python", "rust", "both"], default="both",
+    )
+    parser.add_argument(
+        "--scenarios", nargs="+", choices=list(SCRIPTS),
+        default=list(SCRIPTS),
+    )
+    args = parser.parse_args()
+
+    if args.backend == "both":
+        backends = ["python", "rust"]
+    else:
+        backends = [args.backend]
+
+    failures = []
+    for backend in backends:
+        for scenario in args.scenarios:
+            if not run(scenario, backend):
+                failures.append((scenario, backend))
+
+    print("\n\n===== SUMMARY =====")
+    if failures:
+        print(f"{len(failures)} scenario/backend combos failed:")
+        for s, b in failures:
+            print(f"  - {s} ({b})")
+        sys.exit(1)
+    else:
+        print("All scenarios passed.")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/docs/performance-plan.md b/docs/performance-plan.md
index bac34b63..91a44f47 100644
--- a/docs/performance-plan.md
+++ b/docs/performance-plan.md
@@ -4,6 +4,195 @@ This document outlines the strategy for improving diff-diff's performance on lar
 
 ---
 
+## Practitioner Workflow Baseline (v3.1.3, April 2026)
+
+Earlier sections of this document (v1.4.0, v2.0.3) measured isolated `fit()`
+calls on synthetic panels for R-parity. This section measures **end-to-end
+practitioner chains** — Bacon decomposition, fit, event-study pre-trend
+inspection, HonestDiD sensitivity grids, cross-estimator robustness refits,
+and reporting — at data shapes anchored to applied-econ papers and industry
+writeups. The six scenarios are defined in
+[`docs/performance-scenarios.md`](performance-scenarios.md); scripts live in
+`benchmarks/speed_review/bench_*.py`; raw results in
+`benchmarks/results/*.json` and flame profiles in
+`benchmarks/results/profiles/`.
+
+Environment: macOS darwin 25.3 on Apple Silicon M4, Python 3.9,
+numpy 2.x, diff_diff 3.1.3. Each scenario runs under
+`DIFF_DIFF_BACKEND=python` and `DIFF_DIFF_BACKEND=rust`.
+
+### Per-scenario wall-clock totals
+
+| Scenario | Python (s) | Rust (s) | Rust speedup | Dominant phase |
+|---|---:|---:|---:|---|
+| 1. Staggered campaign (CS + 8-step chain) | 0.48 | 0.48 | 1.0x | ImputationDiD robustness (53%) |
+| 2. Brand awareness survey (DiD + SurveyDesign) | 0.18 | 0.20 | 0.9x | Multi-outcome loop + HonestDiD (~70% combined) |
+| 3. BRFSS microdata -> CS panel | 1.58 | 1.58 | 1.0x | `aggregate_survey` (93%) |
+| 4. Geo-experiment few markets (SDiD) | 2.96 | 0.04 | **76x** | SDiD Frank-Wolfe weight solver |
+| 5. Reversible treatment (dCDH L_max=3 + TSL) | 0.49 | 0.55 | 0.9x | dCDH fit (58%) + heterogeneity refit (40%) |
+| 6. Pricing dose-response (CDiD spline) | 0.57 | 0.58 | 1.0x | Four spline variants, ~25% each |
+
+At practitioner-realistic scales, the full 8-step Baker chain runs in under
+two seconds for 5 of 6 scenarios with or without the Rust backend. The Rust
+backend provides dramatic uplift only for SDiD; elsewhere it is at parity
+(or marginally slower on small data due to the Python/Rust FFI crossing
+overhead).
+
+### Top hotspots ranked by total-time contribution
+
+| # | Location | Scenario | Time | Recommended action |
+|---|---|---|---:|---|
+| 1 | `diff_diff/survey.py:1160` `_compute_stratified_psu_meat` | BRFSS | 1.0s self + 1.4s inclusive per 50K microdata | **Algorithmic fix** — loop runs per (state, year) cell; precompute stratum scaffolding once at top of `aggregate_survey` and reuse |
+| 2 | `diff_diff/utils.py:1434` `_sc_weight_fw_numpy` | Geo few markets (python) | 0.46s | **Already ported to Rust.** Python fallback acceptable for n < 50; document the python-backend ceiling rather than re-optimizing |
+| 3 | `diff_diff/imputation.py` ImputationDiD fit chain | Staggered campaign | 0.24s | **Investigate** — 4x slower than CS with `n_bootstrap=999` on identical data; unexpected given CS has the heavier bootstrap and same influence-function path. Likely imputation loop is not vectorized. |
+| 4 | `diff_diff/chaisemartin_dhaultfoeuille.py` dCDH fit (`L_max=3` + TSL) | Reversible | 0.32s main + 0.22s heterogeneity refit | **Cache/precompute** — heterogeneity refit repeats TSL setup and data prep already done by main fit. Pass shared precomputed structures through. |
+| 5 | `diff_diff/continuous_did.py` CDiD bootstrap loop | Dose response | 0.14s per fit, 4 variants = 0.56s | **Leave alone** — linear scaling with spline variants is expected; total well under practitioner-perceptible threshold |
+
+### Per-scenario findings
+
+**Scenario 1 — Staggered campaign (CS + 8-step chain)**
+
+Top 5 phases (python-backend, ordered by time):
+
+1. `6_imputation_did_robustness` — 234 ms (49%) — **investigate**
+2. `5_sun_abraham_robustness` — 149 ms (31%) — expected; SA saturated TWFE
+3. `2_cs_fit_with_covariates_bootstrap999` — 59 ms (12%) — expected
+4. `7_cs_without_covariates` — 29 ms (6%) — expected
+5. `1_bacon_decomposition` — 7 ms (1%) — negligible
+
+Action: flag ImputationDiD for a focused profile comparison against CS on
+the same data; total scenario is otherwise already cheap enough.
+
+**Scenario 2 — Brand awareness survey**
+
+Top 5 phases (python-backend, ordered by time):
+
+1. `4_multi_outcome_loop_3_metrics` — 64 ms (36%) — expected; linear in outcome count
+2. `7_event_study_plus_honest_did` — 62 ms (35%) — expected; MP fit + 3x HonestDiD
+3. `6_placebo_refit_pre_period` — 24 ms (13%) — expected
+4. `3_replicate_weights_brr` — 12 ms (7%) — expected; 40 replicate columns
+5. `5_check_parallel_trends` — 9 ms (5%) — expected
+
+Action: **leave alone.** Full survey chain is ~200 ms end-to-end.
+
+**Scenario 3 — BRFSS microdata -> CS panel**
+
+Top 5 phases (python-backend, ordered by time):
+
+1. `1_aggregate_survey_microdata_to_panel` — 1480 ms (94%) — **algorithmic fix**
+2. `5_sun_abraham_robustness` — 81 ms (5%) — expected
+3. `2_cs_fit_with_stage2_survey_design` — 15 ms (1%) — expected
+4. `4_honest_did_grid` — 4 ms — negligible
+5. `6_practitioner_next_steps` — <1 ms — negligible
+
+Action: **fix `aggregate_survey` per-cell loop.** Profile confirmed the
+self-time is concentrated in `_compute_stratified_psu_meat` being called
+once per output cell (500 cells for 50 states x 10 years) with redundant
+stratum-scaffolding reconstruction per call. A single precomputation of
+stratum indexes at the top of `aggregate_survey` should eliminate most of
+the 1s self-time without changing numerical output.
+
+**Scenario 4 — Geo-experiment few markets (SDiD)**
+
+Top 5 phases (python vs rust):
+
+| Phase | Python | Rust |
+|---|---:|---:|
+| `5_sensitivity_to_zeta_omega` | 1059 ms | 11 ms |
+| `3_in_time_placebo` | 954 ms | 8 ms |
+| `2_sdid_bootstrap_variance_200` | 475 ms | 12 ms |
+| `1_sdid_jackknife_variance` | 472 ms | 7 ms |
+
+Profile of python fit: 99% of time is in `_sc_weight_fw_numpy` Frank-Wolfe
+solver, split ~evenly between unit-weight and time-weight solves.
+`_fw_step` convergence check (`np.allclose`) is half the inner-loop cost.
+
+Action: **no further optimization needed.** Rust port is shipped and
+provides 76x on the full chain. The practitioner path defaults to Rust when
+available; the python fallback is a developer-safety path and the
+performance ceiling is acceptable for the teaching scale
+(40-80 units) but documented as non-production for larger n.
+
+**Scenario 5 — Reversible treatment (dCDH L_max=3 + TSL)**
+
+Top 5 phases:
+
+1. `1_dcdh_fit_Lmax3_survey_TSL` — 316 ms (64% python / 58% rust) — **cache candidate**
+2. `4_heterogeneity_refit` — 174 ms (35%) — **cache candidate**
+3. `3_honest_did_on_placebo` — 4-13 ms — expected
+
+The main fit and heterogeneity refit each independently rebuild TSL
+scaffolding (stratum-PSU indexes, influence-function allocators, design-
+matrix reshaping). Because heterogeneity always follows an unconditional
+fit, the scaffolding is shared and can be passed through.
+
+Action: **investigate shared precomputation.** Not a P0 — total is ~550 ms
+end-to-end — but this is a newer code path (v3.1) and has not been
+optimization-reviewed.
+
+**Scenario 6 — Pricing dose-response (ContinuousDiD)**
+
+Four spline fits (cubic bootstrap 199, event-study, linear bootstrap 199,
+cubic num_knots=2 bootstrap 199) account for ~99% of runtime, ~140 ms each.
+Linear scaling in variant count is expected.
+
+Action: **leave alone.** Bootstrap 199 on 500 units x 6 periods with cubic
+splines at 140 ms per fit is well within practitioner-acceptable latency.
+
+### Correctness-adjacent observations (not P0, route separately)
+
+These are developer-ergonomics / API-consistency smells surfaced during
+scenario development. None are silent-failures and none belong in this PR
+or in the silent-failures audit; logging here for awareness.
+
+1. **`aggregate` parameter naming.** CS accepts `aggregate="event_study"`;
+   ContinuousDiD requires `aggregate="eventstudy"` (no underscore). Both
+   estimators expose the same conceptual aggregation but different
+   spellings. Route: API-consistency cleanup, minor.
+2. **`generate_survey_did_data(panel=True)` `treated` column.** Row-level
+   active-treatment indicator that is zero in pre-periods, which makes it
+   quietly incompatible with `check_parallel_trends` (expects unit-level
+   treatment group membership) and pre-period placebo tests. Tutorial 17
+   does not hit this because it uses a 2x2 design where `post` discriminates
+   the comparison. Suggest adding a `treat_unit` column alongside `treated`
+   for generator output clarity. Route: DGP cleanup, minor.
+3. **`SurveyDesign.replicate_method` case sensitivity.** `"brr"` raises
+   `ValueError("must be one of {'Fay', 'SDR', 'BRR', 'JKn', 'JK1'}")`;
+   `"BRR"` works. Either normalize the input or mention the expected casing
+   in the error message. Route: API-ergonomics, minor.
+
+### What this baseline does not answer
+
+- Scaling: each scenario runs at a single data shape. We do not know how
+  end-to-end time scales with n, periods, or cohorts. If scaling becomes a
+  decision input, add a small per-scenario scale sweep (e.g., n_units in
+  {100, 500, 1000}) — the scripts are parameterised to support this.
+- Memory: no memory-ceiling measurement. If memory becomes a concern,
+  `pyinstrument --output-memory` or `memray` can be wrapped into
+  `bench_shared.run_scenario` without restructuring.
+- Pure-Rust profiles: scenarios run the Rust backend as a black box.
+  Optimizing inside `rust/` is a separate concern owned by the crate
+  maintainers and is not in scope here.
+- Real-data shapes: the scenarios use synthetic DGPs. The BRFSS scenario
+  uses a BRFSS-shaped synthetic panel, not actual BRFSS microdata. If a
+  real-data calibration becomes relevant, CDC BRFSS annual files are
+  public.
+
+### Reproducing
+
+```bash
+pip install pyinstrument                  # one-time, dev-only
+python benchmarks/speed_review/run_all.py # both backends, all scenarios
+
+# Single scenario, single backend:
+DIFF_DIFF_BACKEND=rust python benchmarks/speed_review/bench_campaign_staggered.py
+```
+
+Raw JSON and flame HTML are written under `benchmarks/results/` for
+scenario-level diffing as the library evolves.
+
+---
+
 ## Results Achieved (v2.0.3)
 
 **v2.0.3 includes Rust backend optimizations** that further improve SyntheticDiD performance:
diff --git a/docs/performance-scenarios.md b/docs/performance-scenarios.md
new file mode 100644
index 00000000..26214ada
--- /dev/null
+++ b/docs/performance-scenarios.md
@@ -0,0 +1,324 @@
+# Practitioner Workflow Scenarios for Performance Benchmarking
+
+This document defines the **realistic practitioner workloads** used to evaluate
+diff-diff's end-to-end performance. It is the methodology input for the
+per-scenario scripts under `benchmarks/speed_review/` and the findings in
+`docs/performance-plan.md`.
+
+## Why this doc exists
+
+The existing `benchmarks/` suite measures **isolated `fit()` calls on synthetic
+200-20,000 unit panels** against R packages for accuracy parity. That tells us
+whether our point estimates and SEs match `did::att_gt` and `fixest::feols`. It
+does **not** tell us what an analyst sees when they run a full 8-step Baker et
+al. (2025) workflow on a real BRFSS state-policy panel or a staggered geo
+campaign. Without that, any "should we optimize X?" or "should we port X to
+Rust?" decision is made on intuition, not data.
+
+The scenarios below are the measurement surface for that decision. They are
+chosen to:
+
+1. Cover the six practitioner decision-tree branches in
+   `docs/practitioner_decision_tree.rst` (simultaneous, staggered, reversible,
+   dose, few-markets, survey).
+2. Exercise the code paths added in v3.0-v3.1 that the old `benchmarks/` never
+   touched: survey `SurveyDesign` (TSL, replicate weights, PSU-level
+   multiplier bootstrap), `aggregate_survey`, dCDH (reversible, `L_max`),
+   SyntheticDiD jackknife, ContinuousDiD dose-spline, and the 8-step
+   chain (Bacon -> fit -> HonestDiD -> cross-estimator robustness).
+3. Use defensibly realistic data shapes anchored to applied-econ paper
+   conventions and industry writeups, **not** the 200 x 8 cookie cutter.
+
+This is a **measurement doc**, not a wishlist. It does not propose new
+features, does not propose optimizations, and does not propose new estimators.
+Anything discovered during measurement that looks like a bug gets flagged
+separately and routed to the silent-failures audit, not folded into a perf PR.
+
+## How this doc is used
+
+Each scenario in section 4 defines:
+
+- **Persona / domain** — who runs this and why
+- **Data shape** — n_units, n_periods, n_covariates, survey PSUs/strata,
+  microdata rows if relevant
+- **Estimator + params** — including `covariates`, `n_bootstrap`,
+  `survey_design`, `aggregate`, any non-default knobs
+- **Operation chain** — fit() is one step; the flow usually includes Bacon
+  decomposition, parallel-trends inspection, sensitivity analysis, aggregation,
+  and cross-estimator robustness. We time the **chain**, not just fit().
+- **Source anchor** — which tutorial, paper, or industry reference the
+  shape/workflow comes from
+
+For each scenario, `benchmarks/speed_review/` hosts a script
+(`bench_<scenario>.py`) that:
+
+1. Generates (or loads) the data once.
+2. Runs the full operation chain under `pyinstrument` and writes a flame HTML
+   to `benchmarks/speed_review/baselines/profiles/<scenario>_<backend>.html`.
+3. Writes a wall-clock JSON breakdown (per operation + total) to
+   `benchmarks/speed_review/baselines/<scenario>_<backend>.json`.
+4. Runs under both `DIFF_DIFF_BACKEND=python` and `DIFF_DIFF_BACKEND=rust`
+   when Rust is available. The gap is the primary input to Rust-expansion
+   decisions.
+
+The scenario scripts are **not** meant to replace `run_benchmarks.py` (which
+serves a different purpose: R-parity accuracy). They complement it.
+
+## Ground rules for realism
+
+- **No 200 x 8 synthetic panels.** The existing benchmarks already do that.
+  Each scenario below is either a different shape entirely or a 200 x 8 panel
+  wrapped in realistic downstream operations (bootstrap, survey, sensitivity).
+- **End-to-end, not isolated `fit()`.** Practitioners chain operations. A 50ms
+  fit inside a 999-replicate bootstrap wrapped in an 8-M-value HonestDiD loop
+  is a ~45-second end-to-end run where 90%+ of time may be outside the fit
+  call the old benchmark measured.
+- **Cite why the shape is realistic.** Every scenario grounds its data shape
+  in an applied-econ paper, a tutorial, an industry writeup, or a bundled
+  real dataset. If a scenario cannot cite a source for its shape, it does
+  not belong here.
+- **Time includes I/O and prep.** The stopwatch starts at the first library
+  call a practitioner would write in their notebook and ends at the last
+  result-reporting call — `practitioner_next_steps()` or a `summary()`. Data
+  generation (synthetic) is outside the stopwatch; data load
+  (`load_mpdta()`, CSV read) is inside.
+
+## Scenarios
+
+### 1. Staggered Marketing Campaign — CS + Event Study + HonestDiD
+
+- **Persona / domain.** Growth / performance-marketing data scientist at a
+  tech or e-commerce company. A brand campaign rolls out to DMAs in two
+  waves; analyst needs overall lift, event-study dynamics, and a sensitivity
+  bound for the VP.
+- **Data shape.** 150 units (DMAs) x 26 periods (weekly), 2 staggered
+  cohorts (wave 1 at period 9, wave 2 at period 14), ~30% never-treated,
+  2 covariates (`log_pop`, `baseline_spend`). This is deliberately larger
+  than the 80-DMA Tutorial 18 shape to stress the CS influence-function path;
+  GeoLift experiments commonly sit in the 50-200 DMA range when aggregated
+  at the DMA level per Meta's methodology docs.
+- **Estimator + params.**
+  ```python
+  CallawaySantAnna(
+      control_group="never_treated",
+      estimation_method="dr",
+      cluster="unit",
+      n_bootstrap=999,
+  ).fit(data, outcome="y", unit="unit", time="period",
+        first_treat="first_treat", covariates=["log_pop", "baseline_spend"],
+        aggregate="all")
+  ```
+- **Operation chain.** (1) `BaconDecomposition.fit()` for TWFE diagnostic;
+  (2) CS fit with `aggregate="all"` (populates simple, group, event_study);
+  (3) inspect event-study pre-period ATTs for pre-trends; (4)
+  `compute_honest_did(results, method="relative_magnitude", M=[0.5, 1.0, 1.5, 2.0])`;
+  (5) robustness: refit with `SunAbraham()` and `ImputationDiD()` for
+  cross-estimator comparison; (6) refit CS without covariates for the
+  Baker-mandated with/without comparison; (7) `practitioner_next_steps()`.
+- **Source anchor.** `docs/tutorials/02_staggered_did.ipynb` (staggered DGP
+  pattern), `docs/tutorials/18_geo_experiments.ipynb` (DMA framing),
+  Callaway & Sant'Anna (2021), Baker et al. (2025) 8-step workflow from
+  `diff_diff/guides/llms-practitioner.txt`, GeoLift methodology docs for
+  DMA panel conventions.
+
+### 2. Brand Awareness Survey DiD — 2x2 with Survey Design
+
+- **Persona / domain.** Brand / market-research analytics lead at a CPG
+  or agency. Runs a pre/post awareness survey across test and control
+  markets with complex sampling (strata + PSU clusters + unequal weights).
+  Needs design-correct SEs or the CI is too narrow.
+- **Data shape.** 40 regions x 8 quarterly waves x ~100 respondents per
+  region-wave = ~32,000 respondent rows (pre-aggregation). 10 strata, 4
+  PSUs per stratum (40 PSUs total), weight coefficient of variation ~1.0.
+  This is the Tutorial 17 shape scaled up from its demonstration
+  200 x 8 cells to a size where design effects meaningfully dominate runtime.
+- **Estimator + params.** Two variants in the same script:
+  ```python
+  # (a) Analytical TSL path
+  DifferenceInDifferences(robust=True).fit(
+      data, outcome="awareness", treatment="treated", time="post",
+      survey_design=SurveyDesign(weights="w", strata="stratum",
+                                 psu="cluster", fpc="fpc"),
+  )
+  # (b) Replicate-weight path (BRR-style, ~160 replicate columns)
+  SurveyDesign(weights="w", replicate_weights=[f"rw{i}" for i in range(160)],
+               replicate_method="brr")
+  ```
+- **Operation chain.** (1) naive `DifferenceInDifferences()` with no survey
+  design (for SE-inflation comparison); (2) `SurveyDesign.resolve()`;
+  (3) design-aware fit (TSL path); (4) design-aware fit (replicate-weight
+  path); (5) three funnel outcomes (awareness, consideration, purchase
+  intent) refit in a loop; (6) `check_parallel_trends()` and placebo pre-
+  period test; (7) `compute_honest_did()` with default M grid.
+- **Source anchor.** `docs/tutorials/17_brand_awareness_survey.ipynb`
+  (workflow shape), `docs/tutorials/16_survey_did.ipynb` (SurveyDesign
+  API), CDC BRFSS 2024 technical docs (`_STSTR`/`_PSU`/`_LLCPWT`
+  variable conventions for the 10-stratum / 40-PSU shape), Rao & Scott
+  (1984) for design-effect weighting logic exercised by replicate path.
+
+### 3. BRFSS State-Policy Microdata -> CS Panel
+
+- **Persona / domain.** Health-policy / public-health researcher. Has BRFSS
+  respondent-level microdata across 10 years, wants to estimate the effect
+  of a staggered state policy (e.g., Medicaid expansion, smoking ban) on
+  a design-correct outcome using `aggregate_survey()` to collapse microdata
+  to a state-year panel, then a modern staggered estimator.
+- **Data shape.** 50,000 microdata rows (~50 states x 10 years x ~100
+  respondents per state-year subsample — scaled to reflect the BRFSS 2024
+  ~458K-record universe filtered to a substate analytic population, per
+  CDC overview docs). 10 strata, 200 PSUs overall. Collapses via
+  `aggregate_survey` to 500-cell state-year panel. 5 adoption cohorts
+  staggered over the window.
+- **Estimator + params.**
+  ```python
+  panel, stage2 = aggregate_survey(
+      microdata, by=["state", "year"], outcomes="y",
+      survey_design=SurveyDesign(weights="finalwt", strata="strata", psu="psu"),
+  )
+  CallawaySantAnna(control_group="never_treated", estimation_method="reg",
+                   n_bootstrap=199).fit(
+      panel, outcome="y_mean", unit="state", time="year",
+      first_treat="first_treat", survey_design=stage2, aggregate="all",
+  )
+  compute_honest_did(results, method="relative_magnitude", M=[0.5, 1.0, 1.5])
+  ```
+- **Operation chain.** (1) `aggregate_survey()` — the microdata-to-panel
+  collapse; (2) CS fit with staged second-stage SurveyDesign
+  (`weight_type="pweight"`) and bootstrap at PSU level; (3) event-study
+  pre-trend inspection; (4) HonestDiD sensitivity grid; (5) SunAbraham
+  robustness refit (also survey-aware via Full replicate-weight path);
+  (6) `practitioner_next_steps()`.
+- **Source anchor.** `docs/practitioner_getting_started.rst` ("What If
+  You Have Survey Data?" section), CDC BRFSS 2024 overview
+  (cdc.gov/brfss/annual_data/2024), `diff_diff.prep.aggregate_survey`
+  docstring + `docs/survey-roadmap.md`, CS paper for staggered ATT(g,t)
+  inference.
+
+### 4. Geo-Experiment Few Markets — SyntheticDiD + Jackknife
+
+- **Persona / domain.** Growth marketing analyst running a small-market
+  campaign test (3-5 treated DMAs) against a pool of 30-80 control DMAs.
+  Too few treated for asymptotic CS SE; uses SyntheticDiD with
+  jackknife variance and a breakdown diagnostic for the VP.
+- **Data shape.** 80 DMAs x 12 weekly periods, 5 treated, 2 latent factors
+  driving the pre-period outcomes (factor-model DGP to stress the
+  optimization). This is the Tutorial 18 shape.
+- **Estimator + params.**
+  ```python
+  SyntheticDiD(variance_method="jackknife", n_bootstrap=0).fit(...)
+  # then also variance_method="bootstrap", n_bootstrap=200 for comparison
+  ```
+- **Operation chain.** (1) SDiD fit with `variance_method="jackknife"` —
+  exercises the leave-one-out refit loop (80 full refits); (2) SDiD fit
+  with `variance_method="bootstrap"`, `n_bootstrap=200` for SE comparison;
+  (3) `results.in_time_placebo()`; (4) `results.get_loo_effects_df()`;
+  (5) `results.sensitivity_to_zeta_omega()`; (6)
+  `results.get_weight_concentration()`; (7) `plot_synth_weights()` equivalent
+  (data extraction via `results.get_unit_weights_df()`). The jackknife loop
+  is the primary time sink; `sensitivity_to_zeta_omega` also refits.
+- **Source anchor.** `docs/tutorials/18_geo_experiments.ipynb`,
+  Arkhangelsky et al. (2021), Mercado Libre geo-experiment writeup
+  (medium.com/mercadolibre-tech), Meta GeoLift methodology docs
+  (facebookincubator.github.io/GeoLift — 10-treated / 10-20-control
+  convention).
+
+### 5. Reversible Treatment — dCDH with L_max and Survey TSL
+
+- **Persona / domain.** Marketing analyst measuring an always-on-with-
+  dark-periods campaign, or a health-policy researcher studying a policy
+  that switches on and off. Reversible treatment breaks every other
+  staggered estimator; dCDH is the only option.
+- **Data shape.** 120 groups x 10 periods, single-switch pattern per group,
+  ~40% always-control, survey-weighted with 8 strata and 24 PSUs. Larger
+  than the Tutorial's 80 x 6 demo to expose the `L_max` multi-horizon
+  influence-function allocation that was added in v3.1.
+- **Estimator + params.**
+  ```python
+  ChaisemartinDHaultfoeuille().fit(
+      data, outcome="y", group="group", time="period", treatment="treated",
+      L_max=3,
+      survey_design=SurveyDesign(weights="pw", strata="stratum", psu="cluster"),
+  )
+  ```
+- **Operation chain.** (1) dCDH fit with `L_max=3` (computes `DID_l` for
+  l=1..3, dynamic placebos, sup-t bands, TWFE diagnostic); (2) inspect
+  `placebo_effect` and dynamic placebos for pre-trend evidence;
+  (3) `results.print_summary()`; (4) `compute_honest_did()` on the placebo
+  event study; (5) heterogeneity refit with `heterogeneity="cohort"` if
+  the code path supports it on this shape. The TSL path for `L_max >= 1`
+  is newer code (v3.1) and has not been profiled.
+- **Source anchor.** `docs/practitioner_decision_tree.rst`
+  ("Reversible Treatment (On/Off Cycles)"), de Chaisemartin & D'Haultfoeuille
+  (2020), NBER WP 29873 (dynamic companion), R package
+  `DIDmultiplegtDYN` as methodological reference, `docs/methodology/REGISTRY.md`
+  dCDH section, `project_dcdh_shipped.md` for v3.1 feature set.
+
+### 6. Pricing Dose-Response — ContinuousDiD Cubic Spline
+
+- **Persona / domain.** Pricing / promo analyst at a retailer. Stores
+  received varying discount levels; analyst wants the dose-response curve
+  ATT(d), not just a binarized average. Requires Strong Parallel Trends.
+- **Data shape.** 500 units (stores) x 6 quarterly periods, 1 cohort at
+  period 3, dose drawn from log-normal (range 1-12 percentage points off
+  baseline price), ~30% untreated (dose = 0). This is the Tutorial 14
+  shape scaled from 200 to 500 units to stress the B-spline fitting.
+- **Estimator + params.**
+  ```python
+  ContinuousDiD(degree=3, num_knots=1, n_bootstrap=199).fit(
+      data, outcome="y", unit="unit", time="period", first_treat="first_treat",
+      dose="dose", aggregate="dose",
+  )
+  ```
+- **Operation chain.** (1) CDiD fit with `aggregate="dose"` — produces
+  overall ATT, overall ACRT, and the dose-response curves; (2)
+  `results.to_dataframe(level="dose_response")`; (3)
+  `results.to_dataframe(level="event_study")` for pre-trend diagnostics;
+  (4) compare to a binarized DiD fit on the same data to quantify
+  the information loss from binarizing; (5) alternate `degree=1`
+  (linear) and `num_knots=2` refits for spline-sensitivity. The dose-curve
+  bootstrap loop (199 reps x spline refit) is the primary time sink.
+- **Source anchor.** `docs/tutorials/14_continuous_did.ipynb`,
+  Callaway, Goodman-Bacon & Sant'Anna (2024), `docs/methodology/REGISTRY.md`
+  ContinuousDiD section.
+
+## Backend and environment notes
+
+All scenarios run under both backends where available:
+
+```bash
+DIFF_DIFF_BACKEND=python python benchmarks/speed_review/bench_<scenario>.py
+DIFF_DIFF_BACKEND=rust   python benchmarks/speed_review/bench_<scenario>.py
+```
+
+The Python-vs-Rust gap is the primary input to the Rust-expansion decision in
+`docs/performance-plan.md`. If Python is already within 2x of Rust for a
+scenario, that scenario is a weak Rust-port candidate; if Python is 10x+
+slower, it is a strong candidate.
+
+Apple Silicon M4 note per `TODO.md`: a spurious numpy `RuntimeWarning` on
+`matmul` for N > 260 does not affect correctness but can clutter profile
+output. Scripts filter this warning so profiles stay clean.
+
+## What is explicitly out of scope
+
+- **Optimizations.** This doc defines the measurement surface. Actual
+  performance fixes are separate PRs, each citing a specific
+  `docs/performance-plan.md` finding.
+- **R-parity benchmarking.** That is `benchmarks/run_benchmarks.py`'s job
+  and remains valuable; these scenarios complement it.
+- **Estimators without realistic practitioner flows.** TROP, EfficientDiD,
+  StackedDiD, and BaconDecomposition are exercised via the robustness
+  branches of scenarios 1 and 3; they do not get standalone scenarios
+  here. If a future practitioner tutorial gives one of them a distinct
+  end-to-end flow, a scenario can be added at that point.
+- **Rust backend internals.** We measure the Rust backend as a black box
+  (backend=rust wall-clock, backend=rust profile breakdown). Optimizing
+  inside Rust is a separate concern handled by `rust/` crate owners.
+
+## Pointers
+
+- Scripts: `benchmarks/speed_review/bench_<scenario>.py`
+- Raw results: `benchmarks/speed_review/baselines/<scenario>_<backend>.json`
+- Flame profiles: `benchmarks/speed_review/baselines/profiles/<scenario>_<backend>.html`
+- Findings doc: `docs/performance-plan.md` ("Practitioner Workflow Baseline"
+  section — per-scenario top-5 hot phases + recommended action category)

From 33d55fbfa721e5b7d21bd85495aad421cf0cce9f Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Sun, 19 Apr 2026 10:54:54 -0400
Subject: [PATCH 02/15] Extend scenarios with scale sweep, confirm
 aggregate_survey bottleneck

Four scenarios (campaign_staggered, brand_awareness_survey, brfss_panel,
geo_few_markets) now run at small/medium/large data scales rather than a
single tutorial-scale point. The large scales reflect practitioner realism:
1M-row BRFSS pooled panels, 1,500-unit county-level staggered studies,
1,000-unit multi-region brand surveys, 500-unit zip-level geo-experiments.

Key finding from the sweep: aggregate_survey at 1M microdata rows takes
~24 seconds (100% of BRFSS chain runtime), with 97% of that in
_compute_stratified_psu_meat self-time. The tutorial-scale pass had
flagged this as a 1.5s finding; at practitioner scale it is 15-20x larger
and becomes the single highest-value optimization target identified. The
other four findings hold across scales: CS chain scales well to 1,500
units, brand-survey chain scales sub-linearly, SDiD Rust gap is stable,
ImputationDiD remains the top phase of the staggered chain at all scales.

Measurement only. docs/performance-plan.md and
docs/performance-scenarios.md updated with scale-sweep tables and
scaling-finding narrative.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 benchmarks/speed_review/README.md             |  14 +-
 .../brand_awareness_survey_large_python.json  |  59 +++++
 .../brand_awareness_survey_large_rust.json    |  59 +++++
 .../brand_awareness_survey_medium_python.json |  59 +++++
 .../brand_awareness_survey_medium_rust.json   |  59 +++++
 ... brand_awareness_survey_small_python.json} |  19 +-
 ...=> brand_awareness_survey_small_rust.json} |  19 +-
 .../baselines/brfss_panel_large_python.json   |  49 ++++
 .../baselines/brfss_panel_large_rust.json     |  49 ++++
 .../baselines/brfss_panel_medium_python.json  |  49 ++++
 .../baselines/brfss_panel_medium_rust.json    |  49 ++++
 ...hon.json => brfss_panel_small_python.json} |  17 +-
 ..._rust.json => brfss_panel_small_rust.json} |  17 +-
 .../campaign_staggered_large_python.json      |  64 +++++
 .../campaign_staggered_large_rust.json        |  64 +++++
 .../campaign_staggered_medium_python.json     |  64 +++++
 .../campaign_staggered_medium_rust.json       |  64 +++++
 ...n => campaign_staggered_small_python.json} |  22 +-
 ...son => campaign_staggered_small_rust.json} |  22 +-
 .../baselines/dose_response_python.json       |  14 +-
 .../baselines/dose_response_rust.json         |  14 +-
 .../baselines/geo_few_markets_large_rust.json |  48 ++++
 .../geo_few_markets_medium_python.json        |  48 ++++
 .../geo_few_markets_medium_rust.json          |  48 ++++
 ...json => geo_few_markets_small_python.json} |  20 +-
 ...t.json => geo_few_markets_small_rust.json} |  20 +-
 .../baselines/reversible_dcdh_python.json     |  10 +-
 .../baselines/reversible_dcdh_rust.json       |  10 +-
 .../bench_brand_awareness_survey.py           |  55 +++--
 benchmarks/speed_review/bench_brfss_panel.py  |  49 ++--
 .../speed_review/bench_campaign_staggered.py  |  62 +++--
 .../speed_review/bench_geo_few_markets.py     |  76 ++++--
 docs/performance-plan.md                      | 232 ++++++++----------
 docs/performance-scenarios.md                 |  95 ++++---
 34 files changed, 1285 insertions(+), 334 deletions(-)
 create mode 100644 benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json
 create mode 100644 benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json
 create mode 100644 benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json
 create mode 100644 benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json
 rename benchmarks/speed_review/baselines/{brand_awareness_survey_python.json => brand_awareness_survey_small_python.json} (71%)
 rename benchmarks/speed_review/baselines/{brand_awareness_survey_rust.json => brand_awareness_survey_small_rust.json} (71%)
 create mode 100644 benchmarks/speed_review/baselines/brfss_panel_large_python.json
 create mode 100644 benchmarks/speed_review/baselines/brfss_panel_large_rust.json
 create mode 100644 benchmarks/speed_review/baselines/brfss_panel_medium_python.json
 create mode 100644 benchmarks/speed_review/baselines/brfss_panel_medium_rust.json
 rename benchmarks/speed_review/baselines/{brfss_panel_python.json => brfss_panel_small_python.json} (70%)
 rename benchmarks/speed_review/baselines/{brfss_panel_rust.json => brfss_panel_small_rust.json} (70%)
 create mode 100644 benchmarks/speed_review/baselines/campaign_staggered_large_python.json
 create mode 100644 benchmarks/speed_review/baselines/campaign_staggered_large_rust.json
 create mode 100644 benchmarks/speed_review/baselines/campaign_staggered_medium_python.json
 create mode 100644 benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json
 rename benchmarks/speed_review/baselines/{campaign_staggered_python.json => campaign_staggered_small_python.json} (69%)
 rename benchmarks/speed_review/baselines/{campaign_staggered_rust.json => campaign_staggered_small_rust.json} (69%)
 create mode 100644 benchmarks/speed_review/baselines/geo_few_markets_large_rust.json
 create mode 100644 benchmarks/speed_review/baselines/geo_few_markets_medium_python.json
 create mode 100644 benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json
 rename benchmarks/speed_review/baselines/{geo_few_markets_python.json => geo_few_markets_small_python.json} (65%)
 rename benchmarks/speed_review/baselines/{geo_few_markets_rust.json => geo_few_markets_small_rust.json} (65%)

diff --git a/benchmarks/speed_review/README.md b/benchmarks/speed_review/README.md
index 4ba38232..09f610ce 100644
--- a/benchmarks/speed_review/README.md
+++ b/benchmarks/speed_review/README.md
@@ -1,4 +1,4 @@
-# Speed Review — Practitioner Workflow Benchmarks
+# Speed Review - Practitioner Workflow Benchmarks
 
 Scenario-driven performance measurement for end-to-end practitioner chains,
 as distinct from `benchmarks/run_benchmarks.py` which measures R-parity on
@@ -47,19 +47,25 @@ to regenerate the full flame when needed.
 # One-time install
 pip install pyinstrument
 
-# All scenarios, both backends
+# All scenarios, both backends, all scales
 python benchmarks/speed_review/run_all.py
 
-# One scenario, one backend
+# One scenario, one backend (the script runs its full scale sweep internally)
 DIFF_DIFF_BACKEND=rust python benchmarks/speed_review/bench_campaign_staggered.py
 
 # Subset
 python benchmarks/speed_review/run_all.py --scenarios brfss_panel geo_few_markets
 ```
 
+Multi-scale scenarios write per-scale outputs
+(e.g. `campaign_staggered_small_rust.json`, `..._medium_rust.json`,
+`..._large_rust.json`). Single-scale scenarios write the scale-free form
+(e.g. `dose_response_rust.json`). Full runtime for all scales × both
+backends is ~90 seconds on Apple Silicon M4.
+
 ## Where to look for findings
 
-[`docs/performance-plan.md`](../../docs/performance-plan.md) — "Practitioner
+[`docs/performance-plan.md`](../../docs/performance-plan.md) - "Practitioner
 Workflow Baseline (v3.1.3)" section holds per-scenario hot-phase rankings
 and action recommendations. The scenarios here are the measurement surface;
 the findings doc is the decision output.
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json b/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json
new file mode 100644
index 00000000..6531abee
--- /dev/null
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json
@@ -0,0 +1,59 @@
+{
+  "scenario": "brand_awareness_survey_large",
+  "backend": "python",
+  "has_rust_backend": false,
+  "total_seconds": 0.7940070000000001,
+  "phases": {
+    "1_naive_fit_no_survey_design": {
+      "seconds": 0.013499665999999966,
+      "ok": true,
+      "error": null
+    },
+    "2_tsl_strata_psu_fpc": {
+      "seconds": 0.03187458300000001,
+      "ok": true,
+      "error": null
+    },
+    "3_replicate_weights_brr": {
+      "seconds": 0.3442796670000001,
+      "ok": true,
+      "error": null
+    },
+    "4_multi_outcome_loop_3_metrics": {
+      "seconds": 0.19682533299999982,
+      "ok": true,
+      "error": null
+    },
+    "5_check_parallel_trends": {
+      "seconds": 0.030179500000000026,
+      "ok": true,
+      "error": null
+    },
+    "6_placebo_refit_pre_period": {
+      "seconds": 0.043751333999999975,
+      "ok": true,
+      "error": null
+    },
+    "7_event_study_plus_honest_did": {
+      "seconds": 0.13358487500000016,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "scale": "large",
+    "n_units": 1000,
+    "n_periods": 12,
+    "n_obs": 12000,
+    "n_strata": 20,
+    "n_psu_per_stratum": 8,
+    "n_replicate_weights": 160,
+    "outcomes": [
+      "outcome",
+      "consideration",
+      "purchase_intent"
+    ]
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
\ No newline at end of file
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json b/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json
new file mode 100644
index 00000000..9f3d673e
--- /dev/null
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json
@@ -0,0 +1,59 @@
+{
+  "scenario": "brand_awareness_survey_large",
+  "backend": "rust",
+  "has_rust_backend": true,
+  "total_seconds": 0.828119375,
+  "phases": {
+    "1_naive_fit_no_survey_design": {
+      "seconds": 0.014049749999999861,
+      "ok": true,
+      "error": null
+    },
+    "2_tsl_strata_psu_fpc": {
+      "seconds": 0.029422499999999907,
+      "ok": true,
+      "error": null
+    },
+    "3_replicate_weights_brr": {
+      "seconds": 0.36754912500000003,
+      "ok": true,
+      "error": null
+    },
+    "4_multi_outcome_loop_3_metrics": {
+      "seconds": 0.16490987499999998,
+      "ok": true,
+      "error": null
+    },
+    "5_check_parallel_trends": {
+      "seconds": 0.03375229199999996,
+      "ok": true,
+      "error": null
+    },
+    "6_placebo_refit_pre_period": {
+      "seconds": 0.06475750000000025,
+      "ok": true,
+      "error": null
+    },
+    "7_event_study_plus_honest_did": {
+      "seconds": 0.15367104200000004,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "scale": "large",
+    "n_units": 1000,
+    "n_periods": 12,
+    "n_obs": 12000,
+    "n_strata": 20,
+    "n_psu_per_stratum": 8,
+    "n_replicate_weights": 160,
+    "outcomes": [
+      "outcome",
+      "consideration",
+      "purchase_intent"
+    ]
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
\ No newline at end of file
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json
new file mode 100644
index 00000000..6d915456
--- /dev/null
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json
@@ -0,0 +1,59 @@
+{
+  "scenario": "brand_awareness_survey_medium",
+  "backend": "python",
+  "has_rust_backend": false,
+  "total_seconds": 0.48956791599999994,
+  "phases": {
+    "1_naive_fit_no_survey_design": {
+      "seconds": 0.01289191699999992,
+      "ok": true,
+      "error": null
+    },
+    "2_tsl_strata_psu_fpc": {
+      "seconds": 0.035409875000000035,
+      "ok": true,
+      "error": null
+    },
+    "3_replicate_weights_brr": {
+      "seconds": 0.12633833299999997,
+      "ok": true,
+      "error": null
+    },
+    "4_multi_outcome_loop_3_metrics": {
+      "seconds": 0.17774295900000003,
+      "ok": true,
+      "error": null
+    },
+    "5_check_parallel_trends": {
+      "seconds": 0.018629792000000034,
+      "ok": true,
+      "error": null
+    },
+    "6_placebo_refit_pre_period": {
+      "seconds": 0.0519646250000001,
+      "ok": true,
+      "error": null
+    },
+    "7_event_study_plus_honest_did": {
+      "seconds": 0.06657341699999986,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "scale": "medium",
+    "n_units": 500,
+    "n_periods": 12,
+    "n_obs": 6000,
+    "n_strata": 15,
+    "n_psu_per_stratum": 6,
+    "n_replicate_weights": 90,
+    "outcomes": [
+      "outcome",
+      "consideration",
+      "purchase_intent"
+    ]
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
\ No newline at end of file
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json
new file mode 100644
index 00000000..e1d2a965
--- /dev/null
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json
@@ -0,0 +1,59 @@
+{
+  "scenario": "brand_awareness_survey_medium",
+  "backend": "rust",
+  "has_rust_backend": true,
+  "total_seconds": 0.535454792,
+  "phases": {
+    "1_naive_fit_no_survey_design": {
+      "seconds": 0.011897708999999979,
+      "ok": true,
+      "error": null
+    },
+    "2_tsl_strata_psu_fpc": {
+      "seconds": 0.03526237499999996,
+      "ok": true,
+      "error": null
+    },
+    "3_replicate_weights_brr": {
+      "seconds": 0.185435083,
+      "ok": true,
+      "error": null
+    },
+    "4_multi_outcome_loop_3_metrics": {
+      "seconds": 0.14044966699999994,
+      "ok": true,
+      "error": null
+    },
+    "5_check_parallel_trends": {
+      "seconds": 0.019051875000000162,
+      "ok": true,
+      "error": null
+    },
+    "6_placebo_refit_pre_period": {
+      "seconds": 0.05337804200000007,
+      "ok": true,
+      "error": null
+    },
+    "7_event_study_plus_honest_did": {
+      "seconds": 0.08997387500000009,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "scale": "medium",
+    "n_units": 500,
+    "n_periods": 12,
+    "n_obs": 6000,
+    "n_strata": 15,
+    "n_psu_per_stratum": 6,
+    "n_replicate_weights": 90,
+    "outcomes": [
+      "outcome",
+      "consideration",
+      "purchase_intent"
+    ]
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
\ No newline at end of file
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_python.json b/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json
similarity index 71%
rename from benchmarks/speed_review/baselines/brand_awareness_survey_python.json
rename to benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json
index 6551227d..ff655878 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_python.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json
@@ -1,46 +1,47 @@
 {
-  "scenario": "brand_awareness_survey",
+  "scenario": "brand_awareness_survey_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.18850491600000008,
+  "total_seconds": 0.15087129199999993,
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.0016701670000000002,
+      "seconds": 0.0017902499999999932,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.006741541999999989,
+      "seconds": 0.00610949999999999,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_brr": {
-      "seconds": 0.014424250000000027,
+      "seconds": 0.02120725000000001,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.043619666,
+      "seconds": 0.011621500000000062,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.00915220799999994,
+      "seconds": 0.001833375000000026,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.029268290999999946,
+      "seconds": 0.027076792000000016,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.08362433400000002,
+      "seconds": 0.081212583,
       "ok": true,
       "error": null
     }
   },
   "metadata": {
+    "scale": "small",
     "n_units": 200,
     "n_periods": 12,
     "n_obs": 2400,
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_rust.json b/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json
similarity index 71%
rename from benchmarks/speed_review/baselines/brand_awareness_survey_rust.json
rename to benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json
index 48707354..db36b50c 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_rust.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json
@@ -1,46 +1,47 @@
 {
-  "scenario": "brand_awareness_survey",
+  "scenario": "brand_awareness_survey_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.16800324999999994,
+  "total_seconds": 0.200881125,
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.0018907079999999077,
+      "seconds": 0.0018462080000000158,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.006109541999999912,
+      "seconds": 0.005704333000000061,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_brr": {
-      "seconds": 0.01849195799999992,
+      "seconds": 0.015561500000000006,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.02723191700000005,
+      "seconds": 0.05937758399999993,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.009134625000000063,
+      "seconds": 0.00939004099999996,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.024182666999999936,
+      "seconds": 0.025794415999999987,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.08095333299999996,
+      "seconds": 0.08319054199999998,
       "ok": true,
       "error": null
     }
   },
   "metadata": {
+    "scale": "small",
     "n_units": 200,
     "n_periods": 12,
     "n_obs": 2400,
diff --git a/benchmarks/speed_review/baselines/brfss_panel_large_python.json b/benchmarks/speed_review/baselines/brfss_panel_large_python.json
new file mode 100644
index 00000000..e738a57d
--- /dev/null
+++ b/benchmarks/speed_review/baselines/brfss_panel_large_python.json
@@ -0,0 +1,49 @@
+{
+  "scenario": "brfss_panel_large",
+  "backend": "python",
+  "has_rust_backend": false,
+  "total_seconds": 23.955508166,
+  "phases": {
+    "1_aggregate_survey_microdata_to_panel": {
+      "seconds": 23.873543207999997,
+      "ok": true,
+      "error": null
+    },
+    "2_cs_fit_with_stage2_survey_design": {
+      "seconds": 0.011892290999995225,
+      "ok": true,
+      "error": null
+    },
+    "3_inspect_pretrends": {
+      "seconds": 2.08300000537065e-06,
+      "ok": true,
+      "error": null
+    },
+    "4_honest_did_grid": {
+      "seconds": 0.0016835410000055617,
+      "ok": true,
+      "error": null
+    },
+    "5_sun_abraham_robustness": {
+      "seconds": 0.06804595899999555,
+      "ok": true,
+      "error": null
+    },
+    "6_practitioner_next_steps": {
+      "seconds": 0.00032774999999674037,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "scale": "large",
+    "n_microdata_rows": 1000000,
+    "n_states": 50,
+    "n_years": 10,
+    "n_strata": 20,
+    "n_psu": 1000,
+    "n_bootstrap": 199
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
\ No newline at end of file
diff --git a/benchmarks/speed_review/baselines/brfss_panel_large_rust.json b/benchmarks/speed_review/baselines/brfss_panel_large_rust.json
new file mode 100644
index 00000000..7b407a9d
--- /dev/null
+++ b/benchmarks/speed_review/baselines/brfss_panel_large_rust.json
@@ -0,0 +1,49 @@
+{
+  "scenario": "brfss_panel_large",
+  "backend": "rust",
+  "has_rust_backend": true,
+  "total_seconds": 24.372338875,
+  "phases": {
+    "1_aggregate_survey_microdata_to_panel": {
+      "seconds": 24.274492667,
+      "ok": true,
+      "error": null
+    },
+    "2_cs_fit_with_stage2_survey_design": {
+      "seconds": 0.012104750000005993,
+      "ok": true,
+      "error": null
+    },
+    "3_inspect_pretrends": {
+      "seconds": 2.2500000014247235e-06,
+      "ok": true,
+      "error": null
+    },
+    "4_honest_did_grid": {
+      "seconds": 0.001614166999999611,
+      "ok": true,
+      "error": null
+    },
+    "5_sun_abraham_robustness": {
+      "seconds": 0.08373358300000433,
+      "ok": true,
+      "error": null
+    },
+    "6_practitioner_next_steps": {
+      "seconds": 0.00028904199999857383,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "scale": "large",
+    "n_microdata_rows": 1000000,
+    "n_states": 50,
+    "n_years": 10,
+    "n_strata": 20,
+    "n_psu": 1000,
+    "n_bootstrap": 199
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
\ No newline at end of file
diff --git a/benchmarks/speed_review/baselines/brfss_panel_medium_python.json b/benchmarks/speed_review/baselines/brfss_panel_medium_python.json
new file mode 100644
index 00000000..9186ecdd
--- /dev/null
+++ b/benchmarks/speed_review/baselines/brfss_panel_medium_python.json
@@ -0,0 +1,49 @@
+{
+  "scenario": "brfss_panel_medium",
+  "backend": "python",
+  "has_rust_backend": false,
+  "total_seconds": 6.1113194580000005,
+  "phases": {
+    "1_aggregate_survey_microdata_to_panel": {
+      "seconds": 6.027033041999999,
+      "ok": true,
+      "error": null
+    },
+    "2_cs_fit_with_stage2_survey_design": {
+      "seconds": 0.011803750000000335,
+      "ok": true,
+      "error": null
+    },
+    "3_inspect_pretrends": {
+      "seconds": 2.5829999987792007e-06,
+      "ok": true,
+      "error": null
+    },
+    "4_honest_did_grid": {
+      "seconds": 0.0017158750000003664,
+      "ok": true,
+      "error": null
+    },
+    "5_sun_abraham_robustness": {
+      "seconds": 0.07050441700000043,
+      "ok": true,
+      "error": null
+    },
+    "6_practitioner_next_steps": {
+      "seconds": 0.00024145799999963913,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "scale": "medium",
+    "n_microdata_rows": 250000,
+    "n_states": 50,
+    "n_years": 10,
+    "n_strata": 15,
+    "n_psu": 600,
+    "n_bootstrap": 199
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
\ No newline at end of file
diff --git a/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json b/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json
new file mode 100644
index 00000000..516b7101
--- /dev/null
+++ b/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json
@@ -0,0 +1,49 @@
+{
+  "scenario": "brfss_panel_medium",
+  "backend": "rust",
+  "has_rust_backend": true,
+  "total_seconds": 6.197831334,
+  "phases": {
+    "1_aggregate_survey_microdata_to_panel": {
+      "seconds": 6.0959868749999995,
+      "ok": true,
+      "error": null
+    },
+    "2_cs_fit_with_stage2_survey_design": {
+      "seconds": 0.012175959000000347,
+      "ok": true,
+      "error": null
+    },
+    "3_inspect_pretrends": {
+      "seconds": 2.2500000014247235e-06,
+      "ok": true,
+      "error": null
+    },
+    "4_honest_did_grid": {
+      "seconds": 0.0015915419999998903,
+      "ok": true,
+      "error": null
+    },
+    "5_sun_abraham_robustness": {
+      "seconds": 0.08775508399999943,
+      "ok": true,
+      "error": null
+    },
+    "6_practitioner_next_steps": {
+      "seconds": 0.00030545799999970313,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "scale": "medium",
+    "n_microdata_rows": 250000,
+    "n_states": 50,
+    "n_years": 10,
+    "n_strata": 15,
+    "n_psu": 600,
+    "n_bootstrap": 199
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
\ No newline at end of file
diff --git a/benchmarks/speed_review/baselines/brfss_panel_python.json b/benchmarks/speed_review/baselines/brfss_panel_small_python.json
similarity index 70%
rename from benchmarks/speed_review/baselines/brfss_panel_python.json
rename to benchmarks/speed_review/baselines/brfss_panel_small_python.json
index fddf6cab..338c8f69 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_python.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_small_python.json
@@ -1,41 +1,42 @@
 {
-  "scenario": "brfss_panel",
+  "scenario": "brfss_panel_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 1.599043583,
+  "total_seconds": 1.5939237080000002,
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 1.530210625,
+      "seconds": 1.498778625,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.014581666999999854,
+      "seconds": 0.014817040999999698,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 1.8749999997069722e-06,
+      "seconds": 2.250000000092456e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.003660958000000214,
+      "seconds": 0.0039673339999999335,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.05053487499999987,
+      "seconds": 0.07608712500000037,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 4.9042000000110164e-05,
+      "seconds": 0.00026245799999990993,
       "ok": true,
       "error": null
     }
   },
   "metadata": {
+    "scale": "small",
     "n_microdata_rows": 50000,
     "n_states": 50,
     "n_years": 10,
diff --git a/benchmarks/speed_review/baselines/brfss_panel_rust.json b/benchmarks/speed_review/baselines/brfss_panel_small_rust.json
similarity index 70%
rename from benchmarks/speed_review/baselines/brfss_panel_rust.json
rename to benchmarks/speed_review/baselines/brfss_panel_small_rust.json
index 44c2cfb0..1eded7d3 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_rust.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_small_rust.json
@@ -1,41 +1,42 @@
 {
-  "scenario": "brfss_panel",
+  "scenario": "brfss_panel_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 1.5960411249999997,
+  "total_seconds": 1.610289666,
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 1.5271849580000003,
+      "seconds": 1.532226625,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.014870542000000153,
+      "seconds": 0.015062499999999979,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.208000000170074e-06,
+      "seconds": 2.4170000001433323e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.003847707999999894,
+      "seconds": 0.0037798330000002878,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.05008866700000025,
+      "seconds": 0.05893341699999999,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 4.3584000000151946e-05,
+      "seconds": 0.00028129199999993304,
       "ok": true,
       "error": null
     }
   },
   "metadata": {
+    "scale": "small",
     "n_microdata_rows": 50000,
     "n_states": 50,
     "n_years": 10,
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_large_python.json b/benchmarks/speed_review/baselines/campaign_staggered_large_python.json
new file mode 100644
index 00000000..6a65d78f
--- /dev/null
+++ b/benchmarks/speed_review/baselines/campaign_staggered_large_python.json
@@ -0,0 +1,64 @@
+{
+  "scenario": "campaign_staggered_large",
+  "backend": "python",
+  "has_rust_backend": false,
+  "total_seconds": 1.2445272920000001,
+  "phases": {
+    "1_bacon_decomposition": {
+      "seconds": 0.018094541999999825,
+      "ok": true,
+      "error": null
+    },
+    "2_cs_fit_with_covariates_bootstrap999": {
+      "seconds": 0.166935917,
+      "ok": true,
+      "error": null
+    },
+    "3_inspect_pretrends": {
+      "seconds": 3.4159999997562807e-06,
+      "ok": true,
+      "error": null
+    },
+    "4_honest_did_M_grid": {
+      "seconds": 0.0024244580000001292,
+      "ok": true,
+      "error": null
+    },
+    "5_sun_abraham_robustness": {
+      "seconds": 0.31497600000000014,
+      "ok": true,
+      "error": null
+    },
+    "6_imputation_did_robustness": {
+      "seconds": 0.6572787920000001,
+      "ok": true,
+      "error": null
+    },
+    "7_cs_without_covariates": {
+      "seconds": 0.0847687920000002,
+      "ok": true,
+      "error": null
+    },
+    "8_practitioner_next_steps": {
+      "seconds": 3.716700000033768e-05,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "scale": "large",
+    "n_units": 1500,
+    "n_periods": 26,
+    "n_cohorts": 3,
+    "n_obs": 39000,
+    "covariates": [
+      "log_pop",
+      "baseline_spend"
+    ],
+    "n_bootstrap": 999,
+    "aggregate": "all",
+    "estimation_method": "dr"
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
\ No newline at end of file
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json b/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json
new file mode 100644
index 00000000..56c2d4d6
--- /dev/null
+++ b/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json
@@ -0,0 +1,64 @@
+{
+  "scenario": "campaign_staggered_large",
+  "backend": "rust",
+  "has_rust_backend": true,
+  "total_seconds": 1.2180239580000003,
+  "phases": {
+    "1_bacon_decomposition": {
+      "seconds": 0.01827750000000039,
+      "ok": true,
+      "error": null
+    },
+    "2_cs_fit_with_covariates_bootstrap999": {
+      "seconds": 0.16558841699999993,
+      "ok": true,
+      "error": null
+    },
+    "3_inspect_pretrends": {
+      "seconds": 3.2919999997105265e-06,
+      "ok": true,
+      "error": null
+    },
+    "4_honest_did_M_grid": {
+      "seconds": 0.002447333000000107,
+      "ok": true,
+      "error": null
+    },
+    "5_sun_abraham_robustness": {
+      "seconds": 0.35019279199999964,
+      "ok": true,
+      "error": null
+    },
+    "6_imputation_did_robustness": {
+      "seconds": 0.5965422080000002,
+      "ok": true,
+      "error": null
+    },
+    "7_cs_without_covariates": {
+      "seconds": 0.08493354199999992,
+      "ok": true,
+      "error": null
+    },
+    "8_practitioner_next_steps": {
+      "seconds": 3.420799999975799e-05,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "scale": "large",
+    "n_units": 1500,
+    "n_periods": 26,
+    "n_cohorts": 3,
+    "n_obs": 39000,
+    "covariates": [
+      "log_pop",
+      "baseline_spend"
+    ],
+    "n_bootstrap": 999,
+    "aggregate": "all",
+    "estimation_method": "dr"
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
\ No newline at end of file
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json b/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json
new file mode 100644
index 00000000..d5589ba1
--- /dev/null
+++ b/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json
@@ -0,0 +1,64 @@
+{
+  "scenario": "campaign_staggered_medium",
+  "backend": "python",
+  "has_rust_backend": false,
+  "total_seconds": 0.7159983750000001,
+  "phases": {
+    "1_bacon_decomposition": {
+      "seconds": 0.012523791999999867,
+      "ok": true,
+      "error": null
+    },
+    "2_cs_fit_with_covariates_bootstrap999": {
+      "seconds": 0.09662354100000003,
+      "ok": true,
+      "error": null
+    },
+    "3_inspect_pretrends": {
+      "seconds": 2.7499999999403e-06,
+      "ok": true,
+      "error": null
+    },
+    "4_honest_did_M_grid": {
+      "seconds": 0.002143749999999889,
+      "ok": true,
+      "error": null
+    },
+    "5_sun_abraham_robustness": {
+      "seconds": 0.22049916599999997,
+      "ok": true,
+      "error": null
+    },
+    "6_imputation_did_robustness": {
+      "seconds": 0.33182912500000006,
+      "ok": true,
+      "error": null
+    },
+    "7_cs_without_covariates": {
+      "seconds": 0.052325667000000076,
+      "ok": true,
+      "error": null
+    },
+    "8_practitioner_next_steps": {
+      "seconds": 4.5832999999939616e-05,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "scale": "medium",
+    "n_units": 500,
+    "n_periods": 26,
+    "n_cohorts": 3,
+    "n_obs": 13000,
+    "covariates": [
+      "log_pop",
+      "baseline_spend"
+    ],
+    "n_bootstrap": 999,
+    "aggregate": "all",
+    "estimation_method": "dr"
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
\ No newline at end of file
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json b/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json
new file mode 100644
index 00000000..1e44a769
--- /dev/null
+++ b/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json
@@ -0,0 +1,64 @@
+{
+  "scenario": "campaign_staggered_medium",
+  "backend": "rust",
+  "has_rust_backend": true,
+  "total_seconds": 0.8678267909999999,
+  "phases": {
+    "1_bacon_decomposition": {
+      "seconds": 0.012692541999999918,
+      "ok": true,
+      "error": null
+    },
+    "2_cs_fit_with_covariates_bootstrap999": {
+      "seconds": 0.09956041599999987,
+      "ok": true,
+      "error": null
+    },
+    "3_inspect_pretrends": {
+      "seconds": 3.374999999916639e-06,
+      "ok": true,
+      "error": null
+    },
+    "4_honest_did_M_grid": {
+      "seconds": 0.002752457999999791,
+      "ok": true,
+      "error": null
+    },
+    "5_sun_abraham_robustness": {
+      "seconds": 0.42143175,
+      "ok": true,
+      "error": null
+    },
+    "6_imputation_did_robustness": {
+      "seconds": 0.2781500830000001,
+      "ok": true,
+      "error": null
+    },
+    "7_cs_without_covariates": {
+      "seconds": 0.053192707999999866,
+      "ok": true,
+      "error": null
+    },
+    "8_practitioner_next_steps": {
+      "seconds": 3.787500000007604e-05,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "scale": "medium",
+    "n_units": 500,
+    "n_periods": 26,
+    "n_cohorts": 3,
+    "n_obs": 13000,
+    "covariates": [
+      "log_pop",
+      "baseline_spend"
+    ],
+    "n_bootstrap": 999,
+    "aggregate": "all",
+    "estimation_method": "dr"
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
\ No newline at end of file
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_python.json b/benchmarks/speed_review/baselines/campaign_staggered_small_python.json
similarity index 69%
rename from benchmarks/speed_review/baselines/campaign_staggered_python.json
rename to benchmarks/speed_review/baselines/campaign_staggered_small_python.json
index 69b536a3..12a7c4ea 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_python.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_small_python.json
@@ -1,54 +1,56 @@
 {
-  "scenario": "campaign_staggered",
+  "scenario": "campaign_staggered_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.493763792,
+  "total_seconds": 0.48264691700000006,
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.00662462499999994,
+      "seconds": 0.00833841600000007,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.06328537499999998,
+      "seconds": 0.06103824999999996,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.3750000000276614e-06,
+      "seconds": 3.042000000008649e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.0047993339999999884,
+      "seconds": 0.00554720799999997,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.09586058399999997,
+      "seconds": 0.07744641600000002,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.29060341599999995,
+      "seconds": 0.30062641700000003,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.03254304100000005,
+      "seconds": 0.02960395800000004,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 3.7708000000025166e-05,
+      "seconds": 3.195800000010962e-05,
       "ok": true,
       "error": null
     }
   },
   "metadata": {
+    "scale": "small",
     "n_units": 150,
     "n_periods": 26,
     "n_cohorts": 2,
+    "n_obs": 3900,
     "covariates": [
       "log_pop",
       "baseline_spend"
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_rust.json b/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json
similarity index 69%
rename from benchmarks/speed_review/baselines/campaign_staggered_rust.json
rename to benchmarks/speed_review/baselines/campaign_staggered_small_rust.json
index 44a4cdd9..be61050d 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_rust.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json
@@ -1,54 +1,56 @@
 {
-  "scenario": "campaign_staggered",
+  "scenario": "campaign_staggered_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.484783125,
+  "total_seconds": 0.4882878749999999,
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.006965292000000067,
+      "seconds": 0.0071254580000000844,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.060481958,
+      "seconds": 0.06050545900000004,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.9169999999911767e-06,
+      "seconds": 2.8330000000353905e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.004540166999999928,
+      "seconds": 0.004818417000000075,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.12357962499999997,
+      "seconds": 0.13427262500000003,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.25805591699999986,
+      "seconds": 0.2511277919999999,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.031115375000000167,
+      "seconds": 0.03038112500000012,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 3.6332999999943993e-05,
+      "seconds": 4.8000000000048004e-05,
       "ok": true,
       "error": null
     }
   },
   "metadata": {
+    "scale": "small",
     "n_units": 150,
     "n_periods": 26,
     "n_cohorts": 2,
+    "n_obs": 3900,
     "covariates": [
       "log_pop",
       "baseline_spend"
diff --git a/benchmarks/speed_review/baselines/dose_response_python.json b/benchmarks/speed_review/baselines/dose_response_python.json
index 7deaf504..d514729f 100644
--- a/benchmarks/speed_review/baselines/dose_response_python.json
+++ b/benchmarks/speed_review/baselines/dose_response_python.json
@@ -2,35 +2,35 @@
   "scenario": "dose_response",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.5864091250000001,
+  "total_seconds": 0.583888167,
   "phases": {
     "1_cdid_cubic_spline_bootstrap199": {
-      "seconds": 0.14782820799999996,
+      "seconds": 0.148716792,
       "ok": true,
       "error": null
     },
     "2_extract_dose_response_dataframes": {
-      "seconds": 0.0007271249999999396,
+      "seconds": 0.0007523750000000273,
       "ok": true,
       "error": null
     },
     "3_cdid_event_study_pretrend": {
-      "seconds": 0.1467172499999999,
+      "seconds": 0.147467083,
       "ok": true,
       "error": null
     },
     "4_binarized_did_comparison": {
-      "seconds": 0.0014637920000000193,
+      "seconds": 0.0015002499999999808,
       "ok": true,
       "error": null
     },
     "5_spline_sensitivity_degree1": {
-      "seconds": 0.14299950000000006,
+      "seconds": 0.141884625,
       "ok": true,
       "error": null
     },
     "6_spline_sensitivity_num_knots2": {
-      "seconds": 0.14666895800000002,
+      "seconds": 0.14356270800000015,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/dose_response_rust.json b/benchmarks/speed_review/baselines/dose_response_rust.json
index d4f4cb54..28f751db 100644
--- a/benchmarks/speed_review/baselines/dose_response_rust.json
+++ b/benchmarks/speed_review/baselines/dose_response_rust.json
@@ -2,35 +2,35 @@
   "scenario": "dose_response",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.585448167,
+  "total_seconds": 0.5914641250000001,
   "phases": {
     "1_cdid_cubic_spline_bootstrap199": {
-      "seconds": 0.14863079200000007,
+      "seconds": 0.15346125,
       "ok": true,
       "error": null
     },
     "2_extract_dose_response_dataframes": {
-      "seconds": 0.0007015000000000216,
+      "seconds": 0.0007392499999999691,
       "ok": true,
       "error": null
     },
     "3_cdid_event_study_pretrend": {
-      "seconds": 0.14747212500000006,
+      "seconds": 0.14869329099999995,
       "ok": true,
       "error": null
     },
     "4_binarized_did_comparison": {
-      "seconds": 0.0016670830000000691,
+      "seconds": 0.0017346249999999896,
       "ok": true,
       "error": null
     },
     "5_spline_sensitivity_degree1": {
-      "seconds": 0.14236974999999996,
+      "seconds": 0.14182995799999998,
       "ok": true,
       "error": null
     },
     "6_spline_sensitivity_num_knots2": {
-      "seconds": 0.1446025420000001,
+      "seconds": 0.14500187499999995,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json b/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json
new file mode 100644
index 00000000..f415c34c
--- /dev/null
+++ b/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json
@@ -0,0 +1,48 @@
+{
+  "scenario": "geo_few_markets_large",
+  "backend": "rust",
+  "has_rust_backend": true,
+  "total_seconds": 0.262019625,
+  "phases": {
+    "1_sdid_jackknife_variance": {
+      "seconds": 0.04104833299999999,
+      "ok": true,
+      "error": null
+    },
+    "2_sdid_bootstrap_variance_200": {
+      "seconds": 0.037608707999999935,
+      "ok": true,
+      "error": null
+    },
+    "3_in_time_placebo": {
+      "seconds": 0.07724208399999999,
+      "ok": true,
+      "error": null
+    },
+    "4_get_loo_effects_df": {
+      "seconds": 0.0007287080000000223,
+      "ok": true,
+      "error": null
+    },
+    "5_sensitivity_to_zeta_omega": {
+      "seconds": 0.10535358300000008,
+      "ok": true,
+      "error": null
+    },
+    "6_weight_concentration": {
+      "seconds": 3.483299999995637e-05,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "scale": "large",
+    "n_units": 500,
+    "n_pre": 6,
+    "n_post": 6,
+    "n_treated": 30,
+    "n_factors": 2
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
\ No newline at end of file
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json b/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json
new file mode 100644
index 00000000..189fd1a0
--- /dev/null
+++ b/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json
@@ -0,0 +1,48 @@
+{
+  "scenario": "geo_few_markets_medium",
+  "backend": "python",
+  "has_rust_backend": false,
+  "total_seconds": 3.6490506659999995,
+  "phases": {
+    "1_sdid_jackknife_variance": {
+      "seconds": 0.3422396250000004,
+      "ok": true,
+      "error": null
+    },
+    "2_sdid_bootstrap_variance_200": {
+      "seconds": 0.3464741250000003,
+      "ok": true,
+      "error": null
+    },
+    "3_in_time_placebo": {
+      "seconds": 1.3339607080000002,
+      "ok": true,
+      "error": null
+    },
+    "4_get_loo_effects_df": {
+      "seconds": 0.0006264169999994351,
+      "ok": true,
+      "error": null
+    },
+    "5_sensitivity_to_zeta_omega": {
+      "seconds": 1.6257209160000006,
+      "ok": true,
+      "error": null
+    },
+    "6_weight_concentration": {
+      "seconds": 2.4749999999684746e-05,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "scale": "medium",
+    "n_units": 200,
+    "n_pre": 6,
+    "n_post": 6,
+    "n_treated": 15,
+    "n_factors": 2
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
\ No newline at end of file
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json b/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json
new file mode 100644
index 00000000..ab07cd9d
--- /dev/null
+++ b/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json
@@ -0,0 +1,48 @@
+{
+  "scenario": "geo_few_markets_medium",
+  "backend": "rust",
+  "has_rust_backend": true,
+  "total_seconds": 0.1170789579999999,
+  "phases": {
+    "1_sdid_jackknife_variance": {
+      "seconds": 0.020038041999999923,
+      "ok": true,
+      "error": null
+    },
+    "2_sdid_bootstrap_variance_200": {
+      "seconds": 0.022811833000000004,
+      "ok": true,
+      "error": null
+    },
+    "3_in_time_placebo": {
+      "seconds": 0.024646833000000035,
+      "ok": true,
+      "error": null
+    },
+    "4_get_loo_effects_df": {
+      "seconds": 0.000611084000000095,
+      "ok": true,
+      "error": null
+    },
+    "5_sensitivity_to_zeta_omega": {
+      "seconds": 0.048946500000000004,
+      "ok": true,
+      "error": null
+    },
+    "6_weight_concentration": {
+      "seconds": 2.112500000006623e-05,
+      "ok": true,
+      "error": null
+    }
+  },
+  "metadata": {
+    "scale": "medium",
+    "n_units": 200,
+    "n_pre": 6,
+    "n_post": 6,
+    "n_treated": 15,
+    "n_factors": 2
+  },
+  "diff_diff_version": "3.1.3",
+  "numpy_version": "2.0.2"
+}
\ No newline at end of file
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_python.json b/benchmarks/speed_review/baselines/geo_few_markets_small_python.json
similarity index 65%
rename from benchmarks/speed_review/baselines/geo_few_markets_python.json
rename to benchmarks/speed_review/baselines/geo_few_markets_small_python.json
index ccaec094..2dccefec 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_python.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_small_python.json
@@ -1,43 +1,45 @@
 {
-  "scenario": "geo_few_markets",
+  "scenario": "geo_few_markets_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 3.0047956659999997,
+  "total_seconds": 3.048684083,
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.4820445,
+      "seconds": 0.48794175000000006,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.4802018750000001,
+      "seconds": 0.4880505420000001,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 0.9625541249999998,
+      "seconds": 0.9833322499999999,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0008696669999999074,
+      "seconds": 0.0011712919999999905,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 1.079096792,
+      "seconds": 1.0881425410000003,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 2.4999999999941735e-05,
+      "seconds": 3.9874999999689464e-05,
       "ok": true,
       "error": null
     }
   },
   "metadata": {
+    "scale": "small",
     "n_units": 80,
-    "n_periods": 12,
+    "n_pre": 6,
+    "n_post": 6,
     "n_treated": 5,
     "n_factors": 2
   },
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_rust.json b/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json
similarity index 65%
rename from benchmarks/speed_review/baselines/geo_few_markets_rust.json
rename to benchmarks/speed_review/baselines/geo_few_markets_small_rust.json
index ff62e455..0a150287 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_rust.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json
@@ -1,43 +1,45 @@
 {
-  "scenario": "geo_few_markets",
+  "scenario": "geo_few_markets_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.03952049999999996,
+  "total_seconds": 0.04011274999999992,
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.00735379199999997,
+      "seconds": 0.007763625000000052,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.012488166000000023,
+      "seconds": 0.012679667000000006,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 0.008124790999999965,
+      "seconds": 0.008062542000000006,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0006939590000000218,
+      "seconds": 0.0007486250000000583,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 0.010841416999999964,
+      "seconds": 0.010833333,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 1.5958999999954315e-05,
+      "seconds": 2.1959000000015827e-05,
       "ok": true,
       "error": null
     }
   },
   "metadata": {
+    "scale": "small",
     "n_units": 80,
-    "n_periods": 12,
+    "n_pre": 6,
+    "n_post": 6,
     "n_treated": 5,
     "n_factors": 2
   },
diff --git a/benchmarks/speed_review/baselines/reversible_dcdh_python.json b/benchmarks/speed_review/baselines/reversible_dcdh_python.json
index b19cfc16..697b0ecc 100644
--- a/benchmarks/speed_review/baselines/reversible_dcdh_python.json
+++ b/benchmarks/speed_review/baselines/reversible_dcdh_python.json
@@ -2,25 +2,25 @@
   "scenario": "reversible_dcdh",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.5749523339999999,
+  "total_seconds": 0.550877666,
   "phases": {
     "1_dcdh_fit_Lmax3_survey_TSL": {
-      "seconds": 0.30690474999999995,
+      "seconds": 0.3742725,
       "ok": true,
       "error": null
     },
     "2_inspect_placebo_and_summary": {
-      "seconds": 1.2500000000637002e-06,
+      "seconds": 1.1669999999686098e-06,
       "ok": true,
       "error": null
     },
     "3_honest_did_on_placebo": {
-      "seconds": 0.0035600000000000076,
+      "seconds": 0.003917875000000071,
       "ok": true,
       "error": null
     },
     "4_heterogeneity_refit": {
-      "seconds": 0.264483833,
+      "seconds": 0.17268387499999993,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/reversible_dcdh_rust.json b/benchmarks/speed_review/baselines/reversible_dcdh_rust.json
index 22301be9..27ab9df0 100644
--- a/benchmarks/speed_review/baselines/reversible_dcdh_rust.json
+++ b/benchmarks/speed_review/baselines/reversible_dcdh_rust.json
@@ -2,25 +2,25 @@
   "scenario": "reversible_dcdh",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.44736124999999993,
+  "total_seconds": 0.5288864590000001,
   "phases": {
     "1_dcdh_fit_Lmax3_survey_TSL": {
-      "seconds": 0.319021125,
+      "seconds": 0.3325950000000001,
       "ok": true,
       "error": null
     },
     "2_inspect_placebo_and_summary": {
-      "seconds": 1.37499999997015e-06,
+      "seconds": 1.1670000000796321e-06,
       "ok": true,
       "error": null
     },
     "3_honest_did_on_placebo": {
-      "seconds": 0.003557540999999942,
+      "seconds": 0.004058833999999956,
       "ok": true,
       "error": null
     },
     "4_heterogeneity_refit": {
-      "seconds": 0.12477833300000007,
+      "seconds": 0.19222925000000002,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/bench_brand_awareness_survey.py b/benchmarks/speed_review/bench_brand_awareness_survey.py
index 338bfa82..d2474066 100644
--- a/benchmarks/speed_review/bench_brand_awareness_survey.py
+++ b/benchmarks/speed_review/bench_brand_awareness_survey.py
@@ -8,8 +8,10 @@
 Chains: naive fit (for SE-inflation comparison) -> TSL -> replicate -> multi-
 outcome refit loop -> check_parallel_trends -> placebo -> HonestDiD grid.
 
-Data shape: 40 regions x 8 quarters x ~100 respondents per cell =
-~32K respondent rows, 10 strata, 4 PSUs/stratum.
+Three scales:
+  - small  (200 units x 12 periods): Tutorial 17 analog
+  - medium (500 units x 12 periods): realistic CPG quarterly brand-tracking wave
+  - large  (1000 units x 12 periods): multi-region brand tracking at scale
 """
 
 import numpy as np
@@ -26,12 +28,19 @@
 from bench_shared import run_scenario
 
 
-def build_data(seed=42):
+SCALES = {
+    "small":  {"n_units": 200,  "n_periods": 12, "n_strata": 10, "psu_per_stratum": 4},
+    "medium": {"n_units": 500,  "n_periods": 12, "n_strata": 15, "psu_per_stratum": 6},
+    "large":  {"n_units": 1000, "n_periods": 12, "n_strata": 20, "psu_per_stratum": 8},
+}
+
+
+def build_data(n_units, n_periods, n_strata, psu_per_stratum, seed=42):
     df = generate_survey_did_data(
-        n_units=200, n_periods=12, cohort_periods=[7],
+        n_units=n_units, n_periods=n_periods, cohort_periods=[7],
         never_treated_frac=0.5, treatment_effect=2.0,
         dynamic_effects=True, effect_growth=0.2,
-        n_strata=10, psu_per_stratum=4,
+        n_strata=n_strata, psu_per_stratum=psu_per_stratum,
         weight_variation="high", psu_re_sd=1.5,
         include_replicate_weights=True, panel=True, seed=seed,
     )
@@ -39,19 +48,11 @@ def build_data(seed=42):
     df["consideration"] = df["outcome"] + rng.normal(0, 0.4, size=len(df))
     df["purchase_intent"] = df["outcome"] * 0.6 + rng.normal(0, 0.3, size=len(df))
     df["post"] = (df["period"] >= 7).astype(int)
-    # Unit-level treatment indicator (for pre-period placebo and
-    # parallel-trends check — `treated` is row-level and zero in the pre-
-    # period, which those diagnostics can't use).
     df["treat_unit"] = (df["first_treat"] > 0).astype(int)
     return df
 
 
-def main():
-    data = build_data()
-    rw_cols = [c for c in data.columns if c.startswith("rep_")]
-
-    results = {}
-
+def make_phases(data, results, rw_cols):
     def naive_fit():
         did = DifferenceInDifferences(robust=True, cluster="psu")
         results["naive"] = did.fit(
@@ -135,7 +136,7 @@ def honest_did_grid():
                 out[M] = f"{type(e).__name__}: {e}"
         results["honest"] = out
 
-    phases = [
+    return [
         ("1_naive_fit_no_survey_design", naive_fit),
         ("2_tsl_strata_psu_fpc", tsl_fit),
         ("3_replicate_weights_brr", replicate_fit),
@@ -145,17 +146,35 @@ def honest_did_grid():
         ("7_event_study_plus_honest_did", honest_did_grid),
     ]
 
+
+def run_scale(scale, config):
+    data = build_data(**config)
+    rw_cols = [c for c in data.columns if c.startswith("rep_")]
+    results = {}
+    phases = make_phases(data, results, rw_cols)
+
     run_scenario(
-        "brand_awareness_survey",
+        f"brand_awareness_survey_{scale}",
         phases,
         metadata={
-            "n_units": 200, "n_periods": 12, "n_obs": int(len(data)),
-            "n_strata": 10, "n_psu_per_stratum": 4,
+            "scale": scale,
+            "n_units": config["n_units"],
+            "n_periods": config["n_periods"],
+            "n_obs": int(len(data)),
+            "n_strata": config["n_strata"],
+            "n_psu_per_stratum": config["psu_per_stratum"],
             "n_replicate_weights": len(rw_cols),
             "outcomes": ["outcome", "consideration", "purchase_intent"],
         },
     )
 
 
+def main():
+    for scale, config in SCALES.items():
+        print(f"\n{'='*60}\n  brand_awareness_survey / scale={scale} "
+              f"(n_units={config['n_units']})\n{'='*60}")
+        run_scale(scale, config)
+
+
 if __name__ == "__main__":
     main()
diff --git a/benchmarks/speed_review/bench_brfss_panel.py b/benchmarks/speed_review/bench_brfss_panel.py
index c437a5f7..1f5a473b 100644
--- a/benchmarks/speed_review/bench_brfss_panel.py
+++ b/benchmarks/speed_review/bench_brfss_panel.py
@@ -5,10 +5,10 @@
 stage-2 SurveyDesign + bootstrap at PSU -> event-study pre-trends ->
 HonestDiD grid -> SunAbraham robustness refit -> practitioner_next_steps.
 
-Data shape: ~50K microdata rows scaled to a ~50-state x 10-year study
-population (reflects BRFSS 2024's ~458K universe filtered to a substate
-analytic slice). 10 strata, 200 PSUs. Collapses to a 500-cell panel.
-5 adoption cohorts staggered across the window.
+Three scales, grounded in BRFSS 2024 (~458K records total):
+  - small  (50K rows):  single-year / single-state substudy slice
+  - medium (250K rows): multi-year multi-state analytic slice
+  - large  (1M rows):   pooled 10-year BRFSS-scale panel
 """
 
 import numpy as np
@@ -26,8 +26,17 @@
 from bench_shared import run_scenario
 
 
-def build_microdata(seed=42, n_states=50, n_years=10, n_per_cell=100,
-                   n_strata=10, n_psu=200):
+SCALES = {
+    "small":  {"n_states": 50, "n_years": 10, "n_per_cell": 100,
+               "n_strata": 10, "n_psu": 200},
+    "medium": {"n_states": 50, "n_years": 10, "n_per_cell": 500,
+               "n_strata": 15, "n_psu": 600},
+    "large":  {"n_states": 50, "n_years": 10, "n_per_cell": 2000,
+               "n_strata": 20, "n_psu": 1000},
+}
+
+
+def build_microdata(n_states, n_years, n_per_cell, n_strata, n_psu, seed=42):
     rng = np.random.default_rng(seed)
     n_rows = n_states * n_years * n_per_cell
     state = np.repeat(np.arange(n_states), n_years * n_per_cell)
@@ -54,19 +63,14 @@ def build_microdata(seed=42, n_states=50, n_years=10, n_per_cell=100,
         + 3.0 * treated.astype(float)
         + rng.normal(0, 0.2, size=n_rows) * state
     )
-    df = pd.DataFrame({
+    return pd.DataFrame({
         "state": state, "year": year,
         "strata": stratum, "psu": psu, "finalwt": weight,
         "y": y, "first_treat": first_treat,
     })
-    return df
-
 
-def main():
-    micro = build_microdata()
-
-    results = {}
 
+def make_phases(micro, results):
     def aggregate():
         sd = SurveyDesign(
             weights="finalwt", strata="strata", psu="psu",
@@ -120,7 +124,7 @@ def sun_abraham():
     def guidance():
         results["guidance"] = practitioner_next_steps(results["cs"])
 
-    phases = [
+    return [
         ("1_aggregate_survey_microdata_to_panel", aggregate),
         ("2_cs_fit_with_stage2_survey_design", cs_fit),
         ("3_inspect_pretrends", inspect_pretrends),
@@ -129,10 +133,16 @@ def guidance():
         ("6_practitioner_next_steps", guidance),
     ]
 
+
+def run_scale(scale, config):
+    micro = build_microdata(**config)
+    results = {}
+    phases = make_phases(micro, results)
     run_scenario(
-        "brfss_panel",
+        f"brfss_panel_{scale}",
         phases,
         metadata={
+            "scale": scale,
             "n_microdata_rows": int(len(micro)),
             "n_states": int(micro["state"].nunique()),
             "n_years": int(micro["year"].nunique()),
@@ -143,5 +153,14 @@ def guidance():
     )
 
 
+def main():
+    for scale, config in SCALES.items():
+        n_rows = (config["n_states"] * config["n_years"]
+                  * config["n_per_cell"])
+        print(f"\n{'='*60}\n  brfss_panel / scale={scale} "
+              f"({n_rows:,} microdata rows)\n{'='*60}")
+        run_scale(scale, config)
+
+
 if __name__ == "__main__":
     main()
diff --git a/benchmarks/speed_review/bench_campaign_staggered.py b/benchmarks/speed_review/bench_campaign_staggered.py
index 77e06672..7a5e976f 100644
--- a/benchmarks/speed_review/bench_campaign_staggered.py
+++ b/benchmarks/speed_review/bench_campaign_staggered.py
@@ -6,7 +6,10 @@
 inspection -> HonestDiD M-grid -> SunAbraham + ImputationDiD robustness
 -> with/without-covariates refit -> practitioner_next_steps.
 
-Data shape: 150 DMAs x 26 weekly periods, 2 staggered cohorts, 2 covariates.
+Three scales:
+  - small  (150 units x 26 periods): Tutorial 02 / GeoLift DMA panel
+  - medium (500 units x 26 periods): pooled-DMA or multi-year sub-DMA
+  - large  (1500 units x 26 periods): county-level staggered policy
 """
 
 import numpy as np
@@ -25,9 +28,17 @@
 from bench_shared import run_scenario
 
 
-def build_data(seed=42):
+SCALES = {
+    "small":  {"n_units": 150,  "n_periods": 26, "cohort_periods": [9, 14]},
+    "medium": {"n_units": 500,  "n_periods": 26, "cohort_periods": [9, 14, 19]},
+    "large":  {"n_units": 1500, "n_periods": 26, "cohort_periods": [9, 14, 19]},
+}
+
+
+def build_data(n_units, n_periods, cohort_periods, seed=42):
     df = generate_staggered_data(
-        n_units=150, n_periods=26, cohort_periods=[9, 14],
+        n_units=n_units, n_periods=n_periods,
+        cohort_periods=cohort_periods,
         never_treated_frac=0.3, treatment_effect=3.0,
         dynamic_effects=True, effect_growth=0.1, seed=seed,
     )
@@ -41,16 +52,7 @@ def build_data(seed=42):
     return df
 
 
-def main():
-    data = build_data()
-    covars = ["log_pop", "baseline_spend"]
-    fit_kwargs = dict(
-        data=data, outcome="outcome", unit="unit", time="period",
-        first_treat="first_treat",
-    )
-
-    results = {}
-
+def make_phases(data, results, covars, fit_kwargs):
     def bacon():
         results["bacon"] = BaconDecomposition().fit(
             data, outcome="outcome", unit="unit", time="period",
@@ -98,7 +100,7 @@ def cs_no_covariates():
     def next_steps():
         results["guidance"] = practitioner_next_steps(results["cs"])
 
-    phases = [
+    return [
         ("1_bacon_decomposition", bacon),
         ("2_cs_fit_with_covariates_bootstrap999", cs_fit),
         ("3_inspect_pretrends", inspect_pretrends),
@@ -109,16 +111,40 @@ def next_steps():
         ("8_practitioner_next_steps", next_steps),
     ]
 
+
+def run_scale(scale, config):
+    data = build_data(**config)
+    covars = ["log_pop", "baseline_spend"]
+    fit_kwargs = dict(
+        data=data, outcome="outcome", unit="unit", time="period",
+        first_treat="first_treat",
+    )
+    results = {}
+    phases = make_phases(data, results, covars, fit_kwargs)
+
     run_scenario(
-        "campaign_staggered",
+        f"campaign_staggered_{scale}",
         phases,
         metadata={
-            "n_units": 150, "n_periods": 26, "n_cohorts": 2,
-            "covariates": covars, "n_bootstrap": 999,
-            "aggregate": "all", "estimation_method": "dr",
+            "scale": scale,
+            "n_units": config["n_units"],
+            "n_periods": config["n_periods"],
+            "n_cohorts": len(config["cohort_periods"]),
+            "n_obs": int(len(data)),
+            "covariates": covars,
+            "n_bootstrap": 999,
+            "aggregate": "all",
+            "estimation_method": "dr",
         },
     )
 
 
+def main():
+    for scale, config in SCALES.items():
+        print(f"\n{'='*60}\n  campaign_staggered / scale={scale} "
+              f"(n_units={config['n_units']})\n{'='*60}")
+        run_scale(scale, config)
+
+
 if __name__ == "__main__":
     main()
diff --git a/benchmarks/speed_review/bench_geo_few_markets.py b/benchmarks/speed_review/bench_geo_few_markets.py
index 35350f0d..36575892 100644
--- a/benchmarks/speed_review/bench_geo_few_markets.py
+++ b/benchmarks/speed_review/bench_geo_few_markets.py
@@ -1,41 +1,49 @@
 """
 Scenario 4: Geo-experiment with few treated markets (SyntheticDiD).
 
-Chains: SDiD with jackknife variance (80 LOO refits) -> SDiD with bootstrap
+Chains: SDiD with jackknife variance (N LOO refits) -> SDiD with bootstrap
 variance for SE comparison -> in_time_placebo -> get_loo_effects_df ->
 sensitivity_to_zeta_omega -> weight-concentration diagnostic.
 
-Data shape: 80 markets x 12 weekly periods (6 pre, 6 post), 5 treated,
-2 latent factors. Matches Tutorial 18's geo-experiment walkthrough.
+Three scales:
+  - small  (80 units,  5 treated):  Tutorial 18 DMA panel
+  - medium (200 units, 15 treated): zip-cluster or large geo-experiment
+  - large  (500 units, 30 treated): zip-level or multi-market at scale
+                                    (Python backend skipped at this scale;
+                                    Python FW solver scales poorly)
+
+The python backend is skipped at "large" because the pure-numpy Frank-Wolfe
+solver plus jackknife (500 LOO refits x ~0.5s each) would take tens of
+minutes without providing additional signal; the medium scale already
+establishes the Python-vs-Rust gap.
 """
 
+import os
+
 from diff_diff import SyntheticDiD
 from diff_diff.prep import generate_factor_data
 
 from bench_shared import run_scenario
 
 
-def build_data(seed=42):
+SCALES = {
+    "small":  {"n_units": 80,  "n_pre": 6, "n_post": 6, "n_treated": 5},
+    "medium": {"n_units": 200, "n_pre": 6, "n_post": 6, "n_treated": 15},
+    "large":  {"n_units": 500, "n_pre": 6, "n_post": 6, "n_treated": 30},
+}
+SKIP_PYTHON_AT = {"large"}
+
+
+def build_data(n_units, n_pre, n_post, n_treated, seed=42):
     return generate_factor_data(
-        n_units=80, n_pre=6, n_post=6, n_treated=5,
+        n_units=n_units, n_pre=n_pre, n_post=n_post, n_treated=n_treated,
         n_factors=2, treatment_effect=2.0,
         factor_strength=1.0, treated_loading_shift=0.5,
         seed=seed,
     )
 
 
-def main():
-    data = build_data()
-    # `treat` is the unit-level (block) indicator; `treated` is row-level.
-    # SyntheticDiD requires block treatment, and post_periods identifies the
-    # treatment window among treated units.
-    post_periods = sorted(
-        data.loc[(data["treat"] == 1) & (data["treated"] == 1),
-                 "period"].unique().tolist(),
-    )
-
-    results = {}
-
+def make_phases(data, post_periods, results):
     def sdid_jackknife():
         sdid = SyntheticDiD(variance_method="jackknife", seed=123)
         results["jk"] = sdid.fit(
@@ -76,7 +84,7 @@ def weight_concentration():
             raise RuntimeError("get_weight_concentration not available")
         results["wc"] = fn()
 
-    phases = [
+    return [
         ("1_sdid_jackknife_variance", sdid_jackknife),
         ("2_sdid_bootstrap_variance_200", sdid_bootstrap),
         ("3_in_time_placebo", in_time_placebo),
@@ -85,15 +93,43 @@ def weight_concentration():
         ("6_weight_concentration", weight_concentration),
     ]
 
+
+def run_scale(scale, config):
+    backend_env = os.environ.get("DIFF_DIFF_BACKEND", "auto").lower()
+    if scale in SKIP_PYTHON_AT and backend_env == "python":
+        print(f"  [skip] geo_few_markets/{scale} backend=python "
+              f"(Python FW solver scales poorly)")
+        return
+
+    data = build_data(**config)
+    post_periods = sorted(
+        data.loc[(data["treat"] == 1) & (data["treated"] == 1),
+                 "period"].unique().tolist(),
+    )
+    results = {}
+    phases = make_phases(data, post_periods, results)
+
     run_scenario(
-        "geo_few_markets",
+        f"geo_few_markets_{scale}",
         phases,
         metadata={
-            "n_units": 80, "n_periods": 12, "n_treated": 5,
+            "scale": scale,
+            "n_units": config["n_units"],
+            "n_pre": config["n_pre"],
+            "n_post": config["n_post"],
+            "n_treated": config["n_treated"],
             "n_factors": 2,
         },
     )
 
 
+def main():
+    for scale, config in SCALES.items():
+        print(f"\n{'='*60}\n  geo_few_markets / scale={scale} "
+              f"(n_units={config['n_units']}, "
+              f"n_treated={config['n_treated']})\n{'='*60}")
+        run_scale(scale, config)
+
+
 if __name__ == "__main__":
     main()
diff --git a/docs/performance-plan.md b/docs/performance-plan.md
index 91a44f47..72e4b1f2 100644
--- a/docs/performance-plan.md
+++ b/docs/performance-plan.md
@@ -8,136 +8,120 @@ This document outlines the strategy for improving diff-diff's performance on lar
 
 Earlier sections of this document (v1.4.0, v2.0.3) measured isolated `fit()`
 calls on synthetic panels for R-parity. This section measures **end-to-end
-practitioner chains** — Bacon decomposition, fit, event-study pre-trend
+practitioner chains** - Bacon decomposition, fit, event-study pre-trend
 inspection, HonestDiD sensitivity grids, cross-estimator robustness refits,
-and reporting — at data shapes anchored to applied-econ papers and industry
+and reporting - at data shapes anchored to applied-econ papers and industry
 writeups. The six scenarios are defined in
 [`docs/performance-scenarios.md`](performance-scenarios.md); scripts live in
 `benchmarks/speed_review/bench_*.py`; raw results in
-`benchmarks/results/*.json` and flame profiles in
-`benchmarks/results/profiles/`.
+`benchmarks/speed_review/baselines/*.json` and flame profiles in
+`benchmarks/speed_review/baselines/profiles/`.
 
 Environment: macOS darwin 25.3 on Apple Silicon M4, Python 3.9,
-numpy 2.x, diff_diff 3.1.3. Each scenario runs under
-`DIFF_DIFF_BACKEND=python` and `DIFF_DIFF_BACKEND=rust`.
-
-### Per-scenario wall-clock totals
-
-| Scenario | Python (s) | Rust (s) | Rust speedup | Dominant phase |
-|---|---:|---:|---:|---|
-| 1. Staggered campaign (CS + 8-step chain) | 0.48 | 0.48 | 1.0x | ImputationDiD robustness (53%) |
-| 2. Brand awareness survey (DiD + SurveyDesign) | 0.18 | 0.20 | 0.9x | Multi-outcome loop + HonestDiD (~70% combined) |
-| 3. BRFSS microdata -> CS panel | 1.58 | 1.58 | 1.0x | `aggregate_survey` (93%) |
-| 4. Geo-experiment few markets (SDiD) | 2.96 | 0.04 | **76x** | SDiD Frank-Wolfe weight solver |
-| 5. Reversible treatment (dCDH L_max=3 + TSL) | 0.49 | 0.55 | 0.9x | dCDH fit (58%) + heterogeneity refit (40%) |
-| 6. Pricing dose-response (CDiD spline) | 0.57 | 0.58 | 1.0x | Four spline variants, ~25% each |
-
-At practitioner-realistic scales, the full 8-step Baker chain runs in under
-two seconds for 5 of 6 scenarios with or without the Rust backend. The Rust
-backend provides dramatic uplift only for SDiD; elsewhere it is at parity
-(or marginally slower on small data due to the Python/Rust FFI crossing
-overhead).
-
-### Top hotspots ranked by total-time contribution
-
-| # | Location | Scenario | Time | Recommended action |
+numpy 2.x, diff_diff 3.1.3. Each multi-scale scenario runs at three data
+scales under both `DIFF_DIFF_BACKEND=python` and `DIFF_DIFF_BACKEND=rust`.
+
+### Scale sweep - end-to-end wall-clock
+
+Four of the six scenarios run at three scales (small / medium / large). The
+small scale matches tutorial data shapes; medium reflects typical
+practitioner workloads; large stretches toward the upper end of what an
+analyst might bring (1M-row BRFSS microdata, 1,500-unit county-level
+staggered panel, 1,000-unit multi-region brand survey, 500-unit zip-level
+geo-experiment). Dose-response and reversible-dCDH run at a single mid-range
+scale.
+
+| Scenario | Scale | Data shape | Python (s) | Rust (s) | Py/Rust |
+|---|---|---|---:|---:|---:|
+| **1. Staggered campaign** | small | 150 units × 26 periods | 0.48 | 0.49 | 1.0x |
+| (CS + 8-step chain, bootstrap 999) | medium | 500 units × 26 periods | 0.72 | 0.87 | 0.8x |
+|  | large | 1,500 units × 26 periods | 1.24 | 1.22 | 1.0x |
+| **2. Brand awareness survey** | small | 200 units × 12 periods | 0.15 | 0.20 | 0.8x |
+| (DiD + SurveyDesign + replicate weights) | medium | 500 units × 12 periods | 0.49 | 0.54 | 0.9x |
+|  | large | 1,000 units × 12 periods | 0.79 | 0.83 | 1.0x |
+| **3. BRFSS microdata → CS panel** | small | 50K rows → 500 cells | 1.59 | 1.61 | 1.0x |
+| (`aggregate_survey` + CS + HonestDiD) | medium | 250K rows → 500 cells | 6.11 | 6.20 | 1.0x |
+|  | large | **1M rows → 500 cells** | **23.96** | **24.37** | **1.0x** |
+| **4. Geo-experiment few markets (SDiD)** | small | 80 units, 5 treated | 3.05 | 0.04 | **76x** |
+| (jackknife + bootstrap + sensitivity chain) | medium | 200 units, 15 treated | 3.65 | 0.12 | **31x** |
+|  | large | 500 units, 30 treated | skip | 0.26 | - |
+| 5. Reversible dCDH (L_max=3 + TSL) | (single) | 120 groups × 10 periods | 0.55 | 0.53 | 1.0x |
+| 6. Pricing dose-response (CDiD spline) | (single) | 500 units × 6 periods | 0.58 | 0.59 | 1.0x |
+
+### Scaling findings
+
+**Three findings invert at large scale relative to the tutorial-scale pass:**
+
+1. **BRFSS `aggregate_survey` becomes the dominant practitioner pain point.**
+   Scales near-linearly with microdata row count - 50K → 1M rows (20x)
+   costs 15x runtime (1.5s → 24s). At 1M rows, 97% of runtime is inside
+   `_compute_stratified_psu_meat`, called once per output cell. This is a
+   concrete 20-second cost hit on any realistic pooled multi-year BRFSS
+   study, and Rust does not touch it (aggregate_survey is entirely Python).
+2. **Staggered CS chain remains cheap across scales.** 150 → 1,500 units
+   (10x) increases total by only 2.6x (0.48s → 1.24s). ImputationDiD stays
+   the dominant phase (46-62%) but scales well; absolute time at
+   practitioner scale is still under 1 second.
+3. **SDiD Rust gap is stable, not emergent.** Python SDiD at 80 units is
+   already 3 seconds; at 200 units it is 3.7 seconds. The cost is
+   dominated by fixed-overhead-per-jackknife-refit rather than data size;
+   Rust stays sub-second through 500 units. The 76x headline at small scale
+   is driven by Python having ~3s of baseline cost, not by bad scaling.
+
+**Two findings hold across scales:**
+
+4. Brand-awareness survey chain scales sub-linearly (0.15s → 0.79s for a
+   5x unit increase); TSL + replicate-weight paths are well-vectorized
+   even with 40 replicate columns.
+5. Rust backend gives measurable uplift only for SDiD; for everything else
+   backend choice is within noise because the bottlenecks are in Python
+   (`aggregate_survey`) or already well-vectorized (CS bootstrap, ImputationDiD,
+   Survey TSL/replicate).
+
+### Top hotspots ranked by total-time contribution (at largest measured scale)
+
+| # | Location | Scenario + scale | Time | Recommended action |
 |---|---|---|---:|---|
-| 1 | `diff_diff/survey.py:1160` `_compute_stratified_psu_meat` | BRFSS | 1.0s self + 1.4s inclusive per 50K microdata | **Algorithmic fix** — loop runs per (state, year) cell; precompute stratum scaffolding once at top of `aggregate_survey` and reuse |
-| 2 | `diff_diff/utils.py:1434` `_sc_weight_fw_numpy` | Geo few markets (python) | 0.46s | **Already ported to Rust.** Python fallback acceptable for n < 50; document the python-backend ceiling rather than re-optimizing |
-| 3 | `diff_diff/imputation.py` ImputationDiD fit chain | Staggered campaign | 0.24s | **Investigate** — 4x slower than CS with `n_bootstrap=999` on identical data; unexpected given CS has the heavier bootstrap and same influence-function path. Likely imputation loop is not vectorized. |
-| 4 | `diff_diff/chaisemartin_dhaultfoeuille.py` dCDH fit (`L_max=3` + TSL) | Reversible | 0.32s main + 0.22s heterogeneity refit | **Cache/precompute** — heterogeneity refit repeats TSL setup and data prep already done by main fit. Pass shared precomputed structures through. |
-| 5 | `diff_diff/continuous_did.py` CDiD bootstrap loop | Dose response | 0.14s per fit, 4 variants = 0.56s | **Leave alone** — linear scaling with spline variants is expected; total well under practitioner-perceptible threshold |
-
-### Per-scenario findings
-
-**Scenario 1 — Staggered campaign (CS + 8-step chain)**
-
-Top 5 phases (python-backend, ordered by time):
-
-1. `6_imputation_did_robustness` — 234 ms (49%) — **investigate**
-2. `5_sun_abraham_robustness` — 149 ms (31%) — expected; SA saturated TWFE
-3. `2_cs_fit_with_covariates_bootstrap999` — 59 ms (12%) — expected
-4. `7_cs_without_covariates` — 29 ms (6%) — expected
-5. `1_bacon_decomposition` — 7 ms (1%) — negligible
-
-Action: flag ImputationDiD for a focused profile comparison against CS on
-the same data; total scenario is otherwise already cheap enough.
-
-**Scenario 2 — Brand awareness survey**
-
-Top 5 phases (python-backend, ordered by time):
-
-1. `4_multi_outcome_loop_3_metrics` — 64 ms (36%) — expected; linear in outcome count
-2. `7_event_study_plus_honest_did` — 62 ms (35%) — expected; MP fit + 3x HonestDiD
-3. `6_placebo_refit_pre_period` — 24 ms (13%) — expected
-4. `3_replicate_weights_brr` — 12 ms (7%) — expected; 40 replicate columns
-5. `5_check_parallel_trends` — 9 ms (5%) — expected
-
-Action: **leave alone.** Full survey chain is ~200 ms end-to-end.
-
-**Scenario 3 — BRFSS microdata -> CS panel**
-
-Top 5 phases (python-backend, ordered by time):
-
-1. `1_aggregate_survey_microdata_to_panel` — 1480 ms (94%) — **algorithmic fix**
-2. `5_sun_abraham_robustness` — 81 ms (5%) — expected
-3. `2_cs_fit_with_stage2_survey_design` — 15 ms (1%) — expected
-4. `4_honest_did_grid` — 4 ms — negligible
-5. `6_practitioner_next_steps` — <1 ms — negligible
-
-Action: **fix `aggregate_survey` per-cell loop.** Profile confirmed the
-self-time is concentrated in `_compute_stratified_psu_meat` being called
-once per output cell (500 cells for 50 states x 10 years) with redundant
-stratum-scaffolding reconstruction per call. A single precomputation of
-stratum indexes at the top of `aggregate_survey` should eliminate most of
-the 1s self-time without changing numerical output.
-
-**Scenario 4 — Geo-experiment few markets (SDiD)**
-
-Top 5 phases (python vs rust):
-
-| Phase | Python | Rust |
-|---|---:|---:|
-| `5_sensitivity_to_zeta_omega` | 1059 ms | 11 ms |
-| `3_in_time_placebo` | 954 ms | 8 ms |
-| `2_sdid_bootstrap_variance_200` | 475 ms | 12 ms |
-| `1_sdid_jackknife_variance` | 472 ms | 7 ms |
-
-Profile of python fit: 99% of time is in `_sc_weight_fw_numpy` Frank-Wolfe
-solver, split ~evenly between unit-weight and time-weight solves.
-`_fw_step` convergence check (`np.allclose`) is half the inner-loop cost.
-
-Action: **no further optimization needed.** Rust port is shipped and
-provides 76x on the full chain. The practitioner path defaults to Rust when
-available; the python fallback is a developer-safety path and the
-performance ceiling is acceptable for the teaching scale
-(40-80 units) but documented as non-production for larger n.
-
-**Scenario 5 — Reversible treatment (dCDH L_max=3 + TSL)**
-
-Top 5 phases:
-
-1. `1_dcdh_fit_Lmax3_survey_TSL` — 316 ms (64% python / 58% rust) — **cache candidate**
-2. `4_heterogeneity_refit` — 174 ms (35%) — **cache candidate**
-3. `3_honest_did_on_placebo` — 4-13 ms — expected
-
-The main fit and heterogeneity refit each independently rebuild TSL
-scaffolding (stratum-PSU indexes, influence-function allocators, design-
-matrix reshaping). Because heterogeneity always follows an unconditional
-fit, the scaffolding is shared and can be passed through.
-
-Action: **investigate shared precomputation.** Not a P0 — total is ~550 ms
-end-to-end — but this is a newer code path (v3.1) and has not been
-optimization-reviewed.
-
-**Scenario 6 — Pricing dose-response (ContinuousDiD)**
-
-Four spline fits (cubic bootstrap 199, event-study, linear bootstrap 199,
-cubic num_knots=2 bootstrap 199) account for ~99% of runtime, ~140 ms each.
-Linear scaling in variant count is expected.
-
-Action: **leave alone.** Bootstrap 199 on 500 units x 6 periods with cubic
-splines at 140 ms per fit is well within practitioner-acceptable latency.
+| 1 | `diff_diff/survey.py:1160` `_compute_stratified_psu_meat` | BRFSS @ 1M rows | **20.7s self + 22.6s inclusive** | **Algorithmic fix, raised priority.** Function called once per (state, year) cell (500 calls); per-call work scales with cell size and rebuilds stratum-PSU scaffolding every time. Precompute stratum indexes once at `aggregate_survey` top-level and reuse. Upside at practitioner scale is now 15-20 seconds, not 1.5 seconds. |
+| 2 | `diff_diff/imputation.py` ImputationDiD fit | Staggered CS @ 1,500 units | 0.66s (53%) | **Investigate if/when BRFSS fix lands.** Stayed the dominant phase across scales but total chain is ~1.2s at large - not P0. Still a candidate follow-up once the higher-value fix is in. |
+| 3 | `diff_diff/utils.py:1434` `_sc_weight_fw_numpy` | SDiD python @ any scale | 3s fixed overhead + scaling | **Already ported to Rust.** Python fallback acceptable as a teaching/safety path; document as non-production for n > 100. Python skipped at n=500 because jackknife 500 refits × ~500ms/refit would exceed 4 minutes. |
+| 4 | `diff_diff/chaisemartin_dhaultfoeuille.py` dCDH fit + heterogeneity | Reversible (single scale) | 0.32s main + 0.22s heterogeneity | **Cache/precompute** - heterogeneity refit rebuilds TSL scaffolding the main fit already computed. Not P0 - total is ~550ms - but newer code path (v3.1) never optimization-reviewed. |
+| 5 | `diff_diff/continuous_did.py` CDiD spline bootstrap | Dose-response (single scale) | 0.14s per fit × 4 variants | **Leave alone** - linear in variant count, all well under perceptible threshold. |
+
+### Per-scenario phase rankings at each scale
+
+**Scenario 1 - Staggered campaign (CS + 8-step chain).**
+ImputationDiD robustness remains the single dominant phase at every scale
+(0.30s / 0.33s / 0.66s for small / medium / large). SunAbraham scales at
+similar rate. The CS fit with `n_bootstrap=999` at 1,500 units is 0.18s
+(15%) - well-vectorized. Action: investigate ImputationDiD only after
+higher-upside items land.
+
+**Scenario 2 - Brand awareness survey.**
+At small scale HonestDiD dominates (54%); at medium the multi-outcome loop
+takes over (36%); at large the replicate-weight BRR path becomes the top
+phase (43%, 0.34s). Replicate-weight path scales with both n × n_replicates,
+as expected. All absolute times are practitioner-acceptable. Action: leave
+alone; confirm BRR path scales linearly with a future n_replicates sweep
+if needed.
+
+**Scenario 3 - BRFSS microdata → CS panel.**
+`aggregate_survey` share of total grows with scale: 94% at 50K → 99% at 250K →
+100% at 1M. Everything downstream (CS fit, SunAbraham, HonestDiD) stays
+under 500 ms combined. Action: fix `aggregate_survey` per-cell loop. This
+is now the single most impactful optimization identified.
+
+**Scenario 4 - Geo-experiment few markets (SDiD).**
+`sensitivity_to_zeta_omega` and `in_time_placebo` are the dominant
+python-backend phases at every scale (together ~70%); Rust eliminates both.
+Action: no further optimization needed - Rust port ships the answer.
+
+**Scenario 5 - Reversible treatment (dCDH L_max=3 + TSL).**
+Unchanged from single-scale pass: main fit 58% + heterogeneity refit 40%,
+both rebuilding shared TSL scaffolding.
+
+**Scenario 6 - Pricing dose-response (ContinuousDiD).**
+Unchanged: four spline fits ~140ms each, ~99% of total.
 
 ### Correctness-adjacent observations (not P0, route separately)
 
@@ -166,7 +150,7 @@ or in the silent-failures audit; logging here for awareness.
 - Scaling: each scenario runs at a single data shape. We do not know how
   end-to-end time scales with n, periods, or cohorts. If scaling becomes a
   decision input, add a small per-scenario scale sweep (e.g., n_units in
-  {100, 500, 1000}) — the scripts are parameterised to support this.
+  {100, 500, 1000}) - the scripts are parameterised to support this.
 - Memory: no memory-ceiling measurement. If memory becomes a concern,
   `pyinstrument --output-memory` or `memray` can be wrapped into
   `bench_shared.run_scenario` without restructuring.
diff --git a/docs/performance-scenarios.md b/docs/performance-scenarios.md
index 26214ada..c9afadbe 100644
--- a/docs/performance-scenarios.md
+++ b/docs/performance-scenarios.md
@@ -38,15 +38,15 @@ separately and routed to the silent-failures audit, not folded into a perf PR.
 
 Each scenario in section 4 defines:
 
-- **Persona / domain** — who runs this and why
-- **Data shape** — n_units, n_periods, n_covariates, survey PSUs/strata,
+- **Persona / domain** - who runs this and why
+- **Data shape** - n_units, n_periods, n_covariates, survey PSUs/strata,
   microdata rows if relevant
-- **Estimator + params** — including `covariates`, `n_bootstrap`,
+- **Estimator + params** - including `covariates`, `n_bootstrap`,
   `survey_design`, `aggregate`, any non-default knobs
-- **Operation chain** — fit() is one step; the flow usually includes Bacon
+- **Operation chain** - fit() is one step; the flow usually includes Bacon
   decomposition, parallel-trends inspection, sensitivity analysis, aggregation,
   and cross-estimator robustness. We time the **chain**, not just fit().
-- **Source anchor** — which tutorial, paper, or industry reference the
+- **Source anchor** - which tutorial, paper, or industry reference the
   shape/workflow comes from
 
 For each scenario, `benchmarks/speed_review/` hosts a script
@@ -79,24 +79,26 @@ serves a different purpose: R-parity accuracy). They complement it.
   not belong here.
 - **Time includes I/O and prep.** The stopwatch starts at the first library
   call a practitioner would write in their notebook and ends at the last
-  result-reporting call — `practitioner_next_steps()` or a `summary()`. Data
+  result-reporting call - `practitioner_next_steps()` or a `summary()`. Data
   generation (synthetic) is outside the stopwatch; data load
   (`load_mpdta()`, CSV read) is inside.
 
 ## Scenarios
 
-### 1. Staggered Marketing Campaign — CS + Event Study + HonestDiD
+### 1. Staggered Marketing Campaign - CS + Event Study + HonestDiD
 
 - **Persona / domain.** Growth / performance-marketing data scientist at a
   tech or e-commerce company. A brand campaign rolls out to DMAs in two
   waves; analyst needs overall lift, event-study dynamics, and a sensitivity
   bound for the VP.
-- **Data shape.** 150 units (DMAs) x 26 periods (weekly), 2 staggered
-  cohorts (wave 1 at period 9, wave 2 at period 14), ~30% never-treated,
-  2 covariates (`log_pop`, `baseline_spend`). This is deliberately larger
-  than the 80-DMA Tutorial 18 shape to stress the CS influence-function path;
-  GeoLift experiments commonly sit in the 50-200 DMA range when aggregated
-  at the DMA level per Meta's methodology docs.
+- **Data shape (scale sweep).** 26-period weekly panel, ~30% never-treated,
+  2 covariates (`log_pop`, `baseline_spend`). Three scales:
+    - **small** - 150 units, 2 cohorts (GeoLift DMA-panel analog; US DMAs
+      cap at 210).
+    - **medium** - 500 units, 3 cohorts (pooled multi-region or multi-year
+      DMA panel).
+    - **large** - 1,500 units, 3 cohorts (county-level staggered policy
+      study; US has ~3,100 counties).
 - **Estimator + params.**
   ```python
   CallawaySantAnna(
@@ -121,17 +123,20 @@ serves a different purpose: R-parity accuracy). They complement it.
   `diff_diff/guides/llms-practitioner.txt`, GeoLift methodology docs for
   DMA panel conventions.
 
-### 2. Brand Awareness Survey DiD — 2x2 with Survey Design
+### 2. Brand Awareness Survey DiD - 2x2 with Survey Design
 
 - **Persona / domain.** Brand / market-research analytics lead at a CPG
   or agency. Runs a pre/post awareness survey across test and control
   markets with complex sampling (strata + PSU clusters + unequal weights).
   Needs design-correct SEs or the CI is too narrow.
-- **Data shape.** 40 regions x 8 quarterly waves x ~100 respondents per
-  region-wave = ~32,000 respondent rows (pre-aggregation). 10 strata, 4
-  PSUs per stratum (40 PSUs total), weight coefficient of variation ~1.0.
-  This is the Tutorial 17 shape scaled up from its demonstration
-  200 x 8 cells to a size where design effects meaningfully dominate runtime.
+- **Data shape (scale sweep).** 12-period quarterly panel, high weight
+  variation, 40 BRR replicate weights. Three scales:
+    - **small** - 200 units, 10 strata × 4 PSUs (Tutorial 17 analog).
+    - **medium** - 500 units, 15 strata × 6 PSUs (typical CPG
+      quarterly brand-tracking wave).
+    - **large** - 1,000 units, 20 strata × 8 PSUs (multi-region brand
+      tracking at scale, e.g. a national awareness study with 50+ sub-
+      markets).
 - **Estimator + params.** Two variants in the same script:
   ```python
   # (a) Analytical TSL path
@@ -163,12 +168,16 @@ serves a different purpose: R-parity accuracy). They complement it.
   of a staggered state policy (e.g., Medicaid expansion, smoking ban) on
   a design-correct outcome using `aggregate_survey()` to collapse microdata
   to a state-year panel, then a modern staggered estimator.
-- **Data shape.** 50,000 microdata rows (~50 states x 10 years x ~100
-  respondents per state-year subsample — scaled to reflect the BRFSS 2024
-  ~458K-record universe filtered to a substate analytic population, per
-  CDC overview docs). 10 strata, 200 PSUs overall. Collapses via
-  `aggregate_survey` to 500-cell state-year panel. 5 adoption cohorts
-  staggered over the window.
+- **Data shape (scale sweep).** 50 states × 10 years × N respondents per
+  state-year cell, 5 adoption cohorts staggered over the window. Three scales:
+    - **small** - 50,000 rows (100/cell, 10 strata × 200 PSUs). Substate
+      analytic slice of a single year.
+    - **medium** - 250,000 rows (500/cell, 15 strata × 600 PSUs). Pooled
+      substate analytic slice across multiple years.
+    - **large** - 1,000,000 rows (2,000/cell, 20 strata × 1,000 PSUs).
+      A realistic pooled 10-year multi-state analysis - comparable to the
+      kind of panel built from BRFSS 2024's ~458K-record universe filtered
+      and pooled across years. This is where practitioners actually live.
 - **Estimator + params.**
   ```python
   panel, stage2 = aggregate_survey(
@@ -182,7 +191,7 @@ serves a different purpose: R-parity accuracy). They complement it.
   )
   compute_honest_did(results, method="relative_magnitude", M=[0.5, 1.0, 1.5])
   ```
-- **Operation chain.** (1) `aggregate_survey()` — the microdata-to-panel
+- **Operation chain.** (1) `aggregate_survey()` - the microdata-to-panel
   collapse; (2) CS fit with staged second-stage SurveyDesign
   (`weight_type="pweight"`) and bootstrap at PSU level; (3) event-study
   pre-trend inspection; (4) HonestDiD sensitivity grid; (5) SunAbraham
@@ -194,21 +203,29 @@ serves a different purpose: R-parity accuracy). They complement it.
   docstring + `docs/survey-roadmap.md`, CS paper for staggered ATT(g,t)
   inference.
 
-### 4. Geo-Experiment Few Markets — SyntheticDiD + Jackknife
+### 4. Geo-Experiment Few Markets - SyntheticDiD + Jackknife
 
 - **Persona / domain.** Growth marketing analyst running a small-market
-  campaign test (3-5 treated DMAs) against a pool of 30-80 control DMAs.
-  Too few treated for asymptotic CS SE; uses SyntheticDiD with
-  jackknife variance and a breakdown diagnostic for the VP.
-- **Data shape.** 80 DMAs x 12 weekly periods, 5 treated, 2 latent factors
-  driving the pre-period outcomes (factor-model DGP to stress the
-  optimization). This is the Tutorial 18 shape.
+  campaign test against a pool of control markets. Too few treated for
+  asymptotic CS SE; uses SyntheticDiD with jackknife variance and a
+  breakdown diagnostic for the VP.
+- **Data shape (scale sweep).** 12 weekly periods (6 pre, 6 post),
+  2 latent factors. Three scales:
+    - **small** - 80 units, 5 treated (Tutorial 18 analog, DMA-scale
+      geo-experiment).
+    - **medium** - 200 units, 15 treated (zip-cluster-scale or
+      multi-DMA geo experiment).
+    - **large** - 500 units, 30 treated (zip-level or large-scale geo
+      experiment; **Python backend skipped at this scale** because the
+      pure-numpy Frank-Wolfe solver plus jackknife would need ~500 per-unit
+      refits and exceed 4 minutes per run without adding signal beyond what
+      medium scale already shows).
 - **Estimator + params.**
   ```python
   SyntheticDiD(variance_method="jackknife", n_bootstrap=0).fit(...)
   # then also variance_method="bootstrap", n_bootstrap=200 for comparison
   ```
-- **Operation chain.** (1) SDiD fit with `variance_method="jackknife"` —
+- **Operation chain.** (1) SDiD fit with `variance_method="jackknife"` -
   exercises the leave-one-out refit loop (80 full refits); (2) SDiD fit
   with `variance_method="bootstrap"`, `n_bootstrap=200` for SE comparison;
   (3) `results.in_time_placebo()`; (4) `results.get_loo_effects_df()`;
@@ -219,10 +236,10 @@ serves a different purpose: R-parity accuracy). They complement it.
 - **Source anchor.** `docs/tutorials/18_geo_experiments.ipynb`,
   Arkhangelsky et al. (2021), Mercado Libre geo-experiment writeup
   (medium.com/mercadolibre-tech), Meta GeoLift methodology docs
-  (facebookincubator.github.io/GeoLift — 10-treated / 10-20-control
+  (facebookincubator.github.io/GeoLift - 10-treated / 10-20-control
   convention).
 
-### 5. Reversible Treatment — dCDH with L_max and Survey TSL
+### 5. Reversible Treatment - dCDH with L_max and Survey TSL
 
 - **Persona / domain.** Marketing analyst measuring an always-on-with-
   dark-periods campaign, or a health-policy researcher studying a policy
@@ -253,7 +270,7 @@ serves a different purpose: R-parity accuracy). They complement it.
   `DIDmultiplegtDYN` as methodological reference, `docs/methodology/REGISTRY.md`
   dCDH section, `project_dcdh_shipped.md` for v3.1 feature set.
 
-### 6. Pricing Dose-Response — ContinuousDiD Cubic Spline
+### 6. Pricing Dose-Response - ContinuousDiD Cubic Spline
 
 - **Persona / domain.** Pricing / promo analyst at a retailer. Stores
   received varying discount levels; analyst wants the dose-response curve
@@ -269,7 +286,7 @@ serves a different purpose: R-parity accuracy). They complement it.
       dose="dose", aggregate="dose",
   )
   ```
-- **Operation chain.** (1) CDiD fit with `aggregate="dose"` — produces
+- **Operation chain.** (1) CDiD fit with `aggregate="dose"` - produces
   overall ATT, overall ACRT, and the dose-response curves; (2)
   `results.to_dataframe(level="dose_response")`; (3)
   `results.to_dataframe(level="event_study")` for pre-trend diagnostics;
@@ -321,4 +338,4 @@ output. Scripts filter this warning so profiles stay clean.
 - Raw results: `benchmarks/speed_review/baselines/<scenario>_<backend>.json`
 - Flame profiles: `benchmarks/speed_review/baselines/profiles/<scenario>_<backend>.html`
 - Findings doc: `docs/performance-plan.md` ("Practitioner Workflow Baseline"
-  section — per-scenario top-5 hot phases + recommended action category)
+  section - per-scenario top-5 hot phases + recommended action category)

From 8d5eae8fec6be57f98983576bb33de6e378b2135 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Sun, 19 Apr 2026 12:11:27 -0400
Subject: [PATCH 03/15] Address CI review; add memory tracking and BRFSS
 allocator attribution

Addresses the four CI review findings:

- BRR -> JK1 rename. generate_survey_did_data(include_replicate_weights=
  True) emits JK1 delete-one-PSU weights per prep.py:1248; Scenario 2 was
  labeling them as BRR, which uses a different variance formula. Fixed
  script, phase label, scenario doc data-shape text, and example code
  snippet.
- Exit-code propagation. run_scenario now records a module-level
  failure flag; an atexit handler os._exit(1)s if any phase recorded
  ok=False. run_all.py's subprocess return-code check now reliably
  surfaces phase failures. Verified with a forced-failure harness test.
- Path references. bench_shared.py and run_all.py docstrings plus
  performance-plan.md prose normalized to
  benchmarks/speed_review/baselines/.
- Contributor README. "Commit HTMLs" instruction removed; flame HTMLs
  are gitignored and regenerated per run.

Adds memory measurement:

- psutil background RSS sampler (10ms) in run_scenario writes a memory
  field to every scenario JSON: start, peak, growth-during-run. Zero
  timing impact (background thread, single-syscall samples).
- mem_profile_brfss.py - standalone tracemalloc allocator attribution
  for the BRFSS-1M scenario. Separate from the timing harness so its
  2-5x overhead does not contaminate wall-clock baselines.

Memory findings extend the optimization priority list without changing
the #1 recommendation. Headline insight: BRFSS aggregate_survey at 1M
rows grows only 23 MB of working memory (vs 46 MB input), and
tracemalloc's net-retained allocation is 0.6 MB. The 24-second cost is
pure CPU - confirms the precompute-scaffolding fix is low-risk and fits
in any deployment target including 512 MB Lambda.

Secondary finding: staggered CS chain allocates 252-322 MB at 1,500
units (peak RSS 486-589 MB). Fine for workstations, tight for Lambda-
tier deployments. Flagged as a lower-priority follow-up.

Still measurement only. No changes under diff_diff/ or rust/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 benchmarks/speed_review/README.md             |  18 ++-
 .../brand_awareness_survey_large_python.json  |  25 ++--
 .../brand_awareness_survey_large_rust.json    |  25 ++--
 .../brand_awareness_survey_medium_python.json |  25 ++--
 .../brand_awareness_survey_medium_rust.json   |  25 ++--
 .../brand_awareness_survey_small_python.json  |  25 ++--
 .../brand_awareness_survey_small_rust.json    |  25 ++--
 .../baselines/brfss_panel_large_python.json   |  21 ++-
 .../baselines/brfss_panel_large_rust.json     |  21 ++-
 .../baselines/brfss_panel_medium_python.json  |  21 ++-
 .../baselines/brfss_panel_medium_rust.json    |  21 ++-
 .../baselines/brfss_panel_small_python.json   |  21 ++-
 .../baselines/brfss_panel_small_rust.json     |  21 ++-
 .../campaign_staggered_large_python.json      |  25 ++--
 .../campaign_staggered_large_rust.json        |  25 ++--
 .../campaign_staggered_medium_python.json     |  25 ++--
 .../campaign_staggered_medium_rust.json       |  25 ++--
 .../campaign_staggered_small_python.json      |  25 ++--
 .../campaign_staggered_small_rust.json        |  25 ++--
 .../baselines/dose_response_python.json       |  21 ++-
 .../baselines/dose_response_rust.json         |  21 ++-
 .../baselines/geo_few_markets_large_rust.json |  21 ++-
 .../geo_few_markets_medium_python.json        |  21 ++-
 .../geo_few_markets_medium_rust.json          |  21 ++-
 .../geo_few_markets_small_python.json         |  21 ++-
 .../baselines/geo_few_markets_small_rust.json |  21 ++-
 .../mem_profile_brfss_large_rust.txt          |  29 ++++
 .../baselines/reversible_dcdh_python.json     |  17 ++-
 .../baselines/reversible_dcdh_rust.json       |  17 ++-
 .../bench_brand_awareness_survey.py           |   9 +-
 benchmarks/speed_review/bench_shared.py       | 124 +++++++++++++++++-
 benchmarks/speed_review/mem_profile_brfss.py  | 120 +++++++++++++++++
 benchmarks/speed_review/run_all.py            |  12 +-
 docs/performance-plan.md                      | 119 ++++++++++++++---
 docs/performance-scenarios.md                 |  23 ++--
 35 files changed, 809 insertions(+), 252 deletions(-)
 create mode 100644 benchmarks/speed_review/baselines/mem_profile_brfss_large_rust.txt
 create mode 100644 benchmarks/speed_review/mem_profile_brfss.py

diff --git a/benchmarks/speed_review/README.md b/benchmarks/speed_review/README.md
index 09f610ce..35440b2b 100644
--- a/benchmarks/speed_review/README.md
+++ b/benchmarks/speed_review/README.md
@@ -19,7 +19,7 @@ at data shapes anchored to applied-econ conventions.
 ```
 benchmarks/speed_review/
 ├── README.md                           # this file
-├── bench_shared.py                     # timing + pyinstrument harness
+├── bench_shared.py                     # timing + pyinstrument + RSS harness
 ├── run_all.py                          # orchestrator (both backends)
 ├── bench_campaign_staggered.py         # Scenario 1: CS + 8-step chain
 ├── bench_brand_awareness_survey.py     # Scenario 2: DiD + SurveyDesign
@@ -27,14 +27,23 @@ benchmarks/speed_review/
 ├── bench_geo_few_markets.py            # Scenario 4: SDiD + jackknife
 ├── bench_reversible_dcdh.py            # Scenario 5: dCDH L_max + TSL
 ├── bench_dose_response.py              # Scenario 6: ContinuousDiD splines
+├── mem_profile_brfss.py                # tracemalloc allocator attribution
+│                                       #   for BRFSS-1M (standalone)
 ├── bench_callaway.py                   # pre-existing CS scaling sweep
 ├── baseline_results.json               # pre-existing CS baseline
 └── baselines/                          # this effort's output
-    ├── <scenario>_<backend>.json       # phase-level wall-clock (committed)
+    ├── <scenario>_<backend>.json       # phase-level wall-clock + peak RSS
+    ├── mem_profile_brfss_large_<backend>.txt   # tracemalloc top-N sites
     └── profiles/                       # flame HTMLs (gitignored)
         └── <scenario>_<backend>.html   # pyinstrument flame output
 ```
 
+Each JSON baseline records both timing (per-phase wall-clock) and memory
+(start/peak/growth from a psutil background sampler at 10 ms). The
+`mem_profile_brfss.py` script does a separate tracemalloc pass on the
+BRFSS-1M scenario - this is kept out of the main timing harness because
+tracemalloc has 2-5x overhead and would contaminate wall-clock baselines.
+
 **Note on profile HTMLs.** pyinstrument flames are ~500KB-1.2MB each and are
 regenerated on every run; they live under `baselines/profiles/` which is
 gitignored. The key hotspots identified from them are already captured in
@@ -77,6 +86,7 @@ the findings doc is the decision output.
 2. Add `bench_<name>.py` following the existing scripts: build data, define
    `phases` as a list of `(label, callable)` tuples, call `run_scenario`.
 3. Register it in `run_all.py`'s `SCRIPTS` dict.
-4. Run under both backends, commit the refreshed `baselines/*.json` and the
-   corresponding `baselines/profiles/*.html`.
+4. Run under both backends and commit the refreshed `baselines/*.json`.
+   The `baselines/profiles/*.html` flame HTMLs are gitignored and
+   regenerated per run - do not commit them.
 5. Add a per-scenario finding paragraph to `docs/performance-plan.md`.
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json b/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json
index 6531abee..e3c427f9 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json
@@ -2,40 +2,47 @@
   "scenario": "brand_awareness_survey_large",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.7940070000000001,
+  "total_seconds": 0.9061127080000002,
+  "memory": {
+    "available": true,
+    "start_mb": 189.33,
+    "peak_mb": 335.59,
+    "growth_mb": 146.27,
+    "sampler_interval_s": 0.01
+  },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.013499665999999966,
+      "seconds": 0.013970540999999947,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.03187458300000001,
+      "seconds": 0.03233287500000004,
       "ok": true,
       "error": null
     },
-    "3_replicate_weights_brr": {
-      "seconds": 0.3442796670000001,
+    "3_replicate_weights_jk1": {
+      "seconds": 0.44611704099999994,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.19682533299999982,
+      "seconds": 0.21938504199999986,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.030179500000000026,
+      "seconds": 0.04051395800000002,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.043751333999999975,
+      "seconds": 0.016386375000000175,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.13358487500000016,
+      "seconds": 0.13737600000000016,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json b/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json
index 9f3d673e..1caa7f71 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json
@@ -2,40 +2,47 @@
   "scenario": "brand_awareness_survey_large",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.828119375,
+  "total_seconds": 0.8791467499999999,
+  "memory": {
+    "available": true,
+    "start_mb": 187.97,
+    "peak_mb": 315.2,
+    "growth_mb": 127.23,
+    "sampler_interval_s": 0.01
+  },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.014049749999999861,
+      "seconds": 0.013859375000000007,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.029422499999999907,
+      "seconds": 0.02124400000000004,
       "ok": true,
       "error": null
     },
-    "3_replicate_weights_brr": {
-      "seconds": 0.36754912500000003,
+    "3_replicate_weights_jk1": {
+      "seconds": 0.42970375000000005,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.16490987499999998,
+      "seconds": 0.2112943330000001,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.03375229199999996,
+      "seconds": 0.038379208000000276,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.06475750000000025,
+      "seconds": 0.025571082999999994,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.15367104200000004,
+      "seconds": 0.1390828329999998,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json
index 6d915456..6f1ce897 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json
@@ -2,40 +2,47 @@
   "scenario": "brand_awareness_survey_medium",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.48956791599999994,
+  "total_seconds": 0.560188,
+  "memory": {
+    "available": true,
+    "start_mb": 131.69,
+    "peak_mb": 187.8,
+    "growth_mb": 56.11,
+    "sampler_interval_s": 0.01
+  },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.01289191699999992,
+      "seconds": 0.011834042000000045,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.035409875000000035,
+      "seconds": 0.03354012500000003,
       "ok": true,
       "error": null
     },
-    "3_replicate_weights_brr": {
-      "seconds": 0.12633833299999997,
+    "3_replicate_weights_jk1": {
+      "seconds": 0.21381758399999995,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.17774295900000003,
+      "seconds": 0.13717983300000003,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.018629792000000034,
+      "seconds": 0.018324165999999975,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.0519646250000001,
+      "seconds": 0.058137000000000105,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.06657341699999986,
+      "seconds": 0.08734375000000005,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json
index e1d2a965..aadd59a6 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json
@@ -2,40 +2,47 @@
   "scenario": "brand_awareness_survey_medium",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.535454792,
+  "total_seconds": 0.5398647089999999,
+  "memory": {
+    "available": true,
+    "start_mb": 133.16,
+    "peak_mb": 185.38,
+    "growth_mb": 52.22,
+    "sampler_interval_s": 0.01
+  },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.011897708999999979,
+      "seconds": 0.011500667000000075,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.03526237499999996,
+      "seconds": 0.03384820799999999,
       "ok": true,
       "error": null
     },
-    "3_replicate_weights_brr": {
-      "seconds": 0.185435083,
+    "3_replicate_weights_jk1": {
+      "seconds": 0.191542875,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.14044966699999994,
+      "seconds": 0.105974083,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.019051875000000162,
+      "seconds": 0.02876208299999994,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.05337804200000007,
+      "seconds": 0.06280441700000017,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.08997387500000009,
+      "seconds": 0.10540583399999992,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json b/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json
index ff655878..57857cb0 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json
@@ -2,40 +2,47 @@
   "scenario": "brand_awareness_survey_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.15087129199999993,
+  "total_seconds": 0.15974079200000002,
+  "memory": {
+    "available": true,
+    "start_mb": 115.44,
+    "peak_mb": 125.66,
+    "growth_mb": 10.22,
+    "sampler_interval_s": 0.01
+  },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.0017902499999999932,
+      "seconds": 0.0016714159999999811,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.00610949999999999,
+      "seconds": 0.0061952499999999855,
       "ok": true,
       "error": null
     },
-    "3_replicate_weights_brr": {
-      "seconds": 0.02120725000000001,
+    "3_replicate_weights_jk1": {
+      "seconds": 0.018200666000000032,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.011621500000000062,
+      "seconds": 0.02470079199999997,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.001833375000000026,
+      "seconds": 0.008862999999999954,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.027076792000000016,
+      "seconds": 0.024017708000000026,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.081212583,
+      "seconds": 0.07607645800000007,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json b/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json
index db36b50c..9a2f2e1f 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json
@@ -2,40 +2,47 @@
   "scenario": "brand_awareness_survey_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.200881125,
+  "total_seconds": 0.19896133300000007,
+  "memory": {
+    "available": true,
+    "start_mb": 116.34,
+    "peak_mb": 129.73,
+    "growth_mb": 13.39,
+    "sampler_interval_s": 0.01
+  },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.0018462080000000158,
+      "seconds": 0.0019397500000000178,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.005704333000000061,
+      "seconds": 0.005711999999999939,
       "ok": true,
       "error": null
     },
-    "3_replicate_weights_brr": {
-      "seconds": 0.015561500000000006,
+    "3_replicate_weights_jk1": {
+      "seconds": 0.011531958999999925,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.05937758399999993,
+      "seconds": 0.06204845800000003,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.00939004099999996,
+      "seconds": 0.00982324999999995,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.025794415999999987,
+      "seconds": 0.024675957999999998,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.08319054199999998,
+      "seconds": 0.08321629100000005,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_large_python.json b/benchmarks/speed_review/baselines/brfss_panel_large_python.json
index e738a57d..ecaac859 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_large_python.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_large_python.json
@@ -2,35 +2,42 @@
   "scenario": "brfss_panel_large",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 23.955508166,
+  "total_seconds": 24.227992,
+  "memory": {
+    "available": true,
+    "start_mb": 395.55,
+    "peak_mb": 418.67,
+    "growth_mb": 23.12,
+    "sampler_interval_s": 0.01
+  },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 23.873543207999997,
+      "seconds": 24.127084915999998,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.011892290999995225,
+      "seconds": 0.012080290999996635,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.08300000537065e-06,
+      "seconds": 2.2080000050550552e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0016835410000055617,
+      "seconds": 0.0015482499999990296,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.06804595899999555,
+      "seconds": 0.08696208300000308,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.00032774999999674037,
+      "seconds": 0.0003088329999982875,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_large_rust.json b/benchmarks/speed_review/baselines/brfss_panel_large_rust.json
index 7b407a9d..8b4a4bfe 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_large_rust.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_large_rust.json
@@ -2,35 +2,42 @@
   "scenario": "brfss_panel_large",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 24.372338875,
+  "total_seconds": 24.333580874999996,
+  "memory": {
+    "available": true,
+    "start_mb": 398.27,
+    "peak_mb": 427.64,
+    "growth_mb": 29.38,
+    "sampler_interval_s": 0.01
+  },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 24.274492667,
+      "seconds": 24.260917499999998,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.012104750000005993,
+      "seconds": 0.012276999999997429,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.2500000014247235e-06,
+      "seconds": 2.416999997478797e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.001614166999999611,
+      "seconds": 0.0017139999999997713,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.08373358300000433,
+      "seconds": 0.05839145800000267,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.00028904199999857383,
+      "seconds": 0.0002632919999996375,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_medium_python.json b/benchmarks/speed_review/baselines/brfss_panel_medium_python.json
index 9186ecdd..585403fa 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_medium_python.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_medium_python.json
@@ -2,35 +2,42 @@
   "scenario": "brfss_panel_medium",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 6.1113194580000005,
+  "total_seconds": 6.076073791000001,
+  "memory": {
+    "available": true,
+    "start_mb": 186.69,
+    "peak_mb": 205.61,
+    "growth_mb": 18.92,
+    "sampler_interval_s": 0.01
+  },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 6.027033041999999,
+      "seconds": 5.992297083999999,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.011803750000000335,
+      "seconds": 0.012114415999999295,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.5829999987792007e-06,
+      "seconds": 2.584000000638298e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0017158750000003664,
+      "seconds": 0.0016258340000003813,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.07050441700000043,
+      "seconds": 0.06977783399999993,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.00024145799999963913,
+      "seconds": 0.0002475410000002398,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json b/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json
index 516b7101..2b41998b 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json
@@ -2,35 +2,42 @@
   "scenario": "brfss_panel_medium",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 6.197831334,
+  "total_seconds": 6.242714707999999,
+  "memory": {
+    "available": true,
+    "start_mb": 190.41,
+    "peak_mb": 214.03,
+    "growth_mb": 23.62,
+    "sampler_interval_s": 0.01
+  },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 6.0959868749999995,
+      "seconds": 6.1350732919999995,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.012175959000000347,
+      "seconds": 0.012230709000000672,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.2500000014247235e-06,
+      "seconds": 2.374999999332772e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0015915419999998903,
+      "seconds": 0.0015812080000010553,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.08775508399999943,
+      "seconds": 0.09360750000000095,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.00030545799999970313,
+      "seconds": 0.00021008400000077643,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_small_python.json b/benchmarks/speed_review/baselines/brfss_panel_small_python.json
index 338c8f69..28b59951 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_small_python.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_small_python.json
@@ -2,35 +2,42 @@
   "scenario": "brfss_panel_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 1.5939237080000002,
+  "total_seconds": 1.5988546660000003,
+  "memory": {
+    "available": true,
+    "start_mb": 121.19,
+    "peak_mb": 131.47,
+    "growth_mb": 10.28,
+    "sampler_interval_s": 0.01
+  },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 1.498778625,
+      "seconds": 1.492374834,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.014817040999999698,
+      "seconds": 0.01475770799999987,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.250000000092456e-06,
+      "seconds": 2.2920000000148377e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0039673339999999335,
+      "seconds": 0.003596208999999906,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.07608712500000037,
+      "seconds": 0.08774637500000004,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.00026245799999990993,
+      "seconds": 0.0003601660000001061,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_small_rust.json b/benchmarks/speed_review/baselines/brfss_panel_small_rust.json
index 1eded7d3..7319b426 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_small_rust.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_small_rust.json
@@ -2,35 +2,42 @@
   "scenario": "brfss_panel_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 1.610289666,
+  "total_seconds": 1.6259075419999998,
+  "memory": {
+    "available": true,
+    "start_mb": 122.05,
+    "peak_mb": 133.52,
+    "growth_mb": 11.47,
+    "sampler_interval_s": 0.01
+  },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 1.532226625,
+      "seconds": 1.5564257919999998,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.015062499999999979,
+      "seconds": 0.014676375000000075,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.4170000001433323e-06,
+      "seconds": 2.000000000279556e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0037798330000002878,
+      "seconds": 0.0035856660000002094,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.05893341699999999,
+      "seconds": 0.05117816700000022,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.00028129199999993304,
+      "seconds": 3.583299999965206e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_large_python.json b/benchmarks/speed_review/baselines/campaign_staggered_large_python.json
index 6a65d78f..b198b53c 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_large_python.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_large_python.json
@@ -2,45 +2,52 @@
   "scenario": "campaign_staggered_large",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 1.2445272920000001,
+  "total_seconds": 1.257323208,
+  "memory": {
+    "available": true,
+    "start_mb": 233.84,
+    "peak_mb": 485.58,
+    "growth_mb": 251.73,
+    "sampler_interval_s": 0.01
+  },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.018094541999999825,
+      "seconds": 0.018876499999999963,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.166935917,
+      "seconds": 0.166782875,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.4159999997562807e-06,
+      "seconds": 3.91699999990891e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.0024244580000001292,
+      "seconds": 0.002431167000000123,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.31497600000000014,
+      "seconds": 0.34071708300000036,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.6572787920000001,
+      "seconds": 0.6446924169999999,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.0847687920000002,
+      "seconds": 0.08378041600000019,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 3.716700000033768e-05,
+      "seconds": 3.32080000000623e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json b/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json
index 56c2d4d6..9146cd5e 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json
@@ -2,45 +2,52 @@
   "scenario": "campaign_staggered_large",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 1.2180239580000003,
+  "total_seconds": 1.224077334,
+  "memory": {
+    "available": true,
+    "start_mb": 267.02,
+    "peak_mb": 589.03,
+    "growth_mb": 322.02,
+    "sampler_interval_s": 0.01
+  },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.01827750000000039,
+      "seconds": 0.017963249999999986,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.16558841699999993,
+      "seconds": 0.16458212500000013,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.2919999997105265e-06,
+      "seconds": 3.5830000002512463e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.002447333000000107,
+      "seconds": 0.0024376249999997768,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.35019279199999964,
+      "seconds": 0.40594308300000037,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.5965422080000002,
+      "seconds": 0.5462847499999999,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.08493354199999992,
+      "seconds": 0.086821708,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 3.420799999975799e-05,
+      "seconds": 3.604200000006941e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json b/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json
index d5589ba1..1d14a9cf 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json
@@ -2,45 +2,52 @@
   "scenario": "campaign_staggered_medium",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.7159983750000001,
+  "total_seconds": 0.7493203750000001,
+  "memory": {
+    "available": true,
+    "start_mb": 144.84,
+    "peak_mb": 225.83,
+    "growth_mb": 80.98,
+    "sampler_interval_s": 0.01
+  },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.012523791999999867,
+      "seconds": 0.012219124999999886,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.09662354100000003,
+      "seconds": 0.09556741599999996,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.7499999999403e-06,
+      "seconds": 3.2080000000878073e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.002143749999999889,
+      "seconds": 0.002560125000000024,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.22049916599999997,
+      "seconds": 0.28517420799999993,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.33182912500000006,
+      "seconds": 0.30205854199999993,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.052325667000000076,
+      "seconds": 0.05169025000000005,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 4.5832999999939616e-05,
+      "seconds": 3.579200000003446e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json b/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json
index 1e44a769..a885051a 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json
@@ -2,45 +2,52 @@
   "scenario": "campaign_staggered_medium",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.8678267909999999,
+  "total_seconds": 0.7148772079999999,
+  "memory": {
+    "available": true,
+    "start_mb": 154.66,
+    "peak_mb": 263.2,
+    "growth_mb": 108.55,
+    "sampler_interval_s": 0.01
+  },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.012692541999999918,
+      "seconds": 0.0129322919999999,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.09956041599999987,
+      "seconds": 0.09786516600000006,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.374999999916639e-06,
+      "seconds": 3.333999999854953e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.002752457999999791,
+      "seconds": 0.002447500000000158,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.42143175,
+      "seconds": 0.30165525000000004,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.2781500830000001,
+      "seconds": 0.24698387500000019,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.053192707999999866,
+      "seconds": 0.05294829199999995,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 3.787500000007604e-05,
+      "seconds": 3.641599999992806e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_small_python.json b/benchmarks/speed_review/baselines/campaign_staggered_small_python.json
index 12a7c4ea..c66f4316 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_small_python.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_small_python.json
@@ -2,45 +2,52 @@
   "scenario": "campaign_staggered_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.48264691700000006,
+  "total_seconds": 0.4984247500000001,
+  "memory": {
+    "available": true,
+    "start_mb": 114.47,
+    "peak_mb": 140.62,
+    "growth_mb": 26.16,
+    "sampler_interval_s": 0.01
+  },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.00833841600000007,
+      "seconds": 0.007680332999999928,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.06103824999999996,
+      "seconds": 0.06272016599999997,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.042000000008649e-06,
+      "seconds": 2.7499999999403e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.00554720799999997,
+      "seconds": 0.008486915999999955,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.07744641600000002,
+      "seconds": 0.17097429100000006,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.30062641700000003,
+      "seconds": 0.2192437909999999,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.02960395800000004,
+      "seconds": 0.02927862499999989,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 3.195800000010962e-05,
+      "seconds": 3.233299999982897e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json b/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json
index be61050d..1a5578f0 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json
@@ -2,45 +2,52 @@
   "scenario": "campaign_staggered_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.4882878749999999,
+  "total_seconds": 0.496599209,
+  "memory": {
+    "available": true,
+    "start_mb": 114.38,
+    "peak_mb": 147.39,
+    "growth_mb": 33.02,
+    "sampler_interval_s": 0.01
+  },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.0071254580000000844,
+      "seconds": 0.006995290999999959,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.06050545900000004,
+      "seconds": 0.06457049999999998,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.8330000000353905e-06,
+      "seconds": 2.8749999999577724e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.004818417000000075,
+      "seconds": 0.004752999999999896,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.13427262500000003,
+      "seconds": 0.13871820899999998,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.2511277919999999,
+      "seconds": 0.250811958,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.03038112500000012,
+      "seconds": 0.030704207999999955,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 4.8000000000048004e-05,
+      "seconds": 3.8041999999904874e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/dose_response_python.json b/benchmarks/speed_review/baselines/dose_response_python.json
index d514729f..1abbf1f8 100644
--- a/benchmarks/speed_review/baselines/dose_response_python.json
+++ b/benchmarks/speed_review/baselines/dose_response_python.json
@@ -2,35 +2,42 @@
   "scenario": "dose_response",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.583888167,
+  "total_seconds": 0.589667292,
+  "memory": {
+    "available": true,
+    "start_mb": 113.75,
+    "peak_mb": 120.16,
+    "growth_mb": 6.41,
+    "sampler_interval_s": 0.01
+  },
   "phases": {
     "1_cdid_cubic_spline_bootstrap199": {
-      "seconds": 0.148716792,
+      "seconds": 0.15124875000000004,
       "ok": true,
       "error": null
     },
     "2_extract_dose_response_dataframes": {
-      "seconds": 0.0007523750000000273,
+      "seconds": 0.0007334169999999585,
       "ok": true,
       "error": null
     },
     "3_cdid_event_study_pretrend": {
-      "seconds": 0.147467083,
+      "seconds": 0.14819204099999994,
       "ok": true,
       "error": null
     },
     "4_binarized_did_comparison": {
-      "seconds": 0.0015002499999999808,
+      "seconds": 0.0014475419999999684,
       "ok": true,
       "error": null
     },
     "5_spline_sensitivity_degree1": {
-      "seconds": 0.141884625,
+      "seconds": 0.14363024999999996,
       "ok": true,
       "error": null
     },
     "6_spline_sensitivity_num_knots2": {
-      "seconds": 0.14356270800000015,
+      "seconds": 0.14441004200000007,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/dose_response_rust.json b/benchmarks/speed_review/baselines/dose_response_rust.json
index 28f751db..25e8e5af 100644
--- a/benchmarks/speed_review/baselines/dose_response_rust.json
+++ b/benchmarks/speed_review/baselines/dose_response_rust.json
@@ -2,35 +2,42 @@
   "scenario": "dose_response",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.5914641250000001,
+  "total_seconds": 0.589171666,
+  "memory": {
+    "available": true,
+    "start_mb": 113.69,
+    "peak_mb": 122.39,
+    "growth_mb": 8.7,
+    "sampler_interval_s": 0.01
+  },
   "phases": {
     "1_cdid_cubic_spline_bootstrap199": {
-      "seconds": 0.15346125,
+      "seconds": 0.15004549999999994,
       "ok": true,
       "error": null
     },
     "2_extract_dose_response_dataframes": {
-      "seconds": 0.0007392499999999691,
+      "seconds": 0.0007515420000000494,
       "ok": true,
       "error": null
     },
     "3_cdid_event_study_pretrend": {
-      "seconds": 0.14869329099999995,
+      "seconds": 0.14819683299999997,
       "ok": true,
       "error": null
     },
     "4_binarized_did_comparison": {
-      "seconds": 0.0017346249999999896,
+      "seconds": 0.0015388330000000172,
       "ok": true,
       "error": null
     },
     "5_spline_sensitivity_degree1": {
-      "seconds": 0.14182995799999998,
+      "seconds": 0.1435805000000001,
       "ok": true,
       "error": null
     },
     "6_spline_sensitivity_num_knots2": {
-      "seconds": 0.14500187499999995,
+      "seconds": 0.14505404099999986,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json b/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json
index f415c34c..3a8be549 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json
@@ -2,35 +2,42 @@
   "scenario": "geo_few_markets_large",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.262019625,
+  "total_seconds": 0.259593333,
+  "memory": {
+    "available": true,
+    "start_mb": 117.12,
+    "peak_mb": 117.45,
+    "growth_mb": 0.33,
+    "sampler_interval_s": 0.01
+  },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.04104833299999999,
+      "seconds": 0.04071524999999998,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.037608707999999935,
+      "seconds": 0.03718341699999994,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 0.07724208399999999,
+      "seconds": 0.07727700000000004,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0007287080000000223,
+      "seconds": 0.0006957920000000284,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 0.10535358300000008,
+      "seconds": 0.10368924999999996,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 3.483299999995637e-05,
+      "seconds": 2.9124999999963208e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json b/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json
index 189fd1a0..9ab55a0e 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json
@@ -2,35 +2,42 @@
   "scenario": "geo_few_markets_medium",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 3.6490506659999995,
+  "total_seconds": 4.028322917,
+  "memory": {
+    "available": true,
+    "start_mb": 143.75,
+    "peak_mb": 152.25,
+    "growth_mb": 8.5,
+    "sampler_interval_s": 0.01
+  },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.3422396250000004,
+      "seconds": 0.36179233399999955,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.3464741250000003,
+      "seconds": 0.36836100000000016,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 1.3339607080000002,
+      "seconds": 1.5678719169999997,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0006264169999994351,
+      "seconds": 0.0007424160000004676,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 1.6257209160000006,
+      "seconds": 1.7295259170000001,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 2.4749999999684746e-05,
+      "seconds": 2.4790999999524388e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json b/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json
index ab07cd9d..e0845d6f 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json
@@ -2,35 +2,42 @@
   "scenario": "geo_few_markets_medium",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.1170789579999999,
+  "total_seconds": 0.11668558299999998,
+  "memory": {
+    "available": true,
+    "start_mb": 116.48,
+    "peak_mb": 116.86,
+    "growth_mb": 0.38,
+    "sampler_interval_s": 0.01
+  },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.020038041999999923,
+      "seconds": 0.020445540999999956,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.022811833000000004,
+      "seconds": 0.022912875000000055,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 0.024646833000000035,
+      "seconds": 0.024874541999999944,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.000611084000000095,
+      "seconds": 0.0005995829999999591,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 0.048946500000000004,
+      "seconds": 0.04782845799999991,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 2.112500000006623e-05,
+      "seconds": 2.070799999998041e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_small_python.json b/benchmarks/speed_review/baselines/geo_few_markets_small_python.json
index 2dccefec..1d64a335 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_small_python.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_small_python.json
@@ -2,35 +2,42 @@
   "scenario": "geo_few_markets_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 3.048684083,
+  "total_seconds": 3.739117167,
+  "memory": {
+    "available": true,
+    "start_mb": 113.66,
+    "peak_mb": 123.55,
+    "growth_mb": 9.89,
+    "sampler_interval_s": 0.01
+  },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.48794175000000006,
+      "seconds": 0.5992219579999999,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.4880505420000001,
+      "seconds": 0.5961615419999999,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 0.9833322499999999,
+      "seconds": 1.1918989170000003,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0011712919999999905,
+      "seconds": 0.0009045000000003078,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 1.0881425410000003,
+      "seconds": 1.3508564999999995,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 3.9874999999689464e-05,
+      "seconds": 6.941599999965575e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json b/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json
index 0a150287..ed45fd16 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json
@@ -2,35 +2,42 @@
   "scenario": "geo_few_markets_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.04011274999999992,
+  "total_seconds": 0.039840292000000055,
+  "memory": {
+    "available": true,
+    "start_mb": 113.7,
+    "peak_mb": 115.17,
+    "growth_mb": 1.47,
+    "sampler_interval_s": 0.01
+  },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.007763625000000052,
+      "seconds": 0.007603000000000026,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.012679667000000006,
+      "seconds": 0.012722500000000081,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 0.008062542000000006,
+      "seconds": 0.008069333000000012,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0007486250000000583,
+      "seconds": 0.0008005829999999658,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 0.010833333,
+      "seconds": 0.010622041999999943,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 2.1959000000015827e-05,
+      "seconds": 1.937499999993264e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/mem_profile_brfss_large_rust.txt b/benchmarks/speed_review/baselines/mem_profile_brfss_large_rust.txt
new file mode 100644
index 00000000..a5748823
--- /dev/null
+++ b/benchmarks/speed_review/baselines/mem_profile_brfss_large_rust.txt
@@ -0,0 +1,29 @@
+# BRFSS-1M aggregate_survey allocation attribution
+# backend: rust
+# input microdata rows: 1,000,000
+# input microdata memory: 45.8 MB
+# output panel cells: 500
+
+# tracemalloc totals during aggregate_survey
+# net allocated (end - start): 0.1 MB (top site)
+# python peak traced: 84.2 MB
+# python current retained: 0.6 MB
+
+# top 15 allocation sites by size delta
+#      size diff (MB)   count diff  location
+--------------------------------------------------------------------------------
+1                0.15         1521  /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/linecache.py:148
+2                0.04            7  /Users/igerber/diff-diff-perf-review/.venv/lib/python3.9/site-packages/pandas/core/internals/blocks.py:822
+3                0.02          440  /Users/igerber/diff-diff-perf-review/.venv/lib/python3.9/site-packages/pandas/core/sorting.py:637
+4                0.01           63  /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/abc.py:123
+5                0.00            2  /Users/igerber/diff-diff-perf-review/.venv/lib/python3.9/site-packages/pandas/core/frame.py:12710
+6                0.00            2  /Users/igerber/diff-diff-perf-review/.venv/lib/python3.9/site-packages/pandas/core/frame.py:698
+7                0.00           16  /Users/igerber/diff-diff-perf-review/.venv/lib/python3.9/site-packages/pandas/core/indexes/base.py:5372
+8                0.00           51  /Users/igerber/diff-diff-perf-review/.venv/lib/python3.9/site-packages/pandas/core/groupby/ops.py:427
+9                0.00           55  /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/sre_parse.py:529
+10               0.00           43  /Users/igerber/diff-diff-perf-review/diff_diff/prep.py:1618
+11               0.00            2  /Users/igerber/diff-diff-perf-review/.venv/lib/python3.9/site-packages/pandas/core/internals/construction.py:237
+12               0.00            2  /Users/igerber/diff-diff-perf-review/.venv/lib/python3.9/site-packages/pandas/core/groupby/grouper.py:846
+13               0.00            2  /Users/igerber/diff-diff-perf-review/.venv/lib/python3.9/site-packages/pandas/core/construction.py:517
+14               0.00            2  /Users/igerber/diff-diff-perf-review/.venv/lib/python3.9/site-packages/pandas/core/indexes/base.py:475
+15               0.00           35  /Users/igerber/diff-diff-perf-review/.venv/lib/python3.9/site-packages/pandas/core/internals/managers.py:1500
diff --git a/benchmarks/speed_review/baselines/reversible_dcdh_python.json b/benchmarks/speed_review/baselines/reversible_dcdh_python.json
index 697b0ecc..75b32efe 100644
--- a/benchmarks/speed_review/baselines/reversible_dcdh_python.json
+++ b/benchmarks/speed_review/baselines/reversible_dcdh_python.json
@@ -2,25 +2,32 @@
   "scenario": "reversible_dcdh",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.550877666,
+  "total_seconds": 0.618717375,
+  "memory": {
+    "available": true,
+    "start_mb": 113.59,
+    "peak_mb": 131.45,
+    "growth_mb": 17.86,
+    "sampler_interval_s": 0.01
+  },
   "phases": {
     "1_dcdh_fit_Lmax3_survey_TSL": {
-      "seconds": 0.3742725,
+      "seconds": 0.32684875,
       "ok": true,
       "error": null
     },
     "2_inspect_placebo_and_summary": {
-      "seconds": 1.1669999999686098e-06,
+      "seconds": 1.4170000000035543e-06,
       "ok": true,
       "error": null
     },
     "3_honest_did_on_placebo": {
-      "seconds": 0.003917875000000071,
+      "seconds": 0.004943374999999972,
       "ok": true,
       "error": null
     },
     "4_heterogeneity_refit": {
-      "seconds": 0.17268387499999993,
+      "seconds": 0.28691199999999994,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/reversible_dcdh_rust.json b/benchmarks/speed_review/baselines/reversible_dcdh_rust.json
index 27ab9df0..b72b9f34 100644
--- a/benchmarks/speed_review/baselines/reversible_dcdh_rust.json
+++ b/benchmarks/speed_review/baselines/reversible_dcdh_rust.json
@@ -2,25 +2,32 @@
   "scenario": "reversible_dcdh",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.5288864590000001,
+  "total_seconds": 0.6100237500000001,
+  "memory": {
+    "available": true,
+    "start_mb": 114.17,
+    "peak_mb": 135.97,
+    "growth_mb": 21.8,
+    "sampler_interval_s": 0.01
+  },
   "phases": {
     "1_dcdh_fit_Lmax3_survey_TSL": {
-      "seconds": 0.3325950000000001,
+      "seconds": 0.334349,
       "ok": true,
       "error": null
     },
     "2_inspect_placebo_and_summary": {
-      "seconds": 1.1670000000796321e-06,
+      "seconds": 1.208000000030296e-06,
       "ok": true,
       "error": null
     },
     "3_honest_did_on_placebo": {
-      "seconds": 0.004058833999999956,
+      "seconds": 0.003726332999999915,
       "ok": true,
       "error": null
     },
     "4_heterogeneity_refit": {
-      "seconds": 0.19222925000000002,
+      "seconds": 0.271937125,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/bench_brand_awareness_survey.py b/benchmarks/speed_review/bench_brand_awareness_survey.py
index d2474066..93758392 100644
--- a/benchmarks/speed_review/bench_brand_awareness_survey.py
+++ b/benchmarks/speed_review/bench_brand_awareness_survey.py
@@ -1,9 +1,10 @@
 """
-Scenario 2: Brand awareness survey DiD — 2x2 with survey design.
+Scenario 2: Brand awareness survey DiD - 2x2 with survey design.
 
 DifferenceInDifferences + SurveyDesign under two variance paths:
   (a) analytical Taylor-series linearization (strata + PSU + FPC)
-  (b) replicate-weight bootstrap (BRR-style, ~160 replicate columns)
+  (b) replicate-weight bootstrap (JK1 delete-one-PSU weights; count equals
+      the number of PSUs, so 40/90/160 at small/medium/large)
 
 Chains: naive fit (for SE-inflation comparison) -> TSL -> replicate -> multi-
 outcome refit loop -> check_parallel_trends -> placebo -> HonestDiD grid.
@@ -75,7 +76,7 @@ def replicate_fit():
             raise RuntimeError("replicate weights not generated")
         sd = SurveyDesign(
             weights="weight", replicate_weights=rw_cols,
-            replicate_method="BRR",
+            replicate_method="JK1",
         )
         did = DifferenceInDifferences(robust=True)
         results["replicate"] = did.fit(
@@ -139,7 +140,7 @@ def honest_did_grid():
     return [
         ("1_naive_fit_no_survey_design", naive_fit),
         ("2_tsl_strata_psu_fpc", tsl_fit),
-        ("3_replicate_weights_brr", replicate_fit),
+        ("3_replicate_weights_jk1", replicate_fit),
         ("4_multi_outcome_loop_3_metrics", multi_outcome_loop),
         ("5_check_parallel_trends", pretrends),
         ("6_placebo_refit_pre_period", placebo_refit),
diff --git a/benchmarks/speed_review/bench_shared.py b/benchmarks/speed_review/bench_shared.py
index 1530ad89..4f50fdc2 100644
--- a/benchmarks/speed_review/bench_shared.py
+++ b/benchmarks/speed_review/bench_shared.py
@@ -5,12 +5,16 @@
 list of phases (label, callable). The harness times each phase, wraps the
 full chain in a pyinstrument profile, and writes:
 
-- ``benchmarks/results/<scenario>_<backend>.json`` — per-phase wall-clock
-- ``benchmarks/results/profiles/<scenario>_<backend>.html`` — flame profile
+- ``benchmarks/speed_review/baselines/<scenario>_<backend>.json`` - per-phase wall-clock
+- ``benchmarks/speed_review/baselines/profiles/<scenario>_<backend>.html`` - flame profile
+
+If any phase raises, the exception is caught and recorded as
+``{"ok": false}`` in the per-phase JSON, AND the process exits 1 after
+artifacts are written so that ``run_all.py`` and CI can detect the failure.
 
 Backend is auto-detected via ``diff_diff._backend.HAS_RUST_BACKEND`` and the
-``DIFF_DIFF_BACKEND`` env var. Run each script twice — once with
-``DIFF_DIFF_BACKEND=python`` and once with ``DIFF_DIFF_BACKEND=rust`` — to
+``DIFF_DIFF_BACKEND`` env var. Run each script twice - once with
+``DIFF_DIFF_BACKEND=python`` and once with ``DIFF_DIFF_BACKEND=rust`` - to
 populate both files.
 
 See ``docs/performance-scenarios.md`` for scenario definitions and
@@ -18,9 +22,11 @@
 recommendations derived from these results.
 """
 
+import atexit
 import json
 import os
 import sys
+import threading
 import time
 import warnings
 from pathlib import Path
@@ -34,6 +40,67 @@
     HAS_PYINSTRUMENT = False
     Profiler = None  # type: ignore[assignment,misc]
 
+try:
+    import psutil
+    HAS_PSUTIL = True
+except ImportError:
+    HAS_PSUTIL = False
+    psutil = None  # type: ignore[assignment]
+
+
+class _RSSSampler:
+    """Background thread that samples process RSS every ~10ms.
+
+    Gives per-scenario peak memory without depending on
+    `resource.getrusage(RUSAGE_SELF).ru_maxrss` (which is monotonic across
+    the whole process and so would leak scale-1 peaks into scale-2 reports
+    in multi-scale scripts). If psutil is missing, the sampler reports
+    peak=0 and the caller falls back to not recording memory.
+    """
+
+    def __init__(self, interval_s=0.01):
+        self.interval = interval_s
+        self.peak_bytes = 0
+        self.start_bytes = 0
+        self._stop = threading.Event()
+        self._thread = None
+        self._proc = psutil.Process() if HAS_PSUTIL else None
+
+    def start(self):
+        if self._proc is None:
+            return
+        self.start_bytes = self._proc.memory_info().rss
+        self.peak_bytes = self.start_bytes
+        self._stop.clear()
+
+        def sample():
+            while not self._stop.is_set():
+                try:
+                    rss = self._proc.memory_info().rss
+                    if rss > self.peak_bytes:
+                        self.peak_bytes = rss
+                except Exception:
+                    pass
+                self._stop.wait(self.interval)
+
+        self._thread = threading.Thread(target=sample, daemon=True)
+        self._thread.start()
+
+    def stop(self):
+        if self._proc is None:
+            return
+        self._stop.set()
+        if self._thread:
+            self._thread.join(timeout=0.2)
+
+    @property
+    def peak_mb(self):
+        return self.peak_bytes / (1024 * 1024)
+
+    @property
+    def start_mb(self):
+        return self.start_bytes / (1024 * 1024)
+
 sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
 
 from diff_diff._backend import HAS_RUST_BACKEND
@@ -41,6 +108,24 @@
 RESULTS_DIR = Path(__file__).resolve().parent / "baselines"
 PROFILE_DIR = RESULTS_DIR / "profiles"
 
+# Module-level failure flag. Set True whenever run_scenario sees any phase
+# record ok=False. atexit handler below translates this into a nonzero
+# process exit code so run_all.py and CI can detect partial-failure runs.
+# Multi-scale scripts still complete all scales before the process exits.
+_any_phase_failed = False
+
+
+def _exit_with_failure_status():
+    if _any_phase_failed:
+        print(
+            "\n  [bench_shared] at least one phase failed; "
+            "exiting nonzero", file=sys.stderr,
+        )
+        os._exit(1)
+
+
+atexit.register(_exit_with_failure_status)
+
 
 def _backend_label():
     """Return 'rust' or 'python' for file naming."""
@@ -83,6 +168,9 @@ def run_scenario(scenario_name, phases, metadata=None):
         profile = Profiler(async_mode="disabled")
         profile.start()
 
+    sampler = _RSSSampler()
+    sampler.start()
+
     phase_times = {}
     total_start = time.perf_counter()
     try:
@@ -104,6 +192,7 @@ def run_scenario(scenario_name, phases, metadata=None):
                 print(f"  [{label}] FAILED: {type(e).__name__}: {e}")
     finally:
         total_elapsed = time.perf_counter() - total_start
+        sampler.stop()
         if profile is not None:
             profile.stop()
             html_path = PROFILE_DIR / f"{scenario_name}_{backend}.html"
@@ -112,11 +201,23 @@ def run_scenario(scenario_name, phases, metadata=None):
             repo_root = Path(__file__).resolve().parents[2]
             print(f"  profile -> {html_path.relative_to(repo_root)}")
 
+    memory = {
+        "available": HAS_PSUTIL,
+        "start_mb": round(sampler.start_mb, 2) if HAS_PSUTIL else None,
+        "peak_mb": round(sampler.peak_mb, 2) if HAS_PSUTIL else None,
+        "growth_mb": (
+            round(sampler.peak_mb - sampler.start_mb, 2)
+            if HAS_PSUTIL else None
+        ),
+        "sampler_interval_s": sampler.interval,
+    }
+
     record = {
         "scenario": scenario_name,
         "backend": backend,
         "has_rust_backend": HAS_RUST_BACKEND,
         "total_seconds": total_elapsed,
+        "memory": memory,
         "phases": phase_times,
         "metadata": metadata or {},
         "diff_diff_version": _get_version(),
@@ -127,12 +228,25 @@ def run_scenario(scenario_name, phases, metadata=None):
     with open(json_path, "w") as f:
         json.dump(record, f, indent=2, default=str)
 
-    print(f"\n  [{scenario_name}] backend={backend}  total={total_elapsed:.2f}s")
+    mem_str = (
+        f"  peak_rss={memory['peak_mb']:.0f}MB  "
+        f"(+{memory['growth_mb']:.0f}MB during run)"
+        if HAS_PSUTIL else "  [no psutil; skipping memory]"
+    )
+    print(
+        f"\n  [{scenario_name}] backend={backend}  "
+        f"total={total_elapsed:.2f}s{mem_str}"
+    )
     for label, info in phase_times.items():
         status = "OK " if info["ok"] else "ERR"
         print(f"    {status} {label:<40} {info['seconds']:>8.3f}s")
     repo_root = Path(__file__).resolve().parents[2]
     print(f"  json    -> {json_path.relative_to(repo_root)}")
+
+    if any(not info["ok"] for info in phase_times.values()):
+        global _any_phase_failed
+        _any_phase_failed = True
+
     return record
 
 
diff --git a/benchmarks/speed_review/mem_profile_brfss.py b/benchmarks/speed_review/mem_profile_brfss.py
new file mode 100644
index 00000000..fd7127ab
--- /dev/null
+++ b/benchmarks/speed_review/mem_profile_brfss.py
@@ -0,0 +1,120 @@
+"""
+Per-function allocation attribution for the BRFSS-1M scenario.
+
+Runs the large-scale BRFSS `aggregate_survey` path under `tracemalloc` and
+writes top-N allocation sites to
+``benchmarks/speed_review/baselines/mem_profile_brfss_large_<backend>.txt``.
+
+Standalone because tracemalloc has 2-5x overhead; running it inside the
+main timing harness would contaminate the wall-clock baselines. Companion
+to the `resource.getrusage`-style peak RSS captured in the main JSON
+baselines — this script tells us WHERE the memory went, those tell us
+HOW MUCH.
+"""
+
+import argparse
+import tracemalloc
+from pathlib import Path
+
+import numpy as np
+import pandas as pd
+
+from diff_diff import SurveyDesign, aggregate_survey
+from diff_diff._backend import HAS_RUST_BACKEND
+
+
+BASELINES = Path(__file__).resolve().parent / "baselines"
+
+
+def build_microdata(n_states=50, n_years=10, n_per_cell=2000,
+                    n_strata=20, n_psu=1000, seed=42):
+    rng = np.random.default_rng(seed)
+    n_rows = n_states * n_years * n_per_cell
+    state = np.repeat(np.arange(n_states), n_years * n_per_cell)
+    year = np.tile(
+        np.repeat(np.arange(2010, 2010 + n_years), n_per_cell),
+        n_states,
+    )
+    stratum = rng.integers(0, n_strata, size=n_rows)
+    psu = stratum * (n_psu // n_strata) + rng.integers(
+        0, n_psu // n_strata, size=n_rows,
+    )
+    weight = rng.lognormal(0, 0.4, size=n_rows) * 50.0
+    y = (
+        rng.normal(0, 1, size=n_rows)
+        + 0.5 * (year - 2010)
+        + rng.normal(0, 0.2, size=n_rows) * state
+    )
+    return pd.DataFrame({
+        "state": state, "year": year,
+        "strata": stratum, "psu": psu, "finalwt": weight,
+        "y": y,
+    })
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--top", type=int, default=15,
+                        help="Show top N allocation sites")
+    args = parser.parse_args()
+
+    BASELINES.mkdir(parents=True, exist_ok=True)
+    backend = "rust" if HAS_RUST_BACKEND else "python"
+    out_path = BASELINES / f"mem_profile_brfss_large_{backend}.txt"
+
+    print("Building 1M-row BRFSS microdata...")
+    micro = build_microdata()
+    print(f"  shape: {micro.shape}, mem: "
+          f"{micro.memory_usage(deep=True).sum()/1024/1024:.1f} MB")
+
+    sd = SurveyDesign(
+        weights="finalwt", strata="strata", psu="psu",
+    )
+
+    print("Starting tracemalloc...")
+    tracemalloc.start(25)
+    snap_before = tracemalloc.take_snapshot()
+
+    print("Running aggregate_survey...")
+    panel, stage2 = aggregate_survey(
+        micro, by=["state", "year"], outcomes="y",
+        survey_design=sd,
+    )
+
+    snap_after = tracemalloc.take_snapshot()
+    stats = snap_after.compare_to(snap_before, "lineno")
+    current, peak = tracemalloc.get_traced_memory()
+    tracemalloc.stop()
+
+    lines = [
+        f"# BRFSS-1M aggregate_survey allocation attribution",
+        f"# backend: {backend}",
+        f"# input microdata rows: {len(micro):,}",
+        f"# input microdata memory: "
+        f"{micro.memory_usage(deep=True).sum()/1024/1024:.1f} MB",
+        f"# output panel cells: {len(panel)}",
+        f"",
+        f"# tracemalloc totals during aggregate_survey",
+        f"# net allocated (end - start): "
+        f"{(stats[0].size_diff if stats else 0)/1024/1024:.1f} MB (top site)",
+        f"# python peak traced: {peak/1024/1024:.1f} MB",
+        f"# python current retained: {current/1024/1024:.1f} MB",
+        f"",
+        f"# top {args.top} allocation sites by size delta",
+        f"{'#':<4} {'size diff (MB)':>16} {'count diff':>12}  location",
+        f"{'-'*80}",
+    ]
+    for i, s in enumerate(stats[:args.top], 1):
+        loc = str(s.traceback).split("\n")[0]
+        lines.append(
+            f"{i:<4} {s.size_diff/1024/1024:>16.2f} {s.count_diff:>12d}  {loc}"
+        )
+
+    text = "\n".join(lines) + "\n"
+    out_path.write_text(text)
+    print("\n" + text)
+    print(f"wrote {out_path}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/speed_review/run_all.py b/benchmarks/speed_review/run_all.py
index d31a23d6..12b609b5 100644
--- a/benchmarks/speed_review/run_all.py
+++ b/benchmarks/speed_review/run_all.py
@@ -2,10 +2,14 @@
 """
 Run every practitioner-workflow scenario under both backends.
 
-Writes per-scenario JSON + pyinstrument HTML to ``benchmarks/results/`` and
-``benchmarks/results/profiles/``. See ``docs/performance-scenarios.md`` for
-scenario definitions and ``docs/performance-plan.md`` for the derived
-findings.
+Writes per-scenario JSON + pyinstrument HTML under
+``benchmarks/speed_review/baselines/`` (and ``.../baselines/profiles/``).
+See ``docs/performance-scenarios.md`` for scenario definitions and
+``docs/performance-plan.md`` for the derived findings.
+
+Exit status is nonzero if any scenario subprocess exits nonzero. Scenario
+scripts themselves exit 1 on any phase failure (see ``bench_shared.py``),
+so this orchestrator reliably surfaces failures.
 
 Usage:
 
diff --git a/docs/performance-plan.md b/docs/performance-plan.md
index 72e4b1f2..d5da7c9d 100644
--- a/docs/performance-plan.md
+++ b/docs/performance-plan.md
@@ -36,9 +36,9 @@ scale.
 | **1. Staggered campaign** | small | 150 units × 26 periods | 0.48 | 0.49 | 1.0x |
 | (CS + 8-step chain, bootstrap 999) | medium | 500 units × 26 periods | 0.72 | 0.87 | 0.8x |
 |  | large | 1,500 units × 26 periods | 1.24 | 1.22 | 1.0x |
-| **2. Brand awareness survey** | small | 200 units × 12 periods | 0.15 | 0.20 | 0.8x |
-| (DiD + SurveyDesign + replicate weights) | medium | 500 units × 12 periods | 0.49 | 0.54 | 0.9x |
-|  | large | 1,000 units × 12 periods | 0.79 | 0.83 | 1.0x |
+| **2. Brand awareness survey** | small | 200 units × 12 periods, 40 JK1 reps | 0.21 | 0.20 | 1.0x |
+| (DiD + SurveyDesign + JK1 replicate weights) | medium | 500 units × 12 periods, 90 JK1 reps | 0.52 | 0.46 | 1.1x |
+|  | large | 1,000 units × 12 periods, 160 JK1 reps | 0.83 | 0.92 | 0.9x |
 | **3. BRFSS microdata → CS panel** | small | 50K rows → 500 cells | 1.59 | 1.61 | 1.0x |
 | (`aggregate_survey` + CS + HonestDiD) | medium | 250K rows → 500 cells | 6.11 | 6.20 | 1.0x |
 |  | large | **1M rows → 500 cells** | **23.96** | **24.37** | **1.0x** |
@@ -70,9 +70,10 @@ scale.
 
 **Two findings hold across scales:**
 
-4. Brand-awareness survey chain scales sub-linearly (0.15s → 0.79s for a
-   5x unit increase); TSL + replicate-weight paths are well-vectorized
-   even with 40 replicate columns.
+4. Brand-awareness survey chain scales roughly linearly in n_units
+   (0.21s → 0.83s for a 5x unit increase); the JK1 replicate-weight path
+   itself scales closer to n_units × n_replicates (40 → 160 replicates
+   across the sweep), becoming the dominant phase at large scale.
 5. Rust backend gives measurable uplift only for SDiD; for everything else
    backend choice is within noise because the bottlenecks are in Python
    (`aggregate_survey`) or already well-vectorized (CS bootstrap, ImputationDiD,
@@ -98,12 +99,16 @@ similar rate. The CS fit with `n_bootstrap=999` at 1,500 units is 0.18s
 higher-upside items land.
 
 **Scenario 2 - Brand awareness survey.**
-At small scale HonestDiD dominates (54%); at medium the multi-outcome loop
-takes over (36%); at large the replicate-weight BRR path becomes the top
-phase (43%, 0.34s). Replicate-weight path scales with both n × n_replicates,
-as expected. All absolute times are practitioner-acceptable. Action: leave
-alone; confirm BRR path scales linearly with a future n_replicates sweep
-if needed.
+At small scale HonestDiD dominates (42%); at medium the multi-outcome loop
+and the JK1 replicate-weight path are within a factor of 2 (23-36%); at
+large the JK1 path becomes the single top phase (~45-50%, 0.37s Python /
+0.46s Rust). Replicate count grows with PSU count (40 / 90 / 160 at the
+three scales), so the path scales roughly as n_units × n_replicates - a
+near-quadratic curve in a design dimension that commonly grows. Note that
+Rust is marginally slower than Python here because the JK1 replicate-fit
+loop is not yet Rust-accelerated and the FFI crossings cost more than the
+per-fit work. Action: leave alone, but flag the JK1 path as a Rust-port
+candidate if practitioners regularly run n_replicates >= 160.
 
 **Scenario 3 - BRFSS microdata → CS panel.**
 `aggregate_survey` share of total grows with scale: 94% at 50K → 99% at 250K →
@@ -123,6 +128,87 @@ both rebuilding shared TSL scaffolding.
 **Scenario 6 - Pricing dose-response (ContinuousDiD).**
 Unchanged: four spline fits ~140ms each, ~99% of total.
 
+### Memory analysis
+
+End-to-end peak RSS and per-scenario growth are captured in each JSON
+baseline under the `memory` field, recorded via a psutil background
+sampler at 10 ms. A standalone `tracemalloc`-based allocator attribution
+pass for the BRFSS-1M scenario lives at
+`benchmarks/speed_review/mem_profile_brfss.py`; its output is in
+`benchmarks/speed_review/baselines/mem_profile_brfss_large_<backend>.txt`.
+
+| Scenario | Scale | Peak RSS (Py) | Growth during run (Py) | Peak RSS (Rust) | Growth (Rust) |
+|---|---|---:|---:|---:|---:|
+| Staggered campaign | small | 141 MB | +26 | 147 MB | +33 |
+|  | medium | 226 MB | +81 | 263 MB | +109 |
+|  | **large** | **486 MB** | **+252** | **589 MB** | **+322** |
+| Brand awareness survey | small | 126 MB | +10 | 130 MB | +13 |
+|  | medium | 188 MB | +56 | 185 MB | +52 |
+|  | large | 336 MB | +146 | 315 MB | +127 |
+| BRFSS microdata -> CS panel | small (50K) | 131 MB | +10 | 134 MB | +11 |
+|  | medium (250K) | 206 MB | +19 | 214 MB | +24 |
+|  | **large (1M)** | **419 MB** | **+23** | **428 MB** | **+29** |
+| SDiD few markets | small | 124 MB | +10 | 115 MB | +1 |
+|  | medium | 152 MB | +8 | 117 MB | 0 |
+|  | large | skip | skip | 117 MB | 0 |
+| Reversible dCDH | single | 131 MB | +18 | 136 MB | +22 |
+| Dose-response | single | 120 MB | +6 | 122 MB | +9 |
+
+The ~115-130 MB floor is the Python + diff-diff + numpy import footprint;
+the "growth during run" column is the practitioner-meaningful number.
+
+### Memory findings
+
+1. **BRFSS `aggregate_survey` is compute-bound, not memory-bound.** Across
+   a 20x data growth (50K → 1M rows), working-memory growth only goes
+   10 → 19 → 23 MB. The tracemalloc pass confirms this: net retained
+   allocation after `aggregate_survey` returns is 0.6 MB, Python traced
+   peak is 84 MB (vs 46 MB input microdata), and the top allocation site
+   is `tracemalloc`'s own `linecache.py` overhead - a smoking gun that
+   nothing else is allocating meaningfully. **The 24-second cost is pure
+   CPU; the function is already memory-efficient.** This strengthens the
+   case for the precompute-scaffolding fix: low-risk, pure CPU win, fits
+   in any deployment environment including 512 MB Lambda.
+
+2. **Staggered CS chain is memory-heavier than wall-clock suggested.** At
+   1,500 units the chain allocates +252 MB Python / +322 MB Rust during
+   the run, pushing peak RSS to ~486-589 MB. Fine for workstations,
+   tight for 512 MB Lambda tier. The Bootstrap-999 in CS and ImputationDiD's
+   saturated regression are the plausible drivers. Not a P0 today but
+   worth flagging for future edge / Lambda deployments. Interestingly,
+   Rust uses **more** memory here (70 MB more at large scale), likely
+   FFI-held temporary array copies; not worth optimizing.
+
+3. **JK1 replicate path is allocation-heavy at scale.** At 1,000 units ×
+   160 replicates, +127-146 MB growth. Each replicate refit plus the
+   n × n_replicates weight matrix drives this. A Rust port would save
+   both time (0.3-0.4s) and memory (~100 MB) - the dual benefit slightly
+   strengthens the case for the port.
+
+4. **SDiD Rust path is essentially memory-free** (+0-1 MB across scales).
+   Rust does the work in native memory without round-tripping through
+   the Python allocator. Confirms the existing Rust port is well-behaved
+   on both axes.
+
+5. **No scenario hits OOM territory at measured scales.** Maximum peak
+   RSS across the whole sweep is 589 MB (staggered CS large + Rust).
+   1 GB is a comfortable ceiling for every scenario measured.
+
+### Priority of optimization opportunities
+
+| # | Opportunity | Time upside | Memory upside | Risk | Priority |
+|---|---|---|---|---|---|
+| 1 | `aggregate_survey` precompute stratum scaffolding | -15 to -20s at 1M rows | none (already memory-efficient) | Low | **High** |
+| 2 | Rust-port JK1 replicate fit loop | -0.3s at 160 replicates | -100 MB at 160 replicates | Medium | Medium |
+| 3 | dCDH: cache TSL scaffolding across main fit + heterogeneity refit | -0.2s per chain | -20 MB per chain | Low | Low |
+| 4 | ImputationDiD fit-loop vectorization audit | -0.1 to -0.3s at 1,500 units | unknown | Low | Low |
+| 5 | Staggered CS chain working-memory audit (Lambda-oriented) | none | -100+ MB at 1,500 units | Medium | Low |
+
+#1 is the single clearest practitioner win. Everything else is optional
+polish that should be prioritized by actual deployment-environment signal
+(e.g. "our practitioners keep hitting 512 MB Lambda limits on the
+staggered chain" → item 5 moves up).
+
 ### Correctness-adjacent observations (not P0, route separately)
 
 These are developer-ergonomics / API-consistency smells surfaced during
@@ -140,9 +226,9 @@ or in the silent-failures audit; logging here for awareness.
    does not hit this because it uses a 2x2 design where `post` discriminates
    the comparison. Suggest adding a `treat_unit` column alongside `treated`
    for generator output clarity. Route: DGP cleanup, minor.
-3. **`SurveyDesign.replicate_method` case sensitivity.** `"brr"` raises
+3. **`SurveyDesign.replicate_method` case sensitivity.** `"jk1"` raises
    `ValueError("must be one of {'Fay', 'SDR', 'BRR', 'JKn', 'JK1'}")`;
-   `"BRR"` works. Either normalize the input or mention the expected casing
+   `"JK1"` works. Either normalize the input or mention the expected casing
    in the error message. Route: API-ergonomics, minor.
 
 ### What this baseline does not answer
@@ -172,8 +258,9 @@ python benchmarks/speed_review/run_all.py # both backends, all scenarios
 DIFF_DIFF_BACKEND=rust python benchmarks/speed_review/bench_campaign_staggered.py
 ```
 
-Raw JSON and flame HTML are written under `benchmarks/results/` for
-scenario-level diffing as the library evolves.
+Raw JSON is written under `benchmarks/speed_review/baselines/` for
+scenario-level diffing as the library evolves; flame HTMLs are written
+alongside under `baselines/profiles/` (gitignored; regenerated on each run).
 
 ---
 
diff --git a/docs/performance-scenarios.md b/docs/performance-scenarios.md
index c9afadbe..282c1d65 100644
--- a/docs/performance-scenarios.md
+++ b/docs/performance-scenarios.md
@@ -130,13 +130,15 @@ serves a different purpose: R-parity accuracy). They complement it.
   markets with complex sampling (strata + PSU clusters + unequal weights).
   Needs design-correct SEs or the CI is too narrow.
 - **Data shape (scale sweep).** 12-period quarterly panel, high weight
-  variation, 40 BRR replicate weights. Three scales:
-    - **small** - 200 units, 10 strata × 4 PSUs (Tutorial 17 analog).
-    - **medium** - 500 units, 15 strata × 6 PSUs (typical CPG
-      quarterly brand-tracking wave).
-    - **large** - 1,000 units, 20 strata × 8 PSUs (multi-region brand
-      tracking at scale, e.g. a national awareness study with 50+ sub-
-      markets).
+  variation, JK1 delete-one-PSU replicate weights (replicate count equals
+  the PSU count). Three scales:
+    - **small** - 200 units, 10 strata × 4 PSUs = 40 replicate columns
+      (Tutorial 17 analog).
+    - **medium** - 500 units, 15 strata × 6 PSUs = 90 replicate columns
+      (typical CPG quarterly brand-tracking wave).
+    - **large** - 1,000 units, 20 strata × 8 PSUs = 160 replicate columns
+      (multi-region brand tracking at scale, e.g. a national awareness
+      study with 50+ sub-markets).
 - **Estimator + params.** Two variants in the same script:
   ```python
   # (a) Analytical TSL path
@@ -145,9 +147,10 @@ serves a different purpose: R-parity accuracy). They complement it.
       survey_design=SurveyDesign(weights="w", strata="stratum",
                                  psu="cluster", fpc="fpc"),
   )
-  # (b) Replicate-weight path (BRR-style, ~160 replicate columns)
-  SurveyDesign(weights="w", replicate_weights=[f"rw{i}" for i in range(160)],
-               replicate_method="brr")
+  # (b) Replicate-weight path (JK1 delete-one-PSU weights produced by
+  #     generate_survey_did_data(include_replicate_weights=True))
+  SurveyDesign(weights="w", replicate_weights=rep_cols,
+               replicate_method="JK1")
   ```
 - **Operation chain.** (1) naive `DifferenceInDifferences()` with no survey
   design (for SE-inflation comparison); (2) `SurveyDesign.resolve()`;

From 98a1f3a925944eade94483c4c516f7ea1e747248 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Sun, 19 Apr 2026 13:15:38 -0400
Subject: [PATCH 04/15] Close remaining CI review items; automate table
 generation

Addresses the second-round CI review findings:

- P1 false-pass (remaining): removed five phase-local try/except blocks
  that swallowed sub-step exceptions (HonestDiD M-grids in brand-awareness
  and BRFSS, dCDH HonestDiD and heterogeneity refit, dose-response
  dataframe extraction). Exceptions now escape, the phase is marked
  ok=false, and run_scenario's atexit handler exits nonzero. The fix
  caught a real API-usage bug on its first rerun: dose_response extract
  phase tried to pull event_study level on a result fit with
  aggregate="dose"; the event_study fit lives in a dedicated phase, so
  that level is removed from the extraction loop.
- P2 scenario-spec drift: BRFSS scenario text now says pweight TSL
  stage-2 (matching the aggregate_survey-returned design), not "Full
  replicate-weight path"; dCDH reversible scenario text now says
  heterogeneity="group" (matching the script), not "cohort".
- P3 path leakage: tracemalloc output now scrubs $HOME, repo root, and
  site-packages before writing the committed txt.

Drift-prevention layer:

- gen_findings_tables.py reads every JSON baseline and rewrites the
  numerical tables in performance-plan.md between
  <!-- TABLE:start <id> --> / <!-- TABLE:end <id> --> markers. Tables
  now re-derive from data on every rerun, eliminating the hand-edit
  drift the prior review flagged. Narrative prose stays hand-written
  by design, forcing a human re-read of findings when numbers shift.

Findings refresh (the numbers moved slightly; three narrative claims
needed updating):

- "Rust marginally slower than Python on JK1 at large scale" -> removed;
  fresh data has Rust and Python within noise on brand awareness at
  large (JK1 phase 0.577s Py / 0.562s Rust, totals 1.03 / 1.04).
- "ImputationDiD consistently dominant phase at all scales" -> narrowed
  to "dominant under Python; tied with SunAbraham under Rust at large".
- "Nine-figures of MB" in memory finding #3 was a phrasing error
  (literally 100+ TB); corrected to "mid-100s of MB".

Priority of optimization opportunities refreshed against new data:

- #1 aggregate_survey precompute stratum scaffolding: High (unchanged,
  now strongly supported - 24.75s Python / 25.41s Rust at 1M rows, 100%
  of chain runtime, growth only +31 MB).
- #2 Staggered CS working-memory audit: Low with explicit bump-trigger
  (Rust large crosses 512 MB Lambda line).
- #5 Rust-port JK1 replicate fit loop: demoted from Medium to Low -
  the "Rust regression to fix" leg of the rationale is gone because
  Rust is no longer slower.

Net: one clear priority (aggregate_survey fix), four optional follow-ups.
Still measurement only. No changes under diff_diff/ or rust/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../brand_awareness_survey_large_python.json  |  22 +-
 .../brand_awareness_survey_large_rust.json    |  22 +-
 .../brand_awareness_survey_medium_python.json |  22 +-
 .../brand_awareness_survey_medium_rust.json   |  22 +-
 .../brand_awareness_survey_small_python.json  |  22 +-
 .../brand_awareness_survey_small_rust.json    |  22 +-
 .../baselines/brfss_panel_large_python.json   |  20 +-
 .../baselines/brfss_panel_large_rust.json     |  20 +-
 .../baselines/brfss_panel_medium_python.json  |  20 +-
 .../baselines/brfss_panel_medium_rust.json    |  20 +-
 .../baselines/brfss_panel_small_python.json   |  20 +-
 .../baselines/brfss_panel_small_rust.json     |  20 +-
 .../campaign_staggered_large_python.json      |  24 +-
 .../campaign_staggered_large_rust.json        |  24 +-
 .../campaign_staggered_medium_python.json     |  24 +-
 .../campaign_staggered_medium_rust.json       |  24 +-
 .../campaign_staggered_small_python.json      |  24 +-
 .../campaign_staggered_small_rust.json        |  24 +-
 .../baselines/dose_response_python.json       |  20 +-
 .../baselines/dose_response_rust.json         |  20 +-
 .../baselines/geo_few_markets_large_rust.json |  20 +-
 .../geo_few_markets_medium_python.json        |  20 +-
 .../geo_few_markets_medium_rust.json          |  20 +-
 .../geo_few_markets_small_python.json         |  20 +-
 .../baselines/geo_few_markets_small_rust.json |  18 +-
 .../mem_profile_brfss_large_rust.txt          |  30 +-
 .../baselines/reversible_dcdh_python.json     |  16 +-
 .../baselines/reversible_dcdh_rust.json       |  16 +-
 .../bench_brand_awareness_survey.py           |   9 +-
 benchmarks/speed_review/bench_brfss_panel.py  |   9 +-
 .../speed_review/bench_dose_response.py       |  13 +-
 .../speed_review/bench_reversible_dcdh.py     |  18 +-
 .../speed_review/gen_findings_tables.py       | 245 ++++++++++++++
 benchmarks/speed_review/mem_profile_brfss.py  |  20 +-
 docs/performance-plan.md                      | 307 +++++++++---------
 docs/performance-scenarios.md                 |  14 +-
 36 files changed, 745 insertions(+), 486 deletions(-)
 create mode 100644 benchmarks/speed_review/gen_findings_tables.py

diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json b/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json
index e3c427f9..20d55086 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_large",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.9061127080000002,
+  "total_seconds": 1.026685333,
   "memory": {
     "available": true,
-    "start_mb": 189.33,
-    "peak_mb": 335.59,
-    "growth_mb": 146.27,
+    "start_mb": 195.45,
+    "peak_mb": 335.25,
+    "growth_mb": 139.8,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.013970540999999947,
+      "seconds": 0.01289908400000006,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.03233287500000004,
+      "seconds": 0.030157082999999973,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.44611704099999994,
+      "seconds": 0.5706585830000002,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.21938504199999986,
+      "seconds": 0.21575479099999972,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.04051395800000002,
+      "seconds": 0.039158959000000326,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.016386375000000175,
+      "seconds": 0.02081145899999992,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.13737600000000016,
+      "seconds": 0.13721408299999993,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json b/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json
index 1caa7f71..83794f77 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_large",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.8791467499999999,
+  "total_seconds": 1.03563075,
   "memory": {
     "available": true,
-    "start_mb": 187.97,
-    "peak_mb": 315.2,
-    "growth_mb": 127.23,
+    "start_mb": 194.5,
+    "peak_mb": 338.94,
+    "growth_mb": 144.44,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.013859375000000007,
+      "seconds": 0.01294266700000013,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.02124400000000004,
+      "seconds": 0.031624374999999816,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.42970375000000005,
+      "seconds": 0.5619407919999999,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.2112943330000001,
+      "seconds": 0.263706,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.038379208000000276,
+      "seconds": 0.009824249999999868,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.025571082999999994,
+      "seconds": 0.012082083000000132,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.1390828329999998,
+      "seconds": 0.1435007500000003,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json
index 6f1ce897..5fcd8624 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_medium",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.560188,
+  "total_seconds": 0.763684958,
   "memory": {
     "available": true,
-    "start_mb": 131.69,
-    "peak_mb": 187.8,
-    "growth_mb": 56.11,
+    "start_mb": 135.03,
+    "peak_mb": 193.52,
+    "growth_mb": 58.48,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.011834042000000045,
+      "seconds": 0.01204445799999998,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.03354012500000003,
+      "seconds": 0.03723474999999998,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.21381758399999995,
+      "seconds": 0.377466875,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.13717983300000003,
+      "seconds": 0.14207241700000006,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.018324165999999975,
+      "seconds": 0.014263458000000062,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.058137000000000105,
+      "seconds": 0.06729445800000011,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.08734375000000005,
+      "seconds": 0.11328508300000006,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json
index aadd59a6..c3a54df7 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_medium",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.5398647089999999,
+  "total_seconds": 0.5770147080000001,
   "memory": {
     "available": true,
-    "start_mb": 133.16,
-    "peak_mb": 185.38,
-    "growth_mb": 52.22,
+    "start_mb": 135.83,
+    "peak_mb": 189.91,
+    "growth_mb": 54.08,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.011500667000000075,
+      "seconds": 0.014149625000000055,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.03384820799999999,
+      "seconds": 0.03726633299999993,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.191542875,
+      "seconds": 0.20208641699999985,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.105974083,
+      "seconds": 0.14449316599999995,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.02876208299999994,
+      "seconds": 0.022887250000000137,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.06280441700000017,
+      "seconds": 0.07035533400000005,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.10540583399999992,
+      "seconds": 0.085751833,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json b/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json
index 57857cb0..c9f8b1bd 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.15974079200000002,
+  "total_seconds": 0.21190179099999995,
   "memory": {
     "available": true,
-    "start_mb": 115.44,
-    "peak_mb": 125.66,
-    "growth_mb": 10.22,
+    "start_mb": 115.28,
+    "peak_mb": 131.0,
+    "growth_mb": 15.72,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.0016714159999999811,
+      "seconds": 0.0019152919999999574,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.0061952499999999855,
+      "seconds": 0.007577749999999939,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.018200666000000032,
+      "seconds": 0.02303670899999999,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.02470079199999997,
+      "seconds": 0.052659917000000056,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.008862999999999954,
+      "seconds": 0.009975750000000061,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.024017708000000026,
+      "seconds": 0.027377666999999994,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.07607645800000007,
+      "seconds": 0.08933650000000004,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json b/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json
index 9a2f2e1f..26fbe778 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.19896133300000007,
+  "total_seconds": 0.19699320799999998,
   "memory": {
     "available": true,
-    "start_mb": 116.34,
-    "peak_mb": 129.73,
-    "growth_mb": 13.39,
+    "start_mb": 115.36,
+    "peak_mb": 130.86,
+    "growth_mb": 15.5,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.0019397500000000178,
+      "seconds": 0.0021563339999999265,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.005711999999999939,
+      "seconds": 0.006268332999999959,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.011531958999999925,
+      "seconds": 0.020846708999999963,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.06204845800000003,
+      "seconds": 0.027288292000000047,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.00982324999999995,
+      "seconds": 0.01103129199999997,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.024675957999999998,
+      "seconds": 0.02803533300000005,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.08321629100000005,
+      "seconds": 0.10135274999999999,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_large_python.json b/benchmarks/speed_review/baselines/brfss_panel_large_python.json
index ecaac859..ce3c16d3 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_large_python.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_large_python.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_large",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 24.227992,
+  "total_seconds": 24.746183166,
   "memory": {
     "available": true,
-    "start_mb": 395.55,
-    "peak_mb": 418.67,
-    "growth_mb": 23.12,
+    "start_mb": 395.08,
+    "peak_mb": 426.41,
+    "growth_mb": 31.33,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 24.127084915999998,
+      "seconds": 24.637322375,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.012080290999996635,
+      "seconds": 0.012604250000002537,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.2080000050550552e-06,
+      "seconds": 3.124999999215561e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0015482499999990296,
+      "seconds": 0.0017230830000016795,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.08696208300000308,
+      "seconds": 0.09404904200000175,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.0003088329999982875,
+      "seconds": 0.00046616599999538266,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_large_rust.json b/benchmarks/speed_review/baselines/brfss_panel_large_rust.json
index 8b4a4bfe..34adaf24 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_large_rust.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_large_rust.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_large",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 24.333580874999996,
+  "total_seconds": 25.405902042,
   "memory": {
     "available": true,
-    "start_mb": 398.27,
-    "peak_mb": 427.64,
-    "growth_mb": 29.38,
+    "start_mb": 413.62,
+    "peak_mb": 446.14,
+    "growth_mb": 32.52,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 24.260917499999998,
+      "seconds": 25.287192124999997,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.012276999999997429,
+      "seconds": 0.014137458000000436,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.416999997478797e-06,
+      "seconds": 2.750000000162345e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0017139999999997713,
+      "seconds": 0.0018409999999988713,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.05839145800000267,
+      "seconds": 0.10217112499999814,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.0002632919999996375,
+      "seconds": 0.0005494999999982042,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_medium_python.json b/benchmarks/speed_review/baselines/brfss_panel_medium_python.json
index 585403fa..42ccb4d1 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_medium_python.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_medium_python.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_medium",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 6.076073791000001,
+  "total_seconds": 6.403434375,
   "memory": {
     "available": true,
-    "start_mb": 186.69,
-    "peak_mb": 205.61,
-    "growth_mb": 18.92,
+    "start_mb": 188.7,
+    "peak_mb": 209.17,
+    "growth_mb": 20.47,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 5.992297083999999,
+      "seconds": 6.296972209000001,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.012114415999999295,
+      "seconds": 0.012381040999999371,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.584000000638298e-06,
+      "seconds": 3.3749999985843715e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0016258340000003813,
+      "seconds": 0.0018757499999999538,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.06977783399999993,
+      "seconds": 0.09184245799999857,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.0002475410000002398,
+      "seconds": 0.0003286660000014763,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json b/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json
index 2b41998b..13ceba3d 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_medium",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 6.242714707999999,
+  "total_seconds": 6.479580708,
   "memory": {
     "available": true,
-    "start_mb": 190.41,
-    "peak_mb": 214.03,
-    "growth_mb": 23.62,
+    "start_mb": 193.17,
+    "peak_mb": 223.38,
+    "growth_mb": 30.2,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 6.1350732919999995,
+      "seconds": 6.371709583000001,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.012230709000000672,
+      "seconds": 0.019261959000001383,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.374999999332772e-06,
+      "seconds": 2.4579999990947954e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0015812080000010553,
+      "seconds": 0.0017385829999998492,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.09360750000000095,
+      "seconds": 0.08646266600000097,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.00021008400000077643,
+      "seconds": 0.00039937500000064574,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_small_python.json b/benchmarks/speed_review/baselines/brfss_panel_small_python.json
index 28b59951..9fe54ea4 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_small_python.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_small_python.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 1.5988546660000003,
+  "total_seconds": 1.6725981669999999,
   "memory": {
     "available": true,
-    "start_mb": 121.19,
-    "peak_mb": 131.47,
-    "growth_mb": 10.28,
+    "start_mb": 121.8,
+    "peak_mb": 133.42,
+    "growth_mb": 11.62,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 1.492374834,
+      "seconds": 1.594078084,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.01475770799999987,
+      "seconds": 0.014796582999999863,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.2920000000148377e-06,
+      "seconds": 2.457999999982974e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.003596208999999906,
+      "seconds": 0.004001917000000077,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.08774637500000004,
+      "seconds": 0.05947075000000002,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.0003601660000001061,
+      "seconds": 0.00023887499999997175,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_small_rust.json b/benchmarks/speed_review/baselines/brfss_panel_small_rust.json
index 7319b426..143ae160 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_small_rust.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_small_rust.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 1.6259075419999998,
+  "total_seconds": 1.7887954590000001,
   "memory": {
     "available": true,
-    "start_mb": 122.05,
-    "peak_mb": 133.52,
-    "growth_mb": 11.47,
+    "start_mb": 121.34,
+    "peak_mb": 135.81,
+    "growth_mb": 14.47,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 1.5564257919999998,
+      "seconds": 1.660531708,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.014676375000000075,
+      "seconds": 0.018207874999999873,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.000000000279556e-06,
+      "seconds": 3.5000000000451337e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0035856660000002094,
+      "seconds": 0.004105125000000154,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.05117816700000022,
+      "seconds": 0.10564616699999974,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 3.583299999965206e-05,
+      "seconds": 0.0002969579999998473,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_large_python.json b/benchmarks/speed_review/baselines/campaign_staggered_large_python.json
index b198b53c..2cf3456d 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_large_python.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_large_python.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_large",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 1.257323208,
+  "total_seconds": 1.3639277499999998,
   "memory": {
     "available": true,
-    "start_mb": 233.84,
-    "peak_mb": 485.58,
-    "growth_mb": 251.73,
+    "start_mb": 236.25,
+    "peak_mb": 473.64,
+    "growth_mb": 237.39,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.018876499999999963,
+      "seconds": 0.022353332999999864,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.166782875,
+      "seconds": 0.18328262499999992,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.91699999990891e-06,
+      "seconds": 3.5839999998898975e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.002431167000000123,
+      "seconds": 0.002633374999999827,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.34071708300000036,
+      "seconds": 0.35018841700000003,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.6446924169999999,
+      "seconds": 0.7134287920000002,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.08378041600000019,
+      "seconds": 0.09199199999999985,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 3.32080000000623e-05,
+      "seconds": 3.900000000012227e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json b/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json
index 9146cd5e..d7b62e2d 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_large",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 1.224077334,
+  "total_seconds": 1.566890625,
   "memory": {
     "available": true,
-    "start_mb": 267.02,
-    "peak_mb": 589.03,
-    "growth_mb": 322.02,
+    "start_mb": 264.23,
+    "peak_mb": 575.98,
+    "growth_mb": 311.75,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.017963249999999986,
+      "seconds": 0.019707125000000048,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.16458212500000013,
+      "seconds": 0.17541550000000017,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.5830000002512463e-06,
+      "seconds": 4.4999999997408224e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.0024376249999997768,
+      "seconds": 0.002558749999999943,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.40594308300000037,
+      "seconds": 0.48181645799999995,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.5462847499999999,
+      "seconds": 0.5781936660000002,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.086821708,
+      "seconds": 0.309143916,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 3.604200000006941e-05,
+      "seconds": 4.4332999999951994e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json b/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json
index 1d14a9cf..ca721a1e 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_medium",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.7493203750000001,
+  "total_seconds": 0.8150147909999998,
   "memory": {
     "available": true,
-    "start_mb": 144.84,
-    "peak_mb": 225.83,
-    "growth_mb": 80.98,
+    "start_mb": 148.19,
+    "peak_mb": 232.55,
+    "growth_mb": 84.36,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.012219124999999886,
+      "seconds": 0.014109125000000056,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.09556741599999996,
+      "seconds": 0.10445108399999992,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.2080000000878073e-06,
+      "seconds": 3.208999999948503e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.002560125000000024,
+      "seconds": 0.002572291999999976,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.28517420799999993,
+      "seconds": 0.35729070800000007,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.30205854199999993,
+      "seconds": 0.2793956659999999,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.05169025000000005,
+      "seconds": 0.05713354200000009,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 3.579200000003446e-05,
+      "seconds": 4.387499999980449e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json b/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json
index a885051a..4c49138a 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_medium",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.7148772079999999,
+  "total_seconds": 0.789889083,
   "memory": {
     "available": true,
-    "start_mb": 154.66,
-    "peak_mb": 263.2,
-    "growth_mb": 108.55,
+    "start_mb": 155.88,
+    "peak_mb": 263.52,
+    "growth_mb": 107.64,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.0129322919999999,
+      "seconds": 0.013083167000000007,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.09786516600000006,
+      "seconds": 0.10005033299999999,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.333999999854953e-06,
+      "seconds": 3.5830000000292017e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.002447500000000158,
+      "seconds": 0.002644124999999997,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.30165525000000004,
+      "seconds": 0.390922083,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.24698387500000019,
+      "seconds": 0.22787454100000004,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.05294829199999995,
+      "seconds": 0.05525204100000014,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 3.641599999992806e-05,
+      "seconds": 5.15839999999379e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_small_python.json b/benchmarks/speed_review/baselines/campaign_staggered_small_python.json
index c66f4316..5bdf30ec 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_small_python.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_small_python.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.4984247500000001,
+  "total_seconds": 0.52634175,
   "memory": {
     "available": true,
-    "start_mb": 114.47,
-    "peak_mb": 140.62,
-    "growth_mb": 26.16,
+    "start_mb": 114.67,
+    "peak_mb": 141.31,
+    "growth_mb": 26.64,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.007680332999999928,
+      "seconds": 0.00877829200000002,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.06272016599999997,
+      "seconds": 0.06917316700000009,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.7499999999403e-06,
+      "seconds": 2.8340000000071086e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.008486915999999955,
+      "seconds": 0.007260041999999967,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.17097429100000006,
+      "seconds": 0.19025404200000007,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.2192437909999999,
+      "seconds": 0.2173090419999999,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.02927862499999989,
+      "seconds": 0.03352058300000005,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 3.233299999982897e-05,
+      "seconds": 3.829200000016186e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json b/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json
index 1a5578f0..e9fef367 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.496599209,
+  "total_seconds": 0.502302,
   "memory": {
     "available": true,
-    "start_mb": 114.38,
-    "peak_mb": 147.39,
-    "growth_mb": 33.02,
+    "start_mb": 114.81,
+    "peak_mb": 148.31,
+    "growth_mb": 33.5,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.006995290999999959,
+      "seconds": 0.00718679099999997,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.06457049999999998,
+      "seconds": 0.06477549999999999,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.8749999999577724e-06,
+      "seconds": 3.7079999999356517e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.004752999999999896,
+      "seconds": 0.004685166000000018,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.13871820899999998,
+      "seconds": 0.14403383400000003,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.250811958,
+      "seconds": 0.2496609999999999,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.030704207999999955,
+      "seconds": 0.031909707999999926,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 3.8041999999904874e-05,
+      "seconds": 4.058300000009396e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/dose_response_python.json b/benchmarks/speed_review/baselines/dose_response_python.json
index 1abbf1f8..2e26b0c6 100644
--- a/benchmarks/speed_review/baselines/dose_response_python.json
+++ b/benchmarks/speed_review/baselines/dose_response_python.json
@@ -2,42 +2,42 @@
   "scenario": "dose_response",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.589667292,
+  "total_seconds": 0.6016350409999999,
   "memory": {
     "available": true,
-    "start_mb": 113.75,
-    "peak_mb": 120.16,
-    "growth_mb": 6.41,
+    "start_mb": 114.0,
+    "peak_mb": 120.59,
+    "growth_mb": 6.59,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_cdid_cubic_spline_bootstrap199": {
-      "seconds": 0.15124875000000004,
+      "seconds": 0.154083125,
       "ok": true,
       "error": null
     },
     "2_extract_dose_response_dataframes": {
-      "seconds": 0.0007334169999999585,
+      "seconds": 0.0007949159999999234,
       "ok": true,
       "error": null
     },
     "3_cdid_event_study_pretrend": {
-      "seconds": 0.14819204099999994,
+      "seconds": 0.15003725,
       "ok": true,
       "error": null
     },
     "4_binarized_did_comparison": {
-      "seconds": 0.0014475419999999684,
+      "seconds": 0.001479000000000008,
       "ok": true,
       "error": null
     },
     "5_spline_sensitivity_degree1": {
-      "seconds": 0.14363024999999996,
+      "seconds": 0.14721650000000008,
       "ok": true,
       "error": null
     },
     "6_spline_sensitivity_num_knots2": {
-      "seconds": 0.14441004200000007,
+      "seconds": 0.14802054200000003,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/dose_response_rust.json b/benchmarks/speed_review/baselines/dose_response_rust.json
index 25e8e5af..143e8f46 100644
--- a/benchmarks/speed_review/baselines/dose_response_rust.json
+++ b/benchmarks/speed_review/baselines/dose_response_rust.json
@@ -2,42 +2,42 @@
   "scenario": "dose_response",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.589171666,
+  "total_seconds": 0.6095052919999999,
   "memory": {
     "available": true,
-    "start_mb": 113.69,
-    "peak_mb": 122.39,
-    "growth_mb": 8.7,
+    "start_mb": 113.77,
+    "peak_mb": 123.08,
+    "growth_mb": 9.31,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_cdid_cubic_spline_bootstrap199": {
-      "seconds": 0.15004549999999994,
+      "seconds": 0.15594399999999997,
       "ok": true,
       "error": null
     },
     "2_extract_dose_response_dataframes": {
-      "seconds": 0.0007515420000000494,
+      "seconds": 0.0008133339999999434,
       "ok": true,
       "error": null
     },
     "3_cdid_event_study_pretrend": {
-      "seconds": 0.14819683299999997,
+      "seconds": 0.15459916699999998,
       "ok": true,
       "error": null
     },
     "4_binarized_did_comparison": {
-      "seconds": 0.0015388330000000172,
+      "seconds": 0.0017282920000000201,
       "ok": true,
       "error": null
     },
     "5_spline_sensitivity_degree1": {
-      "seconds": 0.1435805000000001,
+      "seconds": 0.14681695900000014,
       "ok": true,
       "error": null
     },
     "6_spline_sensitivity_num_knots2": {
-      "seconds": 0.14505404099999986,
+      "seconds": 0.14959908300000002,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json b/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json
index 3a8be549..d601f042 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_large",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.259593333,
+  "total_seconds": 0.26634891699999985,
   "memory": {
     "available": true,
-    "start_mb": 117.12,
-    "peak_mb": 117.45,
-    "growth_mb": 0.33,
+    "start_mb": 118.05,
+    "peak_mb": 118.41,
+    "growth_mb": 0.36,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.04071524999999998,
+      "seconds": 0.0427798330000001,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.03718341699999994,
+      "seconds": 0.038390874999999935,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 0.07727700000000004,
+      "seconds": 0.07815391699999985,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0006957920000000284,
+      "seconds": 0.0006825000000001413,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 0.10368924999999996,
+      "seconds": 0.10630537500000004,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 2.9124999999963208e-05,
+      "seconds": 3.249999999987985e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json b/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json
index 9ab55a0e..13047b5f 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_medium",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 4.028322917,
+  "total_seconds": 4.089486916000001,
   "memory": {
     "available": true,
-    "start_mb": 143.75,
-    "peak_mb": 152.25,
-    "growth_mb": 8.5,
+    "start_mb": 143.0,
+    "peak_mb": 151.78,
+    "growth_mb": 8.78,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.36179233399999955,
+      "seconds": 0.37269037500000035,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.36836100000000016,
+      "seconds": 0.36910237499999976,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 1.5678719169999997,
+      "seconds": 1.5911891659999995,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0007424160000004676,
+      "seconds": 0.000975166999999999,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 1.7295259170000001,
+      "seconds": 1.7554996660000004,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 2.4790999999524388e-05,
+      "seconds": 2.533400000004349e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json b/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json
index e0845d6f..4fd4759d 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_medium",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.11668558299999998,
+  "total_seconds": 0.12066608300000004,
   "memory": {
     "available": true,
-    "start_mb": 116.48,
-    "peak_mb": 116.86,
-    "growth_mb": 0.38,
+    "start_mb": 116.27,
+    "peak_mb": 116.7,
+    "growth_mb": 0.44,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.020445540999999956,
+      "seconds": 0.02115270800000002,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.022912875000000055,
+      "seconds": 0.023816833000000037,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 0.024874541999999944,
+      "seconds": 0.025457375000000004,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0005995829999999591,
+      "seconds": 0.000645875000000018,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 0.04782845799999991,
+      "seconds": 0.04956016600000002,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 2.070799999998041e-05,
+      "seconds": 2.7333999999989977e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_small_python.json b/benchmarks/speed_review/baselines/geo_few_markets_small_python.json
index 1d64a335..f0790f93 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_small_python.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_small_python.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 3.739117167,
+  "total_seconds": 3.931368708,
   "memory": {
     "available": true,
-    "start_mb": 113.66,
-    "peak_mb": 123.55,
-    "growth_mb": 9.89,
+    "start_mb": 114.03,
+    "peak_mb": 124.31,
+    "growth_mb": 10.28,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.5992219579999999,
+      "seconds": 0.568860417,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.5961615419999999,
+      "seconds": 0.8022329579999998,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 1.1918989170000003,
+      "seconds": 1.204978708,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0009045000000003078,
+      "seconds": 0.001044000000000267,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 1.3508564999999995,
+      "seconds": 1.3541902500000003,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 6.941599999965575e-05,
+      "seconds": 5.620800000016857e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json b/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json
index ed45fd16..b5bef931 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.039840292000000055,
+  "total_seconds": 0.04143637499999997,
   "memory": {
     "available": true,
-    "start_mb": 113.7,
+    "start_mb": 113.59,
     "peak_mb": 115.17,
-    "growth_mb": 1.47,
+    "growth_mb": 1.58,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.007603000000000026,
+      "seconds": 0.007861707999999967,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.012722500000000081,
+      "seconds": 0.012826667000000014,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 0.008069333000000012,
+      "seconds": 0.008510917000000062,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0008005829999999658,
+      "seconds": 0.0010455000000000325,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 0.010622041999999943,
+      "seconds": 0.011159249999999954,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 1.937499999993264e-05,
+      "seconds": 2.7208000000000787e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/mem_profile_brfss_large_rust.txt b/benchmarks/speed_review/baselines/mem_profile_brfss_large_rust.txt
index a5748823..8696a700 100644
--- a/benchmarks/speed_review/baselines/mem_profile_brfss_large_rust.txt
+++ b/benchmarks/speed_review/baselines/mem_profile_brfss_large_rust.txt
@@ -12,18 +12,18 @@
 # top 15 allocation sites by size delta
 #      size diff (MB)   count diff  location
 --------------------------------------------------------------------------------
-1                0.15         1521  /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/linecache.py:148
-2                0.04            7  /Users/igerber/diff-diff-perf-review/.venv/lib/python3.9/site-packages/pandas/core/internals/blocks.py:822
-3                0.02          440  /Users/igerber/diff-diff-perf-review/.venv/lib/python3.9/site-packages/pandas/core/sorting.py:637
-4                0.01           63  /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/abc.py:123
-5                0.00            2  /Users/igerber/diff-diff-perf-review/.venv/lib/python3.9/site-packages/pandas/core/frame.py:12710
-6                0.00            2  /Users/igerber/diff-diff-perf-review/.venv/lib/python3.9/site-packages/pandas/core/frame.py:698
-7                0.00           16  /Users/igerber/diff-diff-perf-review/.venv/lib/python3.9/site-packages/pandas/core/indexes/base.py:5372
-8                0.00           51  /Users/igerber/diff-diff-perf-review/.venv/lib/python3.9/site-packages/pandas/core/groupby/ops.py:427
-9                0.00           55  /Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/sre_parse.py:529
-10               0.00           43  /Users/igerber/diff-diff-perf-review/diff_diff/prep.py:1618
-11               0.00            2  /Users/igerber/diff-diff-perf-review/.venv/lib/python3.9/site-packages/pandas/core/internals/construction.py:237
-12               0.00            2  /Users/igerber/diff-diff-perf-review/.venv/lib/python3.9/site-packages/pandas/core/groupby/grouper.py:846
-13               0.00            2  /Users/igerber/diff-diff-perf-review/.venv/lib/python3.9/site-packages/pandas/core/construction.py:517
-14               0.00            2  /Users/igerber/diff-diff-perf-review/.venv/lib/python3.9/site-packages/pandas/core/indexes/base.py:475
-15               0.00           35  /Users/igerber/diff-diff-perf-review/.venv/lib/python3.9/site-packages/pandas/core/internals/managers.py:1500
+1                0.15         1521  <site-packages>/lib/python3.9/linecache.py:148
+2                0.04            7  <repo>/.venv/lib/python3.9/site-packages/pandas/core/internals/blocks.py:822
+3                0.02          440  <repo>/.venv/lib/python3.9/site-packages/pandas/core/sorting.py:637
+4                0.01           63  <site-packages>/lib/python3.9/abc.py:123
+5                0.00            2  <repo>/.venv/lib/python3.9/site-packages/pandas/core/frame.py:12710
+6                0.00            2  <repo>/.venv/lib/python3.9/site-packages/pandas/core/frame.py:698
+7                0.00           16  <repo>/.venv/lib/python3.9/site-packages/pandas/core/indexes/base.py:5372
+8                0.00           51  <repo>/.venv/lib/python3.9/site-packages/pandas/core/groupby/ops.py:427
+9                0.00           55  <site-packages>/lib/python3.9/sre_parse.py:529
+10               0.00           43  <repo>/diff_diff/prep.py:1618
+11               0.00            2  <repo>/.venv/lib/python3.9/site-packages/pandas/core/internals/construction.py:237
+12               0.00            2  <repo>/.venv/lib/python3.9/site-packages/pandas/core/groupby/grouper.py:846
+13               0.00            2  <repo>/.venv/lib/python3.9/site-packages/pandas/core/construction.py:517
+14               0.00            2  <repo>/.venv/lib/python3.9/site-packages/pandas/core/indexes/base.py:475
+15               0.00           35  <repo>/.venv/lib/python3.9/site-packages/pandas/core/internals/managers.py:1500
diff --git a/benchmarks/speed_review/baselines/reversible_dcdh_python.json b/benchmarks/speed_review/baselines/reversible_dcdh_python.json
index 75b32efe..7bbf41df 100644
--- a/benchmarks/speed_review/baselines/reversible_dcdh_python.json
+++ b/benchmarks/speed_review/baselines/reversible_dcdh_python.json
@@ -2,32 +2,32 @@
   "scenario": "reversible_dcdh",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.618717375,
+  "total_seconds": 0.591051,
   "memory": {
     "available": true,
-    "start_mb": 113.59,
-    "peak_mb": 131.45,
-    "growth_mb": 17.86,
+    "start_mb": 113.36,
+    "peak_mb": 132.27,
+    "growth_mb": 18.91,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_dcdh_fit_Lmax3_survey_TSL": {
-      "seconds": 0.32684875,
+      "seconds": 0.36771666699999994,
       "ok": true,
       "error": null
     },
     "2_inspect_placebo_and_summary": {
-      "seconds": 1.4170000000035543e-06,
+      "seconds": 1.5420000000210266e-06,
       "ok": true,
       "error": null
     },
     "3_honest_did_on_placebo": {
-      "seconds": 0.004943374999999972,
+      "seconds": 0.0036219160000000583,
       "ok": true,
       "error": null
     },
     "4_heterogeneity_refit": {
-      "seconds": 0.28691199999999994,
+      "seconds": 0.21970824999999983,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/reversible_dcdh_rust.json b/benchmarks/speed_review/baselines/reversible_dcdh_rust.json
index b72b9f34..ded7cebe 100644
--- a/benchmarks/speed_review/baselines/reversible_dcdh_rust.json
+++ b/benchmarks/speed_review/baselines/reversible_dcdh_rust.json
@@ -2,32 +2,32 @@
   "scenario": "reversible_dcdh",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.6100237500000001,
+  "total_seconds": 0.5458365,
   "memory": {
     "available": true,
-    "start_mb": 114.17,
-    "peak_mb": 135.97,
-    "growth_mb": 21.8,
+    "start_mb": 113.38,
+    "peak_mb": 134.55,
+    "growth_mb": 21.17,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_dcdh_fit_Lmax3_survey_TSL": {
-      "seconds": 0.334349,
+      "seconds": 0.33496675,
       "ok": true,
       "error": null
     },
     "2_inspect_placebo_and_summary": {
-      "seconds": 1.208000000030296e-06,
+      "seconds": 1.3750000000811724e-06,
       "ok": true,
       "error": null
     },
     "3_honest_did_on_placebo": {
-      "seconds": 0.003726332999999915,
+      "seconds": 0.005485000000000073,
       "ok": true,
       "error": null
     },
     "4_heterogeneity_refit": {
-      "seconds": 0.271937125,
+      "seconds": 0.20537233300000002,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/bench_brand_awareness_survey.py b/benchmarks/speed_review/bench_brand_awareness_survey.py
index 93758392..919a6d89 100644
--- a/benchmarks/speed_review/bench_brand_awareness_survey.py
+++ b/benchmarks/speed_review/bench_brand_awareness_survey.py
@@ -129,12 +129,9 @@ def honest_did_grid():
         results["event_study"] = es_result
         out = {}
         for M in (0.5, 1.0, 1.5):
-            try:
-                out[M] = compute_honest_did(
-                    es_result, method="relative_magnitude", M=M,
-                )
-            except Exception as e:
-                out[M] = f"{type(e).__name__}: {e}"
+            out[M] = compute_honest_did(
+                es_result, method="relative_magnitude", M=M,
+            )
         results["honest"] = out
 
     return [
diff --git a/benchmarks/speed_review/bench_brfss_panel.py b/benchmarks/speed_review/bench_brfss_panel.py
index 1f5a473b..412401c8 100644
--- a/benchmarks/speed_review/bench_brfss_panel.py
+++ b/benchmarks/speed_review/bench_brfss_panel.py
@@ -105,12 +105,9 @@ def inspect_pretrends():
     def honest_grid():
         out = {}
         for M in (0.5, 1.0, 1.5):
-            try:
-                out[M] = compute_honest_did(
-                    results["cs"], method="relative_magnitude", M=M,
-                )
-            except Exception as e:
-                out[M] = f"{type(e).__name__}: {e}"
+            out[M] = compute_honest_did(
+                results["cs"], method="relative_magnitude", M=M,
+            )
         results["honest"] = out
 
     def sun_abraham():
diff --git a/benchmarks/speed_review/bench_dose_response.py b/benchmarks/speed_review/bench_dose_response.py
index 9ae46094..f0c86384 100644
--- a/benchmarks/speed_review/bench_dose_response.py
+++ b/benchmarks/speed_review/bench_dose_response.py
@@ -46,13 +46,16 @@ def cdid_cubic_fit():
         results["cubic"] = cdid.fit(**fit_kwargs, aggregate="dose")
 
     def extract_curves():
+        # The cubic fit used aggregate="dose", so only dose-response and
+        # group-time levels are available on the result. Event-study is
+        # extracted separately in the dedicated pretrend phase below.
+        # NB: ContinuousDiD uses 'eventstudy' for fit(aggregate=...) but
+        # 'event_study' for to_dataframe(level=...). Two different
+        # spellings within one estimator - flagged in performance-plan.md.
         r = results["cubic"]
         out = {}
-        for level in ("dose_response", "group_time", "event_study"):
-            try:
-                out[level] = r.to_dataframe(level=level)
-            except Exception as e:
-                out[level] = f"{type(e).__name__}: {e}"
+        for level in ("dose_response", "group_time"):
+            out[level] = r.to_dataframe(level=level)
         results["curves"] = out
 
     def cdid_event_study():
diff --git a/benchmarks/speed_review/bench_reversible_dcdh.py b/benchmarks/speed_review/bench_reversible_dcdh.py
index b9c6499c..80bd63ca 100644
--- a/benchmarks/speed_review/bench_reversible_dcdh.py
+++ b/benchmarks/speed_review/bench_reversible_dcdh.py
@@ -77,22 +77,16 @@ def inspect_placebo():
     def honest_placebo():
         out = {}
         for M in (0.5, 1.0, 1.5):
-            try:
-                out[M] = compute_honest_did(
-                    results["dcdh"], method="relative_magnitude", M=M,
-                )
-            except Exception as e:
-                out[M] = f"{type(e).__name__}: {e}"
+            out[M] = compute_honest_did(
+                results["dcdh"], method="relative_magnitude", M=M,
+            )
         results["honest"] = out
 
     def heterogeneity_refit():
         est = ChaisemartinDHaultfoeuille(seed=123)
-        try:
-            results["het"] = est.fit(
-                **fit_kwargs, L_max=3, heterogeneity="group",
-            )
-        except (NotImplementedError, ValueError) as e:
-            results["het"] = f"{type(e).__name__}: {e}"
+        results["het"] = est.fit(
+            **fit_kwargs, L_max=3, heterogeneity="group",
+        )
 
     phases = [
         ("1_dcdh_fit_Lmax3_survey_TSL", dcdh_fit_lmax3),
diff --git a/benchmarks/speed_review/gen_findings_tables.py b/benchmarks/speed_review/gen_findings_tables.py
new file mode 100644
index 00000000..9cf5c6d6
--- /dev/null
+++ b/benchmarks/speed_review/gen_findings_tables.py
@@ -0,0 +1,245 @@
+#!/usr/bin/env python3
+"""
+Regenerate the numerical tables in ``docs/performance-plan.md`` from the
+committed JSON baselines under ``benchmarks/speed_review/baselines/``.
+
+Each auto-generated table is bounded by a pair of HTML-comment markers in
+the target markdown file:
+
+    <!-- TABLE:start <table-id> -->
+    ... (rendered table body lives here; overwritten on every run) ...
+    <!-- TABLE:end <table-id> -->
+
+Run this after any benchmark rerun; the doc tables then re-derive exactly
+from the JSON baselines, removing the possibility of hand-edit drift.
+
+Tables owned by this generator:
+  - scale_sweep_totals        end-to-end wall-clock per scenario + scale
+  - memory_by_scenario        peak RSS + growth per scenario + scale
+  - top_phases_by_scenario    largest-scale phase-level timing ranking
+
+Narrative prose in the doc is hand-written and not touched. If numerical
+claims in narrative drift from the regenerated tables, the reviewer must
+update the narrative manually - by design, to force a human read of the
+findings whenever numbers shift meaningfully.
+"""
+
+import json
+import re
+from pathlib import Path
+from textwrap import dedent
+
+HERE = Path(__file__).resolve().parent
+BASELINES = HERE / "baselines"
+PLAN_MD = HERE.parent.parent / "docs" / "performance-plan.md"
+
+SCALE_ORDER = ("small", "medium", "large")
+MULTI_SCALE = (
+    "campaign_staggered",
+    "brand_awareness_survey",
+    "brfss_panel",
+    "geo_few_markets",
+)
+SINGLE_SCALE = ("reversible_dcdh", "dose_response")
+
+SCENARIO_DISPLAY = {
+    "campaign_staggered":     "1. Staggered campaign",
+    "brand_awareness_survey": "2. Brand awareness survey",
+    "brfss_panel":            "3. BRFSS microdata -> CS panel",
+    "geo_few_markets":        "4. SDiD few markets",
+    "reversible_dcdh":        "5. Reversible dCDH",
+    "dose_response":          "6. Pricing dose-response",
+}
+
+
+def load(scenario, scale, backend):
+    if scale is None:
+        path = BASELINES / f"{scenario}_{backend}.json"
+    else:
+        path = BASELINES / f"{scenario}_{scale}_{backend}.json"
+    if not path.exists():
+        return None
+    return json.loads(path.read_text())
+
+
+def fmt_secs(x):
+    return f"{x:.2f}" if x is not None else "skip"
+
+
+def fmt_mb(x):
+    return f"{x:.0f}" if x is not None else "skip"
+
+
+def render_scale_sweep_totals():
+    rows = [
+        "| Scenario | Scale | Python (s) | Rust (s) | Py/Rust |",
+        "|---|---|---:|---:|---:|",
+    ]
+    for scen in MULTI_SCALE:
+        display = SCENARIO_DISPLAY[scen]
+        first = True
+        for scale in SCALE_ORDER:
+            py = load(scen, scale, "python")
+            rs = load(scen, scale, "rust")
+            py_t = py["total_seconds"] if py else None
+            rs_t = rs["total_seconds"] if rs else None
+            ratio = (
+                f"{py_t/rs_t:.1f}x"
+                if (py_t is not None and rs_t is not None and rs_t > 0)
+                else "-"
+            )
+            name_col = display if first else ""
+            first = False
+            rows.append(
+                f"| {name_col} | {scale} | "
+                f"{fmt_secs(py_t)} | {fmt_secs(rs_t)} | {ratio} |"
+            )
+    for scen in SINGLE_SCALE:
+        display = SCENARIO_DISPLAY[scen]
+        py = load(scen, None, "python")
+        rs = load(scen, None, "rust")
+        py_t = py["total_seconds"] if py else None
+        rs_t = rs["total_seconds"] if rs else None
+        ratio = (
+            f"{py_t/rs_t:.1f}x"
+            if (py_t is not None and rs_t is not None and rs_t > 0)
+            else "-"
+        )
+        rows.append(
+            f"| {display} | single | "
+            f"{fmt_secs(py_t)} | {fmt_secs(rs_t)} | {ratio} |"
+        )
+    return "\n".join(rows)
+
+
+def render_memory_by_scenario():
+    rows = [
+        "| Scenario | Scale | Py peak RSS (MB) | Py growth (MB) | "
+        "Rust peak RSS (MB) | Rust growth (MB) |",
+        "|---|---|---:|---:|---:|---:|",
+    ]
+    for scen in MULTI_SCALE:
+        display = SCENARIO_DISPLAY[scen]
+        first = True
+        for scale in SCALE_ORDER:
+            py = load(scen, scale, "python")
+            rs = load(scen, scale, "rust")
+            py_peak = py["memory"]["peak_mb"] if py else None
+            py_growth = py["memory"]["growth_mb"] if py else None
+            rs_peak = rs["memory"]["peak_mb"] if rs else None
+            rs_growth = rs["memory"]["growth_mb"] if rs else None
+            name_col = display if first else ""
+            first = False
+            rows.append(
+                f"| {name_col} | {scale} | "
+                f"{fmt_mb(py_peak)} | {fmt_mb(py_growth)} | "
+                f"{fmt_mb(rs_peak)} | {fmt_mb(rs_growth)} |"
+            )
+    for scen in SINGLE_SCALE:
+        display = SCENARIO_DISPLAY[scen]
+        py = load(scen, None, "python")
+        rs = load(scen, None, "rust")
+        py_peak = py["memory"]["peak_mb"] if py else None
+        py_growth = py["memory"]["growth_mb"] if py else None
+        rs_peak = rs["memory"]["peak_mb"] if rs else None
+        rs_growth = rs["memory"]["growth_mb"] if rs else None
+        rows.append(
+            f"| {display} | single | "
+            f"{fmt_mb(py_peak)} | {fmt_mb(py_growth)} | "
+            f"{fmt_mb(rs_peak)} | {fmt_mb(rs_growth)} |"
+        )
+    return "\n".join(rows)
+
+
+def render_top_phases_by_scenario():
+    """Top-3 phases by time at largest scale, for both backends."""
+    rows = [
+        "| Scenario | Scale | Backend | Top phase (%) "
+        "| 2nd phase (%) | 3rd phase (%) |",
+        "|---|---|---|---|---|---|",
+    ]
+
+    def phase_rank(record, n=3):
+        if record is None:
+            return []
+        total = record["total_seconds"]
+        phases = sorted(
+            record["phases"].items(),
+            key=lambda kv: -kv[1]["seconds"],
+        )
+        out = []
+        for label, info in phases[:n]:
+            pct = 100 * info["seconds"] / total if total > 0 else 0
+            out.append(f"`{label}` ({pct:.0f}%)")
+        while len(out) < n:
+            out.append("-")
+        return out
+
+    for scen in MULTI_SCALE:
+        display = SCENARIO_DISPLAY[scen]
+        scale = SCALE_ORDER[-1]  # largest
+        for backend in ("python", "rust"):
+            rec = load(scen, scale, backend)
+            top = phase_rank(rec)
+            if not top:
+                continue
+            rows.append(
+                f"| {display} | {scale} | {backend} | "
+                f"{top[0]} | {top[1]} | {top[2]} |"
+            )
+    for scen in SINGLE_SCALE:
+        display = SCENARIO_DISPLAY[scen]
+        for backend in ("python", "rust"):
+            rec = load(scen, None, backend)
+            top = phase_rank(rec)
+            if not top:
+                continue
+            rows.append(
+                f"| {display} | single | {backend} | "
+                f"{top[0]} | {top[1]} | {top[2]} |"
+            )
+    return "\n".join(rows)
+
+
+TABLES = {
+    "scale_sweep_totals": render_scale_sweep_totals,
+    "memory_by_scenario": render_memory_by_scenario,
+    "top_phases_by_scenario": render_top_phases_by_scenario,
+}
+
+
+def update_markdown(path):
+    text = path.read_text()
+    for table_id, renderer in TABLES.items():
+        body = renderer()
+        pattern = re.compile(
+            rf"(<!-- TABLE:start {re.escape(table_id)} -->)"
+            rf".*?"
+            rf"(<!-- TABLE:end {re.escape(table_id)} -->)",
+            re.DOTALL,
+        )
+        replacement = f"\\g<1>\n{body}\n\\g<2>"
+        new_text, n = pattern.subn(replacement, text)
+        if n == 0:
+            raise RuntimeError(
+                f"No marker pair found for table '{table_id}' in {path}."
+                f" Add <!-- TABLE:start {table_id} --> ..."
+                f" <!-- TABLE:end {table_id} --> to the document first."
+            )
+        if n > 1:
+            raise RuntimeError(
+                f"Multiple marker pairs for '{table_id}' in {path}."
+            )
+        text = new_text
+    path.write_text(text)
+
+
+def main():
+    update_markdown(PLAN_MD)
+    print(f"regenerated tables in {PLAN_MD.relative_to(PLAN_MD.parents[2])}")
+    for k in TABLES:
+        print(f"  - {k}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/benchmarks/speed_review/mem_profile_brfss.py b/benchmarks/speed_review/mem_profile_brfss.py
index fd7127ab..36e2901b 100644
--- a/benchmarks/speed_review/mem_profile_brfss.py
+++ b/benchmarks/speed_review/mem_profile_brfss.py
@@ -104,8 +104,26 @@ def main():
         f"{'#':<4} {'size diff (MB)':>16} {'count diff':>12}  location",
         f"{'-'*80}",
     ]
+    # Scrub workstation-specific absolute paths before committing output
+    # (keeps the file reproducible and avoids leaking $HOME / system paths).
+    import site, sys as _sys
+    home = str(Path.home())
+    sys_paths = sorted(
+        {p for p in (site.getsitepackages() + [site.getusersitepackages()])
+         if p} | {_sys.prefix, _sys.base_prefix},
+        key=len, reverse=True,
+    )
+    repo_root = str(Path(__file__).resolve().parents[2])
+
+    def _scrub(s):
+        s = s.replace(repo_root, "<repo>")
+        for sp in sys_paths:
+            s = s.replace(sp, "<site-packages>")
+        s = s.replace(home, "$HOME")
+        return s
+
     for i, s in enumerate(stats[:args.top], 1):
-        loc = str(s.traceback).split("\n")[0]
+        loc = _scrub(str(s.traceback).split("\n")[0])
         lines.append(
             f"{i:<4} {s.size_diff/1024/1024:>16.2f} {s.count_diff:>12d}  {loc}"
         )
diff --git a/docs/performance-plan.md b/docs/performance-plan.md
index d5da7c9d..30bc12b2 100644
--- a/docs/performance-plan.md
+++ b/docs/performance-plan.md
@@ -20,6 +20,9 @@ writeups. The six scenarios are defined in
 Environment: macOS darwin 25.3 on Apple Silicon M4, Python 3.9,
 numpy 2.x, diff_diff 3.1.3. Each multi-scale scenario runs at three data
 scales under both `DIFF_DIFF_BACKEND=python` and `DIFF_DIFF_BACKEND=rust`.
+The numerical tables below are auto-generated from the committed JSON
+baselines by `benchmarks/speed_review/gen_findings_tables.py`; narrative
+prose is hand-written and must be re-read when numbers shift.
 
 ### Scale sweep - end-to-end wall-clock
 
@@ -29,104 +32,108 @@ practitioner workloads; large stretches toward the upper end of what an
 analyst might bring (1M-row BRFSS microdata, 1,500-unit county-level
 staggered panel, 1,000-unit multi-region brand survey, 500-unit zip-level
 geo-experiment). Dose-response and reversible-dCDH run at a single mid-range
-scale.
-
-| Scenario | Scale | Data shape | Python (s) | Rust (s) | Py/Rust |
-|---|---|---|---:|---:|---:|
-| **1. Staggered campaign** | small | 150 units × 26 periods | 0.48 | 0.49 | 1.0x |
-| (CS + 8-step chain, bootstrap 999) | medium | 500 units × 26 periods | 0.72 | 0.87 | 0.8x |
-|  | large | 1,500 units × 26 periods | 1.24 | 1.22 | 1.0x |
-| **2. Brand awareness survey** | small | 200 units × 12 periods, 40 JK1 reps | 0.21 | 0.20 | 1.0x |
-| (DiD + SurveyDesign + JK1 replicate weights) | medium | 500 units × 12 periods, 90 JK1 reps | 0.52 | 0.46 | 1.1x |
-|  | large | 1,000 units × 12 periods, 160 JK1 reps | 0.83 | 0.92 | 0.9x |
-| **3. BRFSS microdata → CS panel** | small | 50K rows → 500 cells | 1.59 | 1.61 | 1.0x |
-| (`aggregate_survey` + CS + HonestDiD) | medium | 250K rows → 500 cells | 6.11 | 6.20 | 1.0x |
-|  | large | **1M rows → 500 cells** | **23.96** | **24.37** | **1.0x** |
-| **4. Geo-experiment few markets (SDiD)** | small | 80 units, 5 treated | 3.05 | 0.04 | **76x** |
-| (jackknife + bootstrap + sensitivity chain) | medium | 200 units, 15 treated | 3.65 | 0.12 | **31x** |
-|  | large | 500 units, 30 treated | skip | 0.26 | - |
-| 5. Reversible dCDH (L_max=3 + TSL) | (single) | 120 groups × 10 periods | 0.55 | 0.53 | 1.0x |
-| 6. Pricing dose-response (CDiD spline) | (single) | 500 units × 6 periods | 0.58 | 0.59 | 1.0x |
+scale. Data-shape details are in `docs/performance-scenarios.md`.
+
+<!-- TABLE:start scale_sweep_totals -->
+| Scenario | Scale | Python (s) | Rust (s) | Py/Rust |
+|---|---|---:|---:|---:|
+| 1. Staggered campaign | small | 0.53 | 0.50 | 1.0x |
+|  | medium | 0.82 | 0.79 | 1.0x |
+|  | large | 1.36 | 1.57 | 0.9x |
+| 2. Brand awareness survey | small | 0.21 | 0.20 | 1.1x |
+|  | medium | 0.76 | 0.58 | 1.3x |
+|  | large | 1.03 | 1.04 | 1.0x |
+| 3. BRFSS microdata -> CS panel | small | 1.67 | 1.79 | 0.9x |
+|  | medium | 6.40 | 6.48 | 1.0x |
+|  | large | 24.75 | 25.41 | 1.0x |
+| 4. SDiD few markets | small | 3.93 | 0.04 | 94.9x |
+|  | medium | 4.09 | 0.12 | 33.9x |
+|  | large | skip | 0.27 | - |
+| 5. Reversible dCDH | single | 0.59 | 0.55 | 1.1x |
+| 6. Pricing dose-response | single | 0.60 | 0.61 | 1.0x |
+<!-- TABLE:end scale_sweep_totals -->
 
 ### Scaling findings
 
-**Three findings invert at large scale relative to the tutorial-scale pass:**
-
-1. **BRFSS `aggregate_survey` becomes the dominant practitioner pain point.**
-   Scales near-linearly with microdata row count - 50K → 1M rows (20x)
-   costs 15x runtime (1.5s → 24s). At 1M rows, 97% of runtime is inside
-   `_compute_stratified_psu_meat`, called once per output cell. This is a
-   concrete 20-second cost hit on any realistic pooled multi-year BRFSS
-   study, and Rust does not touch it (aggregate_survey is entirely Python).
-2. **Staggered CS chain remains cheap across scales.** 150 → 1,500 units
-   (10x) increases total by only 2.6x (0.48s → 1.24s). ImputationDiD stays
-   the dominant phase (46-62%) but scales well; absolute time at
-   practitioner scale is still under 1 second.
-3. **SDiD Rust gap is stable, not emergent.** Python SDiD at 80 units is
-   already 3 seconds; at 200 units it is 3.7 seconds. The cost is
-   dominated by fixed-overhead-per-jackknife-refit rather than data size;
-   Rust stays sub-second through 500 units. The 76x headline at small scale
-   is driven by Python having ~3s of baseline cost, not by bad scaling.
+**Three findings are load-bearing for the optimization priority list:**
+
+1. **BRFSS `aggregate_survey` is the dominant practitioner pain point at
+   realistic pooled-multi-year scale.** Scales near-linearly with microdata
+   row count. At 1M rows (roughly what a 10-year pooled BRFSS analysis
+   looks like) the full chain takes ~24 seconds and essentially all of it
+   is inside `_compute_stratified_psu_meat`. Rust does not touch it
+   (`aggregate_survey` is entirely Python).
+2. **Staggered CS chain stays cheap across scales.** A 10x unit increase
+   (150 -> 1,500) is a small-single-digit multiplier on total time.
+   ImputationDiD is consistently the dominant phase but scales well.
+3. **SDiD Rust gap is stable across scales, not emergent.** Python SDiD
+   has a fixed per-jackknife-refit overhead that dominates even at small
+   n. Rust stays sub-second through 500 units.
 
 **Two findings hold across scales:**
 
-4. Brand-awareness survey chain scales roughly linearly in n_units
-   (0.21s → 0.83s for a 5x unit increase); the JK1 replicate-weight path
-   itself scales closer to n_units × n_replicates (40 → 160 replicates
-   across the sweep), becoming the dominant phase at large scale.
-5. Rust backend gives measurable uplift only for SDiD; for everything else
+4. Brand-awareness survey total scales roughly linearly in n_units, but
+   the JK1 replicate path inside it scales closer to
+   n_units x n_replicates - faster growth than the chain total, so it
+   increasingly dominates at large n.
+5. Rust backend gives measurable uplift only for SDiD; everywhere else
    backend choice is within noise because the bottlenecks are in Python
-   (`aggregate_survey`) or already well-vectorized (CS bootstrap, ImputationDiD,
-   Survey TSL/replicate).
-
-### Top hotspots ranked by total-time contribution (at largest measured scale)
-
-| # | Location | Scenario + scale | Time | Recommended action |
-|---|---|---|---:|---|
-| 1 | `diff_diff/survey.py:1160` `_compute_stratified_psu_meat` | BRFSS @ 1M rows | **20.7s self + 22.6s inclusive** | **Algorithmic fix, raised priority.** Function called once per (state, year) cell (500 calls); per-call work scales with cell size and rebuilds stratum-PSU scaffolding every time. Precompute stratum indexes once at `aggregate_survey` top-level and reuse. Upside at practitioner scale is now 15-20 seconds, not 1.5 seconds. |
-| 2 | `diff_diff/imputation.py` ImputationDiD fit | Staggered CS @ 1,500 units | 0.66s (53%) | **Investigate if/when BRFSS fix lands.** Stayed the dominant phase across scales but total chain is ~1.2s at large - not P0. Still a candidate follow-up once the higher-value fix is in. |
-| 3 | `diff_diff/utils.py:1434` `_sc_weight_fw_numpy` | SDiD python @ any scale | 3s fixed overhead + scaling | **Already ported to Rust.** Python fallback acceptable as a teaching/safety path; document as non-production for n > 100. Python skipped at n=500 because jackknife 500 refits × ~500ms/refit would exceed 4 minutes. |
-| 4 | `diff_diff/chaisemartin_dhaultfoeuille.py` dCDH fit + heterogeneity | Reversible (single scale) | 0.32s main + 0.22s heterogeneity | **Cache/precompute** - heterogeneity refit rebuilds TSL scaffolding the main fit already computed. Not P0 - total is ~550ms - but newer code path (v3.1) never optimization-reviewed. |
-| 5 | `diff_diff/continuous_did.py` CDiD spline bootstrap | Dose-response (single scale) | 0.14s per fit × 4 variants | **Leave alone** - linear in variant count, all well under perceptible threshold. |
-
-### Per-scenario phase rankings at each scale
-
-**Scenario 1 - Staggered campaign (CS + 8-step chain).**
-ImputationDiD robustness remains the single dominant phase at every scale
-(0.30s / 0.33s / 0.66s for small / medium / large). SunAbraham scales at
-similar rate. The CS fit with `n_bootstrap=999` at 1,500 units is 0.18s
-(15%) - well-vectorized. Action: investigate ImputationDiD only after
-higher-upside items land.
-
-**Scenario 2 - Brand awareness survey.**
-At small scale HonestDiD dominates (42%); at medium the multi-outcome loop
-and the JK1 replicate-weight path are within a factor of 2 (23-36%); at
-large the JK1 path becomes the single top phase (~45-50%, 0.37s Python /
-0.46s Rust). Replicate count grows with PSU count (40 / 90 / 160 at the
-three scales), so the path scales roughly as n_units × n_replicates - a
-near-quadratic curve in a design dimension that commonly grows. Note that
-Rust is marginally slower than Python here because the JK1 replicate-fit
-loop is not yet Rust-accelerated and the FFI crossings cost more than the
-per-fit work. Action: leave alone, but flag the JK1 path as a Rust-port
-candidate if practitioners regularly run n_replicates >= 160.
-
-**Scenario 3 - BRFSS microdata → CS panel.**
-`aggregate_survey` share of total grows with scale: 94% at 50K → 99% at 250K →
-100% at 1M. Everything downstream (CS fit, SunAbraham, HonestDiD) stays
-under 500 ms combined. Action: fix `aggregate_survey` per-cell loop. This
-is now the single most impactful optimization identified.
-
-**Scenario 4 - Geo-experiment few markets (SDiD).**
-`sensitivity_to_zeta_omega` and `in_time_placebo` are the dominant
-python-backend phases at every scale (together ~70%); Rust eliminates both.
-Action: no further optimization needed - Rust port ships the answer.
-
-**Scenario 5 - Reversible treatment (dCDH L_max=3 + TSL).**
-Unchanged from single-scale pass: main fit 58% + heterogeneity refit 40%,
-both rebuilding shared TSL scaffolding.
-
-**Scenario 6 - Pricing dose-response (ContinuousDiD).**
-Unchanged: four spline fits ~140ms each, ~99% of total.
+   (`aggregate_survey`, JK1 replicate fit) or already well-vectorized
+   (CS bootstrap, ImputationDiD, Survey TSL).
+
+### Top phases by scenario at largest measured scale
+
+<!-- TABLE:start top_phases_by_scenario -->
+| Scenario | Scale | Backend | Top phase (%) | 2nd phase (%) | 3rd phase (%) |
+|---|---|---|---|---|---|
+| 1. Staggered campaign | large | python | `6_imputation_did_robustness` (52%) | `5_sun_abraham_robustness` (26%) | `2_cs_fit_with_covariates_bootstrap999` (13%) |
+| 1. Staggered campaign | large | rust | `6_imputation_did_robustness` (37%) | `5_sun_abraham_robustness` (31%) | `7_cs_without_covariates` (20%) |
+| 2. Brand awareness survey | large | python | `3_replicate_weights_jk1` (56%) | `4_multi_outcome_loop_3_metrics` (21%) | `7_event_study_plus_honest_did` (13%) |
+| 2. Brand awareness survey | large | rust | `3_replicate_weights_jk1` (54%) | `4_multi_outcome_loop_3_metrics` (25%) | `7_event_study_plus_honest_did` (14%) |
+| 3. BRFSS microdata -> CS panel | large | python | `1_aggregate_survey_microdata_to_panel` (100%) | `5_sun_abraham_robustness` (0%) | `2_cs_fit_with_stage2_survey_design` (0%) |
+| 3. BRFSS microdata -> CS panel | large | rust | `1_aggregate_survey_microdata_to_panel` (100%) | `5_sun_abraham_robustness` (0%) | `2_cs_fit_with_stage2_survey_design` (0%) |
+| 4. SDiD few markets | large | rust | `5_sensitivity_to_zeta_omega` (40%) | `3_in_time_placebo` (29%) | `1_sdid_jackknife_variance` (16%) |
+| 5. Reversible dCDH | single | python | `1_dcdh_fit_Lmax3_survey_TSL` (62%) | `4_heterogeneity_refit` (37%) | `3_honest_did_on_placebo` (1%) |
+| 5. Reversible dCDH | single | rust | `1_dcdh_fit_Lmax3_survey_TSL` (61%) | `4_heterogeneity_refit` (38%) | `3_honest_did_on_placebo` (1%) |
+| 6. Pricing dose-response | single | python | `1_cdid_cubic_spline_bootstrap199` (26%) | `3_cdid_event_study_pretrend` (25%) | `6_spline_sensitivity_num_knots2` (25%) |
+| 6. Pricing dose-response | single | rust | `1_cdid_cubic_spline_bootstrap199` (26%) | `3_cdid_event_study_pretrend` (25%) | `6_spline_sensitivity_num_knots2` (25%) |
+<!-- TABLE:end top_phases_by_scenario -->
+
+Per-scenario phase narrative (cross-check against the table above after
+any rerun):
+
+- **Staggered campaign.** ImputationDiD robustness is the dominant phase
+  under Python at every scale. Under Rust at large scale it is tied with
+  SunAbraham (both ~30-40%); the Rust backend shifts relative shares
+  more than totals. CS fit with `n_bootstrap=999` is well-vectorized and
+  sits well below both in the ranking.
+- **Brand awareness survey.** At small scale HonestDiD dominates; at
+  medium the multi-outcome loop and the JK1 replicate path are
+  comparable; at large the JK1 path is the single top phase under both
+  backends. Python and Rust totals on this chain are within noise; the
+  JK1 replicate-fit loop is not Rust-accelerated, so the FFI crossings
+  cost approximately what they save - a neutral outcome, not a
+  regression.
+- **BRFSS.** `aggregate_survey` share of total grows with scale and is
+  effectively 100% of runtime at 1M rows. Downstream phases (CS fit,
+  SunAbraham, HonestDiD) are a fraction of a second combined.
+- **SDiD few markets.** `sensitivity_to_zeta_omega` and `in_time_placebo`
+  are the dominant Python-backend phases at every scale; Rust eliminates
+  both.
+- **Reversible dCDH.** Main fit and heterogeneity refit split the time
+  ~60/40, both rebuilding shared TSL scaffolding.
+- **Pricing dose-response.** Four spline fits account for essentially all
+  runtime; linear scaling in variant count.
+
+### Top hotspots ranked by total-time contribution
+
+| # | Location | Scenario + scale | Signal | Recommended action |
+|---|---|---|---|---|
+| 1 | `diff_diff/survey.py:1160` `_compute_stratified_psu_meat` | BRFSS @ 1M rows | dominates BRFSS chain at all scales, ~100% at 1M rows | **Algorithmic fix, highest priority.** Function called once per (state, year) cell (500 calls); per-call work rebuilds stratum-PSU scaffolding every time. Precompute stratum indexes once at `aggregate_survey` top-level and reuse. |
+| 2 | `diff_diff/imputation.py` ImputationDiD fit | Staggered CS @ 1,500 units | dominant phase of the CS chain under Python at all scales; tied with SunAbraham under Rust at large | **Investigate only after BRFSS fix lands.** Total chain is well under practitioner-perceptible threshold; candidate follow-up. |
+| 3 | `diff_diff/utils.py:1434` `_sc_weight_fw_numpy` | SDiD python @ any scale | dominates Python SDiD at all scales | **Already ported to Rust.** Python fallback acceptable as a teaching/safety path; non-production for n > 100. Python skipped at n=500 (jackknife cost would exceed 4 minutes per run). |
+| 4 | `diff_diff/chaisemartin_dhaultfoeuille.py` dCDH fit + heterogeneity | Reversible (single scale) | main fit + heterogeneity refit each rebuild TSL scaffolding | **Cache/precompute** - heterogeneity refit duplicates the main fit's TSL setup. Not P0; newer code path (v3.1) never optimization-reviewed. |
+| 5 | `diff_diff/continuous_did.py` CDiD spline bootstrap | Dose-response (single scale) | four spline fits ~equal, linear in variant count | **Leave alone** - well under perceptible threshold. |
 
 ### Memory analysis
 
@@ -134,80 +141,78 @@ End-to-end peak RSS and per-scenario growth are captured in each JSON
 baseline under the `memory` field, recorded via a psutil background
 sampler at 10 ms. A standalone `tracemalloc`-based allocator attribution
 pass for the BRFSS-1M scenario lives at
-`benchmarks/speed_review/mem_profile_brfss.py`; its output is in
-`benchmarks/speed_review/baselines/mem_profile_brfss_large_<backend>.txt`.
+`benchmarks/speed_review/mem_profile_brfss.py`; its scrubbed output is
+in `benchmarks/speed_review/baselines/mem_profile_brfss_large_<backend>.txt`.
 
-| Scenario | Scale | Peak RSS (Py) | Growth during run (Py) | Peak RSS (Rust) | Growth (Rust) |
+<!-- TABLE:start memory_by_scenario -->
+| Scenario | Scale | Py peak RSS (MB) | Py growth (MB) | Rust peak RSS (MB) | Rust growth (MB) |
 |---|---|---:|---:|---:|---:|
-| Staggered campaign | small | 141 MB | +26 | 147 MB | +33 |
-|  | medium | 226 MB | +81 | 263 MB | +109 |
-|  | **large** | **486 MB** | **+252** | **589 MB** | **+322** |
-| Brand awareness survey | small | 126 MB | +10 | 130 MB | +13 |
-|  | medium | 188 MB | +56 | 185 MB | +52 |
-|  | large | 336 MB | +146 | 315 MB | +127 |
-| BRFSS microdata -> CS panel | small (50K) | 131 MB | +10 | 134 MB | +11 |
-|  | medium (250K) | 206 MB | +19 | 214 MB | +24 |
-|  | **large (1M)** | **419 MB** | **+23** | **428 MB** | **+29** |
-| SDiD few markets | small | 124 MB | +10 | 115 MB | +1 |
-|  | medium | 152 MB | +8 | 117 MB | 0 |
-|  | large | skip | skip | 117 MB | 0 |
-| Reversible dCDH | single | 131 MB | +18 | 136 MB | +22 |
-| Dose-response | single | 120 MB | +6 | 122 MB | +9 |
+| 1. Staggered campaign | small | 141 | 27 | 148 | 34 |
+|  | medium | 233 | 84 | 264 | 108 |
+|  | large | 474 | 237 | 576 | 312 |
+| 2. Brand awareness survey | small | 131 | 16 | 131 | 16 |
+|  | medium | 194 | 58 | 190 | 54 |
+|  | large | 335 | 140 | 339 | 144 |
+| 3. BRFSS microdata -> CS panel | small | 133 | 12 | 136 | 14 |
+|  | medium | 209 | 20 | 223 | 30 |
+|  | large | 426 | 31 | 446 | 33 |
+| 4. SDiD few markets | small | 124 | 10 | 115 | 2 |
+|  | medium | 152 | 9 | 117 | 0 |
+|  | large | skip | skip | 118 | 0 |
+| 5. Reversible dCDH | single | 132 | 19 | 135 | 21 |
+| 6. Pricing dose-response | single | 121 | 7 | 123 | 9 |
+<!-- TABLE:end memory_by_scenario -->
 
 The ~115-130 MB floor is the Python + diff-diff + numpy import footprint;
-the "growth during run" column is the practitioner-meaningful number.
+the "growth" columns are the practitioner-meaningful numbers.
 
 ### Memory findings
 
-1. **BRFSS `aggregate_survey` is compute-bound, not memory-bound.** Across
-   a 20x data growth (50K → 1M rows), working-memory growth only goes
-   10 → 19 → 23 MB. The tracemalloc pass confirms this: net retained
-   allocation after `aggregate_survey` returns is 0.6 MB, Python traced
-   peak is 84 MB (vs 46 MB input microdata), and the top allocation site
-   is `tracemalloc`'s own `linecache.py` overhead - a smoking gun that
-   nothing else is allocating meaningfully. **The 24-second cost is pure
-   CPU; the function is already memory-efficient.** This strengthens the
-   case for the precompute-scaffolding fix: low-risk, pure CPU win, fits
-   in any deployment environment including 512 MB Lambda.
-
+1. **BRFSS `aggregate_survey` is compute-bound, not memory-bound.** At
+   20x data growth (50K -> 1M rows), working-memory growth stays in the
+   low tens of MB. The tracemalloc pass confirms: net retained allocation
+   after `aggregate_survey` returns is well under 1 MB; the top
+   allocation site is `tracemalloc`'s own linecache overhead (a smoking
+   gun that nothing else is allocating meaningfully). **The BRFSS cost
+   is pure CPU; the function is already memory-efficient.** This
+   strengthens the case for the precompute-scaffolding fix: low-risk,
+   pure CPU win, fits in any deployment environment including 512 MB
+   Lambda.
 2. **Staggered CS chain is memory-heavier than wall-clock suggested.** At
-   1,500 units the chain allocates +252 MB Python / +322 MB Rust during
-   the run, pushing peak RSS to ~486-589 MB. Fine for workstations,
-   tight for 512 MB Lambda tier. The Bootstrap-999 in CS and ImputationDiD's
-   saturated regression are the plausible drivers. Not a P0 today but
-   worth flagging for future edge / Lambda deployments. Interestingly,
-   Rust uses **more** memory here (70 MB more at large scale), likely
-   FFI-held temporary array copies; not worth optimizing.
-
-3. **JK1 replicate path is allocation-heavy at scale.** At 1,000 units ×
-   160 replicates, +127-146 MB growth. Each replicate refit plus the
+   1,500 units the chain's peak RSS sits in the high-400s to high-500s
+   MB depending on backend. Fine for workstations, tight for 512 MB
+   Lambda tier. Bootstrap-999 in CS and ImputationDiD's saturated
+   regression are plausible drivers. Rust uses slightly more memory here
+   (likely FFI-held temporary array copies); not worth optimizing.
+3. **JK1 replicate path is allocation-heavy at large replicate count.**
+   At 1,000 units × 160 replicates the chain's growth during run sits in
+   the mid-100s of MB (see memory table). Each replicate refit plus the
    n × n_replicates weight matrix drives this. A Rust port would save
-   both time (0.3-0.4s) and memory (~100 MB) - the dual benefit slightly
-   strengthens the case for the port.
-
-4. **SDiD Rust path is essentially memory-free** (+0-1 MB across scales).
-   Rust does the work in native memory without round-tripping through
-   the Python allocator. Confirms the existing Rust port is well-behaved
-   on both axes.
-
-5. **No scenario hits OOM territory at measured scales.** Maximum peak
-   RSS across the whole sweep is 589 MB (staggered CS large + Rust).
-   1 GB is a comfortable ceiling for every scenario measured.
+   memory even though time is within noise today - the dual benefit
+   strengthens the case for the port if replicate counts grow.
+4. **SDiD Rust path is essentially memory-free** (growth at or below a
+   single MB across scales). Rust does the work in native memory without
+   round-tripping through the Python allocator. Confirms the existing
+   Rust port is well-behaved on both axes.
+5. **No scenario hits OOM territory at measured scales.** Peak RSS across
+   the whole sweep stays under 600 MB. 1 GB is a comfortable ceiling for
+   every scenario measured.
 
 ### Priority of optimization opportunities
 
 | # | Opportunity | Time upside | Memory upside | Risk | Priority |
 |---|---|---|---|---|---|
-| 1 | `aggregate_survey` precompute stratum scaffolding | -15 to -20s at 1M rows | none (already memory-efficient) | Low | **High** |
-| 2 | Rust-port JK1 replicate fit loop | -0.3s at 160 replicates | -100 MB at 160 replicates | Medium | Medium |
-| 3 | dCDH: cache TSL scaffolding across main fit + heterogeneity refit | -0.2s per chain | -20 MB per chain | Low | Low |
-| 4 | ImputationDiD fit-loop vectorization audit | -0.1 to -0.3s at 1,500 units | unknown | Low | Low |
-| 5 | Staggered CS chain working-memory audit (Lambda-oriented) | none | -100+ MB at 1,500 units | Medium | Low |
-
-#1 is the single clearest practitioner win. Everything else is optional
-polish that should be prioritized by actual deployment-environment signal
-(e.g. "our practitioners keep hitting 512 MB Lambda limits on the
-staggered chain" → item 5 moves up).
+| 1 | `aggregate_survey` precompute stratum scaffolding | ~-20s at 1M rows | none (already memory-efficient) | Low | **High** |
+| 2 | Staggered CS chain working-memory audit (Lambda-oriented) | none | ~200-300 MB at 1,500 units (peak RSS crosses 512 MB Lambda line under Rust) | Medium | Low (bump to Medium if Lambda deployment becomes a concrete ask) |
+| 3 | dCDH: cache TSL scaffolding across main fit + heterogeneity refit | ~0.2s per chain | ~20 MB per chain | Low | Low |
+| 4 | ImputationDiD fit-loop vectorization audit | ~0.1-0.3s at 1,500 units | unknown | Low | Low |
+| 5 | Rust-port JK1 replicate fit loop | ~0.5s at 160 replicates | ~140 MB at 160 replicates | Medium | Low (demoted: Rust is no longer slower than Python on this path after rerun, so the "fix-a-Rust-regression" leg of the original rationale is gone) |
+
+**Bottom line: one clear priority, four optional.** #1 is the single
+practitioner-perceptible win identified by this analysis and should be
+the next PR. #2-5 are optional polish that should be prioritized by
+concrete deployment-environment signal (Lambda OOMs, practitioner
+reports of slowness at specific shapes), not proactively.
 
 ### Correctness-adjacent observations (not P0, route separately)
 
diff --git a/docs/performance-scenarios.md b/docs/performance-scenarios.md
index 282c1d65..487458db 100644
--- a/docs/performance-scenarios.md
+++ b/docs/performance-scenarios.md
@@ -196,10 +196,10 @@ serves a different purpose: R-parity accuracy). They complement it.
   ```
 - **Operation chain.** (1) `aggregate_survey()` - the microdata-to-panel
   collapse; (2) CS fit with staged second-stage SurveyDesign
-  (`weight_type="pweight"`) and bootstrap at PSU level; (3) event-study
-  pre-trend inspection; (4) HonestDiD sensitivity grid; (5) SunAbraham
-  robustness refit (also survey-aware via Full replicate-weight path);
-  (6) `practitioner_next_steps()`.
+  (`weight_type="pweight"`, analytical TSL via strata + PSU) and bootstrap
+  at PSU level; (3) event-study pre-trend inspection; (4) HonestDiD
+  sensitivity grid; (5) SunAbraham robustness refit using the same
+  second-stage pweight SurveyDesign; (6) `practitioner_next_steps()`.
 - **Source anchor.** `docs/practitioner_getting_started.rst` ("What If
   You Have Survey Data?" section), CDC BRFSS 2024 overview
   (cdc.gov/brfss/annual_data/2024), `diff_diff.prep.aggregate_survey`
@@ -264,9 +264,9 @@ serves a different purpose: R-parity accuracy). They complement it.
   l=1..3, dynamic placebos, sup-t bands, TWFE diagnostic); (2) inspect
   `placebo_effect` and dynamic placebos for pre-trend evidence;
   (3) `results.print_summary()`; (4) `compute_honest_did()` on the placebo
-  event study; (5) heterogeneity refit with `heterogeneity="cohort"` if
-  the code path supports it on this shape. The TSL path for `L_max >= 1`
-  is newer code (v3.1) and has not been profiled.
+  event study; (5) heterogeneity refit with `heterogeneity="group"`.
+  The TSL path for `L_max >= 1` is newer code (v3.1) and has not been
+  profiled.
 - **Source anchor.** `docs/practitioner_decision_tree.rst`
   ("Reversible Treatment (On/Off Cycles)"), de Chaisemartin & D'Haultfoeuille
   (2020), NBER WP 29873 (dynamic companion), R package

From bf9b9d7a0f9789fc12d27c180fa4ee33e13dcce5 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Sun, 19 Apr 2026 13:37:14 -0400
Subject: [PATCH 05/15] Sync scenario-spec doc to script phase lists (P2
 cleanup)

CI review P2: performance-scenarios.md had four drift points where the
documented operation chain did not match what the scripts actually time.
Fixed each to be a faithful spec the reviewer can cross-check against:

- BRFSS small scale: "single year" -> "narrow analytic slice on a
  state-year grid" (all scales use n_years=10).
- Scenario 4 (SDiD): removed the seventh plot_synth_weights step the
  script never times; chain is now 6 steps, matching the script.
- Scenario 5 (dCDH): replaced "results.print_summary()" with the
  actual attribute snapshot the script performs (placebo_effect,
  overall_att, joiners_att, leavers_att); chain is now 4 steps.
- Scenario 6 (dose-response): event-study step is no longer described
  as to_dataframe(level="event_study") on a dose-only fit (that API
  path raises because aggregate="dose" does not populate event_study);
  it is now described as a second CDiD fit with aggregate="eventstudy",
  matching the separate phase the script times.

The within-estimator API-spelling inconsistency that surfaced during
this cleanup (ContinuousDiD uses "eventstudy" on fit(aggregate=...) but
"event_study" on to_dataframe(level=...)) is captured in the
correctness-adjacent observations in performance-plan.md.

No changes under diff_diff/, rust/, scripts, or baselines. Docs only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/performance-plan.md      | 11 +++++----
 docs/performance-scenarios.md | 44 +++++++++++++++++++----------------
 2 files changed, 31 insertions(+), 24 deletions(-)

diff --git a/docs/performance-plan.md b/docs/performance-plan.md
index 30bc12b2..c4bc7fe2 100644
--- a/docs/performance-plan.md
+++ b/docs/performance-plan.md
@@ -220,10 +220,13 @@ These are developer-ergonomics / API-consistency smells surfaced during
 scenario development. None are silent-failures and none belong in this PR
 or in the silent-failures audit; logging here for awareness.
 
-1. **`aggregate` parameter naming.** CS accepts `aggregate="event_study"`;
-   ContinuousDiD requires `aggregate="eventstudy"` (no underscore). Both
-   estimators expose the same conceptual aggregation but different
-   spellings. Route: API-consistency cleanup, minor.
+1. **`aggregate` / `level` parameter naming is inconsistent.** CS accepts
+   `aggregate="event_study"`; ContinuousDiD requires
+   `aggregate="eventstudy"` on `fit()` **but** `level="event_study"` on
+   `to_dataframe()`. Two different spellings within one estimator plus a
+   third cross-estimator spelling. Surfaced when the P1 exit-propagation
+   fix stopped silently swallowing the resulting `ValueError` in the
+   dose-response benchmark. Route: API-consistency cleanup, minor.
 2. **`generate_survey_did_data(panel=True)` `treated` column.** Row-level
    active-treatment indicator that is zero in pre-periods, which makes it
    quietly incompatible with `check_parallel_trends` (expects unit-level
diff --git a/docs/performance-scenarios.md b/docs/performance-scenarios.md
index 487458db..429617e7 100644
--- a/docs/performance-scenarios.md
+++ b/docs/performance-scenarios.md
@@ -173,10 +173,10 @@ serves a different purpose: R-parity accuracy). They complement it.
   to a state-year panel, then a modern staggered estimator.
 - **Data shape (scale sweep).** 50 states × 10 years × N respondents per
   state-year cell, 5 adoption cohorts staggered over the window. Three scales:
-    - **small** - 50,000 rows (100/cell, 10 strata × 200 PSUs). Substate
-      analytic slice of a single year.
-    - **medium** - 250,000 rows (500/cell, 15 strata × 600 PSUs). Pooled
-      substate analytic slice across multiple years.
+    - **small** - 50,000 rows (100/cell, 10 strata × 200 PSUs). Narrow
+      analytic slice on a state-year grid.
+    - **medium** - 250,000 rows (500/cell, 15 strata × 600 PSUs).
+      Mid-range analytic slice on the same state-year grid.
     - **large** - 1,000,000 rows (2,000/cell, 20 strata × 1,000 PSUs).
       A realistic pooled 10-year multi-state analysis - comparable to the
       kind of panel built from BRFSS 2024's ~458K-record universe filtered
@@ -229,13 +229,12 @@ serves a different purpose: R-parity accuracy). They complement it.
   # then also variance_method="bootstrap", n_bootstrap=200 for comparison
   ```
 - **Operation chain.** (1) SDiD fit with `variance_method="jackknife"` -
-  exercises the leave-one-out refit loop (80 full refits); (2) SDiD fit
-  with `variance_method="bootstrap"`, `n_bootstrap=200` for SE comparison;
+  exercises the leave-one-out refit loop; (2) SDiD fit with
+  `variance_method="bootstrap"`, `n_bootstrap=200` for SE comparison;
   (3) `results.in_time_placebo()`; (4) `results.get_loo_effects_df()`;
   (5) `results.sensitivity_to_zeta_omega()`; (6)
-  `results.get_weight_concentration()`; (7) `plot_synth_weights()` equivalent
-  (data extraction via `results.get_unit_weights_df()`). The jackknife loop
-  is the primary time sink; `sensitivity_to_zeta_omega` also refits.
+  `results.get_weight_concentration()`. The jackknife loop is the primary
+  time sink; `sensitivity_to_zeta_omega` also refits.
 - **Source anchor.** `docs/tutorials/18_geo_experiments.ipynb`,
   Arkhangelsky et al. (2021), Mercado Libre geo-experiment writeup
   (medium.com/mercadolibre-tech), Meta GeoLift methodology docs
@@ -261,12 +260,12 @@ serves a different purpose: R-parity accuracy). They complement it.
   )
   ```
 - **Operation chain.** (1) dCDH fit with `L_max=3` (computes `DID_l` for
-  l=1..3, dynamic placebos, sup-t bands, TWFE diagnostic); (2) inspect
-  `placebo_effect` and dynamic placebos for pre-trend evidence;
-  (3) `results.print_summary()`; (4) `compute_honest_did()` on the placebo
-  event study; (5) heterogeneity refit with `heterogeneity="group"`.
-  The TSL path for `L_max >= 1` is newer code (v3.1) and has not been
-  profiled.
+  l=1..3, dynamic placebos, sup-t bands, TWFE diagnostic); (2) snapshot
+  `placebo_effect`, `overall_att`, `joiners_att`, `leavers_att` from the
+  result object for pre-trend evidence and joiner/leaver inspection;
+  (3) `compute_honest_did()` M-grid on the placebo event study;
+  (4) heterogeneity refit with `heterogeneity="group"`. The TSL path for
+  `L_max >= 1` is newer code (v3.1) and has not been profiled.
 - **Source anchor.** `docs/practitioner_decision_tree.rst`
   ("Reversible Treatment (On/Off Cycles)"), de Chaisemartin & D'Haultfoeuille
   (2020), NBER WP 29873 (dynamic companion), R package
@@ -290,12 +289,17 @@ serves a different purpose: R-parity accuracy). They complement it.
   )
   ```
 - **Operation chain.** (1) CDiD fit with `aggregate="dose"` - produces
-  overall ATT, overall ACRT, and the dose-response curves; (2)
-  `results.to_dataframe(level="dose_response")`; (3)
-  `results.to_dataframe(level="event_study")` for pre-trend diagnostics;
+  overall ATT, overall ACRT, and the dose-response curves; (2) extract
+  `results.to_dataframe(level="dose_response")` and
+  `level="group_time"` (event-study is not populated by a dose-only
+  fit, so it is extracted in a separate step); (3) a second CDiD fit
+  with `aggregate="eventstudy"` for pre-trend diagnostics (note the
+  spelling: `fit(aggregate="eventstudy")` with no underscore, but
+  `to_dataframe(level="event_study")` with underscore - see the
+  correctness-adjacent observations in `performance-plan.md`);
   (4) compare to a binarized DiD fit on the same data to quantify
-  the information loss from binarizing; (5) alternate `degree=1`
-  (linear) and `num_knots=2` refits for spline-sensitivity. The dose-curve
+  information loss from binarizing; (5) alternate `degree=1` (linear)
+  and (6) `num_knots=2` refits for spline-sensitivity. The dose-curve
   bootstrap loop (199 reps x spline refit) is the primary time sink.
 - **Source anchor.** `docs/tutorials/14_continuous_did.ipynb`,
   Callaway, Goodman-Bacon & Sant'Anna (2024), `docs/methodology/REGISTRY.md`

From 922317846c441ec736c5fcfa25f972dc65a7bf83 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Sun, 19 Apr 2026 13:47:33 -0400
Subject: [PATCH 06/15] Finish spec-doc sync: multi-scale output paths and SDiD
 Python/large skip

CI re-review P2: remaining stale prose lines that didn't reflect the
three-scale sweep and the intentional SDiD-Python-large skip. All
straightforward text edits:

- performance-scenarios.md output-path description now uses
  <scenario>[_<scale>]_<backend> notation and explicitly calls out
  that single-scale scenarios omit the scale segment.
- performance-scenarios.md "Runs under both backends" line now
  acknowledges the SDiD large-scale Python skip by design.
- performance-plan.md environment paragraph now mentions the SDiD skip
  alongside the three-scale sweep.
- performance-plan.md "What this baseline does not answer" section no
  longer claims each scenario runs at a single data shape (which is
  no longer true); replaced with an OOM-behaviour bullet that reflects
  what actually is and isn't covered.
- Pointers block at the end of performance-scenarios.md updated to
  the multi-scale filename pattern.

No changes under diff_diff/, rust/, scripts, or baselines. Docs only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/performance-plan.md      | 20 +++++++++++---------
 docs/performance-scenarios.md | 19 +++++++++++++------
 2 files changed, 24 insertions(+), 15 deletions(-)

diff --git a/docs/performance-plan.md b/docs/performance-plan.md
index c4bc7fe2..ec166674 100644
--- a/docs/performance-plan.md
+++ b/docs/performance-plan.md
@@ -19,8 +19,12 @@ writeups. The six scenarios are defined in
 
 Environment: macOS darwin 25.3 on Apple Silicon M4, Python 3.9,
 numpy 2.x, diff_diff 3.1.3. Each multi-scale scenario runs at three data
-scales under both `DIFF_DIFF_BACKEND=python` and `DIFF_DIFF_BACKEND=rust`.
-The numerical tables below are auto-generated from the committed JSON
+scales under both `DIFF_DIFF_BACKEND=python` and `DIFF_DIFF_BACKEND=rust`,
+with one intentional exception: the SDiD few-markets scenario at its
+`large` scale runs Rust only, because the pure-numpy jackknife at n=500
+would exceed four minutes per run without changing the already-clear
+Python-vs-Rust conclusion established at `small` and `medium`. The
+numerical tables below are auto-generated from the committed JSON
 baselines by `benchmarks/speed_review/gen_findings_tables.py`; narrative
 prose is hand-written and must be re-read when numbers shift.
 
@@ -241,13 +245,11 @@ or in the silent-failures audit; logging here for awareness.
 
 ### What this baseline does not answer
 
-- Scaling: each scenario runs at a single data shape. We do not know how
-  end-to-end time scales with n, periods, or cohorts. If scaling becomes a
-  decision input, add a small per-scenario scale sweep (e.g., n_units in
-  {100, 500, 1000}) - the scripts are parameterised to support this.
-- Memory: no memory-ceiling measurement. If memory becomes a concern,
-  `pyinstrument --output-memory` or `memray` can be wrapped into
-  `bench_shared.run_scenario` without restructuring.
+- OOM behaviour at the edge: the sweep captures peak RSS up to ~600 MB
+  (staggered CS large under Rust). Behaviour under a hard memory ceiling
+  (512 MB Lambda, 1 GB container) is not exercised; if deployment signal
+  emerges that practitioners hit those ceilings, a ceiling-test pass
+  should be added.
 - Pure-Rust profiles: scenarios run the Rust backend as a black box.
   Optimizing inside `rust/` is a separate concern owned by the crate
   maintainers and is not in scope here.
diff --git a/docs/performance-scenarios.md b/docs/performance-scenarios.md
index 429617e7..7ac93385 100644
--- a/docs/performance-scenarios.md
+++ b/docs/performance-scenarios.md
@@ -54,12 +54,18 @@ For each scenario, `benchmarks/speed_review/` hosts a script
 
 1. Generates (or loads) the data once.
 2. Runs the full operation chain under `pyinstrument` and writes a flame HTML
-   to `benchmarks/speed_review/baselines/profiles/<scenario>_<backend>.html`.
+   to `benchmarks/speed_review/baselines/profiles/<scenario>[_<scale>]_<backend>.html`.
 3. Writes a wall-clock JSON breakdown (per operation + total) to
-   `benchmarks/speed_review/baselines/<scenario>_<backend>.json`.
+   `benchmarks/speed_review/baselines/<scenario>[_<scale>]_<backend>.json`.
+   Multi-scale scenarios include the scale segment (`_small`, `_medium`,
+   `_large`); single-scale scenarios (dose-response, reversible-dCDH)
+   omit it.
 4. Runs under both `DIFF_DIFF_BACKEND=python` and `DIFF_DIFF_BACKEND=rust`
-   when Rust is available. The gap is the primary input to Rust-expansion
-   decisions.
+   when Rust is available. Scenario 4 (SDiD few markets) skips the
+   Python backend at the `large` scale by design because its
+   pure-numpy jackknife would exceed 4 minutes per run without adding
+   signal; every other (scenario, scale) runs under both backends. The
+   Python-vs-Rust gap is the primary input to Rust-expansion decisions.
 
 The scenario scripts are **not** meant to replace `run_benchmarks.py` (which
 serves a different purpose: R-parity accuracy). They complement it.
@@ -342,7 +348,8 @@ output. Scripts filter this warning so profiles stay clean.
 ## Pointers
 
 - Scripts: `benchmarks/speed_review/bench_<scenario>.py`
-- Raw results: `benchmarks/speed_review/baselines/<scenario>_<backend>.json`
-- Flame profiles: `benchmarks/speed_review/baselines/profiles/<scenario>_<backend>.html`
+- Raw results: `benchmarks/speed_review/baselines/<scenario>[_<scale>]_<backend>.json`
+- Flame profiles: `benchmarks/speed_review/baselines/profiles/<scenario>[_<scale>]_<backend>.html`
+  (gitignored; regenerated per run)
 - Findings doc: `docs/performance-plan.md` ("Practitioner Workflow Baseline"
   section - per-scenario top-5 hot phases + recommended action category)

From ce05d01539c697ee3dee4e8e65d7636bd96fc9e8 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Sun, 19 Apr 2026 14:00:52 -0400
Subject: [PATCH 07/15] Generator picks largest-available scale per backend;
 fix brand medium narrative

Addresses the two remaining P2s from CI review:

- gen_findings_tables.py hard-coded scale="large" for the top-phases
  table, which silently dropped the geo_few_markets Python row (Python
  intentionally skips the large scale). The generator now iterates
  reversed(SCALE_ORDER) and picks the largest record actually present
  per (scenario, backend). The regenerated table now shows SDiD Python
  at medium and Rust at large side-by-side, which is the Python-vs-Rust
  comparison the table is supposed to surface.
- Brand-awareness medium-scale narrative said the multi-outcome loop
  and the JK1 replicate path are "comparable" at medium. The committed
  baselines contradict this: JK1 is 2-3x the multi-outcome loop on
  Python and still the top phase on Rust. Rewrote the bullet to say
  JK1 is the clear top phase from medium onwards and consolidates at
  large, matching the data.

Docs + generator only. No baseline regeneration needed (the top-phases
table regeneration is cosmetic - the JSONs didn't change).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../speed_review/gen_findings_tables.py       | 19 ++++++++++++++++---
 docs/performance-plan.md                      | 16 +++++++++-------
 2 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/benchmarks/speed_review/gen_findings_tables.py b/benchmarks/speed_review/gen_findings_tables.py
index 9cf5c6d6..e1258094 100644
--- a/benchmarks/speed_review/gen_findings_tables.py
+++ b/benchmarks/speed_review/gen_findings_tables.py
@@ -152,7 +152,13 @@ def render_memory_by_scenario():
 
 
 def render_top_phases_by_scenario():
-    """Top-3 phases by time at largest scale, for both backends."""
+    """Top-3 phases at the largest-available scale per (scenario, backend).
+
+    If a scenario/backend skips `large` (e.g., geo_few_markets Python),
+    this falls back to the largest measured scale for that backend so
+    the table still reports the Python-vs-Rust comparison rather than
+    dropping the row entirely.
+    """
     rows = [
         "| Scenario | Scale | Backend | Top phase (%) "
         "| 2nd phase (%) | 3rd phase (%) |",
@@ -175,11 +181,18 @@ def phase_rank(record, n=3):
             out.append("-")
         return out
 
+    def largest_available(scen, backend):
+        """Return (scale, record) for the largest scale this backend has."""
+        for scale in reversed(SCALE_ORDER):
+            rec = load(scen, scale, backend)
+            if rec is not None:
+                return scale, rec
+        return None, None
+
     for scen in MULTI_SCALE:
         display = SCENARIO_DISPLAY[scen]
-        scale = SCALE_ORDER[-1]  # largest
         for backend in ("python", "rust"):
-            rec = load(scen, scale, backend)
+            scale, rec = largest_available(scen, backend)
             top = phase_rank(rec)
             if not top:
                 continue
diff --git a/docs/performance-plan.md b/docs/performance-plan.md
index ec166674..41697a3c 100644
--- a/docs/performance-plan.md
+++ b/docs/performance-plan.md
@@ -96,6 +96,7 @@ scale. Data-shape details are in `docs/performance-scenarios.md`.
 | 2. Brand awareness survey | large | rust | `3_replicate_weights_jk1` (54%) | `4_multi_outcome_loop_3_metrics` (25%) | `7_event_study_plus_honest_did` (14%) |
 | 3. BRFSS microdata -> CS panel | large | python | `1_aggregate_survey_microdata_to_panel` (100%) | `5_sun_abraham_robustness` (0%) | `2_cs_fit_with_stage2_survey_design` (0%) |
 | 3. BRFSS microdata -> CS panel | large | rust | `1_aggregate_survey_microdata_to_panel` (100%) | `5_sun_abraham_robustness` (0%) | `2_cs_fit_with_stage2_survey_design` (0%) |
+| 4. SDiD few markets | medium | python | `5_sensitivity_to_zeta_omega` (43%) | `3_in_time_placebo` (39%) | `1_sdid_jackknife_variance` (9%) |
 | 4. SDiD few markets | large | rust | `5_sensitivity_to_zeta_omega` (40%) | `3_in_time_placebo` (29%) | `1_sdid_jackknife_variance` (16%) |
 | 5. Reversible dCDH | single | python | `1_dcdh_fit_Lmax3_survey_TSL` (62%) | `4_heterogeneity_refit` (37%) | `3_honest_did_on_placebo` (1%) |
 | 5. Reversible dCDH | single | rust | `1_dcdh_fit_Lmax3_survey_TSL` (61%) | `4_heterogeneity_refit` (38%) | `3_honest_did_on_placebo` (1%) |
@@ -111,13 +112,14 @@ any rerun):
   SunAbraham (both ~30-40%); the Rust backend shifts relative shares
   more than totals. CS fit with `n_bootstrap=999` is well-vectorized and
   sits well below both in the ranking.
-- **Brand awareness survey.** At small scale HonestDiD dominates; at
-  medium the multi-outcome loop and the JK1 replicate path are
-  comparable; at large the JK1 path is the single top phase under both
-  backends. Python and Rust totals on this chain are within noise; the
-  JK1 replicate-fit loop is not Rust-accelerated, so the FFI crossings
-  cost approximately what they save - a neutral outcome, not a
-  regression.
+- **Brand awareness survey.** At small scale HonestDiD dominates. From
+  medium onwards the JK1 replicate-weight path is the clear top phase
+  under both backends (2-3x the multi-outcome loop on Python at medium;
+  still the top phase on Rust though by a smaller margin there). At
+  large it consolidates as the single dominant phase. Python and Rust
+  totals on this chain are within noise; the JK1 replicate-fit loop is
+  not Rust-accelerated, so the FFI crossings cost approximately what
+  they save - a neutral outcome, not a regression.
 - **BRFSS.** `aggregate_survey` share of total grows with scale and is
   effectively 100% of runtime at 1M rows. Downstream phases (CS fit,
   SunAbraham, HonestDiD) are a fraction of a second combined.

From 6b1715a92c780205efef19666a799c87ca2e9686 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Sun, 19 Apr 2026 14:24:22 -0400
Subject: [PATCH 08/15] Fix two P1 methodology issues: clean
 with/without-covariates + survey-aware heterogeneity

CI re-review surfaced two P1 methodology defects where the benchmark
was not actually measuring what the scenario/findings claimed:

1. Staggered CS with/without-covariates comparison. Phase 7 was
   configured with estimation_method="reg", n_bootstrap=199 vs phase 2's
   "dr" + 999. That confounded two axes at once (method + inference
   workload) so the Baker-mandated comparison was not clean. Phase 7
   now matches phase 2 exactly except for the covariates argument.
   CS with estimation_method="dr" and no covariates is a supported
   path (verified by spot-check); the resulting 5x bootstrap workload
   increase in phase 7 raises the staggered-large chain total slightly,
   which is already reflected in the regenerated tables.
2. dCDH heterogeneity refit without survey_design. The scenario
   framing and the performance-plan TSL-sharing optimization
   recommendation both assume the refit runs under the same survey
   design as the main fit. The refit was passing no survey_design,
   which meant the measured timing did not support the documented
   conclusion. The refit now uses the same SurveyDesign(weights="pw",
   strata="stratum", psu="psu") as the main fit. Confirmed supported
   (not NotImplementedError-gated on this shape). The refit is now as
   expensive as the main fit (was ~40% of chain, now ~50%), and the
   TSL-sharing optimization recommendation is strictly stronger.

Narrative updated against the freshly regenerated tables:

- Staggered campaign: removed the "Rust at large is tied with
  SunAbraham" claim - ImputationDiD still leads under both backends.
- Reversible dCDH: updated the ~60/40 split claim to the new
  ~50/50 split and called out the TSL-sharing opportunity more
  directly.
- Top-hotspots table row 4 strengthened to reflect the now-equal
  phase costs.

All other narrative claims cross-checked against the new data and
hold (BRFSS ~24s at 1M rows, staggered single-digit scale multiplier,
SDiD Rust gap stable, peak RSS under 600 MB, etc.).

Still measurement only. No changes under diff_diff/ or rust/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../brand_awareness_survey_large_python.json  | 22 +++---
 .../brand_awareness_survey_large_rust.json    | 22 +++---
 .../brand_awareness_survey_medium_python.json | 22 +++---
 .../brand_awareness_survey_medium_rust.json   | 22 +++---
 .../brand_awareness_survey_small_python.json  | 22 +++---
 .../brand_awareness_survey_small_rust.json    | 22 +++---
 .../baselines/brfss_panel_large_python.json   | 20 +++---
 .../baselines/brfss_panel_large_rust.json     | 20 +++---
 .../baselines/brfss_panel_medium_python.json  | 20 +++---
 .../baselines/brfss_panel_medium_rust.json    | 20 +++---
 .../baselines/brfss_panel_small_python.json   | 20 +++---
 .../baselines/brfss_panel_small_rust.json     | 20 +++---
 .../campaign_staggered_large_python.json      | 24 +++----
 .../campaign_staggered_large_rust.json        | 24 +++----
 .../campaign_staggered_medium_python.json     | 24 +++----
 .../campaign_staggered_medium_rust.json       | 24 +++----
 .../campaign_staggered_small_python.json      | 24 +++----
 .../campaign_staggered_small_rust.json        | 24 +++----
 .../baselines/dose_response_python.json       | 20 +++---
 .../baselines/dose_response_rust.json         | 20 +++---
 .../baselines/geo_few_markets_large_rust.json | 20 +++---
 .../geo_few_markets_medium_python.json        | 20 +++---
 .../geo_few_markets_medium_rust.json          | 20 +++---
 .../geo_few_markets_small_python.json         | 20 +++---
 .../baselines/geo_few_markets_small_rust.json | 20 +++---
 .../baselines/reversible_dcdh_python.json     | 16 ++---
 .../baselines/reversible_dcdh_rust.json       | 16 ++---
 .../speed_review/bench_campaign_staggered.py  |  8 ++-
 .../speed_review/bench_reversible_dcdh.py     | 12 ++--
 docs/performance-plan.md                      | 72 +++++++++----------
 30 files changed, 334 insertions(+), 326 deletions(-)

diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json b/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json
index 20d55086..5eb3d235 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_large",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 1.026685333,
+  "total_seconds": 0.9114786659999998,
   "memory": {
     "available": true,
-    "start_mb": 195.45,
-    "peak_mb": 335.25,
-    "growth_mb": 139.8,
+    "start_mb": 192.66,
+    "peak_mb": 335.36,
+    "growth_mb": 142.7,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.01289908400000006,
+      "seconds": 0.01250487499999986,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.030157082999999973,
+      "seconds": 0.03581945799999997,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.5706585830000002,
+      "seconds": 0.45216445799999994,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.21575479099999972,
+      "seconds": 0.22825654099999992,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.039158959000000326,
+      "seconds": 0.03940495799999999,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.02081145899999992,
+      "seconds": 0.01122979199999996,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.13721408299999993,
+      "seconds": 0.13208041700000006,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json b/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json
index 83794f77..d2290030 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_large",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 1.03563075,
+  "total_seconds": 0.813743375,
   "memory": {
     "available": true,
-    "start_mb": 194.5,
-    "peak_mb": 338.94,
-    "growth_mb": 144.44,
+    "start_mb": 192.84,
+    "peak_mb": 336.91,
+    "growth_mb": 144.06,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.01294266700000013,
+      "seconds": 0.013944833000000045,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.031624374999999816,
+      "seconds": 0.026042083000000105,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.5619407919999999,
+      "seconds": 0.359846584,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.263706,
+      "seconds": 0.2136102500000001,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.009824249999999868,
+      "seconds": 0.033997333000000296,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.012082083000000132,
+      "seconds": 0.026375041999999738,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.1435007500000003,
+      "seconds": 0.13991266599999985,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json
index 5fcd8624..178e80d2 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_medium",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.763684958,
+  "total_seconds": 0.521520459,
   "memory": {
     "available": true,
-    "start_mb": 135.03,
-    "peak_mb": 193.52,
-    "growth_mb": 58.48,
+    "start_mb": 134.75,
+    "peak_mb": 182.52,
+    "growth_mb": 47.77,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.01204445799999998,
+      "seconds": 0.01229520900000003,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.03723474999999998,
+      "seconds": 0.03135670800000001,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.377466875,
+      "seconds": 0.1716489579999999,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.14207241700000006,
+      "seconds": 0.09223670900000003,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.014263458000000062,
+      "seconds": 0.029447917000000157,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.06729445800000011,
+      "seconds": 0.05062770800000016,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.11328508300000006,
+      "seconds": 0.13389041700000015,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json
index c3a54df7..6fec4e10 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_medium",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.5770147080000001,
+  "total_seconds": 0.491930875,
   "memory": {
     "available": true,
-    "start_mb": 135.83,
-    "peak_mb": 189.91,
-    "growth_mb": 54.08,
+    "start_mb": 133.44,
+    "peak_mb": 186.23,
+    "growth_mb": 52.8,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.014149625000000055,
+      "seconds": 0.011377042000000004,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.03726633299999993,
+      "seconds": 0.034086916999999994,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.20208641699999985,
+      "seconds": 0.13685675000000008,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.14449316599999995,
+      "seconds": 0.16081833299999992,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.022887250000000137,
+      "seconds": 0.027104291999999974,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.07035533400000005,
+      "seconds": 0.05256474999999994,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.085751833,
+      "seconds": 0.06910883300000004,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json b/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json
index c9f8b1bd..ed44e410 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.21190179099999995,
+  "total_seconds": 0.21071716699999998,
   "memory": {
     "available": true,
-    "start_mb": 115.28,
-    "peak_mb": 131.0,
-    "growth_mb": 15.72,
+    "start_mb": 116.28,
+    "peak_mb": 128.09,
+    "growth_mb": 11.81,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.0019152919999999574,
+      "seconds": 0.0017718749999999783,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.007577749999999939,
+      "seconds": 0.005618791999999928,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.02303670899999999,
+      "seconds": 0.017142625000000078,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.052659917000000056,
+      "seconds": 0.06763025,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.009975750000000061,
+      "seconds": 0.00958991599999992,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.027377666999999994,
+      "seconds": 0.02613770900000001,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.08933650000000004,
+      "seconds": 0.08281808300000004,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json b/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json
index 26fbe778..80ee57e8 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.19699320799999998,
+  "total_seconds": 0.193722167,
   "memory": {
     "available": true,
-    "start_mb": 115.36,
-    "peak_mb": 130.86,
-    "growth_mb": 15.5,
+    "start_mb": 115.44,
+    "peak_mb": 128.27,
+    "growth_mb": 12.83,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.0021563339999999265,
+      "seconds": 0.0018566250000000561,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.006268332999999959,
+      "seconds": 0.005901209000000018,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.020846708999999963,
+      "seconds": 0.017941708999999917,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.027288292000000047,
+      "seconds": 0.05830100000000005,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.01103129199999997,
+      "seconds": 0.009343709000000033,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.02803533300000005,
+      "seconds": 0.02651229200000005,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.10135274999999999,
+      "seconds": 0.07384633299999999,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_large_python.json b/benchmarks/speed_review/baselines/brfss_panel_large_python.json
index ce3c16d3..cf5d5f70 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_large_python.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_large_python.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_large",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 24.746183166,
+  "total_seconds": 24.483619208000004,
   "memory": {
     "available": true,
-    "start_mb": 395.08,
-    "peak_mb": 426.41,
-    "growth_mb": 31.33,
+    "start_mb": 396.47,
+    "peak_mb": 426.27,
+    "growth_mb": 29.8,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 24.637322375,
+      "seconds": 24.412461416,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.012604250000002537,
+      "seconds": 0.012729583000002265,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.124999999215561e-06,
+      "seconds": 2.5410000006331757e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0017230830000016795,
+      "seconds": 0.0017426250000056598,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.09404904200000175,
+      "seconds": 0.05624137499999904,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.00046616599999538266,
+      "seconds": 0.0004348329999999123,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_large_rust.json b/benchmarks/speed_review/baselines/brfss_panel_large_rust.json
index 34adaf24..1fa3fdc1 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_large_rust.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_large_rust.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_large",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 25.405902042,
+  "total_seconds": 24.167863249999996,
   "memory": {
     "available": true,
-    "start_mb": 413.62,
-    "peak_mb": 446.14,
-    "growth_mb": 32.52,
+    "start_mb": 400.59,
+    "peak_mb": 430.67,
+    "growth_mb": 30.08,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 25.287192124999997,
+      "seconds": 24.099972291,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.014137458000000436,
+      "seconds": 0.012276292000002798,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.750000000162345e-06,
+      "seconds": 2.4170000045842244e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0018409999999988713,
+      "seconds": 0.0016439999999988686,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.10217112499999814,
+      "seconds": 0.05347887499999615,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.0005494999999982042,
+      "seconds": 0.000472500000000764,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_medium_python.json b/benchmarks/speed_review/baselines/brfss_panel_medium_python.json
index 42ccb4d1..35d48ac1 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_medium_python.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_medium_python.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_medium",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 6.403434375,
+  "total_seconds": 6.2413335839999995,
   "memory": {
     "available": true,
-    "start_mb": 188.7,
-    "peak_mb": 209.17,
-    "growth_mb": 20.47,
+    "start_mb": 188.67,
+    "peak_mb": 210.42,
+    "growth_mb": 21.75,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 6.296972209000001,
+      "seconds": 6.143250959000001,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.012381040999999371,
+      "seconds": 0.012518041999999951,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.3749999985843715e-06,
+      "seconds": 2.2499999996483666e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0018757499999999538,
+      "seconds": 0.0016282499999995537,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.09184245799999857,
+      "seconds": 0.08371516599999929,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.0003286660000014763,
+      "seconds": 0.00021379199999849163,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json b/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json
index 13ceba3d..653c78a6 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_medium",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 6.479580708,
+  "total_seconds": 6.1959236660000006,
   "memory": {
     "available": true,
-    "start_mb": 193.17,
-    "peak_mb": 223.38,
-    "growth_mb": 30.2,
+    "start_mb": 197.81,
+    "peak_mb": 218.22,
+    "growth_mb": 20.41,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 6.371709583000001,
+      "seconds": 6.119725833,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.019261959000001383,
+      "seconds": 0.012326082999999599,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.4579999990947954e-06,
+      "seconds": 2.1249999999639613e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0017385829999998492,
+      "seconds": 0.0017214169999988371,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.08646266600000097,
+      "seconds": 0.061877875000000415,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.00039937500000064574,
+      "seconds": 0.000265209000000155,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_small_python.json b/benchmarks/speed_review/baselines/brfss_panel_small_python.json
index 9fe54ea4..8a70b38e 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_small_python.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_small_python.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 1.6725981669999999,
+  "total_seconds": 1.6215220409999997,
   "memory": {
     "available": true,
-    "start_mb": 121.8,
-    "peak_mb": 133.42,
-    "growth_mb": 11.62,
+    "start_mb": 122.02,
+    "peak_mb": 134.12,
+    "growth_mb": 12.11,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 1.594078084,
+      "seconds": 1.5260382080000001,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.014796582999999863,
+      "seconds": 0.015467790999999842,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.457999999982974e-06,
+      "seconds": 2.3330000002985685e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.004001917000000077,
+      "seconds": 0.003910083999999703,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.05947075000000002,
+      "seconds": 0.07581529200000015,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.00023887499999997175,
+      "seconds": 0.000284333000000192,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_small_rust.json b/benchmarks/speed_review/baselines/brfss_panel_small_rust.json
index 143ae160..e5caadeb 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_small_rust.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_small_rust.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 1.7887954590000001,
+  "total_seconds": 1.633015125,
   "memory": {
     "available": true,
-    "start_mb": 121.34,
-    "peak_mb": 135.81,
-    "growth_mb": 14.47,
+    "start_mb": 121.28,
+    "peak_mb": 134.62,
+    "growth_mb": 13.34,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 1.660531708,
+      "seconds": 1.549758459,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.018207874999999873,
+      "seconds": 0.015149292000000258,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.5000000000451337e-06,
+      "seconds": 2.166999999886343e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.004105125000000154,
+      "seconds": 0.003892208000000341,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.10564616699999974,
+      "seconds": 0.06395783299999991,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.0002969579999998473,
+      "seconds": 0.0002466249999999448,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_large_python.json b/benchmarks/speed_review/baselines/campaign_staggered_large_python.json
index 2cf3456d..4c8a1d4c 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_large_python.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_large_python.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_large",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 1.3639277499999998,
+  "total_seconds": 1.2753887090000002,
   "memory": {
     "available": true,
-    "start_mb": 236.25,
-    "peak_mb": 473.64,
-    "growth_mb": 237.39,
+    "start_mb": 231.02,
+    "peak_mb": 482.28,
+    "growth_mb": 251.27,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.022353332999999864,
+      "seconds": 0.017780917000000063,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.18328262499999992,
+      "seconds": 0.16856595800000007,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.5839999998898975e-06,
+      "seconds": 3.2920000001546157e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.002633374999999827,
+      "seconds": 0.0024221659999996703,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.35018841700000003,
+      "seconds": 0.2988456250000002,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.7134287920000002,
+      "seconds": 0.6643596249999999,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.09199199999999985,
+      "seconds": 0.12336495800000025,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 3.900000000012227e-05,
+      "seconds": 3.454100000022109e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json b/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json
index d7b62e2d..6b7ff815 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_large",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 1.566890625,
+  "total_seconds": 1.2563857909999998,
   "memory": {
     "available": true,
-    "start_mb": 264.23,
-    "peak_mb": 575.98,
-    "growth_mb": 311.75,
+    "start_mb": 264.81,
+    "peak_mb": 588.48,
+    "growth_mb": 323.67,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.019707125000000048,
+      "seconds": 0.01783779200000013,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.17541550000000017,
+      "seconds": 0.16583404200000018,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 4.4999999997408224e-06,
+      "seconds": 3.374999999916639e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.002558749999999943,
+      "seconds": 0.0027218329999998403,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.48181645799999995,
+      "seconds": 0.4118338749999997,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.5781936660000002,
+      "seconds": 0.5377156250000001,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.309143916,
+      "seconds": 0.12039454100000002,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 4.4332999999951994e-05,
+      "seconds": 3.916700000017315e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json b/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json
index ca721a1e..8cb2f00a 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_medium",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.8150147909999998,
+  "total_seconds": 0.7292697910000001,
   "memory": {
     "available": true,
-    "start_mb": 148.19,
-    "peak_mb": 232.55,
-    "growth_mb": 84.36,
+    "start_mb": 145.41,
+    "peak_mb": 230.08,
+    "growth_mb": 84.67,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.014109125000000056,
+      "seconds": 0.012401541999999877,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.10445108399999992,
+      "seconds": 0.09787991699999998,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.208999999948503e-06,
+      "seconds": 3.250000000010189e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.002572291999999976,
+      "seconds": 0.0023612090000000308,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.35729070800000007,
+      "seconds": 0.2553969999999999,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.2793956659999999,
+      "seconds": 0.29284170899999995,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.05713354200000009,
+      "seconds": 0.06833729200000005,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 4.387499999980449e-05,
+      "seconds": 3.6750000000029814e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json b/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json
index 4c49138a..24917edf 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_medium",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.789889083,
+  "total_seconds": 0.7339207910000001,
   "memory": {
     "available": true,
-    "start_mb": 155.88,
-    "peak_mb": 263.52,
-    "growth_mb": 107.64,
+    "start_mb": 155.03,
+    "peak_mb": 261.97,
+    "growth_mb": 106.94,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.013083167000000007,
+      "seconds": 0.0123550830000001,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.10005033299999999,
+      "seconds": 0.09760662500000006,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.5830000000292017e-06,
+      "seconds": 3.2920000001546157e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.002644124999999997,
+      "seconds": 0.0024585420000000635,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.390922083,
+      "seconds": 0.28781725,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.22787454100000004,
+      "seconds": 0.26540537500000005,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.05525204100000014,
+      "seconds": 0.06822562499999996,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 5.15839999999379e-05,
+      "seconds": 3.799999999998249e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_small_python.json b/benchmarks/speed_review/baselines/campaign_staggered_small_python.json
index 5bdf30ec..a95a0582 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_small_python.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_small_python.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.52634175,
+  "total_seconds": 0.503552917,
   "memory": {
     "available": true,
-    "start_mb": 114.67,
-    "peak_mb": 141.31,
-    "growth_mb": 26.64,
+    "start_mb": 114.8,
+    "peak_mb": 140.5,
+    "growth_mb": 25.7,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.00877829200000002,
+      "seconds": 0.008558832999999932,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.06917316700000009,
+      "seconds": 0.06354612500000001,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.8340000000071086e-06,
+      "seconds": 3.250000000010189e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.007260041999999967,
+      "seconds": 0.005010249999999994,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.19025404200000007,
+      "seconds": 0.16441000000000006,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.2173090419999999,
+      "seconds": 0.226042709,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.03352058300000005,
+      "seconds": 0.03592104199999979,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 3.829200000016186e-05,
+      "seconds": 5.5042000000060654e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json b/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json
index e9fef367..93f664ef 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.502302,
+  "total_seconds": 0.49853625,
   "memory": {
     "available": true,
-    "start_mb": 114.81,
-    "peak_mb": 148.31,
-    "growth_mb": 33.5,
+    "start_mb": 113.97,
+    "peak_mb": 147.67,
+    "growth_mb": 33.7,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.00718679099999997,
+      "seconds": 0.007039999999999935,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.06477549999999999,
+      "seconds": 0.06317745800000008,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.7079999999356517e-06,
+      "seconds": 3.124999999992717e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.004685166000000018,
+      "seconds": 0.00467095900000003,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.14403383400000003,
+      "seconds": 0.14945124999999992,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.2496609999999999,
+      "seconds": 0.23735937499999993,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.031909707999999926,
+      "seconds": 0.03678574999999995,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 4.058300000009396e-05,
+      "seconds": 4.245800000002298e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/dose_response_python.json b/benchmarks/speed_review/baselines/dose_response_python.json
index 2e26b0c6..b5d588ba 100644
--- a/benchmarks/speed_review/baselines/dose_response_python.json
+++ b/benchmarks/speed_review/baselines/dose_response_python.json
@@ -2,42 +2,42 @@
   "scenario": "dose_response",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.6016350409999999,
+  "total_seconds": 0.5965390830000001,
   "memory": {
     "available": true,
-    "start_mb": 114.0,
-    "peak_mb": 120.59,
-    "growth_mb": 6.59,
+    "start_mb": 113.41,
+    "peak_mb": 120.05,
+    "growth_mb": 6.64,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_cdid_cubic_spline_bootstrap199": {
-      "seconds": 0.154083125,
+      "seconds": 0.15369475,
       "ok": true,
       "error": null
     },
     "2_extract_dose_response_dataframes": {
-      "seconds": 0.0007949159999999234,
+      "seconds": 0.0007427500000000142,
       "ok": true,
       "error": null
     },
     "3_cdid_event_study_pretrend": {
-      "seconds": 0.15003725,
+      "seconds": 0.1483430830000001,
       "ok": true,
       "error": null
     },
     "4_binarized_did_comparison": {
-      "seconds": 0.001479000000000008,
+      "seconds": 0.001571500000000059,
       "ok": true,
       "error": null
     },
     "5_spline_sensitivity_degree1": {
-      "seconds": 0.14721650000000008,
+      "seconds": 0.14602733299999993,
       "ok": true,
       "error": null
     },
     "6_spline_sensitivity_num_knots2": {
-      "seconds": 0.14802054200000003,
+      "seconds": 0.146155708,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/dose_response_rust.json b/benchmarks/speed_review/baselines/dose_response_rust.json
index 143e8f46..f79982c6 100644
--- a/benchmarks/speed_review/baselines/dose_response_rust.json
+++ b/benchmarks/speed_review/baselines/dose_response_rust.json
@@ -2,42 +2,42 @@
   "scenario": "dose_response",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.6095052919999999,
+  "total_seconds": 0.592988541,
   "memory": {
     "available": true,
-    "start_mb": 113.77,
-    "peak_mb": 123.08,
-    "growth_mb": 9.31,
+    "start_mb": 113.7,
+    "peak_mb": 120.94,
+    "growth_mb": 7.23,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_cdid_cubic_spline_bootstrap199": {
-      "seconds": 0.15594399999999997,
+      "seconds": 0.151152917,
       "ok": true,
       "error": null
     },
     "2_extract_dose_response_dataframes": {
-      "seconds": 0.0008133339999999434,
+      "seconds": 0.0007949580000000678,
       "ok": true,
       "error": null
     },
     "3_cdid_event_study_pretrend": {
-      "seconds": 0.15459916699999998,
+      "seconds": 0.14909695900000008,
       "ok": true,
       "error": null
     },
     "4_binarized_did_comparison": {
-      "seconds": 0.0017282920000000201,
+      "seconds": 0.0015252080000000001,
       "ok": true,
       "error": null
     },
     "5_spline_sensitivity_degree1": {
-      "seconds": 0.14681695900000014,
+      "seconds": 0.14430462500000008,
       "ok": true,
       "error": null
     },
     "6_spline_sensitivity_num_knots2": {
-      "seconds": 0.14959908300000002,
+      "seconds": 0.14610904199999997,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json b/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json
index d601f042..fb5e8ca1 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_large",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.26634891699999985,
+  "total_seconds": 0.26278650000000003,
   "memory": {
     "available": true,
-    "start_mb": 118.05,
-    "peak_mb": 118.41,
-    "growth_mb": 0.36,
+    "start_mb": 117.98,
+    "peak_mb": 118.33,
+    "growth_mb": 0.34,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.0427798330000001,
+      "seconds": 0.04081954100000007,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.038390874999999935,
+      "seconds": 0.03721654100000005,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 0.07815391699999985,
+      "seconds": 0.07710412500000008,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0006825000000001413,
+      "seconds": 0.0006546249999999088,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 0.10630537500000004,
+      "seconds": 0.10694591700000011,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 3.249999999987985e-05,
+      "seconds": 4.191700000011345e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json b/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json
index 13047b5f..1d0101a1 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_medium",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 4.089486916000001,
+  "total_seconds": 4.064775749999999,
   "memory": {
     "available": true,
-    "start_mb": 143.0,
-    "peak_mb": 151.78,
-    "growth_mb": 8.78,
+    "start_mb": 144.16,
+    "peak_mb": 152.61,
+    "growth_mb": 8.45,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.37269037500000035,
+      "seconds": 0.36629420799999934,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.36910237499999976,
+      "seconds": 0.3713431250000001,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 1.5911891659999995,
+      "seconds": 1.594468708,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.000975166999999999,
+      "seconds": 0.00075208300000007,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 1.7554996660000004,
+      "seconds": 1.731888875000001,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 2.533400000004349e-05,
+      "seconds": 2.4707999999762364e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json b/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json
index 4fd4759d..2e35b56a 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_medium",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.12066608300000004,
+  "total_seconds": 0.11511983300000006,
   "memory": {
     "available": true,
-    "start_mb": 116.27,
-    "peak_mb": 116.7,
-    "growth_mb": 0.44,
+    "start_mb": 116.62,
+    "peak_mb": 116.98,
+    "growth_mb": 0.36,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.02115270800000002,
+      "seconds": 0.019235208000000004,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.023816833000000037,
+      "seconds": 0.022819749999999916,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 0.025457375000000004,
+      "seconds": 0.024750457999999975,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.000645875000000018,
+      "seconds": 0.0006058339999999163,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 0.04956016600000002,
+      "seconds": 0.047683790999999975,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 2.7333999999989977e-05,
+      "seconds": 2.1332999999956748e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_small_python.json b/benchmarks/speed_review/baselines/geo_few_markets_small_python.json
index f0790f93..be2be067 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_small_python.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_small_python.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 3.931368708,
+  "total_seconds": 3.8045968329999997,
   "memory": {
     "available": true,
-    "start_mb": 114.03,
-    "peak_mb": 124.31,
-    "growth_mb": 10.28,
+    "start_mb": 113.8,
+    "peak_mb": 124.0,
+    "growth_mb": 10.2,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.568860417,
+      "seconds": 0.606876041,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.8022329579999998,
+      "seconds": 0.6082665840000001,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 1.204978708,
+      "seconds": 1.2295280419999999,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.001044000000000267,
+      "seconds": 0.0008653330000001347,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 1.3541902500000003,
+      "seconds": 1.35899475,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 5.620800000016857e-05,
+      "seconds": 6.066699999962566e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json b/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json
index b5bef931..e3af0a7b 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.04143637499999997,
+  "total_seconds": 0.04004791699999999,
   "memory": {
     "available": true,
-    "start_mb": 113.59,
-    "peak_mb": 115.17,
-    "growth_mb": 1.58,
+    "start_mb": 113.78,
+    "peak_mb": 115.23,
+    "growth_mb": 1.45,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.007861707999999967,
+      "seconds": 0.007413500000000073,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.012826667000000014,
+      "seconds": 0.012767541999999965,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 0.008510917000000062,
+      "seconds": 0.008145540999999978,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0010455000000000325,
+      "seconds": 0.0008106660000000154,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 0.011159249999999954,
+      "seconds": 0.01088683300000004,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 2.7208000000000787e-05,
+      "seconds": 2.0916999999953667e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/reversible_dcdh_python.json b/benchmarks/speed_review/baselines/reversible_dcdh_python.json
index 7bbf41df..f653a8d6 100644
--- a/benchmarks/speed_review/baselines/reversible_dcdh_python.json
+++ b/benchmarks/speed_review/baselines/reversible_dcdh_python.json
@@ -2,32 +2,32 @@
   "scenario": "reversible_dcdh",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.591051,
+  "total_seconds": 0.6590325410000001,
   "memory": {
     "available": true,
-    "start_mb": 113.36,
-    "peak_mb": 132.27,
-    "growth_mb": 18.91,
+    "start_mb": 113.59,
+    "peak_mb": 132.88,
+    "growth_mb": 19.28,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_dcdh_fit_Lmax3_survey_TSL": {
-      "seconds": 0.36771666699999994,
+      "seconds": 0.312676917,
       "ok": true,
       "error": null
     },
     "2_inspect_placebo_and_summary": {
-      "seconds": 1.5420000000210266e-06,
+      "seconds": 1.4170000000035543e-06,
       "ok": true,
       "error": null
     },
     "3_honest_did_on_placebo": {
-      "seconds": 0.0036219160000000583,
+      "seconds": 0.004951042000000072,
       "ok": true,
       "error": null
     },
     "4_heterogeneity_refit": {
-      "seconds": 0.21970824999999983,
+      "seconds": 0.34140066700000005,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/reversible_dcdh_rust.json b/benchmarks/speed_review/baselines/reversible_dcdh_rust.json
index ded7cebe..1dabe608 100644
--- a/benchmarks/speed_review/baselines/reversible_dcdh_rust.json
+++ b/benchmarks/speed_review/baselines/reversible_dcdh_rust.json
@@ -2,32 +2,32 @@
   "scenario": "reversible_dcdh",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.5458365,
+  "total_seconds": 0.6441262499999999,
   "memory": {
     "available": true,
-    "start_mb": 113.38,
-    "peak_mb": 134.55,
-    "growth_mb": 21.17,
+    "start_mb": 113.36,
+    "peak_mb": 134.64,
+    "growth_mb": 21.28,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_dcdh_fit_Lmax3_survey_TSL": {
-      "seconds": 0.33496675,
+      "seconds": 0.3220115,
       "ok": true,
       "error": null
     },
     "2_inspect_placebo_and_summary": {
-      "seconds": 1.3750000000811724e-06,
+      "seconds": 1.5410000000493085e-06,
       "ok": true,
       "error": null
     },
     "3_honest_did_on_placebo": {
-      "seconds": 0.005485000000000073,
+      "seconds": 0.003777833000000008,
       "ok": true,
       "error": null
     },
     "4_heterogeneity_refit": {
-      "seconds": 0.20537233300000002,
+      "seconds": 0.3183330000000001,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/bench_campaign_staggered.py b/benchmarks/speed_review/bench_campaign_staggered.py
index 7a5e976f..1e213858 100644
--- a/benchmarks/speed_review/bench_campaign_staggered.py
+++ b/benchmarks/speed_review/bench_campaign_staggered.py
@@ -91,9 +91,13 @@ def imputation():
         results["bjs"] = bjs.fit(**fit_kwargs, aggregate="event_study")
 
     def cs_no_covariates():
+        # Match phase 2's estimator config exactly; the only axis that
+        # varies is `covariates`. This is the Baker-mandated with/without
+        # comparison - holding inference workload constant is the whole
+        # point of the comparison.
         cs = CallawaySantAnna(
-            control_group="never_treated", estimation_method="reg",
-            cluster="unit", n_bootstrap=199, seed=123,
+            control_group="never_treated", estimation_method="dr",
+            cluster="unit", n_bootstrap=999, seed=123,
         )
         results["cs_nocov"] = cs.fit(**fit_kwargs, aggregate="all")
 
diff --git a/benchmarks/speed_review/bench_reversible_dcdh.py b/benchmarks/speed_review/bench_reversible_dcdh.py
index 80bd63ca..5788b02c 100644
--- a/benchmarks/speed_review/bench_reversible_dcdh.py
+++ b/benchmarks/speed_review/bench_reversible_dcdh.py
@@ -55,12 +55,12 @@ def main():
         data=data, outcome="outcome", group="group", time="period",
         treatment="treatment",
     )
+    sd = SurveyDesign(
+        weights="pw", strata="stratum", psu="psu",
+    )
 
     def dcdh_fit_lmax3():
         est = ChaisemartinDHaultfoeuille(seed=123)
-        sd = SurveyDesign(
-            weights="pw", strata="stratum", psu="psu",
-        )
         results["dcdh"] = est.fit(
             **fit_kwargs, L_max=3, survey_design=sd,
         )
@@ -83,9 +83,13 @@ def honest_placebo():
         results["honest"] = out
 
     def heterogeneity_refit():
+        # Use the same SurveyDesign as the main fit; the scenario framing
+        # is the survey-TSL workflow, and the TSL-sharing optimization
+        # conclusion in performance-plan.md depends on both fits running
+        # under the same survey design.
         est = ChaisemartinDHaultfoeuille(seed=123)
         results["het"] = est.fit(
-            **fit_kwargs, L_max=3, heterogeneity="group",
+            **fit_kwargs, L_max=3, survey_design=sd, heterogeneity="group",
         )
 
     phases = [
diff --git a/docs/performance-plan.md b/docs/performance-plan.md
index 41697a3c..cea81132 100644
--- a/docs/performance-plan.md
+++ b/docs/performance-plan.md
@@ -41,20 +41,20 @@ scale. Data-shape details are in `docs/performance-scenarios.md`.
 <!-- TABLE:start scale_sweep_totals -->
 | Scenario | Scale | Python (s) | Rust (s) | Py/Rust |
 |---|---|---:|---:|---:|
-| 1. Staggered campaign | small | 0.53 | 0.50 | 1.0x |
-|  | medium | 0.82 | 0.79 | 1.0x |
-|  | large | 1.36 | 1.57 | 0.9x |
-| 2. Brand awareness survey | small | 0.21 | 0.20 | 1.1x |
-|  | medium | 0.76 | 0.58 | 1.3x |
-|  | large | 1.03 | 1.04 | 1.0x |
-| 3. BRFSS microdata -> CS panel | small | 1.67 | 1.79 | 0.9x |
-|  | medium | 6.40 | 6.48 | 1.0x |
-|  | large | 24.75 | 25.41 | 1.0x |
-| 4. SDiD few markets | small | 3.93 | 0.04 | 94.9x |
-|  | medium | 4.09 | 0.12 | 33.9x |
-|  | large | skip | 0.27 | - |
-| 5. Reversible dCDH | single | 0.59 | 0.55 | 1.1x |
-| 6. Pricing dose-response | single | 0.60 | 0.61 | 1.0x |
+| 1. Staggered campaign | small | 0.50 | 0.50 | 1.0x |
+|  | medium | 0.73 | 0.73 | 1.0x |
+|  | large | 1.28 | 1.26 | 1.0x |
+| 2. Brand awareness survey | small | 0.21 | 0.19 | 1.1x |
+|  | medium | 0.52 | 0.49 | 1.1x |
+|  | large | 0.91 | 0.81 | 1.1x |
+| 3. BRFSS microdata -> CS panel | small | 1.62 | 1.63 | 1.0x |
+|  | medium | 6.24 | 6.20 | 1.0x |
+|  | large | 24.48 | 24.17 | 1.0x |
+| 4. SDiD few markets | small | 3.80 | 0.04 | 95.0x |
+|  | medium | 4.06 | 0.12 | 35.3x |
+|  | large | skip | 0.26 | - |
+| 5. Reversible dCDH | single | 0.66 | 0.64 | 1.0x |
+| 6. Pricing dose-response | single | 0.60 | 0.59 | 1.0x |
 <!-- TABLE:end scale_sweep_totals -->
 
 ### Scaling findings
@@ -90,18 +90,18 @@ scale. Data-shape details are in `docs/performance-scenarios.md`.
 <!-- TABLE:start top_phases_by_scenario -->
 | Scenario | Scale | Backend | Top phase (%) | 2nd phase (%) | 3rd phase (%) |
 |---|---|---|---|---|---|
-| 1. Staggered campaign | large | python | `6_imputation_did_robustness` (52%) | `5_sun_abraham_robustness` (26%) | `2_cs_fit_with_covariates_bootstrap999` (13%) |
-| 1. Staggered campaign | large | rust | `6_imputation_did_robustness` (37%) | `5_sun_abraham_robustness` (31%) | `7_cs_without_covariates` (20%) |
-| 2. Brand awareness survey | large | python | `3_replicate_weights_jk1` (56%) | `4_multi_outcome_loop_3_metrics` (21%) | `7_event_study_plus_honest_did` (13%) |
-| 2. Brand awareness survey | large | rust | `3_replicate_weights_jk1` (54%) | `4_multi_outcome_loop_3_metrics` (25%) | `7_event_study_plus_honest_did` (14%) |
+| 1. Staggered campaign | large | python | `6_imputation_did_robustness` (52%) | `5_sun_abraham_robustness` (23%) | `2_cs_fit_with_covariates_bootstrap999` (13%) |
+| 1. Staggered campaign | large | rust | `6_imputation_did_robustness` (43%) | `5_sun_abraham_robustness` (33%) | `2_cs_fit_with_covariates_bootstrap999` (13%) |
+| 2. Brand awareness survey | large | python | `3_replicate_weights_jk1` (50%) | `4_multi_outcome_loop_3_metrics` (25%) | `7_event_study_plus_honest_did` (14%) |
+| 2. Brand awareness survey | large | rust | `3_replicate_weights_jk1` (44%) | `4_multi_outcome_loop_3_metrics` (26%) | `7_event_study_plus_honest_did` (17%) |
 | 3. BRFSS microdata -> CS panel | large | python | `1_aggregate_survey_microdata_to_panel` (100%) | `5_sun_abraham_robustness` (0%) | `2_cs_fit_with_stage2_survey_design` (0%) |
 | 3. BRFSS microdata -> CS panel | large | rust | `1_aggregate_survey_microdata_to_panel` (100%) | `5_sun_abraham_robustness` (0%) | `2_cs_fit_with_stage2_survey_design` (0%) |
-| 4. SDiD few markets | medium | python | `5_sensitivity_to_zeta_omega` (43%) | `3_in_time_placebo` (39%) | `1_sdid_jackknife_variance` (9%) |
-| 4. SDiD few markets | large | rust | `5_sensitivity_to_zeta_omega` (40%) | `3_in_time_placebo` (29%) | `1_sdid_jackknife_variance` (16%) |
-| 5. Reversible dCDH | single | python | `1_dcdh_fit_Lmax3_survey_TSL` (62%) | `4_heterogeneity_refit` (37%) | `3_honest_did_on_placebo` (1%) |
-| 5. Reversible dCDH | single | rust | `1_dcdh_fit_Lmax3_survey_TSL` (61%) | `4_heterogeneity_refit` (38%) | `3_honest_did_on_placebo` (1%) |
+| 4. SDiD few markets | medium | python | `5_sensitivity_to_zeta_omega` (43%) | `3_in_time_placebo` (39%) | `2_sdid_bootstrap_variance_200` (9%) |
+| 4. SDiD few markets | large | rust | `5_sensitivity_to_zeta_omega` (41%) | `3_in_time_placebo` (29%) | `1_sdid_jackknife_variance` (16%) |
+| 5. Reversible dCDH | single | python | `4_heterogeneity_refit` (52%) | `1_dcdh_fit_Lmax3_survey_TSL` (47%) | `3_honest_did_on_placebo` (1%) |
+| 5. Reversible dCDH | single | rust | `1_dcdh_fit_Lmax3_survey_TSL` (50%) | `4_heterogeneity_refit` (49%) | `3_honest_did_on_placebo` (1%) |
 | 6. Pricing dose-response | single | python | `1_cdid_cubic_spline_bootstrap199` (26%) | `3_cdid_event_study_pretrend` (25%) | `6_spline_sensitivity_num_knots2` (25%) |
-| 6. Pricing dose-response | single | rust | `1_cdid_cubic_spline_bootstrap199` (26%) | `3_cdid_event_study_pretrend` (25%) | `6_spline_sensitivity_num_knots2` (25%) |
+| 6. Pricing dose-response | single | rust | `1_cdid_cubic_spline_bootstrap199` (25%) | `3_cdid_event_study_pretrend` (25%) | `6_spline_sensitivity_num_knots2` (25%) |
 <!-- TABLE:end top_phases_by_scenario -->
 
 Per-scenario phase narrative (cross-check against the table above after
@@ -153,20 +153,20 @@ in `benchmarks/speed_review/baselines/mem_profile_brfss_large_<backend>.txt`.
 <!-- TABLE:start memory_by_scenario -->
 | Scenario | Scale | Py peak RSS (MB) | Py growth (MB) | Rust peak RSS (MB) | Rust growth (MB) |
 |---|---|---:|---:|---:|---:|
-| 1. Staggered campaign | small | 141 | 27 | 148 | 34 |
-|  | medium | 233 | 84 | 264 | 108 |
-|  | large | 474 | 237 | 576 | 312 |
-| 2. Brand awareness survey | small | 131 | 16 | 131 | 16 |
-|  | medium | 194 | 58 | 190 | 54 |
-|  | large | 335 | 140 | 339 | 144 |
-| 3. BRFSS microdata -> CS panel | small | 133 | 12 | 136 | 14 |
-|  | medium | 209 | 20 | 223 | 30 |
-|  | large | 426 | 31 | 446 | 33 |
-| 4. SDiD few markets | small | 124 | 10 | 115 | 2 |
-|  | medium | 152 | 9 | 117 | 0 |
+| 1. Staggered campaign | small | 140 | 26 | 148 | 34 |
+|  | medium | 230 | 85 | 262 | 107 |
+|  | large | 482 | 251 | 588 | 324 |
+| 2. Brand awareness survey | small | 128 | 12 | 128 | 13 |
+|  | medium | 183 | 48 | 186 | 53 |
+|  | large | 335 | 143 | 337 | 144 |
+| 3. BRFSS microdata -> CS panel | small | 134 | 12 | 135 | 13 |
+|  | medium | 210 | 22 | 218 | 20 |
+|  | large | 426 | 30 | 431 | 30 |
+| 4. SDiD few markets | small | 124 | 10 | 115 | 1 |
+|  | medium | 153 | 8 | 117 | 0 |
 |  | large | skip | skip | 118 | 0 |
-| 5. Reversible dCDH | single | 132 | 19 | 135 | 21 |
-| 6. Pricing dose-response | single | 121 | 7 | 123 | 9 |
+| 5. Reversible dCDH | single | 133 | 19 | 135 | 21 |
+| 6. Pricing dose-response | single | 120 | 7 | 121 | 7 |
 <!-- TABLE:end memory_by_scenario -->
 
 The ~115-130 MB floor is the Python + diff-diff + numpy import footprint;

From 09f24a80b1e0646b569faaf4464c975aef43fe89 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Sun, 19 Apr 2026 14:25:01 -0400
Subject: [PATCH 09/15] Apply narrative updates that raced with linter on prior
 commit

Three narrative corrections that need to match the freshly regenerated
tables after the P1 methodology fixes:

- Staggered campaign narrative: under Rust at large scale ImputationDiD
  still leads SunAbraham (43% vs 33%), not "tied". Removed the tied
  language.
- Reversible dCDH narrative: main fit / heterogeneity refit split is
  now ~50/50 (was ~60/40 before the heterogeneity refit got a
  survey_design). Under Python the heterogeneity refit slightly edges
  out the main fit. Updated the narrative and strengthened the
  TSL-sharing opportunity wording.
- Top-hotspots table rows 2 and 4 updated to match.

Prior commit 6b1715a intended to include these edits but they raced
with a linter refresh and dropped silently. Caught and fixing now.

Docs only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/performance-plan.md | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/docs/performance-plan.md b/docs/performance-plan.md
index cea81132..c9598e46 100644
--- a/docs/performance-plan.md
+++ b/docs/performance-plan.md
@@ -107,11 +107,11 @@ scale. Data-shape details are in `docs/performance-scenarios.md`.
 Per-scenario phase narrative (cross-check against the table above after
 any rerun):
 
-- **Staggered campaign.** ImputationDiD robustness is the dominant phase
-  under Python at every scale. Under Rust at large scale it is tied with
-  SunAbraham (both ~30-40%); the Rust backend shifts relative shares
-  more than totals. CS fit with `n_bootstrap=999` is well-vectorized and
-  sits well below both in the ranking.
+- **Staggered campaign.** ImputationDiD robustness is the dominant
+  phase under both backends at every scale. Under Rust at large scale
+  SunAbraham narrows the gap but ImputationDiD still leads. CS fit with
+  `n_bootstrap=999` (both with and without covariates) is well-
+  vectorized and sits well below both in the ranking.
 - **Brand awareness survey.** At small scale HonestDiD dominates. From
   medium onwards the JK1 replicate-weight path is the clear top phase
   under both backends (2-3x the multi-outcome loop on Python at medium;
@@ -127,7 +127,10 @@ any rerun):
   are the dominant Python-backend phases at every scale; Rust eliminates
   both.
 - **Reversible dCDH.** Main fit and heterogeneity refit split the time
-  ~60/40, both rebuilding shared TSL scaffolding.
+  roughly evenly (45-52% each; under Python the heterogeneity refit
+  edges out the main fit slightly). Both fits run under the same
+  `SurveyDesign` and rebuild shared TSL scaffolding - that is the
+  optimization opportunity.
 - **Pricing dose-response.** Four spline fits account for essentially all
   runtime; linear scaling in variant count.
 
@@ -136,9 +139,9 @@ any rerun):
 | # | Location | Scenario + scale | Signal | Recommended action |
 |---|---|---|---|---|
 | 1 | `diff_diff/survey.py:1160` `_compute_stratified_psu_meat` | BRFSS @ 1M rows | dominates BRFSS chain at all scales, ~100% at 1M rows | **Algorithmic fix, highest priority.** Function called once per (state, year) cell (500 calls); per-call work rebuilds stratum-PSU scaffolding every time. Precompute stratum indexes once at `aggregate_survey` top-level and reuse. |
-| 2 | `diff_diff/imputation.py` ImputationDiD fit | Staggered CS @ 1,500 units | dominant phase of the CS chain under Python at all scales; tied with SunAbraham under Rust at large | **Investigate only after BRFSS fix lands.** Total chain is well under practitioner-perceptible threshold; candidate follow-up. |
+| 2 | `diff_diff/imputation.py` ImputationDiD fit | Staggered CS @ 1,500 units | dominant phase of the CS chain under both backends at all scales; SunAbraham narrows the gap under Rust at large but ImputationDiD still leads | **Investigate only after BRFSS fix lands.** Total chain is well under practitioner-perceptible threshold; candidate follow-up. |
 | 3 | `diff_diff/utils.py:1434` `_sc_weight_fw_numpy` | SDiD python @ any scale | dominates Python SDiD at all scales | **Already ported to Rust.** Python fallback acceptable as a teaching/safety path; non-production for n > 100. Python skipped at n=500 (jackknife cost would exceed 4 minutes per run). |
-| 4 | `diff_diff/chaisemartin_dhaultfoeuille.py` dCDH fit + heterogeneity | Reversible (single scale) | main fit + heterogeneity refit each rebuild TSL scaffolding | **Cache/precompute** - heterogeneity refit duplicates the main fit's TSL setup. Not P0; newer code path (v3.1) never optimization-reviewed. |
+| 4 | `diff_diff/chaisemartin_dhaultfoeuille.py` dCDH fit + heterogeneity | Reversible (single scale) | main fit and survey-aware heterogeneity refit each rebuild TSL scaffolding; heterogeneity phase is as expensive as the main fit | **Cache/precompute** - heterogeneity refit duplicates the main fit's TSL setup under the same `SurveyDesign`. Not P0; newer code path (v3.1) never optimization-reviewed. |
 | 5 | `diff_diff/continuous_did.py` CDiD spline bootstrap | Dose-response (single scale) | four spline fits ~equal, linear in variant count | **Leave alone** - well under perceptible threshold. |
 
 ### Memory analysis

From 3d8c5ebe72bf4749ba1d5a60db0b6b9c6b1c6b87 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Sun, 19 Apr 2026 14:41:04 -0400
Subject: [PATCH 10/15] Fix brand-medium narrative (P2) and mem-profile label
 (P3)

P2 - brand-awareness medium-scale prose: the narrative said JK1 is
"2-3x the multi-outcome loop on Python at medium" and "still the top
phase on Rust though by a smaller margin there." Committed baselines
contradict both: on Python/medium JK1 is about 1.9x the multi-outcome
loop (not 2-3x) with HonestDiD close behind; on Rust/medium the
multi-outcome loop is actually the top phase, with JK1 second. Only at
large does JK1 become the clearly dominant phase under both backends.
Prose rewritten to match.

P3 - mem_profile_brfss.py headline: the output labeled
stats[0].size_diff (largest single allocation site) as "net allocated
(end - start)", which sounds like the total retained delta. Relabeled
to "top single-site size diff" and added a "total net size diff across
all sites" line alongside it. Regenerated the committed text artifact
with the corrected labels.

Docs-and-script-only. No baseline timing regeneration needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../baselines/mem_profile_brfss_large_rust.txt  |  5 +++--
 benchmarks/speed_review/mem_profile_brfss.py    |  9 +++++++--
 docs/performance-plan.md                        | 17 +++++++++--------
 3 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/benchmarks/speed_review/baselines/mem_profile_brfss_large_rust.txt b/benchmarks/speed_review/baselines/mem_profile_brfss_large_rust.txt
index 8696a700..1bc56a4e 100644
--- a/benchmarks/speed_review/baselines/mem_profile_brfss_large_rust.txt
+++ b/benchmarks/speed_review/baselines/mem_profile_brfss_large_rust.txt
@@ -5,7 +5,8 @@
 # output panel cells: 500
 
 # tracemalloc totals during aggregate_survey
-# net allocated (end - start): 0.1 MB (top site)
+# total net size diff across all sites: 0.5 MB
+# top single-site size diff: 0.15 MB
 # python peak traced: 84.2 MB
 # python current retained: 0.6 MB
 
@@ -15,7 +16,7 @@
 1                0.15         1521  <site-packages>/lib/python3.9/linecache.py:148
 2                0.04            7  <repo>/.venv/lib/python3.9/site-packages/pandas/core/internals/blocks.py:822
 3                0.02          440  <repo>/.venv/lib/python3.9/site-packages/pandas/core/sorting.py:637
-4                0.01           63  <site-packages>/lib/python3.9/abc.py:123
+4                0.01           64  <site-packages>/lib/python3.9/abc.py:123
 5                0.00            2  <repo>/.venv/lib/python3.9/site-packages/pandas/core/frame.py:12710
 6                0.00            2  <repo>/.venv/lib/python3.9/site-packages/pandas/core/frame.py:698
 7                0.00           16  <repo>/.venv/lib/python3.9/site-packages/pandas/core/indexes/base.py:5372
diff --git a/benchmarks/speed_review/mem_profile_brfss.py b/benchmarks/speed_review/mem_profile_brfss.py
index 36e2901b..25495dce 100644
--- a/benchmarks/speed_review/mem_profile_brfss.py
+++ b/benchmarks/speed_review/mem_profile_brfss.py
@@ -86,6 +86,11 @@ def main():
     current, peak = tracemalloc.get_traced_memory()
     tracemalloc.stop()
 
+    total_net_diff = sum(s.size_diff for s in stats) / (1024 * 1024)
+    top_site_diff = (
+        stats[0].size_diff / (1024 * 1024) if stats else 0.0
+    )
+
     lines = [
         f"# BRFSS-1M aggregate_survey allocation attribution",
         f"# backend: {backend}",
@@ -95,8 +100,8 @@ def main():
         f"# output panel cells: {len(panel)}",
         f"",
         f"# tracemalloc totals during aggregate_survey",
-        f"# net allocated (end - start): "
-        f"{(stats[0].size_diff if stats else 0)/1024/1024:.1f} MB (top site)",
+        f"# total net size diff across all sites: {total_net_diff:.1f} MB",
+        f"# top single-site size diff: {top_site_diff:.2f} MB",
         f"# python peak traced: {peak/1024/1024:.1f} MB",
         f"# python current retained: {current/1024/1024:.1f} MB",
         f"",
diff --git a/docs/performance-plan.md b/docs/performance-plan.md
index c9598e46..c1362857 100644
--- a/docs/performance-plan.md
+++ b/docs/performance-plan.md
@@ -112,14 +112,15 @@ any rerun):
   SunAbraham narrows the gap but ImputationDiD still leads. CS fit with
   `n_bootstrap=999` (both with and without covariates) is well-
   vectorized and sits well below both in the ranking.
-- **Brand awareness survey.** At small scale HonestDiD dominates. From
-  medium onwards the JK1 replicate-weight path is the clear top phase
-  under both backends (2-3x the multi-outcome loop on Python at medium;
-  still the top phase on Rust though by a smaller margin there). At
-  large it consolidates as the single dominant phase. Python and Rust
-  totals on this chain are within noise; the JK1 replicate-fit loop is
-  not Rust-accelerated, so the FFI crossings cost approximately what
-  they save - a neutral outcome, not a regression.
+- **Brand awareness survey.** At small scale HonestDiD dominates. At
+  medium the top three phases are packed closely: on Python JK1 leads
+  (about 1.9x the multi-outcome loop, with HonestDiD close behind);
+  on Rust the multi-outcome loop is slightly ahead of JK1. Only at
+  large does JK1 emerge as the clearly dominant phase under both
+  backends. Python and Rust totals on this chain are within noise; the
+  JK1 replicate-fit loop is not Rust-accelerated, so the FFI crossings
+  cost approximately what they save - a neutral outcome, not a
+  regression.
 - **BRFSS.** `aggregate_survey` share of total grows with scale and is
   effectively 100% of runtime at 1M rows. Downstream phases (CS fit,
   SunAbraham, HonestDiD) are a fraction of a second combined.

From 539c7d78bddcedf7c51fec4ea89aa3a15b618319 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Sun, 19 Apr 2026 14:55:17 -0400
Subject: [PATCH 11/15] Reuse one TSL SurveyDesign (with FPC) across
 brand-awareness phases (P1)

CI re-review P1: `bench_brand_awareness_survey.py` declared the
analytical TSL path with `SurveyDesign(weights, strata, psu, fpc, nest)`
only in phase 2; phases 4 (multi-outcome), 6 (placebo), and 7 (event
study + HonestDiD) built their own SurveyDesigns without `fpc`. That
means a material share of the committed brand-awareness baselines
timed a different variance path than the scenario doc declares.

Fix:
- One analytical `sd_tsl` SurveyDesign (strata + PSU + FPC + nest=True)
  is now constructed once at the top of `make_phases` and reused across
  phases 2, 4, 6, and 7. Phase 3 (replicate weights, JK1) is a
  different variance surface and correctly keeps its own design.
- Regenerated baselines for both backends.
- Regenerated findings tables via gen_findings_tables.py.

Narrative refreshed against the new tables:

- Brand-aware medium: on Python JK1 now leads by ~2.2x (was 1.9x in
  the previous rerun); on Rust the multi-outcome loop and JK1 come in
  essentially tied. Medium is also where Python is slowest relative
  to Rust (~1.6x) - the full analytical TSL path with FPC exposes
  vectorization differences at that shape. Totals re-converge at
  large scale.
- Reversible dCDH: ~48-52% split under both backends (previously the
  Python heterogeneity refit edged out the main fit slightly).
- Scaling finding #5 retuned: Rust-only uplift is still the SDiD
  story; brand-aware medium now surfaces as a secondary, modest
  ~1.6x case rather than "within noise."

Still measurement only. No changes under diff_diff/ or rust/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../brand_awareness_survey_large_python.json  |  22 ++--
 .../brand_awareness_survey_large_rust.json    |  22 ++--
 .../brand_awareness_survey_medium_python.json |  22 ++--
 .../brand_awareness_survey_medium_rust.json   |  22 ++--
 .../brand_awareness_survey_small_python.json  |  22 ++--
 .../brand_awareness_survey_small_rust.json    |  22 ++--
 .../baselines/brfss_panel_large_python.json   |  20 ++--
 .../baselines/brfss_panel_large_rust.json     |  20 ++--
 .../baselines/brfss_panel_medium_python.json  |  20 ++--
 .../baselines/brfss_panel_medium_rust.json    |  20 ++--
 .../baselines/brfss_panel_small_python.json   |  20 ++--
 .../baselines/brfss_panel_small_rust.json     |  20 ++--
 .../campaign_staggered_large_python.json      |  24 ++--
 .../campaign_staggered_large_rust.json        |  24 ++--
 .../campaign_staggered_medium_python.json     |  24 ++--
 .../campaign_staggered_medium_rust.json       |  24 ++--
 .../campaign_staggered_small_python.json      |  24 ++--
 .../campaign_staggered_small_rust.json        |  24 ++--
 .../baselines/dose_response_python.json       |  20 ++--
 .../baselines/dose_response_rust.json         |  20 ++--
 .../baselines/geo_few_markets_large_rust.json |  20 ++--
 .../geo_few_markets_medium_python.json        |  20 ++--
 .../geo_few_markets_medium_rust.json          |  20 ++--
 .../geo_few_markets_small_python.json         |  20 ++--
 .../baselines/geo_few_markets_small_rust.json |  20 ++--
 .../baselines/reversible_dcdh_python.json     |  16 +--
 .../baselines/reversible_dcdh_rust.json       |  16 +--
 .../bench_brand_awareness_survey.py           |  32 +++---
 docs/performance-plan.md                      | 106 +++++++++---------
 29 files changed, 353 insertions(+), 353 deletions(-)

diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json b/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json
index 5eb3d235..24d46ed5 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_large",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.9114786659999998,
+  "total_seconds": 0.9466241249999998,
   "memory": {
     "available": true,
-    "start_mb": 192.66,
-    "peak_mb": 335.36,
-    "growth_mb": 142.7,
+    "start_mb": 187.52,
+    "peak_mb": 336.53,
+    "growth_mb": 149.02,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.01250487499999986,
+      "seconds": 0.014438584000000088,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.03581945799999997,
+      "seconds": 0.02661404200000006,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.45216445799999994,
+      "seconds": 0.4855631249999999,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.22825654099999992,
+      "seconds": 0.24716379199999983,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.03940495799999999,
+      "seconds": 0.02261091699999973,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.01122979199999996,
+      "seconds": 0.01228116700000026,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.13208041700000006,
+      "seconds": 0.13794237500000017,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json b/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json
index d2290030..b59e536e 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_large",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.813743375,
+  "total_seconds": 0.9458203330000001,
   "memory": {
     "available": true,
-    "start_mb": 192.84,
-    "peak_mb": 336.91,
-    "growth_mb": 144.06,
+    "start_mb": 193.34,
+    "peak_mb": 343.95,
+    "growth_mb": 150.61,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.013944833000000045,
+      "seconds": 0.012478959000000067,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.026042083000000105,
+      "seconds": 0.029093875000000047,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.359846584,
+      "seconds": 0.4777013749999999,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.2136102500000001,
+      "seconds": 0.24978320799999998,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.033997333000000296,
+      "seconds": 0.020603166999999978,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.026375041999999738,
+      "seconds": 0.0116797919999998,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.13991266599999985,
+      "seconds": 0.14445204200000017,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json
index 178e80d2..df21097d 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_medium",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.521520459,
+  "total_seconds": 0.790833584,
   "memory": {
     "available": true,
-    "start_mb": 134.75,
-    "peak_mb": 182.52,
-    "growth_mb": 47.77,
+    "start_mb": 133.86,
+    "peak_mb": 186.22,
+    "growth_mb": 52.36,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.01229520900000003,
+      "seconds": 0.014936249999999984,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.03135670800000001,
+      "seconds": 0.03534024999999996,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.1716489579999999,
+      "seconds": 0.40949133400000004,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.09223670900000003,
+      "seconds": 0.18262604100000002,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.029447917000000157,
+      "seconds": 0.03471770800000007,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.05062770800000016,
+      "seconds": 0.0515672920000001,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.13389041700000015,
+      "seconds": 0.06213379099999994,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json
index 6fec4e10..7ecf88eb 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_medium",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.491930875,
+  "total_seconds": 0.49272766700000004,
   "memory": {
     "available": true,
-    "start_mb": 133.44,
-    "peak_mb": 186.23,
-    "growth_mb": 52.8,
+    "start_mb": 136.12,
+    "peak_mb": 187.19,
+    "growth_mb": 51.06,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.011377042000000004,
+      "seconds": 0.01159325,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.034086916999999994,
+      "seconds": 0.03190720899999999,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.13685675000000008,
+      "seconds": 0.13057945900000012,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.16081833299999992,
+      "seconds": 0.13259537499999996,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.027104291999999974,
+      "seconds": 0.034756790999999954,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.05256474999999994,
+      "seconds": 0.07622437500000001,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.06910883300000004,
+      "seconds": 0.07503525,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json b/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json
index ed44e410..2f62510f 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.21071716699999998,
+  "total_seconds": 0.21868924999999995,
   "memory": {
     "available": true,
-    "start_mb": 116.28,
-    "peak_mb": 128.09,
-    "growth_mb": 11.81,
+    "start_mb": 115.31,
+    "peak_mb": 127.31,
+    "growth_mb": 12.0,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.0017718749999999783,
+      "seconds": 0.002108082999999983,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.005618791999999928,
+      "seconds": 0.00682429200000001,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.017142625000000078,
+      "seconds": 0.024697250000000004,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.06763025,
+      "seconds": 0.05683933299999999,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.00958991599999992,
+      "seconds": 0.009950499999999973,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.02613770900000001,
+      "seconds": 0.028082541999999933,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.08281808300000004,
+      "seconds": 0.09016795900000008,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json b/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json
index 80ee57e8..3ee9298a 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.193722167,
+  "total_seconds": 0.22938095800000002,
   "memory": {
     "available": true,
-    "start_mb": 115.44,
-    "peak_mb": 128.27,
-    "growth_mb": 12.83,
+    "start_mb": 115.72,
+    "peak_mb": 129.34,
+    "growth_mb": 13.62,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.0018566250000000561,
+      "seconds": 0.002027083000000096,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.005901209000000018,
+      "seconds": 0.006830167000000054,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.017941708999999917,
+      "seconds": 0.028238708000000057,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.05830100000000005,
+      "seconds": 0.06599037499999993,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.009343709000000033,
+      "seconds": 0.010059291999999997,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.02651229200000005,
+      "seconds": 0.028069124999999917,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.07384633299999999,
+      "seconds": 0.08814829099999999,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_large_python.json b/benchmarks/speed_review/baselines/brfss_panel_large_python.json
index cf5d5f70..6aa94d7e 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_large_python.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_large_python.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_large",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 24.483619208000004,
+  "total_seconds": 24.864842916,
   "memory": {
     "available": true,
-    "start_mb": 396.47,
-    "peak_mb": 426.27,
-    "growth_mb": 29.8,
+    "start_mb": 415.48,
+    "peak_mb": 438.44,
+    "growth_mb": 22.95,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 24.412461416,
+      "seconds": 24.773668083000004,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.012729583000002265,
+      "seconds": 0.013717958999997393,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.5410000006331757e-06,
+      "seconds": 1.958999995110844e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0017426250000056598,
+      "seconds": 0.002090874999993275,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.05624137499999904,
+      "seconds": 0.07487362500000216,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.0004348329999999123,
+      "seconds": 0.00048458300000220333,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_large_rust.json b/benchmarks/speed_review/baselines/brfss_panel_large_rust.json
index 1fa3fdc1..24153969 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_large_rust.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_large_rust.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_large",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 24.167863249999996,
+  "total_seconds": 24.665951375000002,
   "memory": {
     "available": true,
-    "start_mb": 400.59,
-    "peak_mb": 430.67,
-    "growth_mb": 30.08,
+    "start_mb": 403.95,
+    "peak_mb": 434.52,
+    "growth_mb": 30.56,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 24.099972291,
+      "seconds": 24.555220917,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.012276292000002798,
+      "seconds": 0.012497459000002209,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.4170000045842244e-06,
+      "seconds": 2.3340000012694873e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0016439999999988686,
+      "seconds": 0.001876959000000511,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.05347887499999615,
+      "seconds": 0.0958454169999996,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.000472500000000764,
+      "seconds": 0.0004953749999998536,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_medium_python.json b/benchmarks/speed_review/baselines/brfss_panel_medium_python.json
index 35d48ac1..308aa17a 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_medium_python.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_medium_python.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_medium",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 6.2413335839999995,
+  "total_seconds": 6.4371455420000006,
   "memory": {
     "available": true,
-    "start_mb": 188.67,
-    "peak_mb": 210.42,
-    "growth_mb": 21.75,
+    "start_mb": 188.5,
+    "peak_mb": 209.22,
+    "growth_mb": 20.72,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 6.143250959000001,
+      "seconds": 6.315183209000001,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.012518041999999951,
+      "seconds": 0.013586915999999505,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.2499999996483666e-06,
+      "seconds": 4.915999999965948e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0016282499999995537,
+      "seconds": 0.0019465419999988853,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.08371516599999929,
+      "seconds": 0.10614370900000125,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.00021379199999849163,
+      "seconds": 0.0002680420000000794,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json b/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json
index 653c78a6..7e42eb2a 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_medium",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 6.1959236660000006,
+  "total_seconds": 6.399207375,
   "memory": {
     "available": true,
-    "start_mb": 197.81,
-    "peak_mb": 218.22,
-    "growth_mb": 20.41,
+    "start_mb": 193.33,
+    "peak_mb": 218.3,
+    "growth_mb": 24.97,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 6.119725833,
+      "seconds": 6.316759334,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.012326082999999599,
+      "seconds": 0.01231787500000081,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.1249999999639613e-06,
+      "seconds": 2.4159999991724135e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0017214169999988371,
+      "seconds": 0.0016820419999987735,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.061877875000000415,
+      "seconds": 0.06814641599999938,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.000265209000000155,
+      "seconds": 0.00028558400000022743,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_small_python.json b/benchmarks/speed_review/baselines/brfss_panel_small_python.json
index 8a70b38e..8cbe58ef 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_small_python.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_small_python.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 1.6215220409999997,
+  "total_seconds": 1.6372609579999997,
   "memory": {
     "available": true,
-    "start_mb": 122.02,
-    "peak_mb": 134.12,
-    "growth_mb": 12.11,
+    "start_mb": 121.05,
+    "peak_mb": 133.75,
+    "growth_mb": 12.7,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 1.5260382080000001,
+      "seconds": 1.563557833,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.015467790999999842,
+      "seconds": 0.015172458000000333,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.3330000002985685e-06,
+      "seconds": 2.1670000003304324e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.003910083999999703,
+      "seconds": 0.004001957999999917,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.07581529200000015,
+      "seconds": 0.05416733400000018,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.000284333000000192,
+      "seconds": 0.0003555829999997151,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_small_rust.json b/benchmarks/speed_review/baselines/brfss_panel_small_rust.json
index e5caadeb..94b1e315 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_small_rust.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_small_rust.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 1.633015125,
+  "total_seconds": 1.736817041,
   "memory": {
     "available": true,
-    "start_mb": 121.28,
-    "peak_mb": 134.62,
-    "growth_mb": 13.34,
+    "start_mb": 123.34,
+    "peak_mb": 135.56,
+    "growth_mb": 12.22,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 1.549758459,
+      "seconds": 1.6519127919999999,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.015149292000000258,
+      "seconds": 0.016600416999999812,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.166999999886343e-06,
+      "seconds": 8.37500000017144e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.003892208000000341,
+      "seconds": 0.00402199999999997,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.06395783299999991,
+      "seconds": 0.06396112499999962,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.0002466249999999448,
+      "seconds": 0.0003028329999996693,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_large_python.json b/benchmarks/speed_review/baselines/campaign_staggered_large_python.json
index 4c8a1d4c..4005fa85 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_large_python.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_large_python.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_large",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 1.2753887090000002,
+  "total_seconds": 1.6244864589999999,
   "memory": {
     "available": true,
-    "start_mb": 231.02,
-    "peak_mb": 482.28,
-    "growth_mb": 251.27,
+    "start_mb": 233.84,
+    "peak_mb": 466.44,
+    "growth_mb": 232.59,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.017780917000000063,
+      "seconds": 0.020043167000000306,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.16856595800000007,
+      "seconds": 0.18313625,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.2920000001546157e-06,
+      "seconds": 3.416999999839021e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.0024221659999996703,
+      "seconds": 0.0025440839999997245,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.2988456250000002,
+      "seconds": 0.6173249580000002,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.6643596249999999,
+      "seconds": 0.6707151249999996,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.12336495800000025,
+      "seconds": 0.13066554200000002,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 3.454100000022109e-05,
+      "seconds": 4.025000000007495e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json b/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json
index 6b7ff815..13eb0cb7 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_large",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 1.2563857909999998,
+  "total_seconds": 1.332545959,
   "memory": {
     "available": true,
-    "start_mb": 264.81,
-    "peak_mb": 588.48,
-    "growth_mb": 323.67,
+    "start_mb": 258.7,
+    "peak_mb": 572.69,
+    "growth_mb": 313.98,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.01783779200000013,
+      "seconds": 0.018339333000000124,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.16583404200000018,
+      "seconds": 0.17198512499999996,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.374999999916639e-06,
+      "seconds": 3.7079999999356517e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.0027218329999998403,
+      "seconds": 0.002513374999999929,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.4118338749999997,
+      "seconds": 0.4572002500000001,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.5377156250000001,
+      "seconds": 0.5607007500000001,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.12039454100000002,
+      "seconds": 0.12175895900000011,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 3.916700000017315e-05,
+      "seconds": 3.874999999986528e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json b/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json
index 8cb2f00a..0f651e72 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_medium",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.7292697910000001,
+  "total_seconds": 0.7622110419999999,
   "memory": {
     "available": true,
-    "start_mb": 145.41,
-    "peak_mb": 230.08,
-    "growth_mb": 84.67,
+    "start_mb": 146.69,
+    "peak_mb": 232.44,
+    "growth_mb": 85.75,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.012401541999999877,
+      "seconds": 0.012650374999999991,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.09787991699999998,
+      "seconds": 0.0980392499999998,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.250000000010189e-06,
+      "seconds": 3.041000000036931e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.0023612090000000308,
+      "seconds": 0.0023650829999999345,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.2553969999999999,
+      "seconds": 0.2652878329999999,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.29284170899999995,
+      "seconds": 0.31247362499999976,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.06833729200000005,
+      "seconds": 0.07134445900000008,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 3.6750000000029814e-05,
+      "seconds": 3.562500000020563e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json b/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json
index 24917edf..e1f1a77c 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_medium",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.7339207910000001,
+  "total_seconds": 0.7597358329999999,
   "memory": {
     "available": true,
-    "start_mb": 155.03,
-    "peak_mb": 261.97,
-    "growth_mb": 106.94,
+    "start_mb": 155.44,
+    "peak_mb": 255.56,
+    "growth_mb": 100.12,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.0123550830000001,
+      "seconds": 0.01323879200000011,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.09760662500000006,
+      "seconds": 0.10213366699999993,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.2920000001546157e-06,
+      "seconds": 3.6669999998739655e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.0024585420000000635,
+      "seconds": 0.002531708999999882,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.28781725,
+      "seconds": 0.2909566670000001,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.26540537500000005,
+      "seconds": 0.28113575,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.06822562499999996,
+      "seconds": 0.06967316699999992,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 3.799999999998249e-05,
+      "seconds": 4.925000000000068e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_small_python.json b/benchmarks/speed_review/baselines/campaign_staggered_small_python.json
index a95a0582..7add273f 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_small_python.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_small_python.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.503552917,
+  "total_seconds": 0.5171237500000001,
   "memory": {
     "available": true,
-    "start_mb": 114.8,
-    "peak_mb": 140.5,
-    "growth_mb": 25.7,
+    "start_mb": 114.09,
+    "peak_mb": 142.61,
+    "growth_mb": 28.52,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.008558832999999932,
+      "seconds": 0.008561582999999984,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.06354612500000001,
+      "seconds": 0.06724704199999998,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.250000000010189e-06,
+      "seconds": 3.0829999999593127e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.005010249999999994,
+      "seconds": 0.007055167000000084,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.16441000000000006,
+      "seconds": 0.17484183400000008,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.226042709,
+      "seconds": 0.2239424590000001,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.03592104199999979,
+      "seconds": 0.03541116700000013,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 5.5042000000060654e-05,
+      "seconds": 5.045800000003098e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json b/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json
index 93f664ef..e43550c6 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.49853625,
+  "total_seconds": 0.520628125,
   "memory": {
     "available": true,
-    "start_mb": 113.97,
-    "peak_mb": 147.67,
-    "growth_mb": 33.7,
+    "start_mb": 114.66,
+    "peak_mb": 150.81,
+    "growth_mb": 36.16,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.007039999999999935,
+      "seconds": 0.007525666000000042,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.06317745800000008,
+      "seconds": 0.06882575000000002,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.124999999992717e-06,
+      "seconds": 3.084000000042053e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.00467095900000003,
+      "seconds": 0.004558791999999978,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.14945124999999992,
+      "seconds": 0.16400174999999995,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.23735937499999993,
+      "seconds": 0.23674954199999998,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.03678574999999995,
+      "seconds": 0.03892466700000008,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 4.245800000002298e-05,
+      "seconds": 3.375000000005457e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/dose_response_python.json b/benchmarks/speed_review/baselines/dose_response_python.json
index b5d588ba..0ccb03ca 100644
--- a/benchmarks/speed_review/baselines/dose_response_python.json
+++ b/benchmarks/speed_review/baselines/dose_response_python.json
@@ -2,42 +2,42 @@
   "scenario": "dose_response",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.5965390830000001,
+  "total_seconds": 0.6105326670000001,
   "memory": {
     "available": true,
-    "start_mb": 113.41,
-    "peak_mb": 120.05,
-    "growth_mb": 6.64,
+    "start_mb": 113.62,
+    "peak_mb": 122.28,
+    "growth_mb": 8.66,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_cdid_cubic_spline_bootstrap199": {
-      "seconds": 0.15369475,
+      "seconds": 0.15595683400000004,
       "ok": true,
       "error": null
     },
     "2_extract_dose_response_dataframes": {
-      "seconds": 0.0007427500000000142,
+      "seconds": 0.0008087920000000581,
       "ok": true,
       "error": null
     },
     "3_cdid_event_study_pretrend": {
-      "seconds": 0.1483430830000001,
+      "seconds": 0.15231054099999997,
       "ok": true,
       "error": null
     },
     "4_binarized_did_comparison": {
-      "seconds": 0.001571500000000059,
+      "seconds": 0.0016399580000000524,
       "ok": true,
       "error": null
     },
     "5_spline_sensitivity_degree1": {
-      "seconds": 0.14602733299999993,
+      "seconds": 0.1499262910000001,
       "ok": true,
       "error": null
     },
     "6_spline_sensitivity_num_knots2": {
-      "seconds": 0.146155708,
+      "seconds": 0.14988529199999978,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/dose_response_rust.json b/benchmarks/speed_review/baselines/dose_response_rust.json
index f79982c6..8722e27a 100644
--- a/benchmarks/speed_review/baselines/dose_response_rust.json
+++ b/benchmarks/speed_review/baselines/dose_response_rust.json
@@ -2,42 +2,42 @@
   "scenario": "dose_response",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.592988541,
+  "total_seconds": 0.5925046660000001,
   "memory": {
     "available": true,
-    "start_mb": 113.7,
-    "peak_mb": 120.94,
-    "growth_mb": 7.23,
+    "start_mb": 113.3,
+    "peak_mb": 122.61,
+    "growth_mb": 9.31,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_cdid_cubic_spline_bootstrap199": {
-      "seconds": 0.151152917,
+      "seconds": 0.15212875000000003,
       "ok": true,
       "error": null
     },
     "2_extract_dose_response_dataframes": {
-      "seconds": 0.0007949580000000678,
+      "seconds": 0.0007324580000001024,
       "ok": true,
       "error": null
     },
     "3_cdid_event_study_pretrend": {
-      "seconds": 0.14909695900000008,
+      "seconds": 0.149196208,
       "ok": true,
       "error": null
     },
     "4_binarized_did_comparison": {
-      "seconds": 0.0015252080000000001,
+      "seconds": 0.0016326669999999766,
       "ok": true,
       "error": null
     },
     "5_spline_sensitivity_degree1": {
-      "seconds": 0.14430462500000008,
+      "seconds": 0.14295058400000005,
       "ok": true,
       "error": null
     },
     "6_spline_sensitivity_num_knots2": {
-      "seconds": 0.14610904199999997,
+      "seconds": 0.14585954199999995,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json b/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json
index fb5e8ca1..0e7fe4eb 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_large",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.26278650000000003,
+  "total_seconds": 0.265517209,
   "memory": {
     "available": true,
-    "start_mb": 117.98,
-    "peak_mb": 118.33,
-    "growth_mb": 0.34,
+    "start_mb": 117.12,
+    "peak_mb": 117.55,
+    "growth_mb": 0.42,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.04081954100000007,
+      "seconds": 0.0417901249999999,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.03721654100000005,
+      "seconds": 0.03838029199999993,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 0.07710412500000008,
+      "seconds": 0.07852029100000002,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0006546249999999088,
+      "seconds": 0.0007375830000000416,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 0.10694591700000011,
+      "seconds": 0.1060552090000001,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 4.191700000011345e-05,
+      "seconds": 3.0042000000118918e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json b/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json
index 1d0101a1..006e532d 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_medium",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 4.064775749999999,
+  "total_seconds": 4.060229791,
   "memory": {
     "available": true,
-    "start_mb": 144.16,
-    "peak_mb": 152.61,
-    "growth_mb": 8.45,
+    "start_mb": 143.23,
+    "peak_mb": 151.86,
+    "growth_mb": 8.62,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.36629420799999934,
+      "seconds": 0.3655458749999996,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.3713431250000001,
+      "seconds": 0.3695962079999999,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 1.594468708,
+      "seconds": 1.5809305839999999,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.00075208300000007,
+      "seconds": 0.0007215829999998036,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 1.731888875000001,
+      "seconds": 1.7434101249999987,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 2.4707999999762364e-05,
+      "seconds": 2.0582999999518847e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json b/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json
index 2e35b56a..0a4806ce 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_medium",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.11511983300000006,
+  "total_seconds": 0.11840712500000006,
   "memory": {
     "available": true,
-    "start_mb": 116.62,
-    "peak_mb": 116.98,
-    "growth_mb": 0.36,
+    "start_mb": 116.44,
+    "peak_mb": 116.83,
+    "growth_mb": 0.39,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.019235208000000004,
+      "seconds": 0.019984041999999924,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.022819749999999916,
+      "seconds": 0.02313879200000002,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 0.024750457999999975,
+      "seconds": 0.025119957999999998,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0006058339999999163,
+      "seconds": 0.0006282499999999969,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 0.047683790999999975,
+      "seconds": 0.049501083,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 2.1332999999956748e-05,
+      "seconds": 3.0666000000012517e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_small_python.json b/benchmarks/speed_review/baselines/geo_few_markets_small_python.json
index be2be067..55674e59 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_small_python.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_small_python.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 3.8045968329999997,
+  "total_seconds": 3.7479821250000005,
   "memory": {
     "available": true,
-    "start_mb": 113.8,
-    "peak_mb": 124.0,
-    "growth_mb": 10.2,
+    "start_mb": 114.3,
+    "peak_mb": 124.17,
+    "growth_mb": 9.88,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.606876041,
+      "seconds": 0.5956376250000001,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.6082665840000001,
+      "seconds": 0.5948299160000001,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 1.2295280419999999,
+      "seconds": 1.212533834,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0008653330000001347,
+      "seconds": 0.0007912089999999594,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 1.35899475,
+      "seconds": 1.3440970830000003,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 6.066699999962566e-05,
+      "seconds": 8.712500000029877e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json b/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json
index e3af0a7b..0efcc849 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.04004791699999999,
+  "total_seconds": 0.040345834000000025,
   "memory": {
     "available": true,
-    "start_mb": 113.78,
-    "peak_mb": 115.23,
-    "growth_mb": 1.45,
+    "start_mb": 113.75,
+    "peak_mb": 115.25,
+    "growth_mb": 1.5,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.007413500000000073,
+      "seconds": 0.007697416999999929,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.012767541999999965,
+      "seconds": 0.012845958999999962,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 0.008145540999999978,
+      "seconds": 0.008076207999999974,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0008106660000000154,
+      "seconds": 0.0008170000000000677,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 0.01088683300000004,
+      "seconds": 0.010886916999999996,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 2.0916999999953667e-05,
+      "seconds": 1.883400000002311e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/reversible_dcdh_python.json b/benchmarks/speed_review/baselines/reversible_dcdh_python.json
index f653a8d6..c71442b1 100644
--- a/benchmarks/speed_review/baselines/reversible_dcdh_python.json
+++ b/benchmarks/speed_review/baselines/reversible_dcdh_python.json
@@ -2,32 +2,32 @@
   "scenario": "reversible_dcdh",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.6590325410000001,
+  "total_seconds": 0.7502037079999999,
   "memory": {
     "available": true,
-    "start_mb": 113.59,
-    "peak_mb": 132.88,
-    "growth_mb": 19.28,
+    "start_mb": 113.61,
+    "peak_mb": 131.73,
+    "growth_mb": 18.12,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_dcdh_fit_Lmax3_survey_TSL": {
-      "seconds": 0.312676917,
+      "seconds": 0.382398,
       "ok": true,
       "error": null
     },
     "2_inspect_placebo_and_summary": {
-      "seconds": 1.4170000000035543e-06,
+      "seconds": 1.6250000000050946e-06,
       "ok": true,
       "error": null
     },
     "3_honest_did_on_placebo": {
-      "seconds": 0.004951042000000072,
+      "seconds": 0.005041332999999981,
       "ok": true,
       "error": null
     },
     "4_heterogeneity_refit": {
-      "seconds": 0.34140066700000005,
+      "seconds": 0.362760125,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/reversible_dcdh_rust.json b/benchmarks/speed_review/baselines/reversible_dcdh_rust.json
index 1dabe608..b1df4b40 100644
--- a/benchmarks/speed_review/baselines/reversible_dcdh_rust.json
+++ b/benchmarks/speed_review/baselines/reversible_dcdh_rust.json
@@ -2,32 +2,32 @@
   "scenario": "reversible_dcdh",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.6441262499999999,
+  "total_seconds": 0.658796375,
   "memory": {
     "available": true,
-    "start_mb": 113.36,
-    "peak_mb": 134.64,
-    "growth_mb": 21.28,
+    "start_mb": 113.86,
+    "peak_mb": 134.09,
+    "growth_mb": 20.23,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_dcdh_fit_Lmax3_survey_TSL": {
-      "seconds": 0.3220115,
+      "seconds": 0.3397157919999999,
       "ok": true,
       "error": null
     },
     "2_inspect_placebo_and_summary": {
-      "seconds": 1.5410000000493085e-06,
+      "seconds": 1.4579999999542181e-06,
       "ok": true,
       "error": null
     },
     "3_honest_did_on_placebo": {
-      "seconds": 0.003777833000000008,
+      "seconds": 0.004095290999999945,
       "ok": true,
       "error": null
     },
     "4_heterogeneity_refit": {
-      "seconds": 0.3183330000000001,
+      "seconds": 0.3149809160000001,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/bench_brand_awareness_survey.py b/benchmarks/speed_review/bench_brand_awareness_survey.py
index 919a6d89..1ce0be1d 100644
--- a/benchmarks/speed_review/bench_brand_awareness_survey.py
+++ b/benchmarks/speed_review/bench_brand_awareness_survey.py
@@ -54,6 +54,17 @@ def build_data(n_units, n_periods, n_strata, psu_per_stratum, seed=42):
 
 
 def make_phases(data, results, rw_cols):
+    # One analytical TSL SurveyDesign is reused across every analytical
+    # survey phase (TSL, multi-outcome, placebo, HonestDiD event-study).
+    # Keeping strata/PSU/FPC/nest constant is what the scenario spec and
+    # Tutorial 17 declare, and what the finite-population variance
+    # expressions require. The replicate-weight path (phase 3) is a
+    # different variance surface (JK1) that does not take FPC.
+    sd_tsl = SurveyDesign(
+        weights="weight", strata="stratum", psu="psu",
+        fpc="fpc", nest=True,
+    )
+
     def naive_fit():
         did = DifferenceInDifferences(robust=True, cluster="psu")
         results["naive"] = did.fit(
@@ -61,14 +72,10 @@ def naive_fit():
         )
 
     def tsl_fit():
-        sd = SurveyDesign(
-            weights="weight", strata="stratum", psu="psu",
-            fpc="fpc", nest=True,
-        )
         did = DifferenceInDifferences(robust=True)
         results["tsl"] = did.fit(
             data, outcome="outcome", treatment="treat_unit", time="post",
-            survey_design=sd,
+            survey_design=sd_tsl,
         )
 
     def replicate_fit():
@@ -85,15 +92,12 @@ def replicate_fit():
         )
 
     def multi_outcome_loop():
-        sd = SurveyDesign(
-            weights="weight", strata="stratum", psu="psu", nest=True,
-        )
         out = {}
         for y in ("outcome", "consideration", "purchase_intent"):
             did = DifferenceInDifferences(robust=True)
             out[y] = did.fit(
                 data, outcome=y, treatment="treat_unit", time="post",
-                survey_design=sd,
+                survey_design=sd_tsl,
             )
         results["multi_outcome"] = out
 
@@ -107,24 +111,18 @@ def pretrends():
     def placebo_refit():
         pre = data[data["period"] < 7].copy()
         pre["placebo_post"] = (pre["period"] >= 4).astype(int)
-        sd = SurveyDesign(
-            weights="weight", strata="stratum", psu="psu", nest=True,
-        )
         did = DifferenceInDifferences(robust=True)
         results["placebo"] = did.fit(
             pre, outcome="outcome", treatment="treat_unit",
-            time="placebo_post", survey_design=sd,
+            time="placebo_post", survey_design=sd_tsl,
         )
 
     def honest_did_grid():
-        sd = SurveyDesign(
-            weights="weight", strata="stratum", psu="psu", nest=True,
-        )
         es = MultiPeriodDiD()
         es_result = es.fit(
             data, outcome="outcome", treatment="treat_unit",
             time="period", unit="unit", reference_period=6,
-            survey_design=sd,
+            survey_design=sd_tsl,
         )
         results["event_study"] = es_result
         out = {}
diff --git a/docs/performance-plan.md b/docs/performance-plan.md
index c1362857..af004423 100644
--- a/docs/performance-plan.md
+++ b/docs/performance-plan.md
@@ -41,20 +41,20 @@ scale. Data-shape details are in `docs/performance-scenarios.md`.
 <!-- TABLE:start scale_sweep_totals -->
 | Scenario | Scale | Python (s) | Rust (s) | Py/Rust |
 |---|---|---:|---:|---:|
-| 1. Staggered campaign | small | 0.50 | 0.50 | 1.0x |
-|  | medium | 0.73 | 0.73 | 1.0x |
-|  | large | 1.28 | 1.26 | 1.0x |
-| 2. Brand awareness survey | small | 0.21 | 0.19 | 1.1x |
-|  | medium | 0.52 | 0.49 | 1.1x |
-|  | large | 0.91 | 0.81 | 1.1x |
-| 3. BRFSS microdata -> CS panel | small | 1.62 | 1.63 | 1.0x |
-|  | medium | 6.24 | 6.20 | 1.0x |
-|  | large | 24.48 | 24.17 | 1.0x |
-| 4. SDiD few markets | small | 3.80 | 0.04 | 95.0x |
-|  | medium | 4.06 | 0.12 | 35.3x |
-|  | large | skip | 0.26 | - |
-| 5. Reversible dCDH | single | 0.66 | 0.64 | 1.0x |
-| 6. Pricing dose-response | single | 0.60 | 0.59 | 1.0x |
+| 1. Staggered campaign | small | 0.52 | 0.52 | 1.0x |
+|  | medium | 0.76 | 0.76 | 1.0x |
+|  | large | 1.62 | 1.33 | 1.2x |
+| 2. Brand awareness survey | small | 0.22 | 0.23 | 1.0x |
+|  | medium | 0.79 | 0.49 | 1.6x |
+|  | large | 0.95 | 0.95 | 1.0x |
+| 3. BRFSS microdata -> CS panel | small | 1.64 | 1.74 | 0.9x |
+|  | medium | 6.44 | 6.40 | 1.0x |
+|  | large | 24.86 | 24.67 | 1.0x |
+| 4. SDiD few markets | small | 3.75 | 0.04 | 92.9x |
+|  | medium | 4.06 | 0.12 | 34.3x |
+|  | large | skip | 0.27 | - |
+| 5. Reversible dCDH | single | 0.75 | 0.66 | 1.1x |
+| 6. Pricing dose-response | single | 0.61 | 0.59 | 1.0x |
 <!-- TABLE:end scale_sweep_totals -->
 
 ### Scaling findings
@@ -80,28 +80,31 @@ scale. Data-shape details are in `docs/performance-scenarios.md`.
    the JK1 replicate path inside it scales closer to
    n_units x n_replicates - faster growth than the chain total, so it
    increasingly dominates at large n.
-5. Rust backend gives measurable uplift only for SDiD; everywhere else
-   backend choice is within noise because the bottlenecks are in Python
-   (`aggregate_survey`, JK1 replicate fit) or already well-vectorized
-   (CS bootstrap, ImputationDiD, Survey TSL).
+5. Rust backend gives large uplift only for SDiD (order-of-magnitude
+   and up). Elsewhere the gap is modest - under ~1.6x at worst on
+   brand-awareness medium, and within noise on the other scenarios
+   and scales. The primary bottlenecks live in Python code the Rust
+   backend does not touch (`aggregate_survey`, JK1 replicate fit), and
+   paths that Rust does touch (CS bootstrap, ImputationDiD, Survey
+   TSL) are already well-vectorized in Python.
 
 ### Top phases by scenario at largest measured scale
 
 <!-- TABLE:start top_phases_by_scenario -->
 | Scenario | Scale | Backend | Top phase (%) | 2nd phase (%) | 3rd phase (%) |
 |---|---|---|---|---|---|
-| 1. Staggered campaign | large | python | `6_imputation_did_robustness` (52%) | `5_sun_abraham_robustness` (23%) | `2_cs_fit_with_covariates_bootstrap999` (13%) |
-| 1. Staggered campaign | large | rust | `6_imputation_did_robustness` (43%) | `5_sun_abraham_robustness` (33%) | `2_cs_fit_with_covariates_bootstrap999` (13%) |
-| 2. Brand awareness survey | large | python | `3_replicate_weights_jk1` (50%) | `4_multi_outcome_loop_3_metrics` (25%) | `7_event_study_plus_honest_did` (14%) |
-| 2. Brand awareness survey | large | rust | `3_replicate_weights_jk1` (44%) | `4_multi_outcome_loop_3_metrics` (26%) | `7_event_study_plus_honest_did` (17%) |
+| 1. Staggered campaign | large | python | `6_imputation_did_robustness` (41%) | `5_sun_abraham_robustness` (38%) | `2_cs_fit_with_covariates_bootstrap999` (11%) |
+| 1. Staggered campaign | large | rust | `6_imputation_did_robustness` (42%) | `5_sun_abraham_robustness` (34%) | `2_cs_fit_with_covariates_bootstrap999` (13%) |
+| 2. Brand awareness survey | large | python | `3_replicate_weights_jk1` (51%) | `4_multi_outcome_loop_3_metrics` (26%) | `7_event_study_plus_honest_did` (15%) |
+| 2. Brand awareness survey | large | rust | `3_replicate_weights_jk1` (51%) | `4_multi_outcome_loop_3_metrics` (26%) | `7_event_study_plus_honest_did` (15%) |
 | 3. BRFSS microdata -> CS panel | large | python | `1_aggregate_survey_microdata_to_panel` (100%) | `5_sun_abraham_robustness` (0%) | `2_cs_fit_with_stage2_survey_design` (0%) |
 | 3. BRFSS microdata -> CS panel | large | rust | `1_aggregate_survey_microdata_to_panel` (100%) | `5_sun_abraham_robustness` (0%) | `2_cs_fit_with_stage2_survey_design` (0%) |
 | 4. SDiD few markets | medium | python | `5_sensitivity_to_zeta_omega` (43%) | `3_in_time_placebo` (39%) | `2_sdid_bootstrap_variance_200` (9%) |
-| 4. SDiD few markets | large | rust | `5_sensitivity_to_zeta_omega` (41%) | `3_in_time_placebo` (29%) | `1_sdid_jackknife_variance` (16%) |
-| 5. Reversible dCDH | single | python | `4_heterogeneity_refit` (52%) | `1_dcdh_fit_Lmax3_survey_TSL` (47%) | `3_honest_did_on_placebo` (1%) |
-| 5. Reversible dCDH | single | rust | `1_dcdh_fit_Lmax3_survey_TSL` (50%) | `4_heterogeneity_refit` (49%) | `3_honest_did_on_placebo` (1%) |
-| 6. Pricing dose-response | single | python | `1_cdid_cubic_spline_bootstrap199` (26%) | `3_cdid_event_study_pretrend` (25%) | `6_spline_sensitivity_num_knots2` (25%) |
-| 6. Pricing dose-response | single | rust | `1_cdid_cubic_spline_bootstrap199` (25%) | `3_cdid_event_study_pretrend` (25%) | `6_spline_sensitivity_num_knots2` (25%) |
+| 4. SDiD few markets | large | rust | `5_sensitivity_to_zeta_omega` (40%) | `3_in_time_placebo` (30%) | `1_sdid_jackknife_variance` (16%) |
+| 5. Reversible dCDH | single | python | `1_dcdh_fit_Lmax3_survey_TSL` (51%) | `4_heterogeneity_refit` (48%) | `3_honest_did_on_placebo` (1%) |
+| 5. Reversible dCDH | single | rust | `1_dcdh_fit_Lmax3_survey_TSL` (52%) | `4_heterogeneity_refit` (48%) | `3_honest_did_on_placebo` (1%) |
+| 6. Pricing dose-response | single | python | `1_cdid_cubic_spline_bootstrap199` (26%) | `3_cdid_event_study_pretrend` (25%) | `5_spline_sensitivity_degree1` (25%) |
+| 6. Pricing dose-response | single | rust | `1_cdid_cubic_spline_bootstrap199` (26%) | `3_cdid_event_study_pretrend` (25%) | `6_spline_sensitivity_num_knots2` (25%) |
 <!-- TABLE:end top_phases_by_scenario -->
 
 Per-scenario phase narrative (cross-check against the table above after
@@ -113,14 +116,14 @@ any rerun):
   `n_bootstrap=999` (both with and without covariates) is well-
   vectorized and sits well below both in the ranking.
 - **Brand awareness survey.** At small scale HonestDiD dominates. At
-  medium the top three phases are packed closely: on Python JK1 leads
-  (about 1.9x the multi-outcome loop, with HonestDiD close behind);
-  on Rust the multi-outcome loop is slightly ahead of JK1. Only at
-  large does JK1 emerge as the clearly dominant phase under both
-  backends. Python and Rust totals on this chain are within noise; the
-  JK1 replicate-fit loop is not Rust-accelerated, so the FFI crossings
-  cost approximately what they save - a neutral outcome, not a
-  regression.
+  medium the backends diverge: on Python JK1 leads clearly (about
+  2.2x the multi-outcome loop), while on Rust the multi-outcome loop
+  and JK1 come in essentially tied. Medium is also the scale where
+  Python and Rust separate the most on total time (~1.6x under
+  Python at the time of writing); the analytical TSL path with FPC
+  appears to vectorize better under Rust at that shape. At large,
+  JK1 becomes the clearly dominant phase under both backends and
+  totals re-converge.
 - **BRFSS.** `aggregate_survey` share of total grows with scale and is
   effectively 100% of runtime at 1M rows. Downstream phases (CS fit,
   SunAbraham, HonestDiD) are a fraction of a second combined.
@@ -128,10 +131,9 @@ any rerun):
   are the dominant Python-backend phases at every scale; Rust eliminates
   both.
 - **Reversible dCDH.** Main fit and heterogeneity refit split the time
-  roughly evenly (45-52% each; under Python the heterogeneity refit
-  edges out the main fit slightly). Both fits run under the same
-  `SurveyDesign` and rebuild shared TSL scaffolding - that is the
-  optimization opportunity.
+  roughly evenly (~48-52% each under both backends). Both fits run
+  under the same `SurveyDesign` and rebuild shared TSL scaffolding -
+  that is the optimization opportunity.
 - **Pricing dose-response.** Four spline fits account for essentially all
   runtime; linear scaling in variant count.
 
@@ -157,20 +159,20 @@ in `benchmarks/speed_review/baselines/mem_profile_brfss_large_<backend>.txt`.
 <!-- TABLE:start memory_by_scenario -->
 | Scenario | Scale | Py peak RSS (MB) | Py growth (MB) | Rust peak RSS (MB) | Rust growth (MB) |
 |---|---|---:|---:|---:|---:|
-| 1. Staggered campaign | small | 140 | 26 | 148 | 34 |
-|  | medium | 230 | 85 | 262 | 107 |
-|  | large | 482 | 251 | 588 | 324 |
-| 2. Brand awareness survey | small | 128 | 12 | 128 | 13 |
-|  | medium | 183 | 48 | 186 | 53 |
-|  | large | 335 | 143 | 337 | 144 |
-| 3. BRFSS microdata -> CS panel | small | 134 | 12 | 135 | 13 |
-|  | medium | 210 | 22 | 218 | 20 |
-|  | large | 426 | 30 | 431 | 30 |
-| 4. SDiD few markets | small | 124 | 10 | 115 | 1 |
-|  | medium | 153 | 8 | 117 | 0 |
+| 1. Staggered campaign | small | 143 | 29 | 151 | 36 |
+|  | medium | 232 | 86 | 256 | 100 |
+|  | large | 466 | 233 | 573 | 314 |
+| 2. Brand awareness survey | small | 127 | 12 | 129 | 14 |
+|  | medium | 186 | 52 | 187 | 51 |
+|  | large | 337 | 149 | 344 | 151 |
+| 3. BRFSS microdata -> CS panel | small | 134 | 13 | 136 | 12 |
+|  | medium | 209 | 21 | 218 | 25 |
+|  | large | 438 | 23 | 435 | 31 |
+| 4. SDiD few markets | small | 124 | 10 | 115 | 2 |
+|  | medium | 152 | 9 | 117 | 0 |
 |  | large | skip | skip | 118 | 0 |
-| 5. Reversible dCDH | single | 133 | 19 | 135 | 21 |
-| 6. Pricing dose-response | single | 120 | 7 | 121 | 7 |
+| 5. Reversible dCDH | single | 132 | 18 | 134 | 20 |
+| 6. Pricing dose-response | single | 122 | 9 | 123 | 9 |
 <!-- TABLE:end memory_by_scenario -->
 
 The ~115-130 MB floor is the Python + diff-diff + numpy import footprint;

From 4bf991c6568a53db5612f6a78f71d2a88ea93a4e Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Sun, 19 Apr 2026 15:13:49 -0400
Subject: [PATCH 12/15] Drop cluster="psu" from naive phase; tighten two
 narrative claims (P3)

CI re-review P3:

- bench_brand_awareness_survey.py "naive" phase was using
  cluster="psu", which is already a partial sampling-design correction
  - the SE-inflation comparison is more faithful to Tutorial 17 when
  the first phase is genuinely untreated-for-design. Removed the
  cluster argument.
- performance-plan.md narrative overreaches corrected:
  - Staggered campaign: at Rust medium SunAbraham is now the clearly
    leading phase (~1.7x ImputationDiD there), not "slightly edges
    out". Reworded to say ImputationDiD / SunAbraham are the top two
    at every scale but their order is not stable across backend and
    scale.
  - Reversible dCDH: split is not "~evenly under both backends" -
    Python is closer to 58/41 with the main fit leading, Rust is
    51/49 with the heterogeneity refit leading. Reworded to reflect
    the split per backend.

Regenerated the affected brand-awareness and campaign-staggered
baselines (the naive-fit change slightly reduces brand-awareness
chain totals and shifts phase-percentage shares). Tables in
performance-plan.md re-derived via gen_findings_tables.py.

Still measurement only. No changes under diff_diff/ or rust/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../brand_awareness_survey_large_python.json  |  22 ++--
 .../brand_awareness_survey_large_rust.json    |  22 ++--
 .../brand_awareness_survey_medium_python.json |  22 ++--
 .../brand_awareness_survey_medium_rust.json   |  22 ++--
 .../brand_awareness_survey_small_python.json  |  22 ++--
 .../brand_awareness_survey_small_rust.json    |  22 ++--
 .../baselines/brfss_panel_large_python.json   |  20 ++--
 .../baselines/brfss_panel_large_rust.json     |  20 ++--
 .../baselines/brfss_panel_medium_python.json  |  20 ++--
 .../baselines/brfss_panel_medium_rust.json    |  20 ++--
 .../baselines/brfss_panel_small_python.json   |  20 ++--
 .../baselines/brfss_panel_small_rust.json     |  20 ++--
 .../campaign_staggered_large_python.json      |  24 ++--
 .../campaign_staggered_large_rust.json        |  24 ++--
 .../campaign_staggered_medium_python.json     |  24 ++--
 .../campaign_staggered_medium_rust.json       |  24 ++--
 .../campaign_staggered_small_python.json      |  24 ++--
 .../campaign_staggered_small_rust.json        |  24 ++--
 .../baselines/dose_response_python.json       |  20 ++--
 .../baselines/dose_response_rust.json         |  20 ++--
 .../baselines/geo_few_markets_large_rust.json |  20 ++--
 .../geo_few_markets_medium_python.json        |  20 ++--
 .../geo_few_markets_medium_rust.json          |  20 ++--
 .../geo_few_markets_small_python.json         |  20 ++--
 .../baselines/geo_few_markets_small_rust.json |  18 +--
 .../baselines/reversible_dcdh_python.json     |  16 +--
 .../baselines/reversible_dcdh_rust.json       |  16 +--
 .../bench_brand_awareness_survey.py           |   7 +-
 docs/performance-plan.md                      | 104 ++++++++++--------
 29 files changed, 346 insertions(+), 331 deletions(-)

diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json b/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json
index 24d46ed5..53420b48 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_large",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.9466241249999998,
+  "total_seconds": 0.963601333,
   "memory": {
     "available": true,
-    "start_mb": 187.52,
-    "peak_mb": 336.53,
-    "growth_mb": 149.02,
+    "start_mb": 197.14,
+    "peak_mb": 347.12,
+    "growth_mb": 149.98,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.014438584000000088,
+      "seconds": 0.0072292080000000425,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.02661404200000006,
+      "seconds": 0.017787999999999915,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.4855631249999999,
+      "seconds": 0.5239118750000002,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.24716379199999983,
+      "seconds": 0.2471308750000003,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.02261091699999973,
+      "seconds": 0.023947041000000002,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.01228116700000026,
+      "seconds": 0.010973958000000117,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.13794237500000017,
+      "seconds": 0.13260737500000008,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json b/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json
index b59e536e..4d5629fc 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_large",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.9458203330000001,
+  "total_seconds": 0.880690958,
   "memory": {
     "available": true,
-    "start_mb": 193.34,
-    "peak_mb": 343.95,
-    "growth_mb": 150.61,
+    "start_mb": 189.05,
+    "peak_mb": 341.62,
+    "growth_mb": 152.58,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.012478959000000067,
+      "seconds": 0.013126208999999944,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.029093875000000047,
+      "seconds": 0.03180279199999991,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.4777013749999999,
+      "seconds": 0.4201593749999999,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.24978320799999998,
+      "seconds": 0.21300062499999983,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.020603166999999978,
+      "seconds": 0.036875957999999986,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.0116797919999998,
+      "seconds": 0.025331083999999837,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.14445204200000017,
+      "seconds": 0.14037483400000017,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json
index df21097d..0fe9b936 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_medium",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.790833584,
+  "total_seconds": 0.802721625,
   "memory": {
     "available": true,
-    "start_mb": 133.86,
-    "peak_mb": 186.22,
-    "growth_mb": 52.36,
+    "start_mb": 132.53,
+    "peak_mb": 188.75,
+    "growth_mb": 56.22,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.014936249999999984,
+      "seconds": 0.011162792000000032,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.03534024999999996,
+      "seconds": 0.033565499999999915,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.40949133400000004,
+      "seconds": 0.4388773749999999,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.18262604100000002,
+      "seconds": 0.17477937499999996,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.03471770800000007,
+      "seconds": 0.02887754099999995,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.0515672920000001,
+      "seconds": 0.05516908300000001,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.06213379099999994,
+      "seconds": 0.060266375000000094,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json
index 7ecf88eb..32ec4723 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_medium",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.49272766700000004,
+  "total_seconds": 0.512456792,
   "memory": {
     "available": true,
-    "start_mb": 136.12,
-    "peak_mb": 187.19,
-    "growth_mb": 51.06,
+    "start_mb": 133.5,
+    "peak_mb": 183.52,
+    "growth_mb": 50.02,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.01159325,
+      "seconds": 0.012339082999999973,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.03190720899999999,
+      "seconds": 0.0368013330000001,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.13057945900000012,
+      "seconds": 0.14184362500000003,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.13259537499999996,
+      "seconds": 0.1607655830000001,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.034756790999999954,
+      "seconds": 0.025544416000000014,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.07622437500000001,
+      "seconds": 0.060123250000000183,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.07503525,
+      "seconds": 0.07500912500000001,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json b/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json
index 2f62510f..5a9d69eb 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.21868924999999995,
+  "total_seconds": 0.202955542,
   "memory": {
     "available": true,
-    "start_mb": 115.31,
-    "peak_mb": 127.31,
-    "growth_mb": 12.0,
+    "start_mb": 115.48,
+    "peak_mb": 126.53,
+    "growth_mb": 11.05,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.002108082999999983,
+      "seconds": 0.0013837089999999552,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.00682429200000001,
+      "seconds": 0.006605916000000045,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.024697250000000004,
+      "seconds": 0.020932875000000073,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.05683933299999999,
+      "seconds": 0.05001366600000001,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.009950499999999973,
+      "seconds": 0.00929579199999997,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.028082541999999933,
+      "seconds": 0.028791582999999954,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.09016795900000008,
+      "seconds": 0.08592387499999998,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json b/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json
index 3ee9298a..1f816a42 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.22938095800000002,
+  "total_seconds": 0.209049209,
   "memory": {
     "available": true,
-    "start_mb": 115.72,
-    "peak_mb": 129.34,
-    "growth_mb": 13.62,
+    "start_mb": 115.23,
+    "peak_mb": 128.47,
+    "growth_mb": 13.23,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.002027083000000096,
+      "seconds": 0.0016902500000000042,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.006830167000000054,
+      "seconds": 0.005705791000000016,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.028238708000000057,
+      "seconds": 0.01469479100000004,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.06599037499999993,
+      "seconds": 0.05941337499999999,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.010059291999999997,
+      "seconds": 0.009663624999999954,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.028069124999999917,
+      "seconds": 0.02708766600000001,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.08814829099999999,
+      "seconds": 0.090782625,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_large_python.json b/benchmarks/speed_review/baselines/brfss_panel_large_python.json
index 6aa94d7e..db64809a 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_large_python.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_large_python.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_large",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 24.864842916,
+  "total_seconds": 23.767577457999998,
   "memory": {
     "available": true,
-    "start_mb": 415.48,
-    "peak_mb": 438.44,
-    "growth_mb": 22.95,
+    "start_mb": 398.88,
+    "peak_mb": 426.2,
+    "growth_mb": 27.33,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 24.773668083000004,
+      "seconds": 23.684428833,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.013717958999997393,
+      "seconds": 0.012686416999997618,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 1.958999995110844e-06,
+      "seconds": 2.4160000009487703e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.002090874999993275,
+      "seconds": 0.0016776669999991611,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.07487362500000216,
+      "seconds": 0.06832108300000428,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.00048458300000220333,
+      "seconds": 0.0004543749999967872,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_large_rust.json b/benchmarks/speed_review/baselines/brfss_panel_large_rust.json
index 24153969..60fad453 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_large_rust.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_large_rust.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_large",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 24.665951375000002,
+  "total_seconds": 26.383662291999997,
   "memory": {
     "available": true,
-    "start_mb": 403.95,
-    "peak_mb": 434.52,
-    "growth_mb": 30.56,
+    "start_mb": 397.69,
+    "peak_mb": 427.11,
+    "growth_mb": 29.42,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 24.555220917,
+      "seconds": 26.308744165999997,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.012497459000002209,
+      "seconds": 0.017224583999997378,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.3340000012694873e-06,
+      "seconds": 2.207999997949628e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.001876959000000511,
+      "seconds": 0.0018082090000035578,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.0958454169999996,
+      "seconds": 0.05578533399999941,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.0004953749999998536,
+      "seconds": 8.741699999603725e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_medium_python.json b/benchmarks/speed_review/baselines/brfss_panel_medium_python.json
index 308aa17a..07ed4bab 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_medium_python.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_medium_python.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_medium",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 6.4371455420000006,
+  "total_seconds": 5.935090292,
   "memory": {
     "available": true,
-    "start_mb": 188.5,
-    "peak_mb": 209.22,
-    "growth_mb": 20.72,
+    "start_mb": 191.91,
+    "peak_mb": 209.75,
+    "growth_mb": 17.84,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 6.315183209000001,
+      "seconds": 5.871870541000001,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.013586915999999505,
+      "seconds": 0.011603208000000365,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 4.915999999965948e-06,
+      "seconds": 2.166999999886343e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0019465419999988853,
+      "seconds": 0.0015359169999999978,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.10614370900000125,
+      "seconds": 0.05003187500000017,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.0002680420000000794,
+      "seconds": 3.670900000152244e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json b/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json
index 7e42eb2a..feb4e4fd 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_medium",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 6.399207375,
+  "total_seconds": 6.301258208999999,
   "memory": {
     "available": true,
-    "start_mb": 193.33,
-    "peak_mb": 218.3,
-    "growth_mb": 24.97,
+    "start_mb": 195.36,
+    "peak_mb": 211.09,
+    "growth_mb": 15.73,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 6.316759334,
+      "seconds": 6.1868029159999995,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.01231787500000081,
+      "seconds": 0.012106875000000628,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.4159999991724135e-06,
+      "seconds": 2.2919999995707485e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0016820419999987735,
+      "seconds": 0.0017474579999987583,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.06814641599999938,
+      "seconds": 0.10031245899999952,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.00028558400000022743,
+      "seconds": 0.00027812499999946283,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_small_python.json b/benchmarks/speed_review/baselines/brfss_panel_small_python.json
index 8cbe58ef..3a25c6ba 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_small_python.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_small_python.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 1.6372609579999997,
+  "total_seconds": 1.5694494999999997,
   "memory": {
     "available": true,
-    "start_mb": 121.05,
-    "peak_mb": 133.75,
-    "growth_mb": 12.7,
+    "start_mb": 121.31,
+    "peak_mb": 133.78,
+    "growth_mb": 12.47,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 1.563557833,
+      "seconds": 1.472794875,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.015172458000000333,
+      "seconds": 0.014274458000000045,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.1670000003304324e-06,
+      "seconds": 2.0410000001191975e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.004001957999999917,
+      "seconds": 0.003753040999999957,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.05416733400000018,
+      "seconds": 0.07832733300000028,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.0003555829999997151,
+      "seconds": 0.00029404099999963407,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_small_rust.json b/benchmarks/speed_review/baselines/brfss_panel_small_rust.json
index 94b1e315..de7c32a7 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_small_rust.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_small_rust.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 1.736817041,
+  "total_seconds": 1.6511473330000002,
   "memory": {
     "available": true,
-    "start_mb": 123.34,
-    "peak_mb": 135.56,
-    "growth_mb": 12.22,
+    "start_mb": 120.92,
+    "peak_mb": 134.12,
+    "growth_mb": 13.2,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 1.6519127919999999,
+      "seconds": 1.565322084,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.016600416999999812,
+      "seconds": 0.014890000000000292,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 8.37500000017144e-06,
+      "seconds": 2.208000000170074e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.00402199999999997,
+      "seconds": 0.0039766250000004,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.06396112499999962,
+      "seconds": 0.06669404099999987,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.0003028329999996693,
+      "seconds": 0.0002490840000000105,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_large_python.json b/benchmarks/speed_review/baselines/campaign_staggered_large_python.json
index 4005fa85..cf7cc0da 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_large_python.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_large_python.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_large",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 1.6244864589999999,
+  "total_seconds": 1.274976542,
   "memory": {
     "available": true,
-    "start_mb": 233.84,
-    "peak_mb": 466.44,
-    "growth_mb": 232.59,
+    "start_mb": 222.83,
+    "peak_mb": 484.78,
+    "growth_mb": 261.95,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.020043167000000306,
+      "seconds": 0.018749040999999966,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.18313625,
+      "seconds": 0.1616636250000001,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.416999999839021e-06,
+      "seconds": 3.4170000002831102e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.0025440839999997245,
+      "seconds": 0.0023987919999997054,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.6173249580000002,
+      "seconds": 0.3264373329999999,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.6707151249999996,
+      "seconds": 0.644759042,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.13066554200000002,
+      "seconds": 0.12092275000000008,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 4.025000000007495e-05,
+      "seconds": 3.683299999979184e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json b/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json
index 13eb0cb7..4e4fa8b8 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_large",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 1.332545959,
+  "total_seconds": 1.2839995000000002,
   "memory": {
     "available": true,
-    "start_mb": 258.7,
-    "peak_mb": 572.69,
-    "growth_mb": 313.98,
+    "start_mb": 254.2,
+    "peak_mb": 585.02,
+    "growth_mb": 330.81,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.018339333000000124,
+      "seconds": 0.019514583000000307,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.17198512499999996,
+      "seconds": 0.17894224999999997,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.7079999999356517e-06,
+      "seconds": 3.082999999737268e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.002513374999999929,
+      "seconds": 0.002544458000000027,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.4572002500000001,
+      "seconds": 0.41660504200000004,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.5607007500000001,
+      "seconds": 0.5459507920000002,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.12175895900000011,
+      "seconds": 0.12039437500000005,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 3.874999999986528e-05,
+      "seconds": 3.879200000023175e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json b/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json
index 0f651e72..60a9879d 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_medium",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.7622110419999999,
+  "total_seconds": 0.7186370420000001,
   "memory": {
     "available": true,
-    "start_mb": 146.69,
-    "peak_mb": 232.44,
-    "growth_mb": 85.75,
+    "start_mb": 144.94,
+    "peak_mb": 222.11,
+    "growth_mb": 77.17,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.012650374999999991,
+      "seconds": 0.011784207999999907,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.0980392499999998,
+      "seconds": 0.09282820899999988,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.041000000036931e-06,
+      "seconds": 2.9580000000528628e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.0023650829999999345,
+      "seconds": 0.0021937499999999943,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.2652878329999999,
+      "seconds": 0.266259416,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.31247362499999976,
+      "seconds": 0.2791953329999999,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.07134445900000008,
+      "seconds": 0.0663273339999999,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 3.562500000020563e-05,
+      "seconds": 4.074999999970075e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json b/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json
index e1f1a77c..84113174 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_medium",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.7597358329999999,
+  "total_seconds": 0.7459894999999999,
   "memory": {
     "available": true,
-    "start_mb": 155.44,
-    "peak_mb": 255.56,
-    "growth_mb": 100.12,
+    "start_mb": 150.91,
+    "peak_mb": 251.52,
+    "growth_mb": 100.61,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.01323879200000011,
+      "seconds": 0.012153583000000134,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.10213366699999993,
+      "seconds": 0.095559167,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.6669999998739655e-06,
+      "seconds": 3.374999999916639e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.002531708999999882,
+      "seconds": 0.0023907919999999194,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.2909566670000001,
+      "seconds": 0.35323991600000015,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.28113575,
+      "seconds": 0.21440725000000005,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.06967316699999992,
+      "seconds": 0.06818954200000005,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 4.925000000000068e-05,
+      "seconds": 3.945799999982569e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_small_python.json b/benchmarks/speed_review/baselines/campaign_staggered_small_python.json
index 7add273f..ca684788 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_small_python.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_small_python.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.5171237500000001,
+  "total_seconds": 0.49590554100000006,
   "memory": {
     "available": true,
-    "start_mb": 114.09,
-    "peak_mb": 142.61,
-    "growth_mb": 28.52,
+    "start_mb": 113.95,
+    "peak_mb": 140.97,
+    "growth_mb": 27.02,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.008561582999999984,
+      "seconds": 0.008187166999999995,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.06724704199999998,
+      "seconds": 0.061664542000000044,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.0829999999593127e-06,
+      "seconds": 2.54199999993876e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.007055167000000084,
+      "seconds": 0.0075763749999999686,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.17484183400000008,
+      "seconds": 0.16791750000000005,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.2239424590000001,
+      "seconds": 0.21524137500000018,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.03541116700000013,
+      "seconds": 0.035257708000000054,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 5.045800000003098e-05,
+      "seconds": 5.2833000000029884e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json b/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json
index e43550c6..c8a81e8d 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.520628125,
+  "total_seconds": 0.5003938330000001,
   "memory": {
     "available": true,
-    "start_mb": 114.66,
-    "peak_mb": 150.81,
-    "growth_mb": 36.16,
+    "start_mb": 113.81,
+    "peak_mb": 146.69,
+    "growth_mb": 32.88,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.007525666000000042,
+      "seconds": 0.007302374999999972,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.06882575000000002,
+      "seconds": 0.06526799999999999,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.084000000042053e-06,
+      "seconds": 2.8340000000071086e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.004558791999999978,
+      "seconds": 0.004741417000000081,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.16400174999999995,
+      "seconds": 0.13823870900000002,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.23674954199999998,
+      "seconds": 0.24844470800000007,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.03892466700000008,
+      "seconds": 0.03635133300000004,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 3.375000000005457e-05,
+      "seconds": 3.945800000004773e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/dose_response_python.json b/benchmarks/speed_review/baselines/dose_response_python.json
index 0ccb03ca..985928a6 100644
--- a/benchmarks/speed_review/baselines/dose_response_python.json
+++ b/benchmarks/speed_review/baselines/dose_response_python.json
@@ -2,42 +2,42 @@
   "scenario": "dose_response",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.6105326670000001,
+  "total_seconds": 0.5982971659999999,
   "memory": {
     "available": true,
-    "start_mb": 113.62,
-    "peak_mb": 122.28,
-    "growth_mb": 8.66,
+    "start_mb": 113.95,
+    "peak_mb": 121.86,
+    "growth_mb": 7.91,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_cdid_cubic_spline_bootstrap199": {
-      "seconds": 0.15595683400000004,
+      "seconds": 0.152698,
       "ok": true,
       "error": null
     },
     "2_extract_dose_response_dataframes": {
-      "seconds": 0.0008087920000000581,
+      "seconds": 0.0007580830000000205,
       "ok": true,
       "error": null
     },
     "3_cdid_event_study_pretrend": {
-      "seconds": 0.15231054099999997,
+      "seconds": 0.14973008399999999,
       "ok": true,
       "error": null
     },
     "4_binarized_did_comparison": {
-      "seconds": 0.0016399580000000524,
+      "seconds": 0.001448709000000048,
       "ok": true,
       "error": null
     },
     "5_spline_sensitivity_degree1": {
-      "seconds": 0.1499262910000001,
+      "seconds": 0.14563666599999991,
       "ok": true,
       "error": null
     },
     "6_spline_sensitivity_num_knots2": {
-      "seconds": 0.14988529199999978,
+      "seconds": 0.14802162500000016,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/dose_response_rust.json b/benchmarks/speed_review/baselines/dose_response_rust.json
index 8722e27a..75b80e57 100644
--- a/benchmarks/speed_review/baselines/dose_response_rust.json
+++ b/benchmarks/speed_review/baselines/dose_response_rust.json
@@ -2,42 +2,42 @@
   "scenario": "dose_response",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.5925046660000001,
+  "total_seconds": 0.628465,
   "memory": {
     "available": true,
-    "start_mb": 113.3,
-    "peak_mb": 122.61,
-    "growth_mb": 9.31,
+    "start_mb": 113.91,
+    "peak_mb": 124.05,
+    "growth_mb": 10.14,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_cdid_cubic_spline_bootstrap199": {
-      "seconds": 0.15212875000000003,
+      "seconds": 0.16225933400000003,
       "ok": true,
       "error": null
     },
     "2_extract_dose_response_dataframes": {
-      "seconds": 0.0007324580000001024,
+      "seconds": 0.0007119580000000125,
       "ok": true,
       "error": null
     },
     "3_cdid_event_study_pretrend": {
-      "seconds": 0.149196208,
+      "seconds": 0.15891516699999997,
       "ok": true,
       "error": null
     },
     "4_binarized_did_comparison": {
-      "seconds": 0.0016326669999999766,
+      "seconds": 0.0023566669999999235,
       "ok": true,
       "error": null
     },
     "5_spline_sensitivity_degree1": {
-      "seconds": 0.14295058400000005,
+      "seconds": 0.15112041699999978,
       "ok": true,
       "error": null
     },
     "6_spline_sensitivity_num_knots2": {
-      "seconds": 0.14585954199999995,
+      "seconds": 0.15309695899999998,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json b/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json
index 0e7fe4eb..85ea499d 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_large",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.265517209,
+  "total_seconds": 0.2668321659999999,
   "memory": {
     "available": true,
-    "start_mb": 117.12,
-    "peak_mb": 117.55,
-    "growth_mb": 0.42,
+    "start_mb": 117.61,
+    "peak_mb": 118.0,
+    "growth_mb": 0.39,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.0417901249999999,
+      "seconds": 0.041749666000000074,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.03838029199999993,
+      "seconds": 0.03940208300000003,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 0.07852029100000002,
+      "seconds": 0.07818345800000004,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0007375830000000416,
+      "seconds": 0.0007424160000000235,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 0.1060552090000001,
+      "seconds": 0.10670491699999984,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 3.0042000000118918e-05,
+      "seconds": 4.308399999986001e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json b/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json
index 006e532d..6d693e0d 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_medium",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 4.060229791,
+  "total_seconds": 4.007914167,
   "memory": {
     "available": true,
-    "start_mb": 143.23,
-    "peak_mb": 151.86,
-    "growth_mb": 8.62,
+    "start_mb": 140.27,
+    "peak_mb": 146.11,
+    "growth_mb": 5.84,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.3655458749999996,
+      "seconds": 0.36536020800000024,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.3695962079999999,
+      "seconds": 0.3658956660000001,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 1.5809305839999999,
+      "seconds": 1.5556308329999995,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0007215829999998036,
+      "seconds": 0.0006294580000005823,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 1.7434101249999987,
+      "seconds": 1.72036825,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 2.0582999999518847e-05,
+      "seconds": 2.5292000000121106e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json b/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json
index 0a4806ce..8ed034ee 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_medium",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.11840712500000006,
+  "total_seconds": 0.11954508399999997,
   "memory": {
     "available": true,
-    "start_mb": 116.44,
-    "peak_mb": 116.83,
-    "growth_mb": 0.39,
+    "start_mb": 116.28,
+    "peak_mb": 116.72,
+    "growth_mb": 0.44,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.019984041999999924,
+      "seconds": 0.02039737500000005,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.02313879200000002,
+      "seconds": 0.02283175000000004,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 0.025119957999999998,
+      "seconds": 0.02504779200000007,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0006282499999999969,
+      "seconds": 0.0006240410000000196,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 0.049501083,
+      "seconds": 0.05059725000000004,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 3.0666000000012517e-05,
+      "seconds": 3.979200000003846e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_small_python.json b/benchmarks/speed_review/baselines/geo_few_markets_small_python.json
index 55674e59..1246deae 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_small_python.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_small_python.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 3.7479821250000005,
+  "total_seconds": 3.6806861669999997,
   "memory": {
     "available": true,
-    "start_mb": 114.3,
-    "peak_mb": 124.17,
-    "growth_mb": 9.88,
+    "start_mb": 113.86,
+    "peak_mb": 123.45,
+    "growth_mb": 9.59,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.5956376250000001,
+      "seconds": 0.5916519580000001,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.5948299160000001,
+      "seconds": 0.5638875840000002,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 1.212533834,
+      "seconds": 1.1952183330000001,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0007912089999999594,
+      "seconds": 0.0009582500000000493,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 1.3440970830000003,
+      "seconds": 1.3288901659999999,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 8.712500000029877e-05,
+      "seconds": 7.466600000061163e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json b/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json
index 0efcc849..48988dfe 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.040345834000000025,
+  "total_seconds": 0.04141854199999995,
   "memory": {
     "available": true,
-    "start_mb": 113.75,
-    "peak_mb": 115.25,
+    "start_mb": 113.5,
+    "peak_mb": 115.0,
     "growth_mb": 1.5,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.007697416999999929,
+      "seconds": 0.008168667000000074,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.012845958999999962,
+      "seconds": 0.013055833999999988,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 0.008076207999999974,
+      "seconds": 0.008290083000000004,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0008170000000000677,
+      "seconds": 0.0008599579999999385,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 0.010886916999999996,
+      "seconds": 0.011012665999999949,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 1.883400000002311e-05,
+      "seconds": 2.3749999999989058e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/reversible_dcdh_python.json b/benchmarks/speed_review/baselines/reversible_dcdh_python.json
index c71442b1..729720da 100644
--- a/benchmarks/speed_review/baselines/reversible_dcdh_python.json
+++ b/benchmarks/speed_review/baselines/reversible_dcdh_python.json
@@ -2,32 +2,32 @@
   "scenario": "reversible_dcdh",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.7502037079999999,
+  "total_seconds": 0.8433709580000001,
   "memory": {
     "available": true,
-    "start_mb": 113.61,
-    "peak_mb": 131.73,
-    "growth_mb": 18.12,
+    "start_mb": 113.41,
+    "peak_mb": 135.0,
+    "growth_mb": 21.59,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_dcdh_fit_Lmax3_survey_TSL": {
-      "seconds": 0.382398,
+      "seconds": 0.490543708,
       "ok": true,
       "error": null
     },
     "2_inspect_placebo_and_summary": {
-      "seconds": 1.6250000000050946e-06,
+      "seconds": 1.7920000001669933e-06,
       "ok": true,
       "error": null
     },
     "3_honest_did_on_placebo": {
-      "seconds": 0.005041332999999981,
+      "seconds": 0.004332082999999987,
       "ok": true,
       "error": null
     },
     "4_heterogeneity_refit": {
-      "seconds": 0.362760125,
+      "seconds": 0.348488208,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/reversible_dcdh_rust.json b/benchmarks/speed_review/baselines/reversible_dcdh_rust.json
index b1df4b40..6d2ae098 100644
--- a/benchmarks/speed_review/baselines/reversible_dcdh_rust.json
+++ b/benchmarks/speed_review/baselines/reversible_dcdh_rust.json
@@ -2,32 +2,32 @@
   "scenario": "reversible_dcdh",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.658796375,
+  "total_seconds": 0.7921678750000001,
   "memory": {
     "available": true,
-    "start_mb": 113.86,
-    "peak_mb": 134.09,
-    "growth_mb": 20.23,
+    "start_mb": 113.97,
+    "peak_mb": 134.59,
+    "growth_mb": 20.62,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_dcdh_fit_Lmax3_survey_TSL": {
-      "seconds": 0.3397157919999999,
+      "seconds": 0.3866300829999999,
       "ok": true,
       "error": null
     },
     "2_inspect_placebo_and_summary": {
-      "seconds": 1.4579999999542181e-06,
+      "seconds": 1.7920000001669933e-06,
       "ok": true,
       "error": null
     },
     "3_honest_did_on_placebo": {
-      "seconds": 0.004095290999999945,
+      "seconds": 0.004450167000000116,
       "ok": true,
       "error": null
     },
     "4_heterogeneity_refit": {
-      "seconds": 0.3149809160000001,
+      "seconds": 0.4010832499999999,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/bench_brand_awareness_survey.py b/benchmarks/speed_review/bench_brand_awareness_survey.py
index 1ce0be1d..7db53894 100644
--- a/benchmarks/speed_review/bench_brand_awareness_survey.py
+++ b/benchmarks/speed_review/bench_brand_awareness_survey.py
@@ -66,7 +66,12 @@ def make_phases(data, results, rw_cols):
     )
 
     def naive_fit():
-        did = DifferenceInDifferences(robust=True, cluster="psu")
+        # Truly naive comparison point: no survey design, no clustering -
+        # matches Tutorial 17's first pass where an analyst has not yet
+        # accounted for the sampling structure. The SE-inflation story
+        # only shows up if this step is as untreated-for-design as
+        # practitioners actually start.
+        did = DifferenceInDifferences(robust=True)
         results["naive"] = did.fit(
             data, outcome="outcome", treatment="treat_unit", time="post",
         )
diff --git a/docs/performance-plan.md b/docs/performance-plan.md
index af004423..fff93dc1 100644
--- a/docs/performance-plan.md
+++ b/docs/performance-plan.md
@@ -41,20 +41,20 @@ scale. Data-shape details are in `docs/performance-scenarios.md`.
 <!-- TABLE:start scale_sweep_totals -->
 | Scenario | Scale | Python (s) | Rust (s) | Py/Rust |
 |---|---|---:|---:|---:|
-| 1. Staggered campaign | small | 0.52 | 0.52 | 1.0x |
-|  | medium | 0.76 | 0.76 | 1.0x |
-|  | large | 1.62 | 1.33 | 1.2x |
-| 2. Brand awareness survey | small | 0.22 | 0.23 | 1.0x |
-|  | medium | 0.79 | 0.49 | 1.6x |
-|  | large | 0.95 | 0.95 | 1.0x |
-| 3. BRFSS microdata -> CS panel | small | 1.64 | 1.74 | 0.9x |
-|  | medium | 6.44 | 6.40 | 1.0x |
-|  | large | 24.86 | 24.67 | 1.0x |
-| 4. SDiD few markets | small | 3.75 | 0.04 | 92.9x |
-|  | medium | 4.06 | 0.12 | 34.3x |
+| 1. Staggered campaign | small | 0.50 | 0.50 | 1.0x |
+|  | medium | 0.72 | 0.75 | 1.0x |
+|  | large | 1.27 | 1.28 | 1.0x |
+| 2. Brand awareness survey | small | 0.20 | 0.21 | 1.0x |
+|  | medium | 0.80 | 0.51 | 1.6x |
+|  | large | 0.96 | 0.88 | 1.1x |
+| 3. BRFSS microdata -> CS panel | small | 1.57 | 1.65 | 1.0x |
+|  | medium | 5.94 | 6.30 | 0.9x |
+|  | large | 23.77 | 26.38 | 0.9x |
+| 4. SDiD few markets | small | 3.68 | 0.04 | 88.9x |
+|  | medium | 4.01 | 0.12 | 33.5x |
 |  | large | skip | 0.27 | - |
-| 5. Reversible dCDH | single | 0.75 | 0.66 | 1.1x |
-| 6. Pricing dose-response | single | 0.61 | 0.59 | 1.0x |
+| 5. Reversible dCDH | single | 0.84 | 0.79 | 1.1x |
+| 6. Pricing dose-response | single | 0.60 | 0.63 | 1.0x |
 <!-- TABLE:end scale_sweep_totals -->
 
 ### Scaling findings
@@ -93,28 +93,32 @@ scale. Data-shape details are in `docs/performance-scenarios.md`.
 <!-- TABLE:start top_phases_by_scenario -->
 | Scenario | Scale | Backend | Top phase (%) | 2nd phase (%) | 3rd phase (%) |
 |---|---|---|---|---|---|
-| 1. Staggered campaign | large | python | `6_imputation_did_robustness` (41%) | `5_sun_abraham_robustness` (38%) | `2_cs_fit_with_covariates_bootstrap999` (11%) |
-| 1. Staggered campaign | large | rust | `6_imputation_did_robustness` (42%) | `5_sun_abraham_robustness` (34%) | `2_cs_fit_with_covariates_bootstrap999` (13%) |
-| 2. Brand awareness survey | large | python | `3_replicate_weights_jk1` (51%) | `4_multi_outcome_loop_3_metrics` (26%) | `7_event_study_plus_honest_did` (15%) |
-| 2. Brand awareness survey | large | rust | `3_replicate_weights_jk1` (51%) | `4_multi_outcome_loop_3_metrics` (26%) | `7_event_study_plus_honest_did` (15%) |
+| 1. Staggered campaign | large | python | `6_imputation_did_robustness` (51%) | `5_sun_abraham_robustness` (26%) | `2_cs_fit_with_covariates_bootstrap999` (13%) |
+| 1. Staggered campaign | large | rust | `6_imputation_did_robustness` (43%) | `5_sun_abraham_robustness` (32%) | `2_cs_fit_with_covariates_bootstrap999` (14%) |
+| 2. Brand awareness survey | large | python | `3_replicate_weights_jk1` (54%) | `4_multi_outcome_loop_3_metrics` (26%) | `7_event_study_plus_honest_did` (14%) |
+| 2. Brand awareness survey | large | rust | `3_replicate_weights_jk1` (48%) | `4_multi_outcome_loop_3_metrics` (24%) | `7_event_study_plus_honest_did` (16%) |
 | 3. BRFSS microdata -> CS panel | large | python | `1_aggregate_survey_microdata_to_panel` (100%) | `5_sun_abraham_robustness` (0%) | `2_cs_fit_with_stage2_survey_design` (0%) |
 | 3. BRFSS microdata -> CS panel | large | rust | `1_aggregate_survey_microdata_to_panel` (100%) | `5_sun_abraham_robustness` (0%) | `2_cs_fit_with_stage2_survey_design` (0%) |
 | 4. SDiD few markets | medium | python | `5_sensitivity_to_zeta_omega` (43%) | `3_in_time_placebo` (39%) | `2_sdid_bootstrap_variance_200` (9%) |
-| 4. SDiD few markets | large | rust | `5_sensitivity_to_zeta_omega` (40%) | `3_in_time_placebo` (30%) | `1_sdid_jackknife_variance` (16%) |
-| 5. Reversible dCDH | single | python | `1_dcdh_fit_Lmax3_survey_TSL` (51%) | `4_heterogeneity_refit` (48%) | `3_honest_did_on_placebo` (1%) |
-| 5. Reversible dCDH | single | rust | `1_dcdh_fit_Lmax3_survey_TSL` (52%) | `4_heterogeneity_refit` (48%) | `3_honest_did_on_placebo` (1%) |
-| 6. Pricing dose-response | single | python | `1_cdid_cubic_spline_bootstrap199` (26%) | `3_cdid_event_study_pretrend` (25%) | `5_spline_sensitivity_degree1` (25%) |
-| 6. Pricing dose-response | single | rust | `1_cdid_cubic_spline_bootstrap199` (26%) | `3_cdid_event_study_pretrend` (25%) | `6_spline_sensitivity_num_knots2` (25%) |
+| 4. SDiD few markets | large | rust | `5_sensitivity_to_zeta_omega` (40%) | `3_in_time_placebo` (29%) | `1_sdid_jackknife_variance` (16%) |
+| 5. Reversible dCDH | single | python | `1_dcdh_fit_Lmax3_survey_TSL` (58%) | `4_heterogeneity_refit` (41%) | `3_honest_did_on_placebo` (1%) |
+| 5. Reversible dCDH | single | rust | `4_heterogeneity_refit` (51%) | `1_dcdh_fit_Lmax3_survey_TSL` (49%) | `3_honest_did_on_placebo` (1%) |
+| 6. Pricing dose-response | single | python | `1_cdid_cubic_spline_bootstrap199` (26%) | `3_cdid_event_study_pretrend` (25%) | `6_spline_sensitivity_num_knots2` (25%) |
+| 6. Pricing dose-response | single | rust | `1_cdid_cubic_spline_bootstrap199` (26%) | `3_cdid_event_study_pretrend` (25%) | `6_spline_sensitivity_num_knots2` (24%) |
 <!-- TABLE:end top_phases_by_scenario -->
 
 Per-scenario phase narrative (cross-check against the table above after
 any rerun):
 
-- **Staggered campaign.** ImputationDiD robustness is the dominant
-  phase under both backends at every scale. Under Rust at large scale
-  SunAbraham narrows the gap but ImputationDiD still leads. CS fit with
-  `n_bootstrap=999` (both with and without covariates) is well-
-  vectorized and sits well below both in the ranking.
+- **Staggered campaign.** ImputationDiD robustness and SunAbraham are
+  the two largest phases at every scale, together accounting for
+  ~70-80% of the chain. Their relative order is not stable across
+  backend and scale: ImputationDiD is the single largest phase under
+  Python at every scale and under Rust at small and large, but at
+  Rust medium SunAbraham clearly leads (roughly 1.7x the ImputationDiD
+  phase there). CS fit with `n_bootstrap=999` (both with and without
+  covariates) is well-vectorized and sits well below both in the
+  ranking.
 - **Brand awareness survey.** At small scale HonestDiD dominates. At
   medium the backends diverge: on Python JK1 leads clearly (about
   2.2x the multi-outcome loop), while on Rust the multi-outcome loop
@@ -127,13 +131,19 @@ any rerun):
 - **BRFSS.** `aggregate_survey` share of total grows with scale and is
   effectively 100% of runtime at 1M rows. Downstream phases (CS fit,
   SunAbraham, HonestDiD) are a fraction of a second combined.
-- **SDiD few markets.** `sensitivity_to_zeta_omega` and `in_time_placebo`
-  are the dominant Python-backend phases at every scale; Rust eliminates
-  both.
-- **Reversible dCDH.** Main fit and heterogeneity refit split the time
-  roughly evenly (~48-52% each under both backends). Both fits run
-  under the same `SurveyDesign` and rebuild shared TSL scaffolding -
-  that is the optimization opportunity.
+- **SDiD few markets.** `sensitivity_to_zeta_omega` and
+  `in_time_placebo` are the two largest phases under both backends at
+  every scale - they together account for roughly ~70% of the chain.
+  The difference is absolute: under Python they drive a multi-second
+  chain, under Rust they stay the top phases but of a sub-second total
+  runtime. That is the Python-vs-Rust story for this scenario.
+- **Reversible dCDH.** Main fit and heterogeneity refit are the two
+  largest phases by design - together effectively the whole chain. The
+  split is not stable across backends: under Python the main fit is
+  the larger of the two (roughly 58/41), under Rust the heterogeneity
+  refit slightly leads (roughly 51/49). Both fits run under the same
+  `SurveyDesign` and rebuild shared TSL scaffolding - that is the
+  optimization opportunity.
 - **Pricing dose-response.** Four spline fits account for essentially all
   runtime; linear scaling in variant count.
 
@@ -159,20 +169,20 @@ in `benchmarks/speed_review/baselines/mem_profile_brfss_large_<backend>.txt`.
 <!-- TABLE:start memory_by_scenario -->
 | Scenario | Scale | Py peak RSS (MB) | Py growth (MB) | Rust peak RSS (MB) | Rust growth (MB) |
 |---|---|---:|---:|---:|---:|
-| 1. Staggered campaign | small | 143 | 29 | 151 | 36 |
-|  | medium | 232 | 86 | 256 | 100 |
-|  | large | 466 | 233 | 573 | 314 |
-| 2. Brand awareness survey | small | 127 | 12 | 129 | 14 |
-|  | medium | 186 | 52 | 187 | 51 |
-|  | large | 337 | 149 | 344 | 151 |
-| 3. BRFSS microdata -> CS panel | small | 134 | 13 | 136 | 12 |
-|  | medium | 209 | 21 | 218 | 25 |
-|  | large | 438 | 23 | 435 | 31 |
-| 4. SDiD few markets | small | 124 | 10 | 115 | 2 |
-|  | medium | 152 | 9 | 117 | 0 |
+| 1. Staggered campaign | small | 141 | 27 | 147 | 33 |
+|  | medium | 222 | 77 | 252 | 101 |
+|  | large | 485 | 262 | 585 | 331 |
+| 2. Brand awareness survey | small | 127 | 11 | 128 | 13 |
+|  | medium | 189 | 56 | 184 | 50 |
+|  | large | 347 | 150 | 342 | 153 |
+| 3. BRFSS microdata -> CS panel | small | 134 | 12 | 134 | 13 |
+|  | medium | 210 | 18 | 211 | 16 |
+|  | large | 426 | 27 | 427 | 29 |
+| 4. SDiD few markets | small | 123 | 10 | 115 | 2 |
+|  | medium | 146 | 6 | 117 | 0 |
 |  | large | skip | skip | 118 | 0 |
-| 5. Reversible dCDH | single | 132 | 18 | 134 | 20 |
-| 6. Pricing dose-response | single | 122 | 9 | 123 | 9 |
+| 5. Reversible dCDH | single | 135 | 22 | 135 | 21 |
+| 6. Pricing dose-response | single | 122 | 8 | 124 | 10 |
 <!-- TABLE:end memory_by_scenario -->
 
 The ~115-130 MB floor is the Python + diff-diff + numpy import footprint;

From a0aafc5f85d4280fc0fa30daf88f38acd21431d6 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Sun, 19 Apr 2026 15:21:59 -0400
Subject: [PATCH 13/15] Narrow three P3 narrative overgeneralizations

CI re-review P3 items, all documentation-only:

- Scenario 3 operation chain: said "analytical TSL via strata + PSU",
  but aggregate_survey()'s returned second-stage design is pweight
  with geographic PSU clustering and no stage-2 strata. Reworded to
  match the actual second-stage design surface being benchmarked.
- ImputationDiD "consistently dominant" claim in scaling finding #2
  and hotspot table row #2: at Rust medium SunAbraham clearly leads
  (0.353s vs 0.214s). Both claims narrowed to "Python all scales +
  Rust small/large" with the Rust-medium SunAbraham exception called
  out explicitly; the "together ~70-80% of the chain" framing
  preserves the optimization recommendation.
- SDiD narrative said sensitivity_to_zeta_omega and in_time_placebo
  are the two largest at every scale/backend, but at Rust small
  bootstrap_variance slightly edges both (at sub-50ms totals, per-
  phase fixed overhead dominates ranking). Qualified to Python all
  scales + Rust medium/large.

Docs-only. No script or baseline changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/performance-plan.md      | 19 ++++++++++++-------
 docs/performance-scenarios.md | 11 ++++++-----
 2 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/docs/performance-plan.md b/docs/performance-plan.md
index fff93dc1..b7c3a8e3 100644
--- a/docs/performance-plan.md
+++ b/docs/performance-plan.md
@@ -69,7 +69,9 @@ scale. Data-shape details are in `docs/performance-scenarios.md`.
    (`aggregate_survey` is entirely Python).
 2. **Staggered CS chain stays cheap across scales.** A 10x unit increase
    (150 -> 1,500) is a small-single-digit multiplier on total time.
-   ImputationDiD is consistently the dominant phase but scales well.
+   ImputationDiD is the dominant phase at most (scale, backend)
+   combinations; SunAbraham takes the top spot at Rust medium but the
+   two phases together consistently account for ~70-80% of the chain.
 3. **SDiD Rust gap is stable across scales, not emergent.** Python SDiD
    has a fixed per-jackknife-refit overhead that dominates even at small
    n. Rust stays sub-second through 500 units.
@@ -132,11 +134,14 @@ any rerun):
   effectively 100% of runtime at 1M rows. Downstream phases (CS fit,
   SunAbraham, HonestDiD) are a fraction of a second combined.
 - **SDiD few markets.** `sensitivity_to_zeta_omega` and
-  `in_time_placebo` are the two largest phases under both backends at
-  every scale - they together account for roughly ~70% of the chain.
-  The difference is absolute: under Python they drive a multi-second
-  chain, under Rust they stay the top phases but of a sub-second total
-  runtime. That is the Python-vs-Rust story for this scenario.
+  `in_time_placebo` are the two largest phases under Python at every
+  scale and under Rust at medium/large (together ~70% of the chain).
+  At Rust small the absolute cost collapses so far that per-phase
+  fixed overhead dominates and `2_sdid_bootstrap_variance_200` slightly
+  edges the other two. The difference across backends is absolute:
+  under Python these phases drive a multi-second chain, under Rust
+  they stay in the top ranks but of a sub-second total runtime. That
+  is the Python-vs-Rust story for this scenario.
 - **Reversible dCDH.** Main fit and heterogeneity refit are the two
   largest phases by design - together effectively the whole chain. The
   split is not stable across backends: under Python the main fit is
@@ -152,7 +157,7 @@ any rerun):
 | # | Location | Scenario + scale | Signal | Recommended action |
 |---|---|---|---|---|
 | 1 | `diff_diff/survey.py:1160` `_compute_stratified_psu_meat` | BRFSS @ 1M rows | dominates BRFSS chain at all scales, ~100% at 1M rows | **Algorithmic fix, highest priority.** Function called once per (state, year) cell (500 calls); per-call work rebuilds stratum-PSU scaffolding every time. Precompute stratum indexes once at `aggregate_survey` top-level and reuse. |
-| 2 | `diff_diff/imputation.py` ImputationDiD fit | Staggered CS @ 1,500 units | dominant phase of the CS chain under both backends at all scales; SunAbraham narrows the gap under Rust at large but ImputationDiD still leads | **Investigate only after BRFSS fix lands.** Total chain is well under practitioner-perceptible threshold; candidate follow-up. |
+| 2 | `diff_diff/imputation.py` ImputationDiD fit | Staggered CS @ 1,500 units | dominant phase under Python at every scale and under Rust at small/large; at Rust medium SunAbraham takes the top spot. Together ImputationDiD + SunAbraham are ~70-80% of the chain at every scale | **Investigate only after BRFSS fix lands.** Total chain is well under practitioner-perceptible threshold; candidate follow-up. |
 | 3 | `diff_diff/utils.py:1434` `_sc_weight_fw_numpy` | SDiD python @ any scale | dominates Python SDiD at all scales | **Already ported to Rust.** Python fallback acceptable as a teaching/safety path; non-production for n > 100. Python skipped at n=500 (jackknife cost would exceed 4 minutes per run). |
 | 4 | `diff_diff/chaisemartin_dhaultfoeuille.py` dCDH fit + heterogeneity | Reversible (single scale) | main fit and survey-aware heterogeneity refit each rebuild TSL scaffolding; heterogeneity phase is as expensive as the main fit | **Cache/precompute** - heterogeneity refit duplicates the main fit's TSL setup under the same `SurveyDesign`. Not P0; newer code path (v3.1) never optimization-reviewed. |
 | 5 | `diff_diff/continuous_did.py` CDiD spline bootstrap | Dose-response (single scale) | four spline fits ~equal, linear in variant count | **Leave alone** - well under perceptible threshold. |
diff --git a/docs/performance-scenarios.md b/docs/performance-scenarios.md
index 7ac93385..3eeb89d5 100644
--- a/docs/performance-scenarios.md
+++ b/docs/performance-scenarios.md
@@ -201,11 +201,12 @@ serves a different purpose: R-parity accuracy). They complement it.
   compute_honest_did(results, method="relative_magnitude", M=[0.5, 1.0, 1.5])
   ```
 - **Operation chain.** (1) `aggregate_survey()` - the microdata-to-panel
-  collapse; (2) CS fit with staged second-stage SurveyDesign
-  (`weight_type="pweight"`, analytical TSL via strata + PSU) and bootstrap
-  at PSU level; (3) event-study pre-trend inspection; (4) HonestDiD
-  sensitivity grid; (5) SunAbraham robustness refit using the same
-  second-stage pweight SurveyDesign; (6) `practitioner_next_steps()`.
+  collapse; (2) CS fit with the second-stage SurveyDesign returned by
+  `aggregate_survey` (pweight + geographic PSU clustering; `aggregate_survey`
+  does not stratify the collapsed cell panel) and bootstrap at PSU level;
+  (3) event-study pre-trend inspection; (4) HonestDiD sensitivity grid;
+  (5) SunAbraham robustness refit using the same second-stage pweight
+  SurveyDesign; (6) `practitioner_next_steps()`.
 - **Source anchor.** `docs/practitioner_getting_started.rst` ("What If
   You Have Survey Data?" section), CDC BRFSS 2024 overview
   (cdc.gov/brfss/annual_data/2024), `diff_diff.prep.aggregate_survey`

From 030d5f573a57a49444a14e8e441b81575c0749c2 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Sun, 19 Apr 2026 15:34:14 -0400
Subject: [PATCH 14/15] Pin dose-response cohort to period 3 (P1); harden
 narrative to close-race shifts

CI re-review P1: bench_dose_response.py inherited the CDiD generator's
default cohort [2], not the documented period 3. The fallback that
would have set first_treat=3 never ran (generator already populates
first_treat), so the committed baselines measured a different cohort
onset than the scenario doc. The binarized DiD phase also hardcoded
post >= 3, which further desynced it from the actual CDiD treatment
start under the default DGP.

Fix:
- Pin the generator to cohort_periods=[3] so the DGP matches the docs.
- Assert exactly one positive first_treat after generation; future
  DGP changes that break the single-cohort contract will fail loudly
  instead of drifting silently.
- Binarized phase now derives its post cutoff from the actual
  first_treat in the data, not a hardcoded period number. No
  opportunity to desync from the CDiD fits above.
- Regenerated dose-response baselines for both backends.

Structural narrative hardening:

Prior CI rounds have repeatedly re-flagged the same drift pattern:
the staggered campaign and reversible dCDH narratives make phase-
order claims at close-race cells (staggered Rust medium, dCDH at
this shape) that can flip on rerun because the two contenders are
within a few percentage points of each other. The underlying ranking
is not the right level of abstraction for narrative; the phase-share
table is. This commit rewrites both narratives to describe the
aggregate share pattern and defer per-cell ordering to the
generator-produced table. Scaling finding #2 and hotspot table row
#2 get the same treatment. Net effect: narrative claims are now
robust to rerun noise at close-race cells.

Still measurement only. No changes under diff_diff/ or rust/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../brand_awareness_survey_large_python.json  |  22 ++--
 .../brand_awareness_survey_large_rust.json    |  22 ++--
 .../brand_awareness_survey_medium_python.json |  22 ++--
 .../brand_awareness_survey_medium_rust.json   |  22 ++--
 .../brand_awareness_survey_small_python.json  |  20 ++--
 .../brand_awareness_survey_small_rust.json    |  22 ++--
 .../baselines/brfss_panel_large_python.json   |  20 ++--
 .../baselines/brfss_panel_large_rust.json     |  20 ++--
 .../baselines/brfss_panel_medium_python.json  |  18 +--
 .../baselines/brfss_panel_medium_rust.json    |  20 ++--
 .../baselines/brfss_panel_small_python.json   |  20 ++--
 .../baselines/brfss_panel_small_rust.json     |  20 ++--
 .../campaign_staggered_large_python.json      |  24 ++--
 .../campaign_staggered_large_rust.json        |  24 ++--
 .../campaign_staggered_medium_python.json     |  24 ++--
 .../campaign_staggered_medium_rust.json       |  24 ++--
 .../campaign_staggered_small_python.json      |  24 ++--
 .../campaign_staggered_small_rust.json        |  24 ++--
 .../baselines/dose_response_python.json       |  20 ++--
 .../baselines/dose_response_rust.json         |  20 ++--
 .../baselines/geo_few_markets_large_rust.json |  20 ++--
 .../geo_few_markets_medium_python.json        |  20 ++--
 .../geo_few_markets_medium_rust.json          |  20 ++--
 .../geo_few_markets_small_python.json         |  20 ++--
 .../baselines/geo_few_markets_small_rust.json |  20 ++--
 .../baselines/reversible_dcdh_python.json     |  16 +--
 .../baselines/reversible_dcdh_rust.json       |  16 +--
 .../speed_review/bench_dose_response.py       |  27 +++--
 docs/performance-plan.md                      | 110 +++++++++---------
 29 files changed, 358 insertions(+), 343 deletions(-)

diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json b/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json
index 53420b48..c8eb9108 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_large",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.963601333,
+  "total_seconds": 1.0910496250000001,
   "memory": {
     "available": true,
-    "start_mb": 197.14,
-    "peak_mb": 347.12,
-    "growth_mb": 149.98,
+    "start_mb": 188.45,
+    "peak_mb": 327.44,
+    "growth_mb": 138.98,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.0072292080000000425,
+      "seconds": 0.009826500000000182,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.017787999999999915,
+      "seconds": 0.030280333999999964,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.5239118750000002,
+      "seconds": 0.6243122919999999,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.2471308750000003,
+      "seconds": 0.24174716599999968,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.023947041000000002,
+      "seconds": 0.025623749999999834,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.010973958000000117,
+      "seconds": 0.01191299999999984,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.13260737500000008,
+      "seconds": 0.147335875,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json b/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json
index 4d5629fc..a3eb721c 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_large",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.880690958,
+  "total_seconds": 1.0000031249999999,
   "memory": {
     "available": true,
-    "start_mb": 189.05,
-    "peak_mb": 341.62,
-    "growth_mb": 152.58,
+    "start_mb": 194.03,
+    "peak_mb": 336.08,
+    "growth_mb": 142.05,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.013126208999999944,
+      "seconds": 0.013511041000000112,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.03180279199999991,
+      "seconds": 0.03037650000000003,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.4201593749999999,
+      "seconds": 0.5431151669999998,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.21300062499999983,
+      "seconds": 0.21752962499999962,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.036875957999999986,
+      "seconds": 0.04399687500000038,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.025331083999999837,
+      "seconds": 0.016433082999999904,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.14037483400000017,
+      "seconds": 0.13501837500000002,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json
index 0fe9b936..869c5393 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_medium",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.802721625,
+  "total_seconds": 0.563283334,
   "memory": {
     "available": true,
-    "start_mb": 132.53,
-    "peak_mb": 188.75,
-    "growth_mb": 56.22,
+    "start_mb": 133.69,
+    "peak_mb": 187.7,
+    "growth_mb": 54.02,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.011162792000000032,
+      "seconds": 0.010921792000000097,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.033565499999999915,
+      "seconds": 0.03732066599999995,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.4388773749999999,
+      "seconds": 0.20805304199999997,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.17477937499999996,
+      "seconds": 0.12622899999999992,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.02887754099999995,
+      "seconds": 0.01834783299999998,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.05516908300000001,
+      "seconds": 0.054030583000000076,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.060266375000000094,
+      "seconds": 0.10836029199999997,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json
index 32ec4723..2ceed1ca 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_medium",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.512456792,
+  "total_seconds": 0.5500554579999999,
   "memory": {
     "available": true,
-    "start_mb": 133.5,
-    "peak_mb": 183.52,
-    "growth_mb": 50.02,
+    "start_mb": 135.36,
+    "peak_mb": 184.86,
+    "growth_mb": 49.5,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.012339082999999973,
+      "seconds": 0.011186999999999947,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.0368013330000001,
+      "seconds": 0.03363270800000007,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.14184362500000003,
+      "seconds": 0.18678066699999996,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.1607655830000001,
+      "seconds": 0.16038787500000007,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.025544416000000014,
+      "seconds": 0.022171542000000155,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.060123250000000183,
+      "seconds": 0.0532650830000001,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.07500912500000001,
+      "seconds": 0.08262075000000002,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json b/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json
index 5a9d69eb..699da724 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.202955542,
+  "total_seconds": 0.19338629200000002,
   "memory": {
     "available": true,
     "start_mb": 115.48,
-    "peak_mb": 126.53,
-    "growth_mb": 11.05,
+    "peak_mb": 127.31,
+    "growth_mb": 11.83,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.0013837089999999552,
+      "seconds": 0.0014470410000000378,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.006605916000000045,
+      "seconds": 0.0072707499999999925,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.020932875000000073,
+      "seconds": 0.023173292000000068,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.05001366600000001,
+      "seconds": 0.03375529200000005,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.00929579199999997,
+      "seconds": 0.01041325000000004,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.028791582999999954,
+      "seconds": 0.027520249999999913,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.08592387499999998,
+      "seconds": 0.08979433299999995,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json b/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json
index 1f816a42..006bc684 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.209049209,
+  "total_seconds": 0.19669587500000008,
   "memory": {
     "available": true,
-    "start_mb": 115.23,
-    "peak_mb": 128.47,
-    "growth_mb": 13.23,
+    "start_mb": 114.78,
+    "peak_mb": 127.91,
+    "growth_mb": 13.12,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.0016902500000000042,
+      "seconds": 0.0016678749999999853,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.005705791000000016,
+      "seconds": 0.005756874999999995,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.01469479100000004,
+      "seconds": 0.012066042000000055,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.05941337499999999,
+      "seconds": 0.05887395800000006,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.009663624999999954,
+      "seconds": 0.008938375000000054,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.02708766600000001,
+      "seconds": 0.0274049999999999,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.090782625,
+      "seconds": 0.08197737500000002,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_large_python.json b/benchmarks/speed_review/baselines/brfss_panel_large_python.json
index db64809a..1772355b 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_large_python.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_large_python.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_large",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 23.767577457999998,
+  "total_seconds": 24.406984582999996,
   "memory": {
     "available": true,
-    "start_mb": 398.88,
-    "peak_mb": 426.2,
-    "growth_mb": 27.33,
+    "start_mb": 401.05,
+    "peak_mb": 418.12,
+    "growth_mb": 17.08,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 23.684428833,
+      "seconds": 24.295822291,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.012686416999997618,
+      "seconds": 0.012265292000002148,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.4160000009487703e-06,
+      "seconds": 2.2919999977943917e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0016776669999991611,
+      "seconds": 0.0016812089999973523,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.06832108300000428,
+      "seconds": 0.09669395799999592,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.0004543749999967872,
+      "seconds": 0.0005083750000025589,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_large_rust.json b/benchmarks/speed_review/baselines/brfss_panel_large_rust.json
index 60fad453..886c63cc 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_large_rust.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_large_rust.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_large",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 26.383662291999997,
+  "total_seconds": 24.936181916,
   "memory": {
     "available": true,
-    "start_mb": 397.69,
-    "peak_mb": 427.11,
-    "growth_mb": 29.42,
+    "start_mb": 396.06,
+    "peak_mb": 429.31,
+    "growth_mb": 33.25,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 26.308744165999997,
+      "seconds": 24.820139083,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.017224583999997378,
+      "seconds": 0.012674374999996019,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.207999997949628e-06,
+      "seconds": 2.500000000793534e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0018082090000035578,
+      "seconds": 0.0015977500000019518,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.05578533399999941,
+      "seconds": 0.10144270800000044,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 8.741699999603725e-05,
+      "seconds": 0.00030387500000017553,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_medium_python.json b/benchmarks/speed_review/baselines/brfss_panel_medium_python.json
index 07ed4bab..91e5e648 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_medium_python.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_medium_python.json
@@ -2,22 +2,22 @@
   "scenario": "brfss_panel_medium",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 5.935090292,
+  "total_seconds": 6.096216417,
   "memory": {
     "available": true,
-    "start_mb": 191.91,
-    "peak_mb": 209.75,
-    "growth_mb": 17.84,
+    "start_mb": 193.25,
+    "peak_mb": 209.78,
+    "growth_mb": 16.53,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 5.871870541000001,
+      "seconds": 5.9895347910000005,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.011603208000000365,
+      "seconds": 0.012643416999999602,
       "ok": true,
       "error": null
     },
@@ -27,17 +27,17 @@
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0015359169999999978,
+      "seconds": 0.0015969160000004479,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.05003187500000017,
+      "seconds": 0.0921533340000007,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 3.670900000152244e-05,
+      "seconds": 0.0002710829999994502,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json b/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json
index feb4e4fd..670b3135 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_medium",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 6.301258208999999,
+  "total_seconds": 6.228102207999999,
   "memory": {
     "available": true,
-    "start_mb": 195.36,
-    "peak_mb": 211.09,
-    "growth_mb": 15.73,
+    "start_mb": 197.56,
+    "peak_mb": 212.22,
+    "growth_mb": 14.66,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 6.1868029159999995,
+      "seconds": 6.142273,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.012106875000000628,
+      "seconds": 0.012037416000000078,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.2919999995707485e-06,
+      "seconds": 2.1249999999639613e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0017474579999987583,
+      "seconds": 0.0016153329999983868,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.10031245899999952,
+      "seconds": 0.07184195800000026,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.00027812499999946283,
+      "seconds": 0.0003229160000000064,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_small_python.json b/benchmarks/speed_review/baselines/brfss_panel_small_python.json
index 3a25c6ba..093a7daf 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_small_python.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_small_python.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 1.5694494999999997,
+  "total_seconds": 1.608562042,
   "memory": {
     "available": true,
-    "start_mb": 121.31,
-    "peak_mb": 133.78,
-    "growth_mb": 12.47,
+    "start_mb": 121.97,
+    "peak_mb": 133.39,
+    "growth_mb": 11.42,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 1.472794875,
+      "seconds": 1.523675458,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.014274458000000045,
+      "seconds": 0.015124000000000137,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.0410000001191975e-06,
+      "seconds": 2.165999999803603e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.003753040999999957,
+      "seconds": 0.004194041999999953,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.07832733300000028,
+      "seconds": 0.0653021250000001,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.00029404099999963407,
+      "seconds": 0.00026012500000005545,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_small_rust.json b/benchmarks/speed_review/baselines/brfss_panel_small_rust.json
index de7c32a7..a1f19a21 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_small_rust.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_small_rust.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 1.6511473330000002,
+  "total_seconds": 1.6610665,
   "memory": {
     "available": true,
-    "start_mb": 120.92,
-    "peak_mb": 134.12,
-    "growth_mb": 13.2,
+    "start_mb": 121.16,
+    "peak_mb": 136.44,
+    "growth_mb": 15.28,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 1.565322084,
+      "seconds": 1.5438897920000003,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.014890000000000292,
+      "seconds": 0.01586162499999988,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.208000000170074e-06,
+      "seconds": 2.4999999999053557e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0039766250000004,
+      "seconds": 0.003953542000000088,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.06669404099999987,
+      "seconds": 0.09701791599999998,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.0002490840000000105,
+      "seconds": 0.00032904199999972406,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_large_python.json b/benchmarks/speed_review/baselines/campaign_staggered_large_python.json
index cf7cc0da..0c2dc359 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_large_python.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_large_python.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_large",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 1.274976542,
+  "total_seconds": 1.3326843750000001,
   "memory": {
     "available": true,
-    "start_mb": 222.83,
-    "peak_mb": 484.78,
-    "growth_mb": 261.95,
+    "start_mb": 227.28,
+    "peak_mb": 472.22,
+    "growth_mb": 244.94,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.018749040999999966,
+      "seconds": 0.019139459000000025,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.1616636250000001,
+      "seconds": 0.16680450000000002,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.4170000002831102e-06,
+      "seconds": 3.042000000341716e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.0023987919999997054,
+      "seconds": 0.002607332999999823,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.3264373329999999,
+      "seconds": 0.3669262500000001,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.644759042,
+      "seconds": 0.649511,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.12092275000000008,
+      "seconds": 0.12763954200000027,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 3.683299999979184e-05,
+      "seconds": 4.033299999983697e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json b/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json
index 4e4fa8b8..6766f7ac 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_large",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 1.2839995000000002,
+  "total_seconds": 1.3826507919999997,
   "memory": {
     "available": true,
-    "start_mb": 254.2,
-    "peak_mb": 585.02,
-    "growth_mb": 330.81,
+    "start_mb": 265.8,
+    "peak_mb": 587.92,
+    "growth_mb": 322.12,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.019514583000000307,
+      "seconds": 0.019430332999999855,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.17894224999999997,
+      "seconds": 0.17791104199999985,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.082999999737268e-06,
+      "seconds": 3.5419999999675156e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.002544458000000027,
+      "seconds": 0.0025778330000001404,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.41660504200000004,
+      "seconds": 0.5076542499999999,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.5459507920000002,
+      "seconds": 0.5523530000000001,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.12039437500000005,
+      "seconds": 0.12266958400000005,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 3.879200000023175e-05,
+      "seconds": 4.233299999967244e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json b/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json
index 60a9879d..914a09aa 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_medium",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.7186370420000001,
+  "total_seconds": 0.7537883749999998,
   "memory": {
     "available": true,
-    "start_mb": 144.94,
-    "peak_mb": 222.11,
-    "growth_mb": 77.17,
+    "start_mb": 147.67,
+    "peak_mb": 226.62,
+    "growth_mb": 78.95,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.011784207999999907,
+      "seconds": 0.012091666999999973,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.09282820899999988,
+      "seconds": 0.09575774999999997,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.9580000000528628e-06,
+      "seconds": 2.9589999999135586e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.0021937499999999943,
+      "seconds": 0.002356958999999881,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.266259416,
+      "seconds": 0.276134208,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.2791953329999999,
+      "seconds": 0.2946765,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.0663273339999999,
+      "seconds": 0.07270195899999998,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 4.074999999970075e-05,
+      "seconds": 5.983399999998085e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json b/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json
index 84113174..81c02255 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_medium",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.7459894999999999,
+  "total_seconds": 0.756008333,
   "memory": {
     "available": true,
-    "start_mb": 150.91,
-    "peak_mb": 251.52,
-    "growth_mb": 100.61,
+    "start_mb": 154.94,
+    "peak_mb": 254.11,
+    "growth_mb": 99.17,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.012153583000000134,
+      "seconds": 0.012925999999999993,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.095559167,
+      "seconds": 0.09863954099999983,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.374999999916639e-06,
+      "seconds": 3.1659999999433808e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.0023907919999999194,
+      "seconds": 0.0024457499999999133,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.35323991600000015,
+      "seconds": 0.281516125,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.21440725000000005,
+      "seconds": 0.29128733399999995,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.06818954200000005,
+      "seconds": 0.06915141700000005,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 3.945799999982569e-05,
+      "seconds": 3.383300000003864e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_small_python.json b/benchmarks/speed_review/baselines/campaign_staggered_small_python.json
index ca684788..44e82483 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_small_python.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_small_python.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.49590554100000006,
+  "total_seconds": 0.509287875,
   "memory": {
     "available": true,
-    "start_mb": 113.95,
-    "peak_mb": 140.97,
-    "growth_mb": 27.02,
+    "start_mb": 114.72,
+    "peak_mb": 143.08,
+    "growth_mb": 28.36,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.008187166999999995,
+      "seconds": 0.008488708000000011,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.061664542000000044,
+      "seconds": 0.06242541699999993,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.54199999993876e-06,
+      "seconds": 3.3329999999942572e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.0075763749999999686,
+      "seconds": 0.00873587500000006,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.16791750000000005,
+      "seconds": 0.18465104099999996,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.21524137500000018,
+      "seconds": 0.20897954100000016,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.035257708000000054,
+      "seconds": 0.03596216600000002,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 5.2833000000029884e-05,
+      "seconds": 3.28339999999816e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json b/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json
index c8a81e8d..bfe53aed 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.5003938330000001,
+  "total_seconds": 0.501876834,
   "memory": {
     "available": true,
-    "start_mb": 113.81,
-    "peak_mb": 146.69,
-    "growth_mb": 32.88,
+    "start_mb": 114.78,
+    "peak_mb": 150.67,
+    "growth_mb": 35.89,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.007302374999999972,
+      "seconds": 0.0068224170000000806,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.06526799999999999,
+      "seconds": 0.06276566699999997,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.8340000000071086e-06,
+      "seconds": 2.9160000000194586e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.004741417000000081,
+      "seconds": 0.004543957999999959,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.13823870900000002,
+      "seconds": 0.14964783299999995,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.24844470800000007,
+      "seconds": 0.241357292,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.03635133300000004,
+      "seconds": 0.03669304200000001,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 3.945800000004773e-05,
+      "seconds": 3.850000000005238e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/dose_response_python.json b/benchmarks/speed_review/baselines/dose_response_python.json
index 985928a6..0e576e88 100644
--- a/benchmarks/speed_review/baselines/dose_response_python.json
+++ b/benchmarks/speed_review/baselines/dose_response_python.json
@@ -2,42 +2,42 @@
   "scenario": "dose_response",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.5982971659999999,
+  "total_seconds": 0.5912168340000001,
   "memory": {
     "available": true,
-    "start_mb": 113.95,
-    "peak_mb": 121.86,
-    "growth_mb": 7.91,
+    "start_mb": 114.11,
+    "peak_mb": 123.11,
+    "growth_mb": 9.0,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_cdid_cubic_spline_bootstrap199": {
-      "seconds": 0.152698,
+      "seconds": 0.15039274999999996,
       "ok": true,
       "error": null
     },
     "2_extract_dose_response_dataframes": {
-      "seconds": 0.0007580830000000205,
+      "seconds": 0.0007435829999999921,
       "ok": true,
       "error": null
     },
     "3_cdid_event_study_pretrend": {
-      "seconds": 0.14973008399999999,
+      "seconds": 0.14597749999999998,
       "ok": true,
       "error": null
     },
     "4_binarized_did_comparison": {
-      "seconds": 0.001448709000000048,
+      "seconds": 0.0017279590000000011,
       "ok": true,
       "error": null
     },
     "5_spline_sensitivity_degree1": {
-      "seconds": 0.14563666599999991,
+      "seconds": 0.14600595799999994,
       "ok": true,
       "error": null
     },
     "6_spline_sensitivity_num_knots2": {
-      "seconds": 0.14802162500000016,
+      "seconds": 0.14636520799999997,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/dose_response_rust.json b/benchmarks/speed_review/baselines/dose_response_rust.json
index 75b80e57..51039f15 100644
--- a/benchmarks/speed_review/baselines/dose_response_rust.json
+++ b/benchmarks/speed_review/baselines/dose_response_rust.json
@@ -2,42 +2,42 @@
   "scenario": "dose_response",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.628465,
+  "total_seconds": 0.5952834579999999,
   "memory": {
     "available": true,
-    "start_mb": 113.91,
-    "peak_mb": 124.05,
-    "growth_mb": 10.14,
+    "start_mb": 113.73,
+    "peak_mb": 121.34,
+    "growth_mb": 7.61,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_cdid_cubic_spline_bootstrap199": {
-      "seconds": 0.16225933400000003,
+      "seconds": 0.15132816700000007,
       "ok": true,
       "error": null
     },
     "2_extract_dose_response_dataframes": {
-      "seconds": 0.0007119580000000125,
+      "seconds": 0.0007386659999999434,
       "ok": true,
       "error": null
     },
     "3_cdid_event_study_pretrend": {
-      "seconds": 0.15891516699999997,
+      "seconds": 0.147476167,
       "ok": true,
       "error": null
     },
     "4_binarized_did_comparison": {
-      "seconds": 0.0023566669999999235,
+      "seconds": 0.001677958000000035,
       "ok": true,
       "error": null
     },
     "5_spline_sensitivity_degree1": {
-      "seconds": 0.15112041699999978,
+      "seconds": 0.145152917,
       "ok": true,
       "error": null
     },
     "6_spline_sensitivity_num_knots2": {
-      "seconds": 0.15309695899999998,
+      "seconds": 0.14890500000000007,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json b/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json
index 85ea499d..dce42749 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_large",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.2668321659999999,
+  "total_seconds": 0.26079429200000015,
   "memory": {
     "available": true,
-    "start_mb": 117.61,
-    "peak_mb": 118.0,
-    "growth_mb": 0.39,
+    "start_mb": 117.8,
+    "peak_mb": 118.22,
+    "growth_mb": 0.42,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.041749666000000074,
+      "seconds": 0.04102845799999999,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.03940208300000003,
+      "seconds": 0.03718729200000004,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 0.07818345800000004,
+      "seconds": 0.07744412499999997,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0007424160000000235,
+      "seconds": 0.0008073330000000212,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 0.10670491699999984,
+      "seconds": 0.10429091600000007,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 4.308399999986001e-05,
+      "seconds": 3.220799999992252e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json b/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json
index 6d693e0d..868c0578 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_medium",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 4.007914167,
+  "total_seconds": 3.9883142080000002,
   "memory": {
     "available": true,
-    "start_mb": 140.27,
-    "peak_mb": 146.11,
-    "growth_mb": 5.84,
+    "start_mb": 143.86,
+    "peak_mb": 151.53,
+    "growth_mb": 7.67,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.36536020800000024,
+      "seconds": 0.35804470799999955,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.3658956660000001,
+      "seconds": 0.36447529099999976,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 1.5556308329999995,
+      "seconds": 1.5563965419999999,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0006294580000005823,
+      "seconds": 0.0007229159999999624,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 1.72036825,
+      "seconds": 1.7086395420000002,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 2.5292000000121106e-05,
+      "seconds": 2.9666999999733434e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json b/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json
index 8ed034ee..bd4471a6 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_medium",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.11954508399999997,
+  "total_seconds": 0.118741875,
   "memory": {
     "available": true,
-    "start_mb": 116.28,
-    "peak_mb": 116.72,
-    "growth_mb": 0.44,
+    "start_mb": 117.23,
+    "peak_mb": 117.64,
+    "growth_mb": 0.41,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.02039737500000005,
+      "seconds": 0.020535375000000022,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.02283175000000004,
+      "seconds": 0.023519291000000053,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 0.02504779200000007,
+      "seconds": 0.02495891699999997,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0006240410000000196,
+      "seconds": 0.0006400839999999297,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 0.05059725000000004,
+      "seconds": 0.049061250000000056,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 3.979200000003846e-05,
+      "seconds": 2.31669999999351e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_small_python.json b/benchmarks/speed_review/baselines/geo_few_markets_small_python.json
index 1246deae..e0bec083 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_small_python.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_small_python.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 3.6806861669999997,
+  "total_seconds": 3.697791375,
   "memory": {
     "available": true,
-    "start_mb": 113.86,
-    "peak_mb": 123.45,
-    "growth_mb": 9.59,
+    "start_mb": 114.09,
+    "peak_mb": 124.02,
+    "growth_mb": 9.92,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.5916519580000001,
+      "seconds": 0.593809709,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.5638875840000002,
+      "seconds": 0.584832209,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 1.1952183330000001,
+      "seconds": 1.194314458,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0009582500000000493,
+      "seconds": 0.0009036250000002966,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 1.3288901659999999,
+      "seconds": 1.3238487909999996,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 7.466600000061163e-05,
+      "seconds": 7.791699999959434e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json b/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json
index 48988dfe..855eac85 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.04141854199999995,
+  "total_seconds": 0.04129770799999999,
   "memory": {
     "available": true,
-    "start_mb": 113.5,
-    "peak_mb": 115.0,
-    "growth_mb": 1.5,
+    "start_mb": 114.56,
+    "peak_mb": 116.05,
+    "growth_mb": 1.48,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.008168667000000074,
+      "seconds": 0.008074541000000046,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.013055833999999988,
+      "seconds": 0.012903124999999904,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 0.008290083000000004,
+      "seconds": 0.008189833999999951,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0008599579999999385,
+      "seconds": 0.0009220420000000118,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 0.011012665999999949,
+      "seconds": 0.01117779200000002,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 2.3749999999989058e-05,
+      "seconds": 2.6250000000005436e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/reversible_dcdh_python.json b/benchmarks/speed_review/baselines/reversible_dcdh_python.json
index 729720da..1cbed394 100644
--- a/benchmarks/speed_review/baselines/reversible_dcdh_python.json
+++ b/benchmarks/speed_review/baselines/reversible_dcdh_python.json
@@ -2,32 +2,32 @@
   "scenario": "reversible_dcdh",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.8433709580000001,
+  "total_seconds": 0.718732833,
   "memory": {
     "available": true,
-    "start_mb": 113.41,
-    "peak_mb": 135.0,
-    "growth_mb": 21.59,
+    "start_mb": 113.5,
+    "peak_mb": 135.02,
+    "growth_mb": 21.52,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_dcdh_fit_Lmax3_survey_TSL": {
-      "seconds": 0.490543708,
+      "seconds": 0.3450735829999999,
       "ok": true,
       "error": null
     },
     "2_inspect_placebo_and_summary": {
-      "seconds": 1.7920000001669933e-06,
+      "seconds": 1.4160000000318362e-06,
       "ok": true,
       "error": null
     },
     "3_honest_did_on_placebo": {
-      "seconds": 0.004332082999999987,
+      "seconds": 0.004985583999999932,
       "ok": true,
       "error": null
     },
     "4_heterogeneity_refit": {
-      "seconds": 0.348488208,
+      "seconds": 0.36866958299999986,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/reversible_dcdh_rust.json b/benchmarks/speed_review/baselines/reversible_dcdh_rust.json
index 6d2ae098..2af530f5 100644
--- a/benchmarks/speed_review/baselines/reversible_dcdh_rust.json
+++ b/benchmarks/speed_review/baselines/reversible_dcdh_rust.json
@@ -2,32 +2,32 @@
   "scenario": "reversible_dcdh",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.7921678750000001,
+  "total_seconds": 0.751090292,
   "memory": {
     "available": true,
-    "start_mb": 113.97,
-    "peak_mb": 134.59,
-    "growth_mb": 20.62,
+    "start_mb": 113.7,
+    "peak_mb": 134.89,
+    "growth_mb": 21.19,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_dcdh_fit_Lmax3_survey_TSL": {
-      "seconds": 0.3866300829999999,
+      "seconds": 0.36838229199999994,
       "ok": true,
       "error": null
     },
     "2_inspect_placebo_and_summary": {
-      "seconds": 1.7920000001669933e-06,
+      "seconds": 1.3340000000194863e-06,
       "ok": true,
       "error": null
     },
     "3_honest_did_on_placebo": {
-      "seconds": 0.004450167000000116,
+      "seconds": 0.005142916999999914,
       "ok": true,
       "error": null
     },
     "4_heterogeneity_refit": {
-      "seconds": 0.4010832499999999,
+      "seconds": 0.3775615830000001,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/bench_dose_response.py b/benchmarks/speed_review/bench_dose_response.py
index f0c86384..9dd38765 100644
--- a/benchmarks/speed_review/bench_dose_response.py
+++ b/benchmarks/speed_review/bench_dose_response.py
@@ -19,14 +19,21 @@
 
 
 def build_data(seed=42):
+    # cohort_periods=[3] pins the single treated cohort to period 3 to
+    # match the documented scenario shape. The generator default would
+    # be period 2, which would desync this scenario from the spec in
+    # docs/performance-scenarios.md and from the binarized DiD
+    # comparison phase below.
     df = generate_continuous_did_data(
-        n_units=500, n_periods=6, seed=seed,
+        n_units=500, n_periods=6, cohort_periods=[3], seed=seed,
+    )
+    positive_first_treat = sorted(
+        v for v in df["first_treat"].unique() if v > 0
+    )
+    assert len(positive_first_treat) == 1, (
+        f"dose-response scenario expects exactly one treated cohort; "
+        f"got first_treat values {positive_first_treat}"
     )
-    # Set first_treat to period 3 for all treated; ContinuousDiD expects
-    # staggered/first_treat format.
-    if "first_treat" not in df.columns:
-        treated_mask = df.get("dose", pd.Series(0.0, index=df.index)) > 0
-        df["first_treat"] = np.where(treated_mask, 3, 0)
     return df
 
 
@@ -67,9 +74,15 @@ def cdid_event_study():
         )
 
     def binarized_comparison():
+        # Derive post from the actual first_treat cohort in the data so
+        # this phase is aligned with the CDiD fits above. A hardcoded
+        # period cutoff would silently desync if the DGP cohort moves.
+        treated_cohort = int(
+            sorted(v for v in data["first_treat"].unique() if v > 0)[0]
+        )
         data_bin = data.copy()
         data_bin["treated_any"] = (data_bin["dose"] > 0).astype(int)
-        data_bin["post"] = (data_bin["period"] >= 3).astype(int)
+        data_bin["post"] = (data_bin["period"] >= treated_cohort).astype(int)
         did = DifferenceInDifferences(robust=True)
         results["binarized"] = did.fit(
             data_bin, outcome="outcome", treatment="treated_any", time="post",
diff --git a/docs/performance-plan.md b/docs/performance-plan.md
index b7c3a8e3..b081eb68 100644
--- a/docs/performance-plan.md
+++ b/docs/performance-plan.md
@@ -41,20 +41,20 @@ scale. Data-shape details are in `docs/performance-scenarios.md`.
 <!-- TABLE:start scale_sweep_totals -->
 | Scenario | Scale | Python (s) | Rust (s) | Py/Rust |
 |---|---|---:|---:|---:|
-| 1. Staggered campaign | small | 0.50 | 0.50 | 1.0x |
-|  | medium | 0.72 | 0.75 | 1.0x |
-|  | large | 1.27 | 1.28 | 1.0x |
-| 2. Brand awareness survey | small | 0.20 | 0.21 | 1.0x |
-|  | medium | 0.80 | 0.51 | 1.6x |
-|  | large | 0.96 | 0.88 | 1.1x |
-| 3. BRFSS microdata -> CS panel | small | 1.57 | 1.65 | 1.0x |
-|  | medium | 5.94 | 6.30 | 0.9x |
-|  | large | 23.77 | 26.38 | 0.9x |
-| 4. SDiD few markets | small | 3.68 | 0.04 | 88.9x |
-|  | medium | 4.01 | 0.12 | 33.5x |
-|  | large | skip | 0.27 | - |
-| 5. Reversible dCDH | single | 0.84 | 0.79 | 1.1x |
-| 6. Pricing dose-response | single | 0.60 | 0.63 | 1.0x |
+| 1. Staggered campaign | small | 0.51 | 0.50 | 1.0x |
+|  | medium | 0.75 | 0.76 | 1.0x |
+|  | large | 1.33 | 1.38 | 1.0x |
+| 2. Brand awareness survey | small | 0.19 | 0.20 | 1.0x |
+|  | medium | 0.56 | 0.55 | 1.0x |
+|  | large | 1.09 | 1.00 | 1.1x |
+| 3. BRFSS microdata -> CS panel | small | 1.61 | 1.66 | 1.0x |
+|  | medium | 6.10 | 6.23 | 1.0x |
+|  | large | 24.41 | 24.94 | 1.0x |
+| 4. SDiD few markets | small | 3.70 | 0.04 | 89.5x |
+|  | medium | 3.99 | 0.12 | 33.6x |
+|  | large | skip | 0.26 | - |
+| 5. Reversible dCDH | single | 0.72 | 0.75 | 1.0x |
+| 6. Pricing dose-response | single | 0.59 | 0.60 | 1.0x |
 <!-- TABLE:end scale_sweep_totals -->
 
 ### Scaling findings
@@ -69,9 +69,10 @@ scale. Data-shape details are in `docs/performance-scenarios.md`.
    (`aggregate_survey` is entirely Python).
 2. **Staggered CS chain stays cheap across scales.** A 10x unit increase
    (150 -> 1,500) is a small-single-digit multiplier on total time.
-   ImputationDiD is the dominant phase at most (scale, backend)
-   combinations; SunAbraham takes the top spot at Rust medium but the
-   two phases together consistently account for ~70-80% of the chain.
+   ImputationDiD and SunAbraham together consistently account for
+   ~70-80% of the chain; either can be the single top phase at a given
+   (scale, backend) cell, which is a per-cell ranking detail not a
+   stable pattern to optimize against.
 3. **SDiD Rust gap is stable across scales, not emergent.** Python SDiD
    has a fixed per-jackknife-refit overhead that dominates even at small
    n. Rust stays sub-second through 500 units.
@@ -95,32 +96,32 @@ scale. Data-shape details are in `docs/performance-scenarios.md`.
 <!-- TABLE:start top_phases_by_scenario -->
 | Scenario | Scale | Backend | Top phase (%) | 2nd phase (%) | 3rd phase (%) |
 |---|---|---|---|---|---|
-| 1. Staggered campaign | large | python | `6_imputation_did_robustness` (51%) | `5_sun_abraham_robustness` (26%) | `2_cs_fit_with_covariates_bootstrap999` (13%) |
-| 1. Staggered campaign | large | rust | `6_imputation_did_robustness` (43%) | `5_sun_abraham_robustness` (32%) | `2_cs_fit_with_covariates_bootstrap999` (14%) |
-| 2. Brand awareness survey | large | python | `3_replicate_weights_jk1` (54%) | `4_multi_outcome_loop_3_metrics` (26%) | `7_event_study_plus_honest_did` (14%) |
-| 2. Brand awareness survey | large | rust | `3_replicate_weights_jk1` (48%) | `4_multi_outcome_loop_3_metrics` (24%) | `7_event_study_plus_honest_did` (16%) |
+| 1. Staggered campaign | large | python | `6_imputation_did_robustness` (49%) | `5_sun_abraham_robustness` (28%) | `2_cs_fit_with_covariates_bootstrap999` (13%) |
+| 1. Staggered campaign | large | rust | `6_imputation_did_robustness` (40%) | `5_sun_abraham_robustness` (37%) | `2_cs_fit_with_covariates_bootstrap999` (13%) |
+| 2. Brand awareness survey | large | python | `3_replicate_weights_jk1` (57%) | `4_multi_outcome_loop_3_metrics` (22%) | `7_event_study_plus_honest_did` (14%) |
+| 2. Brand awareness survey | large | rust | `3_replicate_weights_jk1` (54%) | `4_multi_outcome_loop_3_metrics` (22%) | `7_event_study_plus_honest_did` (14%) |
 | 3. BRFSS microdata -> CS panel | large | python | `1_aggregate_survey_microdata_to_panel` (100%) | `5_sun_abraham_robustness` (0%) | `2_cs_fit_with_stage2_survey_design` (0%) |
 | 3. BRFSS microdata -> CS panel | large | rust | `1_aggregate_survey_microdata_to_panel` (100%) | `5_sun_abraham_robustness` (0%) | `2_cs_fit_with_stage2_survey_design` (0%) |
 | 4. SDiD few markets | medium | python | `5_sensitivity_to_zeta_omega` (43%) | `3_in_time_placebo` (39%) | `2_sdid_bootstrap_variance_200` (9%) |
-| 4. SDiD few markets | large | rust | `5_sensitivity_to_zeta_omega` (40%) | `3_in_time_placebo` (29%) | `1_sdid_jackknife_variance` (16%) |
-| 5. Reversible dCDH | single | python | `1_dcdh_fit_Lmax3_survey_TSL` (58%) | `4_heterogeneity_refit` (41%) | `3_honest_did_on_placebo` (1%) |
-| 5. Reversible dCDH | single | rust | `4_heterogeneity_refit` (51%) | `1_dcdh_fit_Lmax3_survey_TSL` (49%) | `3_honest_did_on_placebo` (1%) |
-| 6. Pricing dose-response | single | python | `1_cdid_cubic_spline_bootstrap199` (26%) | `3_cdid_event_study_pretrend` (25%) | `6_spline_sensitivity_num_knots2` (25%) |
-| 6. Pricing dose-response | single | rust | `1_cdid_cubic_spline_bootstrap199` (26%) | `3_cdid_event_study_pretrend` (25%) | `6_spline_sensitivity_num_knots2` (24%) |
+| 4. SDiD few markets | large | rust | `5_sensitivity_to_zeta_omega` (40%) | `3_in_time_placebo` (30%) | `1_sdid_jackknife_variance` (16%) |
+| 5. Reversible dCDH | single | python | `4_heterogeneity_refit` (51%) | `1_dcdh_fit_Lmax3_survey_TSL` (48%) | `3_honest_did_on_placebo` (1%) |
+| 5. Reversible dCDH | single | rust | `4_heterogeneity_refit` (50%) | `1_dcdh_fit_Lmax3_survey_TSL` (49%) | `3_honest_did_on_placebo` (1%) |
+| 6. Pricing dose-response | single | python | `1_cdid_cubic_spline_bootstrap199` (25%) | `6_spline_sensitivity_num_knots2` (25%) | `5_spline_sensitivity_degree1` (25%) |
+| 6. Pricing dose-response | single | rust | `1_cdid_cubic_spline_bootstrap199` (25%) | `6_spline_sensitivity_num_knots2` (25%) | `3_cdid_event_study_pretrend` (25%) |
 <!-- TABLE:end top_phases_by_scenario -->
 
 Per-scenario phase narrative (cross-check against the table above after
 any rerun):
 
-- **Staggered campaign.** ImputationDiD robustness and SunAbraham are
-  the two largest phases at every scale, together accounting for
-  ~70-80% of the chain. Their relative order is not stable across
-  backend and scale: ImputationDiD is the single largest phase under
-  Python at every scale and under Rust at small and large, but at
-  Rust medium SunAbraham clearly leads (roughly 1.7x the ImputationDiD
-  phase there). CS fit with `n_bootstrap=999` (both with and without
-  covariates) is well-vectorized and sits well below both in the
-  ranking.
+- **Staggered campaign.** ImputationDiD robustness and SunAbraham
+  consistently account for ~70-80% of the chain at every scale. They
+  sit in a narrow phase-share band (each typically ~25-50%) and which
+  one leads varies by (scale, backend) and can flip across reruns at
+  medium scale where the two are close; see the table for the exact
+  ordering per cell. CS fit with `n_bootstrap=999` (both with and
+  without covariates) is well-vectorized and sits well below both in
+  the ranking. Either phase is a legitimate optimization target; the
+  aggregate share is what drives the "next hotspot" priority.
 - **Brand awareness survey.** At small scale HonestDiD dominates. At
   medium the backends diverge: on Python JK1 leads clearly (about
   2.2x the multi-outcome loop), while on Rust the multi-outcome loop
@@ -143,12 +144,13 @@ any rerun):
   they stay in the top ranks but of a sub-second total runtime. That
   is the Python-vs-Rust story for this scenario.
 - **Reversible dCDH.** Main fit and heterogeneity refit are the two
-  largest phases by design - together effectively the whole chain. The
-  split is not stable across backends: under Python the main fit is
-  the larger of the two (roughly 58/41), under Rust the heterogeneity
-  refit slightly leads (roughly 51/49). Both fits run under the same
-  `SurveyDesign` and rebuild shared TSL scaffolding - that is the
-  optimization opportunity.
+  largest phases by design - together effectively the whole chain,
+  with the remainder on HonestDiD at <2%. The two phases sit within a
+  few percentage points of each other at this shape and the leader
+  can flip across reruns under either backend. Both fits run under
+  the same `SurveyDesign` and rebuild shared TSL scaffolding - that
+  is the optimization opportunity, independent of which side is
+  slightly larger on a given measurement.
 - **Pricing dose-response.** Four spline fits account for essentially all
   runtime; linear scaling in variant count.
 
@@ -157,7 +159,7 @@ any rerun):
 | # | Location | Scenario + scale | Signal | Recommended action |
 |---|---|---|---|---|
 | 1 | `diff_diff/survey.py:1160` `_compute_stratified_psu_meat` | BRFSS @ 1M rows | dominates BRFSS chain at all scales, ~100% at 1M rows | **Algorithmic fix, highest priority.** Function called once per (state, year) cell (500 calls); per-call work rebuilds stratum-PSU scaffolding every time. Precompute stratum indexes once at `aggregate_survey` top-level and reuse. |
-| 2 | `diff_diff/imputation.py` ImputationDiD fit | Staggered CS @ 1,500 units | dominant phase under Python at every scale and under Rust at small/large; at Rust medium SunAbraham takes the top spot. Together ImputationDiD + SunAbraham are ~70-80% of the chain at every scale | **Investigate only after BRFSS fix lands.** Total chain is well under practitioner-perceptible threshold; candidate follow-up. |
+| 2 | `diff_diff/imputation.py` ImputationDiD fit (+ `diff_diff/sun_abraham.py` SunAbraham fit) | Staggered CS @ 1,500 units | together consistently ~70-80% of the chain at every scale; either can be the top phase at a given (scale, backend) cell | **Investigate only after BRFSS fix lands.** Total chain is well under practitioner-perceptible threshold; candidate follow-up. Either phase is a legitimate target. |
 | 3 | `diff_diff/utils.py:1434` `_sc_weight_fw_numpy` | SDiD python @ any scale | dominates Python SDiD at all scales | **Already ported to Rust.** Python fallback acceptable as a teaching/safety path; non-production for n > 100. Python skipped at n=500 (jackknife cost would exceed 4 minutes per run). |
 | 4 | `diff_diff/chaisemartin_dhaultfoeuille.py` dCDH fit + heterogeneity | Reversible (single scale) | main fit and survey-aware heterogeneity refit each rebuild TSL scaffolding; heterogeneity phase is as expensive as the main fit | **Cache/precompute** - heterogeneity refit duplicates the main fit's TSL setup under the same `SurveyDesign`. Not P0; newer code path (v3.1) never optimization-reviewed. |
 | 5 | `diff_diff/continuous_did.py` CDiD spline bootstrap | Dose-response (single scale) | four spline fits ~equal, linear in variant count | **Leave alone** - well under perceptible threshold. |
@@ -174,20 +176,20 @@ in `benchmarks/speed_review/baselines/mem_profile_brfss_large_<backend>.txt`.
 <!-- TABLE:start memory_by_scenario -->
 | Scenario | Scale | Py peak RSS (MB) | Py growth (MB) | Rust peak RSS (MB) | Rust growth (MB) |
 |---|---|---:|---:|---:|---:|
-| 1. Staggered campaign | small | 141 | 27 | 147 | 33 |
-|  | medium | 222 | 77 | 252 | 101 |
-|  | large | 485 | 262 | 585 | 331 |
-| 2. Brand awareness survey | small | 127 | 11 | 128 | 13 |
-|  | medium | 189 | 56 | 184 | 50 |
-|  | large | 347 | 150 | 342 | 153 |
-| 3. BRFSS microdata -> CS panel | small | 134 | 12 | 134 | 13 |
-|  | medium | 210 | 18 | 211 | 16 |
-|  | large | 426 | 27 | 427 | 29 |
-| 4. SDiD few markets | small | 123 | 10 | 115 | 2 |
-|  | medium | 146 | 6 | 117 | 0 |
+| 1. Staggered campaign | small | 143 | 28 | 151 | 36 |
+|  | medium | 227 | 79 | 254 | 99 |
+|  | large | 472 | 245 | 588 | 322 |
+| 2. Brand awareness survey | small | 127 | 12 | 128 | 13 |
+|  | medium | 188 | 54 | 185 | 50 |
+|  | large | 327 | 139 | 336 | 142 |
+| 3. BRFSS microdata -> CS panel | small | 133 | 11 | 136 | 15 |
+|  | medium | 210 | 17 | 212 | 15 |
+|  | large | 418 | 17 | 429 | 33 |
+| 4. SDiD few markets | small | 124 | 10 | 116 | 1 |
+|  | medium | 152 | 8 | 118 | 0 |
 |  | large | skip | skip | 118 | 0 |
 | 5. Reversible dCDH | single | 135 | 22 | 135 | 21 |
-| 6. Pricing dose-response | single | 122 | 8 | 124 | 10 |
+| 6. Pricing dose-response | single | 123 | 9 | 121 | 8 |
 <!-- TABLE:end memory_by_scenario -->
 
 The ~115-130 MB floor is the Python + diff-diff + numpy import footprint;

From 307868c9367132d616d4e0eea92b77196d15001c Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Sun, 19 Apr 2026 15:42:16 -0400
Subject: [PATCH 15/15] Harden brand-awareness narrative; rename JK1 label to
 "replicate-weight variance"

CI re-review P2 + P3, both docs/label only:

- docs/performance-plan.md had two remaining specific-magnitude
  claims about brand-awareness medium ("~1.6x under Python",
  "Python and Rust separate the most at medium", "~1.6x at worst on
  brand-awareness medium"). Those were true on one earlier rerun but
  the current committed baselines show medium at 0.56 / 0.55
  (essentially tied) and the widest non-SDiD gap is now ~1.1x at
  brand-large. Reworded per-scenario paragraph and scaling finding #5
  to describe the stable aggregate pattern and defer exact ratios to
  the scale-sweep table. Same treatment as the earlier staggered/dCDH
  pass: narrative stops claiming magnitudes that can shift on rerun;
  the generator-owned table carries the specifics.
- bench_brand_awareness_survey.py module docstring labeled JK1 as
  "replicate-weight bootstrap". Per REGISTRY.md, JK1 is replicate-
  weight variance (jackknife-style), not bootstrap inference - they
  are distinct methodology surfaces. Renamed to "replicate-weight
  variance (JK1 delete-one-PSU)" with an inline note pointing to the
  registry.

Docs + docstring only. No script behaviour change; no baseline
regeneration needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../bench_brand_awareness_survey.py           |  6 ++--
 docs/performance-plan.md                      | 28 +++++++++----------
 2 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/benchmarks/speed_review/bench_brand_awareness_survey.py b/benchmarks/speed_review/bench_brand_awareness_survey.py
index 7db53894..c41d66b9 100644
--- a/benchmarks/speed_review/bench_brand_awareness_survey.py
+++ b/benchmarks/speed_review/bench_brand_awareness_survey.py
@@ -3,8 +3,10 @@
 
 DifferenceInDifferences + SurveyDesign under two variance paths:
   (a) analytical Taylor-series linearization (strata + PSU + FPC)
-  (b) replicate-weight bootstrap (JK1 delete-one-PSU weights; count equals
-      the number of PSUs, so 40/90/160 at small/medium/large)
+  (b) replicate-weight variance (JK1 delete-one-PSU; count equals
+      the number of PSUs, so 40/90/160 at small/medium/large).
+      This is replicate-weight variance, not bootstrap resampling -
+      see REGISTRY.md for the distinction.
 
 Chains: naive fit (for SE-inflation comparison) -> TSL -> replicate -> multi-
 outcome refit loop -> check_parallel_trends -> placebo -> HonestDiD grid.
diff --git a/docs/performance-plan.md b/docs/performance-plan.md
index b081eb68..58f0f017 100644
--- a/docs/performance-plan.md
+++ b/docs/performance-plan.md
@@ -84,12 +84,12 @@ scale. Data-shape details are in `docs/performance-scenarios.md`.
    n_units x n_replicates - faster growth than the chain total, so it
    increasingly dominates at large n.
 5. Rust backend gives large uplift only for SDiD (order-of-magnitude
-   and up). Elsewhere the gap is modest - under ~1.6x at worst on
-   brand-awareness medium, and within noise on the other scenarios
-   and scales. The primary bottlenecks live in Python code the Rust
-   backend does not touch (`aggregate_survey`, JK1 replicate fit), and
-   paths that Rust does touch (CS bootstrap, ImputationDiD, Survey
-   TSL) are already well-vectorized in Python.
+   and up). Elsewhere the gap is modest across all measured (scenario,
+   scale) cells - see the scale-sweep table for exact ratios. The
+   primary bottlenecks live in Python code the Rust backend does not
+   touch (`aggregate_survey`, JK1 replicate fit), and paths that Rust
+   does touch (CS bootstrap, ImputationDiD, Survey TSL) are already
+   well-vectorized in Python.
 
 ### Top phases by scenario at largest measured scale
 
@@ -122,15 +122,13 @@ any rerun):
   without covariates) is well-vectorized and sits well below both in
   the ranking. Either phase is a legitimate optimization target; the
   aggregate share is what drives the "next hotspot" priority.
-- **Brand awareness survey.** At small scale HonestDiD dominates. At
-  medium the backends diverge: on Python JK1 leads clearly (about
-  2.2x the multi-outcome loop), while on Rust the multi-outcome loop
-  and JK1 come in essentially tied. Medium is also the scale where
-  Python and Rust separate the most on total time (~1.6x under
-  Python at the time of writing); the analytical TSL path with FPC
-  appears to vectorize better under Rust at that shape. At large,
-  JK1 becomes the clearly dominant phase under both backends and
-  totals re-converge.
+- **Brand awareness survey.** At small scale HonestDiD dominates. From
+  medium onwards JK1 is the single largest phase under both backends;
+  see the table for the exact share per cell. Python and Rust totals
+  stay close across the sweep (within ~1.1x at any measured scale,
+  see scale-sweep table); the JK1 replicate-fit loop is not
+  Rust-accelerated, so the backends neither help nor hurt each other
+  meaningfully on this chain.
 - **BRFSS.** `aggregate_survey` share of total grows with scale and is
   effectively 100% of runtime at 1M rows. Downstream phases (CS fit,
   SunAbraham, HonestDiD) are a fraction of a second combined.