From 8b8b33b343df07685c6c6d2d1bac1302cd6cd4dc Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Sun, 19 Apr 2026 17:22:04 -0400
Subject: [PATCH 1/2] Precompute stratum-PSU scaffolding in aggregate_survey
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The per-cell Taylor-series variance inside aggregate_survey previously
rebuilt stratum-PSU scaffolding (np.unique, per-stratum pandas groupby,
stratum FPC lookup) on every output cell.  At BRFSS scale (50 states x
10 years = 500 cells, 20 strata, 1M microdata rows) this was ~10K
pandas groupby constructions, each summing a mostly-zero psi vector
and paying full pandas setup cost — the entire chain's runtime.

This PR adds a frozen _PsuScaffolding dataclass plus private
_precompute_psu_scaffolding(resolved) and _compute_if_variance_fast(
psi, scf) helpers in diff_diff/survey.py.  aggregate_survey builds
scaffolding once per design and threads it through _cell_mean_variance
via a new optional kwarg; the fast path replaces the per-stratum
groupby loop with two vectorized np.bincount passes (psi → PSU sums,
PSU sums → per-stratum first and second moments) plus a closed-form
meat = sum_h adjustment_h * centered_ss_h.

Scope is deliberately localized: _compute_stratified_psu_meat and
compute_survey_if_variance are unchanged, so every other TSL caller
(DiD, TWFE, CS, SunAbraham, dCDH, etc.) is unaffected.  Replicate-
weight designs continue to route through compute_replicate_if_variance
unchanged.

Measured impact (benchmarks/speed_review/run_all.py, 1M rows BRFSS):
- Large: 24.4s → 1.33s (Python), 24.9s → 1.32s (Rust)  [18.4-19.0x]
- Medium: 6.1s → 0.49s  [12.5-13.2x]
- Small: 1.6s → 0.17s  [7.6-10x]
No regression in any other scenario (all within run-to-run noise).

Numerical equivalence: new TestAggregateSurveyScaffolding asserts
assert_allclose(atol=1e-14, rtol=1e-14) between fast and legacy paths
across seven design cases — stratified+PSU+FPC, stratified no FPC,
PSU-only, weights-only, and all three lonely_psu modes (remove /
certainty / adjust) — plus structural tests on the scaffolding itself.
On the actual BRFSS-large 1M-row panel, y_mean is bit-identical and
y_se / y_precision drift at ~1 ULP (max relative diff 4.6e-16).

Existing coverage unchanged: all 43 TestAggregateSurvey tests green
on the fast path (new default); all 129 test_survey.py tests green.

Documentation:
- docs/performance-plan.md: finding #1 rewritten ("practitioner-fast
  at every scale"), BRFSS bullet updated, hotspots row #1 marked
  LANDED, memory finding updated, priority table item #1 marked
  LANDED, new "Optimization landed" subsection, bottom line updated
  ("no practitioner-perceptible bottleneck remains").  Auto-tables
  regenerated via gen_findings_tables.py.
- CHANGELOG.md: new Performance entry under [Unreleased].

No user-facing API change.  Methodology docs (REGISTRY.md, survey-
theory.md) are deliberately not touched: this is a pure internal
performance optimization with numerics preserved to sub-ULP tolerance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 CHANGELOG.md                                  |   3 +
 .../brand_awareness_survey_large_python.json  |  22 +-
 .../brand_awareness_survey_large_rust.json    |  22 +-
 .../brand_awareness_survey_medium_python.json |  22 +-
 .../brand_awareness_survey_medium_rust.json   |  22 +-
 .../brand_awareness_survey_small_python.json  |  22 +-
 .../brand_awareness_survey_small_rust.json    |  22 +-
 .../baselines/brfss_panel_large_python.json   |  20 +-
 .../baselines/brfss_panel_large_rust.json     |  20 +-
 .../baselines/brfss_panel_medium_python.json  |  20 +-
 .../baselines/brfss_panel_medium_rust.json    |  20 +-
 .../baselines/brfss_panel_small_python.json   |  20 +-
 .../baselines/brfss_panel_small_rust.json     |  20 +-
 .../campaign_staggered_large_python.json      |  24 +-
 .../campaign_staggered_large_rust.json        |  24 +-
 .../campaign_staggered_medium_python.json     |  24 +-
 .../campaign_staggered_medium_rust.json       |  24 +-
 .../campaign_staggered_small_python.json      |  24 +-
 .../campaign_staggered_small_rust.json        |  24 +-
 .../baselines/dose_response_python.json       |  20 +-
 .../baselines/dose_response_rust.json         |  20 +-
 .../baselines/geo_few_markets_large_rust.json |  20 +-
 .../geo_few_markets_medium_python.json        |  20 +-
 .../geo_few_markets_medium_rust.json          |  20 +-
 .../geo_few_markets_small_python.json         |  20 +-
 .../baselines/geo_few_markets_small_rust.json |  20 +-
 .../baselines/reversible_dcdh_python.json     |  16 +-
 .../baselines/reversible_dcdh_rust.json       |  16 +-
 diff_diff/prep.py                             |  23 +-
 diff_diff/survey.py                           | 320 ++++++++++++++++++
 docs/performance-plan.md                      | 145 ++++----
 tests/test_prep.py                            | 194 +++++++++++
 32 files changed, 906 insertions(+), 347 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 17ec76c4..7f62f00a 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -10,6 +10,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ### Added
 - **`BusinessReport` and `DiagnosticReport` (experimental preview)** - practitioner-ready output layer. `BusinessReport(results, ...)` produces plain-English narrative summaries (`.summary()`, `.full_report()`, `.export_markdown()`, `.to_dict()`) from any of the 16 fitted result types. `DiagnosticReport(results, ...)` orchestrates the existing diagnostic battery (parallel trends, pre-trends power, HonestDiD sensitivity, Goodman-Bacon, heterogeneity, design-effect, EPV) plus estimator-native diagnostics for SyntheticDiD (`pre_treatment_fit`, weight concentration, in-time placebo, zeta sensitivity) and TROP (factor-model fit metrics). Both classes expose an AI-legible `to_dict()` schema (single source of truth; prose renders from the dict). BR auto-constructs DR by default so summaries mention pre-trends, robustness, and design-effect findings in one call. See `docs/methodology/REPORTING.md` for methodology deviations including the no-traffic-light-gates decision, pre-trends verdict thresholds (0.05 / 0.30), and power-aware phrasing driven by `compute_pretrends_power`. **Both schemas are marked experimental in this release** - wording, verdict thresholds, and schema shape will change; do not anchor downstream tooling on them yet.
 
+### Performance
+- **`aggregate_survey` stratum-PSU scaffolding precompute** — the per-cell Taylor-series variance inside `aggregate_survey` no longer rebuilds stratum-PSU scaffolding on every cell. A frozen `_PsuScaffolding` (strata codes, global PSU codes unique across strata, per-stratum counts and FPC ratios, singleton mask, static legitimate-zero counts and variance-computable flag) is precomputed once per design at the top of `aggregate_survey` and threaded through `_cell_mean_variance` to a new `_compute_if_variance_fast` path that replaces the per-stratum pandas groupby with two vectorized `np.bincount` passes. BRFSS-shaped 50-state × 10-year × 1M-row microdata → state-year panel drops from ~24s to sub-2s under both backends (the path is pure Python, so Python and Rust track each other). Numerical output is preserved to sub-ULP tolerance; seven-case equivalence tests (`TestAggregateSurveyScaffolding`) assert `assert_allclose(atol=1e-14, rtol=1e-14)` between fast and legacy paths across stratified+PSU+FPC, stratified no FPC, PSU-only, weights-only, and all three `lonely_psu` modes (remove / certainty / adjust). Replicate-weight designs continue to route through `compute_replicate_if_variance` unchanged. `_compute_stratified_psu_meat` is untouched — all other TSL callers (DiD / TWFE / CS / etc.) are unaffected.
+
 ### Changed
 - Add Zenodo DOI badge to README; upgrade the BibTeX citation block with the concept DOI (`10.5281/zenodo.19646175`) and list author as Isaac Gerber (matching `CITATION.cff`). Add `doi:` and `identifiers:` entries (concept + versioned) to `CITATION.cff`. DOI was minted by Zenodo when v3.1.3 was released.
 - **`ChaisemartinDHaultfoeuille` heterogeneity + within-group-varying PSU/strata now supported under Binder TSL** - `fit(heterogeneity=..., survey_design=...)` no longer raises `NotImplementedError` when the resolved design's PSU or strata vary across the cells of a group. On the **Binder TSL** branch (`compute_survey_if_variance`), the heterogeneity WLS coefficient IF is expanded to observation level via the cell-period allocator `ψ_i = ψ_g * (w_i / W_{g, out_idx})` on the post-period cell — the DID_l post-period single-cell convention shipped in v3.1.x. Under PSU=group the PSU-level Binder TSL variance is byte-identical to the previous release (PSU-level aggregate telescopes to `ψ_g`); under within-group-varying PSU, mass lands in the post-period PSU of the transition. The **Rao-Wu replicate-weight** branch (`compute_replicate_if_variance`) retains the legacy group-level allocator `ψ_i = ψ_g * (w_i / W_g)`: replicate variance computes `θ_r = sum_i ratio_ir * ψ_i` at observation level and is therefore not PSU-telescoping, so the cell-period allocator would silently change the replicate SE whenever a replicate column's ratios vary within group (e.g., per-row replicate matrices). Replicate + heterogeneity fits therefore produce byte-identical SE to the previous release, and the newly-unblocked `heterogeneity=` + within-group-varying PSU combination is unreachable under replicate designs by construction (`SurveyDesign` rejects `replicate_weights` combined with explicit `strata/psu/fpc`).
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json b/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json
index c8eb9108..22ed15d1 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_large_python.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_large",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 1.0910496250000001,
+  "total_seconds": 0.8670909579999999,
   "memory": {
     "available": true,
-    "start_mb": 188.45,
-    "peak_mb": 327.44,
-    "growth_mb": 138.98,
+    "start_mb": 200.7,
+    "peak_mb": 340.16,
+    "growth_mb": 139.45,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.009826500000000182,
+      "seconds": 0.01288558399999995,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.030280333999999964,
+      "seconds": 0.03156662499999996,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.6243122919999999,
+      "seconds": 0.39469687499999995,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.24174716599999968,
+      "seconds": 0.22814783400000005,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.025623749999999834,
+      "seconds": 0.04083812500000006,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.01191299999999984,
+      "seconds": 0.014936375000000002,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.147335875,
+      "seconds": 0.14401216700000008,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json b/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json
index a3eb721c..ffcc5060 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_large_rust.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_large",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 1.0000031249999999,
+  "total_seconds": 0.9299781670000002,
   "memory": {
     "available": true,
-    "start_mb": 194.03,
-    "peak_mb": 336.08,
-    "growth_mb": 142.05,
+    "start_mb": 190.2,
+    "peak_mb": 347.92,
+    "growth_mb": 157.72,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.013511041000000112,
+      "seconds": 0.01335629100000002,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.03037650000000003,
+      "seconds": 0.0316900830000002,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.5431151669999998,
+      "seconds": 0.46433058400000005,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.21752962499999962,
+      "seconds": 0.23703795799999994,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.04399687500000038,
+      "seconds": 0.030673249999999985,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.016433082999999904,
+      "seconds": 0.011707583000000188,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.13501837500000002,
+      "seconds": 0.14117254200000007,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json
index 869c5393..a59f68b4 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_python.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_medium",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.563283334,
+  "total_seconds": 0.529578166,
   "memory": {
     "available": true,
-    "start_mb": 133.69,
-    "peak_mb": 187.7,
-    "growth_mb": 54.02,
+    "start_mb": 137.67,
+    "peak_mb": 182.88,
+    "growth_mb": 45.2,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.010921792000000097,
+      "seconds": 0.01053379199999993,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.03732066599999995,
+      "seconds": 0.032504792000000005,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.20805304199999997,
+      "seconds": 0.16178545899999996,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.12622899999999992,
+      "seconds": 0.1744099589999999,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.01834783299999998,
+      "seconds": 0.02328412499999999,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.054030583000000076,
+      "seconds": 0.06313762499999998,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.10836029199999997,
+      "seconds": 0.06389345899999999,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json
index 2ceed1ca..42535c3a 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_medium_rust.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_medium",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.5500554579999999,
+  "total_seconds": 0.50248775,
   "memory": {
     "available": true,
-    "start_mb": 135.36,
-    "peak_mb": 184.86,
-    "growth_mb": 49.5,
+    "start_mb": 133.94,
+    "peak_mb": 189.34,
+    "growth_mb": 55.41,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.011186999999999947,
+      "seconds": 0.010962209,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.03363270800000007,
+      "seconds": 0.03478112499999997,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.18678066699999996,
+      "seconds": 0.13834324999999992,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.16038787500000007,
+      "seconds": 0.1290292500000001,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.022171542000000155,
+      "seconds": 0.02951112499999997,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.0532650830000001,
+      "seconds": 0.06002304200000008,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.08262075000000002,
+      "seconds": 0.09981400000000007,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json b/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json
index 699da724..51e34058 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_small_python.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.19338629200000002,
+  "total_seconds": 0.22668149999999998,
   "memory": {
     "available": true,
-    "start_mb": 115.48,
-    "peak_mb": 127.31,
-    "growth_mb": 11.83,
+    "start_mb": 115.44,
+    "peak_mb": 130.16,
+    "growth_mb": 14.72,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.0014470410000000378,
+      "seconds": 0.00165958300000002,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.0072707499999999925,
+      "seconds": 0.006191999999999975,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.023173292000000068,
+      "seconds": 0.02364570900000007,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.03375529200000005,
+      "seconds": 0.07623400000000002,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.01041325000000004,
+      "seconds": 0.009393082999999969,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.027520249999999913,
+      "seconds": 0.02586829199999996,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.08979433299999995,
+      "seconds": 0.08367512499999996,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json b/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json
index 006bc684..00cd03e8 100644
--- a/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json
+++ b/benchmarks/speed_review/baselines/brand_awareness_survey_small_rust.json
@@ -2,47 +2,47 @@
   "scenario": "brand_awareness_survey_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.19669587500000008,
+  "total_seconds": 0.198891041,
   "memory": {
     "available": true,
-    "start_mb": 114.78,
-    "peak_mb": 127.91,
-    "growth_mb": 13.12,
+    "start_mb": 115.05,
+    "peak_mb": 127.78,
+    "growth_mb": 12.73,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_naive_fit_no_survey_design": {
-      "seconds": 0.0016678749999999853,
+      "seconds": 0.0019442080000000583,
       "ok": true,
       "error": null
     },
     "2_tsl_strata_psu_fpc": {
-      "seconds": 0.005756874999999995,
+      "seconds": 0.006045499999999926,
       "ok": true,
       "error": null
     },
     "3_replicate_weights_jk1": {
-      "seconds": 0.012066042000000055,
+      "seconds": 0.02063908400000003,
       "ok": true,
       "error": null
     },
     "4_multi_outcome_loop_3_metrics": {
-      "seconds": 0.05887395800000006,
+      "seconds": 0.05060483399999993,
       "ok": true,
       "error": null
     },
     "5_check_parallel_trends": {
-      "seconds": 0.008938375000000054,
+      "seconds": 0.009498208000000008,
       "ok": true,
       "error": null
     },
     "6_placebo_refit_pre_period": {
-      "seconds": 0.0274049999999999,
+      "seconds": 0.025947834000000003,
       "ok": true,
       "error": null
     },
     "7_event_study_plus_honest_did": {
-      "seconds": 0.08197737500000002,
+      "seconds": 0.08419849999999995,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_large_python.json b/benchmarks/speed_review/baselines/brfss_panel_large_python.json
index 1772355b..9437734c 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_large_python.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_large_python.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_large",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 24.406984582999996,
+  "total_seconds": 1.328024584,
   "memory": {
     "available": true,
-    "start_mb": 401.05,
-    "peak_mb": 418.12,
-    "growth_mb": 17.08,
+    "start_mb": 387.59,
+    "peak_mb": 412.75,
+    "growth_mb": 25.16,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 24.295822291,
+      "seconds": 1.2118086249999998,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.012265292000002148,
+      "seconds": 0.012898916999999788,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.2919999977943917e-06,
+      "seconds": 2.5409999997449972e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0016812089999973523,
+      "seconds": 0.0018360419999998712,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.09669395799999592,
+      "seconds": 0.10123833299999996,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.0005083750000025589,
+      "seconds": 0.00022966599999962867,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_large_rust.json b/benchmarks/speed_review/baselines/brfss_panel_large_rust.json
index 886c63cc..338bfe61 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_large_rust.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_large_rust.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_large",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 24.936181916,
+  "total_seconds": 1.31504775,
   "memory": {
     "available": true,
-    "start_mb": 396.06,
-    "peak_mb": 429.31,
-    "growth_mb": 33.25,
+    "start_mb": 384.2,
+    "peak_mb": 409.28,
+    "growth_mb": 25.08,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 24.820139083,
+      "seconds": 1.2451636250000002,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.012674374999996019,
+      "seconds": 0.013531541999999952,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.500000000793534e-06,
+      "seconds": 2.916000000130481e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0015977500000019518,
+      "seconds": 0.001939415999999916,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.10144270800000044,
+      "seconds": 0.054231499999999766,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.00030387500000017553,
+      "seconds": 0.0001666249999998648,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_medium_python.json b/benchmarks/speed_review/baselines/brfss_panel_medium_python.json
index 91e5e648..ea65bf9d 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_medium_python.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_medium_python.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_medium",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 6.096216417,
+  "total_seconds": 0.48709708400000007,
   "memory": {
     "available": true,
-    "start_mb": 193.25,
-    "peak_mb": 209.78,
-    "growth_mb": 16.53,
+    "start_mb": 185.42,
+    "peak_mb": 202.75,
+    "growth_mb": 17.33,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 5.9895347910000005,
+      "seconds": 0.372203458,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.012643416999999602,
+      "seconds": 0.01215470800000018,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.166999999886343e-06,
+      "seconds": 2.5000000001274003e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0015969160000004479,
+      "seconds": 0.0016202499999999898,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.0921533340000007,
+      "seconds": 0.10084249999999995,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.0002710829999994502,
+      "seconds": 0.000269875000000086,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json b/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json
index 670b3135..7876dd32 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_medium_rust.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_medium",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 6.228102207999999,
+  "total_seconds": 0.472971041,
   "memory": {
     "available": true,
-    "start_mb": 197.56,
-    "peak_mb": 212.22,
-    "growth_mb": 14.66,
+    "start_mb": 178.69,
+    "peak_mb": 199.55,
+    "growth_mb": 20.86,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 6.142273,
+      "seconds": 0.4003294999999999,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.012037416000000078,
+      "seconds": 0.0133387920000001,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.1249999999639613e-06,
+      "seconds": 2.4999999999053557e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.0016153329999983868,
+      "seconds": 0.0020148749999999715,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.07184195800000026,
+      "seconds": 0.057244916000000146,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.0003229160000000064,
+      "seconds": 3.6416000000150106e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_small_python.json b/benchmarks/speed_review/baselines/brfss_panel_small_python.json
index 093a7daf..127748c2 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_small_python.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_small_python.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 1.608562042,
+  "total_seconds": 0.21261929199999996,
   "memory": {
     "available": true,
-    "start_mb": 121.97,
-    "peak_mb": 133.39,
-    "growth_mb": 11.42,
+    "start_mb": 121.34,
+    "peak_mb": 132.62,
+    "growth_mb": 11.28,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 1.523675458,
+      "seconds": 0.08785816700000004,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.015124000000000137,
+      "seconds": 0.016040416999999918,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.165999999803603e-06,
+      "seconds": 2.583000000000446e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.004194041999999953,
+      "seconds": 0.004216333999999988,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.0653021250000001,
+      "seconds": 0.10422679200000007,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.00026012500000005545,
+      "seconds": 0.00026649999999994733,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/brfss_panel_small_rust.json b/benchmarks/speed_review/baselines/brfss_panel_small_rust.json
index a1f19a21..a22692ca 100644
--- a/benchmarks/speed_review/baselines/brfss_panel_small_rust.json
+++ b/benchmarks/speed_review/baselines/brfss_panel_small_rust.json
@@ -2,42 +2,42 @@
   "scenario": "brfss_panel_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 1.6610665,
+  "total_seconds": 0.16585016600000002,
   "memory": {
     "available": true,
-    "start_mb": 121.16,
-    "peak_mb": 136.44,
-    "growth_mb": 15.28,
+    "start_mb": 121.91,
+    "peak_mb": 130.25,
+    "growth_mb": 8.34,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_aggregate_survey_microdata_to_panel": {
-      "seconds": 1.5438897920000003,
+      "seconds": 0.084868791,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_stage2_survey_design": {
-      "seconds": 0.01586162499999988,
+      "seconds": 0.016418874999999944,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.4999999999053557e-06,
+      "seconds": 3.124999999992717e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_grid": {
-      "seconds": 0.003953542000000088,
+      "seconds": 0.004238000000000075,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.09701791599999998,
+      "seconds": 0.060278041000000004,
       "ok": true,
       "error": null
     },
     "6_practitioner_next_steps": {
-      "seconds": 0.00032904199999972406,
+      "seconds": 3.820799999998403e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_large_python.json b/benchmarks/speed_review/baselines/campaign_staggered_large_python.json
index 0c2dc359..19bf1a59 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_large_python.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_large_python.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_large",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 1.3326843750000001,
+  "total_seconds": 1.321951625,
   "memory": {
     "available": true,
-    "start_mb": 227.28,
-    "peak_mb": 472.22,
-    "growth_mb": 244.94,
+    "start_mb": 235.58,
+    "peak_mb": 486.17,
+    "growth_mb": 250.59,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.019139459000000025,
+      "seconds": 0.019820957999999944,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.16680450000000002,
+      "seconds": 0.17604354199999994,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.042000000341716e-06,
+      "seconds": 3.4580000001227518e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.002607332999999823,
+      "seconds": 0.002394666999999906,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.3669262500000001,
+      "seconds": 0.279372666,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.649511,
+      "seconds": 0.716293292,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.12763954200000027,
+      "seconds": 0.12797208299999996,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 4.033299999983697e-05,
+      "seconds": 3.8041999999904874e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json b/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json
index 6766f7ac..87200f59 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_large_rust.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_large",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 1.3826507919999997,
+  "total_seconds": 1.310933833,
   "memory": {
     "available": true,
-    "start_mb": 265.8,
-    "peak_mb": 587.92,
-    "growth_mb": 322.12,
+    "start_mb": 254.7,
+    "peak_mb": 581.67,
+    "growth_mb": 326.97,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.019430332999999855,
+      "seconds": 0.01872620799999991,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.17791104199999985,
+      "seconds": 0.1628326659999999,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.5419999999675156e-06,
+      "seconds": 3.459000000205492e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.0025778330000001404,
+      "seconds": 0.00247950000000019,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.5076542499999999,
+      "seconds": 0.4679546669999999,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.5523530000000001,
+      "seconds": 0.539718041,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.12266958400000005,
+      "seconds": 0.1191795830000002,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 4.233299999967244e-05,
+      "seconds": 3.449999999993736e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json b/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json
index 914a09aa..234f2918 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_medium_python.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_medium",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.7537883749999998,
+  "total_seconds": 0.81063825,
   "memory": {
     "available": true,
-    "start_mb": 147.67,
-    "peak_mb": 226.62,
-    "growth_mb": 78.95,
+    "start_mb": 150.39,
+    "peak_mb": 235.06,
+    "growth_mb": 84.67,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.012091666999999973,
+      "seconds": 0.013887540999999892,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.09575774999999997,
+      "seconds": 0.10513504099999982,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.9589999999135586e-06,
+      "seconds": 3.750000000080078e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.002356958999999881,
+      "seconds": 0.0026329160000000407,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.276134208,
+      "seconds": 0.2873527090000001,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.2946765,
+      "seconds": 0.3267266660000001,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.07270195899999998,
+      "seconds": 0.07484287499999986,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 5.983399999998085e-05,
+      "seconds": 5.050000000039745e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json b/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json
index 81c02255..55107bbb 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_medium_rust.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_medium",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.756008333,
+  "total_seconds": 0.814152875,
   "memory": {
     "available": true,
-    "start_mb": 154.94,
-    "peak_mb": 254.11,
-    "growth_mb": 99.17,
+    "start_mb": 152.19,
+    "peak_mb": 252.59,
+    "growth_mb": 100.41,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.012925999999999993,
+      "seconds": 0.012288542000000069,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.09863954099999983,
+      "seconds": 0.09617150000000008,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.1659999999433808e-06,
+      "seconds": 3.084000000042053e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.0024457499999999133,
+      "seconds": 0.002409292000000063,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.281516125,
+      "seconds": 0.4186234579999999,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.29128733399999995,
+      "seconds": 0.217003375,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.06915141700000005,
+      "seconds": 0.06760054199999987,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 3.383300000003864e-05,
+      "seconds": 4.71669999999591e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_small_python.json b/benchmarks/speed_review/baselines/campaign_staggered_small_python.json
index 44e82483..7fe1a2ac 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_small_python.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_small_python.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.509287875,
+  "total_seconds": 0.5199064999999999,
   "memory": {
     "available": true,
-    "start_mb": 114.72,
-    "peak_mb": 143.08,
-    "growth_mb": 28.36,
+    "start_mb": 114.66,
+    "peak_mb": 145.62,
+    "growth_mb": 30.97,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.008488708000000011,
+      "seconds": 0.006750833000000012,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.06242541699999993,
+      "seconds": 0.06804841700000008,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 3.3329999999942572e-06,
+      "seconds": 4.1669999999438545e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.00873587500000006,
+      "seconds": 0.005387375000000083,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.18465104099999996,
+      "seconds": 0.17906933400000002,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.20897954100000016,
+      "seconds": 0.22210808299999996,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.03596216600000002,
+      "seconds": 0.038495792000000195,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 3.28339999999816e-05,
+      "seconds": 3.6332999999943993e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json b/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json
index bfe53aed..edeb195e 100644
--- a/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json
+++ b/benchmarks/speed_review/baselines/campaign_staggered_small_rust.json
@@ -2,52 +2,52 @@
   "scenario": "campaign_staggered_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.501876834,
+  "total_seconds": 0.5057707079999999,
   "memory": {
     "available": true,
-    "start_mb": 114.78,
-    "peak_mb": 150.67,
-    "growth_mb": 35.89,
+    "start_mb": 114.27,
+    "peak_mb": 148.09,
+    "growth_mb": 33.83,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_bacon_decomposition": {
-      "seconds": 0.0068224170000000806,
+      "seconds": 0.007045167000000019,
       "ok": true,
       "error": null
     },
     "2_cs_fit_with_covariates_bootstrap999": {
-      "seconds": 0.06276566699999997,
+      "seconds": 0.06206424999999993,
       "ok": true,
       "error": null
     },
     "3_inspect_pretrends": {
-      "seconds": 2.9160000000194586e-06,
+      "seconds": 2.6250000000338503e-06,
       "ok": true,
       "error": null
     },
     "4_honest_did_M_grid": {
-      "seconds": 0.004543957999999959,
+      "seconds": 0.004464875000000035,
       "ok": true,
       "error": null
     },
     "5_sun_abraham_robustness": {
-      "seconds": 0.14964783299999995,
+      "seconds": 0.19407279099999997,
       "ok": true,
       "error": null
     },
     "6_imputation_did_robustness": {
-      "seconds": 0.241357292,
+      "seconds": 0.2018087919999999,
       "ok": true,
       "error": null
     },
     "7_cs_without_covariates": {
-      "seconds": 0.03669304200000001,
+      "seconds": 0.03626620899999988,
       "ok": true,
       "error": null
     },
     "8_practitioner_next_steps": {
-      "seconds": 3.850000000005238e-05,
+      "seconds": 4.0457999999965466e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/dose_response_python.json b/benchmarks/speed_review/baselines/dose_response_python.json
index 0e576e88..40399067 100644
--- a/benchmarks/speed_review/baselines/dose_response_python.json
+++ b/benchmarks/speed_review/baselines/dose_response_python.json
@@ -2,42 +2,42 @@
   "scenario": "dose_response",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.5912168340000001,
+  "total_seconds": 0.5858542499999999,
   "memory": {
     "available": true,
-    "start_mb": 114.11,
-    "peak_mb": 123.11,
-    "growth_mb": 9.0,
+    "start_mb": 114.7,
+    "peak_mb": 122.31,
+    "growth_mb": 7.61,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_cdid_cubic_spline_bootstrap199": {
-      "seconds": 0.15039274999999996,
+      "seconds": 0.15196441700000007,
       "ok": true,
       "error": null
     },
     "2_extract_dose_response_dataframes": {
-      "seconds": 0.0007435829999999921,
+      "seconds": 0.0008212909999999463,
       "ok": true,
       "error": null
     },
     "3_cdid_event_study_pretrend": {
-      "seconds": 0.14597749999999998,
+      "seconds": 0.14416820900000005,
       "ok": true,
       "error": null
     },
     "4_binarized_did_comparison": {
-      "seconds": 0.0017279590000000011,
+      "seconds": 0.0015125420000000611,
       "ok": true,
       "error": null
     },
     "5_spline_sensitivity_degree1": {
-      "seconds": 0.14600595799999994,
+      "seconds": 0.1431360410000001,
       "ok": true,
       "error": null
     },
     "6_spline_sensitivity_num_knots2": {
-      "seconds": 0.14636520799999997,
+      "seconds": 0.14424499999999996,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/dose_response_rust.json b/benchmarks/speed_review/baselines/dose_response_rust.json
index 51039f15..2c26010b 100644
--- a/benchmarks/speed_review/baselines/dose_response_rust.json
+++ b/benchmarks/speed_review/baselines/dose_response_rust.json
@@ -2,42 +2,42 @@
   "scenario": "dose_response",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.5952834579999999,
+  "total_seconds": 0.6261942910000001,
   "memory": {
     "available": true,
-    "start_mb": 113.73,
-    "peak_mb": 121.34,
-    "growth_mb": 7.61,
+    "start_mb": 113.95,
+    "peak_mb": 123.27,
+    "growth_mb": 9.31,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_cdid_cubic_spline_bootstrap199": {
-      "seconds": 0.15132816700000007,
+      "seconds": 0.1623119999999999,
       "ok": true,
       "error": null
     },
     "2_extract_dose_response_dataframes": {
-      "seconds": 0.0007386659999999434,
+      "seconds": 0.0007812500000000666,
       "ok": true,
       "error": null
     },
     "3_cdid_event_study_pretrend": {
-      "seconds": 0.147476167,
+      "seconds": 0.15469937500000008,
       "ok": true,
       "error": null
     },
     "4_binarized_did_comparison": {
-      "seconds": 0.001677958000000035,
+      "seconds": 0.001991167000000016,
       "ok": true,
       "error": null
     },
     "5_spline_sensitivity_degree1": {
-      "seconds": 0.145152917,
+      "seconds": 0.15138845899999998,
       "ok": true,
       "error": null
     },
     "6_spline_sensitivity_num_knots2": {
-      "seconds": 0.14890500000000007,
+      "seconds": 0.15501741599999996,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json b/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json
index dce42749..637c260d 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_large_rust.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_large",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.26079429200000015,
+  "total_seconds": 0.23366233300000006,
   "memory": {
     "available": true,
-    "start_mb": 117.8,
-    "peak_mb": 118.22,
-    "growth_mb": 0.42,
+    "start_mb": 117.77,
+    "peak_mb": 118.11,
+    "growth_mb": 0.34,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.04102845799999999,
+      "seconds": 0.03807345899999992,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.03718729200000004,
+      "seconds": 0.03627791699999994,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 0.07744412499999997,
+      "seconds": 0.06991887500000005,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0008073330000000212,
+      "seconds": 0.0007567080000000503,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 0.10429091600000007,
+      "seconds": 0.08854208299999988,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 3.220799999992252e-05,
+      "seconds": 8.5874999999902e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json b/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json
index 868c0578..283552a2 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_medium_python.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_medium",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 3.9883142080000002,
+  "total_seconds": 3.998488124999999,
   "memory": {
     "available": true,
-    "start_mb": 143.86,
-    "peak_mb": 151.53,
-    "growth_mb": 7.67,
+    "start_mb": 140.11,
+    "peak_mb": 148.12,
+    "growth_mb": 8.02,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.35804470799999955,
+      "seconds": 0.35502641700000037,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.36447529099999976,
+      "seconds": 0.36030566600000036,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 1.5563965419999999,
+      "seconds": 1.5716015000000008,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0007229159999999624,
+      "seconds": 0.0007380409999999671,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 1.7086395420000002,
+      "seconds": 1.7107877500000006,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 2.9666999999733434e-05,
+      "seconds": 2.462500000000034e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json b/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json
index bd4471a6..debdccf6 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_medium_rust.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_medium",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.118741875,
+  "total_seconds": 0.10621941700000004,
   "memory": {
     "available": true,
-    "start_mb": 117.23,
-    "peak_mb": 117.64,
-    "growth_mb": 0.41,
+    "start_mb": 117.05,
+    "peak_mb": 117.36,
+    "growth_mb": 0.31,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.020535375000000022,
+      "seconds": 0.018085625000000105,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.023519291000000053,
+      "seconds": 0.020790666999999985,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 0.02495891699999997,
+      "seconds": 0.025967375000000015,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0006400839999999297,
+      "seconds": 0.0006781249999999739,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 0.049061250000000056,
+      "seconds": 0.04067133299999992,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 2.31669999999351e-05,
+      "seconds": 2.2332999999985503e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_small_python.json b/benchmarks/speed_review/baselines/geo_few_markets_small_python.json
index e0bec083..ed7af335 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_small_python.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_small_python.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_small",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 3.697791375,
+  "total_seconds": 3.7007011660000004,
   "memory": {
     "available": true,
-    "start_mb": 114.09,
-    "peak_mb": 124.02,
-    "growth_mb": 9.92,
+    "start_mb": 114.14,
+    "peak_mb": 124.05,
+    "growth_mb": 9.91,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.593809709,
+      "seconds": 0.5908792500000001,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.584832209,
+      "seconds": 0.593548083,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 1.194314458,
+      "seconds": 1.1894560410000001,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0009036250000002966,
+      "seconds": 0.001243833000000194,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 1.3238487909999996,
+      "seconds": 1.3254739579999995,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 7.791699999959434e-05,
+      "seconds": 9.341699999954045e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json b/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json
index 855eac85..91f9888d 100644
--- a/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json
+++ b/benchmarks/speed_review/baselines/geo_few_markets_small_rust.json
@@ -2,42 +2,42 @@
   "scenario": "geo_few_markets_small",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.04129770799999999,
+  "total_seconds": 0.04177825000000002,
   "memory": {
     "available": true,
-    "start_mb": 114.56,
-    "peak_mb": 116.05,
-    "growth_mb": 1.48,
+    "start_mb": 114.55,
+    "peak_mb": 115.84,
+    "growth_mb": 1.3,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_sdid_jackknife_variance": {
-      "seconds": 0.008074541000000046,
+      "seconds": 0.008172167000000008,
       "ok": true,
       "error": null
     },
     "2_sdid_bootstrap_variance_200": {
-      "seconds": 0.012903124999999904,
+      "seconds": 0.013141583000000012,
       "ok": true,
       "error": null
     },
     "3_in_time_placebo": {
-      "seconds": 0.008189833999999951,
+      "seconds": 0.00833604099999996,
       "ok": true,
       "error": null
     },
     "4_get_loo_effects_df": {
-      "seconds": 0.0009220420000000118,
+      "seconds": 0.0008852080000000262,
       "ok": true,
       "error": null
     },
     "5_sensitivity_to_zeta_omega": {
-      "seconds": 0.01117779200000002,
+      "seconds": 0.011213916999999962,
       "ok": true,
       "error": null
     },
     "6_weight_concentration": {
-      "seconds": 2.6250000000005436e-05,
+      "seconds": 2.599999999997049e-05,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/reversible_dcdh_python.json b/benchmarks/speed_review/baselines/reversible_dcdh_python.json
index 1cbed394..fff45fd6 100644
--- a/benchmarks/speed_review/baselines/reversible_dcdh_python.json
+++ b/benchmarks/speed_review/baselines/reversible_dcdh_python.json
@@ -2,32 +2,32 @@
   "scenario": "reversible_dcdh",
   "backend": "python",
   "has_rust_backend": false,
-  "total_seconds": 0.718732833,
+  "total_seconds": 0.788816875,
   "memory": {
     "available": true,
-    "start_mb": 113.5,
-    "peak_mb": 135.02,
-    "growth_mb": 21.52,
+    "start_mb": 113.75,
+    "peak_mb": 133.66,
+    "growth_mb": 19.91,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_dcdh_fit_Lmax3_survey_TSL": {
-      "seconds": 0.3450735829999999,
+      "seconds": 0.384559958,
       "ok": true,
       "error": null
     },
     "2_inspect_placebo_and_summary": {
-      "seconds": 1.4160000000318362e-06,
+      "seconds": 1.3329999999367459e-06,
       "ok": true,
       "error": null
     },
     "3_honest_did_on_placebo": {
-      "seconds": 0.004985583999999932,
+      "seconds": 0.003932208000000048,
       "ok": true,
       "error": null
     },
     "4_heterogeneity_refit": {
-      "seconds": 0.36866958299999986,
+      "seconds": 0.400320667,
       "ok": true,
       "error": null
     }
diff --git a/benchmarks/speed_review/baselines/reversible_dcdh_rust.json b/benchmarks/speed_review/baselines/reversible_dcdh_rust.json
index 2af530f5..0c073cd6 100644
--- a/benchmarks/speed_review/baselines/reversible_dcdh_rust.json
+++ b/benchmarks/speed_review/baselines/reversible_dcdh_rust.json
@@ -2,32 +2,32 @@
   "scenario": "reversible_dcdh",
   "backend": "rust",
   "has_rust_backend": true,
-  "total_seconds": 0.751090292,
+  "total_seconds": 0.7799259999999999,
   "memory": {
     "available": true,
-    "start_mb": 113.7,
-    "peak_mb": 134.89,
-    "growth_mb": 21.19,
+    "start_mb": 113.81,
+    "peak_mb": 134.28,
+    "growth_mb": 20.47,
     "sampler_interval_s": 0.01
   },
   "phases": {
     "1_dcdh_fit_Lmax3_survey_TSL": {
-      "seconds": 0.36838229199999994,
+      "seconds": 0.38806558299999994,
       "ok": true,
       "error": null
     },
     "2_inspect_placebo_and_summary": {
-      "seconds": 1.3340000000194863e-06,
+      "seconds": 1.4580000000652404e-06,
       "ok": true,
       "error": null
     },
     "3_honest_did_on_placebo": {
-      "seconds": 0.005142916999999914,
+      "seconds": 0.003724375000000002,
       "ok": true,
       "error": null
     },
     "4_heterogeneity_refit": {
-      "seconds": 0.3775615830000001,
+      "seconds": 0.38813170900000005,
       "ok": true,
       "error": null
     }
diff --git a/diff_diff/prep.py b/diff_diff/prep.py
index 01d50653..70201144 100644
--- a/diff_diff/prep.py
+++ b/diff_diff/prep.py
@@ -30,6 +30,9 @@
 from diff_diff.survey import (
     ResolvedSurveyDesign,
     SurveyDesign,
+    _compute_if_variance_fast,
+    _precompute_psu_scaffolding,
+    _PsuScaffolding,
     compute_replicate_if_variance,
     compute_survey_if_variance,
 )
@@ -1318,6 +1321,7 @@ def _cell_mean_variance(
     full_resolved: ResolvedSurveyDesign,
     cell_mask: np.ndarray,
     min_n: int,
+    scaffolding: Optional[_PsuScaffolding] = None,
 ) -> Tuple[float, float, int, bool]:
     """Compute design-based mean and variance of the weighted mean for one cell.
 
@@ -1396,9 +1400,14 @@ def _cell_mean_variance(
     valid_positions = cell_indices[valid]
     psi[valid_positions] = w_valid[valid] * (y_clean[valid] - y_bar) / sum_w
 
-    # Route to TSL or replicate variance using the full design
+    # Route to TSL or replicate variance using the full design.  When a
+    # design-level scaffolding is provided (aggregate_survey's fast path),
+    # use it to skip the per-call pandas groupby / np.unique setup that
+    # otherwise dominates runtime at BRFSS scale.
     if full_resolved.uses_replicate_variance:
         variance, _ = compute_replicate_if_variance(psi, full_resolved)
+    elif scaffolding is not None:
+        variance = _compute_if_variance_fast(psi, scaffolding)
     else:
         variance = compute_survey_if_variance(psi, full_resolved)
 
@@ -1580,6 +1589,17 @@ def aggregate_survey(
     )
     full_resolved = effective_design.resolve(data)
 
+    # Precompute stratum/PSU scaffolding once per design.  Amortizes
+    # per-cell pandas groupby + np.unique + stratum FPC lookup that
+    # otherwise dominate runtime at scale (see _compute_if_variance_fast).
+    # Replicate-weight designs use a different variance surface and stay
+    # on the legacy path.
+    _tsl_scaffolding: Optional[_PsuScaffolding] = (
+        _precompute_psu_scaffolding(full_resolved)
+        if not full_resolved.uses_replicate_variance
+        else None
+    )
+
     # --- Precompute full-length outcome/covariate arrays ---
     n_total = len(data)
     all_vars = outcome_cols + cov_cols
@@ -1635,6 +1655,7 @@ def aggregate_survey(
                 full_resolved,
                 cell_mask,
                 min_n,
+                scaffolding=_tsl_scaffolding,
             )
             se = float(np.sqrt(variance)) if not np.isnan(variance) else np.nan
 
diff --git a/diff_diff/survey.py b/diff_diff/survey.py
index 2ee8334e..3d951fb6 100644
--- a/diff_diff/survey.py
+++ b/diff_diff/survey.py
@@ -1304,6 +1304,326 @@ def _compute_stratified_psu_meat(
     return meat, _variance_computed, legitimate_zero_count
 
 
+@dataclass(frozen=True)
+class _PsuScaffolding:
+    """Precomputed stratum/PSU layout for amortized TSL variance.
+
+    Internal helper used by :func:`diff_diff.prep.aggregate_survey` to reuse
+    design-dependent scaffolding across hundreds of per-cell variance calls.
+    Holds integer codes, per-stratum counts, FPC ratios, and static
+    variance-computability flags that depend only on the
+    :class:`ResolvedSurveyDesign` (not on the psi / outcome being collapsed).
+
+    See :func:`_compute_if_variance_fast` for the fast variance path that
+    consumes this scaffolding.  Numerically equivalent to
+    :func:`compute_survey_if_variance` up to sub-ULP reduction-order drift.
+    """
+
+    mode: str  # "no_strata_no_psu" | "psu_only" | "stratified"
+    n: int
+    lonely_psu: str
+    variance_computable: bool
+    legitimate_zero_count: int
+    # stratified-mode fields (None in other modes):
+    psu_codes: Optional[np.ndarray] = None          # (n,) int, global PSU id 0..P-1
+    psu_stratum: Optional[np.ndarray] = None        # (P,) int, stratum of each PSU
+    n_psu_per_stratum: Optional[np.ndarray] = None  # (S,) int
+    singleton_strata: Optional[np.ndarray] = None   # (S,) bool
+    adjustment_h: Optional[np.ndarray] = None       # (S,) float, (1-f_h)*n_h/(n_h-1); 0 for singletons
+    # psu_only-mode fields (None in other modes):
+    psu_codes_only: Optional[np.ndarray] = None     # (n,) int, PSU id 0..P-1
+    n_psu_only: Optional[int] = None
+    adjustment_only: Optional[float] = None         # (1-f)*n_psu/(n_psu-1) or 0
+    # no_strata_no_psu-mode fields (None in other modes):
+    adjustment_direct: Optional[float] = None       # (1-f)*n/(n-1) or 0
+
+
+def _precompute_psu_scaffolding(resolved: "ResolvedSurveyDesign") -> _PsuScaffolding:
+    """Precompute per-design PSU/stratum scaffolding for fast per-cell variance.
+
+    Equivalent in effect to the per-call scaffolding work inside
+    :func:`_compute_stratified_psu_meat`, but done once per design instead of
+    once per output cell.  For the typical BRFSS-scale
+    :func:`~diff_diff.prep.aggregate_survey` workload (~500 cells, ~20 strata),
+    this amortizes the pandas-groupby + ``np.unique`` setup that otherwise
+    dominates the chain runtime.
+
+    Parameters
+    ----------
+    resolved : ResolvedSurveyDesign
+        Resolved survey design.  Must NOT use replicate variance
+        (``resolved.uses_replicate_variance`` False).
+
+    Returns
+    -------
+    _PsuScaffolding
+        Frozen dataclass with mode-appropriate precomputed fields.
+
+    Raises
+    ------
+    ValueError
+        Same FPC-vs-n guards as :func:`_compute_stratified_psu_meat`
+        (FPC must be >= effective PSU count in each stratum).
+    """
+    weights = resolved.weights
+    n = int(len(weights))
+    strata = resolved.strata
+    psu = resolved.psu
+    fpc = resolved.fpc
+    lonely_psu = resolved.lonely_psu
+
+    if strata is None and psu is None:
+        # Implicit per-observation PSUs
+        f = 0.0
+        lz_count = 0
+        if fpc is not None:
+            N = fpc[0]
+            if N < n:
+                raise ValueError(
+                    f"FPC ({N}) is less than the number of observations "
+                    f"({n}). FPC must be >= n_obs for implicit per-observation PSUs."
+                )
+            f = n / N
+            if f >= 1.0:
+                lz_count = 1
+        var_computable = n >= 2
+        adjustment = (1.0 - f) * (n / (n - 1)) if n >= 2 else 0.0
+        return _PsuScaffolding(
+            mode="no_strata_no_psu",
+            n=n,
+            lonely_psu=lonely_psu,
+            variance_computable=var_computable,
+            legitimate_zero_count=lz_count,
+            adjustment_direct=float(adjustment),
+        )
+
+    if strata is None and psu is not None:
+        # Single-stratum cluster-robust
+        psu_arr = np.asarray(psu)
+        codes, uniques = pd.factorize(psu_arr)
+        n_psu = int(len(uniques))
+        f = 0.0
+        lz_count = 0
+        if n_psu >= 2:
+            if fpc is not None:
+                N = fpc[0]
+                if N < n_psu:
+                    raise ValueError(
+                        f"FPC ({N}) is less than the number of effective PSUs "
+                        f"({n_psu}). FPC must be >= n_PSU."
+                    )
+                f = n_psu / N
+                if f >= 1.0:
+                    lz_count = 1
+            adjustment = (1.0 - f) * (n_psu / (n_psu - 1))
+            var_computable = True
+        else:
+            adjustment = 0.0
+            var_computable = False
+        return _PsuScaffolding(
+            mode="psu_only",
+            n=n,
+            lonely_psu=lonely_psu,
+            variance_computable=var_computable,
+            legitimate_zero_count=lz_count,
+            psu_codes_only=codes.astype(np.int64),
+            n_psu_only=n_psu,
+            adjustment_only=float(adjustment),
+        )
+
+    # Stratified branch (with or without PSU)
+    strata_arr = np.asarray(strata)
+    strata_codes, strata_uniques = pd.factorize(strata_arr, sort=True)
+    strata_codes = strata_codes.astype(np.int64)
+    S = int(len(strata_uniques))
+
+    if psu is not None:
+        # Global PSU codes unique across (stratum, psu) pairs — matches the
+        # legacy per-stratum pandas groupby which never aggregated PSU labels
+        # across strata.
+        psu_arr = np.asarray(psu)
+        psu_local_codes, _ = pd.factorize(psu_arr)
+        psu_local_codes = psu_local_codes.astype(np.int64)
+        psu_local_max = int(psu_local_codes.max()) if len(psu_local_codes) > 0 else 0
+        compound = strata_codes * (psu_local_max + 1) + psu_local_codes
+        psu_codes, _ = pd.factorize(compound)
+        psu_codes = psu_codes.astype(np.int64)
+        P = int(psu_codes.max() + 1) if len(psu_codes) > 0 else 0
+        psu_stratum = np.zeros(P, dtype=np.int64)
+        # Safe scatter: by construction, all observations sharing a global
+        # PSU code share a stratum, so repeated writes to the same position
+        # store the same value.
+        if P > 0:
+            psu_stratum[psu_codes] = strata_codes
+    else:
+        # Each observation is its own PSU within its stratum (legacy
+        # behavior when strata is not None and psu is None).
+        psu_codes = np.arange(n, dtype=np.int64)
+        P = n
+        psu_stratum = strata_codes.copy()
+
+    n_psu_per_stratum = np.bincount(psu_stratum, minlength=S).astype(np.int64)
+    singleton_strata = n_psu_per_stratum == 1
+
+    # Per-stratum FPC ratio (stratum-level attribute; read from the first
+    # observation of each stratum, matching legacy ``resolved.fpc[mask_h][0]``).
+    f_h = np.zeros(S, dtype=np.float64)
+    if fpc is not None:
+        fpc_arr = np.asarray(fpc)
+        # Vectorized "first-in-stratum" FPC lookup:
+        # pd.factorize with sort=True iterates the array in input order, so
+        # the first observation encountered for each stratum_code is the
+        # reference row.
+        first_idx = np.full(S, -1, dtype=np.int64)
+        seen = np.zeros(S, dtype=bool)
+        for i in range(n):
+            h = strata_codes[i]
+            if not seen[h]:
+                seen[h] = True
+                first_idx[h] = i
+                if seen.all():
+                    break
+        for h in range(S):
+            if first_idx[h] < 0:
+                continue
+            N_h = fpc_arr[first_idx[h]]
+            n_h = n_psu_per_stratum[h]
+            if n_h > 0 and N_h < n_h:
+                raise ValueError(
+                    f"FPC ({N_h}) is less than the number of effective PSUs "
+                    f"({n_h}) in stratum. FPC must be >= n_PSU."
+                )
+            if n_h > 0:
+                f_h[h] = n_h / N_h
+
+    with np.errstate(divide="ignore", invalid="ignore"):
+        adjustment_h = np.where(
+            n_psu_per_stratum >= 2,
+            (1.0 - f_h) * n_psu_per_stratum / np.maximum(n_psu_per_stratum - 1, 1),
+            0.0,
+        )
+
+    # Static legitimate_zero_count (design-dependent only):
+    #   - Non-singleton strata with f_h >= 1.0 contribute (legacy counter).
+    #   - Singleton strata under lonely_psu == "certainty" contribute.
+    fpc_saturated = (n_psu_per_stratum >= 2) & (f_h >= 1.0)
+    legitimate_zero_count = int(fpc_saturated.sum())
+    if lonely_psu == "certainty":
+        legitimate_zero_count += int(singleton_strata.sum())
+
+    # Static variance_computable flag:
+    #   - Any non-singleton stratum (regardless of FPC) → variance_computed=True
+    #     path is exercised.
+    #   - Under "adjust", any singleton stratum also counts (adds V_h even if 0).
+    has_non_singleton = bool(np.any(~singleton_strata))
+    has_singleton = bool(np.any(singleton_strata))
+    variance_computable = has_non_singleton or (
+        lonely_psu == "adjust" and has_singleton
+    )
+
+    return _PsuScaffolding(
+        mode="stratified",
+        n=n,
+        lonely_psu=lonely_psu,
+        variance_computable=variance_computable,
+        legitimate_zero_count=legitimate_zero_count,
+        psu_codes=psu_codes,
+        psu_stratum=psu_stratum,
+        n_psu_per_stratum=n_psu_per_stratum,
+        singleton_strata=singleton_strata,
+        adjustment_h=adjustment_h,
+    )
+
+
+def _compute_if_variance_fast(
+    psi: np.ndarray,
+    scaffolding: _PsuScaffolding,
+) -> float:
+    """Fast TSL variance for aggregate_survey using precomputed scaffolding.
+
+    Numerically equivalent to :func:`compute_survey_if_variance` for any
+    TSL (non-replicate) design, up to sub-ULP reduction-order drift.  The
+    speedup comes from replacing per-cell pandas groupbys and per-stratum
+    Python loops with two ``np.bincount`` passes plus a fully vectorized
+    per-stratum reduction.
+
+    Parameters
+    ----------
+    psi : np.ndarray
+        Per-unit influence function values, shape (n,).
+    scaffolding : _PsuScaffolding
+        Precomputed via :func:`_precompute_psu_scaffolding` for the same
+        resolved design.
+
+    Returns
+    -------
+    float
+        Design-based variance.  Returns ``np.nan`` when variance is
+        unidentified (matches legacy behavior).
+    """
+    psi = np.asarray(psi, dtype=np.float64).ravel()
+
+    def _finalize(meat_scalar: float) -> float:
+        if meat_scalar == 0.0:
+            if scaffolding.variance_computable or scaffolding.legitimate_zero_count > 0:
+                return 0.0
+            return float("nan")
+        return meat_scalar
+
+    if scaffolding.mode == "no_strata_no_psu":
+        if scaffolding.n < 2:
+            return float("nan")
+        psi_mean = psi.mean()
+        centered = psi - psi_mean
+        meat = scaffolding.adjustment_direct * float(centered @ centered)
+        return _finalize(meat)
+
+    if scaffolding.mode == "psu_only":
+        if scaffolding.n_psu_only < 2:
+            if scaffolding.legitimate_zero_count > 0:
+                return 0.0
+            return float("nan")
+        psu_sums = np.bincount(
+            scaffolding.psu_codes_only, weights=psi, minlength=scaffolding.n_psu_only
+        )
+        psu_mean = psu_sums.mean()
+        centered = psu_sums - psu_mean
+        meat = scaffolding.adjustment_only * float(centered @ centered)
+        return _finalize(meat)
+
+    # Stratified
+    S = len(scaffolding.n_psu_per_stratum)
+    P = len(scaffolding.psu_stratum)
+
+    psu_sums = np.bincount(scaffolding.psu_codes, weights=psi, minlength=P)
+    sum_by_h = np.bincount(scaffolding.psu_stratum, weights=psu_sums, minlength=S)
+    sum2_by_h = np.bincount(
+        scaffolding.psu_stratum, weights=psu_sums * psu_sums, minlength=S
+    )
+
+    with np.errstate(divide="ignore", invalid="ignore"):
+        centered_ss = np.where(
+            scaffolding.n_psu_per_stratum >= 2,
+            sum2_by_h - (sum_by_h * sum_by_h) / np.maximum(scaffolding.n_psu_per_stratum, 1),
+            0.0,
+        )
+    meat_per_stratum = scaffolding.adjustment_h * centered_ss
+
+    if np.any(scaffolding.singleton_strata) and scaffolding.lonely_psu == "adjust":
+        # Singleton strata under "adjust": V_h = (psu_sum - global_mean)^2.
+        # For a singleton stratum, the one PSU's sum equals sum_by_h[h].
+        # No FPC, no (n-1) adjustment — matches legacy (survey.py:1276-1281).
+        if P > 0:
+            global_mean = psu_sums.mean()
+            singleton_meat = (sum_by_h - global_mean) ** 2
+            meat_per_stratum = np.where(
+                scaffolding.singleton_strata, singleton_meat, meat_per_stratum
+            )
+
+    meat = float(meat_per_stratum.sum())
+    return _finalize(meat)
+
+
 def _compute_stratified_meat_from_psu_scores(
     psu_scores: np.ndarray,
     psu_strata: np.ndarray,
diff --git a/docs/performance-plan.md b/docs/performance-plan.md
index 58f0f017..438a8b56 100644
--- a/docs/performance-plan.md
+++ b/docs/performance-plan.md
@@ -41,32 +41,36 @@ scale. Data-shape details are in `docs/performance-scenarios.md`.
 <!-- TABLE:start scale_sweep_totals -->
 | Scenario | Scale | Python (s) | Rust (s) | Py/Rust |
 |---|---|---:|---:|---:|
-| 1. Staggered campaign | small | 0.51 | 0.50 | 1.0x |
-|  | medium | 0.75 | 0.76 | 1.0x |
-|  | large | 1.33 | 1.38 | 1.0x |
-| 2. Brand awareness survey | small | 0.19 | 0.20 | 1.0x |
-|  | medium | 0.56 | 0.55 | 1.0x |
-|  | large | 1.09 | 1.00 | 1.1x |
-| 3. BRFSS microdata -> CS panel | small | 1.61 | 1.66 | 1.0x |
-|  | medium | 6.10 | 6.23 | 1.0x |
-|  | large | 24.41 | 24.94 | 1.0x |
-| 4. SDiD few markets | small | 3.70 | 0.04 | 89.5x |
-|  | medium | 3.99 | 0.12 | 33.6x |
-|  | large | skip | 0.26 | - |
-| 5. Reversible dCDH | single | 0.72 | 0.75 | 1.0x |
-| 6. Pricing dose-response | single | 0.59 | 0.60 | 1.0x |
+| 1. Staggered campaign | small | 0.52 | 0.51 | 1.0x |
+|  | medium | 0.81 | 0.81 | 1.0x |
+|  | large | 1.32 | 1.31 | 1.0x |
+| 2. Brand awareness survey | small | 0.23 | 0.20 | 1.1x |
+|  | medium | 0.53 | 0.50 | 1.1x |
+|  | large | 0.87 | 0.93 | 0.9x |
+| 3. BRFSS microdata -> CS panel | small | 0.21 | 0.17 | 1.3x |
+|  | medium | 0.49 | 0.47 | 1.0x |
+|  | large | 1.33 | 1.32 | 1.0x |
+| 4. SDiD few markets | small | 3.70 | 0.04 | 88.6x |
+|  | medium | 4.00 | 0.11 | 37.6x |
+|  | large | skip | 0.23 | - |
+| 5. Reversible dCDH | single | 0.79 | 0.78 | 1.0x |
+| 6. Pricing dose-response | single | 0.59 | 0.63 | 0.9x |
 <!-- TABLE:end scale_sweep_totals -->
 
 ### Scaling findings
 
 **Three findings are load-bearing for the optimization priority list:**
 
-1. **BRFSS `aggregate_survey` is the dominant practitioner pain point at
-   realistic pooled-multi-year scale.** Scales near-linearly with microdata
-   row count. At 1M rows (roughly what a 10-year pooled BRFSS analysis
-   looks like) the full chain takes ~24 seconds and essentially all of it
-   is inside `_compute_stratified_psu_meat`. Rust does not touch it
-   (`aggregate_survey` is entirely Python).
+1. **BRFSS `aggregate_survey` is now practitioner-fast at every measured
+   scale.** Prior to the precompute-scaffolding fix (see "Optimization
+   landed" below), the full chain at 1M rows took ~24 seconds and was
+   essentially all inside `_compute_stratified_psu_meat`. After the fix,
+   the chain is sub-2s at every measured scale; `aggregate_survey`
+   continues to dominate its own (now-cheap) chain share, but in
+   absolute time the entire workflow is well under a practitioner-
+   perceptible threshold at realistic pooled-multi-year BRFSS volume.
+   The path is entirely Python, so Python and Rust backends track each
+   other within noise.
 2. **Staggered CS chain stays cheap across scales.** A 10x unit increase
    (150 -> 1,500) is a small-single-digit multiplier on total time.
    ImputationDiD and SunAbraham together consistently account for
@@ -96,18 +100,18 @@ scale. Data-shape details are in `docs/performance-scenarios.md`.
 <!-- TABLE:start top_phases_by_scenario -->
 | Scenario | Scale | Backend | Top phase (%) | 2nd phase (%) | 3rd phase (%) |
 |---|---|---|---|---|---|
-| 1. Staggered campaign | large | python | `6_imputation_did_robustness` (49%) | `5_sun_abraham_robustness` (28%) | `2_cs_fit_with_covariates_bootstrap999` (13%) |
-| 1. Staggered campaign | large | rust | `6_imputation_did_robustness` (40%) | `5_sun_abraham_robustness` (37%) | `2_cs_fit_with_covariates_bootstrap999` (13%) |
-| 2. Brand awareness survey | large | python | `3_replicate_weights_jk1` (57%) | `4_multi_outcome_loop_3_metrics` (22%) | `7_event_study_plus_honest_did` (14%) |
-| 2. Brand awareness survey | large | rust | `3_replicate_weights_jk1` (54%) | `4_multi_outcome_loop_3_metrics` (22%) | `7_event_study_plus_honest_did` (14%) |
-| 3. BRFSS microdata -> CS panel | large | python | `1_aggregate_survey_microdata_to_panel` (100%) | `5_sun_abraham_robustness` (0%) | `2_cs_fit_with_stage2_survey_design` (0%) |
-| 3. BRFSS microdata -> CS panel | large | rust | `1_aggregate_survey_microdata_to_panel` (100%) | `5_sun_abraham_robustness` (0%) | `2_cs_fit_with_stage2_survey_design` (0%) |
+| 1. Staggered campaign | large | python | `6_imputation_did_robustness` (54%) | `5_sun_abraham_robustness` (21%) | `2_cs_fit_with_covariates_bootstrap999` (13%) |
+| 1. Staggered campaign | large | rust | `6_imputation_did_robustness` (41%) | `5_sun_abraham_robustness` (36%) | `2_cs_fit_with_covariates_bootstrap999` (12%) |
+| 2. Brand awareness survey | large | python | `3_replicate_weights_jk1` (46%) | `4_multi_outcome_loop_3_metrics` (26%) | `7_event_study_plus_honest_did` (17%) |
+| 2. Brand awareness survey | large | rust | `3_replicate_weights_jk1` (50%) | `4_multi_outcome_loop_3_metrics` (25%) | `7_event_study_plus_honest_did` (15%) |
+| 3. BRFSS microdata -> CS panel | large | python | `1_aggregate_survey_microdata_to_panel` (91%) | `5_sun_abraham_robustness` (8%) | `2_cs_fit_with_stage2_survey_design` (1%) |
+| 3. BRFSS microdata -> CS panel | large | rust | `1_aggregate_survey_microdata_to_panel` (95%) | `5_sun_abraham_robustness` (4%) | `2_cs_fit_with_stage2_survey_design` (1%) |
 | 4. SDiD few markets | medium | python | `5_sensitivity_to_zeta_omega` (43%) | `3_in_time_placebo` (39%) | `2_sdid_bootstrap_variance_200` (9%) |
-| 4. SDiD few markets | large | rust | `5_sensitivity_to_zeta_omega` (40%) | `3_in_time_placebo` (30%) | `1_sdid_jackknife_variance` (16%) |
-| 5. Reversible dCDH | single | python | `4_heterogeneity_refit` (51%) | `1_dcdh_fit_Lmax3_survey_TSL` (48%) | `3_honest_did_on_placebo` (1%) |
-| 5. Reversible dCDH | single | rust | `4_heterogeneity_refit` (50%) | `1_dcdh_fit_Lmax3_survey_TSL` (49%) | `3_honest_did_on_placebo` (1%) |
-| 6. Pricing dose-response | single | python | `1_cdid_cubic_spline_bootstrap199` (25%) | `6_spline_sensitivity_num_knots2` (25%) | `5_spline_sensitivity_degree1` (25%) |
-| 6. Pricing dose-response | single | rust | `1_cdid_cubic_spline_bootstrap199` (25%) | `6_spline_sensitivity_num_knots2` (25%) | `3_cdid_event_study_pretrend` (25%) |
+| 4. SDiD few markets | large | rust | `5_sensitivity_to_zeta_omega` (38%) | `3_in_time_placebo` (30%) | `1_sdid_jackknife_variance` (16%) |
+| 5. Reversible dCDH | single | python | `4_heterogeneity_refit` (51%) | `1_dcdh_fit_Lmax3_survey_TSL` (49%) | `3_honest_did_on_placebo` (0%) |
+| 5. Reversible dCDH | single | rust | `4_heterogeneity_refit` (50%) | `1_dcdh_fit_Lmax3_survey_TSL` (50%) | `3_honest_did_on_placebo` (0%) |
+| 6. Pricing dose-response | single | python | `1_cdid_cubic_spline_bootstrap199` (26%) | `6_spline_sensitivity_num_knots2` (25%) | `3_cdid_event_study_pretrend` (25%) |
+| 6. Pricing dose-response | single | rust | `1_cdid_cubic_spline_bootstrap199` (26%) | `6_spline_sensitivity_num_knots2` (25%) | `3_cdid_event_study_pretrend` (25%) |
 <!-- TABLE:end top_phases_by_scenario -->
 
 Per-scenario phase narrative (cross-check against the table above after
@@ -129,9 +133,11 @@ any rerun):
   see scale-sweep table); the JK1 replicate-fit loop is not
   Rust-accelerated, so the backends neither help nor hurt each other
   meaningfully on this chain.
-- **BRFSS.** `aggregate_survey` share of total grows with scale and is
-  effectively 100% of runtime at 1M rows. Downstream phases (CS fit,
-  SunAbraham, HonestDiD) are a fraction of a second combined.
+- **BRFSS.** `aggregate_survey` remains the single largest chain share
+  under both backends at every scale, but the absolute chain total is
+  sub-2s at 1M rows after the precompute-scaffolding fix. Downstream
+  phases (CS fit, SunAbraham, HonestDiD) are a fraction of a second
+  combined - see the scale-sweep table for the current totals.
 - **SDiD few markets.** `sensitivity_to_zeta_omega` and
   `in_time_placebo` are the two largest phases under Python at every
   scale and under Rust at medium/large (together ~70% of the chain).
@@ -156,7 +162,7 @@ any rerun):
 
 | # | Location | Scenario + scale | Signal | Recommended action |
 |---|---|---|---|---|
-| 1 | `diff_diff/survey.py:1160` `_compute_stratified_psu_meat` | BRFSS @ 1M rows | dominates BRFSS chain at all scales, ~100% at 1M rows | **Algorithmic fix, highest priority.** Function called once per (state, year) cell (500 calls); per-call work rebuilds stratum-PSU scaffolding every time. Precompute stratum indexes once at `aggregate_survey` top-level and reuse. |
+| 1 | `diff_diff/survey.py` `_compute_stratified_psu_meat` + `aggregate_survey` | BRFSS @ 1M rows | previously dominated BRFSS chain at all scales (~100% at 1M rows) | **LANDED** (this PR). Precompute stratum-PSU scaffolding once per design at `aggregate_survey` top level; replace per-cell pandas groupby with two vectorized `np.bincount` passes. BRFSS-large chain drops from ~24s to sub-2s across both backends. See "Optimization landed" below. |
 | 2 | `diff_diff/imputation.py` ImputationDiD fit (+ `diff_diff/sun_abraham.py` SunAbraham fit) | Staggered CS @ 1,500 units | together consistently ~70-80% of the chain at every scale; either can be the top phase at a given (scale, backend) cell | **Investigate only after BRFSS fix lands.** Total chain is well under practitioner-perceptible threshold; candidate follow-up. Either phase is a legitimate target. |
 | 3 | `diff_diff/utils.py:1434` `_sc_weight_fw_numpy` | SDiD python @ any scale | dominates Python SDiD at all scales | **Already ported to Rust.** Python fallback acceptable as a teaching/safety path; non-production for n > 100. Python skipped at n=500 (jackknife cost would exceed 4 minutes per run). |
 | 4 | `diff_diff/chaisemartin_dhaultfoeuille.py` dCDH fit + heterogeneity | Reversible (single scale) | main fit and survey-aware heterogeneity refit each rebuild TSL scaffolding; heterogeneity phase is as expensive as the main fit | **Cache/precompute** - heterogeneity refit duplicates the main fit's TSL setup under the same `SurveyDesign`. Not P0; newer code path (v3.1) never optimization-reviewed. |
@@ -174,20 +180,20 @@ in `benchmarks/speed_review/baselines/mem_profile_brfss_large_<backend>.txt`.
 <!-- TABLE:start memory_by_scenario -->
 | Scenario | Scale | Py peak RSS (MB) | Py growth (MB) | Rust peak RSS (MB) | Rust growth (MB) |
 |---|---|---:|---:|---:|---:|
-| 1. Staggered campaign | small | 143 | 28 | 151 | 36 |
-|  | medium | 227 | 79 | 254 | 99 |
-|  | large | 472 | 245 | 588 | 322 |
-| 2. Brand awareness survey | small | 127 | 12 | 128 | 13 |
-|  | medium | 188 | 54 | 185 | 50 |
-|  | large | 327 | 139 | 336 | 142 |
-| 3. BRFSS microdata -> CS panel | small | 133 | 11 | 136 | 15 |
-|  | medium | 210 | 17 | 212 | 15 |
-|  | large | 418 | 17 | 429 | 33 |
+| 1. Staggered campaign | small | 146 | 31 | 148 | 34 |
+|  | medium | 235 | 85 | 253 | 100 |
+|  | large | 486 | 251 | 582 | 327 |
+| 2. Brand awareness survey | small | 130 | 15 | 128 | 13 |
+|  | medium | 183 | 45 | 189 | 55 |
+|  | large | 340 | 139 | 348 | 158 |
+| 3. BRFSS microdata -> CS panel | small | 133 | 11 | 130 | 8 |
+|  | medium | 203 | 17 | 200 | 21 |
+|  | large | 413 | 25 | 409 | 25 |
 | 4. SDiD few markets | small | 124 | 10 | 116 | 1 |
-|  | medium | 152 | 8 | 118 | 0 |
+|  | medium | 148 | 8 | 117 | 0 |
 |  | large | skip | skip | 118 | 0 |
-| 5. Reversible dCDH | single | 135 | 22 | 135 | 21 |
-| 6. Pricing dose-response | single | 123 | 9 | 121 | 8 |
+| 5. Reversible dCDH | single | 134 | 20 | 134 | 20 |
+| 6. Pricing dose-response | single | 122 | 8 | 123 | 9 |
 <!-- TABLE:end memory_by_scenario -->
 
 The ~115-130 MB floor is the Python + diff-diff + numpy import footprint;
@@ -195,16 +201,15 @@ the "growth" columns are the practitioner-meaningful numbers.
 
 ### Memory findings
 
-1. **BRFSS `aggregate_survey` is compute-bound, not memory-bound.** At
-   20x data growth (50K -> 1M rows), working-memory growth stays in the
-   low tens of MB. The tracemalloc pass confirms: net retained allocation
-   after `aggregate_survey` returns is well under 1 MB; the top
-   allocation site is `tracemalloc`'s own linecache overhead (a smoking
-   gun that nothing else is allocating meaningfully). **The BRFSS cost
-   is pure CPU; the function is already memory-efficient.** This
-   strengthens the case for the precompute-scaffolding fix: low-risk,
-   pure CPU win, fits in any deployment environment including 512 MB
-   Lambda.
+1. **BRFSS `aggregate_survey` was compute-bound, not memory-bound - and
+   the compute side is now addressed.** Working-memory growth stayed in
+   the low tens of MB across the 20x data-growth sweep (50K -> 1M rows);
+   the pre-fix tracemalloc pass confirmed net retained allocation under
+   1 MB and identified `tracemalloc`'s own linecache overhead as the
+   top allocation site (smoking gun that nothing else was allocating
+   meaningfully). The precompute-scaffolding fix in this PR is a pure
+   CPU win - no change to the function's memory profile, which was
+   already Lambda-friendly.
 2. **Staggered CS chain is memory-heavier than wall-clock suggested.** At
    1,500 units the chain's peak RSS sits in the high-400s to high-500s
    MB depending on backend. Fine for workstations, tight for 512 MB
@@ -229,16 +234,32 @@ the "growth" columns are the practitioner-meaningful numbers.
 
 | # | Opportunity | Time upside | Memory upside | Risk | Priority |
 |---|---|---|---|---|---|
-| 1 | `aggregate_survey` precompute stratum scaffolding | ~-20s at 1M rows | none (already memory-efficient) | Low | **High** |
+| 1 | `aggregate_survey` precompute stratum scaffolding | ~-20s at 1M rows | none (already memory-efficient) | Low | **LANDED** (this PR) |
 | 2 | Staggered CS chain working-memory audit (Lambda-oriented) | none | ~200-300 MB at 1,500 units (peak RSS crosses 512 MB Lambda line under Rust) | Medium | Low (bump to Medium if Lambda deployment becomes a concrete ask) |
 | 3 | dCDH: cache TSL scaffolding across main fit + heterogeneity refit | ~0.2s per chain | ~20 MB per chain | Low | Low |
 | 4 | ImputationDiD fit-loop vectorization audit | ~0.1-0.3s at 1,500 units | unknown | Low | Low |
 | 5 | Rust-port JK1 replicate fit loop | ~0.5s at 160 replicates | ~140 MB at 160 replicates | Medium | Low (demoted: Rust is no longer slower than Python on this path after rerun, so the "fix-a-Rust-regression" leg of the original rationale is gone) |
 
-**Bottom line: one clear priority, four optional.** #1 is the single
-practitioner-perceptible win identified by this analysis and should be
-the next PR. #2-5 are optional polish that should be prioritized by
-concrete deployment-environment signal (Lambda OOMs, practitioner
+### Optimization landed
+
+**#1 shipped in this PR.** `diff_diff/survey.py` now precomputes a
+per-design `_PsuScaffolding` (strata codes, global PSU codes, per-
+stratum counts and FPC ratios, singleton mask, lonely-PSU-aware
+variance-computable flag).  `aggregate_survey` builds it once per call
+and threads it through `_cell_mean_variance` so each per-cell variance
+reduction uses two vectorized `np.bincount` passes instead of a
+per-stratum pandas groupby loop.  Numerics are preserved to sub-ULP
+tolerance; equivalence tests across seven design cases
+(`TestAggregateSurveyScaffolding`) enforce `assert_allclose(atol=1e-14,
+rtol=1e-14)` between fast and legacy paths.
+
+Replicate-weight designs (JK1 etc.) continue to use the legacy
+`compute_replicate_if_variance` code path and are unaffected.
+
+**Bottom line: no practitioner-perceptible bottleneck remains in the
+six measured workflows; four optional items stand by.** Items #2-5
+above should be prioritized by concrete deployment-environment signal
+(Lambda OOMs, practitioner
 reports of slowness at specific shapes), not proactively.
 
 ### Correctness-adjacent observations (not P0, route separately)
diff --git a/tests/test_prep.py b/tests/test_prep.py
index 3c96626b..0a954395 100644
--- a/tests/test_prep.py
+++ b/tests/test_prep.py
@@ -3440,3 +3440,197 @@ def test_pweight_retains_zero_precision_geo(self):
             )
         assert 0 not in panel_a["state"].values
         assert len(panel_a) == 6  # 3 states x 2 periods
+
+
+class TestAggregateSurveyScaffolding:
+    """Tests for the amortized TSL variance fast path in aggregate_survey.
+
+    Equivalence tests verify that ``_compute_if_variance_fast`` produces
+    numerically identical ``_mean`` / ``_se`` / ``_precision`` outputs
+    (assert_allclose atol=1e-14 rtol=1e-14) relative to the legacy
+    ``compute_survey_if_variance`` path across every supported design
+    mode and ``lonely_psu`` policy.  Reduction-order drift is expected
+    to be sub-ULP because the formulas are identical and only the
+    order of summation changes (single np.bincount vs per-stratum
+    pandas groupby).
+    """
+
+    def _build_microdata(self, mode, seed=42):
+        """Per-case microdata plus a SurveyDesign that exercises that mode."""
+        rng = np.random.default_rng(seed)
+        n_per_cell = 80
+        state = np.repeat(["A", "B", "C"], 2 * n_per_cell)
+        year = np.tile(np.repeat([2019, 2020], n_per_cell), 3)
+        n = len(state)
+        wt = rng.uniform(0.5, 2.5, n)
+        y = rng.normal(5.0, 1.5, n)
+        df_base = pd.DataFrame(
+            {"state": state, "year": year, "wt": wt, "y": y}
+        )
+
+        if mode == "stratified_fpc":
+            df = df_base.copy()
+            df["stratum"] = rng.integers(0, 4, n)
+            df["psu"] = df["stratum"] * 10 + rng.integers(0, 4, n)
+            df["fpc"] = 200.0  # comfortably above per-stratum n_psu
+            sd = SurveyDesign(weights="wt", strata="stratum", psu="psu", fpc="fpc")
+            return df, sd
+
+        if mode == "stratified_no_fpc":
+            df = df_base.copy()
+            df["stratum"] = rng.integers(0, 4, n)
+            df["psu"] = df["stratum"] * 10 + rng.integers(0, 4, n)
+            sd = SurveyDesign(weights="wt", strata="stratum", psu="psu")
+            return df, sd
+
+        if mode == "psu_only":
+            df = df_base.copy()
+            df["psu"] = rng.integers(0, 12, n)
+            sd = SurveyDesign(weights="wt", psu="psu")
+            return df, sd
+
+        if mode == "weights_only":
+            return df_base.copy(), SurveyDesign(weights="wt")
+
+        if mode.startswith("lonely_"):
+            # Singleton stratum: stratum 0 has exactly one PSU; strata 1..3
+            # each have 4 PSUs.  Forces every lonely_psu branch to engage.
+            df = df_base.copy()
+            strata = rng.integers(1, 4, n)
+            psu = strata * 10 + rng.integers(0, 4, n)
+            sentinel = rng.choice(n, size=n // 8, replace=False)
+            strata[sentinel] = 0
+            psu[sentinel] = 999
+            df["stratum"] = strata
+            df["psu"] = psu
+            policy = mode.split("_", 1)[1]
+            sd = SurveyDesign(
+                weights="wt", strata="stratum", psu="psu", lonely_psu=policy,
+            )
+            return df, sd
+
+        raise ValueError(f"Unknown mode: {mode}")
+
+    @staticmethod
+    def _assert_panels_equivalent(p_fast, p_legacy, outcome="y"):
+        assert len(p_fast) == len(p_legacy)
+        assert list(p_fast.columns) == list(p_legacy.columns)
+        for suffix in ("_mean", "_se", "_precision"):
+            col = f"{outcome}{suffix}"
+            a = p_fast[col].to_numpy(dtype=np.float64)
+            b = p_legacy[col].to_numpy(dtype=np.float64)
+            nan_a, nan_b = np.isnan(a), np.isnan(b)
+            assert np.array_equal(nan_a, nan_b), f"NaN pattern mismatch in {col}"
+            np.testing.assert_allclose(
+                a[~nan_a], b[~nan_b],
+                atol=1e-14, rtol=1e-14,
+                err_msg=f"{col} diverges between fast and legacy paths",
+            )
+
+    @pytest.mark.parametrize(
+        "mode",
+        [
+            "stratified_fpc",
+            "stratified_no_fpc",
+            "psu_only",
+            "weights_only",
+            "lonely_remove",
+            "lonely_certainty",
+            "lonely_adjust",
+        ],
+    )
+    def test_fast_path_equals_legacy(self, mode, monkeypatch):
+        """Fast and legacy paths produce numerically identical panels."""
+        from diff_diff import prep
+
+        data, sd = self._build_microdata(mode)
+        panel_fast, _ = aggregate_survey(
+            data, by=["state", "year"], outcomes="y", survey_design=sd,
+        )
+        # Force the legacy code path by disabling the scaffolding precompute.
+        # _cell_mean_variance falls back to compute_survey_if_variance when
+        # scaffolding is None.
+        monkeypatch.setattr(
+            prep, "_precompute_psu_scaffolding", lambda resolved: None,
+        )
+        panel_legacy, _ = aggregate_survey(
+            data, by=["state", "year"], outcomes="y", survey_design=sd,
+        )
+        self._assert_panels_equivalent(panel_fast, panel_legacy)
+
+    def test_scaffolding_stratified_shape(self):
+        from diff_diff.survey import _precompute_psu_scaffolding
+
+        data, sd = self._build_microdata("stratified_fpc")
+        resolved = sd.resolve(data)
+        scf = _precompute_psu_scaffolding(resolved)
+        assert scf.mode == "stratified"
+        assert scf.n == len(data)
+        assert scf.psu_codes.shape == (len(data),)
+        assert scf.psu_stratum.ndim == 1
+        assert scf.n_psu_per_stratum.ndim == 1
+        assert len(scf.psu_stratum) == int(scf.psu_codes.max() + 1)
+        # adjustment_h is zero for any singleton stratum by construction
+        if scf.singleton_strata.any():
+            assert np.all(scf.adjustment_h[scf.singleton_strata] == 0.0)
+
+    def test_scaffolding_weights_only_shape(self):
+        from diff_diff.survey import _precompute_psu_scaffolding
+
+        data, sd = self._build_microdata("weights_only")
+        resolved = sd.resolve(data)
+        scf = _precompute_psu_scaffolding(resolved)
+        assert scf.mode == "no_strata_no_psu"
+        assert scf.adjustment_direct is not None
+        assert scf.psu_codes is None
+        assert scf.psu_codes_only is None
+
+    def test_scaffolding_psu_only_shape(self):
+        from diff_diff.survey import _precompute_psu_scaffolding
+
+        data, sd = self._build_microdata("psu_only")
+        resolved = sd.resolve(data)
+        scf = _precompute_psu_scaffolding(resolved)
+        assert scf.mode == "psu_only"
+        assert scf.psu_codes_only is not None
+        assert scf.n_psu_only is not None and scf.n_psu_only >= 2
+        assert scf.adjustment_only is not None
+        assert scf.psu_codes is None
+        assert scf.adjustment_direct is None
+
+    def test_lonely_psu_certainty_counts_singletons(self):
+        """Under lonely_psu='certainty', singletons contribute to legitimate_zero_count."""
+        from diff_diff.survey import _precompute_psu_scaffolding
+
+        data, sd = self._build_microdata("lonely_certainty")
+        resolved = sd.resolve(data)
+        scf = _precompute_psu_scaffolding(resolved)
+        n_singletons = int(scf.singleton_strata.sum())
+        assert n_singletons >= 1  # sanity: fixture does plant a singleton
+        assert scf.legitimate_zero_count >= n_singletons
+
+    def test_scaffolding_fpc_saturation_counts(self):
+        """f_h >= 1.0 increments legitimate_zero_count independent of singletons."""
+        from diff_diff.survey import _precompute_psu_scaffolding
+
+        rng = np.random.default_rng(7)
+        n = 200
+        stratum = rng.integers(0, 2, n)
+        # Build exactly 4 unique PSUs per stratum so FPC = n_psu exactly.
+        psu = np.empty(n, dtype=np.int64)
+        for h in range(2):
+            idx = np.where(stratum == h)[0]
+            psu[idx] = np.arange(len(idx)) % 4 + h * 10
+        df = pd.DataFrame(
+            {
+                "wt": rng.uniform(1, 2, n),
+                "stratum": stratum,
+                "psu": psu,
+                "y": rng.normal(size=n),
+                "fpc": 4.0,  # f_h = 4/4 = 1.0
+            }
+        )
+        sd = SurveyDesign(weights="wt", strata="stratum", psu="psu", fpc="fpc")
+        resolved = sd.resolve(df)
+        scf = _precompute_psu_scaffolding(resolved)
+        assert scf.legitimate_zero_count >= 1

From 7039e74e4e9be890c9c4a6ad446a9b9dcc5ed592 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Sun, 19 Apr 2026 17:45:31 -0400
Subject: [PATCH 2/2] Cover stratified-no-PSU branch in scaffolding equivalence
 tests
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Addresses the P2 test-coverage gap from the CI review on PR #338:
_precompute_psu_scaffolding() has a dedicated strata-with-no-PSU
branch (survey.py:1458-1466) where each observation is its own PSU
within its stratum, but TestAggregateSurveyScaffolding did not
exercise it despite the docstring claiming every supported design
mode was covered.

Adds two parametrized cases to test_fast_path_equals_legacy:
- stratified_no_psu        — SurveyDesign(weights=, strata=)
- stratified_no_psu_fpc    — same plus stratum-level FPC lookup
(the fpc-on-this-branch path goes through the same per-stratum
first-obs FPC read as stratified+PSU, so both variants matter).

Both pass assert_allclose(atol=1e-14, rtol=1e-14) equivalence with
the legacy path across all cells.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 tests/test_prep.py | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/tests/test_prep.py b/tests/test_prep.py
index 0a954395..a9818e4a 100644
--- a/tests/test_prep.py
+++ b/tests/test_prep.py
@@ -3483,6 +3483,24 @@ def _build_microdata(self, mode, seed=42):
             sd = SurveyDesign(weights="wt", strata="stratum", psu="psu")
             return df, sd
 
+        if mode == "stratified_no_psu":
+            # strata present, psu absent — each observation is its own
+            # PSU within its stratum.  This is a distinct scaffolding
+            # branch (survey.py:_precompute_psu_scaffolding, else clause
+            # of the `if psu is not None` block).
+            df = df_base.copy()
+            df["stratum"] = rng.integers(0, 4, n)
+            sd = SurveyDesign(weights="wt", strata="stratum")
+            return df, sd
+
+        if mode == "stratified_no_psu_fpc":
+            # Same branch as above plus stratum-level FPC lookup.
+            df = df_base.copy()
+            df["stratum"] = rng.integers(0, 4, n)
+            df["fpc"] = 1000.0  # well above per-stratum obs count
+            sd = SurveyDesign(weights="wt", strata="stratum", fpc="fpc")
+            return df, sd
+
         if mode == "psu_only":
             df = df_base.copy()
             df["psu"] = rng.integers(0, 12, n)
@@ -3532,6 +3550,8 @@ def _assert_panels_equivalent(p_fast, p_legacy, outcome="y"):
         [
             "stratified_fpc",
             "stratified_no_fpc",
+            "stratified_no_psu",
+            "stratified_no_psu_fpc",
             "psu_only",
             "weights_only",
             "lonely_remove",