Precompute stratum-PSU scaffolding in aggregate_survey (~19x BRFSS speedup) by igerber · Pull Request #338 · igerber/diff-diff

igerber · 2026-04-19T21:22:38Z

Summary

Drops aggregate_survey at BRFSS-realistic scale (50 states × 10 years, 1M microdata rows) from ~24s to ~1.3s — 18-19x under both backends. Addresses item #1 from the PR #333 practitioner-workflow performance baseline, the single pain point identified across the six measured scenarios.

Mechanism. The per-cell Taylor-series variance inside aggregate_survey previously rebuilt stratum-PSU scaffolding (np.unique + per-stratum pd.DataFrame.groupby(...).sum() + FPC lookup) on every output cell — ~10K pandas groupby constructions for the BRFSS scenario, each summing a mostly-zero psi vector. This PR adds a frozen _PsuScaffolding dataclass + private _precompute_psu_scaffolding / _compute_if_variance_fast helpers in diff_diff/survey.py. aggregate_survey builds scaffolding once per design and threads it through _cell_mean_variance via a new optional kwarg; the fast path replaces the per-stratum groupby loop with two vectorized np.bincount passes (psi → PSU sums, PSU sums → per-stratum first and second moments).

Scope is localized. _compute_stratified_psu_meat and compute_survey_if_variance are untouched, so every other TSL caller (DiD, TWFE, CS, SunAbraham, dCDH, etc.) is unaffected. Replicate-weight designs continue to route through compute_replicate_if_variance unchanged.

Measured impact (benchmarks/speed_review/run_all.py)

Scenario/scale	Before	After	Speedup
BRFSS large (Py/Rust)	24.4s / 24.9s	1.33s / 1.32s	18.4x / 19.0x
BRFSS medium	6.1s / 6.2s	0.49s / 0.47s	12.5x / 13.2x
BRFSS small	1.6s / 1.7s	0.21s / 0.17s	7.6x / 10x
All other scenarios	—	—	within noise

Memory profile unchanged (BRFSS was already memory-efficient; growth stays in low tens of MB).

Methodology references

Method: Binder (1983) / Lumley (2004) Taylor-series linearization for design-based survey variance (no change).
Formulas: identical to the existing _compute_stratified_psu_meat TSL branch. Only the order of summation changes (single np.bincount vs per-stratum pandas groupby).
Deviations from source: none. Numerics preserved to sub-ULP tolerance.
Reference doc: docs/methodology/REGISTRY.md "Survey" section unchanged — the methodology documented there is exactly what the fast path computes.

Note on deliberately untouched docs. diff_diff/survey.py has HIGH drift-risk doc deps (methodology REGISTRY, survey-theory, roadmap, tutorials). None were updated because this PR makes no methodology, API, numerical, or user-facing behavior change — it is a pure internal performance optimization. Methodology docs would be misleading if edited ("what changed?" — nothing in the math). Flagging proactively in case the AI reviewer wants to confirm.

Validation

Tests added

tests/test_prep.py::TestAggregateSurveyScaffolding (+194 lines):
- 7 parametrized equivalence cases: stratified+PSU+FPC, stratified no FPC, PSU-only, weights-only, and all three lonely_psu modes (remove / certainty / adjust). Each monkey-patches _precompute_psu_scaffolding to force the legacy path for comparison, then assert_allclose(atol=1e-14, rtol=1e-14) on _mean, _se, _precision.
- 5 structural tests on _PsuScaffolding shape/dtype per mode + static legitimate_zero_count semantics.

Tests unchanged, all green

tests/test_prep.py::TestAggregateSurvey — 43/43 passed (exercises fast path by default)
tests/test_prep.py (full file) — 218/218 passed
tests/test_survey.py — 129/129 passed

Numerical equivalence on real BRFSS-large data

Spot-checked against the actual 1M-row benchmark fixture (500 cells × 50 states × 10 years):

y_mean: bit-identical (|diff| = 0.0)
y_se: max relative drift 2.83e-16 (≈1 ULP)
y_precision: max relative drift 4.64e-16 (≈1 ULP)

All ~10⁸ × tighter than the 1e-14 test tolerance.

Performance sweep

python benchmarks/speed_review/run_all.py rerun; all 27 baselines regenerated and committed. No scenario regressed outside run-to-run noise (<±10% on sub-second totals). Auto-tables in docs/performance-plan.md regenerated via gen_findings_tables.py; narrative prose updated to match (finding #1 rewritten, hotspots row marked LANDED, new "Optimization landed" subsection, priority-table item #1 marked LANDED, bottom line updated).

Security / privacy

Confirm no secrets/PII in this PR: Yes.

🤖 Generated with Claude Code

github-actions · 2026-04-19T21:36:20Z

Overall Assessment
✅ Looks good. No unmitigated P0 or P1 findings.

Executive Summary

Affected methodology: aggregate_survey domain-estimation SEs under Lumley (2004 §3.4) with Binder-style TSL variance. The changed code is the new scaffolding-based fast path in diff_diff/survey.py:L1307-L1624, called from diff_diff/prep.py:L1403-L1412.
I did not find an undocumented methodology deviation or an incorrect variance/SE formula. This is a reduction-strategy change (groupby/np.unique to precomputed codes + np.bincount), which is a valid implementation choice under the existing registry contract in docs/methodology/REGISTRY.md:L2711-L2731 and docs/methodology/REGISTRY.md:L2790-L2828.
Replicate-weight designs still bypass the new scaffolding path, so this PR does not change replicate-variance behavior; see diff_diff/prep.py:L1407-L1412.
One non-blocking gap remains: the new fast-vs-legacy equivalence suite does not exercise the changed stratified-without-PSU branch, even though the test docstring claims coverage of “every supported design mode”; see diff_diff/survey.py:L1458-L1466 and tests/test_prep.py:L3448-L3455, tests/test_prep.py:L3530-L3559.

Methodology

Severity: P3-informational. Impact: The PR changes the implementation of the existing survey-aggregation variance calculation, not the method itself. The zero-padded domain-estimation IF contract in diff_diff/prep.py:L1326-L1425 still matches the registry’s aggregate_survey specification in docs/methodology/REGISTRY.md:L2790-L2828, and the new _finalize() logic in diff_diff/survey.py:L1566-L1571 mirrors the legacy 0.0 vs NaN contract in diff_diff/survey.py:L1834-L1846. Concrete fix: None.

Code Quality
No findings.

Performance
No findings.

Maintainability
No findings beyond the test-coverage item below.

Tech Debt
No findings.

Security
No findings.

Documentation/Tests

Severity: P2. Impact: _precompute_psu_scaffolding() now contains a dedicated strata-with-no-psu path in diff_diff/survey.py:L1458-L1466, but TestAggregateSurveyScaffolding.test_fast_path_equals_legacy() never covers that mode and therefore does not fully substantiate the docstring claim in tests/test_prep.py:L3448-L3455. A regression there would affect supported aggregate_survey(..., survey_design=SurveyDesign(weights=..., strata=...)) users without tripping the new PR-specific equivalence suite. Concrete fix: Add a stratified_no_psu case to _build_microdata() and include it in the parametrized fast-vs-legacy test; ideally cover both with and without FPC because that branch also owns stratum-level FPC lookup.

Review note: this was a static source review. I could not run local verification in this sandbox because the available Python environment is missing project dependencies (numpy).

Addresses the P2 test-coverage gap from the CI review on PR #338: _precompute_psu_scaffolding() has a dedicated strata-with-no-PSU branch (survey.py:1458-1466) where each observation is its own PSU within its stratum, but TestAggregateSurveyScaffolding did not exercise it despite the docstring claiming every supported design mode was covered. Adds two parametrized cases to test_fast_path_equals_legacy: - stratified_no_psu — SurveyDesign(weights=, strata=) - stratified_no_psu_fpc — same plus stratum-level FPC lookup (the fpc-on-this-branch path goes through the same per-stratum first-obs FPC read as stratified+PSU, so both variants matter). Both pass assert_allclose(atol=1e-14, rtol=1e-14) equivalence with the legacy path across all cells. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T21:45:36Z

/ai-review

github-actions · 2026-04-19T21:51:16Z

🔁 AI review rerun (requested by @igerber)

Head SHA: ff7134da84b3d7b48fa3b8ca5ec680839df43f63

Overall Assessment

✅ Looks good. This re-review resolves the prior coverage concern, and I did not identify any unmitigated P0/P1 issues in the changed code. One P3 documentation mismatch remains.

Executive Summary

The affected method is aggregate_survey domain-estimation variance under Lumley (2004 §3.4) / Binder (1983); the new fast path is confined to the TSL branch in diff_diff/survey.py:L1308-L1624 and diff_diff/prep.py:L1592-L1659.
I did not find an undocumented methodology deviation or an incorrect variance/SE computation relative to docs/methodology/REGISTRY.md:L2711-L2731 and docs/methodology/REGISTRY.md:L2790-L2828.
Replicate-weight designs still bypass the new scaffolding path, so the scope remains localized to non-replicate aggregate_survey TSL variance; see diff_diff/prep.py:L1407-L1412 and diff_diff/prep.py:L1597-L1600.
The prior re-review coverage gap is fixed: the equivalence suite now includes stratified_no_psu and stratified_no_psu_fpc in tests/test_prep.py:L3486-L3502 and tests/test_prep.py:L3548-L3562.
One residual P3 item remains: changed prose in CHANGELOG.md:L10-L11 and docs/performance-plan.md:L245-L254 still says “seven-case” even though the committed test matrix now covers nine modes.
Static review only; I could not run local verification in this sandbox because pytest and numpy are unavailable.

Methodology

Severity: P3-informational. Impact: The PR changes the reduction strategy, not the survey methodology. The registry still describes the same zero-padded domain IF for aggregate_survey and the same Binder/Lumley TSL variance contract, while the code simply swaps repeated per-cell scaffolding rebuilds for cached integer codes plus np.bincount; see docs/methodology/REGISTRY.md:L2796-L2828, docs/methodology/REGISTRY.md:L2713-L2731, diff_diff/prep.py:L1394-L1425, and diff_diff/survey.py:L1538-L1624. Concrete fix: None.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: The code and tests now cover nine equivalence modes, but the updated prose still documents a “seven-case” / “seven design cases” suite. That understates the actual regression surface and obscures that the prior stratified_no_psu gap was closed; see CHANGELOG.md:L10-L11, docs/performance-plan.md:L245-L254, and tests/test_prep.py:L3548-L3562. Concrete fix: Update the changed prose to say “nine-case” / “nine design cases” and explicitly mention stratified_no_psu and stratified_no_psu_fpc.

The per-cell Taylor-series variance inside aggregate_survey previously rebuilt stratum-PSU scaffolding (np.unique, per-stratum pandas groupby, stratum FPC lookup) on every output cell. At BRFSS scale (50 states x 10 years = 500 cells, 20 strata, 1M microdata rows) this was ~10K pandas groupby constructions, each summing a mostly-zero psi vector and paying full pandas setup cost — the entire chain's runtime. This PR adds a frozen _PsuScaffolding dataclass plus private _precompute_psu_scaffolding(resolved) and _compute_if_variance_fast( psi, scf) helpers in diff_diff/survey.py. aggregate_survey builds scaffolding once per design and threads it through _cell_mean_variance via a new optional kwarg; the fast path replaces the per-stratum groupby loop with two vectorized np.bincount passes (psi → PSU sums, PSU sums → per-stratum first and second moments) plus a closed-form meat = sum_h adjustment_h * centered_ss_h. Scope is deliberately localized: _compute_stratified_psu_meat and compute_survey_if_variance are unchanged, so every other TSL caller (DiD, TWFE, CS, SunAbraham, dCDH, etc.) is unaffected. Replicate- weight designs continue to route through compute_replicate_if_variance unchanged. Measured impact (benchmarks/speed_review/run_all.py, 1M rows BRFSS): - Large: 24.4s → 1.33s (Python), 24.9s → 1.32s (Rust) [18.4-19.0x] - Medium: 6.1s → 0.49s [12.5-13.2x] - Small: 1.6s → 0.17s [7.6-10x] No regression in any other scenario (all within run-to-run noise). Numerical equivalence: new TestAggregateSurveyScaffolding asserts assert_allclose(atol=1e-14, rtol=1e-14) between fast and legacy paths across seven design cases — stratified+PSU+FPC, stratified no FPC, PSU-only, weights-only, and all three lonely_psu modes (remove / certainty / adjust) — plus structural tests on the scaffolding itself. On the actual BRFSS-large 1M-row panel, y_mean is bit-identical and y_se / y_precision drift at ~1 ULP (max relative diff 4.6e-16). Existing coverage unchanged: all 43 TestAggregateSurvey tests green on the fast path (new default); all 129 test_survey.py tests green. Documentation: - docs/performance-plan.md: finding #1 rewritten ("practitioner-fast at every scale"), BRFSS bullet updated, hotspots row #1 marked LANDED, memory finding updated, priority table item #1 marked LANDED, new "Optimization landed" subsection, bottom line updated ("no practitioner-perceptible bottleneck remains"). Auto-tables regenerated via gen_findings_tables.py. - CHANGELOG.md: new Performance entry under [Unreleased]. No user-facing API change. Methodology docs (REGISTRY.md, survey- theory.md) are deliberately not touched: this is a pure internal performance optimization with numerics preserved to sub-ULP tolerance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Addresses the P2 test-coverage gap from the CI review on PR #338: _precompute_psu_scaffolding() has a dedicated strata-with-no-PSU branch (survey.py:1458-1466) where each observation is its own PSU within its stratum, but TestAggregateSurveyScaffolding did not exercise it despite the docstring claiming every supported design mode was covered. Adds two parametrized cases to test_fast_path_equals_legacy: - stratified_no_psu — SurveyDesign(weights=, strata=) - stratified_no_psu_fpc — same plus stratum-level FPC lookup (the fpc-on-this-branch path goes through the same per-stratum first-obs FPC read as stratified+PSU, so both variants matter). Both pass assert_allclose(atol=1e-14, rtol=1e-14) equivalence with the legacy path across all cells. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

igerber · 2026-04-19T23:33:03Z

/ai-review

github-actions · 2026-04-19T23:38:48Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 7039e74e4e9be890c9c4a6ad446a9b9dcc5ed592

Overall Assessment

✅ Looks good

No unmitigated P0/P1 issues in the changed PR surface. The optimization stays within the documented aggregate_survey domain-estimation contract in docs/methodology/REGISTRY.md:L2796-L2828; remaining issues are P3 documentation/test follow-ups.

Executive Summary

The affected method is aggregate_survey survey-domain variance under Lumley (2004 §3.4) / Binder (1983). The new work is confined to the non-replicate TSL path in diff_diff/prep.py:L1403-L1412, diff_diff/prep.py:L1592-L1600, and diff_diff/survey.py:L1307-L1624.
I did not find an undocumented methodology deviation or incorrect variance/SE formula relative to docs/methodology/REGISTRY.md:L2796-L2828. compute_survey_if_variance() and _compute_stratified_psu_meat() remain unchanged for non-aggregate_survey callers in diff_diff/survey.py:L1160-L1304 and diff_diff/survey.py:L1802-L1846.
Replicate-weight designs still bypass the new scaffolding path, so the scope remains localized to non-replicate aggregate_survey; see diff_diff/prep.py:L1407-L1412 and diff_diff/prep.py:L1597-L1600.
The prior P3 doc mismatch remains unresolved: changed prose still says “seven-case” even though the committed equivalence matrix now parameterizes nine modes in tests/test_prep.py:L3548-L3560.
The new equivalence suite still does not directly exercise the unstratified FPC fast-path branches (weights+FPC, PSU-only+FPC) added in diff_diff/survey.py:L1375-L1432.
Static review only; I could not run local numerical verification in this sandbox because numpy is unavailable.

Methodology

No findings. The fast path preserves the same zero-padded domain-IF contract documented in docs/methodology/REGISTRY.md:L2796-L2828, and I did not see a source-material mismatch in the changed variance algebra.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: The previous re-review’s documentation mismatch is still present. CHANGELOG.md:L13-L14 and docs/performance-plan.md:L245-L254 still describe a “seven-case” equivalence suite, but tests/test_prep.py:L3548-L3560 now covers nine modes. Concrete fix: update the changed prose to “nine-case” / “nine design cases” and explicitly name stratified_no_psu and stratified_no_psu_fpc.
Severity: P3. Impact: tests/test_prep.py:L3448-L3455 says the new equivalence tests cover “every supported design mode,” but the parameterized matrix in tests/test_prep.py:L3548-L3560 omits the new unstratified FPC branches implemented in diff_diff/survey.py:L1375-L1432. Concrete fix: add fast-vs-legacy equivalence cases for SurveyDesign(weights="wt", fpc="fpc") and SurveyDesign(weights="wt", psu="psu", fpc="fpc"), or narrow the docstring/prose so it does not overstate coverage.

igerber added the ready-for-ci Triggers CI test workflows label Apr 19, 2026

igerber and others added 2 commits April 19, 2026 19:32

igerber force-pushed the perf/aggregate-survey-precompute branch from ff7134d to 7039e74 Compare April 19, 2026 23:32

igerber merged commit 73eef66 into main Apr 20, 2026
19 checks passed

igerber deleted the perf/aggregate-survey-precompute branch April 20, 2026 00:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Precompute stratum-PSU scaffolding in aggregate_survey (~19x BRFSS speedup)#338

Precompute stratum-PSU scaffolding in aggregate_survey (~19x BRFSS speedup)#338
igerber merged 2 commits intomainfrom
perf/aggregate-survey-precompute

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Apr 19, 2026

Summary

Measured impact (benchmarks/speed_review/run_all.py)

Methodology references

Validation

Tests added

Tests unchanged, all green

Numerical equivalence on real BRFSS-large data

Performance sweep

Security / privacy

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

igerber commented Apr 19, 2026

Uh oh!

github-actions Bot commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant