Precompute stratum-PSU scaffolding in aggregate_survey (~19x BRFSS speedup)#338
Precompute stratum-PSU scaffolding in aggregate_survey (~19x BRFSS speedup)#338
Conversation
|
Overall Assessment Executive Summary
Methodology
Code Quality Performance Maintainability Tech Debt Security Documentation/Tests
Review note: this was a static source review. I could not run local verification in this sandbox because the available Python environment is missing project dependencies ( |
Addresses the P2 test-coverage gap from the CI review on PR #338: _precompute_psu_scaffolding() has a dedicated strata-with-no-PSU branch (survey.py:1458-1466) where each observation is its own PSU within its stratum, but TestAggregateSurveyScaffolding did not exercise it despite the docstring claiming every supported design mode was covered. Adds two parametrized cases to test_fast_path_equals_legacy: - stratified_no_psu — SurveyDesign(weights=, strata=) - stratified_no_psu_fpc — same plus stratum-level FPC lookup (the fpc-on-this-branch path goes through the same per-stratum first-obs FPC read as stratified+PSU, so both variants matter). Both pass assert_allclose(atol=1e-14, rtol=1e-14) equivalence with the legacy path across all cells. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good. This re-review resolves the prior coverage concern, and I did not identify any unmitigated P0/P1 issues in the changed code. One P3 documentation mismatch remains. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
The per-cell Taylor-series variance inside aggregate_survey previously rebuilt stratum-PSU scaffolding (np.unique, per-stratum pandas groupby, stratum FPC lookup) on every output cell. At BRFSS scale (50 states x 10 years = 500 cells, 20 strata, 1M microdata rows) this was ~10K pandas groupby constructions, each summing a mostly-zero psi vector and paying full pandas setup cost — the entire chain's runtime. This PR adds a frozen _PsuScaffolding dataclass plus private _precompute_psu_scaffolding(resolved) and _compute_if_variance_fast( psi, scf) helpers in diff_diff/survey.py. aggregate_survey builds scaffolding once per design and threads it through _cell_mean_variance via a new optional kwarg; the fast path replaces the per-stratum groupby loop with two vectorized np.bincount passes (psi → PSU sums, PSU sums → per-stratum first and second moments) plus a closed-form meat = sum_h adjustment_h * centered_ss_h. Scope is deliberately localized: _compute_stratified_psu_meat and compute_survey_if_variance are unchanged, so every other TSL caller (DiD, TWFE, CS, SunAbraham, dCDH, etc.) is unaffected. Replicate- weight designs continue to route through compute_replicate_if_variance unchanged. Measured impact (benchmarks/speed_review/run_all.py, 1M rows BRFSS): - Large: 24.4s → 1.33s (Python), 24.9s → 1.32s (Rust) [18.4-19.0x] - Medium: 6.1s → 0.49s [12.5-13.2x] - Small: 1.6s → 0.17s [7.6-10x] No regression in any other scenario (all within run-to-run noise). Numerical equivalence: new TestAggregateSurveyScaffolding asserts assert_allclose(atol=1e-14, rtol=1e-14) between fast and legacy paths across seven design cases — stratified+PSU+FPC, stratified no FPC, PSU-only, weights-only, and all three lonely_psu modes (remove / certainty / adjust) — plus structural tests on the scaffolding itself. On the actual BRFSS-large 1M-row panel, y_mean is bit-identical and y_se / y_precision drift at ~1 ULP (max relative diff 4.6e-16). Existing coverage unchanged: all 43 TestAggregateSurvey tests green on the fast path (new default); all 129 test_survey.py tests green. Documentation: - docs/performance-plan.md: finding #1 rewritten ("practitioner-fast at every scale"), BRFSS bullet updated, hotspots row #1 marked LANDED, memory finding updated, priority table item #1 marked LANDED, new "Optimization landed" subsection, bottom line updated ("no practitioner-perceptible bottleneck remains"). Auto-tables regenerated via gen_findings_tables.py. - CHANGELOG.md: new Performance entry under [Unreleased]. No user-facing API change. Methodology docs (REGISTRY.md, survey- theory.md) are deliberately not touched: this is a pure internal performance optimization with numerics preserved to sub-ULP tolerance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses the P2 test-coverage gap from the CI review on PR #338: _precompute_psu_scaffolding() has a dedicated strata-with-no-PSU branch (survey.py:1458-1466) where each observation is its own PSU within its stratum, but TestAggregateSurveyScaffolding did not exercise it despite the docstring claiming every supported design mode was covered. Adds two parametrized cases to test_fast_path_equals_legacy: - stratified_no_psu — SurveyDesign(weights=, strata=) - stratified_no_psu_fpc — same plus stratum-level FPC lookup (the fpc-on-this-branch path goes through the same per-stratum first-obs FPC read as stratified+PSU, so both variants matter). Both pass assert_allclose(atol=1e-14, rtol=1e-14) equivalence with the legacy path across all cells. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ff7134d to
7039e74
Compare
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good No unmitigated P0/P1 issues in the changed PR surface. The optimization stays within the documented Executive Summary
Methodology No findings. The fast path preserves the same zero-padded domain-IF contract documented in docs/methodology/REGISTRY.md:L2796-L2828, and I did not see a source-material mismatch in the changed variance algebra. Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt No findings. Security No findings. Documentation/Tests
|
Summary
Drops
aggregate_surveyat BRFSS-realistic scale (50 states × 10 years, 1M microdata rows) from ~24s to ~1.3s — 18-19x under both backends. Addresses item #1 from the PR #333 practitioner-workflow performance baseline, the single pain point identified across the six measured scenarios.Mechanism. The per-cell Taylor-series variance inside
aggregate_surveypreviously rebuilt stratum-PSU scaffolding (np.unique+ per-stratumpd.DataFrame.groupby(...).sum()+ FPC lookup) on every output cell — ~10K pandas groupby constructions for the BRFSS scenario, each summing a mostly-zero psi vector. This PR adds a frozen_PsuScaffoldingdataclass + private_precompute_psu_scaffolding/_compute_if_variance_fasthelpers indiff_diff/survey.py.aggregate_surveybuilds scaffolding once per design and threads it through_cell_mean_variancevia a new optional kwarg; the fast path replaces the per-stratum groupby loop with two vectorizednp.bincountpasses (psi → PSU sums, PSU sums → per-stratum first and second moments).Scope is localized.
_compute_stratified_psu_meatandcompute_survey_if_varianceare untouched, so every other TSL caller (DiD, TWFE, CS, SunAbraham, dCDH, etc.) is unaffected. Replicate-weight designs continue to route throughcompute_replicate_if_varianceunchanged.Measured impact (benchmarks/speed_review/run_all.py)
Memory profile unchanged (BRFSS was already memory-efficient; growth stays in low tens of MB).
Methodology references
_compute_stratified_psu_meatTSL branch. Only the order of summation changes (singlenp.bincountvs per-stratum pandas groupby).docs/methodology/REGISTRY.md"Survey" section unchanged — the methodology documented there is exactly what the fast path computes.Note on deliberately untouched docs.
diff_diff/survey.pyhas HIGH drift-risk doc deps (methodology REGISTRY, survey-theory, roadmap, tutorials). None were updated because this PR makes no methodology, API, numerical, or user-facing behavior change — it is a pure internal performance optimization. Methodology docs would be misleading if edited ("what changed?" — nothing in the math). Flagging proactively in case the AI reviewer wants to confirm.Validation
Tests added
tests/test_prep.py::TestAggregateSurveyScaffolding(+194 lines):lonely_psumodes (remove / certainty / adjust). Each monkey-patches_precompute_psu_scaffoldingto force the legacy path for comparison, thenassert_allclose(atol=1e-14, rtol=1e-14)on_mean,_se,_precision._PsuScaffoldingshape/dtype per mode + staticlegitimate_zero_countsemantics.Tests unchanged, all green
tests/test_prep.py::TestAggregateSurvey— 43/43 passed (exercises fast path by default)tests/test_prep.py(full file) — 218/218 passedtests/test_survey.py— 129/129 passedNumerical equivalence on real BRFSS-large data
Spot-checked against the actual 1M-row benchmark fixture (500 cells × 50 states × 10 years):
y_mean: bit-identical (|diff| = 0.0)y_se: max relative drift 2.83e-16 (≈1 ULP)y_precision: max relative drift 4.64e-16 (≈1 ULP)All ~10⁸ × tighter than the
1e-14test tolerance.Performance sweep
python benchmarks/speed_review/run_all.pyrerun; all 27 baselines regenerated and committed. No scenario regressed outside run-to-run noise (<±10% on sub-second totals). Auto-tables indocs/performance-plan.mdregenerated viagen_findings_tables.py; narrative prose updated to match (finding #1 rewritten, hotspots row marked LANDED, new "Optimization landed" subsection, priority-table item #1 marked LANDED, bottom line updated).Security / privacy
Confirm no secrets/PII in this PR: Yes.
🤖 Generated with Claude Code