perf: Hot path performance optimizations (P2, P5, P13, partial P1) #157
+77
−22
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Performance optimizations for frequently executed code paths, targeting the "hot paths" identified in the code audit.
Changes
P2: parse_metadata pre-split (
parse_metadata)meta3andmeta2bydimension_idbefore the loop usingsplit()dplyr::filter()per columnP5: Factor conversion gsub loop (
normalize_cansim_values)across()for single-pass regex processingvapplyfor pre-checking which fields need processingP13: lapply %>% unlist chain optimizations (multiple locations)
lapply(length) %>% unlistwith vectorizedlengths()lapply(...) %>% unlistwithpurrr::map_chr()for string extractionlapply(gsub...) %>% unlistwith vectorizedgsub()directlylapply(class) %>% unlistwithvapply(..., character(1))P1 (partial): fold_in_metadata member ID extraction
lapply(.data$...pos, ...) %>% unlistwithpurrr::map_chr()Files Modified
R/cansim.R:normalize_cansim_values,fold_in_metadata_for_columns,categories_for_levelR/cansim_metadata.R:parse_metadata,read_notes,get_cansim_cube_metadataBenchmark Results
P2: parse_metadata pre-split - ✅ 84-86% improvement
Test methodology: Simulated metadata structure matching real StatCan table metadata, testing the filter-per-iteration vs pre-split approach.
Test data:
Benchmark code:
Results (40,000 rows, 20 dimensions):
Results (5,000 rows, 10 dimensions):
P5: normalize_cansim_values -⚠️ 0.8% (negligible)
Test methodology: Benchmarked
normalize_cansim_values()on real StatCan data comparing master vs optimized branch.Test data: Table 17-10-0005 (Population by age and sex)
Benchmark code:
Results:
Analysis: The improvement is within measurement noise. This is expected because the optimization only affects 2-3 classification columns per table - the
across()change improves code clarity but doesn't measurably improve performance.P13: lengths() optimization -⚠️ -0.6% (negligible)
Test methodology: Benchmarked
categories_for_level()which uses thelengths()vslapply(length) %>% unlistoptimization.Test data: Same 302,610 row table from P5 test, with 3 hierarchy columns.
Benchmark code:
Results:
Analysis: The
lengths()optimization operates on unique hierarchy values only (15 unique values for GEO column), not on all 302,610 rows. With such a small input, there's no measurable difference betweenlengths()andlapply(length) %>% unlist.Summary
Test Plan
devtools::check()with no errors/warningsnormalize_cansim_values()produces identical output🤖 Generated with Claude Code