perf: Lookups and vectorization optimizations (P4, P8, P11) #159

dshkol · 2026-01-22T02:18:10Z

Summary

Performance optimizations for lookup operations and vectorization patterns.

Changes

P8: vapply instead of sapply for null check

Location: vector extraction in R/cansim_vectors.R

Replace sapply(x, is.null) with vapply(x, is.null, logical(1))
Type-safe and pre-allocates result vector
Code quality improvement (type safety)

Reverted Changes

P4/P11: Environment hash table for parent lookup - ❌ REVERTED

The original plan proposed replacing named vector lookup with environment hash tables for O(1) lookup. Benchmarking revealed this is actually slower in R.

Benchmark Results

P4/P11: Environment hash table - ❌ 1.6% regression (REVERTED)

Test methodology: Isolated the add_hierarchy() function pattern and benchmarked named vector vs environment hash table approaches.

Test data: Simulated hierarchy metadata

5,000 member IDs
Random parent relationships
10 hierarchy building iterations (matching real code behavior)

Benchmark code:

library(microbenchmark)
library(purrr)
library(rlang)

set.seed(42)
n_members <- 5000
member_ids <- as.character(seq_len(n_members))
parent_member_ids <- c(NA_character_, sample(member_ids[1:(n_members-1)], n_members - 1, replace = TRUE))

# Original approach (named vector)
original_approach <- function(member_ids, parent_member_ids) {
  parent_lookup <- rlang::set_names(parent_member_ids, member_ids)

  current_top <- function(c) {
    strsplit(c, "\\.") %>% purrr::map(dplyr::first) %>% unlist
  }
  parent_for_current_top <- function(c) {
    as.character(parent_lookup[current_top(c)])
  }

  hierarchy <- member_ids
  for (iter in 1:10) {
    old <- hierarchy
    p <- parent_for_current_top(hierarchy)
    hierarchy <- ifelse(is.na(p), hierarchy, paste0(p, ".", hierarchy))
    if (sum(old != hierarchy) == 0) break
  }
  hierarchy
}

# Optimized approach (environment hash)
optimized_approach <- function(member_ids, parent_member_ids) {
  parent_lookup_env <- new.env(hash = TRUE, parent = emptyenv())
  for (i in seq_along(member_ids)) {
    parent_lookup_env[[member_ids[i]]] <- parent_member_ids[i]
  }

  current_top <- function(c) {
    strsplit(c, "\\.") %>% purrr::map(dplyr::first) %>% unlist
  }
  parent_for_current_top <- function(c) {
    tops <- current_top(c)
    unlist(mget(tops, envir = parent_lookup_env, ifnotfound = list(NA_character_)))
  }

  hierarchy <- member_ids
  for (iter in 1:10) {
    old <- hierarchy
    p <- parent_for_current_top(hierarchy)
    hierarchy <- ifelse(is.na(p), hierarchy, paste0(p, ".", hierarchy))
    if (sum(old != hierarchy) == 0) break
  }
  hierarchy
}

microbenchmark(
  original = original_approach(member_ids, parent_member_ids),
  optimized = optimized_approach(member_ids, parent_member_ids),
  times = 10
)

Results:

Unit: milliseconds
      expr      min       lq     mean   median       uq      max neval
  original 326.543 345.120 353.426 354.204 361.368 379.467    10
 optimized 346.123 357.242 367.495 359.869 362.602 435.788    10

Original median:  354.204 ms
Optimized median: 359.869 ms
Improvement:      -1.6% (REGRESSION)

Analysis: In R, named vector indexing parent_lookup[key] is highly optimized (vectorized C code), while environment creation and mget() have overhead that outweighs the theoretical O(1) lookup benefit:

Named vector subsetting in R is implemented in C and handles vectorized lookups efficiently
Environment creation with a for loop is slow in R
mget() overhead includes argument parsing, ifnotfound handling, and list creation

The theoretical O(n) vs O(1) analysis doesn't account for R's internal optimizations and constant factors.

Deferred Optimizations

P9 (French string constants): Not implemented

intToUtf8() is already very fast (O(1) character conversion)
Would add complexity for minimal performance gain

Summary

Optimization	Test Data	Improvement	Status
P4/P11 (hash table)	5k members	-1.6%	❌ Reverted
P8 (vapply)	N/A	Code quality	✅ Kept

Test Plan

Run devtools::check() with no errors/warnings
Benchmarks run comparing approaches
P4/P11 reverted after regression confirmed
Verify vector extraction produces identical results

🤖 Generated with Claude Code

…n daily checks on macos-latest

P4/P11: Environment hash table for parent lookup in add_hierarchy - Replace named vector lookup (O(n) per key) with environment hash table (O(1)) - Use mget() for efficient batch lookup with proper NA handling - Significant improvement for large datasets with 10k+ members - Expected improvement: 40-95% for hierarchy building P8: vapply instead of sapply for null check - Replace sapply(x, is.null) with vapply(x, is.null, logical(1)) - Type-safe and faster null detection in vector extraction - Expected improvement: 30-45% for vector data processing Deferred optimizations: - P9: French string constants - intToUtf8() is already very fast - P12: Coordinate metadata loop - requires significant refactoring (moving get_cansim_cube_metadata call outside the map loop) Files modified: - R/cansim_metadata.R: add_hierarchy function - R/cansim_vectors.R: vector extraction null check Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Benchmarking showed the environment hash table approach is actually 1.6% slower than the original named vector lookup. In R, named vector indexing is highly optimized (vectorized C code), while environment creation and mget() have overhead that outweighs the O(1) lookup benefit. Benchmark results (5000 members, hierarchy building): - Original (named vector): 354ms median - Optimized (env hash): 360ms median - Improvement: -1.6% (regression) Keeping P8 (vapply instead of sapply) which is a valid type-safety improvement. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

mountainMath and others added 4 commits November 19, 2025 20:22

update workflows, only check on all OS for push/pull_request, only ru…

fb01fb8

…n daily checks on macos-latest

change name of workflow

b7cf6b5

dshkol changed the base branch from master to v0.4.5 January 22, 2026 06:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: Lookups and vectorization optimizations (P4, P8, P11) #159

perf: Lookups and vectorization optimizations (P4, P8, P11) #159

Uh oh!

dshkol commented Jan 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

perf: Lookups and vectorization optimizations (P4, P8, P11) #159

Are you sure you want to change the base?

perf: Lookups and vectorization optimizations (P4, P8, P11) #159

Uh oh!

Conversation

dshkol commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Reverted Changes

Benchmark Results

P4/P11: Environment hash table - ❌ 1.6% regression (REVERTED)

Deferred Optimizations

Summary

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dshkol commented Jan 22, 2026 •

edited

Loading