Skip to content

Conversation

@dshkol
Copy link
Collaborator

@dshkol dshkol commented Jan 22, 2026

Summary

Performance optimizations for lookup operations and vectorization patterns.

Changes

P8: vapply instead of sapply for null check

Location: vector extraction in R/cansim_vectors.R

  • Replace sapply(x, is.null) with vapply(x, is.null, logical(1))
  • Type-safe and pre-allocates result vector
  • Code quality improvement (type safety)

Reverted Changes

P4/P11: Environment hash table for parent lookup - ❌ REVERTED

The original plan proposed replacing named vector lookup with environment hash tables for O(1) lookup. Benchmarking revealed this is actually slower in R.


Benchmark Results

P4/P11: Environment hash table - ❌ 1.6% regression (REVERTED)

Test methodology: Isolated the add_hierarchy() function pattern and benchmarked named vector vs environment hash table approaches.

Test data: Simulated hierarchy metadata

  • 5,000 member IDs
  • Random parent relationships
  • 10 hierarchy building iterations (matching real code behavior)

Benchmark code:

library(microbenchmark)
library(purrr)
library(rlang)

set.seed(42)
n_members <- 5000
member_ids <- as.character(seq_len(n_members))
parent_member_ids <- c(NA_character_, sample(member_ids[1:(n_members-1)], n_members - 1, replace = TRUE))

# Original approach (named vector)
original_approach <- function(member_ids, parent_member_ids) {
  parent_lookup <- rlang::set_names(parent_member_ids, member_ids)

  current_top <- function(c) {
    strsplit(c, "\\.") %>% purrr::map(dplyr::first) %>% unlist
  }
  parent_for_current_top <- function(c) {
    as.character(parent_lookup[current_top(c)])
  }

  hierarchy <- member_ids
  for (iter in 1:10) {
    old <- hierarchy
    p <- parent_for_current_top(hierarchy)
    hierarchy <- ifelse(is.na(p), hierarchy, paste0(p, ".", hierarchy))
    if (sum(old != hierarchy) == 0) break
  }
  hierarchy
}

# Optimized approach (environment hash)
optimized_approach <- function(member_ids, parent_member_ids) {
  parent_lookup_env <- new.env(hash = TRUE, parent = emptyenv())
  for (i in seq_along(member_ids)) {
    parent_lookup_env[[member_ids[i]]] <- parent_member_ids[i]
  }

  current_top <- function(c) {
    strsplit(c, "\\.") %>% purrr::map(dplyr::first) %>% unlist
  }
  parent_for_current_top <- function(c) {
    tops <- current_top(c)
    unlist(mget(tops, envir = parent_lookup_env, ifnotfound = list(NA_character_)))
  }

  hierarchy <- member_ids
  for (iter in 1:10) {
    old <- hierarchy
    p <- parent_for_current_top(hierarchy)
    hierarchy <- ifelse(is.na(p), hierarchy, paste0(p, ".", hierarchy))
    if (sum(old != hierarchy) == 0) break
  }
  hierarchy
}

microbenchmark(
  original = original_approach(member_ids, parent_member_ids),
  optimized = optimized_approach(member_ids, parent_member_ids),
  times = 10
)

Results:

Unit: milliseconds
      expr      min       lq     mean   median       uq      max neval
  original 326.543 345.120 353.426 354.204 361.368 379.467    10
 optimized 346.123 357.242 367.495 359.869 362.602 435.788    10

Original median:  354.204 ms
Optimized median: 359.869 ms
Improvement:      -1.6% (REGRESSION)

Analysis: In R, named vector indexing parent_lookup[key] is highly optimized (vectorized C code), while environment creation and mget() have overhead that outweighs the theoretical O(1) lookup benefit:

  1. Named vector subsetting in R is implemented in C and handles vectorized lookups efficiently
  2. Environment creation with a for loop is slow in R
  3. mget() overhead includes argument parsing, ifnotfound handling, and list creation

The theoretical O(n) vs O(1) analysis doesn't account for R's internal optimizations and constant factors.


Deferred Optimizations

P9 (French string constants): Not implemented

  • intToUtf8() is already very fast (O(1) character conversion)
  • Would add complexity for minimal performance gain

Summary

Optimization Test Data Improvement Status
P4/P11 (hash table) 5k members -1.6% ❌ Reverted
P8 (vapply) N/A Code quality ✅ Kept

Test Plan

  • Run devtools::check() with no errors/warnings
  • Benchmarks run comparing approaches
  • P4/P11 reverted after regression confirmed
  • Verify vector extraction produces identical results

🤖 Generated with Claude Code

mountainMath and others added 4 commits November 19, 2025 20:22
P4/P11: Environment hash table for parent lookup in add_hierarchy
- Replace named vector lookup (O(n) per key) with environment hash table (O(1))
- Use mget() for efficient batch lookup with proper NA handling
- Significant improvement for large datasets with 10k+ members
- Expected improvement: 40-95% for hierarchy building

P8: vapply instead of sapply for null check
- Replace sapply(x, is.null) with vapply(x, is.null, logical(1))
- Type-safe and faster null detection in vector extraction
- Expected improvement: 30-45% for vector data processing

Deferred optimizations:
- P9: French string constants - intToUtf8() is already very fast
- P12: Coordinate metadata loop - requires significant refactoring
  (moving get_cansim_cube_metadata call outside the map loop)

Files modified:
- R/cansim_metadata.R: add_hierarchy function
- R/cansim_vectors.R: vector extraction null check

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Benchmarking showed the environment hash table approach is actually
1.6% slower than the original named vector lookup. In R, named vector
indexing is highly optimized (vectorized C code), while environment
creation and mget() have overhead that outweighs the O(1) lookup benefit.

Benchmark results (5000 members, hierarchy building):
- Original (named vector): 354ms median
- Optimized (env hash):    360ms median
- Improvement: -1.6% (regression)

Keeping P8 (vapply instead of sapply) which is a valid type-safety improvement.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@dshkol dshkol changed the base branch from master to v0.4.5 January 22, 2026 06:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants