Skip to content

bench(wam-haskell): IntSet 3-way macro — algorithmic win didn't materialise at depth=10`#1698

Merged
s243a merged 2 commits intomainfrom
feat/wam-haskell-intset-macro-bench-and-memory
Apr 29, 2026
Merged

bench(wam-haskell): IntSet 3-way macro — algorithmic win didn't materialise at depth=10`#1698
s243a merged 2 commits intomainfrom
feat/wam-haskell-intset-macro-bench-and-memory

Conversation

@s243a
Copy link
Copy Markdown
Owner

@s243a s243a commented Apr 29, 2026

Summary

Closes the IntSet visited arc with measured numbers — and the design's predicted speedup did not materialise at the workload's actual recursion depth. This is a real and worth-reporting finding.

  • Extends tests/benchmarks/wam_effective_distance_macro_bench.pl to a 3-way comparison (unlowered / Phase G lowered / Phase H intset).
  • Adds WAM_EFF_DIST_BENCH_SCALE env var so the bench can run against 1k or 10k facts.
  • Updates WAM_PERF_OPTIMIZATION_LOG.md Phase H final entry with the measured numbers and an honest reading of why the algorithmic improvement didn't pay off.
  • Memory files (outside repo) refreshed with the IntSet arc completion notes including this finding.

Measured numbers

10k scale (462 tuples, max_depth=10, 6 trials with rotating order):

variant mean query_ms
unlowered (no directives) 931.5
lowered (Phase G mode only) 861.0
intset (Phase G + Phase H) 957.5
comparison speedup
lowered vs unlowered 1.082× (Phase G constant-factor win, as expected)
intset vs lowered 0.899× (IntSet ~10 % SLOWER than list)
intset vs unlowered 0.973× (IntSet barely matches the slow path)

tuple_count=462 matches across all three — correctness preserved across both the Phase G lowering and the Phase H IntSet path.

Why the algorithmic O(log N) didn't pay off

The design predicted ~1.5–3× speedup from O(N) → O(log N) at max_depth=10. Reality: the Patricia-trie constant factors exceed the linear walk cost on a ~10-element list. Specifically:

  1. Allocation per insert: IS.insert allocates new tree nodes per cons-extension (purely functional). [X|V] allocates one cons cell. For shallow visited sets, the allocation cost dominates.
  2. Cache locality: a small contiguous cons-cell list packs into one or two cache lines. A 4-element IntSet trie scatters across multiple nodes.
  3. Node traversal overhead: even IS.member on a 10-element IntSet does ~3-4 tag-checks and comparisons in a Patricia trie, vs at most 10 simple == checks in the list. Constant factors are surprisingly close at this size.

The IntSet wins likely materialize at deeper visited sets (max_depth ≥ 50); the canonical effective-distance workload doesn't reach there. Phase G's not_member_list lowering (skipping the put_structure + builtin_call dispatch + heap term allocation) is the actual macro win on this workload.

Honest takeaways

  1. Algorithmic wins aren't free at small N. The design's "1.5-3× expected" was asymptotic reasoning that didn't account for IntSet's constant factors on deeply-allocating workloads.
  2. The infrastructure is reusable. VSet, the directive, and the codegen paths can host other set representations (sorted array, small bitmap) that might beat IntSet at small N. Filed as future exploration.
  3. Phase G is the real macro win. The constant-factor dispatch reduction is what speeds up the workload. Phase H's algorithmic pivot was the wrong move for max_depth=10. The implementation remains correct and opt-in: users at typical depth should leave the directive off.

Changes

tests/benchmarks/wam_effective_distance_macro_bench.pl (~90 LOC delta)

  • New intset variant in generate_project/2 that asserts both the mode declarations AND :- visited_set(category_ancestor/4, 4).
  • Reworked main/0 to run all three variants twice with rotating order, report 3-way comparison with three speedup ratios.
  • New WAM_EFF_DIST_BENCH_SCALE env var routes facts_path/1 to data/benchmark/<scale>/facts.pl. Defaults to 1k for fast smoke runs; set WAM_EFF_DIST_BENCH_SCALE=10k for the macro measurement.

docs/design/WAM_PERF_OPTIMIZATION_LOG.md

Phase H final entry's "Macro benchmark — TODO" replaced with the measured 10k results, the per-trial numbers, the constant-factor analysis, and the three honest takeaways.

Memory updates (outside repo)

Verification

  • All 5 lowering / runtime / state-analysis test suites stay green.
  • Cabal e2e test still passes.
  • The benchmark itself runs end-to-end at both 1k and 10k.

Test plan

  • Re-run the benchmark at multiple scales to confirm the directionality holds: WAM_EFF_DIST_BENCH_SCALE=10k swipl -t halt tests/benchmarks/wam_effective_distance_macro_bench.pl
  • (Future) Benchmark at max_depth=50 or higher to find the IntSet crossover point.

Refs

🤖 Generated with Claude Code

s243a and others added 2 commits April 28, 2026 20:46
…ialise at depth=10

Extends wam_effective_distance_macro_bench.pl to a 3-way comparison
(unlowered / Phase G lowered / Phase H IntSet) and measures all
three on the effective-distance workload. Adds WAM_EFF_DIST_BENCH_SCALE
env var so the bench can run against 1k or 10k facts.

Measured at 10k scale (462 tuples, max_depth=10, 6 trials with
rotating order):

  unlowered (no directives)       mean 931.5 ms
  lowered (Phase G mode only)     mean 861.0 ms
  intset (Phase G + Phase H)      mean 957.5 ms

  lowered vs unlowered:  1.082x  (Phase G constant-factor win, as expected)
  intset vs lowered:     0.899x  (IntSet ~10% SLOWER than list)
  intset vs unlowered:   0.973x  (IntSet barely matches the slow path)

tuple_count=462 matches across all three (correctness preserved).

The design predicted 1.5-3x speedup from O(N) -> O(log N) at
max_depth=10. Reality: the Patricia-trie constant factors (per-insert
allocation, cache scattering, node traversal) exceed the linear walk
cost on a ~10-element list. The algorithmic improvement doesn't
amortise its constant factors at this size.

Phase H appendix in WAM_PERF_OPTIMIZATION_LOG.md captures the
measurement, the per-trial numbers, and an honest "the algorithmic
win didn't materialise here" reading. Three takeaways:

1. Algorithmic wins aren't free at small N — the design's "1.5-3x
   expected" was asymptotic reasoning that didn't account for IntSet
   constant factors.
2. The infrastructure is reusable — VSet, the directive, and the
   codegen paths can host other set representations (sorted array,
   bitmap) that might win at small N.
3. Phase G is the real macro win on this workload. Phase H's
   algorithmic pivot was the wrong move for max_depth=10.

The IntSet implementation remains correct and opt-in: users with
deep visited sets may benefit (untested, but plausible at
max_depth>=50); users at typical depth should leave the directive
off.

Closes task #194 (IntSet macro bench + memory cleanup). Memory
files updated outside the repo (project_wam_haskell_intset_visited.md
gets the honest finding; project_wam_haskell_mode_analysis.md
extended with the cross-arc summary; MEMORY.md and todo.md
refreshed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…et-macro-bench-and-memory

# Conflicts:
#	docs/design/WAM_PERF_OPTIMIZATION_LOG.md
@s243a s243a merged commit 0871172 into main Apr 29, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant