Two-phase spanning tree transclosure: 37% faster, 33% less memory#135
Merged
Two-phase spanning tree transclosure: 37% faster, 33% less memory#135
Conversation
Replace single-pass BFS+union-find with a two-phase approach: Phase 1: Fast BFS discovery using a maximum-weight spanning tree of sequence pairs (N-1 edges instead of N*(N-1)/2). Only follows spanning tree alignment edges, discovering positions without collecting overlap entries. Includes orphan recovery via repeated linear scans to find positions only reachable through non-tree edges. Phase 2: Linear scan of all iitree intervals for union-find. Uses block-level sampling (check 3 positions per block) to skip blocks where source and target are already in the same equivalence class. Iterates verification passes until convergence to catch sampling misses. Also adds for_each_interval() to the IntervalTree trait for linear iteration over all stored intervals. Benchmarked on 466-sequence MHC/C4 region (HPRCv2): - Wall clock: 9m00s -> 5m41s (37% faster) - User time: 1969s -> 1293s (34% less CPU) - Peak RSS: 12.4GB -> 8.3GB (33% less memory) - Output identical (2727 nodes, 52047bp, 3686 links)
… 1.3GB Replace linear scan of pre-collected all_intervals Vec with per-sequence iitree overlap queries for both orphan recovery and phase 2 union-find. Eliminates 58M-entry Vec allocation (~1.4GB) while maintaining identical output. For workloads with multiple smaller components, this also reduces scan cost proportionally to component size. Benchmark (466-seq MHC, same workload): - Wall clock: 5m41s -> 5m57s (comparable, within noise) - Peak RSS: 8.3GB -> 7.0GB (1.3GB saved) - Output: identical (2727 nodes, 52047bp, 3686 links)
- Remove dead handle_range function and OverlapAtomicQueue type alias (replaced by inline discovery logic in explore_overlaps_discovery) - Remove unused Mutex import - Remove duplicate HashMap import in compute_spanning_tree - Fix unused variable warning (_weight in spanning tree loop) - Extract find_component_sequences helper (was duplicated 2x) - Extract unite_block and block_needs_unite helpers (was duplicated 3x) - Remove redundant n_seqs binding in orphan recovery
Replace 3-point block sampling with binary subdivision sampling (check at offsets 0, L-1, L/2, L/4, 3L/4, ... down to step size 32). Catches more boundary cases than 3-point, reducing verification round recoveries from ~10K to ~2.8K blocks. Verification pass remains exhaustive for correctness. Extract block_can_skip helper with the logarithmic sampling logic.
Replace DisjointSetsAsm (128-bit CAS-based union-find) with simple label propagation for phase 2 equivalence class computation. Algorithm: hook (parent[max] = min) + pointer jump, iterated until convergence. Uses AtomicU32 with Relaxed ordering (free on x86) instead of CMPXCHG16B. Sequential access within blocks enables hardware prefetch. Converges in 4 rounds for MHC/C4 data: Round 1: 22.2B hooks (main work) Round 2: 930K hooks Round 3: 5.3K hooks Round 4: 0 hooks (converged) Benchmark: 5m54s wall clock (same as previous), correct output. Simpler code — eliminates sampling heuristics and exhaustive verification passes. Foundation for block-level optimizations.
Add hook_range() that operates on consecutive rank indices directly, avoiding per-position curr_bv and rank_table lookups. Since positions within each sequence get consecutive ranks, a block [a,a+L)->[b,b+L) can be processed as hook_range(rank[a], rank[b], L) -- a tight loop of array reads and conditional writes with sequential access. Round 1 drops from 207s to 41s. Total transclosure from 5m54s to 3m04s. Output identical to baseline (2727 nodes, 52047bp, 3686 links). Wall clock: 9m00s baseline -> 3m04s (2.9x speedup) Peak RSS: 12.4GB -> 6.5GB (48% less memory)
…peedup) After rounds with few hooks (<10K), do up to 20 pointer jump passes to fully flatten the label tree. This eliminates the 3-4 extra rounds that previously each cost ~10s for iitree traversal with near-zero hooks. Round count: 6 -> 3. Total: 3m04s -> 2m17s. Output identical (2727 nodes, 52047bp, 3686 links).
1c0427d to
982916e
Compare
Use a flat Vec<Vec<usize>> adjacency list instead of HashSet<(usize, usize)> for spanning tree lookup. Tree nodes have degree ~2-3, so linear scan of a 2-3 element vec is faster than hashing a (usize, usize) tuple. Eliminates hash computation per iitree callback during BFS.
Remove unite_block, block_needs_unite, block_can_skip helpers and DisjointSetsAsm import — all replaced by the LabelProp label propagation approach. Also guard compute_spanning_tree for n_seqs <= 1.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
for_each_interval()toIntervalTreetrait for linear iterationBenchmark (466-sequence MHC/C4 region, HPRCv2)
Key insight from profiling
The BFS had 175.9M interval callbacks, of which 99.97% were redundant — targets already discovered. Each position had ~930 overlapping intervals (465 pairs x 2 directions), but only ~1 per position discovered new positions. The spanning tree reduces this to ~6 callbacks per position for discovery, with a separate linear scan for union-find that skips redundant blocks via sampling.
Test plan