Skip to content

Two-phase spanning tree transclosure: 37% faster, 33% less memory#135

Merged
ekg merged 11 commits intorust-2from
transclosure-optimize
Mar 25, 2026
Merged

Two-phase spanning tree transclosure: 37% faster, 33% less memory#135
ekg merged 11 commits intorust-2from
transclosure-optimize

Conversation

@ekg
Copy link
Copy Markdown
Collaborator

@ekg ekg commented Mar 23, 2026

Summary

  • Replace single-pass BFS+union-find with a two-phase approach using a maximum-weight spanning tree of sequence pairs
  • Phase 1: Fast BFS discovery following only spanning tree edges (N-1 pairs vs N*(N-1)/2), with flood-fill orphan recovery for positions only reachable via non-tree edges
  • Phase 2: Linear scan of all iitree intervals for union-find with block-level sampling (check 3 positions per block to skip redundant unions), iterating verification passes until convergence
  • Add for_each_interval() to IntervalTree trait for linear iteration

Benchmark (466-sequence MHC/C4 region, HPRCv2)

Metric Baseline Two-phase Change
Wall clock 9m00s 5m41s 37% faster
User time 1969s 1293s 34% less CPU
Peak RSS 12.4 GB 8.3 GB 33% less memory
Output 2727 nodes, 52047bp identical

Key insight from profiling

The BFS had 175.9M interval callbacks, of which 99.97% were redundant — targets already discovered. Each position had ~930 overlapping intervals (465 pairs x 2 directions), but only ~1 per position discovered new positions. The spanning tree reduces this to ~6 callbacks per position for discovery, with a separate linear scan for union-find that skips redundant blocks via sampling.

Test plan

  • Output identical to baseline (2727 nodes, 52047bp, 3686 links)
  • Orphan recovery converges (2 rounds for MHC data, finds 26.9M positions)
  • Phase 2 verification converges (16,338 blocks recovered)
  • Test on other genomic regions and species

ekg added 7 commits March 23, 2026 08:47
Replace single-pass BFS+union-find with a two-phase approach:

Phase 1: Fast BFS discovery using a maximum-weight spanning tree of
sequence pairs (N-1 edges instead of N*(N-1)/2). Only follows spanning
tree alignment edges, discovering positions without collecting overlap
entries. Includes orphan recovery via repeated linear scans to find
positions only reachable through non-tree edges.

Phase 2: Linear scan of all iitree intervals for union-find. Uses
block-level sampling (check 3 positions per block) to skip blocks where
source and target are already in the same equivalence class. Iterates
verification passes until convergence to catch sampling misses.

Also adds for_each_interval() to the IntervalTree trait for linear
iteration over all stored intervals.

Benchmarked on 466-sequence MHC/C4 region (HPRCv2):
- Wall clock: 9m00s -> 5m41s (37% faster)
- User time: 1969s -> 1293s (34% less CPU)
- Peak RSS: 12.4GB -> 8.3GB (33% less memory)
- Output identical (2727 nodes, 52047bp, 3686 links)
… 1.3GB

Replace linear scan of pre-collected all_intervals Vec with per-sequence
iitree overlap queries for both orphan recovery and phase 2 union-find.
Eliminates 58M-entry Vec allocation (~1.4GB) while maintaining identical
output. For workloads with multiple smaller components, this also reduces
scan cost proportionally to component size.

Benchmark (466-seq MHC, same workload):
- Wall clock: 5m41s -> 5m57s (comparable, within noise)
- Peak RSS: 8.3GB -> 7.0GB (1.3GB saved)
- Output: identical (2727 nodes, 52047bp, 3686 links)
- Remove dead handle_range function and OverlapAtomicQueue type alias
  (replaced by inline discovery logic in explore_overlaps_discovery)
- Remove unused Mutex import
- Remove duplicate HashMap import in compute_spanning_tree
- Fix unused variable warning (_weight in spanning tree loop)
- Extract find_component_sequences helper (was duplicated 2x)
- Extract unite_block and block_needs_unite helpers (was duplicated 3x)
- Remove redundant n_seqs binding in orphan recovery
Replace 3-point block sampling with binary subdivision sampling (check at
offsets 0, L-1, L/2, L/4, 3L/4, ... down to step size 32). Catches more
boundary cases than 3-point, reducing verification round recoveries from
~10K to ~2.8K blocks. Verification pass remains exhaustive for correctness.

Extract block_can_skip helper with the logarithmic sampling logic.
Replace DisjointSetsAsm (128-bit CAS-based union-find) with simple
label propagation for phase 2 equivalence class computation.

Algorithm: hook (parent[max] = min) + pointer jump, iterated until
convergence. Uses AtomicU32 with Relaxed ordering (free on x86) instead
of CMPXCHG16B. Sequential access within blocks enables hardware prefetch.

Converges in 4 rounds for MHC/C4 data:
  Round 1: 22.2B hooks (main work)
  Round 2: 930K hooks
  Round 3: 5.3K hooks
  Round 4: 0 hooks (converged)

Benchmark: 5m54s wall clock (same as previous), correct output.
Simpler code — eliminates sampling heuristics and exhaustive
verification passes. Foundation for block-level optimizations.
Add hook_range() that operates on consecutive rank indices directly,
avoiding per-position curr_bv and rank_table lookups. Since positions
within each sequence get consecutive ranks, a block [a,a+L)->[b,b+L)
can be processed as hook_range(rank[a], rank[b], L) -- a tight loop
of array reads and conditional writes with sequential access.

Round 1 drops from 207s to 41s. Total transclosure from 5m54s to 3m04s.
Output identical to baseline (2727 nodes, 52047bp, 3686 links).

Wall clock: 9m00s baseline -> 3m04s (2.9x speedup)
Peak RSS: 12.4GB -> 6.5GB (48% less memory)
…peedup)

After rounds with few hooks (<10K), do up to 20 pointer jump passes to
fully flatten the label tree. This eliminates the 3-4 extra rounds that
previously each cost ~10s for iitree traversal with near-zero hooks.

Round count: 6 -> 3. Total: 3m04s -> 2m17s.
Output identical (2727 nodes, 52047bp, 3686 links).
@ekg ekg force-pushed the transclosure-optimize branch from 1c0427d to 982916e Compare March 24, 2026 14:24
ekg added 4 commits March 24, 2026 10:40
Use a flat Vec<Vec<usize>> adjacency list instead of HashSet<(usize, usize)>
for spanning tree lookup. Tree nodes have degree ~2-3, so linear scan of
a 2-3 element vec is faster than hashing a (usize, usize) tuple. Eliminates
hash computation per iitree callback during BFS.
Remove unite_block, block_needs_unite, block_can_skip helpers and
DisjointSetsAsm import — all replaced by the LabelProp label
propagation approach. Also guard compute_spanning_tree for n_seqs <= 1.
@ekg ekg merged commit 71c2971 into rust-2 Mar 25, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant