Two-phase spanning tree transclosure: 37% faster, 33% less memory by ekg · Pull Request #135 · pangenome/seqwish

ekg · 2026-03-23T14:25:13Z

Summary

Replace single-pass BFS+union-find with a two-phase approach using a maximum-weight spanning tree of sequence pairs
Phase 1: Fast BFS discovery following only spanning tree edges (N-1 pairs vs N*(N-1)/2), with flood-fill orphan recovery for positions only reachable via non-tree edges
Phase 2: Linear scan of all iitree intervals for union-find with block-level sampling (check 3 positions per block to skip redundant unions), iterating verification passes until convergence
Add for_each_interval() to IntervalTree trait for linear iteration

Benchmark (466-sequence MHC/C4 region, HPRCv2)

Metric	Baseline	Two-phase	Change
Wall clock	9m00s	5m41s	37% faster
User time	1969s	1293s	34% less CPU
Peak RSS	12.4 GB	8.3 GB	33% less memory
Output	2727 nodes, 52047bp	identical	✓

Key insight from profiling

The BFS had 175.9M interval callbacks, of which 99.97% were redundant — targets already discovered. Each position had ~930 overlapping intervals (465 pairs x 2 directions), but only ~1 per position discovered new positions. The spanning tree reduces this to ~6 callbacks per position for discovery, with a separate linear scan for union-find that skips redundant blocks via sampling.

Test plan

Output identical to baseline (2727 nodes, 52047bp, 3686 links)
Orphan recovery converges (2 rounds for MHC data, finds 26.9M positions)
Phase 2 verification converges (16,338 blocks recovered)
Test on other genomic regions and species

Replace single-pass BFS+union-find with a two-phase approach: Phase 1: Fast BFS discovery using a maximum-weight spanning tree of sequence pairs (N-1 edges instead of N*(N-1)/2). Only follows spanning tree alignment edges, discovering positions without collecting overlap entries. Includes orphan recovery via repeated linear scans to find positions only reachable through non-tree edges. Phase 2: Linear scan of all iitree intervals for union-find. Uses block-level sampling (check 3 positions per block) to skip blocks where source and target are already in the same equivalence class. Iterates verification passes until convergence to catch sampling misses. Also adds for_each_interval() to the IntervalTree trait for linear iteration over all stored intervals. Benchmarked on 466-sequence MHC/C4 region (HPRCv2): - Wall clock: 9m00s -> 5m41s (37% faster) - User time: 1969s -> 1293s (34% less CPU) - Peak RSS: 12.4GB -> 8.3GB (33% less memory) - Output identical (2727 nodes, 52047bp, 3686 links)

… 1.3GB Replace linear scan of pre-collected all_intervals Vec with per-sequence iitree overlap queries for both orphan recovery and phase 2 union-find. Eliminates 58M-entry Vec allocation (~1.4GB) while maintaining identical output. For workloads with multiple smaller components, this also reduces scan cost proportionally to component size. Benchmark (466-seq MHC, same workload): - Wall clock: 5m41s -> 5m57s (comparable, within noise) - Peak RSS: 8.3GB -> 7.0GB (1.3GB saved) - Output: identical (2727 nodes, 52047bp, 3686 links)

- Remove dead handle_range function and OverlapAtomicQueue type alias (replaced by inline discovery logic in explore_overlaps_discovery) - Remove unused Mutex import - Remove duplicate HashMap import in compute_spanning_tree - Fix unused variable warning (_weight in spanning tree loop) - Extract find_component_sequences helper (was duplicated 2x) - Extract unite_block and block_needs_unite helpers (was duplicated 3x) - Remove redundant n_seqs binding in orphan recovery

Replace 3-point block sampling with binary subdivision sampling (check at offsets 0, L-1, L/2, L/4, 3L/4, ... down to step size 32). Catches more boundary cases than 3-point, reducing verification round recoveries from ~10K to ~2.8K blocks. Verification pass remains exhaustive for correctness. Extract block_can_skip helper with the logarithmic sampling logic.

Replace DisjointSetsAsm (128-bit CAS-based union-find) with simple label propagation for phase 2 equivalence class computation. Algorithm: hook (parent[max] = min) + pointer jump, iterated until convergence. Uses AtomicU32 with Relaxed ordering (free on x86) instead of CMPXCHG16B. Sequential access within blocks enables hardware prefetch. Converges in 4 rounds for MHC/C4 data: Round 1: 22.2B hooks (main work) Round 2: 930K hooks Round 3: 5.3K hooks Round 4: 0 hooks (converged) Benchmark: 5m54s wall clock (same as previous), correct output. Simpler code — eliminates sampling heuristics and exhaustive verification passes. Foundation for block-level optimizations.

Add hook_range() that operates on consecutive rank indices directly, avoiding per-position curr_bv and rank_table lookups. Since positions within each sequence get consecutive ranks, a block [a,a+L)->[b,b+L) can be processed as hook_range(rank[a], rank[b], L) -- a tight loop of array reads and conditional writes with sequential access. Round 1 drops from 207s to 41s. Total transclosure from 5m54s to 3m04s. Output identical to baseline (2727 nodes, 52047bp, 3686 links). Wall clock: 9m00s baseline -> 3m04s (2.9x speedup) Peak RSS: 12.4GB -> 6.5GB (48% less memory)

…peedup) After rounds with few hooks (<10K), do up to 20 pointer jump passes to fully flatten the label tree. This eliminates the 3-4 extra rounds that previously each cost ~10s for iitree traversal with near-zero hooks. Round count: 6 -> 3. Total: 3m04s -> 2m17s. Output identical (2727 nodes, 52047bp, 3686 links).

Use a flat Vec<Vec<usize>> adjacency list instead of HashSet<(usize, usize)> for spanning tree lookup. Tree nodes have degree ~2-3, so linear scan of a 2-3 element vec is faster than hashing a (usize, usize) tuple. Eliminates hash computation per iitree callback during BFS.

Remove unite_block, block_needs_unite, block_can_skip helpers and DisjointSetsAsm import — all replaced by the LabelProp label propagation approach. Also guard compute_spanning_tree for n_seqs <= 1.

ekg added 7 commits March 23, 2026 08:47

ekg force-pushed the transclosure-optimize branch from 1c0427d to 982916e Compare March 24, 2026 14:24

ekg added 4 commits March 24, 2026 10:40

cargo fmt

dd149a6

fix clippy warnings: collapse nested if, use &mut [usize] over &mut Vec

1799873

remove dead code: DisjointSetsAsm helpers replaced by LabelProp

c433774

Remove unite_block, block_needs_unite, block_can_skip helpers and DisjointSetsAsm import — all replaced by the LabelProp label propagation approach. Also guard compute_spanning_tree for n_seqs <= 1.

ekg merged commit 71c2971 into rust-2 Mar 25, 2026
8 checks passed

ekg mentioned this pull request Mar 27, 2026

Revert label propagation: restore DisjointSetsAsm for correctness #137

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two-phase spanning tree transclosure: 37% faster, 33% less memory#135

Two-phase spanning tree transclosure: 37% faster, 33% less memory#135
ekg merged 11 commits intorust-2from
transclosure-optimize

ekg commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ekg commented Mar 23, 2026

Summary

Benchmark (466-sequence MHC/C4 region, HPRCv2)

Key insight from profiling

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant