Skip to content

Rust implementation of seqwish#131

Open
ekg wants to merge 463 commits intomasterfrom
rust-2
Open

Rust implementation of seqwish#131
ekg wants to merge 463 commits intomasterfrom
rust-2

Conversation

@ekg
Copy link
Copy Markdown
Collaborator

@ekg ekg commented Dec 2, 2025

Summary

Complete rewrite of seqwish in Rust, providing a modern, memory-safe implementation with identical output to the C++ version.

Key changes:

  • Full Rust implementation of all core algorithms (sequence indexing, alignment processing, transitive closure, graph compaction, GFA output)
  • Preserves Big O complexity - uses equivalent data structures (FM-index, succinct bitvectors, interval trees) per project requirements
  • Byte-for-byte identical output with the C++ implementation (verified via checksums)
  • Published to crates.io as seqwish (versions 0.1.1-0.1.3)
  • Performance optimizations: 5.7x faster GFA path writing, work-stealing parallelism via Rayon
  • C++ code preserved in cpp/ directory for reference

Architecture:

  • Pure Rust binary with no C++ dependencies for the main implementation
  • Supports both disk-backed and in-memory operation modes
  • Cross-platform: builds on Linux x86_64, Linux aarch64, and macOS

Testing:

  • 104 unit tests + 3 integration tests
  • CI workflow for Rust builds and legacy C++ reference builds
  • Checksum verification against reference outputs

Test plan

  • All 107 tests pass (cargo test --release)
  • Integration tests verify byte-identical output with C++ version
  • CI builds for Linux x86_64, Linux aarch64, macOS
  • Manual testing on large pangenome datasets recommended before merge

ekg and others added 30 commits September 17, 2020 20:57
…ithoutcigars

raised a warning when PAF files have no CIGAR strings #61
The last change affected output (removing *,*,*... overlap cigars), and
thus affected md5s.
Sanitize memory management and mmmulti
update memory mapping to use madvise as before
fix segmentation fault with empty PAF files
…n_the_fasta

Manage empty sequences in the fasta and removed mmmap_allocator dependency
This treats X and = like M. We check the input sequences.

A bug in a particular mapper tends to cause this (wfmash), so this is
just for testing until that's resolved.
distrust cigar mismatch characters
ekg added 3 commits January 17, 2026 14:54
Build artifacts were accidentally committed. The .gitignore only had
seqwish-rs/target/ from the old directory structure before the Rust
project was moved to the repository root.
AndreaGuarracino and others added 26 commits February 27, 2026 21:41
…k lookup

Three bugs in the transitive closure computation caused incorrect graphs:

1. ovlp_q.push() used non-blocking ArrayQueue::push() with `let _ =`
   which silently dropped overlaps when the queue was full.
2. The manager task never drained ovlp_q during the parallel work phase,
   so the 100,000-capacity queue filled almost immediately and all
   subsequent overlaps were lost.
3. todo_in.push() had the same silent-drop pattern.

Fix: workers now spin-retry on both ovlp_q and todo_in pushes (matching
C++ blocking semantics). Manager drains ovlp_q continuously during the
work loop (matching C++ manager thread pattern). Queue sizes increased
from 100,000 to 131,072 (matching C++ 2 << 16).

Additionally, replace Rank9Sel::rank1() with a direct Vec<u32> rank
lookup table. This changes O(popcount) per call to O(1) array access,
which is critical at scale (38.6B rank lookups in our test case).

Before fix: 4,378 segments (incorrect, lost overlaps)
After fix:  628 segments (matches C++ seqwish output)
Performance: 46.7s Rust vs 52.4s C++ on chr6 29M test region

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…p-drop

fix: prevent silent overlap dropping in transclosure
The seq_id_cbv bitvector has size graph_length + 1, but the graph
sequence is only graph_length bytes. When computing the last node's
length, select(node_id + 1) returns None. The S-line code was
skipping the node entirely, and the validation was using
unwrap_or(node_start) giving length 0.

Fix both to use the actual graph sequence length (seq_v_slice.len())
as the end sentinel, giving the correct last node length. This fixes
"path step length mismatch" errors that occurred when a path visited
the last node in the graph.
fix: correct last node length in GFA output
…oundary marking

The previous implementation used read_volatile/write_volatile which is NOT
atomic — concurrent threads setting different bits in the same u64 word
could silently lose updates (lost-update race condition). This caused
intermittent "path step length mismatch" errors in GFA output because
dropped node boundaries meant paths partially traversed nodes.

Replace with AtomicU64::fetch_or(mask, Ordering::Relaxed) which is a
single atomic read-modify-write operation, eliminating the race.
64 threads set different bits in the same u64 word simultaneously,
repeated 100 times. This would fail with the old read_volatile
implementation due to lost-update races.
fix: use truly atomic fetch_or in AtomicBitVec
Replace single-pass BFS+union-find with a two-phase approach:

Phase 1: Fast BFS discovery using a maximum-weight spanning tree of
sequence pairs (N-1 edges instead of N*(N-1)/2). Only follows spanning
tree alignment edges, discovering positions without collecting overlap
entries. Includes orphan recovery via repeated linear scans to find
positions only reachable through non-tree edges.

Phase 2: Linear scan of all iitree intervals for union-find. Uses
block-level sampling (check 3 positions per block) to skip blocks where
source and target are already in the same equivalence class. Iterates
verification passes until convergence to catch sampling misses.

Also adds for_each_interval() to the IntervalTree trait for linear
iteration over all stored intervals.

Benchmarked on 466-sequence MHC/C4 region (HPRCv2):
- Wall clock: 9m00s -> 5m41s (37% faster)
- User time: 1969s -> 1293s (34% less CPU)
- Peak RSS: 12.4GB -> 8.3GB (33% less memory)
- Output identical (2727 nodes, 52047bp, 3686 links)
… 1.3GB

Replace linear scan of pre-collected all_intervals Vec with per-sequence
iitree overlap queries for both orphan recovery and phase 2 union-find.
Eliminates 58M-entry Vec allocation (~1.4GB) while maintaining identical
output. For workloads with multiple smaller components, this also reduces
scan cost proportionally to component size.

Benchmark (466-seq MHC, same workload):
- Wall clock: 5m41s -> 5m57s (comparable, within noise)
- Peak RSS: 8.3GB -> 7.0GB (1.3GB saved)
- Output: identical (2727 nodes, 52047bp, 3686 links)
- Remove dead handle_range function and OverlapAtomicQueue type alias
  (replaced by inline discovery logic in explore_overlaps_discovery)
- Remove unused Mutex import
- Remove duplicate HashMap import in compute_spanning_tree
- Fix unused variable warning (_weight in spanning tree loop)
- Extract find_component_sequences helper (was duplicated 2x)
- Extract unite_block and block_needs_unite helpers (was duplicated 3x)
- Remove redundant n_seqs binding in orphan recovery
Replace 3-point block sampling with binary subdivision sampling (check at
offsets 0, L-1, L/2, L/4, 3L/4, ... down to step size 32). Catches more
boundary cases than 3-point, reducing verification round recoveries from
~10K to ~2.8K blocks. Verification pass remains exhaustive for correctness.

Extract block_can_skip helper with the logarithmic sampling logic.
Replace DisjointSetsAsm (128-bit CAS-based union-find) with simple
label propagation for phase 2 equivalence class computation.

Algorithm: hook (parent[max] = min) + pointer jump, iterated until
convergence. Uses AtomicU32 with Relaxed ordering (free on x86) instead
of CMPXCHG16B. Sequential access within blocks enables hardware prefetch.

Converges in 4 rounds for MHC/C4 data:
  Round 1: 22.2B hooks (main work)
  Round 2: 930K hooks
  Round 3: 5.3K hooks
  Round 4: 0 hooks (converged)

Benchmark: 5m54s wall clock (same as previous), correct output.
Simpler code — eliminates sampling heuristics and exhaustive
verification passes. Foundation for block-level optimizations.
Add hook_range() that operates on consecutive rank indices directly,
avoiding per-position curr_bv and rank_table lookups. Since positions
within each sequence get consecutive ranks, a block [a,a+L)->[b,b+L)
can be processed as hook_range(rank[a], rank[b], L) -- a tight loop
of array reads and conditional writes with sequential access.

Round 1 drops from 207s to 41s. Total transclosure from 5m54s to 3m04s.
Output identical to baseline (2727 nodes, 52047bp, 3686 links).

Wall clock: 9m00s baseline -> 3m04s (2.9x speedup)
Peak RSS: 12.4GB -> 6.5GB (48% less memory)
…peedup)

After rounds with few hooks (<10K), do up to 20 pointer jump passes to
fully flatten the label tree. This eliminates the 3-4 extra rounds that
previously each cost ~10s for iitree traversal with near-zero hooks.

Round count: 6 -> 3. Total: 3m04s -> 2m17s.
Output identical (2727 nodes, 52047bp, 3686 links).
Use a flat Vec<Vec<usize>> adjacency list instead of HashSet<(usize, usize)>
for spanning tree lookup. Tree nodes have degree ~2-3, so linear scan of
a 2-3 element vec is faster than hashing a (usize, usize) tuple. Eliminates
hash computation per iitree callback during BFS.
Remove unite_block, block_needs_unite, block_can_skip helpers and
DisjointSetsAsm import — all replaced by the LabelProp label
propagation approach. Also guard compute_spanning_tree for n_seqs <= 1.
Two-phase spanning tree transclosure: 37% faster, 33% less memory
The Shiloach-Vishkin label propagation had a race condition where
parent[max(la,lb)] = min(la,lb) could link unrelated positions when
la/lb were stale. This caused base mismatches in write_graph_chunk
on larger inputs (e.g., USP14 region with 527 sequences, 212kb).

Revert to DisjointSetsAsm (lock-free CAS-based union-find) which
is proven correct. Keep all other optimizations: spanning tree BFS,
orphan recovery, per-sequence iitree queries, adjacency list.

MHC (466 seqs): 4m12s, correct (2727 nodes, 52047bp)
USP14 (527 seqs): 19m60s, correct (10199 nodes, 236011bp) — was crashing
Revert label propagation: restore DisjointSetsAsm for correctness
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants