Rust implementation of seqwish by ekg · Pull Request #131 · pangenome/seqwish

ekg · 2025-12-02T23:11:49Z

Summary

Complete rewrite of seqwish in Rust, providing a modern, memory-safe implementation with identical output to the C++ version.

Key changes:

Full Rust implementation of all core algorithms (sequence indexing, alignment processing, transitive closure, graph compaction, GFA output)
Preserves Big O complexity - uses equivalent data structures (FM-index, succinct bitvectors, interval trees) per project requirements
Byte-for-byte identical output with the C++ implementation (verified via checksums)
Published to crates.io as seqwish (versions 0.1.1-0.1.3)
Performance optimizations: 5.7x faster GFA path writing, work-stealing parallelism via Rayon
C++ code preserved in cpp/ directory for reference

Architecture:

Pure Rust binary with no C++ dependencies for the main implementation
Supports both disk-backed and in-memory operation modes
Cross-platform: builds on Linux x86_64, Linux aarch64, and macOS

Testing:

104 unit tests + 3 integration tests
CI workflow for Rust builds and legacy C++ reference builds
Checksum verification against reference outputs

Test plan

All 107 tests pass (cargo test --release)
Integration tests verify byte-identical output with C++ version
CI builds for Linux x86_64, Linux aarch64, macOS
Manual testing on large pangenome datasets recommended before merge

…ithoutcigars raised a warning when PAF files have no CIGAR strings #61

The last change affected output (removing *,*,*... overlap cigars), and thus affected md5s.

Sanitize memory management and mmmulti

update memory mapping to use madvise as before

added very safe test

…l empty rows)

fix segmentation fault with empty PAF files

…n_the_fasta Manage empty sequences in the fasta and removed mmmap_allocator dependency

This treats X and = like M. We check the input sequences. A bug in a particular mapper tends to cause this (wfmash), so this is just for testing until that's resolved.

distrust cigar mismatch characters

use jemalloc

Build artifacts were accidentally committed. The .gitignore only had seqwish-rs/target/ from the old directory structure before the Rust project was moved to the repository root.

…k lookup Three bugs in the transitive closure computation caused incorrect graphs: 1. ovlp_q.push() used non-blocking ArrayQueue::push() with `let _ =` which silently dropped overlaps when the queue was full. 2. The manager task never drained ovlp_q during the parallel work phase, so the 100,000-capacity queue filled almost immediately and all subsequent overlaps were lost. 3. todo_in.push() had the same silent-drop pattern. Fix: workers now spin-retry on both ovlp_q and todo_in pushes (matching C++ blocking semantics). Manager drains ovlp_q continuously during the work loop (matching C++ manager thread pattern). Queue sizes increased from 100,000 to 131,072 (matching C++ 2 << 16). Additionally, replace Rank9Sel::rank1() with a direct Vec<u32> rank lookup table. This changes O(popcount) per call to O(1) array access, which is critical at scale (38.6B rank lookups in our test case). Before fix: 4,378 segments (incorrect, lost overlaps) After fix: 628 segments (matches C++ seqwish output) Performance: 46.7s Rust vs 52.4s C++ on chr6 29M test region Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…p-drop fix: prevent silent overlap dropping in transclosure

The seq_id_cbv bitvector has size graph_length + 1, but the graph sequence is only graph_length bytes. When computing the last node's length, select(node_id + 1) returns None. The S-line code was skipping the node entirely, and the validation was using unwrap_or(node_start) giving length 0. Fix both to use the actual graph sequence length (seq_v_slice.len()) as the end sentinel, giving the correct last node length. This fixes "path step length mismatch" errors that occurred when a path visited the last node in the graph.

fix: correct last node length in GFA output

…oundary marking The previous implementation used read_volatile/write_volatile which is NOT atomic — concurrent threads setting different bits in the same u64 word could silently lose updates (lost-update race condition). This caused intermittent "path step length mismatch" errors in GFA output because dropped node boundaries meant paths partially traversed nodes. Replace with AtomicU64::fetch_or(mask, Ordering::Relaxed) which is a single atomic read-modify-write operation, eliminating the race.

64 threads set different bits in the same u64 word simultaneously, repeated 100 times. This would fail with the old read_volatile implementation due to lost-update races.

fix: use truly atomic fetch_or in AtomicBitVec

Replace single-pass BFS+union-find with a two-phase approach: Phase 1: Fast BFS discovery using a maximum-weight spanning tree of sequence pairs (N-1 edges instead of N*(N-1)/2). Only follows spanning tree alignment edges, discovering positions without collecting overlap entries. Includes orphan recovery via repeated linear scans to find positions only reachable through non-tree edges. Phase 2: Linear scan of all iitree intervals for union-find. Uses block-level sampling (check 3 positions per block) to skip blocks where source and target are already in the same equivalence class. Iterates verification passes until convergence to catch sampling misses. Also adds for_each_interval() to the IntervalTree trait for linear iteration over all stored intervals. Benchmarked on 466-sequence MHC/C4 region (HPRCv2): - Wall clock: 9m00s -> 5m41s (37% faster) - User time: 1969s -> 1293s (34% less CPU) - Peak RSS: 12.4GB -> 8.3GB (33% less memory) - Output identical (2727 nodes, 52047bp, 3686 links)

… 1.3GB Replace linear scan of pre-collected all_intervals Vec with per-sequence iitree overlap queries for both orphan recovery and phase 2 union-find. Eliminates 58M-entry Vec allocation (~1.4GB) while maintaining identical output. For workloads with multiple smaller components, this also reduces scan cost proportionally to component size. Benchmark (466-seq MHC, same workload): - Wall clock: 5m41s -> 5m57s (comparable, within noise) - Peak RSS: 8.3GB -> 7.0GB (1.3GB saved) - Output: identical (2727 nodes, 52047bp, 3686 links)

- Remove dead handle_range function and OverlapAtomicQueue type alias (replaced by inline discovery logic in explore_overlaps_discovery) - Remove unused Mutex import - Remove duplicate HashMap import in compute_spanning_tree - Fix unused variable warning (_weight in spanning tree loop) - Extract find_component_sequences helper (was duplicated 2x) - Extract unite_block and block_needs_unite helpers (was duplicated 3x) - Remove redundant n_seqs binding in orphan recovery

Replace 3-point block sampling with binary subdivision sampling (check at offsets 0, L-1, L/2, L/4, 3L/4, ... down to step size 32). Catches more boundary cases than 3-point, reducing verification round recoveries from ~10K to ~2.8K blocks. Verification pass remains exhaustive for correctness. Extract block_can_skip helper with the logarithmic sampling logic.

Replace DisjointSetsAsm (128-bit CAS-based union-find) with simple label propagation for phase 2 equivalence class computation. Algorithm: hook (parent[max] = min) + pointer jump, iterated until convergence. Uses AtomicU32 with Relaxed ordering (free on x86) instead of CMPXCHG16B. Sequential access within blocks enables hardware prefetch. Converges in 4 rounds for MHC/C4 data: Round 1: 22.2B hooks (main work) Round 2: 930K hooks Round 3: 5.3K hooks Round 4: 0 hooks (converged) Benchmark: 5m54s wall clock (same as previous), correct output. Simpler code — eliminates sampling heuristics and exhaustive verification passes. Foundation for block-level optimizations.

Add hook_range() that operates on consecutive rank indices directly, avoiding per-position curr_bv and rank_table lookups. Since positions within each sequence get consecutive ranks, a block [a,a+L)->[b,b+L) can be processed as hook_range(rank[a], rank[b], L) -- a tight loop of array reads and conditional writes with sequential access. Round 1 drops from 207s to 41s. Total transclosure from 5m54s to 3m04s. Output identical to baseline (2727 nodes, 52047bp, 3686 links). Wall clock: 9m00s baseline -> 3m04s (2.9x speedup) Peak RSS: 12.4GB -> 6.5GB (48% less memory)

…peedup) After rounds with few hooks (<10K), do up to 20 pointer jump passes to fully flatten the label tree. This eliminates the 3-4 extra rounds that previously each cost ~10s for iitree traversal with near-zero hooks. Round count: 6 -> 3. Total: 3m04s -> 2m17s. Output identical (2727 nodes, 52047bp, 3686 links).

Use a flat Vec<Vec<usize>> adjacency list instead of HashSet<(usize, usize)> for spanning tree lookup. Tree nodes have degree ~2-3, so linear scan of a 2-3 element vec is faster than hashing a (usize, usize) tuple. Eliminates hash computation per iitree callback during BFS.

Remove unite_block, block_needs_unite, block_can_skip helpers and DisjointSetsAsm import — all replaced by the LabelProp label propagation approach. Also guard compute_spanning_tree for n_seqs <= 1.

Two-phase spanning tree transclosure: 37% faster, 33% less memory

The Shiloach-Vishkin label propagation had a race condition where parent[max(la,lb)] = min(la,lb) could link unrelated positions when la/lb were stale. This caused base mismatches in write_graph_chunk on larger inputs (e.g., USP14 region with 527 sequences, 212kb). Revert to DisjointSetsAsm (lock-free CAS-based union-find) which is proven correct. Keep all other optimizations: spanning tree BFS, orphan recovery, per-sequence iitree queries, adjacency list. MHC (466 seqs): 4m12s, correct (2727 nodes, 52047bp) USP14 (527 seqs): 19m60s, correct (10199 nodes, 236011bp) — was crashing

Revert label propagation: restore DisjointSetsAsm for correctness

… on drop

…d configurability

ekg and others added 30 commits September 17, 2020 20:57

Merge pull request #64 from AndreaGuarracino/warning_with_alignmentsw…

fe66201

…ithoutcigars raised a warning when PAF files have no CIGAR strings #61

write smaller GFAs

13e3e58

unbreak the tests

7595647

The last change affected output (removing *,*,*... overlap cigars), and thus affected md5s.

move to memory safe mmmulti and avoid stack overflow in main

0924332

fix memory leak in GFA emitter

dc53dcc

update mmmulti

2c3daa6

correct some nasty bugs in the mmmultimap

7a7a188

Merge pull request #71 from ekg/sanitize

f623dc0

Sanitize memory management and mmmulti

update memory mapping to use madvise as before

5f6fad2

Merge pull request #72 from ekg/madvise-mio

9139797

update memory mapping to use madvise as before

added very safe test

8a62fac

be polite with the log verbosity

99dfe57

Merge pull request #73 from AndreaGuarracino/safe_github_action

aeb25f5

added very safe test

disable ASAN by default (was safewish, now seqwish...)

07732c0

Merge branch 'master' of github.com:ekg/seqwish

82e5692

fix segmentation fault with empty PAF files (or PAF files with initia…

b8db4e2

…l empty rows)

Merge pull request #74 from AndreaGuarracino/fix_heap_buffer_overflow

9deb45c

fix segmentation fault with empty PAF files

notify if input FASTA contains empty sequences

8356008

save the seqname only for not empty sequences

2d5c457

notify the presence of empty sequences only one time

ee2ec60

remove 'the' to follow other messages convention

8577f78

removed mmap_allocator dependency

7a2ad5e

removed mmap_allocator dependency

01c9ea8

fix seq_name size

a8f9e0e

fix offsets preparation

cf552a5

Merge pull request #76 from AndreaGuarracino/manage_empty_sequences_i…

010d1fe

…n_the_fasta Manage empty sequences in the fasta and removed mmmap_allocator dependency

hack: distrust cigar mismatch characters

72d6be1

This treats X and = like M. We check the input sequences. A bug in a particular mapper tends to cause this (wfmash), so this is just for testing until that's resolved.

Merge pull request #77 from ekg/cigar-ignorance

a4012b0

distrust cigar mismatch characters

use jemalloc

2b6a2fc

Merge pull request #78 from ekg/jemalloc

0f24577

use jemalloc

ekg added 3 commits January 17, 2026 14:54

CI: use 4 threads for integration test to avoid timeout

6ab8225

Remove C++ legacy build from CI (Rust is now primary)

20eb7b0

Remove target/ from git tracking and add to .gitignore

80c64c8

Build artifacts were accidentally committed. The .gitignore only had seqwish-rs/target/ from the old directory structure before the Rust project was moved to the repository root.

ekg force-pushed the rust-2 branch from 6e2aada to 80c64c8 Compare January 20, 2026 12:38

AndreaGuarracino and others added 26 commits February 27, 2026 21:41

cargo fmt

d7f7ea2

Merge pull request #132 from AndreaGuarracino/fix-transclosure-overla…

a781924

…p-drop fix: prevent silent overlap dropping in transclosure

Merge pull request #133 from pangenome/fix-last-node-length

6beba35

fix: correct last node length in GFA output

test: add concurrent stress test for AtomicBitVec

0d4b7e0

64 threads set different bits in the same u64 word simultaneously, repeated 100 times. This would fail with the old read_volatile implementation due to lost-update races.

style: cargo fmt

33e0ceb

Merge pull request #134 from pangenome/fix-last-node-length

5c80b1c

fix: use truly atomic fetch_or in AtomicBitVec

cargo fmt

dd149a6

fix clippy warnings: collapse nested if, use &mut [usize] over &mut Vec

1799873

remove dead code: DisjointSetsAsm helpers replaced by LabelProp

c433774

Remove unite_block, block_needs_unite, block_can_skip helpers and DisjointSetsAsm import — all replaced by the LabelProp label propagation approach. Also guard compute_spanning_tree for n_seqs <= 1.

Merge pull request #135 from pangenome/transclosure-optimize

71c2971

Two-phase spanning tree transclosure: 37% faster, 33% less memory

Merge pull request #137 from pangenome/revert-label-prop

b65a7e0

Revert label propagation: restore DisjointSetsAsm for correctness

refactor: update temp file handling in build_index and ensure cleanup…

a9ce59f

… on drop

lint

e5115b8

refactor: improve temp file creation in build_index for uniqueness an…

15aee55

…d configurability

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rust implementation of seqwish#131

Rust implementation of seqwish#131
ekg wants to merge 463 commits intomasterfrom
rust-2

ekg commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

ekg commented Dec 2, 2025

Summary

Key changes:

Architecture:

Testing:

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants