Conversation
…ithoutcigars raised a warning when PAF files have no CIGAR strings #61
The last change affected output (removing *,*,*... overlap cigars), and thus affected md5s.
Sanitize memory management and mmmulti
update memory mapping to use madvise as before
added very safe test
fix segmentation fault with empty PAF files
…n_the_fasta Manage empty sequences in the fasta and removed mmmap_allocator dependency
This treats X and = like M. We check the input sequences. A bug in a particular mapper tends to cause this (wfmash), so this is just for testing until that's resolved.
distrust cigar mismatch characters
use jemalloc
Build artifacts were accidentally committed. The .gitignore only had seqwish-rs/target/ from the old directory structure before the Rust project was moved to the repository root.
…k lookup Three bugs in the transitive closure computation caused incorrect graphs: 1. ovlp_q.push() used non-blocking ArrayQueue::push() with `let _ =` which silently dropped overlaps when the queue was full. 2. The manager task never drained ovlp_q during the parallel work phase, so the 100,000-capacity queue filled almost immediately and all subsequent overlaps were lost. 3. todo_in.push() had the same silent-drop pattern. Fix: workers now spin-retry on both ovlp_q and todo_in pushes (matching C++ blocking semantics). Manager drains ovlp_q continuously during the work loop (matching C++ manager thread pattern). Queue sizes increased from 100,000 to 131,072 (matching C++ 2 << 16). Additionally, replace Rank9Sel::rank1() with a direct Vec<u32> rank lookup table. This changes O(popcount) per call to O(1) array access, which is critical at scale (38.6B rank lookups in our test case). Before fix: 4,378 segments (incorrect, lost overlaps) After fix: 628 segments (matches C++ seqwish output) Performance: 46.7s Rust vs 52.4s C++ on chr6 29M test region Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…p-drop fix: prevent silent overlap dropping in transclosure
The seq_id_cbv bitvector has size graph_length + 1, but the graph sequence is only graph_length bytes. When computing the last node's length, select(node_id + 1) returns None. The S-line code was skipping the node entirely, and the validation was using unwrap_or(node_start) giving length 0. Fix both to use the actual graph sequence length (seq_v_slice.len()) as the end sentinel, giving the correct last node length. This fixes "path step length mismatch" errors that occurred when a path visited the last node in the graph.
fix: correct last node length in GFA output
…oundary marking The previous implementation used read_volatile/write_volatile which is NOT atomic — concurrent threads setting different bits in the same u64 word could silently lose updates (lost-update race condition). This caused intermittent "path step length mismatch" errors in GFA output because dropped node boundaries meant paths partially traversed nodes. Replace with AtomicU64::fetch_or(mask, Ordering::Relaxed) which is a single atomic read-modify-write operation, eliminating the race.
64 threads set different bits in the same u64 word simultaneously, repeated 100 times. This would fail with the old read_volatile implementation due to lost-update races.
fix: use truly atomic fetch_or in AtomicBitVec
Replace single-pass BFS+union-find with a two-phase approach: Phase 1: Fast BFS discovery using a maximum-weight spanning tree of sequence pairs (N-1 edges instead of N*(N-1)/2). Only follows spanning tree alignment edges, discovering positions without collecting overlap entries. Includes orphan recovery via repeated linear scans to find positions only reachable through non-tree edges. Phase 2: Linear scan of all iitree intervals for union-find. Uses block-level sampling (check 3 positions per block) to skip blocks where source and target are already in the same equivalence class. Iterates verification passes until convergence to catch sampling misses. Also adds for_each_interval() to the IntervalTree trait for linear iteration over all stored intervals. Benchmarked on 466-sequence MHC/C4 region (HPRCv2): - Wall clock: 9m00s -> 5m41s (37% faster) - User time: 1969s -> 1293s (34% less CPU) - Peak RSS: 12.4GB -> 8.3GB (33% less memory) - Output identical (2727 nodes, 52047bp, 3686 links)
… 1.3GB Replace linear scan of pre-collected all_intervals Vec with per-sequence iitree overlap queries for both orphan recovery and phase 2 union-find. Eliminates 58M-entry Vec allocation (~1.4GB) while maintaining identical output. For workloads with multiple smaller components, this also reduces scan cost proportionally to component size. Benchmark (466-seq MHC, same workload): - Wall clock: 5m41s -> 5m57s (comparable, within noise) - Peak RSS: 8.3GB -> 7.0GB (1.3GB saved) - Output: identical (2727 nodes, 52047bp, 3686 links)
- Remove dead handle_range function and OverlapAtomicQueue type alias (replaced by inline discovery logic in explore_overlaps_discovery) - Remove unused Mutex import - Remove duplicate HashMap import in compute_spanning_tree - Fix unused variable warning (_weight in spanning tree loop) - Extract find_component_sequences helper (was duplicated 2x) - Extract unite_block and block_needs_unite helpers (was duplicated 3x) - Remove redundant n_seqs binding in orphan recovery
Replace 3-point block sampling with binary subdivision sampling (check at offsets 0, L-1, L/2, L/4, 3L/4, ... down to step size 32). Catches more boundary cases than 3-point, reducing verification round recoveries from ~10K to ~2.8K blocks. Verification pass remains exhaustive for correctness. Extract block_can_skip helper with the logarithmic sampling logic.
Replace DisjointSetsAsm (128-bit CAS-based union-find) with simple label propagation for phase 2 equivalence class computation. Algorithm: hook (parent[max] = min) + pointer jump, iterated until convergence. Uses AtomicU32 with Relaxed ordering (free on x86) instead of CMPXCHG16B. Sequential access within blocks enables hardware prefetch. Converges in 4 rounds for MHC/C4 data: Round 1: 22.2B hooks (main work) Round 2: 930K hooks Round 3: 5.3K hooks Round 4: 0 hooks (converged) Benchmark: 5m54s wall clock (same as previous), correct output. Simpler code — eliminates sampling heuristics and exhaustive verification passes. Foundation for block-level optimizations.
Add hook_range() that operates on consecutive rank indices directly, avoiding per-position curr_bv and rank_table lookups. Since positions within each sequence get consecutive ranks, a block [a,a+L)->[b,b+L) can be processed as hook_range(rank[a], rank[b], L) -- a tight loop of array reads and conditional writes with sequential access. Round 1 drops from 207s to 41s. Total transclosure from 5m54s to 3m04s. Output identical to baseline (2727 nodes, 52047bp, 3686 links). Wall clock: 9m00s baseline -> 3m04s (2.9x speedup) Peak RSS: 12.4GB -> 6.5GB (48% less memory)
…peedup) After rounds with few hooks (<10K), do up to 20 pointer jump passes to fully flatten the label tree. This eliminates the 3-4 extra rounds that previously each cost ~10s for iitree traversal with near-zero hooks. Round count: 6 -> 3. Total: 3m04s -> 2m17s. Output identical (2727 nodes, 52047bp, 3686 links).
Use a flat Vec<Vec<usize>> adjacency list instead of HashSet<(usize, usize)> for spanning tree lookup. Tree nodes have degree ~2-3, so linear scan of a 2-3 element vec is faster than hashing a (usize, usize) tuple. Eliminates hash computation per iitree callback during BFS.
Remove unite_block, block_needs_unite, block_can_skip helpers and DisjointSetsAsm import — all replaced by the LabelProp label propagation approach. Also guard compute_spanning_tree for n_seqs <= 1.
Two-phase spanning tree transclosure: 37% faster, 33% less memory
The Shiloach-Vishkin label propagation had a race condition where parent[max(la,lb)] = min(la,lb) could link unrelated positions when la/lb were stale. This caused base mismatches in write_graph_chunk on larger inputs (e.g., USP14 region with 527 sequences, 212kb). Revert to DisjointSetsAsm (lock-free CAS-based union-find) which is proven correct. Keep all other optimizations: spanning tree BFS, orphan recovery, per-sequence iitree queries, adjacency list. MHC (466 seqs): 4m12s, correct (2727 nodes, 52047bp) USP14 (527 seqs): 19m60s, correct (10199 nodes, 236011bp) — was crashing
Revert label propagation: restore DisjointSetsAsm for correctness
…d configurability
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Complete rewrite of seqwish in Rust, providing a modern, memory-safe implementation with identical output to the C++ version.
Key changes:
seqwish(versions 0.1.1-0.1.3)cpp/directory for referenceArchitecture:
Testing:
Test plan
cargo test --release)