feat: integrate syng index for syncmer-based homology queries#162
Open
feat: integrate syng index for syncmer-based homology queries#162
Conversation
added 9 commits
April 8, 2026 17:45
…r (sync-impl-c) - Add syng as git submodule under vendor/syng - Add cc build dependency and extend build.rs to compile 11 syng C files into libsyng.a - Create src/syng_ffi.rs with raw extern "C" declarations for SyngBWT, KmerHash, Seqhash, Rskip, SyncmerSet, and ONElib functions - Create src/syng.rs with safe SyngIndex wrapper, SyncmerParams, SyngNameMap, HomologousInterval types, and Drop implementation - Add module declarations to src/lib.rs - All existing tests pass, new unit tests verify create/drop cycle
… (sync-test-c) - FFI smoke tests: Seqhash, KmerHash, SyngBWT create/destroy individually - SyncmerParams: default values and to_c() conversion - SyngIndex: create/drop, custom params, pointer accessors - SyngNameMap: new() and Default trait
…d (sync-impl-index) - SyngIndex::build() progressively constructs GBWT from sequence iterator (syncmer extraction, KmerHash population, forward + revcomp paths) - SyngNameMap::save/load for .syng.names text sidecar - SyngIndex::save/load for .1khash + .1gbwt + .syng.names roundtrip - impg syng CLI: --agc or --fasta input, -o prefix output, syncmer params - C helper wrappers for static-inline seqhash destroy functions - Proper Seqhash cleanup via impg_seqhashDestroy instead of raw libc::free
…c-test-index) - Round-trip tests: build → save → load → verify path count, names, lengths - SyngNameMap serialization with PanSN format and special characters - Syncmer parameter variations: different k/w/seed produce different khash - Edge cases: empty sequence, shorter-than-syncmer, exactly syncmer length, mixed - CLI integration: run `impg syng -f <fasta> -o <prefix>`, verify output files - Load error case: missing files return Err
…-query) Phase 3 of syng integration: GBWT querying pipeline. - Add GbwtPathStart struct to SyngNameMap for per-path start info (start_node, start_count, num_syncmers) — needed for query path walking - Modify SyngIndex::build to capture GBWT path start info during forward path construction - Implement SyngIndex::query_region(): walks query genome's GBWT path to find syncmer nodes in [start,end], then walks all other genome paths to find shared nodes, merges and pads intervals - Add --syng <prefix> flag to query CLI (mutually exclusive with -a) - Add --syng-padding flag (default 120bp = 2× syncmer length) - Support bed, gfa, and fasta output formats for syng queries - Add dispatch_gfa_engine_with_seq_index() for syng→graph engine wiring via SeqIndexWrapper (minimal ImpgIndex impl using SequenceIndex) - Add impg_syng_suppress_debug FFI helper to silence C debug output - Backward-compatible .syng.names format (6 columns, reads old 3-column) - 5 new query_region tests (basic, unknown genome, padding, missing path info, save/load roundtrip)
…-test-query) Adds 19 new tests covering: - Query completeness with known ground truth (shared backbone detection) - Interior coverage (no false negatives for shared regions) - Boundary padding (0/60/120bp, monotonic growth, genome length clamping) - Interval merging (overlap, adjacency, multi-genome, edge cases) - Edge cases (isolated regions, full-sequence query, single-sequence index, out-of-range, zero-width, identical sequences, unknown genome) - CLI integration (syng build + query --syng for BED/GFA output, --syng/-a mutual exclusivity)
…nc-impl-gbwt) Add SyngIndex::build_region_gbwt() method that builds a region-specific GBWT from fetched sequences — creates fresh KmerHash/SyngBWT, extracts syncmers, builds forward+revcomp paths, and writes syng-compatible .1khash + .1gbwt files. Add 'gbwt' to query command's output format enum. Works with both --syng and -i input modes. Requires -O prefix and --sequence-files.
…test-gbwt) 14 new tests covering Phase 4 validation: - Region-specific GBWT output from query results - Region GBWT loadable as SyngIndex - Region nodes subset of full index - Single genome and very small region edge cases - Output prefix with nested directories and nonexistent dirs - ONEcode magic byte verification for .1gbwt and .1khash - Round-trip: query → GBWT → query consistency - CLI integration: --syng + -o gbwt, PAF-based + -o gbwt - CLI validation: -o gbwt requires -O prefix
…dd thread-safety and clippy cleanup (.verify-sync-verify-integration) - Fix heap-buffer-overflow in seqhash: encode sequences as 0/1/2/3 (not ASCII a/c/g/t) since patternRC[4] uses raw byte values as array indices - Use impg_seqhashCreateSafe for thread-safe seqhash creation - Add mutex serialization for syng tests (C library has thread-unsafe globals) - Fix clippy warnings: Error::other(), unused import, needless_range_loop - Add crate-level allows for pre-existing too_many_arguments/type_complexity
added 7 commits
April 13, 2026 16:50
# Conflicts: # src/lib.rs
Implement SyngImpgWrapper that adapts SyngIndex to the ImpgIndex trait, allowing partition_alignments() to work transparently with syng-based queries. No changes to partition logic itself.
`Decompressor::get_contig()` in ragc-core silently returns an empty Vec when called after `list_contigs_names_only()` because it checks contig count instead of `are_details_loaded()`, skipping the required detail reload. This produced a fully-populated `.syng.names` file but empty `.1gbwt` and `.1khash`, which then crashed at load time with "no Vertex objects in .1gbwt file". Switch to `get_contig_length` + `get_contig_range(0, length)` (same pattern as src/agc_index.rs), which correctly reloads details. Also add tests/test_syng_integration.rs covering: - impg syng --agc produces a non-empty GBWT (regression test for the exact bug above) - impg syng --agc round-trips: built index loads and queries correctly - impg syng -f (FASTA) produces a non-empty GBWT - impg partition --syng runs end-to-end and emits non-empty BED The AGC build test asserts .1gbwt > 2000 bytes, which would have caught this bug immediately — the broken output was exactly 1362 bytes.
The vendored syng hash table had a bug where the REMOVED sentinel value collided with hashInt(1), so any startCount() call for GBWT node 1 (or -1) silently returned 0 on every invocation instead of incrementing. Two paths both starting at node 1 would both record start_count=0 and collide; on query, the C layer would die with: FATAL ERROR: syngBWTpathStartOld startNode 1 count 0 >= startCount 0 This manifests on any pangenome with byte-identical contigs because the first k-mer added to the hash gets index 1 and becomes a GBWT start node. On yeast235 the chrIII sequences from samples AAA and SGDref are identical, which triggered the crash 100% of the time. The fix patches the vendored syng hash.c to use a middle-of-range REMOVED sentinel (0x4000000000000000) that hashInt() never produces for small-magnitude integer keys. Bumping the vendor/syng submodule pointer. Also adds regression tests: - tests/test_syng_startcount.rs: isolates startCount increment behavior at the FFI level for node 1, node 5, short/long paths, with/without RC - tests/test_syng_integration.rs::test_syng_identical_sequences_build_and_query: reproduces the yeast235 scenario (two byte-identical FASTA records) and verifies start_counts are distinct and query_region succeeds on both
Previously the vendor/syng submodule pointed at richarddurbin/syng but the pinned commits (23b6547, 1dbfd58, ce46949) were local-only and did not exist on that remote, so fresh clones with submodules failed CI with: "fatal: remote error: upload-pack: not our ref ..." Create https://github.com/pangenome/syng as a fork of richarddurbin/syng with an impg-integration branch carrying our patches. Point .gitmodules at it so CI and downstream clones can actually fetch the submodule. Contents of pangenome/syng:impg-integration (all commits ahead of upstream b659846): - 23b6547 impg FFI helper wrappers for static-inline functions - 1dbfd58 impg_syng_suppress_debug helper for FFI - ce46949 hash REMOVED sentinel collides with hashInt(1) (genuine bug fix) - d423521 IMPG_PATCHES.md tracking all divergences from upstream IMPG_PATCHES.md documents each patch and includes the sync workflow for merging future upstream changes. The hash.c fix should be upstreamed to richarddurbin/syng.
The pggb:X / seqwish:X engine spec carries a partition window size. Previously the syng+gfa path ran a single flat engine invocation over the entire query range, because X was only consulted as a boolean skip_normalize flag — so large regions produced enormous whole-chromosome context spans from query_region and overwhelmed the aligner. Now when partition_size is set, we split [range_start, range_end) into per-window partitions, call query_region per window (which returns tight, small-scale intervals), and run the partitioned GFA pipeline that does a fresh alignment + graph induction per partition plus a single final gfaffix normalization — structurally mirroring the alignment path's output_results_gfa_partitioned. SyngImpgWrapper gains seq_index() and syng_padding() accessors so the main.rs path can share it instead of threading the SequenceIndex and padding separately. Adds test_query_syng_gfa_subwindow_splitter integration test that drives a 20 kbp query through pggb:5000 and verifies the per-window log lines appear in stderr.
The previous query_region walked every forward path end-to-end on
every call — O(total pangenome length) per query, which dominates
runtime on large inputs. FastLocate gives O(k log r) per query
(k = query nodes, r = total BWT runs) by building an r-index locate
structure over a classical GBWT.
Algorithm: port of jltsiren/gbwt's C++ FastLocate (fast_locate.cpp)
to Rust on top of the `gbz` crate (which despite the crate name is
jltsiren/gbwt-rs). The r-index formulation is from Gagie, Navarro,
Prezza (JACM 2020) as adapted to multi-string BWTs in Sirén,
Garrison, Novak, Paten, Durbin (Bioinformatics 2020).
Key structural difference from a naive port: runs whose successor is
ENDMARKER must be counted as one logical run per position (not per
physical RLE run), mirroring C++ LFLoop's convention. Without this,
`last.predecessor(prev)` fails to find a valid tail for head samples
of nodes near the end of short sequences.
Integration in SyngIndex:
- build_fast_locate walks each forward syng path, inserts the
encoded node sequence into gbz::GBWTBuilder(bidirectional=true),
records per-node bp positions into a flat BpOffsets sidecar, then
builds FastLocate over the resulting gbz GBWT.
- SyngIndex::build calls build_fast_locate eagerly (warn on failure).
- query_region uses FastLocate::decompress_da when available,
falling back to the old walk-every-path implementation otherwise.
Serialization: new {prefix}.syng.locate sidecar = gbz::GBWT
(simple-sds Serialize) + FastLocate (custom little-endian framing)
+ BpOffsets (custom little-endian framing). Loaded opportunistically
if the file exists.
New dependencies: gbz 0.6 and simple-sds 0.4 (both from crates.io).
Tests:
- fast_locate unit: tiny multi-share, 8-path larger, single path,
identical paths, save/load roundtrip — all compared against a
ground-truth walk via gbz's own sequence iterator.
- syng: test_query_region_fast_locate_parity (fast vs slow result
equivalence), test_syng_save_load_with_fast_locate (full disk
roundtrip).
- Integration: test_syng_integration.rs passes end-to-end with the
fast path (350s subwindow splitter).
Also reverts a prior broken attempt to add r-index locate on the
C side of syng (from vendor/syng and syng_ffi declarations) — the
classical r-index formula does not apply to syng's rskip BWT, so
this is done in Rust on top of a separate gbz GBWT instead.
f12e003 to
f7fe289
Compare
added 12 commits
April 15, 2026 16:19
test_query_syng_gfa_subwindow_splitter was running FastGA + seqwish + gfaffix per sub-window, which took ~6 min locally and didn't finish in reasonable time on GitHub's 2-vCPU runners (1+ hour in Test step). The regression it checks is engine-agnostic — it verifies that pggb:X / seqwish:X / poa:X is interpreted as a sub-window size (not a boolean flag), which is visible from the per-window log lines emitted BEFORE the engine runs. Two changes: - Switch from `seqwish:5000` to `poa:1000`. POA skips the FastGA alignment + seqwish transclosure chain entirely and runs a single partial-order alignment per sub-window. The sub-window loop and its log emission are identical, so the regression assertion still fires. 1000 bp is impg's minimum allowed partition size. - Shrink the FASTA from 15 kbp + 2 kbp tails (34 kbp total) to 3 kbp + 500 bp tails (7 kbp total), and shrink the query range from 15 kbp to 3 kbp. That keeps the 3-sub-window assertion intact while dropping the per-run pipeline cost. - Drop the secondary "GFA has > 100 bytes" assertion. The primary sub-window-count assertion is the regression test; downstream engine success is checked by the existing pipeline tests in test_pipeline_integration.rs. Local timing: test_query_syng_gfa_subwindow_splitter now 0.49s (vs 346.96s previously, ~700x speedup). Full test_syng_integration suite now 0.80s (vs 350s). Full release serial test suite now ~2s.
The AGC syng build path formatted sequence names as `contig@sample`, but PanSN-style contig names already embed the sample (e.g. `S288C#0#chrIV`), so the result was `S288C#0#chrIV@S288C#0` — a redundant suffix that made `query -r S288C#0#chrIV:...` fail with "genome not found". Use the contig name directly instead.
Outlines the per-hop pipeline: syng one-hop seed → fetch sequences → local pairwise alignment (FastGA/sweepga) → build in-memory ImpgIndex from PAFs → re-query through implicit graph with cigar-based coordinate projection → precise boundaries for next hop. Key insight: no explicit graph (no seqwish/GFA) — the implicit graph IS the PAFs, which impg already knows how to query through. Realignment at each hop prevents compounding syncmer-resolution slop. See notes/SYNG_TRANSITIVE_DESIGN.md for full design, open questions, and integration plan. Work continues on PR #162.
Commit b7a9323 dropped the `@sample` suffix when naming AGC-imported contigs. That was correct for PanSN-encoded names (`sample#haplotype#contig`), which already embed the sample — but broke when contigs are raw names like `chr1` that appear across multiple samples: all three collapse to the same `name_to_path` entry and two of them are silently lost. Only append `@sample` when the contig name doesn't already contain `#` (i.e. isn't PanSN-encoded). Restores test_syng_agc_roundtrip_query to passing without regressing the PanSN path.
Syng's syncmerIterator reports the first syncmer at wherever the min k-mer lands in the initial window — anywhere in [0, w+k). walk_path was hardcoding the first syncmer's accumulated position to 0 and returning relative offsets. Callers (query_region, fast-locate bp_offsets, anchor fetch in boundary realignment) treated those as absolute sequence coordinates, introducing a systematic shift equal to the first-syncmer offset on every query. Store the first-syncmer absolute position in GbwtPathStart at build time (from syncmers[0].1), bump the .syng.names file format to 7 columns, and anchor walk_path's bp accumulator there. Older 6-column files load with first_syncmer_pos=0 and behave as before (still buggy for anchors; a fresh rebuild corrects).
Introduce Anchor + HomologousIntervalWithAnchors; add query_region_with_anchors that emits per-hit shared-syncmer positions and distinguishes forward vs RC homology. The fast-locate fast path previously collapsed every syncmer visit to the forward-orientation GBWT node via unsigned_abs(), silently dropping every reverse-orientation visit. Preserve the sign from walk_path by setting the low bit of the GBWT encoded node id so decompress_da(2*N + 1) returns actual reverse-orientation visits. query_region_with_anchors queries BOTH orientations per query node, tags anchors by (query_orient XOR target_orient), and groups results per (target_path, strand) so forward and RC intervals stay separate (they describe distinct homologies, not the same interval reported twice). query_region is now a thin wrapper that drops anchors. merge_intervals is replaced by merge_intervals_with_anchors which unions + de-dupes the anchor list across merged intervals. Unit tests updated accordingly.
New impg::syng_transitive module implements the multihop syng-seeded transitive query described in notes/SYNG_TRANSITIVE_DESIGN.md. For each syng homolog, resolve its two fuzzy edges to base-pair precision using BiWFA (lib_wfa2 MemoryMode::Ultralow) on small anchor-flanked windows — no full in-memory Impg, no subprocess aligner. Multihop iteration re-seeds each hop from the refined intervals of the previous one so slop doesn't compound across hops. Strand-aware: forward homology uses the anchor pair directly; RC homology fetches the target window in reverse-order bounds, RC's it, aligns, and projects offsets from the left anchor's forward anchor point (t_pos + syncmer_len) downward. Flanks each syng query by ANCHOR_FLANK_BP=512 on both sides so each edge has a candidate anchor on either side; falls back to the syncmer-resolution bound when a flank is missing. Skips realignment for anchor gaps > MAX_REALIGN_WINDOW_BP=2048 (where the cost model of local realignment inverts). Integration tests: - test_syng_boundary_realign_tightens_edges: on an identical backbone, refined edges snap to exact query coords (where the previous syncmer-resolution output was padded by ~150bp). - test_syng_rc_homolog_end_to_end: on an inversion fixture, verifies strand='-' is reported, refined intervals overlap the expected RC window by >=200bp, and RC'd target bytes share a >=30bp exact-match run with query bytes. Coordinate precision is approximate due to inherent RC-syncmer sparsity (~3-5 RC-shared anchors per kb vs ~35 per kb forward).
impg query --syng now runs the BiWFA boundary-realignment transitive pipeline by default for bed/fasta/gbwt output. Each homolog's syncmer-resolution edges are snapped to base-pair precision; under --transitive with --max-depth > 1 the refined intervals feed back into the next syng hop. Add --syng-raw (debug-only) to preserve the previous pass-through behaviour that emits raw syncmer-resolution intervals without realignment — useful for inspecting what syng's node graph reports before any refinement is applied. GFA output stays on raw intervals for now: its partitioning pipeline does its own per-partition alignment downstream, and the interaction with boundary realignment needs its own design pass. Boundary realignment needs a UnifiedSequenceIndex for edge-window fetches. Sequence files are now required for bed output unless --syng-raw is set — error message points users to either --sequence-files/--sequence-list or --syng-raw.
- notes/SYNG_TRANSITIVE_DESIGN.md: pivot from in-memory-Impg + FastGA/WFMASH full-region realignment to BiWFA boundary-only realignment. Document the per-hop pipeline, strand handling, anchor selection, resolved/outstanding open questions, and the known limitation on RC coordinate precision. - examples/syng_probe.rs: diagnostic tool that compares syng's C iterator syncmer positions against walk_path output. Served as the verification step for the walk_path absolute-coord fix; keep it in the tree for future syng-coordinate debugging. - .gitignore: add .tmp*, seqwish temp dirs, .workgraph.1/, .claude/ so syng/seqwish/FastGA scratch files from test runs don't clutter git status.
Two related bugs, both exposed by:
impg query --syng yeast235 -r S288C#0#chrIV:409000-409500 \
-o gfa --gfa-engine pggb -d 1000
1. `-o gfa` was exempt from boundary realignment — the previous
commit kept it on raw syncmer-resolution intervals with the
reasoning "the partitioning pipeline does its own alignment
downstream." That was wrong: fragmented short intervals feed
straight into the GFA partitioned / flat pipelines and produce a
fragmented graph. Wire both GFA sub-paths (partitioned and flat)
through impg::syng_transitive::query_transitive when --syng-raw
isn't set, just like bed/fasta/gbwt do.
2. `-d <merge_distance>` (min_distance_between_ranges) was silently
a no-op in the syng path. Add a merge_distance: u64 parameter to
syng_transitive::query_transitive and one_hop, plumb the value
through from query.transitive_opts.min_distance_between_ranges in
main.rs. Inside one_hop, run bedtools-style distance-merge on the
padded syncmer hits before boundary realignment via a new
distance_merge_anchored helper (anchors from merged intervals are
unioned and deduplicated). Distance 0 is the existing default =
overlap-only merge.
Adds test_syng_query_reconstructs_homology_with_diffs: three genomes
sharing a 3kb region with scattered SNPs and a 10bp indel. Query
genome_a[500..2500] returns exactly one interval per target with
refined edges within single-digit bp of biological truth (the indel
correctly shifts genome_c's right edge to 2490).
…ection
My earlier boundary-realignment implementation used BiWFA on
anchor-flanked windows per-edge per-homolog — which on real
pangenome data (e.g. yeast235 with ~300 haplotypes per region) was
spending most of its time on thousands of AGC sequence fetches and
small BiWFA runs. A 2kb query took >10 min wall time before being
killed.
For the intended use case — closely-related genomes — the shared
syncmer anchors already encode the homology structure. Linear
extrapolation from the innermost anchors to the user's query edges
gives coordinate precision within the local indel burden (single-digit
bp on <5%-divergent genomes), matching what PAF-based queries produce
from their stored CIGARs.
Replace refine_boundaries with project_query_to_target, a pure
coordinate arithmetic function. No AGC fetches. No BiWFA. Per-homolog
cost drops from O(fetches × aligner overhead) to O(anchor list size).
Wall time on the yeast235 example collapses from >10min (killed) to
~1s for the refinement step.
Drop resolve_edge_via_biwfa, ANCHOR_FLANK_BP, MAX_REALIGN_WINDOW_BP,
the thread-local BiWFA aligner, with_biwfa_edit_aligner,
project_query_offset_via_cigar, flanking_anchors, the query-range
widening in one_hop, and all associated tests. Net -400 lines.
Also fix a separate wiring bug: syng_merge_distance was reading from
transitive_opts.min_distance_between_ranges (long-only flag, default
10) instead of QueryOpts::merge_distance (short `-d`, default 0 —
the flag CLI users actually type). Switch to effective_merge_distance()
so `-d 10000` works as documented. Without this, my distance-merge was
effectively a no-op on CLI usage.
Verified on yeast235 --agc:
impg query --syng yeast235 --sequence-files yeast235.agc \
-r S288C#0#chrIV:408000-410000 -o bed -d 10000
Before: 746 bed rows, up to 7 fragments per haplotype on same strand.
After: 526 bed rows, max 2 rows per (sample,hap,chrom) — one `+`
and one `-` (real RC homology, not fragmentation).
Updates notes/SYNG_TRANSITIVE_DESIGN.md to match the simpler algorithm.
New unit tests in src/syng_transitive.rs cover linear projection on
both strands, clamping, and distance-merge semantics.
…trand
Syng's query_region_with_anchors groups shared-syncmer hits by
(path, strand) and emits one padded interval per group. For short
syncmers (k=8) on large genomes like yeast, a query syncmer's
canonical hash coincidentally matches other positions on the same
target path in the opposite orientation — producing spurious `-`
anchors inside (or around) real `+` homologies and vice versa.
On yeast235 this caused:
* ~50% of `+` homologs to come paired with a nested `-` "homolog"
at the same target location, doubling the sequence set fed to
downstream consumers (GFA pipeline ended up building a doubled
graph where every region was aligned against its own RC).
* The -o gfa path to take 20+ minutes because seqwish/smoothxg
were processing twice the inputs.
Fix: after per-(path, strand) distance-merge, run a cross-strand
dedupe pass per path. If `+` and `-` intervals on the SAME path
overlap on target forward coordinates, keep only the strand with more
anchor support and drop the other. Non-overlapping `+`/`-` intervals
on the same path stay separate (real inversion on a separate region
of the same contig is legitimate biology).
Result on the yeast235 command (S288C#0#chrIV:408000-410000):
* Bed rows: 526 → 271
* `-` strand rows: 266 → 61 (noise gone, real RC retained)
* HN1#0#chrIV, AAA#0#chrIV, etc: one clean `+` interval each,
no nested `-` duplicates.
4 new unit tests in src/syng_transitive.rs cover:
- overlapping +/- → keep majority (noise filter)
- non-overlapping +/- on same path → keep both (real inversion)
- different paths never cross-dedupe
- deterministic tie-breaking
added 6 commits
April 16, 2026 17:07
Linear projection from innermost anchors alone cannot see indels that
sit in the inter-anchor gap (between the user's query edge and the
innermost shared syncmer anchor). Indels tend to DISRUPT syncmers
rather than fall between them, so when a query edge sits outside the
shared-syncmer grid the indels in that outer span are invisible to
linear extrapolation — producing refined intervals whose widths are
artificially locked to the query width.
Add refine_edge_via_realignment: per edge per merged homolog, fetch
the outer query slice [qs .. q_anchor + k] and the corresponding
target slice (sized q_window + buffer, oriented per strand) and run
EndsFree BiWFA anchored at the inner (anchor) end with text_begin_free
on the outer target end. The number of leading text-only ops ('I' in
WFA2 convention — opposite of PAF/SAM) gives the exact target offset
where the query edge lands.
Caps & guardrails:
- Outer span capped at EDGE_ALIGN_CAP_BP = 4096bp (typical span is
one syncmer gap ~30bp).
- Target buffer EDGE_ALIGN_TARGET_BUFFER_BP = 256bp, clamped to
|target slice| (WFA2 requires text_begin_free <= |T|).
- Falls back to linear projection on fetch failure, alignment failure,
or outer span = 0.
- Uses lib_wfa2 MemoryMode::High (windows are small; EndsFree traceback
needs the full matrix which BiWFA Ultralow doesn't always supply).
Verified on yeast235 S288C#0#chrIV:408000-410000 -d 10000:
* Wall time: 4.83s (bed output, 294 rows, 235 haplotypes).
* Width variation now reflects real indels: AAA 1999bp, AAB 2003bp
(previously both locked to exactly 2000). HN1 1805bp, BAL_1a
1800bp (previously both locked to 1805).
* Edge positions shift by 1-5bp from linear-projection baseline,
reflecting inter-anchor indel content that was hidden before.
No regressions on lib tests (150), syng_transitive unit (12), or
syng integration tests (9/9).
…roximity Distance-merge (bedtools `-d`) was merging paralogs into one super- interval when they sat within `-d` bp on the same contig. Example on yeast235 CRE#3#block57_contig1 for S288C:408000-410000 with -d 10000: * True ortholog at target [409163, 410109] with anchor delta=1047 * Paralog at target [416414, 417219] with anchor delta=7275 These got merged into one 8228bp "homolog" spanning the junk between them. That ragged sequence became GFA pipeline input and inflated the graph-build time. Fix: distance-merge now requires both (a) target-axis gap <= -d AND (b) median anchor co-linearity signatures within tolerance (set to -d / 10, so for -d 10000 it's 1000bp). The signature is: * '+' strand: target_pos - query_pos (stable under collinear homology) * '-' strand: target_pos + query_pos (stable under RC homology) Two clusters with signatures 1047 vs 7275 differ by 6228bp — far beyond tolerance — so they now stay as separate intervals (both real biology, just on the same contig). Ortholog: CRE#3#block57_contig1:409048-410292 (1244bp). Paralog: :416294-417276 (982bp). Previously: one merged 8228bp junk interval. Result on the yeast query: * Total output sequence: 472kb → 460kb (no more 8kb ragged CRE hits). * Max per-interval width: 8228bp → 2146bp. Two new unit tests in src/syng_transitive.rs cover: - signatures diverge (paralogs) → don't merge - signatures close (real fragmentation) → do merge
Usage:
cargo run --release --example syng_anchor_probe -- \
<syng_prefix> <query_name> <qs> <qe> <target_path>
Dumps the full anchor list for one (query_region, target_path) pair,
showing per-anchor (query_pos, target_pos, delta, node_id). Used to
diagnose coordinate-frame issues (relative vs absolute) and
co-linearity signatures between anchors, which led to the
distance-merge co-linearity filter fix.
Compares two bed-format outputs produced by impg query (one from --syng, one from traditional -a PAF). Reports: - Row counts and per-path coverage (syng-only, paf-only, both) - Start/end boundary deltas on common paths - Strand agreement - Top outliers by |start delta| Used to validate syng query semantics against PAF ground truth as described in notes/SYNG_NEXT_STEPS.md section 1.
Ran sweepga --pairs S288C#0 vs all 234 other haplotypes on yeast235.agc
to produce a ground-truth PAF, then diffed impg query outputs for
S288C#0#chrIV:408000-410000 -d 10000 on both the PAF-based and
syng-based indices.
Headline:
- 100% strand agreement on 209 common paths.
- 59%/74% within 5bp on start/end edges for common paths.
- Strand-dedupe and RC detection in syng are validated.
Known gap: RC ('-' strand) paths have a systematic ~1500bp undershoot
on the outer edge (the qe projection under RC), affecting ~10-15
paths. Likely cause: syng's innermost anchor on the RC side sits
inside the homology, and edge realignment can't extend far enough
outward. Details and fix plan in the note.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Integrates syng as a C library inside impg, providing syncmer-based homology detection via GBWT index queries — complementing the existing alignment-based approach.
cccrate intolibsyng.a, with safe Rust FFI bindings (syng_ffi.rs,syng.rs)impg syng --agc pangenome.agc -o prefixorimpg syng -f genomes.fa -o prefix— progressive construction, one sequence at a time (low memory). Produces.1khash,.1gbwt,.syng.namesfiles (interoperable with standalone syng)impg query --syng prefix -b region.bed— query homologous regions using the syng index, with BED/GFA/FASTA output-o gbwtoutput format to produce region sub-indexes--syncmer-k(default 8),--syncmer-w(default 55),--syncmer-seed(default 7)Key implementation details
seqhash.c(ASCII vs numeric encoding forpatternRC[4])Test plan
cargo test(126 tests pass)cargo clippy -- -D warningsclean