Skip to content

feat: integrate syng index for syncmer-based homology queries#162

Open
ekg wants to merge 34 commits intomainfrom
feature/syng-integration
Open

feat: integrate syng index for syncmer-based homology queries#162
ekg wants to merge 34 commits intomainfrom
feature/syng-integration

Conversation

@ekg
Copy link
Copy Markdown
Collaborator

@ekg ekg commented Apr 9, 2026

Summary

Integrates syng as a C library inside impg, providing syncmer-based homology detection via GBWT index queries — complementing the existing alignment-based approach.

  • C library integration: syng compiled via cc crate into libsyng.a, with safe Rust FFI bindings (syng_ffi.rs, syng.rs)
  • Index building: impg syng --agc pangenome.agc -o prefix or impg syng -f genomes.fa -o prefix — progressive construction, one sequence at a time (low memory). Produces .1khash, .1gbwt, .syng.names files (interoperable with standalone syng)
  • Querying: impg query --syng prefix -b region.bed — query homologous regions using the syng index, with BED/GFA/FASTA output
  • GBWT export: -o gbwt output format to produce region sub-indexes
  • Configurable syncmer parameters: --syncmer-k (default 8), --syncmer-w (default 55), --syncmer-seed (default 7)

Key implementation details

  • Progressive build streams one sequence at a time — suitable for large pangenomes via AGC
  • Thread-safety: mutex serialization for C library globals
  • Fixed heap-buffer-overflow in seqhash.c (ASCII vs numeric encoding for patternRC[4])
  • 126 unit tests + integration tests, clippy clean

Test plan

  • FFI smoke tests and SyngIndex lifecycle tests
  • Index build/save/load round-trip tests
  • query_region correctness tests (padding, shared sequences)
  • Thread-safety under parallel test execution
  • GBWT output and interoperability tests
  • Full cargo test (126 tests pass)
  • cargo clippy -- -D warnings clean
  • End-to-end pipeline: build → query (bed/gfa/gbwt)

Test User added 9 commits April 8, 2026 17:45
…r (sync-impl-c)

- Add syng as git submodule under vendor/syng
- Add cc build dependency and extend build.rs to compile 11 syng C files into libsyng.a
- Create src/syng_ffi.rs with raw extern "C" declarations for SyngBWT, KmerHash,
  Seqhash, Rskip, SyncmerSet, and ONElib functions
- Create src/syng.rs with safe SyngIndex wrapper, SyncmerParams, SyngNameMap,
  HomologousInterval types, and Drop implementation
- Add module declarations to src/lib.rs
- All existing tests pass, new unit tests verify create/drop cycle
… (sync-test-c)

- FFI smoke tests: Seqhash, KmerHash, SyngBWT create/destroy individually
- SyncmerParams: default values and to_c() conversion
- SyngIndex: create/drop, custom params, pointer accessors
- SyngNameMap: new() and Default trait
…d (sync-impl-index)

- SyngIndex::build() progressively constructs GBWT from sequence iterator
  (syncmer extraction, KmerHash population, forward + revcomp paths)
- SyngNameMap::save/load for .syng.names text sidecar
- SyngIndex::save/load for .1khash + .1gbwt + .syng.names roundtrip
- impg syng CLI: --agc or --fasta input, -o prefix output, syncmer params
- C helper wrappers for static-inline seqhash destroy functions
- Proper Seqhash cleanup via impg_seqhashDestroy instead of raw libc::free
…c-test-index)

- Round-trip tests: build → save → load → verify path count, names, lengths
- SyngNameMap serialization with PanSN format and special characters
- Syncmer parameter variations: different k/w/seed produce different khash
- Edge cases: empty sequence, shorter-than-syncmer, exactly syncmer length, mixed
- CLI integration: run `impg syng -f <fasta> -o <prefix>`, verify output files
- Load error case: missing files return Err
…-query)

Phase 3 of syng integration: GBWT querying pipeline.

- Add GbwtPathStart struct to SyngNameMap for per-path start info
  (start_node, start_count, num_syncmers) — needed for query path walking
- Modify SyngIndex::build to capture GBWT path start info during
  forward path construction
- Implement SyngIndex::query_region(): walks query genome's GBWT path
  to find syncmer nodes in [start,end], then walks all other genome
  paths to find shared nodes, merges and pads intervals
- Add --syng <prefix> flag to query CLI (mutually exclusive with -a)
- Add --syng-padding flag (default 120bp = 2× syncmer length)
- Support bed, gfa, and fasta output formats for syng queries
- Add dispatch_gfa_engine_with_seq_index() for syng→graph engine wiring
  via SeqIndexWrapper (minimal ImpgIndex impl using SequenceIndex)
- Add impg_syng_suppress_debug FFI helper to silence C debug output
- Backward-compatible .syng.names format (6 columns, reads old 3-column)
- 5 new query_region tests (basic, unknown genome, padding, missing
  path info, save/load roundtrip)
…-test-query)

Adds 19 new tests covering:
- Query completeness with known ground truth (shared backbone detection)
- Interior coverage (no false negatives for shared regions)
- Boundary padding (0/60/120bp, monotonic growth, genome length clamping)
- Interval merging (overlap, adjacency, multi-genome, edge cases)
- Edge cases (isolated regions, full-sequence query, single-sequence index,
  out-of-range, zero-width, identical sequences, unknown genome)
- CLI integration (syng build + query --syng for BED/GFA output,
  --syng/-a mutual exclusivity)
…nc-impl-gbwt)

Add SyngIndex::build_region_gbwt() method that builds a region-specific
GBWT from fetched sequences — creates fresh KmerHash/SyngBWT, extracts
syncmers, builds forward+revcomp paths, and writes syng-compatible
.1khash + .1gbwt files.

Add 'gbwt' to query command's output format enum. Works with both
--syng and -i input modes. Requires -O prefix and --sequence-files.
…test-gbwt)

14 new tests covering Phase 4 validation:
- Region-specific GBWT output from query results
- Region GBWT loadable as SyngIndex
- Region nodes subset of full index
- Single genome and very small region edge cases
- Output prefix with nested directories and nonexistent dirs
- ONEcode magic byte verification for .1gbwt and .1khash
- Round-trip: query → GBWT → query consistency
- CLI integration: --syng + -o gbwt, PAF-based + -o gbwt
- CLI validation: -o gbwt requires -O prefix
…dd thread-safety and clippy cleanup (.verify-sync-verify-integration)

- Fix heap-buffer-overflow in seqhash: encode sequences as 0/1/2/3 (not ASCII a/c/g/t)
  since patternRC[4] uses raw byte values as array indices
- Use impg_seqhashCreateSafe for thread-safe seqhash creation
- Add mutex serialization for syng tests (C library has thread-unsafe globals)
- Fix clippy warnings: Error::other(), unused import, needless_range_loop
- Add crate-level allows for pre-existing too_many_arguments/type_complexity
Test User added 7 commits April 13, 2026 16:50
# Conflicts:
#	src/lib.rs
Implement SyngImpgWrapper that adapts SyngIndex to the ImpgIndex trait,
allowing partition_alignments() to work transparently with syng-based
queries. No changes to partition logic itself.
`Decompressor::get_contig()` in ragc-core silently returns an empty Vec
when called after `list_contigs_names_only()` because it checks contig
count instead of `are_details_loaded()`, skipping the required detail
reload. This produced a fully-populated `.syng.names` file but empty
`.1gbwt` and `.1khash`, which then crashed at load time with
"no Vertex objects in .1gbwt file".

Switch to `get_contig_length` + `get_contig_range(0, length)` (same
pattern as src/agc_index.rs), which correctly reloads details.

Also add tests/test_syng_integration.rs covering:
- impg syng --agc produces a non-empty GBWT (regression test for the
  exact bug above)
- impg syng --agc round-trips: built index loads and queries correctly
- impg syng -f (FASTA) produces a non-empty GBWT
- impg partition --syng runs end-to-end and emits non-empty BED

The AGC build test asserts .1gbwt > 2000 bytes, which would have caught
this bug immediately — the broken output was exactly 1362 bytes.
The vendored syng hash table had a bug where the REMOVED sentinel value
collided with hashInt(1), so any startCount() call for GBWT node 1 (or -1)
silently returned 0 on every invocation instead of incrementing. Two paths
both starting at node 1 would both record start_count=0 and collide; on
query, the C layer would die with:
  FATAL ERROR: syngBWTpathStartOld startNode 1 count 0 >= startCount 0

This manifests on any pangenome with byte-identical contigs because the
first k-mer added to the hash gets index 1 and becomes a GBWT start node.
On yeast235 the chrIII sequences from samples AAA and SGDref are identical,
which triggered the crash 100% of the time.

The fix patches the vendored syng hash.c to use a middle-of-range REMOVED
sentinel (0x4000000000000000) that hashInt() never produces for
small-magnitude integer keys. Bumping the vendor/syng submodule pointer.

Also adds regression tests:
- tests/test_syng_startcount.rs: isolates startCount increment behavior
  at the FFI level for node 1, node 5, short/long paths, with/without RC
- tests/test_syng_integration.rs::test_syng_identical_sequences_build_and_query:
  reproduces the yeast235 scenario (two byte-identical FASTA records) and
  verifies start_counts are distinct and query_region succeeds on both
Previously the vendor/syng submodule pointed at richarddurbin/syng but
the pinned commits (23b6547, 1dbfd58, ce46949) were local-only and did
not exist on that remote, so fresh clones with submodules failed CI
with: "fatal: remote error: upload-pack: not our ref ..."

Create https://github.com/pangenome/syng as a fork of
richarddurbin/syng with an impg-integration branch carrying our
patches. Point .gitmodules at it so CI and downstream clones can
actually fetch the submodule.

Contents of pangenome/syng:impg-integration (all commits ahead of
upstream b659846):
  - 23b6547 impg FFI helper wrappers for static-inline functions
  - 1dbfd58 impg_syng_suppress_debug helper for FFI
  - ce46949 hash REMOVED sentinel collides with hashInt(1)  (genuine bug fix)
  - d423521 IMPG_PATCHES.md tracking all divergences from upstream

IMPG_PATCHES.md documents each patch and includes the sync workflow
for merging future upstream changes. The hash.c fix should be
upstreamed to richarddurbin/syng.
The pggb:X / seqwish:X engine spec carries a partition window size.
Previously the syng+gfa path ran a single flat engine invocation over
the entire query range, because X was only consulted as a boolean
skip_normalize flag — so large regions produced enormous
whole-chromosome context spans from query_region and overwhelmed the
aligner.

Now when partition_size is set, we split [range_start, range_end)
into per-window partitions, call query_region per window (which
returns tight, small-scale intervals), and run the partitioned GFA
pipeline that does a fresh alignment + graph induction per partition
plus a single final gfaffix normalization — structurally mirroring
the alignment path's output_results_gfa_partitioned.

SyngImpgWrapper gains seq_index() and syng_padding() accessors so
the main.rs path can share it instead of threading the SequenceIndex
and padding separately.

Adds test_query_syng_gfa_subwindow_splitter integration test that
drives a 20 kbp query through pggb:5000 and verifies the per-window
log lines appear in stderr.
The previous query_region walked every forward path end-to-end on
every call — O(total pangenome length) per query, which dominates
runtime on large inputs. FastLocate gives O(k log r) per query
(k = query nodes, r = total BWT runs) by building an r-index locate
structure over a classical GBWT.

Algorithm: port of jltsiren/gbwt's C++ FastLocate (fast_locate.cpp)
to Rust on top of the `gbz` crate (which despite the crate name is
jltsiren/gbwt-rs). The r-index formulation is from Gagie, Navarro,
Prezza (JACM 2020) as adapted to multi-string BWTs in Sirén,
Garrison, Novak, Paten, Durbin (Bioinformatics 2020).

Key structural difference from a naive port: runs whose successor is
ENDMARKER must be counted as one logical run per position (not per
physical RLE run), mirroring C++ LFLoop's convention. Without this,
`last.predecessor(prev)` fails to find a valid tail for head samples
of nodes near the end of short sequences.

Integration in SyngIndex:
- build_fast_locate walks each forward syng path, inserts the
  encoded node sequence into gbz::GBWTBuilder(bidirectional=true),
  records per-node bp positions into a flat BpOffsets sidecar, then
  builds FastLocate over the resulting gbz GBWT.
- SyngIndex::build calls build_fast_locate eagerly (warn on failure).
- query_region uses FastLocate::decompress_da when available,
  falling back to the old walk-every-path implementation otherwise.

Serialization: new {prefix}.syng.locate sidecar = gbz::GBWT
(simple-sds Serialize) + FastLocate (custom little-endian framing)
+ BpOffsets (custom little-endian framing). Loaded opportunistically
if the file exists.

New dependencies: gbz 0.6 and simple-sds 0.4 (both from crates.io).

Tests:
- fast_locate unit: tiny multi-share, 8-path larger, single path,
  identical paths, save/load roundtrip — all compared against a
  ground-truth walk via gbz's own sequence iterator.
- syng: test_query_region_fast_locate_parity (fast vs slow result
  equivalence), test_syng_save_load_with_fast_locate (full disk
  roundtrip).
- Integration: test_syng_integration.rs passes end-to-end with the
  fast path (350s subwindow splitter).

Also reverts a prior broken attempt to add r-index locate on the
C side of syng (from vendor/syng and syng_ffi declarations) — the
classical r-index formula does not apply to syng's rskip BWT, so
this is done in Rust on top of a separate gbz GBWT instead.
@ekg ekg force-pushed the feature/syng-integration branch from f12e003 to f7fe289 Compare April 15, 2026 19:38
Test User added 12 commits April 15, 2026 16:19
test_query_syng_gfa_subwindow_splitter was running FastGA + seqwish +
gfaffix per sub-window, which took ~6 min locally and didn't finish in
reasonable time on GitHub's 2-vCPU runners (1+ hour in Test step). The
regression it checks is engine-agnostic — it verifies that pggb:X /
seqwish:X / poa:X is interpreted as a sub-window size (not a boolean
flag), which is visible from the per-window log lines emitted BEFORE
the engine runs.

Two changes:

- Switch from `seqwish:5000` to `poa:1000`. POA skips the FastGA
  alignment + seqwish transclosure chain entirely and runs a single
  partial-order alignment per sub-window. The sub-window loop and its
  log emission are identical, so the regression assertion still fires.
  1000 bp is impg's minimum allowed partition size.

- Shrink the FASTA from 15 kbp + 2 kbp tails (34 kbp total) to
  3 kbp + 500 bp tails (7 kbp total), and shrink the query range from
  15 kbp to 3 kbp. That keeps the 3-sub-window assertion intact while
  dropping the per-run pipeline cost.

- Drop the secondary "GFA has > 100 bytes" assertion. The primary
  sub-window-count assertion is the regression test; downstream engine
  success is checked by the existing pipeline tests in
  test_pipeline_integration.rs.

Local timing: test_query_syng_gfa_subwindow_splitter now 0.49s
(vs 346.96s previously, ~700x speedup). Full test_syng_integration
suite now 0.80s (vs 350s). Full release serial test suite now ~2s.
The AGC syng build path formatted sequence names as `contig@sample`,
but PanSN-style contig names already embed the sample
(e.g. `S288C#0#chrIV`), so the result was
`S288C#0#chrIV@S288C#0` — a redundant suffix that made
`query -r S288C#0#chrIV:...` fail with "genome not found".

Use the contig name directly instead.
Outlines the per-hop pipeline: syng one-hop seed → fetch sequences →
local pairwise alignment (FastGA/sweepga) → build in-memory ImpgIndex
from PAFs → re-query through implicit graph with cigar-based coordinate
projection → precise boundaries for next hop.

Key insight: no explicit graph (no seqwish/GFA) — the implicit graph
IS the PAFs, which impg already knows how to query through. Realignment
at each hop prevents compounding syncmer-resolution slop.

See notes/SYNG_TRANSITIVE_DESIGN.md for full design, open questions,
and integration plan. Work continues on PR #162.
Commit b7a9323 dropped the `@sample` suffix when naming AGC-imported
contigs. That was correct for PanSN-encoded names
(`sample#haplotype#contig`), which already embed the sample — but
broke when contigs are raw names like `chr1` that appear across
multiple samples: all three collapse to the same `name_to_path` entry
and two of them are silently lost.

Only append `@sample` when the contig name doesn't already contain
`#` (i.e. isn't PanSN-encoded).

Restores test_syng_agc_roundtrip_query to passing without regressing
the PanSN path.
Syng's syncmerIterator reports the first syncmer at wherever the min
k-mer lands in the initial window — anywhere in [0, w+k). walk_path
was hardcoding the first syncmer's accumulated position to 0 and
returning relative offsets. Callers (query_region, fast-locate
bp_offsets, anchor fetch in boundary realignment) treated those as
absolute sequence coordinates, introducing a systematic shift equal
to the first-syncmer offset on every query.

Store the first-syncmer absolute position in GbwtPathStart at build
time (from syncmers[0].1), bump the .syng.names file format to 7
columns, and anchor walk_path's bp accumulator there. Older 6-column
files load with first_syncmer_pos=0 and behave as before (still
buggy for anchors; a fresh rebuild corrects).
Introduce Anchor + HomologousIntervalWithAnchors; add
query_region_with_anchors that emits per-hit shared-syncmer positions
and distinguishes forward vs RC homology.

The fast-locate fast path previously collapsed every syncmer visit
to the forward-orientation GBWT node via unsigned_abs(), silently
dropping every reverse-orientation visit. Preserve the sign from
walk_path by setting the low bit of the GBWT encoded node id so
decompress_da(2*N + 1) returns actual reverse-orientation visits.

query_region_with_anchors queries BOTH orientations per query node,
tags anchors by (query_orient XOR target_orient), and groups results
per (target_path, strand) so forward and RC intervals stay separate
(they describe distinct homologies, not the same interval reported
twice). query_region is now a thin wrapper that drops anchors.

merge_intervals is replaced by merge_intervals_with_anchors which
unions + de-dupes the anchor list across merged intervals. Unit
tests updated accordingly.
New impg::syng_transitive module implements the multihop
syng-seeded transitive query described in
notes/SYNG_TRANSITIVE_DESIGN.md. For each syng homolog, resolve its
two fuzzy edges to base-pair precision using BiWFA (lib_wfa2
MemoryMode::Ultralow) on small anchor-flanked windows — no full
in-memory Impg, no subprocess aligner. Multihop iteration re-seeds
each hop from the refined intervals of the previous one so slop
doesn't compound across hops.

Strand-aware: forward homology uses the anchor pair directly; RC
homology fetches the target window in reverse-order bounds, RC's it,
aligns, and projects offsets from the left anchor's forward anchor
point (t_pos + syncmer_len) downward.

Flanks each syng query by ANCHOR_FLANK_BP=512 on both sides so each
edge has a candidate anchor on either side; falls back to the
syncmer-resolution bound when a flank is missing. Skips realignment
for anchor gaps > MAX_REALIGN_WINDOW_BP=2048 (where the cost model
of local realignment inverts).

Integration tests:
- test_syng_boundary_realign_tightens_edges: on an identical
  backbone, refined edges snap to exact query coords (where the
  previous syncmer-resolution output was padded by ~150bp).
- test_syng_rc_homolog_end_to_end: on an inversion fixture,
  verifies strand='-' is reported, refined intervals overlap the
  expected RC window by >=200bp, and RC'd target bytes share a
  >=30bp exact-match run with query bytes. Coordinate precision is
  approximate due to inherent RC-syncmer sparsity (~3-5 RC-shared
  anchors per kb vs ~35 per kb forward).
impg query --syng now runs the BiWFA boundary-realignment transitive
pipeline by default for bed/fasta/gbwt output. Each homolog's
syncmer-resolution edges are snapped to base-pair precision; under
--transitive with --max-depth > 1 the refined intervals feed back
into the next syng hop.

Add --syng-raw (debug-only) to preserve the previous pass-through
behaviour that emits raw syncmer-resolution intervals without
realignment — useful for inspecting what syng's node graph reports
before any refinement is applied.

GFA output stays on raw intervals for now: its partitioning pipeline
does its own per-partition alignment downstream, and the interaction
with boundary realignment needs its own design pass.

Boundary realignment needs a UnifiedSequenceIndex for edge-window
fetches. Sequence files are now required for bed output unless
--syng-raw is set — error message points users to either
--sequence-files/--sequence-list or --syng-raw.
- notes/SYNG_TRANSITIVE_DESIGN.md: pivot from in-memory-Impg +
  FastGA/WFMASH full-region realignment to BiWFA boundary-only
  realignment. Document the per-hop pipeline, strand handling, anchor
  selection, resolved/outstanding open questions, and the known
  limitation on RC coordinate precision.
- examples/syng_probe.rs: diagnostic tool that compares syng's C
  iterator syncmer positions against walk_path output. Served as the
  verification step for the walk_path absolute-coord fix; keep it in
  the tree for future syng-coordinate debugging.
- .gitignore: add .tmp*, seqwish temp dirs, .workgraph.1/, .claude/
  so syng/seqwish/FastGA scratch files from test runs don't clutter
  git status.
Two related bugs, both exposed by:

    impg query --syng yeast235 -r S288C#0#chrIV:409000-409500 \
        -o gfa --gfa-engine pggb -d 1000

1. `-o gfa` was exempt from boundary realignment — the previous
   commit kept it on raw syncmer-resolution intervals with the
   reasoning "the partitioning pipeline does its own alignment
   downstream." That was wrong: fragmented short intervals feed
   straight into the GFA partitioned / flat pipelines and produce a
   fragmented graph. Wire both GFA sub-paths (partitioned and flat)
   through impg::syng_transitive::query_transitive when --syng-raw
   isn't set, just like bed/fasta/gbwt do.

2. `-d <merge_distance>` (min_distance_between_ranges) was silently
   a no-op in the syng path. Add a merge_distance: u64 parameter to
   syng_transitive::query_transitive and one_hop, plumb the value
   through from query.transitive_opts.min_distance_between_ranges in
   main.rs. Inside one_hop, run bedtools-style distance-merge on the
   padded syncmer hits before boundary realignment via a new
   distance_merge_anchored helper (anchors from merged intervals are
   unioned and deduplicated). Distance 0 is the existing default =
   overlap-only merge.

Adds test_syng_query_reconstructs_homology_with_diffs: three genomes
sharing a 3kb region with scattered SNPs and a 10bp indel. Query
genome_a[500..2500] returns exactly one interval per target with
refined edges within single-digit bp of biological truth (the indel
correctly shifts genome_c's right edge to 2490).
…ection

My earlier boundary-realignment implementation used BiWFA on
anchor-flanked windows per-edge per-homolog — which on real
pangenome data (e.g. yeast235 with ~300 haplotypes per region) was
spending most of its time on thousands of AGC sequence fetches and
small BiWFA runs. A 2kb query took >10 min wall time before being
killed.

For the intended use case — closely-related genomes — the shared
syncmer anchors already encode the homology structure. Linear
extrapolation from the innermost anchors to the user's query edges
gives coordinate precision within the local indel burden (single-digit
bp on <5%-divergent genomes), matching what PAF-based queries produce
from their stored CIGARs.

Replace refine_boundaries with project_query_to_target, a pure
coordinate arithmetic function. No AGC fetches. No BiWFA. Per-homolog
cost drops from O(fetches × aligner overhead) to O(anchor list size).
Wall time on the yeast235 example collapses from >10min (killed) to
~1s for the refinement step.

Drop resolve_edge_via_biwfa, ANCHOR_FLANK_BP, MAX_REALIGN_WINDOW_BP,
the thread-local BiWFA aligner, with_biwfa_edit_aligner,
project_query_offset_via_cigar, flanking_anchors, the query-range
widening in one_hop, and all associated tests. Net -400 lines.

Also fix a separate wiring bug: syng_merge_distance was reading from
transitive_opts.min_distance_between_ranges (long-only flag, default
10) instead of QueryOpts::merge_distance (short `-d`, default 0 —
the flag CLI users actually type). Switch to effective_merge_distance()
so `-d 10000` works as documented. Without this, my distance-merge was
effectively a no-op on CLI usage.

Verified on yeast235 --agc:

    impg query --syng yeast235 --sequence-files yeast235.agc \
        -r S288C#0#chrIV:408000-410000 -o bed -d 10000

  Before: 746 bed rows, up to 7 fragments per haplotype on same strand.
  After:  526 bed rows, max 2 rows per (sample,hap,chrom) — one `+`
          and one `-` (real RC homology, not fragmentation).

Updates notes/SYNG_TRANSITIVE_DESIGN.md to match the simpler algorithm.
New unit tests in src/syng_transitive.rs cover linear projection on
both strands, clamping, and distance-merge semantics.
…trand

Syng's query_region_with_anchors groups shared-syncmer hits by
(path, strand) and emits one padded interval per group. For short
syncmers (k=8) on large genomes like yeast, a query syncmer's
canonical hash coincidentally matches other positions on the same
target path in the opposite orientation — producing spurious `-`
anchors inside (or around) real `+` homologies and vice versa.

On yeast235 this caused:
  * ~50% of `+` homologs to come paired with a nested `-` "homolog"
    at the same target location, doubling the sequence set fed to
    downstream consumers (GFA pipeline ended up building a doubled
    graph where every region was aligned against its own RC).
  * The -o gfa path to take 20+ minutes because seqwish/smoothxg
    were processing twice the inputs.

Fix: after per-(path, strand) distance-merge, run a cross-strand
dedupe pass per path. If `+` and `-` intervals on the SAME path
overlap on target forward coordinates, keep only the strand with more
anchor support and drop the other. Non-overlapping `+`/`-` intervals
on the same path stay separate (real inversion on a separate region
of the same contig is legitimate biology).

Result on the yeast235 command (S288C#0#chrIV:408000-410000):
  * Bed rows: 526 → 271
  * `-` strand rows: 266 → 61 (noise gone, real RC retained)
  * HN1#0#chrIV, AAA#0#chrIV, etc: one clean `+` interval each,
    no nested `-` duplicates.

4 new unit tests in src/syng_transitive.rs cover:
  - overlapping +/- → keep majority (noise filter)
  - non-overlapping +/- on same path → keep both (real inversion)
  - different paths never cross-dedupe
  - deterministic tie-breaking
Test User added 6 commits April 16, 2026 17:07
Linear projection from innermost anchors alone cannot see indels that
sit in the inter-anchor gap (between the user's query edge and the
innermost shared syncmer anchor). Indels tend to DISRUPT syncmers
rather than fall between them, so when a query edge sits outside the
shared-syncmer grid the indels in that outer span are invisible to
linear extrapolation — producing refined intervals whose widths are
artificially locked to the query width.

Add refine_edge_via_realignment: per edge per merged homolog, fetch
the outer query slice [qs .. q_anchor + k] and the corresponding
target slice (sized q_window + buffer, oriented per strand) and run
EndsFree BiWFA anchored at the inner (anchor) end with text_begin_free
on the outer target end. The number of leading text-only ops ('I' in
WFA2 convention — opposite of PAF/SAM) gives the exact target offset
where the query edge lands.

Caps & guardrails:
- Outer span capped at EDGE_ALIGN_CAP_BP = 4096bp (typical span is
  one syncmer gap ~30bp).
- Target buffer EDGE_ALIGN_TARGET_BUFFER_BP = 256bp, clamped to
  |target slice| (WFA2 requires text_begin_free <= |T|).
- Falls back to linear projection on fetch failure, alignment failure,
  or outer span = 0.
- Uses lib_wfa2 MemoryMode::High (windows are small; EndsFree traceback
  needs the full matrix which BiWFA Ultralow doesn't always supply).

Verified on yeast235 S288C#0#chrIV:408000-410000 -d 10000:
  * Wall time: 4.83s (bed output, 294 rows, 235 haplotypes).
  * Width variation now reflects real indels: AAA 1999bp, AAB 2003bp
    (previously both locked to exactly 2000). HN1 1805bp, BAL_1a
    1800bp (previously both locked to 1805).
  * Edge positions shift by 1-5bp from linear-projection baseline,
    reflecting inter-anchor indel content that was hidden before.

No regressions on lib tests (150), syng_transitive unit (12), or
syng integration tests (9/9).
…roximity

Distance-merge (bedtools `-d`) was merging paralogs into one super-
interval when they sat within `-d` bp on the same contig. Example on
yeast235 CRE#3#block57_contig1 for S288C:408000-410000 with -d 10000:
  * True ortholog at target [409163, 410109] with anchor delta=1047
  * Paralog at target [416414, 417219] with anchor delta=7275
These got merged into one 8228bp "homolog" spanning the junk between
them. That ragged sequence became GFA pipeline input and inflated the
graph-build time.

Fix: distance-merge now requires both (a) target-axis gap <= -d AND
(b) median anchor co-linearity signatures within tolerance (set to
-d / 10, so for -d 10000 it's 1000bp). The signature is:
  * '+' strand:  target_pos - query_pos  (stable under collinear homology)
  * '-' strand:  target_pos + query_pos  (stable under RC homology)

Two clusters with signatures 1047 vs 7275 differ by 6228bp — far beyond
tolerance — so they now stay as separate intervals (both real biology,
just on the same contig). Ortholog: CRE#3#block57_contig1:409048-410292
(1244bp). Paralog: :416294-417276 (982bp). Previously: one merged
8228bp junk interval.

Result on the yeast query:
  * Total output sequence: 472kb → 460kb (no more 8kb ragged CRE hits).
  * Max per-interval width: 8228bp → 2146bp.

Two new unit tests in src/syng_transitive.rs cover:
  - signatures diverge (paralogs) → don't merge
  - signatures close (real fragmentation) → do merge
Usage:
  cargo run --release --example syng_anchor_probe -- \
      <syng_prefix> <query_name> <qs> <qe> <target_path>

Dumps the full anchor list for one (query_region, target_path) pair,
showing per-anchor (query_pos, target_pos, delta, node_id). Used to
diagnose coordinate-frame issues (relative vs absolute) and
co-linearity signatures between anchors, which led to the
distance-merge co-linearity filter fix.
Compares two bed-format outputs produced by impg query (one from
--syng, one from traditional -a PAF). Reports:
- Row counts and per-path coverage (syng-only, paf-only, both)
- Start/end boundary deltas on common paths
- Strand agreement
- Top outliers by |start delta|

Used to validate syng query semantics against PAF ground truth as
described in notes/SYNG_NEXT_STEPS.md section 1.
Ran sweepga --pairs S288C#0 vs all 234 other haplotypes on yeast235.agc
to produce a ground-truth PAF, then diffed impg query outputs for
S288C#0#chrIV:408000-410000 -d 10000 on both the PAF-based and
syng-based indices.

Headline:
- 100% strand agreement on 209 common paths.
- 59%/74% within 5bp on start/end edges for common paths.
- Strand-dedupe and RC detection in syng are validated.

Known gap: RC ('-' strand) paths have a systematic ~1500bp undershoot
on the outer edge (the qe projection under RC), affecting ~10-15
paths. Likely cause: syng's innermost anchor on the RC side sits
inside the homology, and edge realignment can't extend far enough
outward. Details and fix plan in the note.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant