SweepGA

Fast all-vs-all genome alignment with plane-sweep filtering and scaffold-aware chaining. Wraps FastGA (default) and wfmash, then applies a plane sweep to drop near-duplicate mappings and (optionally) chains nearby alignments into syntenic scaffolds.

What it does

Align — one or more FASTA files → PAF or .1aln. Uses FastGA or wfmash; PanSN-aware; scales k-mer frequency to the cohort size automatically.
Filter — take any aligner's PAF/1aln and apply the same plane-sweep
- scaffold pipeline.
Scale — built-in batching (--batch-bytes, --max-disk) partitions large cohorts into memory/disk-bounded batches.
Sparsify — optionally pre-select which pairs to align (random, tree, giant-component, wfmash-density).

Ships two binaries: sweepga (the aligner/filter) and alnstats (a small PAF/1aln statistics tool).

Install

# From crates.io
cargo install sweepga

# From source
git clone https://github.com/pangenome/sweepga.git
cd sweepga
cargo install --force --path .

Requires Rust ≥ 1.70.

On mixed-glibc systems (e.g. Debian + Guix) a plain cargo build may fail with __pthread_barrier_wait@GLIBC_PRIVATE link errors. Use ./scripts/build-clean.sh --install — see docs/BUILD-NOTES.md.

Quick start

# Self-alignment of one FASTA (FastGA, default)
sweepga genome.fa.gz --output-file aln.paf

# Use wfmash instead
sweepga --wfmash genome.fa.gz --output-file aln.paf

# Pairwise alignment of two FASTAs
sweepga target.fa query.fa --output-file aln.paf

# Filter an existing PAF (any aligner)
sweepga alignments.paf --output-file filtered.paf

# Convert .1aln → PAF
sweepga alignments.1aln --paf > out.paf

# AGC archive input (extract samples, align, filter)
sweepga --agc-samples HG002,HG005 genomes.agc --output-file aln.paf

sweepga auto-detects input type (FASTA vs. PAF/1aln) and picks a reasonable pipeline. Run sweepga --help for the full flag list.

Defaults

The defaults are tuned for pangenome alignment (matches impg's historical settings):

Flag	Default	Meaning
`--aligner`	`fastga`	Aligner backend. Use `--wfmash` or `--aligner wfmash` to switch.
`--num-mappings`	`many:many`	Pre-scaffold plane-sweep: keep all mappings per query/target.
`--scaffold-jump`	`50k`	Scaffolding is enabled by default; chains mappings within a 50kb gap. Pass `--scaffold-jump 0` to disable.
`--scaffold-mass`	`10k`	Drop scaffold chains shorter than 10kb.
`--scaffold-filter`	`many:many`	Keep all non-overlapping scaffolds. Use `1:1` for the aggressive (best-per-chromosome-pair) filter.
`--overlap`	`0.95`	Drop mappings whose overlap with a better-scoring one exceeds 95%.
`--scoring`	`log-length-ani`	Plane-sweep scoring function.
`--threads`	`8`	Parallelism.

Adaptive behavior: when the input average sequence length is known (FASTA with .fai), --scaffold-mass and --scaffold-jump are clamped to the sequence-length scale so short inputs (e.g. pangenome-window excerpts) don't get filtered empty. Pass --no-adaptive-scaffolds to turn clamping off.

Scaffolding pipeline

When --scaffold-jump > 0 (the default), sweepga runs:

Pre-scaffold plane sweep per (query, target) chromosome pair using --num-mappings / --scoring / --overlap.
Scaffold creation — union-find merges mappings within --scaffold-jump on both axes; chains shorter than --scaffold-mass are dropped.
Scaffold plane sweep using --scaffold-filter (default many:many); 1:1 keeps the single best scaffold per chromosome-pair.
Rescue (--scaffold-dist > 0) — recover mappings within the Euclidean distance of a kept scaffold anchor. Off by default.

Pass --no-filter to skip everything and just emit the raw aligner output (or the raw input PAF).

Sparsification

--sparsify picks which pairs actually get aligned (for FASTA input) or how densely wfmash reports mappings:

Value	Behavior
`none` (default)	Align all pairs, report all mappings
`auto`	Heuristic pick based on cohort size
`<f>` or `random:<f>`	Random `<f>` fraction of pairs
`giant:<p>` / `connectivity:<p>`	Giant-component connectivity guarantee
`tree:<near>[:<far>[:<random>]]`	Tree-sampled neighbor pairs
`wfmash:auto` / `wfmash:<f>`	wfmash mapping density (`-x` flag); auto = `ln(n)/n*10`

--sparsify-pairs is a separate knob that only drives pre-alignment pair selection (same grammar, no wfmash variant).

PanSN

Sequence names like SAMPLE#HAPLOTYPE#CHR are treated pair-wise per-chromosome within each genome pair. Prefix delimiter is #.

Output

.paf — standard PAF text.
.1aln — compact binary ONE format (50–70% smaller, preserves edit distances). Pass --paf to convert to PAF on read.

Output path is chosen by --output-file (extension auto-detected) or the explicit --paf / --1aln flags.

Alnstats

alnstats alignments.paf              # summary
alnstats raw.paf filtered.paf        # before/after
alnstats alignments.paf -d           # per-genome-pair breakdown

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 445 Commits
.cargo		.cargo
.github/workflows		.github/workflows
benches		benches
data		data
doc		doc
docs		docs
examples		examples
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
build.rs		build.rs
install.sh		install.sh
test_aln_reader.rs		test_aln_reader.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SweepGA

What it does

Install

Quick start

Defaults

Scaffolding pipeline

Sparsification

PanSN

Output

Alnstats

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SweepGA

What it does

Install

Quick start

Defaults

Scaffolding pipeline

Sparsification

PanSN

Output

Alnstats

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages