impg: implicit pangenome graph

Why impg?

Studying variation at specific loci across populations or species usually means either building an expensive whole-genome graph or falling back to a single reference. impg takes a third path: it treats all-vs-all pairwise alignments as an implicit pangenome graph and projects target ranges through the alignment network to extract only the homologous sequences you need. Query regions across hundreds of genomes in seconds, walk transitive alignments, partition a cohort into comparable loci, refine regions to maximize sample support — all without ever materializing a graph structure.

What does it do?

At its core, impg lifts ranges from a target sequence (the reference in a given alignment) into the queries aligned onto it. It outputs BED / BEDPE / PAF — ready to feed FASTA extraction, multiple sequence alignment, or a graph builder like pggb or minigraph-cactus — and can also emit GFA directly by chaining sweepga + seqwish + smoothxg-style smoothing.

How does it work?

impg uses coitrees (cache-oblivious interval trees) for fast range lookup, and stores CIGAR strings as compact deltas. The result is fast, memory-efficient projection of sequence ranges through alignment networks.

Install

# Bioconda
conda install -c bioconda impg

# Source
git clone --recursive https://github.com/pangenome/impg.git
cd impg
cargo install --force --path .

The source install places impg, gfaffix, and the companion aligner binaries (wfmash, FastGA) into ~/.cargo/bin/.

Docker

docker pull pangenome/impg
docker run pangenome/impg

Troubleshooting

On older glibc systems (e.g. Debian Buster) a plain cargo build can fail because wfmash needs modern CMake / GCC / glibc. Use Guix's toolchain:

source ./env.sh
cargo build --release

See .guix/ for the Guix build recipe (guix build -L .guix/modules --file=guix.scm). For libclang link errors, set LIBCLANG_PATH to your LLVM install (see env -i … LIBCLANG_PATH=…).

Quick start

impg query -a cerevisiae.pan.paf.gz -r S288C#1#chrI:50000-100000 -x

-a — alignment file (PAF / 1ALN / TPA). PAF must use =/X CIGAR ops (from wfmash or minimap2 --eqx).
-r — target range, seq:start-end.
-x — walk the transitive closure: find everything aligned to the initial result, recursively.

Example output (BED):

S288C#1#chrI         50000   100000
DBVPG6044#1#chrI     35335    85288
Y12#1#chrI           36263    86288
DBVPG6765#1#chrI     36166    86150
YPS128#1#chrI        47080    97062
UWOPS034614#1#chrI   36826    86817
SK1#1#chrI           52740   102721

Commands

All commands accept -a (alignment files, mixed PAF/1ALN/TPA) or --alignment-list (text file, one per line), -t / --threads, and -v 0|1|2 for verbosity. Every command has a --help with the exhaustive flag list — this section covers the flags you'll actually turn.

`query` — project a range through alignments

# A single range
impg query -a aln.paf -r chr1:1000-2000

# Transitive closure (depth 2 by default)
impg query -a aln.paf -r chr1:1000-2000 -x -m 3

# Many regions from a BED, mixed PAF + 1ALN
impg query -a f1.paf f2.1aln -b regions.bed

# Output formats: auto | bed | bedpe | paf | gfa | maf | fasta | fasta+paf | fasta-aln
impg query -a aln.paf -r chr1:1000-2000 -o bed
impg query -a aln.paf -r chr1:1000-2000 -o gfa --sequence-files genomes.fa
impg query -a aln.1aln -r chr1:1000-2000 -o fasta --sequence-files *.fa \
           --reverse-complement

# Filter / shape the result
impg query -a aln.paf -r chr1:1000-2000 --min-identity 0.9 -l 5000 -d 1000

# Restrict to a sequence whitelist (also filters transitive intermediates)
impg query -a aln.paf -r chr1:1000-2000 -x --subset-sequence-list seqs.txt

# Fast approximate mode (.1aln only; bed/bedpe output)
impg query -a aln.1aln -r chr1:1000-2000 --approximate

GFA / MAF / FASTA outputs need --sequence-files (FASTA or AGC archive) or --sequence-list. See GFA engines for engine selection and partitioned builds.

`graph` — build a pangenome graph from FASTA

Runs alignment + seqwish + (optional) smoothing, no pre-computed alignment needed.

# Default pipeline: pggb (align → seqwish → smooth → gfaffix)
impg graph --sequence-files genomes.fa -g output.gfa -t 16

# Partitioned mode for large inputs (aligns once, then builds per-window)
impg graph --sequence-files genomes.fa -g output.gfa --gfa-engine pggb:10000

# Reuse an existing PAF instead of aligning
impg graph --sequence-files genomes.fa -g output.gfa --paf-file aln.paf

# Batch alignment to cap per-batch RAM (wfmash) or disk (FastGA)
impg graph --sequence-files genomes.fa -g output.gfa --batch-bytes 2G

query -o gfa and graph share the same engine code and flags — the only difference is where the sequences come from (IMPG index + sequence files for query; FASTAs directly for graph).

`partition` — split the cohort into windowed loci

# 1Mb windows, single BED output
impg partition -a aln.paf -w 1000000

# One FASTA per partition (for downstream pipelines)
impg partition -a aln.1aln -w 1000000 -o fasta --sequence-files *.fa \
               --separate-files --output-folder partitions/

# Selection strategies pick the next starting sequence
impg partition -a aln.paf -w 1000000 --selection-mode longest     # default
impg partition -a aln.paf -w 1000000 --selection-mode sample      # PanSN sample
impg partition -a aln.paf -w 1000000 --selection-mode haplotype   # PanSN haplotype

# Start from a fixed list of sequences
impg partition -a aln.paf -w 1000000 --starting-sequences-file seqs.txt

# GFA output per partition; engines: pggb | seqwish | poa
impg partition -a aln.paf -w 1000000 -o gfa --gfa-engine pggb \
               --sequence-files *.fa --separate-files --output-folder gfas/

# Fully partitioned pipeline: build → lace → one gfaffix pass
impg partition -a aln.paf -w 100000 -o gfa --gfa-engine pggb:10000 \
               --sequence-files *.fa --output-folder results/

`refine` — tighten a locus to maximize sample support

Explores asymmetric left/right expansions around each range, picking the smallest window that keeps the most sequences, samples, or haplotypes fully spanning it. Useful for anchoring loci outside structural variants.

impg refine -a aln.paf -r chr1:1000-2000
impg refine -a aln.paf -b loci.bed --span-bp 2000 -d 200000

# Maximize PanSN samples / haplotypes instead of raw sequence count
impg refine -a aln.paf -r chr1:1000-2000 --pansn-mode sample
impg refine -a aln.paf -r chr1:1000-2000 --pansn-mode haplotype

# Cap expansion distance
impg refine -a aln.paf -r chr1:1000-2000 --max-extension 0.90    # 90% of locus
impg refine -a aln.paf -r chr1:1000-2000 --max-extension 50000   # 50kb absolute

# Emit the spanning-entity list alongside the refined BED
impg refine -a aln.paf -r chr1:1000-2000 --support-output support.bed

`similarity` — pairwise similarity / distance within a region

impg similarity -a aln.paf -r chr1:1000-2000 --sequence-files *.fa
impg similarity -a aln.1aln -b regions.bed --sequence-files *.fa --distances

# Group by PanSN prefix
impg similarity -a aln.paf -r chr1:1000-2000 --sequence-files *.fa \
                --delim '#' --delim-pos 2     # sample#haplotype

# PCA / MDS on the distance matrix
impg similarity -a aln.paf -r chr1:1000-2000 --sequence-files *.fa \
                --pca --pca-components 3 --pca-measure cosine

`lace` — combine many per-window graphs into one

# GFAs (auto-detected)
impg lace -f gfa1.gfa gfa2.gfa gfa3.gfa -o combined.gfa

# From a list file, fill inter-window gaps with sequence
impg lace -l gfa_list.txt -o combined.gfa --fill-gaps 1 --sequence-files ref.fa

# VCFs
impg lace -f *.vcf -o combined.vcf --reference ref.fa

Path names must follow NAME:START-END (e.g. HG002#1#chr20:1000-2000); the coordinates drive reassembly. NAME may contain : — the last : is the separator.

`index` — build a reusable IMPG index

# Single combined index
impg index -a aln.paf -i aln.impg

# Mixed PAF + 1ALN + TPA
impg index -a f1.paf f2.1aln f3.tpa -i all.impg

# Per-file index (faster incremental rebuilds for large cohorts)
impg index --alignment-list files.txt --index-mode per-file

--index-mode auto (default) picks per-file when ≥ 100 files are listed, single otherwise. impg warns when the index is older than its input alignments; -f/--force-reindex rebuilds.

Bgzipped PAFs work natively (impg reads .paf.gz); optionally bgzip -r alignments.paf.gz creates a .gzi sidecar to speed up the first read.

`stats` — summarize alignments

impg stats -a aln.paf
impg stats -a f1.paf f2.1aln

GFA engines

The graph, query -o gfa, and partition -o gfa commands share one set of engine implementations, selected via --gfa-engine:

Engine	Pipeline	Use for
`pggb` (default)	sweepga + seqwish + smoothxg-style smoothing + gfaffix	smoothed variation graphs
`seqwish`	sweepga + seqwish + gfaffix	raw (unsmoothed) graphs
`poa`	single-pass SPOA	small regions, quick MSA-based output

Partitioned mode

Append :WINDOW to any engine to build per-window and lace:

impg query    -a aln.paf -r chr1:0-500000 -o gfa \
              --gfa-engine pggb:10000 --sequence-files *.fa -O out
impg graph    --sequence-files *.fa -g out.gfa --gfa-engine seqwish:10000
impg partition -a aln.paf -w 100000 -o gfa --gfa-engine pggb:10000 \
               --sequence-files *.fa --output-folder results/

Window size is in bp (≥ 1000). Partitioned mode is the recommended approach for large regions — it caps peak memory and runs one final gfaffix pass over the laced graph.

Tuning

The flags below are available on all three GFA-producing commands. Defaults match pggb's conventions; only tune if the default graph doesn't meet your need.

# Seqwish induction
--min-match-len 23            # minimum transitive-match length
--transclose-batch 10000000   # batch size (reduce for lower memory)
--sparse-factor 0.0           # drop this fraction of input matches
--disk-backed                 # use disk-backed interval trees
--repeat-max / --min-repeat-dist

# Smoothxg-style smoothing (pggb only)
--target-poa-length 700,1100  # one pass per value
--max-node-length 100
--poa-padding-fraction 0.001

# Alignment filtering (sweepga, seqwish + pggb only)
--no-filter                   # skip post-alignment filtering
--num-mappings many:many      # plane-sweep cardinality
--scaffold-jump 50000         # scaffold chaining gap (0 = off)
--scaffold-mass 10000         # min scaffold chain length
--overlap 0.95
--min-aln-identity 0.9

# Aligner backend
--aligner wfmash              # default; alt: fastga
--sparsify auto               # wfmash-only; pair-selection heuristic
--map-pct-identity 90         # wfmash -p value
--fastga-frequency / --fastga-frequency-multiplier   # fastga-only

# Temp files (can be large)
--temp-dir /scratch/tmp       # explicit path
--temp-dir ramdisk            # → /dev/shm on Linux

Combining --aligner fastga with --sparsify or --aligner wfmash with --fastga-frequency is rejected at parse time.

Common options

-a / --alignment-files — one or more PAF/1ALN/TPA files (can be .gz).
--alignment-list — text file, one alignment path per line.
-i / --index — existing IMPG index.
-f / --force-reindex — rebuild even if the index is up-to-date.
-t / --threads — default 4.
-d / --merge-distance — merge nearby hits within this gap (bp).
--no-merge — disable merging.
--consider-strandness — keep strands separate during merge.
--subset-sequence-list — restrict results to listed sequences.
--unidirectional — disable bidirectional alignment interpretation.

Sequence-requiring outputs (GFA/MAF/FASTA, similarity, lace --fill-gaps) take --sequence-files (FASTA or AGC) or --sequence-list.

Tutorial: yeast pangenome graph

FASTA="cerevisiae.fa.gz"
PAF="cerevisiae.paf"
THREADS=16

# 1. Index
impg index -a "$PAF" -i yeast.impg -t "$THREADS"

# 2. Partition into 100kb windows, one FASTA per window
mkdir -p partitions gfas
impg partition -i yeast.impg -w 100000 \
    --sequence-files "$FASTA" -o fasta \
    --separate-files --output-folder partitions -t "$THREADS"

# 3. Build per-partition GFAs in parallel
ls partitions/*.fasta | xargs -P 4 -I {} bash -c '
    f="{}"; base=$(basename "$f" .fasta)
    impg graph --sequence-files "$f" -g "gfas/${base}.gfa" -t 4
'

# 4. Lace, filling inter-window gaps with reference sequence
find gfas -name "*.gfa" -size +0 | sort -V > gfa_list.txt
impg lace --file-list gfa_list.txt --sequence-files "$FASTA" \
    -o yeast.gfa --fill-gaps 2 -t "$THREADS"

# 5. Post-process with odgi
odgi build  -g yeast.gfa          -o yeast.og       -t "$THREADS"
odgi sort   -i yeast.og           -o yeast.sort.og  -O -p Ygs -t "$THREADS"
odgi layout -i yeast.sort.og      -o yeast.lay      -t "$THREADS"
odgi viz    -i yeast.sort.og      -o yeast.viz.png  -x 4000 -y 1000 -s '#'
odgi draw   -i yeast.sort.og      -c yeast.lay      -p yeast.draw.png

For modern inputs, you can replace steps 1–4 with a single impg graph --sequence-files "$FASTA" -g yeast.gfa --gfa-engine pggb:100000.

Visualizing FASTA alignments

scripts/faln2html.py renders the fasta-aln output into an interactive HTML MSA using react-msa or ProSeqViewer.

impg query -a aln.paf -r chr1:1000-2000 -o fasta-aln --sequence-files *.fa \
  | python scripts/faln2html.py -i - -o alignment.html [--tool proseqviewer]

Authors

Andrea Guarracino aguarra1@uthsc.edu · Bryce Kille brycekille@gmail.com · Erik Garrison erik.garrison@gmail.com

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 1,060 Commits
.github/workflows		.github/workflows
.guix-stubs		.guix-stubs
.guix/modules/impg		.guix/modules/impg
notes		notes
scripts		scripts
src		src
tests		tests
vendor		vendor
.gitignore		.gitignore
.gitmodules		.gitmodules
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
build.rs		build.rs
env.sh		env.sh
flake.lock		flake.lock
flake.nix		flake.nix
guix.scm		guix.scm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

impg: implicit pangenome graph

Why impg?

What does it do?

How does it work?

Install

Docker

Troubleshooting

Quick start

Commands

`query` — project a range through alignments

`graph` — build a pangenome graph from FASTA

`partition` — split the cohort into windowed loci

`refine` — tighten a locus to maximize sample support

`similarity` — pairwise similarity / distance within a region

`lace` — combine many per-window graphs into one

`index` — build a reusable IMPG index

`stats` — summarize alignments

GFA engines

Partitioned mode

Tuning

Common options

Tutorial: yeast pangenome graph

Visualizing FASTA alignments

Authors

License

About

Uh oh!

Releases 10

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

impg: implicit pangenome graph

Why impg?

What does it do?

How does it work?

Install

Docker

Troubleshooting

Quick start

Commands

query — project a range through alignments

graph — build a pangenome graph from FASTA

partition — split the cohort into windowed loci

refine — tighten a locus to maximize sample support

similarity — pairwise similarity / distance within a region

lace — combine many per-window graphs into one

index — build a reusable IMPG index

stats — summarize alignments

GFA engines

Partitioned mode

Tuning

Common options

Tutorial: yeast pangenome graph

Visualizing FASTA alignments

Authors

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`query` — project a range through alignments

`graph` — build a pangenome graph from FASTA

`partition` — split the cohort into windowed loci

`refine` — tighten a locus to maximize sample support

`similarity` — pairwise similarity / distance within a region

`lace` — combine many per-window graphs into one

`index` — build a reusable IMPG index

`stats` — summarize alignments

Packages