Skip to content

pangenome/impg

Repository files navigation

impg: implicit pangenome graph

install with bioconda

Why impg?

Studying variation at specific loci across populations or species usually means either building an expensive whole-genome graph or falling back to a single reference. impg takes a third path: it treats all-vs-all pairwise alignments as an implicit pangenome graph and projects target ranges through the alignment network to extract only the homologous sequences you need. Query regions across hundreds of genomes in seconds, walk transitive alignments, partition a cohort into comparable loci, refine regions to maximize sample support — all without ever materializing a graph structure.

What does it do?

At its core, impg lifts ranges from a target sequence (the reference in a given alignment) into the queries aligned onto it. It outputs BED / BEDPE / PAF — ready to feed FASTA extraction, multiple sequence alignment, or a graph builder like pggb or minigraph-cactus — and can also emit GFA directly by chaining sweepga + seqwish + smoothxg-style smoothing.

How does it work?

impg uses coitrees (cache-oblivious interval trees) for fast range lookup, and stores CIGAR strings as compact deltas. The result is fast, memory-efficient projection of sequence ranges through alignment networks.

Install

# Bioconda
conda install -c bioconda impg

# Source
git clone --recursive https://github.com/pangenome/impg.git
cd impg
cargo install --force --path .

The source install places impg, gfaffix, and the companion aligner binaries (wfmash, FastGA) into ~/.cargo/bin/.

Docker

docker pull pangenome/impg
docker run pangenome/impg

Troubleshooting

On older glibc systems (e.g. Debian Buster) a plain cargo build can fail because wfmash needs modern CMake / GCC / glibc. Use Guix's toolchain:

source ./env.sh
cargo build --release

See .guix/ for the Guix build recipe (guix build -L .guix/modules --file=guix.scm). For libclang link errors, set LIBCLANG_PATH to your LLVM install (see env -i … LIBCLANG_PATH=…).

Quick start

impg query -a cerevisiae.pan.paf.gz -r S288C#1#chrI:50000-100000 -x
  • -a — alignment file (PAF / 1ALN / TPA). PAF must use =/X CIGAR ops (from wfmash or minimap2 --eqx).
  • -r — target range, seq:start-end.
  • -x — walk the transitive closure: find everything aligned to the initial result, recursively.

Example output (BED):

S288C#1#chrI         50000   100000
DBVPG6044#1#chrI     35335    85288
Y12#1#chrI           36263    86288
DBVPG6765#1#chrI     36166    86150
YPS128#1#chrI        47080    97062
UWOPS034614#1#chrI   36826    86817
SK1#1#chrI           52740   102721

Commands

All commands accept -a (alignment files, mixed PAF/1ALN/TPA) or --alignment-list (text file, one per line), -t / --threads, and -v 0|1|2 for verbosity. Every command has a --help with the exhaustive flag list — this section covers the flags you'll actually turn.

query — project a range through alignments

# A single range
impg query -a aln.paf -r chr1:1000-2000

# Transitive closure (depth 2 by default)
impg query -a aln.paf -r chr1:1000-2000 -x -m 3

# Many regions from a BED, mixed PAF + 1ALN
impg query -a f1.paf f2.1aln -b regions.bed

# Output formats: auto | bed | bedpe | paf | gfa | maf | fasta | fasta+paf | fasta-aln
impg query -a aln.paf -r chr1:1000-2000 -o bed
impg query -a aln.paf -r chr1:1000-2000 -o gfa --sequence-files genomes.fa
impg query -a aln.1aln -r chr1:1000-2000 -o fasta --sequence-files *.fa \
           --reverse-complement

# Filter / shape the result
impg query -a aln.paf -r chr1:1000-2000 --min-identity 0.9 -l 5000 -d 1000

# Restrict to a sequence whitelist (also filters transitive intermediates)
impg query -a aln.paf -r chr1:1000-2000 -x --subset-sequence-list seqs.txt

# Fast approximate mode (.1aln only; bed/bedpe output)
impg query -a aln.1aln -r chr1:1000-2000 --approximate

GFA / MAF / FASTA outputs need --sequence-files (FASTA or AGC archive) or --sequence-list. See GFA engines for engine selection and partitioned builds.

graph — build a pangenome graph from FASTA

Runs alignment + seqwish + (optional) smoothing, no pre-computed alignment needed.

# Default pipeline: pggb (align → seqwish → smooth → gfaffix)
impg graph --sequence-files genomes.fa -g output.gfa -t 16

# Partitioned mode for large inputs (aligns once, then builds per-window)
impg graph --sequence-files genomes.fa -g output.gfa --gfa-engine pggb:10000

# Reuse an existing PAF instead of aligning
impg graph --sequence-files genomes.fa -g output.gfa --paf-file aln.paf

# Batch alignment to cap per-batch RAM (wfmash) or disk (FastGA)
impg graph --sequence-files genomes.fa -g output.gfa --batch-bytes 2G

query -o gfa and graph share the same engine code and flags — the only difference is where the sequences come from (IMPG index + sequence files for query; FASTAs directly for graph).

partition — split the cohort into windowed loci

# 1Mb windows, single BED output
impg partition -a aln.paf -w 1000000

# One FASTA per partition (for downstream pipelines)
impg partition -a aln.1aln -w 1000000 -o fasta --sequence-files *.fa \
               --separate-files --output-folder partitions/

# Selection strategies pick the next starting sequence
impg partition -a aln.paf -w 1000000 --selection-mode longest     # default
impg partition -a aln.paf -w 1000000 --selection-mode sample      # PanSN sample
impg partition -a aln.paf -w 1000000 --selection-mode haplotype   # PanSN haplotype

# Start from a fixed list of sequences
impg partition -a aln.paf -w 1000000 --starting-sequences-file seqs.txt

# GFA output per partition; engines: pggb | seqwish | poa
impg partition -a aln.paf -w 1000000 -o gfa --gfa-engine pggb \
               --sequence-files *.fa --separate-files --output-folder gfas/

# Fully partitioned pipeline: build → lace → one gfaffix pass
impg partition -a aln.paf -w 100000 -o gfa --gfa-engine pggb:10000 \
               --sequence-files *.fa --output-folder results/

refine — tighten a locus to maximize sample support

Explores asymmetric left/right expansions around each range, picking the smallest window that keeps the most sequences, samples, or haplotypes fully spanning it. Useful for anchoring loci outside structural variants.

impg refine -a aln.paf -r chr1:1000-2000
impg refine -a aln.paf -b loci.bed --span-bp 2000 -d 200000

# Maximize PanSN samples / haplotypes instead of raw sequence count
impg refine -a aln.paf -r chr1:1000-2000 --pansn-mode sample
impg refine -a aln.paf -r chr1:1000-2000 --pansn-mode haplotype

# Cap expansion distance
impg refine -a aln.paf -r chr1:1000-2000 --max-extension 0.90    # 90% of locus
impg refine -a aln.paf -r chr1:1000-2000 --max-extension 50000   # 50kb absolute

# Emit the spanning-entity list alongside the refined BED
impg refine -a aln.paf -r chr1:1000-2000 --support-output support.bed

similarity — pairwise similarity / distance within a region

impg similarity -a aln.paf -r chr1:1000-2000 --sequence-files *.fa
impg similarity -a aln.1aln -b regions.bed --sequence-files *.fa --distances

# Group by PanSN prefix
impg similarity -a aln.paf -r chr1:1000-2000 --sequence-files *.fa \
                --delim '#' --delim-pos 2     # sample#haplotype

# PCA / MDS on the distance matrix
impg similarity -a aln.paf -r chr1:1000-2000 --sequence-files *.fa \
                --pca --pca-components 3 --pca-measure cosine

lace — combine many per-window graphs into one

# GFAs (auto-detected)
impg lace -f gfa1.gfa gfa2.gfa gfa3.gfa -o combined.gfa

# From a list file, fill inter-window gaps with sequence
impg lace -l gfa_list.txt -o combined.gfa --fill-gaps 1 --sequence-files ref.fa

# VCFs
impg lace -f *.vcf -o combined.vcf --reference ref.fa

Path names must follow NAME:START-END (e.g. HG002#1#chr20:1000-2000); the coordinates drive reassembly. NAME may contain : — the last : is the separator.

Recommended post-processing:

gfaffix combined.gfa -o combined.fix.gfa &> /dev/null
odgi unchop -i combined.fix.gfa -o - -t 16 | \
  odgi sort  -i -              -o - -p gYs -t 16 | \
  odgi view  -i -              -g > combined.final.gfa

index — build a reusable IMPG index

# Single combined index
impg index -a aln.paf -i aln.impg

# Mixed PAF + 1ALN + TPA
impg index -a f1.paf f2.1aln f3.tpa -i all.impg

# Per-file index (faster incremental rebuilds for large cohorts)
impg index --alignment-list files.txt --index-mode per-file

--index-mode auto (default) picks per-file when ≥ 100 files are listed, single otherwise. impg warns when the index is older than its input alignments; -f/--force-reindex rebuilds.

Bgzipped PAFs work natively (impg reads .paf.gz); optionally bgzip -r alignments.paf.gz creates a .gzi sidecar to speed up the first read.

stats — summarize alignments

impg stats -a aln.paf
impg stats -a f1.paf f2.1aln

GFA engines

The graph, query -o gfa, and partition -o gfa commands share one set of engine implementations, selected via --gfa-engine:

Engine Pipeline Use for
pggb (default) sweepga + seqwish + smoothxg-style smoothing + gfaffix smoothed variation graphs
seqwish sweepga + seqwish + gfaffix raw (unsmoothed) graphs
poa single-pass SPOA small regions, quick MSA-based output

Partitioned mode

Append :WINDOW to any engine to build per-window and lace:

impg query    -a aln.paf -r chr1:0-500000 -o gfa \
              --gfa-engine pggb:10000 --sequence-files *.fa -O out
impg graph    --sequence-files *.fa -g out.gfa --gfa-engine seqwish:10000
impg partition -a aln.paf -w 100000 -o gfa --gfa-engine pggb:10000 \
               --sequence-files *.fa --output-folder results/

Window size is in bp (≥ 1000). Partitioned mode is the recommended approach for large regions — it caps peak memory and runs one final gfaffix pass over the laced graph.

Tuning

The flags below are available on all three GFA-producing commands. Defaults match pggb's conventions; only tune if the default graph doesn't meet your need.

# Seqwish induction
--min-match-len 23            # minimum transitive-match length
--transclose-batch 10000000   # batch size (reduce for lower memory)
--sparse-factor 0.0           # drop this fraction of input matches
--disk-backed                 # use disk-backed interval trees
--repeat-max / --min-repeat-dist

# Smoothxg-style smoothing (pggb only)
--target-poa-length 700,1100  # one pass per value
--max-node-length 100
--poa-padding-fraction 0.001

# Alignment filtering (sweepga, seqwish + pggb only)
--no-filter                   # skip post-alignment filtering
--num-mappings many:many      # plane-sweep cardinality
--scaffold-jump 50000         # scaffold chaining gap (0 = off)
--scaffold-mass 10000         # min scaffold chain length
--overlap 0.95
--min-aln-identity 0.9

# Aligner backend
--aligner wfmash              # default; alt: fastga
--sparsify auto               # wfmash-only; pair-selection heuristic
--map-pct-identity 90         # wfmash -p value
--fastga-frequency / --fastga-frequency-multiplier   # fastga-only

# Temp files (can be large)
--temp-dir /scratch/tmp       # explicit path
--temp-dir ramdisk            # → /dev/shm on Linux

Combining --aligner fastga with --sparsify or --aligner wfmash with --fastga-frequency is rejected at parse time.

Common options

  • -a / --alignment-files — one or more PAF/1ALN/TPA files (can be .gz).
  • --alignment-list — text file, one alignment path per line.
  • -i / --index — existing IMPG index.
  • -f / --force-reindex — rebuild even if the index is up-to-date.
  • -t / --threads — default 4.
  • -d / --merge-distance — merge nearby hits within this gap (bp).
  • --no-merge — disable merging.
  • --consider-strandness — keep strands separate during merge.
  • --subset-sequence-list — restrict results to listed sequences.
  • --unidirectional — disable bidirectional alignment interpretation.

Sequence-requiring outputs (GFA/MAF/FASTA, similarity, lace --fill-gaps) take --sequence-files (FASTA or AGC) or --sequence-list.

Tutorial: yeast pangenome graph

FASTA="cerevisiae.fa.gz"
PAF="cerevisiae.paf"
THREADS=16

# 1. Index
impg index -a "$PAF" -i yeast.impg -t "$THREADS"

# 2. Partition into 100kb windows, one FASTA per window
mkdir -p partitions gfas
impg partition -i yeast.impg -w 100000 \
    --sequence-files "$FASTA" -o fasta \
    --separate-files --output-folder partitions -t "$THREADS"

# 3. Build per-partition GFAs in parallel
ls partitions/*.fasta | xargs -P 4 -I {} bash -c '
    f="{}"; base=$(basename "$f" .fasta)
    impg graph --sequence-files "$f" -g "gfas/${base}.gfa" -t 4
'

# 4. Lace, filling inter-window gaps with reference sequence
find gfas -name "*.gfa" -size +0 | sort -V > gfa_list.txt
impg lace --file-list gfa_list.txt --sequence-files "$FASTA" \
    -o yeast.gfa --fill-gaps 2 -t "$THREADS"

# 5. Post-process with odgi
odgi build  -g yeast.gfa          -o yeast.og       -t "$THREADS"
odgi sort   -i yeast.og           -o yeast.sort.og  -O -p Ygs -t "$THREADS"
odgi layout -i yeast.sort.og      -o yeast.lay      -t "$THREADS"
odgi viz    -i yeast.sort.og      -o yeast.viz.png  -x 4000 -y 1000 -s '#'
odgi draw   -i yeast.sort.og      -c yeast.lay      -p yeast.draw.png

For modern inputs, you can replace steps 1–4 with a single impg graph --sequence-files "$FASTA" -g yeast.gfa --gfa-engine pggb:100000.

Visualizing FASTA alignments

scripts/faln2html.py renders the fasta-aln output into an interactive HTML MSA using react-msa or ProSeqViewer.

impg query -a aln.paf -r chr1:1000-2000 -o fasta-aln --sequence-files *.fa \
  | python scripts/faln2html.py -i - -o alignment.html [--tool proseqviewer]

Authors

Andrea Guarracino aguarra1@uthsc.edu · Bryce Kille brycekille@gmail.com · Erik Garrison erik.garrison@gmail.com

License

MIT.

About

implicit pangenome graph

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors