This manual describes how to build an allele-based pangenome database for both functional annotation and NGS genotyping. The pipeline consumes: (1) haplotype-resolved assemblies and reference genomes (2) a BED file of target genes (optionally with exon targets)
Major stages: input preparation, masking, target extraction, and running the Snakemake workflow.
- Linux environment with bash, Python 3, samtools, and Snakemake
- Compilers/tools for C++ utilities in
snakemake/(see that folder’s README) - PAT repository layout with subfolders: masking/ tools/ snakemake/ scripts/
We provide an automatic script to help set up the required binaries and Python dependencies for the PAT toolkit.
Requirements:
- Python 3.10
- A valid Conda environment (recommended)
- C++ compiler (g++ ≥ 8)
- Internet access to install dependencies via conda
How to Use:
-
Create and activate a Python 3.10 environment (if not already done): conda create -n patenv python=3.10 -y conda activate patenv
-
Install required dependencies: conda install -y htslib=1.21 eigen zlib
-
Run the install script from the root of the PAT repository: python install/install.py --prefix /path/to/PAT/scripts/
Replace /path/to/PAT/scripts/ with the absolute path to your ScriptFolder. This should match what you set in the Snakemake config JSON.
The script will:
- Compile all C++ binaries (kmercounter8, kmer_selector, etc.)
- Use the active conda environment's include/ and lib/ directories for htslib/sam.h and zlib.h
- Copy compiled binaries into the provided --prefix folder
-
You should see output like: [BUILD] kmercounter8 not found, attempting to build from /.../src/kmercounter [DONE] Copied kmercounter8 → /.../scripts/kmercounter8
Troubleshooting:
-
If you see errors like: fatal error: zlib.h: No such file or directory or fatal error: htslib/sam.h: No such file or directory
then your environment may be missing required headers. Try: conda install -y htslib=1.21 zlib
Make a two-column, no-header TSV listing all samples and references. Column 1: sampleName_h{1|2} Column 2: absolute or relative FASTA path
Example (query_pathes_withrefs.txt):
apr001_h1 Assemblies/apr001.1.polished.fa
apr001_h2 Assemblies/apr001.2.polished.fa
CHM13_h1 References/CHM13.v2.0.fa
HG38_h1 References/hg38.fa
Notes:
- Use concise sample names (≤10 letters recommended).
_h1and_h2denote haplotypes;_h1is typically paternal.- For references, only use
_h1(e.g.,CHM13_h1,HG38_h1). - If mixing CHM13 and hg38, distinguish contigs using FASTA header prefixes:
- CHM13: NC_0609... (e.g., NC_060925.1 for chr1)
- hg38 : chr... (e.g., chr1)
Make sure these prefixes match your FASTA headers. Example file can be seen in pipeline/query_pathes.txt_example.txt.
Use the scripts in masking/ (see masking/readme.txt for details).
This step generates the masked FASTAs used downstream. You may skip if you have already done so
Use the helper script to extract per-gene BED from a GFF3 (gene annotation file, for example, some gff3 files can be found: https://www.gencodegenes.org/):
python tools/gff_toGeneBed.py -i <genes.gff3> -o genes.bed
Optional arguments:
-g X Select specific genes (file or comma-list). Supports wildcards like SMN*.
-e Add merged exon intervals as `Target=...` in column 5.
-a N Add N bp flanking anchors to each gene region.
Help:
python tools/gff_toGeneBed.py -h
Accepted BED formats: 4-column (gene-level):
chrom start end gene_or_group
5-column (recommended):
chrom start end gene_or_group Target=s1-e1,s2-e2,...
Example:
chr5_GL339449v2_alt 456848 485731 SMN2 Target=456848-457609,457774-458352
Example output format can be seen in pipeline/SMNexample.bed.
See pipeline/ for workflow logic, C++ utilities, and instructions.
Compile all C++ tools as directed in that folder’s README.
There are two snakemake files included, Snakefile and Snakefile_light. Snakefile is used for a large number of genes, while Snakefile_light is a lighter version that only works for a small number of genes and requires fewer resources.
Run example:
snakemake -k --cluster "sbatch --account=mchaisso_100 --partition=qcb --time=500:00:00 {resources.slurm_extra}" --default-resources "mem_mb=3000" --jobs 500 --rerun-incomplete --notemp --latency-wait 100 --resources mem_gb=1000
It requires a JSON config, the Snakemake profile or wrappers will load it.
- JSON configuration (example + field descriptions)
Example JSON:
{
"slurm": " --account=<acct> --time 50:00:00 --partition=qcb ",
"QueryPath": "query_pathes_withrefs.txt",
"ScriptFolder": "/path/to/PAT/scripts/",
"TargetFolder": "groups",
"TempFolder": "snaketemp",
"genelist": "./genes.bed",
"ReferencePrefix": "NC_0609",
"NumPartitions": 1
}
Descriptions:
- slurm: SLURM job submission args
- QueryPath: TSV of assemblies/references, example file "query_pathes.txt_example.txt", included.
- ScriptFolder: Path to scripts folder
- TargetFolder: Output folder for results
- TempFolder: Temp file directory, default: ./snaketemp/
- genelist: BED file for genes (4- or 5-col), example file "SMNexample.bed", included.
- blocksize: the max sequence length used for pangenome alleles, larger will be splitted, default: 80,000,
- ReferencePrefix: Prefix used in reference contigs (e.g., "NC_0609" or "chr")
- NumPartitions: Number of partitions to divide genes (aim ~1000 genes/partition)
After Snakemake completes, run:
cat tempfolder/*_bfixpartitions{small,large}.list | sed 's/$/_kmatrix.txt/' | sort -u > allpartitions.list
Then each line will be each compiled matrix
Then concatenate all matrices:
xargs -a allpartitions.list -I {} bash -c '[[ -f "{}" ]] && cat "{}"' > All_matrix.txt
Index the matrix:
python scripts/matrixindex.py -i All_matrix.txt -g <genes.gff3> -r 1
This creates:
- All_matrix.txt → merged matrix
- All_matrix.txt.index → index file
Note: You might need a background kmer file, which can be obtained from Ctyper's GitHub site, and you may name it as All_matrix.txt.bgd for your convenient
- Tips & Gotchas
- Ensure
.faiindex files match the masked FASTAs. - Be consistent across:
- TSV sample IDs (e.g., apr001_h1)
- Reference IDs (e.g., CHM13_h1)
- FASTA header contig prefixes (e.g., chr1 vs NC_0609...)
- Prefer 5-col BED for exon-based annotation (
Target=). - Mixing CHM13 and hg38? Use distinct prefixes to avoid accidental matches.
- See
masking/andsnakemake/READMEs - For BED help: python tools/gff_toGeneBed.py -h
- If problems arise, double-check FASTA paths and index files.