Workflow to generate hifiasm and verkko assemblies.
No well-documented, simple Snakemake workflows.
Need basic AWS and local path support.
git clone https://github.com/logsdon-lab/Snakemake-Assembly.git --recursive
cd Snakemake-Assemblysnakemake -np --use-conda --configfile config.yaml --workflow-profile noneEach sample is contained with a block in samples.
samples:
sample_name:
threads: ... # Number of threads
mem: ... # In GB. ex. "200GB"
assembler: ... # Assembler. Either "verkko" or "hifiasm"
data: ...Either:
verkkohifiasm
See workflow/envs/(verkko|hifiasm).yaml for version information.
To pass additional args to either assembler, use the added_args option.
samples:
sample_name:
threads: 32
mem: 200GB
assembler: verkko
added_args: -k
data: ...The following data types are supported for {sm}.data.{dtype}.
"ont""hifi"- Required for
verkko.
- Required for
"hic_mat""hic_pat""illumina_mat""illumina_pat"
Note
This workflow makes large temporary files for ONT and HiFI data and cleans them up on workflow completion. By default, the temp directory is {output}/tmp.
Data sources can be either local or on AWS:
path
{sm}.data.{dtype}.pathwill get data from local directory.
uri
{sm}.data.{dtype}.uriwillaws syncfrom the specified S3 uri.
samples:
mPanTro3:
threads: 40
mem: 250GB
assembler: hifiasm
data:
ont:
path: /project/logsdon_shared/data/PrimateT2T/ont/mPanTro3
# Include files to use.
include: ["*.fq.gz"]
# Exclude files.
exclude: ["*fail.fq.gz"]
hifi:
path: /project/logsdon_shared/data/PrimateT2T/hifi_data/mPanTro3
include: ["*.hifi_reads.fq.gz"]samples:
mPanTro3:
threads: 32
mem: 250GB
assembler: hifiasm
data:
ont:
uri: s3://genomeark/species/Pan_troglodytes/mPanTro3/genomic_data/ont/
# Include files
include: ["*.fq.gz"]
# Exclude files to download if include not specific enough.
exclude: ["*old-guppy-runs/*", "*.bam*", "*fast5/*"]
hifi:
uri: s3://genomeark/species/Pan_troglodytes/mPanTro3/genomic_data/pacbio_hifi/
include: ["*.hifi_reads.fq.gz"]
exclude: ["*previous-versions/*", "*.bam*", "*ccs*"]Additional analyses can be added:
Align assembly to a reference genome.
asm_to_ref:
ref:
CHM13: /project/logsdon_shared/projects/twins_chrY_assembly/data/reference/T2T-CHM13v2.fasta
mm2_opts: "-x asm20 --secondary=no -s 25000 -K 8G"
threads: 32
mem: 250GB
mode: ["saffire", "ideogram"]One or more modes is possible.
Generate an ideogram of the assembly.
Note
Reference only works with CHM13
Generate SafFire beds.
Run NucFlag on the entire assembly with provided hifi data.
Note
Currently incompatible with S3 hifi input.
nucflag:
# samples: []
output_dir: "results/nucflag"
output_coverage: false
logs_dir: "logs/nucflag"
benchmarks_dir: "benchmarks/nucflag"
threads_aln: 8
mem_aln: 30G
processes_nucflag: 12
mem_nucflag: 50G
samtools_view_flag: 2308If you need to align to reads not included in the assembly or use a different nucflag configfile, you can specify it per assembly like so:
- Otherwise, uses/expects hifi data from assembly.
nucflag:
samples:
- name: sample
config: "/path/to/nucflag.toml"
read_dir: /path/to/reads/
read_rgx: ".*\\.hifi_reads.fastq.gz$"For more examples, see the examples/ directory.


