Skip to content

logsdon-lab/Snakemake-Assembly

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Snakemake-Assembly

Workflow to generate hifiasm and verkko assemblies.

Why?

No well-documented, simple Snakemake workflows. Need basic AWS and local path support.

Usage

git clone https://github.com/logsdon-lab/Snakemake-Assembly.git --recursive
cd Snakemake-Assembly
snakemake -np --use-conda --configfile config.yaml --workflow-profile none

Config

Each sample is contained with a block in samples.

samples:
  sample_name:
    threads: ... # Number of threads
    mem: ... # In GB. ex. "200GB"
    assembler: ... # Assembler. Either "verkko" or "hifiasm"
    data: ...

Assembler

Either:

  • verkko
  • hifiasm

See workflow/envs/(verkko|hifiasm).yaml for version information.

To pass additional args to either assembler, use the added_args option.

samples:
  sample_name:
    threads: 32
    mem: 200GB
    assembler: verkko
    added_args: -k
    data: ...

Data

Types

The following data types are supported for {sm}.data.{dtype}.

  • "ont"
  • "hifi"
    • Required for verkko.
  • "hic_mat"
  • "hic_pat"
  • "illumina_mat"
  • "illumina_pat"

Note

This workflow makes large temporary files for ONT and HiFI data and cleans them up on workflow completion. By default, the temp directory is {output}/tmp.

Sources

Data sources can be either local or on AWS:

path

  • {sm}.data.{dtype}.path will get data from local directory.

uri

  • {sm}.data.{dtype}.uri will aws sync from the specified S3 uri.
Local
samples:
  mPanTro3:
    threads: 40
    mem: 250GB
    assembler: hifiasm
    data:
      ont:
        path: /project/logsdon_shared/data/PrimateT2T/ont/mPanTro3
        # Include files to use.
        include: ["*.fq.gz"]
        # Exclude files.
        exclude: ["*fail.fq.gz"]
      hifi:
        path: /project/logsdon_shared/data/PrimateT2T/hifi_data/mPanTro3
        include: ["*.hifi_reads.fq.gz"]
S3
samples:
  mPanTro3:
    threads: 32
    mem: 250GB
    assembler: hifiasm
    data:
      ont:
        uri: s3://genomeark/species/Pan_troglodytes/mPanTro3/genomic_data/ont/
        # Include files
        include: ["*.fq.gz"]
        # Exclude files to download if include not specific enough.
        exclude: ["*old-guppy-runs/*", "*.bam*", "*fast5/*"]
      hifi:
        uri: s3://genomeark/species/Pan_troglodytes/mPanTro3/genomic_data/pacbio_hifi/
        include: ["*.hifi_reads.fq.gz"]
        exclude: ["*previous-versions/*", "*.bam*", "*ccs*"]

Analyses

Additional analyses can be added:

asm_to_ref

Align assembly to a reference genome.

asm_to_ref:
  ref:
    CHM13: /project/logsdon_shared/projects/twins_chrY_assembly/data/reference/T2T-CHM13v2.fasta
  mm2_opts: "-x asm20 --secondary=no -s 25000 -K 8G"
  threads: 32
  mem: 250GB
  mode: ["saffire", "ideogram"]

One or more modes is possible.

Ideogram

Generate an ideogram of the assembly.

Note

Reference only works with CHM13

SafFire

Generate SafFire beds.

NucFlag

Run NucFlag on the entire assembly with provided hifi data.

Note

Currently incompatible with S3 hifi input.

nucflag:
  # samples: []
  output_dir: "results/nucflag"
  output_coverage: false
  logs_dir: "logs/nucflag"
  benchmarks_dir: "benchmarks/nucflag"
  threads_aln: 8
  mem_aln: 30G
  processes_nucflag: 12
  mem_nucflag: 50G
  samtools_view_flag: 2308

If you need to align to reads not included in the assembly or use a different nucflag configfile, you can specify it per assembly like so:

  • Otherwise, uses/expects hifi data from assembly.
nucflag:
  samples:
    - name: sample
      config: "/path/to/nucflag.toml"
      read_dir: /path/to/reads/
      read_rgx: ".*\\.hifi_reads.fastq.gz$"

Examples

For more examples, see the examples/ directory.

About

Workflow to generate hifiasm and verkko assemblies.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages