Skip to content

Improve run time for fastq.gz files suggestions #392

@SergeWielhouwer

Description

@SergeWielhouwer

Hi,

Thanks for developing NanoPlot.

I am currently running NanoPlot within a custom Snakemake ONT pipeline, but I've noticed that it tends to be one of the slower rules in my workflow when processing the raw input data.
image

Below is a rough example of the run times for three samples:
34.6 Gbp = 57 min
24.4 Gbp = 40 min
13.4 Gbp = 22 min

I’m using NanoPlot 1.42.0 with 4 threads and running the following command on a single (merged) fastq.gz file stored on flash storage (default compression):

NanoPlot --fastq {input} -o nanoplot/raw/{wildcards.sample}/ -t {threads} 2>{log}

While I understand that providing multiple smaller fastq.gz files might help improve speed, I’m curious if NanoPlot benefits from utilising multiple threads on a single fastq.gz file, or if that’s more applicable to BAM files with multiple reference contigs.

As shown in the plot, your chopper rust tool (with pigz decompression) processes the data in roughly a third of the time that NanoPlot requires.

Do you think NanoPlot could see performance improvements with a transition to Rust or by incorporating libraries such as Intel ISA-L? (https://github.com/pycompression/python-isal). I am just wondering what could potentially hinder its performance, and if SeqIO is mainly used right now for gzip handling.

I’d appreciate any insights or suggestions you may have on speeding up the process.

Best regards,

Serge

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions