Skip to content

Add PRIDE prefetch step using pridepy (separate download job, module in bigbio/nf-modules) #67

@ypriverol

Description

@ypriverol

Summary

For large PRIDE reanalysis campaigns (e.g. the ones run on the EBI Codon SLURM cluster), the current flow in quantmsdiann assumes files are either:

  1. already staged locally under --root_folder, or
  2. accessible over the URI in the SDRF (FTP / HTTPS / Aspera) and fetched lazily per process by Nextflow's remote-file mechanism.

Option (2) causes pain at scale because EBI FTP / Aspera throttle concurrent connections. When Nextflow launches many DIA-NN / TRFP tasks in parallel, each pulling its own .raw file over FTP, downloads fail, get retried, and the run becomes slow or flaky. Once the files are on the cluster the rest of the workflow runs fine — the download is the bottleneck.

nf-core/mhcquant just solved the same problem using pridepy — see nf-core/mhcquant#445. We should do the same in quantmsdiann, with one important design difference explained below.

Design difference from mhcquant#445

mhcquant downloads files during PIPELINE_INITIALISATION as part of the normal workflow DAG. That's fine for small datasets but inherits two problems at PRIDE-reanalysis scale:

  • All downloaded files are pinned in the Nextflow work dir for the duration of the run.
  • If the main run fails mid-way, the next rerun triggers re-downloads unless caching is set up exactly right.

Proposal: split the download into its own prefetch job (either a tiny sibling workflow or a first PREFETCH_DATA subworkflow gated by a flag), which:

  1. Reads the SDRF (or takes a --pride_accession).
  2. Resolves the file list via pridepy against the PRIDE REST API.
  3. Downloads everything to a user-supplied --download_dir (on shared storage, e.g. /hps/nobackup/… on Codon) serially or with a capped concurrency so we stay within EBI's connection limits.
  4. Emits a success marker / manifest.

The main quantmsdiann run is then launched with --root_folder=<download_dir> (reusing the existing local-input code path we just tightened in PR #64). This gives us:

  • Predictable download step: one job, resumable, observable, testable.
  • Reuse: the main workflow runs on local files — same code path as non-PRIDE users.
  • Rerun cost = 0 for downloads: if DIA-NN analysis fails, we don't re-fetch.
  • Scheduling: on SLURM, the prefetch can sit on a single long-running "io" job instead of fanning out N FTP sockets.

Two ways to expose it:

  • nextflow run … -entry PREFETCH_PRIDE --input <sdrf-or-accession> --download_dir /path/..., or
  • --download_first true flag on the main entry point that runs prefetch before anything else.

I'd start with the -entry variant because it keeps the DAGs separable and makes the step trivially scriptable from job arrays.

Module placement: bigbio/nf-modules

Rather than writing a local PRIDEPY_DOWNLOAD module inside quantmsdiann, let's put it in bigbio/nf-modules so other workflows (quantms, quantmsrescore, quantmsdda, future pipelines) can reuse it. Proposed module shape:

modules/bigbio/pridepy/download/
├── main.nf           # PRIDEPY_DOWNLOAD process
├── meta.yml          # inputs/outputs
├── environment.yml   # bioconda::pridepy=<pin>
└── tests/
    ├── main.nf.test
    └── main.nf.test.snap

Process interface (draft):

process PRIDEPY_DOWNLOAD {
    tag "$meta.accession"
    label 'process_low'

    input:
    tuple val(meta), val(accession), path(sdrf, stageAs: 'sdrf_?')

    output:
    tuple val(meta), path("${accession}/*"), emit: files
    tuple val(meta), path("${accession}/manifest.tsv"), emit: manifest
    path "versions.yml", emit: versions

    script:
    def protocol = task.ext.protocol ?: 'ftp'        // ftp | aspera | globus | s3
    def threads  = task.ext.threads ?: 1             // concurrent connections cap
    def filter   = sdrf ? "--sdrf ${sdrf}" : '--raw-only'
    """
    pridepy download-all-public-raw-files \\
        -a ${accession} \\
        -o ${accession} \\
        -p ${protocol} \\
        ${filter} \\
        --threads ${threads}

    # generate manifest for downstream SDRF path resolution
    find ${accession} -type f \\( -name '*.raw' -o -name '*.mzML' -o -name '*.d*' -o -name '*.dia' \\) \\
        | sort > ${accession}/manifest.tsv

    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        pridepy: \$(pridepy --version | awk '{print \$NF}')
    END_VERSIONS
    """
}

Key design points:

  • --threads/concurrency cap exposed via task.ext so cluster admins can dial it down to respect EBI's limits (default 1 to be safe; bump via config).
  • Protocol selectable — FTP default; Aspera / Globus / S3 enabled via task.ext.protocol where the cluster has the client installed.
  • Resume-friendly: pridepy skips already-downloaded files when re-invoked, and Nextflow's cache will skip the task entirely if inputs and outputs are unchanged.
  • Retry policy: wire errorStrategy = 'retry' + maxRetries in a shared config for the FTP-flakiness case.

Consumption in quantmsdiann

On the quantmsdiann side:

  1. New subworkflows/local/prefetch_pride/main.nf that wraps PRIDEPY_DOWNLOAD, accepts either an accession + fetches the SDRF, or an SDRF path directly, and emits the download directory path.
  2. New entry: workflow PREFETCH { … } in main.nf (or a -entry alias) so users can call nextflow run bigbio/quantmsdiann -entry PREFETCH … independently.
  3. Docs:
    • new "Running against a PRIDE accession" section in docs/usage.md with the two-step recipe,
    • mention in docs/container-guide.md / Codon config that prefetch should be a separate SLURM job.
  4. CI: tiny accession (e.g. PXD009752, 2 files) in an extended_ci job — don't add it to the main matrix to avoid hammering PRIDE on every PR.

Params to add (on the quantmsdiann side)

Param Type Default Description
--pride_accession string null PXD accession; when set, prefetch resolves the SDRF + raw files.
--download_dir string null Directory where PRIDE files will be staged. Becomes --root_folder for the main run.
--download_protocol string ftp One of ftp, aspera, globus, s3. Enforce enum in schema.
--download_threads integer 1 Concurrent connection cap for pridepy. Keep ≤ 4 for EBI FTP.
--download_skip_existing boolean true Let pridepy skip files already present in --download_dir.

Open questions

  • Pin pridepy to a specific version (bioconda::pridepy=x.y.z) or float on >=? I'd pin once we validate.
  • Should the prefetch also download the fasta DB when the SDRF points at one, or leave that to the user?
  • Do we want a small "SDRF rewrite" post-step that replaces URIs with local paths, or rely purely on --root_folder + extension inference (current PR Default local_input_type to raw and enforce supported local file formats #64 path)?
  • Aspera requires a .ascp client in the container — worth supporting in the bioconda container, or only in site-specific configs?

References

/cc @ypriverol

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions