You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For large PRIDE reanalysis campaigns (e.g. the ones run on the EBI Codon SLURM cluster), the current flow in quantmsdiann assumes files are either:
already staged locally under --root_folder, or
accessible over the URI in the SDRF (FTP / HTTPS / Aspera) and fetched lazily per process by Nextflow's remote-file mechanism.
Option (2) causes pain at scale because EBI FTP / Aspera throttle concurrent connections. When Nextflow launches many DIA-NN / TRFP tasks in parallel, each pulling its own .raw file over FTP, downloads fail, get retried, and the run becomes slow or flaky. Once the files are on the cluster the rest of the workflow runs fine — the download is the bottleneck.
nf-core/mhcquant just solved the same problem using pridepy — see nf-core/mhcquant#445. We should do the same in quantmsdiann, with one important design difference explained below.
Design difference from mhcquant#445
mhcquant downloads files during PIPELINE_INITIALISATION as part of the normal workflow DAG. That's fine for small datasets but inherits two problems at PRIDE-reanalysis scale:
All downloaded files are pinned in the Nextflow work dir for the duration of the run.
If the main run fails mid-way, the next rerun triggers re-downloads unless caching is set up exactly right.
Proposal: split the download into its own prefetch job (either a tiny sibling workflow or a first PREFETCH_DATA subworkflow gated by a flag), which:
Reads the SDRF (or takes a --pride_accession).
Resolves the file list via pridepy against the PRIDE REST API.
Downloads everything to a user-supplied --download_dir (on shared storage, e.g. /hps/nobackup/… on Codon) serially or with a capped concurrency so we stay within EBI's connection limits.
Emits a success marker / manifest.
The main quantmsdiann run is then launched with --root_folder=<download_dir> (reusing the existing local-input code path we just tightened in PR #64). This gives us:
Predictable download step: one job, resumable, observable, testable.
Reuse: the main workflow runs on local files — same code path as non-PRIDE users.
Rerun cost = 0 for downloads: if DIA-NN analysis fails, we don't re-fetch.
Scheduling: on SLURM, the prefetch can sit on a single long-running "io" job instead of fanning out N FTP sockets.
Two ways to expose it:
nextflow run … -entry PREFETCH_PRIDE --input <sdrf-or-accession> --download_dir /path/..., or
--download_first true flag on the main entry point that runs prefetch before anything else.
I'd start with the -entry variant because it keeps the DAGs separable and makes the step trivially scriptable from job arrays.
Module placement: bigbio/nf-modules
Rather than writing a local PRIDEPY_DOWNLOAD module inside quantmsdiann, let's put it in bigbio/nf-modules so other workflows (quantms, quantmsrescore, quantmsdda, future pipelines) can reuse it. Proposed module shape:
--threads/concurrency cap exposed via task.ext so cluster admins can dial it down to respect EBI's limits (default 1 to be safe; bump via config).
Protocol selectable — FTP default; Aspera / Globus / S3 enabled via task.ext.protocol where the cluster has the client installed.
Resume-friendly: pridepy skips already-downloaded files when re-invoked, and Nextflow's cache will skip the task entirely if inputs and outputs are unchanged.
Retry policy: wire errorStrategy = 'retry' + maxRetries in a shared config for the FTP-flakiness case.
Consumption in quantmsdiann
On the quantmsdiann side:
New subworkflows/local/prefetch_pride/main.nf that wraps PRIDEPY_DOWNLOAD, accepts either an accession + fetches the SDRF, or an SDRF path directly, and emits the download directory path.
New entry: workflow PREFETCH { … } in main.nf (or a -entry alias) so users can call nextflow run bigbio/quantmsdiann -entry PREFETCH … independently.
Docs:
new "Running against a PRIDE accession" section in docs/usage.md with the two-step recipe,
mention in docs/container-guide.md / Codon config that prefetch should be a separate SLURM job.
CI: tiny accession (e.g. PXD009752, 2 files) in an extended_ci job — don't add it to the main matrix to avoid hammering PRIDE on every PR.
Params to add (on the quantmsdiann side)
Param
Type
Default
Description
--pride_accession
string
null
PXD accession; when set, prefetch resolves the SDRF + raw files.
--download_dir
string
null
Directory where PRIDE files will be staged. Becomes --root_folder for the main run.
--download_protocol
string
ftp
One of ftp, aspera, globus, s3. Enforce enum in schema.
--download_threads
integer
1
Concurrent connection cap for pridepy. Keep ≤ 4 for EBI FTP.
--download_skip_existing
boolean
true
Let pridepy skip files already present in --download_dir.
Open questions
Pin pridepy to a specific version (bioconda::pridepy=x.y.z) or float on >=? I'd pin once we validate.
Should the prefetch also download the fasta DB when the SDRF points at one, or leave that to the user?
Summary
For large PRIDE reanalysis campaigns (e.g. the ones run on the EBI Codon SLURM cluster), the current flow in
quantmsdiannassumes files are either:--root_folder, orOption (2) causes pain at scale because EBI FTP / Aspera throttle concurrent connections. When Nextflow launches many DIA-NN / TRFP tasks in parallel, each pulling its own
.rawfile over FTP, downloads fail, get retried, and the run becomes slow or flaky. Once the files are on the cluster the rest of the workflow runs fine — the download is the bottleneck.nf-core/mhcquant just solved the same problem using
pridepy— see nf-core/mhcquant#445. We should do the same inquantmsdiann, with one important design difference explained below.Design difference from mhcquant#445
mhcquant downloads files during
PIPELINE_INITIALISATIONas part of the normal workflow DAG. That's fine for small datasets but inherits two problems at PRIDE-reanalysis scale:Proposal: split the download into its own prefetch job (either a tiny sibling workflow or a first
PREFETCH_DATAsubworkflow gated by a flag), which:--pride_accession).pridepyagainst the PRIDE REST API.--download_dir(on shared storage, e.g./hps/nobackup/…on Codon) serially or with a capped concurrency so we stay within EBI's connection limits.The main
quantmsdiannrun is then launched with--root_folder=<download_dir>(reusing the existing local-input code path we just tightened in PR #64). This gives us:Two ways to expose it:
nextflow run … -entry PREFETCH_PRIDE --input <sdrf-or-accession> --download_dir /path/..., or--download_first trueflag on the main entry point that runs prefetch before anything else.I'd start with the
-entryvariant because it keeps the DAGs separable and makes the step trivially scriptable from job arrays.Module placement: bigbio/nf-modules
Rather than writing a local
PRIDEPY_DOWNLOADmodule insidequantmsdiann, let's put it inbigbio/nf-modulesso other workflows (quantms, quantmsrescore, quantmsdda, future pipelines) can reuse it. Proposed module shape:Process interface (draft):
Key design points:
--threads/concurrency cap exposed viatask.extso cluster admins can dial it down to respect EBI's limits (default1to be safe; bump via config).task.ext.protocolwhere the cluster has the client installed.pridepyskips already-downloaded files when re-invoked, and Nextflow's cache will skip the task entirely if inputs and outputs are unchanged.errorStrategy = 'retry'+maxRetriesin a shared config for the FTP-flakiness case.Consumption in quantmsdiann
On the quantmsdiann side:
subworkflows/local/prefetch_pride/main.nfthat wrapsPRIDEPY_DOWNLOAD, accepts either an accession + fetches the SDRF, or an SDRF path directly, and emits the download directory path.workflow PREFETCH { … }inmain.nf(or a-entryalias) so users can callnextflow run bigbio/quantmsdiann -entry PREFETCH …independently.docs/usage.mdwith the two-step recipe,docs/container-guide.md/ Codon config that prefetch should be a separate SLURM job.extended_cijob — don't add it to the main matrix to avoid hammering PRIDE on every PR.Params to add (on the quantmsdiann side)
--pride_accessionnull--download_dirnull--root_folderfor the main run.--download_protocolftpftp,aspera,globus,s3. Enforce enum in schema.--download_threads1≤ 4for EBI FTP.--download_skip_existingtrue--download_dir.Open questions
pridepyto a specific version (bioconda::pridepy=x.y.z) or float on>=? I'd pin once we validate.--root_folder+ extension inference (current PR Defaultlocal_input_typetorawand enforce supported local file formats #64 path)?.ascpclient in the container — worth supporting in the bioconda container, or only in site-specific configs?References
/cc @ypriverol