Add PRIDE prefetch step using `pridepy` (separate download job, module in bigbio/nf-modules)

## Summary

For large PRIDE reanalysis campaigns (e.g. the ones run on the EBI Codon SLURM cluster), the current flow in `quantmsdiann` assumes files are either:

1. already staged locally under `--root_folder`, or
2. accessible over the URI in the SDRF (FTP / HTTPS / Aspera) and fetched lazily **per process** by Nextflow's remote-file mechanism.

Option (2) causes pain at scale because **EBI FTP / Aspera throttle concurrent connections**. When Nextflow launches many DIA-NN / TRFP tasks in parallel, each pulling its own `.raw` file over FTP, downloads fail, get retried, and the run becomes slow or flaky. Once the files are on the cluster the rest of the workflow runs fine — the download is the bottleneck.

nf-core/mhcquant just solved the same problem using [`pridepy`](https://github.com/PRIDE-Archive/pridepy) — see [nf-core/mhcquant#445](https://github.com/nf-core/mhcquant/pull/445). We should do the same in `quantmsdiann`, with one important design difference explained below.

## Design difference from mhcquant#445

mhcquant downloads files during `PIPELINE_INITIALISATION` as part of the normal workflow DAG. That's fine for small datasets but inherits two problems at PRIDE-reanalysis scale:

- All downloaded files are pinned in the Nextflow work dir for the duration of the run.
- If the main run fails mid-way, the next rerun triggers re-downloads unless caching is set up exactly right.

**Proposal: split the download into its own prefetch job** (either a tiny sibling workflow or a first `PREFETCH_DATA` subworkflow gated by a flag), which:

1. Reads the SDRF (or takes a `--pride_accession`).
2. Resolves the file list via `pridepy` against the PRIDE REST API.
3. Downloads everything to a user-supplied `--download_dir` (on shared storage, e.g. `/hps/nobackup/…` on Codon) **serially or with a capped concurrency** so we stay within EBI's connection limits.
4. Emits a success marker / manifest.

The main `quantmsdiann` run is then launched with `--root_folder=<download_dir>` (reusing the existing local-input code path we just tightened in PR #64). This gives us:

- **Predictable download step**: one job, resumable, observable, testable.
- **Reuse**: the main workflow runs on local files — same code path as non-PRIDE users.
- **Rerun cost = 0 for downloads**: if DIA-NN analysis fails, we don't re-fetch.
- **Scheduling**: on SLURM, the prefetch can sit on a single long-running "io" job instead of fanning out N FTP sockets.

Two ways to expose it:

- `nextflow run … -entry PREFETCH_PRIDE --input <sdrf-or-accession> --download_dir /path/...`, or
- `--download_first true` flag on the main entry point that runs prefetch before anything else.

I'd start with the `-entry` variant because it keeps the DAGs separable and makes the step trivially scriptable from job arrays.

## Module placement: bigbio/nf-modules

Rather than writing a local `PRIDEPY_DOWNLOAD` module inside `quantmsdiann`, let's put it in [`bigbio/nf-modules`](https://github.com/bigbio/nf-modules) so other workflows (quantms, quantmsrescore, quantmsdda, future pipelines) can reuse it. Proposed module shape:

```
modules/bigbio/pridepy/download/
├── main.nf           # PRIDEPY_DOWNLOAD process
├── meta.yml          # inputs/outputs
├── environment.yml   # bioconda::pridepy=<pin>
└── tests/
    ├── main.nf.test
    └── main.nf.test.snap
```

Process interface (draft):

```groovy
process PRIDEPY_DOWNLOAD {
    tag "$meta.accession"
    label 'process_low'

    input:
    tuple val(meta), val(accession), path(sdrf, stageAs: 'sdrf_?')

    output:
    tuple val(meta), path("${accession}/*"), emit: files
    tuple val(meta), path("${accession}/manifest.tsv"), emit: manifest
    path "versions.yml", emit: versions

    script:
    def protocol = task.ext.protocol ?: 'ftp'        // ftp | aspera | globus | s3
    def threads  = task.ext.threads ?: 1             // concurrent connections cap
    def filter   = sdrf ? "--sdrf ${sdrf}" : '--raw-only'
    """
    pridepy download-all-public-raw-files \\
        -a ${accession} \\
        -o ${accession} \\
        -p ${protocol} \\
        ${filter} \\
        --threads ${threads}

    # generate manifest for downstream SDRF path resolution
    find ${accession} -type f \$ -name '*.raw' -o -name '*.mzML' -o -name '*.d*' -o -name '*.dia' \$ \\
        | sort > ${accession}/manifest.tsv

    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        pridepy: \$(pridepy --version | awk '{print \$NF}')
    END_VERSIONS
    """
}
```

Key design points:

- **`--threads`/concurrency cap** exposed via `task.ext` so cluster admins can dial it down to respect EBI's limits (default `1` to be safe; bump via config).
- **Protocol selectable** — FTP default; Aspera / Globus / S3 enabled via `task.ext.protocol` where the cluster has the client installed.
- **Resume-friendly**: `pridepy` skips already-downloaded files when re-invoked, and Nextflow's cache will skip the task entirely if inputs and outputs are unchanged.
- **Retry policy**: wire `errorStrategy = 'retry'` + `maxRetries` in a shared config for the FTP-flakiness case.

## Consumption in quantmsdiann

On the quantmsdiann side:

1. New `subworkflows/local/prefetch_pride/main.nf` that wraps `PRIDEPY_DOWNLOAD`, accepts either an accession + fetches the SDRF, or an SDRF path directly, and emits the download directory path.
2. New entry: `workflow PREFETCH { … }` in `main.nf` (or a `-entry` alias) so users can call `nextflow run bigbio/quantmsdiann -entry PREFETCH …` independently.
3. Docs:
   - new "Running against a PRIDE accession" section in `docs/usage.md` with the two-step recipe,
   - mention in `docs/container-guide.md` / Codon config that prefetch should be a separate SLURM job.
4. CI: tiny accession (e.g. PXD009752, 2 files) in an `extended_ci` job — don't add it to the main matrix to avoid hammering PRIDE on every PR.

## Params to add (on the quantmsdiann side)

| Param                        | Type    | Default | Description                                                                                    |
| ---------------------------- | ------- | ------- | ---------------------------------------------------------------------------------------------- |
| `--pride_accession`          | string  | `null`  | PXD accession; when set, prefetch resolves the SDRF + raw files.                               |
| `--download_dir`             | string  | `null`  | Directory where PRIDE files will be staged. Becomes `--root_folder` for the main run.          |
| `--download_protocol`        | string  | `ftp`   | One of `ftp`, `aspera`, `globus`, `s3`. Enforce enum in schema.                                |
| `--download_threads`         | integer | `1`     | Concurrent connection cap for pridepy. Keep `≤ 4` for EBI FTP.                                 |
| `--download_skip_existing`   | boolean | `true`  | Let pridepy skip files already present in `--download_dir`.                                    |

## Open questions

- Pin `pridepy` to a specific version (`bioconda::pridepy=x.y.z`) or float on `>=`? I'd pin once we validate.
- Should the prefetch also download the fasta DB when the SDRF points at one, or leave that to the user?
- Do we want a small "SDRF rewrite" post-step that replaces URIs with local paths, or rely purely on `--root_folder` + extension inference (current PR #64 path)?
- Aspera requires a `.ascp` client in the container — worth supporting in the bioconda container, or only in site-specific configs?

## References

- Reference PR: https://github.com/nf-core/mhcquant/pull/445
- pridepy: https://github.com/PRIDE-Archive/pridepy
- pridepy on PyPI: https://pypi.org/project/pridepy/
- bigbio/nf-modules (proposed home for the module): https://github.com/bigbio/nf-modules

/cc @ypriverol

Param	Type	Default	Description
`--pride_accession`	string	`null`	PXD accession; when set, prefetch resolves the SDRF + raw files.
`--download_dir`	string	`null`	Directory where PRIDE files will be staged. Becomes `--root_folder` for the main run.
`--download_protocol`	string	`ftp`	One of `ftp`, `aspera`, `globus`, `s3`. Enforce enum in schema.
`--download_threads`	integer	`1`	Concurrent connection cap for pridepy. Keep `≤ 4` for EBI FTP.
`--download_skip_existing`	boolean	`true`	Let pridepy skip files already present in `--download_dir`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PRIDE prefetch step using `pridepy` (separate download job, module in bigbio/nf-modules) #67

Summary

Design difference from mhcquant#445

Module placement: bigbio/nf-modules

Consumption in quantmsdiann

Params to add (on the quantmsdiann side)

Open questions

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add PRIDE prefetch step using pridepy (separate download job, module in bigbio/nf-modules) #67

Description

Summary

Design difference from mhcquant#445

Module placement: bigbio/nf-modules

Consumption in quantmsdiann

Params to add (on the quantmsdiann side)

Open questions

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Add PRIDE prefetch step using `pridepy` (separate download job, module in bigbio/nf-modules) #67