v2.0.0: Refactor for sample-wise parameterisation#171
v2.0.0: Refactor for sample-wise parameterisation#171nschan wants to merge 172 commits intonf-core:devfrom
Conversation
|
Warning Newer version of the nf-core template is available. Your pipeline is using an old version of the nf-core template: 3.5.1. For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation. |
|
This comment is outdated, and this information is now also in the PR text. Since the initial start of this refactor, I noticed that when doing multiple assemblies from the same set of reads it is kind of a waste to send those reads through preprocessing multiple times. I also figured that assemblies from the same set of reads are likely to be compared to each other. To reduce redundant work and make comparisons easier, there is now a |
nvnieuwk
left a comment
There was a problem hiding this comment.
Hi I've done my best but I found it pretty hard to read the code in this pipeline. so I can't approve this at this point... I've left a few comments. Here are some more tips to help with the readability:
- Don't use
itin closures, try to set a variable name for each item in the channel entry instead (e.g. instead of.map { it -> ...}do.map { meta, file1, file2 -> ...}. This makes it easier for me to understand what is in the channel at that point and will make it easier for future you (and others) to work on the pipeline later. - Use more clear variable names
- Try to put some more comments above big code blocks with a short explanation of what this piece of code is for. (Especially on harder to understand pieces of code).
But anyways, I'm still really impressed with what you've done here and this really will be a massive improvement to the pipeline!
58e4a0b to
4e4e735
Compare
* Template update for nf-core/tools version 3.2.1 * Template update for nf-core/tools version 3.3.1 * merge template 3.3.1 - fix linting * update pre-commit * merge template 3.3.1 - fix linting * pre-commit config? * pre-commit config? * reinstall links * try larger runner * smaller run, disable bloom filter for hifiasm test * updated test snapshot * updated test snapshot * update nftignore * update nftignore * update nftignore * update nftignore * update nftignore * update nftignore * update nftignore * update nftignore * update nftignore * Update .github/actions/nf-test/action.yml Co-authored-by: Matthias Hörtenhuber <mashehu@users.noreply.github.com> * Update docs/output.md Co-authored-by: Matthias Hörtenhuber <mashehu@users.noreply.github.com> * remove .nf-test.log --------- Co-authored-by: Niklas Schandry <niklas@bio.lmu.de> Co-authored-by: Matthias Hörtenhuber <mashehu@users.noreply.github.com>
83c7f7a to
9d160c6
Compare
9d160c6 to
2c08e3a
Compare
vagkaratzas
left a comment
There was a problem hiding this comment.
The review will be coming in separate comments, ofc :P
- What is this
assets/reportfolder? Is there not a more nf-core place to put/execute those? - Need a workflow dark, and SVGs for both light and dark metro maps
| // Read preparation | ||
| includeConfig 'modules/ont-prep.config' | ||
| includeConfig 'modules/hifi-prep.config' | ||
| includeConfig 'modules/trimgalore.config' |
There was a problem hiding this comment.
I can see the reason of splitting configs, but I still think I would prefer to have all the module configurations in one file. Will let you decide, and/or the second reviewer :P
There was a problem hiding this comment.
I can see why you would prefer this, but in practice, having a massive module configuration file turned out to be very hard for me manage and I spent a lot of time looking for specific configurations in there and kept introducing duplications etc. There are around 70 withName selectors across the config files.
ok ;)
assets/report contains the scripts for generating the report. If there is a better place to put them I am happy to put them somewhere else, but idk where.
Good point, I will add those. |
vagkaratzas
left a comment
There was a problem hiding this comment.
Another round. Hopefully another last one soon!
|
|
||
| //fastplong_jsons.view { it -> "UNQIE JSONS: $it"} | ||
|
|
||
| REPORT( report_files, |
There was a problem hiding this comment.
I am not using multiqc for reporting
vagkaratzas
left a comment
There was a problem hiding this comment.
Final one. There is no way that everything will be working as intended with all these changes (and strategies..!)
I would suggest creating nf-test for all local subworkflows and modules if they dont already have. But I wouldn't stop you from a release for that.
BUT! I think this is a great opportunity to have a track for the upcoming hackathon for beginners and not only, to write nf-tests for everything. Just a suggestion ;)
| def args = task.ext.args ?: '' | ||
|
|
||
| """ | ||
| dorado aligner \\ |
There was a problem hiding this comment.
Probably not, I stole this from https://github.com/nf-core/methylong/tree/2.0.0/modules/local/dorado
| @@ -14,7 +14,8 @@ process GENOMESCOPE { | |||
| tuple val(meta), path("*_plot.log.png") , emit: plot_log | |||
There was a problem hiding this comment.
https://nf-co.re/modules/genomescope2/
is on nf-core/modules, update and use that one, or maybe just patch it?
| @@ -2,37 +2,27 @@ process GFA_2_FA { | |||
| tag "${meta.id}" | |||
There was a problem hiding this comment.
gfatools_gfa2fa also on nf-core modules
| @@ -10,7 +10,8 @@ process COUNT { | |||
|
|
|||
There was a problem hiding this comment.
https://nf-co.re/modules/jellyfish_count/ also (I made that one >.<)
There was a problem hiding this comment.
A, great, I wasn't aware that you made an nf-core module, happy to switch!
| @@ -10,24 +10,16 @@ process DUMP { | |||
|
|
|||
| output: | |||
There was a problem hiding this comment.
And dump https://nf-co.re/modules/jellyfish_dump/ xD
|
|
||
| ch_versions = ch_versions.mix(RUN_RAGTAG.out.versions) | ||
| } | ||
| channel.empty().set { links_busco } |
There was a problem hiding this comment.
I think the preferred nf-core way is
| channel.empty().set { links_busco } | |
| ch_links_busco = channel.empty() |
instead of .set. But I guess too late for that now, given the number of files in teh PR xD
There was a problem hiding this comment.
If it is preferred, but not mandatory, I would like to keep it with .set. I know some people prefer channel = but its ugly :/
|
|
||
| emit: | ||
| scaffolds | ||
| ch_main = ch_main_scaffolded |
There was a problem hiding this comment.
main is not very self-explanatory as a name
There was a problem hiding this comment.
Fair point. I am not sure how to better name this, since the main channel does not necessarily contain all steps for all samples. I think I am using main to mean that this has been transformed correctly and that everything is ready to move back into the main workflow / next subworkflow.
| only once, and then the original channel is restored. | ||
|
|
||
| Brief description how this works: | ||
| // Move group information into channel, if it exists |
There was a problem hiding this comment.
Too many comments and code there. I guess it should be deleted? The explanation should be enough I guess
There was a problem hiding this comment.
This comment is mainly for reviewers and potential future contributors. If this is useless, I can remove it.
I would hope that everything works as intended and I did some larger tests before asking for reviews. Still it is likely that someone will try something that I did not think of; |
I would ask the thematically closest persons / facility / nf-core slack channels for potential track leaders that would want to oversee this if they don't already have a project but feel confident enough with nf-tests. |
Updated Feb 04 2026
As suggested here this is full refactor of
genomeassemblerto support sample-level parameterisation of everything.Why?
Often when doing genome assembly, we do not know what works best. With this change, this pipeline can be used to compare different settings for the same set of reads, to compare the assembly outcome. Samples that share the same value in
groupwill be combined during reporting and preprocessing to facilitate comparisons of strategies on the same input(s).Per-sample parameterisation
Initially, I implemented this poorly, based largely around
join()'ing various maps back to other maps. nextflow does not have ajoinoperator for maps, so this was a big mess and turned out to be hard to read, annoying to write, and constantly blocking. To summarise: bad idea, cannot recommend. This attempt contributes to the large number of commits in this branch.While contemplating my failure to implement sample-level parameterisation, I realised that the solution to this problem is "
meta-stuffing", also referred to as "meta-smuggling" by @prototaxites, who seems to have arrived at a similar conclusion at around the same time.In the purest form, this would implement everything in a single
meta-map, which is in slot0of the channel traveling through the pipeline. I will usemetato refer to[0]of list-channels from here on.Note: A pure
meta-stuffing implementation would have required additional refactoring of some subworkflows, in particular ofQCwhich takes more than one input channel, something I did not want to do.How this works:
Everything goes into
meta:paramsare turned into k/v pairs for each sample, unless the samplesheet contains a different value for that sample with the same key, resulting in a massivemeta-map.valuesare pulled frommetaas required for channel inputs,metais recreated / updated from channel outputs. This largely enables flow-control at the sample level, when combined withbranch,filterand related operators. In some cases,joinscannot be avoided (or would need to be traded for concurrent processes), but I tried to minimise them.This means that the pipeline is essentially free of
if { }statements, since flow control is done via channels. This also means that the pipeline DAG is always rendered in full, irrespective of whether the nodes will actually be visited by a sample. It might be possible to optimise DAG rendering by inspecting all created meta-maps and creating some global variables based on their content, to again conditionally include subworkflows, but this is beyond the scope of this refactor.I made an effort to provide flexibility in combining assemblers, polishers, or scaffolding tools, as I thought it was reasonable, but this does not offer full-factorial combinations, which become especially tricky if things should happen in order.
Generally, I am very happy with this approach, it offers great flexibility, and is surprisingly nice to write once
metahas been constructed. Given that global parameters can still be set, the samplesheet may or may not be kind of wide, for a single sample everything could be done viaparamsand the only column in the samplesheet would besample. In the most ridiculous case of setting all parameters differently for each sample, the samplesheet could grow to around 50 columns.Grouping
Grouping is implemented as an extension of "
meta-smuggling", essentially smuggling multiple meta maps through a channel (insidemeta), replacing the value ofmeta.idwith the group id.Here is the relevant code for grouping and un-grouping while maintaining meta-maps:
some_channel // Move group information into channel, if it exists .filter { it -> it.meta.group } .map { it -> [it.meta, it.meta.group, it.meta.ontreads] } // Group by group .groupTuple(by: 1) // Collect all sample-meta into a group meta slot named metas // Use unique reads; user responsible to group correctly .map { it -> [ [ id: it[1], // the group metas: it[0] ], it[2].unique()[0] // Ontreads ] }After this input channel has been processed, the samples are
recreated from meta[metas]:
More
Since switching to
meta-stuffing made things much easier, I have also added anHiCscaffolding subworkflow, and support fordorado polish(experimental, asdorado polishdoes not work reliably) in the polishing subworkflow