Hello, people of nf-core. While working on my project to rewrite MaxBin2, I encountered a bit of an odd issue.
I was benchmarking a gene caller on two test datasets: the B. fragilis data from the modules branch, and the "minigut" data from the mag branch. I got identical results on both and dug into why.
Turns out the reads are the same file — test_data/test_minigut_R1.fastq.gz on mag and data/genomics/prokaryotes/bacteroides_fragilis/illumina/fastq/test1_1.fastq.gz on modules have the same SHA-256 (sha256-qQ9NSKMVSIbnKlIhuN1rEVnqCQ5dviBjTpZbTugBFqs=). The contigs are different files but contain the same 272 sequences in a different order
Looking at the git history, the reads were first committed to mag in 2018 (2c50b6d5, "test gut bact"), then the same file showed up on modules under bacteroides_fragilis/ in 2021 (b2197bc5). The contigs were re-assembled and added to mag in 2023 (95f022cc).
The naming is what tripped me up — "minigut" sounds like a multi-species gut community, not a single-organism B. fragilis dataset. There's nothing on either branch that connects the two. Maybe worth a note in the mag branch README?
Hello, people of nf-core. While working on my project to rewrite MaxBin2, I encountered a bit of an odd issue.
I was benchmarking a gene caller on two test datasets: the B. fragilis data from the
modulesbranch, and the "minigut" data from themagbranch. I got identical results on both and dug into why.Turns out the reads are the same file —
test_data/test_minigut_R1.fastq.gzonmaganddata/genomics/prokaryotes/bacteroides_fragilis/illumina/fastq/test1_1.fastq.gzonmoduleshave the same SHA-256 (sha256-qQ9NSKMVSIbnKlIhuN1rEVnqCQ5dviBjTpZbTugBFqs=). The contigs are different files but contain the same 272 sequences in a different orderLooking at the git history, the reads were first committed to
magin 2018 (2c50b6d5, "test gut bact"), then the same file showed up onmodulesunderbacteroides_fragilis/in 2021 (b2197bc5). The contigs were re-assembled and added tomagin 2023 (95f022cc).The naming is what tripped me up — "minigut" sounds like a multi-species gut community, not a single-organism B. fragilis dataset. There's nothing on either branch that connects the two. Maybe worth a note in the mag branch README?