Skip to content

feat: add non-random UMI support#145

Merged
znorgaard merged 22 commits intodevfrom
zn_nonrandom_umi_support
Mar 27, 2026
Merged

feat: add non-random UMI support#145
znorgaard merged 22 commits intodevfrom
zn_nonrandom_umi_support

Conversation

@znorgaard
Copy link
Copy Markdown
Collaborator

@znorgaard znorgaard commented Mar 23, 2026

Closes #137

PR Overview

This PR adds an option to use non-random (or fixed) UMIs with fastqourum.
This is supported by an optional new column in the samplesheet which points to a file of known UMI sequences.
This allows the fixed list to be used for some samples but not all of them, or to use different fixed lists for different samples.

Pulled from the updated Usage.md:

When using UMI correction, consider setting --groupreadsbyumi_strategy Identity (or Paired with --groupreadsbyumi_edits 0 for duplex sequencing), since UMIs have already been corrected to exact known sequences. The pipeline will emit a warning if a fuzzy-matching strategy is used with corrected UMIs.

The pipeline will also report an error if you supply different UMI lists for the same "sample" (note for #147, this should be based on "library_id").

Test Data Overview

The test data for this was created by taking the existing randomer data for SRR6109255 and replacing the existing 10bp UMI and 1 constant base with an 8bp UMI from the xGen™ cfDNA & FFPE DNA Library Preparation Kit. Then a random 1bp substitution was introduced in ~10% of the UMI sequences.

In the final output test data, we expect to have fewer consensus reads due to the reduced number of available UMIs. However, the drop shouldn't be too drastic.

Running the Test Data

nextflow run main.nf -profile test,docker --outdir results_random
nextflow run main.nf -profile test_nonrandom,docker --outdir results_nonrandom

Evaluating the UMI Correction (and Synthetic data creation)

Non-Random Correct UMI Metrics
% cat SRR6109255.correct-umis-metrics.txt
umi	total_matches	perfect_matches	one_mismatch_matches	two_mismatch_matches	other_matches	fraction_of_matches	representation
AAGCACTG	1162	1050	112	0	0	0.02934	0.938895
ACCACGAT	1258	1090	168	0	0	0.031764	1.016463
ACGACTTG	1274	1148	126	0	0	0.032168	1.029391
ACGGAACA	1302	1146	156	0	0	0.032875	1.052015
ACGTTCAG	1214	1062	152	0	0	0.030653	0.980911
ACTAGGAG	1240	1134	106	0	0	0.03131	1.001919
ACTGAGGT	1232	1118	114	0	0	0.031108	0.995455
AGCGTGTT	1226	1110	116	0	0	0.030956	0.990607
ATCCAGAG	1240	1130	110	0	0	0.03131	1.001919
CAATGTGG	1202	1086	116	0	0	0.03035	0.971215
CGCATGAT	1228	1100	128	0	0	0.031007	0.992223
CGGCTAAT	1132	1032	100	0	0	0.028583	0.914655
CTGTTGAC	1204	1092	112	0	0	0.030401	0.972831
CTTAGGAC	1248	1114	134	0	0	0.031512	1.008383
GAAGGAAG	1312	1174	138	0	0	0.033128	1.060095
GAGACGAT	1190	1088	102	0	0	0.030047	0.961519
GATCGAGT	1144	1034	110	0	0	0.028886	0.924351
GATGTGTG	1204	1094	110	0	0	0.030401	0.972831
GATTACCG	1222	1094	128	0	0	0.030855	0.987375
GCACAACT	1274	1108	166	0	0	0.032168	1.029391
GCGTCATT	1284	1166	118	0	0	0.032421	1.037471
GCTATCCT	1292	1154	138	0	0	0.032623	1.043935
GTCGAAGA	1310	1188	122	0	0	0.033077	1.058479
GTGCCATA	1182	1080	102	0	0	0.029845	0.955055
GTTACGCA	1248	1128	120	0	0	0.031512	1.008383
NNNNNNNN	0	0	0	0	0	0	0
TCGCTGTT	1290	1182	108	0	0	0.032572	1.042319
TGAAGACG	1236	1140	96	0	0	0.031209	0.998687
TGGACTCT	1222	1096	126	0	0	0.030855	0.987375
TTCCAAGG	1200	1048	152	0	0	0.0303	0.969599
TTCGTTGG	1244	1114	130	0	0	0.031411	1.005151
TTGCAGAC	1272	1120	152	0	0	0.032118	1.027775
TTGCGAAG	1316	1186	130	0	0	0.033229	1.063327

We observe all the expected UMIs and they all have ~10% representation from reads with 1 mismatch.

Comparing the Grouping

Random
% cat SRR6109255.grouped-family-sizes.txt
family_size	count	fraction	fraction_gt_or_eq_family_size
1	6219	0.870399	1
2	799	0.111826	0.129601
3	104	0.014556	0.017775
4	22	0.003079	0.003219
5	1	0.00014	0.00014
Non-Random
% cat SRR6109255.grouped-family-sizes.txt

family_size	count	fraction	fraction_gt_or_eq_family_size
1	4185	0.735113	1
2	932	0.16371	0.264887
3	315	0.055331	0.101177
4	149	0.026172	0.045846
5	69	0.01212	0.019673
6	24	0.004216	0.007553
7	9	0.001581	0.003337
8	8	0.001405	0.001757
9	1	0.000176	0.000351
11	1	0.000176	0.000176

We (as expected) observe larger family sizes in the non-random data because we're using fewer UMIs resulting in more reads sharing coordinates and UMIs.

Comparing the Consensus BAMs

Random
% samtools flagstat SRR6109255.cons.unmapped.bam
13350 + 0 in total (QC-passed reads + QC-failed reads)
13350 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
0 + 0 mapped (0.00% : N/A)
0 + 0 primary mapped (0.00% : N/A)
13350 + 0 paired in sequencing
6675 + 0 read1
6675 + 0 read2
0 + 0 properly paired (0.00% : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
Non-random
% samtools flagstat SRR6109255.cons.unmapped.bam

9474 + 0 in total (QC-passed reads + QC-failed reads)
9474 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
0 + 0 mapped (0.00% : N/A)
0 + 0 primary mapped (0.00% : N/A)
9474 + 0 paired in sequencing
4737 + 0 read1
4737 + 0 read2
0 + 0 properly paired (0.00% : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

~30% reduction in consensus sequences in the non-random data as expected with the larger family sizes.

Comparing Filtered Aligned Consensus BAM

Random
% samtools flagstat SRR6109255.mapped.bam
13402 + 0 in total (QC-passed reads + QC-failed reads)
13350 + 0 primary
0 + 0 secondary
52 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
13293 + 0 mapped (99.19% : N/A)
13241 + 0 primary mapped (99.18% : N/A)
13350 + 0 paired in sequencing
6675 + 0 read1
6675 + 0 read2
13166 + 0 properly paired (98.62% : N/A)
13180 + 0 with itself and mate mapped
61 + 0 singletons (0.46% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
Non-Random
% samtools flagstat SRR6109255.mapped.bam
9525 + 0 in total (QC-passed reads + QC-failed reads)
9474 + 0 primary
0 + 0 secondary
51 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
9423 + 0 mapped (98.93% : N/A)
9372 + 0 primary mapped (98.92% : N/A)
9474 + 0 paired in sequencing
4737 + 0 read1
4737 + 0 read2
9302 + 0 properly paired (98.18% : N/A)
9314 + 0 with itself and mate mapped
58 + 0 singletons (0.61% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

We observe a similar map rate (slightly worse makes sense because we've created consensus across molecules by reducing the number of UMIs available -- more UMI collisions).

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/fastquorum branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@znorgaard znorgaard force-pushed the zn_nonrandom_umi_support branch from 190f274 to 9548ab5 Compare March 23, 2026 22:10
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 23, 2026

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 4060202

+| ✅ 205 tests passed       |+
#| ❔  10 tests were ignored |#
!| ❗   1 tests had warnings |!
Details

❗ Test warnings:

❔ Tests ignored:

  • files_exist - File is ignored: .github/workflows/awsfulltest.yml
  • files_exist - File is ignored: .github/workflows/awstest.yml
  • files_exist - File is ignored: conf/modules/modules.config
  • files_unchanged - File ignored due to lint config: .github/PULL_REQUEST_TEMPLATE.md
  • files_unchanged - File ignored due to lint config: .github/workflows/branch.yml
  • files_unchanged - File ignored due to lint config: .github/workflows/linting.yml
  • files_unchanged - File ignored due to lint config: assets/nf-core-fastquorum_logo_light.png
  • files_unchanged - File ignored due to lint config: docs/images/nf-core-fastquorum_logo_dark.png
  • files_unchanged - File ignored due to lint config: .gitignore or .prettierignore
  • template_strings - template_strings

✅ Tests passed:

Run details

  • nf-core/tools version 3.5.2
  • Run at 2026-03-27 22:15:37

@znorgaard znorgaard changed the base branch from dev to nh13/switch-master-to-main March 23, 2026 22:12
Base automatically changed from nh13/switch-master-to-main to dev March 23, 2026 22:13
znorgaard and others added 16 commits March 23, 2026 15:26
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…pport

- Add nonrandom.nf.test: tests R&D and HT modes with fixed UMIs
- Add mixed_umis.nf.test: tests mixed fixed/random UMI samples
- Update test configs to use nf-core/test-datasets URLs
- Use Paired strategy with edits=0 for nonrandom test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous approach parsed the CSV with new File() which silently
fails for URL inputs. Now the warning fires from the workflow itself
by checking the first item in the correct branch channel, which works
regardless of whether the samplesheet is a local path or URL.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@znorgaard znorgaard force-pushed the zn_nonrandom_umi_support branch from 84097ec to 93fd81c Compare March 23, 2026 22:26
znorgaard and others added 4 commits March 23, 2026 15:38
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Verifies the pipeline fails with an informative error when multiple runs
of the same sample specify different umi_file values in the samplesheet.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@znorgaard znorgaard marked this pull request as ready for review March 27, 2026 16:41
@znorgaard znorgaard requested a review from nh13 March 27, 2026 16:41
Comment thread conf/test_mixed_umis.config Outdated
Comment thread subworkflows/local/utils_nfcore_fastquorum_pipeline/main.nf Outdated
@znorgaard znorgaard merged commit 4d3d4ab into dev Mar 27, 2026
20 checks passed
@znorgaard znorgaard deleted the zn_nonrandom_umi_support branch March 27, 2026 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants