Fix off by one bug when converting BAM files to d4 files.#97
Fix off by one bug when converting BAM files to d4 files.#97
Conversation
Fix off by one bug when converting BAM files to d4 files. Fixes: 38#75
|
A previous pull request semi worked (although fixing The new implementation is tested against # Samtools depth parameters:
$ samtools depth
Usage: samtools depth [options] in.bam [in.bam ...]
Options:
-a Output all positions (including zero depth)
-a -a, -aa Output absolutely all positions, including unused ref seqs
-r REG Specify a region in chr or chr:from-to syntax
-b FILE Use bed FILE for list of regions
-f FILE Specify list of input BAM/SAM/CRAM filenames
-X Use custom index files (in -X *.bam *.bam.bai order)
-g INT Remove specified flags from default filter-out flag list
-G, --excl-flags FLAGS
Add specified flags to the default filter-out flag list
[UNMAP,SECONDARY,QCFAIL,DUP]
--incl-flags FLAGS
Only include records with at least one the FLAGs present [0]
--require-flags FLAGS
Only include records with all of the FLAGs present [0]
-H Print a file header line
-l INT Minimum read length [0]
-o FILE Write output to FILE [stdout]
-q, --min-BQ INT
Filter bases with base quality smaller than INT [0]
-Q, --min-MQ INT
Filter alignments with mapping quality smaller than INT [0]
-J Include reads with deletions in depth computation
-s Do not count overlapping reads within a template
--input-fmt-option OPT[=VAL]
Specify a single input file format option in the form
of OPTION or OPTION=VALUE
--reference FILE
Reference sequence FASTA FILE [null]
-@, --threads INT
Number of additional threads to use [0]
--verbosity INT
Set level of verbosity
# Run SAMtools depth:
# -a: Output all positions (0-depth)
# -Q 10: Filter reads with mapping quality smaller than 10.
# Reads with UNMAP,SECONDARY,QCFAIL,DUP are filtered by default too (see -G option)
# -J: Count "D" in CIGAR string as coverage
# (d4tools create should not count positions in reads that are "D" in the CIGAR string
# but currently does this anyway).
# Convert SAMtools depth output to bedGraph.
# Compact adjacent bedGraph entries if they have the same value with bedGraphPack of Kent tools
samtools depth -@ 4 -aa -Q 10 -J test.bam \
| awk -F '\t' '{print $1 "\t" $2 - 1 "\t" $2 "\t" $3 }' \
| bedGraphPack /dev/stdin test.samtools_depth_MQ10_count_deletions.bedGraph
# Run SAMtools flags to get integer flag value to use with d4tools to filter out the same reads:
$ samtools flags UNMAP,SECONDARY,QCFAIL,DUP
0x704 1796 UNMAP,SECONDARY,QCFAIL,DUP
# Run d4tools create:
# -q 10: Filter reads with mapping quality smaller than 10.
# -F 1796: Filter reads with UNMAP,SECONDARY,QCFAIL,DUP flags to make
# sure that d4tools uses the same reads as samtools depth
# -z: Use compression as uncompressed d4tools output sometimes writes a wrong depth value as last entry.
d4tools create -q 10 -F '~1796' -z test.bam test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.d4
# Convert d4 file to bedGraph:
d4tools view test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.d4 \
> test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.bedGraph
# No differences:
$ diff -u test.samtools_depth_MQ10_count_deletions.bedGraph test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.bedGraph |
|
Unrelated to this patch, but it seems to me that d4tools create can create wrong data for the last entry of a chromosome when writing non-compressed data: # Run d4tools create:
# -q 10: Filter reads with mapping quality smaller than 10.
# -F 1796: Filter reads with UNMAP,SECONDARY,QCFAIL,DUP flags.
# -z: Use compression as uncompressed d4tools output sometimes writes a wrong depth value as last entry of a chromosome.
d4tools create -q 10 -F '~1796' -z test.bam test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.compresssed.d4
# Convert d4 file to bedGraph:
d4tools view test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.compressed.d4 \
> test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.compressed.bedGraph
# Run d4tools create:
# -q 10: Filter reads with mapping quality smaller than 10.
# -F 1796: Filter reads with UNMAP,SECONDARY,QCFAIL,DUP flags.
# without compression: sometimes writes a wrong depth value as last entry for a chromosome.
d4tools create -q 10 -F '~1796' -z test.bam test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.uncompressed.d4
# Convert d4 file to bedGraph:
d4tools view test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.uncompressed.d4 \
> test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.uncompressed.bedGraph# With compressed output samtools depth and d4tools give the same.
$ diff -u test.samtools_depth_MQ10_count_deletions.bedGraph test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.compressed.bedGraph
# With uncompressed output samtools depth and d4tools give the same.
# Only for 3 chromosomes out of 1870 chromosomes this problem appears.
$ diff -u test.samtools_depth_MQ10_count_deletions.bedGraph test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.uncompressed.bedGraph
--- test.samtools_depth_MQ10_count_deletions.bedGraph 2025-01-20 11:37:33.000000000 +0100
+++ test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.uncompressed.bedGraph 2025-01-20 17:47:31.000000000 +0100
@@ -15016909,7 +15016909,8 @@
211000022280772 13654 13701 1
211000022280772 13701 13713 0
211000022280772 13713 13762 1
-211000022280772 13762 15417 0
+211000022280772 13762 13763 0
+211000022280772 13763 15417 2
211000022280659 0 594 0
211000022280659 594 642 1
211000022280659 642 703 0
@@ -15017355,7 +15017356,8 @@
211000022279098 12261 12270 3
211000022279098 12270 12289 2
211000022279098 12289 12310 1
-211000022279098 12310 12368 0
+211000022279098 12310 12311 0
+211000022279098 12311 12368 2
211000022280761 0 75 0
211000022280761 75 123 1
211000022280761 123 1433 0
@@ -15030169,7 +15030171,8 @@
211000022280548 1426 1434 1
211000022280548 1434 1474 2
211000022280548 1474 1483 1
-211000022280548 1483 2618 0
+211000022280548 1483 1484 0
+211000022280548 1484 2618 2
211000022280553 0 2561 0
211000022280554 0 228 0
211000022280554 228 277 1
@@ -15043475,7 +15043478,8 @@
211000022279504 1843 1850 15
211000022279504 1850 1854 14
211000022279504 1854 1855 12
-211000022279504 1855 2317 0
+211000022279504 1855 1856 0
+211000022279504 1856 2317 2
211000022279505 0 44 0
211000022279505 44 92 1
211000022279505 92 114 0
When running d4tools create only for those chromosomes, it does not seem to happen. When adding a bunch of other chromosomes, sometimes it gives the correct output and sometimes the wrong output, so I assume it has something to do with a reuse of a buffer that does not get cleared properly (and due to threading, it does not always contain the same value): uncompressed_output_instability () {
/staging/leuven/stg_00002/lcb/ghuls/software/d4-format/target/release/d4tools create -q 10 -F '~1796' test.bam test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.subset.d4 -f '^2R|^211000022[0-9]{6}' > /dev/null
/staging/leuven/stg_00002/lcb/ghuls/software/d4-format/target/release/d4tools view test.d4tools_create_MQ10_remove_UNMAP_SECONDARY_QCFAIL_DUP_reads.subset.d4 \
| rg -A 5 211000022280548 \
| tail
} |
|
@brentp Can you take a look? |
|
thank you! this looks good to me |
|
Nice @ghuls ! I'll give this a test run |
|
Works well in my hands for the example in the issue you provided. Setting up the test data - genome 30 bp and read 21 bp. Aligning and indexing. Checking samtools depth. Creating d4tools using the current version and the fixed version. Checking the depths Nice 👌 I agree that it would be great to add a unit test for this! |
|
Also before (or at least with #76), if your read $ bwa mem -k 2 -T 0 genome_mini.fa fake_reads.fa | samtools sort -o out.bam -
$ samtools index out.bam
$ d4tools create -q 0 out.bam out.d4tools_current.d4
$ d4tools create -q 0 out.bam out.d4tools_ghuls.d4
$ d4tools view out.d4tools_current.d4 out.d4tools_current.bdg
$ d4tools view out.d4tools_ghuls.d4 out.d4tools_ghuls.bdg
$ samtools depth -aa out.bam | awk '{print $1 "\t" $2 - 1 "\t" $2 "\t" $3}' | bedGraphPack /dev/stdin out.samtools_depth.bdg
$ head -n 50 out.d4tools_current.bdg out.d4tools_ghuls.bdg out.samtools_depth.bdg
==> out.d4tools_current.bdg <==
chr1 0 1 0
chr1 1 4 2
chr1 4 15 3
chr1 15 18 2
chr1 18 27 3
chr1 27 30 2
chr2 0 1 0
chr2 1 15 3
chr2 15 16 2
chr2 16 19 0
chr2 19 22 1
chr2 22 23 2
chr2 23 25 3
chr2 25 39 4
chr2 39 40 3
==> out.d4tools_ghuls.bdg <==
chr1 0 4 2
chr1 4 14 3
chr1 14 18 2
chr1 18 26 3
chr1 26 30 2
chr2 0 14 3
chr2 14 15 2
chr2 15 19 0
chr2 19 22 1
chr2 22 23 2
chr2 23 25 3
chr2 25 38 4
chr2 38 39 3
chr2 39 40 2
==> out.samtools_depth.bdg <==
chr1 0 4 2
chr1 4 14 3
chr1 14 18 2
chr1 18 26 3
chr1 26 30 2
chr2 0 14 3
chr2 14 15 2
chr2 15 19 0
chr2 19 22 1
chr2 22 23 2
chr2 23 25 3
chr2 25 38 4
chr2 38 39 3
chr2 39 40 2 |
OK, well spotted! And that looks like a nice test case. Would you be able to add this as a test to this PR @ghuls ? I think brentp would be happy to merge it after getting it tested. I.e. adding to https://github.com/38/d4-format/tree/master/d4tools/test/create Just let me know if you want help with this. This bug impacts us as well. Seems like the d4tools-devs are gone / missing, so it is kind of a community effort to keep this going. |
|
Help with adding the test would be great. |
OK, I'll look into it 👍 |
|
I put up a PR towards this PR with the two additional unit tests outlined above over at your repo: ghuls#1 The existing test bam naturally needed updating. There are still differences to the If we resolve / understand the reasons for them appearing, then I think this PR would be good to go. |
|
There was a discrepancy in filtering of the BAM reads between SAMtools and d4tools. The SAMtools default makes more sense to me. |
I see! I guess for this PR the current d4tools output is what we'll use in the tests. Good that you investigated it. I don't have write rights neither here nor to your PR. But if you think my PR looks OK over at your repo - merge it into this PR and then we'll forward it to brentp. Just let me know if you want further help. |
Fix off by one bam add tests
|
@Jakob37 Merged your tests. |
|
so this now matches samtools output? |
|
Yes, at least if you use the correct flagsfor both d4 and samtools depth as in : #97 (comment) |
Fix off by one bug when converting BAM files to d4 files.
Fixes: #75