-
Notifications
You must be signed in to change notification settings - Fork 6
Description
I'm seeing a lot of valid reads get filtered out in this step in long read .bams, and I was wondering if I could have some help understanding the logic of this section of the code:
Lines 536 to 584 in 04e2a11
| int count[5] = { 0, 0, 0, 0, 0 } ; | |
| int pos = 0 ; | |
| for ( i = 0 ; i < ccnt ; ++i ) | |
| { | |
| switch ( cigarSeg[i].type ) | |
| { | |
| case 'S': | |
| pos += cigarSeg[i].len ; | |
| case 'M': | |
| case 'I': | |
| { | |
| for ( j = 0 ; j < cigarSeg[i].len ; ++j ) | |
| ++count[ (unsigned char) nucToNum[ col[9][pos + j] - 'A' ] ] ; | |
| pos += j ; | |
| } break ; | |
| case 'N': | |
| { | |
| int max = 0 ; | |
| int sum = 0 ; | |
| for ( j = 0 ; j < 5 ; ++j ) | |
| { | |
| if ( count[j] > max ) | |
| max = count[j] ; | |
| sum += count[j] ; | |
| } | |
| if ( max > 0.8 * sum ) { | |
| validRead = false ; | |
| } | |
| count[0] = count[1] = count[2] = count[3] = count[4] = 0 ; | |
| } break ; | |
| case 'H': | |
| case 'P': | |
| case 'D': | |
| default: break ; | |
| } | |
| } | |
| int max = 0 ; | |
| int sum = 0 ; | |
| for ( j = 0 ; j < 5 ; ++j ) | |
| { | |
| if ( count[j] > max ) { | |
| max = count[j] ; | |
| } | |
| sum += count[j] ; | |
| } | |
| if ( max > 0.8 * sum ) | |
| validRead = false ; |
First off, there is an error - no break statement in case 'S' means there is a fallthrough for soft-clipped reads to case 'I', which offsets the pos variable. This later causes reads into incorrect data left over in the string from another alignment, or simply reads into garbage memory. Somehow nucToNum handles junk input silently, i.e. nucToNum[-65] == 0, so this does not cause a crash/error and simply churns out incorrect data. Below is an example, where the program just starts reading zeroes after a certain point. In combination with the reset of "count" when 'N' is reached in the cigar string, this read gets incorrectly filtered. Below is an example of the issue with the program loaded into a debugger.
The numbers are the numerical values of the characters being read from the sequence string, with a -'A' offset. (i.e. col[9][pos + j] - 'A' ). You can see that after a certain point, it starts reading zeroes (which appear as -65 due to -'A')
Apart from some soft clipped poly-T at the start of the read, the content seems decently complex, but because the program is reading junk from memory, the 126 matches register as the same nucleotide.
❯ samtools view lr_test.bam | grep 465S1446M534N59M1206N77M435N126M
SRR14298432.320776 16 Contig10 10022 0 465S1446M534N59M1206N77M435N126M * 00 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAGCTAAAATAAGTATTACTTAGAAAACATCTGAAGTTAACATATTTTGGAAGTCGACGAAGATCAAGATGTACATTAATTTTTATAGAAATAACTGCTTTCTTCTCCACACATCTTATATCTCACAATACTGCTGCCGCACAACACCTAAGGCCCTTCAGACGCTCAAAAACTTGTTTCATTTCTGGTCGTTTCTTTTGTTCATCATCCATACACTGTCTTGCAAGTTCTGTAAGTTTACGTCCATCCCGACGATCAAAGTAGGTATCATCCTCAAGACTTTCGTCGACATATGAACGCCTGTGCTTGTACTCATTCGCGGCCCACTTATCAATTTCACACCCTGTATATTTTTCCTTGTCAAGAGCTCTCTTGCTTAGAAGATTGTAGAGCGAAATACCGAATGCATAAACATCAGACTTGACATTGTATGCAGCTGTGATCACATCGTCCGAGGTCTCCGCATCTGCTCTAGCTGGAGCTGCATAAAAGTGGGCCTGGACTTCCCTACCCTGTGCATCGGCTCTGCCCCCTAATCTACCTCCTCGGGGACCACCCCGTATACCCTGGTGACCACCCCTACGGTCCTGTCCCTGGCCTCTGCCTCTAGCTGGAGGTGTTGGCGCATGCCTAACTGCCTGTGGAGCGGGTAAGGCTAACCTGCGATGGGGACACTCACGGGCCCAATGATCTAACTCCCCACACTCATAACACCCCTGGGACTGTCGACTGAGAGCTGGAGGTCGGCTACCCCGGCCAGAAGAACCCATCTGTGAACTATCTCGGCCTCGACTAGGACCGGCACTAAAACCAATATAGCGGTGCTGATCACCCTCTGATGCCTGGAGAGAAGCCTGGAAGGGGCTACTAGATTGACCTGACTGAAACCTCTGCTGAGATCTATCATAAGACTCTCGGGGTCTGAAACGGCGCCTATCATGGCTATCCTCCTGTCTAGCTCTCTTGCCGCTGCCCCCTTGGGCCTCGCGACGGAGCTGCTCGATAGTGCGGGCAAGGTCCACTACATCCAGAAACGAGCGGCCTGCTGAGACTAAAGACTAAGTCTCAATCCGCAACTGTAGTCTCAAACCCCGAACAAAACAACGAACCCTCTCCTCCTCGGTAGGTAATATCATGGCGGCATGCCTGGAGAGCTCGTGGAACCGGGCCTCATATTCTGATACTGTCATGGAGCCCTGCTCTAACCAAGCAAACTGATCTCTGAGCTGATCTCTGATGCTCCTGGGAATAAATCGGGCCAAGAAGGCCTCTGAAAACTCAGTCCATGTGACCGGTGGAGACCCAGCTGGCCTGGTCTCTAAGTATGTACGCCACCACTGCCTGGCGGGACCGTCTAACTGGTAAGCGGTGAAGTCTGCCCCTCTCGACTCCACTAAACCCAAAGTCTGTAGCCGCTCGCGGCAAGTAACTAGAAACTCGTGGGCATCCTCCCCCACTGCCCCAGAAAACCTCGGCGGTGATAGTCTAAGGAACACCCCGAGCATCTTCTGCTCCCGTAATGGCATGACGCCCTCCAAATCGTCAGGCTGAACAACCGGTGCCGCTAGCTGCTGAACAACCGGTACCGCTGGTGTAGGCTGGCCCTGCGGTACTGGAACGACTGGAGCTGGGGCCTGCTGCATACCCCCAATGGCAGCTAGTATCTGTACGAGCATCGCCTGTATATCTGGAGTTGCCACAGGCTCGGGCGGTGGCTGGGCTGGTCCCGGGCACATCGGCTGCTGCTGTGCTGGAGCTGCTGGTGGCTCTATCACCGGCTCCAGAGTTTGCTCTCGGTCCTGGGCTGGTGCTGTAACTCTGGCACGACCTCGAGGTCGGCCTCTACCCCTAGCTGGGGACCTACCACGCGTGCGAGCCATCCCCCCGTCTCCGGATAAGTCTACTCCAAGCTAGATCAACGCTGGCTACTAGTGCTCGGACCTGAAAGCCGATCACGCGTGCTCCAACTCACGGTACCGGATCTTTTCCTCTCCGAACGGTCTCCAAATGCTTCCCCACTCAAATGAAGCCTCCCCTCGCGTTTCCGAGATCACCGATAGATTCCCCTCGAAATTCTCTTGAAGTTTGCCTTTTGAATCACTTAATTTGGACTCAAAATGAAGGATTAATCTCAAGTTGAAGTT ,~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~w~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~u~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~w~~~~~~~~~~~~~~~~~~}~~~~~~~n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~L~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~I~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~N~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NM:i:39 ms:i:1513 AS:i:1441 nn:i:0 ts:A:+ tp:A:P cm:i:374 s1:i:1488 s2:i:1487 de:f:0.0228 SA:Z:chr01,87709743,+,1705S439M2D29S,1,8; rl:i:73
I'll write a fix for this, but even after this is corrected, some valid reads are still being filtered - this is due to the count resetting when an 'N' is reached in the CIGAR string, so a read that is complex but is terminated by a low complexity region will often get filtered out in the case that there is an N near the end of the alignment. This is an example of a read that was incorrectly filtered:
CIGAR:1S93M1D184M979N143M354N185M236N91M143N334M151N58M71N402M13319N17M16S
READ:GATTTGCTGTTGTTAGAAGCATTTAGTATTATAAACTATTTCTCTTTCATTCCTTTCTCACTTTTCTTTGCTTTTGGGTTTCTGGGGTTTGTTGTTTTTTTTTCTAGTTTTGTTTGAGTTTGATCTGCTTCTTGGGGTTCTTTAAATTCTCTTGAAATCTTGAATTTTGGGTGTTTCTTGAAAATTCATAGAGATGGAGAGGGACTTTATGGGATTGAATATTAAAGATTCTTTAGTTGTAGTCAAGGAAGAACATGTTGAAAGCTCCAAAGACTCTGGGTTTCGCTGGCCAATGTCGAGTAAGGTTGGTGTACCTCATTTCATGTCCTTGAACTCTGCTCAAGATGAAAACACCTTCAAAGCTCTATCTGCCACAGATGGAGTCGATGCTGGTCTCAAACGTCAGTCTGGTGAACTCCAGAATGTGCATGCAATGCATCTTCCGTATGATGTTAAGATGCTTCCATTTAACATGAACAATCCCTCCTACAAGACTCATTTTGGTGGTTCTGGCCAGATGAAGCAAGTTCTTGGTGGAATTCCTGTTACAGCTCCTCATTCAATGCTTCCATCGTGTGGCTCTGTGGCTGGGACAACCGAACCTTGGTTTAATTCCAAAAGTTCTGCAGCACCTGTTCAACTGACCATCTTCTATGGTGGGACGGTCAATGTCTTCGAGGATATCTCCCCTGAGAAGGCACAGGCTATTATGTTTTTGGCTGGACATGGCTGTGCTCCACCTAATGTGGTGCAACCAAGGTTTCAACTTCAGGCATCTGCATCAAAACCTGCTGCTGCGGATGGTGTTTGTGTGAACCAAACCCCAAACATGCTGCCTGCCTCGGGTCTCTCTAGCCCTATGTCTGTTTCTTCCCATCCCATTGGTCAATCTGATGGCAGTTCTGGAAACAAAGATGACATGAAGATGTCTAAAACTGCAAACATTTCAGTGACTCCCCTTGTCAAATTGGACACTTCAAAGATTGTGACATCACTAGGACCTGTTGGGGCGACTACCATAATGACAGCAGCGGTTCCACAAGCTAGGAAAGCATCTCTGGCTCGTTTTCTGGAGAAGCGTAAGGAAAGGGTGATGAACTTAGCACCATATGGCCTCAGCAAGAAATCGCCTGAGTGCTTCACCCCCGAGTCTAATGGAGTTGGTTTCTCTGCGACTTCCAGTCCTCTGTTAGCCGGTAAGGAGACCTAGCTATGATCTCTGAAGACTGCTTAACGCTGTAAATCAGTGAAGCATAATCTATTTTATTCAGCAAGTTAAAAATCTGTGTAGCTATCAAGTTTTGTTTGTTGTAATTTGTAATGTCAGCTTTGGTTTAGCTCTAGCTAGACATCCTAGTATCCAATTGGTTTGGTCTTTTTTTTTCTGTAACCTGATGGAATTTATCCCCATCCAAATAACTTCTTCAATTTGTAACTATTTATTTTGTTACATTGTGTTTCACAAGTATATATATATATTAATGTTATTTAGTTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
The CIGAR ends with 13319N17M16S. 13319N stops the count, resetting it, 17M reads into the start of the poly A: GTTAAAAAAAAAAAAAA and finally the rest of the poly-A is softclipped. As a result, the whole read is filtered based on this tail section.
What's the best way of dealing with this? I guess the easiest option is just to not reset when N is reached, but then this introduces the assumption that the whole read has a similar complexity. Alternatively, you could argue that the aligner didn't clip accurately, I guess, and then this is not a PsiCLASS issue.
I guess the ideal solution is to take some combined measure of the complexities of each chunk, somehow? I'd love some input, but in any case I'll be headscratching on this one for a while 😅.
Here's the test data I'm using (minimap2 bam):
https://drive.google.com/file/d/1nT7th6WKVPs1Jb5mhbm9CRAaHENP9MRd/view?usp=drive_link
