-
Notifications
You must be signed in to change notification settings - Fork 1
Description
MARINE's behaviour when encountering reads without MD tags (e.g. a user never ran the equivalent of samtools calmd or there are unmapped reads in the bam) is not ideal.
If MARINE encounters a read that is missing an MD tag while iterating through a contig interval, then it throws a warning message and immediately skips every read left to be processed in that interval (e.g. if an interval has 10,000 reads but the 500th read has no MD tag, then the next 9,500 reads will be skipped). This warning message does not stop MARINE, and there is no indication to the user that any reads have been skipped.
To see an example of this behaviour, we can run marine on the following small example which includes unmapped reads:
/tscc/lustre/ddn/scratch/kflanagan/Tau_marine_run/one_circ_files/one_circ_unmapped.bam
using the commands:
module load marine
python /tscc/projects/ps-yeolab3/kflanagan/MARINE-1.0.9-alpha/marine.py --bam_filepath /tscc/lustre/ddn/scratch/kflanagan/Tau_marine_run/one_circ_files/one_circ_unmapped.bam --output_folder ./results --strandedness 0 --cores 16 --contigs "chr1:145631664|145706593" --bedgraphs "CT" --paired_end --num_intervals_per_contig 2 --keep_intermediate_files
This will throw a warning message (copied below), but will not prevent MARINE from completing its run.
It seems like there is an MD tag missing "tag 'MD' not present" Failed getting read info on LH00444:108:22C2FVLT4:1:2453:9303:25682 69 chr1:145631664|145706593 74931 0 * = 74931 0 GTGTTGGGACCAAAACTTCACATCAAATTGGATTCCCAGATAATTCACAGTGATCTCAATGCCTCTTGTGACAGGATCCGTTTTCCCAACTCGATCAAAAACAATATTGATGGTAGCCTTTTCTTTTTGCTTATGATCATATCCATGGTTT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII YT:Z:UP, local variable 'md_tag' referenced before assignment
When we check the .tsv files within ./results/edit_info to look at the individual edits found in each interval, we will find that the first interval has no edits whatsoever as one of the very first reads in this interval was unmapped. Furthermore, the second interval has only 126 edits when in reality there are nearly 1000 edits in that interval.
I would recommend making it so that MARINE stops entirely when it encounters a read without an MD tag. This can be done quite easily by replacing the break in line 182 of src/core.py with a raise() of some kind. It may also be helpful to include suggestions for how to fix the problem in the error message, such as directing them to readme instruction on removing unmapped reads and adding the MD tag.
An alternative solution would be to replace the break in line 182 of src/core.py with a continue so that it ignores the read and keeps going, but given that this could potentially generate millions of warning messages, it would require time and care to implement.