-
Notifications
You must be signed in to change notification settings - Fork 1
Description
I ran MicrobeMod twice:
- Run 1:
--percent_cutoff_streme 0.9(default) - Run 2:
--percent_cutoff_streme 0.8
Both runs have --methylation_confidence_threshold 0.8. In both cases, STREME detected a motif with e-value < 0.1. However, only the motif from the 0.9 run is reported in LIBRARY_NAME_motifs.tsv.
I ran this on multiple samples, all containing the same Cyanobacteria species, and this issue occurs in about half of them. In the remaining samples, MicrobeMod reports the motif in both the default and 0.8 run.
A similar issue was reported in Unclear why motif is excluded. The potential solutions/reasons referenced in that thread are not present in my current dataset. The motif that was excluded in my dataset:
- Contains C
- Has high coverage
- Did not involve methylation type h
- The LIBRARY_NAME_a_pos.fasta file contains many instances of the motif
In the logs below, the line Counting genome sites for: ___ appears in the 0.9 run but is missing in the 0.8 run. Perhaps there is a potential issue in make_motif_table() function in microbemod.py. STREME successfully identified the motif, but is filtered out by MicrobeMod for some reason other than the e-value cutoff.
1. Data from default run (where motif was reported by MicrobeMod):
Log file:
25-02-03 09:45:14 INFO Running modkit: modkit pileup -t 20 SQK-NBD114-24_barcode01_1_low.bed -r SQK-NBD114-24_barcode01.1.fa --only-tabs --filter-threshold 0.8
25-02-03 09:48:58 INFO Reading Modkit table...
25-02-03 09:49:02 INFO Retrieving sequences
25-02-03 09:49:09 INFO Total number of sites with any possible calls (including low quality): 4184185 bases
25-02-03 09:49:09 INFO Median coverage (stranded): 1140.0x
25-02-03 09:49:09 INFO Total number of sites with greater than min coverage (10x): 4184185 bases
25-02-03 09:49:10 INFO Potential sites (>33%) for methylation type m: 0 bases
25-02-03 09:49:10 INFO High quality sites (>66%) for methylation type m: 0 bases
25-02-03 09:49:10 INFO Very high quality sites (>90%) for methylation type m: 0 bases
25-02-03 09:49:10 INFO Potential sites (>33%) for methylation type 21839: 20 bases
25-02-03 09:49:10 INFO High quality sites (>66%) for methylation type 21839: 0 bases
25-02-03 09:49:10 INFO Very high quality sites (>90%) for methylation type 21839: 0 bases
25-02-03 09:49:10 INFO Potential sites (>33%) for methylation type a: 2068 bases
25-02-03 09:49:10 INFO High quality sites (>66%) for methylation type a: 1779 bases
25-02-03 09:49:10 INFO Very high quality sites (>90%) for methylation type a: 196 bases
25-02-03 09:49:10 INFO Running for Methylation type: 6mA
25-02-03 09:49:11 INFO Modkit table size: 2006021
25-02-03 09:49:11 INFO 196 sites for STREME with cutoff 0.9 and min coverage 10
25-02-03 09:49:11 INFO Creating background control distribution: SQK-NBD114-24_barcode01_1_a_control.fasta
25-02-03 09:49:11 INFO Running STREME: streme . --n SQK-NBD114-24_barcode01_1_a_control.fasta -p SQK-NBD114-24_barcode01_1_a_pos.fasta -o SQK-NBD114-24_barcode01_1_a_pos_streme
25-02-03 09:51:43 INFO Found and trimmed motif: 1-YTGGAGGTDHWDYWN GGAGGNNNW 176
25-02-03 09:51:44 INFO Counting genome sites for: GGAGGNNNW
25-02-03 09:51:44 INFO Motif has 1376 methylated instances
25-02-03 09:51:44 INFO Methylated positions in motif GGAGGNNNW:
25-02-03 09:51:44 INFO 3 98.071429
6 1.714286
9 0.214286
Name: count, dtype: float64
25-02-03 09:51:44 INFO Running for Methylation type: 5mC
25-02-03 09:51:44 INFO Modkit table size: 1089082
25-02-03 09:51:44 INFO 0 sites for STREME with cutoff 0.9 and min coverage 10
25-02-03 09:51:44 INFO Note: Fewer than 10 methylated sites identified, no motif finding was run.
25-02-03 09:51:44 INFO Running for Methylation type: 4mC
25-02-03 09:51:44 INFO Modkit table size: 1089082
25-02-03 09:51:44 INFO 0 sites for STREME with cutoff 0.9 and min coverage 10
25-02-03 09:51:44 INFO Note: Fewer than 10 methylated sites identified, no motif finding was run.
25-02-03 09:51:44 INFO Complete!
25-02-03 09:51:44 INFO Saving methylated site table to: SQK-NBD114-24_barcode01_1_methylated_sites.tsv
25-02-03 09:51:44 INFO Saving motif output to: SQK-NBD114-24_barcode01_1_motifs.tsv
SQK-NBD114-24_barcode01_1_motifs.tsv:
index Motif Motif_raw Methylation_type Genome_sites Methylated_sites Methylation_coverage Average_Percent_Methylation_per_site Methylated_position_1 Methylated_position_1_percent Methylated_position_2 Methylated_position_2_percent
0 0 GGAGGNNNW 1-YTGGAGGTDHWDYWN 6mA 1523 1384 0.909 0.82 3 98.071 6 1.714
1 1 No Motif Assigned NA 6mA NA 403 NA 0.81 NA NA NA NA
streme.xml:
<motif id="1-YTGGAGGTDHWDYWN" alt="STREME-1" width="15" initial_width="13" seed="CTGGAGGTGCTGCCA" score_threshold="8.28948" npassing="196" train_pos_count="177" train_neg_count="808" train_log_pvalue="-354.458" train_pvalue="3.5e-355" train_dtc="-1.0" train_bernoulli="-1" test_pos_count="19" test_neg_count="101" test_log_pvalue="-37.1572" test_pvalue="7.0e-038" test_log_evalue="-36.5551" test_evalue="2.8e-037" test_dtc="-1.0" test_bernoulli="-1" is_palindromic="no" elapsed_time="22.4" total_sites="177" site_distr=" 0 48 42 0 0 0 0 45 42 0" max_sites="1" site_hist=" 0 177">
2. Data from 0.8 cutoff run (where motif was not reported by MicrobeMod):
Log file:
25-03-31 17:47:32 INFO Running modkit: modkit pileup -t 20 SQK-NBD114-24_barcode01_1_sorted.bam SQK-NBD114-24_barcode01_1_low.bed -r SQK-NBD114-24_barcode01.1.fa --only-tabs --filter-threshold 0.8
25-03-31 17:58:34 INFO Reading Modkit table...
25-03-31 17:58:42 INFO Retrieving sequences
25-03-31 17:58:53 INFO Total number of sites with any possible calls (including low quality): 4184185 bases
25-03-31 17:58:53 INFO Median coverage (stranded): 1140.0x
25-03-31 17:58:54 INFO Total number of sites with greater than min coverage (10x): 4184185 bases
25-03-31 17:58:55 INFO Potential sites (>33%) for methylation type m: 0 bases
25-03-31 17:58:55 INFO High quality sites (>66%) for methylation type m: 0 bases
25-03-31 17:58:55 INFO Very high quality sites (>90%) for methylation type m: 0 bases
25-03-31 17:58:55 INFO Potential sites (>33%) for methylation type 21839: 20 bases
25-03-31 17:58:55 INFO High quality sites (>66%) for methylation type 21839: 0 bases
25-03-31 17:58:55 INFO Very high quality sites (>90%) for methylation type 21839: 0 bases
25-03-31 17:58:56 INFO Potential sites (>33%) for methylation type a: 2068 bases
25-03-31 17:58:56 INFO High quality sites (>66%) for methylation type a: 1779 bases
25-03-31 17:58:56 INFO Very high quality sites (>90%) for methylation type a: 196 bases
25-03-31 17:58:56 INFO Running for Methylation type: 6mA
25-03-31 17:58:57 INFO Modkit table size: 2006021
25-03-31 17:58:57 INFO 1155 sites for STREME with cutoff 0.8 and min coverage 10
25-03-31 17:58:57 INFO Creating background control distribution: SQK-NBD114-24_barcode01_1_a_control.fasta
25-03-31 17:58:57 INFO Running STREME: streme . --n SQK-NBD114-24_barcode01_1_a_control.fasta -p SQK-NBD114-24_barcode01_1_a_pos.fasta -o SQK-NBD114-24_barcode01_1_a_pos_streme
25-03-31 18:00:17 INFO Found and trimmed motif: 1-WCCTCCWDHWDHWDH CCTCC 1040
25-03-31 18:00:19 INFO Running for Methylation type: 5mC
25-03-31 18:00:19 INFO Modkit table size: 1089082
25-03-31 18:00:19 INFO 0 sites for STREME with cutoff 0.8 and min coverage 10
25-03-31 18:00:19 INFO Note: Fewer than 10 methylated sites identified, no motif finding was run.
25-03-31 18:00:19 INFO Running for Methylation type: 4mC
25-03-31 18:00:19 INFO Modkit table size: 1089082
25-03-31 18:00:19 INFO 0 sites for STREME with cutoff 0.8 and min coverage 10
25-03-31 18:00:19 INFO Note: Fewer than 10 methylated sites identified, no motif finding was run.
25-03-31 18:00:20 INFO Complete!
25-03-31 18:00:20 INFO Saving methylated site table to: SQK-NBD114-24_barcode01_1_methylated_sites.tsv
25-03-31 18:00:20 INFO Saving motif output to: SQK-NBD114-24_barcode01_1_motifs.tsv
SQK-NBD114-24_barcode01_1_motifs.tsv:
index Motif Motif_raw Methylation_type Genome_sites Methylated_sites Methylation_coverage Average_Percent_Methylation_per_site Methylated_position_1 Methylated_position_1_percent Methylated_position_2 Methylated_position_2_percent
0 0 No Motif Assigned NA 6mA NA 1779 NA 0.82 NA NA NA NA
streme.xml:
<motif id="1-WCCTCCWDHWDHWDH" alt="STREME-1" width="15" initial_width="10" seed="ACCTCCAGTATCAAT" score_threshold="7.65842" npassing="1155" train_pos_count="1040" train_neg_count="1005" train_log_pvalue="-1853.34" train_pvalue="4.6e-1854" train_dtc="-1.0" train_bernoulli="-1" test_pos_count="115" test_neg_count="129" test_log_pvalue="-199.837" test_pvalue="1.5e-200" test_log_evalue="-199.235" test_evalue="5.8e-200" test_dtc="-1.0" test_bernoulli="-1" is_palindromic="no" elapsed_time="24.4" total_sites="1040" site_distr=" 251 257 0 0 0 0 1 0 271 260" max_sites="2" site_hist=" 0 1038 2">
Thank you!