Skip to content

STREME motif Missing from MicrobeMod Output #36

@nhinvo

Description

@nhinvo

I ran MicrobeMod twice:

  • Run 1: --percent_cutoff_streme 0.9 (default)
  • Run 2: --percent_cutoff_streme 0.8

Both runs have --methylation_confidence_threshold 0.8. In both cases, STREME detected a motif with e-value < 0.1. However, only the motif from the 0.9 run is reported in LIBRARY_NAME_motifs.tsv.

I ran this on multiple samples, all containing the same Cyanobacteria species, and this issue occurs in about half of them. In the remaining samples, MicrobeMod reports the motif in both the default and 0.8 run.

A similar issue was reported in Unclear why motif is excluded. The potential solutions/reasons referenced in that thread are not present in my current dataset. The motif that was excluded in my dataset:

  • Contains C
  • Has high coverage
  • Did not involve methylation type h
  • The LIBRARY_NAME_a_pos.fasta file contains many instances of the motif

In the logs below, the line Counting genome sites for: ___ appears in the 0.9 run but is missing in the 0.8 run. Perhaps there is a potential issue in make_motif_table() function in microbemod.py. STREME successfully identified the motif, but is filtered out by MicrobeMod for some reason other than the e-value cutoff.

1. Data from default run (where motif was reported by MicrobeMod):

Log file:

25-02-03 09:45:14 INFO     Running modkit: modkit pileup -t 20 SQK-NBD114-24_barcode01_1_low.bed -r SQK-NBD114-24_barcode01.1.fa --only-tabs --filter-threshold 0.8
25-02-03 09:48:58 INFO     Reading Modkit table...
25-02-03 09:49:02 INFO     Retrieving sequences
25-02-03 09:49:09 INFO     Total number of sites with any possible calls (including low quality): 4184185 bases
25-02-03 09:49:09 INFO     Median coverage (stranded): 1140.0x
25-02-03 09:49:09 INFO     Total number of sites with greater than min coverage (10x): 4184185 bases
25-02-03 09:49:10 INFO     Potential sites (>33%) for methylation type m: 0 bases
25-02-03 09:49:10 INFO     High quality sites (>66%) for methylation type m: 0 bases
25-02-03 09:49:10 INFO     Very high quality sites (>90%) for methylation type m: 0 bases
25-02-03 09:49:10 INFO     Potential sites (>33%) for methylation type 21839: 20 bases
25-02-03 09:49:10 INFO     High quality sites (>66%) for methylation type 21839: 0 bases
25-02-03 09:49:10 INFO     Very high quality sites (>90%) for methylation type 21839: 0 bases
25-02-03 09:49:10 INFO     Potential sites (>33%) for methylation type a: 2068 bases
25-02-03 09:49:10 INFO     High quality sites (>66%) for methylation type a: 1779 bases
25-02-03 09:49:10 INFO     Very high quality sites (>90%) for methylation type a: 196 bases
25-02-03 09:49:10 INFO     Running for Methylation type: 6mA
25-02-03 09:49:11 INFO     Modkit table size: 2006021
25-02-03 09:49:11 INFO     196 sites for STREME with cutoff 0.9 and min coverage 10
25-02-03 09:49:11 INFO     Creating background control distribution: SQK-NBD114-24_barcode01_1_a_control.fasta
25-02-03 09:49:11 INFO     Running STREME: streme . --n SQK-NBD114-24_barcode01_1_a_control.fasta -p SQK-NBD114-24_barcode01_1_a_pos.fasta -o SQK-NBD114-24_barcode01_1_a_pos_streme
25-02-03 09:51:43 INFO     Found and trimmed motif: 1-YTGGAGGTDHWDYWN	GGAGGNNNW	176
25-02-03 09:51:44 INFO     Counting genome sites for: GGAGGNNNW
25-02-03 09:51:44 INFO     Motif has 1376 methylated instances
25-02-03 09:51:44 INFO     Methylated positions in motif GGAGGNNNW:
25-02-03 09:51:44 INFO     3    98.071429
6     1.714286
9     0.214286
Name: count, dtype: float64
25-02-03 09:51:44 INFO     Running for Methylation type: 5mC
25-02-03 09:51:44 INFO     Modkit table size: 1089082
25-02-03 09:51:44 INFO     0 sites for STREME with cutoff 0.9 and min coverage 10
25-02-03 09:51:44 INFO     Note: Fewer than 10 methylated sites identified, no motif finding was run.
25-02-03 09:51:44 INFO     Running for Methylation type: 4mC
25-02-03 09:51:44 INFO     Modkit table size: 1089082
25-02-03 09:51:44 INFO     0 sites for STREME with cutoff 0.9 and min coverage 10
25-02-03 09:51:44 INFO     Note: Fewer than 10 methylated sites identified, no motif finding was run.
25-02-03 09:51:44 INFO     Complete!
25-02-03 09:51:44 INFO     Saving methylated site table to: SQK-NBD114-24_barcode01_1_methylated_sites.tsv
25-02-03 09:51:44 INFO     Saving motif output to: SQK-NBD114-24_barcode01_1_motifs.tsv

SQK-NBD114-24_barcode01_1_motifs.tsv:

index   Motif   Motif_raw       Methylation_type        Genome_sites    Methylated_sites        Methylation_coverage    Average_Percent_Methylation_per_site   Methylated_position_1   Methylated_position_1_percent   Methylated_position_2   Methylated_position_2_percent
0       0       GGAGGNNNW       1-YTGGAGGTDHWDYWN       6mA     1523    1384    0.909   0.82    3       98.071  6       1.714
1       1       No Motif Assigned       NA      6mA     NA      403     NA      0.81    NA      NA      NA      NA 

streme.xml:

<motif id="1-YTGGAGGTDHWDYWN" alt="STREME-1" width="15" initial_width="13" seed="CTGGAGGTGCTGCCA" score_threshold="8.28948" npassing="196" train_pos_count="177" train_neg_count="808" train_log_pvalue="-354.458" train_pvalue="3.5e-355" train_dtc="-1.0" train_bernoulli="-1" test_pos_count="19" test_neg_count="101" test_log_pvalue="-37.1572" test_pvalue="7.0e-038" test_log_evalue="-36.5551" test_evalue="2.8e-037" test_dtc="-1.0" test_bernoulli="-1" is_palindromic="no" elapsed_time="22.4" total_sites="177" site_distr=" 0 48 42 0 0 0 0 45 42 0" max_sites="1" site_hist=" 0 177">

2. Data from 0.8 cutoff run (where motif was not reported by MicrobeMod):

Log file:

25-03-31 17:47:32 INFO     Running modkit: modkit pileup -t 20 SQK-NBD114-24_barcode01_1_sorted.bam SQK-NBD114-24_barcode01_1_low.bed -r SQK-NBD114-24_barcode01.1.fa --only-tabs --filter-threshold 0.8
25-03-31 17:58:34 INFO     Reading Modkit table...
25-03-31 17:58:42 INFO     Retrieving sequences
25-03-31 17:58:53 INFO     Total number of sites with any possible calls (including low quality): 4184185 bases
25-03-31 17:58:53 INFO     Median coverage (stranded): 1140.0x
25-03-31 17:58:54 INFO     Total number of sites with greater than min coverage (10x): 4184185 bases
25-03-31 17:58:55 INFO     Potential sites (>33%) for methylation type m: 0 bases
25-03-31 17:58:55 INFO     High quality sites (>66%) for methylation type m: 0 bases
25-03-31 17:58:55 INFO     Very high quality sites (>90%) for methylation type m: 0 bases
25-03-31 17:58:55 INFO     Potential sites (>33%) for methylation type 21839: 20 bases
25-03-31 17:58:55 INFO     High quality sites (>66%) for methylation type 21839: 0 bases
25-03-31 17:58:55 INFO     Very high quality sites (>90%) for methylation type 21839: 0 bases
25-03-31 17:58:56 INFO     Potential sites (>33%) for methylation type a: 2068 bases
25-03-31 17:58:56 INFO     High quality sites (>66%) for methylation type a: 1779 bases
25-03-31 17:58:56 INFO     Very high quality sites (>90%) for methylation type a: 196 bases
25-03-31 17:58:56 INFO     Running for Methylation type: 6mA
25-03-31 17:58:57 INFO     Modkit table size: 2006021
25-03-31 17:58:57 INFO     1155 sites for STREME with cutoff 0.8 and min coverage 10
25-03-31 17:58:57 INFO     Creating background control distribution: SQK-NBD114-24_barcode01_1_a_control.fasta
25-03-31 17:58:57 INFO     Running STREME: streme . --n SQK-NBD114-24_barcode01_1_a_control.fasta -p SQK-NBD114-24_barcode01_1_a_pos.fasta -o SQK-NBD114-24_barcode01_1_a_pos_streme
25-03-31 18:00:17 INFO     Found and trimmed motif: 1-WCCTCCWDHWDHWDH	CCTCC	1040
25-03-31 18:00:19 INFO     Running for Methylation type: 5mC
25-03-31 18:00:19 INFO     Modkit table size: 1089082
25-03-31 18:00:19 INFO     0 sites for STREME with cutoff 0.8 and min coverage 10
25-03-31 18:00:19 INFO     Note: Fewer than 10 methylated sites identified, no motif finding was run.
25-03-31 18:00:19 INFO     Running for Methylation type: 4mC
25-03-31 18:00:19 INFO     Modkit table size: 1089082
25-03-31 18:00:19 INFO     0 sites for STREME with cutoff 0.8 and min coverage 10
25-03-31 18:00:19 INFO     Note: Fewer than 10 methylated sites identified, no motif finding was run.
25-03-31 18:00:20 INFO     Complete!
25-03-31 18:00:20 INFO     Saving methylated site table to: SQK-NBD114-24_barcode01_1_methylated_sites.tsv
25-03-31 18:00:20 INFO     Saving motif output to: SQK-NBD114-24_barcode01_1_motifs.tsv

SQK-NBD114-24_barcode01_1_motifs.tsv:

index   Motif   Motif_raw       Methylation_type        Genome_sites    Methylated_sites        Methylation_coverage    Average_Percent_Methylation_per_site   Methylated_position_1   Methylated_position_1_percent   Methylated_position_2   Methylated_position_2_percent
0       0       No Motif Assigned       NA      6mA     NA      1779    NA      0.82    NA      NA      NA      NA

streme.xml:

<motif id="1-WCCTCCWDHWDHWDH" alt="STREME-1" width="15" initial_width="10" seed="ACCTCCAGTATCAAT" score_threshold="7.65842" npassing="1155" train_pos_count="1040" train_neg_count="1005" train_log_pvalue="-1853.34" train_pvalue="4.6e-1854" train_dtc="-1.0" train_bernoulli="-1" test_pos_count="115" test_neg_count="129" test_log_pvalue="-199.837" test_pvalue="1.5e-200" test_log_evalue="-199.235" test_evalue="5.8e-200" test_dtc="-1.0" test_bernoulli="-1" is_palindromic="no" elapsed_time="24.4" total_sites="1040" site_distr=" 251 257 0 0 0 0 1 0 271 260" max_sites="2" site_hist=" 0 1038 2">

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions