diff --git a/README.md b/README.md index 1899313..2819416 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,9 @@ ![](images/logo.png) +--- +[![PyPI](https://img.shields.io/pypi/v/ModDotPlot?color=blue&label=PyPI)](https://pypi.org/project/ModDotPlot/) +[![CI](https://github.com/marbl/ModDotPlot/actions/workflows/black.yml/badge.svg)](https://github.com/marbl/ModDotPlot/actions/workflows/black.yml) +- [](#) - [Cite](#cite) - [About](#about) - [Installation](#installation) @@ -19,7 +23,7 @@ ## Cite -Alexander P Sweeten, Michael C Schatz, Adam M Phillippy, ModDotPlot—rapid and interactive visualization of tandem repeats, Bioinformatics, Volume 40, Issue 8, August 2024, btae493, [https://doi.org/10.1093/bioinformatics/btae493](https://pmc.ncbi.nlm.nih.gov/articles/PMC11321072/) +Alexander P Sweeten, Michael C Schatz, Adam M Phillippy, ModDotPlot—rapid and interactive visualization of tandem repeats, Bioinformatics, Volume 40, Issue 8, August 2024, btae493, [https://doi.org/10.1093/bioinformatics/btae493](https://doi.org/10.1093/bioinformatics/btae493) If you use ModDotPlot for your research, please cite our software! @@ -27,7 +31,7 @@ If you use ModDotPlot for your research, please cite our software! ## About -ModDotPlot is a dot plot visualization tool designed for large sequences and whole genomes. ModDotPlot outputs an identity heatmap similar to [StainedGlass](https://mrvollger.github.io/StainedGlass/) by rapidly approximating the Average Nucleotide Identity between pairwise combinations of genomic intervals. This significantly reduces the computational time required to produce these plots, enough to view multiple layers of resolution in real time! +_ModDotPlot_ is a dot plot visualization tool designed for large sequences and whole genomes. _ModDotPlot_ outputs an identity heatmap similar to [StainedGlass](https://mrvollger.github.io/StainedGlass/) by rapidly approximating the Average Nucleotide Identity between pairwise combinations of genomic intervals. This significantly reduces the computational time required to produce these plots, enough to view multiple layers of resolution in real time! ![](images/demo.gif) @@ -35,12 +39,14 @@ ModDotPlot is a dot plot visualization tool designed for large sequences and who ## Installation +_ModDotPlot_ can be installed by running `pip install moddotplot`. It requires Python 3.7+ to run. Alternatively, you can download the current release from GitHub by using: + ``` git clone https://github.com/marbl/ModDotPlot.git cd ModDotPlot ``` -Although optional, it's recommended to setup a virtual environment before using ModDotPlot: +Although optional, it's recommended to setup a virtual environment before using _ModDotPlot_: ``` python -m venv venv @@ -53,7 +59,7 @@ Once activated, you can install the required dependencies: python -m pip install . ``` -Finally, confirm that the installation was installed correctly by running `moddotplot -h`: +Finally, confirm that the installation was installed correctly and that your version is up to date by running `moddotplot -h`: ``` __ __ _ _____ _ _____ _ _ | \/ | | | | __ \ | | | __ \| | | | @@ -62,6 +68,8 @@ Finally, confirm that the installation was installed correctly by running `moddo | | | | (_) | (_| | | |__| | (_) | |_ | | | | (_) | |_ |_| |_|\___/ \__,_| |_____/ \___/ \__| |_| |_|\___/ \__| + v0.9.4 + usage: moddotplot [-h] {interactive,static} ... ModDotPlot: Visualization of Complex Repeat Structures @@ -78,7 +86,7 @@ options: ## Usage -ModDotPlot must be run either in `interactive` mode, or `static` mode: +_ModDotPlot_ must be run either in `interactive` mode, or `static` mode: ### Interactive Mode @@ -94,13 +102,15 @@ This will launch a [Dash application](https://plotly.com/dash/) on your machine' moddotplot static ``` -This skips running Dash and quickly creates plots under the specified output directory using [plotnine](https://plotnine.readthedocs.io/en/v0.12.4/). By default, running ModDotPlot in static mode this will produce the following files: +Running _ModDotPlot_ in static mode skips running Dash and quickly creates plots under the specified output directory `-o`. By default, running _ModDotPlot_ in static mode this will produce the following files: -- A paired-end bed file, containing intervals alongside their corresponding identity estimates. -- A self-identity dotplot for each sequence. +- A paired-end bed file `.bedpe`, containing intervals alongside their corresponding identity estimates. +- A self-identity dotplot for each sequence, as both an upper triangle matrix `_TRI` and full matrix `_FULL` representation. - A histogram of identity values for each sequence. - -See [static mode commands](#static-mode-commands) for further info. + +All plots and histograms are output in a vectorized (default: `.svg`) and rasterized `.png` image. [Plotnine](https://plotnine.readthedocs.io/en/v0.12.4/) is the Python plotting library used, with [CairoSVG](https://cairosvg.org) used for converting between image formats. + +_ModDotPlot_ supports highly customizable plotting features in static mode. See [static mode commands](#static-mode-commands) for a complete list of features. --- @@ -112,6 +122,10 @@ The following arguments are the same in both interactive and static mode: Fasta files to input. Multifasta files are accepted. Interactive mode will only support a maximum of two sequences at a time. +`-b / --bed <.bed file>` + +Input bedfile used for dotplot annotation (note: this is not the same as the paired-end bed file produced by ModDotPlot). If selected, this will produce an annotated bedtrack image `_ANNOTATION.svg`. Name in the bedfile must match the name of the fasta sequence header (or input, if `-l` is used instead) in order to produce a correct bed track. + `-k / --kmer ` K-mer size to use. This should be large enough to distinguish unique k-mers with enough specificity, but not too large that sensitivity is removed. Default: 21. @@ -184,15 +198,15 @@ Load previously saved matrices. Used instead of `-f/--fasta` Run `moddotplot static` with a config file, rather than (sample syntax). Recommended when creating a really customized plot. Used instead of `-f/--fasta`. -`-b / --bed <.bed file>` +`-l / --load <.bedpe file>` -Create a plot from a previously computed pairwise bed file. Skips Average Nucleotide Identity computation. Used instead of `-f/--fasta`. +Create a plot from a previously computed pairwise bed file. Skips Average Nucleotide Identity computation. Used instead of `-f/--fasta`. Will only accept paired-end bed files produced by ModDotPlot. `-w / --window ` Window size. Unlike interactive mode, only one matrix will be created, so this represents the *only* window size. Default is set to `n/1000` (eg. 3000bp for a 3Mbp sequence). -`--no-bed ` +`--no-bedpe ` Skip output of bed file. @@ -204,9 +218,17 @@ Skip output of histogram legend. Adjust width of self dot plots. Default is 9 inches. -`--dpi ` +`--dpi ` -Image resolution in dots per inch (not to be confused with dotplot resolution). Default is 600. +Image resolution in dots per inch (not to be confused with dotplot resolution). Default is `300`. + +`--vector ` + +Vectorized image format to output to. Must be one of ["svg", "pdf", "ps"]. Default: `svg` + +`--deraster ` + +By default, vectorized ouptuts rasterize the actual plot (not the axis). This is done to save space, as a high-resolution dotplot can be extremely space inefficient and prevent use of image manipulation software. This plot rasterization can be removed using this flag. `--palette ` @@ -226,11 +248,11 @@ Add custom identity threshold breakpoints. Note that the number of breakpoints m `-t / --axes-ticks ` -Custom tickmarks for x and y axis. Values outside of the axes-limits will not be shown. +Custom tickmarks for x and y axis. Values outside of the `--axes-limits` will not be shown. `-a / --axes-limits ` -Change axis limits for x and y axis. Useful for comparing multiple plots, allowing them to stay in scale. +Change axis limits for x and y axis. Useful when comparing multiple plots, allowing them to stay in scale. `--bin-freq ` @@ -308,7 +330,6 @@ $ cat config/config.json { "identity": 90, - "sparsity": 10, "palette": "OrRd_7", "breakpoints": [ 90, @@ -338,13 +359,13 @@ $ moddotplot static -c config/config.json Running ModDotPlot in static mode -Retrieving k-mers from Chr1:14M-18M.... +Retrieving k-mers from Chr1:14000000-18000000.... Progress: |████████████████████████████████████████| 100.0% Completed -Chr1:14M-18M k-mers retrieved! +Chr1:14000000-18000000 k-mers retrieved! -Computing self identity matrix for chr1:14M-18M... +Computing self identity matrix for chr1:14000000-18000000... Sequence length n: 4000000 @@ -357,13 +378,13 @@ Computing self identity matrix for chr1:14M-18M... Progress: |████████████████████████████████████████| 100.0% Completed -Saved bed file to Chr1_cen_plots/Chr1:14M-18M.bed +Saved bed file to Chr1_cen_plots/Chr1:14000000-18000000.bed -Plots created! Saving to Chr1_cen_plots/Chr1:14M-18M... +Plots created! Saving to Chr1_cen_plots/Chr1:14000000-18000000... Chr1_cen_plots/Chr1:14M-18M_TRI.png, Chr1_cen_plots/Chr1:14M-18M_TRI.pdf, Chr1_cen_plots/Chr1:14M-18M_FULL.png, Chr1_cen_plots/Chr1:14M-18M_FULL.png, Chr1_cen_plots/Chr1:14M-18M_HIST.png and Chr1_cen_plots/Chr1:14M-18M_HIST.pdf, saved sucessfully. ``` -![](images/Chr1:14M-18M_FULL.png) +![](images/Chr1:14000000-18000000_FULL.png) --- @@ -376,6 +397,8 @@ ModDotPlot can produce an a vs. b style dotplot for each pairwise combination of moddotplot interactive -f sequences/chr15_segment.fa sequences/chr21_segment.fa --compare-only ``` +![](images/chr14:2000000-5000000_chr21:2000000-5000000.png) + --- ## Questions @@ -386,10 +409,6 @@ For bug reports or general usage questions, please raise a GitHub issue, or emai ## Known Issues -- Plot width and xlim (limiting the x axis to a different amount) currently do not work. I plan to have those working in v0.9.0. - - Mac users might encounter the following unexpected command line output: `/bin/sh: lscpu: command not found`. This is a known issue with Plotnine, the Python plotting library used by ModDotPlot. This can be safely ignored. - If you encounter an error with the following traceback: `rv = reductor(4) TypeError: cannot pickle 'generator' object`, ths means that you have a newer version of Plotnine that is incompatible with ModDotPlot. Please uninstall plotnine and reinstall version 0.12.4 `pip install plotnine==0.12.4`. - -- In interactive mode, comparing sequences of two sizes will lead to errors in zooming for the larger sequence. I plan to fix this in v0.9.0. diff --git a/config/config.json b/config/config.json index dafb47e..9b351b1 100644 --- a/config/config.json +++ b/config/config.json @@ -1,6 +1,5 @@ { "identity": 90, - "sparsity": 10, "palette": "OrRd_7", "breakpoints": [ 90, diff --git a/pyproject.toml b/pyproject.toml index 02929dd..955193d 100755 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta" [project] name = "ModDotPlot" -version = "0.9.3" +version = "0.9.4" requires-python = ">= 3.7" dependencies = [ "pysam", @@ -14,12 +14,12 @@ dependencies = [ "plotnine==0.12.4", "palettable", "mmh3", - "tk", "setproctitle", "numpy", "pillow", "patchworklib==0.6.3", - "cairosvg" + "cairosvg", + "pygenometracks", ] authors = [ {name = "Alex Sweeten", email = "alex.sweeten@nih.gov"}, diff --git a/src/moddotplot/const.py b/src/moddotplot/const.py index 5256205..ee2afbd 100755 --- a/src/moddotplot/const.py +++ b/src/moddotplot/const.py @@ -1,4 +1,4 @@ -VERSION = "0.9.3" +VERSION = "0.9.4" COLS = [ "#query_name", "query_start", diff --git a/src/moddotplot/estimate_identity.py b/src/moddotplot/estimate_identity.py index 293bd64..d9da261 100644 --- a/src/moddotplot/estimate_identity.py +++ b/src/moddotplot/estimate_identity.py @@ -128,7 +128,7 @@ def partitionOverlaps( try: kmer_list.append(lst[delta_start_index:delta_end_index]) except Exception as e: - print("test") + print("Error in appending list of kmers...\n") print(e) kmer_list.append(lst[delta_start_index:seq_len]) counter += win @@ -167,7 +167,7 @@ def convertToModimizers( def convertMatrixToBed( - matrix, window_size, id_threshold, x_name, y_name, self_identity + matrix, window_size, id_threshold, x_name, y_name, self_identity, x_offset, y_offset ): bed = [ ( @@ -187,10 +187,10 @@ def convertMatrixToBed( value = matrix[x, y] if (not self_identity) or (self_identity and x <= y): if value >= id_threshold / 100: - start_x = x * window_size + 1 - end_x = (x + 1) * window_size - start_y = y * window_size + 1 - end_y = (y + 1) * window_size + start_x = x * window_size + x_offset + end_x = start_x + window_size - 1 + start_y = y * window_size + y_offset + end_y = start_y + window_size - 1 bed.append( ( diff --git a/src/moddotplot/moddotplot.py b/src/moddotplot/moddotplot.py index 0412ab0..abff278 100755 --- a/src/moddotplot/moddotplot.py +++ b/src/moddotplot/moddotplot.py @@ -5,6 +5,7 @@ getInputHeaders, isValidFasta, extractFiles, + extractRegion, ) from moddotplot.estimate_identity import ( @@ -35,7 +36,7 @@ def get_parser(): """ parser = argparse.ArgumentParser( formatter_class=argparse.ArgumentDefaultsHelpFormatter, - description="ModDotPlot: Visualization of Complex Repeat Structures", + description="ModDotPlot: Visualization of Tandem Repeats", ) subparsers = parser.add_subparsers( dest="command", help="Choose mode: interactive or static" @@ -176,8 +177,8 @@ def get_parser(): ) static_input_group.add_argument( - "-b", - "--bed", + "-l", + "--load", default=argparse.SUPPRESS, help="Path to input paired-end bed file(s). Exclusively used in static mode.", nargs="+", @@ -197,6 +198,8 @@ def get_parser(): required=False ) + static_parser.add_argument("-b", "--bed", default=None, help="Bed file annotation.") + static_parser.add_argument( "-k", "--kmer", default=21, type=int, help="k-mer length." ) @@ -268,7 +271,7 @@ def get_parser(): ) static_parser.add_argument( - "--no-bed", action="store_true", help="Skip output of bed file." + "--no-bedpe", action="store_true", help="Skip output of paired-end bed file." ) static_parser.add_argument( @@ -280,7 +283,7 @@ def get_parser(): ) static_parser.add_argument( - "--width", default=18, type=float, help="Plot width (also height for _FULL)." + "--width", default=9, type=float, help="Plot width (also height for _FULL)." ) static_parser.add_argument("--dpi", default=300, type=int, help="Plot dpi.") @@ -430,6 +433,7 @@ def main(): config = json.load(f) # TODO: Remove args that are interactive only args.fasta = config.get("fasta") + args.load = config.get("load") args.bed = config.get("bed") # Distance matrix commands @@ -443,7 +447,7 @@ def main(): args.compare = config.get("compare", args.compare) args.compare_only = config.get("compare_only", args.compare_only) - args.no_bed = config.get("no_bed", args.no_bed) + args.no_bedpe = config.get("no_bedpe", args.no_bedpe) args.no_plot = config.get("no_plot", args.no_plot) args.no_hist = config.get("no_hist", args.no_hist) args.width = config.get("width", args.width) @@ -457,18 +461,13 @@ def main(): args.axes_ticks = config.get("axes_ticks", args.axes_ticks) args.breakpoints = config.get("breakpoints", args.breakpoints) args.bin_freq = config.get("bin_freq", args.bin_freq) - - # - """if args.grid or args.grid_only: - if not (args.compare or args.compare_only) and not args.bed: - print( - f"Option --grid was selected, but no comparative plots will be produced. Please rerun ModDotPlot with the `--compare` or `--compare-only` option.\n" - ) - sys.exit(10)""" + args.axes_limits = config.get("axes_limits", args.axes_limits) + args.axes_ticks = config.get("axes_ticks", args.axes_ticks) + args.vector = config.get("vector", args.vector) + args.deraster = config.get("deraster", args.deraster) # -----------INPUT COMMAND VALIDATION----------- # TODO: More tests! - # Validation for breakpoints command if args.breakpoints: # Check that start value for breakpoints = identity threshold value if float(args.breakpoints[0]) != float(args.identity): @@ -483,15 +482,15 @@ def main(): sys.exit(2) # -----------BEDFILE INPUT FOR STATIC MODE----------- - if hasattr(args, "bed") and args.bed: + if hasattr(args, "load") and args.load: if args.grid or args.grid_only: single_vals = [] double_vals = [] single_val_name = [] double_val_name = [] xlim_val_grid = 0 - for bed in args.bed: - # If args.bed is provided as input, run static mode directly from the bed file. Skip counting input k-mers. + for bed in args.load: + # If args.load is provided as input, run static mode directly from the paired-end bed file. Skip counting input k-mers. df = read_df_from_file(bed) unique_query_names = df["#query_name"].unique() @@ -529,6 +528,7 @@ def main(): axes_tick_number=args.axes_number, vector_format=args.vector, deraster=args.deraster, + annotation=args.bed, ) if args.grid or args.grid_only: single_vals.append(df) @@ -566,6 +566,7 @@ def main(): axes_tick_number=args.axes_number, vector_format=args.vector, deraster=args.deraster, + annotation=args.bed, ) if args.grid or args.grid_only: double_vals.append(df) @@ -929,6 +930,11 @@ def main(): for i in range(len(sequences)): seq_length = len(sequences[i][1]) seq_name = sequences[i][0] + seq_range = extractRegion(seq_name) + if not seq_range: + seq_start_pos = 1 + else: + seq_start_pos = int(seq_range[1]) win = args.window res = args.resolution if args.window: @@ -972,24 +978,33 @@ def main(): expectation, ) bed = convertMatrixToBed( - self_mat, win, args.identity, seq_name, seq_name, True + self_mat, + win, + args.identity, + seq_name, + seq_name, + True, + seq_start_pos, + seq_start_pos, ) if args.grid or args.grid_only: grid_val_singles.append(bed) grid_val_single_names.append(seq_name) - if not args.no_bed: + if not args.no_bedpe: # Log saving bed file if not args.output_dir: - bedfile_output = seq_name + ".bed" + bedfile_output = seq_name + ".bedpe" else: bedfile_output = os.path.join( - args.output_dir, seq_name + ".bed" + args.output_dir, seq_name + ".bedpe" ) with open(bedfile_output, "w") as bedfile: for row in bed: bedfile.write("\t".join(map(str, row)) + "\n") - print(f"Saved bed file to {bedfile_output}\n") + print( + f"Saved self-identity matrix as a paired-end bed file to {bedfile_output}\n" + ) if (not args.no_plot) and (not args.grid_only): create_plots( @@ -1012,6 +1027,7 @@ def main(): axes_tick_number=args.axes_number, vector_format=args.vector, deraster=args.deraster, + annotation=args.bed, ) # -----------COMPUTE COMPARATIVE PLOTS----------- @@ -1028,12 +1044,23 @@ def main(): for i in range(len(sequences)): for j in range(i + 1, len(sequences)): + # Larger = x, smaller = y. This is pre-sorted earlier. larger_seq = sequences[i][1] smaller_seq = sequences[j][1] larger_length = len(larger_seq) smaller_length = len(smaller_seq) larger_seq_name = sequences[i][0] smaller_seq_name = sequences[j][0] + larger_seq_range = extractRegion(larger_seq_name) + if not larger_seq_range: + larger_seq_start_pos = 1 + else: + larger_seq_start_pos = int(larger_seq_range[1]) + smaller_seq_range = extractRegion(smaller_seq_name) + if not larger_seq_range: + smaller_seq_start_pos = 1 + else: + smaller_seq_start_pos = int(smaller_seq_range[1]) win = args.window res = args.resolution if args.window: @@ -1090,6 +1117,8 @@ def main(): seq_list[i], seq_list[j], False, + larger_seq_start_pos, + smaller_seq_start_pos, ) if args.grid or args.grid_only: grid_val_doubles.append(bed) @@ -1097,14 +1126,14 @@ def main(): [larger_seq_name, smaller_seq_name] ) xlim_val_grid = max(larger_length, xlim_val_grid) - if not args.no_bed: + if not args.no_bedpe: # Log saving bed file if not args.output_dir: bedfile_output = ( larger_seq_name + "_" + smaller_seq_name - + "_COMPARE.bed" + + "_COMPARE.bedpe" ) else: bedfile_output = os.path.join( @@ -1112,12 +1141,14 @@ def main(): larger_seq_name + "_" + smaller_seq_name - + "_COMPARE.bed", + + "_COMPARE.bedpe", ) with open(bedfile_output, "w") as bedfile: for row in bed: bedfile.write("\t".join(map(str, row)) + "\n") - print(f"Saved bed file to {bedfile_output}\n") + print( + f"Saved comparative matrix as a paired-end bed file to {bedfile_output}\n" + ) if (not args.no_plot) and (not args.grid_only): create_plots( @@ -1140,6 +1171,7 @@ def main(): axes_tick_number=args.axes_number, vector_format=args.vector, deraster=args.deraster, + annotation=args.bed, ) if args.grid or args.grid_only: diff --git a/src/moddotplot/parse_fasta.py b/src/moddotplot/parse_fasta.py index ff97c01..12acb1e 100644 --- a/src/moddotplot/parse_fasta.py +++ b/src/moddotplot/parse_fasta.py @@ -12,6 +12,24 @@ tab_b = bytes.maketrans(b"ACTG", b"TGAC") +def extractRegion(seq_name): + """Check if seq_name contains a region and extract the bounds.""" + # Define the regular expression to match the format "chrY:50-3000" + region_pattern = r"([a-zA-Z0-9]+):(\d+)-(\d+)" + + match = re.match(region_pattern, seq_name) + if match: + # Extract chromosome, lower bound, and upper bound + chrom = match.group(1) + lower_bound = int(match.group(2)) + upper_bound = int(match.group(3)) + + return chrom, lower_bound, upper_bound + else: + # No region found + return None + + def generateKmersFromFasta(seq: Sequence[str], k: int, quiet: bool) -> Iterable[int]: n = len(seq) if not quiet: diff --git a/src/moddotplot/static_plots.py b/src/moddotplot/static_plots.py index 2afc809..bf6055f 100755 --- a/src/moddotplot/static_plots.py +++ b/src/moddotplot/static_plots.py @@ -27,6 +27,7 @@ theme_minimal, geom_raster, ) +import svgutils.transform as sg import cairosvg import pandas as pd import numpy as np @@ -38,7 +39,9 @@ import sys import re from moddotplot.parse_fasta import printProgressBar - +from lxml import etree +from pygenometracks.utilities import get_region +import matplotlib.pyplot as plt from moddotplot.const import ( DIVERGING_PALETTES, QUALITATIVE_PALETTES, @@ -46,6 +49,18 @@ ) from typing import List from palettable.colorbrewer import qualitative, sequential, diverging +import logging + +# Set log level BEFORE importing pygenometracks +for name in logging.root.manager.loggerDict: + if name.startswith("pygenometracks"): + logging.getLogger(name).setLevel(logging.CRITICAL) + logging.getLogger(name).propagate = False # Don't pass to root logger + +# Also make sure the root logger isn’t outputting debug messages +logging.basicConfig(level=logging.CRITICAL) + +from pygenometracks.tracksClass import PlotTracks def is_plot_empty(p): @@ -74,6 +89,93 @@ def check_pascal(single_val, double_val): sys.exit(8) +def generate_ini_file( + bedfile, ininame, chrom, color_value="bed_rgb", x_axis=True, display="collapsed" +): + try: + thing = chrom.split(":")[0] + sections = [ + "[spacer]", + "# height of space in cm (optional)", + "height = 0.5", + "", + f"[{thing}]", + f"file = {bedfile}", + f"Title=", + "height = 1", + f"display = {display}", + f"color = {color_value}", + "labels = false", + "fontsize = 10", + "file_type = bed", + ] + + if x_axis: + sections.insert(0, "[x-axis]") + + ini_content = "\n".join(sections) + with open(f"{ininame}.ini", "w") as file: + file.write(ini_content) + print(f"Successfully generated {ininame}.ini\n") + return f"{ininame}.ini" + except Exception as err: + print(f"Error producing ini file: {err}\n") + return None + + +def read_annotation_bed(filepath): + """Reads a BED file into a Pandas DataFrame and ensures correct formatting.""" + col_names = [ + "chrom", + "start", + "end", + "name", + "score", + "strand", + "thickStart", + "thickEnd", + "itemRgb", + ] # Include additional fields + + df = pd.read_csv(filepath, sep="\t", comment="#", header=None) + + # Ensure at least three required columns exist + if df.shape[1] < 3: + raise ValueError( + "Invalid BED file: must have at least 3 columns (chrom, start, end)." + ) + + # Rename only the expected columns + df.columns = col_names[: df.shape[1]] + + # Ensure start and end columns contain valid integers + if df["start"].isna().any() or df["end"].isna().any(): + raise ValueError( + "Invalid BED file: 'start' and 'end' columns must be integers and contain no missing values." + ) + + return df + + +def run_pygenometracks(inifile, region, output_file, width): + output_dir = os.path.dirname(output_file) + if output_dir and not os.path.exists(output_dir): + os.makedirs(output_dir) # Create directory if it doesn't exist + + trp = PlotTracks( + inifile, + width, + fig_height=None, + fontsize=10, + dpi=300, + track_label_width=0.05, + plot_regions=region, + plot_width=width, + ) + + return trp + + def reverse_pascal(double_vals): if len(double_vals) == 1: return 2 @@ -189,7 +291,7 @@ def get_colors(sdf, ncolors, is_freq, custom_breakpoints): else: breaks = [bot + i * interval for i in range(ncolors + 1)] if custom_breakpoints: - breaks = np.asfarray(custom_breakpoints) + np.asarray(custom_breakpoints, dtype=np.float64) labels = np.arange(len(breaks) - 1) # corner case of only one %id value if len(breaks) == 1: @@ -281,23 +383,27 @@ def read_df( return df -def generate_breaks(number, min_breaks=5, max_breaks=9): +def generate_breaks(min_number, max_number, min_breaks=5, max_breaks=9): # Determine the order of magnitude - magnitude = 10 ** int( - math.floor(math.log10(number)) - ) # Base power of 10 (e.g., 10M, 1M, 100K) - threshold = math.ceil(number / magnitude) + difference = max_number - min_number + + magnitude = 10 ** int(math.floor(math.log10(difference))) + threshold = math.ceil(difference / magnitude) while threshold > max_breaks: magnitude *= 2 - threshold = math.ceil(number / magnitude) + threshold = math.ceil(difference / magnitude) while threshold < min_breaks: magnitude /= 2 - threshold = math.ceil(number / magnitude) + threshold = math.ceil(difference / magnitude) + + # Round down min_number to the nearest multiple of magnitude + min_aligned = int(min_number // magnitude * magnitude) # Generate breakpoints - breaks = list(range(0, int((threshold * magnitude) + magnitude), int(magnitude))) + upper_bound = int(min_aligned + (threshold + 1) * magnitude) + breaks = list(range(min_aligned, upper_bound, int(magnitude))) return breaks @@ -348,11 +454,12 @@ def make_dot( if not xlim: xlim = 0 # Determine maximum genomic position for scaling + min_val = min(sdf["q_st"].min(), sdf["r_st"].min()) max_val = max(sdf["q_en"].max(), sdf["r_en"].max(), xlim) # If user provides breaks, convert to ints if not breaks: - breaks = generate_breaks(int(max_val)) + breaks = generate_breaks(int(min_val), int(max_val)) else: [int(x) for x in breaks] xlim = xlim or 0 @@ -398,8 +505,12 @@ def make_dot( + scale_color_discrete(guide=False) + scale_fill_manual(values=new_hexcodes, guide=False) + common_theme - + scale_x_continuous(labels=make_scale, limits=[0, max_val], breaks=breaks) - + scale_y_continuous(labels=make_scale, limits=[0, max_val], breaks=breaks) + + scale_x_continuous( + labels=make_scale, limits=[min_val, max_val], breaks=breaks + ) + + scale_y_continuous( + labels=make_scale, limits=[min_val, max_val], breaks=breaks + ) + coord_fixed(ratio=1) + facet_grid("r ~ q") + labs(x=x_label, y="", title=title_name) @@ -448,11 +559,12 @@ def make_dot_grid( if not xlim: xlim = 0 # Determine maximum genomic position for scaling + min_val = max(sdf["q_st"].min(), sdf["r_st"].min()) max_val = max(sdf["q_en"].max(), sdf["r_en"].max(), xlim) # If user provides breaks, convert to ints if not breaks: - breaks = generate_breaks(int(max_val)) + breaks = generate_breaks(int(min_val), int(max_val)) else: [int(x) for x in breaks] xlim = xlim or 0 @@ -543,11 +655,12 @@ def make_dot_final( if not xlim: xlim = 0 # Determine maximum genomic position for scaling + min_val = min(sdf["q_st"].min(), sdf["r_st"].min()) max_val = max(sdf["q_en"].max(), sdf["r_en"].max(), xlim) # If user provides breaks, convert to ints if not breaks: - breaks = generate_breaks(int(max_val)) + breaks = generate_breaks(int(min_val), int(max_val)) else: [int(x) for x in breaks] xlim = xlim or 0 @@ -657,11 +770,12 @@ def make_tri( if not xlim: xlim = 0 # Determine maximum genomic position for scaling + min_val = max(sdf["q_st"].min(), sdf["r_st"].min()) max_val = max(sdf["q_en"].max(), sdf["r_en"].max(), xlim) # If user provides breaks, convert to ints if not breaks: - breaks = generate_breaks(int(max_val)) + breaks = generate_breaks(int(min_val), int(max_val)) else: [int(x) for x in breaks] xlim = xlim or 0 @@ -688,8 +802,12 @@ def make_tri( ) # Ensure full opacity + scale_fill_manual(values=new_hexcodes, guide=False) + scale_color_discrete(guide=False) - + scale_x_continuous(labels=make_scale, limits=[0, max_val], breaks=breaks) - + scale_y_continuous(labels=make_scale, limits=[0, max_val], breaks=breaks) + + scale_x_continuous( + labels=make_scale, limits=[min_val, max_val], breaks=breaks + ) + + scale_y_continuous( + labels=make_scale, limits=[min_val, max_val], breaks=breaks + ) + coord_fixed(ratio=1) + labs(x=x_label, y="", title=title_name) + theme( @@ -749,8 +867,12 @@ def make_tri( ) # Ensure full opacity + scale_fill_manual(values=new_hexcodes, guide=False) + scale_color_discrete(guide=False) - + scale_x_continuous(labels=make_scale, limits=[0, max_val], breaks=breaks) - + scale_y_continuous(labels=make_scale, limits=[0, max_val], breaks=breaks) + + scale_x_continuous( + labels=make_scale, limits=[min_val, max_val], breaks=breaks + ) + + scale_y_continuous( + labels=make_scale, limits=[min_val, max_val], breaks=breaks + ) + coord_fixed(ratio=1) + labs(x=x_label, y="", title=title_name) + theme( @@ -778,8 +900,12 @@ def make_tri( ) + scale_color_discrete(guide=False) + scale_fill_manual(values=new_hexcodes, guide=False) - + scale_x_continuous(labels=make_scale, limits=[0, max_val], breaks=breaks) - + scale_y_continuous(labels=make_scale, limits=[0, max_val], breaks=breaks) + + scale_x_continuous( + labels=make_scale, limits=[min_val, max_val], breaks=breaks + ) + + scale_y_continuous( + labels=make_scale, limits=[min_val, max_val], breaks=breaks + ) + coord_fixed(ratio=1) + labs(x="", y="", title="") + theme( @@ -898,6 +1024,81 @@ def rotate_rasterized_tri(svg_path, shift_x, shift_y): tree.write(svg_path) +from lxml import etree + + +def append_svg(svg1_path, svg2_path, output_path): + """Appends SVG2 to the bottom of SVG1 without modifying its width.""" + # Load SVG1 and SVG2 + tree1 = etree.parse(svg1_path) + root1 = tree1.getroot() + + tree2 = etree.parse(svg2_path) + root2 = tree2.getroot() + + # Extract width and height of SVG1 + width1 = float(root1.get("width", "0").replace("pt", "")) + height1 = float(root1.get("height", "0").replace("pt", "")) + + # Extract width and height of SVG2 + width2 = float(root2.get("width", "0").replace("pt", "")) + height2 = float(root2.get("height", "0").replace("pt", "")) + + # Update SVG1 height to accommodate SVG2 + new_height = height1 + height2 + root1.set("height", f"{new_height}pt") + + # Create a translation group for SVG2 and shift it down + group = etree.Element("g", attrib={"transform": f"translate(0,{height1 + 400})"}) + for child in root2: + group.append(child) + + # Append translated SVG2 to SVG1 + root1.append(group) + + # Save the new merged SVG + tree1.write(output_path, pretty_print=True, xml_declaration=True, encoding="utf-8") + + +def get_svg_size(svg_path): + """Returns the width and height of an SVG file.""" + fig = sg.fromfile(svg_path) + return fig.get_size() + + +def merge_svgs(svg1_path, svg2_path, output_path): + """Merges two SVG files into a single SVG file. + + Args: + svg1_path: Path to the first SVG file. + svg2_path: Path to the second SVG file. + output_path: Path to save the merged SVG file. + """ + # Create figure and subplots + + w, l = get_svg_size(svg1_path) + print(w, l) + w2, l2 = get_svg_size(svg2_path) + print(w2, l2) + # fig = sg.SVGFigure("780pt", "432pt") + # fig = sg.SVGFigure(810, 450) + fig = sg.SVGFigure("110pt", "100pt") + # Load the two SVG files + svg1 = sg.fromfile(svg1_path).getroot() + svg2 = sg.fromfile(svg2_path).getroot() + + # Optionally, adjust the position and size of the loaded SVGs + svg1.moveto(0, 0) + svg2.moveto(0, 600) + + # Append the loaded SVGs to the figure + fig.append([svg1, svg2]) + + # Save the merged SVG to a new file + fig.save(output_path) + print(get_svg_size(output_path)) + + def make_tri_axis(sdf, title_name, palette, palette_orientation, colors, breaks, xlim): if not breaks: breaks = True @@ -1350,6 +1551,7 @@ def create_plots( axes_tick_number, vector_format, deraster, + annotation, ): df = read_df( sdf, @@ -1369,6 +1571,39 @@ def create_plots( sdf, palette, palette_orientation, custom_colors, custom_breakpoints ) + # Just doing triangle plots for now. + if annotation: + print(f"Generating ini file for annotation track:\n") + iniprefix = plot_filename + inifile = generate_ini_file( + bedfile=annotation, + ininame=iniprefix, + chrom=name_x, + ) + min_val = max(sdf["q_st"].min(), sdf["r_st"].min()) + max_val = max(sdf["q_en"].max(), sdf["r_en"].max()) + region = [(name_x.split(":")[0], min_val, max_val)] + if inifile: + bed_track = run_pygenometracks( + inifile=inifile, + region=region, + output_file=f"{iniprefix}_ANNOTATION.{vector_format}", + width=width * 2.05, + ) + name_x.split(":")[0] + bed_track.plot( + f"{iniprefix}_ANNOTATION.{vector_format}", + name_x.split(":")[0], + min_val, + max_val, + ) + bed_track.plot( + f"{iniprefix}_ANNOTATION.png", name_x.split(":")[0], min_val, max_val + ) + print( + f"\nAnnotation track saved to {iniprefix}_ANNOTATION.{vector_format} & {iniprefix}_ANNOTATION.png\n" + ) + if is_pairwise: heatmap = make_dot( sdf, @@ -1467,6 +1702,7 @@ def create_plots( width, False, ) + ggsave( full_plot, width=width, @@ -1502,6 +1738,10 @@ def create_plots( rotate_vectorized_tri( f"{tri_prefix}.svg", scaling_values[0], scaling_values[1] ) + if annotation: + """merge_svgs(f"{plot_filename}_TRI.{vector_format}","test.svg", "test2.svg") + append_svg(f"{plot_filename}_TRI.{vector_format}", "test.svg", "test2.svg") + """ try: cairosvg.svg2png( url=f"{tri_prefix}.svg", write_to=f"{tri_prefix}.png", dpi=dpi