diff --git a/README.md b/README.md index 6dda4b7..57948a8 100644 --- a/README.md +++ b/README.md @@ -51,27 +51,27 @@ This README document contains installation instructions and documentation for va **Platforms:** Mac and Linux operating systems, ARM (e.g. M1/M2 mac) and x86 (e.g. Intel mac) architectures. The package has been tested on HPC clusters, but there may be additional complexities depending on how these systems are set up. -*For Windows, we recommend using [Google Colab](https://colab.research.google.com/). We have not tested on [Windows Linux Subsystem](https://learn.microsoft.com/en-us/windows/wsl/install) but in principle that should work too. Windows support is possible in future, but blocked by [conda availability for modkit executables](https://anaconda.org/nanoporetech/modkit) and the [current implementation](dimelo/run_modkit.py) of live error/progress tracking during modkit execution, which relies on a unix-only library as of Python 3.11. The urgency of a Windows implementation will depend on user need, so please let us know if this is important for you.* +*For Windows, we recommend using [Google Colab](https://colab.research.google.com/). We have not tested on [Windows Linux Subsystem](https://learn.microsoft.com/en-us/windows/wsl/install) but in principle that should work too. Windows support is possible in future, but blocked by [conda availability for modkit executables](https://anaconda.org/nanoporetech/modkit) and the [current implementation](dimelo-toolkit/run_modkit.py) of live error/progress tracking during modkit execution, which relies on a unix-only library as of Python 3.11. The urgency of a Windows implementation will depend on user need, so please let us know if this is important for you.* **Conda and Python:** The default installation requires conda, or alternatives like mamba. See [here](https://www.anaconda.com/download) for conda installation. The installation instructions below will install Python 3.11 for you within a conda virtual environment, but depending on your system configuration you may need to ensure that you are not also loading a different version of Python on your path. If you encounter unexpected errors when importing `dimelo`, e.g. complaining about syntax, consider checking your Python version. -### Load source code from the modkit_parsing_beta branch +### Load source code for dimelo-toolkit -Open your terminal or command line and navigate to wherever you want to keep the `dimelo` source code (e.g. your Documents folder, `cd Documents`) and clone the repo +Open your terminal or command line and navigate to wherever you want to keep the `dimelo-toolkit` source code (e.g. your Documents folder, `cd Documents`) and clone the repo ``` -git clone https://github.com/streetslab/dimelo +git clone https://github.com/streetslab/dimelo-toolkit ``` ### Set up virtual environment -Navigate into the dimelo directory +Navigate into the dimelo-toolkit directory ``` -cd dimelo +cd dimelo-toolkit ``` -Create a conda environment using environment.yml. This will make a new conda environment with the name `dimelo`. +Create a conda environment using environment.yml. This will make a new conda environment with the name `dimelo-toolkit`. ``` conda env create -f environment.yml @@ -79,15 +79,15 @@ conda env create -f environment.yml *If you want to handle environment creation yourself, see [the alternative installation instructions](#alternative-installations).* -### Install pip dependencies and core dimelo package +### Install pip dependencies and core dimelo-toolkit package Activate your conda environment, which should now contain python 3.11 and a modkit executable on the path and executable on your system. ``` -conda activate dimelo +conda activate dimelo-toolkit ``` -Ensure that you are still in the top-level dimelo directory. Install the dimelo package and its dependencies from source. +Ensure that you are still in the top-level dimelo-toolkit directory. Install the dimelo-toolkit package and its dependencies from source. ``` pip install . @@ -95,7 +95,7 @@ pip install . ## Google Colab Installation -Run the following code in the first cell of your notebook to grab `modkit v0.2.4` from conda and install the `dimelo modkit_parsing_beta` branch. This will have to be run whenever you make a new Colab instance, unless you have a better way of managing this, in which case please reach out. The tutorial notebook runs equivalent code blocks to set up your environment, so if you are trying to run the tutorial you can skip to [Basic Use](#basic-use). +Run the following code in the first cell of your notebook to grab `modkit v0.2.4` from conda and install the `dimelo-toolkit main` branch. This will have to be run whenever you make a new Colab instance, unless you have a better way of managing this, in which case please reach out. The tutorial notebook runs equivalent code blocks to set up your environment, so if you are trying to run the tutorial you can skip to [Basic Use](#basic-use). ``` from google.colab import drive @@ -104,14 +104,14 @@ drive.mount('/content/drive') import condacolab condacolab.install() !conda install nanoporetech::modkit==0.2.4 -!git clone https://github.com/streetslab/dimelo -!cd dimelo && pip install ipywidgets==7.7.1 . +!git clone https://github.com/streetslab/dimelo-toolkit +!cd dimelo-toolkit && pip install ipywidgets==7.7.1 . import dimelo ``` ## Alternative Installations -Alternatively, you can install modkit into any conda environment you like. If you want to, you can install modkit some other way, and then add it to the path of your notebook or script. *NOTE: if you are creating the environment yourself, be sure to use python 3.10 or greater. Some dimelo package features require relatively new python releases.* +Alternatively, you can install modkit into any conda environment you like. If you want to, you can install modkit some other way, and then add it to the path of your notebook or script. *NOTE: if you are creating the environment yourself, be sure to use python 3.10 or greater. Some dimelo-toolkit features require relatively new python releases.* ``` conda install nanoporetech::modkit==0.2.4 @@ -125,7 +125,7 @@ sys.path.append('path_to_modkit_executable_directory') ``` ## Developer Installation -If you are planning on developing for the `dimelo` package, change the `pip install` command to install the package in "editable" mode, so that your code changes are reflected in your environment: +If you are planning on developing for the `dimelo-toolkit` package, change the `pip install` command to install the package in "editable" mode, so that your code changes are reflected in your environment: ``` pip install . -e ``` @@ -159,7 +159,7 @@ See the [tutorial](tutorial.ipynb) as a starting point. For local operation on Mac or Linux, you will already have cloned the repo to disk in the installation step. Activate your conda environment, make sure you have jupyter installed, and then launch a jupyter notebook server and navigate to `tutorial.ipynb`. You can also use other tools to open the jupyter notebook or you can simply reference it as an example. ``` -conda activate dimelo +conda activate dimelo-toolkit jupyter notebook ``` @@ -224,9 +224,8 @@ You should expect to see some text outputs and a series of progress bars. Progre There should not be such issues for command line operation. See below an example of command line progress outputs: you should expect relatively fast pre-processing, 10-90 seconds, and then contig processing times depending heavily on the size of your `.bam` file and the extent of your `regions`. ``` -(dimelo_modkit_parsing) oberondixon-luinenburg@Oberons-MacBook-Pro package_test_notebooks % python dimelo_cmd.py +(dimelo-toolkit) % python dimelo_cmd.py modkit found with expected version 0.2.4 -No output directory provided, using input directory /Users/oberondixon-luinenburg/Documents/Ioannidis-Streets/dimelo_test_data/20230702_jm_lmnb1_acessibility_redux No specified number of cores requested. 8 available on machine, allocating all. Modification threshold of 0.9 will be treated as coming from range 0-1. ████████████████████| Preprocessing complete for motifs ['A,0'] in chm13.draft_v1.1.fasta: 100% | 00:30 @@ -246,6 +245,7 @@ def plot_enrichment_profile( motifs: list[str], sample_names: list[str], window_size: int, + relative: bool = True, single_strand: bool = False, regions_5to3prime: bool = False, smooth_window: int | None = None, @@ -268,6 +268,7 @@ def plot_enrichment_profile( mod_names: list of modifications to extract; expected to match mods available in the relevant mod_files sample_names: list of names to use for labeling traces in the output; legend entries window_size: half-size of the desired window to plot; how far the window stretches on either side of the center point + relative: True means x-axis is centered around region centers, False means x-axis is absolute genome positions. Must be True when plotting more than one region. single_strand: True means we only grab counts from reads from the same strand as the region of interest, False means we always grab both strands within the regions regions_5to3prime: True means negative strand regions get flipped, False means no flipping @@ -440,6 +441,10 @@ def plot_read_browser( sort_by: str | list[str] = "shuffle", hover: bool = True, subset_parameters: dict | None = None, + span_full_window: bool = False, + width: int = 1, + size: int = 4, + colorscales: dict = utils.DEFAULT_COLORSCALES, **kwargs, ) -> plotly.graph_objs.Figure: """ @@ -469,6 +474,10 @@ def plot_read_browser( hover: if False, disables display of information on mouse hover subset_parameters: Parameters to pass to the utils.random_sample() method, to subset the reads to be returned. If not None, at least one of n or frac must be provided. + span_full_window: if True, only plot reads that fully span the window defined by region_start-region_end + width: width of the read lines in the browser + size: size of the modification event markers in the browser + colorscales: dictionary mapping motif names to plotly colorscale specifications Returns: plotly Figure object containing the plot @@ -747,6 +756,7 @@ def read_vectors_from_hdf5( quiet: bool = True, # currently unused; change to default False when pbars are implemented cores: int | None = None, # currently unused subset_parameters: dict | None = None, + span_full_window: bool = False, ) -> tuple[list[tuple], list[str], dict | None]: """ User-facing function. @@ -795,6 +805,7 @@ def read_vectors_from_hdf5( subset_parameters: Parameters to pass to the utils.random_sample() method, to subset the reads to be returned. If not None, at least one of n or frac must be provided. The array parameter should not be provided here. + span_full_window: If True, only load reads that fully span the window defined by region_start-region_end Returns: a list of tuples, each tuple containing all datasets corresponding to an individual read that diff --git a/dimelo/load_processed.py b/dimelo/load_processed.py index 62a4556..b14e4f0 100644 --- a/dimelo/load_processed.py +++ b/dimelo/load_processed.py @@ -629,6 +629,7 @@ def read_vectors_from_hdf5( quiet: bool = True, # currently unused; change to default False when pbars are implemented cores: int | None = None, # currently unused subset_parameters: dict | None = None, + span_full_window: bool = False, ) -> tuple[list[tuple], list[str], dict | None]: """ User-facing function. @@ -679,6 +680,7 @@ def read_vectors_from_hdf5( subset_parameters: Parameters to pass to the utils.random_sample() method, to subset the reads to be returned. If not None, at least one of n or frac must be provided. The array parameter should not be provided here. + span_full_window: If True, only load reads that fully span the window defined by region_start-region_end Returns: a list of tuples, each tuple containing all datasets corresponding to an individual read that @@ -731,6 +733,8 @@ def read_vectors_from_hdf5( relevant_read_indices = np.flatnonzero( (read_ends > region_start) & (read_starts < region_end) + & (read_starts <= region_start if span_full_window else True) + & (read_ends >= region_end if span_full_window else True) & np.isin(read_motifs, motifs) & (read_chromosomes == chrom) & ( @@ -792,7 +796,7 @@ def read_vectors_from_hdf5( ) # We add region information (start, end, and strand; chromosome is already present!) # so that it is possible to sort by and process based on these - readwise_datasets += ["region_start", "region_end", "region_strand"] + readwise_datasets += ["region_start", "region_end", "region_strand", "read_length"] # This is sanitizing the dataset entries and adjusting prob values if needed if binarized: @@ -809,6 +813,9 @@ def read_vectors_from_hdf5( for tup in read_tuples_raw ] + read_start_idx = readwise_datasets.index("read_start") + read_end_idx = readwise_datasets.index("read_end") + if calculate_mod_fractions: # Add the MOTIF_mod_fraction entries to the readwise_datasets list for future reference in sorting readwise_datasets += [f"{motif}_mod_fraction" for motif in motifs] @@ -828,8 +835,10 @@ def read_vectors_from_hdf5( read_tuples_all = [] for read_tuple in read_tuples_processed: + read_length = read_tuple[read_end_idx] - read_tuple[read_start_idx] read_tuples_all.append( tuple(val for val in read_tuple) + + (read_length,) + tuple( mod_frac for mod_frac in mod_fractions_by_read_name_by_motif[ @@ -838,33 +847,59 @@ def read_vectors_from_hdf5( ) ) else: - read_tuples_all = read_tuples_processed + read_tuples_all = [] + for read_tuple in read_tuples_processed: + read_length = read_tuple[read_end_idx] - read_tuple[read_start_idx] + read_tuples_all.append(tuple(val for val in read_tuple) + (read_length,)) ## Sort the reads + # Normalize sort_by to a list of tuples (field, order) + # Support formats: + # - "field" or ["field1", "field2"] -> [("field1", "asc"), ("field2", "asc")] + # - [("field1", "desc"), "field2"] -> [("field1", "desc"), ("field2", "asc")] + # Enforce that sort_by is a list if not isinstance(sort_by, list): sort_by = [sort_by] + # Parse into (field, order) tuples + sort_by_normalized = [] + for item in sort_by: + if isinstance(item, tuple): + field, order = item + if order not in ["asc", "desc"]: + raise ValueError( + f"Sort order must be 'asc' or 'desc', got '{order}' for field '{field}'" + ) + sort_by_normalized.append((field, order)) + else: + # Default to ascending order + sort_by_normalized.append((item, "asc")) + # If 'shuffle' appears anywhere in sort_by, we first shuffle the list - if "shuffle" in sort_by: + if any(field == "shuffle" for field, _ in sort_by_normalized): utils.rng.shuffle(read_tuples_all) + # Build sorting configuration try: - sort_by_indices = [ - readwise_datasets.index(sort_item) - for sort_item in sort_by - if sort_item != "shuffle" + sort_config = [ + (readwise_datasets.index(field), order == "desc") + for field, order in sort_by_normalized + if field != "shuffle" ] except ValueError as e: raise ValueError( f"Sorting error. {e}. Datasets include {readwise_datasets}. If you need mod fraction sorting make sure you are not setting calculate_read_fraction to False." ) from e - if len(sort_by_indices) > 0: - sorted_read_tuples = sorted( - read_tuples_all, key=lambda x: tuple(x[index] for index in sort_by_indices) - ) + if len(sort_config) > 0: + # Use stable sort from right to left (reverse order of sort keys) + sorted_read_tuples = read_tuples_all + for idx, is_reverse in reversed(sort_config): + sorted_read_tuples = sorted( + sorted_read_tuples, key=lambda x: x[idx], reverse=is_reverse + ) else: sorted_read_tuples = read_tuples_all diff --git a/dimelo/parse_bam.py b/dimelo/parse_bam.py index 88a5310..e86b7ee 100644 --- a/dimelo/parse_bam.py +++ b/dimelo/parse_bam.py @@ -1037,6 +1037,7 @@ def read_by_base_txt_to_hdf5( read_counter += 1 ## Set up for next read + # Metadata read_name = fields[0] read_chrom = fields[3] read_len = int(fields[9]) @@ -1046,13 +1047,21 @@ def read_by_base_txt_to_hdf5( pos_in_read_ref = int(fields[1]) elif ref_strand == "-": pos_in_read_ref = read_len - int(fields[1]) - 1 - # Calculate read info + # Calculate read start (leftmost position on ref genome) + # TODO: logic can be replaced when we switch to true read start/end from modkit read_start = pos_in_genome - pos_in_read_ref - read_end = read_start + read_len # Instantiate lists mod_values_list = [] valid_coordinates_list = [] + # Adjust the read_end (rightmost position on ref genome) each time there's a new mod + # This will lead to the most accurate end positions for gapped reads + # TODO: logic can be replaced when we switch to true read start/end from modkit + if ref_strand == "+": + pos_in_read_ref = int(fields[1]) + elif ref_strand == "-": + pos_in_read_ref = read_len - int(fields[1]) - 1 + read_end = pos_in_genome + (read_len - pos_in_read_ref) # Regardless of whether its a new read or not, # add modification to vector if motif type is correct # for the motif in question diff --git a/dimelo/plot_enrichment_profile.py b/dimelo/plot_enrichment_profile.py index 97ec461..20a625c 100644 --- a/dimelo/plot_enrichment_profile.py +++ b/dimelo/plot_enrichment_profile.py @@ -12,6 +12,7 @@ def plot_enrichment_profile( motifs: list[str], sample_names: list[str], window_size: int, + relative: bool = True, single_strand: bool = False, regions_5to3prime: bool = False, smooth_window: int | None = None, @@ -39,6 +40,7 @@ def plot_enrichment_profile( mod_names: list of modifications to extract; expected to match mods available in the relevant mod_files sample_names: list of names to use for labeling traces in the output; legend entries window_size: half-size of the desired window to plot; how far the window stretches on either side of the center point + relative: True means x-axis is centered around region centers, False means x-axis is absolute genome positions. Must be True when plotting more than one region. single_strand: True means we only grab counts from reads from the same strand as the region of interest, False means we always grab both strands within the regions regions_5to3prime: True means negative strand regions get flipped, False means no flipping @@ -65,8 +67,26 @@ def plot_enrichment_profile( cores=cores, ) + if relative: + offset_center = 0 + else: + regions_dict = utils.regions_dict_from_input( + regions_list[0], + window_size, + ) + if len(regions_dict) == 1 and len(list(regions_dict.values())[0]) == 1: + region_tuple = list(regions_dict.values())[0][0] + offset_center = (region_tuple[0] + region_tuple[1]) // 2 + else: + raise ValueError( + "relative=False must be used when plotting more than one region." + ) + axes = make_enrichment_profile_plot( - trace_vectors=trace_vectors, sample_names=sample_names, **kwargs + trace_vectors=trace_vectors, + sample_names=sample_names, + offset_center=offset_center, + **kwargs, ) return axes @@ -241,6 +261,7 @@ def get_enrichment_profiles( def make_enrichment_profile_plot( trace_vectors: list[np.ndarray], sample_names: list[str], + offset_center: int = 0, **kwargs, ) -> Axes: """ @@ -252,6 +273,7 @@ def make_enrichment_profile_plot( Args: trace_vectors: list of enrichment profile traces sample_names: list of names to use for labeling traces in the output; legend entries + offset_center: position offset to apply to x-axis (e.g., when plotting absolute genome positions) kwargs: other keyword parameters passed through to utils.line_plot Returns: @@ -261,8 +283,8 @@ def make_enrichment_profile_plot( raise ValueError("Unequal number of inputs") axes = utils.line_plot( indep_vector=np.arange( - -len(trace_vectors[0]) // 2, - len(trace_vectors[0]) // 2 + len(trace_vectors[0]) % 2, + offset_center - len(trace_vectors[0]) // 2, + offset_center + len(trace_vectors[0]) // 2 + len(trace_vectors[0]) % 2, ), indep_name="pos", dep_vectors=trace_vectors, diff --git a/dimelo/plot_read_browser.py b/dimelo/plot_read_browser.py index 4ead839..a49d496 100644 --- a/dimelo/plot_read_browser.py +++ b/dimelo/plot_read_browser.py @@ -16,6 +16,10 @@ def plot_read_browser( sort_by: str | list[str] = "shuffle", hover: bool = True, subset_parameters: dict | None = None, + span_full_window: bool = False, + width: int = 1, + size: int = 4, + colorscales: dict = utils.DEFAULT_COLORSCALES, **kwargs, ) -> plotly.graph_objs.Figure: """ @@ -45,6 +49,10 @@ def plot_read_browser( hover: if False, disables display of information on mouse hover subset_parameters: Parameters to pass to the utils.random_sample() method, to subset the reads to be returned. If not None, at least one of n or frac must be provided. + span_full_window: if True, only plot reads that fully span the window defined by region_start-region_end + width: width of the read lines in the browser + size: size of the modification event markers in the browser + colorscales: dictionary mapping motif names to plotly colorscale specifications Returns: plotly Figure object containing the plot @@ -67,8 +75,9 @@ def plot_read_browser( motifs=motifs, single_strand=single_strand, sort_by=sort_by, - calculate_mod_fractions=False, + calculate_mod_fractions=True, subset_parameters=subset_parameters, + span_full_window=span_full_window, ) mod_vector_index = entry_labels.index("mod_vector") @@ -107,6 +116,10 @@ def plot_read_browser( region_start=region_start, region_end=region_end, hover=hover, + thresh=thresh, + width=width, + size=size, + colorscales=colorscales, **kwargs, ) @@ -169,9 +182,17 @@ def format_browser_data( # differences are caused by the assumptions made in parse_bam.extract h5 conversion # and don't matter for much but do cause the full duplicate check to leave duplicates, # which then cause non-unique indices for mapping in collapse mode - read_extent_df = read_df[ - ["read_start", "read_end", "y_index", "read_name"] - ].drop_duplicates(subset=["read_name"]) + # The sorting below is just to make sure that when we drop duplicates, we keep + # the longest read extent for each read_name. It doesn't effect final figure sort order. + read_extent_df = ( + read_df[["read_start", "read_end", "y_index", "read_name"]] + .assign(read_length=lambda df: df.read_end - df.read_start) + .sort_values( + "read_length", ascending=False + ) # TODO: logic can be removed once we reference true read lengths from modkit + .drop_duplicates(subset=["read_name"]) + .drop(columns=["read_length"]) + ) # * represents the methylation events mod_event_df = ( read_df[["y_index", "read_name", "motif", "pos_vector", "prob_vector"]] @@ -294,6 +315,10 @@ def make_browser_figure( region_start: int, region_end: int, hover: bool = True, + thresh: int | float | None = None, + width: int = 1, + size: int = 4, + colorscales: dict = utils.DEFAULT_COLORSCALES, **kwargs, ) -> plotly.graph_objs.Figure: """ @@ -311,6 +336,10 @@ def make_browser_figure( region_start: start position of the region being browsed region_end: end position of the region being browsed hover: if False, disables display of information on mouse hover + thresh: pass down threshold for color scaling of modification probabilities + width: width of the read lines in the browser + size: size of the modification event markers in the browser + colorscales: dictionary mapping motif names to plotly colorscale specifications TODO: Think about how this interfaces with different types of initial sorting... TODO: Make it so that this method does NOT modify the input dataframe @@ -341,7 +370,7 @@ def make_browser_figure( x=[row.read_start, row.read_end], y=[row.y_index, row.y_index], mode="lines", - line=dict(width=1, color="lightgrey"), + line=dict(width=width, color="lightgrey"), showlegend=False, hoverinfo="text", hovertext=row.read_name, @@ -365,9 +394,11 @@ def make_browser_figure( ] ), marker=dict( - size=4, + size=size, color=motif_df["prob"], - colorscale=utils.DEFAULT_COLORSCALES[motif], + colorscale=colorscales[motif], + cmin=thresh, + cmax=1.0, colorbar=dict( title=dict( text=f"{motif} probability", diff --git a/dimelo/test/dimelo_test.py b/dimelo/test/dimelo_test.py index f7b84b2..170397a 100644 --- a/dimelo/test/dimelo_test.py +++ b/dimelo/test/dimelo_test.py @@ -51,8 +51,10 @@ def test_unit__pileup( pileup_target, regions_target = results["pileup"] if pileup_target is not None and regions_target is not None: - # This is necessary because the gzipped files are different on mac vs linux, but the contents should be identical (and are, so far) - # Not sure why the compression ratio is better on Linux when both are using pysam.tabix_compress with pysam 0.22.0 and zlib 1.2.13 but whatcha gonna do + # This is necessary because the gzipped files are different on mac vs linux, + # but the contents should be identical (and are, so far) + # Not sure why the compression ratio is better on Linux when both are using + # pysam.tabix_compress with pysam 0.22.0 and zlib 1.2.13 but whatcha gonna do with ( gzip.open(pileup_bed, "rt") as f1, gzip.open(pileup_target, "rt") as f2, diff --git a/dimelo/utils.py b/dimelo/utils.py index e4ad62b..91a2661 100644 --- a/dimelo/utils.py +++ b/dimelo/utils.py @@ -27,7 +27,9 @@ "CG,0": "orange", "CG,0,m": "yellow", "CG,0,h": "red", - "GCH,1": "purple", + "HCG,1": "orange", + "WCG,1": "red", + "GCH,1": "green", } ) # Default colorscales for plotly; based off of DEFAULT_COLORS @@ -73,7 +75,7 @@ def __init__(self, motif_string): if self.modified_pos >= len(self.motif_seq): raise ValueError(f"Motif {motif_string} has an out-of-range mod index.") self.modified_base = self.motif_seq[self.modified_pos] - self.mod_codes = set(parts[2]) + self.mod_codes = set([parts[2]]) else: # Motifs need both a sequence and an index, separated by a comma raise ValueError( diff --git a/environment.yml b/environment.yml index b0b6a75..d74efd5 100644 --- a/environment.yml +++ b/environment.yml @@ -1,4 +1,4 @@ -name: dimelo +name: dimelo-toolkit channels: - conda-forge - nanoporetech