Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
0f55da5
Adjust environment name
OberonDixon Nov 11, 2025
f2bc169
Change default colors closer to those for the manuscript
OberonDixon Nov 11, 2025
cdf9def
Add relative=False option (absolute coordinates) for enrichment profiles
OberonDixon Nov 12, 2025
011da25
Add support for pseudouridine by supporting mod codes more than one c…
OberonDixon Nov 12, 2025
9513e7e
Improve plot_read_browser customizability
OberonDixon Nov 13, 2025
e0aad85
Read length management adjusted for gapped alignments e.g. RNA with s…
OberonDixon Nov 14, 2025
aaab7d1
Add span_full_window parameter to select only reads that both are wit…
OberonDixon Nov 14, 2025
9f0addc
Move metadata per-read parsing to below the read saving functionality…
OberonDixon Nov 14, 2025
8a9c7c1
Add sorting option for read_length
OberonDixon Nov 15, 2025
20c48a5
Add asc vs desc options to sort_by for read_vectors_from_h5. Passes e…
OberonDixon Nov 15, 2025
72caadf
ruff format
OberonDixon Nov 19, 2025
e06fca1
Documentation changes and top-level parameter exposure per PR comments
OberonDixon Nov 25, 2025
841faf9
Shorten long lines to see if that helps with ruff format
OberonDixon Nov 25, 2025
b0b78e0
ruff format after downgrading to ruff 0.6.8
OberonDixon Nov 25, 2025
1e8d676
Update package name and add new kwargs to function descriptions
OberonDixon Nov 25, 2025
473500b
Update kwargs handling to avoid bugs in the collapse sorting case
OberonDixon Nov 25, 2025
e70da7a
Fix mistake in removing kwargs in too many places
OberonDixon Nov 25, 2025
ba5b9c4
Fix typo and remove personal info
OberonDixon Nov 25, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 29 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,51 +51,51 @@ This README document contains installation instructions and documentation for va

**Platforms:** Mac and Linux operating systems, ARM (e.g. M1/M2 mac) and x86 (e.g. Intel mac) architectures. The package has been tested on HPC clusters, but there may be additional complexities depending on how these systems are set up.

*For Windows, we recommend using [Google Colab](https://colab.research.google.com/). We have not tested on [Windows Linux Subsystem](https://learn.microsoft.com/en-us/windows/wsl/install) but in principle that should work too. Windows support is possible in future, but blocked by [conda availability for modkit executables](https://anaconda.org/nanoporetech/modkit) and the [current implementation](dimelo/run_modkit.py) of live error/progress tracking during modkit execution, which relies on a unix-only library as of Python 3.11. The urgency of a Windows implementation will depend on user need, so please let us know if this is important for you.*
*For Windows, we recommend using [Google Colab](https://colab.research.google.com/). We have not tested on [Windows Linux Subsystem](https://learn.microsoft.com/en-us/windows/wsl/install) but in principle that should work too. Windows support is possible in future, but blocked by [conda availability for modkit executables](https://anaconda.org/nanoporetech/modkit) and the [current implementation](dimelo-toolkit/run_modkit.py) of live error/progress tracking during modkit execution, which relies on a unix-only library as of Python 3.11. The urgency of a Windows implementation will depend on user need, so please let us know if this is important for you.*

**Conda and Python:** The default installation requires conda, or alternatives like mamba. See [here](https://www.anaconda.com/download) for conda installation. The installation instructions below will install Python 3.11 for you within a conda virtual environment, but depending on your system configuration you may need to ensure that you are not also loading a different version of Python on your path. If you encounter unexpected errors when importing `dimelo`, e.g. complaining about syntax, consider checking your Python version.

### Load source code from the modkit_parsing_beta branch
### Load source code for dimelo-toolkit

Open your terminal or command line and navigate to wherever you want to keep the `dimelo` source code (e.g. your Documents folder, `cd Documents`) and clone the repo
Open your terminal or command line and navigate to wherever you want to keep the `dimelo-toolkit` source code (e.g. your Documents folder, `cd Documents`) and clone the repo

```
git clone https://github.com/streetslab/dimelo
git clone https://github.com/streetslab/dimelo-toolkit
```

### Set up virtual environment

Navigate into the dimelo directory
Navigate into the dimelo-toolkit directory

```
cd dimelo
cd dimelo-toolkit
```

Create a conda environment using environment.yml. This will make a new conda environment with the name `dimelo`.
Create a conda environment using environment.yml. This will make a new conda environment with the name `dimelo-toolkit`.

```
conda env create -f environment.yml
```

*If you want to handle environment creation yourself, see [the alternative installation instructions](#alternative-installations).*

### Install pip dependencies and core dimelo package
### Install pip dependencies and core dimelo-toolkit package

Activate your conda environment, which should now contain python 3.11 and a modkit executable on the path and executable on your system.

```
conda activate dimelo
conda activate dimelo-toolkit
```

Ensure that you are still in the top-level dimelo directory. Install the dimelo package and its dependencies from source.
Ensure that you are still in the top-level dimelo-toolkit directory. Install the dimelo-toolkit package and its dependencies from source.

```
pip install .
```

## Google Colab Installation

Run the following code in the first cell of your notebook to grab `modkit v0.2.4` from conda and install the `dimelo modkit_parsing_beta` branch. This will have to be run whenever you make a new Colab instance, unless you have a better way of managing this, in which case please reach out. The tutorial notebook runs equivalent code blocks to set up your environment, so if you are trying to run the tutorial you can skip to [Basic Use](#basic-use).
Run the following code in the first cell of your notebook to grab `modkit v0.2.4` from conda and install the `dimelo-toolkit main` branch. This will have to be run whenever you make a new Colab instance, unless you have a better way of managing this, in which case please reach out. The tutorial notebook runs equivalent code blocks to set up your environment, so if you are trying to run the tutorial you can skip to [Basic Use](#basic-use).

```
from google.colab import drive
Expand All @@ -104,14 +104,14 @@ drive.mount('/content/drive')
import condacolab
condacolab.install()
!conda install nanoporetech::modkit==0.2.4
!git clone https://github.com/streetslab/dimelo
!cd dimelo && pip install ipywidgets==7.7.1 .
!git clone https://github.com/streetslab/dimelo-toolkit
!cd dimelo-toolkit && pip install ipywidgets==7.7.1 .
import dimelo
```

## Alternative Installations

Alternatively, you can install modkit into any conda environment you like. If you want to, you can install modkit some other way, and then add it to the path of your notebook or script. *NOTE: if you are creating the environment yourself, be sure to use python 3.10 or greater. Some dimelo package features require relatively new python releases.*
Alternatively, you can install modkit into any conda environment you like. If you want to, you can install modkit some other way, and then add it to the path of your notebook or script. *NOTE: if you are creating the environment yourself, be sure to use python 3.10 or greater. Some dimelo-toolkit features require relatively new python releases.*

```
conda install nanoporetech::modkit==0.2.4
Expand All @@ -125,7 +125,7 @@ sys.path.append('path_to_modkit_executable_directory')
```

## Developer Installation
If you are planning on developing for the `dimelo` package, change the `pip install` command to install the package in "editable" mode, so that your code changes are reflected in your environment:
If you are planning on developing for the `dimelo-toolkit` package, change the `pip install` command to install the package in "editable" mode, so that your code changes are reflected in your environment:
```
pip install . -e
```
Expand Down Expand Up @@ -159,7 +159,7 @@ See the [tutorial](tutorial.ipynb) as a starting point.
For local operation on Mac or Linux, you will already have cloned the repo to disk in the installation step. Activate your conda environment, make sure you have jupyter installed, and then launch a jupyter notebook server and navigate to `tutorial.ipynb`. You can also use other tools to open the jupyter notebook or you can simply reference it as an example.

```
conda activate dimelo
conda activate dimelo-toolkit
jupyter notebook
```

Expand Down Expand Up @@ -224,9 +224,8 @@ You should expect to see some text outputs and a series of progress bars. Progre
There should not be such issues for command line operation. See below an example of command line progress outputs: you should expect relatively fast pre-processing, 10-90 seconds, and then contig processing times depending heavily on the size of your `.bam` file and the extent of your `regions`.

```
(dimelo_modkit_parsing) oberondixon-luinenburg@Oberons-MacBook-Pro package_test_notebooks % python dimelo_cmd.py
(dimelo-toolkit) % python dimelo_cmd.py
modkit found with expected version 0.2.4
No output directory provided, using input directory /Users/oberondixon-luinenburg/Documents/Ioannidis-Streets/dimelo_test_data/20230702_jm_lmnb1_acessibility_redux
No specified number of cores requested. 8 available on machine, allocating all.
Modification threshold of 0.9 will be treated as coming from range 0-1.
████████████████████| Preprocessing complete for motifs ['A,0'] in chm13.draft_v1.1.fasta: 100% | 00:30
Expand All @@ -246,6 +245,7 @@ def plot_enrichment_profile(
motifs: list[str],
sample_names: list[str],
window_size: int,
relative: bool = True,
single_strand: bool = False,
regions_5to3prime: bool = False,
smooth_window: int | None = None,
Expand All @@ -268,6 +268,7 @@ def plot_enrichment_profile(
mod_names: list of modifications to extract; expected to match mods available in the relevant mod_files
sample_names: list of names to use for labeling traces in the output; legend entries
window_size: half-size of the desired window to plot; how far the window stretches on either side of the center point
relative: True means x-axis is centered around region centers, False means x-axis is absolute genome positions. Must be True when plotting more than one region.
single_strand: True means we only grab counts from reads from the same strand as
the region of interest, False means we always grab both strands within the regions
regions_5to3prime: True means negative strand regions get flipped, False means no flipping
Expand Down Expand Up @@ -440,6 +441,10 @@ def plot_read_browser(
sort_by: str | list[str] = "shuffle",
hover: bool = True,
subset_parameters: dict | None = None,
span_full_window: bool = False,
width: int = 1,
size: int = 4,
colorscales: dict = utils.DEFAULT_COLORSCALES,
**kwargs,
) -> plotly.graph_objs.Figure:
"""
Expand Down Expand Up @@ -469,6 +474,10 @@ def plot_read_browser(
hover: if False, disables display of information on mouse hover
subset_parameters: Parameters to pass to the utils.random_sample() method, to subset the
reads to be returned. If not None, at least one of n or frac must be provided.
span_full_window: if True, only plot reads that fully span the window defined by region_start-region_end
width: width of the read lines in the browser
size: size of the modification event markers in the browser
colorscales: dictionary mapping motif names to plotly colorscale specifications

Returns:
plotly Figure object containing the plot
Expand Down Expand Up @@ -747,6 +756,7 @@ def read_vectors_from_hdf5(
quiet: bool = True, # currently unused; change to default False when pbars are implemented
cores: int | None = None, # currently unused
subset_parameters: dict | None = None,
span_full_window: bool = False,
) -> tuple[list[tuple], list[str], dict | None]:
"""
User-facing function.
Expand Down Expand Up @@ -795,6 +805,7 @@ def read_vectors_from_hdf5(
subset_parameters: Parameters to pass to the utils.random_sample() method, to subset the
reads to be returned. If not None, at least one of n or frac must be provided. The array
parameter should not be provided here.
span_full_window: If True, only load reads that fully span the window defined by region_start-region_end

Returns:
a list of tuples, each tuple containing all datasets corresponding to an individual read that
Expand Down
57 changes: 46 additions & 11 deletions dimelo/load_processed.py
Original file line number Diff line number Diff line change
Expand Up @@ -629,6 +629,7 @@ def read_vectors_from_hdf5(
quiet: bool = True, # currently unused; change to default False when pbars are implemented
cores: int | None = None, # currently unused
subset_parameters: dict | None = None,
span_full_window: bool = False,
) -> tuple[list[tuple], list[str], dict | None]:
"""
User-facing function.
Expand Down Expand Up @@ -679,6 +680,7 @@ def read_vectors_from_hdf5(
subset_parameters: Parameters to pass to the utils.random_sample() method, to subset the
reads to be returned. If not None, at least one of n or frac must be provided. The array
parameter should not be provided here.
span_full_window: If True, only load reads that fully span the window defined by region_start-region_end

Returns:
a list of tuples, each tuple containing all datasets corresponding to an individual read that
Expand Down Expand Up @@ -731,6 +733,8 @@ def read_vectors_from_hdf5(
relevant_read_indices = np.flatnonzero(
(read_ends > region_start)
& (read_starts < region_end)
& (read_starts <= region_start if span_full_window else True)
& (read_ends >= region_end if span_full_window else True)
& np.isin(read_motifs, motifs)
& (read_chromosomes == chrom)
& (
Expand Down Expand Up @@ -792,7 +796,7 @@ def read_vectors_from_hdf5(
)
# We add region information (start, end, and strand; chromosome is already present!)
# so that it is possible to sort by and process based on these
readwise_datasets += ["region_start", "region_end", "region_strand"]
readwise_datasets += ["region_start", "region_end", "region_strand", "read_length"]

# This is sanitizing the dataset entries and adjusting prob values if needed
if binarized:
Expand All @@ -809,6 +813,9 @@ def read_vectors_from_hdf5(
for tup in read_tuples_raw
]

read_start_idx = readwise_datasets.index("read_start")
read_end_idx = readwise_datasets.index("read_end")

if calculate_mod_fractions:
# Add the MOTIF_mod_fraction entries to the readwise_datasets list for future reference in sorting
readwise_datasets += [f"{motif}_mod_fraction" for motif in motifs]
Expand All @@ -828,8 +835,10 @@ def read_vectors_from_hdf5(

read_tuples_all = []
for read_tuple in read_tuples_processed:
read_length = read_tuple[read_end_idx] - read_tuple[read_start_idx]
read_tuples_all.append(
tuple(val for val in read_tuple)
+ (read_length,)
+ tuple(
mod_frac
for mod_frac in mod_fractions_by_read_name_by_motif[
Expand All @@ -838,33 +847,59 @@ def read_vectors_from_hdf5(
)
)
else:
read_tuples_all = read_tuples_processed
read_tuples_all = []
for read_tuple in read_tuples_processed:
read_length = read_tuple[read_end_idx] - read_tuple[read_start_idx]
read_tuples_all.append(tuple(val for val in read_tuple) + (read_length,))

## Sort the reads

# Normalize sort_by to a list of tuples (field, order)
# Support formats:
# - "field" or ["field1", "field2"] -> [("field1", "asc"), ("field2", "asc")]
# - [("field1", "desc"), "field2"] -> [("field1", "desc"), ("field2", "asc")]

# Enforce that sort_by is a list
if not isinstance(sort_by, list):
sort_by = [sort_by]

# Parse into (field, order) tuples
sort_by_normalized = []
for item in sort_by:
if isinstance(item, tuple):
field, order = item
if order not in ["asc", "desc"]:
raise ValueError(
f"Sort order must be 'asc' or 'desc', got '{order}' for field '{field}'"
)
sort_by_normalized.append((field, order))
else:
# Default to ascending order
sort_by_normalized.append((item, "asc"))

# If 'shuffle' appears anywhere in sort_by, we first shuffle the list
if "shuffle" in sort_by:
if any(field == "shuffle" for field, _ in sort_by_normalized):
utils.rng.shuffle(read_tuples_all)

# Build sorting configuration
try:
sort_by_indices = [
readwise_datasets.index(sort_item)
for sort_item in sort_by
if sort_item != "shuffle"
sort_config = [
(readwise_datasets.index(field), order == "desc")
for field, order in sort_by_normalized
if field != "shuffle"
]
except ValueError as e:
raise ValueError(
f"Sorting error. {e}. Datasets include {readwise_datasets}. If you need mod fraction sorting make sure you are not setting calculate_read_fraction to False."
) from e

if len(sort_by_indices) > 0:
sorted_read_tuples = sorted(
read_tuples_all, key=lambda x: tuple(x[index] for index in sort_by_indices)
)
if len(sort_config) > 0:
# Use stable sort from right to left (reverse order of sort keys)
sorted_read_tuples = read_tuples_all
for idx, is_reverse in reversed(sort_config):
sorted_read_tuples = sorted(
sorted_read_tuples, key=lambda x: x[idx], reverse=is_reverse
)
else:
sorted_read_tuples = read_tuples_all

Expand Down
13 changes: 11 additions & 2 deletions dimelo/parse_bam.py
Original file line number Diff line number Diff line change
Expand Up @@ -1037,6 +1037,7 @@ def read_by_base_txt_to_hdf5(
read_counter += 1

## Set up for next read
# Metadata
read_name = fields[0]
read_chrom = fields[3]
read_len = int(fields[9])
Expand All @@ -1046,13 +1047,21 @@ def read_by_base_txt_to_hdf5(
pos_in_read_ref = int(fields[1])
elif ref_strand == "-":
pos_in_read_ref = read_len - int(fields[1]) - 1
# Calculate read info
# Calculate read start (leftmost position on ref genome)
# TODO: logic can be replaced when we switch to true read start/end from modkit
read_start = pos_in_genome - pos_in_read_ref
read_end = read_start + read_len
# Instantiate lists
mod_values_list = []
valid_coordinates_list = []

# Adjust the read_end (rightmost position on ref genome) each time there's a new mod
# This will lead to the most accurate end positions for gapped reads
# TODO: logic can be replaced when we switch to true read start/end from modkit
if ref_strand == "+":
pos_in_read_ref = int(fields[1])
elif ref_strand == "-":
pos_in_read_ref = read_len - int(fields[1]) - 1
read_end = pos_in_genome + (read_len - pos_in_read_ref)
# Regardless of whether its a new read or not,
# add modification to vector if motif type is correct
# for the motif in question
Expand Down
Loading
Loading