Benefit-Optimising Short-term Strategies for ReadUntil Nanopore Sequencing
BOSS-RUNS allows for dynamic, adaptive sampling during nanopore sequencing. It periodically ingests all newly observed sequencing reads to generate updated decision strategies in order to maximise the information gain during the sequencing experiment. This method is implemented to run alongside readfish, which communicates the rejection signals to the sequencing machine and reloads BOSS-RUNS' decision strategies whenever they are updated.
The method is described in this preprint on bioRxiv
- Linux
- MinKNOW >=5.0 with guppy >=6.0
- working setup of readfish for BOSS-RUNS.
In parallel to BOSS-RUNS, a modified version of readfish needs to be set up, which is available at: readfish for BOSS-RUNS.
To install both readfish and BOSS-RUNS:
- create a virtual environment
python3 -m venv bossruns
. bossruns/bin/activate
- install dependencies and readfish
pip install --upgrade pip
pip install git+https://github.com/nanoporetech/read_until_api@v3.0.0
pip install git+https://github.com/LooseLab/readfish@BossRuns/V0.0.2
ont_guppy_client_libneeds a version specific to your guppy version. On OSes usingaptyou can find your guppy version using:
apt list --installed ont-guppy* | tail -n 1 | cut -f2 -d' ' | cut -f1 -d'-' >guppy_version
cat guppy_version
pip install ont_pyguppy_client_lib==$(cat guppy_version)
- install a few more dependencies for BR and clone the repository
pip3 install natsort scipy numba toml
git clone https://github.com/goldman-gp-ebi/BOSS-RUNS.git
Alternatively, if you already have the correct version of readfish set up, BOSS-RUNS can be run in a separate conda environment.
The required dependencies can then be installed using the provided yaml file
conda env create -f environment.yml
To activate the environment:
conda activate bossruns
BOSS-RUNS can be used in different ways depending on the aim of the sequencing experiment and the initial targets.
Reference fasta sequences are required for all targets. They will be indexed upon launch, or an index can optionally be provided. For creation, you can use:
BOSS_RUNS/scripts/mappy_index_fasta.py references.fa
-
By default, all whole genome(s) included in the input fasta file are considered of interest.
-
You can also choose to reject all reads from specific sequences in your fasta file. For this, provide fasta headers of the reference file, e.g.:
--reject_refs 1,2,3,X,Y,MT -
Alternatively, a VCF that matches the input fasta file and contains regions/sites of interest (ROIs) can be provided. Only sites included in the VCF will be considered on-target.
readfish needs a single modification to utilize BOSS-RUNS' dynamically updated decision masks:
In the TOML configuration for readfish, add a single line pointing to the location where BOSS-RUNS will deposit new strategies, e.g. mask = bossruns_name/masks.
This path will follow the pattern bossruns_{condition_name}/masks, where condition_name is the name of the condition in the readfish TOML intended to use BOSS-RUNS.
A configuration for such a condition might look like this:
[conditions.0]
name = "select_c20" <- condition name
control = false
min_chunks = 0
max_chunks = 12
targets = ["20"]
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"
mask = "bossruns_select_c20/masks" <- path to dynamic strategies (bossruns_{condition_name}/masks)
readfish can then be started with the subcommand boss-runs (instead of targets). E.g. like this:
readfish boss-runs --device DEVICE \
--experiment-name NAME \
--toml TOML \
--log-file LOGFILE
After sequencing has started and readfish is operating, the minimal command to launch BOSS-RUNS is:
./bossruns.py --ref REF --device DEVICE --run_name CONDITION_NAME
where DEVICE needs to be the name of the 'position' on the sequencer (displayed in MinKNOW overview),
and CONDITION_NAME is the same as in the readfish TOML.
BOSS-RUNS will initialise and start to periodically generate new decision strategies from the sequencing reads deposited by the sequencer.
If readfish is configured properly, the strategies will be reloaded automatically.
This triggers a message in readfish's logfile similar to: Reloaded mask dict for FASTA_HEADERS.
After sequencing, BOSS-RUNS needs to be stopped by a keyboard interrupt (Ctrl+C).
Arguments can either be specified on the command line, by providing a parameter file, or a mixture or both (command-line arguments take priority).
python BOSS-RUNS/bossruns.py --help
usage: bossruns.py [-h] --ref REF [--ref_idx REF_IDX] [--run_name RUN_NAME] [--vcf VCF] [--reject_refs REJECT_REFS] [--ploidy PLOIDY] [--conditions] --device DEVICE [--host HOST] [--port PORT] [--wait WAIT]
[--ckp CKP]
optional arguments:
-h, --help show this help message and exit
--ref REF Path to reference
--ref_idx REF_IDX Optional minimap index of reference
--run_name RUN_NAME Experiment identifier. Must match name of [conditions.X] in readfish toml file
--vcf VCF Path to vcf file for ROIs
--reject_refs REJECT_REFS
reject all reads of some entries in reference file, i.e. chromosomes or species. Multiple headers can be comma separated
--ploidy PLOIDY 1 == haploid, 2 == diploid
--conditions Multiple conditions on a single flowcell, used to assign channels
--device DEVICE Name of device/sequencing position in MinKNOW
--host HOST hostname of sequencing device
--port PORT port of sequencing device
--wait WAIT Period between strategy updates (sec.)
A parameter file can be supplied using @ on the command line. E.g.: python bossruns.py @params.txt with the structure (one space-delimited argument per line):
--argumentX valueX
--argumentY valueY
This walkthrough is adapted from the readfish repository
To test your configuration of readfish and BOSS-RUNS we recommend first running a playback experiment. Download an open access bulk FAST5 file. (Attention 21Gb!)
To configure a run for playback, you need to edit a sequencing TOML file located at /opt/ont/minknow/conf/package/sequencing.
- Edit
sequencing_MIN106_DNA.tomland under the entry[custom_settings]add:
simulation = "/full/path/to/your_bulk.FAST5"
-
and set the parameter
break_reads_after_seconds = 1.0tobreak_reads_after_seconds = 0.4 -
In the MinKNOW GUI select Reload Scripts (vertical dot menu at
Select positionsscreen when starting a sequencing run). -
Insert a configuration test flowcell into the sequencing device and start a sequencing run (selecting the corresponding flow cell type to the edited script, i.e.
FLO-MIN106). -
The run should start and immediately begin a mux scan. Let it run for a few minutes.
This requires access to a guppy basecall server and a TOML file. Here's an example TOML file for this walkthrough:
[caller_settings]
config_name = "dna_r9.4.1_450bps_hac"
host = "127.0.0.1"
port = 5555
[conditions]
reference = "/path/to/reference.mmi"
[conditions.0]
name = "select_c20"
control = false
min_chunks = 0
max_chunks = 12
targets = ["20"]
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"
mask = "bossruns_select_c20/masks"
This configures readfish to target all reads from chromosome 20 and to continuously read the dynamically updated decision strategies from BOSS-RUNS (mask = "bossruns_select_c20/masks").
You simply need to modify the reference field to a minimap2 index of the human genome.
Modify the targets field to reflect the naming convention used in your index.
This is the sequence name only, up to but not including any whitespace. e.g. the fasta header >20 human chromosome 20 would become 20.
readfish can then be launched with
readfish boss-runs --device DEVICE \
--experiment-name "EXPERIMENT_NAME" \
--toml example.toml \
--log-file readfish.log
After readfish is running, you can launch BOSS-RUNS using the same reference file and indicate which chromosomes are not of interest a priori.
./bossruns.py --run_name select_c20 \
--ref /data/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa \
--ref_idx /data/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.mmi \
--reject_refs 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,X,Y,MT \
--device DEVICE \
--testing
run_name needs to match the condition name in the readfish TOML file.
Let the playback sequencing run for a few minutes.
There are 2 things to verify that the setup works:
readfishis rejecting reads from all chromosomes, except for #20. For this, we look at the observed read lengths:
readfish summary example.toml /path/to/sequencing/output/fastq_pass/
Check that the mean read length for the enriched chromosome is larger than for the remaining chromosomes.
(In this example the read lengths of depleted chromosomes are still rather long due to slow base calling)
contig number sum min max std mean median N50
1 1151 1577570 224 197871 7002 1371 581 7096
10 1012 1147739 211 105585 5128 1134 578 1149
11 836 1207274 225 212740 8385 1444 590 7727
12 741 940284 192 81667 5419 1269 569 2143
13 477 412434 182 54642 2694 865 556 898
14 771 988124 176 114908 5616 1282 573 7301
15 540 882612 223 99327 7459 1634 576 8258
16 425 486124 209 107335 5579 1144 533 1341
17 732 1099546 195 219260 9307 1502 614 8725
18 180 310512 228 40407 4565 1725 638 8213
19 592 824374 222 121745 6374 1393 634 5811
2 1477 1572749 190 89358 3780 1065 561 1154
20 50 1163599 227 145149 37887 23272 2214 86773 <---
21 392 424448 199 118785 6053 1083 548 1213
22 178 286534 188 49058 4788 1610 655 9884
3 1198 1638753 201 172141 7473 1368 578 6548
4 1370 1807013 174 160366 6708 1319 582 6917
5 1408 2144345 162 212394 9884 1523 544 8397
6 656 1013424 231 118194 5819 1545 599 7268
7 1026 932717 185 66384 2972 909 563 914
8 906 1133194 210 162732 5930 1251 564 2210
9 1046 1533653 200 248867 9656 1466 552 8059
MT 19 143721 591 16467 6140 7564 6809 13257
X 1515 1490398 199 132776 4652 984 526 1014
Y 9 6531 427 1895 475 726 517 628
readfishis using dynamically updated decision strategies
for this, we can simply grep the log-file of readfish for all reloading events of updated strategies.
grep "Reloaded" readfish.log
This should produce an output similar to this, with updates every ~90 seconds (by default):
2022-08-19 22:58:12,327 ru.ru_gen_boss_runs Reloaded mask dict for dict_keys(['2', '14', '21', '18', '8', '20', '16', '15', '6', '17', '19', '1', '22', '13', '5', '7', '9', '4', '12', 'Y', '10', '11', '3', 'X'])
2022-08-19 22:59:32,695 ru.ru_gen_boss_runs Reloaded mask dict for dict_keys(['2', '14', '21', '18', '8', '20', '16', '15', '6', '17', '19', '1', '22', '13', '5', '7', '9', '4', '12', 'Y', '10', '11', '3', 'X'])
2022-08-19 23:01:02,290 ru.ru_gen_boss_runs Reloaded mask dict for dict_keys(['2', '14', '21', '18', '8', '20', '16', '15', '6', '17', '19', '1', '22', '13', '5', '7', '9', '4', '12', 'Y', '10', '11', '3', 'X'])
2022-08-19 23:02:31,781 ru.ru_gen_boss_runs Reloaded mask dict for dict_keys(['2', '14', '21', '18', '8', '20', '16', '15', '6', '17', '19', '1', '22', '13', '5', '7', '9', '4', '12', 'Y', '10', '11', '3', 'X'])
After testing, remove the simulation = line from the sequencing_MIN106_DNA.toml file and reload scripts in MinKNOW GUI (as above).
Please use the issue tracker in this repository to get in touch!
@article {Weilguny_dynamic_adaptive_sampling,
author = {Weilguny, Lukas and De Maio, Nicola and Munro, Rory and Manser, Charlotte and Birney, Ewan and Loose, Matt and Goldman, Nick},
title = {Dynamic, adaptive sampling during nanopore sequencing using Bayesian experimental design},
year = {2022},
doi = {10.1101/2020.02.07.938670},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2022/04/04/2020.02.07.938670},
journal = {bioRxiv}
}
Licensed under GPLv3

