Skip to content

Dynamic, adaptive sampling during nanopore sequencing

License

Notifications You must be signed in to change notification settings

kvandeynze/BOSS-RUNS

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Benefit-Optimising Short-term Strategies for ReadUntil Nanopore Sequencing

BOSS-RUNS allows for dynamic, adaptive sampling during nanopore sequencing. It periodically ingests all newly observed sequencing reads to generate updated decision strategies in order to maximise the information gain during the sequencing experiment. This method is implemented to run alongside readfish, which communicates the rejection signals to the sequencing machine and reloads BOSS-RUNS' decision strategies whenever they are updated.

The method is described in this preprint on bioRxiv

Requirements

Installation

In parallel to BOSS-RUNS, a modified version of readfish needs to be set up, which is available at: readfish for BOSS-RUNS.

To install both readfish and BOSS-RUNS:

  • create a virtual environment

python3 -m venv bossruns

. bossruns/bin/activate

  • install dependencies and readfish

pip install --upgrade pip

pip install git+https://github.com/nanoporetech/read_until_api@v3.0.0

pip install git+https://github.com/LooseLab/readfish@BossRuns/V0.0.2

  • ont_guppy_client_lib needs a version specific to your guppy version. On OSes using apt you can find your guppy version using:

apt list --installed ont-guppy* | tail -n 1 | cut -f2 -d' ' | cut -f1 -d'-' >guppy_version

cat guppy_version

pip install ont_pyguppy_client_lib==$(cat guppy_version)

  • install a few more dependencies for BR and clone the repository

pip3 install natsort scipy numba toml

git clone https://github.com/goldman-gp-ebi/BOSS-RUNS.git

Alternatively, if you already have the correct version of readfish set up, BOSS-RUNS can be run in a separate conda environment. The required dependencies can then be installed using the provided yaml file

conda env create -f environment.yml

To activate the environment:

conda activate bossruns

Usage

BOSS-RUNS can be used in different ways depending on the aim of the sequencing experiment and the initial targets.

Reference fasta sequences are required for all targets. They will be indexed upon launch, or an index can optionally be provided. For creation, you can use:

BOSS_RUNS/scripts/mappy_index_fasta.py references.fa

  • By default, all whole genome(s) included in the input fasta file are considered of interest.

  • You can also choose to reject all reads from specific sequences in your fasta file. For this, provide fasta headers of the reference file, e.g.: --reject_refs 1,2,3,X,Y,MT

  • Alternatively, a VCF that matches the input fasta file and contains regions/sites of interest (ROIs) can be provided. Only sites included in the VCF will be considered on-target.

Starting readfish

readfish needs a single modification to utilize BOSS-RUNS' dynamically updated decision masks: In the TOML configuration for readfish, add a single line pointing to the location where BOSS-RUNS will deposit new strategies, e.g. mask = bossruns_name/masks. This path will follow the pattern bossruns_{condition_name}/masks, where condition_name is the name of the condition in the readfish TOML intended to use BOSS-RUNS. A configuration for such a condition might look like this:

[conditions.0]
name = "select_c20"                     <- condition name 
control = false
min_chunks = 0
max_chunks = 12
targets = ["20"]
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"
mask = "bossruns_select_c20/masks"      <- path to dynamic strategies (bossruns_{condition_name}/masks)

readfish can then be started with the subcommand boss-runs (instead of targets). E.g. like this:

readfish boss-runs --device DEVICE \
                   --experiment-name NAME \
                   --toml TOML \
                   --log-file LOGFILE

Starting BOSS-RUNS

After sequencing has started and readfish is operating, the minimal command to launch BOSS-RUNS is:

./bossruns.py --ref REF --device DEVICE --run_name CONDITION_NAME

where DEVICE needs to be the name of the 'position' on the sequencer (displayed in MinKNOW overview), and CONDITION_NAME is the same as in the readfish TOML.

BOSS-RUNS will initialise and start to periodically generate new decision strategies from the sequencing reads deposited by the sequencer. If readfish is configured properly, the strategies will be reloaded automatically. This triggers a message in readfish's logfile similar to: Reloaded mask dict for FASTA_HEADERS.

After sequencing, BOSS-RUNS needs to be stopped by a keyboard interrupt (Ctrl+C).

Arguments

Arguments can either be specified on the command line, by providing a parameter file, or a mixture or both (command-line arguments take priority).

python BOSS-RUNS/bossruns.py --help

usage: bossruns.py [-h] --ref REF [--ref_idx REF_IDX] [--run_name RUN_NAME] [--vcf VCF] [--reject_refs REJECT_REFS] [--ploidy PLOIDY] [--conditions] --device DEVICE [--host HOST] [--port PORT] [--wait WAIT]
                   [--ckp CKP]

optional arguments:
  -h, --help            show this help message and exit
  --ref REF             Path to reference
  --ref_idx REF_IDX     Optional minimap index of reference
  --run_name RUN_NAME   Experiment identifier. Must match name of [conditions.X] in readfish toml file
  --vcf VCF             Path to vcf file for ROIs
  --reject_refs REJECT_REFS
                        reject all reads of some entries in reference file, i.e. chromosomes or species. Multiple headers can be comma separated
  --ploidy PLOIDY       1 == haploid, 2 == diploid
  --conditions          Multiple conditions on a single flowcell, used to assign channels
  --device DEVICE       Name of device/sequencing position in MinKNOW
  --host HOST           hostname of sequencing device
  --port PORT           port of sequencing device
  --wait WAIT           Period between strategy updates (sec.)

A parameter file can be supplied using @ on the command line. E.g.: python bossruns.py @params.txt with the structure (one space-delimited argument per line):

--argumentX valueX
--argumentY valueY

Walkthrough for testing

This walkthrough is adapted from the readfish repository

Setting up a playback run

To test your configuration of readfish and BOSS-RUNS we recommend first running a playback experiment. Download an open access bulk FAST5 file. (Attention 21Gb!)

To configure a run for playback, you need to edit a sequencing TOML file located at /opt/ont/minknow/conf/package/sequencing.

  • Edit sequencing_MIN106_DNA.toml and under the entry [custom_settings] add:

simulation = "/full/path/to/your_bulk.FAST5"

  • and set the parameter break_reads_after_seconds = 1.0 to break_reads_after_seconds = 0.4

  • In the MinKNOW GUI select Reload Scripts (vertical dot menu at Select positions screen when starting a sequencing run).

  • Insert a configuration test flowcell into the sequencing device and start a sequencing run (selecting the corresponding flow cell type to the edited script, i.e. FLO-MIN106).

  • The run should start and immediately begin a mux scan. Let it run for a few minutes.

Starting readfish

This requires access to a guppy basecall server and a TOML file. Here's an example TOML file for this walkthrough:

[caller_settings]
config_name = "dna_r9.4.1_450bps_hac"
host = "127.0.0.1"
port = 5555

[conditions]
reference = "/path/to/reference.mmi"

[conditions.0]
name = "select_c20"
control = false
min_chunks = 0
max_chunks = 12
targets = ["20"]
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"
mask = "bossruns_select_c20/masks"

This configures readfish to target all reads from chromosome 20 and to continuously read the dynamically updated decision strategies from BOSS-RUNS (mask = "bossruns_select_c20/masks").

You simply need to modify the reference field to a minimap2 index of the human genome.

Modify the targets field to reflect the naming convention used in your index. This is the sequence name only, up to but not including any whitespace. e.g. the fasta header >20 human chromosome 20 would become 20.

readfish can then be launched with

readfish boss-runs --device DEVICE \
                   --experiment-name "EXPERIMENT_NAME" \
                   --toml example.toml \
                   --log-file readfish.log

Starting BOSS-RUNS

After readfish is running, you can launch BOSS-RUNS using the same reference file and indicate which chromosomes are not of interest a priori.

./bossruns.py --run_name select_c20 \
              --ref /data/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa \
              --ref_idx /data/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.mmi \
              --reject_refs 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,21,22,X,Y,MT \
              --device DEVICE \
              --testing

run_name needs to match the condition name in the readfish TOML file.

Let the playback sequencing run for a few minutes.

Check that it's working

There are 2 things to verify that the setup works:

  1. readfish is rejecting reads from all chromosomes, except for #20. For this, we look at the observed read lengths:

readfish summary example.toml /path/to/sequencing/output/fastq_pass/

Check that the mean read length for the enriched chromosome is larger than for the remaining chromosomes.
(In this example the read lengths of depleted chromosomes are still rather long due to slow base calling)

contig  number      sum  min     max    std   mean  median    N50
     1    1151  1577570  224  197871   7002   1371     581   7096
    10    1012  1147739  211  105585   5128   1134     578   1149
    11     836  1207274  225  212740   8385   1444     590   7727
    12     741   940284  192   81667   5419   1269     569   2143
    13     477   412434  182   54642   2694    865     556    898
    14     771   988124  176  114908   5616   1282     573   7301
    15     540   882612  223   99327   7459   1634     576   8258
    16     425   486124  209  107335   5579   1144     533   1341
    17     732  1099546  195  219260   9307   1502     614   8725
    18     180   310512  228   40407   4565   1725     638   8213
    19     592   824374  222  121745   6374   1393     634   5811
     2    1477  1572749  190   89358   3780   1065     561   1154
    20      50  1163599  227  145149  37887  23272    2214  86773   <---
    21     392   424448  199  118785   6053   1083     548   1213
    22     178   286534  188   49058   4788   1610     655   9884
     3    1198  1638753  201  172141   7473   1368     578   6548
     4    1370  1807013  174  160366   6708   1319     582   6917
     5    1408  2144345  162  212394   9884   1523     544   8397
     6     656  1013424  231  118194   5819   1545     599   7268
     7    1026   932717  185   66384   2972    909     563    914
     8     906  1133194  210  162732   5930   1251     564   2210
     9    1046  1533653  200  248867   9656   1466     552   8059
    MT      19   143721  591   16467   6140   7564    6809  13257
     X    1515  1490398  199  132776   4652    984     526   1014
     Y       9     6531  427    1895    475    726     517    628

  1. readfish is using dynamically updated decision strategies

for this, we can simply grep the log-file of readfish for all reloading events of updated strategies.

grep "Reloaded" readfish.log

This should produce an output similar to this, with updates every ~90 seconds (by default):

2022-08-19 22:58:12,327 ru.ru_gen_boss_runs Reloaded mask dict for dict_keys(['2', '14', '21', '18', '8', '20', '16', '15', '6', '17', '19', '1', '22', '13', '5', '7', '9', '4', '12', 'Y', '10', '11', '3', 'X'])
2022-08-19 22:59:32,695 ru.ru_gen_boss_runs Reloaded mask dict for dict_keys(['2', '14', '21', '18', '8', '20', '16', '15', '6', '17', '19', '1', '22', '13', '5', '7', '9', '4', '12', 'Y', '10', '11', '3', 'X'])
2022-08-19 23:01:02,290 ru.ru_gen_boss_runs Reloaded mask dict for dict_keys(['2', '14', '21', '18', '8', '20', '16', '15', '6', '17', '19', '1', '22', '13', '5', '7', '9', '4', '12', 'Y', '10', '11', '3', 'X'])
2022-08-19 23:02:31,781 ru.ru_gen_boss_runs Reloaded mask dict for dict_keys(['2', '14', '21', '18', '8', '20', '16', '15', '6', '17', '19', '1', '22', '13', '5', '7', '9', '4', '12', 'Y', '10', '11', '3', 'X'])

Deactivating playback behaviour

After testing, remove the simulation = line from the sequencing_MIN106_DNA.toml file and reload scripts in MinKNOW GUI (as above).

Issues, questions, suggestions ...

Please use the issue tracker in this repository to get in touch!

Citation

@article {Weilguny_dynamic_adaptive_sampling,
	author = {Weilguny, Lukas and De Maio, Nicola and Munro, Rory and Manser, Charlotte and Birney, Ewan and Loose, Matt and Goldman, Nick},
	title = {Dynamic, adaptive sampling during nanopore sequencing using Bayesian experimental design},
	year = {2022},
	doi = {10.1101/2020.02.07.938670},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2022/04/04/2020.02.07.938670},
	journal = {bioRxiv}
}

License

Licensed under GPLv3

About

Dynamic, adaptive sampling during nanopore sequencing

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%