Skip to content

The parameter file

ryoga-hmdlab edited this page Aug 23, 2020 · 14 revisions

RaptRanker input parameters with a json file. Template file is here.

The basic rule of json file is following. Be careful that some { }, [], ,, : and "" are needed for json format.

{
  "key":value,
  "key":"text",
  "key_as_list":[
    {
     "key_1":value,
     "key_1":"text"
    },
    {
     "key_2":value,
     "key_2":"text"
    }
  ]
}

Parameters for I/O

file_type

Set the FASTA(1) or FASTQ(2) format. For example, if you input .fastq files you would write as "file_type":2,. Note that in both cases, the input file must not contain any base other than "ACG(T/U)", such as "N". The input FASTA/FASTQ files are expected to be quality filtered.

input_file_nums

The number of input files.

input_file_list

The information about each input file. round_id,round_name and file_path are need for each input file.

  • round_id is the numeric ID for each input file. We recommend use the SELEX round number.
  • round_name is the text for each input file. This text will be used in score name (ex. "round_name":"foo1R" -> Frequency_foo1R, Enrichment_foo1R, ....).
  • file_path is the full-path for the input file.

The following is an example when input three FASTQ files (5R--7R).

"file_type":2,
"input_file_nums":3,
"input_file_list":[
    {
      "round_id":5,
      "round_name":"Round5",
      "file_path":"/path/to/5R.fastq"
    },
    {
      "round_id":6,
      "round_name":"Round6",
      "file_path":"/path/to/6R.fastq"
    }, 
    {
      "round_id":7,
      "round_name":"Round7",
      "file_path":"/path/to/7R.fastq"
    }
  ],
...

Please note that "," needs at after "}" in case of there is next input file, and "," DO NOT need between "}" and "]" (in the case of the last input file).

experiment_dbfile, analysis_output_path, analysis_dbfile

The experiment_dbfile is the one of RaptRanker's outputs. This is a sqlite3 database file which records all unique sequences, thier secondary structure, Frequecy and Enrichment score. These records are identical unless the filtering parameters are changed. So it's like, "output for one inputted SELEX experiment". Please set this parameter in the form of /full/path/to/filename.sqlite3.

The analysis_dbfile is the another RaptRanker's outputs. This is a sqlite3 database file which records all subsequences, clustering results, and so on. These records are identical unless the clustering parameters are changed. So it's like, "output for one RaptRanker clustering". Please set analysis_dbfile in the form of /full/path/to/output/filename.sqlite3.

The analysis_output_path is the output path for intermediate files and score CSV. Please set analysis_output_path as /full/path/to/output/. We recommend use same directory for analysis_output_path and the path to analysis_dbfile. This parameter must end with "/".

For example, once you have run an analysis in RaptRanker, if you want to see the results in different clustering parameters (window size, threshold, etc), you should change only the analysis_dbfile and analysis_output_path. In this case, RaptRanker use the previous secondary structure prediction results, and reduce the time to run. Also, if you add a new round, you don't have to change the experiment_dbfile. RaptRanker input new sequences and analyze them additionaly. (This is determined by the input_file_list information.) In the same SELEX analysis, you need to change the experiment_dbfile only when you change the filtering parameter.

Parameters for filtering

forward_primer,reverse_primer

The nucleotide sequences for filtering. RaptRanker extracts only the sequences whose both fix regions is the same as these. In some cases, the forward fix sequence include T7-promoter, barcode sequences, and so on. If they are inculded, please include them in forward_primer. Please note that they are "a primer-binding region" (they are in sequenced sequences), it is not "a primer sequence".

add_forward_primer,add_reverse_primer

The nucleotide sequences for secondary structure prediction. Please note that they are "a primer-binding region" (they are in sequenced sequences), it is not "a primer sequence".

For example, if the template sequence is TAATACGACTCACTATA-GGGAGCAGGAGAGAGGTCAGATG-30N-CCTATGCGTGCTAGTGTGA (T7 promoter - forward primer binding region - random region - reverse primer binding region), the parameters should be set as following.

"forward_primer":"TAATACGACTCACTATAGGGAGCAGGAGAGAGGTCAGATG",
"reverse_primer":"CCTATGCGTGCTAGTGTGA",
"add_forward_primer":"GGGAGCAGGAGAGAGGTCAGATG",
"add_reverse_primer":"CCTATGCGTGCTAGTGTGA",

sequence_maximum_length,sequence_minimum_length

The upper/lower limits of the length of random region. RaptRanker extracts only the random regions whose lengths L are sequence_minimum_length ≤ L ≤ sequence_maximum_length. We usually use +-5nt from the design (ex. 30N -> "sequence_maximum_length":35, and "sequence_minimum_length":25,).

Prameters for clustering

These parameters affect clustering and AME calculations. If you are not sure, we recommend that you leave the default.

wide_length

This is the length of subsequences. Default is 10.

nucleotide_weight

This parameter representing the weight of the sequence information in clustering. If this value is large, the sequence information becomes more emphasized than the secondary structure information, and therefore more strict matching of the nucleotide sequences between entries is required to be clustered together. Default is 0.5.

cosine_distance

This is a parameter for SketchSort; an upper limit of cosine distance between vectors. Default is 0.001.

missing_ratio

This is a parameter for SketchSort; an upper limit of the expectation value for false negatives. Default is 0.00001.

Extra parameters

These parameters are not required. Use as needed.

exKmer

This parameter is obsolete. Please use parameters listed below.

calcAKFandAKE, calcBMFandBME, calcBKFandBKE

This parameter needs bool value (true or false). If this parameter is true, RaptRanker calculates each score. Default is false

add_binding, binding_file_path

The add_binding needs bool value (true or false). If this parameter is true, RaptRanker input binding flags from the binding_file_path. The binding_file is a CSV file which have sequence (random region only) and flag (TRUE:1 or FALSE:0) as following,

ATCAGTCAGTTGCA,0
ACTGATCGCACACA,1
ACAGTCAAAACACC,1
...