The code constructed from a regex to korotkov automata. A korotkov automat is an nfa with Q-gram/K-mer transitions. It is based on this lecture https://wiki.postgresql.org/images/6/6c/Index_support_for_regular_expression_search.pdf
A matrix is generated from the automata, which can be used for pattern matching.
The regex need to be written in reverse polish notation and supports the following operations:
- "|" - or
- "*" - kleene star
- "+" - min. 1
- "?" - 0 or 1
- "." - concatination
example:
a(b^+|c^+)d => ab+c+|.d.
The Project has the following directory layout:
Bachelor
|-picture
|-Software
|- source
|- build
|- seqan
For the setup got to the build directory and run:
$ cmake -DCMAKE_BUILD_TYPE=Release ../source
$ make
text in brackets is the explanatory
$./korotkov "ab+c+|.d."(regex) "3"(length of q-gram) "Matrix"(optianl .txt file) "Automat"(optional .dot file)
the output is on now a little bit experimentel
output:
acd 11 [1,0,0,0,0] -- q-Gram / hashnr. / bitvec
abd 7 [0,1,0,0,0]
bbd 23 [0,0,1,0,0]
bbb 21 [0,0,0,1,0]
abb 5 [0,0,0,0,1]
1 2 3 -- row of the matrix that is necessary (need to be fixed!)
a b c d -- alphabet
abb bbb bbd 0 0 1 1 1 --set of q-Grams / set as bitvector
abb bbd 0 0 1 0 1
abd 0 1 0 0 0
acd 1 0 0 0 0
the .dot file can be transformed int an png file with the following command.
$ dot -Tpng out.dot > out.png