Skip to content

vmusch/Bachelor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

53 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Bachelor

The code constructed from a regex to korotkov automata. A korotkov automat is an nfa with Q-gram/K-mer transitions. It is based on this lecture https://wiki.postgresql.org/images/6/6c/Index_support_for_regular_expression_search.pdf

A matrix is generated from the automata, which can be used for pattern matching. The regex need to be written in reverse polish notation and supports the following operations:

  1. "|" - or
  2. "*" - kleene star
  3. "+" - min. 1
  4. "?" - 0 or 1
  5. "." - concatination

example:
a(b^+|c^+)d => ab+c+|.d.

The Project has the following directory layout:
Bachelor
|-picture
|-Software

|- source
|- build
|- seqan

For the setup got to the build directory and run:
$ cmake -DCMAKE_BUILD_TYPE=Release ../source
$ make

text in brackets is the explanatory
$./korotkov "ab+c+|.d."(regex) "3"(length of q-gram) "Matrix"(optianl .txt file) "Automat"(optional .dot file)

the output is on now a little bit experimentel
output:
acd 11 [1,0,0,0,0] -- q-Gram / hashnr. / bitvec
abd 7 [0,1,0,0,0]
bbd 23 [0,0,1,0,0]
bbb 21 [0,0,0,1,0]
abb 5 [0,0,0,0,1]
1 2 3 -- row of the matrix that is necessary (need to be fixed!)
a b c d -- alphabet
abb bbb bbd 0 0 1 1 1 --set of q-Grams / set as bitvector
abb bbd 0 0 1 0 1
abd 0 1 0 0 0
acd 1 0 0 0 0

the .dot file can be transformed int an png file with the following command.
$ dot -Tpng out.dot > out.png

About

All stuff of my Bachelor Thesis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published