-
Notifications
You must be signed in to change notification settings - Fork 0
ParserSQL Technical Explanation
This is a technical explanation of how the parser works. An explanation of the parser's usage can be seen here.
The parsing process has three major steps, scanning, parsing and interpreting.
Most of the code extracts shown in this section are simplified for brevity. The complete code for the scanner can be found here.
The purpose of the scanner is to transform the entire input string into recognizable tokens. As such, we define the tokens via the Token class:
class Token:
class Type:
(
LPAR, RPAR, SELECT, FROM, WHERE, INSERT, INTO, VALUES, UPDATE, SET, DELETE,
CREATE, TABLE, DROP, AND, OR, NOT, AS, ORDER, BY, LIMIT, ID, STAR, BETWEEN,
EQ, NEQ, LT, GT, LE, GE, COMMA, DOT, SEMICOLON, NUMVAL, FLOATVAL, STRINGVAL,
BOOLVAL, PRIMARY, KEY, DATATYPE, INDEX, ON, USING, INDEXTYPE, ERR, END,
WITHIN, RECTANGLE, CIRCLE, KNN, ASC, DESC, IF, EXISTS
) = range(54)
token_names = [
"LPAR", "RPAR", "SELECT", "FROM", # etc.
]
def __init__(self, token_type, lexema=""):
self.type = token_type
self.lexema = lexemaThe full description of each token can be seen here.
The Scanner class is created as follows:
class Scanner:
def __init__(self, input):
self.input = input + '\0'
self.first = 0
self.current = 0
self.line = 1
self.pos = 1
def start_lexema(self) -> None:
self.first = self.current
def get_lexema(self) -> str:
return self.input[self.first:self.current]A lexema is the string that represents the current token's value. For example, an LPAR's lexema is '(', a SELECT's lexema is 'SELECT', and a NUMVAL's lexema might be '5'.
We then define a method to get the next token in the input, which is based around a DFA. The purpose of the DFA is to recognize which type of token the current lexema is. A token may end on a space, a new line, or any character that generally represents a space (defined using python's isspace() method). If the token is recognized as an ID, it ends on the first non-alphanumeric character. If it's recognized as a number, it ends on the first non-digit character (unless it's a dot .). If it's recognized as a string, it ends on the first instance of a single quote (').
The method and some states are shown here:
def next_token(self) -> Token:
state = 0
self.start_lexema()
c = self.input[self.current]
while True:
if state == 0:
c = self.input[self.current]
if c.isspace():
if c == '\n':
self.line += 1
self.pos = 1
self.current += 1
self.pos += 1
self.start_lexema()
state = 0
elif c == '\0':
return Token(Token.Type.END)
elif c == '*':
self.current += 1
self.pos += 1
return Token(Token.Type.STAR)
elif c == ';':
self.current += 1
self.pos += 1
return Token(Token.Type.SEMICOLON)
# etc.
elif c.isdigit():
state = 1
elif c == "'":
state = 3
elif c.isalpha():
state = 4
else:
return Token(Token.Type.ERR)For single characters like ; or *, we simply return the corresponding tokens. For other cases, we go to a new state. In this case, we go to state 1 to tokenize a number, state 3 to tokenize a string and state 4 to tokenize an ID or keyword.
States 1 and 2 are used to tokenize numbers, and can be seen here:
elif state == 1:
self.current += 1
self.pos += 1
c = self.input[self.current]
if c.isdigit():
pass
elif c == '.':
state = 2
else:
return Token(Token.Type.NUMVAL, self.get_lexema())
elif state == 2:
self.current += 1
self.pos += 1
c = self.input[self.current]
if not c.isdigit():
return Token(Token.Type.FLOATVAL, self.get_lexema())In state 1, we advance the current position until we see a non-digit character, in which case we return a NUMVAL token. If at any point we find a dot (.), it means we are facing a float, so we go to state 2 and continue advancing until we see a non-digit number, in which case we return a FLOATVAL token.
State 3 is used to tokenize strings, and can be seen here:
elif state == 3:
self.current += 1
self.pos += 1
c = self.input[self.current]
if c == "'":
self.current += 1
self.pos += 1
return Token(Token.Type.STRINGVAL, self.get_lexema()[1:-1])
elif c == '\0':
return Token(Token.Type.ERR)In state 3, we advance the current position until we find a single quote ('), in which case we return a STRINGVAL whose lexema doesn't include the single quotes on both ends.
State 4 and 5 are used to tokenize IDs and keywords, and can be seen here:
elif state == 4:
self.current += 1
self.pos += 1
c = self.input[self.current]
if not (c.isalnum() or c in ["_"]):
state = 5
elif state == 5:
lexema = self.get_lexema().upper()
keywords = {
"SELECT": Token.Type.SELECT,
"FROM": Token.Type.FROM,
"WHERE": Token.Type.WHERE,
"INSERT": Token.Type.INSERT,
"INTO": Token.Type.INTO,
# etc.
}
if lexema in keywords:
return Token(keywords[lexema], lexema if keywords[lexema] in [Token.Type.BOOLVAL, Token.Type.INDEXTYPE, Token.Type.DATATYPE] else "")
else:
return Token(Token.Type.ID, self.get_lexema())In state 4, we advance the current position until we find a character thats neither alphanumeric nor an underscore (_), in which case we go to state 5. In state 5, we differentiate between IDs and keywords. If the lexema matches any of the reserved keywords, then we return the correspoding token (if its a BOOLVAL, INDEXTYPE or DATATYPE we make sure to also return the lexema), otherwise, we return it as and ID.
Most of the code extracts shown in this section are simplified for brevity. The complete code for the parser can be found here.
The purpose of the parser is to generate an abstract syntax tree (AST) that can then be traversed and executed instruction by instruction. To generate this tree, we use the recursive descent method.
The parsing step uses a context-free grammar (CFG) to accept or reject an input, and can be found here.
Our objective now is to follow the grammar and construct the related tree and objects.
We first create classes for each instruction and other necessary parts, that will contain every single attribute an instruction needs to be executed.
For example, the class related to a SELECT statement can be seen here:
class SelectStmt(Stmt):
def __init__(self, table_name : str = None, condition : Condition = None, all : bool = False, column_list : list[str] = None, order_by : str = None, asc : bool = True, limit : int = None):
super().__init__()
self.table_name = table_name
self.condition = condition
self.all = all
self.column_list = column_list if column_list else []
self.order_by = order_by
self.asc = asc
self.limit = limit
def add_column(self, column_name : str) -> None:
self.column_list.append(column_name)So, when parsing a SELECT statement, what we have to do is follow the grammar, build the SelectStmt object and put it in the tree.
The root node in the tree is an object that will contain a list of statements, defined as follows:
class SQL:
def __init__(self, stmt_list : list[Stmt] = None):
self.stmt_list = stmt_list if stmt_list else []
def add_stmt(self, stmt : Stmt) -> None:
self.stmt_list.append(stmt)Now we create our parser, which will read tokens and build the AST. The parser is first defined as follows:
class Parser:
def __init__(self, scanner : Scanner):
self.scanner = scanner
self.current : Token = None
self.previous : Token = None
def error(self, error : str):
raise ParseError(error, self.scanner.line, self.scanner.pos, self.current)
def match(self, type : Token.Type) -> bool:
if self.check(type):
self.advance()
return True
else:
return False
def check(self, type : Token.Type) -> bool:
if self.is_at_end():
return False
else:
return self.current.type == type
def advance(self) -> None:
if not self.is_at_end():
temp = self.current
self.current = self.scanner.next_token()
self.previous = temp
if self.check(Token.Type.ERR):
self.error(f"unrecognized character: {self.current.lexema}")
def is_at_end(self) -> bool:
return self.current.type == Token.Type.ENDA parser recieves a scanner, which will be used to get the next token as seen previously. The parse also provides some useful methods. match(type) recieves a token type, and will check and return if the current token is of the recieved type. If true, it will advance to the next token. Advancing means calling the next_token() method of the scanner and storing the result.
Now we can begin parsing. To parse the input, we first need to parse the SQL object defined previously:
def parse(self) -> SQL:
try:
self.current = self.scanner.next_token()
return self.parse_sql()
except ParseError as e:
raise e
def parse_sql(self) -> SQL:
sql = SQL()
sql.add_stmt(self.parse_stmt())
while(self.match(Token.Type.SEMICOLON) and self.current.type != Token.Type.END):
sql.add_stmt(self.parse_stmt())
if self.current.type != Token.Type.END:
self.error("unexpected items after statement")
return sqlWhen parsing the SQL object, we need to parse an statement, and while we match semicolon tokens, and we don't yet reach the end of the input, we parse another statement. This follows the following part of the grammar:
<sql> ::= <statement_list>
<statement_list> ::= <statement> ";" { <statement> ";" }
<statement> ::= <select-stmt>
| <create-table-stmt>
| <drop-table-stmt>
| <insert-stmt>
| <delete-stmt>
| <create-index-stmt>
| <drop-index-stmt>
Parsing an statement depends on the type of statement. This can be determined by matching the related keywords of each statement:
def parse_stmt(self) -> Stmt:
if self.match(Token.Type.SELECT):
return self.parse_select_stmt()
elif self.match(Token.Type.CREATE):
if self.match(Token.Type.TABLE):
return self.parse_create_table_stmt()
elif self.match(Token.Type.INDEX):
return self.parse_create_index_stmt()
else:
self.error("expected TABLE or INDEX keyword after CREATE keyword")
elif self.match(Token.Type.DROP):
if self.match(Token.Type.TABLE):
return self.parse_drop_table_stmt()
elif self.match(Token.Type.INDEX):
return self.parse_drop_index_stmt()
else:
self.error("expected TABLE or INDEX keyword after DROP keyword")
elif self.match(Token.Type.INSERT):
return self.parse_insert_stmt()
elif self.match(Token.Type.DELETE):
return self.parse_delete_stmt()
elif self.match(Token.Type.SELECT):
return self.parse_select_stmt()
else:
self.error("unexpected start of an instruction")For example, if we match a SELECT token, it means we are facing a SELECT statement, and as such we parse a SELECT statement.
The parsing in this case follows this part of the grammar:
<select-stmt> ::= "SELECT" <select-list> "FROM" <table-name> [ "WHERE" <condition> ] ["ORDER" "BY" <column-name> ["ASC" | "DESC"]] ["LIMIT" <number>]
<select-list> ::= "*" | <column-name> { "," <column-name> }
<table-name> ::= <identifier>
<column-name> ::= <identifier>
The parsing goes as follows:
def parse_select_stmt(self) -> SelectStmt:
select_stmt = SelectStmt()
if self.match(Token.Type.STAR):
select_stmt.all = True
elif self.match(Token.Type.ID):
select_stmt.add_column(self.previous.lexema)
while(self.match(Token.Type.COMMA)):
if self.match(Token.Type.ID):
select_stmt.add_column(self.previous.lexema)
else:
self.error("expected column name after comma")
else:
self.error("expected '*' or column name after SELECT keyword")
if not self.match(Token.Type.FROM):
self.error("expected FROM clause in SELECT statement")
if not self.match(Token.Type.ID):
self.error("expected table name after FROM keyword")
select_stmt.table_name = self.previous.lexema
if self.match(Token.Type.WHERE):
select_stmt.condition = self.parse_or_condition()
if self.match(Token.Type.ORDER):
if not self.match(Token.Type.BY):
self.error("expected BY keyword after ORDER keyword")
if not self.match(Token.Type.ID):
self.error("expected column name in ORDER BY clause")
select_stmt.order_by = self.previous.lexema
if self.match(Token.Type.ASC):
select_stmt.asc = True
elif self.match(Token.Type.DESC):
select_stmt.asc = False
if self.match(Token.Type.LIMIT):
if not self.match(Token.Type.NUMVAL):
self.error("expected valid int value after LIMIT keyword")
select_stmt.limit = self.str_into_type(self.previous.lexema, self.previous)
return select_stmtFirst, a SelectStmt object is initialized, and depending on what tokens we match and what the grammar says, we update the object.
If we match a STAR (*), we set the all attribute to True. If instead we match and ID, it means we are looking at the column names, and as such we add that lexema to the object's list of columns, and we do so while we find commas. If neither are found, we throw an error. After that a FROM clause is required, so if we don't match a FROM keyword, we throw an error. We now match an ID that represents the table name, and set the table_name attribute. After this, if we match a WHERE, it means a condition will follow, and as such we parse a condition and set the condition attribute (parsing conditions will be explained later). After this if we find an ORDER BY, we then match an ID that represents the column to be ordered by and set the order_by attribute. If we find ASC or DESC, set the asc attribute. Finally, if we match a LIMIT, we then match a NUMVAL and set the limit attribute. Now we have completely built the SelectStmt object and so we return it and it is then added to the statement list.
Parsing conditions means following a precedence of operators to avoid ambiguity. After parsing, we will end up with a condition tree that accurately follows conditions in the desired order. The precedence goes as follows: Parenthesis -> NOT -> AND -> OR.
To create this condition tree, we need to first create classes that will define each type of condition and the leaf nodes.
class BinaryOp(Enum):
AND = auto()
OR = auto()
EQ = auto()
NEQ = auto()
LT = auto()
GT = auto()
LE = auto()
GE = auto()
WC = auto()
WR = auto()
KNN = auto()
class Condition:
def __init__(self):
pass
class BinaryCondition(Condition):
def __init__(self, left : Condition = None, op : BinaryOp = None, right : Condition = None):
super().__init__()
self.left = left
self.op = op
self.right = right
class BetweenCondition(Condition):
def __init__(self, left : Condition = None, mid : Condition = None, right : Condition = None):
super().__init__()
self.left = left
self.mid = mid
self.right = right
class NotCondition(Condition):
def __init__(self, condition : Condition = None):
super().__init__()
self.condition = condition
class BooleanColumn(Condition):
def __init__(self, column_name : str = None):
super().__init__()
self.column_name = column_name
class ConditionColumn(Condition):
def __init__(self, column_name : str = None):
super().__init__()
self.column_name = column_name
class ConditionValue(Condition):
def __init__(self, value = None):
super().__init__()
self.value = valueUsing these classes, we can build the condition tree. The part of the grammar that we follow for this part is the following:
<condition> ::= <or-condition>
<or-condition> ::= <and-condition> { "OR" <and-condition> }
<and-condition> ::= <not-condition> { "AND" <not-condition> }
<not-condition> ::= [ "NOT" ] <predicate>
<predicate> ::= <simple-condition> | "(" <condition> ")"
<simple-condition> ::= <column-name> <operator> <value> |
<column-name> "BETWEEN" <value> "AND" <value> |
<column-name> "WITHIN" "RECTANGLE" "(" <float> "," <float> "," <float> "," <float> )" |
<column-name> "WITHIN" "CIRCLE" "(" <float> "," <float> "," <float> ) |
<column-name> "KNN" "(" <float> "," <float> "," <number> )
<operator> ::= "=" | "<>" | "<" | ">" | "<=" | ">="
<value> ::= <string> | <number> | "TRUE" | "FALSE"
<column-name> ::= <identifier>
Following this grammar, we parse a condition as follows:
# <or-condition> ::= <and-condition> { "OR" <and-condition> }
def parse_or_condition(self) -> Condition:
left = self.parse_and_condition()
while self.match(Token.Type.OR):
right = self.parse_and_condition()
left = BinaryCondition(left, BinaryOp.OR, right)
return left
# <and-condition> ::= <not-condition> { "AND" <not-condition> }
def parse_and_condition(self) -> Condition:
left = self.parse_not_condition()
while self.match(Token.Type.AND):
right = self.parse_not_condition()
left = BinaryCondition(left, BinaryOp.AND, right)
return left
# <not-condition> ::= [ "NOT" ] <predicate>
def parse_not_condition(self) -> Condition:
if(self.match(Token.Type.NOT)):
return NotCondition(self.parse_predicate())
return self.parse_predicate()
# <predicate> ::= <simple-condition> | "(" <condition> ")"
def parse_predicate(self) -> Condition:
if(self.match(Token.Type.LPAR)):
condition = self.parse_or_condition()
if not self.match(Token.Type.RPAR):
self.error("expected ')' to close a condition")
return condition
return self.parse_simple_condition()Lastly, we need to parse a simple condition, which means parsing a column name, an operation, and the respective value on the right hand side. The parsing goes as follows:
def parse_simple_condition(self) -> Condition:
if not self.match(Token.Type.ID):
self.error("expected column name in condition")
column_name = self.previous.lexema
if self.match(Token.Type.BETWEEN):
between_condition = BetweenCondition()
between_condition.left = ConditionColumn(column_name)
# etc.
between_condition.right = ConditionValue(self.str_into_type(self.previous.lexema, self.previous))
return between_condition
simple_condition = BinaryCondition()
simple_condition.left = ConditionColumn(column_name)
if not (self.match(Token.Type.LT) or self.match(Token.Type.GT) or self.match(Token.Type.LE) or self.match(Token.Type.GE) or self.match(Token.Type.EQ) or self.match(Token.Type.NEQ) or self.match(Token.Type.WITHIN) or self.match(Token.Type.KNN)):
return BooleanColumn(column_name)
match self.previous.type:
case Token.Type.LT:
simple_condition.op = BinaryOp.LT
case Token.Type.GT:
simple_condition.op = BinaryOp.GT
# etc.
case _:
self.error("unknown conditional operator")
if simple_condition.op == BinaryOp.WR:
if not self.match(Token.Type.LPAR):
self.error("expected '(' after WITHIN RECTANGLE operation")
if not self.match(Token.Type.FLOATVAL):
self.error("expected valid float value for min x coordinate")
x_min = self.str_into_type(self.previous.lexema, self.previous)
# etc.
simple_condition.right = ConditionValue((x_min, y_min, x_max, y_max))
elif simple_condition.op == BinaryOp.WC:
if not self.match(Token.Type.LPAR):
self.error("expected '(' after WITHIN CIRCLE operation")
if not self.match(Token.Type.FLOATVAL):
self.error("expected valid float value for x coordinate")
x = self.str_into_type(self.previous.lexema, self.previous)
# etc.
simple_condition.right = ConditionValue((x, y, radius))
elif simple_condition.op == BinaryOp.KNN:
if not self.match(Token.Type.LPAR):
self.error("expected '(' after KNN operation")
if not self.match(Token.Type.FLOATVAL):
self.error("expected valid float value for x coordinate")
x = self.str_into_type(self.previous.lexema, self.previous)
# etc.
simple_condition.right = ConditionValue((x, y, k))
else:
if self.match(Token.Type.LPAR):
if not self.match(Token.Type.FLOATVAL):
self.error("expected a valid float value por x coordinate on POINT declaration")
x = self.str_into_type(self.previous.lexema, self.previous)
# etc.
simple_condition.right = ConditionValue((x, y))
else:
if not self.match_values():
self.error("expected a value after conditional operator")
simple_condition.right = ConditionValue(self.str_into_type(self.previous.lexema, self.previous))
return simple_conditionAfter having parsed every single instruction on the input, we return the SQL object, which will be used by the interpreter.
Most of the code extracts shown in this section are simplified for brevity. The complete code for the interpreter can be found here (it's the same file as the parser)
The purpose of the interpreter is to traverse the AST and execute each instruction.
The interpreter will use the DBManager to execute the instructions, and as such we define the class as follows:
class Interpreter:
def __init__(self):
self.dbmanager = DBManager()
def error(self, error : str):
raise RuntimeError(error)To interpret the SQL object, we iterate over every instruction on the instruction list and execute the respective method.
def interpret(self, sql : SQL):
try:
return self.interpret_sql(sql)
except RuntimeError as e:
raise e
def interpret_sql(self, sql : SQL):
if not sql:
self.error("Invalid SQL")
for stmt in sql.stmt_list:
result = self.interpret_stmt(stmt)
return result
def interpret_stmt(self, stmt : Stmt):
stmt_type = type(stmt)
if stmt_type == SelectStmt:
return self.interpret_select_stmt(stmt), "Selection successful"
elif stmt_type == CreateTableStmt:
self.interpret_create_table_stmt(stmt)
return None, "Table created successfully"
elif stmt_type == DropTableStmt:
self.interpret_drop_table_stmt(stmt)
return None, "Table dropped successfully"
elif stmt_type == InsertStmt:
self.interpret_insert_stmt(stmt)
return None, "Insertion successful"
elif stmt_type == DeleteStmt:
self.interpret_delete_stmt(stmt)
return None, "Deletion successful"
elif stmt_type == CreateIndexStmt:
self.interpret_create_index_stmt(stmt)
return None, "Index created successfully"
elif stmt_type == DropIndexStmt:
self.interpret_drop_index_stmt(stmt)
return None, "Index dropped successfully"
else:
self.error("unknown statement type")We also return a success message if no errors are thrown.
For example, when interpreting a SELECT statement, we do the following:
def interpret_select_stmt(self, stmt : SelectStmt):
select_schema = SelectSchema(stmt.table_name, ConditionSchema(stmt.condition), stmt.all, stmt.column_list, stmt.order_by, stmt.asc, stmt.limit)
return self.dbmanager.select(select_schema)The select function on the DBManager requires a SelectSchema object, which we easily build using our SelectStmt object. We then pass the SelectSchema object and call the select method.
As such, to interpret every instruction, we only need to build the objects needed by the DBManager and call the respective method.
These are the tokens used in the scanner. Note that not all tokens are used.
| Token | Description |
|---|---|
| LPAR | Left parenthesis (
|
| RPAR | Right parenthesis )
|
| SELECT | Keyword for selecting data |
| FROM | Keyword indicating the data source (table) |
| WHERE | Keyword for filtering rows |
| INSERT | Keyword for inserting data |
| INTO | Keyword indicating the target table in INSERT |
| VALUES | Keyword indicating the values to insert |
| UPDATE | Keyword for modifying existing data |
| SET | Keyword indicating columns to update and their new values |
| DELETE | Keyword for deleting data |
| CREATE | Keyword for creating tables or indexes |
| TABLE | Keyword that follows CREATE or DROP and indicates a table creation or deletion |
| DROP | Keyword for deleting tables or indexes |
| AND | Keyword for AND operator |
| OR | Keyword for OR operator |
| NOT | Keyword for NOT operator |
| AS | Keyword used for aliasing columns or tables |
| ORDER | Keyword for ordering selection results |
| BY | Keyword that follows ORDER |
| LIMIT | Keyword that limits the number of returned rows |
| ID | Identifier (e.g., column or table name) |
| STAR | Asterisk *, means "all columns" |
| BETWEEN | Keyword for range comparison operator |
| EQ | Equality operator =
|
| NEQ | Not equal operator != or <>
|
| LT | Less than operator <
|
| GT | Greater than operator >
|
| LE | Less than or equal <=
|
| GE | Greater than or equal >=
|
| COMMA | Comma ,
|
| DOT | Period .
|
| SEMICOLON | Statement terminator ;
|
| NUMVAL | Integer numeric literal |
| FLOATVAL | Float numeric literal |
| STRINGVAL | String literal |
| BOOLVAL | Boolean literal (TRUE or FALSE) |
| PRIMARY | Keyword for primary key definition |
| KEY | Keyword that follows PRIMARY |
| DATATYPE | Data type (e.g., INT, VARCHAR) |
| INDEX | Keyword to define an index |
| ON | Keyword indicating the table in an index creation or deletion |
| USING | Keyword indicating the index type |
| INDEXTYPE | Specifies the type of index (e.g., BTREE, HASH) |
| WITHIN | Keyword for containment operations |
| RECTANGLE | Keyword that follows WITHIN and indicates a rectangle definition |
| CIRCLE | Keyword that follows WITHIN and indicates a circle definition |
| KNN | Keyword for K-Nearest Neighbors operation |
| ASC | Keyword that indicates a sort in ascending order |
| DESC | Keyword that indicates a sort in descending order |
| IF | Keyword that specifies a condition in CREATE and DROP instructions |
| EXISTS | Keyword that follows IF and indicates if a table exists |
| ERR | Token representing an error |
| END | Token indicating the end of the input |