Skip to content

ParserSQL Technical Explanation

Mikel-Bracamonte edited this page Jul 16, 2025 · 7 revisions

SQL Parser Technical Explanation

This is a technical explanation of how the parser works. An explanation of the parser's usage can be seen here.

The parsing process has three major steps, scanning, parsing and interpreting.

Scanner

Most of the code extracts shown in this section are simplified for brevity. The complete code for the scanner can be found here.

Objective

The purpose of the scanner is to transform the entire input string into recognizable tokens. As such, we define the tokens via the Token class:

class Token:
    class Type:
        (
            LPAR, RPAR, SELECT, FROM, WHERE, INSERT, INTO, VALUES, UPDATE, SET, DELETE, 
            CREATE, TABLE, DROP, AND, OR, NOT, AS, ORDER, BY, LIMIT, ID, STAR, BETWEEN,
            EQ, NEQ, LT, GT, LE, GE, COMMA, DOT, SEMICOLON, NUMVAL, FLOATVAL, STRINGVAL,
            BOOLVAL, PRIMARY, KEY, DATATYPE, INDEX, ON, USING, INDEXTYPE, ERR, END, 
            WITHIN, RECTANGLE, CIRCLE, KNN, ASC, DESC, IF, EXISTS
        ) = range(54)

    token_names = [
        "LPAR", "RPAR", "SELECT", "FROM", # etc.
    ]

    def __init__(self, token_type, lexema=""):
        self.type = token_type
        self.lexema = lexema

The full description of each token can be seen here.

The Scanner class

The Scanner class is created as follows:

class Scanner:
    def __init__(self, input):
        self.input = input + '\0'
        self.first = 0
        self.current = 0
        self.line = 1
        self.pos = 1

    def start_lexema(self) -> None:
        self.first = self.current

    def get_lexema(self) -> str:
        return self.input[self.first:self.current]

A lexema is the string that represents the current token's value. For example, an LPAR's lexema is '(', a SELECT's lexema is 'SELECT', and a NUMVAL's lexema might be '5'.

We then define a method to get the next token in the input, which is based around a DFA. The purpose of the DFA is to recognize which type of token the current lexema is. A token may end on a space, a new line, or any character that generally represents a space (defined using python's isspace() method). If the token is recognized as an ID, it ends on the first non-alphanumeric character. If it's recognized as a number, it ends on the first non-digit character (unless it's a dot .). If it's recognized as a string, it ends on the first instance of a single quote (').

The method and some states are shown here:

def next_token(self) -> Token:
        state = 0
        self.start_lexema()
        c = self.input[self.current]
        while True:
            if state == 0:
                c = self.input[self.current]
                if c.isspace():
                    if c == '\n':
                        self.line += 1
                        self.pos = 1
                    self.current += 1
                    self.pos += 1
                    self.start_lexema()
                    state = 0
                elif c == '\0':
                    return Token(Token.Type.END)
                elif c == '*':
                    self.current += 1
                    self.pos += 1
                    return Token(Token.Type.STAR)
                elif c == ';':
                    self.current += 1
                    self.pos += 1
                    return Token(Token.Type.SEMICOLON)
                # etc.
                elif c.isdigit():
                    state = 1
                elif c == "'":
                    state = 3
                elif c.isalpha():
                    state = 4
                else:
                    return Token(Token.Type.ERR)

For single characters like ; or *, we simply return the corresponding tokens. For other cases, we go to a new state. In this case, we go to state 1 to tokenize a number, state 3 to tokenize a string and state 4 to tokenize an ID or keyword.

States 1 and 2 are used to tokenize numbers, and can be seen here:

elif state == 1:
                self.current += 1
                self.pos += 1
                c = self.input[self.current]
                if c.isdigit():
                    pass
                elif c == '.':
                    state = 2
                else:
                    return Token(Token.Type.NUMVAL, self.get_lexema())

            elif state == 2:
                self.current += 1
                self.pos += 1
                c = self.input[self.current]
                if not c.isdigit():
                    return Token(Token.Type.FLOATVAL, self.get_lexema())

In state 1, we advance the current position until we see a non-digit character, in which case we return a NUMVAL token. If at any point we find a dot (.), it means we are facing a float, so we go to state 2 and continue advancing until we see a non-digit number, in which case we return a FLOATVAL token.

State 3 is used to tokenize strings, and can be seen here:

elif state == 3:
                self.current += 1
                self.pos += 1
                c = self.input[self.current]
                if c == "'":
                    self.current += 1
                    self.pos += 1
                    return Token(Token.Type.STRINGVAL, self.get_lexema()[1:-1])
                elif c == '\0':
                    return Token(Token.Type.ERR)

In state 3, we advance the current position until we find a single quote ('), in which case we return a STRINGVAL whose lexema doesn't include the single quotes on both ends.

State 4 and 5 are used to tokenize IDs and keywords, and can be seen here:

elif state == 4:
                self.current += 1
                self.pos += 1
                c = self.input[self.current]
                if not (c.isalnum() or c in ["_"]):
                    state = 5

            elif state == 5:
                lexema = self.get_lexema().upper()
                keywords = {
                    "SELECT": Token.Type.SELECT,
                    "FROM": Token.Type.FROM,
                    "WHERE": Token.Type.WHERE,
                    "INSERT": Token.Type.INSERT,
                    "INTO": Token.Type.INTO,
                    # etc.
                }
                if lexema in keywords:
                    return Token(keywords[lexema], lexema if keywords[lexema] in [Token.Type.BOOLVAL, Token.Type.INDEXTYPE, Token.Type.DATATYPE] else "")
                else:
                    return Token(Token.Type.ID, self.get_lexema())

In state 4, we advance the current position until we find a character thats neither alphanumeric nor an underscore (_), in which case we go to state 5. In state 5, we differentiate between IDs and keywords. If the lexema matches any of the reserved keywords, then we return the correspoding token (if its a BOOLVAL, INDEXTYPE or DATATYPE we make sure to also return the lexema), otherwise, we return it as and ID.

Parser

Most of the code extracts shown in this section are simplified for brevity. The complete code for the parser can be found here.

Objective

The purpose of the parser is to generate an abstract syntax tree (AST) that can then be traversed and executed instruction by instruction. To generate this tree, we use the recursive descent method.

The parsing step uses a context-free grammar (CFG) to accept or reject an input, and can be found here.

Our objective now is to follow the grammar and construct the related tree and objects.

Statement and related classes

We first create classes for each instruction and other necessary parts, that will contain every single attribute an instruction needs to be executed.

For example, the class related to a SELECT statement can be seen here:

class SelectStmt(Stmt):
    def __init__(self, table_name : str = None, condition : Condition = None, all : bool = False, column_list : list[str] = None, order_by : str = None, asc : bool = True, limit : int = None):
        super().__init__()
        self.table_name = table_name
        self.condition = condition
        self.all = all
        self.column_list = column_list if column_list else []
        self.order_by = order_by
        self.asc = asc
        self.limit = limit

    def add_column(self, column_name : str) -> None:
        self.column_list.append(column_name)

So, when parsing a SELECT statement, what we have to do is follow the grammar, build the SelectStmt object and put it in the tree.

The root node in the tree is an object that will contain a list of statements, defined as follows:

class SQL:
    def __init__(self, stmt_list : list[Stmt] = None):
        self.stmt_list = stmt_list if stmt_list else []

    def add_stmt(self, stmt : Stmt) -> None:
        self.stmt_list.append(stmt)

The Parser class

Now we create our parser, which will read tokens and build the AST. The parser is first defined as follows:

class Parser:
    def __init__(self, scanner : Scanner):
        self.scanner = scanner
        self.current : Token = None
        self.previous : Token = None

    def error(self, error : str):
        raise ParseError(error, self.scanner.line, self.scanner.pos, self.current)

    def match(self, type : Token.Type) -> bool:
        if self.check(type):
            self.advance()
            return True
        else:
            return False

    def check(self, type : Token.Type) -> bool:
        if self.is_at_end():
            return False
        else:
            return self.current.type == type

    def advance(self) -> None:
        if not self.is_at_end():
            temp = self.current
            self.current = self.scanner.next_token()
            self.previous = temp
            if self.check(Token.Type.ERR):
                self.error(f"unrecognized character: {self.current.lexema}")

    def is_at_end(self) -> bool:
        return self.current.type == Token.Type.END

A parser recieves a scanner, which will be used to get the next token as seen previously. The parse also provides some useful methods. match(type) recieves a token type, and will check and return if the current token is of the recieved type. If true, it will advance to the next token. Advancing means calling the next_token() method of the scanner and storing the result.

Parsing statements

Now we can begin parsing. To parse the input, we first need to parse the SQL object defined previously:

def parse(self) -> SQL:
        try:
            self.current = self.scanner.next_token()
            return self.parse_sql()
        except ParseError as e:
            raise e

def parse_sql(self) -> SQL:
        sql = SQL()
        sql.add_stmt(self.parse_stmt())
        while(self.match(Token.Type.SEMICOLON) and self.current.type != Token.Type.END):
            sql.add_stmt(self.parse_stmt())
        if self.current.type != Token.Type.END:
            self.error("unexpected items after statement")
        return sql

When parsing the SQL object, we need to parse an statement, and while we match semicolon tokens, and we don't yet reach the end of the input, we parse another statement. This follows the following part of the grammar:

<sql> ::= <statement_list>

<statement_list> ::= <statement> ";" { <statement> ";" }

<statement> ::= <select-stmt>
              | <create-table-stmt>
              | <drop-table-stmt>
              | <insert-stmt>
              | <delete-stmt>
              | <create-index-stmt>
              | <drop-index-stmt>

Parsing an statement depends on the type of statement. This can be determined by matching the related keywords of each statement:

def parse_stmt(self) -> Stmt:
        if self.match(Token.Type.SELECT):
            return self.parse_select_stmt()
        elif self.match(Token.Type.CREATE):
            if self.match(Token.Type.TABLE):
                return self.parse_create_table_stmt()
            elif self.match(Token.Type.INDEX):
                return self.parse_create_index_stmt()
            else:
                self.error("expected TABLE or INDEX keyword after CREATE keyword")
        elif self.match(Token.Type.DROP):
            if self.match(Token.Type.TABLE):
                return self.parse_drop_table_stmt()
            elif self.match(Token.Type.INDEX):
                return self.parse_drop_index_stmt()
            else:
                self.error("expected TABLE or INDEX keyword after DROP keyword")
        elif self.match(Token.Type.INSERT):
            return self.parse_insert_stmt()
        elif self.match(Token.Type.DELETE):
            return self.parse_delete_stmt()
        elif self.match(Token.Type.SELECT):
            return self.parse_select_stmt()
        else:
            self.error("unexpected start of an instruction")

For example, if we match a SELECT token, it means we are facing a SELECT statement, and as such we parse a SELECT statement.

Parsing a SELECT statement

The parsing in this case follows this part of the grammar:

<select-stmt> ::= "SELECT" <select-list> "FROM" <table-name> [ "WHERE" <condition> ] ["ORDER" "BY" <column-name> ["ASC" | "DESC"]] ["LIMIT" <number>]
<select-list> ::= "*" | <column-name> { "," <column-name> }
<table-name> ::= <identifier>
<column-name> ::= <identifier>

The parsing goes as follows:

def parse_select_stmt(self) -> SelectStmt:
        select_stmt = SelectStmt()
        if self.match(Token.Type.STAR):
            select_stmt.all = True
        elif self.match(Token.Type.ID):
            select_stmt.add_column(self.previous.lexema)
            while(self.match(Token.Type.COMMA)):
                if self.match(Token.Type.ID):
                    select_stmt.add_column(self.previous.lexema)
                else:
                    self.error("expected column name after comma")
        else:
            self.error("expected '*' or column name after SELECT keyword")
        if not self.match(Token.Type.FROM):
            self.error("expected FROM clause in SELECT statement")
        if not self.match(Token.Type.ID):
            self.error("expected table name after FROM keyword")
        select_stmt.table_name = self.previous.lexema
        if self.match(Token.Type.WHERE):
            select_stmt.condition = self.parse_or_condition()
        if self.match(Token.Type.ORDER):
            if not self.match(Token.Type.BY):
                self.error("expected BY keyword after ORDER keyword")
            if not self.match(Token.Type.ID):
                self.error("expected column name in ORDER BY clause")
            select_stmt.order_by = self.previous.lexema
            if self.match(Token.Type.ASC):
                select_stmt.asc = True
            elif self.match(Token.Type.DESC):
                select_stmt.asc = False
        if self.match(Token.Type.LIMIT):
            if not self.match(Token.Type.NUMVAL):
                self.error("expected valid int value after LIMIT keyword")
            select_stmt.limit = self.str_into_type(self.previous.lexema, self.previous)
        return select_stmt

First, a SelectStmt object is initialized, and depending on what tokens we match and what the grammar says, we update the object.

If we match a STAR (*), we set the all attribute to True. If instead we match and ID, it means we are looking at the column names, and as such we add that lexema to the object's list of columns, and we do so while we find commas. If neither are found, we throw an error. After that a FROM clause is required, so if we don't match a FROM keyword, we throw an error. We now match an ID that represents the table name, and set the table_name attribute. After this, if we match a WHERE, it means a condition will follow, and as such we parse a condition and set the condition attribute (parsing conditions will be explained later). After this if we find an ORDER BY, we then match an ID that represents the column to be ordered by and set the order_by attribute. If we find ASC or DESC, set the asc attribute. Finally, if we match a LIMIT, we then match a NUMVAL and set the limit attribute. Now we have completely built the SelectStmt object and so we return it and it is then added to the statement list.

Parsing conditions

Parsing conditions means following a precedence of operators to avoid ambiguity. After parsing, we will end up with a condition tree that accurately follows conditions in the desired order. The precedence goes as follows: Parenthesis -> NOT -> AND -> OR.

To create this condition tree, we need to first create classes that will define each type of condition and the leaf nodes.

class BinaryOp(Enum):
    AND = auto()
    OR = auto()
    EQ = auto()
    NEQ = auto()
    LT = auto()
    GT = auto()
    LE = auto()
    GE = auto()
    WC = auto()
    WR = auto()
    KNN = auto()

class Condition:
    def __init__(self):
        pass

class BinaryCondition(Condition):
    def __init__(self, left : Condition = None, op : BinaryOp = None, right : Condition = None):
        super().__init__()
        self.left = left
        self.op = op
        self.right = right

class BetweenCondition(Condition):
    def __init__(self, left : Condition = None, mid : Condition = None, right : Condition = None):
        super().__init__()
        self.left = left
        self.mid = mid
        self.right = right

class NotCondition(Condition):
    def __init__(self, condition : Condition = None):
        super().__init__()
        self.condition = condition

class BooleanColumn(Condition):
    def __init__(self, column_name : str = None):
        super().__init__()
        self.column_name = column_name

class ConditionColumn(Condition):
    def __init__(self, column_name : str = None):
        super().__init__()
        self.column_name = column_name

class ConditionValue(Condition):
    def __init__(self, value = None):
        super().__init__()
        self.value = value

Using these classes, we can build the condition tree. The part of the grammar that we follow for this part is the following:

<condition> ::= <or-condition>

<or-condition> ::= <and-condition> { "OR" <and-condition> }

<and-condition> ::= <not-condition> { "AND" <not-condition> }

<not-condition> ::= [ "NOT" ] <predicate>

<predicate> ::= <simple-condition> | "(" <condition> ")"

<simple-condition> ::= <column-name> <operator> <value> |
                       <column-name> "BETWEEN" <value> "AND" <value> |
                       <column-name> "WITHIN" "RECTANGLE" "(" <float> "," <float> "," <float> "," <float> )" |
                       <column-name> "WITHIN" "CIRCLE" "(" <float> "," <float> "," <float> ) |
                       <column-name> "KNN" "(" <float> "," <float> "," <number> )

<operator> ::= "=" | "<>" | "<" | ">" | "<=" | ">="

<value> ::= <string> | <number> | "TRUE" | "FALSE"

<column-name> ::= <identifier>

Following this grammar, we parse a condition as follows:

    # <or-condition> ::= <and-condition> { "OR" <and-condition> }
    def parse_or_condition(self) -> Condition:
        left = self.parse_and_condition()
        while self.match(Token.Type.OR):
            right = self.parse_and_condition()
            left = BinaryCondition(left, BinaryOp.OR, right)
        return left

    # <and-condition> ::= <not-condition> { "AND" <not-condition> }
    def parse_and_condition(self) -> Condition:
        left = self.parse_not_condition()
        while self.match(Token.Type.AND):
            right = self.parse_not_condition()
            left = BinaryCondition(left, BinaryOp.AND, right)
        return left

    # <not-condition> ::= [ "NOT" ] <predicate>
    def parse_not_condition(self) -> Condition:
        if(self.match(Token.Type.NOT)):
            return NotCondition(self.parse_predicate())
        return self.parse_predicate()

    # <predicate> ::= <simple-condition> | "(" <condition> ")"
    def parse_predicate(self) -> Condition:
        if(self.match(Token.Type.LPAR)):
            condition = self.parse_or_condition()
            if not self.match(Token.Type.RPAR):
                self.error("expected ')' to close a condition")
            return condition
        return self.parse_simple_condition()

Lastly, we need to parse a simple condition, which means parsing a column name, an operation, and the respective value on the right hand side. The parsing goes as follows:

def parse_simple_condition(self) -> Condition:
        if not self.match(Token.Type.ID):
            self.error("expected column name in condition")
        column_name = self.previous.lexema
        if self.match(Token.Type.BETWEEN):
            between_condition = BetweenCondition()
            between_condition.left = ConditionColumn(column_name)
            # etc.
            between_condition.right = ConditionValue(self.str_into_type(self.previous.lexema, self.previous))
            return between_condition
        simple_condition = BinaryCondition()
        simple_condition.left = ConditionColumn(column_name)
        if not (self.match(Token.Type.LT) or self.match(Token.Type.GT) or self.match(Token.Type.LE) or self.match(Token.Type.GE) or self.match(Token.Type.EQ) or self.match(Token.Type.NEQ) or self.match(Token.Type.WITHIN) or self.match(Token.Type.KNN)):
            return BooleanColumn(column_name)
        
        match self.previous.type:
            case Token.Type.LT:
                simple_condition.op = BinaryOp.LT
            case Token.Type.GT:
                simple_condition.op = BinaryOp.GT
            # etc.
            case _:
                self.error("unknown conditional operator")
        if simple_condition.op == BinaryOp.WR:
            if not self.match(Token.Type.LPAR):
                self.error("expected '(' after WITHIN RECTANGLE operation")
            if not self.match(Token.Type.FLOATVAL):
                self.error("expected valid float value for min x coordinate")
            x_min = self.str_into_type(self.previous.lexema, self.previous)
            # etc.
            simple_condition.right = ConditionValue((x_min, y_min, x_max, y_max))
        elif simple_condition.op == BinaryOp.WC:
            if not self.match(Token.Type.LPAR):
                self.error("expected '(' after WITHIN CIRCLE operation")
            if not self.match(Token.Type.FLOATVAL):
                self.error("expected valid float value for x coordinate")
            x = self.str_into_type(self.previous.lexema, self.previous)
            # etc.
            simple_condition.right = ConditionValue((x, y, radius))
        elif simple_condition.op == BinaryOp.KNN:
            if not self.match(Token.Type.LPAR):
                self.error("expected '(' after KNN operation")
            if not self.match(Token.Type.FLOATVAL):
                self.error("expected valid float value for x coordinate")
            x = self.str_into_type(self.previous.lexema, self.previous)
            # etc.
            simple_condition.right = ConditionValue((x, y, k))
        else:
            if self.match(Token.Type.LPAR):
                if not self.match(Token.Type.FLOATVAL):
                    self.error("expected a valid float value por x coordinate on POINT declaration")
                x = self.str_into_type(self.previous.lexema, self.previous)
                # etc.
                simple_condition.right = ConditionValue((x, y))
            else:
                if not self.match_values():
                    self.error("expected a value after conditional operator")
                simple_condition.right = ConditionValue(self.str_into_type(self.previous.lexema, self.previous))
        return simple_condition

Finalizing parsing

After having parsed every single instruction on the input, we return the SQL object, which will be used by the interpreter.

Interpreter

Most of the code extracts shown in this section are simplified for brevity. The complete code for the interpreter can be found here (it's the same file as the parser)

Objective

The purpose of the interpreter is to traverse the AST and execute each instruction.

The Interpreter class

The interpreter will use the DBManager to execute the instructions, and as such we define the class as follows:

class Interpreter:
    def __init__(self):
        self.dbmanager = DBManager()

    def error(self, error : str):
        raise RuntimeError(error)

Interpreting

To interpret the SQL object, we iterate over every instruction on the instruction list and execute the respective method.

def interpret(self, sql : SQL):
        try:
            return self.interpret_sql(sql)
        except RuntimeError as e:
            raise e

    def interpret_sql(self, sql : SQL):
        if not sql:
            self.error("Invalid SQL")
        for stmt in sql.stmt_list:
            result = self.interpret_stmt(stmt)
        return result

    def interpret_stmt(self, stmt : Stmt):
        stmt_type = type(stmt)
        if stmt_type == SelectStmt:
            return self.interpret_select_stmt(stmt), "Selection successful"
        elif stmt_type == CreateTableStmt:
            self.interpret_create_table_stmt(stmt)
            return None, "Table created successfully"
        elif stmt_type == DropTableStmt:
            self.interpret_drop_table_stmt(stmt)
            return None, "Table dropped successfully"
        elif stmt_type == InsertStmt:
            self.interpret_insert_stmt(stmt)
            return None, "Insertion successful"
        elif stmt_type == DeleteStmt:
            self.interpret_delete_stmt(stmt)
            return None, "Deletion successful"
        elif stmt_type == CreateIndexStmt:
            self.interpret_create_index_stmt(stmt)
            return None, "Index created successfully"
        elif stmt_type == DropIndexStmt:
            self.interpret_drop_index_stmt(stmt)
            return None, "Index dropped successfully"
        else:
            self.error("unknown statement type")

We also return a success message if no errors are thrown.

For example, when interpreting a SELECT statement, we do the following:

def interpret_select_stmt(self, stmt : SelectStmt):
        select_schema = SelectSchema(stmt.table_name, ConditionSchema(stmt.condition), stmt.all, stmt.column_list, stmt.order_by, stmt.asc, stmt.limit)
        return self.dbmanager.select(select_schema)

The select function on the DBManager requires a SelectSchema object, which we easily build using our SelectStmt object. We then pass the SelectSchema object and call the select method.

As such, to interpret every instruction, we only need to build the objects needed by the DBManager and call the respective method.

Tokens

These are the tokens used in the scanner. Note that not all tokens are used.

Token Description
LPAR Left parenthesis (
RPAR Right parenthesis )
SELECT Keyword for selecting data
FROM Keyword indicating the data source (table)
WHERE Keyword for filtering rows
INSERT Keyword for inserting data
INTO Keyword indicating the target table in INSERT
VALUES Keyword indicating the values to insert
UPDATE Keyword for modifying existing data
SET Keyword indicating columns to update and their new values
DELETE Keyword for deleting data
CREATE Keyword for creating tables or indexes
TABLE Keyword that follows CREATE or DROP and indicates a table creation or deletion
DROP Keyword for deleting tables or indexes
AND Keyword for AND operator
OR Keyword for OR operator
NOT Keyword for NOT operator
AS Keyword used for aliasing columns or tables
ORDER Keyword for ordering selection results
BY Keyword that follows ORDER
LIMIT Keyword that limits the number of returned rows
ID Identifier (e.g., column or table name)
STAR Asterisk *, means "all columns"
BETWEEN Keyword for range comparison operator
EQ Equality operator =
NEQ Not equal operator != or <>
LT Less than operator <
GT Greater than operator >
LE Less than or equal <=
GE Greater than or equal >=
COMMA Comma ,
DOT Period .
SEMICOLON Statement terminator ;
NUMVAL Integer numeric literal
FLOATVAL Float numeric literal
STRINGVAL String literal
BOOLVAL Boolean literal (TRUE or FALSE)
PRIMARY Keyword for primary key definition
KEY Keyword that follows PRIMARY
DATATYPE Data type (e.g., INT, VARCHAR)
INDEX Keyword to define an index
ON Keyword indicating the table in an index creation or deletion
USING Keyword indicating the index type
INDEXTYPE Specifies the type of index (e.g., BTREE, HASH)
WITHIN Keyword for containment operations
RECTANGLE Keyword that follows WITHIN and indicates a rectangle definition
CIRCLE Keyword that follows WITHIN and indicates a circle definition
KNN Keyword for K-Nearest Neighbors operation
ASC Keyword that indicates a sort in ascending order
DESC Keyword that indicates a sort in descending order
IF Keyword that specifies a condition in CREATE and DROP instructions
EXISTS Keyword that follows IF and indicates if a table exists
ERR Token representing an error
END Token indicating the end of the input

Clone this wiki locally