Skip to content

IDNI/parser

Repository files navigation

The Tau Parsing Library

The Tau parsing library is a high-performance, industrial-strength, continuously maintained parsing tool. It is a C++ library that is able to parse, and generate a parser, to any Context Free Grammar (CFG) and even all Boolean grammars. Compared to existing parsing solutions, the Tau Parsing Library stands out with its simplicity and robustness: just give it a grammar and an input string, and it will give you a representation of all possible parse trees, without any need for additional information, configuration, or callbacks that other CFG parsers commonly require you to provide.

Generate a parser in 3 easy steps:

First, define the grammar using the Tau Grammar Format (TGF), which is very similar to EBNF.

@use char classes space, digit.  # introduce space and digit native productions

start => _ expr _.

expr  =>  ( '(' _ expr _ ')'  ):group
	| ( expr _ '-' _ expr ):sub
	| ( expr _ '+' _ expr ):add
	| ( expr _ '/' _ expr ):div
	| ( expr _ '*' _ expr ):mul
	| ( '+' _ expr        ):pos
	| ( '-' _ expr        ):neg
	| ( digit+            ):number.

_     => space*.                 # optional white space of any length

Second, use the TGF command line utility to translate your TGF file into a parser header file using just one command.

tgf basic_arithmetic.tgf gen

Finally, include the parser header file in your C++ program and initialize your parser object. The parser object can then parse any valid input string in your grammar and return a parse forest or a syntax error.

#include "basic_arithmetic_parser.generated.h"

int main() {

	std::string input = "-12*5"; // input expression

	// run parser and get a parse result
	basic_arithmetic_parser::result r = basic_arithmetic_parser::instance()
					.parse(input.c_str(), input.size());

	// check if any successfully parsed tree was found
	if (!r.found) {
		std::cerr << r.parse_error << std::endl;
		return 1;
	}

	// get the parse tree from the result and print it to stdout
	r.get_tree()->to_print(std::cout);

	return 0;
}

Above program prints:

start(3)[0, 5]
	_(4)[0, 0]
	expr(5)[0, 5]
		mul(12)[0, 5]
			expr(5)[0, 3]
				neg(18)[0, 3]
					'-'[0, 1]
					_(4)[1, 1]
					expr(5)[1, 3]
						number(20)[1, 3]
							digit(2)[1, 2]
								'1'[1, 2]
							digit(2)[2, 3]
								'2'[2, 3]
			_(4)[3, 3]
			'*'[3, 4]
			_(4)[4, 4]
			expr(5)[4, 5]
				number(20)[4, 5]
					digit(2)[4, 5]
						'5'[4, 5]
	_(4)[5, 5]

Features

  • Boolean grammars.

  • C++ API (docs/programatic_grammar.md).

  • TGF - Tau Grammar Format for describing grammars in an EBNF-like format.

  • Earley based parser with garbage collection producing parse forest containing a compressed representation of all the parsed trees.

  • Forest traversal API with callbacks and extraction of trees and graphs to deal with transformations and with ambiguity.

  • tgf tool to test grammars in TGF file or to generate a C++ code representing a parser for a given TGF file.

  • Predefined or custom classes of characters defined by a function (<cctype>-like functions).

  • Optional recoding of an input string into another type which can be used for lexing, reencoding or accessing binary data.

  • own UTF-8 support.

How to use

To build the library, the tgf tool requires CMake and a compiler supporting C++23 with GNU extensions (-std=gnu++23). Then, for Windows with Mingw-w64 you can use w64-debug.sh or w64-release.sh script, and for Linux there are debug.sh and release.sh scripts for compiling all source files. After successful compilation you can find the artifacts in build-Debug and build-Release folders.

TGF tool

This library comes with a CLI executable tgf which features viewing, testing and debugging grammars written in TGF and it can also generate a C++ code from TGF grammar which is then usable in a C++ project.

More detailed information about this CLI can be found on page TGF tool

TGF is an EBNF-based format to describe grammars. Description of this format can be found on page Tau Grammar Format.

Tutorials

CSV parser

Quickly learn how to use this library in a beginner tutorial for a CSV parser.

Part 1 shows a very simple parser with all that is required. Each following part expands the previous one. Read comments since they are always related to a new or a changed code.

  • part 1 - minimal parser for positive integers
  • part 2 - using predefined character classes
  • part 3 - refactoring parser into its components
  • part 4 - also parse negative integers
  • part 5 - parse strings and nulls
  • part 6 - parsing comma separated values
  • part 7 - parsing new line separated rows
  • part 8 - replacing the programmatically created grammar with a grammar in TGF
  • part 9 - using EBNF syntax in TGF
  • part 10, csv.tgf
  • using tgf tool to generate a parser from a TGF file

JSON parser

An example of a complete JSON parser that reads a JSON element from a file, traverses the resulting tree and outputs a string representing the original JSON data.

API

Basic terminology

The following terminology is standard:

Literal is either a terminal or a non-terminal.

A terminal literal is a fixed single character (from the chosen alphabet).

A non-terminal literal represents a syntactic category or a placeholder that can be replaced by a sequence of terminals and/or non-terminals. Non-terminal literals are used to define the structure of the language.

A production rule consists of two parts: a left-hand side (LHS) and a right-hand side (RHS). The left-hand side contains a single non-terminal symbol, and the right-hand side contains a sequence of terminals and non-terminals. The production rule indicates that the non-terminal on the left-hand side can be rewritten or expanded into the sequence of symbols on the right-hand side.

Grammar is a formal system that defines a set of strings (a language). It consists of terminals, non-terminals, production rules and a starting symbol (which is a non-terminal).

Conjunctive Grammar is a grammar which allows usage of conjunctions of literal sequences in production rules. The conjunction acts as a logical "AND" operation, indicating that all the conjuncted literal sequences must be matched for the whole rule to be satisfied.

Boolean Grammar is a conjunctive grammar which also allows usage of negation of literal sequences. Negation acts as a logical "NOT" operation, indicating that if the negated literal sequence is matched the whole rule fails to be satisfied.

DNF or Disjunctive Normal form is a specific form of representing logical formulas, Boolean expressions or production rules in our case. In DNF, a logical formula is represented as a disjunction (OR) of one or more conjunctions (AND) of literals.

EBNF or Extended Backus–Naur form is a family of metasyntax notations, any of which can be used to express a context-free grammar.

TGF or Tau Grammar Format is an EBNF based format to describe grammars which are usable by this library. TGF understands EBNF-like syntax (+, *, [] and {}) so it is easier to define repetition or optionality.

Namespace

The library uses an idni namespace for its declarations.

For simplicity, all examples in library documentation expect declaration using namespace idni;.

Templates

Most of the structures provided by this library are templates in order to support arbitrary alphabets (terminals) as well as to generalize the concept of "string". Usually they are declared like this:

template <typename C = char, typename T = C> ... where

  • C is the input type of the input data (usually characters).
  • T is the type of terminals.

Fully supported and tested types are char and char32_t (including Unicode). Though it is possible to supply encoder and decoder functions to the parser to convert input elements of any type into terminals of any other type.

If T differs from C it is required to provide decoder and encoder functions in parser::options. parser uses these decoders to convert input data into terminals and back. See recoders for more information.

Classes and structs

Click on the type name in the following list to see more detailed documentation of its API:

  • nonterminals - id map of non-terminals to their names
  • lit, lits, conjs, disjs, prod and prods - basic building blocks for creating a grammar programmatically. prods represents a grammar containing production rules as disjunctions of conjunctions of sequences of literals.
  • char_class_fn, char_class_fns - Character class function is an alternative way to create a production rule. It is a function returning true if a terminal character is a member of a character class. It can be custom or there are predefined cctype-like functions (isalnum, isdigit...) provided. char_class_fns is a container for such functions.
  • shaping_options - options for shaping a parsed tree
  • grammar::options - options for grammar
  • grammar - represents a grammar created from production rules prods and character class functions char_class_fns.
  • parser::options - plain struct of options changing behavior of the parser
  • parser - uses a grammar for parsing an input and produces a result
  • parser::result - result of parsing containing the parsed forest
  • parser::error - detailed information about a parse error
  • parser::input - wrapper for an input data pointer, an input stream or a file. It provides an API to access it from a parser or a decoder
  • parser::pnode - parser's forest node
  • forest - a result of parsing containing all the parsed trees. Provides means of traversal or extraction of trees or graphs.
  • forest::graph - a graph extracted from a forest
  • forest::tree - a tree extracted from a graph
  • tgf - Tau Grammar Format parser - reads a grammar in TGF format from a string or a file. An alternative way to describe a grammar instead of creating it programmatically.
  • traverser - struct for traversing and accessing rewriter trees
  • rewriting - API for rewriting a resulting parse tree
  • measure - simple struct for measuring time

Classes and structs usable for building command line interfaces

  • cli - provides CLI arguments management and processing
  • repl - provides terminal (currently only Linux is supported) REPL UI
  • term_colors - provides terminal color escape codes

Functions

There are several areas covered by functions provided by this library

  • predefined character class functions - cctype-like predefined character class functions
  • recoders - decoder and encoder for UTF-8 in char <-> Unicode in char32_t
  • UTF-8 - UTF-8 and Unicode support
  • devhelpers - helper forest transformations to various formats (TML facts, TML rules, DOT) useful when developing a parser

About

IDNI's parser library

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 6