The Tau parsing library is a high-performance, industrial-strength, continuously maintained parsing tool. It is a C++ library that is able to parse, and generate a parser, to any Context Free Grammar (CFG) and even all Boolean grammars. Compared to existing parsing solutions, the Tau Parsing Library stands out with its simplicity and robustness: just give it a grammar and an input string, and it will give you a representation of all possible parse trees, without any need for additional information, configuration, or callbacks that other CFG parsers commonly require you to provide.
Generate a parser in 3 easy steps:
First, define the grammar using the Tau Grammar Format (TGF), which is very similar to EBNF.
@use char classes space, digit. # introduce space and digit native productions
start => _ expr _.
expr => ( '(' _ expr _ ')' ):group
| ( expr _ '-' _ expr ):sub
| ( expr _ '+' _ expr ):add
| ( expr _ '/' _ expr ):div
| ( expr _ '*' _ expr ):mul
| ( '+' _ expr ):pos
| ( '-' _ expr ):neg
| ( digit+ ):number.
_ => space*. # optional white space of any length
Second, use the TGF command line utility to translate your TGF file into a parser header file using just one command.
tgf basic_arithmetic.tgf gen
Finally, include the parser header file in your C++ program and initialize your parser object. The parser object can then parse any valid input string in your grammar and return a parse forest or a syntax error.
#include "basic_arithmetic_parser.generated.h"
int main() {
std::string input = "-12*5"; // input expression
// run parser and get a parse result
basic_arithmetic_parser::result r = basic_arithmetic_parser::instance()
.parse(input.c_str(), input.size());
// check if any successfully parsed tree was found
if (!r.found) {
std::cerr << r.parse_error << std::endl;
return 1;
}
// get the parse tree from the result and print it to stdout
r.get_tree()->to_print(std::cout);
return 0;
}
Above program prints:
start(3)[0, 5]
_(4)[0, 0]
expr(5)[0, 5]
mul(12)[0, 5]
expr(5)[0, 3]
neg(18)[0, 3]
'-'[0, 1]
_(4)[1, 1]
expr(5)[1, 3]
number(20)[1, 3]
digit(2)[1, 2]
'1'[1, 2]
digit(2)[2, 3]
'2'[2, 3]
_(4)[3, 3]
'*'[3, 4]
_(4)[4, 4]
expr(5)[4, 5]
number(20)[4, 5]
digit(2)[4, 5]
'5'[4, 5]
_(4)[5, 5]
-
Boolean grammars.
-
C++ API (docs/programatic_grammar.md).
-
TGF - Tau Grammar Format for describing grammars in an EBNF-like format.
-
Earley based parser with garbage collection producing parse forest containing a compressed representation of all the parsed trees.
-
Forest traversal API with callbacks and extraction of trees and graphs to deal with transformations and with ambiguity.
-
tgftool to test grammars in TGF file or to generate a C++ code representing a parser for a given TGF file. -
Predefined or custom classes of characters defined by a function (
<cctype>-like functions). -
Optional recoding of an input string into another type which can be used for lexing, reencoding or accessing binary data.
To build the library, the tgf tool requires CMake and a compiler supporting C++23 with GNU extensions (-std=gnu++23).
Then, for Windows with Mingw-w64 you can use w64-debug.sh or w64-release.sh script, and for Linux there are debug.sh and release.sh scripts for compiling all source files. After successful compilation you can find the artifacts in build-Debug and build-Release folders.
This library comes with a CLI executable tgf which features viewing, testing and debugging grammars written in TGF and it can also generate a C++ code from TGF grammar which is then usable in a C++ project.
More detailed information about this CLI can be found on page TGF tool
TGF is an EBNF-based format to describe grammars. Description of this format can be found on page Tau Grammar Format.
Quickly learn how to use this library in a beginner tutorial for a CSV parser.
Part 1 shows a very simple parser with all that is required. Each following part expands the previous one. Read comments since they are always related to a new or a changed code.
- part 1 - minimal parser for positive integers
- part 2 - using predefined character classes
- part 3 - refactoring parser into its components
- part 4 - also parse negative integers
- part 5 - parse strings and nulls
- part 6 - parsing comma separated values
- part 7 - parsing new line separated rows
- part 8 - replacing the programmatically created grammar with a grammar in TGF
- part 9 - using EBNF syntax in TGF
- part 10, csv.tgf
- using tgf tool to generate a parser from a TGF file
An example of a complete JSON parser that reads a JSON element from a file, traverses the resulting tree and outputs a string representing the original JSON data.
The following terminology is standard:
Literal is either a terminal or a non-terminal.
A terminal literal is a fixed single character (from the chosen alphabet).
A non-terminal literal represents a syntactic category or a placeholder that can be replaced by a sequence of terminals and/or non-terminals. Non-terminal literals are used to define the structure of the language.
A production rule consists of two parts: a left-hand side (LHS) and a right-hand side (RHS). The left-hand side contains a single non-terminal symbol, and the right-hand side contains a sequence of terminals and non-terminals. The production rule indicates that the non-terminal on the left-hand side can be rewritten or expanded into the sequence of symbols on the right-hand side.
Grammar is a formal system that defines a set of strings (a language). It consists of terminals, non-terminals, production rules and a starting symbol (which is a non-terminal).
Conjunctive Grammar is a grammar which allows usage of conjunctions of literal sequences in production rules. The conjunction acts as a logical "AND" operation, indicating that all the conjuncted literal sequences must be matched for the whole rule to be satisfied.
Boolean Grammar is a conjunctive grammar which also allows usage of negation of literal sequences. Negation acts as a logical "NOT" operation, indicating that if the negated literal sequence is matched the whole rule fails to be satisfied.
DNF or Disjunctive Normal form is a specific form of representing logical formulas, Boolean expressions or production rules in our case. In DNF, a logical formula is represented as a disjunction (OR) of one or more conjunctions (AND) of literals.
EBNF or Extended Backus–Naur form is a family of metasyntax notations, any of which can be used to express a context-free grammar.
TGF or Tau Grammar Format is an EBNF based format to describe grammars which are usable by this library. TGF understands EBNF-like syntax (+, *, [] and {}) so it is easier to define repetition or optionality.
The library uses an idni namespace for its declarations.
For simplicity, all examples in library documentation expect declaration using namespace idni;.
Most of the structures provided by this library are templates in order to support arbitrary alphabets (terminals) as well as to generalize the concept of "string". Usually they are declared like this:
template <typename C = char, typename T = C> ... where
Cis the input type of the input data (usually characters).Tis the type of terminals.
Fully supported and tested types are char and char32_t (including Unicode). Though it is possible to supply encoder and decoder functions to the parser to convert input elements of any type into terminals of any other type.
If T differs from C it is required to provide decoder and encoder functions in parser::options. parser uses these decoders to convert input data into terminals and back. See recoders for more information.
Click on the type name in the following list to see more detailed documentation of its API:
nonterminals- id map of non-terminals to their nameslit,lits,conjs,disjs,prodandprods- basic building blocks for creating a grammar programmatically.prodsrepresents a grammar containing production rules as disjunctions of conjunctions of sequences of literals.char_class_fn,char_class_fns- Character class function is an alternative way to create a production rule. It is a function returning true if a terminal character is a member of a character class. It can be custom or there are predefined cctype-like functions (isalnum, isdigit...) provided.char_class_fnsis a container for such functions.shaping_options- options for shaping a parsed treegrammar::options- options for grammargrammar- represents a grammar created from production rulesprodsand character class functionschar_class_fns.parser::options- plain struct of options changing behavior of the parserparser- uses agrammarfor parsing an input and produces aresultparser::result- result of parsing containing the parsed forestparser::error- detailed information about a parse errorparser::input- wrapper for an input data pointer, an input stream or a file. It provides an API to access it from a parser or a decoderparser::pnode- parser's forest nodeforest- a result of parsing containing all the parsed trees. Provides means of traversal or extraction of trees or graphs.forest::graph- a graph extracted from a forestforest::tree- a tree extracted from a graphtgf- Tau Grammar Format parser - reads a grammar in TGF format from a string or a file. An alternative way to describe a grammar instead of creating it programmatically.traverser- struct for traversing and accessing rewriter treesrewriting- API for rewriting a resulting parse treemeasure- simple struct for measuring time
- cli - provides CLI arguments management and processing
- repl - provides terminal (currently only Linux is supported) REPL UI
- term_colors - provides terminal color escape codes
There are several areas covered by functions provided by this library
- predefined character class functions - cctype-like predefined character class functions
- recoders - decoder and encoder for UTF-8 in
char<-> Unicode inchar32_t - UTF-8 - UTF-8 and Unicode support
- devhelpers - helper forest transformations to various formats (TML facts, TML rules, DOT) useful when developing a parser