Skip to content

Commit 03319f9

Browse files
committed
Add parser_work document
1 parent 20464a9 commit 03319f9

File tree

1 file changed

+263
-0
lines changed

1 file changed

+263
-0
lines changed

parser_work.md

Lines changed: 263 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,263 @@
1+
Working on Parser Logic
2+
===
3+
4+
This document contains advice related to doing grammar / parser work.
5+
6+
From Grammar to Ruby
7+
---
8+
The grammar is described in a `.ra` (racc) file. For the "future parser", this is in
9+
lib/puppet/pops/parser/egrammar.ra and it is combined with the parser_support.rb file in the
10+
same directory and processed by race. The output is the resulting parser (in eparser.rb).
11+
12+
Never modify the `parser.rb` by hand.
13+
14+
Merge conflicts
15+
---
16+
Simply touch the `egrammar.ra` (unless it was changed by resolving merge conflicts), and
17+
then rebuild the parser by running make in the same directory.
18+
19+
The resulting `eparser.rb` should be checked in.
20+
21+
The eparser.rb and Racc runtime
22+
---
23+
If you look inside the `eparser.rb` file, you see several tables and a set of methods.
24+
The tables are used by the racc runtime (written n C and part of Ruby), and it calls back to
25+
the methods that implement the actions that were expressed in the grammar.
26+
27+
Note that the file contains source file/line references to the grammar file, thus ensuring
28+
that runtime exceptions appear to come from the grammar file (as they should).
29+
30+
Grammar Ambiguities
31+
---
32+
If you are working with grammar changes, you may run into ambiguity problems. There are two kinds of conflicts:
33+
34+
* shift/reduce
35+
* reduce/reduce
36+
37+
Bot of these conflicts mean that racc can not determine what to do when the sequence of source tokens have made it reach a particular state.
38+
39+
A "shift" can be read as "tell me more", and "reduce" as "got it". So a shift/reduce is an ambiguity
40+
where the grammar expresses that it is both ok to accept the state as complete, or to continue and build something more elaborate. A reduce/reduce, is trickier, since this means we reached a state
41+
where there appears to be no difference between completing one of multiple choices.
42+
43+
There are several reasons to why shift/reduce, and reduce/reduce conflicts occur:
44+
45+
* The language is truly ambiguous, i.e. there is no way to differentiate between two or more
46+
choices. This is poorly design language feature and it can only be fixed completely by changing
47+
the language. It is however possible to make such a problem less of an issue by hardcoding
48+
a decision and thus blocking one interpretation of the input from occurring. When doing so, the
49+
trick is to make this happen in a very dark corner of the programming language; i.e. in
50+
a sequence that is of little practical use. In all cases, avoid having ambiguities in the
51+
language.
52+
53+
* Racc only performs one token look-ahead over rule boundaries, any lookahead beyond that must
54+
take place in one and the same rule!
55+
To resolve these, you can introduce additional states, you can roll up rules into a larger rule, or
56+
you can assign precedence to rules.
57+
* "flattening" the grammar means that you spell out a sequence of tokens even if it would
58+
be less repetition to refer to a rule. Remember, breaking up sequences into rules is not
59+
just syntactic, it changes how the parser works.
60+
* adding states, means that you break up sequences into pieces that are unambiguous; thus making
61+
it possible for the grammar to reduce them. Unfortunately this makes the grammar quite abstract
62+
and hard to read.
63+
* Assigning precedence to rules can solve a shift/reduce. The decision with the highest precedence
64+
will win. When doing this for rules, great care must be taken as it may mean that certain rules
65+
can never be triggered.
66+
67+
* The language is an expression language and racc can not on its own determine the priority
68+
of operators - e.g. in 1 + 2 * 3, should the addition or the multiplication be performed first?
69+
Issues of this kind are easy to solve by giving operators a precedence.
70+
71+
As a rule of thumb, do not try to implement all semantics of the language in the grammar. It
72+
is far better to make the grammar parse non sensical input and then validate the result than
73+
trying to capture all semantics via grammar rules. This makes the grammar simpler and there
74+
are far fewer grammar conflicts to deal with.
75+
76+
### How Racc signals Ambiguities
77+
78+
When the grammar (.ra file) is processed racc outputs information about unused/useless rules
79+
and the number of shift/reduce, and reduce/reduce conflicts. It will still produce a parser, so
80+
you must pay attention to this output. If you see conflicts it means that certain parts of
81+
the grammar may be unreachable (racc has built in defaults that **may** be what you want, but
82+
it is most often by accident).
83+
84+
When an ambiguity is reported. You need to generate a more detailed report. You do that by running
85+
the makefile target `egrammar.output`. This produces a file with the same name. At the top of this
86+
file, you will find a more detailed report of which states/rules that are involved in the ambiguity.
87+
88+
It may for instance say:
89+
90+
state 168 contains one shift/reduce conflict
91+
92+
To find what this means, you search for "state 168", it is probably mentioned in several places
93+
with a "goto state 168", search until you find the state itself. There you find a description
94+
of that exact state; how it got there, and what racc considers at that point.
95+
96+
Here is a simple example:
97+
98+
state 66
99+
100+
7) syntactic_statements : syntactic_statements syntactic_statement _
101+
9) syntactic_statement : syntactic_statement _ COMMA expression
102+
103+
COMMA shift, and go to state 68
104+
$default reduce using rule 7 (syntactic_statements)
105+
106+
The current state is shown with an `_`, thus we are looking at the state where the parser
107+
has seen a `syntactic_statement`. We see below, that if it sees a COMMA, it will shift to
108+
state 68 (to deal with the expression), and if not, it will reduce rule 7 (it will add one
109+
syntactic_statement to the list of syntactic_statements).
110+
111+
If there is a conflict of a token/rule, it will be listed multiple times in the decision table.
112+
Say if there was a conflict on the COMMA, it may be shown as:
113+
114+
COMMA shift, and go to state 68
115+
COMMA reduce using rule 666 (the_trouble_rule)
116+
$default reduce using rule 7 (syntactic_statements)
117+
118+
119+
### How to find where the problem is
120+
121+
Each conflicting token/rule-pair is displayed in the output (as shown above), thus if the
122+
same COMMA is involved in 3 conflicts, you will see 6 entries. The valuable piece of information is the name of the reduction rule in conflict. At this point, try to manually construct the sequence of input tokens that would lead up to the ambiguity.
123+
It may be that the problem in the grammar is "before" reaching the ambiguity
124+
on the COMMA. Once you understand the sequence, you need to apply reasoning to find the resolution
125+
of the problem.
126+
127+
If that proves to be hard, and your grammar produces a viable parser, you can build a
128+
debugging parser, and turn on debugging output in the runtime. This gives you a trace of
129+
what the parser decides when it parses the input. This is sometimes easier than manually
130+
following the state changes using only the .output file. Often, you need both, because the trace
131+
only tells you which of the alternatives that it took, not what the alternatives were.
132+
133+
And yes, this is extremely tedious and time consuming. You will most certainly want to run this
134+
on as little source input as possible to avoid having your head explode.
135+
136+
### Generating a Debugging Parser
137+
138+
To generate a debugging parser, run the make target `egrammar.debug`. This creates an
139+
eparser.rb (it overwrites the non-debugging variant). (**Do not check in this parser**, it is
140+
much slower than the non debugging variant).
141+
142+
### Turning on Debug output
143+
144+
To turn on debug output, you need to set an instance variable. You do this in `parser_support.rb`
145+
in the `_parse()` method. Simply change the line that by default reads:
146+
147+
@yydebug = false
148+
149+
to
150+
151+
@yydebug = true
152+
153+
Again, **Do not check in this change**.
154+
155+
Note that the @yydebug=true does nothing unless the parser is build for debugging - i.e. you
156+
do not have to change it while you are switching from regular to non debugging version.
157+
158+
### Running with debugging on
159+
160+
When you run with debugging turned on, the trace will be printed to stdout, and each
161+
decision; reading a token, shifting to another rule/state, and reduction of rules
162+
is printed out.
163+
164+
Armed with that output and the .output file, you can now manually step through the grammar.
165+
166+
### Limiting the scope
167+
168+
Sometimes it is just impossible to figure out what is going wrong in a complex grammar.
169+
You can try reducing the grammar by simply commenting out large sections of the grammar. Repeat this
170+
for as long as the problem occurs. When you removed the problem, revert that change, then continue elsewhere until you have the smallest possible reproducer
171+
172+
Fixing Problems
173+
---
174+
175+
### Precedence
176+
177+
Precedence is expressed in a table at the beginning of the grammar. It lists the precedence
178+
from high (at the top) to low (at the bottom), and for each token (real or pseudo token)
179+
the associativity (`left`, `right` `nonasoc`) is expressed before a token (or list of tokens).
180+
181+
e.g.
182+
183+
prechigh
184+
left HIGH
185+
nonassoc UMINUS
186+
left TIMES DIV MODULO
187+
left MINUS PLUS
188+
right EQUALS
189+
left LOW
190+
preclow
191+
192+
The associativity tells racc how to group input with the same precedence; i.e. should 1 + 2 + 3 be treated as (1 + 2) + 3, or 1 + (2 + 3). A nonassoc means that racc does not allow this multiple
193+
times in a row, e.g. an unary minus can not occur in a sequence and --1 is an error.
194+
195+
The example above shows two pseudo tokens HIGH and LOW that can be used in the grammar to
196+
make a rule have a certain precedence.
197+
198+
We can now express an otherwise ambiguous grammar like this:
199+
200+
expr
201+
: expr PLUS expr
202+
| expr MINUS expr
203+
| expr TIMES expr
204+
| expr DIV expr
205+
| MINUS expr =UMINUS
206+
207+
### "Decent Precedence"
208+
209+
Optionally, we can deal with precedence by grouping the expressions having the same precedence
210+
211+
expr
212+
: mulexp # to higher precedence
213+
| expr PLUS mulexp
214+
| expr MINUS mulexp
215+
216+
mulexp
217+
: primary # to higher precedence
218+
| mulexp TIMES primary
219+
| mulexp DIV primary
220+
221+
primary
222+
: NUMBER
223+
224+
This has the same effect as setting the precedence and associativity in the precedence
225+
table.
226+
227+
I named this "Decent Precedence" since this mimics the behavior of a "recursive decent parser",
228+
the type of parser that is usually written by hand.
229+
230+
### Assigning the precedence
231+
232+
The precedence of a rule can be assigned like this:
233+
234+
| MINUS expression =UMINUS
235+
236+
This means that the lexer delivers a MINUS token, and when that is followed by an
237+
expression, the result is an UMINUS operation. If we did not assign =UMINUS, the rule
238+
would be given the precedence of MINUS.
239+
240+
Other options for fixing problems
241+
---
242+
243+
### Creating look ahead / look behind in the lexer
244+
245+
Sometimes it is possible to solve an issue by doing a bit more work in the lexer. As an example,
246+
the puppet grammar has LBRACK and LISTSTART tokens that are issued for the input '['. The lexer
247+
can differentiate between the tokens - a LISTSTART occurs if at the beginning of the input, or after whitespace. This helps making input such as $a[1] and $a [1] non ambiguous before fed to the grammar
248+
(where it is impossible to differentiate between them due to whitespace tokens not being part of
249+
the information sent to the parser). (This is actually an example of "look behind").
250+
251+
Beware that any lookahead in the lexer is very expensive since it visits each and every
252+
character in the source file. Look-behinds are cheap in comparison.
253+
254+
255+
Literature
256+
===
257+
258+
There is almost no documentation for racc. Luckily, it is a Ruby port of Yacc, and almost everything that is described for Yacc also applies to Racc (with the major exception that Racc uses rules
259+
written in Ruby, and that the runtime methods are slightly different).
260+
261+
The best book on the topic is "O'Reilly Yacc & Lex". If you want to learn more about parsers, see "Compilers, Principles, Techniques and Tools" (Aho et.al), also known as 'the dragon book').
262+
263+

0 commit comments

Comments
 (0)