Skip to content

Commit c81d9b3

Browse files
committed
Update docs
1 parent 66806d7 commit c81d9b3

20 files changed

+2675
-132
lines changed

blog-posts/how-puppet-works.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
Getting your Puppet Ducks in a Row
22
===
3-
A conversation that comes up frequently is if the Puppet Programming Language is declarative or not. This is usually the topic when someone has been fighting with how order of evaluation works and have
3+
A conversation that comes up frequently is if the Puppet Programming Language is declarative or not. This is usually the topic when someone has been fighting with how order of evaluation works on the master side, and have
44
been beaten by what sometimes may seem as random behavior. In this post I want to explain how
55
Puppet works and try to straighten out some of the misconceptions.
66

@@ -10,8 +10,7 @@ First, lets get the terminology right (or this will remain confusing). It is com
1010

1111
Parse Order is the order in which Puppet reads puppet manifests (`.pp`) from disk, turns them into tokens and checks their grammar. The result is something that can be evaluated (technically an Abstract Syntax Tree (AST)). The order in which this is done is actually of minor importance from a user perspective, you really do not need to think about how an expression such as `$a = 1 + 2` becomes an AST.
1212

13-
OTOH, if you think about "parse order" as "the order the files are parsed", but this order is
14-
also of minor importance. Puppet starts with the `site.pp` file (or possibly the `code` setting in the configuration), then asking external services (such as the ENC) for additional things that are not included in the logic that is loaded from the `site.pp`. In versions from 3.5.0 the manifest setting can also refer to a directory of `.pp` files (preferred over using the now deprecated `import` statement).
13+
I have users talk about "parse order" with the meaning "the order the files are loaded and parsed", which to me is a bit of a stretch since that involves several additional concepts (as you will see later). The overall ordering of the execution is that Puppet starts with the `site.pp` file (or possibly the `code` setting in the configuration), then asking external services (such as the ENC) for additional things that are not included in the logic that is loaded from the `site.pp`. In versions from 3.5.0 the manifest setting can also refer to a directory of `.pp` files (preferred over using the now deprecated `import` statement).
1514

1615
At this point (after having started and after having parsed the initial manifest(s), Puppet first matches the information about the node making a request for a catalog with available node definitions, and selects the first matching node definition. At this point Puppet has the notion of:
1716

blog-posts/internals/location/location.md

Lines changed: 79 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,47 +1,58 @@
1-
In post about Puppet Internals I am going to cover how source location is
2-
handled by the lexer and parser.
1+
In this post about Puppet Internals I am going to cover how source location information is
2+
handled by the future lexer and parser. This also shows a concrete case of using the
3+
**adapter pattern** introduced in [Puppet Internals - the Adapter Pattern][1].
4+
5+
[1]:http://puppet-on-the-edge.blogspot.se/2014/02/puppet-internals-adapter-pattern.html
36

47
### Rationale for Detailed Position Information
58

69
It is important to have detailed position information when errors occur
710
as the user programming in the Puppet Programming Language would otherwise have
811
to guess about where in the source text the problem is.
912

13+
Positioned information is not only wanted for syntax errors and other errors immediately
14+
detected by the parser, this information is perhaps even more valuable when errors occur
15+
later and it is important to know where a particular value originated.
16+
1017
To date, this output has consisted of file and line only. While this is enough in
11-
many situations there are many cases where there is no reasonable text to output (say
12-
using an operator like '+' the wrong way, and there are 5 '+' on the same line), just knowing
18+
most situations there are also many cases where line alone is not enough (say
19+
using an operator like '+' the wrong way, and there are five '+' on the same line), just knowing
1320
that there is something wrong with one of the '+' on line 3 is not that great. What we
14-
want is to also know the position on the line.
21+
want is to also know the character position on the line.
1522

1623
### Implementation Background
1724

1825
In the 3x Puppet Language implementation, the positioning information was
1926
limited to only contain the name of the file and the line number. In many cases
20-
the information is wrong (or rather imprecise) as it relies on the positioned held in
21-
the lexer as opposed to the position of the individual tokens. The lexer may have advanced
22-
past the particular point where the problem is when the problem is reported.
27+
this information is wrong (or rather, quite imprecise) as it relies on the lexer's
28+
current lexing position (i.e. what it has consumed from the source text) as opposed
29+
to the position of the individual tokens. The lexer may very well have advanced
30+
past the particular point where the problem is when the problem is reported by the parser.
2331

2432
The first implantation of the future parser (available in Puppet 3.2) contained
25-
detailed positioning. It was calculated at the same time as the lexer did its
26-
regular processing - i.e intermixed with producing the tokens that are fed to the
27-
parser.
28-
29-
This proved to be both complicated (as the lexer needs to look ahead and thus either maintain
30-
a complex state where it is, what the location is etc.) and to be a performance hog. Part of
31-
the performance problem is the need to compute the length of an entire expression to enable
33+
detailed positioning (file, line and position on line).
34+
It was calculated at the same time as the lexer did its regular processing - i.e intermixed with
35+
production of the tokens that it feeds to the parser.
36+
37+
This proved to be complicated as the lexer needs to look ahead and thus either maintain
38+
a complex state where it is, the location of points it will potentially backtrack to, etc.
39+
This also proved to be a performance hog. Part of
40+
the performance problem was the need to compute the length of an entire expression to enable
3241
future output of the corresponding source text.
3342

3443
The second implementation was much more performant. The idea was that detailed position
35-
information is really only needed when there is an error and it could be computed lazily
36-
if only recording the most basic information - i.e. the offset and length of the significant
37-
parts of the source text. The complex state and intermixed position calculations could also
38-
be made much more efficiently if done first, scanning the entire input for line breaks.
39-
40-
This implementation was introduced in the future parser in 3.4. Unfortunately, the idea
41-
that positioning is really only needed when there is an error was wrong since the 3.4
44+
information is really only needed when there is an error and that only a fraction of the total
45+
amount of tokens have positions that end up being recorded in the catalog). Thus, the
46+
position information could be computed lazily, only recording the most basic information; the
47+
offset and length of the significant parts of the source text.
48+
It was then possible to get rid of the complex state and intermixed position calculations
49+
and instead more efficiently scan the entire input for line breaks and building an index.
50+
51+
This implementation was introduced in the future parser in Puppet 3.4. Unfortunately, the idea
52+
that positioning is really only needed when there is an error was wrong in this case since the 3.4
4253
implementation has to transform all parsed syntax tree into the 3.x AST equivalent tree, and
4354
there all position information is still computed up front and stored inside each AST node (even
44-
if never used).
55+
if never used). (Bummer).
4556

4657
Now, in Puppet 3.5, the new evaluator is responsible for evaluating most expressions
4758
and it will only trigger the position computations when it is required; when there is
@@ -52,7 +63,11 @@ While working on the performance of the lexer / parser it was also observed that
5263
would be beneficial to serialize the constructed syntax tree as deserialization of a
5364
tree is much faster in general than again parsing the same source text. In order to be
5465
able to support this, it was obvious that the detailed information needed to be included
55-
in the model (and not only computed and kept in structures in memory).
66+
in the model (and not only computed and kept in structures in memory). Thus, in Puppet 3.5's
67+
future parser's AST model you will find a Program element that holds the entire source text,
68+
the line offset index, and the parse tree result. (This has other benefits as well - even a
69+
broken program can be returned - even if it can not be evaluated, it makes it easier to report
70+
what is wrong).
5671

5772
### Requirements on Positioning Information
5873

@@ -69,20 +84,20 @@ in the model (and not only computed and kept in structures in memory).
6984

7085
A mix of techniques were used to meet the requirements.
7186

72-
The central concept is that of a Locator; an object that knows about the source text
73-
string,where all the lines start, and given an offset into the string can answer which
74-
line it is on, and its position on that line. This means that only the offset and length
87+
The central concept is that of a `Locator`; an object that knows about the source text
88+
string; where all the lines start, and given an offset into the string can answer which
89+
line it is on, and the position on that line. This means that only the offset and length
7590
of tokens needs to be recorded and stored in the syntax nodes.
7691

77-
We could store a reference to the Locator in every node, but that requires one extra
92+
We could store a reference to the locator in every node, but that requires one extra
7893
slot per node, and would need to be handled in de-serialization (i.e. setting thousands
79-
of references when loading a single model). The offset and length are simply Integers and
80-
are fast to serialize/de-serialize.
94+
of references when loading a single model). The offset and length are in contrast
95+
regular Integers and are fast to serialize/de-serialize.
8196

8297
The parser always produces an instance of `Program`, and it contains both the source text
8398
and the required line index. With these two, it can reconstruct the Locator (that was originally
8499
created by the lexer / parser when parsing the source). The Program is only a data container,
85-
it does not do any computation - that is always handled by an instance of Locator.
100+
it does not do any offset computation - that is always handled by an instance of `Locator`.
86101

87102
Here is a diagram that shows the relationship between the `Program` and the `Locator`. It also
88103
shows how individual nodes (`Positioned`) and their corresponding computational / cache of
@@ -96,15 +111,15 @@ PROGRAM_LOCATOR_DIAGRAM.PNG
96111
All nodes in the constructed syntax tree model inherit from `Positioned` (except `Program` which is
97112
always the entire source). Being `Positioned` means that there is an `offset` and a `length` (but nothing more).
98113

99-
If we want to know the line number and the position on the line we need to find the Locator
100-
since it knows how to compute the information. We could have implemented that in the Positioned
114+
If we want to know the line number and the position on the line we need to find the `Locator`
115+
since it knows how to compute the information. We could have implemented that in the `Positioned`
101116
object itself, but it would clutter its implementation and it would be difficult to change
102-
the strategy for computing. This is where the SourcePosAdapter comes in.
117+
the strategy for computing. This is where the `SourcePosAdapter` comes in.
103118

104119
### The SourcePosAdapter
105120

106-
Being an `Adapter` (there are others) means that it is bi-directionally associated with a particular object without the object knowing about it. The responsibility of the managing the relationship
107-
is entirely on the adapter side.
121+
Being an `Adapter` (see earlier post ([1]) for details) means that it is bi-directionally associated with a particular object without the object knowing about it.
122+
The responsibility of the managing the relationship is entirely on the adapter side.
108123

109124
An `Positioned` object is adapted to a `SourcePosAdapter` by:
110125

@@ -118,12 +133,12 @@ to ask if an object is adapted (and get the adapter) by:
118133
Once a `SourcePosAdapter` is obtained, it can answer all the questions about position. When it is
119134
created it performs a minimum of computation. When asked for something that requires a `Locator`
120135
is searches for the closest object that has knowledge of it and then caches this information. When
121-
this takes place for the first time, the search always goes up to the `Program` (root) node. On subsequent searches a node with a `SourcePosAdapter` may be encountered and the search can stop
122-
there.
136+
this takes place for the first time, the search always goes up to the `Program` (root) node. On subsequent searches a parent node with a `SourcePosAdapter` may be encountered and the search
137+
can stop there.
123138

124139
The resulting structure is what is depicted in the graph.
125140

126-
It is worth noting that all model objects that are contained, knows about their container via
141+
It is worth noting that all model objects that are contained knows about their container via
127142
the somewhat mysterious method `eContainer` (how that works in more detail and what
128143
the difference is between a *containment*, and a *reference* is the topic for another blog post).
129144

@@ -162,19 +177,37 @@ While there is much to talk about how the grammar / parser works, this post focu
162177
of location, so the interesting part here are the calls to the `loc` method. It is called
163178
with the resulting model node (e.g. an `ArithmeticExpression`) and then one or two other nodes
164179
which may be model nodes or tokens produced by the lexer. All of the arithmetic expressions
165-
are located by their operator (`lot` is called with `val[1]` which is a
180+
are located by their operator (`loc` is called with `val[1]` which is a
166181
reference to the operator i.e. '+', or '*' in the example).
167182

168183
Once the tree is built, since all of the nodes are `Positioned` it is possible to adapt them with a `SourcePosAdapter` to get the detailed computed information.
169184

170185
### Output
171186

172-
The output when there is position information is simply line:pos where pos starts from 1 (the first
187+
The output when there is position information is simply `line:pos` where pos starts from 1 (the first
173188
character on the line).
174189

175-
Output of source excerpt is not yet implemented as it has its own challenges - some expressions
176-
are quite long and span multiple lines, how much of that is relevant to show? How much is enough context? Also, while the data is in place, expressions like the arithmetic expressions are typically
177-
located by their operator, and the output source would be just the '+'. A bit more processing is
178-
needed to also include the left and right hand sides - but then again - how much of those.
190+
Output of source excerpt (i.e. a snippet of source and a caret-pointer to the position) is not yet implemented. Maybe someone wants to take a stab at making an implementation?
191+
192+
### UTF-8
193+
194+
And, lets talk about UTF-8, or rather about lexing multibyte strings.
195+
196+
The new implementation (in lexer2) handles multibyte source. It does this by recording the
197+
byte offsets and using a locator that is specific to single or multibyte runtime. This proved
198+
to be much faster than using the new mulitbyte support introduced in Ruby 1.9.3. You can look
199+
at the implementation in `Puppet::Pops::Parser::Locator`, where there are two different implementations.
200+
201+
### Summary
202+
203+
This post shows both a concrete example of using the **adapter pattern**, as well as some high
204+
level examples about how the lexer gets its work done in a performant way.
205+
206+
#### In the Next Post
179207

180-
As always the solution is probably to just show the line and a marker to where on the line the problem occurred.
208+
Well, to be honest, I don't really know what the next post should be about? What would you
209+
like to learn more about next - I am considering an overview of modeling as it is a fundamental
210+
concept, but there are many other topics to choose from - the implementation of the lexer,
211+
how the parser works, how error messages are generated and formatted, and then we have the
212+
evaluator to cover... - seems like this will be a very long series.
213+
37.9 KB
Loading
115 KB
Loading

0 commit comments

Comments
 (0)