hlindberg
diff --git a/‎blog-posts/how-puppet-works.md‎
Lines changed: 2 additions & 3 deletions b/‎blog-posts/how-puppet-works.md‎
Lines changed: 2 additions & 3 deletions
diff --git a/‎blog-posts/internals/location/location.md‎
Lines changed: 79 additions & 46 deletions b/‎blog-posts/internals/location/location.md‎
Lines changed: 79 additions & 46 deletions
diff --git a/‎blog-posts/internals/modeling/EcoreOverview.gif‎
37.9 KB b/‎blog-posts/internals/modeling/EcoreOverview.gif‎
37.9 KB
diff --git a/‎blog-posts/internals/modeling/catalog_model_diagram.png‎
115 KB b/‎blog-posts/internals/modeling/catalog_model_diagram.png‎
115 KB
@@ -1,6 +1,6 @@
 Getting your Puppet Ducks in a Row
 ===
-A conversation that comes up frequently is if the Puppet Programming Language is declarative or not. This is usually the topic when someone has been fighting with how order of evaluation works and have
+A conversation that comes up frequently is if the Puppet Programming Language is declarative or not. This is usually the topic when someone has been fighting with how order of evaluation works on the master side, and have
 been beaten by what sometimes may seem as random behavior. In this post I want to explain how
 Puppet works and try to straighten out some of the misconceptions.
 
@@ -10,8 +10,7 @@ First, lets get the terminology right (or this will remain confusing). It is com
 
 Parse Order is the order in which Puppet reads puppet manifests (`.pp`) from disk, turns them into tokens and checks their grammar. The result is something that can be evaluated (technically an Abstract Syntax Tree (AST)). The order in which this is done is actually of minor importance from a user perspective, you really do not need to think about how an expression such as `$a = 1 + 2` becomes an AST.
 
-OTOH, if you think about "parse order" as "the order the files are parsed", but this order is
-also of minor importance. Puppet starts with the `site.pp` file (or possibly the `code` setting in the configuration), then asking external services (such as the ENC) for additional things that are not included in the logic that is loaded from the `site.pp`. In versions from 3.5.0 the manifest setting can also refer to a directory of `.pp` files (preferred over using the now deprecated `import` statement).
+I have users talk about "parse order" with the meaning "the order the files are loaded and parsed", which to me is a bit of a stretch since that involves several additional concepts (as you will see later). The overall ordering of the execution is that Puppet starts with the `site.pp` file (or possibly the `code` setting in the configuration), then asking external services (such as the ENC) for additional things that are not included in the logic that is loaded from the `site.pp`. In versions from 3.5.0 the manifest setting can also refer to a directory of `.pp` files (preferred over using the now deprecated `import` statement).
 
 At this point (after having started and after having parsed the initial manifest(s), Puppet first matches the information about the node making a request for a catalog with available node definitions, and selects the first matching node definition. At this point Puppet has the notion of:
 
 
@@ -1,47 +1,58 @@
-In post about Puppet Internals I am going to cover how source location is
-handled by the lexer and parser. 
+In this post about Puppet Internals I am going to cover how source location information is
+handled by the future lexer and parser. This also shows a concrete case of using the
+**adapter pattern** introduced in [Puppet Internals - the Adapter Pattern][1].
+
+[1]:http://puppet-on-the-edge.blogspot.se/2014/02/puppet-internals-adapter-pattern.html
 
 ### Rationale for Detailed Position Information
 
 It is important to have detailed position information when errors occur
 as the user programming in the Puppet Programming Language would otherwise have
 to guess about where in the source text the problem is.
 
+Positioned information is not only wanted for syntax errors and other errors immediately
+detected by the parser, this information is perhaps even more valuable when errors occur
+later and it is important to know where a particular value originated.
+
 To date, this output has consisted of file and line only. While this is enough in
-many situations there are many cases where there is no reasonable text to output (say
-using an operator like '+' the wrong way, and there are 5 '+' on the same line), just knowing
+most situations there are also many cases where line alone is not enough (say
+using an operator like '+' the wrong way, and there are five '+' on the same line), just knowing
 that there is something wrong with one of the '+' on line 3 is not that great. What we
-want is to also know the position on the line.
+want is to also know the character position on the line.
 
 ### Implementation Background
 
 In the 3x Puppet Language implementation, the positioning information was
 limited to only contain the name of the file and the line number. In many cases
-the information is wrong (or rather imprecise) as it relies on the positioned held in
-the lexer as opposed to the position of the individual tokens. The lexer may have advanced
-past the particular point where the problem is when the problem is reported.
+this information is wrong (or rather, quite imprecise) as it relies on the lexer's
+current lexing position (i.e. what it has consumed from the source text) as opposed
+to the position of the individual tokens. The lexer may very well have advanced
+past the particular point where the problem is when the problem is reported by the parser.
 
 The first implantation of the future parser (available in Puppet 3.2) contained
-detailed positioning. It was calculated at the same time as the lexer did its
-regular processing - i.e intermixed with producing the tokens that are fed to the
-parser. 
-
-This proved to be both complicated (as the lexer needs to look ahead and thus either maintain
-a complex state where it is, what the location is etc.) and to be a performance hog. Part of
-the performance problem is the need to compute the length of an entire expression to enable
+detailed positioning (file, line and position on line).
+It was calculated at the same time as the lexer did its regular processing - i.e intermixed with
+production of the tokens that it feeds to the parser. 
+
+This proved to be complicated as the lexer needs to look ahead and thus either maintain
+a complex state where it is, the location of points it will potentially backtrack to, etc.
+This also proved to be a performance hog. Part of
+the performance problem was the need to compute the length of an entire expression to enable
 future output of the corresponding source text.
 
 The second implementation was much more performant. The idea was that detailed position
-information is really only needed when there is an error and it could be computed lazily
-if only recording the most basic information - i.e. the offset and length of the significant
-parts of the source text. The complex state and intermixed position calculations could also
-be made much more efficiently if done first, scanning the entire input for line breaks.
-
-This implementation was introduced in the future parser in 3.4. Unfortunately, the idea
-that positioning is really only needed when there is an error was wrong since the 3.4
+information is really only needed when there is an error and that only a fraction of the total
+amount of tokens have positions that end up being recorded in the catalog). Thus, the
+position information could be computed lazily, only recording the most basic information; the
+offset and length of the significant parts of the source text.
+It was then possible to get rid of the complex state and intermixed position calculations
+and instead more efficiently scan the entire input for line breaks and building an index.
+
+This implementation was introduced in the future parser in Puppet 3.4. Unfortunately, the idea
+that positioning is really only needed when there is an error was wrong in this case since the 3.4
 implementation has to transform all parsed syntax tree into the 3.x AST equivalent tree, and
 there all position information is still computed up front and stored inside each AST node (even
-if never used).
+if never used). (Bummer).
 
 Now, in Puppet 3.5, the new evaluator is responsible for evaluating most expressions
 and it will only trigger the position computations when it is required; when there is
@@ -52,7 +63,11 @@ While working on the performance of the lexer / parser it was also observed that
 would be beneficial to serialize the constructed syntax tree as deserialization of a
 tree is much faster in general than again parsing the same source text. In order to be
 able to support this, it was obvious that the detailed information needed to be included
-in the model (and not only computed and kept in structures in memory).
+in the model (and not only computed and kept in structures in memory). Thus, in Puppet 3.5's
+future parser's AST model you will find a Program element that holds the entire source text,
+the line offset index, and the parse tree result. (This has other benefits as well - even a
+broken program can be returned - even if it can not be evaluated, it makes it easier to report
+what is wrong).
 
 ### Requirements on Positioning Information
 
@@ -69,20 +84,20 @@ in the model (and not only computed and kept in structures in memory).
 
 A mix of techniques were used to meet the requirements.
 
-The central concept is that of a Locator; an object that knows about the source text
-string,where all the lines start, and given an offset into the string can answer which
-line it is on, and its position on that line. This means that only the offset and length
+The central concept is that of a `Locator`; an object that knows about the source text
+string; where all the lines start, and given an offset into the string can answer which
+line it is on, and the position on that line. This means that only the offset and length
 of tokens needs to be recorded and stored in the syntax nodes.
 
-We could store a reference to the Locator in every node, but that requires one extra
+We could store a reference to the locator in every node, but that requires one extra
 slot per node, and would need to be handled in de-serialization (i.e. setting thousands
-of references when loading a single model). The offset and length are simply Integers and
-are fast to serialize/de-serialize.
+of references when loading a single model). The offset and length are in contrast
+regular Integers and are fast to serialize/de-serialize.
 
 The parser always produces an instance of `Program`, and it contains both the source text
 and the required line index. With these two, it can reconstruct the Locator (that was originally
 created by the lexer / parser when parsing the source). The Program is only a data container,
-it does not do any computation - that is always handled by an instance of Locator.
+it does not do any offset computation - that is always handled by an instance of `Locator`.
 
 Here is a diagram that shows the relationship between the `Program` and the `Locator`. It also
 shows how individual nodes (`Positioned`) and their corresponding computational / cache of
@@ -96,15 +111,15 @@ PROGRAM_LOCATOR_DIAGRAM.PNG
 All nodes in the constructed syntax tree model inherit from `Positioned` (except `Program` which is 
 always the entire source). Being `Positioned` means that there is an `offset` and a `length` (but nothing more).
 
-If we want to know the line number and the position on the line we need to find the Locator
-since it knows how to compute the information. We could have implemented that in the Positioned
+If we want to know the line number and the position on the line we need to find the `Locator`
+since it knows how to compute the information. We could have implemented that in the `Positioned`
 object itself, but it would clutter its implementation and it would be difficult to change
-the strategy for computing. This is where the SourcePosAdapter comes in.
+the strategy for computing. This is where the `SourcePosAdapter` comes in.
 
 ### The SourcePosAdapter
 
-Being an `Adapter` (there are others) means that it is bi-directionally associated with a particular object without the object knowing about it. The responsibility of the managing the relationship
-is entirely on the adapter side.
+Being an `Adapter` (see earlier post ([1]) for details) means that it is bi-directionally associated with a particular object without the object knowing about it.
+The responsibility of the managing the relationship is entirely on the adapter side.
 
 An `Positioned` object is adapted to a `SourcePosAdapter` by:
 
@@ -118,12 +133,12 @@ to ask if an object is adapted (and get the adapter) by:
 Once a `SourcePosAdapter` is obtained, it can answer all the questions about position. When it is
 created it performs a minimum of computation. When asked for something that requires a `Locator`
 is searches for the closest object that has knowledge of it and then caches this information. When
-this takes place for the first time, the search always goes up to the `Program` (root) node. On subsequent searches a node with a `SourcePosAdapter` may be encountered and the search can stop
-there. 
+this takes place for the first time, the search always goes up to the `Program` (root) node. On subsequent searches a parent node with a `SourcePosAdapter` may be encountered and the search
+can stop there. 
 
 The resulting structure is what is depicted in the graph.
 
-It is worth noting that all model objects that are contained, knows about their container via
+It is worth noting that all model objects that are contained knows about their container via
 the somewhat mysterious method `eContainer` (how that works in more detail and what
 the difference is between a *containment*, and a *reference* is the topic for another blog post).
 
@@ -162,19 +177,37 @@ While there is much to talk about how the grammar / parser works, this post focu
 of location, so the interesting part here are the calls to the `loc` method. It is called
 with the resulting model node (e.g. an `ArithmeticExpression`) and then one or two other nodes
 which may be model nodes or tokens produced by the lexer. All of the arithmetic expressions
-are located by their operator (`lot` is called with `val[1]` which is a
+are located by their operator (`loc` is called with `val[1]` which is a
 reference to the operator i.e. '+', or '*' in the example).
 
 Once the tree is built, since all of the nodes are `Positioned` it is possible to adapt them with a `SourcePosAdapter` to get the detailed computed information.
 
 ### Output
 
-The output when there is position information is simply line:pos where pos starts from 1 (the first
+The output when there is position information is simply `line:pos` where pos starts from 1 (the first
 character on the line).
 
-Output of source excerpt is not yet implemented as it has its own challenges - some expressions
-are quite long and span multiple lines, how much of that is relevant to show? How much is enough context? Also, while the data is in place, expressions like the arithmetic expressions are typically
-located by their operator, and the output source would be just the '+'. A bit more processing is
-needed to also include the left and right hand sides - but then again - how much of those.
+Output of source excerpt (i.e. a snippet of source and a caret-pointer to the position) is not yet implemented. Maybe someone wants to take a stab at making an implementation?
+
+### UTF-8
+
+And, lets talk about UTF-8, or rather about lexing multibyte strings.
+
+The new implementation (in lexer2) handles multibyte source. It does this by recording the
+byte offsets and using a locator that is specific to single or multibyte runtime. This proved
+to be much faster than using the new mulitbyte support introduced in Ruby 1.9.3. You can look
+at the implementation in `Puppet::Pops::Parser::Locator`, where there are two different implementations.
+
+### Summary
+
+This post shows both a concrete example of using the **adapter pattern**, as well as some high
+level examples about how the lexer gets its work done in a performant way. 
+
+#### In the Next Post
 
-As always the solution is probably to just show the line and a marker to where on the line the problem occurred.
+Well, to be honest, I don't really know what the next post should be about? What would you
+like to learn more about next - I am considering an overview of modeling as it is a fundamental
+concept, but there are many other topics to choose from - the implementation of the lexer,
+how the parser works, how error messages are generated and formatted, and then we have the
+evaluator to cover... - seems like this will be a very long series.
+