-
Notifications
You must be signed in to change notification settings - Fork 12
Description
This is going to be pretty long. I was fairly dismayed when I learned about this so please bear with me. I am going to talk about a test case, but I'm going to argue the XPath spec is ambiguous and unclear for this case, so perhaps this is more about improving the specification rather than interpreting it.
MapConstructor-025
Let's consider the test MapConstructor-025.
let $m := map{'a':1} return map:size(map{$m?a:true()})
The interesting bit here is:
map{$m?a:true()}
The test parses this as:
map { $m ? a : true() }
Now we turn to a piece of text in section A.2, which the description of the test references
When tokenizing, the longest possible match that is consistent with the EBNF is used.
The test interprets this as follows: a:true is the longest token, but it's only grammatical when the token is a, so the longest possible token is a.
But that's not really independent tokenizing like a lexer does. Tokenizing is not aware of the grammar; tokenizing creates tokens that can be used by the grammar.
So another interpretation of this same text could be:
a:true is the longest possible token allowed by the grammar, so a:true is the token. It's true that such a token leads to the above code not to be parsable, but that's okay: all kinds of sequences of correct tokens are not parsable.
What is a token?
But what is a token in the first place? We go back to A.2, which is titled "Lexical structure" and as we have seen, talks about tokenizing. It describes terminal symbols, at first sight a decent candidate for tokens.
What kind of terminal symbol is a:true? It would be a prefixed qname, as described by the QName production. QName is described as a terminal symbol.
QName refers to the grammar in the XML spec itself, where a QName is PrefixedName | UnprefixedName
Both are composed from NCName. An UnprefixedName is basically a NCName.
But wait. The XPath spec also describes an NCName by itself as a terminal symbol. So when we encounter an NCName in the grammar, is it to be interpreted as a QName or not? That cannot actually be determined without using the grammar.
So it appears that the terminal symbols are not really describing tokens at all.
It's unclear what is meant by "tokenizing".
What is a terminal symbol?
As a digression: A.2.1 says "Terminal symbols" but that evidently isn't the complete list of terminal symbols, as A.2.2 introduces a lot more delimiting and non-delimiting terminal symbols. Which I've interpreted as tokens, but you can't quite do that. Almost but not quite.
For instance, there is also an ambiguity with NCName for reserved function names such as if.
Is if an NCName or not? If it's used as a function if(), it's not, but in other contexts it is a valid ncname, and in another context it's a keyword in the language.
My adventure
In my XPath engine I have a separate lexer (tokenizing) and parsing phase. This is pretty common in programming language construction. The lexer produces tokens and the grammar then defines how those tokens combine into an AST. Whitespace is handled in the lexing phase and then disregarded during parsing.
This works pretty well in most cases, but I noticed it wasn't working well for some cases where ws: explicit is in use. It also didn't work well for QNames: foo : bar, with whitespace, could incorrectly be interpreted as a valid QName.
So I thought, let's handle these cases in the lexer. So now the lexer produces various new tokens, like URIQualifiedName and even wildcards (like foo:*) are interpreted as tokens as the "ws: explicit" rule is in place for them too.
But what to do about QName? As we discussed before, without the grammar it is ambiguous whether a given NCName is a QName or not. But, I thought, it's not ambiguous whether something is a PrefixedName.
So I introduced a token for that. And previous cases where the test were failing are now passing. But to my dismay, MapConstructor-025 started failing.
I am unclear how to resolve this: I really don't want to make the whole grammar whitespace aware for the few exceptional cases; I want to keep handling this in the lexer. But handling PrefixedName as a token leads to new trouble.
MapConstructor-025 takes one interpretation of the specification, but I really think the specification is ambiguous as it doesn't define what a token is. I think PrefixedName is a perfectly reasonable interpretation of "longest possible match according to the EBNF" .
What is a terminal?
[Definition: A terminal is a symbol or string or pattern that can appear in the right-hand side of a rule, but never appears on the left-hand side in the main grammar, although it may appear on the left-hand side of a rule in the grammar for terminals.]
Does this really say much at all beyond "if we put the rule in the grammar for terminals, it's a terminal, but if we put it in the main grammar it's not"?
Concrete steps
I'd love for the code in MapConstructor-025 to be considered invalid grammatically, but I don't know whether it can be, and the specification is unclear anyway.
I think a concrete step is to make the specification more clear. What's a token? (or avoid that word altogether). Is a terminal symbol a token?
If the interpretation in MapConstructor-025 is correct, make explicit that QName or PrefixedName cannot be a token (and safely tucked away in the XML specification), and that NCName is ambiguous on the token level and can only be resolved grammatically.
I'm going to have to think hard for a while to find a doable way to make MapConstructor-025 work in my implementation.