Random ideas and thoughts around the topic of crafting IDE-ready parser with Parol #397

ryo33 · 2024-08-28T14:41:20Z

ryo33
Aug 28, 2024
Collaborator

I've been thinking about the following things and more subtle ones in the past few days. I haven't given them much thought, so please don't take them seriously.

1. Generating multiple types of parsers from a single grammar

If a PAR file is valid grammar in both %grammar_type 'LL(k)' and %grammar_type 'LALR(1), theoretically we can choose which type to generate, --grammar-type=lalr1 in CLI or .grammar_type(GrammarType::Lalr1) in the builder.

Furthermore, there can be various use-case specific parsers

rough parser for syntax highlighting broken-syntax source code in best-effort
super fast parser
wasm friendly parser (no big tables)
IDE-ready parser (constructs more trees in best-effort, tends to be bottom-up ones)

2. `fn on_newline_parsed` and `fn on_whitespace_parsed`

fn on_comment_parsed is a what-all-we-need-is feature because it's lossless about comments without putting Comment everywhere in the grammar, which ruins the grammar's readability, maintainability, and beautifulness.

In the same idea, we could have on_newline_parsed (or "scanned"?) and on_whitespace_parsed.

3. Possibility of incremental parsing in parol

I don't believe incremental parsing solves any performance problem except in corner case scenario, but it may because of lack of knowledge in this area. But it's attractive to me the statement that tree-sitters aims:

Fast enough to parse on every keystroke in a text editor

4. Untyped-tree and type-safe node extraction

Interface is something like:

struct Tree {
    HashMap<Id, Node>,
}

enum Error {
    SyntaxError,
    TypeMismatch,
}

impl Tree {
    fn extract<'a, T: FromTree>(&'a self, node_id: Id) -> T + 'a {
        todo!()
    }
}

In this mode, a generated type for each non-terminal is just a helper type for the extraction API, and would not be owned value.

5. Does integration with rowan makes any sense?

I think this "yes", and it should be separate crate parol-rowan. Parol can provide collected grammar information to help other parser generators.

6. An easy wrapper API of `Tokenizer` to perform only lexer

It would only supports scanner-based scanner switching.

7. Generation of robust parser

IDE needs to give users helpful hits and completions with have some syntax errors. Parol already stops on and report the first encountered syntax error, user should be able to try some recovery technique like ignoring invalid tokens, replacing it with dummy correct token, partial tree construction. This should be a separate package as parol-rowan so in my opinion.

References

jsinger67 · 2024-08-28T16:15:14Z

jsinger67
Aug 28, 2024
Maintainer

Hello @ryo33 ,

thank you very much for sharing your thoughts.

As a first reply I like to write some comments to each of your seven points. I'm sure that our exchange of ideas will incrementally add to each topic further on.

1. Multiple parsers from one grammar

I think this is a rather rare use case because you have to design your grammar explicitly this way. LL and LR grammars differ in the typical representation of recursion, although LR can also handle left and right recursion, but LL has a problem with left recursions.

But it should be no problem to provide a command line argument that overrides the %grammar_type if a user wants to do this.

2. on_newline_parsed, on_whitespace_parsed

I actually like this idea. I have to consider the additional performance and memory overhead and maybe this could be made an optional feature.

3. Incremental parsing

I have to think it over, since it would have influence on the way the parser is set up on the input text.

4. Tree generation

parol already creates an untyped parse tree (in the result of the parse function). I'm not sure if this is sufficient for your ideas.

5. rowan

This is a topic with what I'm currently not really familiar with. So, I think I should have some readings before giving any statements about it.

6. Tokenizer

This is a hot topic. I currently work on a new lexer library, scnr, that should replace regex-automata in the end. When redesigning this I will have to take into consideration that a proper TokenStream/Tokenizer interface is available and that all parts have a user friendly API.

7. Robust parsers

This is also a big topic, although since parol-0.24.0/parol_runtime-0.19.0 (2023-09-18), the generated LL(k) parsers implement a recovery strategy. This means parol won't stop at the first error it encounters.

But I think, what you mean is that parsers in language servers have be able to deal with grammars that are actually extensions of the original grammars of their languages in that sense that they cover many error cases and derive correct hints from them.

So if you have grammar A and want to write a LS for this grammar for some IDE the LS must actually recognize A*, where A* is A + {lots of error branches}.

As far as I know there is no simple way to generate A* from A, at least this is no topic I had much experience on.

Current development phase

I want to elaborate on the current development phase of parol.

In the last three months I tried to stabilize parol to finally reach version 1.0. This is an effort
to give all users of parol the confidence that this tool will provide a stable interface so that they are finally
able to use it in a more production like environment.

I hesitated a bit, mostly because of the known performance problems, that are not coming from the parsing phase, this phase is actually fast, but from the startup phase of the generated parsers. Depending of the complexity of the regular expression that is built up from the terminals of the user grammar, the build process of the regex instance is huge. It often much outweighs the actual parsing time.

I know that regexes from the regex-automata crate provide an overwhelming amount of regex features and can often lead to smaller grammar descriptions, but these advantages come at a cost.

During my development of scnr it became clear to me, that you can gain speed when you are willing to
sacrifice some comfort.

All these insights showed me, that when I change this now, many user will end up with broken grammars.

To make a long story short, currently I tend to release version 1 first. This can be kept compatible for a long time.
Having done this step I could introduce breaking changes in a more relaxed fashion, even to transit to version 2 or so.

Some of your ideas would rather fall in the category of being realized in a version after the release of version 1.
I explained this for you to understand, why I'm currently cautious when it comes to greater changes.

I hope that the situation will settle soon.

0 replies

ryo33 · 2024-08-29T13:42:36Z

ryo33
Aug 29, 2024
Collaborator Author

Thank you for your reply and comments. I am happy to exchange ideas.

First, scnr is a fascinating project! I'd be happy to participate in the feedback process at some point. Many users, including myself, will greatly welcome the release of version 1.0.0. One of the most common issues in the Rust ecosystem is the need for more stable community packages. I have not released any package with a major version.

1. Multiple parsers from one grammar

I think this is a rather rare use case because you have to design your grammar explicitly this way. LL and LR grammars differ in the typical representation of recursion, although LR can also handle left and right recursion, but LL has a problem with left recursions.

Yes. I may have confused grammar types with types of use-case-focused parsers. I initially thought, "It's cool if one grammar definition can generate parsers more than the normal one."

By the way, it's fun to check whether my writing grammar is capable of both LL(k) and LR(k), like playing a game.

2. on_newline_parsed, on_whitespace_parsed

I actually like this idea. I have to consider the additional performance and memory overhead and maybe this could be made an optional feature.

I'm happy to hear that.

4. Tree generation

parol already creates an untyped parse tree (in the result of the parse function). I'm not sure if this is sufficient for your ideas.

I actually had not looked into the API, but it seems sufficient. Those extractor types I'd like to have could be derived from the generated AST types or from some information that parol uses to generate the parser.

7. Robust parsers

This is also a big topic, although since parol-0.24.0/parol_runtime-0.19.0 (2023-09-18), the generated LL(k) parsers implement a recovery strategy. This means parol won't stop at the first error it encounters.

I should explore what I can do with this feature.

Overall

I hope half or more of my ideas will not block the stabilization phase, and I think they should not. I could implement them mostly at the third-party level by using Parol's existing public APIs and the generated parser. It may need some changes in the core, parol, but with a bit of care, those changes can be made without breaking anything. There should also be cases where I have to wait for the roadmap of version 2, but it's fine. A stable major version is always the future!

0 replies

jsinger67 · 2024-08-29T15:24:36Z

jsinger67
Aug 29, 2024
Maintainer

Hello @ryo33,

(This is actually not a reply to your latest answer, but a follow up of my first answer.)

I read a lot the last two days.

Here are my conclusions, surly preliminary.

Topic 2, 4, 5 and maybe 7 are now somewhat clearer for me.
First I can give you good news about the rowan subject.

parol right now uses a very good alternative to rowan, namely the
syntree crate from @udoprog.
Currently parol uses empty spans only, because there were no need for spans in the tree and because
of the lack of a strategy how to fill in the gaps. But this is no more that complicated from my present
perspective.
The ParseTreeType enum in parol_runtime currently consists of only two variants, T(Token<'t>) and
N(&'static str) for terminals and non-terminals, for either leaves or subtrees. In my opinion this
type only has to amended by some kind of placeholder for all spans the parser skips when building up
the parse tree, or perhaps better to add special token variants for EOI, NEW_LINE, WHITESPACE,
LINE_COMMENT and BLOCK_COMMENT and change the parser's implementation in that it not simply skips
these tokens but inserts them into the parse tree with the right node (rather leave) type.

The whole subject of ungrammar I currently rank as not very relevant for parol.
Here is why I think this way.
In my opinion ungrammar tries to solve a problem (e.g. for rust-analyzer) that does simply not
exist in parol. The CST (concrete syntax tree) type information need not be generated by
interpreting an ungrammar description of the parse tree, instead it is fully generated as the
native Rust types that are directly deduced from the structure of the grammar. So if you need a
lossless information, e.g. in some language server, you just forgo using the cut-operator ^.

I'm quite convinced that when having a parse tree that is lossless one can some sort of intermingle,
or project the data from the CST to the nodes in the parse tree in a way that the letter complement
the former.

The whole topic of breaking up the structure of parol into more composable parts, like lexer, parser,
grammar transformer, grammar type generator, source generator is permanently present in my head, but
it needs some time to build up a reasonable construct of ideas.

Resilient LL parsing is an interesting topic, too. The provided examples and rules of thumb for
realizing something in this regard solely consider hand written parsers. But I saw some interesting
staring points, one of them was a kind of meta parser that only creates events that are similar to
the API of rowan/syntree, i.e. open, close etc. I think this is not the most exciting part.
And again this is all hand written and for a parser generator you have to provide a generically
applicable approach.
But I hope that over time a generic approach could be found.

0 replies

jsinger67 · 2024-08-29T16:32:51Z

jsinger67
Aug 29, 2024
Maintainer

Now the reply to your latest post 😉

I must say, I also appreciate exchanging thoughts, just like you. Obviousely, there are not much people out there that want to contribute in the process of finding ideas and making an even better product. Although, the ones that use parol in their public project here on github have questions and sometimes feedback anayway. But when looking at the download counts on crates.io, this seems to be a very small share.

I have a branch where parol already uses sncr, so if you are interested in (perhaps the performance) you can have a look on branch scnr

I think that version 1 will come very soon, maybe already the next weekend.
If this is out of the way, we can make bigger changes

0 replies

ryo33 · 2024-09-15T06:32:10Z

ryo33
Sep 15, 2024
Collaborator Author

Thank you for sharing your deep thoughts! I wish I could have responded earlier, but I apologize for the delay.

Syntree

I'm glad to hear about syntree. Now, what I'd like to try with it is to add an option to generate an expanded ParseTreeType dedicated for the specific grammar like:

enum ParseTreeType<'t> {
    // non-terminals
    /// Root of the syntax
    Root,
    Expr,
    ...,

    // terminals
    Number(Token<'t>),
    Plus(Token<'t>),
    ...,

    // parol non-terminals and terminals
    Whitespace(Token<'t>),
    Newline(Token<'t>),
    LineComment,
    BlockComment,
    LineCommentStart(Token<'t>),
    BlockCommentBegin(Token<'t>),
    BlockCommentEnd(Token<'t>),
    LineCommentBody(Token<'t>),
    BlockCommentBody(Token<'t>),
}

CST

I understand your point about the topic. The solution that cuts cut operators in grammar seems to work well in my project. However, it might degrade the experience of interacting with the tree (and the parser performance) in use cases that only need AST. At some point, I may introduce a lossless option to the generator (both CLI and builder) to ignore the ^ operators in the generation phase. It allows users to write a single grammar for both scenarios.

Incremental Parsing

For this topic, I've decided to restart a project called query-flow, a generic incremental computing library. After its PoC works, I'd like to experiment to use it with parol in any way. I think it would matter if the incremental parsing is a part of the whole incremental compiler or language server.

0 replies

jsinger67 · 2024-09-15T12:24:17Z

jsinger67
Sep 15, 2024
Maintainer

Hi Ryo,

thanks for your answer.

Adopting the parse tree type in such a way as you desire, is not unfortunately feasible, because it
is used in the parser runtime and is actually agnostic of the user grammar. Especially to put user
terminals and non-terminals into it is a problem. All other invariant aspects of parol are feasible
and you have given them correctly.

The user non-terminals are solely available in the

pub enum ASTType<'t>

which is generated in the ..._grammar_trait.rs

My conception of a feasible ParseTreeType<'t> in parol_runtime would look like this

pub enum ParseTreeType<'t> {
    /// A scanned user token
    T(Token<'t>),

    /// A non-terminal name.
    N(&'static str),

    // parol non-terminals and terminals
    Whitespace(Token<'t>),
    Newline(Token<'t>),
    LineComment,
    BlockComment,
    LineCommentStart(Token<'t>),
    BlockCommentBegin(Token<'t>),
    BlockCommentEnd(Token<'t>),
    LineCommentBody(Token<'t>),
    BlockCommentBody(Token<'t>),
}

Having such an extension we can fully integrate with the changes I'm currently working on and which
are necessary preconditions to have an extended parse tree type. The syntree will be the vehicle
that enables us to have lossless grammar information later on.

The handling of the cut operator can surely be improved. Such a lossless option that you suggest
can surely be useful.

I had a short look at your query-flow crate. I would highly
appreciate if the whole concept of incremental parsing is making any progress. I'm curious if your
PoC will work out as expected.

Here are the things I'm currently working on

I currently make small steps towards the redesign of the interface between scanner and parser.

The first step was to enable the syntree_layout to be
able to generate layouts from trees based on the source code (&str) that has been used as input.

This is already available as of version 0.3.1 of syntree_layout.

The idea behind is to eventually strip the text: Cow<'t, str> from the token and thus making it
more lightweight. The scanned text of a token can be easily calculated from its span information at
any time later during the parsing process, while it can be assumed that the scanned input text is
still around and valid. syntree encodes its token information with the help of spans and the
decision to strip extra text from the token type should fit more naturally into this concept. The
tree provides the spans over tokens and their belonging to grammar structures and in conjunction
with the input text the scanned text of each sub tree is calculable.

The next preparation will encompass to enable the scnr crate to provide line/colum information
from offsets in the input string, this way:

offset -> (line, column)

I plan to track the offsets of each line start character automatically during the scan process and
store them subsequently in a vector. This will enable the scanner to easily find the line/column
information, for instance by a binary search in this vector and calculating the column as diff

column = offset - start_of_line

Having this available you can strip further data from the token type such as location information
because this information is calculable on demand.

These are all changes and optimizations regarding the token data type.

Another field of design changes is the scanner instantiation, and here more precisely the data
that is necessary to fill any scanner sufficiently. Conceptual work has been done in the scnr
crate. Here I use a slice of scanner modes as initialization data. Each scanner mode has a name,
an (implicit) index, a non-empty list to tuples (pattern, token_type_number) and a possibly
empty list of transitions as tuples (token_type_number, new_scanner_mode_index).

I think this can be the basis of an abstract interface that could empower parol to instantiate any
concrete type of scanner implementation. It could enable us to provide right now two possible
concrete scanner implementations, one that uses the well-proven regex-automata approach and a
second one the uses the new and upcoming scnr based approach. Other variants are conceivable.

The next field of redesign is concerning the runtime interface between scanner and parser. Beside
the initialization part this part is additionally necessary to become ready to plug in different
scanner implementations easily.

As I said I'm currently in the phase of working this out and I'm pretty sure that this gives a
big boost to the overall usability. The short-term effect will hopefully beneficial for lossless
parse trees.

Generic tree construction from extended ParseTreeType is another more difficult problem, which I
would for now postpone until we have a new token handling.

1 reply

ryo33 Sep 15, 2024
Collaborator Author

Thank you for sharing your detailed plans! They clarify how I should approach my own goals with the future of Parol.

The redesigns you mentioned, pointer-style tokens, unified scanner instantiation, and the pluggable runtime API, are impressive and elegant! I believe things I want to achieve will work well based on the designs.

Also, I'm glad to hear that you are interested in query-flow. (To be honest, I plan to restart by forgetting about the current implementation and removing it totally.)

jsinger67 · 2024-09-15T15:29:34Z

jsinger67
Sep 15, 2024
Maintainer

The concepts behind query-flow are admittedly not completely clear to me right now.
Please feel free to fill me in if you managed to start it over.

I will detail my progress on the above mentioned topics here for you.

Keep up the good work 👍

0 replies

ryo33 · 2024-12-16T14:33:54Z

ryo33
Dec 16, 2024
Collaborator Author

Status Update: I've started to work on the query-flow, and I'm currently prototyping the dependency tracking primitive called https://github.com/ryo33/whale for implementing query-flow. It should purpose-generic.
However, I still cannot understand how to do incrementally parsing in real code as stated in previously referenced articles.
At least, query-flow (built-on whale) can be used with parol generated AST, by treating a query to get a sub AST from AST as a query in the whale term and track it within the whole incremental compiler state. I should add overhaul tests to whale and finish the whale and query-flow. If I finish that, I plan to test it in a real-world compiler in the https://github.com/hihaheho/swon project.

0 replies

jsinger67 · 2024-12-16T14:59:45Z

jsinger67
Dec 16, 2024
Maintainer

Thank you for this update.
I'm currently updating parol to use syntree (rowan like AST crate) in it's newest version.
The goal is to have a lossless AST in the end.
I plan to insert tokens that are not covered by the grammar induced data types in order of appearance into the AST.
Most of the other updates in version 2 I already documented in the book.

I will have a look at whale soon.

0 replies

jsinger67 · 2024-12-21T16:20:55Z

jsinger67
Dec 21, 2024
Maintainer

Hi @ryo33,
I managed to have a lossless parse tree on main for both LL and LR parsers now.
Maybe you could do some verifications if you find some time.
Thanks in advance!

OT:
As far as I konw you celebrate christmas over there in Japan, too😉. Therefore I would like to wish you a Merry Christmas!

1 reply

ryo33 Jan 4, 2025
Collaborator Author

I've confirmed that I can reproduce the same string as an input by walking each leaf of the parse tree. I'm looking forward to building a fully functional application with this feature. Thank you for providing such a great capability.
Merry Christmas & Happy New Year!

ryo33 · 2025-01-25T13:29:12Z

ryo33
Jan 25, 2025
Collaborator Author

One challenge I'm facing is that syntree does not currently support tree modification. I'm considering two options:

Contribute a tree modification feature to syntree
Generate a new parse function that supports to support any tree type that implements a new trait called TreeConstruct.

I'd prefer the second option (at least for now) because I see some value in making the runtime tree library agnostic, and my knowledge of this kind of data structure and syntree's design is very limited and would require significant time to implement it.
Also, I'd like to avoid introducing any further layers of yak-shaving for my downstream projects.

0 replies

Random ideas and thoughts around the topic of crafting IDE-ready parser with Parol #397

Uh oh!

Uh oh!

ryo33 Aug 28, 2024 Collaborator

1. Generating multiple types of parsers from a single grammar

2. fn on_newline_parsed and fn on_whitespace_parsed

3. Possibility of incremental parsing in parol

4. Untyped-tree and type-safe node extraction

5. Does integration with rowan makes any sense?

6. An easy wrapper API of Tokenizer to perform only lexer

7. Generation of robust parser

References

Replies: 11 comments · 2 replies

Uh oh!

jsinger67 Aug 28, 2024 Maintainer

1. Multiple parsers from one grammar

2. on_newline_parsed, on_whitespace_parsed

3. Incremental parsing

4. Tree generation

5. rowan

6. Tokenizer

7. Robust parsers

Current development phase

Uh oh!

ryo33 Aug 29, 2024 Collaborator Author

1. Multiple parsers from one grammar

2. on_newline_parsed, on_whitespace_parsed

4. Tree generation

7. Robust parsers

Overall

Uh oh!

Uh oh!

jsinger67 Aug 29, 2024 Maintainer

Uh oh!

jsinger67 Aug 29, 2024 Maintainer

Uh oh!

ryo33 Sep 15, 2024 Collaborator Author

Syntree

CST

Incremental Parsing

Uh oh!

jsinger67 Sep 15, 2024 Maintainer

Here are the things I'm currently working on

Uh oh!

ryo33 Sep 15, 2024 Collaborator Author

Uh oh!

jsinger67 Sep 15, 2024 Maintainer

Uh oh!

ryo33 Dec 16, 2024 Collaborator Author

Uh oh!

jsinger67 Dec 16, 2024 Maintainer

Uh oh!

jsinger67 Dec 21, 2024 Maintainer

Uh oh!

ryo33 Jan 4, 2025 Collaborator Author

Uh oh!

ryo33 Jan 25, 2025 Collaborator Author

ryo33
Aug 28, 2024
Collaborator

2. `fn on_newline_parsed` and `fn on_whitespace_parsed`

6. An easy wrapper API of `Tokenizer` to perform only lexer

Replies: 11 comments 2 replies

jsinger67
Aug 28, 2024
Maintainer

ryo33
Aug 29, 2024
Collaborator Author

jsinger67
Aug 29, 2024
Maintainer

jsinger67
Aug 29, 2024
Maintainer

ryo33
Sep 15, 2024
Collaborator Author

jsinger67
Sep 15, 2024
Maintainer

ryo33 Sep 15, 2024
Collaborator Author

jsinger67
Sep 15, 2024
Maintainer

ryo33
Dec 16, 2024
Collaborator Author

jsinger67
Dec 16, 2024
Maintainer

jsinger67
Dec 21, 2024
Maintainer

ryo33 Jan 4, 2025
Collaborator Author

ryo33
Jan 25, 2025
Collaborator Author