Replies: 11 comments 2 replies
-
|
Hello @ryo33 , thank you very much for sharing your thoughts. As a first reply I like to write some comments to each of your seven points. I'm sure that our exchange of ideas will incrementally add to each topic further on. 1. Multiple parsers from one grammarI think this is a rather rare use case because you have to design your grammar explicitly this way. LL and LR grammars differ in the typical representation of recursion, although LR can also handle left and right recursion, but LL has a problem with left recursions. But it should be no problem to provide a command line argument that overrides the %grammar_type if a user wants to do this. 2. on_newline_parsed, on_whitespace_parsedI actually like this idea. I have to consider the additional performance and memory overhead and maybe this could be made an optional feature. 3. Incremental parsingI have to think it over, since it would have influence on the way the parser is set up on the input text. 4. Tree generationparol already creates an untyped parse tree (in the result of the parse function). I'm not sure if this is sufficient for your ideas. 5. rowanThis is a topic with what I'm currently not really familiar with. So, I think I should have some readings before giving any statements about it. 6. TokenizerThis is a hot topic. I currently work on a new lexer library, scnr, that should replace regex-automata in the end. When redesigning this I will have to take into consideration that a proper TokenStream/Tokenizer interface is available and that all parts have a user friendly API. 7. Robust parsersThis is also a big topic, although since parol-0.24.0/parol_runtime-0.19.0 (2023-09-18), the generated LL(k) parsers implement a recovery strategy. This means parol won't stop at the first error it encounters. But I think, what you mean is that parsers in language servers have be able to deal with grammars that are actually extensions of the original grammars of their languages in that sense that they cover many error cases and derive correct hints from them. So if you have grammar A and want to write a LS for this grammar for some IDE the LS must actually recognize A*, where A* is A + {lots of error branches}. As far as I know there is no simple way to generate A* from A, at least this is no topic I had much experience on. Current development phaseI want to elaborate on the current development phase of parol. In the last three months I tried to stabilize parol to finally reach version 1.0. This is an effort I hesitated a bit, mostly because of the known performance problems, that are not coming from the parsing phase, this phase is actually fast, but from the startup phase of the generated parsers. Depending of the complexity of the regular expression that is built up from the terminals of the user grammar, the build process of the regex instance is huge. It often much outweighs the actual parsing time. I know that regexes from the regex-automata crate provide an overwhelming amount of regex features and can often lead to smaller grammar descriptions, but these advantages come at a cost. During my development of scnr it became clear to me, that you can gain speed when you are willing to All these insights showed me, that when I change this now, many user will end up with broken grammars. To make a long story short, currently I tend to release version 1 first. This can be kept compatible for a long time. Some of your ideas would rather fall in the category of being realized in a version after the release of version 1. I hope that the situation will settle soon. |
Beta Was this translation helpful? Give feedback.
-
|
Thank you for your reply and comments. I am happy to exchange ideas. First, scnr is a fascinating project! I'd be happy to participate in the feedback process at some point. Many users, including myself, will greatly welcome the release of version 1.0.0. One of the most common issues in the Rust ecosystem is the need for more stable community packages. I have not released any package with a major version. 1. Multiple parsers from one grammar
Yes. I may have confused grammar types with types of use-case-focused parsers. I initially thought, "It's cool if one grammar definition can generate parsers more than the normal one." By the way, it's fun to check whether my writing grammar is capable of both LL(k) and LR(k), like playing a game. 2. on_newline_parsed, on_whitespace_parsed
I'm happy to hear that. 4. Tree generation
I actually had not looked into the API, but it seems sufficient. Those extractor types I'd like to have could be derived from the generated AST types or from some information that parol uses to generate the parser. 7. Robust parsers
I should explore what I can do with this feature. OverallI hope half or more of my ideas will not block the stabilization phase, and I think they should not. I could implement them mostly at the third-party level by using Parol's existing public APIs and the generated parser. It may need some changes in the core, parol, but with a bit of care, those changes can be made without breaking anything. There should also be cases where I have to wait for the roadmap of version 2, but it's fine. A stable major version is always the future! |
Beta Was this translation helpful? Give feedback.
-
|
Hello @ryo33, (This is actually not a reply to your latest answer, but a follow up of my first answer.) I read a lot the last two days. Here are my conclusions, surly preliminary. Topic 2, 4, 5 and maybe 7 are now somewhat clearer for me. parol right now uses a very good alternative to The whole subject of I'm quite convinced that when having a parse tree that is lossless one can some sort of intermingle, The whole topic of breaking up the structure of parol into more composable parts, like lexer, parser, Resilient LL parsing is an interesting topic, too. The provided examples and rules of thumb for |
Beta Was this translation helpful? Give feedback.
-
|
Now the reply to your latest post 😉 I must say, I also appreciate exchanging thoughts, just like you. Obviousely, there are not much people out there that want to contribute in the process of finding ideas and making an even better product. Although, the ones that use parol in their public project here on github have questions and sometimes feedback anayway. But when looking at the download counts on crates.io, this seems to be a very small share. I have a branch where parol already uses sncr, so if you are interested in (perhaps the performance) you can have a look on branch scnr I think that version 1 will come very soon, maybe already the next weekend. |
Beta Was this translation helpful? Give feedback.
-
|
Thank you for sharing your deep thoughts! I wish I could have responded earlier, but I apologize for the delay. SyntreeI'm glad to hear about enum ParseTreeType<'t> {
// non-terminals
/// Root of the syntax
Root,
Expr,
...,
// terminals
Number(Token<'t>),
Plus(Token<'t>),
...,
// parol non-terminals and terminals
Whitespace(Token<'t>),
Newline(Token<'t>),
LineComment,
BlockComment,
LineCommentStart(Token<'t>),
BlockCommentBegin(Token<'t>),
BlockCommentEnd(Token<'t>),
LineCommentBody(Token<'t>),
BlockCommentBody(Token<'t>),
}CSTI understand your point about the topic. The solution that cuts cut operators in grammar seems to work well in my project. However, it might degrade the experience of interacting with the tree (and the parser performance) in use cases that only need AST. At some point, I may introduce a Incremental ParsingFor this topic, I've decided to restart a project called query-flow, a generic incremental computing library. After its PoC works, I'd like to experiment to use it with parol in any way. I think it would matter if the incremental parsing is a part of the whole incremental compiler or language server. |
Beta Was this translation helpful? Give feedback.
-
|
Hi Ryo, thanks for your answer. Adopting the parse tree type in such a way as you desire, is not unfortunately feasible, because it The user non-terminals are solely available in the pub enum ASTType<'t>which is generated in the My conception of a feasible pub enum ParseTreeType<'t> {
/// A scanned user token
T(Token<'t>),
/// A non-terminal name.
N(&'static str),
// parol non-terminals and terminals
Whitespace(Token<'t>),
Newline(Token<'t>),
LineComment,
BlockComment,
LineCommentStart(Token<'t>),
BlockCommentBegin(Token<'t>),
BlockCommentEnd(Token<'t>),
LineCommentBody(Token<'t>),
BlockCommentBody(Token<'t>),
}Having such an extension we can fully integrate with the changes I'm currently working on and which The handling of the cut operator can surely be improved. Such a lossless option that you suggest I had a short look at your query-flow crate. I would highly Here are the things I'm currently working onI currently make small steps towards the redesign of the interface between scanner and parser. The first step was to enable the syntree_layout to be This is already available as of version 0.3.1 of The idea behind is to eventually strip the The next preparation will encompass to enable the I plan to track the offsets of each line start character automatically during the scan process and Having this available you can strip further data from the token type such as location information These are all changes and optimizations regarding the token data type. Another field of design changes is the scanner instantiation, and here more precisely the data I think this can be the basis of an abstract interface that could empower parol to instantiate any The next field of redesign is concerning the runtime interface between scanner and parser. Beside As I said I'm currently in the phase of working this out and I'm pretty sure that this gives a Generic tree construction from extended ParseTreeType is another more difficult problem, which I |
Beta Was this translation helpful? Give feedback.
-
|
The concepts behind query-flow are admittedly not completely clear to me right now. I will detail my progress on the above mentioned topics here for you. Keep up the good work 👍 |
Beta Was this translation helpful? Give feedback.
-
|
Status Update: I've started to work on the query-flow, and I'm currently prototyping the dependency tracking primitive called https://github.com/ryo33/whale for implementing query-flow. It should purpose-generic. |
Beta Was this translation helpful? Give feedback.
-
|
Thank you for this update. I will have a look at whale soon. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @ryo33, OT: |
Beta Was this translation helpful? Give feedback.
-
|
One challenge I'm facing is that syntree does not currently support tree modification. I'm considering two options:
I'd prefer the second option (at least for now) because I see some value in making the runtime tree library agnostic, and my knowledge of this kind of data structure and syntree's design is very limited and would require significant time to implement it. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I've been thinking about the following things and more subtle ones in the past few days. I haven't given them much thought, so please don't take them seriously.
1. Generating multiple types of parsers from a single grammar
If a PAR file is valid grammar in both
%grammar_type 'LL(k)'and%grammar_type 'LALR(1), theoretically we can choose which type to generate,--grammar-type=lalr1in CLI or.grammar_type(GrammarType::Lalr1)in the builder.Furthermore, there can be various use-case specific parsers
2.
fn on_newline_parsedandfn on_whitespace_parsedfn on_comment_parsedis a what-all-we-need-is feature because it's lossless about comments without puttingCommenteverywhere in the grammar, which ruins the grammar's readability, maintainability, and beautifulness.In the same idea, we could have
on_newline_parsed(or "scanned"?) andon_whitespace_parsed.3. Possibility of incremental parsing in parol
I don't believe incremental parsing solves any performance problem except in corner case scenario, but it may because of lack of knowledge in this area. But it's attractive to me the statement that tree-sitters aims:
4. Untyped-tree and type-safe node extraction
Interface is something like:
In this mode, a generated type for each non-terminal is just a helper type for the extraction API, and would not be owned value.
5. Does integration with rowan makes any sense?
I think this "yes", and it should be separate crate
parol-rowan. Parol can provide collected grammar information to help other parser generators.6. An easy wrapper API of
Tokenizerto perform only lexerIt would only supports scanner-based scanner switching.
7. Generation of robust parser
IDE needs to give users helpful hits and completions with have some syntax errors. Parol already stops on and report the first encountered syntax error, user should be able to try some recovery technique like ignoring invalid tokens, replacing it with dummy correct token, partial tree construction. This should be a separate package as
parol-rowanso in my opinion.References
Beta Was this translation helpful? Give feedback.
All reactions