Skip to content

Conversation

@LeonPuchinger
Copy link

@LeonPuchinger LeonPuchinger commented Mar 7, 2025

This PR adds stateful tokenization to the library. Based on previously matched tokens, the lexer can be put into different states, meaning it can work with a different ruleset for matching the next tokens. This change is useful, for instance, when trying to tokenize/parse string literals that contain string interpolations (e.g. "foo${42}bar") or nested structures such as nested c-style block comments. The states are maintained using a stack situated in the lexer. To fully understand these changes, I suggest taking a look at the Readme changes, where the new behavior is explained in detail.

Notes:

  • I have created a new testcase that verifies stateful lexing, using nested block comments as an example.
  • I have documented my changes in the Tokenizer.md Readme.
  • These are no breaking changes, existing lexers should work just the same as before.

These changes only make sense in case they don't conflict with any competing implementations (planned or pending), since the Readme states the following:

Context sensitive tokenizer is also comming.

@LeonPuchinger
Copy link
Author

@microsoft-github-policy-service agree

@LeonPuchinger LeonPuchinger marked this pull request as ready for review March 7, 2025 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant