Implement stateful tokenization #59

LeonPuchinger · 2025-03-07T12:40:46Z

This PR adds stateful tokenization to the library. Based on previously matched tokens, the lexer can be put into different states, meaning it can work with a different ruleset for matching the next tokens. This change is useful, for instance, when trying to tokenize/parse string literals that contain string interpolations (e.g. "foo${42}bar") or nested structures such as nested c-style block comments. The states are maintained using a stack situated in the lexer. To fully understand these changes, I suggest taking a look at the Readme changes, where the new behavior is explained in detail.

Notes:

I have created a new testcase that verifies stateful lexing, using nested block comments as an example.
I have documented my changes in the Tokenizer.md Readme.
These are no breaking changes, existing lexers should work just the same as before.

These changes only make sense in case they don't conflict with any competing implementations (planned or pending), since the Readme states the following:

Context sensitive tokenizer is also comming.

LeonPuchinger · 2025-03-07T12:45:57Z

@microsoft-github-policy-service agree

LeonPuchinger added 10 commits March 6, 2025 20:05

define types to allow building a stateful lexer

713ea05

analyze the rules in nested lexer states

218a1f7

Implement the stateful lexer

ccba6f6

Resolve naming conflict

5294107

apply tslint suggestions to follow project rules

c44c98e

Test stateful tokenization on c-style block comments

4c174ac

Don't differentiate between top-level rules and states on the type level

44255c2

Test the stateful lexer with nested block comments

429cd45

Support pushing the current state to the stack again

a519209

apply tslint suggestions

1b81b3e

LeonPuchinger marked this pull request as ready for review March 7, 2025 13:01

LeonPuchinger added 3 commits April 1, 2025 16:38

safely analyze recurively dependent lexer states

c679896

allow push and pop directives in the top-level state

23e8f78

update documentation

6a4c084

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement stateful tokenization #59

Implement stateful tokenization #59

Uh oh!

LeonPuchinger commented Mar 7, 2025 •

edited

Loading

Uh oh!

LeonPuchinger commented Mar 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Implement stateful tokenization #59

Are you sure you want to change the base?

Implement stateful tokenization #59

Uh oh!

Conversation

LeonPuchinger commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LeonPuchinger commented Mar 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LeonPuchinger commented Mar 7, 2025 •

edited

Loading