Skip to content

Faster tokenizer#11392

Open
bschnurr wants to merge 18 commits intomicrosoft:mainfrom
bschnurr:faster-tokenizer
Open

Faster tokenizer#11392
bschnurr wants to merge 18 commits intomicrosoft:mainfrom
bschnurr:faster-tokenizer

Conversation

@bschnurr
Copy link
Copy Markdown
Member

@bschnurr bschnurr commented Apr 16, 2026

Already committed as 462a76a2f. Here's the updated PR description:


Faster tokenizer

Summary

Continues the tokenizer optimization work with a series of hot-path improvements, plus review follow-ups on the ignore-directive scanner and identifier-intern behavior. Combined result: ~20–30% faster on large Python corpora.

Benchmark results vs main

Methodology: separate git worktrees of origin/main and faster-tokenizer, identical tokenizerBenchmark.test.ts harness (each corpus run in a fresh Node process, 3 warmup + 10 measured iterations). Numbers below are the median of per-run medians across 3–6 runs per side. Small corpora (< 2 ms, < 10 KB) are dominated by V8 JIT/GC jitter and noted as noise.

Corpus Size main this PR Δ
large_stdlib_10x 430 KB 25.98 ms 19.62 ms −24%
large_class 24.5 KB 2.59 ms 1.78 ms −31%
large_stdlib 43 KB 3.54 ms 2.80 ms −21%
repetitive_identifiers 6.7 KB 2.25 ms 1.96 ms −13%
union_heavy 13 KB 1.72 ms 1.64 ms −5%
import_heavy 7.9 KB 1.10 ms 1.20 ms +9% (noise)
fstring_heavy 8.4 KB 1.39 ms 1.64 ms +18% (noise)
comment_heavy 9.5 KB 1.63 ms 1.89 ms +16% (noise)

The large_stdlib_10x result is the most trustworthy signal (10× work, proportionally less noise) and shows a consistent −24%.

Enhancements

Tokenizer hot paths

  • Replaced regex-based keyword / operator / ignore-directive matching with table-driven scans and 128-entry boolean lookup tables (_canStartString, _asciiIdentifierStart, _asciiIdentifierContinue, _keywordFirstCharTable, _singleCharOperatorTypeTable, …).
  • Added an ASCII fast-path in _tryIdentifier that advances over ASCII identifier chars in a tight charCodeAt loop, falling back to the unicode/surrogate path only when a non-ASCII char is encountered. The single biggest win on real-world code.
  • Added a direct-mapped identifier intern cache (2048 slots, hashed by firstChar/lastChar/length, no chaining) that deduplicates repeated identifiers (self, cls, True, None, str, int, …) within a single tokenize pass. Addresses the memory concern raised during review about per-token string allocation, without the overhead of the previous Map-based intern table. The new repetitive_identifiers benchmark corpus locks in the tradeoff (−13% vs main on that corpus).
  • Inlined CharacterStream.skipWhitespace as a tight charCodeAt loop that updates _position / _currentChar / _isEndOfStream directly, avoiding per-iteration method calls.
  • Cached the "any non-trivial tokens seen yet?" flag in _handleComment so the O(n) _tokens.findIndex scan no longer runs on every type: ignore directive.

Ignore-directive scanner (review follow-ups)

  • Preserves support for namespaced type: ignore[...] rules (e.g. ty:rule-name).
  • Rejects malformed type: ignore[ / pyright: ignore[ comments with an unclosed bracket rather than treating them as "ignore all diagnostics". New test TypeIgnoreLineMalformedBracketWithSpace locks in this behavior for the # type: ignore [broken case.
  • Extracted the duplicated bracket-content character-class validation into a single parseIgnoreBracketContent helper — both the "bracket-after-space" and "bracket-immediately-after-ignore" branches now share one implementation.
  • Added a fast pre-filter in _handleComment that uses indexOf('ignore', …) before invoking the directive scanner, so comments without the word ignore don't pay directive-parsing cost.
  • matchIgnoreDirective uses a bounded hand-rolled scan for the directive keyword. (An earlier iteration used String.prototype.indexOf, which has no end bound and scanned well past the current comment on comment-heavy files, producing O(n²) behavior; the worktree-vs-main comparison caught and fixed this.)

Parser / source-file touch-ups

  • Small follow-up adjustments in the parser and source-file layer needed by the tokenizer updates.

Benchmark infrastructure

  • Added tokenizer and parser benchmark suites (tokenizerBenchmark.test.ts, parserBenchmark.test.ts) with representative corpora: large_stdlib, large_stdlib_10x, fstring_heavy, comment_heavy, large_class, import_heavy, union_heavy, and repetitive_identifiers (the last specifically validates the identifier intern cache).
  • Each corpus runs in a fresh Node process (spawned via execFileSync) with 3 warmup + 10 measured iterations; results written as JSON under .generated/benchmark-results/ for side-by-side comparison.
  • Benchmark helpers (calculateStats, printResultTable) take ReadonlyArray parameters per review feedback.

Tokenizer regression tests

  • Added coverage for ignore-directive parsing: malformed unclosed brackets (with and without leading space), namespaced codes, and mixed plain + namespaced codes.

Ran terminal command: cd c:\dev\pyright-main-benchmark; git status; git ls-files packages/pyright-internal/src/tests/benchmarkData | Select-Object -First 3

Behavior notes

  • # type: ignore[unclosed and # type: ignore [unclosed are now rejected entirely (no typeIgnoreLine recorded) instead of falling back to "ignore all". This matches the intent of the original regex for the [ branch and is now consistent between the space and no-space cases.
  • Namespaced codes (ty:rule-name) remain valid inside type: ignore[...] but not inside pyright: ignore[...].

Validation

  • npm test (full pyright-internal suite) — green.
  • Targeted: npx jest tokenizer parser.test --forceExit — 137/137 passing.
  • npm run test:benchmark — table above; raw JSON in .generated/benchmark-results/tokenizer/.

…for : in type: ignore[...] bracket codes (e.g. ty:unresolved-reference). The fix adds Char.Colon to the allowed bracket-content characters in both bracket-parsing branches of matchIgnoreDirective, but only when directive === 'type' — matching the original regex difference ([\s\w:,-]* for type: vs [\s\w-,]* for pyright:).
This PR improves parser/tokenizer performance on common hot paths and moves benchmark suites out of the normal Jest test matrix.

The main changes are:
- replace regex-based directive and continuation scanning with manual scans
- reduce tokenizer overhead on common identifier paths
- clean up a few parser/token access paths that benefit from the tokenizer work
- add dedicated benchmark coverage and keep benchmark runs opt-in

## Benchmark Results

I reran the tokenizer benchmarks against `main` using the isolated harness so each corpus runs in a fresh process.

Representative median results vs `main`:
- `large_stdlib`: `3.13ms` vs `3.98ms` (`21%` faster)
- `fstring_heavy`: `1.77ms` vs `1.94ms` (`9%` faster)
- `large_class`: `1.97ms` vs `2.12ms` (`7%` faster)
- `union_heavy`: `1.63ms` vs `2.18ms` (`25%` faster)
- `large_stdlib_10x`: `21.17ms` vs `24.42ms` (`13%` faster)

`comment_heavy` was effectively flat, and `import_heavy` remained too noisy to treat as a reliable headline result.

Overall, the larger and more representative tokenizer-heavy corpora improved relative to `main`.

## Testing

- tokenizer regression tests passed
- tokenizer test suite passed
- full `pyright-internal` test suite passed
- isolated tokenizer benchmark runs completed successfully
…cts [type: ignoretype: ignore[ or pyright: ignore[ when the closing ] is missing, instead of silently treating them as bare ignore directives.

I also added a regression test in tokenizer.test.ts for the malformed-bracket case. Focused validation passed: the TypeIgnore|PyrightIgnore slice now reports 9 passing tests, including TypeIgnoreLineMalformedBracket.
@github-actions

This comment has been minimized.

Comment thread packages/pyright-internal/src/parser/tokenizer.ts
@rchiodo
Copy link
Copy Markdown
Collaborator

rchiodo commented Apr 16, 2026

detachSubstring replaces the old cloneStr + _identifierInternedStrings Map for identifiers. The old intern map deduplicated repeated identifiers within a tokenization pass (e.g., 10,000 occurrences of self → 1 string object). The new code creates a fresh string per token. The benchmark corpora are 300–1700 lines and won't surface memory regressions on very large files with repetitive identifiers (e.g., generated code). Consider adding a "repetitive identifier" benchmark corpus to validate this tradeoff, or document that the intern map was intentionally removed with the expectation that per-token cost savings outweigh deduplication benefits.

Comment thread packages/pyright-internal/src/parser/tokenizer.ts
Comment thread packages/pyright-internal/src/parser/tokenizer.ts
Comment thread packages/pyright-internal/src/tests/tokenizer.test.ts
@github-actions

This comment has been minimized.

@bschnurr
Copy link
Copy Markdown
Member Author

Addressed in commit 462a76a: added a repetitive_identifiers benchmark corpus (234 lines, 2775 tokens) that exercises heavy reuse of identifiers like self, cls, str, int, T, K, V, repeated method signatures, and list/dict comprehensions with common names. Result vs main: branch ~1.96 ms, main ~2.25 ms (~13% faster), confirming the intern cache outperforms per-token allocation on this workload.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown
Contributor

Diff from mypy_primer, showing the effect of this PR on open source code:

sympy (https://github.com/sympy/sympy)
+   .../projects/sympy/sympy/solvers/tests/test_solveset.py:315:25 - error: "Set" is not iterable
+     "__iter__" method not defined (reportGeneralTypeIssues)
+   .../projects/sympy/sympy/solvers/tests/test_solveset.py:315:25 - error: "ConditionSet" is not iterable
+     "__iter__" method not defined (reportGeneralTypeIssues)
+   .../projects/sympy/sympy/solvers/tests/test_solveset.py:2024:20 - error: "__getitem__" method not defined on type "Basic" (reportIndexIssue)
+   .../projects/sympy/sympy/solvers/tests/test_solveset.py:2036:20 - error: "__getitem__" method not defined on type "Basic" (reportIndexIssue)
+   .../projects/sympy/sympy/solvers/tests/test_solveset.py:2049:20 - error: "__getitem__" method not defined on type "Basic" (reportIndexIssue)
+   .../projects/sympy/sympy/solvers/tests/test_solveset.py:2070:16 - error: Argument of type "Unknown | FiniteSet | Set | Intersection | Union | Complement | Any | ConditionSet" cannot be assigned to parameter "obj" of type "Sized" in function "len"
+     Type "Unknown | FiniteSet | Set | Intersection | Union | Complement | Any | ConditionSet" is not assignable to type "Sized"
+       "Set" is incompatible with protocol "Sized"
+         "__len__" is not present (reportArgumentType)
+   .../projects/sympy/sympy/solvers/tests/test_solveset.py:2071:20 - error: "__getitem__" method not defined on type "Basic" (reportIndexIssue)
+   .../projects/sympy/sympy/solvers/tests/test_solveset.py:2252:16 - error: Argument of type "Unknown | FiniteSet | Set | Intersection | Union | Complement | Any | ConditionSet" cannot be assigned to parameter "obj" of type "Sized" in function "len"
+     Type "Unknown | FiniteSet | Set | Intersection | Union | Complement | Any | ConditionSet" is not assignable to type "Sized"
+       "Set" is incompatible with protocol "Sized"
+         "__len__" is not present (reportArgumentType)
+   .../projects/sympy/sympy/solvers/tests/test_solveset.py:2333:16 - error: Argument of type "Unknown | FiniteSet | Set | Intersection | Union | Complement | Any | ConditionSet" cannot be assigned to parameter "obj" of type "Sized" in function "len"
+     Type "Unknown | FiniteSet | Set | Intersection | Union | Complement | Any | ConditionSet" is not assignable to type "Sized"
+       "Set" is incompatible with protocol "Sized"
+         "__len__" is not present (reportArgumentType)
+   .../projects/sympy/sympy/solvers/tests/test_solveset.py:2346:16 - error: Argument of type "Unknown | FiniteSet | Set | Intersection | Union | Complement | Any | ConditionSet" cannot be assigned to parameter "obj" of type "Sized" in function "len"
+     Type "Unknown | FiniteSet | Set | Intersection | Union | Complement | Any | ConditionSet" is not assignable to type "Sized"
+       "Set" is incompatible with protocol "Sized"
+         "__len__" is not present (reportArgumentType)
+   .../projects/sympy/sympy/solvers/tests/test_solveset.py:2352:16 - error: Argument of type "Unknown | FiniteSet | Set | Intersection | Union | Complement | Any | ConditionSet" cannot be assigned to parameter "obj" of type "Sized" in function "len"
+     Type "Unknown | FiniteSet | Set | Intersection | Union | Complement | Any | ConditionSet" is not assignable to type "Sized"
+       "Set" is incompatible with protocol "Sized"
+         "__len__" is not present (reportArgumentType)
+   .../projects/sympy/sympy/solvers/tests/test_solveset.py:2519:16 - error: Argument of type "Unknown | FiniteSet | Set | Intersection | Union | Complement | Any | ConditionSet" cannot be assigned to parameter "obj" of type "Sized" in function "len"
+     Type "Unknown | FiniteSet | Set | Intersection | Union | Complement | Any | ConditionSet" is not assignable to type "Sized"
+       "Set" is incompatible with protocol "Sized"
+         "__len__" is not present (reportArgumentType)
+   .../projects/sympy/sympy/solvers/tests/test_solveset.py:2522:20 - error: Argument of type "Unknown | Basic | Any" cannot be assigned to parameter "obj" of type "Sized" in function "len"
+     Type "Unknown | Basic | Any" is not assignable to type "Sized"
+       "Basic" is incompatible with protocol "Sized"
+         "__len__" is not present (reportArgumentType)
- 38319 errors, 84 warnings, 0 informations
+ 38332 errors, 84 warnings, 0 informations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants