Conversation
…for : in type: ignore[...] bracket codes (e.g. ty:unresolved-reference). The fix adds Char.Colon to the allowed bracket-content characters in both bracket-parsing branches of matchIgnoreDirective, but only when directive === 'type' — matching the original regex difference ([\s\w:,-]* for type: vs [\s\w-,]* for pyright:).
This PR improves parser/tokenizer performance on common hot paths and moves benchmark suites out of the normal Jest test matrix. The main changes are: - replace regex-based directive and continuation scanning with manual scans - reduce tokenizer overhead on common identifier paths - clean up a few parser/token access paths that benefit from the tokenizer work - add dedicated benchmark coverage and keep benchmark runs opt-in ## Benchmark Results I reran the tokenizer benchmarks against `main` using the isolated harness so each corpus runs in a fresh process. Representative median results vs `main`: - `large_stdlib`: `3.13ms` vs `3.98ms` (`21%` faster) - `fstring_heavy`: `1.77ms` vs `1.94ms` (`9%` faster) - `large_class`: `1.97ms` vs `2.12ms` (`7%` faster) - `union_heavy`: `1.63ms` vs `2.18ms` (`25%` faster) - `large_stdlib_10x`: `21.17ms` vs `24.42ms` (`13%` faster) `comment_heavy` was effectively flat, and `import_heavy` remained too noisy to treat as a reliable headline result. Overall, the larger and more representative tokenizer-heavy corpora improved relative to `main`. ## Testing - tokenizer regression tests passed - tokenizer test suite passed - full `pyright-internal` test suite passed - isolated tokenizer benchmark runs completed successfully
…cts [type: ignoretype: ignore[ or pyright: ignore[ when the closing ] is missing, instead of silently treating them as bare ignore directives. I also added a regression test in tokenizer.test.ts for the malformed-bracket case. Focused validation passed: the TypeIgnore|PyrightIgnore slice now reports 9 passing tests, including TypeIgnoreLineMalformedBracket.
This comment has been minimized.
This comment has been minimized.
|
detachSubstring replaces the old cloneStr + _identifierInternedStrings Map for identifiers. The old intern map deduplicated repeated identifiers within a tokenization pass (e.g., 10,000 occurrences of self → 1 string object). The new code creates a fresh string per token. The benchmark corpora are 300–1700 lines and won't surface memory regressions on very large files with repetitive identifiers (e.g., generated code). Consider adding a "repetitive identifier" benchmark corpus to validate this tradeoff, or document that the intern map was intentionally removed with the expectation that per-token cost savings outweigh deduplication benefits. |
…e-bracket test, fast-path comment handler
This comment has been minimized.
This comment has been minimized.
…irective (fix O(n^2) on comment-heavy files)
|
Addressed in commit 462a76a: added a |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Diff from mypy_primer, showing the effect of this PR on open source code: sympy (https://github.com/sympy/sympy)
+ .../projects/sympy/sympy/solvers/tests/test_solveset.py:315:25 - error: "Set" is not iterable
+ "__iter__" method not defined (reportGeneralTypeIssues)
+ .../projects/sympy/sympy/solvers/tests/test_solveset.py:315:25 - error: "ConditionSet" is not iterable
+ "__iter__" method not defined (reportGeneralTypeIssues)
+ .../projects/sympy/sympy/solvers/tests/test_solveset.py:2024:20 - error: "__getitem__" method not defined on type "Basic" (reportIndexIssue)
+ .../projects/sympy/sympy/solvers/tests/test_solveset.py:2036:20 - error: "__getitem__" method not defined on type "Basic" (reportIndexIssue)
+ .../projects/sympy/sympy/solvers/tests/test_solveset.py:2049:20 - error: "__getitem__" method not defined on type "Basic" (reportIndexIssue)
+ .../projects/sympy/sympy/solvers/tests/test_solveset.py:2070:16 - error: Argument of type "Unknown | FiniteSet | Set | Intersection | Union | Complement | Any | ConditionSet" cannot be assigned to parameter "obj" of type "Sized" in function "len"
+ Type "Unknown | FiniteSet | Set | Intersection | Union | Complement | Any | ConditionSet" is not assignable to type "Sized"
+ "Set" is incompatible with protocol "Sized"
+ "__len__" is not present (reportArgumentType)
+ .../projects/sympy/sympy/solvers/tests/test_solveset.py:2071:20 - error: "__getitem__" method not defined on type "Basic" (reportIndexIssue)
+ .../projects/sympy/sympy/solvers/tests/test_solveset.py:2252:16 - error: Argument of type "Unknown | FiniteSet | Set | Intersection | Union | Complement | Any | ConditionSet" cannot be assigned to parameter "obj" of type "Sized" in function "len"
+ Type "Unknown | FiniteSet | Set | Intersection | Union | Complement | Any | ConditionSet" is not assignable to type "Sized"
+ "Set" is incompatible with protocol "Sized"
+ "__len__" is not present (reportArgumentType)
+ .../projects/sympy/sympy/solvers/tests/test_solveset.py:2333:16 - error: Argument of type "Unknown | FiniteSet | Set | Intersection | Union | Complement | Any | ConditionSet" cannot be assigned to parameter "obj" of type "Sized" in function "len"
+ Type "Unknown | FiniteSet | Set | Intersection | Union | Complement | Any | ConditionSet" is not assignable to type "Sized"
+ "Set" is incompatible with protocol "Sized"
+ "__len__" is not present (reportArgumentType)
+ .../projects/sympy/sympy/solvers/tests/test_solveset.py:2346:16 - error: Argument of type "Unknown | FiniteSet | Set | Intersection | Union | Complement | Any | ConditionSet" cannot be assigned to parameter "obj" of type "Sized" in function "len"
+ Type "Unknown | FiniteSet | Set | Intersection | Union | Complement | Any | ConditionSet" is not assignable to type "Sized"
+ "Set" is incompatible with protocol "Sized"
+ "__len__" is not present (reportArgumentType)
+ .../projects/sympy/sympy/solvers/tests/test_solveset.py:2352:16 - error: Argument of type "Unknown | FiniteSet | Set | Intersection | Union | Complement | Any | ConditionSet" cannot be assigned to parameter "obj" of type "Sized" in function "len"
+ Type "Unknown | FiniteSet | Set | Intersection | Union | Complement | Any | ConditionSet" is not assignable to type "Sized"
+ "Set" is incompatible with protocol "Sized"
+ "__len__" is not present (reportArgumentType)
+ .../projects/sympy/sympy/solvers/tests/test_solveset.py:2519:16 - error: Argument of type "Unknown | FiniteSet | Set | Intersection | Union | Complement | Any | ConditionSet" cannot be assigned to parameter "obj" of type "Sized" in function "len"
+ Type "Unknown | FiniteSet | Set | Intersection | Union | Complement | Any | ConditionSet" is not assignable to type "Sized"
+ "Set" is incompatible with protocol "Sized"
+ "__len__" is not present (reportArgumentType)
+ .../projects/sympy/sympy/solvers/tests/test_solveset.py:2522:20 - error: Argument of type "Unknown | Basic | Any" cannot be assigned to parameter "obj" of type "Sized" in function "len"
+ Type "Unknown | Basic | Any" is not assignable to type "Sized"
+ "Basic" is incompatible with protocol "Sized"
+ "__len__" is not present (reportArgumentType)
- 38319 errors, 84 warnings, 0 informations
+ 38332 errors, 84 warnings, 0 informations
|
Already committed as
462a76a2f. Here's the updated PR description:Faster tokenizer
Summary
Continues the tokenizer optimization work with a series of hot-path improvements, plus review follow-ups on the ignore-directive scanner and identifier-intern behavior. Combined result: ~20–30% faster on large Python corpora.
Benchmark results vs
mainMethodology: separate git worktrees of
origin/mainandfaster-tokenizer, identical tokenizerBenchmark.test.ts harness (each corpus run in a fresh Node process, 3 warmup + 10 measured iterations). Numbers below are the median of per-run medians across 3–6 runs per side. Small corpora (< 2 ms, < 10 KB) are dominated by V8 JIT/GC jitter and noted as noise.The
large_stdlib_10xresult is the most trustworthy signal (10× work, proportionally less noise) and shows a consistent −24%.Enhancements
Tokenizer hot paths
_canStartString,_asciiIdentifierStart,_asciiIdentifierContinue,_keywordFirstCharTable,_singleCharOperatorTypeTable, …)._tryIdentifierthat advances over ASCII identifier chars in a tightcharCodeAtloop, falling back to the unicode/surrogate path only when a non-ASCII char is encountered. The single biggest win on real-world code.firstChar/lastChar/length, no chaining) that deduplicates repeated identifiers (self,cls,True,None,str,int, …) within a single tokenize pass. Addresses the memory concern raised during review about per-token string allocation, without the overhead of the previousMap-based intern table. The newrepetitive_identifiersbenchmark corpus locks in the tradeoff (−13% vs main on that corpus).CharacterStream.skipWhitespaceas a tightcharCodeAtloop that updates_position/_currentChar/_isEndOfStreamdirectly, avoiding per-iteration method calls._handleCommentso the O(n)_tokens.findIndexscan no longer runs on everytype: ignoredirective.Ignore-directive scanner (review follow-ups)
type: ignore[...]rules (e.g.ty:rule-name).type: ignore[/pyright: ignore[comments with an unclosed bracket rather than treating them as "ignore all diagnostics". New testTypeIgnoreLineMalformedBracketWithSpacelocks in this behavior for the# type: ignore [brokencase.parseIgnoreBracketContenthelper — both the "bracket-after-space" and "bracket-immediately-after-ignore" branches now share one implementation._handleCommentthat usesindexOf('ignore', …)before invoking the directive scanner, so comments without the wordignoredon't pay directive-parsing cost.matchIgnoreDirectiveuses a bounded hand-rolled scan for the directive keyword. (An earlier iteration usedString.prototype.indexOf, which has no end bound and scanned well past the current comment on comment-heavy files, producing O(n²) behavior; the worktree-vs-main comparison caught and fixed this.)Parser / source-file touch-ups
Benchmark infrastructure
large_stdlib,large_stdlib_10x,fstring_heavy,comment_heavy,large_class,import_heavy,union_heavy, andrepetitive_identifiers(the last specifically validates the identifier intern cache).execFileSync) with 3 warmup + 10 measured iterations; results written as JSON under.generated/benchmark-results/for side-by-side comparison.calculateStats,printResultTable) takeReadonlyArrayparameters per review feedback.Tokenizer regression tests
Ran terminal command: cd c:\dev\pyright-main-benchmark; git status; git ls-files packages/pyright-internal/src/tests/benchmarkData | Select-Object -First 3
Behavior notes
# type: ignore[unclosedand# type: ignore [unclosedare now rejected entirely (notypeIgnoreLinerecorded) instead of falling back to "ignore all". This matches the intent of the original regex for the[branch and is now consistent between the space and no-space cases.ty:rule-name) remain valid insidetype: ignore[...]but not insidepyright: ignore[...].Validation
npm test(full pyright-internal suite) — green.npx jest tokenizer parser.test --forceExit— 137/137 passing.npm run test:benchmark— table above; raw JSON in.generated/benchmark-results/tokenizer/.