⚡️ Speed up method BasicTokenizer.tokenize by 16%
#877
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 16% (0.16x) speedup for
BasicTokenizer.tokenizeinsrc/transformers/models/prophetnet/tokenization_prophetnet.py⏱️ Runtime :
31.6 milliseconds→27.2 milliseconds(best of134runs)📝 Explanation and details
The optimized code achieves a 16% speedup through several key micro-optimizations that reduce overhead in Python's frequently-called tokenization methods:
Core Optimizations:
Eliminated redundant variable assignments: In
whitespace_tokenize(), removed the intermediatetokensvariable and directly returnedtext.split(), saving memory allocation and variable assignment overhead.Cached method lookups as instance attributes: Added
self._unicodedata_normalizeandself._unicodedata_categoryin__init__()to avoid repeated module attribute lookups during hot path execution. This is particularly effective sinceunicodedata.normalizeandunicodedata.categoryare called frequently in text processing loops.Localized method references in hot loops: Created local variables like
never_split_contains = never_split.__contains__,append = output.append, and_is_punctuation_local = _is_punctuationto eliminate attribute lookups within tight loops. Python's LOAD_FAST opcode for local variables is significantly faster than LOAD_GLOBAL or LOAD_ATTR.Optimized punctuation splitting algorithm: Replaced the complex list-of-lists approach in
_run_split_on_punc()with a simpler current_word buffer pattern, reducing memory allocations and list comprehension overhead.Conditional string joining optimization: Added a check for single-token cases (
if len(split_tokens) == 1) to avoid unnecessary string joining operations.Performance Impact by Test Category:
The optimizations are most effective for high-throughput tokenization workloads where the same methods are called repeatedly, making the reduced per-call overhead compound into substantial performance gains. All semantic behavior and edge case handling remain identical to the original implementation.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-BasicTokenizer.tokenize-misi0t5aand push.