⚡️ Speed up method ProphetNetTokenizer._tokenize by 7%
#883
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 7% (0.07x) speedup for
ProphetNetTokenizer._tokenizeinsrc/transformers/models/prophetnet/tokenization_prophetnet.py⏱️ Runtime :
61.8 milliseconds→57.6 milliseconds(best of91runs)📝 Explanation and details
The optimized code achieves a 7% speedup through several targeted micro-optimizations that reduce Python overhead in tight loops and frequent operations:
Key Performance Optimizations:
Early Chinese Character Detection: Adds
_may_have_chinese_char()function that quickly scans text for CJK codepoints before expensive_tokenize_chinese_chars()processing. This avoids unnecessary Unicode processing when no Chinese characters are present, saving ~25% time in the Chinese tokenization path.Reduced Method Attribute Lookups: Caches frequently accessed methods and attributes as local variables (
split_on_punc = self._run_split_on_punc,nsplit = never_split_set) to avoid repeated attribute resolution in tight loops.Optimized Set Operations: Avoids creating temporary sets when
never_splitis None by checking the condition first, eliminating unnecessaryset()andunion()calls.String Processing Improvements:
WordpieceTokenizer, useschars = tokeninstead ofchars = list(token)since strings are iterablelen(chars)as local variableLto avoid repeated callsunicodedata.categoryfunctionBuffer Management: Uses local references to
split_tokens.appendandsplit_tokens.extendto avoid method lookup overhead in loops processing thousands of tokens.Impact on Workloads:
The optimizations are particularly effective for:
The optimizations preserve all functionality while targeting Python's performance bottlenecks in string processing and method resolution, making tokenization more efficient for high-throughput NLP workloads.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-ProphetNetTokenizer._tokenize-misjtaxgand push.