⚡️ Speed up method `BasicTokenizer.tokenize` by 16% #877

codeflash-ai · 2025-12-05T06:43:41Z

📄 16% (0.16x) speedup for `BasicTokenizer.tokenize` in `src/transformers/models/prophetnet/tokenization_prophetnet.py`

⏱️ Runtime : 31.6 milliseconds → 27.2 milliseconds (best of 134 runs)

📝 Explanation and details

The optimized code achieves a 16% speedup through several key micro-optimizations that reduce overhead in Python's frequently-called tokenization methods:

Core Optimizations:

Eliminated redundant variable assignments: In whitespace_tokenize(), removed the intermediate tokens variable and directly returned text.split(), saving memory allocation and variable assignment overhead.
Cached method lookups as instance attributes: Added self._unicodedata_normalize and self._unicodedata_category in __init__() to avoid repeated module attribute lookups during hot path execution. This is particularly effective since unicodedata.normalize and unicodedata.category are called frequently in text processing loops.
Localized method references in hot loops: Created local variables like never_split_contains = never_split.__contains__, append = output.append, and _is_punctuation_local = _is_punctuation to eliminate attribute lookups within tight loops. Python's LOAD_FAST opcode for local variables is significantly faster than LOAD_GLOBAL or LOAD_ATTR.
Optimized punctuation splitting algorithm: Replaced the complex list-of-lists approach in _run_split_on_punc() with a simpler current_word buffer pattern, reducing memory allocations and list comprehension overhead.
Conditional string joining optimization: Added a check for single-token cases (if len(split_tokens) == 1) to avoid unnecessary string joining operations.

Performance Impact by Test Category:

Basic text processing: 6-12% improvement across typical tokenization scenarios
Large-scale processing: 15-22% improvement on repetitive text (1000+ tokens), where loop overhead dominates
Chinese character processing: Significant gains (20%+) due to optimized character-by-character processing with cached method lookups

The optimizations are most effective for high-throughput tokenization workloads where the same methods are called repeatedly, making the reduced per-call overhead compound into substantial performance gains. All semantic behavior and edge case handling remain identical to the original implementation.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 152 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests and Runtime

# imports
from transformers.models.prophetnet.tokenization_prophetnet import BasicTokenizer


# --- Unit tests for BasicTokenizer.tokenize ---

# 1. BASIC TEST CASES


def test_basic_simple_sentence():
    # Test a simple sentence with punctuation
    tokenizer = BasicTokenizer()
    text = "Hello, world!"
    codeflash_output = tokenizer.tokenize(text)  # 22.4μs -> 20.1μs (11.0% faster)


def test_basic_multiple_spaces():
    # Test multiple spaces between words
    tokenizer = BasicTokenizer()
    text = "Hello     world"
    codeflash_output = tokenizer.tokenize(text)  # 19.9μs -> 18.7μs (6.40% faster)


def test_basic_tabs_and_newlines():
    # Test tabs and newlines as whitespace
    tokenizer = BasicTokenizer()
    text = "Hello\tworld\nnew line"
    codeflash_output = tokenizer.tokenize(text)  # 27.7μs -> 24.7μs (12.0% faster)


def test_basic_mixed_case():
    # Test mixed case and lowercasing
    tokenizer = BasicTokenizer()
    text = "PyTest is AwEsOmE."
    codeflash_output = tokenizer.tokenize(text)  # 25.9μs -> 23.6μs (9.68% faster)


def test_basic_never_split():
    # Test never_split parameter
    tokenizer = BasicTokenizer(never_split=["[UNK]"])
    text = "This is [UNK] token."
    codeflash_output = tokenizer.tokenize(text)  # 25.8μs -> 23.5μs (9.86% faster)


def test_basic_never_split_runtime():
    # Test never_split passed at runtime
    tokenizer = BasicTokenizer()
    text = "This is [MASK] token."
    codeflash_output = tokenizer.tokenize(text, never_split=["[MASK]"])  # 26.9μs -> 24.4μs (10.2% faster)


def test_basic_strip_accents():
    # Test accent stripping
    tokenizer = BasicTokenizer(strip_accents=True)
    text = "Café naïve résumé"
    codeflash_output = tokenizer.tokenize(text)  # 29.9μs -> 27.3μs (9.52% faster)


def test_basic_no_lower_case():
    # Test disabling lowercasing
    tokenizer = BasicTokenizer(do_lower_case=False)
    text = "Hello World!"
    codeflash_output = tokenizer.tokenize(text)  # 17.8μs -> 16.4μs (8.61% faster)


def test_basic_no_split_on_punc():
    # Test disabling split on punctuation
    tokenizer = BasicTokenizer(do_split_on_punc=False)
    text = "Hello, world!"
    codeflash_output = tokenizer.tokenize(text)  # 14.2μs -> 14.3μs (0.830% slower)


def test_basic_strip_accents_false():
    # Test strip_accents=False disables accent stripping
    tokenizer = BasicTokenizer(strip_accents=False)
    text = "Café naïve résumé"
    codeflash_output = tokenizer.tokenize(text)  # 25.2μs -> 22.5μs (12.0% faster)


def test_basic_strip_accents_and_no_lowercase():
    # Test strip_accents and do_lower_case both False
    tokenizer = BasicTokenizer(strip_accents=False, do_lower_case=False)
    text = "Café"
    codeflash_output = tokenizer.tokenize(text)  # 10.5μs -> 9.97μs (5.01% faster)


# 2. EDGE TEST CASES


def test_edge_empty_string():
    # Test empty string
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer.tokenize("")  # 2.19μs -> 2.87μs (23.7% slower)


def test_edge_only_spaces():
    # Test string with only spaces
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer.tokenize("     ")  # 6.01μs -> 6.45μs (6.84% slower)


def test_edge_only_punctuation():
    # Test string with only punctuation
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer.tokenize("!!!")  # 10.7μs -> 9.72μs (10.1% faster)


def test_edge_only_control_characters():
    # Test string with only control characters (should be removed)
    tokenizer = BasicTokenizer()
    # chr(0), chr(1), chr(2) are control chars
    text = "\x00\x01\x02"
    codeflash_output = tokenizer.tokenize(text)  # 3.48μs -> 3.94μs (11.7% slower)


def test_edge_mixed_control_and_text():
    # Test string with control characters and text
    tokenizer = BasicTokenizer()
    text = "a\x00b\x01c"
    codeflash_output = tokenizer.tokenize(text)  # 11.4μs -> 10.7μs (5.92% faster)


def test_edge_unicode_whitespace():
    # Test unicode whitespace (e.g., non-breaking space)
    tokenizer = BasicTokenizer()
    text = "Hello\u00a0world"  # \u00A0 is non-breaking space
    codeflash_output = tokenizer.tokenize(text)  # 20.1μs -> 18.4μs (8.99% faster)


def test_edge_unicode_punctuation():
    # Test unicode punctuation (e.g., em dash)
    tokenizer = BasicTokenizer()
    text = "Hello—world"  # em dash
    codeflash_output = tokenizer.tokenize(text)  # 21.5μs -> 20.2μs (6.32% faster)


def test_edge_unicode_normalization():
    # Test that NFC normalization is applied
    tokenizer = BasicTokenizer()
    text = "e\u0301"  # 'e' + combining acute accent
    # Should be normalized to 'é', then lowercased and accent stripped to 'e'
    codeflash_output = tokenizer.tokenize(text)  # 11.4μs -> 11.3μs (1.30% faster)


def test_edge_chinese_characters():
    # Test that Chinese characters are separated
    tokenizer = BasicTokenizer()
    text = "我喜欢Python。"  # "I like Python."
    # Chinese chars and punctuation should be split
    codeflash_output = tokenizer.tokenize(text)  # 25.9μs -> 24.0μs (8.09% faster)


def test_edge_disable_chinese_tokenization():
    # Test disabling Chinese character tokenization
    tokenizer = BasicTokenizer(tokenize_chinese_chars=False)
    text = "我喜欢Python。"
    # Chinese chars are not split into separate tokens
    codeflash_output = tokenizer.tokenize(text)  # 17.6μs -> 16.3μs (8.26% faster)


def test_edge_never_split_chinese():
    # Test never_split with Chinese char
    tokenizer = BasicTokenizer(never_split=["我"])
    text = "我喜欢Python。"
    # "我" is not split or lowercased/stripped
    codeflash_output = tokenizer.tokenize(text)  # 23.5μs -> 21.7μs (8.32% faster)


def test_edge_mixed_never_split_and_case():
    # Test never_split disables lowercasing/stripping for that token only
    tokenizer = BasicTokenizer(never_split=["PyThOn"])
    text = "I like PyThOn!"
    codeflash_output = tokenizer.tokenize(text)  # 23.5μs -> 21.1μs (11.3% faster)


def test_edge_token_with_punct_inside():
    # Test token with punctuation inside (should split at punctuation)
    tokenizer = BasicTokenizer()
    text = "foo.bar"
    codeflash_output = tokenizer.tokenize(text)  # 14.5μs -> 13.4μs (8.33% faster)


def test_edge_token_with_multiple_punctuations():
    # Test token with multiple consecutive punctuation
    tokenizer = BasicTokenizer()
    text = "hello!!!"
    codeflash_output = tokenizer.tokenize(text)  # 15.3μs -> 13.7μs (11.8% faster)


def test_edge_token_with_punct_and_never_split():
    # Test never_split disables splitting for that token
    tokenizer = BasicTokenizer(never_split=["foo.bar"])
    text = "foo.bar"
    codeflash_output = tokenizer.tokenize(text)  # 8.65μs -> 9.09μs (4.81% slower)


def test_edge_strip_accents_with_nonlatin():
    # Test accent stripping on non-latin (should not strip e.g. Cyrillic)
    tokenizer = BasicTokenizer(strip_accents=True)
    text = "Добрый день"
    codeflash_output = tokenizer.tokenize(text)  # 24.2μs -> 22.2μs (8.95% faster)


def test_edge_nonbreaking_space_and_tab():
    # Test with non-breaking space and tab
    tokenizer = BasicTokenizer()
    text = "Hello\u00a0\tworld"
    codeflash_output = tokenizer.tokenize(text)  # 20.6μs -> 18.6μs (10.6% faster)


def test_edge_surrogate_pairs():
    # Test with characters outside BMP (e.g., emoji)
    tokenizer = BasicTokenizer()
    text = "I love 🍕!"
    codeflash_output = tokenizer.tokenize(text)  # 21.8μs -> 20.5μs (6.26% faster)


def test_edge_punctuation_only_never_split():
    # Test never_split with punctuation token
    tokenizer = BasicTokenizer(never_split=["!"])
    text = "Wow!"
    codeflash_output = tokenizer.tokenize(text)  # 11.5μs -> 10.9μs (6.03% faster)


def test_edge_empty_never_split():
    # Test never_split with empty list
    tokenizer = BasicTokenizer(never_split=[])
    text = "Hello world!"
    codeflash_output = tokenizer.tokenize(text)  # 19.7μs -> 18.3μs (7.26% faster)


# 3. LARGE SCALE TEST CASES


def test_large_long_sentence():
    # Test a long sentence (500 words)
    tokenizer = BasicTokenizer()
    text = "word " * 500
    expected = ["word"] * 500
    codeflash_output = tokenizer.tokenize(text)  # 2.09ms -> 1.80ms (16.3% faster)


def test_large_long_sentence_with_punctuation():
    # Test a long sentence with punctuation at intervals
    tokenizer = BasicTokenizer()
    text = ("foo, bar. " * 200).strip()
    expected = []
    for _ in range(200):
        expected.extend(["foo", ",", "bar", "."])
    codeflash_output = tokenizer.tokenize(text)  # 1.65ms -> 1.39ms (18.7% faster)


def test_large_all_unique_tokens():
    # Test with 1000 unique tokens
    tokenizer = BasicTokenizer()
    tokens = [f"token{i}" for i in range(1000)]
    text = " ".join(tokens)
    codeflash_output = tokenizer.tokenize(text)  # 7.39ms -> 6.32ms (17.0% faster)


def test_large_many_never_split():
    # Test with many never_split tokens
    never_split = [f"special{i}" for i in range(50)]
    tokenizer = BasicTokenizer(never_split=never_split)
    text = " ".join(never_split) + " hello world!"
    expected = never_split + ["hello", "world", "!"]
    codeflash_output = tokenizer.tokenize(text)  # 237μs -> 223μs (6.60% faster)


def test_large_long_chinese_text():
    # Test with a long string of Chinese characters
    tokenizer = BasicTokenizer()
    chinese = "我" * 500
    text = chinese
    expected = ["我"] * 500
    codeflash_output = tokenizer.tokenize(text)  # 737μs -> 604μs (22.1% faster)


def test_large_mixed_unicode_and_ascii():
    # Test with a long string mixing ascii, emoji, and CJK
    tokenizer = BasicTokenizer()
    ascii_part = "hello world! " * 100
    emoji_part = "😀 " * 100
    cjk_part = "你好 " * 100
    text = ascii_part + emoji_part + cjk_part
    # "hello world!" -> ["hello", "world", "!"] * 100
    # "😀" -> ["😀"] * 100
    # "你", "好" -> ["你", "好"] * 100
    expected = []
    for _ in range(100):
        expected.extend(["hello", "world", "!"])
    for _ in range(100):
        expected.append("😀")
    for _ in range(100):
        expected.extend(["你", "好"])
    codeflash_output = tokenizer.tokenize(text)  # 1.60ms -> 1.38ms (15.4% faster)


def test_large_long_sentence_no_split_on_punc():
    # Test long sentence with do_split_on_punc=False
    tokenizer = BasicTokenizer(do_split_on_punc=False)
    text = ("foo,bar! " * 200).strip()
    expected = []
    for _ in range(200):
        expected.extend(["foo,bar!"])
    codeflash_output = tokenizer.tokenize(text)  # 973μs -> 882μs (10.3% faster)


def test_large_long_sentence_no_lower_case():
    # Test long sentence with do_lower_case=False
    tokenizer = BasicTokenizer(do_lower_case=False)
    text = ("Word " * 500).strip()
    expected = ["Word"] * 500
    codeflash_output = tokenizer.tokenize(text)  # 1.77ms -> 1.53ms (15.6% faster)


def test_large_long_sentence_strip_accents():
    # Test long sentence with accents and strip_accents=True
    tokenizer = BasicTokenizer(strip_accents=True)
    text = ("Café " * 500).strip()
    expected = ["cafe"] * 500
    codeflash_output = tokenizer.tokenize(text)  # 2.18ms -> 1.86ms (17.2% faster)


def test_large_long_sentence_with_never_split_runtime():
    # Test long sentence with many never_split tokens passed at runtime
    tokenizer = BasicTokenizer()
    never_split = [f"special{i}" for i in range(20)]
    text = " ".join(never_split) + " hello world!"
    expected = never_split + ["hello", "world", "!"]
    codeflash_output = tokenizer.tokenize(text, never_split=never_split)  # 111μs -> 108μs (2.97% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

# imports
from transformers.models.prophetnet.tokenization_prophetnet import BasicTokenizer


# --- Unit tests ---

# BASIC TEST CASES


def test_basic_english_sentence():
    # Test tokenization of a simple English sentence with punctuation
    tokenizer = BasicTokenizer()
    text = "Hello, world! This is a test."
    expected = ["hello", ",", "world", "!", "this", "is", "a", "test", "."]
    codeflash_output = tokenizer.tokenize(text)  # 36.8μs -> 33.0μs (11.8% faster)


def test_multiple_spaces():
    # Test that multiple spaces are collapsed to single space and ignored in tokenization
    tokenizer = BasicTokenizer()
    text = "Hello    world"
    expected = ["hello", "world"]
    codeflash_output = tokenizer.tokenize(text)  # 19.3μs -> 17.3μs (11.3% faster)


def test_leading_trailing_whitespace():
    # Test that leading and trailing whitespace is ignored
    tokenizer = BasicTokenizer()
    text = "   Hello world   "
    expected = ["hello", "world"]
    codeflash_output = tokenizer.tokenize(text)  # 20.7μs -> 18.8μs (9.64% faster)


def test_punctuation_split():
    # Test that punctuation is split from words
    tokenizer = BasicTokenizer()
    text = "Wait... what?!"
    expected = ["wait", ".", ".", ".", "what", "?", "!"]
    codeflash_output = tokenizer.tokenize(text)  # 21.5μs -> 19.5μs (10.1% faster)


def test_case_lowering():
    # Test that uppercase is converted to lowercase by default
    tokenizer = BasicTokenizer()
    text = "PyTest"
    expected = ["pytest"]
    codeflash_output = tokenizer.tokenize(text)  # 12.6μs -> 12.1μs (4.83% faster)


def test_no_lower_case():
    # Test with do_lower_case=False
    tokenizer = BasicTokenizer(do_lower_case=False)
    text = "PyTest"
    expected = ["PyTest"]
    codeflash_output = tokenizer.tokenize(text)  # 11.6μs -> 10.8μs (7.42% faster)


def test_strip_accents():
    # Test that accents are stripped by default when lowercasing
    tokenizer = BasicTokenizer()
    text = "Café naïve"
    expected = ["cafe", "naive"]
    codeflash_output = tokenizer.tokenize(text)  # 22.6μs -> 20.4μs (10.8% faster)


def test_no_strip_accents():
    # Test that accents are not stripped if strip_accents=False
    tokenizer = BasicTokenizer(strip_accents=False)
    text = "Café naïve"
    expected = ["café", "naïve"]
    codeflash_output = tokenizer.tokenize(text)  # 17.8μs -> 16.3μs (9.12% faster)


def test_never_split():
    # Test that tokens in never_split are not split or lowercased
    tokenizer = BasicTokenizer(never_split=["[MASK]"])
    text = "Hello [MASK] world"
    expected = ["hello", "[MASK]", "world"]
    codeflash_output = tokenizer.tokenize(text)  # 23.3μs -> 21.2μs (9.64% faster)


def test_never_split_runtime():
    # Test that never_split passed at runtime is honored
    tokenizer = BasicTokenizer()
    text = "Hello [MASK] world"
    expected = ["hello", "[MASK]", "world"]
    codeflash_output = tokenizer.tokenize(text, never_split=["[MASK]"])  # 23.0μs -> 22.1μs (4.10% faster)


def test_do_split_on_punc_false():
    # Test that punctuation is not split if do_split_on_punc=False
    tokenizer = BasicTokenizer(do_split_on_punc=False)
    text = "Hello, world!"
    expected = ["hello,", "world!"]
    codeflash_output = tokenizer.tokenize(text)  # 14.2μs -> 14.5μs (1.51% slower)


# EDGE TEST CASES


def test_empty_string():
    # Test that empty string returns empty list
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer.tokenize("")  # 2.15μs -> 2.82μs (23.6% slower)


def test_only_whitespace():
    # Test that string with only whitespace returns empty list
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer.tokenize("   \t\n\r  ")  # 7.46μs -> 7.68μs (2.90% slower)


def test_control_characters():
    # Test that control characters are removed
    tokenizer = BasicTokenizer()
    text = "Hello\u0000World\ufffd!"
    expected = ["hello", "world", "!"]
    codeflash_output = tokenizer.tokenize(text)  # 19.7μs -> 18.1μs (8.65% faster)


def test_unicode_whitespace():
    # Test that unicode whitespace is handled
    tokenizer = BasicTokenizer()
    text = "Hello\u2003World"  # \u2003 is an em space
    expected = ["hello", "world"]
    codeflash_output = tokenizer.tokenize(text)  # 20.4μs -> 18.4μs (11.1% faster)


def test_unicode_punctuation():
    # Test that unicode punctuation is split
    tokenizer = BasicTokenizer()
    text = "Hello—world…"  # em dash and ellipsis
    expected = ["hello", "—", "world", "…"]
    codeflash_output = tokenizer.tokenize(text)  # 23.0μs -> 21.2μs (8.70% faster)


def test_chinese_characters():
    # Test that Chinese characters are tokenized separately
    tokenizer = BasicTokenizer()
    text = "我爱NLP"
    expected = ["我", "爱", "nlp"]
    codeflash_output = tokenizer.tokenize(text)  # 18.6μs -> 17.2μs (8.19% faster)


def test_disable_chinese_tokenization():
    # Test that Chinese tokenization can be disabled
    tokenizer = BasicTokenizer(tokenize_chinese_chars=False)
    text = "我爱NLP"
    # Chinese chars not split, so "我爱nlp" (lowercased)
    expected = ["我爱nlp"]
    codeflash_output = tokenizer.tokenize(text)  # 12.4μs -> 11.6μs (7.23% faster)


def test_japanese_kana():
    # Test that Japanese kana are not split as Chinese
    tokenizer = BasicTokenizer()
    text = "カタカナ"
    expected = ["カタカナ"]
    codeflash_output = tokenizer.tokenize(text)  # 13.0μs -> 12.2μs (6.98% faster)


def test_never_split_with_punctuation():
    # Test that never_split token containing punctuation is not split
    tokenizer = BasicTokenizer(never_split=["[MASK]!"])
    text = "Hello [MASK]! world"
    expected = ["hello", "[MASK]!", "world"]
    codeflash_output = tokenizer.tokenize(text)  # 24.3μs -> 22.5μs (8.06% faster)


def test_strip_accents_true_explicit():
    # Test that strip_accents=True works even if do_lower_case=False
    tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=True)
    text = "Café"
    expected = ["Cafe"]
    codeflash_output = tokenizer.tokenize(text)  # 13.5μs -> 12.5μs (7.96% faster)


def test_strip_accents_false_explicit():
    # Test that strip_accents=False disables accent stripping even with do_lower_case=True
    tokenizer = BasicTokenizer(do_lower_case=True, strip_accents=False)
    text = "Café"
    expected = ["café"]
    codeflash_output = tokenizer.tokenize(text)  # 11.3μs -> 10.4μs (8.50% faster)


def test_mixed_never_split_and_runtime():
    # Test that both constructor and runtime never_split are unioned
    tokenizer = BasicTokenizer(never_split=["[MASK]"])
    text = "Hello [MASK] [CLS] world"
    expected = ["hello", "[MASK]", "[CLS]", "world"]
    codeflash_output = tokenizer.tokenize(text, never_split=["[CLS]"])  # 27.0μs -> 25.7μs (5.37% faster)


def test_multiple_never_split():
    # Test that multiple never_split tokens are not split
    tokenizer = BasicTokenizer(never_split=["[MASK]", "[CLS]"])
    text = "Hello [MASK] [CLS] world"
    expected = ["hello", "[MASK]", "[CLS]", "world"]
    codeflash_output = tokenizer.tokenize(text)  # 25.9μs -> 23.4μs (10.6% faster)


def test_unicode_normalization():
    # Test that different unicode representations are normalized
    tokenizer = BasicTokenizer()
    text = "e\u0301"  # 'e' + combining acute accent
    expected = ["e"]
    codeflash_output = tokenizer.tokenize(text)  # 11.4μs -> 11.5μs (0.236% slower)


def test_token_with_only_punctuation():
    # Test that a string of only punctuation is split into individual tokens
    tokenizer = BasicTokenizer()
    text = "!!!"
    expected = ["!", "!", "!"]
    codeflash_output = tokenizer.tokenize(text)  # 10.2μs -> 9.28μs (10.1% faster)


def test_token_with_numbers_and_punctuation():
    # Test that numbers and punctuation are split correctly
    tokenizer = BasicTokenizer()
    text = "3.14, 2:30"
    expected = ["3", ".", "14", ",", "2", ":", "30"]
    codeflash_output = tokenizer.tokenize(text)  # 19.5μs -> 17.2μs (13.7% faster)


def test_token_with_emoji():
    # Test that emoji are treated as non-punctuation and not split
    tokenizer = BasicTokenizer()
    text = "Hello 😊!"
    expected = ["hello", "😊", "!"]
    codeflash_output = tokenizer.tokenize(text)  # 20.8μs -> 18.8μs (10.8% faster)


def test_token_with_mixed_script():
    # Test that mixed scripts are handled (Latin + Cyrillic + Greek)
    tokenizer = BasicTokenizer()
    text = "abc абв αβγ"
    expected = ["abc", "абв", "αβγ"]
    codeflash_output = tokenizer.tokenize(text)  # 24.1μs -> 22.2μs (8.60% faster)


def test_token_with_long_never_split():
    # Test never_split with a long token
    tokenizer = BasicTokenizer(never_split=["ThisIsAVeryLongTokenThatShouldNotBeSplit"])
    text = "Hello ThisIsAVeryLongTokenThatShouldNotBeSplit world"
    expected = ["hello", "ThisIsAVeryLongTokenThatShouldNotBeSplit", "world"]
    codeflash_output = tokenizer.tokenize(text)  # 39.1μs -> 36.7μs (6.50% faster)


# LARGE SCALE TEST CASES


def test_large_text_repeats():
    # Test tokenization of a long repeated sentence
    tokenizer = BasicTokenizer()
    text = ("The quick brown fox jumps over the lazy dog. " * 50).strip()
    expected = ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "."] * 50
    codeflash_output = tokenizer.tokenize(text)  # 1.90ms -> 1.63ms (17.0% faster)


def test_large_text_no_punctuation():
    # Test tokenization of a large text block without punctuation
    tokenizer = BasicTokenizer()
    text = "word " * 1000
    expected = ["word"] * 1000
    codeflash_output = tokenizer.tokenize(text.strip())  # 4.18ms -> 3.56ms (17.3% faster)


def test_large_text_with_punctuation_and_never_split():
    # Test tokenization of a large text block with punctuation and never_split tokens
    never_split_token = "[SPECIAL_TOKEN]"
    tokenizer = BasicTokenizer(never_split=[never_split_token])
    text = ("Hello, world! " + never_split_token + " ") * 100
    expected = ["hello", ",", "world", "!", never_split_token] * 100
    codeflash_output = tokenizer.tokenize(text.strip())  # 1.87ms -> 1.65ms (13.4% faster)


def test_large_text_with_chinese():
    # Test tokenization of a large text block with Chinese characters
    tokenizer = BasicTokenizer()
    chinese = "我爱NLP"
    text = (chinese + " ") * 200
    expected = ["我", "爱", "nlp"] * 200
    codeflash_output = tokenizer.tokenize(text.strip())  # 1.31ms -> 1.08ms (20.8% faster)


def test_large_text_with_varied_whitespace():
    # Test tokenization of a large text block with varied whitespace and tabs/newlines
    tokenizer = BasicTokenizer()
    text = "word1\tword2  word3\nword4\rword5 " * 100
    expected = ["word1", "word2", "word3", "word4", "word5"] * 100
    codeflash_output = tokenizer.tokenize(text.strip())  # 2.52ms -> 2.15ms (17.2% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-BasicTokenizer.tokenize-misi0t5a and push.

The optimized code achieves a **16% speedup** through several key micro-optimizations that reduce overhead in Python's frequently-called tokenization methods: **Core Optimizations:** 1. **Eliminated redundant variable assignments**: In `whitespace_tokenize()`, removed the intermediate `tokens` variable and directly returned `text.split()`, saving memory allocation and variable assignment overhead. 2. **Cached method lookups as instance attributes**: Added `self._unicodedata_normalize` and `self._unicodedata_category` in `__init__()` to avoid repeated module attribute lookups during hot path execution. This is particularly effective since `unicodedata.normalize` and `unicodedata.category` are called frequently in text processing loops. 3. **Localized method references in hot loops**: Created local variables like `never_split_contains = never_split.__contains__`, `append = output.append`, and `_is_punctuation_local = _is_punctuation` to eliminate attribute lookups within tight loops. Python's LOAD_FAST opcode for local variables is significantly faster than LOAD_GLOBAL or LOAD_ATTR. 4. **Optimized punctuation splitting algorithm**: Replaced the complex list-of-lists approach in `_run_split_on_punc()` with a simpler current_word buffer pattern, reducing memory allocations and list comprehension overhead. 5. **Conditional string joining optimization**: Added a check for single-token cases (`if len(split_tokens) == 1`) to avoid unnecessary string joining operations. **Performance Impact by Test Category:** - **Basic text processing**: 6-12% improvement across typical tokenization scenarios - **Large-scale processing**: 15-22% improvement on repetitive text (1000+ tokens), where loop overhead dominates - **Chinese character processing**: Significant gains (20%+) due to optimized character-by-character processing with cached method lookups The optimizations are most effective for **high-throughput tokenization workloads** where the same methods are called repeatedly, making the reduced per-call overhead compound into substantial performance gains. All semantic behavior and edge case handling remain identical to the original implementation.

codeflash-ai bot requested a review from mashraf-222 December 5, 2025 06:43

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up method `BasicTokenizer.tokenize` by 16% #877

⚡️ Speed up method `BasicTokenizer.tokenize` by 16% #877

Uh oh!

codeflash-ai bot commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method BasicTokenizer.tokenize by 16% #877

Are you sure you want to change the base?

⚡️ Speed up method BasicTokenizer.tokenize by 16% #877

Uh oh!

Conversation

codeflash-ai bot commented Dec 5, 2025

📄 16% (0.16x) speedup for BasicTokenizer.tokenize in src/transformers/models/prophetnet/tokenization_prophetnet.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up method `BasicTokenizer.tokenize` by 16% #877

⚡️ Speed up method `BasicTokenizer.tokenize` by 16% #877

📄 16% (0.16x) speedup for `BasicTokenizer.tokenize` in `src/transformers/models/prophetnet/tokenization_prophetnet.py`