Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 5, 2025

📄 16% (0.16x) speedup for BasicTokenizer.tokenize in src/transformers/models/prophetnet/tokenization_prophetnet.py

⏱️ Runtime : 31.6 milliseconds 27.2 milliseconds (best of 134 runs)

📝 Explanation and details

The optimized code achieves a 16% speedup through several key micro-optimizations that reduce overhead in Python's frequently-called tokenization methods:

Core Optimizations:

  1. Eliminated redundant variable assignments: In whitespace_tokenize(), removed the intermediate tokens variable and directly returned text.split(), saving memory allocation and variable assignment overhead.

  2. Cached method lookups as instance attributes: Added self._unicodedata_normalize and self._unicodedata_category in __init__() to avoid repeated module attribute lookups during hot path execution. This is particularly effective since unicodedata.normalize and unicodedata.category are called frequently in text processing loops.

  3. Localized method references in hot loops: Created local variables like never_split_contains = never_split.__contains__, append = output.append, and _is_punctuation_local = _is_punctuation to eliminate attribute lookups within tight loops. Python's LOAD_FAST opcode for local variables is significantly faster than LOAD_GLOBAL or LOAD_ATTR.

  4. Optimized punctuation splitting algorithm: Replaced the complex list-of-lists approach in _run_split_on_punc() with a simpler current_word buffer pattern, reducing memory allocations and list comprehension overhead.

  5. Conditional string joining optimization: Added a check for single-token cases (if len(split_tokens) == 1) to avoid unnecessary string joining operations.

Performance Impact by Test Category:

  • Basic text processing: 6-12% improvement across typical tokenization scenarios
  • Large-scale processing: 15-22% improvement on repetitive text (1000+ tokens), where loop overhead dominates
  • Chinese character processing: Significant gains (20%+) due to optimized character-by-character processing with cached method lookups

The optimizations are most effective for high-throughput tokenization workloads where the same methods are called repeatedly, making the reduced per-call overhead compound into substantial performance gains. All semantic behavior and edge case handling remain identical to the original implementation.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 152 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
# imports
from transformers.models.prophetnet.tokenization_prophetnet import BasicTokenizer


# --- Unit tests for BasicTokenizer.tokenize ---

# 1. BASIC TEST CASES


def test_basic_simple_sentence():
    # Test a simple sentence with punctuation
    tokenizer = BasicTokenizer()
    text = "Hello, world!"
    codeflash_output = tokenizer.tokenize(text)  # 22.4μs -> 20.1μs (11.0% faster)


def test_basic_multiple_spaces():
    # Test multiple spaces between words
    tokenizer = BasicTokenizer()
    text = "Hello     world"
    codeflash_output = tokenizer.tokenize(text)  # 19.9μs -> 18.7μs (6.40% faster)


def test_basic_tabs_and_newlines():
    # Test tabs and newlines as whitespace
    tokenizer = BasicTokenizer()
    text = "Hello\tworld\nnew line"
    codeflash_output = tokenizer.tokenize(text)  # 27.7μs -> 24.7μs (12.0% faster)


def test_basic_mixed_case():
    # Test mixed case and lowercasing
    tokenizer = BasicTokenizer()
    text = "PyTest is AwEsOmE."
    codeflash_output = tokenizer.tokenize(text)  # 25.9μs -> 23.6μs (9.68% faster)


def test_basic_never_split():
    # Test never_split parameter
    tokenizer = BasicTokenizer(never_split=["[UNK]"])
    text = "This is [UNK] token."
    codeflash_output = tokenizer.tokenize(text)  # 25.8μs -> 23.5μs (9.86% faster)


def test_basic_never_split_runtime():
    # Test never_split passed at runtime
    tokenizer = BasicTokenizer()
    text = "This is [MASK] token."
    codeflash_output = tokenizer.tokenize(text, never_split=["[MASK]"])  # 26.9μs -> 24.4μs (10.2% faster)


def test_basic_strip_accents():
    # Test accent stripping
    tokenizer = BasicTokenizer(strip_accents=True)
    text = "Café naïve résumé"
    codeflash_output = tokenizer.tokenize(text)  # 29.9μs -> 27.3μs (9.52% faster)


def test_basic_no_lower_case():
    # Test disabling lowercasing
    tokenizer = BasicTokenizer(do_lower_case=False)
    text = "Hello World!"
    codeflash_output = tokenizer.tokenize(text)  # 17.8μs -> 16.4μs (8.61% faster)


def test_basic_no_split_on_punc():
    # Test disabling split on punctuation
    tokenizer = BasicTokenizer(do_split_on_punc=False)
    text = "Hello, world!"
    codeflash_output = tokenizer.tokenize(text)  # 14.2μs -> 14.3μs (0.830% slower)


def test_basic_strip_accents_false():
    # Test strip_accents=False disables accent stripping
    tokenizer = BasicTokenizer(strip_accents=False)
    text = "Café naïve résumé"
    codeflash_output = tokenizer.tokenize(text)  # 25.2μs -> 22.5μs (12.0% faster)


def test_basic_strip_accents_and_no_lowercase():
    # Test strip_accents and do_lower_case both False
    tokenizer = BasicTokenizer(strip_accents=False, do_lower_case=False)
    text = "Café"
    codeflash_output = tokenizer.tokenize(text)  # 10.5μs -> 9.97μs (5.01% faster)


# 2. EDGE TEST CASES


def test_edge_empty_string():
    # Test empty string
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer.tokenize("")  # 2.19μs -> 2.87μs (23.7% slower)


def test_edge_only_spaces():
    # Test string with only spaces
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer.tokenize("     ")  # 6.01μs -> 6.45μs (6.84% slower)


def test_edge_only_punctuation():
    # Test string with only punctuation
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer.tokenize("!!!")  # 10.7μs -> 9.72μs (10.1% faster)


def test_edge_only_control_characters():
    # Test string with only control characters (should be removed)
    tokenizer = BasicTokenizer()
    # chr(0), chr(1), chr(2) are control chars
    text = "\x00\x01\x02"
    codeflash_output = tokenizer.tokenize(text)  # 3.48μs -> 3.94μs (11.7% slower)


def test_edge_mixed_control_and_text():
    # Test string with control characters and text
    tokenizer = BasicTokenizer()
    text = "a\x00b\x01c"
    codeflash_output = tokenizer.tokenize(text)  # 11.4μs -> 10.7μs (5.92% faster)


def test_edge_unicode_whitespace():
    # Test unicode whitespace (e.g., non-breaking space)
    tokenizer = BasicTokenizer()
    text = "Hello\u00a0world"  # \u00A0 is non-breaking space
    codeflash_output = tokenizer.tokenize(text)  # 20.1μs -> 18.4μs (8.99% faster)


def test_edge_unicode_punctuation():
    # Test unicode punctuation (e.g., em dash)
    tokenizer = BasicTokenizer()
    text = "Hello—world"  # em dash
    codeflash_output = tokenizer.tokenize(text)  # 21.5μs -> 20.2μs (6.32% faster)


def test_edge_unicode_normalization():
    # Test that NFC normalization is applied
    tokenizer = BasicTokenizer()
    text = "e\u0301"  # 'e' + combining acute accent
    # Should be normalized to 'é', then lowercased and accent stripped to 'e'
    codeflash_output = tokenizer.tokenize(text)  # 11.4μs -> 11.3μs (1.30% faster)


def test_edge_chinese_characters():
    # Test that Chinese characters are separated
    tokenizer = BasicTokenizer()
    text = "我喜欢Python。"  # "I like Python."
    # Chinese chars and punctuation should be split
    codeflash_output = tokenizer.tokenize(text)  # 25.9μs -> 24.0μs (8.09% faster)


def test_edge_disable_chinese_tokenization():
    # Test disabling Chinese character tokenization
    tokenizer = BasicTokenizer(tokenize_chinese_chars=False)
    text = "我喜欢Python。"
    # Chinese chars are not split into separate tokens
    codeflash_output = tokenizer.tokenize(text)  # 17.6μs -> 16.3μs (8.26% faster)


def test_edge_never_split_chinese():
    # Test never_split with Chinese char
    tokenizer = BasicTokenizer(never_split=["我"])
    text = "我喜欢Python。"
    # "我" is not split or lowercased/stripped
    codeflash_output = tokenizer.tokenize(text)  # 23.5μs -> 21.7μs (8.32% faster)


def test_edge_mixed_never_split_and_case():
    # Test never_split disables lowercasing/stripping for that token only
    tokenizer = BasicTokenizer(never_split=["PyThOn"])
    text = "I like PyThOn!"
    codeflash_output = tokenizer.tokenize(text)  # 23.5μs -> 21.1μs (11.3% faster)


def test_edge_token_with_punct_inside():
    # Test token with punctuation inside (should split at punctuation)
    tokenizer = BasicTokenizer()
    text = "foo.bar"
    codeflash_output = tokenizer.tokenize(text)  # 14.5μs -> 13.4μs (8.33% faster)


def test_edge_token_with_multiple_punctuations():
    # Test token with multiple consecutive punctuation
    tokenizer = BasicTokenizer()
    text = "hello!!!"
    codeflash_output = tokenizer.tokenize(text)  # 15.3μs -> 13.7μs (11.8% faster)


def test_edge_token_with_punct_and_never_split():
    # Test never_split disables splitting for that token
    tokenizer = BasicTokenizer(never_split=["foo.bar"])
    text = "foo.bar"
    codeflash_output = tokenizer.tokenize(text)  # 8.65μs -> 9.09μs (4.81% slower)


def test_edge_strip_accents_with_nonlatin():
    # Test accent stripping on non-latin (should not strip e.g. Cyrillic)
    tokenizer = BasicTokenizer(strip_accents=True)
    text = "Добрый день"
    codeflash_output = tokenizer.tokenize(text)  # 24.2μs -> 22.2μs (8.95% faster)


def test_edge_nonbreaking_space_and_tab():
    # Test with non-breaking space and tab
    tokenizer = BasicTokenizer()
    text = "Hello\u00a0\tworld"
    codeflash_output = tokenizer.tokenize(text)  # 20.6μs -> 18.6μs (10.6% faster)


def test_edge_surrogate_pairs():
    # Test with characters outside BMP (e.g., emoji)
    tokenizer = BasicTokenizer()
    text = "I love 🍕!"
    codeflash_output = tokenizer.tokenize(text)  # 21.8μs -> 20.5μs (6.26% faster)


def test_edge_punctuation_only_never_split():
    # Test never_split with punctuation token
    tokenizer = BasicTokenizer(never_split=["!"])
    text = "Wow!"
    codeflash_output = tokenizer.tokenize(text)  # 11.5μs -> 10.9μs (6.03% faster)


def test_edge_empty_never_split():
    # Test never_split with empty list
    tokenizer = BasicTokenizer(never_split=[])
    text = "Hello world!"
    codeflash_output = tokenizer.tokenize(text)  # 19.7μs -> 18.3μs (7.26% faster)


# 3. LARGE SCALE TEST CASES


def test_large_long_sentence():
    # Test a long sentence (500 words)
    tokenizer = BasicTokenizer()
    text = "word " * 500
    expected = ["word"] * 500
    codeflash_output = tokenizer.tokenize(text)  # 2.09ms -> 1.80ms (16.3% faster)


def test_large_long_sentence_with_punctuation():
    # Test a long sentence with punctuation at intervals
    tokenizer = BasicTokenizer()
    text = ("foo, bar. " * 200).strip()
    expected = []
    for _ in range(200):
        expected.extend(["foo", ",", "bar", "."])
    codeflash_output = tokenizer.tokenize(text)  # 1.65ms -> 1.39ms (18.7% faster)


def test_large_all_unique_tokens():
    # Test with 1000 unique tokens
    tokenizer = BasicTokenizer()
    tokens = [f"token{i}" for i in range(1000)]
    text = " ".join(tokens)
    codeflash_output = tokenizer.tokenize(text)  # 7.39ms -> 6.32ms (17.0% faster)


def test_large_many_never_split():
    # Test with many never_split tokens
    never_split = [f"special{i}" for i in range(50)]
    tokenizer = BasicTokenizer(never_split=never_split)
    text = " ".join(never_split) + " hello world!"
    expected = never_split + ["hello", "world", "!"]
    codeflash_output = tokenizer.tokenize(text)  # 237μs -> 223μs (6.60% faster)


def test_large_long_chinese_text():
    # Test with a long string of Chinese characters
    tokenizer = BasicTokenizer()
    chinese = "我" * 500
    text = chinese
    expected = ["我"] * 500
    codeflash_output = tokenizer.tokenize(text)  # 737μs -> 604μs (22.1% faster)


def test_large_mixed_unicode_and_ascii():
    # Test with a long string mixing ascii, emoji, and CJK
    tokenizer = BasicTokenizer()
    ascii_part = "hello world! " * 100
    emoji_part = "😀 " * 100
    cjk_part = "你好 " * 100
    text = ascii_part + emoji_part + cjk_part
    # "hello world!" -> ["hello", "world", "!"] * 100
    # "😀" -> ["😀"] * 100
    # "你", "好" -> ["你", "好"] * 100
    expected = []
    for _ in range(100):
        expected.extend(["hello", "world", "!"])
    for _ in range(100):
        expected.append("😀")
    for _ in range(100):
        expected.extend(["你", "好"])
    codeflash_output = tokenizer.tokenize(text)  # 1.60ms -> 1.38ms (15.4% faster)


def test_large_long_sentence_no_split_on_punc():
    # Test long sentence with do_split_on_punc=False
    tokenizer = BasicTokenizer(do_split_on_punc=False)
    text = ("foo,bar! " * 200).strip()
    expected = []
    for _ in range(200):
        expected.extend(["foo,bar!"])
    codeflash_output = tokenizer.tokenize(text)  # 973μs -> 882μs (10.3% faster)


def test_large_long_sentence_no_lower_case():
    # Test long sentence with do_lower_case=False
    tokenizer = BasicTokenizer(do_lower_case=False)
    text = ("Word " * 500).strip()
    expected = ["Word"] * 500
    codeflash_output = tokenizer.tokenize(text)  # 1.77ms -> 1.53ms (15.6% faster)


def test_large_long_sentence_strip_accents():
    # Test long sentence with accents and strip_accents=True
    tokenizer = BasicTokenizer(strip_accents=True)
    text = ("Café " * 500).strip()
    expected = ["cafe"] * 500
    codeflash_output = tokenizer.tokenize(text)  # 2.18ms -> 1.86ms (17.2% faster)


def test_large_long_sentence_with_never_split_runtime():
    # Test long sentence with many never_split tokens passed at runtime
    tokenizer = BasicTokenizer()
    never_split = [f"special{i}" for i in range(20)]
    text = " ".join(never_split) + " hello world!"
    expected = never_split + ["hello", "world", "!"]
    codeflash_output = tokenizer.tokenize(text, never_split=never_split)  # 111μs -> 108μs (2.97% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
# imports
from transformers.models.prophetnet.tokenization_prophetnet import BasicTokenizer


# --- Unit tests ---

# BASIC TEST CASES


def test_basic_english_sentence():
    # Test tokenization of a simple English sentence with punctuation
    tokenizer = BasicTokenizer()
    text = "Hello, world! This is a test."
    expected = ["hello", ",", "world", "!", "this", "is", "a", "test", "."]
    codeflash_output = tokenizer.tokenize(text)  # 36.8μs -> 33.0μs (11.8% faster)


def test_multiple_spaces():
    # Test that multiple spaces are collapsed to single space and ignored in tokenization
    tokenizer = BasicTokenizer()
    text = "Hello    world"
    expected = ["hello", "world"]
    codeflash_output = tokenizer.tokenize(text)  # 19.3μs -> 17.3μs (11.3% faster)


def test_leading_trailing_whitespace():
    # Test that leading and trailing whitespace is ignored
    tokenizer = BasicTokenizer()
    text = "   Hello world   "
    expected = ["hello", "world"]
    codeflash_output = tokenizer.tokenize(text)  # 20.7μs -> 18.8μs (9.64% faster)


def test_punctuation_split():
    # Test that punctuation is split from words
    tokenizer = BasicTokenizer()
    text = "Wait... what?!"
    expected = ["wait", ".", ".", ".", "what", "?", "!"]
    codeflash_output = tokenizer.tokenize(text)  # 21.5μs -> 19.5μs (10.1% faster)


def test_case_lowering():
    # Test that uppercase is converted to lowercase by default
    tokenizer = BasicTokenizer()
    text = "PyTest"
    expected = ["pytest"]
    codeflash_output = tokenizer.tokenize(text)  # 12.6μs -> 12.1μs (4.83% faster)


def test_no_lower_case():
    # Test with do_lower_case=False
    tokenizer = BasicTokenizer(do_lower_case=False)
    text = "PyTest"
    expected = ["PyTest"]
    codeflash_output = tokenizer.tokenize(text)  # 11.6μs -> 10.8μs (7.42% faster)


def test_strip_accents():
    # Test that accents are stripped by default when lowercasing
    tokenizer = BasicTokenizer()
    text = "Café naïve"
    expected = ["cafe", "naive"]
    codeflash_output = tokenizer.tokenize(text)  # 22.6μs -> 20.4μs (10.8% faster)


def test_no_strip_accents():
    # Test that accents are not stripped if strip_accents=False
    tokenizer = BasicTokenizer(strip_accents=False)
    text = "Café naïve"
    expected = ["café", "naïve"]
    codeflash_output = tokenizer.tokenize(text)  # 17.8μs -> 16.3μs (9.12% faster)


def test_never_split():
    # Test that tokens in never_split are not split or lowercased
    tokenizer = BasicTokenizer(never_split=["[MASK]"])
    text = "Hello [MASK] world"
    expected = ["hello", "[MASK]", "world"]
    codeflash_output = tokenizer.tokenize(text)  # 23.3μs -> 21.2μs (9.64% faster)


def test_never_split_runtime():
    # Test that never_split passed at runtime is honored
    tokenizer = BasicTokenizer()
    text = "Hello [MASK] world"
    expected = ["hello", "[MASK]", "world"]
    codeflash_output = tokenizer.tokenize(text, never_split=["[MASK]"])  # 23.0μs -> 22.1μs (4.10% faster)


def test_do_split_on_punc_false():
    # Test that punctuation is not split if do_split_on_punc=False
    tokenizer = BasicTokenizer(do_split_on_punc=False)
    text = "Hello, world!"
    expected = ["hello,", "world!"]
    codeflash_output = tokenizer.tokenize(text)  # 14.2μs -> 14.5μs (1.51% slower)


# EDGE TEST CASES


def test_empty_string():
    # Test that empty string returns empty list
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer.tokenize("")  # 2.15μs -> 2.82μs (23.6% slower)


def test_only_whitespace():
    # Test that string with only whitespace returns empty list
    tokenizer = BasicTokenizer()
    codeflash_output = tokenizer.tokenize("   \t\n\r  ")  # 7.46μs -> 7.68μs (2.90% slower)


def test_control_characters():
    # Test that control characters are removed
    tokenizer = BasicTokenizer()
    text = "Hello\u0000World\ufffd!"
    expected = ["hello", "world", "!"]
    codeflash_output = tokenizer.tokenize(text)  # 19.7μs -> 18.1μs (8.65% faster)


def test_unicode_whitespace():
    # Test that unicode whitespace is handled
    tokenizer = BasicTokenizer()
    text = "Hello\u2003World"  # \u2003 is an em space
    expected = ["hello", "world"]
    codeflash_output = tokenizer.tokenize(text)  # 20.4μs -> 18.4μs (11.1% faster)


def test_unicode_punctuation():
    # Test that unicode punctuation is split
    tokenizer = BasicTokenizer()
    text = "Hello—world…"  # em dash and ellipsis
    expected = ["hello", "—", "world", "…"]
    codeflash_output = tokenizer.tokenize(text)  # 23.0μs -> 21.2μs (8.70% faster)


def test_chinese_characters():
    # Test that Chinese characters are tokenized separately
    tokenizer = BasicTokenizer()
    text = "我爱NLP"
    expected = ["我", "爱", "nlp"]
    codeflash_output = tokenizer.tokenize(text)  # 18.6μs -> 17.2μs (8.19% faster)


def test_disable_chinese_tokenization():
    # Test that Chinese tokenization can be disabled
    tokenizer = BasicTokenizer(tokenize_chinese_chars=False)
    text = "我爱NLP"
    # Chinese chars not split, so "我爱nlp" (lowercased)
    expected = ["我爱nlp"]
    codeflash_output = tokenizer.tokenize(text)  # 12.4μs -> 11.6μs (7.23% faster)


def test_japanese_kana():
    # Test that Japanese kana are not split as Chinese
    tokenizer = BasicTokenizer()
    text = "カタカナ"
    expected = ["カタカナ"]
    codeflash_output = tokenizer.tokenize(text)  # 13.0μs -> 12.2μs (6.98% faster)


def test_never_split_with_punctuation():
    # Test that never_split token containing punctuation is not split
    tokenizer = BasicTokenizer(never_split=["[MASK]!"])
    text = "Hello [MASK]! world"
    expected = ["hello", "[MASK]!", "world"]
    codeflash_output = tokenizer.tokenize(text)  # 24.3μs -> 22.5μs (8.06% faster)


def test_strip_accents_true_explicit():
    # Test that strip_accents=True works even if do_lower_case=False
    tokenizer = BasicTokenizer(do_lower_case=False, strip_accents=True)
    text = "Café"
    expected = ["Cafe"]
    codeflash_output = tokenizer.tokenize(text)  # 13.5μs -> 12.5μs (7.96% faster)


def test_strip_accents_false_explicit():
    # Test that strip_accents=False disables accent stripping even with do_lower_case=True
    tokenizer = BasicTokenizer(do_lower_case=True, strip_accents=False)
    text = "Café"
    expected = ["café"]
    codeflash_output = tokenizer.tokenize(text)  # 11.3μs -> 10.4μs (8.50% faster)


def test_mixed_never_split_and_runtime():
    # Test that both constructor and runtime never_split are unioned
    tokenizer = BasicTokenizer(never_split=["[MASK]"])
    text = "Hello [MASK] [CLS] world"
    expected = ["hello", "[MASK]", "[CLS]", "world"]
    codeflash_output = tokenizer.tokenize(text, never_split=["[CLS]"])  # 27.0μs -> 25.7μs (5.37% faster)


def test_multiple_never_split():
    # Test that multiple never_split tokens are not split
    tokenizer = BasicTokenizer(never_split=["[MASK]", "[CLS]"])
    text = "Hello [MASK] [CLS] world"
    expected = ["hello", "[MASK]", "[CLS]", "world"]
    codeflash_output = tokenizer.tokenize(text)  # 25.9μs -> 23.4μs (10.6% faster)


def test_unicode_normalization():
    # Test that different unicode representations are normalized
    tokenizer = BasicTokenizer()
    text = "e\u0301"  # 'e' + combining acute accent
    expected = ["e"]
    codeflash_output = tokenizer.tokenize(text)  # 11.4μs -> 11.5μs (0.236% slower)


def test_token_with_only_punctuation():
    # Test that a string of only punctuation is split into individual tokens
    tokenizer = BasicTokenizer()
    text = "!!!"
    expected = ["!", "!", "!"]
    codeflash_output = tokenizer.tokenize(text)  # 10.2μs -> 9.28μs (10.1% faster)


def test_token_with_numbers_and_punctuation():
    # Test that numbers and punctuation are split correctly
    tokenizer = BasicTokenizer()
    text = "3.14, 2:30"
    expected = ["3", ".", "14", ",", "2", ":", "30"]
    codeflash_output = tokenizer.tokenize(text)  # 19.5μs -> 17.2μs (13.7% faster)


def test_token_with_emoji():
    # Test that emoji are treated as non-punctuation and not split
    tokenizer = BasicTokenizer()
    text = "Hello 😊!"
    expected = ["hello", "😊", "!"]
    codeflash_output = tokenizer.tokenize(text)  # 20.8μs -> 18.8μs (10.8% faster)


def test_token_with_mixed_script():
    # Test that mixed scripts are handled (Latin + Cyrillic + Greek)
    tokenizer = BasicTokenizer()
    text = "abc абв αβγ"
    expected = ["abc", "абв", "αβγ"]
    codeflash_output = tokenizer.tokenize(text)  # 24.1μs -> 22.2μs (8.60% faster)


def test_token_with_long_never_split():
    # Test never_split with a long token
    tokenizer = BasicTokenizer(never_split=["ThisIsAVeryLongTokenThatShouldNotBeSplit"])
    text = "Hello ThisIsAVeryLongTokenThatShouldNotBeSplit world"
    expected = ["hello", "ThisIsAVeryLongTokenThatShouldNotBeSplit", "world"]
    codeflash_output = tokenizer.tokenize(text)  # 39.1μs -> 36.7μs (6.50% faster)


# LARGE SCALE TEST CASES


def test_large_text_repeats():
    # Test tokenization of a long repeated sentence
    tokenizer = BasicTokenizer()
    text = ("The quick brown fox jumps over the lazy dog. " * 50).strip()
    expected = ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "."] * 50
    codeflash_output = tokenizer.tokenize(text)  # 1.90ms -> 1.63ms (17.0% faster)


def test_large_text_no_punctuation():
    # Test tokenization of a large text block without punctuation
    tokenizer = BasicTokenizer()
    text = "word " * 1000
    expected = ["word"] * 1000
    codeflash_output = tokenizer.tokenize(text.strip())  # 4.18ms -> 3.56ms (17.3% faster)


def test_large_text_with_punctuation_and_never_split():
    # Test tokenization of a large text block with punctuation and never_split tokens
    never_split_token = "[SPECIAL_TOKEN]"
    tokenizer = BasicTokenizer(never_split=[never_split_token])
    text = ("Hello, world! " + never_split_token + " ") * 100
    expected = ["hello", ",", "world", "!", never_split_token] * 100
    codeflash_output = tokenizer.tokenize(text.strip())  # 1.87ms -> 1.65ms (13.4% faster)


def test_large_text_with_chinese():
    # Test tokenization of a large text block with Chinese characters
    tokenizer = BasicTokenizer()
    chinese = "我爱NLP"
    text = (chinese + " ") * 200
    expected = ["我", "爱", "nlp"] * 200
    codeflash_output = tokenizer.tokenize(text.strip())  # 1.31ms -> 1.08ms (20.8% faster)


def test_large_text_with_varied_whitespace():
    # Test tokenization of a large text block with varied whitespace and tabs/newlines
    tokenizer = BasicTokenizer()
    text = "word1\tword2  word3\nword4\rword5 " * 100
    expected = ["word1", "word2", "word3", "word4", "word5"] * 100
    codeflash_output = tokenizer.tokenize(text.strip())  # 2.52ms -> 2.15ms (17.2% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-BasicTokenizer.tokenize-misi0t5a and push.

Codeflash Static Badge

The optimized code achieves a **16% speedup** through several key micro-optimizations that reduce overhead in Python's frequently-called tokenization methods:

**Core Optimizations:**

1. **Eliminated redundant variable assignments**: In `whitespace_tokenize()`, removed the intermediate `tokens` variable and directly returned `text.split()`, saving memory allocation and variable assignment overhead.

2. **Cached method lookups as instance attributes**: Added `self._unicodedata_normalize` and `self._unicodedata_category` in `__init__()` to avoid repeated module attribute lookups during hot path execution. This is particularly effective since `unicodedata.normalize` and `unicodedata.category` are called frequently in text processing loops.

3. **Localized method references in hot loops**: Created local variables like `never_split_contains = never_split.__contains__`, `append = output.append`, and `_is_punctuation_local = _is_punctuation` to eliminate attribute lookups within tight loops. Python's LOAD_FAST opcode for local variables is significantly faster than LOAD_GLOBAL or LOAD_ATTR.

4. **Optimized punctuation splitting algorithm**: Replaced the complex list-of-lists approach in `_run_split_on_punc()` with a simpler current_word buffer pattern, reducing memory allocations and list comprehension overhead.

5. **Conditional string joining optimization**: Added a check for single-token cases (`if len(split_tokens) == 1`) to avoid unnecessary string joining operations.

**Performance Impact by Test Category:**
- **Basic text processing**: 6-12% improvement across typical tokenization scenarios
- **Large-scale processing**: 15-22% improvement on repetitive text (1000+ tokens), where loop overhead dominates
- **Chinese character processing**: Significant gains (20%+) due to optimized character-by-character processing with cached method lookups

The optimizations are most effective for **high-throughput tokenization workloads** where the same methods are called repeatedly, making the reduced per-call overhead compound into substantial performance gains. All semantic behavior and edge case handling remain identical to the original implementation.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 December 5, 2025 06:43
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant