Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 5, 2025

📄 233% (2.33x) speedup for ProphetNetTokenizer._convert_id_to_token in src/transformers/models/prophetnet/tokenization_prophetnet.py

⏱️ Runtime : 2.25 milliseconds 676 microseconds (best of 200 runs)

📝 Explanation and details

The optimization replaces a dictionary lookup with array indexing for token ID to token conversion, achieving a 232% speedup.

Key optimization: Added a pre-computed list _ids_to_tokens_list during initialization that maps token IDs directly to tokens using array indices, enabling O(1) lookups instead of dictionary operations.

What changed:

  • Initialization: Creates a sparse array where _ids_to_tokens_list[id] = token for fast direct access
  • Lookup logic: Added a fast path that checks if the index is a valid integer within array bounds and retrieves the token directly
  • Fallback preservation: Maintains the original dictionary-based lookup for edge cases (non-integers, out-of-bounds, missing tokens)

Why it's faster:

  • Array indexing (list[index]) is significantly faster than dictionary lookups (dict.get(key)) in Python
  • The bounds check (0 <= index < len(list)) and type check (isinstance(index, int)) are very fast operations
  • Most tokenization scenarios involve valid integer indices within vocabulary bounds, hitting the fast path

Performance characteristics:

  • Valid indices: 146-373% faster (common case in tokenization workflows)
  • Invalid indices: 3-56% slower due to additional checks, but this is rare in practice
  • Large vocabularies: Shows even better improvements (up to 373% faster) due to reduced dictionary overhead

The optimization is particularly effective for transformer tokenizers where _convert_id_to_token is frequently called during text generation and processing, making the fast path for valid indices highly valuable.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 2235 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import os
import tempfile

# imports
import pytest

from transformers.models.prophetnet.tokenization_prophetnet import ProphetNetTokenizer


# Helper function to create a temporary vocab file
def create_vocab_file(tokens):
    tmp = tempfile.NamedTemporaryFile(mode="w+", delete=False, encoding="utf-8")
    for t in tokens:
        tmp.write(t + "\n")
    tmp.flush()
    tmp.close()
    return tmp.name


# ----------- Unit Tests ------------

# 1. Basic Test Cases


def test_basic_id_to_token():
    # Basic vocab
    vocab = ["hello", "world", "[PAD]", "[UNK]"]
    vocab_file = create_vocab_file(vocab)
    tokenizer = ProphetNetTokenizer(vocab_file, unk_token="[UNK]")

    # Test valid indices
    codeflash_output = tokenizer._convert_id_to_token(0)  # 2.40μs -> 978ns (146% faster)
    codeflash_output = tokenizer._convert_id_to_token(1)  # 1.23μs -> 457ns (169% faster)
    codeflash_output = tokenizer._convert_id_to_token(2)  # 1.02μs -> 235ns (336% faster)
    codeflash_output = tokenizer._convert_id_to_token(3)  # 989ns -> 221ns (348% faster)

    os.remove(vocab_file)


def test_basic_out_of_vocab_returns_unk():
    vocab = ["foo", "bar", "[UNK]"]
    vocab_file = create_vocab_file(vocab)
    tokenizer = ProphetNetTokenizer(vocab_file, unk_token="[UNK]")

    # Out-of-range indices
    codeflash_output = tokenizer._convert_id_to_token(10)  # 2.29μs -> 2.78μs (17.6% slower)
    codeflash_output = tokenizer._convert_id_to_token(-1)  # 1.30μs -> 1.56μs (16.8% slower)

    os.remove(vocab_file)


def test_basic_custom_unk_token():
    vocab = ["foo", "bar", "<unk>"]
    vocab_file = create_vocab_file(vocab)
    tokenizer = ProphetNetTokenizer(vocab_file, unk_token="<unk>")

    codeflash_output = tokenizer._convert_id_to_token(100)  # 2.29μs -> 2.71μs (15.8% slower)
    codeflash_output = tokenizer._convert_id_to_token(-100)  # 1.39μs -> 1.61μs (13.8% slower)

    os.remove(vocab_file)


# 2. Edge Test Cases


def test_empty_vocab_file():
    # Empty vocab file
    vocab_file = create_vocab_file([])
    tokenizer = ProphetNetTokenizer(vocab_file, unk_token="[UNK]")
    # Any index should return unk_token
    codeflash_output = tokenizer._convert_id_to_token(0)  # 2.24μs -> 2.62μs (14.5% slower)
    codeflash_output = tokenizer._convert_id_to_token(-1)  # 1.29μs -> 1.46μs (11.8% slower)
    codeflash_output = tokenizer._convert_id_to_token(999)  # 1.06μs -> 1.28μs (17.0% slower)
    os.remove(vocab_file)


def test_vocab_with_duplicate_tokens():
    # Duplicates in vocab (should map to last occurrence)
    vocab = ["a", "b", "a", "[UNK]"]
    vocab_file = create_vocab_file(vocab)
    tokenizer = ProphetNetTokenizer(vocab_file, unk_token="[UNK]")

    # Index 0: "a", Index 1: "b", Index 2: "a", Index 3: "[UNK]"
    # ids_to_tokens: {0: "a", 1: "b", 2: "a", 3: "[UNK]"}
    codeflash_output = tokenizer._convert_id_to_token(0)  # 2.14μs -> 2.78μs (23.0% slower)
    codeflash_output = tokenizer._convert_id_to_token(2)  # 1.33μs -> 516ns (157% faster)
    codeflash_output = tokenizer._convert_id_to_token(3)  # 1.09μs -> 230ns (373% faster)
    codeflash_output = tokenizer._convert_id_to_token(4)  # 1.02μs -> 1.51μs (32.5% slower)

    os.remove(vocab_file)


def test_vocab_with_special_characters():
    vocab = ["[PAD]", "[UNK]", "[MASK]", "[SEP]", "!", "你好", "foo-bar", "a b", "a\tb", "a\nb"]
    vocab_file = create_vocab_file(vocab)
    tokenizer = ProphetNetTokenizer(vocab_file, unk_token="[UNK]")

    # Test all indices
    for idx, token in enumerate(vocab):
        codeflash_output = tokenizer._convert_id_to_token(idx)  # 11.8μs -> 3.17μs (272% faster)

    # Out-of-range index
    codeflash_output = tokenizer._convert_id_to_token(100)  # 1.07μs -> 2.36μs (54.7% slower)

    os.remove(vocab_file)


def test_non_integer_index():
    vocab = ["foo", "bar", "[UNK]"]
    vocab_file = create_vocab_file(vocab)
    tokenizer = ProphetNetTokenizer(vocab_file, unk_token="[UNK]")

    # Should raise TypeError for non-integer index
    with pytest.raises(TypeError):
        tokenizer._convert_id_to_token("1")  # string

    with pytest.raises(TypeError):
        tokenizer._convert_id_to_token(1.5)  # float

    os.remove(vocab_file)


def test_negative_indices():
    vocab = ["foo", "bar", "[UNK]"]
    vocab_file = create_vocab_file(vocab)
    tokenizer = ProphetNetTokenizer(vocab_file, unk_token="[UNK]")

    # Negative indices not present in vocab, should return unk_token
    codeflash_output = tokenizer._convert_id_to_token(-1)  # 2.56μs -> 2.75μs (6.93% slower)
    codeflash_output = tokenizer._convert_id_to_token(-100)  # 1.30μs -> 1.35μs (3.49% slower)

    os.remove(vocab_file)


def test_large_index():
    vocab = ["foo", "bar", "[UNK]"]
    vocab_file = create_vocab_file(vocab)
    tokenizer = ProphetNetTokenizer(vocab_file, unk_token="[UNK]")

    # Very large index
    codeflash_output = tokenizer._convert_id_to_token(10**6)  # 2.26μs -> 2.76μs (18.0% slower)

    os.remove(vocab_file)


def test_vocab_with_whitespace_tokens():
    vocab = [" ", "  ", "\t", "\n", "[UNK]"]
    vocab_file = create_vocab_file(vocab)
    tokenizer = ProphetNetTokenizer(vocab_file, unk_token="[UNK]")

    for idx, token in enumerate(vocab):
        codeflash_output = tokenizer._convert_id_to_token(idx)  # 6.86μs -> 4.21μs (62.8% faster)

    os.remove(vocab_file)


# 3. Large Scale Test Cases


def test_large_vocab_size():
    # Create a large vocab (1000 tokens)
    vocab = [f"token_{i}" for i in range(1000)] + ["[UNK]"]
    vocab_file = create_vocab_file(vocab)
    tokenizer = ProphetNetTokenizer(vocab_file, unk_token="[UNK]")

    # Test first, middle, last indices
    codeflash_output = tokenizer._convert_id_to_token(0)  # 2.57μs -> 1.12μs (129% faster)
    codeflash_output = tokenizer._convert_id_to_token(499)  # 1.37μs -> 464ns (196% faster)
    codeflash_output = tokenizer._convert_id_to_token(999)  # 1.05μs -> 239ns (338% faster)
    codeflash_output = tokenizer._convert_id_to_token(1000)  # 1.03μs -> 232ns (345% faster)
    codeflash_output = tokenizer._convert_id_to_token(1001)  # 1.06μs -> 2.42μs (56.1% slower)

    # Test all indices (spot check)
    for i in range(0, 1000, 100):
        codeflash_output = tokenizer._convert_id_to_token(i)  # 10.3μs -> 2.50μs (310% faster)

    os.remove(vocab_file)


def test_performance_large_vocab_many_calls():
    # Large vocab, test many calls
    vocab = [f"tok{i}" for i in range(1000)] + ["[UNK]"]
    vocab_file = create_vocab_file(vocab)
    tokenizer = ProphetNetTokenizer(vocab_file, unk_token="[UNK]")

    # Test 1000 valid calls
    for i in range(1000):
        codeflash_output = tokenizer._convert_id_to_token(i)  # 1.01ms -> 233μs (331% faster)

    # Test 100 out-of-range calls
    for i in range(1000, 1100):
        codeflash_output = tokenizer._convert_id_to_token(i)  # 98.2μs -> 109μs (10.6% slower)

    os.remove(vocab_file)


def test_large_vocab_edge_indices():
    # Large vocab, test edge indices
    vocab = [f"t{i}" for i in range(999)] + ["[UNK]"]
    vocab_file = create_vocab_file(vocab)
    tokenizer = ProphetNetTokenizer(vocab_file, unk_token="[UNK]")

    # First index
    codeflash_output = tokenizer._convert_id_to_token(0)  # 2.56μs -> 1.08μs (137% faster)
    # Last valid index
    codeflash_output = tokenizer._convert_id_to_token(998)  # 1.45μs -> 432ns (236% faster)
    # Out-of-range index
    codeflash_output = tokenizer._convert_id_to_token(999)  # 1.08μs -> 239ns (351% faster)

    os.remove(vocab_file)


# Additional: mutation testing traps


def test_mutation_trap_wrong_default():
    # If someone changes the default to return None instead of unk_token, this fails
    vocab = ["foo", "bar", "[UNK]"]
    vocab_file = create_vocab_file(vocab)
    tokenizer = ProphetNetTokenizer(vocab_file, unk_token="[UNK]")

    codeflash_output = tokenizer._convert_id_to_token(100)
    result = codeflash_output  # 2.24μs -> 2.84μs (21.2% slower)

    os.remove(vocab_file)


def test_mutation_trap_wrong_mapping():
    # If ids_to_tokens is built incorrectly, this fails
    vocab = ["a", "b", "c", "[UNK]"]
    vocab_file = create_vocab_file(vocab)
    tokenizer = ProphetNetTokenizer(vocab_file, unk_token="[UNK]")

    # ids_to_tokens should map index to token
    for idx, token in enumerate(vocab):
        codeflash_output = tokenizer._convert_id_to_token(idx)  # 5.88μs -> 1.84μs (219% faster)

    os.remove(vocab_file)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import collections
import os
import tempfile

# imports
from transformers.models.prophetnet.tokenization_prophetnet import ProphetNetTokenizer


# Minimal stub for PreTrainedTokenizer (since we don't test its internals here)
class PreTrainedTokenizer:
    def __init__(self, **kwargs):
        self.unk_token = kwargs.get("unk_token", "[UNK]")


# Function to load vocab from file
def load_vocab(vocab_file):
    vocab = collections.OrderedDict()
    with open(vocab_file, "r", encoding="utf-8") as reader:
        tokens = reader.readlines()
    for index, token in enumerate(tokens):
        token = token.rstrip("\n")
        vocab[token] = index
    return vocab


# Helper function to create a temporary vocab file and return its path and vocab list
def create_temp_vocab_file(tokens):
    tf = tempfile.NamedTemporaryFile(delete=False, mode="w", encoding="utf-8")
    for tok in tokens:
        tf.write(tok + "\n")
    tf.close()
    return tf.name


# ---------------------- BASIC TEST CASES ----------------------


def test_basic_known_token():
    # Test that known token ids return the correct token
    tokens = ["hello", "world", "[PAD]", "[UNK]"]
    vocab_file = create_temp_vocab_file(tokens)
    tokenizer = ProphetNetTokenizer(vocab_file)
    try:
        for idx, tok in enumerate(tokens):
            # Each id should map to its token
            codeflash_output = tokenizer._convert_id_to_token(idx)
    finally:
        os.unlink(vocab_file)


def test_basic_unknown_token():
    # Test that unknown id returns unk_token
    tokens = ["a", "b", "[UNK]"]
    vocab_file = create_temp_vocab_file(tokens)
    unk_token = "[UNK]"
    tokenizer = ProphetNetTokenizer(vocab_file, unk_token=unk_token)
    try:
        # id not in vocab
        unknown_id = 10
        codeflash_output = tokenizer._convert_id_to_token(unknown_id)
    finally:
        os.unlink(vocab_file)


def test_basic_custom_unk_token():
    # Test that custom unk_token is returned for unknown ids
    tokens = ["foo", "bar", "<unk>"]
    vocab_file = create_temp_vocab_file(tokens)
    unk_token = "<unk>"
    tokenizer = ProphetNetTokenizer(vocab_file, unk_token=unk_token)
    try:
        codeflash_output = tokenizer._convert_id_to_token(999)
    finally:
        os.unlink(vocab_file)


def test_basic_first_and_last_token():
    # Test that first and last token ids map correctly
    tokens = ["t0", "t1", "t2", "t3"]
    vocab_file = create_temp_vocab_file(tokens)
    tokenizer = ProphetNetTokenizer(vocab_file)
    try:
        codeflash_output = tokenizer._convert_id_to_token(0)
        codeflash_output = tokenizer._convert_id_to_token(3)
    finally:
        os.unlink(vocab_file)


# ---------------------- EDGE TEST CASES ----------------------


def test_negative_index_returns_unk():
    # Negative indices should return unk_token
    tokens = ["a", "b", "c", "[UNK]"]
    vocab_file = create_temp_vocab_file(tokens)
    tokenizer = ProphetNetTokenizer(vocab_file)
    try:
        codeflash_output = tokenizer._convert_id_to_token(-1)
        codeflash_output = tokenizer._convert_id_to_token(-100)
    finally:
        os.unlink(vocab_file)


def test_large_index_returns_unk():
    # Very large index should return unk_token
    tokens = ["x", "y", "z", "[UNK]"]
    vocab_file = create_temp_vocab_file(tokens)
    tokenizer = ProphetNetTokenizer(vocab_file)
    try:
        codeflash_output = tokenizer._convert_id_to_token(1_000_000)
    finally:
        os.unlink(vocab_file)


def test_non_integer_index():
    # Non-integer indices should not raise but return unk_token
    tokens = ["a", "b", "[UNK]"]
    vocab_file = create_temp_vocab_file(tokens)
    tokenizer = ProphetNetTokenizer(vocab_file)
    try:
        # float index
        codeflash_output = tokenizer._convert_id_to_token(1.0)
        # string index
        codeflash_output = tokenizer._convert_id_to_token("1")
        # None index
        codeflash_output = tokenizer._convert_id_to_token(None)
    finally:
        os.unlink(vocab_file)


def test_empty_vocab_file():
    # Empty vocab file: all ids should return unk_token
    tokens = []
    vocab_file = create_temp_vocab_file(tokens)
    tokenizer = ProphetNetTokenizer(vocab_file)
    try:
        codeflash_output = tokenizer._convert_id_to_token(0)
        codeflash_output = tokenizer._convert_id_to_token(10)
    finally:
        os.unlink(vocab_file)


def test_vocab_with_special_characters():
    # Vocab with special/unicode characters
    tokens = ["[PAD]", "你好", "😊", "token-with-dash", "tökén"]
    vocab_file = create_temp_vocab_file(tokens)
    tokenizer = ProphetNetTokenizer(vocab_file)
    try:
        for idx, tok in enumerate(tokens):
            codeflash_output = tokenizer._convert_id_to_token(idx)
    finally:
        os.unlink(vocab_file)


def test_vocab_with_duplicate_tokens():
    # Only the first occurrence should be mapped
    tokens = ["dup", "dup", "unique", "[UNK]"]
    vocab_file = create_temp_vocab_file(tokens)
    tokenizer = ProphetNetTokenizer(vocab_file)
    try:
        # The first "dup" is at id 0, the second at id 1, but only the first is mapped
        codeflash_output = tokenizer._convert_id_to_token(0)
        # id 1 maps to the second "dup", but since OrderedDict will overwrite, only the last survives
        # So, id 1 is not mapped, should return unk_token
        codeflash_output = tokenizer._convert_id_to_token(1)
    finally:
        os.unlink(vocab_file)


def test_vocab_with_blank_lines():
    # Blank lines in vocab should be treated as tokens with empty string
    tokens = ["tok1", "", "tok2"]
    vocab_file = create_temp_vocab_file(tokens)
    tokenizer = ProphetNetTokenizer(vocab_file)
    try:
        codeflash_output = tokenizer._convert_id_to_token(0)
        codeflash_output = tokenizer._convert_id_to_token(1)
        codeflash_output = tokenizer._convert_id_to_token(2)
    finally:
        os.unlink(vocab_file)


def test_vocab_with_whitespace_tokens():
    # Tokens that are whitespace should be handled as valid tokens
    tokens = [" ", "\t", "\n", "tok"]
    vocab_file = create_temp_vocab_file(tokens)
    tokenizer = ProphetNetTokenizer(vocab_file)
    try:
        codeflash_output = tokenizer._convert_id_to_token(0)
        codeflash_output = tokenizer._convert_id_to_token(1)
        # The '\n' will be stripped by rstrip("\n") in load_vocab, so it will be an empty string
        codeflash_output = tokenizer._convert_id_to_token(2)
        codeflash_output = tokenizer._convert_id_to_token(3)
    finally:
        os.unlink(vocab_file)


def test_vocab_with_only_unk_token():
    # Vocab file with only unk_token
    tokens = ["[UNK]"]
    vocab_file = create_temp_vocab_file(tokens)
    tokenizer = ProphetNetTokenizer(vocab_file)
    try:
        codeflash_output = tokenizer._convert_id_to_token(0)
        codeflash_output = tokenizer._convert_id_to_token(1)
    finally:
        os.unlink(vocab_file)


# ---------------------- LARGE SCALE TEST CASES ----------------------


def test_large_vocab_correctness_and_speed():
    # Large vocab: ensure all ids map correctly, and out-of-range returns unk_token
    tokens = [f"tok{i}" for i in range(1000)]
    vocab_file = create_temp_vocab_file(tokens)
    tokenizer = ProphetNetTokenizer(vocab_file)
    try:
        # Check a few random ids in range
        codeflash_output = tokenizer._convert_id_to_token(0)
        codeflash_output = tokenizer._convert_id_to_token(999)
        codeflash_output = tokenizer._convert_id_to_token(500)
        # Out of range
        codeflash_output = tokenizer._convert_id_to_token(1000)
        codeflash_output = tokenizer._convert_id_to_token(10_000)
    finally:
        os.unlink(vocab_file)


def test_large_vocab_all_ids():
    # Test all ids in a large vocab map correctly
    tokens = [f"t{i}" for i in range(1000)]
    vocab_file = create_temp_vocab_file(tokens)
    tokenizer = ProphetNetTokenizer(vocab_file)
    try:
        for idx in range(1000):
            codeflash_output = tokenizer._convert_id_to_token(idx)
    finally:
        os.unlink(vocab_file)


def test_large_vocab_non_integer_inputs():
    # Non-integer indices in large vocab should always return unk_token
    tokens = [f"tok{i}" for i in range(1000)]
    vocab_file = create_temp_vocab_file(tokens)
    tokenizer = ProphetNetTokenizer(vocab_file)
    try:
        for val in [None, "string", 1.5, -2.7, object()]:
            codeflash_output = tokenizer._convert_id_to_token(val)
    finally:
        os.unlink(vocab_file)


def test_large_vocab_negative_indices():
    # Negative indices in large vocab return unk_token
    tokens = [f"tok{i}" for i in range(1000)]
    vocab_file = create_temp_vocab_file(tokens)
    tokenizer = ProphetNetTokenizer(vocab_file)
    try:
        for idx in [-1, -100, -999]:
            codeflash_output = tokenizer._convert_id_to_token(idx)
    finally:
        os.unlink(vocab_file)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-ProphetNetTokenizer._convert_id_to_token-miskb7ul and push.

Codeflash Static Badge

The optimization replaces a dictionary lookup with array indexing for token ID to token conversion, achieving a **232% speedup**.

**Key optimization**: Added a pre-computed list `_ids_to_tokens_list` during initialization that maps token IDs directly to tokens using array indices, enabling O(1) lookups instead of dictionary operations.

**What changed**:
- **Initialization**: Creates a sparse array where `_ids_to_tokens_list[id] = token` for fast direct access
- **Lookup logic**: Added a fast path that checks if the index is a valid integer within array bounds and retrieves the token directly
- **Fallback preservation**: Maintains the original dictionary-based lookup for edge cases (non-integers, out-of-bounds, missing tokens)

**Why it's faster**:
- Array indexing (`list[index]`) is significantly faster than dictionary lookups (`dict.get(key)`) in Python
- The bounds check (`0 <= index < len(list)`) and type check (`isinstance(index, int)`) are very fast operations
- Most tokenization scenarios involve valid integer indices within vocabulary bounds, hitting the fast path

**Performance characteristics**:
- **Valid indices**: 146-373% faster (common case in tokenization workflows)
- **Invalid indices**: 3-56% slower due to additional checks, but this is rare in practice
- **Large vocabularies**: Shows even better improvements (up to 373% faster) due to reduced dictionary overhead

The optimization is particularly effective for transformer tokenizers where `_convert_id_to_token` is frequently called during text generation and processing, making the fast path for valid indices highly valuable.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 December 5, 2025 07:47
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant