Skip to content

Tokenizer test cases and reformatting of tokenizer training file#41

Merged
chandanms merged 14 commits intosimple-stories:devfrom
chandanms:feature/tokenizer_test_cases
Aug 15, 2025
Merged

Tokenizer test cases and reformatting of tokenizer training file#41
chandanms merged 14 commits intosimple-stories:devfrom
chandanms:feature/tokenizer_test_cases

Conversation

@chandanms
Copy link
Collaborator

Description

Added test cases for tokenizer to verify the functionality of test cases both after training and after pruning. Reformatted the tests to use create_tokenizer function directly to make it more aligned with the creation script.

Removed COMMON_PREFIXES and COMMON_SUFFIXES from initial_alphabet as empirical testing (trained with and without it) showed the WordPiece trainer naturally discovers these frequent morphemes during training, making explicit seeding redundant and causing unnecessary complexity. We should have validated this assumption earlier rather than adding potentially redundant configuration.

Related Issue

Sanity check on using special tokens correctly

Motivation and Context

The training and pruning tokenizer could potentially cause unwanted risk which would be deeply problematic if not discovered. Hence the thorough checking of edge cases and use of special tokens.

How Has This Been Tested?

Added tests to check the behaviour and reformatted the tests to test the functionality after training and after pruning.

Does this PR introduce a breaking change?

No

@chandanms chandanms changed the title Tokenizer test cases and reformatting of tokenizer file Tokenizer test cases and reformatting of tokenizer training file Aug 14, 2025
@chandanms
Copy link
Collaborator Author

Check only the last commit please. I had already started coding and didnt see the previous merge for #40

@chandanms chandanms requested a review from danbraunai August 15, 2025 08:03
Copy link
Collaborator

@danbraunai danbraunai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Minor comments below. I love the removal of COMMON_PREFIXES and COMMON_SUFFIXES. Best also to add that this is done to the README, because it differs to the paper (bottom of page 5 https://arxiv.org/pdf/2504.09184).

Can merge after addressing those.

"""Create a fresh tokenizer for testing."""
train_data = ["hello world", "hello there", "world peace", "simple stories"]
train_iter = create_test_data_iterator(train_data)
return train_tokenizer(train_iter, vocab_size=200)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to just do return train_tokenizer(iter(train_iter), vocab_size=200 and remove the create_test_data_iterator function.

pruned = prune_tokenizer(data_iter, tokenizer)
vocab_pruned = pruned.get_vocab()
assert "[UNK]" in vocab_pruned and "[EOS]" in vocab_pruned
assert vocab_pruned["[UNK]"] in [0, 1] and vocab_pruned["[EOS]"] in [0, 1]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe assert that they're not both 0 or 1 too. (i.e. that they're not equal).

@chandanms chandanms merged commit da7997a into simple-stories:dev Aug 15, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants