Tokenizer test cases and reformatting of tokenizer training file by chandanms · Pull Request #41 · simple-stories/simple_stories_train

chandanms · 2025-08-14T21:41:04Z

Description

Added test cases for tokenizer to verify the functionality of test cases both after training and after pruning. Reformatted the tests to use create_tokenizer function directly to make it more aligned with the creation script.

Removed COMMON_PREFIXES and COMMON_SUFFIXES from initial_alphabet as empirical testing (trained with and without it) showed the WordPiece trainer naturally discovers these frequent morphemes during training, making explicit seeding redundant and causing unnecessary complexity. We should have validated this assumption earlier rather than adding potentially redundant configuration.

Related Issue

Sanity check on using special tokens correctly

Motivation and Context

The training and pruning tokenizer could potentially cause unwanted risk which would be deeply problematic if not discovered. Hence the thorough checking of edge cases and use of special tokens.

How Has This Been Tested?

Added tests to check the behaviour and reformatted the tests to test the functionality after training and after pruning.

Does this PR introduce a breaking change?

No

chandanms · 2025-08-14T21:51:01Z

Check only the last commit please. I had already started coding and didnt see the previous merge for #40

danbraunai

Nice. Minor comments below. I love the removal of COMMON_PREFIXES and COMMON_SUFFIXES. Best also to add that this is done to the README, because it differs to the paper (bottom of page 5 https://arxiv.org/pdf/2504.09184).

Can merge after addressing those.

danbraunai · 2025-08-15T10:37:09Z

tests/test_tokenizer.py

+    """Create a fresh tokenizer for testing."""
+    train_data = ["hello world", "hello there", "world peace", "simple stories"]
+    train_iter = create_test_data_iterator(train_data)
+    return train_tokenizer(train_iter, vocab_size=200)


You should be able to just do return train_tokenizer(iter(train_iter), vocab_size=200 and remove the create_test_data_iterator function.

danbraunai · 2025-08-15T10:38:48Z

tests/test_tokenizer.py

+    pruned = prune_tokenizer(data_iter, tokenizer)
+    vocab_pruned = pruned.get_vocab()
+    assert "[UNK]" in vocab_pruned and "[EOS]" in vocab_pruned
+    assert vocab_pruned["[UNK]"] in [0, 1] and vocab_pruned["[EOS]"] in [0, 1]


Maybe assert that they're not both 0 or 1 too. (i.e. that they're not equal).

…ens; Made the data to tokenizer training iterable.

danbraunai-goodfire and others added 12 commits August 13, 2025 11:38

Reset train dataloader when depleted

29ebe0f

Fix pyright errors

1a5440a

Cast instead of isinstance

d0fa3fd

Update pinned torch version

7407ff3

Factor out gpt2 and make general train.py

5b70703

Prefix wandb run name with model_id

4042ab4

Merge branch 'dev' into refactor/modular

c295c0d

Create gpt2 hf converters

b1c6b50

Create push_to_hf

ed62db8

Upload tokenizer to hf too

77ecfe7

Refactor gpt conversions

334fa07

Fixed linter issues

49cd09a

chandanms changed the title ~~Tokenizer test cases and reformatting of tokenizer file~~ Tokenizer test cases and reformatting of tokenizer training file Aug 14, 2025

chandanms requested a review from danbraunai August 15, 2025 08:03

danbraunai approved these changes Aug 15, 2025

View reviewed changes

chandanms added 2 commits August 15, 2025 19:43

Made the tests more strict for verifying existance of EOS and UNK tok…

04b6558

…ens; Made the data to tokenizer training iterable.

Updated the readme file

6cb28de

chandanms merged commit da7997a into simple-stories:dev Aug 15, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer test cases and reformatting of tokenizer training file#41

Tokenizer test cases and reformatting of tokenizer training file#41
chandanms merged 14 commits intosimple-stories:devfrom
chandanms:feature/tokenizer_test_cases

chandanms commented Aug 14, 2025

Uh oh!

chandanms commented Aug 14, 2025

Uh oh!

danbraunai left a comment

Uh oh!

danbraunai Aug 15, 2025

Uh oh!

danbraunai Aug 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chandanms commented Aug 14, 2025

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Does this PR introduce a breaking change?

Uh oh!

chandanms commented Aug 14, 2025

Uh oh!

danbraunai left a comment

Choose a reason for hiding this comment

Uh oh!

danbraunai Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

danbraunai Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants