Add pruning functionality to tokenizer creation to remove zero frequency tokens by chandanms · Pull Request #36 · simple-stories/simple_stories_train

chandanms · 2025-07-29T09:49:11Z

Description

This PR implements tokenizer pruning functionality to address high cosine similarity issues between token embeddings. The solution removes unused tokens from the trained tokenizer vocabulary and reassigns sequential token IDs to create a more compact and efficient tokenizer.

Added prune_tokenizer() function that identifies and removes tokens with zero frequency in the dataset
Preserves special tokens ([UNK], [EOS]) and their functionality, original tokenizer settings (normalizers, pre-tokenizers, post-processors, decoders) and serialized IDs

Other minor updates in the PR are

changing the default names to more updated ones (ex: lennart-finke/SimpleStories to SimpleStories/SimpleStories)
streamlining the tokenizer creating steps like removing the dataset splitting (it adds no value in case of tokenizer), making the script easily executable with different datasets etc

Related Issue

Fixes the token embedding cosine similarity issue where ~2% of zero-frequency tokens from WordPieceTrainer were collapsing into the same embedding direction as [UNK], causing cosine similarities >0.9 and nearly identical embeddings.

Motivation and Context

The original issue was that trained model's embedding layer had cosine similarities >0.9. Investigation revealed that WordPieceTrainer was generating tokens with zero frequency in the actual dataset, which during model training would collapse to have similar embeddings as the [UNK] token. This resulted in high cosine similarities between token embeddings.

The pruning approach removes these problematic tokens and creates a cleaner, more efficient tokenizer that only contains tokens actually present in the training data.

How Has This Been Tested?

A test code has been written in test_tokenizer.py which demonstrates the wordpiece tokenizer issue of creating artifacts when vocab_size is too high and how pruning removes the zero frequency tokens

Does this PR introduce a breaking change?

No. The pruning step can be optionally disabled by not calling prune_tokenizer() in the main workflow.

…dataset

…and what pruning does

danbraunai

Nice, feel free to merge after addressing my comments.

I commented on this once, but I think all default arguments should be removed that aren't actually used in this codebase. I spotted a few of them.

I changed the base of this PR to a dev branch. I don't think we should change the main branch, at least not right now, as people will expect that to match the paper.

danbraunai · 2025-07-30T08:56:37Z

simple_stories_train/35M_config.yaml

  seed: 0
  column_name: story
-model_name: 35M
+model_name: 1.25M


Not great having the file be called 35M but the model name 1.25M.

I think its better to change the filename to train_config.yaml instead

Yep that sounds fine. Do you want to do that and then merge?

simple_stories_train/tokenizer.py

…omments for train and prune tokenizer

danbraunai · 2025-08-06T19:40:32Z

@chandanms changes look great! Still one comment about the 35M_config.yaml naming. Feel free to merge after handling that.

…in_config.yaml; deleted extra config.

chandanms · 2025-08-06T21:07:44Z

@danbraunai Made the changes to file. Btw, not sure if I am missing something but I don't think I have the merge PR rights!

chandanms added 4 commits July 27, 2025 11:48

Added pruning tokenizer function; Modified the old scripts to use HF …

197781f

…dataset

Fixed pyright issues

cdd96ea

Added a test and a toy model to demo the wordpiece tokenizer problem …

3b14778

…and what pruning does

fixed formatting with ruff

a13ef35

danbraunai changed the base branch from main to dev July 30, 2025 08:52

danbraunai approved these changes Jul 30, 2025

View reviewed changes

Made the clean_dataset a iterable. Removed default arguments. Added c…

33d3804

…omments for train and prune tokenizer

Renamed the training config 35M_config.yaml to a generalized name tra…

d65c301

…in_config.yaml; deleted extra config.

chandanms closed this Aug 6, 2025

chandanms reopened this Aug 6, 2025

chandanms merged commit bb68b31 into simple-stories:dev Aug 8, 2025
2 checks passed

chandanms deleted the tokenizer_fix branch August 8, 2025 21:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pruning functionality to tokenizer creation to remove zero frequency tokens#36

Add pruning functionality to tokenizer creation to remove zero frequency tokens#36
chandanms merged 6 commits intosimple-stories:devfrom
chandanms:tokenizer_fix

chandanms commented Jul 29, 2025 •

edited

Loading

Uh oh!

danbraunai left a comment

Uh oh!

danbraunai Jul 30, 2025

Uh oh!

chandanms Aug 4, 2025

Uh oh!

danbraunai Aug 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danbraunai commented Aug 6, 2025

Uh oh!

chandanms commented Aug 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chandanms commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Does this PR introduce a breaking change?

Uh oh!

danbraunai left a comment

Choose a reason for hiding this comment

Uh oh!

danbraunai Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

chandanms Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

danbraunai Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danbraunai commented Aug 6, 2025

Uh oh!

chandanms commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chandanms commented Jul 29, 2025 •

edited

Loading

chandanms commented Aug 6, 2025 •

edited

Loading