Skip to content

Add pruning functionality to tokenizer creation to remove zero frequency tokens#36

Merged
chandanms merged 6 commits intosimple-stories:devfrom
chandanms:tokenizer_fix
Aug 8, 2025
Merged

Add pruning functionality to tokenizer creation to remove zero frequency tokens#36
chandanms merged 6 commits intosimple-stories:devfrom
chandanms:tokenizer_fix

Conversation

@chandanms
Copy link
Collaborator

@chandanms chandanms commented Jul 29, 2025

Description

This PR implements tokenizer pruning functionality to address high cosine similarity issues between token embeddings. The solution removes unused tokens from the trained tokenizer vocabulary and reassigns sequential token IDs to create a more compact and efficient tokenizer.

  • Added prune_tokenizer() function that identifies and removes tokens with zero frequency in the dataset
  • Preserves special tokens ([UNK], [EOS]) and their functionality, original tokenizer settings (normalizers, pre-tokenizers, post-processors, decoders) and serialized IDs

Other minor updates in the PR are

  • changing the default names to more updated ones (ex: lennart-finke/SimpleStories to SimpleStories/SimpleStories)
  • streamlining the tokenizer creating steps like removing the dataset splitting (it adds no value in case of tokenizer), making the script easily executable with different datasets etc

Related Issue

Fixes the token embedding cosine similarity issue where ~2% of zero-frequency tokens from WordPieceTrainer were collapsing into the same embedding direction as [UNK], causing cosine similarities >0.9 and nearly identical embeddings.

Motivation and Context

The original issue was that trained model's embedding layer had cosine similarities >0.9. Investigation revealed that WordPieceTrainer was generating tokens with zero frequency in the actual dataset, which during model training would collapse to have similar embeddings as the [UNK] token. This resulted in high cosine similarities between token embeddings.

The pruning approach removes these problematic tokens and creates a cleaner, more efficient tokenizer that only contains tokens actually present in the training data.

How Has This Been Tested?

A test code has been written in test_tokenizer.py which demonstrates the wordpiece tokenizer issue of creating artifacts when vocab_size is too high and how pruning removes the zero frequency tokens

Does this PR introduce a breaking change?

No. The pruning step can be optionally disabled by not calling prune_tokenizer() in the main workflow.

@danbraunai danbraunai changed the base branch from main to dev July 30, 2025 08:52
Copy link
Collaborator

@danbraunai danbraunai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, feel free to merge after addressing my comments.

I commented on this once, but I think all default arguments should be removed that aren't actually used in this codebase. I spotted a few of them.

I changed the base of this PR to a dev branch. I don't think we should change the main branch, at least not right now, as people will expect that to match the paper.

seed: 0
column_name: story
model_name: 35M
model_name: 1.25M
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not great having the file be called 35M but the model name 1.25M.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its better to change the filename to train_config.yaml instead

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that sounds fine. Do you want to do that and then merge?

@danbraunai
Copy link
Collaborator

@chandanms changes look great! Still one comment about the 35M_config.yaml naming. Feel free to merge after handling that.

@chandanms chandanms closed this Aug 6, 2025
@chandanms chandanms reopened this Aug 6, 2025
@chandanms
Copy link
Collaborator Author

chandanms commented Aug 6, 2025

@danbraunai Made the changes to file. Btw, not sure if I am missing something but I don't think I have the merge PR rights!

@chandanms chandanms merged commit bb68b31 into simple-stories:dev Aug 8, 2025
2 checks passed
@chandanms chandanms deleted the tokenizer_fix branch August 8, 2025 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants