Add pruning functionality to tokenizer creation to remove zero frequency tokens#36
Conversation
danbraunai
left a comment
There was a problem hiding this comment.
Nice, feel free to merge after addressing my comments.
I commented on this once, but I think all default arguments should be removed that aren't actually used in this codebase. I spotted a few of them.
I changed the base of this PR to a dev branch. I don't think we should change the main branch, at least not right now, as people will expect that to match the paper.
| seed: 0 | ||
| column_name: story | ||
| model_name: 35M | ||
| model_name: 1.25M |
There was a problem hiding this comment.
Not great having the file be called 35M but the model name 1.25M.
There was a problem hiding this comment.
I think its better to change the filename to train_config.yaml instead
There was a problem hiding this comment.
Yep that sounds fine. Do you want to do that and then merge?
…omments for train and prune tokenizer
|
@chandanms changes look great! Still one comment about the 35M_config.yaml naming. Feel free to merge after handling that. |
…in_config.yaml; deleted extra config.
|
@danbraunai Made the changes to file. Btw, not sure if I am missing something but I don't think I have the merge PR rights! |
Description
This PR implements tokenizer pruning functionality to address high cosine similarity issues between token embeddings. The solution removes unused tokens from the trained tokenizer vocabulary and reassigns sequential token IDs to create a more compact and efficient tokenizer.
Other minor updates in the PR are
lennart-finke/SimpleStoriestoSimpleStories/SimpleStories)Related Issue
Fixes the token embedding cosine similarity issue where ~2% of zero-frequency tokens from WordPieceTrainer were collapsing into the same embedding direction as [UNK], causing cosine similarities >0.9 and nearly identical embeddings.
Motivation and Context
The original issue was that trained model's embedding layer had cosine similarities >0.9. Investigation revealed that WordPieceTrainer was generating tokens with zero frequency in the actual dataset, which during model training would collapse to have similar embeddings as the [UNK] token. This resulted in high cosine similarities between token embeddings.
The pruning approach removes these problematic tokens and creates a cleaner, more efficient tokenizer that only contains tokens actually present in the training data.
How Has This Been Tested?
A test code has been written in
test_tokenizer.pywhich demonstrates the wordpiece tokenizer issue of creating artifacts when vocab_size is too high and how pruning removes the zero frequency tokensDoes this PR introduce a breaking change?
No. The pruning step can be optionally disabled by not calling prune_tokenizer() in the main workflow.