-
-
Notifications
You must be signed in to change notification settings - Fork 28
Description
Summary
hash_tokenizer_config() in openverifiablellm/tokenizer.py is hardcoded to look for BPE artifacts (vocab.json and merges.txt). Calling it after training a SentencePiece tokenizer crashes immediately with
FileNotFoundError because SentencePiece produces spm.model and spm.vocab instead.
The function claims to hash tokenizer configuration but silently only works for one of the two supported
tokenizer types.
Root Cause
In openverifiablellm/tokenizer.py, hash_tokenizer_config() is hardcoded
for BPE artifacts only
SentencePiece training produces completely different artifacts:
spm.model ← binary model file
spm.vocab ← human readable vocab
Neither vocab.json nor merges.txt is ever produced by SentencePiece,
so the function always crashes when called after SentencePiece training.
Proposed Fix
openverifiablellm/tokenizer.py
Detect tokenizer type from artifacts present on disk
tests/test_tokenizer.py
Add missing tests
Files to Change
openverifiablellm/tokenizer.py— fixhash_tokenizer_config()tests/test_tokenizer.py— add missing test cases
Related
- PR feat: deterministic tokenizer training and config hashing #17 — feat: deterministic tokenizer training and config hashing
(introducedhash_tokenizer_config()with BPE-only support) - CodeRabbit review on PR feat: deterministic tokenizer training and config hashing #17 flagged missing
test_hash_changes_when_merges_change - Follows from: [FEATURE] Complete SentencePiece tokenizer implementation
- Follows from: [FEATURE] Complete BPETokenizer and BaseTokenizer contract
Impact
Critical - Application is unusable
Code of Conduct
- I have joined the Discord server and will post updates there
- I have searched existing issues to avoid duplicates