-
-
Notifications
You must be signed in to change notification settings - Fork 28
Description
Summary
This issue proposes adding encode(), decode(), and load() to BaseTokenizer as abstract methods and implementing them fully in BPETokenizer. This completes the BPE tokenizer
contract and ensures all future tokenizer implementations
are required to support these core operations.
Background
PR #17 introduced BaseTokenizer as an abstract interface and BPETokenizer as its first implementation.
The current BaseTokenizer defines:
train() ✅ defined
get_vocab_path() ✅ defined
get_merges_path() ✅ defined
encode() ❌ not defined
decode() ❌ not defined
load() ❌ not defined
Since encode(), decode() and load() are not part of the abstract interface, subclasses have no obligation to
implement them. This breaks the tokenizer contract and leaves BPETokenizer only half complete.
Problems in Current Implementation
1. BaseTokenizer does not enforce encode/decode/load
2. BPETokenizer cannot encode or decode
3. No way to reload BPETokenizer from disk
4. No input validation in BPETokenizer.train()
Proposed Changes
openverifiablellm/tokenizer/base.py
Add three abstract methods:
openverifiablellm/tokenizer/bpe_tokenizer.py
Implement encode():
Implement decode():
Implement load():
Harden train():
tests/test_bpe.py ← new file
With this PR:
- BaseTokenizer enforces full workflow
- BPETokenizer is fully usable
- encode → decode works
- Pipeline can progress to model training
Scope
This is a change touching only:
base.pybpe_tokenizer.pytests/test_bpe.py
No overlap with any existing open PRs.
No dependency on any other issue.
Related
- PR feat: deterministic tokenizer training and config hashing #17 — feat: deterministic tokenizer training and config hashing
Additional Context
No response
Code of Conduct
- I have joined the Discord server and will post updates there
- I have searched existing issues to avoid duplicates