Skip to content

[FEATURE]: Complete BPETokenizer and BaseTokenizer-Add encode, decode and load to BaseTokenizer and BPETokenizer #52

@Arpitsh7

Description

@Arpitsh7

Summary

This issue proposes adding encode(), decode(), and load() to BaseTokenizer as abstract methods and implementing them fully in BPETokenizer. This completes the BPE tokenizer
contract and ensures all future tokenizer implementations
are required to support these core operations.

Background

PR #17 introduced BaseTokenizer as an abstract interface and BPETokenizer as its first implementation.

The current BaseTokenizer defines:

train()           ✅ defined
get_vocab_path()  ✅ defined
get_merges_path() ✅ defined

encode()          ❌ not defined
decode()          ❌ not defined
load()            ❌ not defined

Since encode(), decode() and load() are not part of the abstract interface, subclasses have no obligation to
implement them. This breaks the tokenizer contract and leaves BPETokenizer only half complete.


Problems in Current Implementation

1. BaseTokenizer does not enforce encode/decode/load

2. BPETokenizer cannot encode or decode

3. No way to reload BPETokenizer from disk

4. No input validation in BPETokenizer.train()


Proposed Changes

openverifiablellm/tokenizer/base.py

Add three abstract methods:

openverifiablellm/tokenizer/bpe_tokenizer.py

Implement encode():

Implement decode():

Implement load():

Harden train():

tests/test_bpe.py ← new file

With this PR:

  • BaseTokenizer enforces full workflow
  • BPETokenizer is fully usable
  • encode → decode works
  • Pipeline can progress to model training

Scope

This is a change touching only:

  • base.py
  • bpe_tokenizer.py
  • tests/test_bpe.py

No overlap with any existing open PRs.
No dependency on any other issue.


Related

Additional Context

No response

Code of Conduct

  • I have joined the Discord server and will post updates there
  • I have searched existing issues to avoid duplicates

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions