Skip to content

[BUG]: hash_tokenizer_config() crashes with FileNotFoundError when used with SentencePiece tokenizer #69

@Arpitsh7

Description

@Arpitsh7

Summary

hash_tokenizer_config() in openverifiablellm/tokenizer.py is hardcoded to look for BPE artifacts (vocab.json and merges.txt). Calling it after training a SentencePiece tokenizer crashes immediately with
FileNotFoundError because SentencePiece produces spm.model and spm.vocab instead.

The function claims to hash tokenizer configuration but silently only works for one of the two supported
tokenizer types.

Root Cause

In openverifiablellm/tokenizer.py, hash_tokenizer_config() is hardcoded
for BPE artifacts only

SentencePiece training produces completely different artifacts:

spm.model   ← binary model file
spm.vocab   ← human readable vocab

Neither vocab.json nor merges.txt is ever produced by SentencePiece,
so the function always crashes when called after SentencePiece training.

Proposed Fix

openverifiablellm/tokenizer.py

Detect tokenizer type from artifacts present on disk

tests/test_tokenizer.py

Add missing tests

Files to Change

  • openverifiablellm/tokenizer.py — fix hash_tokenizer_config()
  • tests/test_tokenizer.py — add missing test cases

Related


Impact

Critical - Application is unusable

Code of Conduct

  • I have joined the Discord server and will post updates there
  • I have searched existing issues to avoid duplicates

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions