Skip to content

[FEATURE]: Complete SentencePiece tokenizer — encode, decode, load and input validation #51

@Arpitsh7

Description

@Arpitsh7

Summary

This issue proposes completing the SentencePieceTokenizer introduced in #17. The class currently supports training only. It is missing encode(), decode(), load(), input validation, special tokens, and directory creation

Background

PR #17 introduced a modular tokenizer architecture with SentencePieceTokenizer as one of the two implementations.

However the current SentencePieceTokenizer only covers the training step

A tokenizer that cannot encode or decode text cannot be used in any downstream pipeline task.

Problems in Current Implementation

1. No encode() or decode()

2. No load()

3. Missing special tokens in training

4. No input validation

5. save_path directory never created

Why This Matters

SentencePiece is used by major modern LLMs:

Without a complete SentencePieceTokenizer, the project cannot support cryptographic verification of pipelines
built on any of these models — which represent the majority of current open source LLMs.

Scope

This is a change touching only:

  • sentencepiece_tokenizer.py

No overlap with any existing open PRs.

Related

Additional Context

No response

Code of Conduct

  • I have joined the Discord server and will post updates there
  • I have searched existing issues to avoid duplicates

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions