goSentencePiece

Pure Go implementation of the SentencePiece tokenizer (Unigram and BPE). Produces byte-identical output to the C++ / Python sentencepiece library -- no CGo, no Rust FFI, no external C libraries.

Built for running DeBERTa v3, Gemma, LLaMA, and other SentencePiece-based models in Go services.

Installation

go get github.com/tggo/goSentencePiece

Requires Go 1.23+.

Quick Start

package main

import (
    "fmt"
    "log"

    sp "github.com/tggo/goSentencePiece"
)

func main() {
    tok, err := sp.NewTokenizer("spm.model")
    if err != nil {
        log.Fatal(err)
    }

    // Encode text to token IDs
    ids, _ := tok.Encode("Hello world")
    fmt.Println("IDs:", ids)

    // Encode text to string pieces
    pieces, _ := tok.EncodeAsPieces("Hello world")
    fmt.Println("Pieces:", pieces)

    // Decode token IDs back to text
    text, _ := tok.Decode(ids)
    fmt.Println("Decoded:", text)

    // Wrap with special tokens (BOS/EOS)
    wrapped := tok.AddSpecialTokens(ids)
    fmt.Println("With special tokens:", wrapped)

    fmt.Println("Vocab size:", tok.VocabSize())
}

Features

Pure Go -- zero CGo, zero Rust FFI, zero external C libraries
Byte-identical to the reference C++ implementation (validated against 5155 golden test cases)
Dual format -- loads both SentencePiece .model (protobuf) and HuggingFace tokenizer.json
Unigram model with Viterbi decoding
BPE model with greedy best-first merging
Byte fallback (<0xHH> tokens) for characters not in vocabulary
Precompiled charsmap normalization via Darts double-array trie (NFKC + custom rules)
Metaspace pre-tokenization for HuggingFace tokenizer.json models
ML pipeline -- post-processing, padding, truncation, attention masks
Batch encoding -- encode multiple texts with automatic padding
ONNX-ready -- produces input_ids, attention_mask, token_type_ids tensors
go:embed support -- load models from embedded files with NewTokenizerFromReader
Typed errors -- sentinel errors for invalid or unsupported models
Fast -- see benchmarks below
Zero runtime dependencies beyond stdlib + google.golang.org/protobuf

API

Tokenizer

// Create from file path (auto-detects .model or tokenizer.json format)
func NewTokenizer(path string) (*Tokenizer, error)

// Create from io.Reader (auto-detects format)
func NewTokenizerFromReader(r io.Reader) (*Tokenizer, error)

// Create from HuggingFace tokenizer.json explicitly
func NewTokenizerFromJSON(path string) (*Tokenizer, error)
func NewTokenizerFromJSONReader(r io.Reader) (*Tokenizer, error)

// Encode text to token IDs
func (t *Tokenizer) Encode(text string) ([]int, error)

// Encode text to string pieces
func (t *Tokenizer) EncodeAsPieces(text string) ([]string, error)

// Decode token IDs back to text
func (t *Tokenizer) Decode(ids []int) (string, error)

// Encode multiple texts at once
func (t *Tokenizer) EncodeBatch(texts []string) ([][]int, error)

// Wrap with BOS/EOS tokens
func (t *Tokenizer) AddSpecialTokens(ids []int) []int

// Get vocabulary size
func (t *Tokenizer) VocabSize() int

// Access the underlying model
func (t *Tokenizer) Model() *Model

// Pipeline configuration (builder pattern)
func (t *Tokenizer) WithPostProcessor(pp PostProcessor) *Tokenizer
func (t *Tokenizer) WithTruncation(params *TruncationParams) *Tokenizer
func (t *Tokenizer) WithPadding(params *PaddingParams) *Tokenizer

// Full encoding with metadata (post-processing + truncation + padding)
func (t *Tokenizer) EncodeWithOptions(text string, addSpecialTokens bool) *Encoding
func (t *Tokenizer) EncodeBatchWithOptions(texts []string, addSpecialTokens bool) []*Encoding

Encoding

type Encoding struct {
    IDs               []int    // Token IDs
    Tokens            []string // String pieces
    AttentionMask     []int    // 1 for real tokens, 0 for padding
    TypeIDs           []int    // Segment IDs (0 for first, 1 for second)
    SpecialTokensMask []int    // 1 for special tokens, 0 for normal
}

Model

// Load model from file
func LoadModel(path string) (*Model, error)

// Load model from reader
func LoadModelFromReader(r io.Reader) (*Model, error)

// Vocabulary lookup
func (m *Model) VocabSize() int
func (m *Model) IdToPiece(id int) string
func (m *Model) PieceToId(piece string) int

// Special token IDs
func (m *Model) UnkID() int
func (m *Model) BosID() int
func (m *Model) EosID() int
func (m *Model) PadID() int

Supported Models

Any SentencePiece .model file (protobuf) or HuggingFace tokenizer.json that uses Unigram or BPE model type.

SentencePiece .model (protobuf):

microsoft/deberta-v3-small / base / large (Unigram)
google/gemma-3-1b-it (BPE, 256K vocab)
Other Unigram/BPE SentencePiece models (XLNet, ALBERT, T5, LLaMA, Mistral, etc.)

HuggingFace tokenizer.json:

onnx-community/mmBERT-small-ONNX (BPE, 256K vocab)
Other BPE/Unigram tokenizer.json models with Metaspace pre-tokenizer

Note: WORD and CHAR model types are not supported. ByteLevel pre-tokenizer (GPT-2/RoBERTa) is not yet supported.

Benchmarks

Measured on Apple M4 Max. Python sentencepiece is a C++ library with Python SWIG bindings.

Operation	Input	Go	Python (C++)	Speedup
Encode	short (11 chars)	301 ns	1.1 μs	3.7x
Encode	medium (120 chars)	3.3 μs	5.6 μs	1.7x
Encode	long (4500 chars)	94 μs	183 μs	1.9x
Decode	short (10 tokens)	343 ns	801 ns	2.3x

Go beats the C++ reference on short inputs (Python FFI overhead dominates). On long inputs, pure Go Viterbi is ~2x faster than C++ via Python bindings.

# Go benchmarks
make bench

# Python benchmarks (requires make venv)
.venv/bin/python _testdata/bench_python.py

Examples

The examples/ directory contains runnable programs:

Example	Description
`basic`	Encode, decode, batch encode, vocab metadata
`embed`	Load model from `go:embed` binary data
`mmbert`	mmBERT-small ONNX inference prep (tokenizer.json or .model)
`similarity`	Jaccard similarity between two texts at token level
`streaming`	Memory-efficient line-by-line tokenization from stdin
`vocab-inspect`	Inspect vocab: special tokens, byte tokens, piece search
`benchmark`	CLI throughput benchmark (tokens/sec)
`compare`	Side-by-side Unigram vs BPE tokenization comparison

go run ./examples/basic _testdata/spm.model
go run ./examples/mmbert _testdata/tokenizer.json   # HuggingFace format
go run ./examples/mmbert _testdata/bpe.model         # SentencePiece format
go run ./examples/compare _testdata/spm.model _testdata/bpe.model
go run ./examples/vocab-inspect _testdata/spm.model
cat file.txt | go run ./examples/benchmark _testdata/spm.model

Project Structure

sentencepiece.go    -- public Tokenizer type, pipeline, constructors
model.go            -- protobuf loading, vocab index, ByteTrie
tokenizer_json.go   -- HuggingFace tokenizer.json loading
normalizer.go       -- precompiled charsmap (Darts trie), NFKC, whitespace
unigram.go          -- Viterbi decoding (forward DP + backtrack)
bpe.go              -- BPE greedy merge with priority queue
encoder.go          -- Encode/Decode with byte-token handling
encoding.go         -- Encoding struct (IDs, masks, offsets)
postprocessor.go    -- PostProcessor interface, template processing
padding.go          -- Padding (right/left, fixed/batch-longest)
truncation.go       -- Truncation to max length
byte_fallback.go    -- <0xHH> token encoding/decoding
errors.go           -- sentinel error types
trie.go             -- ByteTrie (vocab), DartsDoubleArray (charsmap)
proto/              -- generated protobuf code
examples/           -- runnable examples (basic, embed, mmbert)
_testdata/          -- test models and golden test cases

How It Works

Normalization: Input text is normalized using the model's precompiled character map (a Darts double-array trie that encodes NFKC and custom rules). Whitespace is deduplicated, a prefix space is added, and spaces are replaced with the metaspace character.
Viterbi tokenization: The normalized text is segmented into pieces using dynamic programming. A byte-level trie is traversed to find all vocabulary pieces starting at each position. The algorithm finds the segmentation that maximizes total log-probability.
Byte fallback: Characters not covered by any vocabulary piece are encoded as individual UTF-8 bytes using <0xHH> tokens.
Decoding: Token IDs are mapped back to piece strings. Byte tokens are accumulated and flushed as UTF-8. The metaspace prefix is converted back to spaces.

Thread Safety

Tokenizer is safe for concurrent use by multiple goroutines after creation. The model and normalizer are read-only after initialization, so no locking is needed.

Running Tests

# Set up Python venv and download model + golden data
make venv
make golden

# Run tests
make test

# Run benchmarks
make bench

# Run fuzz tests (60s)
make fuzz

# Run linters
make lint

# Run tests with coverage
make cover

License

MIT -- see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github		.github
_testdata		_testdata
examples		examples
proto		proto
.gitattributes		.gitattributes
.gitignore		.gitignore
.golangci.yml		.golangci.yml
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
bpe.go		bpe.go
bpe_test.go		bpe_test.go
byte_fallback.go		byte_fallback.go
encoder.go		encoder.go
encoding.go		encoding.go
errors.go		errors.go
example_embed_test.go		example_embed_test.go
example_test.go		example_test.go
go.mod		go.mod
go.sum		go.sum
model.go		model.go
normalizer.go		normalizer.go
padding.go		padding.go
pipeline_test.go		pipeline_test.go
postprocessor.go		postprocessor.go
sentencepiece.go		sentencepiece.go
sentencepiece_test.go		sentencepiece_test.go
tokenizer_json.go		tokenizer_json.go
tokenizer_json_test.go		tokenizer_json_test.go
trie.go		trie.go
truncation.go		truncation.go
unigram.go		unigram.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

goSentencePiece

Installation

Quick Start

Features

API

Tokenizer

Encoding

Model

Supported Models

Benchmarks

Examples

Project Structure

How It Works

Thread Safety

Running Tests

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

goSentencePiece

Installation

Quick Start

Features

API

Tokenizer

Encoding

Model

Supported Models

Benchmarks

Examples

Project Structure

How It Works

Thread Safety

Running Tests

License

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages