-
Notifications
You must be signed in to change notification settings - Fork 59
Description
Description
I was doing some development work on a MCP called devrag using claude and it decided there was an issue with the tokenizer's processing of spaces. I thought I would raise this issue as it is probably correct and it might be worth checking.
This is Claud's report .......... hope it is of some help.
The tokenizer panics with slice bounds out of range [957:956] when processing text containing many consecutive whitespace characters (50+). This commonly occurs when tokenizing code listings extracted from PDFs that have heavy indentation.
Environment
- tokenizer version: v0.3.0
- Go version: 1.23
- OS: Windows 11 (also reproducible on other platforms)
Steps to Reproduce
package main
import (
"strings"
"github.com/sugarme/tokenizer/pretrained"
)
func main() {
tk, _ := pretrained.FromFile("tokenizer.json") // XLM-RoBERTa tokenizer
// Text with ~100 consecutive spaces (common in PDF code listings)
text := "if (x = OpenWindow(NULL," + strings.Repeat(" ", 100) + "WA_Left, 20,"
_, err := tk.EncodeSingle(text, true) // PANIC here
if err != nil {
panic(err)
}
}Stack Trace
panic: runtime error: slice bounds out of range [957:956]
goroutine 1 [running]:
github.com/sugarme/tokenizer/normalizer.(*NormalizedString).TransformRange(0xc002e8d860, 0x2581?, {0xc0069a8bb8, 0x1, 0x7ff7fd530220?}, 0x1)
normalizer/normalized.go:768 +0x197a
github.com/sugarme/tokenizer/normalizer.(*NormalizedString).Replace(0xc002e8d860, {0x7ff7fd5c7a00?, 0xc0000741a0?}, {0xc000214584, 0x3})
normalizer/normalized.go:1456 +0x217
github.com/sugarme/tokenizer/pretokenizer.(*Metaspace).PreTokenize.func1(0x0, 0xc002e8d860)
pretokenizer/metaspace.go:81 +0xa9
github.com/sugarme/tokenizer.(*PreTokenizedString).Split(0xc0032f91a0, 0xc000033628)
pretokenizer.go:81 +0x16f
github.com/sugarme/tokenizer/pretokenizer.(*Metaspace).PreTokenize(...)
pretokenizer/metaspace.go:116 +0x2c
Root Cause
The TransformRange function in normalizer/normalized.go:768 has a bounds checking issue when the Metaspace pretokenizer processes long runs of whitespace. The slice indices become inverted ([957:956]), causing the panic.
This appears related to #77 which reports a similar TransformRange panic with mixed Unicode characters - both are likely manifestations of the same underlying bounds calculation bug.
Workaround
Normalize input text by collapsing multiple consecutive whitespace to a single space before tokenization:
import (
"regexp"
"strings"
)
var multiSpaceRegex = regexp.MustCompile(`\s{2,}`)
func normalizeText(text string) string {
return strings.TrimSpace(multiSpaceRegex.ReplaceAllString(text, " "))
}
// Use: tk.EncodeSingle(normalizeText(text), true)Real-World Impact
This bug affects PDF text extraction pipelines where scanned programming books/manuals contain code listings with heavy indentation. In our testing with 887-page technical PDFs, approximately 1.7% of text chunks (127 out of 7,490) triggered this panic.