Skip to content

Panic: slice bounds out of range in Metaspace pretokenizer with consecutive whitespace #78

@davideclark

Description

@davideclark

Description

I was doing some development work on a MCP called devrag using claude and it decided there was an issue with the tokenizer's processing of spaces. I thought I would raise this issue as it is probably correct and it might be worth checking.
This is Claud's report .......... hope it is of some help.

The tokenizer panics with slice bounds out of range [957:956] when processing text containing many consecutive whitespace characters (50+). This commonly occurs when tokenizing code listings extracted from PDFs that have heavy indentation.

Environment

  • tokenizer version: v0.3.0
  • Go version: 1.23
  • OS: Windows 11 (also reproducible on other platforms)

Steps to Reproduce

package main

import (
    "strings"
    "github.com/sugarme/tokenizer/pretrained"
)

func main() {
    tk, _ := pretrained.FromFile("tokenizer.json") // XLM-RoBERTa tokenizer

    // Text with ~100 consecutive spaces (common in PDF code listings)
    text := "if (x = OpenWindow(NULL," + strings.Repeat(" ", 100) + "WA_Left, 20,"

    _, err := tk.EncodeSingle(text, true) // PANIC here
    if err != nil {
        panic(err)
    }
}

Stack Trace

panic: runtime error: slice bounds out of range [957:956]

goroutine 1 [running]:
github.com/sugarme/tokenizer/normalizer.(*NormalizedString).TransformRange(0xc002e8d860, 0x2581?, {0xc0069a8bb8, 0x1, 0x7ff7fd530220?}, 0x1)
    normalizer/normalized.go:768 +0x197a
github.com/sugarme/tokenizer/normalizer.(*NormalizedString).Replace(0xc002e8d860, {0x7ff7fd5c7a00?, 0xc0000741a0?}, {0xc000214584, 0x3})
    normalizer/normalized.go:1456 +0x217
github.com/sugarme/tokenizer/pretokenizer.(*Metaspace).PreTokenize.func1(0x0, 0xc002e8d860)
    pretokenizer/metaspace.go:81 +0xa9
github.com/sugarme/tokenizer.(*PreTokenizedString).Split(0xc0032f91a0, 0xc000033628)
    pretokenizer.go:81 +0x16f
github.com/sugarme/tokenizer/pretokenizer.(*Metaspace).PreTokenize(...)
    pretokenizer/metaspace.go:116 +0x2c

Root Cause

The TransformRange function in normalizer/normalized.go:768 has a bounds checking issue when the Metaspace pretokenizer processes long runs of whitespace. The slice indices become inverted ([957:956]), causing the panic.

This appears related to #77 which reports a similar TransformRange panic with mixed Unicode characters - both are likely manifestations of the same underlying bounds calculation bug.

Workaround

Normalize input text by collapsing multiple consecutive whitespace to a single space before tokenization:

import (
    "regexp"
    "strings"
)

var multiSpaceRegex = regexp.MustCompile(`\s{2,}`)

func normalizeText(text string) string {
    return strings.TrimSpace(multiSpaceRegex.ReplaceAllString(text, " "))
}

// Use: tk.EncodeSingle(normalizeText(text), true)

Real-World Impact

This bug affects PDF text extraction pipelines where scanned programming books/manuals contain code listings with heavy indentation. In our testing with 887-page technical PDFs, approximately 1.7% of text chunks (127 out of 7,490) triggered this panic.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions