Panic: slice bounds out of range in Metaspace pretokenizer with consecutive whitespace

## Description

I was doing some development work on a MCP called devrag using claude and it decided there was an issue with the tokenizer's processing of spaces. I thought I would raise this issue as it is probably correct and it might be worth checking.
This is Claud's report .......... hope it is of some help.

The tokenizer panics with `slice bounds out of range [957:956]` when processing text containing many consecutive whitespace characters (50+). This commonly occurs when tokenizing code listings extracted from PDFs that have heavy indentation.

## Environment

- **tokenizer version:** v0.3.0
- **Go version:** 1.23
- **OS:** Windows 11 (also reproducible on other platforms)

## Steps to Reproduce

```go
package main

import (
    "strings"
    "github.com/sugarme/tokenizer/pretrained"
)

func main() {
    tk, _ := pretrained.FromFile("tokenizer.json") // XLM-RoBERTa tokenizer

    // Text with ~100 consecutive spaces (common in PDF code listings)
    text := "if (x = OpenWindow(NULL," + strings.Repeat(" ", 100) + "WA_Left, 20,"

    _, err := tk.EncodeSingle(text, true) // PANIC here
    if err != nil {
        panic(err)
    }
}
```

## Stack Trace

```
panic: runtime error: slice bounds out of range [957:956]

goroutine 1 [running]:
github.com/sugarme/tokenizer/normalizer.(*NormalizedString).TransformRange(0xc002e8d860, 0x2581?, {0xc0069a8bb8, 0x1, 0x7ff7fd530220?}, 0x1)
    normalizer/normalized.go:768 +0x197a
github.com/sugarme/tokenizer/normalizer.(*NormalizedString).Replace(0xc002e8d860, {0x7ff7fd5c7a00?, 0xc0000741a0?}, {0xc000214584, 0x3})
    normalizer/normalized.go:1456 +0x217
github.com/sugarme/tokenizer/pretokenizer.(*Metaspace).PreTokenize.func1(0x0, 0xc002e8d860)
    pretokenizer/metaspace.go:81 +0xa9
github.com/sugarme/tokenizer.(*PreTokenizedString).Split(0xc0032f91a0, 0xc000033628)
    pretokenizer.go:81 +0x16f
github.com/sugarme/tokenizer/pretokenizer.(*Metaspace).PreTokenize(...)
    pretokenizer/metaspace.go:116 +0x2c
```

## Root Cause

The `TransformRange` function in `normalizer/normalized.go:768` has a bounds checking issue when the `Metaspace` pretokenizer processes long runs of whitespace. The slice indices become inverted (`[957:956]`), causing the panic.

This appears related to #77 which reports a similar `TransformRange` panic with mixed Unicode characters - both are likely manifestations of the same underlying bounds calculation bug.

## Workaround

Normalize input text by collapsing multiple consecutive whitespace to a single space before tokenization:

```go
import (
    "regexp"
    "strings"
)

var multiSpaceRegex = regexp.MustCompile(`\s{2,}`)

func normalizeText(text string) string {
    return strings.TrimSpace(multiSpaceRegex.ReplaceAllString(text, " "))
}

// Use: tk.EncodeSingle(normalizeText(text), true)
```

## Real-World Impact

This bug affects PDF text extraction pipelines where scanned programming books/manuals contain code listings with heavy indentation. In our testing with 887-page technical PDFs, approximately 1.7% of text chunks (127 out of 7,490) triggered this panic.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic: slice bounds out of range in Metaspace pretokenizer with consecutive whitespace #78

Description

Environment

Steps to Reproduce

Stack Trace

Root Cause

Workaround

Real-World Impact

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Panic: slice bounds out of range in Metaspace pretokenizer with consecutive whitespace #78

Description

Description

Environment

Steps to Reproduce

Stack Trace

Root Cause

Workaround

Real-World Impact

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions