Skip to content

panic: byte index is not a char boundary in wagl ingest transcripts #201

@GoZumie

Description

@GoZumie

Bug

wagl ingest transcripts panics when transcript content contains multi-byte UTF-8 characters (e.g. curly quotes " ").

Error

thread 'main' panicked at 'byte index 2000 is not a char boundary'

Root Cause

The ingest logic is slicing strings at a fixed byte offset (e.g. 2000) rather than a character boundary. In Rust, &str[n..] requires n to fall on a valid UTF-8 char boundary. Curly quotes and other non-ASCII characters are 2-3 bytes in UTF-8, so a fixed byte slice can land mid-character.

Fix

Use char_indices() or a manual char-boundary walk to truncate safely:

fn safe_truncate(s: &str, max_bytes: usize) -> &str {
    if s.len() <= max_bytes {
        return s;
    }
    let mut boundary = max_bytes;
    while !s.is_char_boundary(boundary) {
        boundary -= 1;
    }
    &s[..boundary]
}

Reproduction

Run wagl ingest transcripts on any transcript containing curly quotes, em-dashes, or other multi-byte UTF-8 characters near a chunk boundary.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions