Bug
wagl ingest transcripts panics when transcript content contains multi-byte UTF-8 characters (e.g. curly quotes " ").
Error
thread 'main' panicked at 'byte index 2000 is not a char boundary'
Root Cause
The ingest logic is slicing strings at a fixed byte offset (e.g. 2000) rather than a character boundary. In Rust, &str[n..] requires n to fall on a valid UTF-8 char boundary. Curly quotes and other non-ASCII characters are 2-3 bytes in UTF-8, so a fixed byte slice can land mid-character.
Fix
Use char_indices() or a manual char-boundary walk to truncate safely:
fn safe_truncate(s: &str, max_bytes: usize) -> &str {
if s.len() <= max_bytes {
return s;
}
let mut boundary = max_bytes;
while !s.is_char_boundary(boundary) {
boundary -= 1;
}
&s[..boundary]
}
Reproduction
Run wagl ingest transcripts on any transcript containing curly quotes, em-dashes, or other multi-byte UTF-8 characters near a chunk boundary.
Bug
wagl ingest transcriptspanics when transcript content contains multi-byte UTF-8 characters (e.g. curly quotes"").Error
Root Cause
The ingest logic is slicing strings at a fixed byte offset (e.g. 2000) rather than a character boundary. In Rust,
&str[n..]requiresnto fall on a valid UTF-8 char boundary. Curly quotes and other non-ASCII characters are 2-3 bytes in UTF-8, so a fixed byte slice can land mid-character.Fix
Use
char_indices()or a manual char-boundary walk to truncate safely:Reproduction
Run
wagl ingest transcriptson any transcript containing curly quotes, em-dashes, or other multi-byte UTF-8 characters near a chunk boundary.