fix: prevent panic on multi-byte UTF-8 in log string slicing by hansmrtn · Pull Request #251 · yfedoseev/pdf_oxide

hansmrtn · 2026-03-14T21:12:14Z

Description

This fixes the 5 byte-offset string slices in log::debug!/log::trace! macros that panic with byte index N is not a char boundary on multi-byte UTF-8 text (CJK, emoji, dingbats, U+FFFD). This triggers in any app using tracing-subscriber. I found this while working with maintenance manuals, causing my logs to fill with panics.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Performance improvement
Code refactoring
Tests
CI/CD changes

Related Issues

N/A

Changes Made

Add safe_prefix/safe_suffix to pub(crate) utils (floor/ceil char boundary truncation)
Replace raw &s[..n]/&s[n..] at text.rs:2961,2963,5697 and xref.rs:391,421
Add 23 tests: catch_unwind proof of old panics, per-site regression, edge cases
Add [Unreleased] entry to CHANGELOG.md

Testing

I have added tests that prove my fix is effective or that my feature works
All new and existing tests pass locally
I have run cargo test --all-features
I have run cargo clippy -- -D warnings
I have run cargo fmt

Note, on my machine, --all-features fails to link due to missing ONNX Runtime (ml/ocr features). I did test with all linkable features: logging,parallel,debug-span-merging,rendering,signatures,barcodes.

Python Bindings (if applicable)

N/A — no Python binding changes.

Python bindings updated (if needed)
Python tests pass
Python code formatted with ruff format
Python code linted with ruff check

Documentation

I have updated the documentation (README, docs/, code comments)
I have added/updated examples (if applicable)
I have updated CHANGELOG.md

Checklist

My code follows the project's coding guidelines (see CONTRIBUTING.md)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
My changes generate no new warnings
I have checked my code and corrected any misspellings
The PR title follows conventional commits format (e.g., feat:, fix:, docs:)

Screenshots (if applicable)

N/A

Additional Notes

The bug only manifests when log::max_level() >= Debug, which happens automatically in apps using tracing-subscriber with tracing-log. ASCII-only PDFs are unaffected since byte offsets always land on char boundaries.

Let me know if any changes or tests are needed or missed!

Replace raw byte-offset string slicing (&s[..n], &s[n..]) with safe_prefix/safe_suffix helpers that respect UTF-8 char boundaries. Fixes 4 sites in text.rs (lines 2961, 2963, 5697) and xref.rs (lines 391, 421) that panic with 'byte index N is not a char boundary' when processing PDFs with multi-byte UTF-8 text and log level >= Debug.

derekleverenz · 2026-03-19T23:05:11Z

Oh nice i ran into this too, i had a fix with .floor_char_boundary and .ceil_char_boundary but i like this

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: prevent panic on multi-byte UTF-8 in log string slicing#251

fix: prevent panic on multi-byte UTF-8 in log string slicing#251
hansmrtn wants to merge 1 commit intoyfedoseev:mainfrom
hansmrtn:fix/utf8-char-boundary-panic

hansmrtn commented Mar 14, 2026

Uh oh!

derekleverenz commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

hansmrtn commented Mar 14, 2026

Description

Type of Change

Related Issues

Changes Made

Testing

Python Bindings (if applicable)

Documentation

Checklist

Screenshots (if applicable)

Additional Notes

Uh oh!

derekleverenz commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants