Skip to content

fix: prevent panic on multi-byte UTF-8 in log string slicing#251

Open
hansmrtn wants to merge 1 commit intoyfedoseev:mainfrom
hansmrtn:fix/utf8-char-boundary-panic
Open

fix: prevent panic on multi-byte UTF-8 in log string slicing#251
hansmrtn wants to merge 1 commit intoyfedoseev:mainfrom
hansmrtn:fix/utf8-char-boundary-panic

Conversation

@hansmrtn
Copy link
Copy Markdown

Description

This fixes the 5 byte-offset string slices in log::debug!/log::trace! macros that panic with byte index N is not a char boundary on multi-byte UTF-8 text (CJK, emoji, dingbats, U+FFFD). This triggers in any app using tracing-subscriber. I found this while working with maintenance manuals, causing my logs to fill with panics.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Tests
  • CI/CD changes

Related Issues

N/A

Changes Made

  • Add safe_prefix/safe_suffix to pub(crate) utils (floor/ceil char boundary truncation)
  • Replace raw &s[..n]/&s[n..] at text.rs:2961,2963,5697 and xref.rs:391,421
  • Add 23 tests: catch_unwind proof of old panics, per-site regression, edge cases
  • Add [Unreleased] entry to CHANGELOG.md

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • All new and existing tests pass locally
  • I have run cargo test --all-features
  • I have run cargo clippy -- -D warnings
  • I have run cargo fmt

Note, on my machine, --all-features fails to link due to missing ONNX Runtime (ml/ocr features). I did test with all linkable features: logging,parallel,debug-span-merging,rendering,signatures,barcodes.

Python Bindings (if applicable)

N/A — no Python binding changes.

  • Python bindings updated (if needed)
  • Python tests pass
  • Python code formatted with ruff format
  • Python code linted with ruff check

Documentation

  • I have updated the documentation (README, docs/, code comments)
  • I have added/updated examples (if applicable)
  • I have updated CHANGELOG.md

Checklist

  • My code follows the project's coding guidelines (see CONTRIBUTING.md)
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • I have checked my code and corrected any misspellings
  • The PR title follows conventional commits format (e.g., feat:, fix:, docs:)

Screenshots (if applicable)

N/A

Additional Notes

The bug only manifests when log::max_level() >= Debug, which happens automatically in apps using tracing-subscriber with tracing-log. ASCII-only PDFs are unaffected since byte offsets always land on char boundaries.

Let me know if any changes or tests are needed or missed!

Replace raw byte-offset string slicing (&s[..n], &s[n..]) with
safe_prefix/safe_suffix helpers that respect UTF-8 char boundaries.

Fixes 4 sites in text.rs (lines 2961, 2963, 5697) and xref.rs
(lines 391, 421) that panic with 'byte index N is not a char
boundary' when processing PDFs with multi-byte UTF-8 text and
log level >= Debug.
@derekleverenz
Copy link
Copy Markdown

Oh nice i ran into this too, i had a fix with .floor_char_boundary and .ceil_char_boundary but i like this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants