Skip to content

feat: add Root::parse_bytes for non-UTF8 input#179

Draft
philiptaron wants to merge 3 commits intomasterfrom
fix-issue-173
Draft

feat: add Root::parse_bytes for non-UTF8 input#179
philiptaron wants to merge 3 commits intomasterfrom
fix-issue-173

Conversation

@philiptaron
Copy link
Copy Markdown
Contributor

@philiptaron philiptaron commented Jan 29, 2026

Summary

  • Add Root::parse_bytes(&[u8]) method that handles non-UTF8 input
  • Invalid UTF-8 sequences are replaced with U+FFFD (replacement character)

Background

The C++ Nix parser uses %option 8bit in its flex lexer, which allows it to handle arbitrary byte values without UTF-8 validation. The raw bytes are preserved in string literals.

rnix uses Rowan for its syntax tree, which requires valid UTF-8. This means we cannot preserve arbitrary bytes exactly. Instead, parse_bytes does lossy UTF-8 conversion - invalid sequences become U+FFFD (�).

Behavior comparison:

Input bytes nix output rnix parse_bytes output
{ x = "\xff"; } { x = "\xff"; } (raw 0xFF preserved) { x = "�"; } (U+FFFD replacement)

This is sufficient for most use cases (linting, formatting, analysis) where exact byte preservation isn't required.

Example

// { x = "\xff"; } with raw 0xFF byte (invalid UTF-8)
let bytes: &[u8] = &[0x7b, 0x20, 0x78, 0x20, 0x3d, 0x20, 0x22, 0xff, 0x22, 0x3b, 0x20, 0x7d];
let parse = Root::parse_bytes(bytes);
assert!(parse.errors().is_empty());
// Output text: { x = "�"; }

Test plan

  • Added test case non_utf8_can_be_parsed_with_parse_bytes_issue173
  • All existing tests pass

Closes #173

nix (C++) can parse files with non-UTF8 bytes, but rnix currently
requires valid UTF-8 because Root::parse takes &str.
Add a new `parse_bytes(&[u8])` method that handles non-UTF8 input by
doing lossy UTF-8 conversion. Invalid byte sequences are replaced with
U+FFFD (replacement character), matching the behavior of the C++ Nix
parser.

This allows parsing `.nix` files that contain non-UTF8 bytes, which was
previously impossible since `Root::parse` requires `&str`.

Closes #173
@philiptaron philiptaron marked this pull request as draft January 29, 2026 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support parsing non-strings

1 participant