Skip to content

feat(patch_set): PatchSet::parse_bytes for raw byte input#64

Open
weihanglo wants to merge 26 commits intobmwill:masterfrom
weihanglo:patchset-raw-bytes
Open

feat(patch_set): PatchSet::parse_bytes for raw byte input#64
weihanglo wants to merge 26 commits intobmwill:masterfrom
weihanglo:patchset-raw-bytes

Conversation

@weihanglo
Copy link
Copy Markdown
Contributor

Blocked on #61. Please review from c255bbf

The entire idea is PatchSet::parse_bytes so that some non-UTF8 hunk that Git doesn't consider as binary patch can still be safely parsed an applied.

With this PR, our history replay test no longer skip any non-UTF8 patches.
When replaying rust-lang/rust history, it shows

History replay completed: 853589 patches applied, 2980 skipped

And the 2980 patches were all submodule updates.

Fixes #63

`.lines()` strips line endings, so callers tracking byte offsets
need to re-add the `\r\n` or `\n` length manually.
Extract the repeated inline pattern into a reusable helper.
* Parse `diff --git` extended headers
* split multi-file git diffs at `diff --git` boundaries
Compat test for also `git apply`.
Unlike unidiff,
gitdiff produces patches for empty file creations/deletions
(`0\t0` in numstat)
because they carry `diff --git` + extended headers even without hunks.

Binary files (`-\t-\t`) are skipped in gitdiff mode for now.
* Added types representing both literal and delta Git binary patches
* Added a parser for the `GIT binary patch` format.

This doesn't include the patch application
(which will be added in later commits)

The implementation is based on

* Specification from <https://diffx.org/spec/binary-diffs.html>
* Behavior observation of Git CLI
- Add `Binary::Keep` variant (now the default) to `ParseOptions`
- Add `PatchKind::Binary` variant for binary patches
- Parse `GIT binary patch` payload via `parse_binary_patch`
- Handle `Binary files ... differ` as `BinaryPatch::Marker`
- Add `extract_file_op_binary` for file ops without ---/+++ headers
The API was stabilized in 1.73.
The lint was added in 1.93.

This is required for a MSRV bump to 1.75
This is a preparation for binary diff application support.

* Git binary patch is compressed by zlib hence flate2
* zlib-rs (which is the most performant zlib backend)
  requires MSRF 1.75.0+ hence the bump.
* Add base85 encoder/decoder and Git delta format decoder.
* Wire them into `BinaryPatch::apply() and `apply_reverse()`
  for decoding zlib-compressed, base85-encoded binary payload.

These are feature-gated behind the `binary` feature.
Now both tests require `binary` Cargo feature.
Preparation for `PatchSet::parse_bytes(&[u8])` support.

No behavior change.
Assumption: Header lines are always ASCII
(with `core.quotePath=true`, git's default).
Move dispatch logic to free functions,
so we can have private trait bound on them.

* `next_gitdiff_patch`
* `next_unidiff_patch`
* `Iterator::next` -> `next_patch`
@weihanglo weihanglo force-pushed the patchset-raw-bytes branch 2 times, most recently from 054f7b1 to 68431bf Compare April 14, 2026 05:21
@weihanglo
Copy link
Copy Markdown
Contributor Author

Ooops the assumption is a bit too aggresive to cover non binary patch portion.

@weihanglo weihanglo marked this pull request as draft April 14, 2026 05:25
Returns the longest valid UTF-8 prefix of the input.

This will be used to safely convert binary patch data to
`&str` without validating the entire remaining input.
We have the assumption that file path from hunk hader is UTF-8
(This is actually not true but we can leave it for future fix)

Only the `str` constructor and Iterator impl are exposed for now.
`parse_bytes` support comes in a follow-up commit.
Avoid lossy UTF-8 conversion of diff output
so non-UTF8 content round-trips correctly.

No more skips in GitDiff mode (except submodule of course)
@weihanglo weihanglo force-pushed the patchset-raw-bytes branch from 68431bf to 05cdc37 Compare April 14, 2026 05:38
@weihanglo
Copy link
Copy Markdown
Contributor Author

Ooops the assumption is a bit too aggresive to cover non binary patch portion.

Fixes with 33b4e38

@weihanglo weihanglo marked this pull request as ready for review April 14, 2026 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PatchSet doesn't support raw bytes

1 participant