Rust stage 3a diff compressor by chopratejas · Pull Request #276 · chopratejas/headroom

chopratejas · 2026-04-26T04:47:01Z

Description

Brief description of changes and motivation.

Fixes #(issue number)

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Performance improvement
Code refactoring (no functional changes)

Changes Made

Change 1
Change 2
Change 3

Testing

Describe the tests you ran to verify your changes:

Unit tests pass (pytest)
Linting passes (ruff check .)
Type checking passes (mypy headroom)
New tests added for new functionality
Manual testing performed

Test Output

# Paste relevant test output here
pytest -v tests/test_your_feature.py

Checklist

My code follows the project's style guidelines
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have updated the CHANGELOG.md if applicable

Screenshots (if applicable)

Add screenshots to help explain your changes.

Additional Notes

Any additional information that reviewers should know.

Stage 3a: first real transform port. Faithful Rust port of `headroom.transforms.diff_compressor` with byte-equal parity against all 20 recorded fixtures. # Algorithm (matching Python) 1. Hand-rolled unified-diff parser (state machine over `diff --git`, `index`, `--- a/`, `+++ b/`, `@@`, mode/binary/rename markers, +/- / space lines, "other" lines like `\ No newline at end of file`). 2. File cap (`max_files=20`): when fired, sort by total changes (most first) and keep top N. 3. Per-file hunk cap (`max_hunks_per_file=10`): keep first + last + top relevance-scored middle, then resort by hunk-header start line to restore appearance order. 4. Relevance scoring: change-density base + user-query word overlap + priority patterns (ERROR / IMPORTANCE / SECURITY regexes — matches `error_detection.PRIORITY_PATTERNS_DIFF`). 5. Per-hunk context trim: keep `max_context_lines=2` lines either side of each `+`/`-` line. 6. CCR cache_key: `md5(original)[:24]` (matches `compression_store.CompressionStore.store`). Emitted only when compression saved >20% of lines. Parity result: `[diff_compressor ] total=20 matched=20 skipped=0 diffed=0`. # Information preservation hardening Three pass-through paths inherited from Python that we keep deliberate (would lose info if we changed them): - Below `min_lines_for_ccr` (50): return input unchanged. - No diff sections parsed: return input unchanged. - Below 20% compression savings: emit compressed output but no CCR marker (the original is the cheaper representation anyway). Plus a parity-bound subtlety: `compressed_line_count` is captured BEFORE the CCR retrieval marker is appended, both for the marker text (`compressed to N`) and the result field. The output string therefore ends up with one more line than the field reports — by design, matching Python exactly. An off-by-one bug from recounting after appending the CCR marker was caught and pinned by a synthetic 8-file diff test. # Observability — the Rust escape hatch Python's `DiffCompressionResult` has thin observability: input/output line counts, additions/deletions, hunks_kept/removed, files_affected, cache_key. The Rust port adds a sidecar `DiffCompressorStats` struct with metrics Python doesn't emit: - `files_dropped: Vec<String>` — names (old → new path) of files silently discarded by the `max_files` cap. Python loses these. - `hunks_dropped_per_file: BTreeMap<String, usize>` — per-file hunk drops, stable iteration via `BTreeMap`. - `context_lines_input` / `context_lines_kept` / `context_lines_trimmed` — directly proxies info loss from the context trim. - `largest_hunk_kept_lines` / `largest_hunk_dropped_lines` — outlier detection (a single huge dropped hunk is much worse than many small). - `parse_warnings: Vec<String>` — surfaces malformed input rather than dropping silently. - `processing_duration_us` — latency budget. - `cache_key_emitted` + `ccr_skipped_reason: Option<String>` — explicit signal for "we chose not to emit CCR and this is why". A `tracing::info!(target: "diff_compressor", ...)` event is emitted on every call, carrying these fields for OTel scraping in prod. The sidecar struct is returned alongside via `compress_with_stats`; the parity-only `compress` API discards it. # Module layout - `crates/headroom-core/src/transforms/mod.rs` — namespace, doc comment with the guiding principle ("information preservation > aggressive compression") so future ports inherit the philosophy. - `crates/headroom-core/src/transforms/diff_compressor.rs` — full port (parser, scorer, hunk selector, context trimmer, formatter, CCR layer, stats, tracing). # Dependencies added to headroom-core - `md-5 = "0.10"` — for the CCR cache_key (matches Python MD5[:24]). - `regex = "1"` — was a transitive dep via tokenizers; now a direct dependency for the hunk-header parser and priority patterns. # Tests 6 unit tests covering pass-through paths, MD5 hex truncation, the Python `split("\n")` line-count semantics, sidecar stats emission, and a synthetic 8-file diff that locks the byte-equal behavior found in the parity fixtures.

… stats Audit follow-up to the diff_compressor port. The parity-bound port had several issues that the byte-equal harness can't catch: hardcoded magic numbers, score-weight literals scattered through the scorer, and silent information-loss paths (file mode + binary detail) with no observability trail. Parity is preserved (still 20/20 byte-equal); these are quality and observability fixes only. # Changes 1. **Hardcoded `0.8` savings threshold → config knob.** New `DiffCompressorConfig::min_compression_ratio_for_ccr: f64` (default 0.8) replaces the literal. Rust-only — Python hardcodes 0.8 too; fixtures don't carry the field, parity comparator defaults to 0.8. 2. **Score-weight magic numbers → documented `pub const`.** The scorer's `0.03`, `0.3`, `0.2`, `0.3`, `1.0`, and `len > 2` filter are now `SCORE_CHANGE_DENSITY_WEIGHT`, `SCORE_CHANGE_DENSITY_CAP`, `SCORE_CONTEXT_WORD_WEIGHT`, `SCORE_CONTEXT_MIN_WORD_LEN`, `SCORE_PRIORITY_PATTERN_BOOST`, `SCORE_TOTAL_CAP`. Each carries a doc comment explaining what it biases toward. A pinning test fails if anyone changes the value without updating the test. 3. **File mode normalization is now visible.** Parity forces emit to hardcode `100644` regardless of input mode (`100755` executable, `120000` symlink, `160000` submodule). The parser now captures the ORIGINAL mode line; if it's not `100644`, the loss is recorded in `DiffCompressorStats::file_mode_normalizations: Vec<(label, original_mode_line)>`. Empty when no normalization happened. 4. **Binary file detail loss is now visible.** Same pattern: parity forces `Binary files differ` on emit, dropping the original `Binary files X and Y differ` filenames. Captured in `DiffCompressorStats::binary_files_simplified: Vec<String>` whenever the original line carried richer detail. 5. **`min_lines_for_ccr` doc clarified.** Despite the name, it gates the ENTIRE compression path, not just the CCR marker. A 49-line diff is returned unchanged regardless of how compressible it is. Doc comment now flags this misnomer prominently. 6. **6 new unit tests cover the lossy paths.** Previously the test suite only exercised pass-through and the in-cap synthetic. New tests: - `max_hunks_per_file_cap_drops_excess_and_records_stats` (15 hunks, cap 10, verify 5 dropped + per-file accounting) - `max_files_cap_drops_files_and_records_names_in_stats` (25 files, cap 20, verify 5 dropped names recorded) - `file_mode_normalization_is_recorded_for_executable_bit` (input `new file mode 100755`, verify stats capture original) - `binary_files_simplification_is_recorded` (input `Binary files a/x and b/x differ`, verify original captured) - `min_compression_ratio_for_ccr_is_configurable` (default 0.8 vs custom 0.5; same input, different CCR emit decision) - `score_constants_match_inline_values` (pins constants so future changes are intentional) Stats fields are also reflected in the existing `tracing::info!` event so OTel scrapers see the loss counts in production. # Why these matter — the user-stated principle "Information preservation > aggressive compression." When parity forces us to drop bytes (file mode, binary detail, fully-dropped files, trimmed context), the loss must be observable. Adding it to the compressed output bytes would break parity; surfacing it in `DiffCompressorStats` + tracing spans is the right escape hatch. # Verification - 20/20 parity fixtures still byte-equal - 12/12 unit tests pass (6 new + 6 prior) - 56 total tests in headroom-core, all green - `cargo fmt --check` clean - `cargo clippy --workspace --all-targets -- -D warnings` clean

…D Rust Audit caught four bugs that the byte-equal parity harness can't catch on its own — both Python and Rust were faithfully emitting the buggy output. Fixed in lockstep so parity is maintained while the underlying behavior is now correct on inputs the existing 20 fixtures didn't exercise. # The four bugs (each fixed in both Python and Rust) 1. Renames silently dropped from output. Parser captured `is_renamed=True` but the emitter never emitted ANY rename markers. Output of a rename looked exactly like a plain modification of the old path. Fix: capture `rename from` / `rename to` / `similarity index N%` / `dissimilarity index N%` / `copy from` / `copy to` lines in a new `rename_lines` field on `DiffFile`; emit them after `diff --git` in canonical git ordering. 2. Combined diff hunks (`@@@`) silently dropped. Hunk-header regex only matched `@@`, so 3-way merge hunks had `current_hunk` never set and ALL their content fell through to the no-op branch. Fix in Python: regex switched to `^(@@+) ... \1` (backreferences match any number of `@`s on each side). Fix in Rust: alternation over `@@`, `@@@`, `@@@@` since `regex` is RE2-based and rejects backreferences. n>3 octopus merges still fall through; rare in practice. 3. `\ No newline at end of file` markers can be context-trimmed away. Treated as ordinary "other" lines — if more than `max_context_lines` from a `+`/`-` change, dropped. Round-trip-breaking for patches; can change whether the trailing line has a newline. Fix: in `_reduce_context`, force-add any line starting with `\` to the keep set regardless of distance. 4. Pre-diff content silently dropped. Anything before the first `diff --git` — commit messages from `git log -p`, email headers from `git format-patch`, fork-and-rebase metadata — was discarded. Fix: `_parse_diff` now returns `(pre_diff_lines, files)`; `format_output` prepends pre-diff content verbatim when present. # Hidden parity bug found during the work `_compress_files` constructed a fresh `DiffFile` from the parsed one but only copied a subset of the fields by name. The new `rename_lines` and `original_*_line` fields were silently dropped here, so the parser populated them correctly but the emitter saw an empty `rename_lines` list. Caught by writing a real test instead of a smoke test — the smoke test passed because it hit the no-diff-found short-circuit, not the parser/emitter pipeline. Constructor now copies all fields explicitly. # Parity status - Existing 20 fixtures: still byte-equal between fixed Python and fixed Rust. None of them exercised the buggy paths. - 4 NEW fixtures recorded against fixed Python, exercising each bug-fix path: rename, 3-way combined diff, `\ No newline` marker far from changes, pre-diff commit headers. All 4 byte-equal between Python and Rust. - Parity harness: total=24 matched=24 skipped=0 diffed=0. # Observability Some normalizations remain parity-bound (file mode `100644` hardcode, `Binary files differ` simplification). Those are surfaced in `DiffCompressorStats::file_mode_normalizations` / `binary_files_simplified` (Rust) and via `logger.warning` (Python's new `_log_loss_signals` helper, called once per compress). # Tests - Python: 4 new test classes (11 tests) covering rename markers, combined diffs, no-newline preservation, pre-diff content. Edge case: no pre-diff content must NOT add a leading blank line. - Rust: 4 new `bugfix_*` unit tests with the same scenarios. - Existing Python tests calling `_parse_diff` directly were updated for the new `(pre_diff, files)` tuple return. # Verification - Python: 37/37 tests pass (was 26). - Rust: 16/16 transforms tests; 60/60 workspace unit tests; 5/5 proptests; 1/1 doctest. - Parity: 24/24 byte-equal. - `cargo fmt --check` clean; `cargo clippy --workspace --all-targets -- -D warnings` clean.

…preambles User audit caught three gaps that prevented DiffCompressor from being invoked even when the input was a real diff. These complement the four emit-time bugs fixed in the previous commit — those fixes only kick in once DiffCompressor receives the input. Without these gap fixes, real merge-commit diffs and `git log -p` outputs with long commit messages were misrouted away from DiffCompressor entirely. # The three gaps (each fixed in Python; gap 3 also fixed in Rust) 1. Detector scan window was hardcoded to first 50 lines. `_try_detect_diff` in content_detector.py only inspected `content.split("\n")[:50]`. `git log -p` outputs commonly have commit messages longer than 50 lines (releases, squashed commits, bots), pushing the `diff --git` header out of the detection window. Result: input was returned with `content_type=PLAIN_TEXT` and routed to the text compressor, never reaching DiffCompressor. Fix: window widened to 500 lines. 2. Detector regex didn't recognize merge-commit headers. `_DIFF_HEADER_PATTERN` matched `diff --git`, `--- a/`, and the regular `@@ -A,B +C,D @@` hunk header. Merge-commit diffs from `git log -p` use `diff --combined <path>`, `diff --cc <path>`, and combined-diff hunk headers `@@@+`. The shared `--- a/` line still triggered the detector with low confidence, but only barely. Fix: extended the regex to recognize all four merge-shaped header forms. 3. DiffCompressor parser only matched `^diff --git`. Even after fixing detection, the parser's `_DIFF_GIT_PATTERN` wouldn't match `diff --combined` or `diff --cc`, so merge diffs reached DiffCompressor and were treated as one giant pre-diff blob — passed through unchanged after the previous PR's pre-diff preservation fix. Fix: added `_DIFF_COMBINED_PATTERN` and `_DIFF_CC_PATTERN`; `_parse_diff` starts a new file section on any of the three header forms. Mirrored in Rust as `is_diff_header` helper that checks all three regexes. # Why this matters end-to-end DiffCompressor's value comes from being routed to. Detection + parser-level coverage are upstream of the compressor — without them, the compressor never sees the input. The previous PR's four bug fixes (rename, combined-diff hunks, no-newline marker, pre-diff content) are correct and necessary, but for merge commits and long-preamble diffs, they were only firing on the rare cases where the detector misclicked into DiffCompressor anyway. With these three gaps closed, the ContentRouter→DiffCompressor pipeline actually engages on: - `git log -p` outputs of any commit-message length - Merge-commit diffs (`diff --combined`, `diff --cc`) - Combined-diff snippets (`@@@`+ hunk-only inputs) # New fixtures (3 added to the existing 24) - `066bc82…` — `diff --combined` merge diff (3-way) - `5d950a94…` — `diff --cc` merge diff (alternate form) - `66c86f64…` — long pre-diff content (60-line commit message) followed by a rename diff (exercises detector scan widening + pre-diff preservation in tandem) Parity: total=27 matched=27 skipped=0 diffed=0. # Tests - Python: 4 new tests across 2 new test classes — `TestRoutingGapMergeDiffs` (combined / cc parser) and `TestRoutingGapDetectorScanWindow` (long preamble detection + combined-diff regex recognition). - Rust: 2 new tests covering combined / cc parser sections. # Verification - 27/27 parity fixtures byte-equal. - Python: 41/41 tests pass (was 37). - Rust: 18/18 transforms tests; 62/62 workspace; 5/5 proptests. - `cargo fmt --check` clean; `cargo clippy --workspace --all-targets -- -D warnings` clean.

chopratejas · 2026-04-26T16:32:22Z

Superseded by #277 (3a+3b combined into one PR going to main).

chopratejas mentioned this pull request Apr 26, 2026

rust stage 3 (a+b): diff_compressor port + retire python via pyo3 + dotenv test hygiene #278

Closed

5 tasks

chopratejas added 4 commits April 26, 2026 09:13

chopratejas force-pushed the rust-stage-3a-diff-compressor branch from dba733d to 48c1324 Compare April 26, 2026 16:15

chopratejas mentioned this pull request Apr 26, 2026

rust stage 3 (a+b): diff_compressor port + retire python via pyo3 + dotenv test hygiene #277

Merged

5 tasks

chopratejas closed this Apr 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rust stage 3a diff compressor#276

Rust stage 3a diff compressor#276
chopratejas wants to merge 4 commits intomainfrom
rust-stage-3a-diff-compressor

chopratejas commented Apr 26, 2026

Uh oh!

chopratejas commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

chopratejas commented Apr 26, 2026

Description

Type of Change

Changes Made

Testing

Test Output

Checklist

Screenshots (if applicable)

Additional Notes

Uh oh!

chopratejas commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant