Skip to content

Rust stage 3c smart crusher#282

Open
chopratejas wants to merge 21 commits intomainfrom
rust-stage-3c-smart-crusher
Open

Rust stage 3c smart crusher#282
chopratejas wants to merge 21 commits intomainfrom
rust-stage-3c-smart-crusher

Conversation

@chopratejas
Copy link
Copy Markdown
Owner

Description

Brief description of changes and motivation.

Fixes #(issue number)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Performance improvement
  • Code refactoring (no functional changes)

Changes Made

  • Change 1
  • Change 2
  • Change 3

Testing

Describe the tests you ran to verify your changes:

  • Unit tests pass (pytest)
  • Linting passes (ruff check .)
  • Type checking passes (mypy headroom)
  • New tests added for new functionality
  • Manual testing performed

Test Output

# Paste relevant test output here
pytest -v tests/test_your_feature.py

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have updated the CHANGELOG.md if applicable

Screenshots (if applicable)

Add screenshots to help explain your changes.

Additional Notes

Any additional information that reviewers should know.

chopratejas and others added 21 commits April 26, 2026 16:45
Stage 3c.1 — like-for-like Rust port of `headroom/transforms/smart_crusher.py`.
This commit lays the foundation: module layout, configuration, foundational
data types, and the simpler helpers (classification, hashing, anchors,
basic statistics). Subsequent commits add the analyzer, crushers, plan
execution, and the orchestrator.

# What's in this commit

`crates/headroom-core/src/transforms/smart_crusher/`:
- `mod.rs` — module entry, public re-exports, port narrative.
- `classifier.rs` — `classify_array` / `ArrayType` (dict/string/number/
  bool/nested/mixed/empty). Direct port of `_classify_array`.
- `config.rs` — `SmartCrusherConfig` with defaults pinned to Python
  byte-for-byte.
- `hashing.rs` — `hash_field_name` (SHA-256 truncated to 16 hex chars),
  matches `hashlib.sha256(name.encode()).hexdigest()[:16]` exactly.
- `statistics.rs` — `is_uuid_format`, `calculate_string_entropy`,
  `detect_sequential_pattern` (with **BUG #2 fix** — see below).
- `anchors.rs` — `extract_query_anchors`, `item_matches_anchors`. Five
  regex patterns ported via `std::sync::LazyLock`.
- `types.rs` — `CompressionStrategy`, `FieldStats`, `CrushabilityAnalysis`,
  `ArrayAnalysis`, `CompressionPlan`, `CrushResult`. Field-by-field
  mirror of the Python @dataclasses so the PyO3 bridge in 3c.1b can
  reconstruct them without manual translators.

# Bug #2 fixed in this commit (Python fix lands later in same PR)

`smart_crusher.py:444-448` — `_detect_sequential_pattern` calls
`int(string_value)` and silently strips zero-padding, so padded string
IDs like `["001", "002", ..., "100"]` get misclassified as a sequential
numeric pattern. Fix: track whether each parsed numeric value
originated as a string. If EVERY parsed value was a string, refuse to
flag as sequential. Mixed numeric+string fields still detect
correctly because the unambiguous numerics dominate. Test:
`bug2_zero_padded_strings_no_longer_misclassified`.

# What's NOT in this commit (subsequent commits)

- `SmartAnalyzer` — `analyze_array`, `_analyze_field`, `_detect_change_points`,
  `_detect_pattern`, `_detect_temporal_field`, `analyze_crushability`,
  `_select_strategy`, `_estimate_reduction`.
- The five array crushers (`_crush_array`, `_crush_string_array`,
  `_crush_number_array`, `_crush_mixed_array`, `_crush_object`).
- Planning (`_compute_k_split`, `_create_plan`, `_plan_*` family).
- Orchestration (`_prioritize_indices`, `_deduplicate_indices_by_content`,
  `_fill_remaining_slots`).
- `SmartCrusher` orchestrator class itself.
- Parity harness fixtures.
- The remaining 3 Python bug fixes (#1, #3, #4) — landed alongside the
  code paths they affect.

# Build / test

- `cargo build -p headroom-core` — clean.
- `cargo clippy -p headroom-core -- -D warnings` — clean.
- 55 new unit tests across the 6 new files, all passing.

Architectural improvements (lossless-first, unified saliency score,
structured CCR markers) are deferred to Stage 3c.2 — see design doc at
`~/Desktop/SmartCrusher-Architecture-Improvements.md`.
…int parse, python-repr matcher

Code review (`/code-review` on commit `d219bee`) caught one critical
bug, two important parity gaps, and a few quality nits. Fixed all of
them; all 135 unit tests pass; diff_compressor parity harness
unaffected (27/27 still matched).

# Critical fix — `hash_field_name` truncation length

Rust truncated SHA-256 to **16** hex chars; Python uses **8** (per
`smart_crusher.py:177`: `hashlib.sha256(...).hexdigest()[:8]`). 16-char
hashes would never collide with TOIN's 8-char `preserve_fields`,
silently disabling the entire `use_feedback_hints` cache lookup path.

Fix: `hex[..8]` instead of `hex[..16]`. Three pinning tests re-verified
against actual Python reference output. Doc comment now warns
explicitly that the length must match Python or TOIN lookups silently miss.

# Important fix — `python_int_parse` mirrors Python's `int()` semantics

`statistics.rs::detect_sequential_pattern` previously called
`s.parse::<i64>()`. Python's `int()` differs in three ways that affect
realistic payloads:
  - strips ASCII whitespace (Rust's `parse` rejects)
  - accepts leading `+` (Rust accepts; same)
  - accepts PEP 515 underscores like `"3_000"` (Rust rejects)

A field with `["  1  ", "  2  ", " 3 ", "4", "5"]` would parse all five
in Python (sequential = True) but only one in Rust (`nums.len() < 5`
→ False). Silent parity break.

Fix: new private `python_int_parse` helper that strips whitespace,
handles underscore separators, and rejects edge cases Python rejects.
Six new tests pin the behavior.

# Important fix — `python_repr` for `item_matches_anchors`

Python compares anchors via `anchor in str(item).lower()`. We were
using `serde_json::to_string(&item).to_lowercase()`, which differs in
three ways that affect substring matching:
  - quote chars (`'` vs `"`)
  - bool/null literals (`True`/`False`/`None` vs `true`/`false`/`null`)
  - spacing (`key: value, ...` vs `key:value,...`)

Anchor `"none"` would match Python form but not JSON. Inverse for
`"null"`. Real divergence.

Fix: new private `python_repr` walks `serde_json::Value` and emits
Python-equivalent form. Plus enable `serde_json/preserve_order` at
workspace level so `Value::Object` preserves JSON parse order
(matching Python `dict` since 3.7).

# Suggestion fixes

- Classifier comment for `[True, False, 1] -> MIXED_ARRAY` now walks
  both Python and Rust paths step by step.
- `ArrayAnalysis::field_stats` doc notes the BTreeMap vs Python-dict
  order nuance for the analyzer port to resolve.
- Added regression tests for "all unparseable strings", "single int
  among strings", fractional-step sequential, and the email-typo
  pattern.

# Build / test

- `cargo build -p headroom-core` clean.
- `cargo clippy -p headroom-core -- -D warnings` clean.
- 135 unit tests in `headroom-core`, all passing (was 55).
- `cargo run -p headroom-parity run` — diff_compressor 27/27 still matched.
…rs + bug #3 fix

Stage 3c.1 commit 2 of ~7. Ports the standalone analyzer helpers from
Python `smart_crusher.py:484-748` and applies the BUG #3 fix to
`detect_rare_status_values` in Rust. The matching Python fix lands
later in this PR (before parity fixtures are recorded).

# What's added

- `error_keywords.rs` — pinned set of 12 error keywords ported from
  `headroom/transforms/error_detection.py:18-33`, with three tests
  pinning length, casing, and exact membership so accidental edits
  surface in CI.

- `field_detect.rs` —
  - `detect_id_field_statistically` (Python `smart_crusher.py:484-530`):
    high-uniqueness field heuristic with UUID-format and entropy paths
    for strings, sequential and high-range paths for numerics. 5 tests.
  - `detect_score_field_statistically` (Python lines 533-603): bounded-
    range numeric with descending-sort signal. Sequential rejection
    matches the Python intent (IDs are sequential, scores aren't). The
    Python `[-1, 1]` chained-comparison precedence is faithfully
    preserved (`min_val >= -1.0 && max_val <= 1.0`). 7 tests including
    confidence-cap and unbounded-range rejection.

- `outliers.rs` —
  - `detect_structural_outliers` (Python lines 606-650): rare-field +
    rare-status detection. Returns ascending-sorted deduplicated indices
    via BTreeSet (Python uses `list(set(...))` with non-deterministic
    order — pinning sorted order makes parity fixtures stable).
  - `detect_rare_status_values` (Python lines 653-701) **with BUG #3
    fix** — see below.
  - `detect_error_items_for_preservation` (Python lines 711-748): scans
    item JSON for any of the 12 ERROR_KEYWORDS. 6 tests.

# BUG #3 fix — `detect_rare_status_values`

Python's original guard at line 674
`if not (2 <= len(unique_values) <= 10): continue`
caps cardinality at 10, so error-code domains with >10 codes are
skipped entirely. A rare error like "ERR_TIMEOUT" appearing 1% of the
time is missed when the field has 50+ distinct codes.

The fix replaces the cap-and-dominance approach with a Pareto check:

  1. Cardinality cap raised to **50** (above which the field is almost
     certainly an ID/free-form column, not a status enum).
  2. Sort value frequencies descending. Find the smallest K such that
     top-K covers >=80% of items.
  3. If K <= 5, remaining values are "rare" → items containing them
     are outliers.

This unifies both cases the original algorithm partially handled:

  - Low cardinality + dominant: 95×ok + 5 errors → top-1 covers 95% →
    4 rare values flagged. Same as before.
  - Higher cardinality + bimodal: 60×info + 25×warn + 15 distinct rare
    errors → top-2 covers 85% → 15 rare values flagged. **New** —
    pre-fix this returned 0.
  - Uniform distribution: 50 distinct values, 1 each → top-K never
    reaches 80% with K <= 5 → skip. Correctly identifies as
    non-categorical.

Three regression tests pin each case. The matching Python fix will
land later in this PR before parity fixtures are recorded so the
fixtures byte-match.

# What's NOT in this commit

`detect_items_by_learned_semantics` (Python lines 751-829) is deferred
to a later commit because it depends on the TOIN `FieldSemantics`
type, which isn't ported yet. That helper integrates learned cross-
session compression patterns and isn't on the critical path for the
parity-port milestone.

The next commit lays down the `SmartAnalyzer` struct itself
(`analyze_array`, `_analyze_field`, `_detect_change_points`,
`_detect_pattern`, `_detect_temporal_field`, `analyze_crushability`,
`_select_strategy`, `_estimate_reduction`).

# Build / test

- `cargo build -p headroom-core` clean.
- `cargo clippy -p headroom-core -- -D warnings` clean.
- 167 unit tests in `headroom-core` (was 135).
- `cargo run -p headroom-parity run` — diff_compressor 27/27 still
  matched.
… crushability

Port Python's `SmartAnalyzer` class (smart_crusher.py:960-1489) to Rust.
All eight methods land here:

- analyze_array — top-level orchestrator: builds field stats, detects
  pattern, runs crushability, picks strategy, estimates reduction.
- analyze_field — per-field stats: type, count, uniqueness, plus
  type-specific (numeric: min/max/mean/variance/change_points; string:
  avg_length, top-5 frequencies).
- detect_change_points — sliding-window mean-shift detector for numeric
  fields; mirrors Python's deduped greedy walk with `> threshold` test.
- detect_pattern — classifies as time_series / logs / search_results /
  generic via structural signals (no field-name heuristics).
- detect_temporal_field — ISO 8601 prefix checks (handcoded byte-level
  to avoid a regex per call) + Unix epoch range checks.
- analyze_crushability — six-case decision tree with explicit signal
  list (id/score/structural-outlier/keyword-error/anomaly/change-points).
- select_strategy — strategy picker honoring `min_items_to_analyze`,
  `crushable=False` skip, and pattern-specific selections.
- estimate_reduction — coarse base+constant-ratio heuristic, capped 0.95.

Supporting pieces:

- stats_math.rs: mean / sample_variance / sample_stdev with n-1
  denominator (matches Python's `statistics` module).
- python_repr helper: str(v) parity for None/True/False/numbers/strings,
  used by _analyze_field's uniqueness count.
- top_n_by_count helper: Counter.most_common(n) parity with
  first-occurrence tie-break (mirrors Python dict insertion order).

Field iteration order: BTreeMap gives ASCII-sorted iteration. Python's
set-based for-key-in-all_keys is non-deterministic; Python source
needs a sorted(all_keys) for parity, scheduled for commit 7
(fixture regeneration alongside bug fixes).

Tests: 24 new unit tests covering empty/non-dict guards, numeric/string/
constant analyze_field paths, change-point detection on three-segment
step functions, log/generic pattern detection, ISO datetime + Unix
epoch temporal detection, six crushability cases, all strategy
selection branches, and reduction edge cases.

Net: 205 unit tests passing, clippy clean, parity harness still 4/4.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
… path

Self-review of commit c26d225 found one real divergence from Python:

Python's `_analyze_field` numeric block is wrapped in
`try/except (OverflowError, ValueError)`. On overflow (e.g., variance
of `[1e200, -1e200]`), the except branch resets:

  stats.min_val = None
  stats.max_val = None
  stats.mean_val = None
  stats.variance = 0          # ← int literal, not None
  stats.change_points = []

Rust was silently propagating Inf/NaN. Two-part fix:

1. `stats_math.rs::{mean,sample_variance,sample_stdev}` now return None
   when the result is non-finite. This mirrors Python's exception path
   for the overflow-on-extreme-floats case.

2. `analyzer.rs::analyze_field` numeric branch now detects any non-
   finite stat (or None from helpers) and resets the entire group
   atomically. Variance is reset to `Some(0.0)` to match Python's
   `variance = 0` literal — downstream truthiness checks
   (`if stats.variance:` and `(variance or 0) > 0`) treat 0 the same
   as None, but the FieldStats serialization shape will matter for
   the parity fixtures landing in commit 7.

Three new tests pin the new behavior:

  - stats_math::mean_non_finite_overflow_returns_none
  - stats_math::sample_variance_non_finite_returns_none
  - stats_math::sample_stdev_non_finite_returns_none
  - analyzer::analyze_field_numeric_overflow_resets_all_stats_to_none

Other findings from the review were documented as either deferred to
Stage 3c.2 (bool-as-numeric anomalies, exotic float formatting) or
pinned to Stage 3c.1 commit 7 (Python sorted-key iteration alongside
fixture regeneration). Each is exotic enough that real fixtures won't
trip it before the Python source fixes land.

Net: 147 unit tests passing in smart_crusher, clippy clean.
Direct port of headroom/transforms/adaptive_sizer.py. This is the
prerequisite that smart_crusher's array crushers (string, number,
object, mixed, dict) all depend on for "how many items to keep" via
information-saturation detection.

Six pieces, each with verified-against-Python tests:

1. simhash(text) -> u64
   Character 4-gram window. Hash each gram with MD5; take first 8 bytes
   as big-endian u64 (matches Python's int(hexdigest()[:16], 16)).
   Per-bit weighted voting; final fingerprint is bit j set iff votes[j]>0.
   Iterates by Unicode codepoints (chars), not bytes — matches Python str
   slicing. Pinned values verified for "", "a", "abcd", "hello",
   "hello world", "café" (UTF-8 multibyte), and a longer sentence.

2. hamming_distance(a, b) -> u32
   XOR + count_ones. Trivial.

3. count_unique_simhash(items, threshold) -> usize
   Greedy clustering of fingerprints by Hamming distance. First-fit
   matches Python's append-only iteration order.

4. compute_unique_bigram_curve(items) -> Vec<usize>
   Whitespace-split lowercase words. Single-word items contribute
   (word, ""); empty-string items contribute ("", ""). Returns
   cumulative unique-bigram count after each item.

5. find_knee(curve) -> Option<usize>
   Kneedle in normalized [0,1] space. Returns knee_idx + 1 (matching
   Python's "include up to and including" semantics) when max_diff
   exceeds 0.05; else None. Flat curves return Some(1).

6. compute_optimal_k(items, bias, min_k, max_k) -> usize
   Three-tier orchestrator: fast path (n<=8 → n, or unique<=3),
   Kneedle on bigram curve with diversity floor when ratio>0.7, no-knee
   fallback to keep_fraction = 0.3 + 0.7*diversity, bias multiplier,
   zlib validation, final clamp.

7. validate_with_zlib(items, k, max_k, tolerance)
   Compares full vs subset compression ratios. ratio_diff > tolerance
   → bump k by 20%. Uses flate2 default backend (miniz_oxide); per-byte
   length drift vs CPython libz absorbed by the 15% tolerance. If
   parity fixtures flake, swap to flate2 features = ["zlib"] for
   byte-equal output to system libz.

A counterintuitive case I verified: 20 identical short lines with k=5
DOES trigger the 20% bump (subset compresses LESS efficiently per byte
than the full text — zlib has less context). Pinned in
validate_zlib_bumps_k_when_subset_undercompresses; confirmed Python
returns the same 6.

Net: 33 new tests in adaptive_sizer (242 total in headroom-core),
clippy clean, parity harness intact.

Next commit: three array crushers (string, number, object) + bug #1
(percentile off-by-one) using compute_optimal_k.
Three crushers from headroom/transforms/smart_crusher.py ported.
Each takes a SmartCrusherConfig + bias and returns
(crushed_items, strategy_string). All schema-preserving — output is
items/values from the original; no generated text.

What's in:

1. compute_k_split (smart_crusher.py:2693)
   Wraps adaptive_sizer::compute_optimal_k. Splits k_total into
   first/last/importance via config.first_fraction / last_fraction.
   Uses f64::round_ties_even() (Rust 1.77+) to match Python's
   banker's-rounding round() — important for off-by-one parity on
   .5-edged k computations.

2. crush_string_array (smart_crusher.py:2727)
   Adaptive K via Kneedle. Mandatory-keep: error-keyword strings +
   length-anomaly strings (>variance_threshold σ from mean length).
   Boundary-keep: first K_first + last K_last. Stride-based diverse
   fill with content-dedup. Output preserves original array order
   (BTreeSet iteration). Strategy includes dedup= and errors= counts
   when nonzero.

3. crush_number_array (smart_crusher.py:2810) — CARRIES BUG #1
   Statistics-driven (mean/median/stdev/p25/p75). Outliers flagged
   at variance_threshold σ. Change-points via window-mean comparison
   (config.preserve_change_points + n>10 gates). Strategy string
   embeds full stats summary via format_g (Python's :.4g approximation).
   BUG #1 — percentile off-by-one — ported AS-IS:
   sorted_finite[len/4] / sorted_finite[3*len/4]. Cosmetic
   (strategy-string only). Test bug1_percentile_off_by_one_documented
   pins the buggy index choice; commit 7 fixes both languages and
   regenerates fixtures.

4. crush_object (smart_crusher.py:3015)
   Token-budget gate (config.min_tokens_to_crush=200). Three
   passthrough exits: n<=8, total tokens too low, k_total>=n. Always
   keeps: error-keyword values + small values (<=12 tokens via
   len/4 + len/4 + 2 heuristic). Boundary keys + stride fill with
   Python's recompute-each-iter cap (mirrored faithfully — slower
   but parity-true). Output preserves key insertion order via
   serde_json/preserve_order's IndexMap.

Supporting helpers in stats_math.rs:
  - median(values) — Python statistics.median (mean-of-middles for
    even, total_cmp sort for NaN determinism).
  - format_g(x) — approximate Python f"{x:.4g}" (4 sig figs,
    scientific outside [-4, 4) exponent range, trailing-zero strip,
    explicit-sign 2-digit exponent). Pinned by 5 fixed-output tests.

Field iteration order: key/object iteration uses BTreeMap-sorted (in
analyzer) and IndexMap-insertion-order (in serde_json::Map for
crush_object). The Python sorted-key fix scheduled for commit 7 also
covers crush_object's iteration paths.

Net: 266 unit tests passing in headroom-core, clippy clean (MSRV 1.80),
parity harness intact (4/4 diff_compressor).

Next commit: planning + execution layer (_create_plan, _execute_plan,
plan-builder methods) with BUG #4 fix (k-split overshoot).
Python's _compute_k_split (smart_crusher.py:2722) computes:

  k_first = max(1, round(k_total * first_fraction))
  k_last  = max(1, round(k_total * last_fraction))

For k_total=1, both round() results are 0 and both max(1, …) return
1, giving k_first + k_last = 2 > k_total = 1. The crusher then keeps
2 items when adaptive sizing said 1 — violating
max_items_after_crush whenever k_total floors to 1.

Fix: after the floored fractions, clamp:
  k_first = min(k_first, k_total)
  k_last  = min(k_last, k_total - k_first)

For k_total >= 2 (the common case) the clamp is a no-op — Python and
Rust agree byte-for-byte. For k_total <= 1 (the previously buggy
edge), Rust now keeps ≤ k_total items.

This is a one-sided fix: Rust correct, Python overshoots. The Python
fix lands in commit 7 alongside parity fixtures (matching the
already-applied pattern for bugs #2 and #3, both of which are also
fixed in Rust now and pending in Python at commit 7).

Real-world reachability: every crusher early-returns "passthrough" on
n <= 8 before compute_k_split is even called, so the only path that
reaches k_total=1 in production is via direct calls to compute_k_split
or compute_optimal_k. Defensive parity remains worth pinning.

Two new tests:
  - bug4_k_split_no_overshoot_when_k_total_is_one (the fix itself)
  - bug4_k_split_no_overshoot_when_k_total_is_two (boundary not regressed)

Net: 268 unit tests passing, clippy clean, parity harness intact.

Bug status:
  #1 percentile off-by-one — Rust faithful (port-as-is), both fixed at commit 7
  #2 sequential pattern int(zero-padded) — Rust fixed, Python at commit 7
  #3 rare-status cardinality cap — Rust fixed, Python at commit 7
  #4 k-split overshoot — Rust fixed (this commit), Python at commit 7
Direct port of headroom/transforms/anchor_selector.py (770 lines) plus
AnchorConfig from headroom/config.py. Used by SmartCrusher's planning
layer to allocate "anchor slots" — positions kept purely for their
location in the array (front/middle/back) before relevance scoring
fills in the rest.

Public surface:

- AnchorConfig: 16-field config struct with defaults pinned to Python.
- DataPattern: SearchResults / Logs / TimeSeries / Generic. from_string
  accepts case-insensitive strings; unknown strings fall through to
  Generic. Named from_string (not from_str) to avoid colliding with
  std::str::FromStr.
- AnchorStrategy: FrontHeavy / BackHeavy / Balanced / Distributed.
- AnchorWeights: front/middle/back distribution + normalize().
- AnchorSelector: stateless selector with .select_anchors() entry point.

Supporting helpers (all parity-pinned):

- calculate_information_score: weighted blend (0.4 uniqueness +
  0.3 length + 0.3 structural). Clamped [0,1].
- calculate_value_uniqueness / _length_score / _structural_uniqueness:
  per-factor scoring.
- compute_item_hash: md5(json.dumps(item, sort_keys=True))[:16] —
  byte-equal with Python.

Critical helper for hash parity: python_json_dumps_sort_keys.
Python's json.dumps default uses (', ', ': ') separators (with
spaces) and ensure_ascii=True (\\uXXXX escapes for non-ASCII,
surrogate pairs for codepoints > U+FFFF). serde_json's compact
default doesn't match. The custom serializer here pins both:

  json_dumps_basic              {"a": 2, "b": 1}
  json_dumps_non_ascii_escaped  {"k": "caf\\u00e9"}
  json_dumps_emoji_uses_surrogate_pair  {"k": "\\ud83d\\ude00"}
  compute_item_hash_matches_python_basic  → 8aacdb17187e6acf
  compute_item_hash_matches_python_with_unicode  → 6761da28ed7eb489

Mismatching the format would silently change which items count as
duplicates during region-based anchor selection, so this lays the
foundation for byte-equal parity in the planning layer.

Net: 31 new tests in anchor_selector (299 total in headroom-core),
clippy clean.

Next commits in scope: relevance scorer (BM25 → embeddings via ONNX
→ hybrid), then planning + execution + orchestration layers, then
Python lockstep bug fixes + parity fixtures.
Direct port of headroom/relevance/base.py (RelevanceScore + RelevanceScorer
trait) and headroom/relevance/bm25.py (BM25 keyword scorer).

base.rs:
- RelevanceScore: clamps score to [0,1] via new() (Python __post_init__).
  Plus empty(reason) helper for "no match" cases.
- RelevanceScorer trait: required score(item, context). Default
  score_batch falls back to per-item dispatch; concrete scorers override
  for vectorized/amortized batch impls.
- default_batch_score: free function fall-back for tests.
- is_available(): true by default; ONNX scorer (next commit) overrides.

bm25.rs:
- BM25Scorer with k1=1.5, b=0.75, max_score=10.0 defaults — pinned to
  Python.
- Tokenization via single regex with three alternatives (UUID first,
  then 4+ digit numeric ID, then alphanumeric+underscore). Order
  matters — UUIDs would otherwise be split into 8/4/4/4/12 hex pieces
  by the alphanumeric arm. Caught by tokenize_uuid_as_single_token test.
- BM25 formula matches Python including the simplified single-doc IDF
  (constant ln(2)) and length-normalization parameters.
- Long-token bonus: +0.3 when any matched term has length >= 8. Boosts
  UUID/long-ID matches above the keyword baseline.
- Optimized score_batch pre-tokenizes the context once and computes
  avg_doc_len across the batch (matches Python's amortization).
- Reasons match Python's surface: "BM25: no term matches" /
  "BM25: matched 'X'" / "BM25: matched N terms (a, b, c...)" for
  single calls, "BM25: N terms" for batch.

Iteration determinism: Python iterates query Counter in dict
insertion order. We sort keys alphabetically before scoring so
matched_terms ordering is deterministic across runs (the BM25 score
itself is order-independent — sum is commutative). Python's order is
non-deterministic across PYTHONHASHSEED, so this is parity-safe and
arguably better behavior.

Net: 20 BM25 + base tests, 319 total in headroom-core, clippy clean.

Next commit: ONNX-backed embedding scorer (sentence-transformers via
the ort crate + tokenizers crate).
…25 fallback

Two more pieces of the relevance ladder, plus the create_scorer factory.

embedding.rs — STUB
- Mirrors Python's "sentence-transformers not installed" branch
  byte-for-byte. is_available() returns false; score() and
  score_batch() return RelevanceScore::empty.
- Real ONNX-backed implementation lands in a follow-up commit
  (tracked as task #22). When it does, this module needs zero
  changes at any call site — flipping is_available() to true is
  enough.
- Public surface: EmbeddingScorer::default() / ::new(model_name).
  Default model name pinned to sentence-transformers/all-MiniLM-L6-v2.

hybrid.rs — full port (with stub embedding under the hood)
- Adaptive alpha tuning per Python hybrid.py:115-151. Regex pattern
  detection on the context: UUIDs (alpha >= 0.85), 2+ numeric IDs
  (>=0.75), single numeric ID (>=0.65), hostname/email (>=0.6).
  Clamped to [0.3, 0.9].
- Email regex pinned to Python's literal pattern including the
  [A-Z|a-z] typo (`|` inside `[...]` is a literal pipe, not an
  alternation). Pinned for byte-equal parity with Python source.
- Graceful BM25 fallback when embedding.is_available() == false:
  - Items with any matched term get score >= 0.3.
  - Items with 2+ matched terms get +0.2 (capped at 1.0).
  - Reason prefixed with "Hybrid (BM25 only, boosted): ".
  Mirrors Python score()/score_batch() boost rules exactly.
- When real ONNX scorer lands, the hybrid path automatically uses
  alpha-weighted fusion: combined = alpha*BM25 + (1-alpha)*Emb.
- score_batch is amortized: BM25 batches once, embedding batches
  once, alpha computed once.

mod.rs — factory
- create_scorer(tier) returns Box<dyn RelevanceScorer>. Mirrors
  Python's create_scorer factory. "embedding" tier returns Err with
  the same shape as Python's RuntimeError until the ONNX backend
  flips on.

Net: 39 relevance tests, 338 total in headroom-core, clippy clean.
SmartCrusher's planning layer (next commit) can now wire in
HybridScorer and behave parity-equal with Python deployments that
don't have sentence-transformers installed. ONNX embeddings remain
on the runway.
…ritize

Direct port of three Python methods from smart_crusher.py:

- _deduplicate_indices_by_content (line 1721) → deduplicate_indices_by_content
- _fill_remaining_slots (line 1794) → fill_remaining_slots
- _prioritize_indices (line 1891) → prioritize_indices

Used by every _plan_* method (next commit) to clean up index sets
before they become the final keep_indices.

Key design:
- All three operate on BTreeSet<usize> so iteration is sorted and
  deterministic.
- Item content hashes use anchor_selector::compute_item_hash, which
  serializes via Python-compatible json.dumps(sort_keys=True) and
  truncates md5 to 16 hex chars. Ensures Rust and Python collapse the
  same items to the same hash.
- Non-dict items (rare in the dict-array context) hash via str()
  fallback to mirror Python's behavior.

prioritize_indices pipeline:
1. Dedup pass (when config.dedup_identical_items, default true).
2. Fill pass — top up to effective_max with diverse uniques if
   under-budget.
3. Already <= effective_max? Return.
4. Otherwise keep ALL critical items (error keywords + structural
   outliers + numeric anomalies) — Python's "quality guarantee" path
   that may exceed effective_max when criticals dominate.
5. Add first-3 / last-2 anchors if room.
6. Fill remaining with non-critical kept indices in ascending order.

TOIN field-semantics: Python's _detect_items_by_learned_semantics
path is stubbed (returns empty set). Mirrors Python's "no
field_semantics provided" branch exactly. When TOIN is ported the
parameter slots back in without changing call sites.

Numeric-anomaly detection: helper numeric_anomaly_indices walks
analysis.field_stats and flags items >variance_threshold σ from
the per-field mean — same formula the analyzer uses for crushability
signal counting.

Net: 12 new orchestration tests, 350 total in headroom-core, clippy
clean.

Next commit: planning layer (_create_plan + four _plan_* methods)
wires these helpers + the relevance scorer + AnchorSelector into
strategy-specific keep_indices builders.
Direct port of Python's compression-planning logic from
smart_crusher.py:3117-3615. Wires every previous module together —
analyzer, anchor_selector, scorer, orchestration helpers, error/outlier
detectors — into strategy-specific keep_indices builders.

Module: smart_crusher/planning.rs (~600 lines).

SmartCrusherPlanner — borrows config + anchor_selector + scorer +
analyzer. Stateless from outside.

create_plan dispatcher (Python _create_plan, line 3117):
- Routes by analysis.recommended_strategy → plan_*.
- SKIP path returns all indices defensively.
- Honors factor_out_constants config (default false → empty constants).

plan_smart_sample (Python lines 3509-3615) — DEFAULT/fallback:
- Anchors via AnchorSelector (DataPattern::Generic).
- Structural outliers + error keywords (already-ported helpers).
- Numeric anomalies > variance_threshold σ from per-field mean.
- Items around change points (window ±1, gated by preserve_change_points).
- Query anchors (extract_query_anchors / item_matches_anchors).
- Relevance scoring via the configured scorer (defaults to HybridScorer
  which falls back to BM25-with-boost while embedding scorer is stubbed).
- TOIN preserve_field matches (helper present, callers pass None until
  TOIN is ported).
- prioritize_indices for dedup + fill + over-budget pruning.

plan_top_n (Python lines 3395-3507):
- Detects highest-confidence score field; falls through to smart_sample
  if none.
- Top-K by score (K = max_items - 3 to reserve for outliers).
- Outliers + error keywords (always added).
- Query anchors and high-confidence relevance matches added ADDITIVELY
  (beyond the top-K) capped at +3 to avoid over-stuffing.
- TOIN preserve_field matches.

plan_cluster_sample (Python lines 3289-3393):
- Anchors (DataPattern::Logs) + outliers + error keywords.
- Identifies highest unique-ratio string field as the message field
  (>0.3 threshold).
- Clusters items by md5(first 50 chars of message)[:8].
- Keeps up to 2 representatives from each cluster.
- Query signals + TOIN + prioritize_indices.

plan_time_series (Python lines 3200-3287):
- Anchors (DataPattern::TimeSeries).
- WIDER change-point window: ±2 (vs ±1 in smart_sample).
- Outliers + error keywords.
- Query signals + TOIN + prioritize_indices.

Helpers:
- map_to_anchor_pattern: CompressionStrategy → DataPattern (mirrors
  Python _map_to_anchor_pattern).
- item_has_preserve_field_match: TOIN preserve_field check (SHA256[:8]
  field-name hashes; bidirectional substring match between query and
  field value, lowercased).
- numeric anomaly inner helper for the smart_sample path.

Config addition: relevance_threshold (default 0.3) — mirrors Python's
RelevanceConfig.relevance_threshold field. Pinned in tests.

Net: 13 planning tests, 363 total in headroom-core, clippy clean.
Each plan method has its own focused test: error preservation,
query anchor pinning, top-N score ordering, cluster_field assignment,
change-point window. The 8-arg signatures are intentionally
clippy-suppressed — they mirror Python; bundling into a struct
would obscure parity.

Next commit: execution layer (_execute_plan) and the _crush_array
orchestrator. After that: SmartCrusher::crush() top-level + Python
lockstep + parity fixtures + PyO3 bridge.
…mixed_array

Lands the top-level SmartCrusher orchestrator. Owns config,
anchor_selector, scorer, and analyzer. Three Python entry points
ported:

- _execute_plan (line 3617) → SmartCrusher::execute_plan
- _crush_array  (line 2400) → SmartCrusher::crush_array
- _crush_mixed_array (line 2914) → SmartCrusher::crush_mixed_array

Module: smart_crusher/crusher.rs (~430 lines).

execute_plan:
- Trivial sort-by-index, clone, return. Mirrors Python verbatim.
- Schema-preserving: each kept item is unchanged.

crush_array:
- Computes adaptive_k from compute_optimal_k(item_strings, bias).
- Tier-1 boundary: n <= adaptive_k → "none:adaptive_at_limit"
  passthrough.
- analyzer.analyze_array → ArrayAnalysis.
- crushability gate: SKIP path returns "skip:<reason>" with original
  items unchanged.
- planner.create_plan → CompressionPlan.
- execute_plan → final items.
- Returns CrushArrayResult { items, strategy_info, ccr_hash,
  dropped_summary }.

crush_mixed_array:
- n <= 8 → "mixed:passthrough".
- Group items by JSON type (dict / str / number / list / null /
  bool). First-occurrence order preserved across groups via
  GroupBuckets helper (mirrors Python's dict insertion-order
  iteration).
- Small groups (< min_items_to_analyze): keep all items.
- Dict group: recurse into self.crush_array; survivors matched
  back to original indices via canonical-JSON serialization
  (anchor_selector::python_json_dumps_sort_keys).
- Str group: dispatch to crush_string_array; survivors matched
  via &str equality.
- Number group: inline first/last + outlier (>variance_threshold σ)
  detection — mirrors Python's _crush_mixed_array number arm
  (no summary prefix).
- list / bool / null / other: keep all.

Stubbed paths (matches Python's "subsystems disabled" behavior
byte-for-byte; parity fixtures will be recorded with these off):

- TOIN: never produces a recommendation. effective_max_items =
  adaptive_k. No preserve_fields, no strategy/level override.
- Feedback: never produces hints.
- CCR: never caches; ccr_hash = None.
- Telemetry: no-op.
- _compress_text_within_items: passthrough (text compression has
  its own port pipeline).
- summarize_dropped_items: empty string.

The TOIN/CCR/feedback/telemetry integration ports happen later
(Stage 3c.2 follow-ups). The current state is a complete like-for-
like port of SmartCrusher's CORE compression decisions.

Net: 12 new crusher tests, 375 total in headroom-core, clippy clean.

Pipeline coverage tested:
- execute_plan: empty / sorted / out-of-bounds index handling.
- crush_array: adaptive_at_limit, skip path, low-uniqueness
  compression, error-item preservation.
- crush_mixed_array: passthrough at threshold, group-and-compress
  dicts, lists/nulls keep-all.

Next commit: SmartCrusher::crush(content, query, bias) top-level
that parses JSON, classifies array type, dispatches to the right
crusher, serializes back. Then Python lockstep + parity fixtures
+ PyO3 bridge.
Lands the public entry point that ContentRouter calls. End-to-end
SmartCrusher pipeline now works in Rust for any JSON input.

Direct port of two Python methods:
- SmartCrusher.crush (line 1581-1603) → SmartCrusher::crush
- _smart_crush_content (line 2243-2301) → smart_crush_content
- _process_value (line 2307-2398) → process_value

Pipeline:
1. crush(content, query, bias) → CrushResult
2. smart_crush_content parses JSON. Non-JSON → passthrough.
3. process_value recurses through arrays + objects.
4. For each array meeting min_items_to_analyze:
   - DictArray  → crush_array (full analyzer + planner pipeline)
   - StringArray → crush_string_array
   - NumberArray → crush_number_array
   - MixedArray  → crush_mixed_array
   - NestedArray / BoolArray / Empty → recurse into items
5. For each object with min_items_to_analyze keys:
   - First recurse into values (compress nested arrays).
   - Then crush_object on the dict itself if not passthrough.
6. Re-serialize with python_json_dumps for byte-equal Python parity.
7. was_modified = result != content.trim().

MAX_PROCESS_DEPTH = 50: matches Python's depth guard against
adversarial nesting.

Critical helper: python_json_dumps (preserve insertion order variant).
Python's json.dumps default emits `, ` and `: ` separators (with
spaces) and ensure_ascii=True (escapes non-ASCII to \\uXXXX).
serde_json's compact to_string differs in both. Without this
formatter the proxy's was_modified checks and downstream byte
comparisons would fail. Pinned by:
  crush_serializes_with_python_format → {"a": 1, "b": 2, "c": 3}

Refactored anchor_selector::python_json_dumps_sort_keys to share
its formatter via a sort_keys=true/false flag. Same code path,
two public entry points.

Tests:
  - crush_non_json_passes_through_unchanged
  - crush_scalar_json_passes_through
  - crush_small_array_passes_through (n < min_items_to_analyze)
  - crush_dict_array_crushes_when_low_uniqueness
  - crush_serializes_with_python_format
  - crush_recurses_into_nested_arrays (compresses inner array
    even inside a wrapper object)

Net: 6 new top-level tests, 381 total in headroom-core, clippy clean.

Stage 3c.1 status — Rust port now end-to-end working:
  ✓ Foundational helpers (statistics, anchors, classifiers, hashing)
  ✓ Analyzer (8-method SmartAnalyzer)
  ✓ adaptive_sizer (Kneedle + SimHash + zlib)
  ✓ Universal crushers (string/number/object) + bug #1 ported,
    bug #4 fixed
  ✓ anchor_selector (770 → ~700 Rust lines)
  ✓ Relevance: BM25 + base trait + HybridScorer
    (Embedding scorer stubbed — real fastembed impl is task #22)
  ✓ Orchestration (dedup/fill/prioritize)
  ✓ Planning (4× _plan_* + dispatcher)
  ✓ SmartCrusher struct (execute_plan + crush_array +
    crush_mixed_array + top-level crush + recursive process_value)

Bug status (4 bugs fixed in BOTH languages at commit 7):
  #1 percentile off-by-one — Rust ports faithful, Python pending
  #2 sequential pattern int(zero-padded) — Rust fixed, Python pending
  #3 rare-status cardinality cap — Rust fixed, Python pending
  #4 k-split overshoot — Rust fixed, Python pending

Remaining for Stage 3c.1 completion:
  - Real fastembed embedding scorer (task #22)
  - Python lockstep bug fixes for all 4 bugs
  - Parity fixture harness + record byte-equal fixtures
  - PyO3 bridge + delete Python smart_crusher.py (Stage 3c.1b)
Lockstep fixes for the four known bugs in headroom/transforms/smart_crusher.py
plus the field-iteration ordering parity fix. Both languages now agree
byte-for-byte on the affected code paths — prerequisite for parity
fixtures landing next.

Bug #1 — percentile off-by-one (Python line 2844 + Rust crushers.rs)
Replaces integer-division indexing with linear-interpolation
percentile (numpy "linear" method). New _percentile_linear helper
shared by both languages: index = q * (n - 1), interpolate between
floor and ceil.

Bug #2 — zero-padded string IDs misclassified as sequential
Track had_non_string_numeric flag; if every parseable value came
from a string (no actual int/float), return False (categorical, not
sequential). Pre-fix: int("001") loses zero-padding and fakes a
sequential pattern.

Bug #3 — rare-status detection cardinality cap
Cardinality cap raised from 10 to 50. Single-dominant check
replaced with Pareto top-K: smallest K such that top-K covers >=80%
of items. If K <= 5, items NOT in top-K are outliers. Catches
bimodal distributions like 60×INFO + 25×WARN + 15 distinct error
codes.

Bug #4 — k-split overshoot when k_total=1
Clamp after the floored fractions: k_first=min(k_first, k_total),
k_last=min(k_last, max(0, k_total - k_first)). No-op for the
common case k_total >= 2.

Field iteration ordering (Python line 1049)
`for key in all_keys` → `for key in sorted(all_keys)`. Set
iteration is non-deterministic across PYTHONHASHSEED; downstream
short-circuits in _select_strategy and _detect_pattern would pick
different fields between runs. Rust uses BTreeMap (sorted ASCII);
sorting Python locks both languages to the same iteration order.

Verification:
- 56 Python tests pass (51 existing + 5 new lockstep tests under
  TestStage3c1BugFixes class).
- 382 Rust tests pass (rust bug #1 documentation test replaced
  with two new "fixed behavior" tests).
- Clippy clean.

Status: all four bugs are now fixed in BOTH languages. Parity
fixtures can be recorded against post-fix Python and asserted
byte-equal against Rust. That's the next commit.
Replace the embedding scorer stub with a real fastembed-rs
implementation. Same library + same model as the Python side will
use after the next commit, giving byte-equal embeddings on identical
inputs.

Cargo.toml: fastembed = "5". Default features pull in `ort` (ONNX
Runtime) with auto-download of the runtime binary at build time
(~21s additional first-build); model weights (BAAI/bge-small-en-v1.5,
~30 MB int8-quantized ONNX) auto-download from HuggingFace Hub on
first use.

embedding.rs:
- EmbeddingScorer wraps Option<Mutex<TextEmbedding>>. Mutex required
  because TextEmbedding::embed needs &mut self (single-threaded ONNX
  session); concurrent callers serialize on the lock, fine for the
  SmartCrusher hot path where inference dominates lock contention.
- EmbeddingScorer::try_new() — explicit construction with HF Hub
  download. Returns Result; surface errors to callers.
- EmbeddingScorer::try_new_with_model(EmbeddingModel) — bring your
  own model from fastembed's catalog.
- EmbeddingScorer::default() — STUB only (model=None,
  is_available()=false). Mirrors Python's "sentence-transformers
  not installed" branch byte-for-byte. To get a real scorer, call
  try_new() and pass via HybridScorer::with_scorers().

Why default() is a stub: with auto-load Default, model availability
would depend on whether HF Hub cache has the file — non-deterministic
in tests. Explicit try_new() keeps Default cheap and predictable.

cosine_similarity:
- f32 vec inputs (fastembed returns Vec<Vec<f32>>).
- Clamped to [0, 1] (mirrors Python _cosine_similarity — only
  positive similarity matters for relevance).
- Defensive: zero vectors / mismatched dims → 0.0.

score / score_batch:
- Empty input / unavailable model → empty score with explanatory
  reason.
- Batch encodes items + context in one model call (Python parity:
  amortizes model dispatch).
- Inference failures degrade gracefully with empty scores rather
  than panicking.

Tests:
- 5 cosine-similarity unit tests (offline).
- 3 unavailable-scorer tests (model=None path).
- 3 model-backed integration tests gated on RUN_FASTEMBED_TESTS=1
  (semantic-match-outranks-unrelated, batch-shape, model-loads).
- All 388 headroom-core tests pass without RUN_FASTEMBED_TESTS;
  with it set, the gated 3 also pass.

Net: 388 unit tests, clippy clean. HybridScorer's BM25-fallback path
remains correct (default embedding scorer reports unavailable).
Stage 3c.1 next: switch Python's relevance/embedding.py to fastembed
PyPI package + record parity fixtures with real embeddings on both
sides.
…-v1.5)

Replace `sentence-transformers` (PyTorch-backed) with `fastembed` (ONNX-
backed) so Python and Rust call into the same library + same model for
relevance scoring. Both sides run BAAI/bge-small-en-v1.5 (33M params,
384 dims, ~30 MB int8-quantized ONNX) auto-downloaded from HF Hub.

Cross-language verification on ('authentication failed for user',
'login error'): Python=0.7505, Rust=0.7507, delta ~0.0002 — well below
the relevance_threshold (0.3) buffer SmartCrusher uses for keep/drop
decisions, so the two implementations agree on every observable
SmartCrusher output. (True byte-equal would require both to load the
identical ONNX weights file — Rust's `fastembed` crate and Python's
`fastembed` package can pick different upstream artifacts; deferred.)

Why fastembed:
- removes torch from the relevance/ path (Phase 6: drop torch from
  Python).
- ~2-3x faster than sentence-transformers' all-MiniLM-L6-v2 for the
  same input shape.
- bge-small-en-v1.5 outranks all-MiniLM-L6-v2 on MTEB by ~6 points.
- self-contained: no longer reads ML_MODEL_DEFAULTS.sentence_transformer
  from utils config.
Adds the SmartCrusher half of the Rust-vs-Python parity harness. Path A
from the Stage 3c.1 plan: record fixtures from Python (with real
fastembed embeddings + the post-bug-fix code), drive the Rust port over
the same inputs, and assert byte-equal output on every recorded
scenario.

What's in:
- `tests/parity/record_smart_crusher.py`: standalone recorder for
  `SmartCrusher.crush(content, query, bias)`. The generic recorder
  framework only captures one positional, so this script writes its
  own JSON envelope `{input: {content, query, bias}, config, output}`.
  17 scenarios cover the planning paths exercised by ContentRouter
  in production: passthrough, smart_sample, top_n, time-series,
  duplicates, unicode (`ensure_ascii=False`), nested-3-deep, empties,
  bias above and below 1.0.
- `crates/headroom-parity/src/lib.rs`: `SmartCrusherComparator`
  reconstructs `SmartCrusherConfig` from the fixture's config block,
  runs Rust `SmartCrusher::crush()`, emits the same JSON shape Python
  serialized.
- `crates/headroom-parity/examples/diff_fixture.rs`: diagnostic CLI
  that prints expected-vs-actual for one fixture (used during the
  iteration that found the serializer bug below).

Serializer fix — found by the harness:
SmartCrusher uses `safe_json_dumps` (compact `(",", ":")` separators
+ `ensure_ascii=False`) for the wire bytes. The Rust port was using
`python_json_dumps` (default Python: `(", ", ": ")` + `ensure_ascii=
True`), which is the right choice for hashing but wrong for the
output. Refactored `anchor_selector.rs` to take a small
`JsonFmt { sort_keys, compact, ensure_ascii }` config so the three
flavors share one writer, added `python_safe_json_dumps`, and
switched `_smart_crush_content` to call it. All three flavors now
have byte-exact tests.

Two unit tests in `crusher.rs` were pinning the old (wrong) format
and have been re-pinned to the compact form.

Cross-language status:
- All 17 empty-query fixtures: byte-equal.
- Embedding-driven (non-empty-query) fixtures deferred until the
  ~0.0002 numeric drift between Python `onnxruntime` and Rust `ort`
  is resolved (or until we accept the drift via a tolerance — none of
  the 17 fixtures exercises a borderline relevance_threshold call,
  and downstream code only branches at the 0.3 threshold).

Tests: cargo test --workspace (388 + supporting) green.
Stage 3c.1b step 1: expose `SmartCrusherConfig`, `CrushResult`, and
`SmartCrusher` to Python via `headroom._core`. The Python shim that
delegates to it (replacing the 3669-line Python implementation) lands
in the next commit; this commit just builds the bridge and a
fixture-replay test that pins it.

Surface:
- `headroom._core.SmartCrusherConfig(**fields)` — every field of the
  Rust `SmartCrusherConfig` exposed as a kwarg with matching default.
- `headroom._core.CrushResult` — read-only mirror of the Rust struct
  with `compressed`, `original`, `was_modified`, `strategy` getters.
- `headroom._core.SmartCrusher(config=None)` — constructor accepts
  only `config`; the Python shim drops `relevance_config`, `scorer`,
  and `ccr_config` since Stage 3c.1 keeps those subsystems disabled.
- `crush(content, query="", bias=1.0)` and `smart_crush_content(...)`
  methods mirror the Python signatures.

Verification:
- All 17 recorded parity fixtures byte-equal between Python and the
  PyO3 bridge (`tests/test_transforms/test_smart_crusher_rust_parity.py`,
  18 tests pass — 1 fixture-count sanity + 17 fixtures).
- The Rust-side `cargo run -p headroom-parity --bin parity-run --
  run --only smart_crusher` was already 17/17 green.

The two tests catch different regression classes:
- Rust-only test: catches drift in the Rust port's logic.
- Python bridge test: catches PyO3 input/output translation bugs.
Stage 3c.1b step 2 + cleanup. The python `SmartCrusher` (3669 lines)
is replaced by a thin pyo3-backed shim (~290 lines) that delegates
every byte to `headroom._core.SmartCrusher` (built from
`crates/headroom-py`, landed in the previous commit). There is no
python implementation and no env-var fallback — the wheel is a hard
import.

Why now: parity was already proven across 17 fixtures + the python-
side bridge test (1+17 in `test_smart_crusher_rust_parity.py`).
Keeping a shadow python impl behind a flag is a permanent maintenance
cost with no operational benefit. Stage 3c.1b deletes ~3380 lines of
python parser/scorer/analyzer/orchestrator code; the rust crate has
its own coverage (388 unit tests + property tests in headroom-core).

Surface preserved (drop-in for every production caller):
- `headroom.transforms.smart_crusher.SmartCrusher` — same class name,
  same `__init__(config, relevance_config, scorer, ccr_config)`
  signature (the latter three are accepted for source-compat and
  silently dropped — rust port keeps those subsystems disabled in
  Stage 3c.1, they re-attach in Stage 3c.2).
- `SmartCrusherConfig` and `CrushResult` dataclasses kept as python
  dataclasses (callers use `asdict()` / dataclass matching on them).
- `crush(content, query, bias)`, `_smart_crush_content(content, ...)`,
  `apply(messages, tokenizer, **kwargs)`, and
  `_extract_context_from_messages(messages)` all preserved.
- `smart_crush_tool_output(content, config, ccr_config)` thin wrapper.

The transform-protocol `apply()` orchestration stays python (message
walking, digest-marker insertion, token counting); only the per-
message compression call delegates to rust.

Removed:
- Python parser / planner / scorer / analyzer / classifier (~3380 lines).
- Internal helpers `_classify_array`, `_detect_sequential_pattern`,
  `_detect_rare_status_values`, `_detect_items_by_learned_semantics`,
  `_percentile_linear`, `_compute_k_split`, `_crush_number_array`,
  `_process_value`, etc. — rust crate has parallel coverage.
- `SmartAnalyzer`, `ArrayType`, `CompressionStrategy`,
  `extract_query_anchors` — internals; not used by any production
  caller (only tests probed them).

Tests deleted (probed deleted internals — same precedent as Stage 3b):
- `tests/test_transforms/test_smart_crusher.py` (40 tests)
- `tests/test_transforms/test_universal_json_crush.py` (45)
- `tests/test_transforms/test_anchor_selector.py` (49)
- `tests/test_toin_field_learning.py` (21)
- `tests/test_crushability.py` (20)

Tests trimmed (removed methods/classes that probe deferred subsystems
— scorer injection, CCR marker injection, TOIN feedback recording —
all of which re-attach in Stage 3c.2):
- `tests/test_transforms/test_smart_crusher_bugs.py`:
  TestNumberArraySchemaPreservation, TestStage3c1BugFixes.
- `tests/test_relevance.py`: 2 scorer-injection tests.
- `tests/test_ccr.py`: TestSmartCrusherCCRIntegration class +
  test_custom_marker_template.
- `tests/test_toin_integration.py`: TestTOINIntegration +
  TestStoreToTOINHash classes.
- `tests/test_critical_fixes.py`: TestSmartCrusherTOINIntegration +
  test_full_feedback_loop.
- `tests/test_acceptance.py::TestQueryAnchorExtraction`: dropped the
  `extract_query_anchors` probe; kept the end-to-end "Alice
  preserved" assertion.

Bug fixes from Stage 3c.1 (#1 percentile linear interp, #2 zero-
padded sequential, #3 rare-status pareto, #4 k-split overshoot) are
pinned by the rust crate and the parity fixtures
(`tests/parity/fixtures/smart_crusher/`).

Tests:
- 517 passed in the smart_crusher-adjacent file set
  (test_transforms/, test_relevance*, test_ccr, test_toin_integration,
  test_quality_retention, test_acceptance, test_critical_fixes).
- 18 in `test_smart_crusher_rust_parity.py` (1 sanity + 17 fixtures).
- 388 rust unit tests still green.

One stale-error-message regex in `test_relevance_extra.py` updated
from "requires sentence-transformers" → "requires fastembed".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant