Skip to content

feat(reporter): layered output sanitization + ADR 0003#57

Merged
Zious11 merged 16 commits intodevelopfrom
worktree-adr-0003-output-sanitization
Apr 9, 2026
Merged

feat(reporter): layered output sanitization + ADR 0003#57
Zious11 merged 16 commits intodevelopfrom
worktree-adr-0003-output-sanitization

Conversation

@Zious11
Copy link
Copy Markdown
Owner

@Zious11 Zious11 commented Apr 9, 2026

Summary

Establishes ADR 0003 (Reporting Pipeline Layering) as the architectural rule for how attacker-controlled bytes flow through the reporting pipeline. The data layer (analyzers + Finding struct) now stores raw post-from_utf8_lossy bytes; each reporter (terminal, JSON, future CSV/SQLite) escapes for its own output medium at render time. Reverses PR #49's construction-site escaping in TLS and centrally fixes an injection vulnerability in the HTTP analyzer's 7 findings.

Motivating vulnerability

During PR #49 follow-up, an audit of HTTP finding construction at src/analyzer/http.rs:183, 209, 227, 242, 257, 271, 285 found that path-traversal, web-shell, admin-panel, unusual-method, missing-Host, long-URI, and empty-UA findings all interpolated parsed.uri / parsed.method / parsed.host directly via format!("{}", ...). An attacker-supplied URI containing \x1b[31mRED\x1b[0m would reach the operator's terminal unescaped (CWE-117).

PR #49 patched the TLS path with {:?} (Debug formatter), which worked for terminal safety but (a) permanently destroyed forensic bytes in Finding.summary — Cyrillic SNIs became \u{43f}\u{440}... in JSON exports — and (b) was tribal knowledge that didn't propagate to HTTP.

Solution

Move escaping from construction-site (analyzer) to display-time (terminal reporter), using a custom helper escape_for_terminal (~15 lines) built on char::escape_default gated by char::is_ascii_control() || ('\u{80}'..='\u{9f}').contains(&c) || c == '\\'. The gate preserves valid non-ASCII Unicode (Cyrillic, CJK, emoji) while escaping C0, DEL, C1, and backslash. str::escape_default was considered and rejected after empirical verification showed it mangles all non-ASCII (see commit 45cf649 and ADR 0003 Validation section).

Behavioral changes visible to users

  • JSON output preserves Cyrillic / CJK / emoji as raw UTF-8. A TLS SNI of пример.рф now appears readably in JSON exports instead of \u{43f}\u{440}.... Downstream tooling gets the actual hostname.
  • HTTP findings gain terminal-safety. An attacker who embeds \x1b[31m in a URI can no longer recolor the operator's terminal output. Zero HTTP analyzer code changes — the fix is centralized in the terminal reporter.
  • Both C0 and C1 control codepoints are escaped at terminal render. C1 codepoints like U+009B (CSI) — encoded in valid UTF-8 as 0xC2 0x9B — are escaped to close the DEC S8C1T attack vector. Narrow but real threat.

Non-changes

  • Finding struct shape (same fields, same serde derives)
  • JSON reporter (already safe via serde_json's automatic RFC 8259 escaping)
  • HTTP analyzer source code (raw interpolation is now correct under ADR 0003)
  • DNS analyzer (emits no findings)
  • No new dependencies

Test plan

  • cargo test --all — 166 tests pass (up from 163 pre-PR)
  • cargo clippy --all-targets -- -D warnings — clean
  • cargo fmt -- --check — clean
  • Unit tests on escape_for_terminal helper (11 total): ESC, BEL, DEL, tab/newline/CR, backslash, printable ASCII, Cyrillic preservation, emoji preservation, mixed content, empty string, C1 NEL+CSI, C1 range boundaries
  • Reporter integration test: terminal reporter escapes ESC in summary + evidence, preserves Cyrillic
  • Reporter integration test: JSON reporter preserves Cyrillic as readable Unicode, round-trips via serde_json::from_str
  • End-to-end contract test covering all three layers (struct preserves raw, terminal escapes, JSON escapes)
  • TLS analyzer test: raw ESC bytes survive to Finding.summary on the non-UTF-8 path (inverted from the pre-PR contract)
  • TLS analyzer test: raw Cyrillic substring present in Finding.summary on the non-ASCII path (guards the {hostname:?}{hostname} rollback)

Multi-agent PR review

Three reviewer agents (code-reviewer, pr-test-analyzer, comment-analyzer) reviewed this branch and returned APPROVED_WITH_NITS / COVERAGE_ADEQUATE / APPROVED_WITH_NITS respectively. The code-reviewer's one Important finding (C1 codepoint bypass + ADR reasoning error) was applied in 7d2cd3c; the pr-test-analyzer's top three coverage suggestions (JSON Cyrillic, TLS non-ASCII rollback, terminal Cyrillic) were applied in b9cc820. One coverage suggestion was deferred to #56 (HTTP analyzer end-to-end through terminal reporter).

Commits

14 commits on top of develop. Notable:

  • f4c6098 + 45cf649 — ADR 0003 (original + correction)
  • 1143d0b — Implementation plan
  • 1f7e556 + 755837b — Escape helper + apply in render loop
  • 0e42aad — Roll back TLS Debug formatter
  • 7e4f1df — End-to-end layering contract test
  • 7d2cd3c + b9cc820 — Post-review C1 fix + coverage additions

Related

Zious11 added 14 commits April 9, 2026 09:59
Establishes the architectural principle that the data layer (analyzers
and the Finding struct) holds raw bytes from network captures, while
each reporter (terminal, JSON, future CSV/SQLite/HTML/SIEM) is
responsible for all formatting that depends on its own output medium —
escaping, styling, truncation, localization.

The motivating consequence is terminal-safe escaping: an audit during
PR #49 follow-up found 7 unprotected interpolations in the HTTP
analyzer where attacker-controlled bytes could reach the operator's
terminal and be interpreted as ANSI escape sequences (CWE-117). PR #49
patched TLS by escaping at the construction site, which solved the
immediate vulnerability but destroyed forensic data (Cyrillic SNIs
became hex blobs in JSON output) and required every new analyzer to
remember the rule — the HTTP audit shows the rule never propagated.

The decision: escape at display time in the reporter, using stdlib
str::escape_default in the terminal reporter. Analyzers store raw
bytes; JSON output stays safe via serde_json's automatic RFC 8259
escaping; future reporters apply their own format's rules.

Validated against OWASP output-encoding guidance (CWE-117) and Rust
stdlib behavior via Perplexity 2026-04-08. Other formatting concerns
(truncation, cipher-name fallback, verdict localization) follow the
same principle but are explicitly deferred per YAGNI.

Implementation follows in subsequent commits.
…::escape_default

An initial Perplexity answer (relied on during the first draft) claimed
str::escape_default preserves multi-byte UTF-8 while escaping only C0
control bytes. Empirical verification during plan self-review — a small
rustc-compiled program that fed ESC, BEL, DEL, Cyrillic, emoji, and
mixed content through str::escape_default — falsified this: the method
routes every character through char::escape_default, which escapes all
non-ASCII as \u{...}. A Cyrillic SNI like "пример.рф" would become
"\u{43f}\u{440}..." — the same UX problem as PR #49's Debug formatter.

The Perplexity answer had conflated str::escape_default (method on str,
operates on chars, escapes all non-ASCII) with std::ascii::escape_default
(free function, operates on bytes, uses \x{hh}). These are not
equivalent.

Corrections in this commit:

- Replaced "The mechanism: str::escape_default" with an 8-line custom
  helper that iterates chars() and gates char::escape_default() on
  is_ascii_control() || '\\'. Empirically verified against the same
  test inputs.
- Replaced the "char::escape_debug (per-char Debug)" alternative entry
  with "Stdlib str::escape_default", since they're functionally
  equivalent for our input space and the new entry correctly documents
  the rejection.
- Updated Rationale bullet to describe the custom helper.
- Updated Validation section to record the empirical check and the
  corrected Perplexity claim.
- Fixed an incorrect "\x1b" reference in the behavioral-changes
  section (char::escape_default uses \u{1b}, not \x1b).

No change to the layering principle, the decision to escape at
display time, the rollback of TLS PR #49, the HTTP analyzer impact,
or any of the file-level consequences. Only the escape primitive
description changed.
Seven bite-sized TDD tasks implementing the reporting pipeline layering
decision from ADR 0003 (commits f4c6098 + 45cf649):

  Task 1: Cross-reference ADR 0003 from ADR 0002
  Task 2: Finding::Display doc warning
  Task 3: escape_for_terminal helper + 9 unit tests
  Task 4: Apply helper in terminal reporter render loop
  Task 5: Roll back TLS construction-site {:?} escaping
  Task 6: End-to-end regression test (data layer + both reporters)
  Task 7: cargo test + clippy + fmt

Each task has complete code blocks, exact commands, and expected
outputs. Test assertions were pre-validated empirically with a small
rustc-compiled program (see commit 45cf649 for the primitive
correction). No placeholders, no speculative steps.

Total impact: ~100 lines of production-code change, ~120 lines of
test code, 6-7 commits beyond the ADR.
Point analyzer authors at the layering principle so they know output
sanitization is the reporter's responsibility, not theirs.
Doc comment on the impl Display block explaining that the raw text
representation is for logging/debugging only and may contain
attacker-controlled control bytes. Terminal rendering should go
through the terminal reporter, which escapes per ADR 0003.
Private helper in src/reporter/terminal.rs built on stdlib
char::escape_default, gated by char::is_ascii_control || backslash.
Escapes C0 control bytes, DEL, and backslash while preserving all
valid non-ASCII Unicode (Cyrillic, CJK, emoji). Unit tested for ESC,
BEL, DEL, tab/newline/CR short escapes, backslash, printable ASCII,
Cyrillic, emoji, mixed content, and empty string.

Not yet applied to the rendering loop — that's the next commit, so
the helper can land with dedicated tests before it's integrated.

See ADR 0003 for the layering rationale.
Apply escape_for_terminal to every Finding.summary and every evidence
entry in the terminal reporter's rendering loop. This moves output
sanitization from the analyzer construction site (where PR #49 placed
it) to the display layer, per ADR 0003.

JSON output remains unchanged and is already safe via serde_json's
automatic RFC 8259 escaping. HTTP findings (path traversal, web shell,
admin panel, unusual method, etc.) — previously vulnerable to terminal
injection via attacker-controlled URIs — are fixed by this change with
zero analyzer-side code modifications.
…ed_ev

Minor clarity improvement from Task 4 code quality review. "safe" is
ambiguous — the escaped string still contains the escape sequences in
textual form, which is what the terminal wants. "escaped" names the
action (control-byte escaping via char::escape_default) rather than a
misleading safety property. Zero behavioral change.
Remove {:?} Debug-format interpolation in both NonAsciiUtf8 and NonUtf8
SNI finding arms. The data layer now stores raw post-from_utf8_lossy
bytes; the terminal reporter is responsible for escaping at render
time per ADR 0003.

Inverts test_non_utf8_sni_escapes_control_bytes_in_summary — renamed
to test_non_utf8_sni_preserves_raw_bytes_in_summary — to enforce the
new contract: raw ESC bytes MUST survive to Finding.summary. A
separate regression test (next commit) verifies the end-to-end
pipeline: terminal reporter escapes, JSON reporter preserves via
serde, struct holds the raw byte.

Behavioral change visible to JSON consumers: Cyrillic SNIs now appear
as readable Unicode ("пример.рф") instead of hex-escaped
placeholders ("\u{43f}\u{440}...").
Single Finding flows through the data layer and both reporters. Asserts:

  1. Finding struct preserves raw ESC byte (forensic ground truth).
  2. Terminal reporter escapes ESC to \u{1b} on display.
  3. JSON reporter serializes ESC as \u001b per RFC 8259.
  4. JSON round-trips back to the original raw byte (reversibility).

Any regression that re-introduces construction-site escaping, removes
the terminal reporter's helper, or changes the JSON path will fail
this test.
The escape_for_terminal helper gated only on char::is_ascii_control(),
which matches C0 + DEL but NOT the C1 range (U+0080-U+009F). C1
codepoints like U+009B (CSI) and U+0085 (NEL) can appear in a valid
UTF-8 String — they're encoded as two-byte sequences (e.g., 0xC2 0x9B
for U+009B) that survive from_utf8_lossy.

A terminal in DEC S8C1T mode interprets U+009B as the 8-bit equivalent
of ESC[, so an attacker-supplied SNI or HTTP URI containing 0xC2 0x9B
could bypass the escape and inject ANSI control sequences. Modern
terminals in UTF-8 mode default to NOT interpreting C1 this way
(verified via Perplexity for xterm, iTerm2, gnome-terminal, Windows
Terminal, VS Code, tmux), but the DEC mode exists and some legacy
terminals honor it. Escape C1 as cheap insurance.

The ADR's Coverage of the threat model section contained a subtly
wrong claim: 'a standalone byte in the 0x80-0x9f range cannot appear
in a valid UTF-8 string'. That's correct for a single raw byte but
irrelevant — the codepoint U+009B is a valid multi-byte UTF-8
sequence and can absolutely appear in a String. ADR updated to
reflect the actual helper behavior and the corrected reasoning.

Adds two new unit tests (escapes_c1_nel_and_csi,
escapes_c1_range_boundaries) covering U+0080, U+0085, U+009B, U+009F,
plus the U+00A0 (NBSP) boundary — which is just past C1 and must
pass through unchanged.
Three additions from the multi-agent PR review triage:

1. test_json_reporter_preserves_cyrillic_as_readable_unicode — pins
   the ADR 0003 user-visible claim that JSON consumers now see raw
   Cyrillic hostnames instead of the PR #49 hex-mangled form. Includes
   a round-trip serde_json::from_str check proving the Cyrillic
   survives the full trip.

2. test_terminal_reporter_escapes_esc_bytes_in_summary augmented with
   Cyrillic content in the summary payload. Previously only covered
   ASCII + ESC — now proves the render loop preserves Cyrillic while
   still escaping control bytes (catches accidental double-escape
   bugs in the for-findings loop).

3. test_cyrillic_sni_emits_non_ascii_finding augmented with raw-
   Cyrillic-substring assertions on Finding.summary. This directly
   guards the {hostname:?} → {hostname} rollback on the NonAsciiUtf8
   match arm in src/analyzer/tls.rs, which was otherwise only
   structurally similar (and untested) compared to the non-UTF-8
   branch that has the inverted dedicated test.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements ADR 0003’s “layered reporting pipeline” rule by preserving attacker-controlled bytes in the data layer (Finding) while performing output-medium-specific escaping at render time (terminal), restoring forensic fidelity (Unicode preserved) and centralizing terminal-safety.

Changes:

  • Add terminal-layer escaping (escape_for_terminal) and apply it when rendering finding summaries/evidence.
  • Roll back TLS analyzer construction-site Debug escaping so Finding.summary preserves raw bytes again.
  • Add ADR 0003 + tests that pin the layering contract across data/terminal/JSON outputs.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/reporter/terminal.rs Adds escape_for_terminal and applies it to Finding.summary/evidence during terminal rendering.
src/analyzer/tls.rs Removes {:?}-based construction-site escaping for SNI summaries to preserve raw bytes.
src/findings.rs Documents that Finding’s Display is raw and not terminal-safe.
tests/reporter_tests.rs Adds integration/contract tests for terminal escaping + JSON Unicode preservation/round-trip.
tests/tls_analyzer_tests.rs Inverts TLS test contract to assert raw bytes are preserved in Finding.summary; adds Cyrillic regression guard.
docs/adr/0003-reporting-pipeline-layering.md Introduces ADR 0003 describing the layering rule and escape approach.
docs/adr/0002-modular-protocol-analyzers.md Adds a guideline pointing analyzer authors to ADR 0003 (no construction-site escaping).
docs/superpowers/plans/2026-04-09-reporting-pipeline-layering.md Adds an implementation plan document for the ADR 0003 work.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/adr/0003-reporting-pipeline-layering.md Outdated
Comment thread docs/superpowers/plans/2026-04-09-reporting-pipeline-layering.md Outdated
Comment thread docs/superpowers/plans/2026-04-09-reporting-pipeline-layering.md Outdated
Comment thread docs/adr/0003-reporting-pipeline-layering.md Outdated
Comment thread docs/adr/0003-reporting-pipeline-layering.md Outdated
The C1 control codepoint fix in commit 7d2cd3c updated the helper's
predicate to include the U+0080..=U+009F range, but only patched one
section of the ADR ('Coverage of the threat model') and left four
other places describing the pre-C1 predicate. Copilot's review of
#57 caught all five mismatches (four in ADR 0003, one in the plan).

Fixes:

- ADR 0003 'Immediate scope' section: code sample updated to include
  the C1 range check, and the 'helper escapes' bullet list gains an
  explicit C1 bullet pointing to the later threat-model paragraph.
- ADR 0003 'Rationale' bullet: 'Escapes exactly the bytes that
  constitute the threat (C0 + DEL + backslash)' -> 'C0 + C1 + DEL +
  backslash'. Also bumps the line count estimate (~8 -> ~15) to
  reflect the helper's actual size after the C1 addition.
- ADR 0003 'Consequences' table row for src/reporter/terminal.rs:
  description now mentions C1 controls as part of the required
  predicate.
- Plan Architecture paragraph: clarifies that the helper gates on
  ASCII control + C1 control + backslash. Adds a forward reference
  to commit 7d2cd3c so future plan-followers know the C1 range was
  added post-plan and Task 3's code samples reflect the pre-C1
  predicate for historical accuracy.
- Plan 'File Structure' clarification: 'No files are created' was
  technically inaccurate since this plan doc and ADR 0003 are both
  new files; reworded to 'No new code or runtime files are created'
  with an explicit note that the two docs are prerequisites
  committed before the implementation tasks begin.

Per Copilot on #57.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/reporter/terminal.rs
Comment thread docs/superpowers/plans/2026-04-09-reporting-pipeline-layering.md Outdated
TerminalReporter::render's ANALYZER SUMMARIES block was printing
detail values (serde_json::Value) via {val} direct interpolation,
relying on serde_json's Display impl to handle escaping. That works
for C0 + DEL (serde_json escapes them per RFC 8259 §7) but NOT for
C1 codepoints (U+0080-U+009F) — serde_json preserves those as raw
UTF-8 two-byte sequences per RFC 8259, which is valid but leaves a
C1 terminal injection vector open.

Empirically verified: a serde_json::Value::Array containing
"before\u{9b}31mafter" renders via Display to a JSON string that
contains the raw 0xC2 0x9B bytes (U+009B CSI). The helper fix in
commit 7d2cd3c closed this gap in the findings block but didn't
reach the analyzer summaries block.

Detail values include attacker-controlled strings from HTTP
top_hosts / top_user_agents / recent_uris and TLS top_snis — these
all come through String::from_utf8_lossy in the analyzers, so
embedded C0 + C1 bytes can appear in the serde_json::Value leaves
exposed to the terminal.

Fix: apply escape_for_terminal to val.to_string() before writing to
the output buffer, matching the findings-block pattern. The JSON
rendering path now has a defense-in-depth double layer:
serde_json's RFC 8259 escape for C0 (which produces \u001b that
escape_for_terminal passes through as printable ASCII), plus
escape_for_terminal's C1 gate catching any raw U+0080-U+009F bytes
in the JSON string.

Adds test_terminal_reporter_escapes_control_bytes_in_analyzer_summaries
covering the C0 and C1 cases through the full ANALYZER SUMMARIES
render path, plus a Cyrillic preservation assertion.

Also fixes the plan's Task 2 doc-comment snippet which still
mentioned str::escape_default — replaces with the correct
escape_for_terminal reference (Copilot round 2 comment).

Per Copilot round 2 review on #57.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Zious11 Zious11 merged commit e1d0a6f into develop Apr 9, 2026
8 checks passed
@Zious11 Zious11 deleted the worktree-adr-0003-output-sanitization branch April 9, 2026 16:31
Zious11 added a commit that referenced this pull request Apr 10, 2026
## Summary

- Adds 3 end-to-end tests that drive `HttpAnalyzer::on_data` with
crafted HTTP requests containing C1 CSI (U+009B) injection, then render
through both `TerminalReporter` and `JsonReporter`, verifying ADR 0003's
output sanitization contract.
- Key discovery during issue validation: `httparse` rejects C0 control
bytes (including ESC 0x1b) in URIs and header values, but **accepts** C1
codepoints (U+009B CSI = `0xC2 0x9B`) because their UTF-8 encoding uses
high bytes (≥ 0x80). C1 CSI is the real viable injection vector through
the HTTP analyzer.
- Closes the coverage gap identified by `pr-test-analyzer` during the
ADR 0003 PR review (#57).

### Tests added
| Test | Vector | Path |
|------|--------|------|
| `test_http_finding_c1_csi_escaped_by_terminal_reporter` | C1 CSI in
path-traversal URI | Finding → TerminalReporter |
| `test_http_finding_c1_csi_in_json_reporter` | C1 CSI in path-traversal
URI | Finding → JsonReporter round-trip |
| `test_http_analyzer_summary_c1_csi_escaped_by_terminal_reporter` | C1
CSI in Host header | AnalyzerSummary → TerminalReporter |

Closes #56

## Test plan
- [x] All 12 reporter tests pass (`cargo test --test reporter_tests`)
- [x] Full suite passes (170+ tests, 0 failures)
- [x] `cargo clippy --all-targets` clean
- [x] `cargo fmt --check` clean
- [x] Multi-agent PR review (test-analyzer, code-reviewer,
comment-analyzer) — 0 critical issues
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants