Skip to content

feat: flag valid-UTF-8 but non-ASCII SNI as RFC 6066 violation#55

Merged
Zious11 merged 2 commits intodevelopfrom
worktree-tls-non-ascii-sni
Apr 8, 2026
Merged

feat: flag valid-UTF-8 but non-ASCII SNI as RFC 6066 violation#55
Zious11 merged 2 commits intodevelopfrom
worktree-tls-non-ascii-sni

Conversation

@Zious11
Copy link
Copy Markdown
Owner

@Zious11 Zious11 commented Apr 8, 2026

Summary

Real-world false-positive risk

Validated with Perplexity that all major TLS clients auto-Punycode internationalized hostnames before sending the SNI:

Client Behavior
rustls Rejects raw UTF-8 at the ServerName API level, requires Punycode upfront
Chrome / BoringSSL Always Punycode (per RFC 3492 + 6066)
Firefox / NSS Always Punycode (URL bar may display Unicode for user-friendliness, but the wire is always Punycode)
curl / libcurl Auto-converts via libidn2 before calling OpenSSL
OpenSSL raw API Passes verbatim — but expects ASCII from the caller; only legitimate path to a raw U-label SNI is a custom application using OpenSSL directly without IDNA prep

So the false-positive surface is small: a buggy custom client, an OpenSSL-direct app that forgot to Punycode, or an attacker tool. Severity is Anomaly / Inconclusive / Low — same rationale as the non-UTF-8 finding from #49.

Implementation

enum SniValue {
    Ascii(String),                                  // RFC compliant
    NonAsciiUtf8 { hostname: String, hex: String }, // valid UTF-8, non-ASCII (NEW)
    NonUtf8 { lossy: String, hex: String },         // invalid UTF-8 (existing)
}

extract_sni uses s.is_ascii() to split the UTF-8 case. handle_client_hello uses an exhaustive match on the variants, so a future variant addition (e.g. for #54's ASCII-control-code case) will fail to compile until handled. Map-keying for NonAsciiUtf8 is the bare hostname (no <non-ascii:HEX> tagging needed — valid UTF-8 strings have no collision risk because from_utf8 is a bijection on well-formed bytes).

Findings shape

The new finding mirrors the existing NonUtf8 finding pattern for consistency:

  • category: Anomaly, verdict: Inconclusive, confidence: Low
  • Summary: "TLS SNI contains non-ASCII characters (RFC 6066 requires A-labels per RFC 5890): {hostname:?}"{:?} Debug formatter to escape any control codepoints that might survive UTF-8 decoding (e.g. U+0085 NEL)
  • Evidence: vec![format!("hex: {hex}")] for forensic byte-level review
  • mitre_technique: None (matches existing TLS findings; no clean ATT&CK technique for protocol-format violations)

Test plan

  • cargo test --test tls_analyzer_tests — 21/21 pass (~650ms; was 18 before)
  • cargo test — full suite green
  • cargo clippy --all-targets -- -D warnings — clean
  • cargo fmt --check — clean
  • Code review pass (code-reviewer): zero critical, zero important; one defer-only suggestion (S1 — ASCII control codes) filed as TLS: flag SNI hostnames containing ASCII control codes (BEL/ESC/DEL/C0) #54
  • Real-world false-positive risk Perplexity-validated against rustls, OpenSSL, Chrome/BoringSSL, Firefox/NSS, curl/libcurl
  • RFC 6066 §3 ASCII requirement and RFC 5890 A-label requirement reconfirmed (already validated earlier this session via direct RFC fetch)

New tests

  1. test_valid_utf8_non_ascii_sni_emits_finding (flipped from test: expand TLS SNI edge-case coverage #50's pin-test) — "café.example", asserts category/verdict/confidence/RFC mention/A-label mention/hex evidence (636166c3a92e6578616d706c65).
  2. test_cyrillic_sni_emits_non_ascii_finding"пример.example" (Cyrillic 2-byte UTF-8 sequences).
  3. test_emoji_sni_emits_non_ascii_finding"🦀.example" (4-byte UTF-8 codepoint).
  4. test_punycode_a_label_does_not_emit_non_ascii_finding"xn--caf-dma.example" regression: the RFC-compliant Punycode form is pure ASCII and must NOT be flagged.

Closes #51. PR #49 surfaced non-UTF-8 SNI bytes as a finding but
deliberately left a related RFC 6066 §3 violation unflagged: SNI
hostnames that are valid UTF-8 but contain non-ASCII codepoints,
e.g. raw U-labels like "café.example" or "пример.example". The spec
requires HostName to be ASCII (with internationalized names sent as
A-labels per RFC 5890 Punycode `xn--…` form), so a non-ASCII byte on
the wire is a real protocol violation.

Real-world false-positive risk
------------------------------
Validated via Perplexity that all major TLS clients auto-Punycode
internationalized hostnames before sending the SNI:

- rustls — rejects raw UTF-8 at the ServerName API level, requires
  Punycode upfront
- Chrome / BoringSSL — always Punycode (per RFC 3492 + 6066)
- Firefox / NSS — always Punycode (URL bar may display Unicode for
  user-friendliness, but the wire is always Punycode)
- curl / libcurl — auto-converts via libidn2 before calling OpenSSL
- OpenSSL raw — passes verbatim, but expects ASCII from the caller;
  applications using OpenSSL directly without IDNA prep are the only
  legitimate path to a raw U-label SNI on the wire

So the false-positive surface is small: a buggy custom client, an
attacker tool, or an OpenSSL-direct app that forgot to Punycode.
Severity is `Anomaly` / `Inconclusive` / `Low` — same rationale as
the non-UTF-8 finding.

SniValue enum
-------------
Renamed `Utf8(String)` → `Ascii(String)` (the old name was misleading
since ASCII is also valid UTF-8) and added a new variant:

    enum SniValue {
        Ascii(String),                              // RFC compliant
        NonAsciiUtf8 { hostname: String, hex: String }, // valid UTF-8, non-ASCII
        NonUtf8 { lossy: String, hex: String },     // invalid UTF-8 (existing)
    }

`extract_sni` now uses `s.is_ascii()` to split the UTF-8 case into
`Ascii` and `NonAsciiUtf8`. The hostname for `NonAsciiUtf8` is the
decoded String (always valid UTF-8 by definition); `hex` is the
lossless representation for forensic evidence.

`handle_client_hello` keys `sni_counts` on the hostname directly for
both Ascii and NonAsciiUtf8 (no collision risk for valid UTF-8) and
keeps the existing `<non-utf8:HEX>` tagged form for NonUtf8. The
finding summary uses `{hostname:?}` Debug formatter to escape any
control codepoints that might survive UTF-8 decoding (e.g. U+0085 NEL).

Tests
-----
- Flipped `test_valid_utf8_non_ascii_sni_currently_not_flagged` to
  `test_valid_utf8_non_ascii_sni_emits_finding` — asserts the new
  finding fires for "café.example" with the right severity, RFC
  reference in summary, and hex evidence (`636166c3a92e6578616d706c65`).
- Added `test_cyrillic_sni_emits_non_ascii_finding` for "пример.example".
- Added `test_emoji_sni_emits_non_ascii_finding` for "🦀.example"
  (4-byte UTF-8 codepoint, all bytes ≥ 0x80).
- Added `test_punycode_a_label_does_not_emit_non_ascii_finding` for
  "xn--caf-dma.example" — pins that the RFC-compliant Punycode form
  is NOT flagged.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR tightens TLS SNI analysis to flag RFC 6066 violations where the SNI hostname bytes are valid UTF-8 but include non-ASCII characters (raw U-labels), while preserving existing handling for non-UTF-8 SNI.

Changes:

  • Refines SNI decoding by splitting valid SNI into Ascii vs NonAsciiUtf8 (with hex evidence), and emits an Anomaly / Inconclusive / Low finding for the non-ASCII UTF-8 case.
  • Updates SNI counting to key Ascii and NonAsciiUtf8 by the decoded hostname string, while keeping the existing hex-tag keying for non-UTF-8.
  • Expands and flips tests to assert the new non-ASCII SNI finding, including Cyrillic/emoji positives and a Punycode A-label regression case.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
tests/tls_analyzer_tests.rs Flips the prior pin-test and adds new cases to validate non-ASCII UTF-8 SNI finding behavior and Punycode non-flagging.
src/analyzer/tls.rs Implements NonAsciiUtf8 detection and finding emission; renames the ASCII-happy-path variant and updates SNI keying logic accordingly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/tls_analyzer_tests.rs Outdated
Addresses Copilot review on PR #55: "a RFC 6066" → "an RFC 6066".
RFC is pronounced "arr-eff-see", which starts with a vowel sound, so
the indefinite article is "an" — matches IETF style throughout the
RFC corpus itself.
@Zious11 Zious11 merged commit 51490bd into develop Apr 8, 2026
4 checks passed
@Zious11 Zious11 deleted the worktree-tls-non-ascii-sni branch April 8, 2026 23:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TLS: flag valid-UTF-8 but non-ASCII SNI as RFC 6066 violation

2 participants