Skip to content

Structural check: detect invisible Unicode characters and directional overrides #66

@dacharyc

Description

@dacharyc

Summary

The validator should detect invisible Unicode characters and directional text overrides in skill files. These characters can hide or disguise content from both human reviewers and pattern-matching scanners, while remaining fully visible to agents that consume the raw file.

Background

Skill files are plain-text Markdown consumed verbatim by LLM agents. A skill file that contains zero-width characters or directional overrides is either corrupted or deliberately trying to hide content. There is no legitimate use case for these characters in skill files — they should be plain ASCII/UTF-8 text.

Threat model

  • Zero-width characters can insert invisible content between visible characters, making a file appear clean to reviewers while containing hidden instructions for the agent.
  • Directional overrides (RTL/LTR) can make text render in a different order than it's stored, so a reviewer reads one thing while the agent reads another.
  • Both can also be used to bypass keyword-based scanners — for example, inserting a zero-width space into su​do (where ​ is U+200B) defeats a regex match for sudo while the agent may still interpret it as the same instruction.

What to detect

Zero-width characters

Character Codepoint Name
U+200B Zero-width space
U+200C Zero-width non-joiner
U+200D Zero-width joiner
 U+FEFF BOM (when not at byte position 0)

A BOM at byte position 0 is a legitimate encoding marker and should be excluded. A BOM appearing mid-file is suspicious and should be flagged.

Directional override characters

Range Name
U+202A–U+202E LRE, RLE, PDF, LRO, RLO
U+2066–U+2069 LRI, RLI, FSI, PDI

These embed or override text direction. They're used in internationalized text but have no place in English-language Markdown skill files.

Proposed severity levels

Pattern Level
Zero-width characters (U+200B, U+200C, U+200D) ERROR
BOM mid-file (U+FEFF not at position 0) ERROR
Directional overrides (U+202A–U+202E, U+2066–U+2069) ERROR

All of these should be errors rather than warnings — there is no legitimate use case, and their presence is a strong corruption or tampering signal.

Implementation notes

  • Scan all files in the skill package: SKILL.md, all reference files, and any other text files
  • Report the file path, byte offset, and character name for each finding so authors can locate and remove them
  • The check can operate on raw bytes/runes — no Markdown parsing needed
  • Consider a new check function like CheckInvisibleCharacters(dir string, files []string) []types.Result in the structure package
  • The scan should iterate over runes in each file and check against a set of flagged codepoints
  • False positive rate is near zero — skill files should never contain these characters

Out of scope (for now)

Homoglyph detection (e.g., Cyrillic о vs Latin o) is related but significantly harder to implement reliably without internationalization false positives. That can be considered separately if there's demand.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions