Summary
The validator should detect invisible Unicode characters and directional text overrides in skill files. These characters can hide or disguise content from both human reviewers and pattern-matching scanners, while remaining fully visible to agents that consume the raw file.
Background
Skill files are plain-text Markdown consumed verbatim by LLM agents. A skill file that contains zero-width characters or directional overrides is either corrupted or deliberately trying to hide content. There is no legitimate use case for these characters in skill files — they should be plain ASCII/UTF-8 text.
Threat model
- Zero-width characters can insert invisible content between visible characters, making a file appear clean to reviewers while containing hidden instructions for the agent.
- Directional overrides (RTL/LTR) can make text render in a different order than it's stored, so a reviewer reads one thing while the agent reads another.
- Both can also be used to bypass keyword-based scanners — for example, inserting a zero-width space into
sudo (where is U+200B) defeats a regex match for sudo while the agent may still interpret it as the same instruction.
What to detect
Zero-width characters
| Character |
Codepoint |
Name |
| |
U+200B |
Zero-width space |
| |
U+200C |
Zero-width non-joiner |
| |
U+200D |
Zero-width joiner |
| |
U+FEFF |
BOM (when not at byte position 0) |
A BOM at byte position 0 is a legitimate encoding marker and should be excluded. A BOM appearing mid-file is suspicious and should be flagged.
Directional override characters
| Range |
Name |
| U+202A–U+202E |
LRE, RLE, PDF, LRO, RLO |
| U+2066–U+2069 |
LRI, RLI, FSI, PDI |
These embed or override text direction. They're used in internationalized text but have no place in English-language Markdown skill files.
Proposed severity levels
| Pattern |
Level |
| Zero-width characters (U+200B, U+200C, U+200D) |
ERROR |
| BOM mid-file (U+FEFF not at position 0) |
ERROR |
| Directional overrides (U+202A–U+202E, U+2066–U+2069) |
ERROR |
All of these should be errors rather than warnings — there is no legitimate use case, and their presence is a strong corruption or tampering signal.
Implementation notes
- Scan all files in the skill package: SKILL.md, all reference files, and any other text files
- Report the file path, byte offset, and character name for each finding so authors can locate and remove them
- The check can operate on raw bytes/runes — no Markdown parsing needed
- Consider a new check function like
CheckInvisibleCharacters(dir string, files []string) []types.Result in the structure package
- The scan should iterate over runes in each file and check against a set of flagged codepoints
- False positive rate is near zero — skill files should never contain these characters
Out of scope (for now)
Homoglyph detection (e.g., Cyrillic о vs Latin o) is related but significantly harder to implement reliably without internationalization false positives. That can be considered separately if there's demand.
Summary
The validator should detect invisible Unicode characters and directional text overrides in skill files. These characters can hide or disguise content from both human reviewers and pattern-matching scanners, while remaining fully visible to agents that consume the raw file.
Background
Skill files are plain-text Markdown consumed verbatim by LLM agents. A skill file that contains zero-width characters or directional overrides is either corrupted or deliberately trying to hide content. There is no legitimate use case for these characters in skill files — they should be plain ASCII/UTF-8 text.
Threat model
sudo(where is U+200B) defeats a regex match forsudowhile the agent may still interpret it as the same instruction.What to detect
Zero-width characters
A BOM at byte position 0 is a legitimate encoding marker and should be excluded. A BOM appearing mid-file is suspicious and should be flagged.
Directional override characters
These embed or override text direction. They're used in internationalized text but have no place in English-language Markdown skill files.
Proposed severity levels
All of these should be errors rather than warnings — there is no legitimate use case, and their presence is a strong corruption or tampering signal.
Implementation notes
CheckInvisibleCharacters(dir string, files []string) []types.Resultin thestructurepackageOut of scope (for now)
Homoglyph detection (e.g., Cyrillic
оvs Latino) is related but significantly harder to implement reliably without internationalization false positives. That can be considered separately if there's demand.