Imagine a writing layer designed for how modern AI reads. It looks different. It means the same. AI still understands. That’s CrossSpeak.
CrossSpeak is a Unicode homoglyph transformation system developed by Syhunt that re-expresses text using parallel characters from diverse scripts, preserving semantic meaning while altering its orthographic representation. Modern multilingual LLMs with robust Unicode generalization can still interpret the text reliably.
With CrossSpeak encoding, you get semantic invariance: the meaning of the text remains intact even as its character layer undergoes a structural transformation. Characters from Greek, Cyrillic, Armenian, phonetic notation, mathematics, and historical alphabets reshape the surface of the text without altering intent. The result is text that:
- Looks unfamiliar to ASCII-bound validators.
- Remains readable to humans.
- Stays semantically coherent for robust AI models.
CrossSpeak can be used by security researchers to evaluate how AI-integrated web applications handle orthographic variation. In particular, it helps analyze:
- The interaction between Unicode normalization and LLM-driven transformations.
- How AI-mediated rewriting or canonicalization may affect input validation.
- The consistency of XSS defenses when model output is reintroduced into web contexts.
These evaluation scenarios are examined in detail in our paper on Cross-Model Scripting (XMS), which formalizes AI-mediated injection conditions arising from post-model transformations and inconsistent validation boundaries.
By revealing the gap between byte-level validation and semantic interpretation, CrossSpeak serves as a diagnostic instrument for studying the XMS class of vulnerabilities. It is not an exploitation framework, but a structured method for understanding how AI-driven processing can reshape security assumptions — and for strengthening defenses accordingly.
Why the Name CrossSpeak: CrossSpeak is named for what it enables. text that crosses orthographic and system boundaries without semantic loss.
- It crosses scripts - moving between Latin, Greek, Cyrillic, Armenian, and beyond.
- It crosses assumptions - from ASCII-bound validation to Unicode diversity.
- It crosses systems - remaining interpretable by modern AI models even when rigid validators falter.
LLM-compatible ≠ universally compatible: Models with limited Unicode robustness or narrow tokenization assumptions may degrade, fragment, or respond unpredictably to CrossSpeak text.
Not a new language: CrossSpeak does not create a new language. It demonstrates that language can travel across representational layers while preserving semantic intent.
Not encryption: CrossSpeak does not conceal information. It re-encodes text at the character layer while preserving semantic content.
As you can see, the resulting writing is:
- recognizable to human readers
- semantically intact for multilingual AI models
- structurally foreign to systems that expect strict ASCII
CrossSpeak is a Unicode transformation system that preserves semantic meaning while altering the orthographic encoding of text.
Text processed through CrossSpeak remains fully readable and interpretable by large language models. Its defining property is semantic invariance: the meaning of the text does not change, even though its character representation does.
CrossSpeak re-expresses ordinary language using parallel characters drawn from real-world scripts. It modifies only how text is еոϲοɗеɗ in Unicode - not what the text says.
It does not encrypt, hide, or obfuscate information. It preserves meaning while геѕτгυϲτυгɩոɡ геρгеѕеոτɑτɩοո.
Unicode was designed to represent the full diversity of human writing - Greek, Cyrillic, Armenian, phonetic notation, mathematics, and historical alphabets. Many of these characters resemble Latin letters through shared ancestry, yet remain distinct codepoints so that each language can exist digitally without corruption.
This structural richness creates parallel glyph space: visually similar characters that are semantically equivalent in context, but computationally distinct at the encoding level.
CrossSpeak leverages this property.
Instead of altering words, grammar, or meaning, it substitutes characters with carefully selected Unicode counterparts drawn from real-world scripts. The resulting text remains readable to humans and interpretable by robust language models, because the semantic structure of the sentence is unchanged.
Modern LLMs are trained on multilingual corpora that already include a wide range of scripts and orthographic variation. As a result, many models learn to generalize across visually and structurally similar characters. When they encounter CrossSpeak text, they can often resolve these variations back into coherent linguistic meaning.
In short:
Unicode permits orthographic diversity. Language models learn semantic generalization. CrossSpeak sits at the intersection of the two.
It changes representation - not meaning.
-
- AI-Native Communication: Keep documents and prompts intelligible to modern language models in environments where rigid ASCII filters distort otherwise valid expression.
-
- Security & Robustness Research: Examine how tokenizers, LLMs, and moderation systems respond to realistic cross-script input and design normalization-aware defenses.
-
- Reduction of False Positives: Avoid over-blocking caused by keyword lists that ignore the multilingual nature of Unicode.
-
- Dataset Evaluation: Test watermarking, detection methods, and training pipelines against script-diverse perturbations.
-
- Creative Literacy: Enable artistic and narrative forms that live between alphabets without altering intent.
The name CrossSpeak was chosen deliberately. At its core, CrossSpeak enables text to cross representational boundaries without losing meaning. It moves across scripts - Latin, Greek, Cyrillic, Armenian, mathematical and historical alphabets - while preserving semantic intent. It also crosses system expectations, remaining interpretable to modern AI models even when ASCII-centric validation routines do not recognize the transformed character patterns.
In a cybersecurity context, the name carries a second, intentional resonance: cross-site scripting (XSS).
CrossSpeak helps security researchers study scenarios where:
- Unicode-transformed input bypasses ASCII-bound pattern checks
- An LLM is asked to rewrite, translate, or normalize that input
- The model outputs canonical ASCII representations
- That output is inserted into an HTML context without proper escaping
In these pipelines, the “cross” is not just cross-site - it is cross-boundary:
- crossing from filtered input to model transformation,
- crossing from Unicode variation back to executable form.
CrossSpeak does not exist to facilitate attacks. It exists to make these boundary transitions visible, testable, and understandable in AI-integrated applications.
CrossSpeak lets you choose how < and > are handled.
-
Untouched (default)
Keeps< >as ASCII. -
Option A — Mathematical
< > → ⟨ ⟩ -
Option B — Guillemets
< > → ‹ › -
Option C — Fullwidth
< > → < >
- Introducing Cross-Model Scripting (XMS) vulnerabilities https://www.syhunt.com/en/?n=Articles.2026-CrossModelScripting
- Evading AI-Generated Content Detectors using Homoglyphs https://arxiv.org/abs/2406.11239v1
- Defending LLM Applications Against Unicode Character Smuggling https://www.cloudthat.com/resources/blog/defending-llm-applications-against-unicode-character-smuggling
Copyright (c) 2026 Syhunt & DaragonTech
Released under the a 3-clause BSD license for research and experimental use - see the LICENSE file for details.

