Utility for post-processing AI-generated text. It normalises output by
removing invisible characters and deterministic raw-text watermark artifacts,
visible model/source annotation artifacts, folding exotic whitespace,
converting "pretty" punctuation to ASCII, and stripping inline citation
placeholders such as (oaicite:12){index=12}.
ai-text-sanitizer is a tiny zero-dependency ES module for cleaning and normalising raw text generated by large language models before you render, store, or diff it.
The library removes invisible Unicode watermark characters, Unicode tag payloads, supplementary variation selectors, visible model/source transport artifacts, other default-ignorable controls, exotic whitespace, and ASCII control codes. It also converts fancy punctuation to plain ASCII, strips inline citation placeholders, and optionally collapses redundant spaces, all while returning per-rule change statistics so you can audit the process.
- Removes Unicode format and other zero-width characters that can act as invisible watermarks.
- Strips Unicode tag payloads, supplementary variation selectors, soft hyphen, combining grapheme joiner, Arabic letter mark, Hangul/Khmer/Mongolian filler marks, shorthand format controls, and musical format controls.
- Converts fancy punctuation (curly quotes, en/em dashes, ellipsis, bullets) to plain ASCII equivalents.
- Folds a wide range of Unicode space characters to a standard space.
- Collapses runs of multiple spaces and normalises line endings to
LF. - Eliminates citation placeholders emitted by some language models.
- Removes bounded visible model annotation envelopes copied from chat surfaces while preserving malformed fragments and unrelated private-use glyphs.
- Optionally preserves or removes emoji glue characters (ZWJ / variation selectors).
- Returns granular change statistics so you can audit the cleaning process, including tag, variation-selector, default-ignorable, and visible model artifact counters.
- Offers heuristic, SynthID-compatible, and soft-watermark detectors so you can record findings without mutating text.
- Provides opt-in rewrite strategies (
light/aggressive) to locally paraphrase flagged passages. - Ships with a CLI (
npx ai-text-sanitizer) for batch reports and CI gates.
pnpm add ai-text-sanitizerThis project is published as an ES module and requires Node ≥ 18.
import { sanitizeAiText } from 'ai-text-sanitizer';
const input = `“Hello\u200B world…” (oaicite:5){index=5}`;
const { cleaned, changes } = sanitizeAiText(input);
console.log(cleaned); // "Hello world..."
console.log(changes); /* {
removedInvisible: 1,
removedTags: 0,
removedVariationSelectors: 0,
removedDefaultIgnorables: 1,
removedCtrl: 0,
removedCitations: 1,
removedModelArtifacts: 0,
prettified: 3,
collapsedSpaces: 0,
total: 5
} */ai-text-sanitizer ships with built-in .d.ts declarations. Nothing extra to install — just import and enjoy full IntelliSense:
import { sanitizeAiText, type SanitizeResult } from 'ai-text-sanitizer';
const result: SanitizeResult = sanitizeAiText('مرحبا\u200Fالعالم');
console.log(result.cleaned);sanitizeAiText(text, options?)
| Option | Default | Description |
|---|---|---|
keepEmoji |
true |
Preserve ZWJ / variation selectors when they are part of a valid emoji grapheme. |
keepBidi |
false |
Allow bidi control marks to survive (useful for mixed RTL/LTR documents). |
keepTabs |
true |
Preserve horizontal tabs. |
keepNewlines |
true |
Preserve \n / \r. |
collapseSpaces |
true |
Collapse repeated ASCII spaces after folding exotic ones. |
nfkc |
false |
Apply aggressive NFKC folding (defaults to NFC). |
detectors |
[] |
Array of detector names ('heuristic', 'synthid', 'soft-watermark'). |
detectorConfigs |
{} |
Config objects for detectors (see below). |
rewriteStrategy |
'none' |
'none', 'light', or 'aggressive' deterministic rewrites. |
sanitizeInvisible(text, opts) is also exported for low-level workflows where
you only want to strip hidden characters while keeping the rest of the pipeline.
sanitizeAiText(text, options?) → { cleaned, changes }
| Parameter | Type | Default | Description |
|---|---|---|---|
text |
string |
– | Input text to sanitise. |
options |
object (optional) |
– | Behaviour flags (below). |
keepEmoji |
boolean |
true |
Keep ZWJ / variation selectors used by emoji. |
collapseSpaces |
boolean |
true |
Collapse contiguous ASCII spaces. |
keepBidi |
boolean |
false |
Preserve bidi control marks. |
detectors |
Array |
[] |
Run optional watermark detectors (see below). |
The returned changes object reports how many code points were altered for
each rule plus a total sum. removedTags, removedVariationSelectors, and
removedDefaultIgnorables are audit counters for subcategories of hidden text
channels; removedInvisible remains the removal total used for these hidden
characters, while bidi controls remain counted under removedBidi, so totals
are not double-counted.
Three detectors ship with the library:
heuristic– counts visible model annotation artifacts, zero-width marks, exotic spaces, Unicode tag payloads, variation selectors, and other default-ignorable controls in the original text. Findings include optionalcategory,count, andcodePointsmetadata.synthid– delegates to your SynthID scoring routine. ProvidedetectorConfigs.synthid = { score(tokens) { ... } }to bridge Google's open implementation and pass your key/PRF.soft-watermark– ports the statistical test from A Watermark for Large Language Models. Supply tokens plus agreenlistpredicate to receive p-values.
Every detector reports structured findings that you can stash or serialize.
Use unifiedDiff(original, cleaned) to create a CI-friendly diff for manual
audits, or call the CLI with --report to export JSON.
SynthID, green-list watermarks, and modern AI text classifiers are token- or statistics-based systems. This package only removes deterministic string-level identifiers and transport artifacts by default; token-statistical watermark checks require caller-provided detector hooks and do not mutate text.
Calling sanitizeAiText with rewriteStrategy: 'light' | 'aggressive' applies a
deterministic, local rewrite after sanitization. Rewrite strategies are opt-in
and are not part of the default prevention path. The light strategy flips
quotation style and re-chunks sentences; the aggressive strategy further
substitutes a small synonym map. Rewrites are counted in
changes.rewrittenSegments so you can detect when phrasing was adjusted.
Run the CLI without installing globally:
npx ai-text-sanitizer --in draft.txt --out cleaned.txt --report findings.json --strict--strict exits with code 2 whenever destructive sanitization occurred or
detectors raised a flag. Pair it with CI to block merges containing disallowed
Unicode marks.
pnpm install
pnpm testTests live in __tests__/ and exercise typical real-world scenarios including
HTML fragments, code snippets, emoji sequences, and BOM handling.
- The function operates on raw strings; it does not parse or sanitise HTML structure. HTML tags remain untouched but are treated as plain text.
- The sanitizer removes deterministic string-level identifiers and transport artifacts. It does not promise that text will evade SynthID, green-list watermark detectors, neural AI detectors, or third-party "AI text" scoring tools.
- C2PA and similar content credentials live in file or asset metadata, not in plain string content, so they are outside this API's scope.
- The mapping of fancy punctuation is intentionally conservative. If you need
broader transliteration, customise the
PRETTIEStable inaiTextSanitizer.js.
Contributions, bug reports, and feature requests are very welcome — feel free to open an issue or submit a pull request. Please ensure the test suite passes (pnpm test) and follow conventional commit messages for ease of release automation.
This repository contains only the core library and test suite to keep the footprint minimal.