ai-text-sanitizer

Utility for post-processing AI-generated text. It normalises output by removing invisible characters and deterministic raw-text watermark artifacts, visible model/source annotation artifacts, folding exotic whitespace, converting "pretty" punctuation to ASCII, and stripping inline citation placeholders such as (oaicite:12){index=12}.

About

ai-text-sanitizer is a tiny zero-dependency ES module for cleaning and normalising raw text generated by large language models before you render, store, or diff it.

Description

The library removes invisible Unicode watermark characters, Unicode tag payloads, supplementary variation selectors, visible model/source transport artifacts, other default-ignorable controls, exotic whitespace, and ASCII control codes. It also converts fancy punctuation to plain ASCII, strips inline citation placeholders, and optionally collapses redundant spaces, all while returning per-rule change statistics so you can audit the process.

Features

Removes Unicode format and other zero-width characters that can act as invisible watermarks.
Strips Unicode tag payloads, supplementary variation selectors, soft hyphen, combining grapheme joiner, Arabic letter mark, Hangul/Khmer/Mongolian filler marks, shorthand format controls, and musical format controls.
Converts fancy punctuation (curly quotes, en/em dashes, ellipsis, bullets) to plain ASCII equivalents.
Folds a wide range of Unicode space characters to a standard space.
Collapses runs of multiple spaces and normalises line endings to LF.
Eliminates citation placeholders emitted by some language models.
Removes bounded visible model annotation envelopes copied from chat surfaces while preserving malformed fragments and unrelated private-use glyphs.
Optionally preserves or removes emoji glue characters (ZWJ / variation selectors).
Returns granular change statistics so you can audit the cleaning process, including tag, variation-selector, default-ignorable, and visible model artifact counters.
Offers heuristic, SynthID-compatible, and soft-watermark detectors so you can record findings without mutating text.
Provides opt-in rewrite strategies (light / aggressive) to locally paraphrase flagged passages.
Ships with a CLI (npx ai-text-sanitizer) for batch reports and CI gates.

Installation

pnpm add ai-text-sanitizer

This project is published as an ES module and requires Node ≥ 18.

Usage

import { sanitizeAiText } from 'ai-text-sanitizer';

const input = `“Hello\u200B world…” (oaicite:5){index=5}`;

const { cleaned, changes } = sanitizeAiText(input);

console.log(cleaned);  // "Hello world..."
console.log(changes);  /* {
                          removedInvisible: 1,
                          removedTags: 0,
                          removedVariationSelectors: 0,
                          removedDefaultIgnorables: 1,
                          removedCtrl: 0,
                          removedCitations: 1,
                          removedModelArtifacts: 0,
                          prettified: 3,
                          collapsedSpaces: 0,
                          total: 5
                        } */

TypeScript

ai-text-sanitizer ships with built-in .d.ts declarations. Nothing extra to install — just import and enjoy full IntelliSense:

import { sanitizeAiText, type SanitizeResult } from 'ai-text-sanitizer';

const result: SanitizeResult = sanitizeAiText('مرحبا\u200Fالعالم');
console.log(result.cleaned);

Options

sanitizeAiText(text, options?)

Option	Default	Description
`keepEmoji`	`true`	Preserve ZWJ / variation selectors when they are part of a valid emoji grapheme.
`keepBidi`	`false`	Allow bidi control marks to survive (useful for mixed RTL/LTR documents).
`keepTabs`	`true`	Preserve horizontal tabs.
`keepNewlines`	`true`	Preserve `\n` / `\r`.
`collapseSpaces`	`true`	Collapse repeated ASCII spaces after folding exotic ones.
`nfkc`	`false`	Apply aggressive NFKC folding (defaults to NFC).
`detectors`	`[]`	Array of detector names (`'heuristic'`, `'synthid'`, `'soft-watermark'`).
`detectorConfigs`	`{}`	Config objects for detectors (see below).
`rewriteStrategy`	`'none'`	`'none'`, `'light'`, or `'aggressive'` deterministic rewrites.

sanitizeInvisible(text, opts) is also exported for low-level workflows where you only want to strip hidden characters while keeping the rest of the pipeline.

API

sanitizeAiText(text, options?) → { cleaned, changes }

Parameter	Type	Default	Description
`text`	`string`	–	Input text to sanitise.
`options`	`object` (optional)	–	Behaviour flags (below).
`keepEmoji`	`boolean`	`true`	Keep ZWJ / variation selectors used by emoji.
`collapseSpaces`	`boolean`	`true`	Collapse contiguous ASCII spaces.
`keepBidi`	`boolean`	`false`	Preserve bidi control marks.
`detectors`	`Array`	`[]`	Run optional watermark detectors (see below).

The returned changes object reports how many code points were altered for each rule plus a total sum. removedTags, removedVariationSelectors, and removedDefaultIgnorables are audit counters for subcategories of hidden text channels; removedInvisible remains the removal total used for these hidden characters, while bidi controls remain counted under removedBidi, so totals are not double-counted.

Watermark detectors

Three detectors ship with the library:

heuristic – counts visible model annotation artifacts, zero-width marks, exotic spaces, Unicode tag payloads, variation selectors, and other default-ignorable controls in the original text. Findings include optional category, count, and codePoints metadata.
synthid – delegates to your SynthID scoring routine. Provide detectorConfigs.synthid = { score(tokens) { ... } } to bridge Google's open implementation and pass your key/PRF.
soft-watermark – ports the statistical test from A Watermark for Large Language Models. Supply tokens plus a greenlist predicate to receive p-values.

Every detector reports structured findings that you can stash or serialize. Use unifiedDiff(original, cleaned) to create a CI-friendly diff for manual audits, or call the CLI with --report to export JSON.

SynthID, green-list watermarks, and modern AI text classifiers are token- or statistics-based systems. This package only removes deterministic string-level identifiers and transport artifacts by default; token-statistical watermark checks require caller-provided detector hooks and do not mutate text.

Rewrite strategies

Calling sanitizeAiText with rewriteStrategy: 'light' | 'aggressive' applies a deterministic, local rewrite after sanitization. Rewrite strategies are opt-in and are not part of the default prevention path. The light strategy flips quotation style and re-chunks sentences; the aggressive strategy further substitutes a small synonym map. Rewrites are counted in changes.rewrittenSegments so you can detect when phrasing was adjusted.

CLI

Run the CLI without installing globally:

npx ai-text-sanitizer --in draft.txt --out cleaned.txt --report findings.json --strict

--strict exits with code 2 whenever destructive sanitization occurred or detectors raised a flag. Pair it with CI to block merges containing disallowed Unicode marks.

Running the test suite

pnpm install
pnpm test

Tests live in __tests__/ and exercise typical real-world scenarios including HTML fragments, code snippets, emoji sequences, and BOM handling.

Limitations

The function operates on raw strings; it does not parse or sanitise HTML structure. HTML tags remain untouched but are treated as plain text.
The sanitizer removes deterministic string-level identifiers and transport artifacts. It does not promise that text will evade SynthID, green-list watermark detectors, neural AI detectors, or third-party "AI text" scoring tools.
C2PA and similar content credentials live in file or asset metadata, not in plain string content, so they are outside this API's scope.
The mapping of fancy punctuation is intentionally conservative. If you need broader transliteration, customise the PRETTIES table in aiTextSanitizer.js.

Contributing

Contributions, bug reports, and feature requests are very welcome — feel free to open an issue or submit a pull request. Please ensure the test suite passes (pnpm test) and follow conventional commit messages for ease of release automation.

This repository contains only the core library and test suite to keep the footprint minimal.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
__tests__		__tests__
docs		docs
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
jest.config.js		jest.config.js
package.json		package.json
tsconfig.json		tsconfig.json
tsup.config.ts		tsup.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ai-text-sanitizer

About

Description

Features

Installation

Usage

TypeScript

Options

API

Watermark detectors

Rewrite strategies

CLI

Running the test suite

Limitations

Contributing

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ai-text-sanitizer

About

Description

Features

Installation

Usage

TypeScript

Options

API

Watermark detectors

Rewrite strategies

CLI

Running the test suite

Limitations

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages