Skip to content

BeMoreDifferent/ai-text-sanitizer

Repository files navigation

ai-text-sanitizer

Utility for post-processing AI-generated text. It normalises output by removing invisible characters and deterministic raw-text watermark artifacts, visible model/source annotation artifacts, folding exotic whitespace, converting "pretty" punctuation to ASCII, and stripping inline citation placeholders such as (oaicite:12){index=12}.

About

ai-text-sanitizer is a tiny zero-dependency ES module for cleaning and normalising raw text generated by large language models before you render, store, or diff it.

Description

The library removes invisible Unicode watermark characters, Unicode tag payloads, supplementary variation selectors, visible model/source transport artifacts, other default-ignorable controls, exotic whitespace, and ASCII control codes. It also converts fancy punctuation to plain ASCII, strips inline citation placeholders, and optionally collapses redundant spaces, all while returning per-rule change statistics so you can audit the process.

Features

  • Removes Unicode format and other zero-width characters that can act as invisible watermarks.
  • Strips Unicode tag payloads, supplementary variation selectors, soft hyphen, combining grapheme joiner, Arabic letter mark, Hangul/Khmer/Mongolian filler marks, shorthand format controls, and musical format controls.
  • Converts fancy punctuation (curly quotes, en/em dashes, ellipsis, bullets) to plain ASCII equivalents.
  • Folds a wide range of Unicode space characters to a standard space.
  • Collapses runs of multiple spaces and normalises line endings to LF.
  • Eliminates citation placeholders emitted by some language models.
  • Removes bounded visible model annotation envelopes copied from chat surfaces while preserving malformed fragments and unrelated private-use glyphs.
  • Optionally preserves or removes emoji glue characters (ZWJ / variation selectors).
  • Returns granular change statistics so you can audit the cleaning process, including tag, variation-selector, default-ignorable, and visible model artifact counters.
  • Offers heuristic, SynthID-compatible, and soft-watermark detectors so you can record findings without mutating text.
  • Provides opt-in rewrite strategies (light / aggressive) to locally paraphrase flagged passages.
  • Ships with a CLI (npx ai-text-sanitizer) for batch reports and CI gates.

Installation

pnpm add ai-text-sanitizer

This project is published as an ES module and requires Node ≥ 18.

Usage

import { sanitizeAiText } from 'ai-text-sanitizer';

const input = `“Hello\u200B world…” (oaicite:5){index=5}`;

const { cleaned, changes } = sanitizeAiText(input);

console.log(cleaned);  // "Hello world..."
console.log(changes);  /* {
                          removedInvisible: 1,
                          removedTags: 0,
                          removedVariationSelectors: 0,
                          removedDefaultIgnorables: 1,
                          removedCtrl: 0,
                          removedCitations: 1,
                          removedModelArtifacts: 0,
                          prettified: 3,
                          collapsedSpaces: 0,
                          total: 5
                        } */

TypeScript

ai-text-sanitizer ships with built-in .d.ts declarations. Nothing extra to install — just import and enjoy full IntelliSense:

import { sanitizeAiText, type SanitizeResult } from 'ai-text-sanitizer';

const result: SanitizeResult = sanitizeAiText('مرحبا\u200Fالعالم');
console.log(result.cleaned);

Options

sanitizeAiText(text, options?)

Option Default Description
keepEmoji true Preserve ZWJ / variation selectors when they are part of a valid emoji grapheme.
keepBidi false Allow bidi control marks to survive (useful for mixed RTL/LTR documents).
keepTabs true Preserve horizontal tabs.
keepNewlines true Preserve \n / \r.
collapseSpaces true Collapse repeated ASCII spaces after folding exotic ones.
nfkc false Apply aggressive NFKC folding (defaults to NFC).
detectors [] Array of detector names ('heuristic', 'synthid', 'soft-watermark').
detectorConfigs {} Config objects for detectors (see below).
rewriteStrategy 'none' 'none', 'light', or 'aggressive' deterministic rewrites.

sanitizeInvisible(text, opts) is also exported for low-level workflows where you only want to strip hidden characters while keeping the rest of the pipeline.

API

sanitizeAiText(text, options?){ cleaned, changes }

Parameter Type Default Description
text string Input text to sanitise.
options object (optional) Behaviour flags (below).
keepEmoji boolean true Keep ZWJ / variation selectors used by emoji.
collapseSpaces boolean true Collapse contiguous ASCII spaces.
keepBidi boolean false Preserve bidi control marks.
detectors Array [] Run optional watermark detectors (see below).

The returned changes object reports how many code points were altered for each rule plus a total sum. removedTags, removedVariationSelectors, and removedDefaultIgnorables are audit counters for subcategories of hidden text channels; removedInvisible remains the removal total used for these hidden characters, while bidi controls remain counted under removedBidi, so totals are not double-counted.

Watermark detectors

Three detectors ship with the library:

  • heuristic – counts visible model annotation artifacts, zero-width marks, exotic spaces, Unicode tag payloads, variation selectors, and other default-ignorable controls in the original text. Findings include optional category, count, and codePoints metadata.
  • synthid – delegates to your SynthID scoring routine. Provide detectorConfigs.synthid = { score(tokens) { ... } } to bridge Google's open implementation and pass your key/PRF.
  • soft-watermark – ports the statistical test from A Watermark for Large Language Models. Supply tokens plus a greenlist predicate to receive p-values.

Every detector reports structured findings that you can stash or serialize. Use unifiedDiff(original, cleaned) to create a CI-friendly diff for manual audits, or call the CLI with --report to export JSON.

SynthID, green-list watermarks, and modern AI text classifiers are token- or statistics-based systems. This package only removes deterministic string-level identifiers and transport artifacts by default; token-statistical watermark checks require caller-provided detector hooks and do not mutate text.

Rewrite strategies

Calling sanitizeAiText with rewriteStrategy: 'light' | 'aggressive' applies a deterministic, local rewrite after sanitization. Rewrite strategies are opt-in and are not part of the default prevention path. The light strategy flips quotation style and re-chunks sentences; the aggressive strategy further substitutes a small synonym map. Rewrites are counted in changes.rewrittenSegments so you can detect when phrasing was adjusted.

CLI

Run the CLI without installing globally:

npx ai-text-sanitizer --in draft.txt --out cleaned.txt --report findings.json --strict

--strict exits with code 2 whenever destructive sanitization occurred or detectors raised a flag. Pair it with CI to block merges containing disallowed Unicode marks.

Running the test suite

pnpm install
pnpm test

Tests live in __tests__/ and exercise typical real-world scenarios including HTML fragments, code snippets, emoji sequences, and BOM handling.

Limitations

  • The function operates on raw strings; it does not parse or sanitise HTML structure. HTML tags remain untouched but are treated as plain text.
  • The sanitizer removes deterministic string-level identifiers and transport artifacts. It does not promise that text will evade SynthID, green-list watermark detectors, neural AI detectors, or third-party "AI text" scoring tools.
  • C2PA and similar content credentials live in file or asset metadata, not in plain string content, so they are outside this API's scope.
  • The mapping of fancy punctuation is intentionally conservative. If you need broader transliteration, customise the PRETTIES table in aiTextSanitizer.js.

Contributing

Contributions, bug reports, and feature requests are very welcome — feel free to open an issue or submit a pull request. Please ensure the test suite passes (pnpm test) and follow conventional commit messages for ease of release automation.


This repository contains only the core library and test suite to keep the footprint minimal.

About

ai-text-sanitizer is a tiny (<6 kB) zero-dependency ES module for cleaning and normalising raw text generated by large language models before you render, store, or diff it.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors