This directory contains tools for analyzing and removing SynthID watermarks from AI-generated text. The watermark was developed by Google DeepMind and published in Nature (2024).
-
N-gram Context: For each token position, SynthID considers the previous
ngram_len - 1tokens (default: 4 tokens) as context. -
Hash Computation: It computes a hash of:
- The context tokens
- The candidate next token
- A set of secret watermarking keys (30 by default)
-
G-value Assignment: The hash is used to assign a binary g-value (0 or 1) to each possible next token for each key layer.
-
Probability Modification: The token probabilities are modified to favor tokens with g-value = 1:
new_prob[i] = prob[i] * (1 + g[i] - mean(g * prob)) -
Sampling: The model samples from the modified distribution.
- Tokenize the text
- For each n-gram, compute g-values using the same keys
- If the mean g-value is significantly > 0.5, the text is watermarked
Effectiveness: 90-100%
The watermark is embedded in the specific token sequence, not the meaning. Paraphrasing with a non-watermarked model completely regenerates the token sequence.
from reverse_synthid import ParaphrasingAttack
attack = ParaphrasingAttack(model_name="gpt2")
clean_text = attack.paraphrase(watermarked_text)Effectiveness: 50-70%
Replacing tokens with synonyms breaks the n-gram patterns:
from reverse_synthid import TokenPerturbationAttack
attack = TokenPerturbationAttack()
clean_text = attack.substitute_synonyms(text, rate=0.3)Effectiveness: 95-100%
Replacing characters with visually identical Unicode characters breaks the hash:
from reverse_synthid import HomoglyphAttack
attack = HomoglyphAttack()
clean_text = attack.apply_homoglyphs(text, rate=0.1)Effectiveness: 30-50%
Inserting or removing words shifts all subsequent n-gram boundaries:
from reverse_synthid import TokenPerturbationAttack
attack = TokenPerturbationAttack()
clean_text = attack.insert_fillers(text, rate=0.1)reverse_synthid.py- Main attack toolkit with multiple methodsanalyze_watermark.py- Watermark detection and analysis tooltest_removal.py- Test script demonstrating watermark removal
python analyze_watermark.py --input text.txt --verbose# Using paraphrasing (most effective, requires model)
python analysis/reverse_synthid.py --input text.txt --output clean.txt --method paraphrase
# Using perturbation (no model needed)
python analysis/reverse_synthid.py --input text.txt --output clean.txt --method perturb
# Using combined attack
python reverse_synthid.py --input text.txt --output clean.txt --method combinedpython analysis/test_removal.pyThe watermark relies on:
- Exact token sequences - Any change to tokens affects n-grams
- Context windows - Inserting tokens shifts all contexts
- Hash function - Different bytes produce different hashes
When you paraphrase text:
- The semantic meaning is preserved
- But the exact tokens are completely different
- The n-gram hashes are completely different
- The g-value pattern is destroyed
For a text of length N with watermark strength W:
- Substitution rate r reduces signal by approximately
1 - (1-r)^(ngram_len) - Insertion rate r shifts
N*rn-gram boundaries - Paraphrasing essentially sets W → 0 (complete regeneration)
- Paraphrasing may change nuances in meaning
- Simple attacks may be detectable
- Future watermarks may be more robust
This code is for educational and research purposes only. It demonstrates vulnerabilities in watermarking schemes to help develop more robust methods.
- Dathathri et al. "Scalable watermarking for identifying large language model outputs." Nature 634, 818-823 (2024).
- https://github.com/google-deepmind/synthid-text

