SynthID Watermark Reverse Engineering

Overview

This directory contains tools for analyzing and removing SynthID watermarks from AI-generated text. The watermark was developed by Google DeepMind and published in Nature (2024).

How SynthID Works

Watermarking Process

N-gram Context: For each token position, SynthID considers the previous ngram_len - 1 tokens (default: 4 tokens) as context.
Hash Computation: It computes a hash of:
- The context tokens
- The candidate next token
- A set of secret watermarking keys (30 by default)
G-value Assignment: The hash is used to assign a binary g-value (0 or 1) to each possible next token for each key layer.
Probability Modification: The token probabilities are modified to favor tokens with g-value = 1:
```
new_prob[i] = prob[i] * (1 + g[i] - mean(g * prob))
```
Sampling: The model samples from the modified distribution.

Detection Process

Tokenize the text
For each n-gram, compute g-values using the same keys
If the mean g-value is significantly > 0.5, the text is watermarked

Vulnerabilities

1. Paraphrasing Attack (Most Effective)

Effectiveness: 90-100%

The watermark is embedded in the specific token sequence, not the meaning. Paraphrasing with a non-watermarked model completely regenerates the token sequence.

from reverse_synthid import ParaphrasingAttack
attack = ParaphrasingAttack(model_name="gpt2")
clean_text = attack.paraphrase(watermarked_text)

2. Token Substitution Attack

Effectiveness: 50-70%

Replacing tokens with synonyms breaks the n-gram patterns:

from reverse_synthid import TokenPerturbationAttack
attack = TokenPerturbationAttack()
clean_text = attack.substitute_synonyms(text, rate=0.3)

3. Homoglyph Attack

Effectiveness: 95-100%

Replacing characters with visually identical Unicode characters breaks the hash:

from reverse_synthid import HomoglyphAttack
attack = HomoglyphAttack()
clean_text = attack.apply_homoglyphs(text, rate=0.1)

4. N-gram Boundary Shift

Effectiveness: 30-50%

Inserting or removing words shifts all subsequent n-gram boundaries:

from reverse_synthid import TokenPerturbationAttack
attack = TokenPerturbationAttack()
clean_text = attack.insert_fillers(text, rate=0.1)

Files

reverse_synthid.py - Main attack toolkit with multiple methods
analyze_watermark.py - Watermark detection and analysis tool
test_removal.py - Test script demonstrating watermark removal

Usage

Analyze Watermark

python analyze_watermark.py --input text.txt --verbose

Remove Watermark

# Using paraphrasing (most effective, requires model)
python analysis/reverse_synthid.py --input text.txt --output clean.txt --method paraphrase

# Using perturbation (no model needed)
python analysis/reverse_synthid.py --input text.txt --output clean.txt --method perturb

# Using combined attack
python reverse_synthid.py --input text.txt --output clean.txt --method combined

Run Test

python analysis/test_removal.py

Comparing SynthID Watermarked vs De-Watermarked Text

Technical Details

Key Insight: The Watermark is Fragile

The watermark relies on:

Exact token sequences - Any change to tokens affects n-grams
Context windows - Inserting tokens shifts all contexts
Hash function - Different bytes produce different hashes

Why Paraphrasing Works

When you paraphrase text:

The semantic meaning is preserved
But the exact tokens are completely different
The n-gram hashes are completely different
The g-value pattern is destroyed

Theoretical Analysis

For a text of length N with watermark strength W:

Substitution rate r reduces signal by approximately 1 - (1-r)^(ngram_len)
Insertion rate r shifts N*r n-gram boundaries
Paraphrasing essentially sets W → 0 (complete regeneration)

Limitations

Paraphrasing may change nuances in meaning
Simple attacks may be detectable
Future watermarks may be more robust

Disclaimer

This code is for educational and research purposes only. It demonstrates vulnerabilities in watermarking schemes to help develop more robust methods.

References

Dathathri et al. "Scalable watermarking for identifying large language model outputs." Nature 634, 818-823 (2024).
https://github.com/google-deepmind/synthid-text

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
analysis		analysis
data		data
notebooks		notebooks
src/synthid_text		src/synthid_text
visualizations		visualizations
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SYNTHID_README.md		SYNTHID_README.md
clean.txt		clean.txt
logo.png		logo.png
pyproject.toml		pyproject.toml
reverse_synthid.py		reverse_synthid.py
watermarked.txt		watermarked.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SynthID Watermark Reverse Engineering

Overview

How SynthID Works

Watermarking Process

Detection Process

Vulnerabilities

1. Paraphrasing Attack (Most Effective)

2. Token Substitution Attack

3. Homoglyph Attack

4. N-gram Boundary Shift

Files

Usage

Analyze Watermark

Remove Watermark

Run Test

Comparing SynthID Watermarked vs De-Watermarked Text

Technical Details

Key Insight: The Watermark is Fragile

Why Paraphrasing Works

Theoretical Analysis

Limitations

Disclaimer

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SynthID Watermark Reverse Engineering

Overview

How SynthID Works

Watermarking Process

Detection Process

Vulnerabilities

1. Paraphrasing Attack (Most Effective)

2. Token Substitution Attack

3. Homoglyph Attack

4. N-gram Boundary Shift

Files

Usage

Analyze Watermark

Remove Watermark

Run Test

Comparing SynthID Watermarked vs De-Watermarked Text

Technical Details

Key Insight: The Watermark is Fragile

Why Paraphrasing Works

Theoretical Analysis

Limitations

Disclaimer

References

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages