Skip to content

aloshdenny/reverse-SynthID-text

Repository files navigation

SynthID Watermark Reverse Engineering

SynthID Watermark Analysis

Overview

This directory contains tools for analyzing and removing SynthID watermarks from AI-generated text. The watermark was developed by Google DeepMind and published in Nature (2024).

How SynthID Works

Watermarking Process

  1. N-gram Context: For each token position, SynthID considers the previous ngram_len - 1 tokens (default: 4 tokens) as context.

  2. Hash Computation: It computes a hash of:

    • The context tokens
    • The candidate next token
    • A set of secret watermarking keys (30 by default)
  3. G-value Assignment: The hash is used to assign a binary g-value (0 or 1) to each possible next token for each key layer.

  4. Probability Modification: The token probabilities are modified to favor tokens with g-value = 1:

    new_prob[i] = prob[i] * (1 + g[i] - mean(g * prob))
    
  5. Sampling: The model samples from the modified distribution.

Detection Process

  1. Tokenize the text
  2. For each n-gram, compute g-values using the same keys
  3. If the mean g-value is significantly > 0.5, the text is watermarked

Vulnerabilities

1. Paraphrasing Attack (Most Effective)

Effectiveness: 90-100%

The watermark is embedded in the specific token sequence, not the meaning. Paraphrasing with a non-watermarked model completely regenerates the token sequence.

from reverse_synthid import ParaphrasingAttack
attack = ParaphrasingAttack(model_name="gpt2")
clean_text = attack.paraphrase(watermarked_text)

2. Token Substitution Attack

Effectiveness: 50-70%

Replacing tokens with synonyms breaks the n-gram patterns:

from reverse_synthid import TokenPerturbationAttack
attack = TokenPerturbationAttack()
clean_text = attack.substitute_synonyms(text, rate=0.3)

3. Homoglyph Attack

Effectiveness: 95-100%

Replacing characters with visually identical Unicode characters breaks the hash:

from reverse_synthid import HomoglyphAttack
attack = HomoglyphAttack()
clean_text = attack.apply_homoglyphs(text, rate=0.1)

4. N-gram Boundary Shift

Effectiveness: 30-50%

Inserting or removing words shifts all subsequent n-gram boundaries:

from reverse_synthid import TokenPerturbationAttack
attack = TokenPerturbationAttack()
clean_text = attack.insert_fillers(text, rate=0.1)

Files

  • reverse_synthid.py - Main attack toolkit with multiple methods
  • analyze_watermark.py - Watermark detection and analysis tool
  • test_removal.py - Test script demonstrating watermark removal

Usage

Analyze Watermark

python analyze_watermark.py --input text.txt --verbose

Remove Watermark

# Using paraphrasing (most effective, requires model)
python analysis/reverse_synthid.py --input text.txt --output clean.txt --method paraphrase

# Using perturbation (no model needed)
python analysis/reverse_synthid.py --input text.txt --output clean.txt --method perturb

# Using combined attack
python reverse_synthid.py --input text.txt --output clean.txt --method combined

Run Test

python analysis/test_removal.py

Comparing SynthID Watermarked vs De-Watermarked Text

Comparison

Technical Details

Key Insight: The Watermark is Fragile

The watermark relies on:

  1. Exact token sequences - Any change to tokens affects n-grams
  2. Context windows - Inserting tokens shifts all contexts
  3. Hash function - Different bytes produce different hashes

Why Paraphrasing Works

When you paraphrase text:

  1. The semantic meaning is preserved
  2. But the exact tokens are completely different
  3. The n-gram hashes are completely different
  4. The g-value pattern is destroyed

Theoretical Analysis

For a text of length N with watermark strength W:

  • Substitution rate r reduces signal by approximately 1 - (1-r)^(ngram_len)
  • Insertion rate r shifts N*r n-gram boundaries
  • Paraphrasing essentially sets W → 0 (complete regeneration)

Limitations

  1. Paraphrasing may change nuances in meaning
  2. Simple attacks may be detectable
  3. Future watermarks may be more robust

Disclaimer

This code is for educational and research purposes only. It demonstrates vulnerabilities in watermarking schemes to help develop more robust methods.

References

  1. Dathathri et al. "Scalable watermarking for identifying large language model outputs." Nature 634, 818-823 (2024).
  2. https://github.com/google-deepmind/synthid-text

About

reverse engineering SynthID for text

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors