Skip to content

Simplify anonymizer and prepare for Windows support#14

Open
woct0rdho wants to merge 4 commits intopeteromallet:mainfrom
woct0rdho:anonymizer
Open

Simplify anonymizer and prepare for Windows support#14
woct0rdho wants to merge 4 commits intopeteromallet:mainfrom
woct0rdho:anonymizer

Conversation

@woct0rdho
Copy link
Contributor

@woct0rdho woct0rdho commented Feb 27, 2026

There are some changes to the anonymizer that I think reasonable:

  1. Make anonymize_path an alias of anonymize_text. We don't need to treat paths differently from general text.
  2. Even if a special folder like Documents, Downloads, Desktop is in the path, we don't need to remove the parent folders before it. We can keep the absolute path and only redact the username.
  3. When len(username) >= 4, we only need to globally replace the username with the hash. This is a superset of all other replacements in anonymize_text. Only when len(username) < 4, we replace with more specific patterns.
  4. I've unified the handling of /Users/username (macOS), \Users\username , \\Users\\username (Windows path that may be escaped in various forms), -Users-username (hyphen-encoded path like Claude Code and HuggingFace cache). I've also unified the handling of /Users/username and /home/username.
  5. We can supply a custom home dir to anonymize_text. Only when it's not any conventional home dir above, we use a specific pattern to handle it.

The rest is to optimize the speed using re.compile and precomputed regex patterns. The resulting anonymize_text function looks pretty clean. You can see the updated tests for the effect of the anonymizer.

This is manually written by me, not vibe-coded. I've prepared the remaining changes of Windows support (based on this PR) at https://github.com/woct0rdho/dataclaw/tree/windows , and an example dataset at https://huggingface.co/datasets/woctordho/dataclaw-windows . But before that, we should carefully review this PR because it touches the anonymizer.

By default the only scope of the anonymizer is to redact the operation system's username. Other kinds of PII are inherently subjective to define, and the user is responsible for defining strings to be redacted.

Currently all CI tests run on Ubuntu. We may need to let some of them run on macOS and Windows, for example Python 3.10 + Ubuntu, Python 3.11 + macOS, Python 3.12 + Windows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant