Simplify anonymizer and prepare for Windows support#14
Open
woct0rdho wants to merge 4 commits intopeteromallet:mainfrom
Open
Simplify anonymizer and prepare for Windows support#14woct0rdho wants to merge 4 commits intopeteromallet:mainfrom
woct0rdho wants to merge 4 commits intopeteromallet:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
There are some changes to the anonymizer that I think reasonable:
anonymize_pathan alias ofanonymize_text. We don't need to treat paths differently from general text.len(username) >= 4, we only need to globally replace the username with the hash. This is a superset of all other replacements inanonymize_text. Only whenlen(username) < 4, we replace with more specific patterns./Users/username(macOS),\Users\username,\\Users\\username(Windows path that may be escaped in various forms),-Users-username(hyphen-encoded path like Claude Code and HuggingFace cache). I've also unified the handling of/Users/usernameand/home/username.anonymize_text. Only when it's not any conventional home dir above, we use a specific pattern to handle it.The rest is to optimize the speed using
re.compileand precomputed regex patterns. The resultinganonymize_textfunction looks pretty clean. You can see the updated tests for the effect of the anonymizer.This is manually written by me, not vibe-coded. I've prepared the remaining changes of Windows support (based on this PR) at https://github.com/woct0rdho/dataclaw/tree/windows , and an example dataset at https://huggingface.co/datasets/woctordho/dataclaw-windows . But before that, we should carefully review this PR because it touches the anonymizer.
By default the only scope of the anonymizer is to redact the operation system's username. Other kinds of PII are inherently subjective to define, and the user is responsible for defining strings to be redacted.
Currently all CI tests run on Ubuntu. We may need to let some of them run on macOS and Windows, for example Python 3.10 + Ubuntu, Python 3.11 + macOS, Python 3.12 + Windows.