-
Notifications
You must be signed in to change notification settings - Fork 0
Implement Ignore Pattern Support #3
Copy link
Copy link
Open
Description
Goal
Allow users to exclude files and/or content regions (e.g., license headers, generated code) from duplicate scanning.
Scope
- CLI flags:
--ignore-glob(repeatable),--ignore-file PATTERN_FILE. - Pattern file format: one glob per line,
#comments, blank lines ignored. - Content region exclusion: configurable regex patterns applied before tokenizing (e.g.,
--ignore-region 're:^# Generated.*?\n'). - Update
DuplicateFinderto apply filters pre-tokenization.
Acceptance Criteria
- Files matching ignore globs are skipped (not counted in scanned set).
- Region regex patterns remove matched text before tokenization.
- Tests: file exclusion, region removal reduces shingles, no false negatives for remaining content.
- README section documenting patterns + examples.
Non-Goals
- Language-aware comment stripping (future tokenizer feature).
Future
- Central config file (
duplicate-finder.toml) for ignore patterns.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels