Skip to content

Implement Ignore Pattern Support #3

@tim-dickey

Description

@tim-dickey

Goal

Allow users to exclude files and/or content regions (e.g., license headers, generated code) from duplicate scanning.

Scope

  • CLI flags: --ignore-glob (repeatable), --ignore-file PATTERN_FILE.
  • Pattern file format: one glob per line, # comments, blank lines ignored.
  • Content region exclusion: configurable regex patterns applied before tokenizing (e.g., --ignore-region 're:^# Generated.*?\n').
  • Update DuplicateFinder to apply filters pre-tokenization.

Acceptance Criteria

  • Files matching ignore globs are skipped (not counted in scanned set).
  • Region regex patterns remove matched text before tokenization.
  • Tests: file exclusion, region removal reduces shingles, no false negatives for remaining content.
  • README section documenting patterns + examples.

Non-Goals

  • Language-aware comment stripping (future tokenizer feature).

Future

  • Central config file (duplicate-finder.toml) for ignore patterns.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions