Skip to content

Introduce MinHash + LSH Prefilter #7

@tim-dickey

Description

@tim-dickey

Goal

Reduce O(N^2) pairwise comparisons by introducing MinHash signatures and locality-sensitive hashing (LSH) banding to prune candidate pairs before full Jaccard.

Scope

  • Add MinHash signature generation (configurable number of permutations, e.g., 64).
  • Implement banding (e.g., 16 bands of 4 hashes) to bucket potential duplicates.
  • CLI flags: --minhash-perms, --lsh-bands, --prefilter (enable/disable).
  • Short-circuit to full Jaccard only on candidate pairs from shared buckets.
  • Benchmark comparison: with vs without prefilter (files/sec, candidate reduction rate).

Acceptance Criteria

  • Prefilter reduces pairwise comparison count on synthetic large dataset (>2k files).
  • Results identical (no false negatives) at chosen permutation/band config (validated on test corpus).
  • Tests: candidate generation includes known duplicate pairs.

Non-Goals

  • Adaptive parameter tuning (manual for now).

Future

  • Dynamically adjust bands based on average signature sparsity.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions