-
Notifications
You must be signed in to change notification settings - Fork 0
Introduce MinHash + LSH Prefilter #7
Copy link
Copy link
Open
Description
Goal
Reduce O(N^2) pairwise comparisons by introducing MinHash signatures and locality-sensitive hashing (LSH) banding to prune candidate pairs before full Jaccard.
Scope
- Add MinHash signature generation (configurable number of permutations, e.g., 64).
- Implement banding (e.g., 16 bands of 4 hashes) to bucket potential duplicates.
- CLI flags:
--minhash-perms,--lsh-bands,--prefilter(enable/disable). - Short-circuit to full Jaccard only on candidate pairs from shared buckets.
- Benchmark comparison: with vs without prefilter (files/sec, candidate reduction rate).
Acceptance Criteria
- Prefilter reduces pairwise comparison count on synthetic large dataset (>2k files).
- Results identical (no false negatives) at chosen permutation/band config (validated on test corpus).
- Tests: candidate generation includes known duplicate pairs.
Non-Goals
- Adaptive parameter tuning (manual for now).
Future
- Dynamically adjust bands based on average signature sparsity.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels