Open
Description
Problem
After adding UltraLogLog
support to Apache Pinot I've been looking at adding some of the MinHash
variants, but to do this I need a reliable way to merge them together when running SQL queries, or merging rows.
Solution
I'd like the SimilarityHasher
interface to also have a merge
method that takes two byte[]
and returns a byte[]
that represents the merged state.
Alternatives
- I've tried implementing the merge functions myself, and run into problems like MinHash output is not mergeable with hash sizes under 64 #169
- I did consider a half way solution of just streaming hashes into it, but that's also not available in the current interface