Initial implementation for Hybrid Hash Functions#91
Initial implementation for Hybrid Hash Functions#91oluiscabral wants to merge 1 commit intoARK-Builders:mainfrom
Conversation
|
Hi @oluiscabral, I'm really glad to see this PR!
In general, I've recently discovered the class of software we actually target with our framework: DAM, which could be used for categorizing various assets like photos, videos, 3D-models. So ideally would be great to cover all file sizes. That doesn't mean that the framework will be used only for DAM, but that's pretty good reference because it requires meticulous work with every single files and its metadata like tags, scores, attributes etc. |
| } | ||
| } | ||
|
|
||
| const THRESHOLD: u64 = 1024 * 1024 * 1024; |
There was a problem hiding this comment.
A wild idea, is it difficult to make this constant a type parameter? So we could instantiate same class using different thresholds? It would be really great to have benchmarks of optimized "skip-chunks" hash function for different sizes. The goal of such benchmarks is not only to see the speed improvement, but also to see collisions ratio.
There was a problem hiding this comment.
Nop, it is not difficult. I just haven't done it already, because I wanted to keep the implementation as similar as possible to the other implementations (Blake3 and CRC32) in this PoC
| if size < THRESHOLD { | ||
| // Use Blake3 for small files | ||
| log::debug!("Computing BLAKE3 hash for bytes"); | ||
|
|
||
| let mut hasher = Blake3Hasher::new(); | ||
| hasher.update(bytes); | ||
| let hash = hasher.finalize(); | ||
| Ok(Hybrid(encode(hash.as_bytes()))) | ||
| } else { | ||
| // Use fnv hashing for large files | ||
| log::debug!("Computing simple hash for bytes"); | ||
|
|
||
| let hash = fnv_hash_bytes(bytes); | ||
| Ok(Hybrid(format!("{}_{}", size, hash))) |
There was a problem hiding this comment.
- The original idea is the opposite: use Blake3 for small and medium files, and use faster function for large files where size of contents is large enough to make collision ratio low enough.
- FNV hashing can be added separately as dedicated hash function. Same as "skip-chunk" hash function.
- A wild idea: can we parameterize this hybrid hash function with other hash functions? So we could compose 2 "dedicated" hash functions into threshold-based hash function.
There was a problem hiding this comment.
- Yes, any file that has size below the
THRESHOLDis being hashed by Blake3 already. - 100%
- Yes, totally. I'm not sure if there are higher priority things to do before it, but we could even create a fully parameterized implementation, that allows indefinite pairs composed of a hash function and its related threshold. I've done something similar to this in JavaScript once
Hello!
This pull request introduces a new implementation of the
ResourceIdtrait using two different hash approaches:The
dev-hash/benches/hybrid.rsfile is a modification ofdev-hash/benches/blake3.rs, with the newHybridstruct being used instead of theBlake3struct. This allows us to compare and analyze the performance differencesbetween the two approaches.
Note: I have not implemented tests for files larger than the threshold yet. This will be added in a future update. Please let me know if you have any suggestions or concerns regarding this approach.