-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Summary: Tc is currently computed via pairwise FingerprintSimilarity loops and can be accelerated using BulkTanimotoSimilarity.
While profiling Tc-based nearest-neighbor computations, I noticed that FingerprintSimilarity(fp1, fp2) is used inside nested Python loops.
CLM/src/clm/commands/write_nn_Tc.py
Line 63 in 2cf5e22
| if max_tc < (fps := FingerprintSimilarity(target_fps, ref_fp)): |
| FingerprintSimilarity(row["fp"], target_fp) |
CLM/src/clm/commands/create_training_sets.py
Line 136 in 2cf5e22
| FingerprintSimilarity(input_fp, target_fp) for input_fp in input_fps |
Line 408 in 2cf5e22
| tcs.append(FingerprintSimilarity(fp1, fp2)) |
RDKit provides a bulk API (BulkTanimotoSimilarity) that computes the same Tanimoto scores but is significantly faster for this use case. Here is a simple benchmark comparing the following approaches. Using Morgan bit vectors, all methods produced identical outputs, but performance differed substantially:
- pairwise
FingerprintSimilarity: ~32 sec - pairwise
TanimotoSimilarity: ~24 sec BulkTanimotoSimilarity: ~1.4 sec