Skip to content

use BulkTanimotoSimilarity instead of pairwise FingerprintSimilarity loops #286

@seungchan-an

Description

@seungchan-an

Summary: Tc is currently computed via pairwise FingerprintSimilarity loops and can be accelerated using BulkTanimotoSimilarity.

While profiling Tc-based nearest-neighbor computations, I noticed that FingerprintSimilarity(fp1, fp2) is used inside nested Python loops.

if max_tc < (fps := FingerprintSimilarity(target_fps, ref_fp)):

FingerprintSimilarity(row["fp"], target_fp)

FingerprintSimilarity(input_fp, target_fp) for input_fp in input_fps

tcs.append(FingerprintSimilarity(fp1, fp2))

RDKit provides a bulk API (BulkTanimotoSimilarity) that computes the same Tanimoto scores but is significantly faster for this use case. Here is a simple benchmark comparing the following approaches. Using Morgan bit vectors, all methods produced identical outputs, but performance differed substantially:

  • pairwise FingerprintSimilarity: ~32 sec
  • pairwise TanimotoSimilarity: ~24 sec
  • BulkTanimotoSimilarity: ~1.4 sec

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions