Skip to content

Conversation

@yger
Copy link
Collaborator

@yger yger commented Dec 11, 2025

This is in line with #4257 , but maybe one option would be to get rid of HDBSCAN dependency, since now there is a full port in scikit-learn. At the time we started to use hdbscan, this was not the case, but now maybe this would be a good option to ease maintenance and install procedures.
This PR removes all usages of hdbscan to only rely on scikit. In fact, some advanced options of hdbscan are not ported into the scikit-learn version, but since they are not used anywhere in the code at the moment...

@yger yger self-assigned this Dec 11, 2025
@chrishalcrow
Copy link
Member

Nice! How similar do you think the implementations are?

@yger yger requested a review from samuelgarcia December 11, 2025 13:07
@yger yger added enhancement New feature or request dependencies Issue/PR that is related to dependencies labels Dec 11, 2025
@yger yger marked this pull request as ready for review December 11, 2025 13:08
@yger
Copy link
Collaborator Author

yger commented Dec 11, 2025

Same paper, same reference, but I have not performed an exhaustive comparison. Based on preliminary results from the sorters (the only part really using hdbscan), performances are similar. Maybe we should compare more in depth

@yger
Copy link
Collaborator Author

yger commented Dec 12, 2025

Ok, after a more exhaustive study, looks like indeed hdbscan is a better implementation than scikit-learn

image

Here for example, I generated 10 Gaussian blobs, and computed averaged run times and homogeneity scores for the two implementations. Looks like hdbscan is faster, and more accurate

@samuelgarcia
Copy link
Member

So we close!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Issue/PR that is related to dependencies enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants