Skip to content

Add method sort(::KmeansResult) #288

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

mcabbott
Copy link

@mcabbott mcabbott commented Apr 15, 2025

I wanted to sort clusters, and wondered whether someone else might want to too?

@codecov-commenter
Copy link

codecov-commenter commented Apr 15, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.41%. Comparing base (b4df21a) to head (5eec95c).
Report is 20 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #288   +/-   ##
=======================================
  Coverage   95.40%   95.41%           
=======================================
  Files          20       19    -1     
  Lines        1503     1505    +2     
=======================================
+ Hits         1434     1436    +2     
  Misses         69       69           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mcabbott
Copy link
Author

Last commit attempts to make Julia 1.0 work, since CI tests it. But locally I can't install things:

(v1.0) pkg> dev Clustering
  Updating git-repo `https://github.com/JuliaStats/Clustering.jl.git`
[ Info: Path `/Users/me/.julia/dev/Clustering` exists and looks like the correct package, using existing path instead of cloning
 Resolving package versions...
ERROR: Unsatisfiable requirements detected for package Distances [b4f34e82]:
 Distances [b4f34e82] log:
 ├─possible versions are: [0.7.0-0.7.4, 0.8.0-0.8.2, 0.9.0-0.9.2, 0.10.0-0.10.7] or uninstalled
 └─restricted to versions 0.10.9-0.10 by Clustering [aaaa29a8] — no versions left
   └─Clustering [aaaa29a8] log:
     ├─possible versions are: 0.15.8 or uninstalled
     └─Clustering [aaaa29a8] is fixed to version 0.15.8

Similar errors on Julia 1.6 locally. If I change the bounds to allow Distances v0.10.7, then this package fails:

julia> data = cbrt.(rand(rng, 2, 300));

julia> kmeans(data, 5)
ERROR: MethodError: no method matching colwise!(::Distances.SqEuclidean, ::Vector{Float64}, ::Matrix{Float64}, ::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true})
Closest candidates are:
  colwise!(::AbstractArray, ::Distances.SqMahalanobis, ::AbstractMatrix{T} where T, ::AbstractVector{T} where T) at /Users/me/.julia/packages/Distances/6E33b/src/mahalanobis.jl:111
  colwise!(::AbstractArray, ::Distances.Mahalanobis, ::AbstractMatrix{T} where T, ::AbstractVector{T} where T) at /Users/me/.julia/packages/Distances/6E33b/src/mahalanobis.jl:121
  colwise!(::AbstractArray, ::Distances.PreMetric, ::AbstractMatrix{T} where T, ::AbstractVector{T} where T) at /Users/me/.julia/packages/Distances/6E33b/src/generic.jl:78
  ...
Stacktrace:
 [1] initseeds!(iseeds::Vector{Int64}, alg::KmppAlg, X::Matrix{Float64}, metric::Distances.SqEuclidean; rng::Random._GLOBAL_RNG)
   @ Clustering ~/.julia/dev/Clustering/src/seeding.jl:183
 [2] #initseeds#3
   @ ~/.julia/dev/Clustering/src/seeding.jl:43 [inlined]
 [3] initseeds(algname::Symbol, X::Matrix{Float64}, k::Int64; kwargs::Base.Iterators.Pairs{Symbol, Random._GLOBAL_RNG, Tuple{Symbol}, NamedTuple{(:rng,), Tuple{Random._GLOBAL_RNG}}})
   @ Clustering ~/.julia/dev/Clustering/src/seeding.jl:75
 [4] kmeans(X::Matrix{Float64}, k::Int64; weights::Nothing, init::Symbol, maxiter::Int64, tol::Float64, display::Symbol, distance::Distances.SqEuclidean, rng::Random._GLOBAL_RNG)
...

So I don't know. CI did pass on 1.0 on the first commit (before adding any tests).

@mcabbott mcabbott marked this pull request as ready for review April 16, 2025 14:13
@alyst
Copy link
Member

alyst commented May 18, 2025

Thank you for the PR.
Sorting clusters is a nice feature in principle.
Does your PR implements it specifically for KMeans and specifically for sorting by the cluster centers?

@mcabbott
Copy link
Author

mcabbott commented May 19, 2025

Yes, this PR only for k-means. Note the title contains sort(::KmeansResult)!

And yes, it is sorting the cluster means, via eachcol of the matrix. It should accept all the usual sorting keywords, such as by = norm if you want a sort order other than lexicographical. (As clearly described in the associated docstring?)

@alyst
Copy link
Member

alyst commented May 19, 2025

@mcabbott Thank you for clarification.
I think the use case that the PR currently covers is too specific.
I understand that asking to support other clustering results and other ways of sorting (e.g. by cluster sizes) might be too much for this PR.
But minimally it should implement a fallback sort implementation for the generic ClusteringResult and provide a mechanism to support different sorting criteria
At the moment I can only think of overriding by keyword with :coordinate, :size, :cost values or with (clustering, cluster_index) -> value function.
It also looks like it's easy to provide sortperm(::ClusteringResult; kwargs...) and use it from sort(::ClusteringResult; kwargs...) to broaden the potential usage scenarios.

@mcabbott
Copy link
Author

mcabbott commented May 20, 2025

implementation for the generic ClusteringResult

This is an abstract type. It's not obvious to me that such an implementation is possible, as it seems to require knowing what the different fields mean, and how they should each be permuted.

I totally agree that being able to sort all the subtypes would be desirable. Seems like the way you get there is by implementing it for one, and then the next, etc. This PR just wants to get the ball rolling.

keyword with :coordinate, :size, :cost values

I agree someone could want to sort clusters by their costs, etc. But I'm not sure what interface you have in mind, from just words.

Edit, maybe I should add -- my reason to sort by cluster means is for coherent plots, in which I numbered the clusters 1-10, and had more transitions 1-2 and 2-3 than between far-away points. They are essentially along one dimension. Sorting means that the plots don't change much with different random numbers. I wonder what purpose sorting by costs or counts would have?

One way to avoid ambiguity would be to make this a method of sortslices(::KmeansResult; dims=2), since the cluster means are columns, while all other possibilities are not.

easy to provide sortperm

I mean, it's trivial to call sortperm on any field you like -- there's little need for this package to provide some complex API to work through, easier to do it yourself.

What this PR is mostly doing is the next step, of applying the same permutation coherently to all the different fields of the k-means result struct. As far as I know the initial ordering of the clusters is arbitrary; this picks a different ordering which is the algorithm could equally have found. As you can see in the code, some fields need invperm (and as you can see in the commit history, I got it wrong on the first attempt!). That could be wrapped up as a method of Base.permute!(:: KmeansResult, perm) instead. Although that's a clunkier interface...

@alyst
Copy link
Member

alyst commented May 20, 2025

implementation for the generic ClusteringResult

This is an abstract type. It's not obvious to me that such an implementation is possible, as it seems to require knowing what the different fields mean, and how they should each be permuted.

I was speaking of a fallback implementation that just throws "sort() is not implemented for $(typeof(clustering))".
The idea is that for the user it is easy to switch from k-means to k-medoids, but then the sorting will break.
And while the standard MethodError will look like a bug, the error thrown by the fallback implementation will be more descriptive and would help to resolve some confusion.

We are limited in implementing the generic ClusteringResult sorting by its current interface (lack of it), there was an initiative for a better Clustering API, but that is a big effort and would be quite breaking.

Edit, maybe I should add -- my reason to sort by cluster means is for coherent plots, in which I numbered the clusters 1-10, and had more transitions 1-2 and 2-3 than between far-away points. They are essentially along one dimension. Sorting means that the plots don't change much with different random numbers. I wonder what purpose sorting by costs or counts would have?

The biggest clusters are likely to be the most relevant, and, for many datasets, could be quite stable.

One way to avoid ambiguity would be to make this a method of sortslices(::KmeansResult; dims=2), since the cluster means are columns, while all other possibilities are not.

I mean, it's trivial to call sortperm on any field you like -- there's little need for this package to provide some complex API to work through, easier to do it yourself.

sort() and sortperm() support the by keyarg, which is a function that receives a container element.
Unfortunately, ClusteringResult is not a container, so I was thinking about something along these lines:

Base.sortperm(clustering::ClusteringResult; by, kwargs...) = sortperm(1:nclusters(clustering); by = i -> by(clustering, i), kwargs...)

Where for sorting by centers the function is (clustering, i) -> view(clustering.centers, :, i), and to avoid exposing the user to these internals, we can have

function Base.sortperm(clustering::ClusteringResult; by::Symbol = :center, kwargs...)
    if by == :center
        accessor = (clustering, i) -> view(clustering.centers, :, i)
    elseif by == :size
        accessor = (clustering, i) -> clustering.counts[i]
    elseif ...
        ...
    end
    sortperm(clustering; by=accessor, kwargs...)
end

Since many clustering implementations share the same field names, there could be a shared sortperm() implementation for several result types.

What this PR is mostly doing is the next step, of applying the same permutation coherently to all the different fields of the k-means result struct.

Exactly -- permuting the fields (essentially, it is getindex(ClusteringResult, indices::AbstractVector)) is the reason why we would need a dedicated method for each ClusteringResult subtype.
So factoring out sortperm() and getindex() out of sort() may simplify implementing sorting for the other clustering results and looks like a first step towards #256.

@mcabbott
Copy link
Author

Ok, well I see what you mean now.

I'm not going to get around to all of this, so will live with my pirate version locally for now.

@mcabbott mcabbott closed this May 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants