Add method `sort(::KmeansResult)` #288

mcabbott · 2025-04-15T21:05:56Z

I wanted to sort clusters, and wondered whether someone else might want to too?

codecov-commenter · 2025-04-15T21:10:43Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.41%. Comparing base (b4df21a) to head (5eec95c).
Report is 20 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #288   +/-   ##
=======================================
  Coverage   95.40%   95.41%           
=======================================
  Files          20       19    -1     
  Lines        1503     1505    +2     
=======================================
+ Hits         1434     1436    +2     
  Misses         69       69

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mcabbott · 2025-04-16T14:13:05Z

Last commit attempts to make Julia 1.0 work, since CI tests it. But locally I can't install things:

(v1.0) pkg> dev Clustering
  Updating git-repo `https://github.com/JuliaStats/Clustering.jl.git`
[ Info: Path `/Users/me/.julia/dev/Clustering` exists and looks like the correct package, using existing path instead of cloning
 Resolving package versions...
ERROR: Unsatisfiable requirements detected for package Distances [b4f34e82]:
 Distances [b4f34e82] log:
 ├─possible versions are: [0.7.0-0.7.4, 0.8.0-0.8.2, 0.9.0-0.9.2, 0.10.0-0.10.7] or uninstalled
 └─restricted to versions 0.10.9-0.10 by Clustering [aaaa29a8] — no versions left
   └─Clustering [aaaa29a8] log:
     ├─possible versions are: 0.15.8 or uninstalled
     └─Clustering [aaaa29a8] is fixed to version 0.15.8

Similar errors on Julia 1.6 locally. If I change the bounds to allow Distances v0.10.7, then this package fails:

julia> data = cbrt.(rand(rng, 2, 300));

julia> kmeans(data, 5)
ERROR: MethodError: no method matching colwise!(::Distances.SqEuclidean, ::Vector{Float64}, ::Matrix{Float64}, ::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true})
Closest candidates are:
  colwise!(::AbstractArray, ::Distances.SqMahalanobis, ::AbstractMatrix{T} where T, ::AbstractVector{T} where T) at /Users/me/.julia/packages/Distances/6E33b/src/mahalanobis.jl:111
  colwise!(::AbstractArray, ::Distances.Mahalanobis, ::AbstractMatrix{T} where T, ::AbstractVector{T} where T) at /Users/me/.julia/packages/Distances/6E33b/src/mahalanobis.jl:121
  colwise!(::AbstractArray, ::Distances.PreMetric, ::AbstractMatrix{T} where T, ::AbstractVector{T} where T) at /Users/me/.julia/packages/Distances/6E33b/src/generic.jl:78
  ...
Stacktrace:
 [1] initseeds!(iseeds::Vector{Int64}, alg::KmppAlg, X::Matrix{Float64}, metric::Distances.SqEuclidean; rng::Random._GLOBAL_RNG)
   @ Clustering ~/.julia/dev/Clustering/src/seeding.jl:183
 [2] #initseeds#3
   @ ~/.julia/dev/Clustering/src/seeding.jl:43 [inlined]
 [3] initseeds(algname::Symbol, X::Matrix{Float64}, k::Int64; kwargs::Base.Iterators.Pairs{Symbol, Random._GLOBAL_RNG, Tuple{Symbol}, NamedTuple{(:rng,), Tuple{Random._GLOBAL_RNG}}})
   @ Clustering ~/.julia/dev/Clustering/src/seeding.jl:75
 [4] kmeans(X::Matrix{Float64}, k::Int64; weights::Nothing, init::Symbol, maxiter::Int64, tol::Float64, display::Symbol, distance::Distances.SqEuclidean, rng::Random._GLOBAL_RNG)
...

So I don't know. CI did pass on 1.0 on the first commit (before adding any tests).

alyst · 2025-05-18T07:25:47Z

Thank you for the PR.
Sorting clusters is a nice feature in principle.
Does your PR implements it specifically for KMeans and specifically for sorting by the cluster centers?

mcabbott · 2025-05-19T19:40:37Z

Yes, this PR only for k-means. Note the title contains sort(::KmeansResult)!

And yes, it is sorting the cluster means, via eachcol of the matrix. It should accept all the usual sorting keywords, such as by = norm if you want a sort order other than lexicographical. (As clearly described in the associated docstring?)

alyst · 2025-05-19T23:21:30Z

@mcabbott Thank you for clarification.
I think the use case that the PR currently covers is too specific.
I understand that asking to support other clustering results and other ways of sorting (e.g. by cluster sizes) might be too much for this PR.
But minimally it should implement a fallback sort implementation for the generic ClusteringResult and provide a mechanism to support different sorting criteria
At the moment I can only think of overriding by keyword with :coordinate, :size, :cost values or with (clustering, cluster_index) -> value function.
It also looks like it's easy to provide sortperm(::ClusteringResult; kwargs...) and use it from sort(::ClusteringResult; kwargs...) to broaden the potential usage scenarios.

mcabbott · 2025-05-20T15:28:34Z

implementation for the generic ClusteringResult

This is an abstract type. It's not obvious to me that such an implementation is possible, as it seems to require knowing what the different fields mean, and how they should each be permuted.

I totally agree that being able to sort all the subtypes would be desirable. Seems like the way you get there is by implementing it for one, and then the next, etc. This PR just wants to get the ball rolling.

keyword with :coordinate, :size, :cost values

I agree someone could want to sort clusters by their costs, etc. But I'm not sure what interface you have in mind, from just words.

Edit, maybe I should add -- my reason to sort by cluster means is for coherent plots, in which I numbered the clusters 1-10, and had more transitions 1-2 and 2-3 than between far-away points. They are essentially along one dimension. Sorting means that the plots don't change much with different random numbers. I wonder what purpose sorting by costs or counts would have?

One way to avoid ambiguity would be to make this a method of sortslices(::KmeansResult; dims=2), since the cluster means are columns, while all other possibilities are not.

easy to provide sortperm

I mean, it's trivial to call sortperm on any field you like -- there's little need for this package to provide some complex API to work through, easier to do it yourself.

What this PR is mostly doing is the next step, of applying the same permutation coherently to all the different fields of the k-means result struct. As far as I know the initial ordering of the clusters is arbitrary; this picks a different ordering which is the algorithm could equally have found. As you can see in the code, some fields need invperm (and as you can see in the commit history, I got it wrong on the first attempt!). That could be wrapped up as a method of Base.permute!(:: KmeansResult, perm) instead. Although that's a clunkier interface...

alyst · 2025-05-20T16:56:59Z

implementation for the generic ClusteringResult

This is an abstract type. It's not obvious to me that such an implementation is possible, as it seems to require knowing what the different fields mean, and how they should each be permuted.

I was speaking of a fallback implementation that just throws "sort() is not implemented for $(typeof(clustering))".
The idea is that for the user it is easy to switch from k-means to k-medoids, but then the sorting will break.
And while the standard MethodError will look like a bug, the error thrown by the fallback implementation will be more descriptive and would help to resolve some confusion.

We are limited in implementing the generic ClusteringResult sorting by its current interface (lack of it), there was an initiative for a better Clustering API, but that is a big effort and would be quite breaking.

Edit, maybe I should add -- my reason to sort by cluster means is for coherent plots, in which I numbered the clusters 1-10, and had more transitions 1-2 and 2-3 than between far-away points. They are essentially along one dimension. Sorting means that the plots don't change much with different random numbers. I wonder what purpose sorting by costs or counts would have?

The biggest clusters are likely to be the most relevant, and, for many datasets, could be quite stable.

One way to avoid ambiguity would be to make this a method of sortslices(::KmeansResult; dims=2), since the cluster means are columns, while all other possibilities are not.

I mean, it's trivial to call sortperm on any field you like -- there's little need for this package to provide some complex API to work through, easier to do it yourself.

sort() and sortperm() support the by keyarg, which is a function that receives a container element.
Unfortunately, ClusteringResult is not a container, so I was thinking about something along these lines:

Base.sortperm(clustering::ClusteringResult; by, kwargs...) = sortperm(1:nclusters(clustering); by = i -> by(clustering, i), kwargs...)

Where for sorting by centers the function is (clustering, i) -> view(clustering.centers, :, i), and to avoid exposing the user to these internals, we can have

function Base.sortperm(clustering::ClusteringResult; by::Symbol = :center, kwargs...)
    if by == :center
        accessor = (clustering, i) -> view(clustering.centers, :, i)
    elseif by == :size
        accessor = (clustering, i) -> clustering.counts[i]
    elseif ...
        ...
    end
    sortperm(clustering; by=accessor, kwargs...)
end

Since many clustering implementations share the same field names, there could be a shared sortperm() implementation for several result types.

What this PR is mostly doing is the next step, of applying the same permutation coherently to all the different fields of the k-means result struct.

Exactly -- permuting the fields (essentially, it is getindex(ClusteringResult, indices::AbstractVector)) is the reason why we would need a dedicated method for each ClusteringResult subtype.
So factoring out sortperm() and getindex() out of sort() may simplify implementing sorting for the other clustering results and looks like a first step towards #256.

mcabbott · 2025-05-20T18:43:28Z

Ok, well I see what you mean now.

I'm not going to get around to all of this, so will live with my pirate version locally for now.

add sort(::KmeansResult)

0f62a30

mcabbott added 2 commits April 15, 2025 18:02

fix invperm

004ebd3

add a test

7b22b9e

mcabbott marked this pull request as ready for review April 16, 2025 14:13

alyst added the enhancement label May 18, 2025

try to fix 1.0

5eec95c

mcabbott force-pushed the sort-kmeans branch from 8bc998b to 5eec95c Compare May 19, 2025 22:07

mcabbott closed this May 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add method `sort(::KmeansResult)` #288

Add method `sort(::KmeansResult)` #288

Uh oh!

mcabbott commented Apr 15, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Apr 15, 2025 •

edited

Loading

Uh oh!

mcabbott commented Apr 16, 2025

Uh oh!

alyst commented May 18, 2025

Uh oh!

mcabbott commented May 19, 2025 •

edited

Loading

Uh oh!

alyst commented May 19, 2025

Uh oh!

mcabbott commented May 20, 2025 •

edited

Loading

Uh oh!

alyst commented May 20, 2025

Uh oh!

mcabbott commented May 20, 2025

Uh oh!

Uh oh!

Add method sort(::KmeansResult) #288

Add method sort(::KmeansResult) #288

Uh oh!

Conversation

mcabbott commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mcabbott commented Apr 16, 2025

Uh oh!

alyst commented May 18, 2025

Uh oh!

mcabbott commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alyst commented May 19, 2025

Uh oh!

mcabbott commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alyst commented May 20, 2025

Uh oh!

mcabbott commented May 20, 2025

Uh oh!

Uh oh!

Add method `sort(::KmeansResult)` #288

Add method `sort(::KmeansResult)` #288

mcabbott commented Apr 15, 2025 •

edited

Loading

codecov-commenter commented Apr 15, 2025 •

edited

Loading

mcabbott commented May 19, 2025 •

edited

Loading

mcabbott commented May 20, 2025 •

edited

Loading