-
Notifications
You must be signed in to change notification settings - Fork 124
Add method sort(::KmeansResult)
#288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #288 +/- ##
=======================================
Coverage 95.40% 95.41%
=======================================
Files 20 19 -1
Lines 1503 1505 +2
=======================================
+ Hits 1434 1436 +2
Misses 69 69 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Last commit attempts to make Julia 1.0 work, since CI tests it. But locally I can't install things: (v1.0) pkg> dev Clustering
Updating git-repo `https://github.com/JuliaStats/Clustering.jl.git`
[ Info: Path `/Users/me/.julia/dev/Clustering` exists and looks like the correct package, using existing path instead of cloning
Resolving package versions...
ERROR: Unsatisfiable requirements detected for package Distances [b4f34e82]:
Distances [b4f34e82] log:
├─possible versions are: [0.7.0-0.7.4, 0.8.0-0.8.2, 0.9.0-0.9.2, 0.10.0-0.10.7] or uninstalled
└─restricted to versions 0.10.9-0.10 by Clustering [aaaa29a8] — no versions left
└─Clustering [aaaa29a8] log:
├─possible versions are: 0.15.8 or uninstalled
└─Clustering [aaaa29a8] is fixed to version 0.15.8 Similar errors on Julia 1.6 locally. If I change the bounds to allow Distances v0.10.7, then this package fails: julia> data = cbrt.(rand(rng, 2, 300));
julia> kmeans(data, 5)
ERROR: MethodError: no method matching colwise!(::Distances.SqEuclidean, ::Vector{Float64}, ::Matrix{Float64}, ::SubArray{Float64, 1, Matrix{Float64}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true})
Closest candidates are:
colwise!(::AbstractArray, ::Distances.SqMahalanobis, ::AbstractMatrix{T} where T, ::AbstractVector{T} where T) at /Users/me/.julia/packages/Distances/6E33b/src/mahalanobis.jl:111
colwise!(::AbstractArray, ::Distances.Mahalanobis, ::AbstractMatrix{T} where T, ::AbstractVector{T} where T) at /Users/me/.julia/packages/Distances/6E33b/src/mahalanobis.jl:121
colwise!(::AbstractArray, ::Distances.PreMetric, ::AbstractMatrix{T} where T, ::AbstractVector{T} where T) at /Users/me/.julia/packages/Distances/6E33b/src/generic.jl:78
...
Stacktrace:
[1] initseeds!(iseeds::Vector{Int64}, alg::KmppAlg, X::Matrix{Float64}, metric::Distances.SqEuclidean; rng::Random._GLOBAL_RNG)
@ Clustering ~/.julia/dev/Clustering/src/seeding.jl:183
[2] #initseeds#3
@ ~/.julia/dev/Clustering/src/seeding.jl:43 [inlined]
[3] initseeds(algname::Symbol, X::Matrix{Float64}, k::Int64; kwargs::Base.Iterators.Pairs{Symbol, Random._GLOBAL_RNG, Tuple{Symbol}, NamedTuple{(:rng,), Tuple{Random._GLOBAL_RNG}}})
@ Clustering ~/.julia/dev/Clustering/src/seeding.jl:75
[4] kmeans(X::Matrix{Float64}, k::Int64; weights::Nothing, init::Symbol, maxiter::Int64, tol::Float64, display::Symbol, distance::Distances.SqEuclidean, rng::Random._GLOBAL_RNG)
... So I don't know. CI did pass on 1.0 on the first commit (before adding any tests). |
Thank you for the PR. |
Yes, this PR only for k-means. Note the title contains And yes, it is sorting the cluster means, via |
@mcabbott Thank you for clarification. |
This is an abstract type. It's not obvious to me that such an implementation is possible, as it seems to require knowing what the different fields mean, and how they should each be permuted. I totally agree that being able to sort all the subtypes would be desirable. Seems like the way you get there is by implementing it for one, and then the next, etc. This PR just wants to get the ball rolling.
I agree someone could want to sort clusters by their costs, etc. But I'm not sure what interface you have in mind, from just words. Edit, maybe I should add -- my reason to sort by cluster means is for coherent plots, in which I numbered the clusters 1-10, and had more transitions 1-2 and 2-3 than between far-away points. They are essentially along one dimension. Sorting means that the plots don't change much with different random numbers. I wonder what purpose sorting by costs or counts would have? One way to avoid ambiguity would be to make this a method of
I mean, it's trivial to call What this PR is mostly doing is the next step, of applying the same permutation coherently to all the different fields of the k-means result struct. As far as I know the initial ordering of the clusters is arbitrary; this picks a different ordering which is the algorithm could equally have found. As you can see in the code, some fields need |
I was speaking of a fallback implementation that just throws We are limited in implementing the generic
The biggest clusters are likely to be the most relevant, and, for many datasets, could be quite stable.
Base.sortperm(clustering::ClusteringResult; by, kwargs...) = sortperm(1:nclusters(clustering); by = i -> by(clustering, i), kwargs...) Where for sorting by centers the function is function Base.sortperm(clustering::ClusteringResult; by::Symbol = :center, kwargs...)
if by == :center
accessor = (clustering, i) -> view(clustering.centers, :, i)
elseif by == :size
accessor = (clustering, i) -> clustering.counts[i]
elseif ...
...
end
sortperm(clustering; by=accessor, kwargs...)
end Since many clustering implementations share the same field names, there could be a shared
Exactly -- permuting the fields (essentially, it is |
Ok, well I see what you mean now. I'm not going to get around to all of this, so will live with my pirate version locally for now. |
I wanted to sort clusters, and wondered whether someone else might want to too?