Skip to content

pasiweber/SHiP-framework

Repository files navigation

Similarity-Hierarchical-Partitioning (SHiP) Clustering Framework

PyPI version Tests Docs

This repository is the official implementation of the Similarity-Hierarchical-Partitioning (SHiP) clustering framework proposed in Ultrametric Cluster Hierarchies: I Want `em All! This framework provides a comprehensive approach to clustering by leveraging similarity trees, $(k,z)$-hierarchies, and various partitioning objective functions.

The whole project is implemented in C++ and Python bindings enable the usage within Python.

Overview

The SHiP framework operates in three main stages: SHiP framework overview

  1. Similarity Tree Construction: A similarity tree is built for the given dataset. This tree represents the relationships and proximities between data points. Note that the default constructed tree corresponds to the $k$-center hierarchy (Section 3 in the paper).
  2. $(k,z)$-Hierarchy Construction: Using the similarity tree, a $(k,z)$-hierarchy can be constructed. These hierarchies correlate to common center based clustering methods, as e.g., $k$-median or $k$-means (Section 4).
  3. Partitioning: Finally, the data is partitioned based on the constructed hierarchy and a user-selected partitioning objective function (Section 5).

Features

  • Similarity Trees: The package provides a set of similarity/ultrametric tree implementations:

    • DCTree [1]
    • HST [2]
    • CoverTree [3]
    • KDTree [3]
    • MeanSplitKDTree [3]
    • BallTree [3]
    • MeanSplitBallTree [3]
    • RPTree [3]
    • MaxRPTree [3]
    • UBTree [3]
    • RTree [3]
    • RStarTree [3]
    • XTree [3]
    • HilbertRTree [3]
    • RPlusTree [3]
    • RPlusPlusTree [3]
    • Or use LoadTree to load a precomputed tree
  • $(k,z)$-Hierarchies: It supports all possible $(k,z)$-hierarchies, allowing flexibility in choosing the most suitable hierarchy for a given dataset.

    • $z = 0$$k$-center (actually in theory: $z = ∞$, but in this implementation we use 0 for $∞$)
    • $z = 1$$k$-median
    • $z = 2$$k$-means
    • ...
  • Partitioning Functions: A wide range of partitioning functions are available, enabling users to select the most appropriate function based on their specific needs:

    • K
    • Elbow
    • Threshold
    • ThresholdElbow
    • QCoverage
    • QCoverageElbow
    • QStem
    • QStemElbow
    • LcaNoiseElbow
    • LcaNoiseElbowNoTriangle
    • MedianOfElbows
    • MeanOfElbows
    • Stability
    • NormalizedStability
  • Customization: Users can customize the framework by selecting from the available similarity trees, $(k,z)$- hierarchies, and partitioning functions.

    • E.g., DCTree with $k$-means ($z=2$)-hierarchy and the Elbow partitioning method.
      from SHiP import SHiP
      
      # Build the `DCTree`
      ship = SHiP(data=data_points, treeType="DCTree")
      # Extract the clustering from the $k$-median hierarchy and the `Elbow` partitioning method
      labels = ship.fit_predict(hierarchy=2, partitioningMethod="Elbow")

Installation

Stable Version

The current stable version can be installed by the following command:
pip install SHiP-framework (coming soon)

Note that a gcc compiler is required for installation. Therefore, in case of an installation error, make sure that:

  • Windows: Microsoft C++ Build Tools is installed
  • Linux/Mac: Python dev is installed (e.g., by running apt-get install python-dev - the exact command may differ depending on the linux distribution)

The error messages may look like this:

error: command 'gcc' failed: No such file or directory
Could not build wheels for SHiP-framework, which is required to install pyproject.toml-based projects
Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools

Development Version

The current development version can be installed directly from git by executing:
sudo pip install git+https://github.com/pasiweber/SHiP-framework.git

Alternatively, clone the repository, go to the root directory and execute:
pip install .

Code Example

from SHiP import SHiP

ship = SHiP(data=data, treeType="DCTree")

# or to load a saved tree
ship = SHiP(data=data, treeType="LoadTree", config={"json_tree_filepath": "<file_path>"}) 
# or additionally specify the tree_type of the loaded tree by adding {"tree_type": "DCTree"}

ship.hierarchy = 0
ship.partitioningMethod = "K"
labels = ship.fit_predict()

# or in one line
labels = ship.fit_predict(hierarchy = 1, partitioningMethod = "Elbow")

# optional: save the current computed tree
json = ship.get_tree().to_json()

Results

Our framework achieves the following performance:

Dataset DC-0-Stab. DC-1-MoE DC-2-Elb. CT-0-Stab. CT-1-MoE CT-2-Elb. $k$-means SCAR Ward AMD-DBSCAN DPC
Boxes 90.1 99.3 97.9 2.6 42.1 ± 4.7 24.2 ± 1.6 93.5 ± 4.3 0.1 ± 0.1 95.8 63.9 25.9
D31 79.7 42.7 82.9 46.5 ± 1.8 62.0 ± 5.4 67.7 ± 3.2 92.0 ± 2.7 41.7 ± 5.4 92.0 86.4 18.5
airway 38.0 65.9 58.8 0.8 18.2 ± 2.4 12.0 ± 1.4 39.9 ± 2.0 -0.9 ± 0.5 43.7 31.7 65.1
lactate 41.0 41.0 67.5 0.1 4.1 ± 0.6 1.7 ± 0.2 28.6 ± 1.1 1.5 ± 1.0 27.7 71.5 0.0
HAR 30.0 46.9 52.8 14.7 ± 8.8 14.2 ± 4.7 9.6 ± 2.2 46.0 ± 4.5 5.5 ± 3.2 49.1 0.0 33.2
letterrec. 12.1 16.6 17.9 5.8 ± 0.2 7.2 ± 0.6 6.2 ± 0.3 12.9 ± 0.6 0.4 ± 0.1 14.7 ± 0.9 7.9 0.0
PenDigits 66.4 73.1 75.4 8.0 ± 0.8 12.0 ± 0.6 8.9 ± 0.5 55.3 ± 3.2 0.9 ± 0.3 55.2 55.6 28.8 ± 1.1
COIL20 81.2 72.8 72.6 46.4 ± 4.4 46.6 ± 2.1 47.7 ± 2.0 58.2 ± 2.8 33.5 ± 2.0 68.6 39.2 35.9 ± 0.1
COIL100 80.1 66.8 70.0 44.6 ± 4.2 46.6 ± 1.5 50.1 ± 1.2 56.1 ± 1.4 16.7 ± 0.8 61.4 14.2 0.2
cmu_faces 60.2 56.6 66.5 8.6 ± 3.1 37.1 ± 4.1 34.2 ± 2.1 53.2 ± 4.7 38.5 ± 2.9 61.6 0.7 0.6
OptDigits 55.3 77.0 77.0 40.9 ± 3.5 20.9 ± 2.3 18.1 ± 2.4 61.3 ± 6.6 14.4 ± 4.1 74.6 ± 2.4 63.2 0.0
USPS 33.7 29.3 29.3 12.0 ± 1.7 8.7 ± 1.0 11.2 ± 1.5 52.3 ± 1.7 2.9 ± 0.9 63.9 0.0 21.0
MNIST 19.7 41.7 46.0 11.1 ± 1.7 5.4 ± 0.6 5.4 ± 0.6 36.9 ± 1.0 1.3 ± 0.4 52.7 0.0 -

License

The project is licensed under the BSD 3-Clause License (see LICENSE.txt).

References

[1] Connecting the Dots -- Density-Connectivity Distance unifies DBSCAN, k-Center and Spectral Clustering
[2] HST+: An Efficient Index for Embedding Arbitrary Metric Spaces (Github)
[3] mlpack 4: a fast, header-only C++ machine learning library (Github)

About

Center-based hierarchical clustering framework using 3 steps: compute a Similarity, build an Hierarchy, and lastly extract a clustering using a Partitioning method.

Resources

License

Stars

Watchers

Forks

Contributors