A K-medoid priority-tree approach for the Fast Library for Approximate Nearest Neighbors (FLANN)
Typical implementations of the FLANN algorithm use either:
- Kd-Trees: requires a dimensionally additive distance measure
- K-means: requires constructing a cluster center of mass in some n-dimensional space
These two implementations fail when utilizing more complex distance measures, or in circumstances where a "center of mass" is difficult to define (such as in the field of graph clustering)
Here we utilize the K-medoids algorithm (https://en.wikipedia.org/wiki/K-medoids) to construct a K-medoid priority-tree akin to the K-means priority-tree version of FLANN.
- Templated to work on arbitrary data objects with user-defined distance functions
- Sparse distance matrix storage, minimizing repetitive distance computations
- Radius and top M NN search protocols
- Take N samples {x}iN, a distance function f(xi,xj), and branching factor k
- Construct tree root as full ensemble of data points
- Select k samples randomly from x to serve as medoids, y
- Cluster all x points to y, Cm={ xi : f(xi, ym) < f(xi, yj) for all j != m}
- Find sample with minimum internal distance in Cm, argminCm,i Sumj=1|Cm| f(Cm,i, Cm,j)
- Update medoids to the new values found in Step 4.
- Goto 3 until converged/reached iteration limit
- For each cluster, Ci, construct new node in tree as a descendent from the root with only the data points in Ci
- Apply this scheme 2-8 recursively, leaf termination occurs when there are less than k samples in the cluster.
- Refactor into individual files for ease of understanding (monolithic header is tough)
- Write simple example code
- Refactor to remove Armadillo dependencies
- Add OpenMP and/or MPI support for parallelized search through the tree
- Both batch and individual NN search parallelization
Built using the Armadillo C++ Linear Algebra Library (http://arma.sourceforge.net/)
[1] M. Muja and D. G. Lowe, "Scalable Nearest Neighbor Algorithms for High Dimensional Data," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 11, pp. 2227-2240, Nov. 1 2014.
[2] M. Muja, “Scalable nearest neighbour methods for high dimensional data,” Doctoral Thesis, University of British Columbia, 2013.