PyNear

Fast, exact and approximate K-nearest-neighbour search for Python

PyNear is a Python library with a C++ core for exact or approximate (fast) KNN search over metric spaces. It is built around Vantage Point Trees, a metric tree that scales well to higher dimensionalities where kd-trees degrade, and uses SIMD intrinsics (AVX2 on x86-64, portable fallbacks on arm64/Apple Silicon) to accelerate the hot distance computation paths. Already using scikit-learn's KNN? PyNear ships drop-in adapter classes that implement the same fit / predict / score / kneighbors API — migrate in one line.

Why PyNear?

	PyNear	Faiss	Annoy	scikit-learn
Exact results	✅ VPTree always	✅ flat index	❌ approximate	✅
Approximate (fast, tunable)	✅ IVFFlatL2Index	✅ IVF	✅	❌
Metric agnostic	✅ L2, L1, L∞, Hamming	L2 / inner product	L2 / cosine / Hamming	L2 / others
Low-dim sweet spot	✅	❌	❌	❌
High-dim (512-D – 1024-D)	✅ IVFFlatL2Index	✅	✅	❌
Binary / Hamming exact	✅ hardware popcount	✅	✅	❌
Binary / Hamming approx	✅ MIH + IVFFlat	⚠️ slow build	❌	❌
Threshold / range search	✅ BKTree	❌	❌	❌
Pickle serialization	✅	❌	✅	✅
No extra native deps	✅ NumPy only	❌ compiled lib + optional GPU	❌	❌
scikit-learn compatible API	✅ drop-in adapters	❌	❌	—

PyNear covers the full spectrum: use VPTree indices when you need guaranteed exact answers (2-D to ~256-D), IVFFlatL2Index for fast approximate float search on high-dimensional data (512-D to 1024-D), or MIHBinaryIndex / IVFFlatBinaryIndex for approximate Hamming search on binary descriptors — achieving up to 257× speedup over exact binary brute-force at N=1M, d=512 with 100% Recall@10.

For the Layman

K-Nearest Neighbours (KNN) is simply the idea of finding the k most similar items to a given query in a collection.

Think of it like asking: "given this song I like, what are the 5 most similar songs in my library?" The algorithm measures the "distance" between items (how different they are) and returns the closest ones.

The two key parameters are:

k — how many neighbours to return (e.g. the 5 most similar)
distance metric — how "similarity" is measured (e.g. Euclidean, Manhattan, Hamming)

Everything else — VP-Trees, SIMD, approximate search — is just engineering to make that search fast at scale.

Main applications of KNN search

Image retrieval — finding visually similar images by searching nearest neighbours in an embedding space (e.g. face recognition, reverse image search).
Recommendation systems — suggesting similar items (products, songs, articles) by finding the closest user or item embeddings.
Anomaly detection — flagging data points whose nearest neighbours are unusually distant as potential outliers or fraud cases.
Semantic search — retrieving documents or passages whose dense vector representations are closest to a query embedding (e.g. RAG pipelines).
Broad-phase collision detection — quickly finding candidate object pairs that might be colliding by looking up the nearest neighbours of each object's bounding volume, before running the expensive narrow-phase test.
Soft body / cloth simulation — finding the nearest mesh vertices or particles to resolve contact constraints and self-collision.
Particle systems (SPH, fluid sim) — each particle needs to know its neighbours within a radius to compute pressure and density forces.

Installation

pip install pynear

Requires Python 3.8+ and NumPy ≥ 1.21.2. Pre-built wheels are available for Linux, macOS (x86-64 and Apple Silicon), and Windows — no compiler needed.

Quick start

import numpy as np
import pynear

# Build index from 100 000 vectors of dimension 32
data = np.random.rand(100_000, 32).astype(np.float32)
index = pynear.VPTreeL2Index()
index.set(data)

# KNN search — returns (indices, distances) per query, sorted nearest-first
queries = np.random.rand(10, 32).astype(np.float32)
indices, distances = index.searchKNN(queries, k=5)

# 1-NN shortcut (slightly faster than searchKNN with k=1)
nn_indices, nn_distances = index.search1NN(queries)

For all index types and advanced usage see docs/README.md.

Approximate binary search (image descriptors)

For large-scale image retrieval with binary descriptors (ORB, BRIEF, AKAZE), PyNear provides two approximate Hamming-distance indices that are orders of magnitude faster than exact brute-force:

import numpy as np
import pynear

# 1M × 512-bit descriptors (64 bytes each)
db = np.random.randint(0, 256, size=(1_000_000, 64), dtype=np.uint8)

# ── Multi-Index Hashing ───────────────────────────────────────────────────────
# Best for d=512 (m=8 sub-tables of 64 bits).
# 257× faster than brute-force at N=1M; 100% Recall@10 for near-duplicates.
mih = pynear.MIHBinaryIndex(m=8)   # m=4 for d=128/256, m=8 for d=512
mih.set(db)

queries = np.random.randint(0, 256, size=(100, 64), dtype=np.uint8)
indices, distances = mih.searchKNN(queries, k=10, radius=8)
# radius: any true neighbour within Hamming distance ≤ radius is guaranteed
# to be found (pigeonhole principle). Increase for higher recall on noisier data.

# ── IVF Flat Binary ───────────────────────────────────────────────────────────
# Predictable cost: scans nprobe clusters per query.
# Good when the query radius is unknown or data is non-uniform.
ivf = pynear.IVFFlatBinaryIndex(nlist=512, nprobe=16)
ivf.set(db)

indices, distances = ivf.searchKNN(queries, k=10)
ivf.set_nprobe(32)  # increase nprobe at runtime to trade speed for recall

Choosing between MIH and IVFFlat:

	`MIHBinaryIndex`	`IVFFlatBinaryIndex`
Best for	Near-duplicate retrieval (small Hamming radius)	General approximate Hamming KNN
d=512, N=1M query time	0.037 ms	1.95 ms
Recall guarantee	Exact for distance ≤ radius (pigeonhole)	Probabilistic (depends on nprobe)
Recall control	`radius` parameter	`nprobe` parameter
Recommended `m`	d/8 bytes (e.g. m=8 for 512-bit)	—

Migrating from scikit-learn

PyNear provides adapter classes that implement the same interface as sklearn.neighbors.NearestNeighbors, KNeighborsClassifier, and KNeighborsRegressor. Changing the import is all that is required in most cases:

# Before
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=5, metric='euclidean')

# After — identical API, backed by a VP-Tree
from pynear.sklearn_adapter import PyNearKNeighborsClassifier
clf = PyNearKNeighborsClassifier(n_neighbors=5, metric='euclidean')

All three adapters follow the standard scikit-learn workflow:

from pynear.sklearn_adapter import (
    PyNearNearestNeighbors,
    PyNearKNeighborsClassifier,
    PyNearKNeighborsRegressor,
)

# Unsupervised neighbour lookup
nn = PyNearNearestNeighbors(n_neighbors=5, metric='euclidean')
nn.fit(X_train)
distances, indices = nn.kneighbors(X_query)

# Classification
clf = PyNearKNeighborsClassifier(n_neighbors=5, weights='distance')
clf.fit(X_train, y_train)
clf.predict(X_test)          # class labels
clf.predict_proba(X_test)    # per-class probabilities
clf.score(X_test, y_test)    # accuracy

# Regression
reg = PyNearKNeighborsRegressor(n_neighbors=5, weights='uniform')
reg.fit(X_train, y_train)
reg.predict(X_test)          # predicted values
reg.score(X_test, y_test)    # R²

Supported metrics: euclidean / l2, manhattan / l1, chebyshev / linf, hamming

Supported weights: uniform, distance (inverse-distance-weighted)

Note: Input arrays are cast to float32 (or uint8 for Hamming) before indexing. scikit-learn uses float64 internally, so very small numerical differences may appear at the precision boundary, but nearest-neighbour results are identical for all practical datasets.

Features

Available indices

Exact indices — always return the true k nearest neighbours:

Index	Distance	Data type	Notes
`VPTreeL2Index`	L2 (Euclidean)	`float32`	SIMD-accelerated
`VPTreeL1Index`	L1 (Manhattan)	`float32`	SIMD-accelerated
`VPTreeChebyshevIndex`	L∞ (Chebyshev)	`float32`	SIMD-accelerated
`VPTreeBinaryIndex`	Hamming	`uint8`	Hardware popcount
`BKTreeBinaryIndex`	Hamming	`uint8`	Threshold / range search

Approximate indices — trade a small recall budget for large speed gains; tunable via n_probe / radius:

Index	Distance	Data type	Notes
`IVFFlatL2Index`	L2 (Euclidean)	`float32`	BLAS SGEMV inner scan; best for 512-D – 1024-D
`IVFFlatBinaryIndex`	Hamming	`uint8`	Binary K-Means IVF; faster build than Faiss binary IVF
`MIHBinaryIndex`	Hamming	`uint8`	Multi-Index Hashing; 257× faster than brute-force at N=1M, d=512

All VPTree and IVFFlat indices support searchKNN(queries, k). BKTreeBinaryIndex supports find_threshold(queries, threshold) for range queries. Set n_probe = n_clusters on IVFFlatL2Index to make it exact.

See docs/approximate.md for a full guide on measuring recall and tuning n_probe for your dataset.

Why approximate search? The curse of dimensionality

Tree-based exact search relies on pruning: a branch is discarded when its closest possible point is provably farther than the current best candidate. This pruning becomes ineffective as dimensionality grows — a phenomenon rooted in a fundamental geometric property of high-dimensional spaces.

Volume concentration near the boundary. Consider $N$ points drawn uniformly at random inside an $n$-dimensional ball of radius $R$. A point at distance $r$ from the origin is closer to the boundary than to the origin whenever $R - r < r$, i.e. $r > R/2$. The fraction of the ball's volume satisfying this condition is:

$$F(n) = \frac{V_n(R) - V_n\left(\tfrac{R}{2}\right)}{V_n(R)} = 1 - \left(\frac{1}{2}\right)^{n}$$

where $V_n(r) = \dfrac{\pi^{n/2}}{\Gamma\left(\tfrac{n}{2}+1\right)} r^n$ is the volume of an $n$-ball of radius $r$. Because $V_n$ scales as $r^n$, the ratio simplifies cleanly to $1 - 2^{-n}$, independent of $R$.

Median distance from the origin. The median distance $r_m$ is the radius such that exactly half the volume lies within it:

$$\frac{V_n(r_m)}{V_n(R)} = \frac{1}{2} ;\Longrightarrow; \left(\frac{r_m}{R}\right)^n = \frac{1}{2} ;\Longrightarrow; r_m = R \cdot 2^{-1/n}$$

As $n \to \infty$, $r_m \to R$: the typical point is arbitrarily close to the surface of the ball.

Numerical illustration:

Dimensionality $n$	Points closer to border than origin $F(n)$	Median distance $r_m / R$
1	50.0 %	0.500
2	75.0 %	0.707
5	96.9 %	0.871
10	99.9 %	0.933
100	≈ 100 %	0.993

Consequence for KNN trees. When $n$ is large, nearly all points are concentrated in a thin shell near the boundary, and the distances between any two points become almost equal. With no contrast in distances, a tree has nothing to prune — every branch must be explored — and search degrades to exhaustive linear scan, $O(N)$. This is the fundamental reason why exact tree search offers diminishing returns beyond $d \approx 256$, and why approximate methods such as IVFFlatL2Index (probing only a fraction of clusters) or Faiss IVF are necessary at high dimensionalities.

Pickle serialisation

All VPTree and IVFFlat indices are pickle-serialisable — save a built index to disk and reload it without rebuilding:

import pickle, numpy as np, pynear

data = np.random.rand(20_000, 32).astype(np.float32)
index = pynear.VPTreeL2Index()
index.set(data)

blob = pickle.dumps(index)
index2 = pickle.loads(blob)

Tree inspection

print(index.to_string())

####################
# [VPTree state]
Num Data Points: 100
Total Memory: 8000 bytes
####################
[+] Root Level:
 Depth: 0
 Height: 14
 Num Sub Nodes: 100
...

Note: to_string() traverses the whole tree — use it for debugging only.

Demos

Two interactive desktop demos ship in demo/ and run with a single command:

pip install PySide6
python demo/point_cloud.py    # KNN Explorer — hover over 1M points to find neighbours
python demo/voronoi.py    # Voronoi diagram — drag seed points, watch cells reshape live

KNN Explorer — scatter up to 1 million 2-D points and hover to see k nearest neighbours highlighted in real time. Supports zoom, pan, and configurable point size.
Voronoi Diagram — every canvas pixel is coloured by its nearest seed point. Add, drag, and remove seeds; the diagram redraws live using pynear's batch 1-NN.

See docs/demos.md for full details.

Benchmarks

Benchmark Report (PDF)

A formal evaluation of PyNear against Faiss, scikit-learn, and Annoy across Euclidean, Manhattan, and Hamming distance metrics, dimensionalities from 2-D to 1024-D, and both exact and approximate search modes. Includes TikZ-rendered latency charts, a recall–latency Pareto analysis of IVFFlatL2Index vs Faiss IndexIVFFlat, and approximate binary-descriptor benchmarks showing MIHBinaryIndex achieving 257× speedup over Faiss exact binary brute-force at N=1M, d=512 with 100% Recall@10.

To run a quick standalone benchmark:

python bench_run.py

Real-World Benchmark — SIFT1M Binary

Performance of pynear's approximate Hamming-distance indices on the INRIA TEXMEX SIFT1M dataset: 1,000,000 × 128-dim float SIFT descriptors sign-quantised to 128-bit binary (16 bytes/descriptor). Ground truth computed by exact brute-force Hamming k-NN over 500 queries, k=10. Machine: Intel(R) Core(TM) Ultra 9 285K.

Index	Configuration	Build (s)	ms / query	QPS	Recall@10
Brute-force (numpy)	N=1,000,000	—	49.6	20	1.000
IVFFlatBinaryIndex	nlist=500, nprobe=31	6.90	1.43	698	1.000
IVFFlatBinaryIndex	nlist=500, nprobe=62	6.90	2.77	361	1.000
IVFFlatBinaryIndex	nlist=500, nprobe=125	6.90	5.42	184	1.000
IVFFlatBinaryIndex	nlist=500, nprobe=250	6.90	10.54	95	1.000
IVFFlatBinaryIndex	nlist=500, nprobe=500	6.90	20.85	48	1.000
MIHBinaryIndex	m=8, radius=4	2.83	1.29	772	0.992
MIHBinaryIndex	m=8, radius=8	2.83	7.56	132	1.000
MIHBinaryIndex	m=8, radius=12	2.83	7.56	132	1.000
MIHBinaryIndex	m=8, radius=16	2.83	24.02	42	1.000
MIHBinaryIndex	m=8, radius=24	2.83	50.92	20	1.000
MIHBinaryIndex	m=8, radius=32	2.83	81.68	12	1.000
MIHBinaryIndex	m=8, radius=48	2.83	171.86	6	1.000

Key takeaways:

IVFFlatBinaryIndex (nprobe=31) achieves 100% Recall@10 at 698 QPS — 35× faster than brute-force.
MIHBinaryIndex (radius=4) is the fastest single configuration at 772 QPS with 0.992 recall.
MIH excels on wider descriptors (512-bit / 64 bytes) where sub-table sparsity is higher.

Reproduce: python demo_binary.py · add --small for a 10 K quick test · --n-gt-queries N to adjust evaluation size.

Development

Building and installing locally

pip install .

Running tests

make test

Debugging C++ code on Unix

CMake build files are provided for building and running C++ tests independently:

make cpp-test

Tests are built in Debug mode by default, so you can debug with GDB:

gdb ./build/tests/vptree-tests

Debugging C++ code on Windows

Install CMake (py -m pip install cmake) and pybind11 (py -m pip install pybind11), then:

mkdir build
cd build
cmake ..\pynear

You may need to pass extra arguments, for example:

cmake ..\pynear -G "Visual Studio 17 2022" -A x64 ^
  -DPYTHON_EXECUTABLE="C:\Program Files\Python312\python.exe" ^
  -Dpybind11_DIR="C:\Program Files\Python312\Lib\site-packages\pybind11\share\cmake\pybind11"

Build and run vptree-tests.exe from the generated solution.

Formatting code

make fmt

Name		Name	Last commit message	Last commit date
Latest commit History 324 Commits
.github/workflows		.github/workflows
demo		demo
docs		docs
pynear		pynear
.clang-format		.clang-format
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
bench_run.py		bench_run.py
demo_binary.py		demo_binary.py
pyproject.toml		pyproject.toml
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyNear

Table of Contents

Why PyNear?

For the Layman

Main applications of KNN search

Installation

Quick start

Approximate binary search (image descriptors)

Migrating from scikit-learn

Features

Available indices

Why approximate search? The curse of dimensionality

Pickle serialisation

Tree inspection

Demos

Benchmarks

Real-World Benchmark — SIFT1M Binary

Development

Building and installing locally

Running tests

Debugging C++ code on Unix

Debugging C++ code on Windows

Formatting code

About

Uh oh!

Releases 5

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PyNear

Table of Contents

Why PyNear?

For the Layman

Main applications of KNN search

Installation

Quick start

Approximate binary search (image descriptors)

Migrating from scikit-learn

Features

Available indices

Why approximate search? The curse of dimensionality

Pickle serialisation

Tree inspection

Demos

Benchmarks

Real-World Benchmark — SIFT1M Binary

Development

Building and installing locally

Running tests

Debugging C++ code on Unix

Debugging C++ code on Windows

Formatting code

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages