Feat/bugfixes tests cicd by AmitaiYacobi · Pull Request #9 · shaham-lab/SpectralNet

AmitaiYacobi · 2026-03-07T17:49:36Z

This PR is a comprehensive overhaul of the SpectralNet codebase covering bug fixes, scalability, testing, tooling, and CI/CD.

Bug fixes — four correctness bugs were found and fixed: get_random_batch sampled indices only within [0, batch_size) instead of the full dataset; SpectralNetModel.orthonorm_weights was never initialised in init, causing AttributeError on the first inference call; get_nearest_neighbors always queried X against itself, silently ignoring the Y parameter; and both AETrainer and SiameseTrainer used hardcoded relative paths for weight files that broke when called from any directory other than the project root.

Scalability — the most fundamental architectural change: fit() and predict() now accept a torch.utils.data.Dataset in addition to a torch.Tensor, enabling true out-of-core training on datasets (e.g. 10M images on disk) that cannot fit in RAM. A _FeatureDataset adapter normalises both inputs into a consistent (x, y) format; every trainer creates its own DataLoader internally so only one mini-batch is ever in memory at once. Separately, AETrainer.embed() and predict() were made chunk-based to avoid GPU OOM on large inputs, and get_random_batch() was fixed to encode only the sampled batch instead of the entire dataset.

Performance — the O(n) Python loop that built the nearest-neighbour mask in get_gaussian_kernel and get_t_kernel was replaced with a vectorised tensor scatter using repeat_interleave, and a redundant import numpy and nested with torch.no_grad() block were removed.

Code quality — wildcard imports (from spectralnet._utils import *, from ._utils import *) were replaced with explicit relative imports; _spectralnet_trainer.py now uses a proper relative import; dead code (create_weights_dir) was deleted; and a wrong docstring ("Cholesky decomposition") was corrected to "QR decomposition".

Testing — 74 tests were added across five modules: test_models.py, test_losses.py, test_utils.py, test_metrics.py, and test_spectralnet.py (end-to-end), all passing with zero warnings.

CI/CD — a GitHub Actions CI workflow runs the test suite on Python 3.11 and 3.12 in parallel on every push and PR; a release workflow uses python-semantic-release to automatically bump the version in setup.cfg, create a GitHub release, and publish to PyPI via OIDC Trusted Publisher based on conventional commit prefixes (fix: → patch, feat: → minor, BREAKING CHANGE → major).

Developer tooling — a pixi.toml and lock file were added for a fully reproducible conda+PyPI environment (pixi install && pixi run test); setup.cfg gained a [dev] extras group and a [tool:pytest] section; pyproject.toml gained [tool.semantic_release] config and was stripped of heavy runtime deps from [build-system] requires.

- _cluster.py: get_random_batch sampled indices in [0, batch_size) instead of the full dataset; now uses randperm(n)[:batch_size] - _spectralnet_model.py: orthonorm_weights was never initialised in __init__, causing AttributeError on the first forward pass when should_update_orth_weights=False; initialise to None and auto-compute on first call - _utils.py: get_nearest_neighbors always queried X against itself, ignoring the Y parameter; fix to use kneighbors(Y_np) - _trainers: AETrainer and SiameseTrainer used hardcoded relative paths for weight files that broke when called from any directory other than the project root; use __file__-relative paths with os.makedirs

Add five test modules covering all major components: - test_models.py (15): SpectralNetModel orthonormality, layer types, SiameseNet shared weights, AEModel encoder/decoder symmetry - test_losses.py (10): SpectralNetLoss and ContrastiveLoss properties — scalar output, non-negativity, zero-affinity, perfect-cluster, margin - test_utils.py (29): Laplacian (row sums, PSD, symmetry), nearest neighbours (shapes, Y param), scale, Gaussian/t-kernels, Grassmann distance, cost matrix - test_metrics.py (7): ACC and NMI correctness - test_spectralnet.py (13): end-to-end fit/predict/transform on blobs Add conftest.py global seed fixture for deterministic results.

ci.yml: run pytest on Python 3.11 and 3.12 in parallel on every push and PR to main; upload test results as artifacts. release.yml: on push to main, run python-semantic-release which reads conventional commit prefixes to decide the version bump (fix→patch, feat→minor, BREAKING CHANGE→major), creates a git tag and GitHub release, then builds and publishes to PyPI via OIDC Trusted Publisher (no API token required).

- pixi.toml + pixi.lock: reproducible conda/PyPI environment via pixi; defines tasks 'test' and 'test-fast' for quick iteration - pyproject.toml: remove heavy runtime deps from [build-system] requires (only setuptools + wheel needed at build time); add [tool.semantic_release] config pointing version bump at setup.cfg - setup.cfg: add [options.extras_require] dev group (pytest, pytest-cov) and [tool:pytest] section so plain 'pytest' works from any directory

- _spectralnet_model.py: remove unused numpy import; replace np.sqrt(m) with plain m**0.5; fix docstring "Cholesky" → "QR decomposition" - _spectralnet_trainer.py: replace absolute wildcard 'from spectralnet._utils import *' with an explicit relative import of the four symbols actually used; remove redundant nested 'with torch.no_grad()' inside validate() (already guarded by the outer context manager) - _reduction.py: replace 'from ._utils import *' with explicit import of the three symbols used (get_affinity_matrix, get_laplacian, plot_laplacian_eigenvectors) - _utils.py: delete dead create_weights_dir() function (trainers now use __file__-relative paths with os.makedirs)

Add a "From source (with pixi)" section under Installation explaining how to install pixi, clone the repo, install the environment with 'pixi install', and run the test suite with 'pixi run test'.

Three scalability fixes for large datasets: 1. get_random_batch() (_cluster.py): previously encoded the *entire* dataset through the AE and Siamese network just to return a small random batch — O(N) wasted work. Now selects indices first and encodes only the sampled batch, wrapped in torch.no_grad(). 2. predict() (_cluster.py): previously moved the full X to GPU and ran AE + SpectralNet in a single forward pass, causing OOM on large inputs. Now iterates in chunks of spectral_batch_size, accumulating CPU tensors, then concatenates for k-means. 3. AETrainer.embed() (_ae_trainer.py): previously called ae_net.encode(X.to(device)) on the whole dataset at once. Now iterates in chunks of batch_size, keeps intermediate results on CPU, and returns a concatenated CPU tensor that DataLoaders in downstream trainers can page to the device batch-by-batch.

Replaces the Tensor-only fit/predict API with a Dataset-first design so that arbitrarily large datasets stored on disk (e.g. 10 M images) can be processed without ever loading them all into memory at once. _cluster.py — new _FeatureDataset adapter class A thin Dataset wrapper that normalises both in-memory torch.Tensors and any user-defined disk-streaming Dataset into consistent (x_flat, y) tuples. fit() and predict() accept Union[Tensor, Dataset/DataLoader]; Tensor inputs are wrapped automatically for full backward compatibility. fit(X, y=None): - X can now be torch.Tensor (small data) or Dataset (large/disk data) - Stores self._dataset; all trainers receive the Dataset and create their own DataLoaders with the right batch size and shuffling predict(X): - X can be Tensor or DataLoader; iterates in spectral_batch_size chunks get_random_batch(): - Creates a temporary shuffled DataLoader over self._dataset and takes the first batch — O(batch_size) memory, not O(N) AETrainer: - train(dataset): derives input_dim from dataset[0], uses random_split directly on the Dataset; training loop extracts x from (x, y) tuples - embed(dataset) → TensorDataset: encodes via DataLoader, returns a new (encoded_x, y) TensorDataset on CPU for downstream trainers - _get_data_loader(): now calls random_split(self._dataset, ...) SiameseTrainer: - train(dataset): iterates a DataLoader to materialise self.X for KNN; comment notes that a representative subset should be passed for very large datasets (exact KNN is inherently O(N²)) SpectralTrainer: - train(dataset, siamese_net): derives input_dim from dataset[0]; _get_data_loader() calls random_split directly on the Dataset — no TensorDataset construction needed

- Split Usage section into "small datasets (in-memory tensor)" and "large datasets (streaming from disk)" - Add a concrete ImageFolderDataset example showing how to define a custom Dataset for 10 M+ images on disk, pass it to fit(), and stream predictions through a DataLoader - Note the Siamese-training memory caveat and the two workarounds (use_approx=True, or pass a subset) - Tighten the metrics example (f-string, y_np variable) - Clean up the Running examples section

Amitai added 10 commits March 6, 2026 17:12

docs: add pixi development setup instructions to README

01cde74

Add a "From source (with pixi)" section under Installation explaining how to install pixi, clone the repo, install the environment with 'pixi install', and run the test suite with 'pixi run test'.

added tqdm to the setup.cfg

05e0b31

AmitaiYacobi force-pushed the feat/bugfixes-tests-cicd branch from 98bef98 to 05e0b31 Compare March 7, 2026 18:04

AmitaiYacobi merged commit f110c82 into main Mar 7, 2026
2 checks passed

AmitaiYacobi deleted the feat/bugfixes-tests-cicd branch March 7, 2026 18:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/bugfixes tests cicd#9

Feat/bugfixes tests cicd#9
AmitaiYacobi merged 10 commits intomainfrom
feat/bugfixes-tests-cicd

AmitaiYacobi commented Mar 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AmitaiYacobi commented Mar 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant