Merged
Conversation
added 10 commits
March 6, 2026 17:12
- _cluster.py: get_random_batch sampled indices in [0, batch_size) instead of the full dataset; now uses randperm(n)[:batch_size] - _spectralnet_model.py: orthonorm_weights was never initialised in __init__, causing AttributeError on the first forward pass when should_update_orth_weights=False; initialise to None and auto-compute on first call - _utils.py: get_nearest_neighbors always queried X against itself, ignoring the Y parameter; fix to use kneighbors(Y_np) - _trainers: AETrainer and SiameseTrainer used hardcoded relative paths for weight files that broke when called from any directory other than the project root; use __file__-relative paths with os.makedirs
Add five test modules covering all major components: - test_models.py (15): SpectralNetModel orthonormality, layer types, SiameseNet shared weights, AEModel encoder/decoder symmetry - test_losses.py (10): SpectralNetLoss and ContrastiveLoss properties — scalar output, non-negativity, zero-affinity, perfect-cluster, margin - test_utils.py (29): Laplacian (row sums, PSD, symmetry), nearest neighbours (shapes, Y param), scale, Gaussian/t-kernels, Grassmann distance, cost matrix - test_metrics.py (7): ACC and NMI correctness - test_spectralnet.py (13): end-to-end fit/predict/transform on blobs Add conftest.py global seed fixture for deterministic results.
ci.yml: run pytest on Python 3.11 and 3.12 in parallel on every push and PR to main; upload test results as artifacts. release.yml: on push to main, run python-semantic-release which reads conventional commit prefixes to decide the version bump (fix→patch, feat→minor, BREAKING CHANGE→major), creates a git tag and GitHub release, then builds and publishes to PyPI via OIDC Trusted Publisher (no API token required).
- pixi.toml + pixi.lock: reproducible conda/PyPI environment via pixi; defines tasks 'test' and 'test-fast' for quick iteration - pyproject.toml: remove heavy runtime deps from [build-system] requires (only setuptools + wheel needed at build time); add [tool.semantic_release] config pointing version bump at setup.cfg - setup.cfg: add [options.extras_require] dev group (pytest, pytest-cov) and [tool:pytest] section so plain 'pytest' works from any directory
- _spectralnet_model.py: remove unused numpy import; replace np.sqrt(m) with plain m**0.5; fix docstring "Cholesky" → "QR decomposition" - _spectralnet_trainer.py: replace absolute wildcard 'from spectralnet._utils import *' with an explicit relative import of the four symbols actually used; remove redundant nested 'with torch.no_grad()' inside validate() (already guarded by the outer context manager) - _reduction.py: replace 'from ._utils import *' with explicit import of the three symbols used (get_affinity_matrix, get_laplacian, plot_laplacian_eigenvectors) - _utils.py: delete dead create_weights_dir() function (trainers now use __file__-relative paths with os.makedirs)
Add a "From source (with pixi)" section under Installation explaining how to install pixi, clone the repo, install the environment with 'pixi install', and run the test suite with 'pixi run test'.
Three scalability fixes for large datasets: 1. get_random_batch() (_cluster.py): previously encoded the *entire* dataset through the AE and Siamese network just to return a small random batch — O(N) wasted work. Now selects indices first and encodes only the sampled batch, wrapped in torch.no_grad(). 2. predict() (_cluster.py): previously moved the full X to GPU and ran AE + SpectralNet in a single forward pass, causing OOM on large inputs. Now iterates in chunks of spectral_batch_size, accumulating CPU tensors, then concatenates for k-means. 3. AETrainer.embed() (_ae_trainer.py): previously called ae_net.encode(X.to(device)) on the whole dataset at once. Now iterates in chunks of batch_size, keeps intermediate results on CPU, and returns a concatenated CPU tensor that DataLoaders in downstream trainers can page to the device batch-by-batch.
Replaces the Tensor-only fit/predict API with a Dataset-first design so
that arbitrarily large datasets stored on disk (e.g. 10 M images) can be
processed without ever loading them all into memory at once.
_cluster.py — new _FeatureDataset adapter class
A thin Dataset wrapper that normalises both in-memory torch.Tensors and
any user-defined disk-streaming Dataset into consistent (x_flat, y)
tuples. fit() and predict() accept Union[Tensor, Dataset/DataLoader];
Tensor inputs are wrapped automatically for full backward compatibility.
fit(X, y=None):
- X can now be torch.Tensor (small data) or Dataset (large/disk data)
- Stores self._dataset; all trainers receive the Dataset and create
their own DataLoaders with the right batch size and shuffling
predict(X):
- X can be Tensor or DataLoader; iterates in spectral_batch_size chunks
get_random_batch():
- Creates a temporary shuffled DataLoader over self._dataset and takes
the first batch — O(batch_size) memory, not O(N)
AETrainer:
- train(dataset): derives input_dim from dataset[0], uses random_split
directly on the Dataset; training loop extracts x from (x, y) tuples
- embed(dataset) → TensorDataset: encodes via DataLoader, returns a
new (encoded_x, y) TensorDataset on CPU for downstream trainers
- _get_data_loader(): now calls random_split(self._dataset, ...)
SiameseTrainer:
- train(dataset): iterates a DataLoader to materialise self.X for KNN;
comment notes that a representative subset should be passed for very
large datasets (exact KNN is inherently O(N²))
SpectralTrainer:
- train(dataset, siamese_net): derives input_dim from dataset[0];
_get_data_loader() calls random_split directly on the Dataset —
no TensorDataset construction needed
- Split Usage section into "small datasets (in-memory tensor)" and "large datasets (streaming from disk)" - Add a concrete ImageFolderDataset example showing how to define a custom Dataset for 10 M+ images on disk, pass it to fit(), and stream predictions through a DataLoader - Note the Siamese-training memory caveat and the two workarounds (use_approx=True, or pass a subset) - Tighten the metrics example (f-string, y_np variable) - Clean up the Running examples section
98bef98 to
05e0b31
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR is a comprehensive overhaul of the SpectralNet codebase covering bug fixes, scalability, testing, tooling, and CI/CD.
Bug fixes — four correctness bugs were found and fixed: get_random_batch sampled indices only within [0, batch_size) instead of the full dataset; SpectralNetModel.orthonorm_weights was never initialised in init, causing AttributeError on the first inference call; get_nearest_neighbors always queried X against itself, silently ignoring the Y parameter; and both AETrainer and SiameseTrainer used hardcoded relative paths for weight files that broke when called from any directory other than the project root.
Scalability — the most fundamental architectural change: fit() and predict() now accept a torch.utils.data.Dataset in addition to a torch.Tensor, enabling true out-of-core training on datasets (e.g. 10M images on disk) that cannot fit in RAM. A _FeatureDataset adapter normalises both inputs into a consistent (x, y) format; every trainer creates its own DataLoader internally so only one mini-batch is ever in memory at once. Separately, AETrainer.embed() and predict() were made chunk-based to avoid GPU OOM on large inputs, and get_random_batch() was fixed to encode only the sampled batch instead of the entire dataset.
Performance — the O(n) Python loop that built the nearest-neighbour mask in get_gaussian_kernel and get_t_kernel was replaced with a vectorised tensor scatter using repeat_interleave, and a redundant import numpy and nested with torch.no_grad() block were removed.
Code quality — wildcard imports (from spectralnet._utils import *, from ._utils import *) were replaced with explicit relative imports; _spectralnet_trainer.py now uses a proper relative import; dead code (create_weights_dir) was deleted; and a wrong docstring ("Cholesky decomposition") was corrected to "QR decomposition".
Testing — 74 tests were added across five modules: test_models.py, test_losses.py, test_utils.py, test_metrics.py, and test_spectralnet.py (end-to-end), all passing with zero warnings.
CI/CD — a GitHub Actions CI workflow runs the test suite on Python 3.11 and 3.12 in parallel on every push and PR; a release workflow uses python-semantic-release to automatically bump the version in setup.cfg, create a GitHub release, and publish to PyPI via OIDC Trusted Publisher based on conventional commit prefixes (fix: → patch, feat: → minor, BREAKING CHANGE → major).
Developer tooling — a pixi.toml and lock file were added for a fully reproducible conda+PyPI environment (pixi install && pixi run test); setup.cfg gained a [dev] extras group and a [tool:pytest] section; pyproject.toml gained [tool.semantic_release] config and was stripped of heavy runtime deps from [build-system] requires.