Skip to content

Feat/bugfixes tests cicd#9

Merged
AmitaiYacobi merged 10 commits intomainfrom
feat/bugfixes-tests-cicd
Mar 7, 2026
Merged

Feat/bugfixes tests cicd#9
AmitaiYacobi merged 10 commits intomainfrom
feat/bugfixes-tests-cicd

Conversation

@AmitaiYacobi
Copy link
Collaborator

This PR is a comprehensive overhaul of the SpectralNet codebase covering bug fixes, scalability, testing, tooling, and CI/CD.

Bug fixes — four correctness bugs were found and fixed: get_random_batch sampled indices only within [0, batch_size) instead of the full dataset; SpectralNetModel.orthonorm_weights was never initialised in init, causing AttributeError on the first inference call; get_nearest_neighbors always queried X against itself, silently ignoring the Y parameter; and both AETrainer and SiameseTrainer used hardcoded relative paths for weight files that broke when called from any directory other than the project root.

Scalability — the most fundamental architectural change: fit() and predict() now accept a torch.utils.data.Dataset in addition to a torch.Tensor, enabling true out-of-core training on datasets (e.g. 10M images on disk) that cannot fit in RAM. A _FeatureDataset adapter normalises both inputs into a consistent (x, y) format; every trainer creates its own DataLoader internally so only one mini-batch is ever in memory at once. Separately, AETrainer.embed() and predict() were made chunk-based to avoid GPU OOM on large inputs, and get_random_batch() was fixed to encode only the sampled batch instead of the entire dataset.

Performance — the O(n) Python loop that built the nearest-neighbour mask in get_gaussian_kernel and get_t_kernel was replaced with a vectorised tensor scatter using repeat_interleave, and a redundant import numpy and nested with torch.no_grad() block were removed.

Code quality — wildcard imports (from spectralnet._utils import *, from ._utils import *) were replaced with explicit relative imports; _spectralnet_trainer.py now uses a proper relative import; dead code (create_weights_dir) was deleted; and a wrong docstring ("Cholesky decomposition") was corrected to "QR decomposition".

Testing — 74 tests were added across five modules: test_models.py, test_losses.py, test_utils.py, test_metrics.py, and test_spectralnet.py (end-to-end), all passing with zero warnings.

CI/CD — a GitHub Actions CI workflow runs the test suite on Python 3.11 and 3.12 in parallel on every push and PR; a release workflow uses python-semantic-release to automatically bump the version in setup.cfg, create a GitHub release, and publish to PyPI via OIDC Trusted Publisher based on conventional commit prefixes (fix: → patch, feat: → minor, BREAKING CHANGE → major).

Developer tooling — a pixi.toml and lock file were added for a fully reproducible conda+PyPI environment (pixi install && pixi run test); setup.cfg gained a [dev] extras group and a [tool:pytest] section; pyproject.toml gained [tool.semantic_release] config and was stripped of heavy runtime deps from [build-system] requires.

Amitai added 10 commits March 6, 2026 17:12
- _cluster.py: get_random_batch sampled indices in [0, batch_size)
  instead of the full dataset; now uses randperm(n)[:batch_size]
- _spectralnet_model.py: orthonorm_weights was never initialised in
  __init__, causing AttributeError on the first forward pass when
  should_update_orth_weights=False; initialise to None and auto-compute
  on first call
- _utils.py: get_nearest_neighbors always queried X against itself,
  ignoring the Y parameter; fix to use kneighbors(Y_np)
- _trainers: AETrainer and SiameseTrainer used hardcoded relative paths
  for weight files that broke when called from any directory other than
  the project root; use __file__-relative paths with os.makedirs
Add five test modules covering all major components:
- test_models.py (15): SpectralNetModel orthonormality, layer types,
  SiameseNet shared weights, AEModel encoder/decoder symmetry
- test_losses.py (10): SpectralNetLoss and ContrastiveLoss properties —
  scalar output, non-negativity, zero-affinity, perfect-cluster, margin
- test_utils.py (29): Laplacian (row sums, PSD, symmetry), nearest
  neighbours (shapes, Y param), scale, Gaussian/t-kernels, Grassmann
  distance, cost matrix
- test_metrics.py (7): ACC and NMI correctness
- test_spectralnet.py (13): end-to-end fit/predict/transform on blobs

Add conftest.py global seed fixture for deterministic results.
ci.yml: run pytest on Python 3.11 and 3.12 in parallel on every push
and PR to main; upload test results as artifacts.

release.yml: on push to main, run python-semantic-release which reads
conventional commit prefixes to decide the version bump (fix→patch,
feat→minor, BREAKING CHANGE→major), creates a git tag and GitHub
release, then builds and publishes to PyPI via OIDC Trusted Publisher
(no API token required).
- pixi.toml + pixi.lock: reproducible conda/PyPI environment via pixi;
  defines tasks 'test' and 'test-fast' for quick iteration
- pyproject.toml: remove heavy runtime deps from [build-system] requires
  (only setuptools + wheel needed at build time); add
  [tool.semantic_release] config pointing version bump at setup.cfg
- setup.cfg: add [options.extras_require] dev group (pytest, pytest-cov)
  and [tool:pytest] section so plain 'pytest' works from any directory
- _spectralnet_model.py: remove unused numpy import; replace np.sqrt(m)
  with plain m**0.5; fix docstring "Cholesky" → "QR decomposition"
- _spectralnet_trainer.py: replace absolute wildcard
  'from spectralnet._utils import *' with an explicit relative import of
  the four symbols actually used; remove redundant nested
  'with torch.no_grad()' inside validate() (already guarded by the
  outer context manager)
- _reduction.py: replace 'from ._utils import *' with explicit import
  of the three symbols used (get_affinity_matrix, get_laplacian,
  plot_laplacian_eigenvectors)
- _utils.py: delete dead create_weights_dir() function (trainers now
  use __file__-relative paths with os.makedirs)
Add a "From source (with pixi)" section under Installation explaining
how to install pixi, clone the repo, install the environment with
'pixi install', and run the test suite with 'pixi run test'.
Three scalability fixes for large datasets:

1. get_random_batch() (_cluster.py): previously encoded the *entire*
   dataset through the AE and Siamese network just to return a small
   random batch — O(N) wasted work. Now selects indices first and
   encodes only the sampled batch, wrapped in torch.no_grad().

2. predict() (_cluster.py): previously moved the full X to GPU and ran
   AE + SpectralNet in a single forward pass, causing OOM on large
   inputs. Now iterates in chunks of spectral_batch_size, accumulating
   CPU tensors, then concatenates for k-means.

3. AETrainer.embed() (_ae_trainer.py): previously called
   ae_net.encode(X.to(device)) on the whole dataset at once. Now
   iterates in chunks of batch_size, keeps intermediate results on CPU,
   and returns a concatenated CPU tensor that DataLoaders in downstream
   trainers can page to the device batch-by-batch.
Replaces the Tensor-only fit/predict API with a Dataset-first design so
that arbitrarily large datasets stored on disk (e.g. 10 M images) can be
processed without ever loading them all into memory at once.

_cluster.py — new _FeatureDataset adapter class
  A thin Dataset wrapper that normalises both in-memory torch.Tensors and
  any user-defined disk-streaming Dataset into consistent (x_flat, y)
  tuples.  fit() and predict() accept Union[Tensor, Dataset/DataLoader];
  Tensor inputs are wrapped automatically for full backward compatibility.

fit(X, y=None):
  - X can now be torch.Tensor (small data) or Dataset (large/disk data)
  - Stores self._dataset; all trainers receive the Dataset and create
    their own DataLoaders with the right batch size and shuffling

predict(X):
  - X can be Tensor or DataLoader; iterates in spectral_batch_size chunks

get_random_batch():
  - Creates a temporary shuffled DataLoader over self._dataset and takes
    the first batch — O(batch_size) memory, not O(N)

AETrainer:
  - train(dataset): derives input_dim from dataset[0], uses random_split
    directly on the Dataset; training loop extracts x from (x, y) tuples
  - embed(dataset) → TensorDataset: encodes via DataLoader, returns a
    new (encoded_x, y) TensorDataset on CPU for downstream trainers
  - _get_data_loader(): now calls random_split(self._dataset, ...)

SiameseTrainer:
  - train(dataset): iterates a DataLoader to materialise self.X for KNN;
    comment notes that a representative subset should be passed for very
    large datasets (exact KNN is inherently O(N²))

SpectralTrainer:
  - train(dataset, siamese_net): derives input_dim from dataset[0];
    _get_data_loader() calls random_split directly on the Dataset —
    no TensorDataset construction needed
- Split Usage section into "small datasets (in-memory tensor)" and
  "large datasets (streaming from disk)"
- Add a concrete ImageFolderDataset example showing how to define a
  custom Dataset for 10 M+ images on disk, pass it to fit(), and stream
  predictions through a DataLoader
- Note the Siamese-training memory caveat and the two workarounds
  (use_approx=True, or pass a subset)
- Tighten the metrics example (f-string, y_np variable)
- Clean up the Running examples section
@AmitaiYacobi AmitaiYacobi force-pushed the feat/bugfixes-tests-cicd branch from 98bef98 to 05e0b31 Compare March 7, 2026 18:04
@AmitaiYacobi AmitaiYacobi merged commit f110c82 into main Mar 7, 2026
2 checks passed
@AmitaiYacobi AmitaiYacobi deleted the feat/bugfixes-tests-cicd branch March 7, 2026 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant