[BUG] Fixes sbd distances for multivariate case #2742

tanishy7777 · 2025-04-13T13:13:16Z

Reference Issues/PRs

Fixes #2674
Closes #2715
As mentioned in #2661 (comment) and #2661 (comment)

fixes sbd distance for multivariate case

What does this implement/fix? Explain your changes.

Currently the implementation of sbd distance in aeon handles the multivariate case slightly differently than the official version(https://github.com/TheDatumOrg/kshape-python) and tslearn's implementation as well.

Also, I set the unequal_length=False (need input from reviewers regarding this as SBD distance(official & tslearn version) supports unequal length time series but doesn't support unequal channels, but the tests in aeon in test_distances check for unequal channels also, more details in #2661 (comment) ).

Differences:
sbd_distance: finds the distance for each channel independently and then takes its average
normalized_cc(from tslearn): finds the correlations for each of the channels and then sums them and finds the max from this sum and then normalizes them using the norm of the the entire multivariate series.

Does your contribution introduce a new dependency? If yes, which one?

Any other comments?

PR checklist

For all contributions

I've added myself to the list of contributors. Alternatively, you can use the @all-contributors bot to do this for you after the PR has been merged.
The PR title starts with either [ENH], [MNT], [DOC], [BUG], [REF], [DEP] or [GOV] indicating whether the PR topic is related to enhancement, maintenance, documentation, bugs, refactoring, deprecation or governance.

For new estimators and functions

I've added the estimator/function to the online API documentation.
(OPTIONAL) I've added myself as a __maintainer__ at the top of relevant files and want to be contacted regarding its maintenance. Unmaintained files may be removed. This is for the full file, and you should not add yourself if you are just making minor changes or do not want to help maintain its contents.

For developers with write access

(OPTIONAL) I've updated aeon's CODEOWNERS to receive notifications about future changes to these files.

aeon-actions-bot · 2025-04-13T13:13:40Z

Thank you for contributing to `aeon`

I have added the following labels to this PR based on the title: [ $\color{#d73a4a}{\textsf{bug}}$ ].
I have added the following labels to this PR based on the changes made: [ $\color{#5209C9}{\textsf{distances}}$, $\color{#2C2F20}{\textsf{testing}}$ ]. Feel free to change these if they do not properly represent the PR.

The Checks tab will show the status of our automated tests. You can click on individual test runs in the tab or "Details" in the panel below to see more information if there is a failure.

If our pre-commit code quality check fails, any trivial fixes will automatically be pushed to your PR unless it is a draft.

Don't hesitate to ask questions on the aeon Slack channel if you have any.

PR CI actions

These checkboxes will add labels to enable/disable CI functionality for this PR. This may not take effect immediately, and a new commit may be required to run the new configuration.

Run pre-commit checks for all files
Run mypy typecheck tests
Run all pytest tests and configurations
Run all notebook example tests
Run numba-disabled codecov tests
Stop automatic pre-commit fixes (always disabled for drafts)
Disable numba cache loading
Push an empty commit to re-run CI checks

tanishy7777 · 2025-04-13T15:02:07Z

Equivalence between original implementation, tslearn's implementation and fixed implementation for multivariate and uni variate case

tanishy7777 · 2025-04-13T16:18:07Z

Reply to #2661 (comment)

Distance measures dependent on the _dtw_distance function like ddtw_distance, dtw_gi_distance and also adtw, erp, edr distances ignore the other channels without warning.

For example in the dtw_distance, for a given time step all the channels are passed to the _univariate_squared_distance which takes min of the channels and computes the distance between them.

where

@njit(cache=True, fastmath=True)
def _univariate_euclidean_distance(x: np.ndarray, y: np.ndarray) -> float:
    return np.sqrt(_univariate_squared_distance(x, y))

tanishy7777 · 2025-04-13T16:19:32Z

Even though currently we are simply ignoring the other channels in all the distance measures(this is the case for atleast the ones mentioned above). Would it be better to add warnings for this?

tanishy7777 · 2025-04-14T11:56:22Z

@SebastianSchmidl

I have added the code for benchmarks here https://github.com/tanishy7777/aeon/tree/main/benchmarks

Important note for comparison, I removed numba acceleration for the aeon functions,
i.e. sbd_distance_fix and sbd_distance, as the main idea of these benchmarks is to compare the algorithm and the original implementation, and tslearn implementation dont use numba.

The benchmarks test the various cases for the following functions:

sbd_distance (from aeon main branch)
sbd_distance_fix (from this PR)
sbd_original (from the original implementation)
sbd_tslearn (tslearn's implementation)

I tested the following cases:

For univariate time series, varying the no of timesteps.
For multivariate time series (2 channels), varying the no of timesteps.
For a constant no of timesteps=1000, varying the no of channels.

The red line denotes the sbd_distance_fix function, it performs better than the original implementation in all cases.
(If I am not making any obvious mistake in the benchmark code itself that is causing this)

SebastianSchmidl · 2025-04-15T16:44:51Z

As the implementations will be used with the Numba JIT turned on in most cases, I would be more interested in the runtimes and scaling behavior of the aeon implementations with Numba. You do not need to include sbd_original and sbd_tslearn.

Could you also include the implementation idea in #2715 in the benchmark?

Especially the different usage of with objmode(cc="float64[:, :]"): with Numba could make a big difference in certain configurations.

tanishy7777 · 2025-04-17T14:12:48Z

As the implementations will be used with the Numba JIT turned on in most cases, I would be more interested in the runtimes and scaling behavior of the aeon implementations with Numba. You do not need to include sbd_original and sbd_tslearn.

Could you also include the implementation idea in #2715 in the benchmark?

Especially the different usage of with objmode(cc="float64[:, :]"): with Numba could make a big difference in certain configurations.

Thanks for the feedback, will enable numba and compare with #2715. Couple of assignments and project presentations lined up this weekend in college, but will add it soon.

tanishy7777 · 2025-04-24T17:28:44Z

@SebastianSchmidl
These are the new benchmarks sorry for the delay, I have semester exams going on.
I have updated the code for them https://github.com/tanishy7777/aeon/tree/main/benchmarks

Changes:

Comparing numba enabled benchmarks.
Also comparing with the implementation in [BUG] Updated sbd_distance() to handle multivariate series (#2674) #2715

Observations:

For univariate case: sbd_distance_fix is better than the original but worse than sbd_distance_pr2715 at higher time series lengths.
For 2 channels case: sbd_distance_fix is the best for all time series lengths among all the implementations.
For fixed time series length(1000 timepoints): sbd_distance_fix is equal to the sbd_distance across different no of channels and is also better than sbd_distance_pr2715.

Please let me know if this looks good? Or any changes are needed. Thank you!

SebastianSchmidl · 2025-04-25T12:05:15Z

Sorry to ask again for more, but could you scale the experiment further (larger time series)? According to the plots, all runs finish in under 1s.

Interesting that #2715 is faster than your proposal for one channel, but scales worse with the number of channels 🤔

tanishy7777 · 2025-04-25T12:51:05Z

Sorry to ask again for more, but could you scale the experiment further (larger time series)? According to the plots, all runs finish in under 1s.

Interesting that #2715 is faster than your proposal for two channels, but scales worse with the number of channels 🤔

Yea sure will add them. Actually for 2 channel my fix works better than #2715
For univariate case only my fix is worse than #2715 but it is still better than the current implementation in aeon.

Also another thing, for the comparision across channels I am using series length of 1000. That is why the aeon implementation is so close. But if you look at the 2nd plot, it diverges for 2 channels case for higher timer series. I am guessing this but I think the same will happen(divergent behaviour) if I keep lets say time series length =15000. Will have to check this though.

SebastianSchmidl · 2025-04-25T12:54:27Z

Yes, I meant the single channel experiment.

tanishy7777 · 2025-04-25T12:55:57Z

Yes, I meant the single channel experiment.

Btw edited my message (added a para) in case you missed it

SebastianSchmidl · 2025-04-25T12:59:40Z

It is not obvious how to best benchmark this because we have two scaling dimensions (length of time series and number of channels), which are not completely independent. You could create a full matrix of tests (all combinations of lengths and channels in the search range), but this could take a long time and is harder to plot. Maybe, we just need to increase the number of experiments, e.g. 5 length-scaling experiments with fixed #channels = [1,5,10,100,1000] and 5 channel-scaling experiments with fixed lengths = [1000,5000,10000,50000,100000] or similar
... still a lot to test

tanishy7777 · 2025-04-25T13:05:47Z

It is not obvious how to best benchmark this because we have two scaling dimensions (length of time series and number of channels), which are not completely independent. You could create a full matrix of tests (all combinations of lengths and channels in the search range), but this could take a long time and is harder to plot. Maybe, we just need to increase the number of experiments, e.g. 5 length-scaling experiments with fixed #channels = [1,5,10,100,1000] and 5 channel-scaling experiments with fixed lengths = [1000,5000,10000,50000,100000] or similar ... still a lot to test

Yep that makes sense, will do this

MatthewMiddlehurst

Just catching up really. Is the main issue here scalability or? @TonyBagnall @chrisholder possibly some issues re: the unequal stuff and results comparison

Even though currently we are simply ignoring the other channels in all the distance measures(this is the case for atleast the ones mentioned above). Would it be better to add warnings for this?

I would open a separate issue on this.

MatthewMiddlehurst · 2025-04-30T10:02:13Z

aeon/distances/_sbd.py

+            if x.shape[0] != y.shape[0]:
+                raise ValueError("x and y must have the same number of channels ")
+            nchannels = x.shape[0]  # both x and y have the same number of channels
+            norm = np.linalg.norm(x.astype(np.float64)) * np.linalg.norm(
+                y.astype(np.float64)
+            )
+            distance = np.zeros((2 * x.shape[1] - 1,))
            for i in range(nchannels):
-                distance += _univariate_sbd_distance(x[i], y[i], standardize)
-            return distance / nchannels
+                distance += _helper_sbd(x[i], y[i], standardize)
+            return np.abs(1 - np.max(distance) / norm)


Do other distances have this in a separate function like the univariate one?

MatthewMiddlehurst · 2025-04-30T10:02:41Z

aeon/distances/_sbd.py

+    with objmode(a="float64[:]"):
+        a = correlate(x, y, method="fft")


not really a fan of this but it is the current functionality. Maybe worth an issue.

MatthewMiddlehurst · 2025-04-30T10:04:12Z

aeon/testing/expected_results/expected_distance_results.py

@@ -115,7 +115,7 @@
        0.6617308353925114,
        0.6617308353925114,
        0.5750093257763462,
-        0.5263609881742105,
+        None,


Why is this None? The comment in the file does not seem to line up with the values seen.

SebastianSchmidl · 2025-05-06T16:11:31Z

Just catching up really. Is the main issue here scalability or? @TonyBagnall @chrisholder possibly some issues re: the unequal stuff and results comparison

The main issue was that the definition of the SBD distance included a multivariate version that was not implemented in aeon. aeon just computed the SBD for each channel independently and then combined the results (we do this for other distances as well).

However, multiple PRs (this one and #2715) were opened to address this issue. I suggested comparing their performance and integrating the better performing implementation into aeon.

MatthewMiddlehurst · 2025-05-08T12:50:15Z

There seems to be no replies for the other PR. I suggest just closing it unless the contents itself are better.

SebastianSchmidl · 2025-05-08T13:58:44Z

This version also seems to perform better than #2715, but let's wait for the new numbers from @tanishy7777. Once, your mentioned issues are addressed, I would also merge this one in favor of #2715 (I already added a dependency to the issue description for this).

Fixes sbd distances for multivariate case

02988db

tanishy7777 requested review from chrisholder and TonyBagnall as code owners April 13, 2025 13:13

aeon-actions-bot bot added bug Something isn't working distances Distances package testing Testing related issue or pull request labels Apr 13, 2025

tanishy7777 mentioned this pull request Apr 13, 2025

[ENH] Implement K-Shape clusterer #2661

Open

tanishy7777 and others added 2 commits April 13, 2025 18:44

Merge branch 'main' into fix/sbd

f326f53

Fixes doctest and changes expected distance with params

18542e0

SebastianSchmidl mentioned this pull request Apr 13, 2025

[BUG] Inconsistent Sbd distance with tslearn and other implementations #2674

Open

Merge branch 'main' into fix/sbd

6226745

Merge branch 'main' into fix/sbd

6f140ed

Merge branch 'main' into fix/sbd

dc6d63b

MatthewMiddlehurst reviewed Apr 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Fixes sbd distances for multivariate case #2742

[BUG] Fixes sbd distances for multivariate case #2742

tanishy7777 commented Apr 13, 2025 •

edited by SebastianSchmidl

Loading

aeon-actions-bot bot commented Apr 13, 2025

tanishy7777 commented Apr 13, 2025 •

edited

Loading

tanishy7777 commented Apr 13, 2025 •

edited

Loading

tanishy7777 commented Apr 13, 2025 •

edited

Loading

tanishy7777 commented Apr 14, 2025 •

edited

Loading

SebastianSchmidl commented Apr 15, 2025

tanishy7777 commented Apr 17, 2025 •

edited

Loading

tanishy7777 commented Apr 24, 2025 •

edited

Loading

SebastianSchmidl commented Apr 25, 2025 •

edited

Loading

tanishy7777 commented Apr 25, 2025 •

edited

Loading

SebastianSchmidl commented Apr 25, 2025

tanishy7777 commented Apr 25, 2025

SebastianSchmidl commented Apr 25, 2025 •

edited

Loading

tanishy7777 commented Apr 25, 2025

MatthewMiddlehurst left a comment

MatthewMiddlehurst Apr 30, 2025

MatthewMiddlehurst Apr 30, 2025

MatthewMiddlehurst Apr 30, 2025

SebastianSchmidl commented May 6, 2025

MatthewMiddlehurst commented May 8, 2025

SebastianSchmidl commented May 8, 2025

		with objmode(a="float64[:]"):
		a = correlate(x, y, method="fft")

[BUG] Fixes sbd distances for multivariate case #2742

Are you sure you want to change the base?

[BUG] Fixes sbd distances for multivariate case #2742

Conversation

tanishy7777 commented Apr 13, 2025 • edited by SebastianSchmidl Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Does your contribution introduce a new dependency? If yes, which one?

Any other comments?

PR checklist

For all contributions

For new estimators and functions

For developers with write access

aeon-actions-bot bot commented Apr 13, 2025

Thank you for contributing to aeon

PR CI actions

tanishy7777 commented Apr 13, 2025 • edited Loading

tanishy7777 commented Apr 13, 2025 • edited Loading

tanishy7777 commented Apr 13, 2025 • edited Loading

tanishy7777 commented Apr 14, 2025 • edited Loading

SebastianSchmidl commented Apr 15, 2025

tanishy7777 commented Apr 17, 2025 • edited Loading

tanishy7777 commented Apr 24, 2025 • edited Loading

SebastianSchmidl commented Apr 25, 2025 • edited Loading

tanishy7777 commented Apr 25, 2025 • edited Loading

SebastianSchmidl commented Apr 25, 2025

tanishy7777 commented Apr 25, 2025

SebastianSchmidl commented Apr 25, 2025 • edited Loading

tanishy7777 commented Apr 25, 2025

MatthewMiddlehurst left a comment

Choose a reason for hiding this comment

MatthewMiddlehurst Apr 30, 2025

Choose a reason for hiding this comment

MatthewMiddlehurst Apr 30, 2025

Choose a reason for hiding this comment

MatthewMiddlehurst Apr 30, 2025

Choose a reason for hiding this comment

SebastianSchmidl commented May 6, 2025

MatthewMiddlehurst commented May 8, 2025

SebastianSchmidl commented May 8, 2025

tanishy7777 commented Apr 13, 2025 •

edited by SebastianSchmidl

Loading

Thank you for contributing to `aeon`

tanishy7777 commented Apr 13, 2025 •

edited

Loading

tanishy7777 commented Apr 13, 2025 •

edited

Loading

tanishy7777 commented Apr 13, 2025 •

edited

Loading

tanishy7777 commented Apr 14, 2025 •

edited

Loading

tanishy7777 commented Apr 17, 2025 •

edited

Loading

tanishy7777 commented Apr 24, 2025 •

edited

Loading

SebastianSchmidl commented Apr 25, 2025 •

edited

Loading

tanishy7777 commented Apr 25, 2025 •

edited

Loading

SebastianSchmidl commented Apr 25, 2025 •

edited

Loading