Skip to content

[BUG] Fixes sbd distances for multivariate case #2742

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

tanishy7777
Copy link
Contributor

@tanishy7777 tanishy7777 commented Apr 13, 2025

Reference Issues/PRs

Fixes #2674
Closes #2715
As mentioned in #2661 (comment) and #2661 (comment)

fixes sbd distance for multivariate case

What does this implement/fix? Explain your changes.

Currently the implementation of sbd distance in aeon handles the multivariate case slightly differently than the official version(https://github.com/TheDatumOrg/kshape-python) and tslearn's implementation as well.

Also, I set the unequal_length=False (need input from reviewers regarding this as SBD distance(official & tslearn version) supports unequal length time series but doesn't support unequal channels, but the tests in aeon in test_distances check for unequal channels also, more details in #2661 (comment) ).

Differences:
sbd_distance: finds the distance for each channel independently and then takes its average
normalized_cc(from tslearn): finds the correlations for each of the channels and then sums them and finds the max from this sum and then normalizes them using the norm of the the entire multivariate series.

Does your contribution introduce a new dependency? If yes, which one?

Any other comments?

PR checklist

For all contributions
  • I've added myself to the list of contributors. Alternatively, you can use the @all-contributors bot to do this for you after the PR has been merged.
  • The PR title starts with either [ENH], [MNT], [DOC], [BUG], [REF], [DEP] or [GOV] indicating whether the PR topic is related to enhancement, maintenance, documentation, bugs, refactoring, deprecation or governance.
For new estimators and functions
  • I've added the estimator/function to the online API documentation.
  • (OPTIONAL) I've added myself as a __maintainer__ at the top of relevant files and want to be contacted regarding its maintenance. Unmaintained files may be removed. This is for the full file, and you should not add yourself if you are just making minor changes or do not want to help maintain its contents.
For developers with write access
  • (OPTIONAL) I've updated aeon's CODEOWNERS to receive notifications about future changes to these files.

@aeon-actions-bot aeon-actions-bot bot added bug Something isn't working distances Distances package testing Testing related issue or pull request labels Apr 13, 2025
@aeon-actions-bot
Copy link
Contributor

Thank you for contributing to aeon

I have added the following labels to this PR based on the title: [ $\color{#d73a4a}{\textsf{bug}}$ ].
I have added the following labels to this PR based on the changes made: [ $\color{#5209C9}{\textsf{distances}}$, $\color{#2C2F20}{\textsf{testing}}$ ]. Feel free to change these if they do not properly represent the PR.

The Checks tab will show the status of our automated tests. You can click on individual test runs in the tab or "Details" in the panel below to see more information if there is a failure.

If our pre-commit code quality check fails, any trivial fixes will automatically be pushed to your PR unless it is a draft.

Don't hesitate to ask questions on the aeon Slack channel if you have any.

PR CI actions

These checkboxes will add labels to enable/disable CI functionality for this PR. This may not take effect immediately, and a new commit may be required to run the new configuration.

  • Run pre-commit checks for all files
  • Run mypy typecheck tests
  • Run all pytest tests and configurations
  • Run all notebook example tests
  • Run numba-disabled codecov tests
  • Stop automatic pre-commit fixes (always disabled for drafts)
  • Disable numba cache loading
  • Push an empty commit to re-run CI checks

@tanishy7777
Copy link
Contributor Author

tanishy7777 commented Apr 13, 2025

Equivalence between original implementation, tslearn's implementation and fixed implementation for multivariate and uni variate case

Image

@tanishy7777
Copy link
Contributor Author

tanishy7777 commented Apr 13, 2025

Reply to #2661 (comment)

Distance measures dependent on the _dtw_distance function like ddtw_distance, dtw_gi_distance and also adtw, erp, edr distances ignore the other channels without warning.

For example in the dtw_distance, for a given time step all the channels are passed to the _univariate_squared_distance which takes min of the channels and computes the distance between them.

Image

Image
Image

Image
where

@njit(cache=True, fastmath=True)
def _univariate_euclidean_distance(x: np.ndarray, y: np.ndarray) -> float:
    return np.sqrt(_univariate_squared_distance(x, y))

@tanishy7777
Copy link
Contributor Author

tanishy7777 commented Apr 13, 2025

Even though currently we are simply ignoring the other channels in all the distance measures(this is the case for atleast the ones mentioned above). Would it be better to add warnings for this?

@tanishy7777
Copy link
Contributor Author

tanishy7777 commented Apr 14, 2025

@SebastianSchmidl

I have added the code for benchmarks here https://github.com/tanishy7777/aeon/tree/main/benchmarks

Important note for comparison, I removed numba acceleration for the aeon functions,
i.e. sbd_distance_fix and sbd_distance, as the main idea of these benchmarks is to compare the algorithm and the original implementation, and tslearn implementation dont use numba.

The benchmarks test the various cases for the following functions:

  • sbd_distance (from aeon main branch)
  • sbd_distance_fix (from this PR)
  • sbd_original (from the original implementation)
  • sbd_tslearn (tslearn's implementation)

I tested the following cases:

  1. For univariate time series, varying the no of timesteps.
  2. For multivariate time series (2 channels), varying the no of timesteps.
  3. For a constant no of timesteps=1000, varying the no of channels.

The red line denotes the sbd_distance_fix function, it performs better than the original implementation in all cases.
(If I am not making any obvious mistake in the benchmark code itself that is causing this)

Figure_1_small
Figure_2_small
Figure_3_small

@SebastianSchmidl
Copy link
Member

As the implementations will be used with the Numba JIT turned on in most cases, I would be more interested in the runtimes and scaling behavior of the aeon implementations with Numba. You do not need to include sbd_original and sbd_tslearn.

Could you also include the implementation idea in #2715 in the benchmark?

Especially the different usage of with objmode(cc="float64[:, :]"): with Numba could make a big difference in certain configurations.

@tanishy7777
Copy link
Contributor Author

tanishy7777 commented Apr 17, 2025

As the implementations will be used with the Numba JIT turned on in most cases, I would be more interested in the runtimes and scaling behavior of the aeon implementations with Numba. You do not need to include sbd_original and sbd_tslearn.

Could you also include the implementation idea in #2715 in the benchmark?

Especially the different usage of with objmode(cc="float64[:, :]"): with Numba could make a big difference in certain configurations.

Thanks for the feedback, will enable numba and compare with #2715. Couple of assignments and project presentations lined up this weekend in college, but will add it soon.

@tanishy7777
Copy link
Contributor Author

tanishy7777 commented Apr 24, 2025

@SebastianSchmidl
These are the new benchmarks sorry for the delay, I have semester exams going on.
I have updated the code for them https://github.com/tanishy7777/aeon/tree/main/benchmarks

Changes:

Observations:

  • For univariate case: sbd_distance_fix is better than the original but worse than sbd_distance_pr2715 at higher time series lengths.
  • For 2 channels case: sbd_distance_fix is the best for all time series lengths among all the implementations.
  • For fixed time series length(1000 timepoints): sbd_distance_fix is equal to the sbd_distance across different no of channels and is also better than sbd_distance_pr2715.

Please let me know if this looks good? Or any changes are needed. Thank you!
Figure_1_final
Figure_2_final
Figure_3_final

@SebastianSchmidl
Copy link
Member

SebastianSchmidl commented Apr 25, 2025

Sorry to ask again for more, but could you scale the experiment further (larger time series)? According to the plots, all runs finish in under 1s.

Interesting that #2715 is faster than your proposal for one channel, but scales worse with the number of channels 🤔

@tanishy7777
Copy link
Contributor Author

tanishy7777 commented Apr 25, 2025

Sorry to ask again for more, but could you scale the experiment further (larger time series)? According to the plots, all runs finish in under 1s.

Interesting that #2715 is faster than your proposal for two channels, but scales worse with the number of channels 🤔

Yea sure will add them. Actually for 2 channel my fix works better than #2715
For univariate case only my fix is worse than #2715 but it is still better than the current implementation in aeon.

Also another thing, for the comparision across channels I am using series length of 1000. That is why the aeon implementation is so close. But if you look at the 2nd plot, it diverges for 2 channels case for higher timer series. I am guessing this but I think the same will happen(divergent behaviour) if I keep lets say time series length =15000. Will have to check this though.

@SebastianSchmidl
Copy link
Member

Yes, I meant the single channel experiment.

@tanishy7777
Copy link
Contributor Author

Yes, I meant the single channel experiment.

Btw edited my message (added a para) in case you missed it

@SebastianSchmidl
Copy link
Member

SebastianSchmidl commented Apr 25, 2025

It is not obvious how to best benchmark this because we have two scaling dimensions (length of time series and number of channels), which are not completely independent. You could create a full matrix of tests (all combinations of lengths and channels in the search range), but this could take a long time and is harder to plot. Maybe, we just need to increase the number of experiments, e.g. 5 length-scaling experiments with fixed #channels = [1,5,10,100,1000] and 5 channel-scaling experiments with fixed lengths = [1000,5000,10000,50000,100000] or similar
... still a lot to test

@tanishy7777
Copy link
Contributor Author

It is not obvious how to best benchmark this because we have two scaling dimensions (length of time series and number of channels), which are not completely independent. You could create a full matrix of tests (all combinations of lengths and channels in the search range), but this could take a long time and is harder to plot. Maybe, we just need to increase the number of experiments, e.g. 5 length-scaling experiments with fixed #channels = [1,5,10,100,1000] and 5 channel-scaling experiments with fixed lengths = [1000,5000,10000,50000,100000] or similar ... still a lot to test

Yep that makes sense, will do this

Copy link
Member

@MatthewMiddlehurst MatthewMiddlehurst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just catching up really. Is the main issue here scalability or? @TonyBagnall @chrisholder possibly some issues re: the unequal stuff and results comparison

Even though currently we are simply ignoring the other channels in all the distance measures(this is the case for atleast the ones mentioned above). Would it be better to add warnings for this?

I would open a separate issue on this.

Comment on lines +107 to +116
if x.shape[0] != y.shape[0]:
raise ValueError("x and y must have the same number of channels ")
nchannels = x.shape[0] # both x and y have the same number of channels
norm = np.linalg.norm(x.astype(np.float64)) * np.linalg.norm(
y.astype(np.float64)
)
distance = np.zeros((2 * x.shape[1] - 1,))
for i in range(nchannels):
distance += _univariate_sbd_distance(x[i], y[i], standardize)
return distance / nchannels
distance += _helper_sbd(x[i], y[i], standardize)
return np.abs(1 - np.max(distance) / norm)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do other distances have this in a separate function like the univariate one?

Comment on lines +257 to +258
with objmode(a="float64[:]"):
a = correlate(x, y, method="fft")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really a fan of this but it is the current functionality. Maybe worth an issue.

@@ -115,7 +115,7 @@
0.6617308353925114,
0.6617308353925114,
0.5750093257763462,
0.5263609881742105,
None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this None? The comment in the file does not seem to line up with the values seen.

@SebastianSchmidl
Copy link
Member

Just catching up really. Is the main issue here scalability or? @TonyBagnall @chrisholder possibly some issues re: the unequal stuff and results comparison

The main issue was that the definition of the SBD distance included a multivariate version that was not implemented in aeon. aeon just computed the SBD for each channel independently and then combined the results (we do this for other distances as well).

However, multiple PRs (this one and #2715) were opened to address this issue. I suggested comparing their performance and integrating the better performing implementation into aeon.

@MatthewMiddlehurst
Copy link
Member

There seems to be no replies for the other PR. I suggest just closing it unless the contents itself are better.

@SebastianSchmidl
Copy link
Member

This version also seems to perform better than #2715, but let's wait for the new numbers from @tanishy7777. Once, your mentioned issues are addressed, I would also merge this one in favor of #2715 (I already added a dependency to the issue description for this).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distances Distances package testing Testing related issue or pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Inconsistent Sbd distance with tslearn and other implementations
3 participants