-
Notifications
You must be signed in to change notification settings - Fork 209
[BUG] Fixes sbd distances for multivariate case #2742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thank you for contributing to
|
Equivalence between original implementation, tslearn's implementation and fixed implementation for multivariate and uni variate case |
Reply to #2661 (comment) Distance measures dependent on the For example in the @njit(cache=True, fastmath=True)
def _univariate_euclidean_distance(x: np.ndarray, y: np.ndarray) -> float:
return np.sqrt(_univariate_squared_distance(x, y)) |
Even though currently we are simply ignoring the other channels in all the distance measures(this is the case for atleast the ones mentioned above). Would it be better to add warnings for this? |
I have added the code for benchmarks here https://github.com/tanishy7777/aeon/tree/main/benchmarks Important note for comparison, I removed numba acceleration for the aeon functions, The benchmarks test the various cases for the following functions:
I tested the following cases:
The red line denotes the |
As the implementations will be used with the Numba JIT turned on in most cases, I would be more interested in the runtimes and scaling behavior of the aeon implementations with Numba. You do not need to include Could you also include the implementation idea in #2715 in the benchmark? Especially the different usage of |
Thanks for the feedback, will enable numba and compare with #2715. Couple of assignments and project presentations lined up this weekend in college, but will add it soon. |
@SebastianSchmidl Changes:
Observations:
Please let me know if this looks good? Or any changes are needed. Thank you! |
Sorry to ask again for more, but could you scale the experiment further (larger time series)? According to the plots, all runs finish in under 1s. Interesting that #2715 is faster than your proposal for one channel, but scales worse with the number of channels 🤔 |
Yea sure will add them. Actually for 2 channel my fix works better than #2715 Also another thing, for the comparision across channels I am using series length of 1000. That is why the aeon implementation is so close. But if you look at the 2nd plot, it diverges for 2 channels case for higher timer series. I am guessing this but I think the same will happen(divergent behaviour) if I keep lets say time series length =15000. Will have to check this though. |
Yes, I meant the single channel experiment. |
Btw edited my message (added a para) in case you missed it |
It is not obvious how to best benchmark this because we have two scaling dimensions (length of time series and number of channels), which are not completely independent. You could create a full matrix of tests (all combinations of lengths and channels in the search range), but this could take a long time and is harder to plot. Maybe, we just need to increase the number of experiments, e.g. 5 length-scaling experiments with fixed #channels = [1,5,10,100,1000] and 5 channel-scaling experiments with fixed lengths = [1000,5000,10000,50000,100000] or similar |
Yep that makes sense, will do this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just catching up really. Is the main issue here scalability or? @TonyBagnall @chrisholder possibly some issues re: the unequal stuff and results comparison
Even though currently we are simply ignoring the other channels in all the distance measures(this is the case for atleast the ones mentioned above). Would it be better to add warnings for this?
I would open a separate issue on this.
if x.shape[0] != y.shape[0]: | ||
raise ValueError("x and y must have the same number of channels ") | ||
nchannels = x.shape[0] # both x and y have the same number of channels | ||
norm = np.linalg.norm(x.astype(np.float64)) * np.linalg.norm( | ||
y.astype(np.float64) | ||
) | ||
distance = np.zeros((2 * x.shape[1] - 1,)) | ||
for i in range(nchannels): | ||
distance += _univariate_sbd_distance(x[i], y[i], standardize) | ||
return distance / nchannels | ||
distance += _helper_sbd(x[i], y[i], standardize) | ||
return np.abs(1 - np.max(distance) / norm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do other distances have this in a separate function like the univariate one?
with objmode(a="float64[:]"): | ||
a = correlate(x, y, method="fft") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really a fan of this but it is the current functionality. Maybe worth an issue.
@@ -115,7 +115,7 @@ | |||
0.6617308353925114, | |||
0.6617308353925114, | |||
0.5750093257763462, | |||
0.5263609881742105, | |||
None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this None? The comment in the file does not seem to line up with the values seen.
The main issue was that the definition of the SBD distance included a multivariate version that was not implemented in aeon. aeon just computed the SBD for each channel independently and then combined the results (we do this for other distances as well). However, multiple PRs (this one and #2715) were opened to address this issue. I suggested comparing their performance and integrating the better performing implementation into aeon. |
There seems to be no replies for the other PR. I suggest just closing it unless the contents itself are better. |
This version also seems to perform better than #2715, but let's wait for the new numbers from @tanishy7777. Once, your mentioned issues are addressed, I would also merge this one in favor of #2715 (I already added a dependency to the issue description for this). |
Reference Issues/PRs
Fixes #2674
Closes #2715
As mentioned in #2661 (comment) and #2661 (comment)
fixes sbd distance for multivariate case
What does this implement/fix? Explain your changes.
Currently the implementation of sbd distance in aeon handles the multivariate case slightly differently than the official version(https://github.com/TheDatumOrg/kshape-python) and tslearn's implementation as well.
Also, I set the
unequal_length=False
(need input from reviewers regarding this as SBD distance(official & tslearn version) supports unequal length time series but doesn't support unequal channels, but the tests in aeon intest_distances
check for unequal channels also, more details in #2661 (comment) ).Differences:
sbd_distance: finds the distance for each channel independently and then takes its average
normalized_cc(from tslearn): finds the correlations for each of the channels and then sums them and finds the max from this sum and then normalizes them using the norm of the the entire multivariate series.
Does your contribution introduce a new dependency? If yes, which one?
Any other comments?
PR checklist
For all contributions
For new estimators and functions
__maintainer__
at the top of relevant files and want to be contacted regarding its maintenance. Unmaintained files may be removed. This is for the full file, and you should not add yourself if you are just making minor changes or do not want to help maintain its contents.For developers with write access