-
-
Notifications
You must be signed in to change notification settings - Fork 19.2k
Reimplemented Pearson's correlation to use two pass Welford's model #62750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you run the benchmarks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for running the benchmarks. I think the current implementation will lead to problems due to accumulation of roundoff errors on big datasets.
The problem stems from the co-moment calculations at large/small values and the asymmetric nature of Welford's Algorithm.
For me, the problem is in the catastrophic cancellation when computing 116.80000305175781 - 116.8000030517578
.
vx = mat[i, xi] - meanx | ||
vy = mat[i, yi] - meany |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the reasons the Welford's algorithm is considered a "stable" algorithm is by mitigating cancellation in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with the current implementation is the accumulation of round off errors. The Kahan Summation can mitigate this issue.
- :func:`read_spss` now supports kwargs to be passed to pyreadstat (:issue:`56356`) | ||
- :func:`read_stata` now returns ``datetime64`` resolutions better matching those natively stored in the stata format (:issue:`55642`) | ||
- :meth:`DataFrame.agg` called with ``axis=1`` and a ``func`` which relabels the result index now raises a ``NotImplementedError`` (:issue:`58807`). | ||
- :meth:`DataFrame.corr` now uses two pass Welford's Method to improve numerical stability with precision for very large/small values (:issue:`59652`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it exists. You are simply using two-pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Must have conflated the two, will update
# Welford's method for the variance-calculation | ||
# https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance | ||
nobs = ssqdmx = ssqdmy = covxy = meanx = meany = 0 | ||
# Changed to Welford's two-pass for improved numeric stability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is misleading.
I have thought of something that could help with performance, we can set a threshold of precision (10^-14 or 10^14) and either normalize values or apply two pass when the value is exceeded. That way we wouldn't lose performance unnecessarily. Will implement it later |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.This improves numerical stability for values that are really large or small such as the example given in the original issue.