Skip to content

Conversation

eicchen
Copy link
Contributor

@eicchen eicchen commented Oct 19, 2025

This improves numerical stability for values that are really large or small such as the example given in the original issue.

Copy link
Member

@Alvaro-Kothe Alvaro-Kothe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run the benchmarks?

@eicchen
Copy link
Contributor Author

eicchen commented Oct 20, 2025

image

I've already run the relevant benchmarks. Unsurprisingly, we are looking at performance decrease in stat_ops.Correlation. I've looked into other ways of solving the issue while keeping the online Welford.

image image

The problem stems from the co-moment calculations at large/small values and the asymmetric nature of Welford's Algorithm. The three values above are mathematically the same, however, when calculating the values provided in the test case, there are always two that are correct and one that is not.

We could pick the value of the pair that match as a redundancy measure. Theoretically, it can only reduce our errors compared to our current version, as the values should be equal. But without a larger test pool, I wouldn't be confident enough to put it in a release version. Two-pass provides the best numerical stability at the cost of performance.

@eicchen eicchen requested a review from Alvaro-Kothe October 21, 2025 19:03
Copy link
Member

@Alvaro-Kothe Alvaro-Kothe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for running the benchmarks. I think the current implementation will lead to problems due to accumulation of roundoff errors on big datasets.

The problem stems from the co-moment calculations at large/small values and the asymmetric nature of Welford's Algorithm.

For me, the problem is in the catastrophic cancellation when computing 116.80000305175781 - 116.8000030517578.

Comment on lines +380 to +381
vx = mat[i, xi] - meanx
vy = mat[i, yi] - meany
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the reasons the Welford's algorithm is considered a "stable" algorithm is by mitigating cancellation in $x_i - \bar{x}$ by using a running average.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with the current implementation is the accumulation of round off errors. The Kahan Summation can mitigate this issue.

- :func:`read_spss` now supports kwargs to be passed to pyreadstat (:issue:`56356`)
- :func:`read_stata` now returns ``datetime64`` resolutions better matching those natively stored in the stata format (:issue:`55642`)
- :meth:`DataFrame.agg` called with ``axis=1`` and a ``func`` which relabels the result index now raises a ``NotImplementedError`` (:issue:`58807`).
- :meth:`DataFrame.corr` now uses two pass Welford's Method to improve numerical stability with precision for very large/small values (:issue:`59652`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it exists. You are simply using two-pass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must have conflated the two, will update

# Welford's method for the variance-calculation
# https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
nobs = ssqdmx = ssqdmy = covxy = meanx = meany = 0
# Changed to Welford's two-pass for improved numeric stability
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is misleading.

@eicchen
Copy link
Contributor Author

eicchen commented Oct 21, 2025

I have thought of something that could help with performance, we can set a threshold of precision (10^-14 or 10^14) and either normalize values or apply two pass when the value is exceeded. That way we wouldn't lose performance unnecessarily. Will implement it later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Pearson correlation outside expected range -1 to 1

2 participants