Reimplemented Pearson's correlation to use two pass Welford's model #62750

eicchen · 2025-10-19T08:25:41Z

closes BUG: Pearson correlation outside expected range -1 to 1 #59652
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This improves numerical stability for values that are really large or small such as the example given in the original issue.

…issue-59652

Alvaro-Kothe

Can you run the benchmarks?

eicchen · 2025-10-20T01:43:46Z

I've already run the relevant benchmarks. Unsurprisingly, we are looking at performance decrease in stat_ops.Correlation. I've looked into other ways of solving the issue while keeping the online Welford.

The problem stems from the co-moment calculations at large/small values and the asymmetric nature of Welford's Algorithm. The three values above are mathematically the same, however, when calculating the values provided in the test case, there are always two that are correct and one that is not.

We could pick the value of the pair that match as a redundancy measure. Theoretically, it can only reduce our errors compared to our current version, as the values should be equal. But without a larger test pool, I wouldn't be confident enough to put it in a release version. Two-pass provides the best numerical stability at the cost of performance.

Alvaro-Kothe

Thanks for running the benchmarks. I think the current implementation will lead to problems due to accumulation of roundoff errors on big datasets.

The problem stems from the co-moment calculations at large/small values and the asymmetric nature of Welford's Algorithm.

For me, the problem is in the catastrophic cancellation when computing 116.80000305175781 - 116.8000030517578.

Alvaro-Kothe · 2025-10-21T19:58:41Z

pandas/_libs/algos.pyx

+                        vx = mat[i, xi] - meanx
+                        vy = mat[i, yi] - meany


One of the reasons the Welford's algorithm is considered a "stable" algorithm is by mitigating cancellation in $x_i - \bar{x}$ by using a running average.

The problem with the current implementation is the accumulation of round off errors. The Kahan Summation can mitigate this issue.

Alvaro-Kothe · 2025-10-21T20:35:03Z

doc/source/whatsnew/v3.0.0.rst

 - :func:`read_spss` now supports kwargs to be passed to pyreadstat (:issue:`56356`)
 - :func:`read_stata` now returns ``datetime64`` resolutions better matching those natively stored in the stata format (:issue:`55642`)
 - :meth:`DataFrame.agg` called with ``axis=1`` and a ``func`` which relabels the result index now raises a ``NotImplementedError`` (:issue:`58807`).
+- :meth:`DataFrame.corr` now uses two pass Welford's Method to improve numerical stability with precision for very large/small values (:issue:`59652`)


I don't think it exists. You are simply using two-pass.

Must have conflated the two, will update

Alvaro-Kothe · 2025-10-21T20:39:24Z

pandas/_libs/algos.pyx

                # Welford's method for the variance-calculation
                # https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
-                nobs = ssqdmx = ssqdmy = covxy = meanx = meany = 0
+                # Changed to Welford's two-pass for improved numeric stability


This comment is misleading.

eicchen · 2025-10-21T23:15:45Z

I have thought of something that could help with performance, we can set a threshold of precision (10^-14 or 10^14) and either normalize values or apply two pass when the value is exceeded. That way we wouldn't lose performance unnecessarily. Will implement it later

eicchen added 6 commits October 18, 2025 17:15

Initial commit, no pre-commit

0823950

Added test case for welford failure

e20b045

Merge branch 'issue-59652' of https://github.com/eicchen/pandas into …

60471c2

…issue-59652

Removed test file

28fb765

Implemented two pass welford for improved numeric stability

d1c1e83

pre-commit edits, moved test case

8af06ed

Alvaro-Kothe reviewed Oct 19, 2025

View reviewed changes

eicchen requested a review from Alvaro-Kothe October 21, 2025 19:03

Alvaro-Kothe reviewed Oct 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Reimplemented Pearson's correlation to use two pass Welford's model #62750

Reimplemented Pearson's correlation to use two pass Welford's model #62750

eicchen commented Oct 19, 2025 •

edited

Loading

Uh oh!

Alvaro-Kothe left a comment

Uh oh!

eicchen commented Oct 20, 2025

Uh oh!

Alvaro-Kothe left a comment

Uh oh!

Alvaro-Kothe Oct 21, 2025

Uh oh!

Alvaro-Kothe Oct 21, 2025

Uh oh!

Alvaro-Kothe Oct 21, 2025

Uh oh!

eicchen Oct 21, 2025

Uh oh!

Alvaro-Kothe Oct 21, 2025

Uh oh!

eicchen commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Reimplemented Pearson's correlation to use two pass Welford's model #62750

Are you sure you want to change the base?

Reimplemented Pearson's correlation to use two pass Welford's model #62750

Conversation

eicchen commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Alvaro-Kothe left a comment

Choose a reason for hiding this comment

Uh oh!

eicchen commented Oct 20, 2025

Uh oh!

Alvaro-Kothe left a comment

Choose a reason for hiding this comment

Uh oh!

Alvaro-Kothe Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Alvaro-Kothe Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Alvaro-Kothe Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

eicchen Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

Alvaro-Kothe Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

eicchen commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eicchen commented Oct 19, 2025 •

edited

Loading