Overhaul Python PCA, from NaN handling to SKLearn replacement #2311

jucor · 2025-11-25T15:37:50Z

Summary

Fix Python PCA to use column means for missing votes (instead of 0), matching the slightly healthier Clojure behavior: better bias towards average than towards zero. Still not ideal and we'll use a better way to do that, but for now that's a first step.
Add regression tests comparing Python PCA output against Clojure math_blob
Add unit tests verifying NaN handling uses column means

Background

Investigation revealed that Python and Clojure PCA implementations differed significantly:

Root cause: Python filled NaN with 0, Clojure fills with column mean
Impact: VW dataset had 16° and 52° angle differences in PC1/PC2

Changes

Core fix (`pca.py`)

pca_project_dataframe: Replace NaN with column means instead of 0
powerit_pca: Add ValueError if NaN values reach it (enforces preprocessing)

New tests

test_pca_unit.py: Two tests verifying column-mean NaN handling
test_legacy_clojure_regression.py: test_pca_components_match_clojure comparing Python vs Clojure PCA

Results after fix

Dataset	PC1 Correlation	PC1 Angle	PC2 Correlation	PC2 Angle
VW	1.0000	0°	-1.0000	0°
Biodiversity	0.9920	7.1°	-0.9977	3.6°

The small biodiversity discrepancy is within numerical tolerance for power iteration on high-sparsity (82%) data.

Test plan

pytest tests/test_pca_unit.py -k nan_handling - NaN handling unit tests pass
pytest tests/test_legacy_clojure_regression.py::test_pca_components_match_clojure - Clojure comparison passes
Full test suite

🤖 Generated with Claude Code

jucor · 2025-11-25T16:08:19Z

Note: next step will be to :

fix the test runner that can't load -- something probably changed under our feet :)
try comparison on other convos (might be useful to add that possibility in the test_regression, in a clean optional way that does not slow down the CI...)
find a way to get the vote matrix as Clojure uses it, to check if there's any difference to the input of the PCA, given there's a small error on biodiversity. I want to check if it's indeed just numeric in the computation, or if there's something else.
update golden record.
remove the now-dead code that hacks a scaling to match clojure vector norms

jucor · 2025-11-25T22:30:17Z

Don't mind the failure of the python/python regression tests: the golden record was computer before the fix, and needs to be updated.

However:

1/ Gosh the creation of the fixture on tests/test_legacy_clojure_regression.py is slow... Gotta be a way to speed this up. Polars instead of Pandas? Need to add more timers everywhere.
2/ The clojure PCA comparison, which passes on VW and on biodiversity, fails on BG2050

tests/test_legacy_clojure_regression.py::test_pca_components_match_clojure[bg2050] 
[bg2050] Testing PCA components match Clojure...
  PC1 correlation: 0.994594
  PC1 angle difference: 44.89°
  PC2 correlation: -0.983012
  PC2 angle difference: 53.64°
FAILED

========================================================== FAILURES ==========================================================

FAILURE: check 44.889375337151506 <= 10.0: PC1 angle difference should be ≤10° (got 44.89°)
tests/test_legacy_clojure_regression.py:344 in test_pca_components_match_clojure() -> check.less_equal(angle_deg, 10.0,

FAILURE: check 53.64306927244186 <= 10.0: PC2 angle difference should be ≤10° (got 53.64°)
------------------------------------------------------------
Failed Checks: 2
tests/test_legacy_clojure_regression.py::test_pca_components_match_clojure[bg2050]:0: check 44.889375337151506 <= 10.0: PC1 angle difference should be ≤10° (got 44.89°)

So there's something still amiss. Sign flip can't explain the 45° degrees angle difference on PC1 ... or the 53.64° on PC2 !

jucor · 2025-11-29T00:43:28Z

Investigation: Why PCA regression tests fail for large conversations (e.g., BG2050)

Summary

The Python PCA implementation matches Clojure for small conversations but diverges significantly for large ones. After investigation, we found this is expected behavior due to different algorithms being used—and we're now confident that dropping the Clojure stochastic approach is the right call.

Root Cause

The Clojure code has a branching condition (conversation.clj:766-796): conversations exceeding 5,000 comments or 10,000 participants switch from exact PCA to a stochastic mini-batch approximation.

Dataset	Comments	Algorithm Used
vw, biodiversity	< 5,000	Exact PCA (same as Python) ✅
BG2050	7,757	Stochastic mini-batch PCA ❌

The stochastic algorithm is clever—it uses growing-size mini-batches with a learning rate, which is exactly the pattern found in convergence proofs and optimal algorithms for stochastic Monte Carlo methods. However, I could not find any trace in the codebase of comparisons against the exact algorithm for accuracy validation.

The Scary-Looking 50° Angle Difference (and why it's not actually scary)

Initial tests showed a ~50° angle difference between Python and Clojure PCA components for BG2050, which looked alarming. But we also observed 0.99 correlation between the same components—seemingly contradictory.

Resolution: The stochastic PCA doesn't re-normalize its components after blending iterations. The mini-batch update does:

new_component = 0.99 * old_component + 0.01 * batch_component

A weighted average of unit vectors is not a unit vector. After 100 iterations, the Clojure vectors have norms ~0.6-0.7 instead of 1.0.

My angle computation assumed unit vectors (as exact PCA produces), which gave misleading results. Once properly normalized:

The vectors point in nearly the same direction (high correlation confirmed)
The ~40% norm difference would cause non-uniform scaling of the opinion map—not ideal, but not catastrophic
BG2050 results weren't wrong, just slightly inaccurate

Decision

Drop the stochastic PCA codepath.

The exact PCA in Python (currently custom, but soon backed by efficient NumPy) will handle what used to be computationally challenging for Clojure. If we ever need to optimize for truly massive conversations, we can revisit with:

Proper comparison tests between algorithms
Battle-tested libraries (scikit-learn's incremental PCA, etc.)
Actual validation that approximations match the exact solution

jucor · 2025-11-29T00:47:16Z

Now that we have verified the PCA algorithm and trust it, next steps:

Re-record Golden records with the fixed PCA we have. Remember from previous commits: in addition to the clojure comparison discussed above, we had also fixed an issue on how the Python PCA handles the NaN, and verified that this tracks with Clojure for small convos. NB: for large convos (not in the committed real_data), changing the PCA result will change the clustering, of course. But that clustering isn't really checked for regression wrt Clojure anyway (yet).
Optional: modify the legacy clojure comparison to be more lenient on large convos. Not ideal, but allows it to pass on the large convos. Then again, those are not committed to this repo, and are used for debug only. Cancelled
Then move to sklearn PCA, and see how faster that is ;-)

jucor · 2025-11-29T01:18:56Z

Parallelized test execution across datasets

Added xdist_group markers to enable parallel test execution by dataset. This significantly speeds up local testing when running against many large conversations.

Local benchmarks (Apple Silicon M1, 64GB RAM):

Dataset	Participants	Comments	Exact PCA time
bg2050	6,624	7,730	15s
EB	4,222	7,728	11s
P	8,308	8,048	53s

Full test suite with all local datasets: ~3 min (down from 10-12 min sequential).

Notes:

The speedup matters less on CI, which only uses vw and biodiversity to keep runner load light
For local debugging/algorithm development with large datasets, the parallel execution is very useful
Memory footprint is reasonable (e.g. bg2050 ~400MB for full matrix, to which we should add PCA working space -- but since we're asking for only 2 components this should be reasonable)
Current implementation uses manual NumPy-based PCA; switching to scikit-learn could leverage optimized BLAS/LAPACK routines for further speedup

The parallel execution infrastructure is now available if we want to enable it on CI runners in the future.

jucor · 2026-01-06T20:50:30Z

We now have fully replaced the Python port of the PCA by a simple call to SKLearn, much faster and much simpler 😄 🎉

The computations on all the datasets I've tried match (from small to very large convos, from public ones to private ones), up to some tiny numerical differences for small values, which barely register when looking at the projections as a whole.

Next step for this PR:

Check that all tests do pass, especially the regression test.
Record new golden snapshots, as this SKlearn implementation become the new standard
Remove code for clojure comparison of the PCA output, not needed anymore, as they now match.
Rebase on edge to resolve conflicts.

commit 2024dde Author: Julien Cornebise <julien@cornebise.com> Date: Mon Nov 24 22:14:34 2025 +0000 Update test_powerit_pca_with_nans to expect ValueError The previous commit changed powerit_pca to raise ValueError when given NaN values, making NaN handling the caller responsibility. This test now verifies that behavior instead of testing graceful NaN handling. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> commit 9839770 Author: Julien Cornebise <julien@cornebise.com> Date: Mon Nov 24 22:14:34 2025 +0000 Fix PCA NaN handling to use column mean (matching Clojure) The Python PCA was using 0 to fill missing votes, while Clojure uses column means. This caused significant differences in PCA components (16 deg and 52 deg angle differences on VW dataset). Changes: - pca_project_dataframe: Replace NaN with column means instead of 0 - powerit_pca: Add ValueError if NaN values reach it (caller must preprocess) - test_regression.py: Enable ignore_pca_sign_flip for golden comparisons New tests: - test_pca_unit.py: test_nan_handling_uses_column_mean verifies column mean is used, test_nan_handling_differs_from_zero_fill confirms we are not using 0 - test_legacy_clojure_regression.py: test_pca_components_match_clojure compares Python vs Clojure PCA components (correlation and angle) Results after fix: - VW: PC1/PC2 correlation 1.0/-1.0, angle 0 deg (perfect match) - Biodiversity: PC1/PC2 correlation 0.99/-1.0, angle 7/4 deg (minor numerical differences from power iteration on 82% sparse data) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> commit 4b33297 Author: Julien Cornebise <julien@cornebise.com> Date: Mon Nov 24 22:14:34 2025 +0000 Test that sign flipping propagates If the PCA components flip sign, the projections must also flip sign. commit e307a52 Author: Julien Cornebise <julien@cornebise.com> Date: Fri Nov 21 09:36:22 2025 +0000 Flag an unneeded exception commit 91a1e9c Author: Julien Cornebise <julien@cornebise.com> Date: Wed Nov 19 15:49:34 2025 +0000 Report any scaling issues in PCA projections commit 395e636 Author: Julien Cornebise <julien@cornebise.com> Date: Wed Nov 19 15:25:46 2025 +0000 Turn off off the Clojure sign-flip for PCA Of course this breaks the comparison to the golden files, as those included the sign flip. So we're also giving the option in the regression test to ignore PCA sign-flips. That will be handy later when we work on improving the PCA implementation, as various PCA implementations have various sign conventions. commit 0dc1aa7 Author: Julien Cornebise <julien@cornebise.com> Date: Mon Nov 24 22:14:34 2025 +0000 Clean up unused variables and imports Address GitHub Copilot review comments: - Log superseded votes count in conversation.py instead of leaving unused - Remove unused p1_idx/p2_idx index lookups in corr.py - Remove unused all_passed variable in regression_comparer.py - Remove unused imports (numpy, Path, List, datetime, stats, pca/cluster functions) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Until now, one shared object was getting reused across tests for a given dataset, and the cache kept on piling.

Use pytest-xdist's loadgroup distribution to run tests for different datasets on separate workers while keeping all tests for the same dataset on one worker (preserving fixture cache efficiency). Changes: - Add xdist_group marker to each dataset parameter via pytest.param() - Switch from --dist=loadscope to --dist=loadgroup in pytest config - Clean up conftest.py marker handling Performance: Tests now complete in ~2 minutes with 4 workers (-n4), down from ~9 minutes before deduplication and ~4 minutes sequential after deduplication. The xdist_group marker ensures each dataset's Conversation object is computed only once per worker. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add reusable helpers in conftest.py for creating pytest.param objects with xdist_group markers, enabling efficient parallel test execution across all dataset-parametrized tests. Changes: - conftest.py: Add make_dataset_params() and get_available_dataset_params() helper functions for xdist_group marker creation - Update pytest_generate_tests to use xdist_group markers - Update test_pca_smoke.py, test_repness_smoke.py, test_conversation_smoke.py, test_legacy_repness_comparison.py, test_golden_data.py, test_pipeline_integrity.py to use the new helpers Pattern: Tests parametrized by dataset now use xdist_group markers via: - make_dataset_params(["biodiversity", "vw"]) for hardcoded lists - get_available_dataset_params() for dynamically discovered datasets With --dist=loadgroup (default in pyproject.toml), pytest-xdist groups tests by dataset, ensuring fixtures are computed once per worker. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

New pytest option allows filtering tests to run on a subset of datasets: pytest --datasets=biodiversity # single dataset pytest --datasets=biodiversity,vw # multiple datasets Implementation: - pytest_generate_tests filters dynamic parametrization - pytest_collection_modifyitems deselects tests with static parametrization - Report header shows "Filtered to: ..." when active Works with both: - Dynamic parametrization (dataset fixture via pytest_generate_tests) - Static parametrization (@pytest.mark.parametrize with get_available_dataset_params()) Also updated tests/README.md with documentation for all pytest options including --include-local, --datasets, and parallel execution with -n. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

The test file has its own pytest_generate_tests hook that was not respecting the --datasets option. Now imports _get_requested_datasets and make_dataset_params from conftest to properly filter datasets. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

The original angle computation assumed unit-normalized vectors, which gave misleading results when comparing Python (exact PCA, unit vectors) with Clojure stochastic PCA for large conversations (non-unit vectors due to learning rate blending without re-normalization). This discrepancy only affects large conversations (>5k comments or >10k participants), as Clojure uses exact PCA for smaller conversations. Now outputs both the raw angle and the properly normalized angle, plus the norms of both vectors for diagnostics. The assertion now checks the normalized angle, which correctly identifies when vectors point in similar directions regardless of their magnitudes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

As discussed above, the alignment with clojure was an AI-hallucinated kludge to pass tests by customized scaling tailored to each dataset, hiding the problems with the Python PCA implementation (the handling of NaNs). Now that we have fixed those, we can remove that madness and update the golden records accordingly.

Note: we will update the golden records in a later commit.

I took care to keep the "sparsity-aware" scaling, computed in a vectorized way.

It's not needed anymore: the PCA is done by Sklearn, and the edge cases are already handled by wrapped_pca caller, i.e. pca_project_dataframe. Less code, less complexity, less maintenance.

Cluster centers are derived from PCA projections, so they inherit the sign ambiguity. This fix ensures sign flips are detected and corrected for `.center` paths in addition to `.pca.comps` and `.proj.` paths. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Previously, sign flips were detected per-projection-vector, which fails when only some components are flipped (e.g., PC1 unchanged, PC2 flipped). Now detects flips at the component level (.pca.comps[N]) and stores them, then applies per-dimension correction to projections and cluster centers. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

… tests TLDR: it works! Differences were just numerical errors, artificially looking huge on small values. Looking at the whole cloud of projections confirmed the post sk-learn matches the pre-sklearn results perfectly well. Problem: -------- The regression comparer was failing on datasets like FLI and pakistan with thousands of "differences" showing relative errors up to 874%. Investigation revealed these were false positives: the high relative errors occurred on near-zero values (e.g., golden=4.54e-06 vs current=4.42e-05) where even tiny absolute differences (3e-04) produce huge relative errors. Diagnosis: ---------- Comparing sklearn SVD-based PCA against power iteration golden snapshots: - The projection point clouds are visually identical (see scatter plots) - Important values (Q70-Q100 percentile) match within 0.2% - Only near-zero values (Q0-Q7 percentile, ~0.0x median) show large rel errors - These near-zero values represent participants at the origin who don't affect visualization or clustering The element-wise (abs_tol, rel_tol) approach fundamentally cannot handle this case: it either fails on small values or is too loose for large values. Solution: --------- Added projection comparison metrics that measure what actually matters: | Metric | Threshold | What it measures | |-----------------------|-----------|-------------------------------------| | Max |error| / range | < 1% | Worst displacement as % of axis | | Mean |error| / range | < 0.1% | Average displacement as % of axis | | R² (all coordinates) | > 0.9999 | Variance explained (99.99%) | | R² (per dimension) | > 0.999 | Per-PC fit quality (99.9%) | | Procrustes disparity | < 1e-4 | Shape similarity after alignment | Results for FLI dataset: - Max |error| / range: 0.0617% (was flagging 28% rel error on Q1 values) - R²: 0.9999992 - Procrustes: 8.2e-07 All 7 local datasets now pass regression tests. Also added: - Quantile context in error reports (computed from ALL values, not just failures) - Explanation when element-wise diffs exist but metrics confirm match

Re-recorded golden snapshots using the new sklearn SVD-based PCA. These snapshots now serve as the baseline for regression tests.

Tests that only verify sorting/moderation now use recompute=False to skip unnecessary PCA computation. Manager tests that need PCA but use minimal data have targeted filterwarnings with explanations. Changes: - Add recompute=False to sorting tests (no PCA needed for ID ordering) - Add recompute=False to test_moderation (only checks filtering) - Add @pytest.mark.filterwarnings to TestConversationManager tests that use minimal data through the manager interface 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The sklearn PCA implementation has been validated to match Clojure. Regressions are now caught by test_regression.py golden snapshots. This test file will be removed once all other Clojure comparison tests are similarly validated and covered by regression tests. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

@Skip

…ion.py - Delete test_golden_data.py: redundant with test_legacy_clojure_regression.py - Change @Skip to @xfail for test_group_clustering and test_comment_priorities so they run and document expected failures rather than being silently skipped 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

jucor · 2026-01-06T22:30:42Z

Rebase on `edge` completed

Rebased this PR on top of edge to incorporate the squash-merged changes from PRs #2312 and #2316.

What happened

This branch was originally built on top of two other PRs that have since been squash-merged into edge:

PR Add auto-discovery for regression test datasets with --include-local flag #2312: Add auto-discovery for regression test datasets with --include-local flag
PR Speed up repness 11x #2316: Speed up repness 11x

Because they were squash-merged (not regular merged), the original commits disappeared but the changes remained in edge, causing potential conflicts.

Resolution

Before rebase: 48 commits
After rebase: 24 commits

Dropped 24 commits that were already present in edge via the squash merges:

7 commits from PR Add auto-discovery for regression test datasets with --include-local flag #2312 (auto-discovery infrastructure)
17 commits from PR Speed up repness 11x #2316 (repness optimization, benchmarks)

Conflict resolved

File: delphi/polismath/regression/utils.py

Nature of conflict: This PR added include_local=True to a list_available_datasets() check in load_golden_snapshot(). However, PR #2312 subsequently removed that check entirely because it's now redundant—get_dataset_info() internally calls discover_datasets(include_local=True), so the validation happens automatically.

Resolution: Accepted edge's cleaner version which delegates the validation to get_dataset_info().

🤖 Generated with Claude Code

Copilot

Pull request overview

This PR overhauls the Python PCA implementation by replacing a custom power iteration algorithm with sklearn's PCA and fixing NaN handling to use column means instead of zeros, bringing Python closer to Clojure behavior.

Key Changes

PCA Implementation: Replaced ~500 lines of custom power iteration code with sklearn's PCA (simpler, more maintainable, better tested)
NaN Handling: Changed from filling missing votes with 0 to using column means, reducing bias in covariance estimates
Testing Infrastructure: Added PCA sign flip tolerance to regression tests, added unit tests for NaN handling, and improved test parallelization with xdist groups

Reviewed changes

Copilot reviewed 20 out of 22 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`delphi/polismath/pca_kmeans_rep/pca.py`	Core change: replaced custom PCA with sklearn, changed NaN fill from 0 to column mean, simplified projection logic
`delphi/polismath/regression/comparer.py`	Added PCA sign flip detection, outlier tolerance, projection metrics for regression testing
`delphi/tests/test_pca_unit.py`	Removed 200+ lines of custom PCA tests, added 2 focused tests for NaN handling with column means
`delphi/tests/test_pca_edge_cases.py`	Removed tests for custom power iteration functions that no longer exist
`delphi/tests/test_regression.py`	Added tests for PCA sign flip handling and scaling factor detection in comparison logic
`delphi/tests/test_legacy_clojure_regression.py`	Changed to class-based tests with better fixture scoping, added placeholder for deprecated PCA test
`delphi/tests/conftest.py`	Added xdist_group markers for parallel execution, dataset filtering via --datasets flag
`delphi/tests/test_golden_data.py`	Deleted obsolete test file that was marked as xfail
`delphi/tests/test_conversation.py`	Added `recompute=False` to tests that only verify sorting/moderation logic (performance optimization)
`delphi/scripts/regression_recorder.py`	Added --include-local flag support
`delphi/scripts/regression_comparer.py`	Added CLI options for PCA sign flip tolerance, outlier handling, and local datasets
`delphi/pyproject.toml`	Added pytest config for xdist loadgroup distribution
`delphi/polismath/benchmarks/bench_pca.py`	New benchmark script for profiling PCA performance
Various smoke/regression test files	Updated imports to use conftest helper functions for dataset parametrization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-06T22:38:00Z

delphi/polismath/pca_kmeans_rep/pca.py

+        n_cmnts = matrix_data.shape[1]
+        n_seen = np.sum(~np.isnan(matrix_data), axis=1)  # Count non-NaN votes per participant
+        proportions = np.sqrt(n_seen / n_cmnts)
+        scaled_projections = projections / proportions[:, np.newaxis]  


Potential division by zero: If a participant has voted on zero comments (all NaN values), n_seen will be 0, leading to proportions containing 0, which will cause division by zero when computing scaled_projections = projections / proportions[:, np.newaxis].

Consider adding a check to handle this edge case, for example:

proportions = np.sqrt(n_seen / n_cmnts) proportions = np.where(proportions == 0, 1, proportions) # Avoid division by zero

Or filter out participants with no votes before PCA computation.

Suggested change

scaled_projections = projections / proportions[:, np.newaxis]

# Avoid division by zero for participants with no seen comments

proportions = np.where(proportions == 0, 1.0, proportions)

scaled_projections = projections / proportions[:, np.newaxis]

Good catch! Fixed in the same approach as the Clojure implementation uses.

Clojure reference (math/src/polismath/math/pca.clj:156):

(* (Math/sqrt (/ n-cmnts (max n-votes 1))) [p1 p2]))

Fix applied:

n_seen_safe = np.maximum(n_seen, 1) proportions = np.sqrt(n_seen_safe / n_cmnts)

Also added a unit test test_participant_with_no_votes to cover this edge case.

Copilot · 2026-01-06T22:38:00Z

delphi/polismath/regression/comparer.py

+        return (np.array(all_golden) if len(all_golden) > 0 else np.array([]),
+                np.array(all_current) if len(all_current) > 0 else np.array([]))
+
+    def _log_projection_metrics(self, golden_stages: dict, current_stages: dict) -> None:


Return type mismatch: Function _log_projection_metrics is declared to return None but actually returns False at line 1500 when len(g_all) == 0, and returns all_pass (a boolean) at line 1572. The function signature should be changed to return bool | None or Optional[bool] to match the actual implementation.

Fixed: changed return type annotation from -> None to -> bool.

Copilot · 2026-01-06T22:38:00Z

delphi/polismath/pca_kmeans_rep/pca.py

+    # Column mean is imperfect (pulls participants toward center, assumes Gaussian data
+    # while votes are ternary) but is better than 0. Future work needed
+    # to have a more proper way to handle missing data in PCA.
+    col_means = np.nanmean(matrix_data, axis=0)


Edge case not handled: If an entire column is NaN, col_means will contain NaN for that column at line 47, and when filling NaN values at line 50, those NaN placeholders will remain NaN. This will cause sklearn's PCA to fail or produce invalid results. Consider adding a check after line 47 to replace any remaining NaN values in col_means with a default value (e.g., 0.0):

col_means = np.where(np.isnan(col_means), 0.0, col_means)

Suggested change

col_means = np.nanmean(matrix_data, axis=0)

col_means = np.nanmean(matrix_data, axis=0)

# Handle edge case: columns that are entirely NaN will produce NaN means.

# Replace any remaining NaNs in col_means with 0.0 to ensure finite fill values.

col_means = np.where(np.isnan(col_means), 0.0, col_means)

Thanks for flagging this edge case! After investigation, this scenario is extremely unlikely in practice.

Why it can't happen:

Both Clojure and Python build the rating matrix only from votes that exist:

Python uses pivot_table which only creates columns for tids that appear in votes

Clojure uses nm/update-nmat which similarly only adds columns for voted comments

A comment with zero votes never gets a column in the matrix - it's simply not added during matrix construction. The columns (colnames/tids) are derived from actual vote records.

When could it theoretically occur?

Manually constructed test DataFrames with all-NaN columns

A future refactor that changes matrix construction logic

For now, leaving as-is since the defensive check would guard against a scenario that can't occur in the current architecture. If we ever change how the matrix is built, we should revisit this.

Copilot · 2026-01-06T22:38:01Z

delphi/tests/test_regression.py

+        )
+        assert not result["match"], "Non-constant scaled data should not match"
+        # Should not mention a specific scaling factor
+        reason = result.get("reason", "")


Variable reason is not used.

Suggested change

reason = result.get("reason", "")

Fixed: removed the unused variable and clarified the comment.

Copilot · 2026-01-06T22:38:01Z

delphi/polismath/regression/comparer.py

+                        if abs(g) > 1e-10:
+                            rel_diffs.append(abs(g - c) / abs(g))
+                    except (TypeError, ValueError):
+                        pass


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

pass

logger.debug(

"Skipping non-numeric difference for stats at path %s: "

"golden_value=%r, current_value=%r",

diff.get("path"),

diff.get("golden_value"),

diff.get("current_value"),

)

Added an explanatory comment. The silent pass is intentional:

all_differences can contain non-numeric entries (type mismatches, key mismatches, length mismatches, etc.) that are already recorded and will cause the comparison to fail. This loop only computes numeric error statistics, so non-numeric differences are correctly skipped.

Design rationale: For regression testing, we want to collect all differences (not fail on the first one) so users see the full picture. Structural mismatches are reported - we just can't compute "how far off" a dict is from a list.

- Fix division by zero when participant has no votes (pca.py) Mirrors Clojure's (max n-votes 1) approach - Fix return type annotation for _log_projection_metrics (comparer.py) - Add explanatory comment for intentional except pass (comparer.py) - Remove unused variable in test (test_regression.py) - Add unit test for no-votes edge case (test_pca_unit.py) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add "Parallel Test Execution" section to regression_testing.md explaining: - How to use -n auto for parallel execution - How xdist_group markers keep dataset fixtures together - When NOT to use parallel execution (database tests, debugging) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

jucor · 2026-01-06T23:20:07Z

Now that I've answered Github Copilot review, ready for human review :)

jucor changed the title ~~Fix PCA NaN handling to match Clojure implementation~~ Fix PCA NaN handling for slightly better practice Nov 25, 2025

jucor force-pushed the pca_work branch from 2024dde to e5cc92e Compare November 25, 2025 16:04

jucor closed this Nov 25, 2025

jucor reopened this Nov 25, 2025

jucor force-pushed the pca_work branch 3 times, most recently from ad69594 to eda5d87 Compare November 25, 2025 17:43

jucor force-pushed the pca_work branch from 1ca384b to ab9041b Compare November 29, 2025 00:05

jucor force-pushed the pca_work branch from 7365309 to 2bd7388 Compare November 29, 2025 01:33

jucor changed the title ~~Fix PCA NaN handling for slightly better practice~~ Overhaul Python PCA, from NaN handling to SKLearn replacement Jan 5, 2026

jucor and others added 14 commits January 6, 2026 22:06

Make each test run their own Conversation object

dddc4b1

Until now, one shared object was getting reused across tests for a given dataset, and the cache kept on piling.

Actually reuse the fixtures...

18681a5

Save difference log regardless of errors

641a620

Account for flip sign in numerical difference reporting

d4b3286

Add PCA benchmark script

9a2bc23

Replace handwritten PCA by SKLearn's implementation

12e4a6f

Note: we will update the golden records in a later commit.

Replace projection by Sklearn's fit_transform for better performance.

9613227

I took care to keep the "sparsity-aware" scaling, computed in a vectorized way.

jucor and others added 11 commits January 6, 2026 22:16

Remove unused start_vectors argument

20081b9

Remove function wrapped_pca

b2a1fe1

It's not needed anymore: the PCA is done by Sklearn, and the edge cases are already handled by wrapped_pca caller, i.e. pca_project_dataframe. Less code, less complexity, less maintenance.

Reformat help message

755f1d5

Allow for a few components with higher error tolerance

df15985

Update biodiversity and vw golden snapshots with sklearn PCA

c471c51

Re-recorded golden snapshots using the new sklearn SVD-based PCA. These snapshots now serve as the baseline for regression tests.

jucor force-pushed the pca_work branch from e4b30e9 to 1921d11 Compare January 6, 2026 22:30

jucor marked this pull request as ready for review January 6, 2026 22:33

jucor requested review from ballPointPenguin, colinmegill and Copilot January 6, 2026 22:34

Copilot started reviewing on behalf of jucor January 6, 2026 22:34 View session

Copilot AI reviewed Jan 6, 2026

View reviewed changes

jucor and others added 2 commits January 6, 2026 23:15

jucor added 2 commits January 7, 2026 15:13

FIx path of mock file for test run_math_pipeline_e2e

60abfa3

Clarify test name

9ed3639

-    col_means = np.nanmean(matrix_data, axis=0)
+    col_means = np.nanmean(matrix_data, axis=0)
+    # Handle edge case: columns that are entirely NaN will produce NaN means.
+    # Replace any remaining NaNs in col_means with 0.0 to ensure finite fill values.
+    col_means = np.where(np.isnan(col_means), 0.0, col_means)

-                        pass
+                        logger.debug(
+                            "Skipping non-numeric difference for stats at path %s: "
+                            "golden_value=%r, current_value=%r",
+                            diff.get("path"),
+                            diff.get("golden_value"),
+                            diff.get("current_value"),
+                        )

Overhaul Python PCA, from NaN handling to SKLearn replacement #2311

Are you sure you want to change the base?

Overhaul Python PCA, from NaN handling to SKLearn replacement #2311

Uh oh!

Conversation

jucor commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

Changes

Core fix (pca.py)

New tests

Results after fix

Test plan

Uh oh!

jucor commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jucor commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jucor commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Investigation: Why PCA regression tests fail for large conversations (e.g., BG2050)

Summary

Root Cause

The Scary-Looking 50° Angle Difference (and why it's not actually scary)

Decision

Uh oh!

jucor commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jucor commented Nov 29, 2025

Uh oh!

jucor commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jucor commented Jan 6, 2026

Rebase on edge completed

What happened

Resolution

Conflict resolved

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

jucor Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

jucor Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

jucor Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

jucor Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

jucor Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

jucor commented Jan 6, 2026

jucor commented Nov 25, 2025 •

edited

Loading

Core fix (`pca.py`)

jucor commented Nov 25, 2025 •

edited

Loading

jucor commented Nov 25, 2025 •

edited

Loading

jucor commented Nov 29, 2025 •

edited

Loading

jucor commented Nov 29, 2025 •

edited

Loading

jucor commented Jan 6, 2026 •

edited

Loading

Rebase on `edge` completed