-
Notifications
You must be signed in to change notification settings - Fork 236
Overhaul Python PCA, from NaN handling to SKLearn replacement #2311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: edge
Are you sure you want to change the base?
Conversation
|
Note: next step will be to :
|
ad69594 to
eda5d87
Compare
|
Don't mind the failure of the python/python regression tests: the golden record was computer before the fix, and needs to be updated. However:
So there's something still amiss. Sign flip can't explain the 45° degrees angle difference on PC1 ... or the 53.64° on PC2 ! |
Investigation: Why PCA regression tests fail for large conversations (e.g., BG2050)SummaryThe Python PCA implementation matches Clojure for small conversations but diverges significantly for large ones. After investigation, we found this is expected behavior due to different algorithms being used—and we're now confident that dropping the Clojure stochastic approach is the right call. Root CauseThe Clojure code has a branching condition (conversation.clj:766-796): conversations exceeding 5,000 comments or 10,000 participants switch from exact PCA to a stochastic mini-batch approximation.
The stochastic algorithm is clever—it uses growing-size mini-batches with a learning rate, which is exactly the pattern found in convergence proofs and optimal algorithms for stochastic Monte Carlo methods. However, I could not find any trace in the codebase of comparisons against the exact algorithm for accuracy validation. The Scary-Looking 50° Angle Difference (and why it's not actually scary)Initial tests showed a ~50° angle difference between Python and Clojure PCA components for BG2050, which looked alarming. But we also observed 0.99 correlation between the same components—seemingly contradictory. Resolution: The stochastic PCA doesn't re-normalize its components after blending iterations. The mini-batch update does: new_component = 0.99 * old_component + 0.01 * batch_componentA weighted average of unit vectors is not a unit vector. After 100 iterations, the Clojure vectors have norms ~0.6-0.7 instead of 1.0. My angle computation assumed unit vectors (as exact PCA produces), which gave misleading results. Once properly normalized:
DecisionDrop the stochastic PCA codepath. The exact PCA in Python (currently custom, but soon backed by efficient NumPy) will handle what used to be computationally challenging for Clojure. If we ever need to optimize for truly massive conversations, we can revisit with:
|
|
Now that we have verified the PCA algorithm and trust it, next steps:
|
|
Parallelized test execution across datasets Added Local benchmarks (Apple Silicon M1, 64GB RAM):
Full test suite with all local datasets: ~3 min (down from 10-12 min sequential). Notes:
The parallel execution infrastructure is now available if we want to enable it on CI runners in the future. |
|
We now have fully replaced the Python port of the PCA by a simple call to SKLearn, much faster and much simpler 😄 🎉 The computations on all the datasets I've tried match (from small to very large convos, from public ones to private ones), up to some tiny numerical differences for small values, which barely register when looking at the projections as a whole. Next step for this PR:
|
commit 2024dde Author: Julien Cornebise <julien@cornebise.com> Date: Mon Nov 24 22:14:34 2025 +0000 Update test_powerit_pca_with_nans to expect ValueError The previous commit changed powerit_pca to raise ValueError when given NaN values, making NaN handling the caller responsibility. This test now verifies that behavior instead of testing graceful NaN handling. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> commit 9839770 Author: Julien Cornebise <julien@cornebise.com> Date: Mon Nov 24 22:14:34 2025 +0000 Fix PCA NaN handling to use column mean (matching Clojure) The Python PCA was using 0 to fill missing votes, while Clojure uses column means. This caused significant differences in PCA components (16 deg and 52 deg angle differences on VW dataset). Changes: - pca_project_dataframe: Replace NaN with column means instead of 0 - powerit_pca: Add ValueError if NaN values reach it (caller must preprocess) - test_regression.py: Enable ignore_pca_sign_flip for golden comparisons New tests: - test_pca_unit.py: test_nan_handling_uses_column_mean verifies column mean is used, test_nan_handling_differs_from_zero_fill confirms we are not using 0 - test_legacy_clojure_regression.py: test_pca_components_match_clojure compares Python vs Clojure PCA components (correlation and angle) Results after fix: - VW: PC1/PC2 correlation 1.0/-1.0, angle 0 deg (perfect match) - Biodiversity: PC1/PC2 correlation 0.99/-1.0, angle 7/4 deg (minor numerical differences from power iteration on 82% sparse data) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> commit 4b33297 Author: Julien Cornebise <julien@cornebise.com> Date: Mon Nov 24 22:14:34 2025 +0000 Test that sign flipping propagates If the PCA components flip sign, the projections must also flip sign. commit e307a52 Author: Julien Cornebise <julien@cornebise.com> Date: Fri Nov 21 09:36:22 2025 +0000 Flag an unneeded exception commit 91a1e9c Author: Julien Cornebise <julien@cornebise.com> Date: Wed Nov 19 15:49:34 2025 +0000 Report any scaling issues in PCA projections commit 395e636 Author: Julien Cornebise <julien@cornebise.com> Date: Wed Nov 19 15:25:46 2025 +0000 Turn off off the Clojure sign-flip for PCA Of course this breaks the comparison to the golden files, as those included the sign flip. So we're also giving the option in the regression test to ignore PCA sign-flips. That will be handy later when we work on improving the PCA implementation, as various PCA implementations have various sign conventions. commit 0dc1aa7 Author: Julien Cornebise <julien@cornebise.com> Date: Mon Nov 24 22:14:34 2025 +0000 Clean up unused variables and imports Address GitHub Copilot review comments: - Log superseded votes count in conversation.py instead of leaving unused - Remove unused p1_idx/p2_idx index lookups in corr.py - Remove unused all_passed variable in regression_comparer.py - Remove unused imports (numpy, Path, List, datetime, stats, pca/cluster functions) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Until now, one shared object was getting reused across tests for a given dataset, and the cache kept on piling.
Use pytest-xdist's loadgroup distribution to run tests for different datasets on separate workers while keeping all tests for the same dataset on one worker (preserving fixture cache efficiency). Changes: - Add xdist_group marker to each dataset parameter via pytest.param() - Switch from --dist=loadscope to --dist=loadgroup in pytest config - Clean up conftest.py marker handling Performance: Tests now complete in ~2 minutes with 4 workers (-n4), down from ~9 minutes before deduplication and ~4 minutes sequential after deduplication. The xdist_group marker ensures each dataset's Conversation object is computed only once per worker. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Add reusable helpers in conftest.py for creating pytest.param objects with xdist_group markers, enabling efficient parallel test execution across all dataset-parametrized tests. Changes: - conftest.py: Add make_dataset_params() and get_available_dataset_params() helper functions for xdist_group marker creation - Update pytest_generate_tests to use xdist_group markers - Update test_pca_smoke.py, test_repness_smoke.py, test_conversation_smoke.py, test_legacy_repness_comparison.py, test_golden_data.py, test_pipeline_integrity.py to use the new helpers Pattern: Tests parametrized by dataset now use xdist_group markers via: - make_dataset_params(["biodiversity", "vw"]) for hardcoded lists - get_available_dataset_params() for dynamically discovered datasets With --dist=loadgroup (default in pyproject.toml), pytest-xdist groups tests by dataset, ensuring fixtures are computed once per worker. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
New pytest option allows filtering tests to run on a subset of datasets: pytest --datasets=biodiversity # single dataset pytest --datasets=biodiversity,vw # multiple datasets Implementation: - pytest_generate_tests filters dynamic parametrization - pytest_collection_modifyitems deselects tests with static parametrization - Report header shows "Filtered to: ..." when active Works with both: - Dynamic parametrization (dataset fixture via pytest_generate_tests) - Static parametrization (@pytest.mark.parametrize with get_available_dataset_params()) Also updated tests/README.md with documentation for all pytest options including --include-local, --datasets, and parallel execution with -n. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
The test file has its own pytest_generate_tests hook that was not respecting the --datasets option. Now imports _get_requested_datasets and make_dataset_params from conftest to properly filter datasets. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
The original angle computation assumed unit-normalized vectors, which gave misleading results when comparing Python (exact PCA, unit vectors) with Clojure stochastic PCA for large conversations (non-unit vectors due to learning rate blending without re-normalization). This discrepancy only affects large conversations (>5k comments or >10k participants), as Clojure uses exact PCA for smaller conversations. Now outputs both the raw angle and the properly normalized angle, plus the norms of both vectors for diagnostics. The assertion now checks the normalized angle, which correctly identifies when vectors point in similar directions regardless of their magnitudes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
As discussed above, the alignment with clojure was an AI-hallucinated kludge to pass tests by customized scaling tailored to each dataset, hiding the problems with the Python PCA implementation (the handling of NaNs). Now that we have fixed those, we can remove that madness and update the golden records accordingly.
Note: we will update the golden records in a later commit.
I took care to keep the "sparsity-aware" scaling, computed in a vectorized way.
It's not needed anymore: the PCA is done by Sklearn, and the edge cases are already handled by wrapped_pca caller, i.e. pca_project_dataframe. Less code, less complexity, less maintenance.
Cluster centers are derived from PCA projections, so they inherit the sign ambiguity. This fix ensures sign flips are detected and corrected for `.center` paths in addition to `.pca.comps` and `.proj.` paths. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, sign flips were detected per-projection-vector, which fails when only some components are flipped (e.g., PC1 unchanged, PC2 flipped). Now detects flips at the component level (.pca.comps[N]) and stores them, then applies per-dimension correction to projections and cluster centers. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… tests TLDR: it works! Differences were just numerical errors, artificially looking huge on small values. Looking at the whole cloud of projections confirmed the post sk-learn matches the pre-sklearn results perfectly well. Problem: -------- The regression comparer was failing on datasets like FLI and pakistan with thousands of "differences" showing relative errors up to 874%. Investigation revealed these were false positives: the high relative errors occurred on near-zero values (e.g., golden=4.54e-06 vs current=4.42e-05) where even tiny absolute differences (3e-04) produce huge relative errors. Diagnosis: ---------- Comparing sklearn SVD-based PCA against power iteration golden snapshots: - The projection point clouds are visually identical (see scatter plots) - Important values (Q70-Q100 percentile) match within 0.2% - Only near-zero values (Q0-Q7 percentile, ~0.0x median) show large rel errors - These near-zero values represent participants at the origin who don't affect visualization or clustering The element-wise (abs_tol, rel_tol) approach fundamentally cannot handle this case: it either fails on small values or is too loose for large values. Solution: --------- Added projection comparison metrics that measure what actually matters: | Metric | Threshold | What it measures | |-----------------------|-----------|-------------------------------------| | Max |error| / range | < 1% | Worst displacement as % of axis | | Mean |error| / range | < 0.1% | Average displacement as % of axis | | R² (all coordinates) | > 0.9999 | Variance explained (99.99%) | | R² (per dimension) | > 0.999 | Per-PC fit quality (99.9%) | | Procrustes disparity | < 1e-4 | Shape similarity after alignment | Results for FLI dataset: - Max |error| / range: 0.0617% (was flagging 28% rel error on Q1 values) - R²: 0.9999992 - Procrustes: 8.2e-07 All 7 local datasets now pass regression tests. Also added: - Quantile context in error reports (computed from ALL values, not just failures) - Explanation when element-wise diffs exist but metrics confirm match
Re-recorded golden snapshots using the new sklearn SVD-based PCA. These snapshots now serve as the baseline for regression tests.
Tests that only verify sorting/moderation now use recompute=False to skip unnecessary PCA computation. Manager tests that need PCA but use minimal data have targeted filterwarnings with explanations. Changes: - Add recompute=False to sorting tests (no PCA needed for ID ordering) - Add recompute=False to test_moderation (only checks filtering) - Add @pytest.mark.filterwarnings to TestConversationManager tests that use minimal data through the manager interface 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The sklearn PCA implementation has been validated to match Clojure. Regressions are now caught by test_regression.py golden snapshots. This test file will be removed once all other Clojure comparison tests are similarly validated and covered by regression tests. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ion.py - Delete test_golden_data.py: redundant with test_legacy_clojure_regression.py - Change @Skip to @xfail for test_group_clustering and test_comment_priorities so they run and document expected failures rather than being silently skipped 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Rebase on
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR overhauls the Python PCA implementation by replacing a custom power iteration algorithm with sklearn's PCA and fixing NaN handling to use column means instead of zeros, bringing Python closer to Clojure behavior.
Key Changes
- PCA Implementation: Replaced ~500 lines of custom power iteration code with sklearn's PCA (simpler, more maintainable, better tested)
- NaN Handling: Changed from filling missing votes with 0 to using column means, reducing bias in covariance estimates
- Testing Infrastructure: Added PCA sign flip tolerance to regression tests, added unit tests for NaN handling, and improved test parallelization with xdist groups
Reviewed changes
Copilot reviewed 20 out of 22 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
delphi/polismath/pca_kmeans_rep/pca.py |
Core change: replaced custom PCA with sklearn, changed NaN fill from 0 to column mean, simplified projection logic |
delphi/polismath/regression/comparer.py |
Added PCA sign flip detection, outlier tolerance, projection metrics for regression testing |
delphi/tests/test_pca_unit.py |
Removed 200+ lines of custom PCA tests, added 2 focused tests for NaN handling with column means |
delphi/tests/test_pca_edge_cases.py |
Removed tests for custom power iteration functions that no longer exist |
delphi/tests/test_regression.py |
Added tests for PCA sign flip handling and scaling factor detection in comparison logic |
delphi/tests/test_legacy_clojure_regression.py |
Changed to class-based tests with better fixture scoping, added placeholder for deprecated PCA test |
delphi/tests/conftest.py |
Added xdist_group markers for parallel execution, dataset filtering via --datasets flag |
delphi/tests/test_golden_data.py |
Deleted obsolete test file that was marked as xfail |
delphi/tests/test_conversation.py |
Added recompute=False to tests that only verify sorting/moderation logic (performance optimization) |
delphi/scripts/regression_recorder.py |
Added --include-local flag support |
delphi/scripts/regression_comparer.py |
Added CLI options for PCA sign flip tolerance, outlier handling, and local datasets |
delphi/pyproject.toml |
Added pytest config for xdist loadgroup distribution |
delphi/polismath/benchmarks/bench_pca.py |
New benchmark script for profiling PCA performance |
| Various smoke/regression test files | Updated imports to use conftest helper functions for dataset parametrization |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| n_cmnts = matrix_data.shape[1] | ||
| n_seen = np.sum(~np.isnan(matrix_data), axis=1) # Count non-NaN votes per participant | ||
| proportions = np.sqrt(n_seen / n_cmnts) | ||
| scaled_projections = projections / proportions[:, np.newaxis] |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential division by zero: If a participant has voted on zero comments (all NaN values), n_seen will be 0, leading to proportions containing 0, which will cause division by zero when computing scaled_projections = projections / proportions[:, np.newaxis].
Consider adding a check to handle this edge case, for example:
proportions = np.sqrt(n_seen / n_cmnts)
proportions = np.where(proportions == 0, 1, proportions) # Avoid division by zeroOr filter out participants with no votes before PCA computation.
| scaled_projections = projections / proportions[:, np.newaxis] | |
| # Avoid division by zero for participants with no seen comments | |
| proportions = np.where(proportions == 0, 1.0, proportions) | |
| scaled_projections = projections / proportions[:, np.newaxis] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Fixed in the same approach as the Clojure implementation uses.
Clojure reference (math/src/polismath/math/pca.clj:156):
(* (Math/sqrt (/ n-cmnts (max n-votes 1)))
[p1 p2]))Fix applied:
n_seen_safe = np.maximum(n_seen, 1)
proportions = np.sqrt(n_seen_safe / n_cmnts)Also added a unit test test_participant_with_no_votes to cover this edge case.
| return (np.array(all_golden) if len(all_golden) > 0 else np.array([]), | ||
| np.array(all_current) if len(all_current) > 0 else np.array([])) | ||
|
|
||
| def _log_projection_metrics(self, golden_stages: dict, current_stages: dict) -> None: |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return type mismatch: Function _log_projection_metrics is declared to return None but actually returns False at line 1500 when len(g_all) == 0, and returns all_pass (a boolean) at line 1572. The function signature should be changed to return bool | None or Optional[bool] to match the actual implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed: changed return type annotation from -> None to -> bool.
| # Column mean is imperfect (pulls participants toward center, assumes Gaussian data | ||
| # while votes are ternary) but is better than 0. Future work needed | ||
| # to have a more proper way to handle missing data in PCA. | ||
| col_means = np.nanmean(matrix_data, axis=0) |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edge case not handled: If an entire column is NaN, col_means will contain NaN for that column at line 47, and when filling NaN values at line 50, those NaN placeholders will remain NaN. This will cause sklearn's PCA to fail or produce invalid results. Consider adding a check after line 47 to replace any remaining NaN values in col_means with a default value (e.g., 0.0):
col_means = np.where(np.isnan(col_means), 0.0, col_means)| col_means = np.nanmean(matrix_data, axis=0) | |
| col_means = np.nanmean(matrix_data, axis=0) | |
| # Handle edge case: columns that are entirely NaN will produce NaN means. | |
| # Replace any remaining NaNs in col_means with 0.0 to ensure finite fill values. | |
| col_means = np.where(np.isnan(col_means), 0.0, col_means) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for flagging this edge case! After investigation, this scenario is extremely unlikely in practice.
Why it can't happen:
Both Clojure and Python build the rating matrix only from votes that exist:
- Python uses
pivot_tablewhich only creates columns for tids that appear in votes - Clojure uses
nm/update-nmatwhich similarly only adds columns for voted comments
A comment with zero votes never gets a column in the matrix - it's simply not added during matrix construction. The columns (colnames/tids) are derived from actual vote records.
When could it theoretically occur?
- Manually constructed test DataFrames with all-NaN columns
- A future refactor that changes matrix construction logic
For now, leaving as-is since the defensive check would guard against a scenario that can't occur in the current architecture. If we ever change how the matrix is built, we should revisit this.
delphi/tests/test_regression.py
Outdated
| ) | ||
| assert not result["match"], "Non-constant scaled data should not match" | ||
| # Should not mention a specific scaling factor | ||
| reason = result.get("reason", "") |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable reason is not used.
| reason = result.get("reason", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed: removed the unused variable and clarified the comment.
| if abs(g) > 1e-10: | ||
| rel_diffs.append(abs(g - c) / abs(g)) | ||
| except (TypeError, ValueError): | ||
| pass |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| pass | |
| logger.debug( | |
| "Skipping non-numeric difference for stats at path %s: " | |
| "golden_value=%r, current_value=%r", | |
| diff.get("path"), | |
| diff.get("golden_value"), | |
| diff.get("current_value"), | |
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an explanatory comment. The silent pass is intentional:
all_differences can contain non-numeric entries (type mismatches, key mismatches, length mismatches, etc.) that are already recorded and will cause the comparison to fail. This loop only computes numeric error statistics, so non-numeric differences are correctly skipped.
Design rationale: For regression testing, we want to collect all differences (not fail on the first one) so users see the full picture. Structural mismatches are reported - we just can't compute "how far off" a dict is from a list.
- Fix division by zero when participant has no votes (pca.py) Mirrors Clojure's (max n-votes 1) approach - Fix return type annotation for _log_projection_metrics (comparer.py) - Add explanatory comment for intentional except pass (comparer.py) - Remove unused variable in test (test_regression.py) - Add unit test for no-votes edge case (test_pca_unit.py) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add "Parallel Test Execution" section to regression_testing.md explaining: - How to use -n auto for parallel execution - How xdist_group markers keep dataset fixtures together - When NOT to use parallel execution (database tests, debugging) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
Now that I've answered Github Copilot review, ready for human review :) |
Summary
Background
Investigation revealed that Python and Clojure PCA implementations differed significantly:
Changes
Core fix (
pca.py)pca_project_dataframe: Replace NaN with column means instead of 0powerit_pca: Add ValueError if NaN values reach it (enforces preprocessing)New tests
test_pca_unit.py: Two tests verifying column-mean NaN handlingtest_legacy_clojure_regression.py:test_pca_components_match_clojurecomparing Python vs Clojure PCAResults after fix
The small biodiversity discrepancy is within numerical tolerance for power iteration on high-sparsity (82%) data.
Test plan
pytest tests/test_pca_unit.py -k nan_handling- NaN handling unit tests passpytest tests/test_legacy_clojure_regression.py::test_pca_components_match_clojure- Clojure comparison passes🤖 Generated with Claude Code