fix: use positional alignment in contrib functions to prevent cartesian-product OOM by drussellmrichie · Pull Request #317 · larsiusprime/openavmkit

drussellmrichie · 2026-03-31T16:03:01Z

Problem

_contrib_to_unit_values and _add_prediction_to_contribution both join df_contrib back to df_base / df using pd.merge(..., on=the_key, how="left"). This produces a cartesian product and an OOM crash under the following combination of conditions:

df_base has a non-sequential index (e.g. after sort_values() — the index retains the pre-sort positions).
The join key ("key" or "key_sale") is backed by PyArrow string dtype, which triggers pandas' slow/non-unique merge path even when the key values are logically unique.
The dataset is large enough that the intermediate cartesian product exhausts memory (observed on a ~421k-row universe: 421k² rows attempted).

The crash surfaces as a MemoryError or silent process kill inside pd.merge before the result is materialized.

Why positional alignment is correct

df_contrib is constructed row-by-row from df_base in the same iteration, so the two DataFrames are always in the same row order. A reset_index + pd.concat(..., axis=1) is therefore semantically equivalent to the merge, without the non-unique-join hazard.

A size-mismatch guard falls back to the key-based merge with a UserWarning, preserving correctness in any unexpected edge case.

Fix

_contrib_to_unit_values: replace unconditional merge with positional reset_index + concat; fall back to merge with a warning if row counts differ.
_add_prediction_to_contribution: same pattern.

Reproducing

The crash is difficult to reproduce on a small synthetic dataset because small Arrow string columns typically fall through to pandas' fast unique-key path. It was observed on a real Philadelphia OPA dataset (~421k parcels). No minimal repro is included with this PR; if a test fixture is needed we can work on one together.

github-actions · 2026-03-31T16:03:10Z

Thank you for your contribution.
Please sign our CLA at the following link:
Click here to sign the CLA.
A maintainer will verify your signature and confirm it here by commenting with the following sentence:

I affirm that this contributor has signed the CLA

0 out of 3 committers have signed the CLA.
❌ @claude
❌ @drussellmrichie
❌ @russell Richie
Russell Richie seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You can retrigger this bot by commenting recheck in this Pull Request.}_{Posted by the CLA Assistant Lite bot.}

…rvice Replace slow row-by-row apply() with vectorized geopandas operations: build a GeoSeries of building geometries aligned with the joined dataframe, then call .intersection() and .area in one pass instead of per-row. https://claude.ai/code/session_01Dmffw6eWUC7rrE6JD1SXXX

…-Q7c2a Vectorize building-parcel intersection area calculation in OvertureService

Adds calc_lycd_land_values() to land.py, which values land by: 1. Computing a uniform local land rate per model group from the median market value and median lot size of improved properties 2. Applying that rate to every parcel (land_value = rate × lot_size) This avoids the side-by-side disparity created by applying a flat allocation % to each parcel's individual market value. Supports a user-supplied land_alloc (float or per-group dict) or automatic derivation from vacant vs improved per-unit value ratios. Also fixes missing area_unit import in land.py. https://claude.ai/code/session_011PMWoWpQazUGuCXKT1nmVg

…ation-93UpZ Implement LYCD (Least You Can Do) land valuation method

…diction_to_contribution to prevent cartesian-product OOM Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

calc_lycd_land_values now accepts a subarea_field parameter (e.g. "neighborhood", "census_tract") to derive a separate local land rate per (subarea, model_group) cell instead of one rate per model group across the entire jurisdiction. Cells below min_improved_per_cell fall back to the model-group rate. _derive_lycd_alloc_from_data updated to match with group_field, subarea_field, and alloc_by_group_fallback parameters. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

## Split-checkpoint contributions strategy (benchmark, pipeline, modeling) - Thread `do_contributions` flag through the full call chain: `run_models`, `_run_models`, `run_one_model`, `run_one_hedonic_model`, `_predict_one_model`, `_run_hedonic_models`, `_write_model_results`, `write_model_parameters` - Add `_DS` and `_SMRContribContext` helper classes (module-level, picklable) - `_write_model_results`: save `_smr_for_contribs.pkl` + `_model_features.json` when `do_contributions=False`; enables deferred SHAP pass checkpointing - New `compute_model_contributions(settings)` in pipeline.py: loads saved `_smr_for_contribs.pkl` files per model group, runs contributions pass, deletes pkl; allows crash recovery without retraining - `finalize_models`: expose `run_main/vacant/hedonic/ensemble`, `do_shaps`, `do_plots`, `do_contributions` as real parameters (were hardcoded True/False) - `_run_models`: add per-group `ind_vars` override from `settings.modeling.models.<mvh>.group_overrides.<group>.ind_vars` ## Bug fixes - `data.py`: `from scipy.spatial import cKDTree` (private submodule removed in scipy >= 1.14) - `data.py`: move `astype(int)` cast to after `.loc` override in `_basic_geo_enrichment` — pandas 2.x rejects float→int64 via `.loc` - `sales_scrutiny_study.py`: go through `.astype(object)` before `.astype(str)` to avoid ArrowNotImplementedError on null-dtype columns (pyarrow >= 22) - `shap_analysis.py`: warn + filter instead of raising ValueError when `list_vars` contains features dropped by variable selection - `tuning.py`: guard `_lightgbm_rolling_origin_cv` against n_samples < n_splits - `utilities/cache.py`: replace `col.values == ...` with `col.eq(...)` and wrap `.sum()` in `int()` — ArrowExtensionArray has no `.values` sum - `utilities/stats.py`: multiple `calc_correlations` guards for sparse/small groups — empty naive_corr early-exit, all-NA score break, `len(remaining) <= 1` early return with correct schema, `.columns.tolist()[0]` for Arrow-backed Index compat ## Misc - `.gitignore`: add `.DS_Store` and `.pytest_cache/` Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude and others added 5 commits April 2, 2026 01:55

Merge pull request #3 from drussellmrichie/claude/implement-todo-item…

0a55056

…-Q7c2a Vectorize building-parcel intersection area calculation in OvertureService

Merge pull request #4 from drussellmrichie/claude/implement-lycd-valu…

2eaa9cc

…ation-93UpZ Implement LYCD (Least You Can Do) land valuation method

fix: use positional alignment in _contrib_to_unit_values and _add_pre…

73e2f70

…diction_to_contribution to prevent cartesian-product OOM Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

drussellmrichie force-pushed the fix/modeling-positional-align-contrib branch from dbc2a9a to 73e2f70 Compare April 2, 2026 14:29

Russell Richie and others added 2 commits April 3, 2026 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use positional alignment in contrib functions to prevent cartesian-product OOM#317

fix: use positional alignment in contrib functions to prevent cartesian-product OOM#317
drussellmrichie wants to merge 7 commits intolarsiusprime:masterfrom
drussellmrichie:fix/modeling-positional-align-contrib

drussellmrichie commented Mar 31, 2026

Uh oh!

github-actions Bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drussellmrichie commented Mar 31, 2026

Problem

Why positional alignment is correct

Fix

Reproducing

Uh oh!

github-actions Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Mar 31, 2026 •

edited

Loading