Skip to content

fix: use positional alignment in contrib functions to prevent cartesian-product OOM#317

Open
drussellmrichie wants to merge 7 commits intolarsiusprime:masterfrom
drussellmrichie:fix/modeling-positional-align-contrib
Open

fix: use positional alignment in contrib functions to prevent cartesian-product OOM#317
drussellmrichie wants to merge 7 commits intolarsiusprime:masterfrom
drussellmrichie:fix/modeling-positional-align-contrib

Conversation

@drussellmrichie
Copy link
Copy Markdown

Problem

_contrib_to_unit_values and _add_prediction_to_contribution both join df_contrib back to df_base / df using pd.merge(..., on=the_key, how="left"). This produces a cartesian product and an OOM crash under the following combination of conditions:

  1. df_base has a non-sequential index (e.g. after sort_values() — the index retains the pre-sort positions).
  2. The join key ("key" or "key_sale") is backed by PyArrow string dtype, which triggers pandas' slow/non-unique merge path even when the key values are logically unique.
  3. The dataset is large enough that the intermediate cartesian product exhausts memory (observed on a ~421k-row universe: 421k² rows attempted).

The crash surfaces as a MemoryError or silent process kill inside pd.merge before the result is materialized.

Why positional alignment is correct

df_contrib is constructed row-by-row from df_base in the same iteration, so the two DataFrames are always in the same row order. A reset_index + pd.concat(..., axis=1) is therefore semantically equivalent to the merge, without the non-unique-join hazard.

A size-mismatch guard falls back to the key-based merge with a UserWarning, preserving correctness in any unexpected edge case.

Fix

  • _contrib_to_unit_values: replace unconditional merge with positional reset_index + concat; fall back to merge with a warning if row counts differ.
  • _add_prediction_to_contribution: same pattern.

Reproducing

The crash is difficult to reproduce on a small synthetic dataset because small Arrow string columns typically fall through to pandas' fast unique-key path. It was observed on a real Philadelphia OPA dataset (~421k parcels). No minimal repro is included with this PR; if a test fixture is needed we can work on one together.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 31, 2026

Thank you for your contribution.
Please sign our CLA at the following link:
Click here to sign the CLA.
A maintainer will verify your signature and confirm it here by commenting with the following sentence:


I affirm that this contributor has signed the CLA


0 out of 3 committers have signed the CLA.
@claude
@drussellmrichie
@russell Richie
Russell Richie seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

claude and others added 5 commits April 2, 2026 01:55
…rvice

Replace slow row-by-row apply() with vectorized geopandas operations:
build a GeoSeries of building geometries aligned with the joined dataframe,
then call .intersection() and .area in one pass instead of per-row.

https://claude.ai/code/session_01Dmffw6eWUC7rrE6JD1SXXX
…-Q7c2a

Vectorize building-parcel intersection area calculation in OvertureService
Adds calc_lycd_land_values() to land.py, which values land by:
1. Computing a uniform local land rate per model group from the median
   market value and median lot size of improved properties
2. Applying that rate to every parcel (land_value = rate × lot_size)

This avoids the side-by-side disparity created by applying a flat
allocation % to each parcel's individual market value.

Supports a user-supplied land_alloc (float or per-group dict) or
automatic derivation from vacant vs improved per-unit value ratios.
Also fixes missing area_unit import in land.py.

https://claude.ai/code/session_011PMWoWpQazUGuCXKT1nmVg
…ation-93UpZ

Implement LYCD (Least You Can Do) land valuation method
…diction_to_contribution to prevent cartesian-product OOM

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@drussellmrichie drussellmrichie force-pushed the fix/modeling-positional-align-contrib branch from dbc2a9a to 73e2f70 Compare April 2, 2026 14:29
Russell Richie and others added 2 commits April 3, 2026 15:37
calc_lycd_land_values now accepts a subarea_field parameter (e.g.
"neighborhood", "census_tract") to derive a separate local land rate per
(subarea, model_group) cell instead of one rate per model group across the
entire jurisdiction. Cells below min_improved_per_cell fall back to the
model-group rate. _derive_lycd_alloc_from_data updated to match with
group_field, subarea_field, and alloc_by_group_fallback parameters.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
## Split-checkpoint contributions strategy (benchmark, pipeline, modeling)
- Thread `do_contributions` flag through the full call chain: `run_models`,
  `_run_models`, `run_one_model`, `run_one_hedonic_model`, `_predict_one_model`,
  `_run_hedonic_models`, `_write_model_results`, `write_model_parameters`
- Add `_DS` and `_SMRContribContext` helper classes (module-level, picklable)
- `_write_model_results`: save `_smr_for_contribs.pkl` + `_model_features.json`
  when `do_contributions=False`; enables deferred SHAP pass checkpointing
- New `compute_model_contributions(settings)` in pipeline.py: loads saved
  `_smr_for_contribs.pkl` files per model group, runs contributions pass,
  deletes pkl; allows crash recovery without retraining
- `finalize_models`: expose `run_main/vacant/hedonic/ensemble`, `do_shaps`,
  `do_plots`, `do_contributions` as real parameters (were hardcoded True/False)
- `_run_models`: add per-group `ind_vars` override from
  `settings.modeling.models.<mvh>.group_overrides.<group>.ind_vars`

## Bug fixes
- `data.py`: `from scipy.spatial import cKDTree` (private submodule removed
  in scipy >= 1.14)
- `data.py`: move `astype(int)` cast to after `.loc` override in
  `_basic_geo_enrichment` — pandas 2.x rejects float→int64 via `.loc`
- `sales_scrutiny_study.py`: go through `.astype(object)` before `.astype(str)`
  to avoid ArrowNotImplementedError on null-dtype columns (pyarrow >= 22)
- `shap_analysis.py`: warn + filter instead of raising ValueError when
  `list_vars` contains features dropped by variable selection
- `tuning.py`: guard `_lightgbm_rolling_origin_cv` against n_samples < n_splits
- `utilities/cache.py`: replace `col.values == ...` with `col.eq(...)` and
  wrap `.sum()` in `int()` — ArrowExtensionArray has no `.values` sum
- `utilities/stats.py`: multiple `calc_correlations` guards for sparse/small
  groups — empty naive_corr early-exit, all-NA score break, `len(remaining)
  <= 1` early return with correct schema, `.columns.tolist()[0]` for
  Arrow-backed Index compat

## Misc
- `.gitignore`: add `.DS_Store` and `.pytest_cache/`

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants