Skip to content

Add split-and-retry path to GpuProjectExec#14724

Open
thirtiseven wants to merge 4 commits intoNVIDIA:mainfrom
thirtiseven:project_split_retry
Open

Add split-and-retry path to GpuProjectExec#14724
thirtiseven wants to merge 4 commits intoNVIDIA:mainfrom
thirtiseven:project_split_retry

Conversation

@thirtiseven
Copy link
Copy Markdown
Collaborator

Fixes #14191.

Description

GpuProjectExec currently retries OOM via withRetryNoSplit only — if a projection runs cuDF kernels with internal scratch allocations the pre-split estimator cannot see (regex, string-replace, etc.), an OOM during the projection cannot be recovered and fails the task.

This PR adds a split-and-retry path: for purely deterministic projections, GpuProjectExec now drives the projection through RmmRapidsRetryIterator.withRetry(splitSpillableInHalfByRows). On GPU OOM the input batch is halved by rows and the projection is re-run on each half; the resulting sub-batches are concatenated back via ConcatAndConsumeAll.buildNonEmptyBatchFromTypes to preserve the single-batch contract of projectAndCloseWithRetrySingleBatch.

Mixed deterministic + non-deterministic projections fall through to the existing withRetryNoSplit path: the non-deterministic side is computed once on the full input batch and stitched row-by-row to the deterministic side, and row-splitting either side would break that alignment. Both GpuProjectExec.projectWithRetrySingleBatch and GpuTieredProject.projectWithRetrySingleBatchInternal dispatch on the same condition (forall(_.deterministic) / areAllDeterministic).

A new internal config spark.rapids.sql.projectExec.splitRetry.enabled (default true) gates the new path so it can be disabled to revert to the prior behavior if regressions surface.

Checklists

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

The new config is internal() and the public behavior is unchanged in non-OOM cases.

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
  • Not required

tests/src/test/scala/com/nvidia/spark/rapids/ProjectSplitRetrySuite.scala adds five cases covering both dispatch sites:

  • split-retry on GpuProjectExec.projectAndCloseWithRetrySingleBatch with forceSplitAndRetryOOM produces output equal to a single-batch projection
  • with the conf disabled, the same injection propagates GpuSplitAndRetryOOM through the legacy path
  • the same coverage on GpuTieredProject.projectAndCloseWithRetrySingleBatch
  • a mixed deterministic + GpuRand projection routes through the legacy retry path (verified by comparing the rand column to a non-injected reference run)
  • a plain forceRetryOOM on the new path returns a single piece without splitting

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

TBD — perf testing to follow.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 5, 2026

Greptile Summary

This PR adds a split-and-retry OOM recovery path to GpuProjectExec and GpuTieredProject for purely deterministic projections, enabling recovery from cuDF-internal scratch allocations (regex, string-replace, etc.) that the pre-split estimator cannot see. A new internal config spark.rapids.sql.projectExec.splitRetry.enabled (default true) gates the feature.

  • P1 – double-close of sb in GpuProjectExec.projectWithRetrySingleBatch: when the split-retry path is taken, runWithSplitRetry (via withRetry) takes ownership of sb and closes it when the iterator is drained, but the outer withResource(sb) in projectAndCloseWithRetrySingleBatch closes sb a second time. This drives SpillableColumnarBatchImpl.refCount below zero. The analogous GpuTieredProject path correctly calls sb.incRefCount() before surrendering ownership; the same fix is needed here.

Confidence Score: 3/5

Not safe to merge as-is: the GpuProjectExec split-retry path double-closes the input SpillableColumnarBatch, corrupting its reference count.

A P1 resource-lifecycle bug (double-close / ref-count corruption) is present in the GpuProjectExec dispatch path. The analogous GpuTieredProject path already applies the correct incRefCount fix, so the GpuProjectExec path missed the same treatment. The P1 pulls the score below the 4/5 ceiling.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/basicPhysicalOperators.scala — specifically GpuProjectExec.projectWithRetrySingleBatch split-retry early-return and the projectAndCloseWithRetrySingleBatch wrapper around it.

Important Files Changed

Filename Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/basicPhysicalOperators.scala Adds split-retry path to GpuProjectExec and GpuTieredProject; contains a P1 double-close of sb in the GpuProjectExec path and a P2 per-batch RapidsConf allocation.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Adds PROJECT_SPLIT_RETRY_ENABLED internal config with correct .internal(), boolean type, and default true; accessor isProjectSplitRetryEnabled placed in correct region.
tests/src/test/scala/com/nvidia/spark/rapids/ProjectSplitRetrySuite.scala New unit test suite with five cases covering both dispatch sites; resource cleanup in afterEach is correct, but contains a misleading comment about a non-existent early-return in runWithSplitRetry.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant PAndC as projectAndCloseWithRetrySingleBatch
    participant PWR as projectWithRetrySingleBatch
    participant RSR as runWithSplitRetry
    participant WR as withRetry(sb)
    participant BNE as buildNonEmptyBatchFromTypes

    Caller->>PAndC: sb (refCount=1)
    PAndC->>PAndC: withResource(sb)
    PAndC->>PWR: sb
    alt All deterministic & splitRetry enabled
        PWR->>RSR: sb (takes ownership)
        RSR->>WR: sb
        loop For each sub-batch (split on OOM)
            WR->>RSR: spillable
            RSR->>RSR: runProject(cb) → pieces += result
        end
        Note over WR: closes sb (refCount→0) ⚠️
        RSR->>BNE: pieces.toArray
        BNE-->>RSR: concatenated ColumnarBatch
        RSR-->>PWR: ColumnarBatch
        PWR-->>PAndC: ColumnarBatch
        Note over PAndC: withResource closes sb again (refCount→-1) ⚠️ BUG
    else Mixed / non-deterministic
        PWR->>PWR: withRetryNoSplit
        PWR-->>PAndC: ColumnarBatch
        Note over PAndC: withResource closes sb correctly (refCount→0)
    end
    PAndC-->>Caller: ColumnarBatch
Loading

Reviews (1): Last reviewed commit: "simplify" | Re-trigger Greptile

Comment on lines +175 to +178
if (new RapidsConf(SQLConf.get).isProjectSplitRetryEnabled &&
boundExprs.forall(_.deterministic)) {
val retryables = GpuExpressionsUtils.collectRetryables(boundExprs)
return runWithSplitRetry(sb, retryables, project(_, boundExprs))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Double-close of sb in the split-retry path

runWithSplitRetry passes sb to withRetry, which takes ownership and closes sb once the iterator is drained (confirmed by the docstring: "This function will close the elements of input as fn is successfully invoked"). However, the caller projectAndCloseWithRetrySingleBatch wraps the whole call in withResource(sb) { _ => … }, which calls sb.close() a second time after projectWithRetrySingleBatch returns. This decrements SpillableColumnarBatchImpl.refCount from 0 to -1. The double-free guard is currently commented out (see issue #10161), so no exception is thrown, but the ref-count state is corrupted.

Compare with the analogous fix already applied in GpuTieredProject.projectWithRetrySingleBatchInternal: when closeInputBatch=false, sb.incRefCount() is called before passing to runWithSplitRetry to compensate for the extra close. The same pattern is needed here — call runWithSplitRetry(sb.incRefCount(), retryables, ...) so that withRetry holds one reference and the outer withResource(sb) holds the other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] GpuProjectExec support for split-retry to handle OOMs.

2 participants