Add split-and-retry path to GpuProjectExec by thirtiseven · Pull Request #14724 · NVIDIA/spark-rapids

thirtiseven · 2026-05-05T04:00:59Z

Description

GpuProjectExec currently retries OOM via withRetryNoSplit only — if a projection runs cuDF kernels with internal scratch allocations the pre-split estimator cannot see (regex, string-replace, etc.), an OOM during the projection cannot be recovered and fails the task.

This PR adds a split-and-retry path: for purely deterministic projections, GpuProjectExec now drives the projection through RmmRapidsRetryIterator.withRetry(splitSpillableInHalfByRows). On GPU OOM the input batch is halved by rows and the projection is re-run on each half; the resulting sub-batches are concatenated back via ConcatAndConsumeAll.buildNonEmptyBatchFromTypes to preserve the single-batch contract of projectAndCloseWithRetrySingleBatch.

Mixed deterministic + non-deterministic projections fall through to the existing withRetryNoSplit path: the non-deterministic side is computed once on the full input batch and stitched row-by-row to the deterministic side, and row-splitting either side would break that alignment. Both GpuProjectExec.projectWithRetrySingleBatch and GpuTieredProject.projectWithRetrySingleBatchInternal dispatch on the same condition (forall(_.deterministic) / areAllDeterministic).

A new internal config spark.rapids.sql.projectExec.splitRetry.enabled (default true) gates the new path so it can be disabled to revert to the prior behavior if regressions surface.

Checklists

Documentation

Updated for new or modified user-facing features or behaviors
No user-facing change

The new config is internal() and the public behavior is unchanged in non-OOM cases.

Testing

Added or modified tests to cover new code paths
Covered by existing tests
Not required

tests/src/test/scala/com/nvidia/spark/rapids/ProjectSplitRetrySuite.scala adds five cases covering both dispatch sites:

split-retry on GpuProjectExec.projectAndCloseWithRetrySingleBatch with forceSplitAndRetryOOM produces output equal to a single-batch projection
with the conf disabled, the same injection propagates GpuSplitAndRetryOOM through the legacy path
the same coverage on GpuTieredProject.projectAndCloseWithRetrySingleBatch
a mixed deterministic + GpuRand projection routes through the legacy retry path (verified by comparing the rand column to a non-injected reference run)
a plain forceRetryOOM on the new path returns a single piece without splitting

Performance

Tests ran and results are added in the PR description
Issue filed with a link in the PR description
Not required

TBD — perf testing to follow.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

greptile-apps · 2026-05-05T04:07:52Z

Greptile Summary

This PR adds a split-and-retry OOM recovery path to GpuProjectExec and GpuTieredProject for purely deterministic projections, enabling recovery from cuDF-internal scratch allocations (regex, string-replace, etc.) that the pre-split estimator cannot see. A new internal config spark.rapids.sql.projectExec.splitRetry.enabled (default true) gates the feature.

P1 – double-close of sb in GpuProjectExec.projectWithRetrySingleBatch: when the split-retry path is taken, runWithSplitRetry (via withRetry) takes ownership of sb and closes it when the iterator is drained, but the outer withResource(sb) in projectAndCloseWithRetrySingleBatch closes sb a second time. This drives SpillableColumnarBatchImpl.refCount below zero. The analogous GpuTieredProject path correctly calls sb.incRefCount() before surrendering ownership; the same fix is needed here.

Confidence Score: 3/5

Not safe to merge as-is: the GpuProjectExec split-retry path double-closes the input SpillableColumnarBatch, corrupting its reference count.

A P1 resource-lifecycle bug (double-close / ref-count corruption) is present in the GpuProjectExec dispatch path. The analogous GpuTieredProject path already applies the correct incRefCount fix, so the GpuProjectExec path missed the same treatment. The P1 pulls the score below the 4/5 ceiling.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/basicPhysicalOperators.scala — specifically GpuProjectExec.projectWithRetrySingleBatch split-retry early-return and the projectAndCloseWithRetrySingleBatch wrapper around it.

Important Files Changed

Filename	Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/basicPhysicalOperators.scala	Adds split-retry path to GpuProjectExec and GpuTieredProject; contains a P1 double-close of sb in the GpuProjectExec path and a P2 per-batch RapidsConf allocation.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala	Adds PROJECT_SPLIT_RETRY_ENABLED internal config with correct .internal(), boolean type, and default true; accessor isProjectSplitRetryEnabled placed in correct region.
tests/src/test/scala/com/nvidia/spark/rapids/ProjectSplitRetrySuite.scala	New unit test suite with five cases covering both dispatch sites; resource cleanup in afterEach is correct, but contains a misleading comment about a non-existent early-return in runWithSplitRetry.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant PAndC as projectAndCloseWithRetrySingleBatch
    participant PWR as projectWithRetrySingleBatch
    participant RSR as runWithSplitRetry
    participant WR as withRetry(sb)
    participant BNE as buildNonEmptyBatchFromTypes

    Caller->>PAndC: sb (refCount=1)
    PAndC->>PAndC: withResource(sb)
    PAndC->>PWR: sb
    alt All deterministic & splitRetry enabled
        PWR->>RSR: sb (takes ownership)
        RSR->>WR: sb
        loop For each sub-batch (split on OOM)
            WR->>RSR: spillable
            RSR->>RSR: runProject(cb) → pieces += result
        end
        Note over WR: closes sb (refCount→0) ⚠️
        RSR->>BNE: pieces.toArray
        BNE-->>RSR: concatenated ColumnarBatch
        RSR-->>PWR: ColumnarBatch
        PWR-->>PAndC: ColumnarBatch
        Note over PAndC: withResource closes sb again (refCount→-1) ⚠️ BUG
    else Mixed / non-deterministic
        PWR->>PWR: withRetryNoSplit
        PWR-->>PAndC: ColumnarBatch
        Note over PAndC: withResource closes sb correctly (refCount→0)
    end
    PAndC-->>Caller: ColumnarBatch

_{Reviews (1): Last reviewed commit: "simplify" | Re-trigger Greptile}

greptile-apps · 2026-05-05T04:07:56Z

+    if (new RapidsConf(SQLConf.get).isProjectSplitRetryEnabled &&
+        boundExprs.forall(_.deterministic)) {
+      val retryables = GpuExpressionsUtils.collectRetryables(boundExprs)
+      return runWithSplitRetry(sb, retryables, project(_, boundExprs))


Double-close of sb in the split-retry path

runWithSplitRetry passes sb to withRetry, which takes ownership and closes sb once the iterator is drained (confirmed by the docstring: "This function will close the elements of input as fn is successfully invoked"). However, the caller projectAndCloseWithRetrySingleBatch wraps the whole call in withResource(sb) { _ => … }, which calls sb.close() a second time after projectWithRetrySingleBatch returns. This decrements SpillableColumnarBatchImpl.refCount from 0 to -1. The double-free guard is currently commented out (see issue #10161), so no exception is thrown, but the ref-count state is corrupted.

Compare with the analogous fix already applied in GpuTieredProject.projectWithRetrySingleBatchInternal: when closeInputBatch=false, sb.incRefCount() is called before passing to runWithSplitRetry to compensate for the extra close. The same pattern is needed here — call runWithSplitRetry(sb.incRefCount(), retryables, ...) so that withRetry holds one reference and the outer withResource(sb) holds the other.

thirtiseven added 4 commits April 29, 2026 09:05

ProjectExec split and retry

3e0ba02

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

remove debug it

68ca3fe

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

refine

b55ab40

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

simplify

a5f6dbb

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>

greptile-apps Bot reviewed May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add split-and-retry path to GpuProjectExec#14724

Add split-and-retry path to GpuProjectExec#14724
thirtiseven wants to merge 4 commits intoNVIDIA:mainfrom
thirtiseven:project_split_retry

thirtiseven commented May 5, 2026

Uh oh!

greptile-apps Bot commented May 5, 2026

Uh oh!

greptile-apps Bot May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

thirtiseven commented May 5, 2026

Description

Checklists

Uh oh!

greptile-apps Bot commented May 5, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants