[BUG] Fatal cudaErrorIllegalAddress error occurred in CI test job

**Describe the bug**
Build: T***-Spark2a-Raplab/208

A fatal CUDA error (`cudaErrorIllegalAddress`, code 700) is thrown from `KudoGpuTableOperator.concat` during a GPU shuffle coalesce step, which causes the `RapidsExecutorPlugin` to stop the executor. The illegal memory access is triggered inside `BaseDeviceMemoryBuffer.copyFromHostBuffer` while copying a host buffer to device memory as part of Kudo table concatenation. A leaked `HostColumnVector` is also reported during shutdown, suggesting resource lifecycle issues on the failure path.

The same workload ran successfully on job run 202 (2026-03-07). The failure started appearing on a subsequent run without any user-side code or data changes, which points to a regression in the plugin or its dependencies between these two runs.

**Expected behavior**

The shuffle coalesce step should successfully copy host buffers to device memory and produce a concatenated GPU table without triggering a CUDA illegal address access. The executor should not be killed.

**Environment details (please complete the following information)**

- Environment location: Spark2a
- GPU: NVIDIA A100-PCIE-40GB, Driver 535.183.01, CUDA 12.2
- Relevant Spark / RAPIDS configuration (suspected related):
  - Kudo shuffle serializer enabled (default path via `KudoGpuTableOperator`)
  - `GpuOutOfCoreSort` active (out-of-core sort path engaged)
  - GPU memory was near capacity at crash time (≈39.6 GiB / 40 GiB used), so RMM spill / retry paths were likely active

**Additional context**

Key stack trace:

```
26/04/18 10:31:32 ERROR RapidsExecutorPlugin: Stopping the Executor based on exception being a fatal CUDA 
error: ai.rapids.cudf.CudaFatalException: Fatal CUDA error encountered at: 
/home/jenkins/agent/workspace/jenkins-spark-rapids-jni-release-31-
cuda12/thirdparty/cudf/java/src/main/native/src/CudaJni.cpp:406: 700 cudaErrorIllegalAddress an illegal memory 
access was encountered
	at ai.rapids.cudf.Cuda.memcpyOnStream(Native Method)
	at ai.rapids.cudf.Cuda.memcpy(Cuda.java:532)
	at ai.rapids.cudf.Cuda.memcpy(Cuda.java:304)
	at ai.rapids.cudf.BaseDeviceMemoryBuffer.copyFromHostBuffer(BaseDeviceMemoryBuffer.java:32)
	at ai.rapids.cudf.BaseDeviceMemoryBuffer.copyFromHostBuffer(BaseDeviceMemoryBuffer.java:102)
	at com.nvidia.spark.rapids.KudoGpuTableOperator.$anonfun$concat$15(GpuShuffleCoalesceExec.scala:452)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.KudoGpuTableOperator.$anonfun$concat$13(GpuShuffleCoalesceExec.scala:450)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.KudoGpuTableOperator.$anonfun$concat$11(GpuShuffleCoalesceExec.scala:437)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:57)
	at com.nvidia.spark.rapids.KudoGpuTableOperator.concat(GpuShuffleCoalesceExec.scala:431)
	at com.nvidia.spark.rapids.KudoGpuTableOperator.concat(GpuShuffleCoalesceExec.scala:413)
	at com.nvidia.spark.rapids.CoalesceIteratorBase.$anonfun$concatenateTables$2(GpuShuffleCoalesceExec.scala:570)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:581)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:744)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:629)
	at com.nvidia.spark.rapids.CoalesceIteratorBase.$anonfun$concatenateTables$1(GpuShuffleCoalesceExec.scala:574)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.CoalesceIteratorBase.concatenateTables(GpuShuffleCoalesceExec.scala:560)
	at com.nvidia.spark.rapids.GpuCoalesceIteratorBase.concatenateTablesInGpu(GpuShuffleCoalesceExec.scala:780)
	at com.nvidia.spark.rapids.GpuCoalesceIteratorBase.$anonfun$next$7(GpuShuffleCoalesceExec.scala:808)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.GpuCoalesceIteratorBase.next(GpuShuffleCoalesceExec.scala:807)
	at com.nvidia.spark.rapids.GpuCoalesceIteratorBase.next(GpuShuffleCoalesceExec.scala:763)
	at com.nvidia.spark.rapids.GpuMetricIteratorBase.$anonfun$next$8(GpuShuffleCoalesceExec.scala:912)
```

Additional observations:

- A leaked host column vector was reported just before shutdown:
  `ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 3041998)`
- GPU ECC counters are all zero and GPU temperature is normal (36 °C), so hardware faults are unlikely.
- Last known-good run: job run **202 on 2026-03-07**. No user-side changes since then — regression is suspected within the RAPIDS plugin / cuDF / JNI stack.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Fatal cudaErrorIllegalAddress error occurred in CI test job #14630

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Fatal cudaErrorIllegalAddress error occurred in CI test job #14630

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions