Describe the bug
Build: T***-Spark2a-Raplab/208
A fatal CUDA error (cudaErrorIllegalAddress, code 700) is thrown from KudoGpuTableOperator.concat during a GPU shuffle coalesce step, which causes the RapidsExecutorPlugin to stop the executor. The illegal memory access is triggered inside BaseDeviceMemoryBuffer.copyFromHostBuffer while copying a host buffer to device memory as part of Kudo table concatenation. A leaked HostColumnVector is also reported during shutdown, suggesting resource lifecycle issues on the failure path.
The same workload ran successfully on job run 202 (2026-03-07). The failure started appearing on a subsequent run without any user-side code or data changes, which points to a regression in the plugin or its dependencies between these two runs.
Expected behavior
The shuffle coalesce step should successfully copy host buffers to device memory and produce a concatenated GPU table without triggering a CUDA illegal address access. The executor should not be killed.
Environment details (please complete the following information)
- Environment location: Spark2a
- GPU: NVIDIA A100-PCIE-40GB, Driver 535.183.01, CUDA 12.2
- Relevant Spark / RAPIDS configuration (suspected related):
- Kudo shuffle serializer enabled (default path via
KudoGpuTableOperator)
GpuOutOfCoreSort active (out-of-core sort path engaged)
- GPU memory was near capacity at crash time (≈39.6 GiB / 40 GiB used), so RMM spill / retry paths were likely active
Additional context
Key stack trace:
26/04/18 10:31:32 ERROR RapidsExecutorPlugin: Stopping the Executor based on exception being a fatal CUDA
error: ai.rapids.cudf.CudaFatalException: Fatal CUDA error encountered at:
/home/jenkins/agent/workspace/jenkins-spark-rapids-jni-release-31-
cuda12/thirdparty/cudf/java/src/main/native/src/CudaJni.cpp:406: 700 cudaErrorIllegalAddress an illegal memory
access was encountered
at ai.rapids.cudf.Cuda.memcpyOnStream(Native Method)
at ai.rapids.cudf.Cuda.memcpy(Cuda.java:532)
at ai.rapids.cudf.Cuda.memcpy(Cuda.java:304)
at ai.rapids.cudf.BaseDeviceMemoryBuffer.copyFromHostBuffer(BaseDeviceMemoryBuffer.java:32)
at ai.rapids.cudf.BaseDeviceMemoryBuffer.copyFromHostBuffer(BaseDeviceMemoryBuffer.java:102)
at com.nvidia.spark.rapids.KudoGpuTableOperator.$anonfun$concat$15(GpuShuffleCoalesceExec.scala:452)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
at com.nvidia.spark.rapids.KudoGpuTableOperator.$anonfun$concat$13(GpuShuffleCoalesceExec.scala:450)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
at com.nvidia.spark.rapids.KudoGpuTableOperator.$anonfun$concat$11(GpuShuffleCoalesceExec.scala:437)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:57)
at com.nvidia.spark.rapids.KudoGpuTableOperator.concat(GpuShuffleCoalesceExec.scala:431)
at com.nvidia.spark.rapids.KudoGpuTableOperator.concat(GpuShuffleCoalesceExec.scala:413)
at com.nvidia.spark.rapids.CoalesceIteratorBase.$anonfun$concatenateTables$2(GpuShuffleCoalesceExec.scala:570)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:581)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:744)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:629)
at com.nvidia.spark.rapids.CoalesceIteratorBase.$anonfun$concatenateTables$1(GpuShuffleCoalesceExec.scala:574)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
at com.nvidia.spark.rapids.CoalesceIteratorBase.concatenateTables(GpuShuffleCoalesceExec.scala:560)
at com.nvidia.spark.rapids.GpuCoalesceIteratorBase.concatenateTablesInGpu(GpuShuffleCoalesceExec.scala:780)
at com.nvidia.spark.rapids.GpuCoalesceIteratorBase.$anonfun$next$7(GpuShuffleCoalesceExec.scala:808)
at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
at com.nvidia.spark.rapids.GpuCoalesceIteratorBase.next(GpuShuffleCoalesceExec.scala:807)
at com.nvidia.spark.rapids.GpuCoalesceIteratorBase.next(GpuShuffleCoalesceExec.scala:763)
at com.nvidia.spark.rapids.GpuMetricIteratorBase.$anonfun$next$8(GpuShuffleCoalesceExec.scala:912)
Additional observations:
- A leaked host column vector was reported just before shutdown:
ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 3041998)
- GPU ECC counters are all zero and GPU temperature is normal (36 °C), so hardware faults are unlikely.
- Last known-good run: job run 202 on 2026-03-07. No user-side changes since then — regression is suspected within the RAPIDS plugin / cuDF / JNI stack.
Describe the bug
Build: T***-Spark2a-Raplab/208
A fatal CUDA error (
cudaErrorIllegalAddress, code 700) is thrown fromKudoGpuTableOperator.concatduring a GPU shuffle coalesce step, which causes theRapidsExecutorPluginto stop the executor. The illegal memory access is triggered insideBaseDeviceMemoryBuffer.copyFromHostBufferwhile copying a host buffer to device memory as part of Kudo table concatenation. A leakedHostColumnVectoris also reported during shutdown, suggesting resource lifecycle issues on the failure path.The same workload ran successfully on job run 202 (2026-03-07). The failure started appearing on a subsequent run without any user-side code or data changes, which points to a regression in the plugin or its dependencies between these two runs.
Expected behavior
The shuffle coalesce step should successfully copy host buffers to device memory and produce a concatenated GPU table without triggering a CUDA illegal address access. The executor should not be killed.
Environment details (please complete the following information)
KudoGpuTableOperator)GpuOutOfCoreSortactive (out-of-core sort path engaged)Additional context
Key stack trace:
Additional observations:
ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 3041998)