Skip to content

[BUG] Fatal cudaErrorIllegalAddress error occurred in CI test job #14630

@yinqingh

Description

@yinqingh

Describe the bug
Build: T***-Spark2a-Raplab/208

A fatal CUDA error (cudaErrorIllegalAddress, code 700) is thrown from KudoGpuTableOperator.concat during a GPU shuffle coalesce step, which causes the RapidsExecutorPlugin to stop the executor. The illegal memory access is triggered inside BaseDeviceMemoryBuffer.copyFromHostBuffer while copying a host buffer to device memory as part of Kudo table concatenation. A leaked HostColumnVector is also reported during shutdown, suggesting resource lifecycle issues on the failure path.

The same workload ran successfully on job run 202 (2026-03-07). The failure started appearing on a subsequent run without any user-side code or data changes, which points to a regression in the plugin or its dependencies between these two runs.

Expected behavior

The shuffle coalesce step should successfully copy host buffers to device memory and produce a concatenated GPU table without triggering a CUDA illegal address access. The executor should not be killed.

Environment details (please complete the following information)

  • Environment location: Spark2a
  • GPU: NVIDIA A100-PCIE-40GB, Driver 535.183.01, CUDA 12.2
  • Relevant Spark / RAPIDS configuration (suspected related):
    • Kudo shuffle serializer enabled (default path via KudoGpuTableOperator)
    • GpuOutOfCoreSort active (out-of-core sort path engaged)
    • GPU memory was near capacity at crash time (≈39.6 GiB / 40 GiB used), so RMM spill / retry paths were likely active

Additional context

Key stack trace:

26/04/18 10:31:32 ERROR RapidsExecutorPlugin: Stopping the Executor based on exception being a fatal CUDA 
error: ai.rapids.cudf.CudaFatalException: Fatal CUDA error encountered at: 
/home/jenkins/agent/workspace/jenkins-spark-rapids-jni-release-31-
cuda12/thirdparty/cudf/java/src/main/native/src/CudaJni.cpp:406: 700 cudaErrorIllegalAddress an illegal memory 
access was encountered
	at ai.rapids.cudf.Cuda.memcpyOnStream(Native Method)
	at ai.rapids.cudf.Cuda.memcpy(Cuda.java:532)
	at ai.rapids.cudf.Cuda.memcpy(Cuda.java:304)
	at ai.rapids.cudf.BaseDeviceMemoryBuffer.copyFromHostBuffer(BaseDeviceMemoryBuffer.java:32)
	at ai.rapids.cudf.BaseDeviceMemoryBuffer.copyFromHostBuffer(BaseDeviceMemoryBuffer.java:102)
	at com.nvidia.spark.rapids.KudoGpuTableOperator.$anonfun$concat$15(GpuShuffleCoalesceExec.scala:452)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.KudoGpuTableOperator.$anonfun$concat$13(GpuShuffleCoalesceExec.scala:450)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.KudoGpuTableOperator.$anonfun$concat$11(GpuShuffleCoalesceExec.scala:437)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:57)
	at com.nvidia.spark.rapids.KudoGpuTableOperator.concat(GpuShuffleCoalesceExec.scala:431)
	at com.nvidia.spark.rapids.KudoGpuTableOperator.concat(GpuShuffleCoalesceExec.scala:413)
	at com.nvidia.spark.rapids.CoalesceIteratorBase.$anonfun$concatenateTables$2(GpuShuffleCoalesceExec.scala:570)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.next(RmmRapidsRetryIterator.scala:581)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:744)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:629)
	at com.nvidia.spark.rapids.CoalesceIteratorBase.$anonfun$concatenateTables$1(GpuShuffleCoalesceExec.scala:574)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.CoalesceIteratorBase.concatenateTables(GpuShuffleCoalesceExec.scala:560)
	at com.nvidia.spark.rapids.GpuCoalesceIteratorBase.concatenateTablesInGpu(GpuShuffleCoalesceExec.scala:780)
	at com.nvidia.spark.rapids.GpuCoalesceIteratorBase.$anonfun$next$7(GpuShuffleCoalesceExec.scala:808)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.GpuCoalesceIteratorBase.next(GpuShuffleCoalesceExec.scala:807)
	at com.nvidia.spark.rapids.GpuCoalesceIteratorBase.next(GpuShuffleCoalesceExec.scala:763)
	at com.nvidia.spark.rapids.GpuMetricIteratorBase.$anonfun$next$8(GpuShuffleCoalesceExec.scala:912)

Additional observations:

  • A leaked host column vector was reported just before shutdown:
    ERROR HostColumnVector: A HOST COLUMN VECTOR WAS LEAKED (ID: 3041998)
  • GPU ECC counters are all zero and GPU temperature is normal (36 °C), so hardware faults are unlikely.
  • Last known-good run: job run 202 on 2026-03-07. No user-side changes since then — regression is suspected within the RAPIDS plugin / cuDF / JNI stack.

Metadata

Metadata

Assignees

Labels

bot_watchSlack bot watched issue for LLM analyzerbugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions