-
Notifications
You must be signed in to change notification settings - Fork 99
Description
System Environment
- GEOSX Version: 1.1.0 (develop, sha1:
2acb3ef19) - Operating System: Windows Subsystem for Linux (WSL2), Ubuntu 22.04/24.04
- GPU: NVIDIA GeForce RTX 5070 (Blackwell architecture, Compute Capability 10.0 suspected)
- CUDA Toolkit Version: 12.6
- C++ Compiler: GCC 13.3.0 / GCC 14.0
- MPI: Open MPI v4.1.6
Description
I am encountering a cudaErrorNoKernelImageForDevice error when running any simulation or unit test that utilizes the HYPRE linear solver interface. This occurs even after manually setting CUDA_ARCH to sm_90 (Hopper) in the TPL and GEOSX build configurations to ensure compatibility.
The error specifically triggers during kernel dispatch, with the exception message: "after dispatching exclusive_scan kernel: cudaErrorNoKernelImageForDevice". This happens consistently before calls like HYPRE_IJMatrixAddToValues2 in HypreMatrix.cpp:191.
It appears that the compiled binaries for sm_90 are not being correctly recognized or JIT-compiled for the Blackwell (RTX 50-series) hardware within the WSL environment. Since Blackwell is very new, I suspect there might be a mismatch in how RAJA/Thrust kernels are being packaged or a lack of forward compatibility without explicit PTX (virtual architecture) flags.
Question: Does GEOSX currently require specific flags (like 90-virtual or a newer compute capability) to support Blackwell GPUs, or is there a known issue with the exclusive_scan kernel on this new architecture?
Steps to Reproduce
- Set
CUDA_ARCHtosm_90inthirdPartyLibs/CMakeLists.txtand the host-config file. - Build Third Party Libraries (TPL) and GEOSX from scratch.
- Run the
testMatricesunit test:ctest -R testMatrices -V. - Observe the crash during the
Hypre/MatrixTestsuite.
Host-Config (wsl-ubuntu.cmake)
#################################################################################
# Generated host-config - Final Fix for CUDA 12.6 and sm_90 Compatibility
#################################################################################
# 1. Basic Compiler Settings
set(CMAKE_C_COMPILER "/usr/bin/gcc" CACHE PATH "")
set(CMAKE_CXX_COMPILER "/usr/bin/g++" CACHE PATH "")
set(BLT_CXX_STD "c++17" CACHE STRING "")
set(CMAKE_C_FLAGS "-O2 -w -fpermissive" CACHE STRING "" FORCE)
set(CMAKE_CXX_FLAGS "-O3 -w -fpermissive" CACHE STRING "" FORCE)
# 2. MPI Settings
set(ENABLE_MPI ON CACHE BOOL "")
set(MPI_C_COMPILER "/usr/bin/mpicc" CACHE PATH "")
set(MPI_CXX_COMPILER "/usr/bin/mpicxx" CACHE PATH "")
set(MPIEXEC "/usr/bin/mpirun" CACHE PATH "")
# 3. Explicit CUDA Toolkit and Version Configuration
set(ENABLE_CUDA ON CACHE BOOL "" FORCE)
set(CUDA_TOOLKIT_ROOT_DIR "/usr/local/cuda-12.6" CACHE PATH "" FORCE)
set(CMAKE_CUDA_COMPILER "${CUDA_TOOLKIT_ROOT_DIR}/bin/nvcc" CACHE PATH "" FORCE)
set(CUDA_VERSION "12.6" CACHE STRING "" FORCE)
set(CUDA_VERSION_MAJOR "12" CACHE STRING "" FORCE)
set(CUDA_VERSION_MINOR "6" CACHE STRING "" FORCE)
set(HYPRE_CUDA_VERSION "12.6" CACHE STRING "" FORCE)
# GPU Architecture Settings (Using sm_90 for RTX 5070 to ensure TPL compatibility)
set(CMAKE_CUDA_ARCHITECTURES "90" CACHE STRING "" FORCE)
set(CMAKE_CUDA_FLAGS "-restrict --expt-extended-lambda -arch sm_90" CACHE STRING "" FORCE)
# 4. Solver Interfaces
set(GEOS_LA_INTERFACE "Hypre" CACHE STRING "" FORCE)
set(ENABLE_HYPRE_DEVICE "CUDA" CACHE STRING "" FORCE)
# 5. Other Required Libraries
set(BLAS_LIBRARIES "/usr/lib/x86_64-linux-gnu/libblas.so" CACHE STRING "" FORCE)
set(LAPACK_LIBRARIES "/usr/lib/x86_64-linux-gnu/liblapack.so" CACHE STRING "" FORCE)
set(ENABLE_OPENMP OFF CACHE BOOL "" FORCE)
# 6. Third-Party Library (TPL) Path Logic
set(CONFIG_NAME "wsl-ubuntu")
if(NOT ( EXISTS "${GEOS_TPL_DIR}" AND IS_DIRECTORY "${GEOS_TPL_DIR}" ) )
set(GEOS_TPL_DIR "${CMAKE_SOURCE_DIR}/../../thirdPartyLibs/install-${CONFIG_NAME}-release" CACHE PATH "" FORCE )
endif()
include(${CMAKE_CURRENT_LIST_DIR}/tpls.cmake)
Test Command 1: Unit Test (ctest)
This command runs the standardized linear algebra interface test.
cd ~/rocksim/GEOS/build-wsl-ubuntu-release
ctest -R testMatrices -V --output-on-failure
Key Error Output:
188: [ RUN ] Hypre/MatrixTest/0.MatrixMatrixOperations
188: unknown file: Failure
188: C++ exception with description "after dispatching exclusive_scan kernel: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device" thrown in the test body.
188: [ FAILED ] Hypre/MatrixTest/0.MatrixMatrixOperations, where TypeParam = geos::HypreInterface
Test Command 2: Integrated Simulation (geosx)
This command attempts to run a basic 3D single-phase flow simulation using the HYPRE solver.
cd ~/rocksim/GEOS/build-wsl-ubuntu-release
./bin/geosx -i ../inputFiles/singlePhaseFlow/3D_10x10x10_compressible_smoke.xml
Key Error Output (Stack Trace):
***** ERROR
***** LOCATION: .../linearAlgebra/interfaces/hypre/HypreUtils.hpp:154
***** Error cause: err != cudaSuccess
***** Rank 0: Previous CUDA errors found: before HYPRE_IJMatrixAddToValues2 (no kernel image is available for execution on the device at .../HypreMatrix.cpp:191)
** StackTrace **
Frame 0: geos::HypreMatrix::create(...)
Frame 1: geos::PhysicsSolverBase::solveNonlinearSystem(...)
Frame 2: geos::PhysicsSolverBase::nonlinearImplicitStep(...)
...
Additional Context
- Manual Source Patch: I had to manually fix a type-casting error in HYPRE (
par_amg_setup.c:3155) to compile with GCC 14:a value of type "void *" cannot be used to initialize an entity of type "HYPRE_Solver". - Storage: The entire TPL and GEOSX project is located on a secondary physical drive (F: drive) mounted via WSL2.