Skip to content

Use static linkage for CUDA runtime#725

Open
bdice wants to merge 3 commits intorapidsai:mainfrom
bdice:static-cudart
Open

Use static linkage for CUDA runtime#725
bdice wants to merge 3 commits intorapidsai:mainfrom
bdice:static-cudart

Conversation

@bdice
Copy link
Contributor

@bdice bdice commented Dec 9, 2025

Summary

  • Set CUDA_STATIC_RUNTIME=ON by default in cpp/CMakeLists.txt
  • Remove cuda-cudart from run requirements in conda recipes

With static linking of the CUDA runtime, the runtime is embedded in the binaries and the cuda-cudart package is not needed at runtime.

Part of rapidsai/build-planning#235

- Set CUDA_STATIC_RUNTIME=ON by default in cpp/CMakeLists.txt
- Remove cuda-cudart from run requirements in conda recipes
@KyleFromNVIDIA KyleFromNVIDIA added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Dec 9, 2025
Copy link

@robertmaynard robertmaynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can drop all the compute sanitizer suppressions in cpp/compute-sanitizer-suppressions.xml

@wence-
Copy link
Contributor

wence- commented Dec 9, 2025

We can drop all the compute sanitizer suppressions in cpp/compute-sanitizer-suppressions.xml

What? why?

@bdice bdice added the breaking Introduces a breaking change label Dec 10, 2025
@robertmaynard
Copy link

robertmaynard commented Dec 10, 2025

We can drop all the compute sanitizer suppressions in cpp/compute-sanitizer-suppressions.xml

What? why?

Sorry, to be clear the suppressions have libcudart.so in the stack will need to be reviewed. Since we move to libcudart_static.a the stack traces will be different and I expect will need to be regenerated.

@pentschev
Copy link
Member

We can drop all the compute sanitizer suppressions in cpp/compute-sanitizer-suppressions.xml

What? why?

Sorry, to be clear the suppressions have libcudart.so in the stack will need to be reviewed. Since we move to libcudart_static.a the stack traces will be different and I expect will need to be regenerated.

But symbols will still exist somewhere, so that means we can remove the module name but still want to keep the suppressions, and this will actually make it more difficult now since before we could do "suppress all CUDA symbols" and now we need to list them by names.

@robertmaynard
Copy link

robertmaynard commented Dec 10, 2025

But symbols will still exist somewhere
Yes. My first comment wasn't correct. We can't remove the suppressions, we just have to make the detection significantly more loose when looking at the callstack

@pentschev
Copy link
Member

But symbols will still exist somewhere
Yes. My first comment wasn't correct. We can't remove the suppressions, we just have to make the detection significantly more loose when looking at the callstack

I understood the correction you've made to the first comment. However, I think what you're saying is we have to make detection significantly tighter now, because we can't anymore just ignore libcudart.so for example, we have to instead explicitly list all the function names that come from libcudart.so, we could use cuda* wildcard for example but that would also suppress any functions coming from any other libraries that begin with cuda*.

@robertmaynard
Copy link

If I am reading the suppressions correctly they are all interested in:

      <frame>
        <func>cudaMemcpyAsync</func>
        <module>.*/libcudart.so.*</module>
      </frame>

Now I presume we can loosen them to be:

      <frame>
        <func>cudaMemcpyAsync</func>
        <module>.*</module>
      </frame>

@pentschev
Copy link
Member

pentschev commented Dec 10, 2025

We have suppressions spanning entire modules too

If I am reading the suppressions correctly they are all interested in:

      <frame>
        <func>cudaMemcpyAsync</func>
        <module>.*/libcudart.so.*</module>
      </frame>

Now I presume we can loosen them to be:

      <frame>
        <func>cudaMemcpyAsync</func>
        <module>.*</module>
      </frame>

Ah, I think you're right. We're not suppressing all symbols from libcuda.so/libcudart.so, the format of the xml file is to provide the frame stack. Here's the full diff that probably works:

Details
diff --git a/cpp/compute-sanitizer-suppressions.xml b/cpp/compute-sanitizer-suppressions.xml
index 309d4d02..30bbc138 100644
--- a/cpp/compute-sanitizer-suppressions.xml
+++ b/cpp/compute-sanitizer-suppressions.xml
@@ -8,18 +8,6 @@
     </what>
     <hostStack>
       <saveLocation>error</saveLocation>
-      <frame>
-        <module>.*/libcuda.so.*</module>
-      </frame>
-      <frame>
-        <module>.*/libcudart.so.*</module>
-      </frame>
-      <frame>
-        <module>.*/libcudart.so.*</module>
-      </frame>
-      <frame>
-        <module>.*/libcudart.so.*</module>
-      </frame>
       <frame>
         <func>cudf::detail::cuda_memcpy_async_impl</func>
         <module>.*/libcudf.so</module>
@@ -41,18 +29,9 @@
     </what>
     <hostStack>
       <saveLocation>error</saveLocation>
-      <frame>
-        <module>.*/libcuda.so.*</module>
-      </frame>
-      <frame>
-        <module>.*/libcudart.so.*</module>
-      </frame>
-      <frame>
-        <module>.*/libcudart.so.*</module>
-      </frame>
       <frame>
         <func>cudaMemcpyAsync</func>
-        <module>.*/libcudart.so.*</module>
+        <module>.*</module>
       </frame>
       <frame>
         <func>cudf::detail::cuda_memcpy_async_impl</func>
@@ -77,18 +56,9 @@
     </what>
     <hostStack>
       <saveLocation>error</saveLocation>
-      <frame>
-        <module>*./libcuda.so.*</module>
-      </frame>
-      <frame>
-        <module>.*/libcudart.so.*</module>
-      </frame>
-      <frame>
-        <module>.*/libcudart.so.*</module>
-      </frame>
       <frame>
         <func>cudaMemcpyAsync</func>
-        <module>.*/libcudart.so.*</module>
+        <module>.*</module>
       </frame>
       <frame>
         <func>rapidsmpf::buffer_copy const</func>
@@ -127,18 +97,6 @@
     </what>
     <hostStack>
       <saveLocation>error</saveLocation>
-      <frame>
-        <module>.*/libcuda.so.*</module>
-      </frame>
-      <frame>
-        <module>.*/libcudart.so.*</module>
-      </frame>
-      <frame>
-        <module>.*/libcudart.so.*</module>
-      </frame>
-      <frame>
-        <module>.*/libcudart.so.*</module>
-      </frame>
       <frame>
         <func>cudf::detail::cuda_memcpy_async_impl</func>
         <module>.*/libcudf.so</module>
@@ -169,18 +127,9 @@
     </what>
     <hostStack>
       <saveLocation>error</saveLocation>
-      <frame>
-        <module>.*/libcuda.so.*</module>
-      </frame>
-      <frame>
-        <module>.*/libcudart.so.*</module>
-      </frame>
-      <frame>
-        <module>.*/libcudart.so.12</module>
-      </frame>
       <frame>
         <func>cudaMemcpyAsync</func>
-        <module>.*/libcudart.so.12</module>
+        <module>.*</module>
       </frame>
       <frame>
         <func>rapidsmpf::buffer_copy const</func>
@@ -216,18 +165,9 @@
     </what>
     <hostStack>
       <saveLocation>error</saveLocation>
-      <frame>
-        <module>.*/libcuda.so.*</module>
-      </frame>
-      <frame>
-        <module>.*/libcudart.so.*</module>
-      </frame>
-      <frame>
-        <module>.*/libcudart.so.*</module>
-      </frame>
       <frame>
         <func>cudaMemcpyAsync</func>
-        <module>.*/libcudart.so.*</module>
+        <module>.*</module>
       </frame>
       <frame>
         <func>rapidsmpf::buffer_copy const</func>

It's just not clear whether starting from any arbitrary frame works or if it needs to provide a complete stack overview, unfortunately the docs do not expand much on that, so there's a chance we may need to regenerate the suppressions completely.

@bdice
Copy link
Contributor Author

bdice commented Dec 11, 2025

I know nothing about these suppressions or how to test them. CI is passing so I assume this is a manual test. Can someone test @pentschev's proposed diff and commit it if it works?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Introduces a breaking change improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants