Dry Run Protocol by achirkin · Pull Request #2961 · rapidsai/raft

achirkin · 2026-02-20T12:30:42Z

The dry run protocol defines a mechanism to simulate the execution of algorithms to get a precise estimate of the memory requirements for a real execution with the same parameters.

#include <raft/util/dry_run_memory_resource.hpp>

raft::resources res;
// auto my_function(const raft::resources& res, my_args...);
auto stats = raft::util::dry_run_execute(res, my_function, my_args...);
// stats.device_global  – peak device memory (bytes)

This PR:

Introduces new infrastructure: raft::util::dry_run_execute, tracking memory resource, resource::get_dry_run_flag) that lets callers estimate peak memory usage of any RAFT algorithm without executing GPU work.
Makes all public functions across all raft namespaces dry-run compliant: allocations are always visible to the tracker; CUDA work is skipped.
Adds a small user guide (docs/source/dry_run_protocol.md)

Depends on (and includes all changes of) #2968

Note for reviewers
The PR contains a lot of small tedious changes to cover all of raft library and the tests components.
Please start reading at docs/source updates to learn more about the topic and the principles guiding these changes.

…mory Introduce a dry-run execution framework that replaces device and host memory resources with lightweight fake allocators to measure peak memory usage without holding real memory. New files: - dry_run_memory_resource.hpp: dry_run_allocator (lock-free bump allocator), dry_run_device_memory_resource, dry_run_host_memory_resource, dry_run_resource_manager (RAII), and dry_run_execute() helper. - dry_run_flag.hpp: boolean dry-run flag as a raft resource, allowing algorithms to skip kernel execution during profiling. - tests/util/dry_run_memory_resource.cpp: unit tests. The dry_run_allocator probes the upstream once to obtain a base address, then atomically bumps a pointer for each allocation — no mutex, no map, no real memory held after the initial probe.

…pinned_memory_resource Add pinned and managed resources to the raft::resources handle to make it possible to customize / temporarily replace these resources

…aking change due to transitive includes in downstream libraries

Merges Remove deprecated headers (rapidsai#2939). Conflict resolutions: - rsvd.cuh: Use new mdspan-based raft::matrix::sqrt and reciprocal APIs (they have internal dry-run guards); kept cudaMemsetAsync guard - svd.cuh: Use raft::matrix::weighted_sqrt (has internal dry-run guard) - matrix.cuh: Accept deletion (deprecated, removed in main) Co-authored-by: Cursor <cursoragent@cursor.com>

…urrent_device_resource()

…implementations

Adapt the dry-run protocol to use the unified cuda::mr resource infrastructure from fea-unify-memory-resources. Key changes: - Replace dry_run_device_memory_resource (rmm subclass) and dry_run_host_memory_resource (std::pmr subclass) with a single dry_run_resource<Upstream> template using cuda::forward_property, modeled after raft::mr::statistics_adaptor. - Replace dry_run_resource_manager (which modified the passed-in resources handle) with dry_run_resources, a standalone class that copies the resources object and provides implicit conversion to const resources&, enabling composability with other resource wrappers. - dry_run_allocator uses probe-once semantics: a single real allocation from the upstream is kept alive for the allocator's lifetime, and all subsequent allocations return the same valid pointer. - Remove obsolete pmr/pinned_memory_resource.hpp (superseded by cuda::mr::legacy_pinned_memory_resource in the unified branch). - Adapt tests to use unified resource APIs (host_resource_ref, host_device_resource_ref, get_default_host_resource, etc.). Made-with: Cursor

…ep/restore the state of resources

achirkin added 13 commits February 18, 2026 10:46

First batch of dry-run guards

695a8a3

Dry run compliance for raft::linalg namespace

42d8ad4

Update developer guide with the dry run protocol

6db7ec8

BREAKING CHANGE: replaced pinned_container with host_container using …

d91a1c6

…pinned_memory_resource Add pinned and managed resources to the raft::resources handle to make it possible to customize / temporarily replace these resources

Dry run compliance for raft::matrix namespace

1a114f6

Dry run compliance for raft::random namespace

dec5e95

Dry run compliance for raft::solver namespace

f84d9a9

Dry run compliance for raft::sparse namespace

44793cd

Dry run compliance for raft::spectral namespace

d566fe9

Dry run compliance for raft::stats namespace

fc3bde6

Add a little bit more tests

b0ddbc8

Add the Dry Run Protocol Overview

15c07a1

achirkin self-assigned this Feb 20, 2026

achirkin requested review from a team as code owners February 20, 2026 12:30

achirkin added feature request New feature or request breaking Breaking change labels Feb 20, 2026

github-project-automation bot added this to Unstructured Data Processing Feb 20, 2026

achirkin and others added 3 commits February 20, 2026 13:31

Fix C++ example in the docs

1c57abb

Merge branch 'main' into fea-dry-run-protocol

d916b45

Add a few more tests and fix a missed CUDA call in QR algorithm

9d24480

achirkin moved this to In Progress in Unstructured Data Processing Feb 20, 2026

achirkin and others added 5 commits February 20, 2026 15:44

Fix excess subsample doing work in dry run

7577e56

Add dry run compliance to the raft::copy on mdspans

99faf68

Merge branch 'main' into fea-dry-run-protocol

b859894

Revert changing includes from public to detail namespace to avoid bre…

57d4c19

…aking change due to transitive includes in downstream libraries

Merge branch 'main' into fea-dry-run-protocol

694ec63

achirkin mentioned this pull request Feb 23, 2026

Modernize the uses of raft in cuVS rapidsai/cuvs#1837

Merged

achirkin and others added 6 commits March 4, 2026 14:40

Prefer rmm::mr::get_current_device_resource_ref() over rmm::mr::get_c…

3a40d22

…urrent_device_resource()

Remove raft pinned and managed memory resources in favor of cuda::mr …

cce4f45

…implementations

Merge branch 'main' into fea-dry-run-protocol

fb56025

Adapt to fea-unify-memory-resources

e76bf7c

Refactor dry_run_resources as a child of raft::resources to better ke…

2d3f8fc

…ep/restore the state of resources

achirkin added DO NOT MERGE Hold off on merging; see PR for details and removed DO NOT MERGE Hold off on merging; see PR for details labels Mar 5, 2026

achirkin mentioned this pull request Mar 5, 2026

Dry Run Protocol achirkin/raft#4

Closed

achirkin added non-breaking Non-breaking change and removed breaking Breaking change labels Mar 5, 2026

This comment was marked as outdated.

Sign in to view

achirkin removed the DO NOT MERGE Hold off on merging; see PR for details label Mar 9, 2026

achirkin and others added 17 commits March 9, 2026 12:47

Merge branch 'main' into fea-dry-run-protocol

d2cf85e

Merge branch 'main' into fea-dry-run-protocol

e86b56d

Fix style after merge commit

d9a0abf

Merge branch 'main' into fea-dry-run-protocol

16324fb

Fix merge commit typo

ced0e6e

Merge branch 'main' into fea-dry-run-protocol

c9bf618

Fix some sparse routines not being dry-run compliant

1acf6cf

Unify the looks of the three custom raft::resources

2326061

Expand test coverage Part 1

d4ff16e

Expand test coverage Part 2

fee4b62

Update docs to reflect unify memory resources PR changes

f1b7aca

Fix segfault in sparse tests caused by invalid thrust exec policy

156a437

Better allocation estimates in the sparse namespace

51c0b16

Fixing more failing tests

c99b879

Fixing last failing tests

e535124

Merge branch 'main' into fea-dry-run-protocol

9971c71

Fix not initialize the mdarray scalars only in dry run mode

b682d46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dry Run Protocol#2961

Dry Run Protocol#2961
achirkin wants to merge 69 commits intorapidsai:mainfrom
achirkin:fea-dry-run-protocol

achirkin commented Feb 20, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

achirkin commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

achirkin commented Feb 20, 2026 •

edited

Loading