Skip to content

Stronger profiling for GPU #26

@TheJoshBrod

Description

@TheJoshBrod

Current implementation

Below is a snippet of code from src/optimizer/hardware/profiler

start.record()
for _ in range(10):      # Python Loop
    module.launch(...)   # CPU -> GPU submission
end.record()
# ...
elapsed_ms = start.elapsed_time(end) / 10

In this snippet the code, the elapsed time may include idle GPU time. Verify that it does or does not have an effect (this will impact the degree of parallelism that we can impose on the system later)

Proposed Solution

Using CUDA Graphs may eliminate this because it allows us to record the sequence of 10 launches once and submit them as a single unit. Once submitted, the GPU executes the entire chain without needing the CPU, making your profiling immune to CPU load.

# 1. Warmup (updates the CUDA context for these specific args)
module.launch(*args, **kwargs)
torch.cuda.synchronize()
# 2. Capture Graph
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    for _ in range(10):
        module.launch(*args, **kwargs)
# 3. Measure Execution of the Graph
start.record()
g.replay()  # Single CPU instruction triggers all 10 runs on GPU
end.record()
torch.cuda.synchronize()
elapsed_ms = start.elapsed_time(end) / 10

Above is a new proposed form that may be better suited for our implementation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions