-
Notifications
You must be signed in to change notification settings - Fork 0
Stronger profiling for GPU #26
Copy link
Copy link
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Current implementation
Below is a snippet of code from src/optimizer/hardware/profiler
start.record()
for _ in range(10): # Python Loop
module.launch(...) # CPU -> GPU submission
end.record()
# ...
elapsed_ms = start.elapsed_time(end) / 10In this snippet the code, the elapsed time may include idle GPU time. Verify that it does or does not have an effect (this will impact the degree of parallelism that we can impose on the system later)
Proposed Solution
Using CUDA Graphs may eliminate this because it allows us to record the sequence of 10 launches once and submit them as a single unit. Once submitted, the GPU executes the entire chain without needing the CPU, making your profiling immune to CPU load.
# 1. Warmup (updates the CUDA context for these specific args)
module.launch(*args, **kwargs)
torch.cuda.synchronize()
# 2. Capture Graph
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
for _ in range(10):
module.launch(*args, **kwargs)
# 3. Measure Execution of the Graph
start.record()
g.replay() # Single CPU instruction triggers all 10 runs on GPU
end.record()
torch.cuda.synchronize()
elapsed_ms = start.elapsed_time(end) / 10Above is a new proposed form that may be better suited for our implementation.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working