Skip to content

Profiling API

Felix Weiglhofer edited this page May 19, 2023 · 8 revisions

This page gives an overview of how to use the Profiling API.

Creating timers

To measure kernel times, profiling has to enabled during the initial xpu::initialize call:

#include <xpu/host.h>

xpu::settings settings;
settings.profile = true;
xpu::initialize(settings);

Then create xpu::scoped_timer in your code to collect timings:

xpu::buffer<float> buf_a, buf_b, buf_c;
// ...

xpu::timings timings;
{
  xpu::scoped_timer t_{"Name", &timings};

  xpu::queue queue;
  queue.copy(buf_a, xpu::h2d);
  queue.copy(buf_b, xpu::h2d);
  queue.memset(buf_tmp, 0);
  queue.launch<KernelA>(...);
  queue.launch<KernelB>(...);
  queue.launch<KernelA>(...);
  queue.copy(buf_c, xpu::d2h);
  queue.wait();
}

After the timer t_ goes out of scope, the measured times can be accessed in timings. E.g. printing the elapsed time on the console could look something like this:

// Total elapsed (wall) time
cout << "Elapsed time: " << timings.wall() << " ms" << endl;

// Time spent in memory operations
cout << "memcpy (host -> device): " << timings.copy(xpu::h2d) << " ms" << endl;
cout << "memcpy (device -> host): " << timings.copy(xpu::d2h) << " ms" << endl;
cout << "memset:                  " << timings.memset()       << " ms" << endl;

// Time spent on kernel executions

// Access times for a specific kernel
xpu::kernel_timings kernel_a_t = timings.kernel<KernelA>(); 
cout << cout << "Kernel " << kernel_a_t.name() << " took " << kernel_a_t.total() << " ms." << endl

// For more details, also the times for each invocation is stored
std::vector<double> invocations = kernel_a_t.times();
cout << "Call 0 took " << invocations[0] << " ms." << endl;
cout << "Call 1 took " << invocations[1] << " ms." << endl;

// OR get times for all kernels executed while the timer was active
std::vector<xpu::kernel_timings> kernel_times = timings.kernels();

// Print times for all kernels
for (xpu::kernel_timings &kt : kernel_times) 
  cout << "Kernel " << kt.name() << " took " << kt.total() << " ms." << endl

Note: In some situations using the RAII-based scoped_timers might be inconvenient. In that case use xpu::push_timer("Name") and xpu::pop_timer() instead to push a new timer on the stack and destroy it again. pop_timer will return the collected xpu::timings in that case.

Note: If settings.profile isn't set to true, then timers in will only collect the elapsed wall time. Other calls will just return zero.

Nesting timers

In some cases it can be useful to break down the processing times by substeps. This is possible by nesting timers. As internally xpu keeps a stack of all active timers. E.g.:

xpu::timings timings;
{
  xpu::scoped_timer t_{"Parent", &timings};
  // ...
  {
    xpu::scoped_timer t_{"SubTimer1"};
    {
      xpu::scoped_timer t_{"SubSubTimer1"};
    }
  }
  {
    xpu::scoped_timer t_{"SubTimer2"};
  }
}

All timings collected by the subtimers are also collected by the parent timers. So in this example SubTimer1, SubTimer2 and SubSubTimer1 all also count towards Parent (and of course SubSubTimer1 counts towards SubTimer1).

Notice that only timings of the Parent timer are explicitly returned. However, the sub-timings are also available in timings by calling timings.children(). This returns a vector of all created sub timings. E.g. to print the wall time of sub timers:

cout << "Elapsed time of substeps of " << timings.name() << endl;
for (xpu::timings child : timings.children()) {
  cout << "  " << child.name() << ": " << child.wall() << " ms" << endl;
}

In this case this would print the elapsed times of SubTimer1, SubTimer2. (But not SubSubTimer1!)

Measuring throughput

Timer and kernels can be annotated with the number of bytes they processed by calling xpu::t_add_bytes(size_t bytes) and xpu::k_add_bytes<Kernel>(size_t bytes):

//...
  xpu::scoped_timer t_{"Name", &timings};

  xpu::h_view h_a(buf_a);
  xpu::h_view h_b(buf_b);
  xpu::h_view h_c(buf_c);

  // Current timer / sections processes buffer a and b.
  xpu::t_add_bytes(h_a.size_bytes() + h_b.size_bytes());

  xpu::queue queue;
  queue.copy(buf_a, xpu::h2d);
  queue.copy(buf_b, xpu::h2d);
  queue.memset(buf_tmp, 0);

  queue.launch<KernelA>(...);
  // KernelA has identical input size to current timer.
  xpu::k_add_bytes<KernelA>(h_a.size_bytes() + h_b.size_bytes());
  
  queue.launch<KernelB>(...);
  // Input of KernelB is only buffer b.
  xpu::k_add_bytes<KernelB>(h_b.size_bytes());

  // KernelA was already annotated above. Could call k_add_bytes<KernelA> here again, if another buffer was processed.
  queue.launch<KernelA>(...);
  queue.copy(buf_c, xpu::d2h);
  queue.wait();
//...

Note: Processed bytes of timers aren't propagated to parent timers. I.e. each timer has to be annotated by hand.

The throughput can then be read out like this:

cout << "Total throughput:  " << timings.throughput() << " GB/s" << endl;
cout << "Kernel throughput: " << timings.throughput_kernel() << " GB/s" << endl;

// Device transfer + memset throughput doesn't require annotations...
cout << "  throughput host -> device: " << timings.throughput_copy(xpu::h2d) << " GB/s" << endl;
cout << "  throughput device -> host: " << timings.throughput_copy(xpu::d2h) << " GB/s" << endl;
cout << "  throughput memset:         " << timings.throughput_memset() << " GB/s" << endl;

// Kernel throughput
xpu::kernel_timings kernel_a_t = timings.kernel<KernelA>();
cout << " throughput Kernel " << kernel_a_t.name() << ": " << kernel_a_t.throughput() << " GB/s" << endl;

Note: Throughput may return zero, if no timings were collected for the corresponding action. E.g. because settings.profile wasn't enabled.

Clone this wiki locally