-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Use vectorcall for all-positional-argument calls #5896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
If a handle or object is called with only positional arguments, it is
straightforward to use PyObject_Vectorcall instead of
PyObject_CallObject.
Benchmarked by adding a trivial function to pybind11_benchmark:
```
m.def("call_func_with_int", [](py::object func) {
return func(py::cast(1));
});
```
and then running `python -m timeit --setup 'from pybind11_benchmark import call_func_with_int; f = lambda x: x + 1' 'call_func_with_int(f)'`.
Before on M4 mac: 57.6 nsec per loop
After on M4 mac: 48.4 nsec per loop
For comparison, the included collatz benchmark takes 33.1 nsec per
loop, just calling `f(1)` directly takes 17.8 nec per loop, and simply
running `pass` takes 4.19 nsec per loop.
include/pybind11/cast.h
Outdated
| // Disable warnings about useless comparisons when N == 0. | ||
| PYBIND11_WARNING_PUSH | ||
| PYBIND11_WARNING_DISABLE_GCC("-Wtype-limits") | ||
| PYBIND11_WARNING_DISABLE_INTEL(186) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure why suppressing the icc warning didn't work :(
|
FWIW, I attempted to extend this to use vectorcall for the unpacking/kwargs/etc. cases, but my naive/straightforward attempt ended up adding too much cost for kwarg handling to end up actually improving performance. (the tuple unpacking case in particular is bottlenecked on args_proxy having to use slow PyIter-based iteration instead of PyTuple_GET_ITEM as currently written, so we couldn't help there either even though it doesn't use kwargs.) |
If a handle or object is called with only positional arguments, it is straightforward to use PyObject_Vectorcall instead of PyObject_CallObject.
Benchmarked by adding a trivial function to pybind11_benchmark:
and then running
python -m timeit --setup 'from pybind11_benchmark import call_func_with_int; f = lambda x: x + 1' 'call_func_with_int(f)'.Before on M4 mac: 57.6 nsec per loop
After on M4 mac: 48.4 nsec per loop
For comparison, the included collatz benchmark takes 33.1 nsec per loop, just calling
f(1)directly takes 17.8 nec per loop, and simply runningpasstakes 4.19 nsec per loop.Suggested changelog entry: