-
Notifications
You must be signed in to change notification settings - Fork 807
[SYCL][UR] Do not call urDeviceRetain()
in device_impl
constructor at all
#20065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: sycl
Are you sure you want to change the base?
[SYCL][UR] Do not call urDeviceRetain()
in device_impl
constructor at all
#20065
Conversation
b07c1d4
to
51a4b37
Compare
51a4b37
to
ab4b9b9
Compare
ab4b9b9
to
7c033d0
Compare
7c033d0
to
054acf1
Compare
054acf1
to
7eb40c4
Compare
urDeviceRetain()
in case of subdevices
7eb40c4
to
1e745dc
Compare
@pbalcer please review |
1e745dc
to
1ab563c
Compare
@ldorau - please do not "force push". That messes up the continuity of the reviews on github. If you are having to "force push" it probably means you are doing something wrong on your end. |
1ab563c
to
1344429
Compare
OK, so what should I do? Add more commits titled:
? |
sycl/source/detail/platform_impl.cpp
Outdated
} | ||
|
||
device_impl & | ||
platform_impl::getOrMakeSubDeviceImpl(ur_device_handle_t UrSubDevice) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please implement a helper function that taken an IsSubDevice bool to eliminate duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
b6f2cce
to
119f6a2
Compare
@cperkinsintel OK, a new commit added with fixes |
@intel/llvm-gatekeepers please consider merging |
sycl/source/detail/device_impl.cpp
Outdated
if (!IsSubDevice) { | ||
// Interoperability Constructor already calls DeviceRetain in | ||
// urDeviceCreateWithNativeHandle. | ||
getAdapter().call<UrApiKind::urDeviceRetain>(MDevice); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The destructor device_impl::~device_impl
calls urDeviceRelease
, which is meant to balance the urDeviceRetain
in the constructor. However, after this change, urDeviceRetain
is no longer called for subdevices in the constructor, while urDeviceRelease
is still invoked for them in the destructor. This imbalance is somewhat concerning. It seems that the actual root cause of the memory leak may lie elsewhere. Could you please clarify what you believe the underlying issue is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is that when a device is created RefCount
is equal 1 without any call to retain()
, but it is destroyed when RefCount
is equal 0, so there should be one more call to release()
than a number of calls to retain()
... or the implementation of class RefCount
is wrong ?
class RefCount {
public:
RefCount(uint32_t count = 1) : Count(count) {}
...
uint32_t retain() { return ++Count; }
bool release() { return --Count == 0; }
void reset(uint32_t value = 1) { Count = value; }
private:
std::atomic_uint32_t Count;
};
...
ur_result_t urDeviceRelease(ur_device_handle_t Device) {
// Root devices are destroyed during the piTearDown process.
if (Device->isSubDevice()) {
if (Device->RefCount.release()) {
delete Device;
}
}
return UR_RESULT_SUCCESS;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also worth pointing out that only subdevices are really ref counted:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"(only sub-devices being ref counted seems like implementation detail of the adapter)"
That's the reason. I pointed it out because it's possible that the existing sequence of creates/retains/releases was always broken, and we never noticed because the leak happens in the rare case of subdevices. So in a way, I'm agreeing with your earlier comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am testing not calling urDeviceRetain()
in device_impl
ctor at all (#20144) - one unittest (sycl/unittests/context_device/DeviceRefCounter.cpp
) will have to be changed at least ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@againull I have removed urDeviceRetain(MDevice)
from the device_impl
constructor completely.
Please re-review.
@intel/llvm-gatekeepers please consider merging |
1 similar comment
@intel/llvm-gatekeepers please consider merging |
urDeviceRetain()
in case of subdevicesurDeviceRetain()
in device_impl
constructor at all
119f6a2
to
6396c62
Compare
urDeviceRetain(MDevice) should not be called in device_impl ctor at all, because RefCounter is initialized with 1, when the device is created. It fixes URT-961. Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
I don't know enough about how the SYCL device object is managed to be able to say whether this is correct. But, generally, this makes sense to me - devices (and all other UR objects) start with refcount 1, so retaining them immediately after creation is unnecessary. |
// So for this test, we just do it. | ||
sycl::detail::GlobalHandler::instance().getPlatformCache().clear(); | ||
|
||
EXPECT_EQ(DevRefCounter, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit off that ref count is zero here, because callback for urDeviceGet (redefinedDevicesGetAfter) increments the DevRefCounter. So, even if we don't call Retain anymore, ref count is still expected to be at least 1 because of urDevicesGet. It would be nice to either understand why it is normal to have zero ref count here or to update the test if necessary if there is something wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@againull I have debugged this unittest. It calls 3 times platform_impl::getDevicesImplHelper()
:
llvm/sycl/source/detail/platform_impl.cpp
Line 497 in 6b29cf1
void platform_impl::getDevicesImplHelper(ur_device_type_t UrDeviceType, |
urDeviceGet(NumDevices == 0)
what setsDevRefCounter == 0
(usingredefinedDevicesGetAfter()
):
llvm/sycl/source/detail/platform_impl.cpp
Lines 502 to 504 in 6b29cf1
MAdapter->call<UrApiKind::urDeviceGet>(MPlatform, UrDeviceType, 0u, // CP info::device_type::all nullptr, &NumDevices); urDeviceGet(NumDevices == 1)
what setsDevRefCounter == 1
(usingredefinedDevicesGetAfter()
):
llvm/sycl/source/detail/platform_impl.cpp
Lines 527 to 530 in 6b29cf1
MAdapter->call<UrApiKind::urDeviceGet>( MPlatform, UrDeviceType, // CP info::device_type::all NumDevices, UrDevices.data(), nullptr); urDeviceRelease()
on this device what decrementsDevRefCounter
to0
(usingredefinedDeviceReleaseAfter()
):
llvm/sycl/source/detail/platform_impl.cpp
Lines 560 to 563 in 6b29cf1
// The reference counter for handles, that we used to create sycl objects, is // incremented, so we need to call release here. for (ur_device_handle_t &UrDev : UrDevicesToCleanUp) MAdapter->call<UrApiKind::urDeviceRelease>(UrDev);
... what looks like:
- redefinedDevicesGetAfter(*params.pphDevices == 0) DevRefCounter == 0 from urDeviceGet(NumDevices == 0)
- redefinedDevicesGetAfter(*params.pNumEntries == 1) DevRefCounter == 1 from urDeviceGet(NumDevices == 1)
- redefinedDeviceReleaseAfter() DevRefCounter == 0 from urDeviceRelease()
- redefinedDevicesGetAfter(*params.pphDevices == 0) DevRefCounter == 0 from urDeviceGet(NumDevices == 0)
- redefinedDevicesGetAfter(*params.pNumEntries == 1) DevRefCounter == 1 from urDeviceGet(NumDevices == 1)
- redefinedDeviceReleaseAfter() DevRefCounter == 0 from urDeviceRelease()
- redefinedDevicesGetAfter(*params.pphDevices == 0) DevRefCounter == 0 from urDeviceGet(NumDevices == 0)
- redefinedDevicesGetAfter(*params.pNumEntries == 1) DevRefCounter == 1 from urDeviceGet(NumDevices == 1)
- redefinedDeviceReleaseAfter() DevRefCounter == 0 from urDeviceRelease()
@againull Does it make sense or is there anything wrong here?
urDeviceRetain(MDevice)
should not be called in thedevice_impl
constructor at all,because
RefCounter
is initialized with 1, when the device is created.It fixes URT-961.