Skip to content

[DML] Bind the dml global objects to the Model #1590

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 27, 2025

Conversation

baijumeswani
Copy link
Collaborator

@baijumeswani baijumeswani commented Jun 27, 2025

Before the device interface was introduced in #1190, the dml objects were tied to the model. The device interface abstraction decoupled the device specific objects and the OgaModel.

For dml, this meant that the dml objects now lived in a global scope (they were previously owned by the OgaModel and hence had the Model scope). These dml objects upon instantiation create background threads that retain hardware resources and prevent the driver threads from terminating. Since these are now in a global scope, the background threads continue living beyond the lifetime of the Model and can cause issues since driver threads may not be able to terminate correctly leading to issues in application layers.

Another pull-request #1378 made it so that device allocators are cached and tied to a global ort session. As a result, this device allocator is also linked to the dml objects. Making it hard to control the lifetime of the dml objects.

This pull request special cases the dml device type so that it destroys all linked globally scoped variables when the model is destroyed and re-creates them when a new model is initialized. This way, the dml threads terminate when the model is destroyed and release driver threads so they can do their own thing.

Addresses #1591

aciddelgado
aciddelgado previously approved these changes Jun 27, 2025
@baijumeswani baijumeswani enabled auto-merge (squash) June 27, 2025 18:57
@baijumeswani baijumeswani merged commit 88bb45c into main Jun 27, 2025
14 checks passed
@baijumeswani baijumeswani deleted the baijumeswani/cleanup-dml-globals-at-model-destroy branch June 27, 2025 19:31
baijumeswani added a commit that referenced this pull request Jun 27, 2025
Before the device interface was introduced in
#1190, the dml
objects were tied to the model. The device interface abstraction
decoupled the device specific objects and the `OgaModel`.

For dml, this meant that the dml objects now lived in a global scope
(they were previously owned by the `OgaModel` and hence had the Model
scope). These dml objects upon instantiation create background threads
that retain hardware resources and prevent the driver threads from
terminating. Since these are now in a global scope, the background
threads continue living beyond the lifetime of the Model and can cause
issues since driver threads may be able to terminate correctly leading
to issues in application layers.

Another pull-request
#1378 made it so that
device allocators are cached and tied to a global ort session. As a
result, this device allocator is also linked to the dml objects. Making
it hard to control the lifetime of the dml objects.

This pull request special cases the dml device type so that it destroys
all linked globally scoped variables when the model is destroyed and
re-creates them when a new model is initialized. This way, the dml
threads terminate when the model is destroyed and release driver threads
so they can do their own thing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

After calling OgaDestroyModel, there is a DML-NV hang issue in GenAI version 0.8.0.
2 participants