Skip to content

Readyness endpoint behavior confusing for KServe #2056

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ylambrus opened this issue Feb 20, 2025 · 3 comments
Open

Readyness endpoint behavior confusing for KServe #2056

ylambrus opened this issue Feb 20, 2025 · 3 comments

Comments

@ylambrus
Copy link

When no models are loaded, the /ready endpoint will answer True because MLServer use the following logic in its dataplane :

return all([model.ready for model in models])

all("empty list") ==> True

The thing is that whenever you have a faulty model, let's say a mlflow runtime one that fail while loading, MLserver will gracefully handle the failure and unload the model. At this point you'll have short period of time when MLServer will continue running and have an empty list of model ( assuming there is only one model here), after which it will shutdown

And this small period of time might be confusing because, in a situation where KServe readiness probe check /v2/health/ready endpoint, it will consider the inference service as READY.
This behaviour can be devastating as KServe will then consider the new inference service as ready and will promote this revision as the new ready and healthy one. We'll then have crashloopbackoff inferenceservice as the elected one and the previous healthy revision will be terminated.

We faced this behavior in production. I know it might not be an issue, but more a question of "did we correctly interface Kserve and MLServer readiness". But is it worth considering that we should not answering True when no models are being loaded ? That would solve the problem.

To reproduce the behavior, write the following piece of code to check the readyness of your mlserver instance :

import requests
import time

url = "http://127.0.0.1:8080/v2/health/ready"

while True:
    ready = False
    start_time = time.time()

    # Check the URL 10 times per second for one second
    while time.time() - start_time < 1:
        try:
            response = requests.get(url, timeout=0.5)
            if response.status_code == 200:
                ready = True
                #break
        except requests.RequestException:
            # Ignore exceptions (e.g., connection errors) and continue checking
            pass
        time.sleep(0.1)

    if ready:
        print("READY")
    else:
        print("NOT READY")

And then try to load a faulty model as the following dummy one :

class DummyModel(mlflow.pyfunc.PythonModel):
    def load_context(self, context):
        import NONEXISTINGMODULE
  ...

You'll notice a small period of time when your probes tells you that it's ready

I can do the PR if you agree that we should not answer True if len(models) == 0 ( or other clever way to handle this situation if that makes sense)
Let me know

@sakoush
Copy link
Contributor

sakoush commented Feb 20, 2025

Thanks for raising this potential issue @ylambrus

The implementation largely depends on the semantics of v2/health/ready. The spec says
The “server ready” health API indicates if all the models are ready for inferencing. The “server ready” health API can be used directly to implement the Kubernetes readinessProbe.

Therefore I think if mlserver has no models loaded, then v2/health/ready should be true by definition. isnt it?

If mlserver unloads automatically a model that has failed to load then this could be a bug and we need to understand the expected behaviour in the example you shared.

We also need to understand whether there is a race condition between a model in loading state and the status of v2/health/ready.

We encourage contributions from the community, feel free to submit a PR if you manage to take a deeper look.

@ylambrus
Copy link
Author

Maybe I should not have talked about a race condition. I'd say the edgy thing is that when a model fails to load, MLServer will shutdown, not only unload the model. I'm not sure in which case it won't shut down if a model fails to load ( I'll double check that)
But, for few milliseconds, before shutting down, it answering 200 to /ready because the model is unloaded.

So the catchy scenario, to us, is that if MLServer decides to shut down because a model does not load, behaviour that would lead to a crashloopbackoff of the pod running it, answering 200 to /ready just before shutting down is a bit odd.

@lc525
Copy link
Member

lc525 commented Apr 16, 2025

A question on this @ylambrus: in the server settings, are you running with parallel_workers = 0 or parallel_workers > 0?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants