Conversation
Codecov Report❌ Patch coverage is
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
|
I tried running a query through the ARAX pathfinder and it's unclear what happened. I got these logs: but nothing else before it was timed out after 5 minutes. I sent it Imatinib->Asthma. |
|
Hi @maximusunc, I changed the parameters to make it faster for now. I will get back to Shepherd-pathfinder and check it probably next week to figure out what the problem is. |
maximusunc
left a comment
There was a problem hiding this comment.
When testing with Imatinib->Asthma, your Pathfinder is returning 0 paths. Is this intended?
workers/arax_pathfinder/worker.py
Outdated
| try: | ||
| start = time.perf_counter() | ||
| logger.info("Starting pathfinder.get_paths()") | ||
| result, aux_graphs, knowledge_graph = pathfinder.get_paths( |
There was a problem hiding this comment.
I'm not sure if your pathfinder code is asynchronous or not, but this call is blocking and so your pathfinder implementation can only handle one query at a time. Is this intended?
There was a problem hiding this comment.
Hi @maximusunc
Could you please provide me your json query that you sent and got 0 paths?
There was a problem hiding this comment.
Here is my query and I got result for this one.
{
"message": {
"query_graph": {
"nodes": {
"n0": {
"ids": [
"CHEBI:31690"
]
},
"n1": {
"ids": [
"MONDO:0004979"
]
}
},
"paths": {
"p0": {
"subject": "n0",
"object": "n1",
"predicates": [
"biolink:related_to"
],
"constraints": []
}
}
}
}
}
There was a problem hiding this comment.
It is now can handle multiple queries.
There was a problem hiding this comment.
So now we have the opposite effect here. It looks like now we're trying to handle every query that comes in, and running this locally, the RAM and CPU usage of this worker shot way up and hit my docker limits. I think we need to tune this so that CPU and RAM stay reasonable. What do we think is reasonable @mohsenht @dkoslicki ?
There was a problem hiding this comment.
I'm also a little concerned with the query that you sent that does return results. That Chebi curie is Imatinib, but it is not normalized. Are you doing some normalization in your pathfinder code somewhere?
There was a problem hiding this comment.
The worker pool capped at 4, each query takes under 1GB of RAM.
Since requests already take 2 to 4 minutes, I'm worried that reducing the workers to save resources will make end users wait way too long.
There was a problem hiding this comment.
I'm not sure I follow. What worker pool are you talking about? And how Shepherd is set up, it is made to horizontally scale, so if one arax_pathfinder worker can't handle all the requests coming in, Shepherd can just spin up another one. So when we build these individual workers, we should make sure that we understand where the bottlenecks are and pick a reasonable threshold for what it should be able to handle as a single worker, and then we can duplicate them at the kubernetes level to handle more load.
There was a problem hiding this comment.
Ah, that makes sense regarding Shepherd's horizontal scaling. The local resource spike is coming from num_cores = min(multiprocessing.cpu_count(), 4) inside the pathfinder library.
I added that multiprocessing specifically to speed up path expansion and keep query times down to 2-4 minutes. If we restrict each worker to 1 core to reduce its footprint, individual queries will take significantly longer.
We have a trade-off: Are we okay with longer end-user wait times so Shepherd can scale smaller workers? Or should we set a higher baseline per worker (e.g., 4 cores) to keep queries fast, and let Shepherd scale those larger pods?
|
I ran some tests last night and ran into some issues. I was able to run your query and get back results, but then I tried sending 5 concurrent queries and while they all fired off, I got this error for all of them: And then I tried backing off and just sending one query and got this error: Now this doesn't seem to be an issue with Shepherd but more coming from these external services. So my follow up questions are: |
|
PloverDB Concurrency: The error came from PloverDB, which is actually designed to handle thousands of requests in parallel for Pathfinder and other services. The KG2 team is currently working hard on its stability. Pathfinder Performance: Pathfinder is both CPU-bound and IO-bound. It already uses multiprocessing in its core code to calculate rankings and expand nodes while building paths and trees. The subsequent failure for the single query shows that the database connection to arax-databases-mysql was also down. I will ping KG2 team for this one. Thanks Max |
|
Hey @mohsenht is this ready for another review? |
…hfinder # Conflicts: # workers/arax/worker.py
|
Hi @maximusunc yes, ready to review |
maximusunc
left a comment
There was a problem hiding this comment.
With these changes, I was able to start a pathfinder query, but I still got an error when connecting to kg2cplover3.rtx.ai.
|
Hi @maximusunc, Since these changes mostly from your work in the main branch, could you please resolve these conflicts yourself? |
|
Ok, I think code looks good now, just the external plover error is an issue. |
|
I asked KGX group and it turned out the CI one is up and running again so I updated the PloverDB url to point to the CI one. |
|
Forgot to mention you :) |
|
I ran the branch and tested it for this query and it worked. POST: |
Hi @maximusunc,
Please review this pull request.