Skip to content

Arax pathfinder#64

Open
mohsenht wants to merge 22 commits intomainfrom
arax-pathfinder
Open

Arax pathfinder#64
mohsenht wants to merge 22 commits intomainfrom
arax-pathfinder

Conversation

@mohsenht
Copy link
Collaborator

Hi @maximusunc,

Please review this pull request.

@mohsenht mohsenht requested a review from maximusunc January 21, 2026 16:17
@codecov
Copy link

codecov bot commented Jan 21, 2026

Codecov Report

❌ Patch coverage is 2.82486% with 172 lines in your changes missing coverage. Please review.
✅ Project coverage is 34.48%. Comparing base (928e4d8) to head (ce1a5e6).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
workers/arax_pathfinder/worker.py 0.00% 111 Missing ⚠️
workers/arax/worker.py 0.00% 35 Missing ⚠️
shepherd_utils/inject_shepherd_arax_provenance.py 0.00% 26 Missing ⚠️
Files with missing lines Coverage Δ
shepherd_server/main.py 0.00% <ø> (ø)
shepherd_utils/config.py 100.00% <100.00%> (ø)
shepherd_utils/inject_shepherd_arax_provenance.py 0.00% <0.00%> (ø)
workers/arax/worker.py 0.00% <0.00%> (ø)
workers/arax_pathfinder/worker.py 0.00% <0.00%> (ø)

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8a561fa...ce1a5e6. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@maximusunc
Copy link
Collaborator

I tried running a query through the ARAX pathfinder and it's unclear what happened. I got these logs:

arax_pathfinder  | 2026-02-04T16:02:05.361576: DEBUG: lookup map not here! /tmp/biolink/biolink_lookup_map_4.2.5_v5.pickle
arax_pathfinder  | 2026-02-04T16:02:07.267560: INFO: Building local Biolink 4.2.5 ancestor/descendant lookup map because one doesn't yet exist
arax_pathfinder  | [2026-02-04 16:02:07,299: INFO/shepherd.arax.pathfinder.e3355332.00527f4e]: Model release date: 12/01/2025
arax_pathfinder  | [2026-02-04 16:02:07,299: INFO/shepherd.arax.pathfinder.e3355332.00527f4e]: Finding paths process has started
arax_pathfinder  | [2026-02-04 16:02:07,300: INFO/shepherd.arax.pathfinder.e3355332.00527f4e]: Expanding CHEBI:45783
arax_pathfinder  | [2026-02-04 16:02:07,301: INFO/shepherd.arax.pathfinder.e3355332.00527f4e]: Expanding MONDO:0004979

but nothing else before it was timed out after 5 minutes. I sent it Imatinib->Asthma.

@mohsenht
Copy link
Collaborator Author

mohsenht commented Feb 4, 2026

Hi @maximusunc,

I changed the parameters to make it faster for now. I will get back to Shepherd-pathfinder and check it probably next week to figure out what the problem is.

Copy link
Collaborator

@maximusunc maximusunc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When testing with Imatinib->Asthma, your Pathfinder is returning 0 paths. Is this intended?

try:
start = time.perf_counter()
logger.info("Starting pathfinder.get_paths()")
result, aux_graphs, knowledge_graph = pathfinder.get_paths(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if your pathfinder code is asynchronous or not, but this call is blocking and so your pathfinder implementation can only handle one query at a time. Is this intended?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @maximusunc

Could you please provide me your json query that you sent and got 0 paths?

Copy link
Collaborator Author

@mohsenht mohsenht Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my query and I got result for this one.

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": [
                        "CHEBI:31690"
                    ]
                },
                "n1": {
                    "ids": [
                        "MONDO:0004979"
                    ]
                }
            },
            "paths": {
                "p0": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": [
                        "biolink:related_to"
                    ],
                    "constraints": []
                }
            }
        }
    }
}

Copy link
Collaborator Author

@mohsenht mohsenht Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is now can handle multiple queries.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now we have the opposite effect here. It looks like now we're trying to handle every query that comes in, and running this locally, the RAM and CPU usage of this worker shot way up and hit my docker limits. I think we need to tune this so that CPU and RAM stay reasonable. What do we think is reasonable @mohsenht @dkoslicki ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also a little concerned with the query that you sent that does return results. That Chebi curie is Imatinib, but it is not normalized. Are you doing some normalization in your pathfinder code somewhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The worker pool capped at 4, each query takes under 1GB of RAM.
Since requests already take 2 to 4 minutes, I'm worried that reducing the workers to save resources will make end users wait way too long.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow. What worker pool are you talking about? And how Shepherd is set up, it is made to horizontally scale, so if one arax_pathfinder worker can't handle all the requests coming in, Shepherd can just spin up another one. So when we build these individual workers, we should make sure that we understand where the bottlenecks are and pick a reasonable threshold for what it should be able to handle as a single worker, and then we can duplicate them at the kubernetes level to handle more load.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that makes sense regarding Shepherd's horizontal scaling. The local resource spike is coming from num_cores = min(multiprocessing.cpu_count(), 4) inside the pathfinder library.

I added that multiprocessing specifically to speed up path expansion and keep query times down to 2-4 minutes. If we restrict each worker to 1 core to reduce its footprint, individual queries will take significantly longer.

We have a trade-off: Are we okay with longer end-user wait times so Shepherd can scale smaller workers? Or should we set a higher baseline per worker (e.g., 4 cores) to keep queries fast, and let Shepherd scale those larger pods?

@maximusunc
Copy link
Collaborator

I ran some tests last night and ran into some issues. I was able to run your query and get back results, but then I tried sending 5 concurrent queries and while they all fired off, I got this error for all of them:

arax_pathfinder  | requests.exceptions.ConnectionError: HTTPSConnectionPool(host='kg2cplover3.rtx.ai', port=9990): Max retries exceeded with url: /query (Caused by NewConnectionError("HTTPSConnection(host='kg2cplover3.rtx.ai', port=9990): Failed to establish a new connection: [Errno 111] Connection refused"))

And then I tried backing off and just sending one query and got this error:

arax_pathfinder  | [2026-02-26 02:18:32,536: ERROR/shepherd.arax.pathfinder.d18b72ad.f07114e6]: Path MONDO:0004979MONDO:0011786 raised an exception: MySQL connection failed: 2003 (HY000): Can't connect to MySQL server on 'arax-databases-mysql.rtx.ai:3306' (111)
arax_pathfinder  | [2026-02-26 02:18:34,166: ERROR/shepherd.arax.pathfinder.d18b72ad.f07114e6]: Path CHEBI:31690NCBIGene:1544 raised an exception: MySQL connection failed: 2003 (HY000): Can't connect to MySQL server on 'arax-databases-mysql.rtx.ai:3306' (111)
arax_pathfinder  | [2026-02-26 02:18:34,186: ERROR/shepherd.arax.pathfinder.d18b72ad.f07114e6]: PathFinder failed to find paths between on and sn. Error message is: MySQL connection failed: 2003 (HY000): Can't connect to MySQL server on 'arax-databases-mysql.rtx.ai:3306' (111)

Now this doesn't seem to be an issue with Shepherd but more coming from these external services. So my follow up questions are:
Can these external services handle concurrent queries? If they can't, then your Pathfinder shouldn't either. Is your Pathfinder heavily CPU-bound? If so, then we will want to set up some multi-processing potentially.

@mohsenht
Copy link
Collaborator Author

PloverDB Concurrency: The error came from PloverDB, which is actually designed to handle thousands of requests in parallel for Pathfinder and other services. The KG2 team is currently working hard on its stability.

Pathfinder Performance: Pathfinder is both CPU-bound and IO-bound. It already uses multiprocessing in its core code to calculate rankings and expand nodes while building paths and trees.

The subsequent failure for the single query shows that the database connection to arax-databases-mysql was also down. I will ping KG2 team for this one.

Thanks Max

@maximusunc
Copy link
Collaborator

Hey @mohsenht is this ready for another review?

@mohsenht
Copy link
Collaborator Author

Hi @maximusunc

yes, ready to review

Copy link
Collaborator

@maximusunc maximusunc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these changes, I was able to start a pathfinder query, but I still got an error when connecting to kg2cplover3.rtx.ai.

@mohsenht
Copy link
Collaborator Author

Hi @maximusunc,

Since these changes mostly from your work in the main branch, could you please resolve these conflicts yourself?

@maximusunc
Copy link
Collaborator

Ok, I think code looks good now, just the external plover error is an issue.

@mohsenht
Copy link
Collaborator Author

I asked KGX group and it turned out the CI one is up and running again so I updated the PloverDB url to point to the CI one.

@mohsenht
Copy link
Collaborator Author

@maximusunc

Forgot to mention you :)

@mohsenht
Copy link
Collaborator Author

@maximusunc

I ran the branch and tested it for this query and it worked.

POST:
http://localhost:5439/arax/query

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": [
                        "CHEBI:31690"
                    ]
                },
                "n1": {
                    "ids": [
                        "MONDO:0004979"
                    ]
                }
            },
            "paths": {
                "p0": {
                    "subject": "n0",
                    "object": "n1",
                    "predicates": [
                        "biolink:related_to"
                    ],
                    "constraints": []
                }
            }
        }
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants