-
Notifications
You must be signed in to change notification settings - Fork 198
Hard to troubleshoot, agent stays running for hour #316
Description
Question
What is the correct way to troubleshoot things in this system, I have a big job going on, it was in the middle of it, then I submitted a couple more job, and everything stop working and task were running but nothing was getting done. There is a need for more visibility, I try to see if LiteLLM was still working it was frozen, restarting it didn't help, I then check the SWE-AF agents there was no easy way to see if they were doing anything. I restarted them but the system didn't unblock.
I restarted agentfield, and still, they show up as running and nothing were really getting things done. I had to delete the request but now the question is, how do I resume that work?
I think a queue system, with a max execution in concurrent is needed, and some way to understand what is the problem is it the agent, the llm, maybe have a agent health system, and an llm health system. Even the agent node it not consistently showing up as up/down properly.
It feels like the UI was put on with AI but not really tested/used in real world scenario. I do feel this project to be very interesting and would even like to a participate.