Skip to content

Hard to troubleshoot, agent stays running for hour #316

@hpe-ykoehler

Description

@hpe-ykoehler

Question

What is the correct way to troubleshoot things in this system, I have a big job going on, it was in the middle of it, then I submitted a couple more job, and everything stop working and task were running but nothing was getting done. There is a need for more visibility, I try to see if LiteLLM was still working it was frozen, restarting it didn't help, I then check the SWE-AF agents there was no easy way to see if they were doing anything. I restarted them but the system didn't unblock.

I restarted agentfield, and still, they show up as running and nothing were really getting things done. I had to delete the request but now the question is, how do I resume that work?

I think a queue system, with a max execution in concurrent is needed, and some way to understand what is the problem is it the agent, the llm, maybe have a agent health system, and an llm health system. Even the agent node it not consistently showing up as up/down properly.

It feels like the UI was put on with AI but not really tested/used in real world scenario. I do feel this project to be very interesting and would even like to a participate.

Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions