fix: add MCP tool timeout + running state to prevent silent stall after hub_repo_details (#127)#190
Open
parthwhy wants to merge 1 commit intohuggingface:mainfrom
Open
Conversation
|
closed per maintainer request |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When running
ml-internheadless from CLI, the agent silently stallsfor 5+ minutes after a
hub_repo_detailstool call with zero progressoutput. The user has no way to know if the agent is working or frozen.
Fixes #127
Root Cause
Two overlapping bugs:
1. No timeout on MCP tool calls (
agent/core/tools.py)hub_repo_detailsis an MCP tool served byhuggingface.co/mcp. Thecall
await self.mcp_client.call_tool(tool_name, arguments)had zerotimeout — if the HF Hub API is slow or rate-limiting, this await hangs
indefinitely, blocking the entire agent loop.
2. No progress event emitted during tool execution (
agent/core/agent_loop.py)After
tool_callfires (showing▸ hub_repo_detailsin terminal),_exec_toolwent silent until results arrived. Notool_state_change: runningevent was emitted for non-approval tools, so the CLI showednothing for the entire duration of the stall.
Changes
agent/core/tools.pyimport asyncioMCP_TOOL_TIMEOUT = 30constant (easy to tune)mcp_client.call_tool()withasyncio.wait_for(timeout=MCP_TOOL_TIMEOUT)except asyncio.TimeoutErrorhandler that returns a clear errormessage to the agent so it can retry or continue instead of hanging
agent/core/agent_loop.pytool_state_change: runningevent inside_exec_toolimmediatelybefore
call_tool()so the CLI/frontend shows the tool is activelyexecuting
Before / After
Before:
▸ hub_repo_details {"repo_ids": ["openai/whisper-small"]}
[5 minutes of silence → user hits Ctrl+C]
After:
▸ hub_repo_details {"repo_ids": ["openai/whisper-small"]} [running]
[if slow]: Tool 'hub_repo_details' timed out after 30s — agent continues
Testing
Reproduced locally with the command from the issue:
ml-intern "Fine-tune a small Whisper model for Arabic speech recognition
on a GPU under 5 dollars and compare the fine-tuned model's metrics to
the original model."
Agent no longer stalls silently after
hub_repo_details.