Skip to content

fix: add MCP tool timeout + running state to prevent silent stall after hub_repo_details (#127)#190

Open
parthwhy wants to merge 1 commit intohuggingface:mainfrom
parthwhy:fix/hub-repo-details-stall-127
Open

fix: add MCP tool timeout + running state to prevent silent stall after hub_repo_details (#127)#190
parthwhy wants to merge 1 commit intohuggingface:mainfrom
parthwhy:fix/hub-repo-details-stall-127

Conversation

@parthwhy
Copy link
Copy Markdown

Problem

When running ml-intern headless from CLI, the agent silently stalls
for 5+ minutes after a hub_repo_details tool call with zero progress
output. The user has no way to know if the agent is working or frozen.

Fixes #127

Root Cause

Two overlapping bugs:

1. No timeout on MCP tool calls (agent/core/tools.py)
hub_repo_details is an MCP tool served by huggingface.co/mcp. The
call await self.mcp_client.call_tool(tool_name, arguments) had zero
timeout — if the HF Hub API is slow or rate-limiting, this await hangs
indefinitely, blocking the entire agent loop.

2. No progress event emitted during tool execution (agent/core/agent_loop.py)
After tool_call fires (showing ▸ hub_repo_details in terminal),
_exec_tool went silent until results arrived. No tool_state_change: running event was emitted for non-approval tools, so the CLI showed
nothing for the entire duration of the stall.

Changes

agent/core/tools.py

  • Added import asyncio
  • Added MCP_TOOL_TIMEOUT = 30 constant (easy to tune)
  • Wrapped mcp_client.call_tool() with asyncio.wait_for(timeout=MCP_TOOL_TIMEOUT)
  • Added except asyncio.TimeoutError handler that returns a clear error
    message to the agent so it can retry or continue instead of hanging

agent/core/agent_loop.py

  • Emit tool_state_change: running event inside _exec_tool immediately
    before call_tool() so the CLI/frontend shows the tool is actively
    executing

Before / After

Before:
▸ hub_repo_details {"repo_ids": ["openai/whisper-small"]}
[5 minutes of silence → user hits Ctrl+C]

After:
▸ hub_repo_details {"repo_ids": ["openai/whisper-small"]} [running]
[if slow]: Tool 'hub_repo_details' timed out after 30s — agent continues

Testing

Reproduced locally with the command from the issue:
ml-intern "Fine-tune a small Whisper model for Arabic speech recognition
on a GPU under 5 dollars and compare the fine-tuned model's metrics to
the original model."
Agent no longer stalls silently after hub_repo_details.

@fglogan
Copy link
Copy Markdown

fglogan commented May 3, 2026

closed per maintainer request

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Agent appears to stall for minutes after hub_repo_details with no progress logs

2 participants