Problem
Data Machine AI requests are still frequently hitting a 120s timeout on intelligence-chubes4, even after reducing prompt/data-packet/memory size.
Recent Flow 2 jobs all showed the same first-attempt failure shape:
Network error occurred while sending request to https://api.openai.com/v1/responses: cURL error 28: Connection timed out after 120000 milliseconds
Observed jobs:
- Job 639: first AI request timed out after 120s, retry succeeded.
- Job 640: first AI request timed out after 120s, retry reached a tool-path bug and skipped.
- Job 641: first AI request timed out after 120s, retry succeeded and wrote a wiki page.
- Job 642: first AI request timed out after 120s, retry succeeded and rejected the source.
Job 641 was only ~4.5k tokens, so this no longer looks primarily like prompt/data size. It looks like transport timeout semantics.
Studio Context
Automattic/studio#3120 raises Studio's low-speed watchdog to 120s for AI TTFB. That behavior is already live on the local Studio site:
CURLOPT_CONNECTTIMEOUT = 30;
CURLOPT_LOW_SPEED_LIMIT = 1024;
CURLOPT_LOW_SPEED_TIME = 120;
However, Data Machine overrides AI request cURL options in RequestBuilder:
curl_setopt( $handle, CURLOPT_CONNECTTIMEOUT, ceil( $connect_timeout ) );
curl_setopt( $handle, CURLOPT_LOW_SPEED_TIME, ceil( $request_timeout ) );
curl_setopt( $handle, CURLOPT_LOW_SPEED_LIMIT, 1 );
The local Data Machine setting is currently:
wp_ai_client_connect_timeout = 120
The exact 120000ms error strongly suggests Data Machine's connect timeout override is the timeout currently winning, not Studio's low-speed watchdog.
Root Cause Hypothesis
Data Machine exposes/tunes the wrong timeout as an operator-visible setting.
Current effective model:
request timeout: 300s hardcoded/filterable
connect timeout: visible setting, currently 120s
retry delay: 60s
A 120s connect timeout is too expensive for autonomous runs. If connection establishment or provider edge contact stalls, the attempt should fail fast and retry with a fresh connection. The model response itself should still have a long request timeout.
Desired Behavior
Separate timeout semantics clearly:
connect timeout: short, e.g. 10-20s default
request timeout: long, e.g. 300s default
low-speed/TTFB watchdog: long enough for non-streaming AI response
retry delay: optionally shorter for transport/connect failures
Acceptance Criteria
- Add first-class Data Machine settings for both:
wp_ai_client_connect_timeout
wp_ai_client_request_timeout
- Default connect timeout should be short enough for autonomous retries, likely 15s or 30s. Prefer 15s if tests/compatibility are okay.
- Request timeout should remain long, likely 300s.
- Preserve filters:
datamachine_wp_ai_client_connect_timeout
datamachine_wp_ai_client_request_timeout
- Log resolved AI transport profile before/around dispatch, including:
- mode
- provider
- model
- job_id / flow_step_id when available
- resolved request timeout
- resolved connect timeout
- whether RequestOptions class was used
- whether Data Machine cURL hook was installed
- On AI request failure logs, include the same resolved timeout profile so operators can tell which timeout likely fired.
- Keep Studio's generic mu-plugin behavior independent; no Studio-specific workaround or endpoint discrimination.
- Add focused tests around timeout resolution/settings/log metadata where practical.
Notes
This should not reduce the model's thinking/response budget. The point is to fail fast only on connection establishment/transport stalls, while preserving a long total request timeout for connected non-streaming model calls.
This is part of making WordPress.com MGS wiki Flow 2 safe for set-and-forget operation. Processed-item semantics are now safer after #1815 (reject_source/defer_item), but the transport layer is still too slow/noisy to be boring.
Problem
Data Machine AI requests are still frequently hitting a 120s timeout on
intelligence-chubes4, even after reducing prompt/data-packet/memory size.Recent Flow 2 jobs all showed the same first-attempt failure shape:
Observed jobs:
Job 641 was only ~4.5k tokens, so this no longer looks primarily like prompt/data size. It looks like transport timeout semantics.
Studio Context
Automattic/studio#3120 raises Studio's low-speed watchdog to 120s for AI TTFB. That behavior is already live on the local Studio site:
However, Data Machine overrides AI request cURL options in
RequestBuilder:The local Data Machine setting is currently:
The exact
120000mserror strongly suggests Data Machine's connect timeout override is the timeout currently winning, not Studio's low-speed watchdog.Root Cause Hypothesis
Data Machine exposes/tunes the wrong timeout as an operator-visible setting.
Current effective model:
A 120s connect timeout is too expensive for autonomous runs. If connection establishment or provider edge contact stalls, the attempt should fail fast and retry with a fresh connection. The model response itself should still have a long request timeout.
Desired Behavior
Separate timeout semantics clearly:
Acceptance Criteria
wp_ai_client_connect_timeoutwp_ai_client_request_timeoutdatamachine_wp_ai_client_connect_timeoutdatamachine_wp_ai_client_request_timeoutNotes
This should not reduce the model's thinking/response budget. The point is to fail fast only on connection establishment/transport stalls, while preserving a long total request timeout for connected non-streaming model calls.
This is part of making WordPress.com MGS wiki Flow 2 safe for set-and-forget operation. Processed-item semantics are now safer after #1815 (
reject_source/defer_item), but the transport layer is still too slow/noisy to be boring.