Skip to content

Separate AI connect/request timeouts and log resolved transport profile #1816

@chubes4

Description

@chubes4

Problem

Data Machine AI requests are still frequently hitting a 120s timeout on intelligence-chubes4, even after reducing prompt/data-packet/memory size.

Recent Flow 2 jobs all showed the same first-attempt failure shape:

Network error occurred while sending request to https://api.openai.com/v1/responses: cURL error 28: Connection timed out after 120000 milliseconds

Observed jobs:

  • Job 639: first AI request timed out after 120s, retry succeeded.
  • Job 640: first AI request timed out after 120s, retry reached a tool-path bug and skipped.
  • Job 641: first AI request timed out after 120s, retry succeeded and wrote a wiki page.
  • Job 642: first AI request timed out after 120s, retry succeeded and rejected the source.

Job 641 was only ~4.5k tokens, so this no longer looks primarily like prompt/data size. It looks like transport timeout semantics.

Studio Context

Automattic/studio#3120 raises Studio's low-speed watchdog to 120s for AI TTFB. That behavior is already live on the local Studio site:

CURLOPT_CONNECTTIMEOUT = 30;
CURLOPT_LOW_SPEED_LIMIT = 1024;
CURLOPT_LOW_SPEED_TIME = 120;

However, Data Machine overrides AI request cURL options in RequestBuilder:

curl_setopt( $handle, CURLOPT_CONNECTTIMEOUT, ceil( $connect_timeout ) );
curl_setopt( $handle, CURLOPT_LOW_SPEED_TIME, ceil( $request_timeout ) );
curl_setopt( $handle, CURLOPT_LOW_SPEED_LIMIT, 1 );

The local Data Machine setting is currently:

wp_ai_client_connect_timeout = 120

The exact 120000ms error strongly suggests Data Machine's connect timeout override is the timeout currently winning, not Studio's low-speed watchdog.

Root Cause Hypothesis

Data Machine exposes/tunes the wrong timeout as an operator-visible setting.

Current effective model:

request timeout: 300s hardcoded/filterable
connect timeout: visible setting, currently 120s
retry delay: 60s

A 120s connect timeout is too expensive for autonomous runs. If connection establishment or provider edge contact stalls, the attempt should fail fast and retry with a fresh connection. The model response itself should still have a long request timeout.

Desired Behavior

Separate timeout semantics clearly:

connect timeout: short, e.g. 10-20s default
request timeout: long, e.g. 300s default
low-speed/TTFB watchdog: long enough for non-streaming AI response
retry delay: optionally shorter for transport/connect failures

Acceptance Criteria

  • Add first-class Data Machine settings for both:
    • wp_ai_client_connect_timeout
    • wp_ai_client_request_timeout
  • Default connect timeout should be short enough for autonomous retries, likely 15s or 30s. Prefer 15s if tests/compatibility are okay.
  • Request timeout should remain long, likely 300s.
  • Preserve filters:
    • datamachine_wp_ai_client_connect_timeout
    • datamachine_wp_ai_client_request_timeout
  • Log resolved AI transport profile before/around dispatch, including:
    • mode
    • provider
    • model
    • job_id / flow_step_id when available
    • resolved request timeout
    • resolved connect timeout
    • whether RequestOptions class was used
    • whether Data Machine cURL hook was installed
  • On AI request failure logs, include the same resolved timeout profile so operators can tell which timeout likely fired.
  • Keep Studio's generic mu-plugin behavior independent; no Studio-specific workaround or endpoint discrimination.
  • Add focused tests around timeout resolution/settings/log metadata where practical.

Notes

This should not reduce the model's thinking/response budget. The point is to fail fast only on connection establishment/transport stalls, while preserving a long total request timeout for connected non-streaming model calls.

This is part of making WordPress.com MGS wiki Flow 2 safe for set-and-forget operation. Processed-item semantics are now safer after #1815 (reject_source/defer_item), but the transport layer is still too slow/noisy to be boring.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions