Problem
After #1817, Data Machine can separate AI request and connect timeouts and logs the resolved transport profile. On intelligence-chubes4, Flow 2 now fails fast on transport stalls:
Network error occurred while sending request to https://api.openai.com/v1/responses: cURL error 28: Connection timed out after 15000 milliseconds
Transport profile confirms the intended settings are active:
{
"request_timeout": 300,
"connect_timeout": 15,
"request_options_class_available": true,
"request_options_used": true,
"curl_hook_installed": true,
"success": false
}
This is a major improvement over the prior 120s timeout, but retry scheduling still uses the generic retry delay:
{
"attempt": 1,
"max_attempts": 3,
"delay_seconds": 60,
"reason": "ai_processing_failed",
"provider": "openai",
"source_type": "mcp"
}
For cheap connection/transport failures, waiting 60s is unnecessarily slow. Now that connect timeouts fail in ~15s, the retry delay should also be short enough for set-and-forget pipeline throughput.
Desired Behavior
Make retry delay failure-type-aware:
connect timeout / cURL 28 / DNS / network transport blip
-> short retry delay, e.g. 10-15s
provider 429 / rate limit / overload
-> backoff / longer delay
full request timeout after long model execution
-> moderate delay
tool contract/runtime failure
-> existing retry/defer/fail semantics; do not mark processed unless reject_source
Runtime Evidence
Recent Flow 2 job 657 after #1817:
- Fetch succeeded.
- First AI call failed with
cURL error 28 after 15000ms.
- Retry was scheduled for 60s later.
- Retry succeeded and the model correctly called
reject_source(reason=too_thin).
So the transport fix works, but the retry delay is now the long pole.
Relevant Code
Likely area:
inc/Core/JobRetryPolicy.php
- AI failure context passed from
RequestBuilder / AIStep into retry policy
- Logs emitted by
JobRetryPolicy::maybeRetry() / retry history in job engine data
Acceptance Criteria
- Classify AI network/connect timeout failures separately from generic AI processing failures.
- Use a shorter retry delay for AI transport/connect failures, likely 10-15s.
- Preserve existing longer/backoff behavior for rate limits, overloads, and provider throttling.
- Include retry classification in job retry metadata/log context, e.g.
retry_class=transport_connect_timeout.
- Tests prove:
- cURL 28 /
Connection timed out after ... milliseconds gets short delay.
- rate limit / 429 keeps longer/backoff behavior.
- generic retryable AI failures keep existing default behavior.
- No source-specific logic for Intelligence, MGS, WordPress.com, or Flow 2.
- No
CHANGELOG.md edits or manual version bumps.
Why This Matters
We are close to making WordPress.com MGS wiki generation set-and-forget. The remaining issue is not correctness anymore; it is throughput and boring recovery. A transient connection failure should cost ~25-30s total, not ~75s+.
Problem
After #1817, Data Machine can separate AI request and connect timeouts and logs the resolved transport profile. On
intelligence-chubes4, Flow 2 now fails fast on transport stalls:Transport profile confirms the intended settings are active:
{ "request_timeout": 300, "connect_timeout": 15, "request_options_class_available": true, "request_options_used": true, "curl_hook_installed": true, "success": false }This is a major improvement over the prior 120s timeout, but retry scheduling still uses the generic retry delay:
{ "attempt": 1, "max_attempts": 3, "delay_seconds": 60, "reason": "ai_processing_failed", "provider": "openai", "source_type": "mcp" }For cheap connection/transport failures, waiting 60s is unnecessarily slow. Now that connect timeouts fail in ~15s, the retry delay should also be short enough for set-and-forget pipeline throughput.
Desired Behavior
Make retry delay failure-type-aware:
Runtime Evidence
Recent Flow 2 job 657 after #1817:
cURL error 28after15000ms.reject_source(reason=too_thin).So the transport fix works, but the retry delay is now the long pole.
Relevant Code
Likely area:
inc/Core/JobRetryPolicy.phpRequestBuilder/AIStepinto retry policyJobRetryPolicy::maybeRetry()/ retry history in job engine dataAcceptance Criteria
retry_class=transport_connect_timeout.Connection timed out after ... millisecondsgets short delay.CHANGELOG.mdedits or manual version bumps.Why This Matters
We are close to making WordPress.com MGS wiki generation set-and-forget. The remaining issue is not correctness anymore; it is throughput and boring recovery. A transient connection failure should cost ~25-30s total, not ~75s+.