Skip to content

Use shorter retry delays for AI transport/connect timeouts #1818

@chubes4

Description

@chubes4

Problem

After #1817, Data Machine can separate AI request and connect timeouts and logs the resolved transport profile. On intelligence-chubes4, Flow 2 now fails fast on transport stalls:

Network error occurred while sending request to https://api.openai.com/v1/responses: cURL error 28: Connection timed out after 15000 milliseconds

Transport profile confirms the intended settings are active:

{
  "request_timeout": 300,
  "connect_timeout": 15,
  "request_options_class_available": true,
  "request_options_used": true,
  "curl_hook_installed": true,
  "success": false
}

This is a major improvement over the prior 120s timeout, but retry scheduling still uses the generic retry delay:

{
  "attempt": 1,
  "max_attempts": 3,
  "delay_seconds": 60,
  "reason": "ai_processing_failed",
  "provider": "openai",
  "source_type": "mcp"
}

For cheap connection/transport failures, waiting 60s is unnecessarily slow. Now that connect timeouts fail in ~15s, the retry delay should also be short enough for set-and-forget pipeline throughput.

Desired Behavior

Make retry delay failure-type-aware:

connect timeout / cURL 28 / DNS / network transport blip
  -> short retry delay, e.g. 10-15s

provider 429 / rate limit / overload
  -> backoff / longer delay

full request timeout after long model execution
  -> moderate delay

tool contract/runtime failure
  -> existing retry/defer/fail semantics; do not mark processed unless reject_source

Runtime Evidence

Recent Flow 2 job 657 after #1817:

  • Fetch succeeded.
  • First AI call failed with cURL error 28 after 15000ms.
  • Retry was scheduled for 60s later.
  • Retry succeeded and the model correctly called reject_source(reason=too_thin).

So the transport fix works, but the retry delay is now the long pole.

Relevant Code

Likely area:

  • inc/Core/JobRetryPolicy.php
  • AI failure context passed from RequestBuilder / AIStep into retry policy
  • Logs emitted by JobRetryPolicy::maybeRetry() / retry history in job engine data

Acceptance Criteria

  • Classify AI network/connect timeout failures separately from generic AI processing failures.
  • Use a shorter retry delay for AI transport/connect failures, likely 10-15s.
  • Preserve existing longer/backoff behavior for rate limits, overloads, and provider throttling.
  • Include retry classification in job retry metadata/log context, e.g. retry_class=transport_connect_timeout.
  • Tests prove:
    • cURL 28 / Connection timed out after ... milliseconds gets short delay.
    • rate limit / 429 keeps longer/backoff behavior.
    • generic retryable AI failures keep existing default behavior.
  • No source-specific logic for Intelligence, MGS, WordPress.com, or Flow 2.
  • No CHANGELOG.md edits or manual version bumps.

Why This Matters

We are close to making WordPress.com MGS wiki generation set-and-forget. The remaining issue is not correctness anymore; it is throughput and boring recovery. A transient connection failure should cost ~25-30s total, not ~75s+.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions