Skip to content

bug(operator): workflow reconciliation fails on session creation and is never retried #1486

@quay-devel

Description

@quay-devel

Workflow reconciliation fails on session creation and is never retried

Summary

Sessions created with activeWorkflow configuration never receive the workflow — the operator attempts to POST the workflow config to the runner's HTTP endpoint before the pod is created, the POST fails with a DNS resolution error, and observedGeneration is then set to the current generation, preventing any retry.

Observed Behavior

Every session created with activeWorkflow (via acp_create_session with workflow_git_url, workflow_branch, workflow_path) shows:

WorkflowReconciled: False
Reason: UpdateFailed
Message: "Failed to notify runner: Post http://session-{name}.{namespace}.svc.cluster.local:8001/workflow: 
         dial tcp: lookup session-{name}.{namespace}.svc.cluster.local: no such host"

The session starts and enters Running phase, but the workflow is never loaded. The runner receives the initialPrompt without any workflow context (no system prompt, no skills, no rubric).

This is 100% reproducible — tested with two consecutive session creations, both failed identically.

Expected Behavior

The workflow should be applied to the runner after it starts. The WorkflowReconciled condition should eventually become True.

Root Cause

Three-part failure chain in components/operator/internal/handlers/sessions.go:

1. Workflow POST attempted before pod exists (line 679)

During the Pending phase handler, reconcileActiveWorkflowWithPatch() is called before the pod is created. It tries to POST to http://session-{name}.{namespace}.svc.cluster.local:8001/workflow, but the pod and its Service don't exist yet — DNS resolution fails.

// Line 676-679 — called BEFORE pod creation at ~line 1520
spec, _, _ := unstructured.NestedMap(currentObj.Object, "spec")
_ = reconcileSpecReposWithPatch(sessionNamespace, name, spec, currentObj, statusPatch)
_ = reconcileActiveWorkflowWithPatch(sessionNamespace, name, spec, currentObj, statusPatch)  // ← fails here

Note: reconcileSpecReposWithPatch works fine because it only sets status conditions — actual repo cloning is done by init containers in the pod spec. But reconcileActiveWorkflowWithPatch requires the runner to be listening on :8001.

2. Error silently discarded (line 679)

The error is assigned to _, so the flow continues to pod creation as if nothing went wrong. The WorkflowReconciled=False condition is batched into the statusPatch but doesn't block progress.

3. observedGeneration set despite failure (line 1543)

After pod creation succeeds, observedGeneration is set to currentObj.GetGeneration():

// Line 1542-1543
statusPatch.SetField("phase", "Creating")
statusPatch.SetField("observedGeneration", currentObj.GetGeneration())  // ← marks spec as "reconciled"

This tells the Running phase reconciler that the spec has been fully applied — even though the workflow POST failed.

4. Running phase never retries (reconcile_phases.go line 325)

When the session reaches Running phase, reconcileRunning() checks for generation drift:

if currentGen != observedGen && observedGen != 0 {
    // reconcile spec changes...
}

Since observedGen (1) equals currentGen (1), this is false — workflow reconciliation is never retried. No other mechanism checks for WorkflowReconciled=False.

Suggested Fix

Several options (not mutually exclusive):

Option A: Move workflow reconciliation to after runner is ready
Don't call reconcileActiveWorkflowWithPatch() at line 679. Instead, add a check in reconcileRunning() (or reconcileCreating() after RunnerStarted=True) that detects WorkflowReconciled != True and retries.

Option B: Don't set observedGeneration if workflow reconciliation failed
Only set observedGeneration at line 1543 if both repo and workflow reconciliation succeeded. This would cause the Running phase to detect currentGen != observedGen and retry.

Option C: Check WorkflowReconciled condition in reconcileRunning
Add a condition check in reconcileRunning() independent of generation drift:

if workflowCondition == "False" && workflowReason == "UpdateFailed" {
    // Retry workflow reconciliation
}

Option A is the cleanest — the workflow POST can only succeed when the runner is listening, so it should never be attempted before the pod exists.

Reproduction Steps

  1. Create a session with workflow configuration:
acp_create_session(
  session_name="test-workflow",
  workflow_git_url="https://github.com/org/repo.git",
  workflow_branch="main",
  workflow_path=".ambient/workflows/my-workflow"
)
  1. Check session conditions — WorkflowReconciled will be False/UpdateFailed
  2. Wait for session to reach Running phase — workflow is never applied
  3. Runner receives initialPrompt without workflow context

Environment

  • Platform: Ambient Code (OpenShift)
  • Operator version: current main
  • Confirmed on two independent session creations in the quay project namespace

Key Files

File Lines What
handlers/sessions.go 676-679 Workflow POST called before pod creation
handlers/sessions.go 1542-1543 observedGeneration set despite failure
handlers/sessions.go 1907-2003 reconcileActiveWorkflowWithPatch() implementation
controller/reconcile_phases.go 320-335 reconcileRunning() generation drift check
controller/agenticsession_controller.go 215-231 Watch predicates (no retry trigger for condition changes)

Metadata

Metadata

Assignees

No one assigned

    Labels

    ambient-code:auto-fixAmber agent: automated low-risk fixes (formatting, linting)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions