Skip to content

runtime: reconnect resumed step/checkpoint trace edges in resume_run #8

@humblemat810

Description

@humblemat810

Problem

In the conversation graph, resumed runtime artifacts are written as orphan nodes:

WF step N: client_sandbox_resume
WF checkpoint N
They are missing the normal runtime trace edges:

wf_next_step_exec
persist_checkpoint
This makes resumed runs look broken in CDC viewers even though the workflow state is otherwise correct.

Root Cause

In runtime.py, resume_run(...) currently persists the resumed step and checkpoint with last_exec_node=None.

That bypasses the normal edge-writing behavior already used in the main run(...) path.

Current problematic lines are around:

runtime.py
_persist_step_exec(... last_exec_node=None)
_persist_checkpoint(... last_exec_node=None)
There is even an inline comment saying:

not strongly linked to previous right now
Expected Behavior

When resume_run(...) persists:

WF step N: client_sandbox_resume
it should attach a wf_next_step_exec edge from the previous execution node, usually:

wf_step|<run_id>|N-1
Then when it persists:

WF checkpoint N
it should attach a persist_checkpoint edge from the resumed step node:

wf_step|<run_id>|N
This should match the same runtime trace semantics as the normal non-resume execution path.

Proposed Delta

In runtime.py:

In resume_run(...), recover the previous execution node before calling _persist_step_exec(...).
Prefer:
wf_step|<run_id>|step_seq_current-1
Fallback to:
wf_run|<run_id>
Pass that node as last_exec_node into _persist_step_exec(...).
Capture the returned resumed exec node.
Pass that returned node as last_exec_node into _persist_checkpoint(...).
Conceptually:

previous_exec_node = lookup wf_step|run_id|step_seq_current-1
if not found:
previous_exec_node = lookup wf_run|run_id

resumed_exec_node = self._persist_step_exec(
...,
last_exec_node=previous_exec_node,
)

self._persist_checkpoint(
...,
last_exec_node=resumed_exec_node,
)
Why This Is Upstream

This is not bridge-specific governance logic. It is a generic workflow runtime resume trace issue in Kogwistar itself.

Any product using:

resume_run(...)
conversation runtime trace
CDC / graph viewers
can hit the same orphaned resumed-step/checkpoint problem.

Regression Test To Add

In test_workflow_suspend_resume.py, extend the existing suspend/resume test to assert that after resume:

the resumed step has:
wf_next_step_exec|<run_id>|2|last::wf_step|<run_id>|1|to::wf_step|<run_id>|2
the resumed checkpoint has:
persist_checkpoint|<run_id>|2|last::wf_step|<run_id>|2|to::wf_ckpt|<run_id>|2
That pins the invariant:

resumed runtime trace must be connected just like normal runtime trace
User-visible Impact

Without the fix:

CDC shows resumed workflow nodes as orphans
the trace looks semantically broken
With the fix:

resumed steps/checkpoints stay connected
workflow trace in CDC is continuous and understandable

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions