-
Notifications
You must be signed in to change notification settings - Fork 222
[core] [world] Lazy run creation on start #1537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
38b6674
2f51139
0dfd424
b1c2bf3
9202723
c129183
5223192
5033640
d381cd3
10838f3
ccba633
770c419
ee29ee9
11631dc
ddd0bfd
3ed82c1
7b51efd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| --- | ||
| "@workflow/world": patch | ||
| "@workflow/core": patch | ||
| "@workflow/world-local": patch | ||
| "@workflow/world-postgres": patch | ||
| --- | ||
|
|
||
| Combine initial run fetch, event fetch, and run_started event creation |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| --- | ||
| "@workflow/world-postgres": patch | ||
| "@workflow/world-vercel": patch | ||
| "@workflow/world-local": patch | ||
| "@workflow/world": patch | ||
| "@workflow/core": minor | ||
| --- | ||
|
|
||
| Allow workflow invocation to create run if initial storage call in `start` did not succeed. Send run input through queue to enable this. Allow creating run_created and run_started events together in World, and skip first event list call by returning events directly. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,5 +1,5 @@ | ||
| { | ||
| "title": "Changelog", | ||
| "pages": ["index", "eager-processing"], | ||
| "pages": ["index", "eager-processing", "resilient-start"], | ||
| "defaultOpen": false | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,144 @@ | ||
| --- | ||
| title: Resilient run start | ||
| description: Overhaul run start logic to tolerate world storage unavailability, as long as the queue is healthy, and significantly speeds up run start. | ||
| --- | ||
|
|
||
| # Resilient `start()` | ||
|
|
||
| ## Motivation | ||
|
|
||
| When `world` storage is unavailable but the queue is up, | ||
| `start()` previously failed entirely because `world.events.create(run_created)` | ||
| is called before `world.queue()`. This change decouples run creation from queue | ||
| dispatch so that runs can still be accepted when storage is degraded. | ||
|
|
||
| ## Design | ||
|
|
||
| ### `start()` changes (packages/core) | ||
|
|
||
| - `world.events.create` (run_created) and `world.queue` are now called **in parallel**. | ||
| - If `events.create` errors with **429 or 5xx**, we log a warning saying that run | ||
| creation failed but the run was accepted — creation will be re-tried async by the | ||
| runtime when it processes the queue message. | ||
| - If `world.queue` fails, we still throw — the run truly failed and was not enqueued. | ||
| - The queue invocation now receives all the run inputs (`input`, `deploymentId`, | ||
| `workflowName`, `specVersion`, `executionContext`) so the runtime can create the | ||
| run later if needed. | ||
| - When the runtime re-enqueues itself, it does **not** pass these inputs — only the | ||
| first queue cycle carries them. | ||
|
|
||
| ### `workflowEntrypoint` changes (packages/core) | ||
|
|
||
| - We no longer call `world.runs.get` or check the run status before starting. | ||
| - We **always** call `world.events.create` with `run_started`, now also passing the | ||
| run input that was sent through the queue. The response will be: | ||
| - **200 with event (now running)**: use the returned `Run` entity as the run. The | ||
| response also includes an `events` array of all events up to that point (typically | ||
| `run_created` and `run_started`), with data resolved. These are used to skip the | ||
| very first `world.events.list` call, reducing TTFB for the first invocation. | ||
| - **200 without event (already running)**: the run entity is returned directly | ||
| without creating a duplicate event. The runtime proceeds normally. | ||
| - **410 (already finished)**: log and exit as usual. | ||
|
|
||
| ### World / workflow-server changes | ||
|
|
||
| - Posting `run_started` to a **non-existent** run is now allowed when the run input is | ||
| sent along with the payload. The server: | ||
| 1. Creates a `run_created` event first (so the event log is consistent). | ||
| 2. Strips the input from the `run_started` event data (it lives on `run_created`). | ||
| 3. Then creates the `run_started` event normally. | ||
| 4. Emits a log and a Datadog metric (`workflow_server.resilient_start.run_created_via_run_started`) | ||
| to track when this fallback path is hit. | ||
| - When `run_started` encounters an **already-running** run, all worlds return `{ run }` | ||
| with `event: undefined` instead of throwing. No duplicate event is created. | ||
| - When posting `run_started` and getting **200**, the response includes an `events` | ||
| property with all events up to that point (data always resolved). | ||
| - ULID timestamp validation now uses **asymmetric thresholds**: 24 hours in the past | ||
| (to support queue retry delays) and 5 minutes in the future (to prevent abuse while | ||
| tolerating clock skew). | ||
|
|
||
| ## Decisions | ||
|
|
||
| 1. **Parallel not sequential**: We chose `Promise.allSettled` over sequential calls to | ||
| minimize latency in the happy path. The trade-off is slightly more complex error | ||
| handling. | ||
|
|
||
| 2. **Already-running returns run without event**: When `run_started` encounters an | ||
| already-running run, all worlds return `{ run }` with `event: undefined` (no | ||
| `events` array) instead of throwing. The runtime detects this by checking for | ||
| `result.event === undefined`. This avoids the extra `world.runs.get` round-trip. | ||
|
|
||
| 3. **Events in 200 response**: We only return events on the 200 path (first caller). | ||
| On the already-running path, we fall back to the normal `events.list` call. This is | ||
| correct because only on 200 can we be certain we know the full event history. | ||
|
|
||
| 4. **Asymmetric ULID thresholds**: VQS supports delayed messages up to 24 hours. We | ||
| allow 24h in the past so a run_created retry can succeed at maximum queue delay, but | ||
| keep the future threshold at 5 minutes to prevent abuse from manipulated timestamps. | ||
|
|
||
| ## Implementation notes | ||
|
|
||
| ### Error type mapping for terminal runs | ||
|
|
||
| Previously, calling `run_started` on a terminal run threw `InvalidOperationStateError` | ||
| (HTTP 409) on workflow-server, or `EntityConflictError` on world-local/world-postgres. | ||
| This was changed to `EntityGoneError` (HTTP 410) / `RunExpiredError` so the runtime | ||
| correctly distinguishes "already running" from "already finished" (exit immediately). | ||
|
|
||
| ### run_started on already-running runs | ||
|
|
||
| All worlds (workflow-server, world-local, world-postgres) now return the existing run | ||
| entity directly — with `event: undefined` — when `run_started` is called on an | ||
| already-running run. This avoids both a duplicate event and the extra `world.runs.get` | ||
| call that the previous 409-based approach required. The `EventResultResolveWireSchema` | ||
| in world-vercel was updated to make `event` optional. | ||
|
|
||
| ### world-local and world-postgres support | ||
|
|
||
| Both world-local (filesystem) and world-postgres (Drizzle/SQL) now implement the full | ||
| resilient start behavior: | ||
|
|
||
| - Creating runs from `run_started` when the run doesn't exist and eventData is provided | ||
| - Returning `{ run }` without event on already-running | ||
| - Throwing `RunExpiredError` on terminal runs | ||
| - Stripping eventData from stored `run_started` events | ||
| - Returning the `events` array on successful start | ||
|
|
||
| ### Asymmetric ULID timestamp validation | ||
|
|
||
| Both `@workflow/world` (`validateUlidTimestamp`) and `workflow-server` | ||
| (`Ulid.isTimestampWithinThreshold`) now accept separate past and future thresholds: | ||
|
|
||
| - **Past**: 24 hours (`DEFAULT_TIMESTAMP_THRESHOLD_PAST_MS`) | ||
| - **Future**: 5 minutes (`DEFAULT_TIMESTAMP_THRESHOLD_FUTURE_MS`) | ||
|
|
||
| The old `DEFAULT_TIMESTAMP_THRESHOLD_MS` constant is deprecated but aliased to the | ||
| past threshold for backwards compatibility. | ||
|
|
||
| ### Datadog metric | ||
|
|
||
| The resilient start fallback path emits a Datadog distribution metric: | ||
| `workflow_server.resilient_start.run_created_via_run_started`, tagged with | ||
| `workflow_name`. Query with `sum:workflow_server.resilient_start.run_created_via_run_started{*}`. | ||
|
|
||
| ### Base64 encoding for queue transport | ||
|
|
||
| `Uint8Array` values (the serialized workflow input) don't survive JSON serialization | ||
| through the queue — they get corrupted to `{0: 72, 1: 101, ...}` objects. The `runInput` | ||
|
Comment on lines
+124
to
+127
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a hack, trying to remove this right now by devaluing binary, but might not be the right approach. WDYT? |
||
| payload in the queue message now base64-encodes binary input in `start()` and the | ||
| runtime decodes it back to `Uint8Array` before passing it to `world.events.create`. | ||
| This was caught by the `spawnWorkflowFromStepWorkflow` e2e test where the child | ||
| workflow's input was being corrupted. | ||
|
|
||
| ### RunStartedEventSchema eventData stripping | ||
|
|
||
| The run input is passed through to `run_started`'s `eventData` but stripped before | ||
| the event is persisted — the data belongs on the `run_created` event only. All worlds | ||
| strip eventData from stored `run_started` events. | ||
|
|
||
| ## Follow-up work | ||
|
|
||
| - [ ] Add e2e tests covering the degraded-storage start path against a live deployment. | ||
| - [ ] Monitor the Datadog metric in production to understand how often the fallback is hit. | ||
| - [ ] Consider whether the `events` optimization in the 200 response should also apply | ||
| to re-enqueue cycles (currently only first invocation). | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4,18 +4,18 @@ import { | |
| RunExpiredError, | ||
| WorkflowRuntimeError, | ||
| } from '@workflow/errors'; | ||
| import { classifyRunError } from './classify-error.js'; | ||
| import { MAX_QUEUE_DELIVERIES } from './runtime/constants.js'; | ||
| import { parseWorkflowName } from '@workflow/utils/parse-name'; | ||
| import { | ||
| type Event, | ||
| SPEC_VERSION_CURRENT, | ||
| WorkflowInvokePayloadSchema, | ||
| type WorkflowRun, | ||
| } from '@workflow/world'; | ||
| import { classifyRunError } from './classify-error.js'; | ||
| import { importKey } from './encryption.js'; | ||
| import { WorkflowSuspension } from './global.js'; | ||
| import { runtimeLogger } from './logger.js'; | ||
| import { MAX_QUEUE_DELIVERIES } from './runtime/constants.js'; | ||
| import { | ||
| getAllWorkflowRunEvents, | ||
| getQueueOverhead, | ||
|
|
@@ -105,6 +105,7 @@ export function workflowEntrypoint( | |
| runId, | ||
| traceCarrier: traceContext, | ||
| requestedAt, | ||
| runInput, | ||
| } = WorkflowInvokePayloadSchema.parse(message_); | ||
| const { requestId } = metadata; | ||
| // Extract the workflow name from the topic name | ||
|
|
@@ -191,50 +192,74 @@ export function workflowEntrypoint( | |
| }); | ||
|
|
||
| let workflowStartedAt = -1; | ||
| let workflowRun = await world.runs.get(runId); | ||
| let workflowRun: WorkflowRun | undefined; | ||
| // Pre-loaded events from run_started response (first caller optimization) | ||
| let preloadedEvents: Event[] | undefined; | ||
|
|
||
| // --- Infrastructure: prepare the run state --- | ||
| // Always call run_started directly — this both transitions | ||
| // the run to 'running' AND returns the run entity, saving | ||
| // a separate runs.get round-trip. When runInput is present | ||
| // (resilient start), pass it so the server can create the | ||
| // run if run_created was missed. | ||
| // Network/server errors propagate to the queue handler for retry. | ||
| // WorkflowRuntimeError (data integrity issues) are fatal and | ||
| // produce run_failed since retrying won't fix them. | ||
| try { | ||
| if (workflowRun.status === 'pending') { | ||
| // Transition run to 'running' via event (event-sourced architecture) | ||
| const result = await world.events.create( | ||
| runId, | ||
| { | ||
| eventType: 'run_started', | ||
| specVersion: SPEC_VERSION_CURRENT, | ||
| }, | ||
| { requestId } | ||
| const result = await world.events.create( | ||
| runId, | ||
| { | ||
| eventType: 'run_started', | ||
| specVersion: SPEC_VERSION_CURRENT, | ||
| // Pass run input from queue so server can create | ||
| // the run if run_created was missed. | ||
| // Input is base64-encoded for queue transport since | ||
| // Uint8Array doesn't survive JSON serialization. | ||
| ...(runInput | ||
| ? { | ||
| eventData: { | ||
| input: | ||
| typeof runInput.input === 'string' | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Blocking: This decodes ANY string as base64 binary. If The encode side in See my comment on |
||
| ? Uint8Array.from(atob(runInput.input), (c) => | ||
| c.charCodeAt(0) | ||
| ) | ||
| : runInput.input, | ||
| deploymentId: runInput.deploymentId, | ||
| workflowName: runInput.workflowName, | ||
| executionContext: runInput.executionContext, | ||
| }, | ||
| } | ||
| : {}), | ||
| }, | ||
| { requestId } | ||
| ); | ||
| if (!result.run) { | ||
| throw new WorkflowRuntimeError( | ||
| `Event creation for 'run_started' did not return the run entity for run "${runId}"` | ||
| ); | ||
| // Use the run entity from the event response (no extra get call needed) | ||
| if (!result.run) { | ||
| throw new WorkflowRuntimeError( | ||
| `Event creation for 'run_started' did not return the run entity for run "${runId}"` | ||
| ); | ||
| } | ||
| workflowRun = result.run; | ||
| } | ||
| workflowRun = result.run; | ||
|
|
||
| // If the response includes events, use them to skip | ||
| // the initial events.list call and reduce TTFB. | ||
| if (result.events && result.events.length > 0) { | ||
| preloadedEvents = result.events; | ||
| } | ||
|
|
||
| // At this point, the workflow is "running" and `startedAt` should | ||
| // definitely be set. | ||
| if (!workflowRun.startedAt) { | ||
| throw new WorkflowRuntimeError( | ||
| `Workflow run "${runId}" has no "startedAt" timestamp` | ||
| ); | ||
| } | ||
| } catch (err) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Blocking (behavior change): The old code caught both Is this intentional? The design doc says already-running returns |
||
| // Run was concurrently completed/failed/cancelled | ||
| // between the GET and the run_started event creation | ||
| if (EntityConflictError.is(err) || RunExpiredError.is(err)) { | ||
| if (RunExpiredError.is(err)) { | ||
| // 410: already finished — log and exit | ||
| runtimeLogger.info( | ||
| 'Run already finished during setup, skipping', | ||
| { workflowRunId: runId, message: err.message } | ||
| ); | ||
| return; | ||
| } | ||
| if (err instanceof WorkflowRuntimeError) { | ||
| } else if (err instanceof WorkflowRuntimeError) { | ||
| runtimeLogger.error( | ||
| 'Fatal runtime error during workflow setup', | ||
| { workflowRunId: runId, error: err.message } | ||
|
|
@@ -265,8 +290,15 @@ export function workflowEntrypoint( | |
| throw failErr; | ||
| } | ||
| return; | ||
| } else { | ||
| throw err; | ||
| } | ||
| throw err; | ||
| } | ||
|
|
||
| if (!workflowRun.startedAt) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Non-blocking: This |
||
| throw new WorkflowRuntimeError( | ||
| `Workflow run "${runId}" has no "startedAt" timestamp` | ||
| ); | ||
| } | ||
| workflowStartedAt = +workflowRun.startedAt; | ||
|
|
||
|
|
@@ -294,8 +326,12 @@ export function workflowEntrypoint( | |
| return; | ||
| } | ||
|
|
||
| // Load all events into memory before running | ||
| const events = await getAllWorkflowRunEvents(workflowRun.runId); | ||
| // Load all events into memory before running. | ||
| // If we got events from the run_started response, | ||
| // skip the events.list round-trip to reduce TTFB. | ||
| const events = | ||
| preloadedEvents ?? | ||
| (await getAllWorkflowRunEvents(workflowRun.runId)); | ||
|
|
||
| // Check for any elapsed waits and create wait_completed events | ||
| const now = Date.now(); | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocking: AGENTS.md states: "All changes should be marked as 'patch'. Never use 'major' or 'minor' modes."
This should be: