Skip to content

Worker: when workflow-complete is rejected, try to send a workfkow-error #1218

@josephjclark

Description

@josephjclark

Sometimes the worker can fail to send workflow-complete events, resulting in a run being marked Lost. This is very rare but it does happen. Lately we've had a few database timeouts which trigger it (possibly while uploading large dataclips?)

But the point for Lost runs is: this shouldn't be lost. The worker KNOWs that the complete event was rejected.

A good strategy in this case would be:

  1. Retry the event
  2. Retry the event without a dataclip
  3. Send an error back to lightning

In the event of an error, we should retry indefinitely.

Obviously there's a problem that if the system is under extreme load, events may not get processed. But dropping the dataclip and retrying should get it go through

(btw I forget how the error event works - do we actually send this to lightning or does and error just generate a complete with an exit reason?)

(also why do we send a final dataclip? Should't we just send the id? Didn't we make a change here recently)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    DevX Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions