Worker: when `workflow-complete` is rejected, try to send a `workfkow-error`

Sometimes the worker can fail to send `workflow-complete` events, resulting in a run being marked Lost. This is very rare but it does happen. Lately we've had a few database timeouts which trigger it (possibly while uploading large dataclips?)

But the point for Lost runs is: this shouldn't be lost. The worker KNOWs that the complete event was rejected.

A good strategy in this case would be:
1. Retry the event
2. Retry the event without a dataclip
3. Send an error back to lightning

In the event of an error, we should retry indefinitely.

Obviously there's a problem that if the system is under extreme load, events may not get processed. But dropping the dataclip and retrying should get it go through

(btw I forget how the error event works - do we actually send this to lightning or does and error just generate a complete with an exit reason?)

(also why do we send a final dataclip? Should't we just send the id? Didn't we make a change here recently)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Worker: when `workflow-complete` is rejected, try to send a `workfkow-error` #1218

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Worker: when workflow-complete is rejected, try to send a workfkow-error #1218

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Worker: when `workflow-complete` is rejected, try to send a `workfkow-error` #1218