-
Notifications
You must be signed in to change notification settings - Fork 14
Description
Sometimes the worker can fail to send workflow-complete events, resulting in a run being marked Lost. This is very rare but it does happen. Lately we've had a few database timeouts which trigger it (possibly while uploading large dataclips?)
But the point for Lost runs is: this shouldn't be lost. The worker KNOWs that the complete event was rejected.
A good strategy in this case would be:
- Retry the event
- Retry the event without a dataclip
- Send an error back to lightning
In the event of an error, we should retry indefinitely.
Obviously there's a problem that if the system is under extreme load, events may not get processed. But dropping the dataclip and retrying should get it go through
(btw I forget how the error event works - do we actually send this to lightning or does and error just generate a complete with an exit reason?)
(also why do we send a final dataclip? Should't we just send the id? Didn't we make a change here recently)