-
Notifications
You must be signed in to change notification settings - Fork 14
Open
Labels
Description
Seen a lost run lately where the fetch:plan event timed out, resulting in the run being lost.
This is a rare event, but the worker must handle the case better. Should the plan fail to fetch, it should be quite happy to back off and try again.
On these getter-style events (dataclip, plan, maybe credential) we don't have to worry about idempotence. So in the event of a timeout these events should just keep retrying until they a) error or b) succeed.
I suppose the flipside of this is: if the event consistently times out, the worker should give up and return some kind of error, rather than just letting the run be Lost