-
Notifications
You must be signed in to change notification settings - Fork 14
Description
Right now the m_etcd Coordinator handles maintaining the task claim in etcd by heartbeating the claim node before the TTL expires, and calling Handler#Stop() if it's unable to maintain the claim.
Users only have to ensure their Stop() methods cause Handler#Run() to exit in a timely manner to ensure correctness of the execute-task-exactly-once guarantee.
However, this does nothing to guarantee the correctness of the User's Handler#Run() method. The method may have deadlocked, but the claim persists. This means the work is effectively running 0 times, not exactly once, in the cluster, and there's no indicator to an operator that anything is amiss.
Solution: Allow Handlers to handle heartbeating
Coordinators should be able to provide a Task#Heartbeat() error method for handlers to call manually to keep the claim alive. Handlers would be expected to exit if an error was returned.
An adapter or similar helper would be provided to keep the original behavior of having the heartbeat fully controlled.