Upon joining a cluster, I noticed that the actual announcement is significantly delayed. In particular, the debug output shows that a node has joined a full 20 seconds before the OnNodeJoin or OnNewLeaves events are fired.
I tracked this down to line 646 of cluster.go. My understanding of reading through the code and the debug output is that the join process goes like this: (please slap me if I've gone awry)
- send join message to cluster
- cluster accepts join message and node into cluster (debug output says "Node ... joined!")
- cluster sends state tables back to the new node
- the new node doesn't know it has successfully joined yet, waits for
2 * NETWORK_TIMEOUT and announces presence
- after announcement is sent, the node proclaims itself joined
- the
OnNodeJoin and OnNewLeaves events are fired for all nodes in the cluster
I'm guessing that there is a reason why there is a delay set to 2 * NETWORK_TIMEOUT, but I'm not sure what it is. (Truthfully, my networking skills are pretty poor, so I dare not hazard a guess.)
I would be very happy to work on a fix for this problem, I'm just not sure what the fix would look like yet. Therefore, I am seeking guidance. :-)
My inclination is to try and announce the node's presence immediately, and if it fails, try again after a longer timeout. I just don't know what if it fails means in this context.
Thanks!