Skip to content
This repository was archived by the owner on Feb 22, 2022. It is now read-only.

Conversation

@knagaitsev
Copy link
Contributor

'''
try:
b_ep_id, reg_message = self.tasks_q.get(timeout=0) # timeout in ms # Update to 0ms
b_ep_id, reg_message = self.tasks_q.get(block=False, timeout=0) # timeout in ms # Update to 0ms
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code before was flawed because block=True is the default. However, it was not blocking endlessly I believe only because the task_q RCVTIMEO=1 was set in initialization (1ms of blocking before there is a raise ). We should do this here instead to make it clear that this get call is not blocking in the main forwarder loop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I believe the logic was to block, but upto the RCVTIMEO, after which we get a zmq.Again exception. Switching to a non-blocking poll of 0ms makes sense given how infrequently there are registration messages.

logger.debug("Configuring server")
self.zmq_socket = self.context.socket(zmq.ROUTER)
self.zmq_socket.set(zmq.ROUTER_MANDATORY, 1)
self.zmq_socket.set(zmq.ROUTER_HANDOVER, 1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this because we want new zmq socket ids (we use endpoint id) to replace the old sockets with the same id. Not doing so means an old socket with a given id can block a new socket from communicating that is given that same id on a new registration.

Copy link
Contributor Author

@knagaitsev knagaitsev Aug 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this was not an issue with the tasks_q from the forwarder because since it was outgoing the forwarder could realize the client was not reachable, so when it sends heartbeats over this channel it simply thinks the same client socket has reconnected. This is why heartbeats were still reaching newly connected sockets, since they were going over this channel.

However, with the results_q, the forwarder has no way of realizing that the socket is unreachable since it is not sending messages over this channel, only receiving messages. So it still thinks the old socket is connected when a new socket tries to connect.

@knagaitsev knagaitsev changed the title Change zmq options for reliability Fix zmq option setting bugs Aug 20, 2021
@knagaitsev knagaitsev force-pushed the zmq_opts branch 2 times, most recently from b5587e2 to 31f87d0 Compare August 20, 2021 02:46
@knagaitsev knagaitsev marked this pull request as ready for review August 20, 2021 03:07
Copy link
Contributor

@yadudoc yadudoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Loonride, The changes to the ordering and setting of socket options looks correct. The change to blocking behavior for the TaskQueue for registration messages seem small but these can often be troublesome in practice. So, I would recommend splitting this change out to a separate PR.

@knagaitsev knagaitsev force-pushed the zmq_opts branch 3 times, most recently from 7e7d64c to 445267a Compare August 23, 2021 17:25
@knagaitsev
Copy link
Contributor Author

@yadudoc Split it into 2 PRs (Other one is #34)

Copy link
Contributor

@yadudoc yadudoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to go.

@yadudoc yadudoc merged commit 2e72c78 into main Aug 23, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants