Fix zmq option setting bugs #24

knagaitsev · 2021-08-04T17:35:55Z

knagaitsev · 2021-08-05T07:49:31Z

funcx_forwarder/forwarder.py

        '''
        try:
-            b_ep_id, reg_message = self.tasks_q.get(timeout=0)  # timeout in ms # Update to 0ms
+            b_ep_id, reg_message = self.tasks_q.get(block=False, timeout=0)  # timeout in ms # Update to 0ms


The code before was flawed because block=True is the default. However, it was not blocking endlessly I believe only because the task_q RCVTIMEO=1 was set in initialization (1ms of blocking before there is a raise ). We should do this here instead to make it clear that this get call is not blocking in the main forwarder loop

Right, I believe the logic was to block, but upto the RCVTIMEO, after which we get a zmq.Again exception. Switching to a non-blocking poll of 0ms makes sense given how infrequently there are registration messages.

knagaitsev · 2021-08-05T07:51:32Z

funcx_forwarder/taskqueue.py

            logger.debug("Configuring server")
            self.zmq_socket = self.context.socket(zmq.ROUTER)
            self.zmq_socket.set(zmq.ROUTER_MANDATORY, 1)
+            self.zmq_socket.set(zmq.ROUTER_HANDOVER, 1)


We need this because we want new zmq socket ids (we use endpoint id) to replace the old sockets with the same id. Not doing so means an old socket with a given id can block a new socket from communicating that is given that same id on a new registration.

I believe this was not an issue with the tasks_q from the forwarder because since it was outgoing the forwarder could realize the client was not reachable, so when it sends heartbeats over this channel it simply thinks the same client socket has reconnected. This is why heartbeats were still reaching newly connected sockets, since they were going over this channel.

However, with the results_q, the forwarder has no way of realizing that the socket is unreachable since it is not sending messages over this channel, only receiving messages. So it still thinks the old socket is connected when a new socket tries to connect.

yadudoc

@Loonride, The changes to the ordering and setting of socket options looks correct. The change to blocking behavior for the TaskQueue for registration messages seem small but these can often be troublesome in practice. So, I would recommend splitting this change out to a separate PR.

knagaitsev · 2021-08-23T17:49:06Z

@yadudoc Split it into 2 PRs (Other one is #34)

yadudoc

Looks good to go.

knagaitsev commented Aug 5, 2021

View reviewed changes

knagaitsev changed the title ~~Change zmq options for reliability~~ Fix zmq option setting bugs Aug 20, 2021

knagaitsev force-pushed the zmq_opts branch 2 times, most recently from b5587e2 to 31f87d0 Compare August 20, 2021 02:46

knagaitsev marked this pull request as ready for review August 20, 2021 03:07

yadudoc reviewed Aug 20, 2021

View reviewed changes

fix zmq option setting bugs

445267a

knagaitsev force-pushed the zmq_opts branch 3 times, most recently from 7e7d64c to 445267a Compare August 23, 2021 17:25

yadudoc approved these changes Aug 23, 2021

View reviewed changes

yadudoc merged commit 2e72c78 into main Aug 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix zmq option setting bugs #24

Fix zmq option setting bugs #24

Uh oh!

knagaitsev commented Aug 4, 2021

Uh oh!

knagaitsev Aug 5, 2021

Uh oh!

yadudoc Aug 20, 2021

Uh oh!

knagaitsev Aug 5, 2021

Uh oh!

knagaitsev Aug 5, 2021 •

edited

Loading

Uh oh!

yadudoc left a comment

Uh oh!

knagaitsev commented Aug 23, 2021

Uh oh!

yadudoc left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix zmq option setting bugs #24

Fix zmq option setting bugs #24

Uh oh!

Conversation

knagaitsev commented Aug 4, 2021

Uh oh!

knagaitsev Aug 5, 2021

Choose a reason for hiding this comment

Uh oh!

yadudoc Aug 20, 2021

Choose a reason for hiding this comment

Uh oh!

knagaitsev Aug 5, 2021

Choose a reason for hiding this comment

Uh oh!

knagaitsev Aug 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yadudoc left a comment

Choose a reason for hiding this comment

Uh oh!

knagaitsev commented Aug 23, 2021

Uh oh!

yadudoc left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

knagaitsev Aug 5, 2021 •

edited

Loading