You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Previously, if the coordinator process is killed too quickly, before the stream
worker cleanup process is spawned, remote workers may be left around
waiting until the default 5 minute timeout expires.
In order to reliably clean up processes in that state, need to start the
cleaner process, with all the job references, before we start submitting them
for execution.
At first, it may seem impossible to monitor a process until after it's already
spawned. That's true for regular processes, however rexi operates on plain
references. For each process we spawn remotely we create a reference on the
coordinator side, which we can then use to track that job. Those are just plain
manually created references. Nothing stops us from creating them first, adding
them to a cleaner process, and only then submitting them.
That's exactly what this commit accomplishes:
* Create a streams specific `fabric_streams:submit_jobs/4` function, which
spawns the cleanup process early, generates worker references, and then
submits the jobs. This way, all the existing streaming submit_jobs can be
replaced easily in one line: fabric_util -> fabric_streams.
* The cleanup process operates as previously: monitors the coordinator for
exits, and fires off `kill_all` message to each node.
* Create `rexi:cast_ref(...)` variants of `rexi:cast(...)` calls, where the
caller specifies the references a new argument. This is what allows us to
start the cleanup process before the even get submitted. Older calls can just
be easily call into the `cast_ref` versions with their own created
references.
Since we added the new `rexi:cast_ref(...)` variants, ensure to add more test
coverage, including the streaming logic as well. It's not 100% yet, but getting
there.
Also, the comments in `rexi.erl` were full of erldoc stanzas and we don't
actually build erldocs anywhere, so replace them with something more helpful.
The streaming protocol itself was never quite described anywhere, and it can
take sometime to figure it out (at least it took me), so took the chance to
also add a very basic, high level description of the message flow.
Related: #5127 (comment)
0 commit comments