Xandra exits on connection shutdwon during connection checkout #390
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
When a
Xandra.Connectionprocess shuts down while a caller is blocked in:gen_statem.call(conn_pid, {:checkout_state_for_next_request, ref}, :infinity),the caller exits with
:shutdown. This propagates to application processes, causing avoidable crashes.Because this happens during the state checkout (before a query is attempted), the retry strategy is never
invoked, and the caller dies instead of receiving a retryable error.
Example
We experienced this issue in production. A single Cassandra node experienced a "local pause" (this is what is
reported in the Cassandra logs by the
FailureDetector) due to I/O exhaustion (presumably, this is what ourOS level monitoring suggests). Around that time, when trying to insert data into this node, our process was
exited due to Xandra propagating an exit:
In our application logs we first saw the exit (this was during the time the Cassandra node already experienced
problems for ~40 seconds).
Roughly 25 seconds later we saw the connections to the node being re-established (only first connection of the pool shown here):
Where it happens
I think this is happening in
Xandra.Connectionwhen it uses:gen_statem.call/3during checkout. Thecurrent helper only traps
:noproc:If the connection exits with
:shutdown, the caller exits and does not return{:error, _}.Expected behavior
If the connection goes down during checkout, the caller should receive a normal error
(e.g.
{:error, :connection_shutdown}wrapped inXandra.ConnectionError) so that the caller process doesnot crash, and retry strategies can handle the failure.
Proposed change
Catch
:shutdownexit reasons in thegen_statem_call_trapping_noproc/2helper and return a normal errortuple:
This keeps the error handling consistent with the existing
{:error, reason}path andallows downstream retry strategies to kick in.
Additional context
As noted above, we experienced this in production in a multi-node cluster. Our
OS level monitoring suggests I/O exhaustion starting at 2:18:45 UTC.
In the Cassandra logs we see:
This indicates the the
FailureDetectorwas not scheduled for ~45 seconds (the warning is generated when notbeing scheduled for > 5s). So the problem was ongoing at least for ~40 seconds. We think this is the effect of
the I/O exhaustion.
Next we see lot's of these:
This matches the timestamp from Xandra restarting connections to the node, so we think this is the effect of
Xandra closing and restarting connections to the pool.
What I think is happening:
Xandra.Connectionreceives ais_closed_messageoris_error_messagesocket messageXandra.Cluster.PoolXandra.Cluster.Poolreacts by stopping the entire pool for that host:shutdownto all connection processes in that poolConnection reset by peermessages in Cassandra:gen_statem.callcheckout at that moment, it exits with:shutdown(as seen in our application logs)Environment