Skip to content

Conversation

@jvf
Copy link
Contributor

@jvf jvf commented Jan 12, 2026

Summary

When a Xandra.Connection process shuts down while a caller is blocked in
:gen_statem.call(conn_pid, {:checkout_state_for_next_request, ref}, :infinity),
the caller exits with :shutdown. This propagates to application processes, causing avoidable crashes.
Because this happens during the state checkout (before a query is attempted), the retry strategy is never
invoked, and the caller dies instead of receiving a retryable error.

Example

We experienced this issue in production. A single Cassandra node experienced a "local pause" (this is what is
reported in the Cassandra logs by the FailureDetector) due to I/O exhaustion (presumably, this is what our
OS level monitoring suggests). Around that time, when trying to insert data into this node, our process was
exited due to Xandra propagating an exit:

2025-12-19 02:19:00.833 module=supervisor pid=<0.420178.0> [error]: Child <OurApp>.Worker of Supervisor #PID<0.420178.0> (<OurApp>.Supervisor.Server) terminated
** (exit) exited in: :gen_statem.call(#PID<0.2566.0>, {:checkout_state_for_next_request, #Reference<0.0.3451443.215878225.1680146433.165777>}, :infinity)
    ** (EXIT) shutdown
Pid: #PID<0.420180.0>

In our application logs we first saw the exit (this was during the time the Cassandra node already experienced
problems for ~40 seconds).

Roughly 25 seconds later we saw the connections to the node being re-established (only first connection of the pool shown here):

2025-12-19 02:19:26.315 module=supervisor pid=<0.421299.0> [info]: Child {Xandra, 1} of Supervisor #PID<0.421299.0> (Xandra.Cluster.ConnectionPool) started
Pid: #PID<0.421300.0>
Start Call: Xandra.start_link([<obfuscated args>])

Where it happens

I think this is happening in Xandra.Connection when it uses :gen_statem.call/3 during checkout. The
current helper only traps :noproc:

# deps/xandra/lib/xandra/connection.ex
case gen_statem_call_trapping_noproc(conn_pid, {:checkout_state_for_next_request, req_alias}) do
  {:ok, state} -> ...
  {:error, error} -> {:error, ConnectionError.new("check out connection", error)}
end

defp gen_statem_call_trapping_noproc(pid, call) do
  :gen_statem.call(pid, call)
catch
  :exit, {:noproc, _} -> {:error, :no_connection_process}
end

If the connection exits with :shutdown, the caller exits and does not return {:error, _}.

Expected behavior

If the connection goes down during checkout, the caller should receive a normal error
(e.g. {:error, :connection_shutdown} wrapped in Xandra.ConnectionError) so that the caller process does
not crash, and retry strategies can handle the failure.

Proposed change

Catch :shutdown exit reasons in the gen_statem_call_trapping_noproc/2 helper and return a normal error
tuple:

defp gen_statem_call_trapping_noproc(pid, call) do
  :gen_statem.call(pid, call)
catch
  :exit, {:noproc, _} -> {:error, :no_connection_process}
  :exit, :shutdown -> {:error, :connection_shutdown}
  :exit, {:shutdown, _} -> {:error, :connection_shutdown}
  :exit, {:shutdown, _, _} -> {:error, :connection_shutdown}
end

This keeps the error handling consistent with the existing {:error, reason} path and
allows downstream retry strategies to kick in.

Additional context

As noted above, we experienced this in production in a multi-node cluster. Our
OS level monitoring suggests I/O exhaustion starting at 2:18:45 UTC.

In the Cassandra logs we see:

WARN  [GossipTasks:1] 2025-12-19 02:19:26,107 FailureDetector.java:319 - Not marking nodes down due to local pause of 45616267328ns > 5000000000ns

This indicates the the FailureDetector was not scheduled for ~45 seconds (the warning is generated when not
being scheduled for > 5s). So the problem was ongoing at least for ~40 seconds. We think this is the effect of
the I/O exhaustion.

Next we see lot's of these:

WARN  [epollEventLoopGroup-5-5] 2025-12-19 02:19:26,137 PreV5Handlers.java:261 - Unknown exception in client networking
io.netty.channel.unix.Errors$NativeIoException: writevAddresses(..) failed: Connection reset by peer

This matches the timestamp from Xandra restarting connections to the node, so we think this is the effect of
Xandra closing and restarting connections to the pool.

What I think is happening:

  • the stalling Cassandra node causes some TCP issue on the client side
  • Xandra.Connection receives a is_closed_message or is_error_message socket message
  • the connection process reports the disconnect to the Xandra.Cluster.Pool
  • Xandra.Cluster.Pool reacts by stopping the entire pool for that host
  • that termination sends :shutdown to all connection processes in that pool
  • this causes the Connection reset by peer messages in Cassandra
  • if a worker is in :gen_statem.call checkout at that moment, it exits with :shutdown (as seen in our application logs)

Environment

  • Xandra v0.19.4
  • Erlang/OTP 27
  • Elixir v1.18.3

Trap shutdown exits from gen_statem checkout and surface them as
:connection_shutdown errors so callers don’t crash and retry logic can run.
@jvf jvf force-pushed the fix-connection-checkout-exit branch from 4eb3916 to d5eabc9 Compare January 12, 2026 10:40
Copy link
Owner

@whatyouhide whatyouhide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic! 🎉 Amazing job on the report and root cause analysis as well, thanks @jvf.

@whatyouhide whatyouhide merged commit 133a8d0 into whatyouhide:main Jan 14, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants