feat: Add address flag for Dqlite force removal#595
feat: Add address flag for Dqlite force removal#595roosterfish merged 5 commits intocanonical:v3from
Conversation
56ee460 to
f249f98
Compare
f249f98 to
b90cf9f
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds an optional --address flag for force removal of Dqlite cluster members. The feature addresses scenarios where a member exists only in Dqlite but cannot be looked up via the Truststore, enabling removal by explicitly specifying the member's network address.
Changes:
- Added an
addressparameter to theRemoveClusterMemberAPI and all related functions - Modified cluster member deletion logic to use the provided address when the remote is not present in the Truststore
- Added CLI flag
--addressto the cluster member remove command - Included test coverage for force removal with an inconsistent cluster state
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| microcluster/app.go | Updated RemoveClusterMember signature to accept address parameter |
| internal/rest/resources/control.go | Updated DeleteClusterMember call to pass empty address for join failure cleanup |
| internal/rest/resources/cluster.go | Modified clusterMemberDelete to handle address parameter and use it when remote is not present |
| internal/rest/client/cluster.go | Added address query parameter handling to DeleteClusterMember client function |
| example/test/main.sh | Added test for force removal with inconsistent state using address flag |
| example/cmd/microctl/cluster_members.go | Added --address CLI flag for cluster member removal |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
b90cf9f to
d78232b
Compare
roosterfish
left a comment
There was a problem hiding this comment.
Hi, thanks for following up on the removal parts.
Another thought that came to my mind is if we even have to provide the address:
- Let's say if we still have all members in the truststore, we can derive the address from the given name to cleanup the dqlite entry.
- If we still have the member in the DB (not truststore), we can derive the address from the
core_cluster_memberstable. - If it's in neither of those two, we already have the recovery option, see
microctl cluster recoverwhen building theexmaple/package.
So in this case aren't all cases already covered?
| return response.SmartError(fmt.Errorf("No remote exists with the given name %q", name)) | ||
| } | ||
|
|
||
| if remotePresent && addr == "" { |
There was a problem hiding this comment.
We should not allow overwriting the address in case it was returned by the truststore as this should be the ultimate source of truth.
Only if the truststore cannot find the member, use the provided address as a fallback.
In addition let's add a check that the provided address (if addr != "") isn't used by any of the other cluster members? There is a Truststore().RemoteAddresses func.
There was a problem hiding this comment.
I think it makes sense if both are specified to check that they point to the same address. If not I would propose to error.
internal/rest/resources/cluster.go
Outdated
| // If we can't find the node in dqlite, that means it failed to fully initialize. It still might have a record in our database so continue along anyway. | ||
| if index < 0 { | ||
| logger.Error(fmt.Sprintf("No dqlite record exists for %q, deleting from internal record instead", remote.Name)) | ||
| logger.Error(fmt.Sprintf("No dqlite record exists for %q.", name)) |
There was a problem hiding this comment.
Let's also add a log entry in case we used the provided address instead of the entry returned from the truststore by name similar to this log message.
There was a problem hiding this comment.
I think it would be best to fail a removal where the address does not match the looked up address of the name and let the administrator pass the correct. I am adding a warning if we didnt get a truststore hit but we proceed with the given address.
d78232b to
c2e25d8
Compare
On 1/2 yes if those are present we can proceed to remove the entries without a problem.
|
a7a1bf6 to
6ed8eae
Compare
roosterfish
left a comment
There was a problem hiding this comment.
The deletion bits look much more solid now, thanks for putting in the changes and the additional tests.
Please add some more comments so that it is easier to follow through and check the other commits about some minor nits.
Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>
Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>
Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>
6ed8eae to
665b1dd
Compare
Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>
665b1dd to
7c0ba93
Compare
Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>
7c0ba93 to
30b9e21
Compare
Add address flag for Dqlite force removal
Problem
During testing of the membership check, we observed an issue when the cluster enters an inconsistent state where a member exists only in Dqlite. In this scenario, the existing force removal flag fails to remove the lingering Dqlite membership.
The root cause is that we rely on the Truststore to look up a node’s address in order to determine which Dqlite member to remove. However, Dqlite has no concept of node names—it only tracks members by internal IDs and their network addresses.
As a result, we cannot remove a Dqlite member by name. In certain failure scenarios, we also cannot query the Truststore or core_cluster_members to determine the address or Dqlite ID associated with the stale member.
Solution
Introduce an optional --address flag that explicitly identifies the Dqlite member by its address. This allows us to reliably locate and forcefully remove the member from Dqlite, even if it has already been removed or “nuked” elsewhere in the system.
Side-note
I intentionally avoided introducing a --dqlite-id flag. Dqlite IDs are harder for users to discover (they must be extracted from the Dqlite cluster.yaml), whereas the node’s address is more intuitive and readily available.