Skip to content

feat: Add address flag for Dqlite force removal#595

Merged
roosterfish merged 5 commits intocanonical:v3from
louiseschmidtgen:KU-5061/fix-force-rm
Feb 4, 2026
Merged

feat: Add address flag for Dqlite force removal#595
roosterfish merged 5 commits intocanonical:v3from
louiseschmidtgen:KU-5061/fix-force-rm

Conversation

@louiseschmidtgen
Copy link
Contributor

@louiseschmidtgen louiseschmidtgen commented Jan 22, 2026

Add address flag for Dqlite force removal

Problem

During testing of the membership check, we observed an issue when the cluster enters an inconsistent state where a member exists only in Dqlite. In this scenario, the existing force removal flag fails to remove the lingering Dqlite membership.

The root cause is that we rely on the Truststore to look up a node’s address in order to determine which Dqlite member to remove. However, Dqlite has no concept of node names—it only tracks members by internal IDs and their network addresses.

As a result, we cannot remove a Dqlite member by name. In certain failure scenarios, we also cannot query the Truststore or core_cluster_members to determine the address or Dqlite ID associated with the stale member.

Solution

Introduce an optional --address flag that explicitly identifies the Dqlite member by its address. This allows us to reliably locate and forcefully remove the member from Dqlite, even if it has already been removed or “nuked” elsewhere in the system.

Side-note

I intentionally avoided introducing a --dqlite-id flag. Dqlite IDs are harder for users to discover (they must be extracted from the Dqlite cluster.yaml), whereas the node’s address is more intuitive and readily available.

@louiseschmidtgen louiseschmidtgen marked this pull request as ready for review January 23, 2026 05:57
Copilot AI review requested due to automatic review settings January 23, 2026 05:57
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an optional --address flag for force removal of Dqlite cluster members. The feature addresses scenarios where a member exists only in Dqlite but cannot be looked up via the Truststore, enabling removal by explicitly specifying the member's network address.

Changes:

  • Added an address parameter to the RemoveClusterMember API and all related functions
  • Modified cluster member deletion logic to use the provided address when the remote is not present in the Truststore
  • Added CLI flag --address to the cluster member remove command
  • Included test coverage for force removal with an inconsistent cluster state

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
microcluster/app.go Updated RemoveClusterMember signature to accept address parameter
internal/rest/resources/control.go Updated DeleteClusterMember call to pass empty address for join failure cleanup
internal/rest/resources/cluster.go Modified clusterMemberDelete to handle address parameter and use it when remote is not present
internal/rest/client/cluster.go Added address query parameter handling to DeleteClusterMember client function
example/test/main.sh Added test for force removal with inconsistent state using address flag
example/cmd/microctl/cluster_members.go Added --address CLI flag for cluster member removal

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@roosterfish roosterfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for following up on the removal parts.

Another thought that came to my mind is if we even have to provide the address:

  1. Let's say if we still have all members in the truststore, we can derive the address from the given name to cleanup the dqlite entry.
  2. If we still have the member in the DB (not truststore), we can derive the address from the core_cluster_members table.
  3. If it's in neither of those two, we already have the recovery option, see microctl cluster recover when building the exmaple/ package.

So in this case aren't all cases already covered?

return response.SmartError(fmt.Errorf("No remote exists with the given name %q", name))
}

if remotePresent && addr == "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not allow overwriting the address in case it was returned by the truststore as this should be the ultimate source of truth.
Only if the truststore cannot find the member, use the provided address as a fallback.

In addition let's add a check that the provided address (if addr != "") isn't used by any of the other cluster members? There is a Truststore().RemoteAddresses func.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense if both are specified to check that they point to the same address. If not I would propose to error.

// If we can't find the node in dqlite, that means it failed to fully initialize. It still might have a record in our database so continue along anyway.
if index < 0 {
logger.Error(fmt.Sprintf("No dqlite record exists for %q, deleting from internal record instead", remote.Name))
logger.Error(fmt.Sprintf("No dqlite record exists for %q.", name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add a log entry in case we used the provided address instead of the entry returned from the truststore by name similar to this log message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be best to fail a removal where the address does not match the looked up address of the name and let the administrator pass the correct. I am adding a warning if we didnt get a truststore hit but we proceed with the given address.

@louiseschmidtgen
Copy link
Contributor Author

Hi, thanks for following up on the removal parts.

Another thought that came to my mind is if we even have to provide the address:

  1. Let's say if we still have all members in the truststore, we can derive the address from the given name to cleanup the dqlite entry.
  2. If we still have the member in the DB (not truststore), we can derive the address from the core_cluster_members table.
  3. If it's in neither of those two, we already have the recovery option, see microctl cluster recover when building the exmaple/ package.

So in this case aren't all cases already covered?

On 1/2 yes if those are present we can proceed to remove the entries without a problem.

  1. The cluster recover command is not suitable for this case as it is used for recovery from quorum loss. If we are nuking a member we still have quorum. It would not make sense to run cluster recover as this command can result in unnecessary interruption and data loss.

@louiseschmidtgen louiseschmidtgen force-pushed the KU-5061/fix-force-rm branch 2 times, most recently from a7a1bf6 to 6ed8eae Compare January 23, 2026 14:23
Copy link
Contributor

@roosterfish roosterfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deletion bits look much more solid now, thanks for putting in the changes and the additional tests.

Please add some more comments so that it is easier to follow through and check the other commits about some minor nits.

Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>
Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>
Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>
Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>
Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>
Copy link
Contributor

@roosterfish roosterfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM.

@roosterfish roosterfish merged commit f68bb2a into canonical:v3 Feb 4, 2026
5 checks passed
@louiseschmidtgen louiseschmidtgen deleted the KU-5061/fix-force-rm branch February 5, 2026 05:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants