Skip to content

fix: add consistency checks across core_cluster_members, truststore, and dqlite#515

Merged
roosterfish merged 5 commits intocanonical:v3from
louiseschmidtgen:KU-4294/diverging-membership-awareness
Jan 16, 2026
Merged

fix: add consistency checks across core_cluster_members, truststore, and dqlite#515
roosterfish merged 5 commits intocanonical:v3from
louiseschmidtgen:KU-4294/diverging-membership-awareness

Conversation

@louiseschmidtgen
Copy link
Contributor

@louiseschmidtgen louiseschmidtgen commented Oct 17, 2025

Problem

Microcluster can enter inconsistent states where core_cluster_members (database), truststore, and dqlite cluster configuration become out of sync during partial failures. This leads to failed operations and difficult recovery scenarios.

Solution

Implements membership consistency validation before critical operations:

Validates before operations: Checks all three sources match before joins, removals, and token generation
Clear error messages: Shows differences between sources when inconsistencies detected making it possible for admins to recover their cluster before worse things happen.

Changes Made

  • state.go - Core consistency checking logic with CheckMembershipConsistency()
  • cluster.go - Added checks before join/remove operations
  • tokens.go - Added checks before token generation
  • main.sh - Integration test simulating inconsistent state and verifying blocked operations, parallel join test showing that join operations started concurrently do not fail

Testing

./example/test/main.sh membership  # Test membership consistency 
./example/test/main.sh parallel-join  # Test parallel join operations

@louiseschmidtgen louiseschmidtgen force-pushed the KU-4294/diverging-membership-awareness branch from b815452 to d3aa99f Compare October 17, 2025 13:37
@louiseschmidtgen louiseschmidtgen marked this pull request as draft October 17, 2025 14:14
@louiseschmidtgen louiseschmidtgen force-pushed the KU-4294/diverging-membership-awareness branch from d3aa99f to 91552ea Compare October 20, 2025 08:55
@louiseschmidtgen louiseschmidtgen marked this pull request as ready for review October 20, 2025 08:55
@louiseschmidtgen louiseschmidtgen force-pushed the KU-4294/diverging-membership-awareness branch from 91552ea to 07565a5 Compare October 20, 2025 09:17
Copy link
Contributor

@roosterfish roosterfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks raising awareness that we can ultimately end up in a state where those three data sources are out of sync. Also the corresponding test looks great.

I wonder if you could please share a reproducer how we might end up in such an inconsistent state in a production cluster?
From the test I understand that you can easily reproduce this when manually deleting DB entries (which we cannot protect against but it's something you should never really do). The same applies for manually editing the trust store.

Instead of checking for an already broken state when trying to add/delete members, what about performing the check elsewhere so we don't end up in this state overall?
If it's easy to end up in such a state in a prod deployment, I wonder if Microcluster doesn't properly clean up certain actions in case of failure?

@louiseschmidtgen louiseschmidtgen force-pushed the KU-4294/diverging-membership-awareness branch 2 times, most recently from 838b97f to e09f316 Compare November 11, 2025 16:10
@louiseschmidtgen
Copy link
Contributor Author

share a reproducer

As discussed at the sprint, the reverter logic only works as long as the node does not crash while the revert is ongoing.
The revert does not get persisted and is picked up when the node restart, meaning that we can remain in such a state where the membership's are out of sync.

@roosterfish
Copy link
Contributor

As discussed at the sprint, the reverter logic only works as long as the node does not crash while the revert is ongoing. The revert does not get persisted and is picked up when the node restart, meaning that we can remain in such a state where the membership's are out of sync.

Hi @louiseschmidtgen, I appreciate the discussions at the sprint to better understand where you are heading with this.
As you mentioned the reverter pattern can be a single point of failure in case the member which executes the revert dies somewhere in the middle.

With the changes proposed in this PR, it can still happen but there would now be an error telling the user that DB/dqlite/truststore are out of sync when performing actions like creating a token, adding and removing a cluster member.

In the presumably rare case a member running a revert dies, it should rather be the admin that figures out that the current action he is executing (e.g. adding a member) ran into a failure. When joining a member m2 via existing member m1 and m1 dies during the process (interrupting the revert), the user will see an error as the connection to m1 gets interrupted. The user then has to perform further checks/validations and ultimately retry the operation.
Therefore we already support a --force option during removal to be able to clean up and recover in case things are left partially. So if we see further issues like #512, those have to be addressed to ensure a robust cleanup.

But as mentioned at the sprint, we don't want to put in checks at certain places to check if something is already broken. Rather fix it when its happening. And if something from the outside is unexpectedly killing a Microcluster member, this is certainly not under our control and we should not come up with patterns in Microcluster to resolve this.

@louiseschmidtgen
Copy link
Contributor Author

louiseschmidtgen commented Nov 18, 2025

@roosterfish I understand that we do not want to add any unnecessary checks and rather iron out the faults in the repo. However, we are not going to fix the reverter logic any time soon.

From a UX perspective expecting the user to perform further checks/validations and ultimately retry the operation is not great. Why not log the error and not expect the user to figure out where to look for the inconsistencies. We can't expect the user to look in the microcluster's internal logic, neither (I admit) is bubbling up these internal details great. Without a solid reverter logic and the possibility of getting your cluster into an unrecoverable case, I would still advocate for this safeguard.

Let's get another opinion from @bschimke95.

@roosterfish
Copy link
Contributor

@roosterfish I understand that we do not want to add any unnecessary checks and rather iron out the faults in the repo. However, we are not going to fix the reverter logic any time soon.

After #512 and #487 do we know what else is currently not working in case the reverter kicks in?

@louiseschmidtgen
Copy link
Contributor Author

After #512 and #487 do we know what else is currently not working in case the reverter kicks in?

We will need to do more exploratory testing to find further cases, race conditions etc.

@bschimke95
Copy link
Contributor

Hey folks,

I generally agree with both of you.

  • if the reverter operation fails because of a lost node I don't have a good way as a user to detect that, without knowing the internals of microcluster and check manually in a couple of places - this is arguably not a great UX

  • As @louiseschmidtgen pointed out, it is also not great to bubble up internal state up to the user and tell them about microclusters internal inconsistency (especially in places where they would not expect it (e.g. when creating a token))

In both cases, we'd leak abstractions and users would need to know the internals of microcluster.

I'd propose two things:

a) If I understand the issue correctly then the main problem (that we know of) right now is that the reverter logic might not run into completion if the node dies mid-way. What about just making this reverter logic persistent? Presumably, this should not be too much work and would solve the issue at hand.

b) Introduce a consistency API (or integrate into an existing one) that exposes some information about the membership state that consumers then can use to integrate in their workflows.

@louiseschmidtgen
Copy link
Contributor Author

@bschimke95 AFAIK, @roosterfish and his team have experimented with an options to persist the change by storing it in Dqlite and making use of Raft's fault-tolerance. However, this was plagued with performance issues IIRC.
Finding a good solution here will likely be a non-trivial task.

@bschimke95
Copy link
Contributor

I'm not super deep in the Raft-game but I'd be surprised if there are any significant performance issues.
You only require a reverter for tasks like joining/removing etc. During those steps, you may store some small data in dqlite but you are also not (normally) concerned too much with performance during those steps. After the clustering step succeeded, you may cleanup those entries and there are no left-overs for the actual runtime.

That is just my understanding of the system/problem. I probably miss something.

@louiseschmidtgen
Copy link
Contributor Author

louiseschmidtgen commented Nov 21, 2025

I'm not super deep in the Raft-game but I'd be surprised if there are any significant performance issues. You only require a reverter for tasks like joining/removing etc. During those steps, you may store some small data in dqlite but you are also not (normally) concerned too much with performance during those steps. After the clustering step succeeded, you may cleanup those entries and there are no left-overs for the actual runtime.

That is just my understanding of the system/problem. I probably miss something.

I'm afraid it's never trivial with distributed systems, especially when you start to think about all the things that go wrong. Say we store the information in Dqlite new node in core_cluster_members, new node in truststore, new node in dqlite.

One example: assume my "new node" fails somewhere in the join process and crashes half way through the reverter and the new node's not coming back to clean up its state any time soon. Now the other nodes need to "deal with the mess" before doing any further membership operations. Dealing with the mess is going to require some more thought through logic and all of the sudden this is not a quick fix.

@bschimke95
Copy link
Contributor

I don't think that this is a quick fix but rather was surprised by the claim that this affects the performance of the system. I totally agree that thinks can get messy in distributed systems.

Anyway, @roosterfish and @louiseschmidtgen how do you want to proceed with this PR? I think the direction where we want to go with this is a bit unclear right now and I don't see this PR to land anytime soon.

My proposal would be to close this PR, regroup and discuss next steps in a better setting (e.g. spec) to agree on a aligned way forward.

@roosterfish
Copy link
Contributor

Anyway, @roosterfish and @louiseschmidtgen how do you want to proceed with this PR?

Hey guys, sorry for the delay, this is not a priority for us at the moment as the goal of this PR is to circumvent possible behavior and not fixing an actual bug. The current answer of Microcluster is to --force remove a member in case you end up in a situation where this might be needed.

I see the point that there can potentially be various different inconsistencies, for example a member entry only present in the truststore. As this is a hypothesis, I cannot present any reproducer to end up in this state, but in this case you can also delete the member using --force, Microcluster has logic for this already, see https://github.com/canonical/microcluster/blob/v3/internal/rest/resources/cluster.go#L504.
But it looks we should also have tests to perform this from various different cluster sizes as I see there are checks below which might hinder us from performing a cleanup.

@louiseschmidtgen in your previous message you mentioned user experience, so we should not simply notify the user but rather allow cleaning up in a "controlled way". What do you think about also extending the deletion endpoint to cope with the different scenarios? This would allow us to cover the entire workflow, from error to solution.

@roosterfish
Copy link
Contributor

@louiseschmidtgen before we continue with this, can you please have a look at #544? There seems to be a regression I just found with the recent cluster join fixes.

@louiseschmidtgen
Copy link
Contributor Author

Anyway, @roosterfish and @louiseschmidtgen how do you want to proceed with this PR?

Hey guys, sorry for the delay, this is not a priority for us at the moment as the goal of this PR is to circumvent possible behavior and not fixing an actual bug. The current answer of Microcluster is to --force remove a member in case you end up in a situation where this might be needed.

I see the point that there can potentially be various different inconsistencies, for example a member entry only present in the truststore. As this is a hypothesis, I cannot present any reproducer to end up in this state, but in this case you can also delete the member using --force, Microcluster has logic for this already, see https://github.com/canonical/microcluster/blob/v3/internal/rest/resources/cluster.go#L504. But it looks we should also have tests to perform this from various different cluster sizes as I see there are checks below which might hinder us from performing a cleanup.

@louiseschmidtgen in your previous message you mentioned user experience, so we should not simply notify the user but rather allow cleaning up in a "controlled way". What do you think about also extending the deletion endpoint to cope with the different scenarios? This would allow us to cover the entire workflow, from error to solution.

@roosterfish I like your idea of doing automated clean-up. I will extend the PR to include this.

Copilot AI review requested due to automatic review settings December 15, 2025 12:25
@louiseschmidtgen louiseschmidtgen force-pushed the KU-4294/diverging-membership-awareness branch from e09f316 to 1fed52e Compare December 15, 2025 12:25
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds consistency validation across the three key components that track cluster membership: the core_cluster_members database table, the truststore, and the dqlite cluster configuration. The implementation prevents critical cluster operations (joins, removals, token generation) when these sources are out of sync, helping administrators detect and address inconsistencies before they cause more serious failures.

Key Changes

  • Implements CheckMembershipConsistency() method that validates all three membership sources match before critical operations
  • Adds pre-operation consistency checks to join, remove, and token generation endpoints (with force flag bypass for removals)
  • Includes comprehensive integration test that simulates inconsistent state and verifies operations are properly blocked

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
internal/state/state.go Adds CheckMembershipConsistency interface method and implementation with helper methods to gather and compare membership data from core_cluster_members, truststore, and dqlite
internal/rest/resources/tokens.go Adds consistency check before token generation to prevent creating join tokens when cluster state is inconsistent
internal/rest/resources/cluster.go Adds consistency checks before join and remove operations (force flag bypasses check for removals)
example/test/main.sh Adds test_membership_consistency() to simulate inconsistent state and verify operations are blocked; improves process cleanup safety in shutdown_systems()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@louiseschmidtgen louiseschmidtgen marked this pull request as draft December 15, 2025 12:42
@louiseschmidtgen
Copy link
Contributor Author

@louiseschmidtgen in your previous message you mentioned user experience, so we should not simply notify the user but rather allow cleaning up in a "controlled way". What do you think about also extending the deletion endpoint to cope with the different scenarios? This would allow us to cover the entire workflow, from error to solution.

Giving this some more thought:

The auto-resolution strategy should be:

  • dqlite = source of truth
    • it's the actual cluster,
    • choosing core_cluster_members as source of truth could lead us to automatically remove a dqlite member and loose quorum
  • Add missing entries to core_cluster_members/truststore if they exist in dqlite
    • we can't add the members back without the full member information (certificate, schema versions, etc.). We can only remove orphaned entries.
  • Remove extra entries in core_cluster_members/ truststore if they don't exist in dqlite
    • Potentially problematic if it runs in parallel with a node joining

@roosterfish / @bschimke95 What do you think? I don't see a good/quick way to do the clean-up without putting the cluster at risk.

Copy link
Contributor

@roosterfish roosterfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, please have a look at my few additional comments.

As I wrote in the other thread, let's include this as the checks itself are not very expensive but can be helpful.
Once we get a reproducer to end up in such a state, we should address this directly by modifying either the member join or remove logic.

Another aspect I just thought about is whether or not we can recover from all of the different inconsistencies. In this comment you mentioned that we might miss some data when restoring core_cluster_members so I suspect this is a manual operation anyway.
Have you already tested this for core_cluster_members, truststore and dqlite?

@louiseschmidtgen louiseschmidtgen force-pushed the KU-4294/diverging-membership-awareness branch from 1fed52e to d436358 Compare January 12, 2026 12:20
@louiseschmidtgen
Copy link
Contributor Author

Thanks, please have a look at my few additional comments.

As I wrote in the other thread, let's include this as the checks itself are not very expensive but can be helpful. Once we get a reproducer to end up in such a state, we should address this directly by modifying either the member join or remove logic.

Another aspect I just thought about is whether or not we can recover from all of the different inconsistencies. In this comment you mentioned that we might miss some data when restoring core_cluster_members so I suspect this is a manual operation anyway. Have you already tested this for core_cluster_members, truststore and dqlite?

+1 When we get a reproducer we can fix it in the future. In the meantime, this PR will help us / users find and define those issues.
On the testing question yes, I tried implementing the automated recovery but based on my findings this won't be possible for some cases and is not safe to do in others.

Copy link
Contributor

@roosterfish roosterfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost ready, pls check the suggestion from Copilot and one small nit.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@louiseschmidtgen louiseschmidtgen force-pushed the KU-4294/diverging-membership-awareness branch from df1ab64 to ccfeb8b Compare January 14, 2026 09:30
@louiseschmidtgen louiseschmidtgen force-pushed the KU-4294/diverging-membership-awareness branch from ccfeb8b to 647147b Compare January 14, 2026 10:13
@louiseschmidtgen louiseschmidtgen force-pushed the KU-4294/diverging-membership-awareness branch from 647147b to 5c32bfd Compare January 15, 2026 13:22
@roosterfish
Copy link
Contributor

Please rebase with main because there is now a conflict after merging #582.

@louiseschmidtgen louiseschmidtgen force-pushed the KU-4294/diverging-membership-awareness branch from 5c32bfd to b5620b5 Compare January 16, 2026 07:25
Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>
@louiseschmidtgen louiseschmidtgen force-pushed the KU-4294/diverging-membership-awareness branch from b5620b5 to ba73c53 Compare January 16, 2026 07:35
Copy link
Contributor

@roosterfish roosterfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Only one potential leftover from rebasing.

Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>
Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>
Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>
Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>
@louiseschmidtgen louiseschmidtgen force-pushed the KU-4294/diverging-membership-awareness branch from ba73c53 to e69343e Compare January 16, 2026 09:56
Copy link
Contributor

@roosterfish roosterfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@roosterfish roosterfish merged commit a757ad7 into canonical:v3 Jan 16, 2026
5 checks passed
@louiseschmidtgen louiseschmidtgen deleted the KU-4294/diverging-membership-awareness branch January 16, 2026 10:05
roosterfish added a commit that referenced this pull request Feb 3, 2026
…tstore, and dqlite (#515) (#602)

# Backport

This PR backports #515
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants