fix: add consistency checks across core_cluster_members, truststore, and dqlite by louiseschmidtgen · Pull Request #515 · canonical/microcluster

louiseschmidtgen · 2025-10-17T13:36:51Z

Problem

Microcluster can enter inconsistent states where core_cluster_members (database), truststore, and dqlite cluster configuration become out of sync during partial failures. This leads to failed operations and difficult recovery scenarios.

Solution

Implements membership consistency validation before critical operations:

Validates before operations: Checks all three sources match before joins, removals, and token generation
Clear error messages: Shows differences between sources when inconsistencies detected making it possible for admins to recover their cluster before worse things happen.

Changes Made

state.go - Core consistency checking logic with CheckMembershipConsistency()
cluster.go - Added checks before join/remove operations
tokens.go - Added checks before token generation
main.sh - Integration test simulating inconsistent state and verifying blocked operations, parallel join test showing that join operations started concurrently do not fail

Testing

./example/test/main.sh membership  # Test membership consistency 
./example/test/main.sh parallel-join  # Test parallel join operations

roosterfish

Thanks raising awareness that we can ultimately end up in a state where those three data sources are out of sync. Also the corresponding test looks great.

I wonder if you could please share a reproducer how we might end up in such an inconsistent state in a production cluster?
From the test I understand that you can easily reproduce this when manually deleting DB entries (which we cannot protect against but it's something you should never really do). The same applies for manually editing the trust store.

Instead of checking for an already broken state when trying to add/delete members, what about performing the check elsewhere so we don't end up in this state overall?
If it's easy to end up in such a state in a prod deployment, I wonder if Microcluster doesn't properly clean up certain actions in case of failure?

internal/rest/resources/tokens.go

internal/membership/membership.go

example/test/main.sh

louiseschmidtgen · 2025-11-11T16:13:35Z

share a reproducer

As discussed at the sprint, the reverter logic only works as long as the node does not crash while the revert is ongoing.
The revert does not get persisted and is picked up when the node restart, meaning that we can remain in such a state where the membership's are out of sync.

roosterfish · 2025-11-18T08:30:37Z

As discussed at the sprint, the reverter logic only works as long as the node does not crash while the revert is ongoing. The revert does not get persisted and is picked up when the node restart, meaning that we can remain in such a state where the membership's are out of sync.

Hi @louiseschmidtgen, I appreciate the discussions at the sprint to better understand where you are heading with this.
As you mentioned the reverter pattern can be a single point of failure in case the member which executes the revert dies somewhere in the middle.

With the changes proposed in this PR, it can still happen but there would now be an error telling the user that DB/dqlite/truststore are out of sync when performing actions like creating a token, adding and removing a cluster member.

In the presumably rare case a member running a revert dies, it should rather be the admin that figures out that the current action he is executing (e.g. adding a member) ran into a failure. When joining a member m2 via existing member m1 and m1 dies during the process (interrupting the revert), the user will see an error as the connection to m1 gets interrupted. The user then has to perform further checks/validations and ultimately retry the operation.
Therefore we already support a --force option during removal to be able to clean up and recover in case things are left partially. So if we see further issues like #512, those have to be addressed to ensure a robust cleanup.

But as mentioned at the sprint, we don't want to put in checks at certain places to check if something is already broken. Rather fix it when its happening. And if something from the outside is unexpectedly killing a Microcluster member, this is certainly not under our control and we should not come up with patterns in Microcluster to resolve this.

louiseschmidtgen · 2025-11-18T10:31:42Z

@roosterfish I understand that we do not want to add any unnecessary checks and rather iron out the faults in the repo. However, we are not going to fix the reverter logic any time soon.

From a UX perspective expecting the user to perform further checks/validations and ultimately retry the operation is not great. Why not log the error and not expect the user to figure out where to look for the inconsistencies. We can't expect the user to look in the microcluster's internal logic, neither (I admit) is bubbling up these internal details great. Without a solid reverter logic and the possibility of getting your cluster into an unrecoverable case, I would still advocate for this safeguard.

Let's get another opinion from @bschimke95.

roosterfish · 2025-11-18T10:42:21Z

@roosterfish I understand that we do not want to add any unnecessary checks and rather iron out the faults in the repo. However, we are not going to fix the reverter logic any time soon.

After #512 and #487 do we know what else is currently not working in case the reverter kicks in?

louiseschmidtgen · 2025-11-18T12:28:36Z

After #512 and #487 do we know what else is currently not working in case the reverter kicks in?

We will need to do more exploratory testing to find further cases, race conditions etc.

bschimke95 · 2025-11-18T14:09:23Z

Hey folks,

I generally agree with both of you.

if the reverter operation fails because of a lost node I don't have a good way as a user to detect that, without knowing the internals of microcluster and check manually in a couple of places - this is arguably not a great UX
As @louiseschmidtgen pointed out, it is also not great to bubble up internal state up to the user and tell them about microclusters internal inconsistency (especially in places where they would not expect it (e.g. when creating a token))

In both cases, we'd leak abstractions and users would need to know the internals of microcluster.

I'd propose two things:

a) If I understand the issue correctly then the main problem (that we know of) right now is that the reverter logic might not run into completion if the node dies mid-way. What about just making this reverter logic persistent? Presumably, this should not be too much work and would solve the issue at hand.

b) Introduce a consistency API (or integrate into an existing one) that exposes some information about the membership state that consumers then can use to integrate in their workflows.

louiseschmidtgen · 2025-11-18T15:11:02Z

@bschimke95 AFAIK, @roosterfish and his team have experimented with an options to persist the change by storing it in Dqlite and making use of Raft's fault-tolerance. However, this was plagued with performance issues IIRC.
Finding a good solution here will likely be a non-trivial task.

bschimke95 · 2025-11-18T15:57:02Z

I'm not super deep in the Raft-game but I'd be surprised if there are any significant performance issues.
You only require a reverter for tasks like joining/removing etc. During those steps, you may store some small data in dqlite but you are also not (normally) concerned too much with performance during those steps. After the clustering step succeeded, you may cleanup those entries and there are no left-overs for the actual runtime.

That is just my understanding of the system/problem. I probably miss something.

louiseschmidtgen · 2025-11-21T10:29:07Z

I'm not super deep in the Raft-game but I'd be surprised if there are any significant performance issues. You only require a reverter for tasks like joining/removing etc. During those steps, you may store some small data in dqlite but you are also not (normally) concerned too much with performance during those steps. After the clustering step succeeded, you may cleanup those entries and there are no left-overs for the actual runtime.

That is just my understanding of the system/problem. I probably miss something.

I'm afraid it's never trivial with distributed systems, especially when you start to think about all the things that go wrong. Say we store the information in Dqlite new node in core_cluster_members, new node in truststore, new node in dqlite.

One example: assume my "new node" fails somewhere in the join process and crashes half way through the reverter and the new node's not coming back to clean up its state any time soon. Now the other nodes need to "deal with the mess" before doing any further membership operations. Dealing with the mess is going to require some more thought through logic and all of the sudden this is not a quick fix.

bschimke95 · 2025-11-24T13:43:06Z

I don't think that this is a quick fix but rather was surprised by the claim that this affects the performance of the system. I totally agree that thinks can get messy in distributed systems.

Anyway, @roosterfish and @louiseschmidtgen how do you want to proceed with this PR? I think the direction where we want to go with this is a bit unclear right now and I don't see this PR to land anytime soon.

My proposal would be to close this PR, regroup and discuss next steps in a better setting (e.g. spec) to agree on a aligned way forward.

roosterfish · 2025-11-24T14:20:15Z

Anyway, @roosterfish and @louiseschmidtgen how do you want to proceed with this PR?

Hey guys, sorry for the delay, this is not a priority for us at the moment as the goal of this PR is to circumvent possible behavior and not fixing an actual bug. The current answer of Microcluster is to --force remove a member in case you end up in a situation where this might be needed.

I see the point that there can potentially be various different inconsistencies, for example a member entry only present in the truststore. As this is a hypothesis, I cannot present any reproducer to end up in this state, but in this case you can also delete the member using --force, Microcluster has logic for this already, see https://github.com/canonical/microcluster/blob/v3/internal/rest/resources/cluster.go#L504.
But it looks we should also have tests to perform this from various different cluster sizes as I see there are checks below which might hinder us from performing a cleanup.

@louiseschmidtgen in your previous message you mentioned user experience, so we should not simply notify the user but rather allow cleaning up in a "controlled way". What do you think about also extending the deletion endpoint to cope with the different scenarios? This would allow us to cover the entire workflow, from error to solution.

roosterfish · 2025-11-26T13:58:44Z

@louiseschmidtgen before we continue with this, can you please have a look at #544? There seems to be a regression I just found with the recent cluster join fixes.

louiseschmidtgen · 2025-12-15T12:14:10Z

Anyway, @roosterfish and @louiseschmidtgen how do you want to proceed with this PR?

Hey guys, sorry for the delay, this is not a priority for us at the moment as the goal of this PR is to circumvent possible behavior and not fixing an actual bug. The current answer of Microcluster is to --force remove a member in case you end up in a situation where this might be needed.

I see the point that there can potentially be various different inconsistencies, for example a member entry only present in the truststore. As this is a hypothesis, I cannot present any reproducer to end up in this state, but in this case you can also delete the member using --force, Microcluster has logic for this already, see https://github.com/canonical/microcluster/blob/v3/internal/rest/resources/cluster.go#L504. But it looks we should also have tests to perform this from various different cluster sizes as I see there are checks below which might hinder us from performing a cleanup.

@louiseschmidtgen in your previous message you mentioned user experience, so we should not simply notify the user but rather allow cleaning up in a "controlled way". What do you think about also extending the deletion endpoint to cope with the different scenarios? This would allow us to cover the entire workflow, from error to solution.

@roosterfish I like your idea of doing automated clean-up. I will extend the PR to include this.

Copilot

Pull request overview

This pull request adds consistency validation across the three key components that track cluster membership: the core_cluster_members database table, the truststore, and the dqlite cluster configuration. The implementation prevents critical cluster operations (joins, removals, token generation) when these sources are out of sync, helping administrators detect and address inconsistencies before they cause more serious failures.

Key Changes

Implements CheckMembershipConsistency() method that validates all three membership sources match before critical operations
Adds pre-operation consistency checks to join, remove, and token generation endpoints (with force flag bypass for removals)
Includes comprehensive integration test that simulates inconsistent state and verifies operations are properly blocked

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
internal/state/state.go	Adds CheckMembershipConsistency interface method and implementation with helper methods to gather and compare membership data from core_cluster_members, truststore, and dqlite
internal/rest/resources/tokens.go	Adds consistency check before token generation to prevent creating join tokens when cluster state is inconsistent
internal/rest/resources/cluster.go	Adds consistency checks before join and remove operations (force flag bypasses check for removals)
example/test/main.sh	Adds test_membership_consistency() to simulate inconsistent state and verify operations are blocked; improves process cleanup safety in shutdown_systems()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

internal/state/state.go

example/test/main.sh

louiseschmidtgen · 2025-12-15T16:47:05Z

@louiseschmidtgen in your previous message you mentioned user experience, so we should not simply notify the user but rather allow cleaning up in a "controlled way". What do you think about also extending the deletion endpoint to cope with the different scenarios? This would allow us to cover the entire workflow, from error to solution.

Giving this some more thought:

The auto-resolution strategy should be:

dqlite = source of truth
- it's the actual cluster,
- choosing core_cluster_members as source of truth could lead us to automatically remove a dqlite member and loose quorum
~~Add missing entries to core_cluster_members/truststore if they exist in dqlite~~
- we can't add the members back without the full member information (certificate, schema versions, etc.). We can only remove orphaned entries.
Remove extra entries in core_cluster_members/ truststore if they don't exist in dqlite
- Potentially problematic if it runs in parallel with a node joining

@roosterfish / @bschimke95 What do you think? I don't see a good/quick way to do the clean-up without putting the cluster at risk.

roosterfish

Thanks, please have a look at my few additional comments.

As I wrote in the other thread, let's include this as the checks itself are not very expensive but can be helpful.
Once we get a reproducer to end up in such a state, we should address this directly by modifying either the member join or remove logic.

Another aspect I just thought about is whether or not we can recover from all of the different inconsistencies. In this comment you mentioned that we might miss some data when restoring core_cluster_members so I suspect this is a manual operation anyway.
Have you already tested this for core_cluster_members, truststore and dqlite?

internal/state/state.go

internal/rest/resources/tokens.go

example/test/main.sh

louiseschmidtgen · 2026-01-12T13:03:06Z

Thanks, please have a look at my few additional comments.

As I wrote in the other thread, let's include this as the checks itself are not very expensive but can be helpful. Once we get a reproducer to end up in such a state, we should address this directly by modifying either the member join or remove logic.

Another aspect I just thought about is whether or not we can recover from all of the different inconsistencies. In this comment you mentioned that we might miss some data when restoring core_cluster_members so I suspect this is a manual operation anyway. Have you already tested this for core_cluster_members, truststore and dqlite?

+1 When we get a reproducer we can fix it in the future. In the meantime, this PR will help us / users find and define those issues.
On the testing question yes, I tried implementing the automated recovery but based on my findings this won't be possible for some cases and is not safe to do in others.

roosterfish

Almost ready, pls check the suggestion from Copilot and one small nit.

internal/state/state.go

example/test/main.sh

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

example/test/main.sh

internal/state/state.go

internal/rest/resources/cluster.go

roosterfish · 2026-01-15T16:08:19Z

Please rebase with main because there is now a conflict after merging #582.

Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>

roosterfish

Thanks! Only one potential leftover from rebasing.

internal/rest/resources/cluster.go

Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>

roosterfish

LGTM!

…tstore, and dqlite (#515) (#602) # Backport This PR backports #515

louiseschmidtgen force-pushed the KU-4294/diverging-membership-awareness branch from b815452 to d3aa99f Compare October 17, 2025 13:37

louiseschmidtgen marked this pull request as draft October 17, 2025 14:14

louiseschmidtgen force-pushed the KU-4294/diverging-membership-awareness branch from d3aa99f to 91552ea Compare October 20, 2025 08:55

louiseschmidtgen marked this pull request as ready for review October 20, 2025 08:55

louiseschmidtgen force-pushed the KU-4294/diverging-membership-awareness branch from 91552ea to 07565a5 Compare October 20, 2025 09:17

roosterfish requested changes Oct 28, 2025

View reviewed changes

internal/rest/resources/tokens.go Show resolved Hide resolved

internal/membership/membership.go Outdated Show resolved Hide resolved

example/test/main.sh Outdated Show resolved Hide resolved

example/test/main.sh Outdated Show resolved Hide resolved

louiseschmidtgen force-pushed the KU-4294/diverging-membership-awareness branch 2 times, most recently from 838b97f to e09f316 Compare November 11, 2025 16:10

louiseschmidtgen requested a review from roosterfish November 12, 2025 08:17

Copilot AI review requested due to automatic review settings December 15, 2025 12:25

louiseschmidtgen force-pushed the KU-4294/diverging-membership-awareness branch from e09f316 to 1fed52e Compare December 15, 2025 12:25

Copilot started reviewing on behalf of louiseschmidtgen December 15, 2025 12:27 View session

Copilot AI reviewed Dec 15, 2025

View reviewed changes

internal/state/state.go Outdated Show resolved Hide resolved

example/test/main.sh Outdated Show resolved Hide resolved

louiseschmidtgen marked this pull request as draft December 15, 2025 12:42

roosterfish requested changes Jan 9, 2026

View reviewed changes

louiseschmidtgen force-pushed the KU-4294/diverging-membership-awareness branch from 1fed52e to d436358 Compare January 12, 2026 12:20

louiseschmidtgen requested a review from roosterfish January 12, 2026 13:03

roosterfish requested a review from Copilot January 13, 2026 09:03

Copilot started reviewing on behalf of roosterfish January 13, 2026 09:03 View session

roosterfish requested changes Jan 13, 2026

View reviewed changes

internal/state/state.go Show resolved Hide resolved

example/test/main.sh Outdated Show resolved Hide resolved

Copilot AI reviewed Jan 13, 2026

View reviewed changes

example/test/main.sh Show resolved Hide resolved

internal/state/state.go Outdated Show resolved Hide resolved

internal/state/state.go Show resolved Hide resolved

internal/state/state.go Show resolved Hide resolved

louiseschmidtgen force-pushed the KU-4294/diverging-membership-awareness branch from df1ab64 to ccfeb8b Compare January 14, 2026 09:30

louiseschmidtgen requested a review from roosterfish January 14, 2026 09:32

roosterfish requested changes Jan 14, 2026

View reviewed changes

internal/state/state.go Outdated Show resolved Hide resolved

louiseschmidtgen force-pushed the KU-4294/diverging-membership-awareness branch from ccfeb8b to 647147b Compare January 14, 2026 10:13

louiseschmidtgen requested a review from roosterfish January 14, 2026 10:14

roosterfish requested changes Jan 14, 2026

View reviewed changes

internal/state/state.go Outdated Show resolved Hide resolved

louiseschmidtgen force-pushed the KU-4294/diverging-membership-awareness branch from 647147b to 5c32bfd Compare January 15, 2026 13:22

roosterfish requested changes Jan 15, 2026

View reviewed changes

internal/rest/resources/cluster.go Show resolved Hide resolved

internal/rest/resources/cluster.go Show resolved Hide resolved

louiseschmidtgen force-pushed the KU-4294/diverging-membership-awareness branch from 5c32bfd to b5620b5 Compare January 16, 2026 07:25

louiseschmidtgen requested a review from roosterfish January 16, 2026 07:26

internal/state: check memberships are in sync

2bd94bb

Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>

louiseschmidtgen force-pushed the KU-4294/diverging-membership-awareness branch from b5620b5 to ba73c53 Compare January 16, 2026 07:35

roosterfish requested changes Jan 16, 2026

View reviewed changes

internal/rest/resources/cluster.go Outdated Show resolved Hide resolved

louiseschmidtgen added 4 commits January 16, 2026 11:55

internal/rest/resources: check memberships in sync

a0676a3

Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>

internal/rest/resources: use request's ctx

1fe6a22

Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>

example/test: test operations are blocked when memberships diverge

ae40a37

Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>

example/test: check if process exists before termination

e69343e

Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>

louiseschmidtgen force-pushed the KU-4294/diverging-membership-awareness branch from ba73c53 to e69343e Compare January 16, 2026 09:56

roosterfish approved these changes Jan 16, 2026

View reviewed changes

roosterfish merged commit a757ad7 into canonical:v3 Jan 16, 2026
5 checks passed

louiseschmidtgen deleted the KU-4294/diverging-membership-awareness branch January 16, 2026 10:05

louiseschmidtgen mentioned this pull request Jan 28, 2026

backport: v2 add consistency checks across core_cluster_members, truststore, and dqlite (#515) #602

Merged

roosterfish added a commit that referenced this pull request Feb 3, 2026

backport: v2 add consistency checks across core_cluster_members, trus…

38c2bb7

…tstore, and dqlite (#515) (#602) # Backport This PR backports #515

Conversation

louiseschmidtgen commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Changes Made

Testing

Uh oh!

roosterfish left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

louiseschmidtgen commented Nov 11, 2025

Uh oh!

roosterfish commented Nov 18, 2025

Uh oh!

louiseschmidtgen commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roosterfish commented Nov 18, 2025

Uh oh!

louiseschmidtgen commented Nov 18, 2025

Uh oh!

bschimke95 commented Nov 18, 2025

Uh oh!

louiseschmidtgen commented Nov 18, 2025

Uh oh!

bschimke95 commented Nov 18, 2025

Uh oh!

louiseschmidtgen commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bschimke95 commented Nov 24, 2025

Uh oh!

roosterfish commented Nov 24, 2025

Uh oh!

roosterfish commented Nov 26, 2025

Uh oh!

louiseschmidtgen commented Dec 15, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Key Changes

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

louiseschmidtgen commented Dec 15, 2025

Uh oh!

roosterfish left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

louiseschmidtgen commented Jan 12, 2026

Uh oh!

roosterfish left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

louiseschmidtgen commented Oct 17, 2025 •

edited

Loading

louiseschmidtgen commented Nov 18, 2025 •

edited

Loading

louiseschmidtgen commented Nov 21, 2025 •

edited

Loading