fix: add consistency checks across core_cluster_members, truststore, and dqlite#515
Conversation
b815452 to
d3aa99f
Compare
d3aa99f to
91552ea
Compare
91552ea to
07565a5
Compare
roosterfish
left a comment
There was a problem hiding this comment.
Thanks raising awareness that we can ultimately end up in a state where those three data sources are out of sync. Also the corresponding test looks great.
I wonder if you could please share a reproducer how we might end up in such an inconsistent state in a production cluster?
From the test I understand that you can easily reproduce this when manually deleting DB entries (which we cannot protect against but it's something you should never really do). The same applies for manually editing the trust store.
Instead of checking for an already broken state when trying to add/delete members, what about performing the check elsewhere so we don't end up in this state overall?
If it's easy to end up in such a state in a prod deployment, I wonder if Microcluster doesn't properly clean up certain actions in case of failure?
838b97f to
e09f316
Compare
As discussed at the sprint, the reverter logic only works as long as the node does not crash while the revert is ongoing. |
Hi @louiseschmidtgen, I appreciate the discussions at the sprint to better understand where you are heading with this. With the changes proposed in this PR, it can still happen but there would now be an error telling the user that DB/dqlite/truststore are out of sync when performing actions like creating a token, adding and removing a cluster member. In the presumably rare case a member running a revert dies, it should rather be the admin that figures out that the current action he is executing (e.g. adding a member) ran into a failure. When joining a member But as mentioned at the sprint, we don't want to put in checks at certain places to check if something is already broken. Rather fix it when its happening. And if something from the outside is unexpectedly killing a Microcluster member, this is certainly not under our control and we should not come up with patterns in Microcluster to resolve this. |
|
@roosterfish I understand that we do not want to add any unnecessary checks and rather iron out the faults in the repo. However, we are not going to fix the reverter logic any time soon. From a UX perspective expecting Let's get another opinion from @bschimke95. |
After #512 and #487 do we know what else is currently not working in case the reverter kicks in? |
|
Hey folks, I generally agree with both of you.
In both cases, we'd leak abstractions and users would need to know the internals of microcluster. I'd propose two things: a) If I understand the issue correctly then the main problem (that we know of) right now is that the reverter logic might not run into completion if the node dies mid-way. What about just making this reverter logic persistent? Presumably, this should not be too much work and would solve the issue at hand. b) Introduce a |
|
@bschimke95 AFAIK, @roosterfish and his team have experimented with an options to persist the change by storing it in Dqlite and making use of Raft's fault-tolerance. However, this was plagued with performance issues IIRC. |
|
I'm not super deep in the Raft-game but I'd be surprised if there are any significant performance issues. That is just my understanding of the system/problem. I probably miss something. |
I'm afraid it's never trivial with distributed systems, especially when you start to think about all the things that go wrong. Say we store the information in Dqlite One example: assume my "new node" fails somewhere in the join process and crashes half way through the reverter and the new node's not coming back to clean up its state any time soon. Now the other nodes need to "deal with the mess" before doing any further membership operations. Dealing with the mess is going to require some more thought through logic and all of the sudden this is not a quick fix. |
|
I don't think that this is a quick fix but rather was surprised by the claim that this affects the performance of the system. I totally agree that thinks can get messy in distributed systems. Anyway, @roosterfish and @louiseschmidtgen how do you want to proceed with this PR? I think the direction where we want to go with this is a bit unclear right now and I don't see this PR to land anytime soon. My proposal would be to close this PR, regroup and discuss next steps in a better setting (e.g. spec) to agree on a aligned way forward. |
Hey guys, sorry for the delay, this is not a priority for us at the moment as the goal of this PR is to circumvent possible behavior and not fixing an actual bug. The current answer of Microcluster is to I see the point that there can potentially be various different inconsistencies, for example a member entry only present in the truststore. As this is a hypothesis, I cannot present any reproducer to end up in this state, but in this case you can also delete the member using @louiseschmidtgen in your previous message you mentioned user experience, so we should not simply notify the user but rather allow cleaning up in a "controlled way". What do you think about also extending the deletion endpoint to cope with the different scenarios? This would allow us to cover the entire workflow, from error to solution. |
|
@louiseschmidtgen before we continue with this, can you please have a look at #544? There seems to be a regression I just found with the recent cluster join fixes. |
@roosterfish I like your idea of doing automated clean-up. I will extend the PR to include this. |
e09f316 to
1fed52e
Compare
There was a problem hiding this comment.
Pull request overview
This pull request adds consistency validation across the three key components that track cluster membership: the core_cluster_members database table, the truststore, and the dqlite cluster configuration. The implementation prevents critical cluster operations (joins, removals, token generation) when these sources are out of sync, helping administrators detect and address inconsistencies before they cause more serious failures.
Key Changes
- Implements CheckMembershipConsistency() method that validates all three membership sources match before critical operations
- Adds pre-operation consistency checks to join, remove, and token generation endpoints (with force flag bypass for removals)
- Includes comprehensive integration test that simulates inconsistent state and verifies operations are properly blocked
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| internal/state/state.go | Adds CheckMembershipConsistency interface method and implementation with helper methods to gather and compare membership data from core_cluster_members, truststore, and dqlite |
| internal/rest/resources/tokens.go | Adds consistency check before token generation to prevent creating join tokens when cluster state is inconsistent |
| internal/rest/resources/cluster.go | Adds consistency checks before join and remove operations (force flag bypasses check for removals) |
| example/test/main.sh | Adds test_membership_consistency() to simulate inconsistent state and verify operations are blocked; improves process cleanup safety in shutdown_systems() |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Giving this some more thought: The auto-resolution strategy should be:
@roosterfish / @bschimke95 What do you think? I don't see a good/quick way to do the clean-up without putting the cluster at risk. |
roosterfish
left a comment
There was a problem hiding this comment.
Thanks, please have a look at my few additional comments.
As I wrote in the other thread, let's include this as the checks itself are not very expensive but can be helpful.
Once we get a reproducer to end up in such a state, we should address this directly by modifying either the member join or remove logic.
Another aspect I just thought about is whether or not we can recover from all of the different inconsistencies. In this comment you mentioned that we might miss some data when restoring core_cluster_members so I suspect this is a manual operation anyway.
Have you already tested this for core_cluster_members, truststore and dqlite?
1fed52e to
d436358
Compare
+1 When we get a reproducer we can fix it in the future. In the meantime, this PR will help us / users find and define those issues. |
roosterfish
left a comment
There was a problem hiding this comment.
Almost ready, pls check the suggestion from Copilot and one small nit.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
df1ab64 to
ccfeb8b
Compare
ccfeb8b to
647147b
Compare
647147b to
5c32bfd
Compare
|
Please rebase with |
5c32bfd to
b5620b5
Compare
Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>
b5620b5 to
ba73c53
Compare
roosterfish
left a comment
There was a problem hiding this comment.
Thanks! Only one potential leftover from rebasing.
Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>
Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>
Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>
Signed-off-by: Louise K. Schmidtgen <louise.schmidtgen@canonical.com>
ba73c53 to
e69343e
Compare
Problem
Microcluster can enter inconsistent states where core_cluster_members (database), truststore, and dqlite cluster configuration become out of sync during partial failures. This leads to failed operations and difficult recovery scenarios.
Solution
Implements membership consistency validation before critical operations:
Validates before operations: Checks all three sources match before joins, removals, and token generation
Clear error messages: Shows differences between sources when inconsistencies detected making it possible for admins to recover their cluster before worse things happen.
Changes Made
Testing