-
Notifications
You must be signed in to change notification settings - Fork 64
Blueprint: Replace all_omicron_zones() with separate in-service / expunged methods
#9521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: john/split-up-execution-cleanup
Are you sure you want to change the base?
Blueprint: Replace all_omicron_zones() with separate in-service / expunged methods
#9521
Conversation
dev-tools/omdb/src/bin/omdb/db.rs
Outdated
| let mut all_zones = | ||
| blueprint.in_service_zones().chain(blueprint.expunged_zones( | ||
| ReadyForCleanup::Both, | ||
| BlueprintExpungedZoneAccessReason::Omdb, | ||
| )); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few places where a caller legitimately wants to look at all zones, and I've updated all of those to follow this pattern where we chain in-service into expunged. This does mean we loop over all zones twice; I thought about adding a combo method but was worried it would be confusing. Open to feedback there though; this could be something like
let mut all_zones =
blueprint.all_in_service_and_expunged_zones(
// ReadyForCleanup::Both, // pass or just assume it?
BlueprintExpungedZoneAccessReason::Omdb,
);There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, like you did with expunged_nexus_zones_ready_for_cleanup - this is probably worth doing?
I agree that inferring ReadyForCleanup::Both seems right, I can't imagine a world where we want to access "all in service zones, but only a subset of expunged zones" (if a caller wanted all "maybe running" zones, what they're doing is kinda racy, if a caller wanted all guaranteed stopped zones, that would be really weird).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 4203003
| /// This is a catch-all variant for updating the `external_ip` and | ||
| /// `network_interface` tables for all zone types that have external | ||
| /// networking (Nexus, boundary NTP, and external DNS). | ||
| /// | ||
| /// The planner must not prune any of these zone types if their external | ||
| /// networking bits are still referenced by the active CRDB tables. | ||
| DeallocateExternalNetworkingResources, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(No action needed here, kinda just wondering out-loud)
This variant is zone-agnostic, so it has me wondering:
This enum lists every possible reason one might want to access expunged zones, but there could plausibly be "more than one reason" why a caller wants to access a zone when getting an iterator of zones.
The interface you provided expects the caller to supply a single "reason", but the planner could be trying to access expunged zones with two reasons (e.g. "I want to see all nexus zones with external networking resources that might need to be deallocated, but I'm also going to do something nexus-specific too")
Are we thinking that the number of "access reasons" is bounded enough that this won't really be an issue in practice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is a good question. In a very early draft of this PR I was considering some way to provide multiple reasons, either by taking a &[reason1, reason2, ...] or doing something fancy to allow or-ing together multiple reasons. There are only a few places outside the planner where we were doing that, and instead I broke those up (e.g., #9512).
I haven't really touched things within the planner. The blueprint builder still has an all_omicron_zones(filter) method, and I think that's fine? My concern here is that outside-of-reconfigurator changes that read the blueprint start acting on expunged zones for some reason, and for that specifically, yeah I'm hoping that the number of "access reasons" stays pretty well bounded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The blueprint builder still has an
all_omicron_zones(filter)method, and I think that's fine?
Thinking about this more, this very well might not be fine: if the planner making correct decisions depends on it accessing expunged zones, it needs to keep them around for itself. I'll take a pass over the planner and see what it's doing as a followup.
nexus/types/src/deployment.rs
Outdated
| self.in_service_zones().chain(self.expunged_zones( | ||
| ReadyForCleanup::Both, | ||
| BlueprintExpungedZoneAccessReason::NexusSelfGeneration, | ||
| )) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks changed from the original implementation?
could_be_running used to mean InService OR (Expunged AND !ready_for_cleanup)
Now, it seems we're saying: InService OR (Expunged), and we don't care about ready for cleanup? I think ReadyForCleanup::No would align with the original?
(I think this is probably fine either way, because we're looking for our own running zone, but it would be probably wrong if Nexus looked up its own generation, saw that it was "fully expunged and halted, supposedly", and kept truckin' along)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch; I didn't mean to change this. I think this is a case for exactly the thing you said would be racy in your earlier comment:
I agree that inferring ReadyForCleanup::Both seems right, I can't imagine a world where we want to access "all in service zones, but only a subset of expunged zones" (if a caller wanted all "maybe running" zones, what they're doing is kinda racy, if a caller wanted all guaranteed stopped zones, that would be really weird).
The could_be_running filter is exactly all "maybe running" zones, I think? Given this, do you think I should add both all_in_service_and_expunged_zones(reason) (in-service + RunningStatus::Any) and all_maybe_running_zones(reason) (in-service + RunningStatus::MaybeRunning)? (I do agree with you that "in-service + RunningStatus::Shutdown" would be really weird, which is why I'm not inclined to make it all_in_service_and_expunged_zones(running_status, reason).)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went with all_maybe_running_zones() (naming suggestions welcome!) and fixed this in 940ed08
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realizing now that all_maybe_running_zones() doesn't need to take a reason - the planner won't prune zones that aren't ready for cleanup, so we don't need to track callers' uses of them (I believe?). I guess that means we also don't need a reason for the specific call expunged_zones(ZoneRunningStatus::MaybeRunning, ...). The only callers that use that status are tests, though, so it doesn't seem worth breaking apart.
I do think I'm going to remove the reason from all_maybe_running_zones() though (and add a comment about why we don't need it).
dev-tools/omdb/src/bin/omdb/db.rs
Outdated
| let mut all_zones = | ||
| blueprint.in_service_zones().chain(blueprint.expunged_zones( | ||
| ReadyForCleanup::Both, | ||
| BlueprintExpungedZoneAccessReason::Omdb, | ||
| )); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, like you did with expunged_nexus_zones_ready_for_cleanup - this is probably worth doing?
I agree that inferring ReadyForCleanup::Both seems right, I can't imagine a world where we want to access "all in service zones, but only a subset of expunged zones" (if a caller wanted all "maybe running" zones, what they're doing is kinda racy, if a caller wanted all guaranteed stopped zones, that would be really weird).
| /// 2. Update the planner to account for it, to prevent the planner from pruning | ||
| /// the zone before whatever your use of it is completes. | ||
| #[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)] | ||
| pub enum BlueprintExpungedZoneAccessReason { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume there is going to be a match on all of these in the planner when we work towards closing #5552 , so that "if you add a new access-reason, you'll also be forced to do the accounting in the planner", right?
Seems like it might be easy for this to get out-of-sync if this enum is an "unused arg from callers" that never actually has any compile-time connection to the planner work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, definitely. I don't know yet whether that match will be structurally meaningful or just have comments referring to where we're handling the variants, but it will exist to give us the compile-time failure when variants are added.
sunshowers
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks quite reasonable to me modulo Sean's comments!
The primary change in this PR is that
Blueprint::all_omicron_zones(filter)has been replaced by two new methods:Blueprint::in_service_zones()(no arguments)Blueprint::expunged_zones(ready_for_cleanup, reason)whereready_for_cleanupallows filtering on that field, andreasonis a not-used-at-runtime marker for "why are we looking for expunged zones"The
reasonis the driver for this change. One of my big fears for #5552 is that we do the work today to make the planner account for any cleanup we do on expunged zones, but then some future work adds a new kind of cleanup without realizing the planner now has to consider that too. This led to the newBlueprintExpungedZoneAccessReasonenum, which I'm hoping serves as both a static guard (because it has to be passed toexpunged_zone()) and as a point of documentation (describing all the different things the planner has to consider when expunging zones).File ordering puts it toward the end, but I'd recommend reviewing
nexus/types/src/deployment.rsfirst - the changes to the rest of the files are largely straightforward fallout from that. (One bit that's slightly surprising is that a bunch ofall_omicron_zones(BlueprintZoneDisposition::any)calls were replaced byin_service_zones(); this is mostly in tests that were using::anybut didn't actually need to access expunged zones.)This is staged on top of #9512.