Skip to content

Conversation

@jgallagher
Copy link
Contributor

@jgallagher jgallagher commented Dec 15, 2025

The primary change in this PR is that Blueprint::all_omicron_zones(filter) has been replaced by two new methods:

  • Blueprint::in_service_zones() (no arguments)
  • Blueprint::expunged_zones(ready_for_cleanup, reason) where ready_for_cleanup allows filtering on that field, and reason is a not-used-at-runtime marker for "why are we looking for expunged zones"

The reason is the driver for this change. One of my big fears for #5552 is that we do the work today to make the planner account for any cleanup we do on expunged zones, but then some future work adds a new kind of cleanup without realizing the planner now has to consider that too. This led to the new BlueprintExpungedZoneAccessReason enum, which I'm hoping serves as both a static guard (because it has to be passed to expunged_zone()) and as a point of documentation (describing all the different things the planner has to consider when expunging zones).

File ordering puts it toward the end, but I'd recommend reviewing nexus/types/src/deployment.rs first - the changes to the rest of the files are largely straightforward fallout from that. (One bit that's slightly surprising is that a bunch of all_omicron_zones(BlueprintZoneDisposition::any) calls were replaced by in_service_zones(); this is mostly in tests that were using ::any but didn't actually need to access expunged zones.)

This is staged on top of #9512.

Comment on lines 1629 to 1633
let mut all_zones =
blueprint.in_service_zones().chain(blueprint.expunged_zones(
ReadyForCleanup::Both,
BlueprintExpungedZoneAccessReason::Omdb,
));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few places where a caller legitimately wants to look at all zones, and I've updated all of those to follow this pattern where we chain in-service into expunged. This does mean we loop over all zones twice; I thought about adding a combo method but was worried it would be confusing. Open to feedback there though; this could be something like

let mut all_zones =
    blueprint.all_in_service_and_expunged_zones(
        // ReadyForCleanup::Both, // pass or just assume it?
        BlueprintExpungedZoneAccessReason::Omdb,
    );

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, like you did with expunged_nexus_zones_ready_for_cleanup - this is probably worth doing?

I agree that inferring ReadyForCleanup::Both seems right, I can't imagine a world where we want to access "all in service zones, but only a subset of expunged zones" (if a caller wanted all "maybe running" zones, what they're doing is kinda racy, if a caller wanted all guaranteed stopped zones, that would be really weird).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 4203003

Comment on lines +758 to +764
/// This is a catch-all variant for updating the `external_ip` and
/// `network_interface` tables for all zone types that have external
/// networking (Nexus, boundary NTP, and external DNS).
///
/// The planner must not prune any of these zone types if their external
/// networking bits are still referenced by the active CRDB tables.
DeallocateExternalNetworkingResources,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(No action needed here, kinda just wondering out-loud)

This variant is zone-agnostic, so it has me wondering:

This enum lists every possible reason one might want to access expunged zones, but there could plausibly be "more than one reason" why a caller wants to access a zone when getting an iterator of zones.

The interface you provided expects the caller to supply a single "reason", but the planner could be trying to access expunged zones with two reasons (e.g. "I want to see all nexus zones with external networking resources that might need to be deallocated, but I'm also going to do something nexus-specific too")

Are we thinking that the number of "access reasons" is bounded enough that this won't really be an issue in practice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a good question. In a very early draft of this PR I was considering some way to provide multiple reasons, either by taking a &[reason1, reason2, ...] or doing something fancy to allow or-ing together multiple reasons. There are only a few places outside the planner where we were doing that, and instead I broke those up (e.g., #9512).

I haven't really touched things within the planner. The blueprint builder still has an all_omicron_zones(filter) method, and I think that's fine? My concern here is that outside-of-reconfigurator changes that read the blueprint start acting on expunged zones for some reason, and for that specifically, yeah I'm hoping that the number of "access reasons" stays pretty well bounded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blueprint builder still has an all_omicron_zones(filter) method, and I think that's fine?

Thinking about this more, this very well might not be fine: if the planner making correct decisions depends on it accessing expunged zones, it needs to keep them around for itself. I'll take a pass over the planner and see what it's doing as a followup.

Comment on lines 540 to 543
self.in_service_zones().chain(self.expunged_zones(
ReadyForCleanup::Both,
BlueprintExpungedZoneAccessReason::NexusSelfGeneration,
))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks changed from the original implementation?

could_be_running used to mean InService OR (Expunged AND !ready_for_cleanup)

Now, it seems we're saying: InService OR (Expunged), and we don't care about ready for cleanup? I think ReadyForCleanup::No would align with the original?

(I think this is probably fine either way, because we're looking for our own running zone, but it would be probably wrong if Nexus looked up its own generation, saw that it was "fully expunged and halted, supposedly", and kept truckin' along)

Copy link
Contributor Author

@jgallagher jgallagher Dec 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch; I didn't mean to change this. I think this is a case for exactly the thing you said would be racy in your earlier comment:

I agree that inferring ReadyForCleanup::Both seems right, I can't imagine a world where we want to access "all in service zones, but only a subset of expunged zones" (if a caller wanted all "maybe running" zones, what they're doing is kinda racy, if a caller wanted all guaranteed stopped zones, that would be really weird).

The could_be_running filter is exactly all "maybe running" zones, I think? Given this, do you think I should add both all_in_service_and_expunged_zones(reason) (in-service + RunningStatus::Any) and all_maybe_running_zones(reason) (in-service + RunningStatus::MaybeRunning)? (I do agree with you that "in-service + RunningStatus::Shutdown" would be really weird, which is why I'm not inclined to make it all_in_service_and_expunged_zones(running_status, reason).)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with all_maybe_running_zones() (naming suggestions welcome!) and fixed this in 940ed08

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realizing now that all_maybe_running_zones() doesn't need to take a reason - the planner won't prune zones that aren't ready for cleanup, so we don't need to track callers' uses of them (I believe?). I guess that means we also don't need a reason for the specific call expunged_zones(ZoneRunningStatus::MaybeRunning, ...). The only callers that use that status are tests, though, so it doesn't seem worth breaking apart.

I do think I'm going to remove the reason from all_maybe_running_zones() though (and add a comment about why we don't need it).

Comment on lines 1629 to 1633
let mut all_zones =
blueprint.in_service_zones().chain(blueprint.expunged_zones(
ReadyForCleanup::Both,
BlueprintExpungedZoneAccessReason::Omdb,
));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, like you did with expunged_nexus_zones_ready_for_cleanup - this is probably worth doing?

I agree that inferring ReadyForCleanup::Both seems right, I can't imagine a world where we want to access "all in service zones, but only a subset of expunged zones" (if a caller wanted all "maybe running" zones, what they're doing is kinda racy, if a caller wanted all guaranteed stopped zones, that would be really weird).

/// 2. Update the planner to account for it, to prevent the planner from pruning
/// the zone before whatever your use of it is completes.
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
pub enum BlueprintExpungedZoneAccessReason {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume there is going to be a match on all of these in the planner when we work towards closing #5552 , so that "if you add a new access-reason, you'll also be forced to do the accounting in the planner", right?

Seems like it might be easy for this to get out-of-sync if this enum is an "unused arg from callers" that never actually has any compile-time connection to the planner work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, definitely. I don't know yet whether that match will be structurally meaningful or just have comments referring to where we're handling the variants, but it will exist to give us the compile-time failure when variants are added.

Copy link
Contributor

@sunshowers sunshowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks quite reasonable to me modulo Sean's comments!

@jgallagher jgallagher requested a review from smklein December 16, 2025 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants