HYPERFLEET-895 - feat: Add force delete design by pnguyen44 · Pull Request #131 · openshift-hyperfleet/architecture

pnguyen44 · 2026-04-28T17:25:44Z

Summary

Design document for the force deletion mechanism. Force delete is a synchronous, admin-only API action that hard-deletes resources stuck in Finalizing, bypassing the Reconciled=True gate.

Covers:

API contract and response codes
Cascade semantics for resource and subresource deletion
Sentinel stuck detection via Prometheus metrics
Audit logging
Rejected alternatives with rationale

Summary by CodeRabbit

Documentation
- Added a design draft for admin-only synchronous force-delete operations, including API behavior, required request fields, and expected response codes.
- Describes cascade semantics (resource and subresource force-deletes), single-transaction hard-delete ordering, and adapter behavior after deletion.
- Specifies sentinel metrics for stuck-detection, required structured audit and failure logging, and contrasts alternative approaches.

openshift-ci · 2026-04-28T17:25:49Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign rafabene for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-04-28T17:26:12Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 371d1e38-8bcc-48a6-ba69-7a074893e000

📥 Commits

Reviewing files that changed from the base of the PR and between 9e3bfcc and eb4885a.

📒 Files selected for processing (3)

hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md
hyperfleet/components/api-service/hard-delete-design.md
hyperfleet/docs/force-deletion-design.md

✅ Files skipped from review due to trivial changes (3)

hyperfleet/docs/force-deletion-design.md
hyperfleet/components/api-service/hard-delete-design.md
hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly identifies the main change: introducing a force deletion design document with a specific feature tag and Jira reference.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Review rate limit: 9/10 reviews remaining, refill in 6 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hyperfleet/docs/force-deletion-design.md`:
- Around line 72-78: Add "500 Internal Server Error: delete failed due to
unexpected server error" to the Response codes list in the "Response codes:"
block so it matches the sequence diagram's error case; update the Response codes
section (the list currently containing 204, 403, 404, 409) to include the 500
entry and ensure the phrasing mirrors the sequence diagram's "500 Internal
Server Error" wording for consistency.
- Around line 83-86: The Database Impact section currently references the Hard
Delete Design but doesn't explicitly state that force delete uses the identical
bottom-up deletion ordering; update the prose under "Database Impact" (and/or
the "Force delete" description) to explicitly confirm that force delete performs
the same bottom-up ordering as normal hard-delete (e.g., nodepools before
clusters, child resources before parents) despite bypassing the Reconciled=True
gate, and add a short example or clause referencing the Hard Delete Design to
make the guarantee clear.
- Around line 68-79: Mark the missing authz middleware as a hard blocker for the
force-delete feature and update the design to require authz middleware be
implemented and enabled before any force-delete route is added; specifically
note in the document that routes.go's TODO for authz must be completed and
deployment must be gated (either via CI check, feature-flag, or a runtime
startup validation) so the force-delete endpoint cannot be active until authz is
registered, and ensure the handler that performs the force delete only accepts
requests when the resource is in Finalizing (deleted_time set).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: a29836ee-e833-4547-b7a0-e48699fd4528

📥 Commits

Reviewing files that changed from the base of the PR and between 3869e40 and 506cc90.

📒 Files selected for processing (1)

hyperfleet/docs/force-deletion-design.md

ciaranRoche

Nice work, just a small piece of house keeping, i think there is some reference in the existing delete docs to force deletion having a separate spike, so might be worth updating and linking to this doc 🙏

ciaranRoche · 2026-04-30T06:17:26Z

+
+---
+
+## Alternatives Considered


I am wondering if you explored the idea, of removing finalizers from created resources 🤔

With the current design we not only orphan resources in the users cloud, but we orphan resources in our infra. I dont think hypershift has a 'force' option but we could always force it via removing finalizers. It would be worth at least exploring IMO

What do people think? Is the juice worth the squeeze to clean up not just our DB but resources in our infra down to the management cluster.

Good point. Cleaning up K8s infra is a separate concern since stripping finalizers would need management cluster access, which today only adapters have, and the adapter being stuck is usually why we're force-deleting in the first place. For now the admin can strip finalizers manually via kubectl as part of the same incident. The stuck detection metric and force-delete audit logs will tell us how often this happens, and we can revisit if it becomes a frequent manual step.

I agree to a point, I would think having a 'best effort' force would probably require a dedicated adapter/controller to handle it.

My main concern about revisiting it later is API compatibility, if we revisit this after we are in production we are limited in what changes we can back to the proposed API in this design, as the contract is locked.

I think best we capture this decision in an ADR, that we are accepting that force delete is hyperfleet DB only, and that we acknowledge that we are orphanining infra. We should document how we would extend to a 'best effort' clean up later on, via a dedicate endpoint, or cleanup adapter/controller.
Def worth a note in the trade off's and link to the ADR, just so we dont have a missed gap, and if it becomes a problem we have a point of reference to start from.

ciaranRoche · 2026-04-30T06:24:41Z

+
+## Audit Logging Approach
+
+The API logs a structured log entry before hard-deleting records, following the [Logging Specification](../standards/logging-specification.md). The log entry includes the caller identity, resource ID, resource type, timestamp, subresources being removed, and adapter statuses at time of force delete. If the delete fails, the API logs the failure with the error.


Is there anything else we can add here, i know we are following the logging spec, but is that enough, during an incident, or a PMR, can we answer exactly the when and why a resources was nuked?

I added a required reason field on the request body that gets included in the audit log. Between adapter statuses (why it was stuck) and the reason (why the admin intervened), we can reconstruct the full picture during a PMR.

This seems reasonable, i think it is worth capturing the trade off's since we do not control the log retention period, this could become an issue, if there is an incident and the logs are not around, we wont be able to answer the when/why

So yeah, a trade off sounds good, that we will rely on the log audit for now, and if it proves insufficient we can potentially extend to an audit table, which we can control.

WDYT?

for force-delete audit trail

to force deletion design

gauge for stuck detection metric

into HYPERFLEET-895/force-delete

openshift-ci Bot requested review from aredenba-rh and tirthct April 28, 2026 17:25

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread hyperfleet/docs/force-deletion-design.md Outdated

Comment thread hyperfleet/docs/force-deletion-design.md

Comment thread hyperfleet/docs/force-deletion-design.md

pnguyen44 changed the title ~~HYPERFLEET-895 - feat: Add force delete design~~ [DRAFT] HYPERFLEET-895 - feat: Add force delete design Apr 28, 2026

HYPERFLEET-895 - feat: Add force delete design

9e3bfcc

pnguyen44 force-pushed the HYPERFLEET-895/force-delete branch from 3eebc62 to 9e3bfcc Compare April 29, 2026 16:48

pnguyen44 changed the title ~~[DRAFT] HYPERFLEET-895 - feat: Add force delete design~~ HYPERFLEET-895 - feat: Add force delete design Apr 29, 2026

kuudori reviewed Apr 29, 2026

View reviewed changes

Comment thread hyperfleet/docs/force-deletion-design.md Outdated

Comment thread hyperfleet/docs/force-deletion-design.md

Comment thread hyperfleet/docs/force-deletion-design.md

ciaranRoche reviewed Apr 30, 2026

View reviewed changes

pnguyen44 added 4 commits April 30, 2026 12:07

HYPERFLEET-895 - feat: add required reason field

8e08a0c

for force-delete audit trail

HYPERFLEET-895 - chore: link existing delete docs

e51a738

to force deletion design

HYPERFLEET-895 - chore: replace histogram with

b0db59f

gauge for stuck detection metric

Merge remote-tracking branch 'upstream/main'

eb4885a

into HYPERFLEET-895/force-delete

pnguyen44 requested review from ciaranRoche and kuudori April 30, 2026 19:09


		## Audit Logging Approach

		The API logs a structured log entry before hard-deleting records, following the [Logging Specification](../standards/logging-specification.md). The log entry includes the caller identity, resource ID, resource type, timestamp, subresources being removed, and adapter statuses at time of force delete. If the delete fails, the API logs the failure with the error.

Conversation

pnguyen44 commented Apr 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary by CodeRabbit

Uh oh!

openshift-ci Bot commented Apr 28, 2026

Uh oh!

coderabbitai Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ciaranRoche left a comment

Choose a reason for hiding this comment

Uh oh!

ciaranRoche Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

pnguyen44 Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ciaranRoche May 1, 2026

Choose a reason for hiding this comment

Uh oh!

ciaranRoche Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

pnguyen44 Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ciaranRoche May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pnguyen44 commented Apr 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading

pnguyen44 Apr 30, 2026 •

edited

Loading

pnguyen44 Apr 30, 2026 •

edited

Loading