Skip to content

HYPERFLEET-895 - feat: Add force delete design#131

Open
pnguyen44 wants to merge 5 commits intoopenshift-hyperfleet:mainfrom
pnguyen44:HYPERFLEET-895/force-delete
Open

HYPERFLEET-895 - feat: Add force delete design#131
pnguyen44 wants to merge 5 commits intoopenshift-hyperfleet:mainfrom
pnguyen44:HYPERFLEET-895/force-delete

Conversation

@pnguyen44
Copy link
Copy Markdown

@pnguyen44 pnguyen44 commented Apr 28, 2026

Summary

Jira ticket

Design document for the force deletion mechanism. Force delete is a synchronous, admin-only API action that hard-deletes resources stuck in Finalizing, bypassing the Reconciled=True gate.

Covers:

  • API contract and response codes
  • Cascade semantics for resource and subresource deletion
  • Sentinel stuck detection via Prometheus metrics
  • Audit logging
  • Rejected alternatives with rationale

Summary by CodeRabbit

  • Documentation
    • Added a design draft for admin-only synchronous force-delete operations, including API behavior, required request fields, and expected response codes.
    • Describes cascade semantics (resource and subresource force-deletes), single-transaction hard-delete ordering, and adapter behavior after deletion.
    • Specifies sentinel metrics for stuck-detection, required structured audit and failure logging, and contrasts alternative approaches.

@openshift-ci openshift-ci Bot requested review from aredenba-rh and tirthct April 28, 2026 17:25
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 28, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign rafabene for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 371d1e38-8bcc-48a6-ba69-7a074893e000

📥 Commits

Reviewing files that changed from the base of the PR and between 9e3bfcc and eb4885a.

📒 Files selected for processing (3)
  • hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md
  • hyperfleet/components/api-service/hard-delete-design.md
  • hyperfleet/docs/force-deletion-design.md
✅ Files skipped from review due to trivial changes (3)
  • hyperfleet/docs/force-deletion-design.md
  • hyperfleet/components/api-service/hard-delete-design.md
  • hyperfleet/components/adapter/framework/adapter-deletion-flow-design.md

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly identifies the main change: introducing a force deletion design document with a specific feature tag and Jira reference.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@hyperfleet/docs/force-deletion-design.md`:
- Around line 72-78: Add "500 Internal Server Error: delete failed due to
unexpected server error" to the Response codes list in the "Response codes:"
block so it matches the sequence diagram's error case; update the Response codes
section (the list currently containing 204, 403, 404, 409) to include the 500
entry and ensure the phrasing mirrors the sequence diagram's "500 Internal
Server Error" wording for consistency.
- Around line 83-86: The Database Impact section currently references the Hard
Delete Design but doesn't explicitly state that force delete uses the identical
bottom-up deletion ordering; update the prose under "Database Impact" (and/or
the "Force delete" description) to explicitly confirm that force delete performs
the same bottom-up ordering as normal hard-delete (e.g., nodepools before
clusters, child resources before parents) despite bypassing the Reconciled=True
gate, and add a short example or clause referencing the Hard Delete Design to
make the guarantee clear.
- Around line 68-79: Mark the missing authz middleware as a hard blocker for the
force-delete feature and update the design to require authz middleware be
implemented and enabled before any force-delete route is added; specifically
note in the document that routes.go's TODO for authz must be completed and
deployment must be gated (either via CI check, feature-flag, or a runtime
startup validation) so the force-delete endpoint cannot be active until authz is
registered, and ensure the handler that performs the force delete only accepts
requests when the resource is in Finalizing (deleted_time set).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: a29836ee-e833-4547-b7a0-e48699fd4528

📥 Commits

Reviewing files that changed from the base of the PR and between 3869e40 and 506cc90.

📒 Files selected for processing (1)
  • hyperfleet/docs/force-deletion-design.md

Comment thread hyperfleet/docs/force-deletion-design.md Outdated
Comment thread hyperfleet/docs/force-deletion-design.md
Comment thread hyperfleet/docs/force-deletion-design.md
@pnguyen44 pnguyen44 changed the title HYPERFLEET-895 - feat: Add force delete design [DRAFT] HYPERFLEET-895 - feat: Add force delete design Apr 28, 2026
@pnguyen44 pnguyen44 force-pushed the HYPERFLEET-895/force-delete branch from 3eebc62 to 9e3bfcc Compare April 29, 2026 16:48
@pnguyen44 pnguyen44 changed the title [DRAFT] HYPERFLEET-895 - feat: Add force delete design HYPERFLEET-895 - feat: Add force delete design Apr 29, 2026
Comment thread hyperfleet/docs/force-deletion-design.md Outdated
Comment thread hyperfleet/docs/force-deletion-design.md
Comment thread hyperfleet/docs/force-deletion-design.md
Copy link
Copy Markdown
Contributor

@ciaranRoche ciaranRoche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, just a small piece of house keeping, i think there is some reference in the existing delete docs to force deletion having a separate spike, so might be worth updating and linking to this doc 🙏


---

## Alternatives Considered
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if you explored the idea, of removing finalizers from created resources 🤔

With the current design we not only orphan resources in the users cloud, but we orphan resources in our infra. I dont think hypershift has a 'force' option but we could always force it via removing finalizers. It would be worth at least exploring IMO

What do people think? Is the juice worth the squeeze to clean up not just our DB but resources in our infra down to the management cluster.

Copy link
Copy Markdown
Author

@pnguyen44 pnguyen44 Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Cleaning up K8s infra is a separate concern since stripping finalizers would need management cluster access, which today only adapters have, and the adapter being stuck is usually why we're force-deleting in the first place. For now the admin can strip finalizers manually via kubectl as part of the same incident. The stuck detection metric and force-delete audit logs will tell us how often this happens, and we can revisit if it becomes a frequent manual step.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree to a point, I would think having a 'best effort' force would probably require a dedicated adapter/controller to handle it.

My main concern about revisiting it later is API compatibility, if we revisit this after we are in production we are limited in what changes we can back to the proposed API in this design, as the contract is locked.

I think best we capture this decision in an ADR, that we are accepting that force delete is hyperfleet DB only, and that we acknowledge that we are orphanining infra. We should document how we would extend to a 'best effort' clean up later on, via a dedicate endpoint, or cleanup adapter/controller.
Def worth a note in the trade off's and link to the ADR, just so we dont have a missed gap, and if it becomes a problem we have a point of reference to start from.


## Audit Logging Approach

The API logs a structured log entry before hard-deleting records, following the [Logging Specification](../standards/logging-specification.md). The log entry includes the caller identity, resource ID, resource type, timestamp, subresources being removed, and adapter statuses at time of force delete. If the delete fails, the API logs the failure with the error.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything else we can add here, i know we are following the logging spec, but is that enough, during an incident, or a PMR, can we answer exactly the when and why a resources was nuked?

Copy link
Copy Markdown
Author

@pnguyen44 pnguyen44 Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a required reason field on the request body that gets included in the audit log. Between adapter statuses (why it was stuck) and the reason (why the admin intervened), we can reconstruct the full picture during a PMR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable, i think it is worth capturing the trade off's since we do not control the log retention period, this could become an issue, if there is an incident and the logs are not around, we wont be able to answer the when/why

So yeah, a trade off sounds good, that we will rely on the log audit for now, and if it proves insufficient we can potentially extend to an audit table, which we can control.

WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants