HYPERFLEET-895 - feat: Add force delete design#131
HYPERFLEET-895 - feat: Add force delete design#131pnguyen44 wants to merge 5 commits intoopenshift-hyperfleet:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
✅ Files skipped from review due to trivial changes (3)
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@hyperfleet/docs/force-deletion-design.md`:
- Around line 72-78: Add "500 Internal Server Error: delete failed due to
unexpected server error" to the Response codes list in the "Response codes:"
block so it matches the sequence diagram's error case; update the Response codes
section (the list currently containing 204, 403, 404, 409) to include the 500
entry and ensure the phrasing mirrors the sequence diagram's "500 Internal
Server Error" wording for consistency.
- Around line 83-86: The Database Impact section currently references the Hard
Delete Design but doesn't explicitly state that force delete uses the identical
bottom-up deletion ordering; update the prose under "Database Impact" (and/or
the "Force delete" description) to explicitly confirm that force delete performs
the same bottom-up ordering as normal hard-delete (e.g., nodepools before
clusters, child resources before parents) despite bypassing the Reconciled=True
gate, and add a short example or clause referencing the Hard Delete Design to
make the guarantee clear.
- Around line 68-79: Mark the missing authz middleware as a hard blocker for the
force-delete feature and update the design to require authz middleware be
implemented and enabled before any force-delete route is added; specifically
note in the document that routes.go's TODO for authz must be completed and
deployment must be gated (either via CI check, feature-flag, or a runtime
startup validation) so the force-delete endpoint cannot be active until authz is
registered, and ensure the handler that performs the force delete only accepts
requests when the resource is in Finalizing (deleted_time set).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: a29836ee-e833-4547-b7a0-e48699fd4528
📒 Files selected for processing (1)
hyperfleet/docs/force-deletion-design.md
3eebc62 to
9e3bfcc
Compare
ciaranRoche
left a comment
There was a problem hiding this comment.
Nice work, just a small piece of house keeping, i think there is some reference in the existing delete docs to force deletion having a separate spike, so might be worth updating and linking to this doc 🙏
|
|
||
| --- | ||
|
|
||
| ## Alternatives Considered |
There was a problem hiding this comment.
I am wondering if you explored the idea, of removing finalizers from created resources 🤔
With the current design we not only orphan resources in the users cloud, but we orphan resources in our infra. I dont think hypershift has a 'force' option but we could always force it via removing finalizers. It would be worth at least exploring IMO
What do people think? Is the juice worth the squeeze to clean up not just our DB but resources in our infra down to the management cluster.
There was a problem hiding this comment.
Good point. Cleaning up K8s infra is a separate concern since stripping finalizers would need management cluster access, which today only adapters have, and the adapter being stuck is usually why we're force-deleting in the first place. For now the admin can strip finalizers manually via kubectl as part of the same incident. The stuck detection metric and force-delete audit logs will tell us how often this happens, and we can revisit if it becomes a frequent manual step.
There was a problem hiding this comment.
I agree to a point, I would think having a 'best effort' force would probably require a dedicated adapter/controller to handle it.
My main concern about revisiting it later is API compatibility, if we revisit this after we are in production we are limited in what changes we can back to the proposed API in this design, as the contract is locked.
I think best we capture this decision in an ADR, that we are accepting that force delete is hyperfleet DB only, and that we acknowledge that we are orphanining infra. We should document how we would extend to a 'best effort' clean up later on, via a dedicate endpoint, or cleanup adapter/controller.
Def worth a note in the trade off's and link to the ADR, just so we dont have a missed gap, and if it becomes a problem we have a point of reference to start from.
|
|
||
| ## Audit Logging Approach | ||
|
|
||
| The API logs a structured log entry before hard-deleting records, following the [Logging Specification](../standards/logging-specification.md). The log entry includes the caller identity, resource ID, resource type, timestamp, subresources being removed, and adapter statuses at time of force delete. If the delete fails, the API logs the failure with the error. |
There was a problem hiding this comment.
Is there anything else we can add here, i know we are following the logging spec, but is that enough, during an incident, or a PMR, can we answer exactly the when and why a resources was nuked?
There was a problem hiding this comment.
I added a required reason field on the request body that gets included in the audit log. Between adapter statuses (why it was stuck) and the reason (why the admin intervened), we can reconstruct the full picture during a PMR.
There was a problem hiding this comment.
This seems reasonable, i think it is worth capturing the trade off's since we do not control the log retention period, this could become an issue, if there is an incident and the logs are not around, we wont be able to answer the when/why
So yeah, a trade off sounds good, that we will rely on the log audit for now, and if it proves insufficient we can potentially extend to an audit table, which we can control.
WDYT?
for force-delete audit trail
to force deletion design
gauge for stuck detection metric
into HYPERFLEET-895/force-delete
Summary
Jira ticket
Design document for the force deletion mechanism. Force delete is a synchronous, admin-only API action that hard-deletes resources stuck in
Finalizing, bypassing theReconciled=Truegate.Covers:
Summary by CodeRabbit