HYPERFLEET-854 - feat: implement hard deletion for clusters and nodepools#119
HYPERFLEET-854 - feat: implement hard deletion for clusters and nodepools#119mliptak0 wants to merge 2 commits intoopenshift-hyperfleet:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
WalkthroughThis pull request introduces hard-deletion semantics for Cluster and NodePool resources, enabling permanent removal after soft deletion when all required adapter statuses report Sequence Diagram(s)sequenceDiagram
participant Client
participant ClusterService
participant ClusterDAO
participant AdapterStatusDAO
participant DB
rect rgba(100, 150, 200, 0.5)
Note over Client,DB: Hard-Delete Flow: Cluster Soft-Deletion
end
Client->>ClusterService: ProcessAdapterStatus(clusterID, adapterStatus)
activate ClusterService
ClusterService->>ClusterDAO: GetForUpdate(clusterID)
activate ClusterDAO
ClusterDAO->>DB: SELECT * FROM clusters WHERE id=? FOR UPDATE
DB-->>ClusterDAO: cluster row (soft-deleted)
ClusterDAO-->>ClusterService: *Cluster
deactivate ClusterDAO
ClusterService->>AdapterStatusDAO: FindByResource(clusterID)
activate AdapterStatusDAO
AdapterStatusDAO->>DB: SELECT * FROM adapter_statuses WHERE resourceID=?
DB-->>AdapterStatusDAO: all adapter statuses
AdapterStatusDAO-->>ClusterService: []AdapterStatus
deactivate AdapterStatusDAO
ClusterService->>ClusterService: validateAndClassify(adapterStatus)
ClusterService->>AdapterStatusDAO: Upsert(ctx, adapterStatus, existing)
activate AdapterStatusDAO
AdapterStatusDAO->>DB: UPDATE/INSERT adapter_statuses
DB-->>AdapterStatusDAO: updated AdapterStatus
AdapterStatusDAO-->>ClusterService: *AdapterStatus
deactivate AdapterStatusDAO
alt All Required Adapters Finalized AND NO Child NodePools
ClusterService->>AdapterStatusDAO: DeleteByResource(clusterID)
activate AdapterStatusDAO
AdapterStatusDAO->>DB: DELETE FROM adapter_statuses WHERE resourceID=?
DB-->>AdapterStatusDAO: success
deactivate AdapterStatusDAO
ClusterService->>ClusterDAO: Delete(clusterID)
activate ClusterDAO
ClusterDAO->>DB: DELETE FROM clusters WHERE id=?
DB-->>ClusterDAO: success
deactivate ClusterDAO
ClusterService-->>Client: cluster hard-deleted
else Child NodePools Exist OR Adapters Not Finalized
ClusterService->>ClusterService: recomputeAndSaveClusterStatus(cluster, adapterStatuses)
ClusterService->>ClusterDAO: SaveStatusConditions(clusterID, conditions)
activate ClusterDAO
ClusterDAO->>DB: UPDATE clusters SET status_conditions=? WHERE id=?
DB-->>ClusterDAO: success
deactivate ClusterDAO
ClusterService-->>Client: cluster soft-deleted, waiting for finalization
end
deactivate ClusterService
sequenceDiagram
participant Client
participant AggregationService
participant ConditionAggregator
rect rgba(150, 100, 150, 0.5)
Note over Client,ConditionAggregator: Condition Aggregation: Deletion with Child Resources
end
Client->>AggregationService: ComputeReconciled(adapterStatuses, deletedTime, hasChildResources=true)
activate AggregationService
AggregationService->>AggregationService: Check: all adapters finalized?
alt deletedTime != nil AND allFinalized AND hasChildResources=true
AggregationService->>ConditionAggregator: Create condition(Type=Reconciled, Status=False, Reason=WaitingForChildResources)
ConditionAggregator-->>AggregationService: condition
AggregationService-->>Client: Reconciled=False, Reason=WaitingForChildResources
else deletedTime != nil AND allFinalized AND hasChildResources=false
AggregationService->>ConditionAggregator: Create condition(Type=Reconciled, Status=True, Reason=AllAdaptersReconciled)
ConditionAggregator-->>AggregationService: condition
AggregationService-->>Client: Reconciled=True, Reason=AllAdaptersReconciled
else All adapters not finalized
AggregationService->>ConditionAggregator: Create condition(Type=Reconciled, Status=False, Reason=MissingRequiredAdapters)
ConditionAggregator-->>AggregationService: condition
AggregationService-->>Client: Reconciled=False, Reason=MissingRequiredAdapters
end
deactivate AggregationService
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@CHANGELOG.md`:
- Around line 12-13: The changelog contains invalid PR links using
"/pull/HYPERFLEET-854" which must be replaced with the numeric PR path
"/pull/119"; update both occurrences in CHANGELOG.md (the two lines referencing
HYPERFLEET-854 and the additional occurrence noted at lines 31-31) so each
markdown link points to
https://github.com/openshift-hyperfleet/hyperfleet-api/pull/119 instead of the
non-numeric form.
In `@pkg/dao/adapter_status.go`:
- Around line 98-100: When updateResult.RowsAffected == 0 in the adapter status
update flow, explicitly distinguish a stale-report (concurrent version mismatch)
from a row that was deleted between read and update: after detecting
RowsAffected == 0, re-query the DB for the row using the same key/ID from
existing (e.g., existing.ID) — if the re-read returns no row, return a not-found
error (or trigger recreate logic) instead of returning existing; if the re-read
returns a row whose version/fields differ from existing, treat that as a stale
update and return existing (or a conflict error) so callers know the update
didn’t persist. Ensure this logic sits where updateResult.RowsAffected is
checked (the function handling the update in adapter_status.go).
In `@pkg/dao/cluster.go`:
- Around line 97-105: In SaveStatusConditions
(sqlClusterDao.SaveStatusConditions) add a check after the Update to treat zero
affected rows as not-found: if result.RowsAffected == 0, call
db.MarkForRollback(ctx, gorm.ErrRecordNotFound) (or sql.ErrNoRows) and return an
appropriate not-found error instead of nil; keep the existing error handling for
result.Error unchanged so true DB errors still roll back.
In `@pkg/dao/node_pool.go`:
- Around line 60-68: In SaveStatusConditions, don't treat a zero-row update as
success: after calling g2.Model(&api.NodePool{}).Where(...).Update(...) inspect
result.RowsAffected and if it is 0 return a not-found error (and call
db.MarkForRollback(ctx, err) as appropriate) instead of returning nil; keep
existing behavior for result.Error non-nil. Use the SaveStatusConditions
function and result.RowsAffected to implement this check so concurrent
hard-delete/update races surface as not-found.
In `@pkg/services/adapter_status.go`:
- Around line 125-130: The lookup code around
s.adapterStatusDao.FindByResourceAndAdapter currently treats any error as a "not
found" by setting existing = nil, which can hide real DAO failures before
calling Upsert; update the logic in the function to only normalize to nil when
the DAO returns the explicit not-found sentinel (e.g., ErrNotFound or the dao's
not-found behavior) and to return the lookup error immediately for any other
error; locate the call to s.adapterStatusDao.FindByResourceAndAdapter(ctx,
adapterStatus.ResourceType, adapterStatus.ResourceID, adapterStatus.Adapter) and
replace the unconditional existing = nil on error with a conditional that checks
the error value/type and either sets existing = nil for not-found or returns the
error so Upsert sees only valid existing state.
In `@pkg/services/aggregation.go`:
- Around line 740-754: The helper allAdaptersFinalized currently only checks
adapterStatus.IsFinalized() and thus considers finalization from any generation;
change it to require Finalized==true at the current resource generation by
adding a generation parameter (e.g., currentGeneration int64) and only count
adapterStatus entries whose Generation equals that parameter and whose
IsFinalized() is true; update callers (cluster/nodepool hard-delete call sites)
to pass the current resource generation, mirroring the generation filtering
approach used by computeReconciled, so older finalization rows are ignored.
In `@pkg/services/node_pool.go`:
- Around line 272-279: After a successful hard delete in the node-pool deletion
path (when nodePool.DeletedTime != nil and tryHardDeleteNodePool returns true),
ensure we re-check/requeue the parent cluster using nodePool.OwnerID so the
cluster's hard-delete/status is recomputed now that the final child is gone;
update the same logic in the other occurrence (around lines 381-403) to also
trigger the parent cluster reconciliation instead of returning immediately (use
the existing cluster-reconciliation/enqueue helper on the service, e.g., the
method responsible for cluster status recomputation or enqueuing by ID).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: cdeffff4-37b5-409f-b5f2-8265a9d7ff7a
📒 Files selected for processing (17)
CHANGELOG.mdpkg/api/adapter_status_types.gopkg/dao/adapter_status.gopkg/dao/cluster.gopkg/dao/mocks/cluster.gopkg/dao/mocks/node_pool.gopkg/dao/node_pool.gopkg/services/adapter_status.gopkg/services/aggregation.gopkg/services/aggregation_test.gopkg/services/cluster.gopkg/services/cluster_test.gopkg/services/node_pool.gopkg/services/node_pool_test.gopkg/services/status_helpers.gotest/integration/clusters_test.gotest/integration/node_pools_test.go
| if updateResult.RowsAffected == 0 { | ||
| return d.FindByResourceAndAdapter(ctx, adapterStatus.ResourceType, adapterStatus.ResourceID, adapterStatus.Adapter) | ||
| return existing, nil | ||
| } |
There was a problem hiding this comment.
Differentiate stale report from concurrently deleted row.
At Line 98, RowsAffected == 0 unconditionally returns existing. That is correct for stale reports, but incorrect when the record was deleted between read and update. In that case this method reports success with a non-persisted object. Please branch explicitly: stale → return existing; missing row → re-read/recreate or return not-found.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/dao/adapter_status.go` around lines 98 - 100, When
updateResult.RowsAffected == 0 in the adapter status update flow, explicitly
distinguish a stale-report (concurrent version mismatch) from a row that was
deleted between read and update: after detecting RowsAffected == 0, re-query the
DB for the row using the same key/ID from existing (e.g., existing.ID) — if the
re-read returns no row, return a not-found error (or trigger recreate logic)
instead of returning existing; if the re-read returns a row whose version/fields
differ from existing, treat that as a stale update and return existing (or a
conflict error) so callers know the update didn’t persist. Ensure this logic
sits where updateResult.RowsAffected is checked (the function handling the
update in adapter_status.go).
There was a problem hiding this comment.
The scenario CodeRabbit describes (record deleted between read and update) cannot happen in this code. The callers of Upsert in ProcessAdapterStatus hold a SELECT ... FOR UPDATE lock on the parent resource (cluster/nodepool) for the duration of the transaction. No concurrent transaction can delete the adapter status while this one is in progress.
There was a problem hiding this comment.
@mliptak0, that's a valid point — the SELECT ... FOR UPDATE lock on the parent resource serializes all concurrent writes for the duration of the transaction, which makes the delete-between-read-and-update race impossible in this context. I've noted the locking strategy for future reviews.
📝 Learnings were identified and not saved (knowledge base disabled). Enable
| func (d *sqlClusterDao) SaveStatusConditions(ctx context.Context, id string, statusConditions []byte) error { | ||
| g2 := (*d.sessionFactory).New(ctx) | ||
| result := g2.Model(&api.Cluster{}).Where("id = ?", id).Update("status_conditions", statusConditions) | ||
| if result.Error != nil { | ||
| db.MarkForRollback(ctx, result.Error) | ||
| return result.Error | ||
| } | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Handle zero-row status updates as not-found.
At Line 99, this returns success even when no row matches id (RowsAffected == 0). In hard-delete races, that silently drops a status write and hides state loss. Treat zero affected rows as not-found and roll back.
Suggested patch
func (d *sqlClusterDao) SaveStatusConditions(ctx context.Context, id string, statusConditions []byte) error {
g2 := (*d.sessionFactory).New(ctx)
result := g2.Model(&api.Cluster{}).Where("id = ?", id).Update("status_conditions", statusConditions)
if result.Error != nil {
db.MarkForRollback(ctx, result.Error)
return result.Error
}
+ if result.RowsAffected == 0 {
+ err := gorm.ErrRecordNotFound
+ db.MarkForRollback(ctx, err)
+ return err
+ }
return nil
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (d *sqlClusterDao) SaveStatusConditions(ctx context.Context, id string, statusConditions []byte) error { | |
| g2 := (*d.sessionFactory).New(ctx) | |
| result := g2.Model(&api.Cluster{}).Where("id = ?", id).Update("status_conditions", statusConditions) | |
| if result.Error != nil { | |
| db.MarkForRollback(ctx, result.Error) | |
| return result.Error | |
| } | |
| return nil | |
| } | |
| func (d *sqlClusterDao) SaveStatusConditions(ctx context.Context, id string, statusConditions []byte) error { | |
| g2 := (*d.sessionFactory).New(ctx) | |
| result := g2.Model(&api.Cluster{}).Where("id = ?", id).Update("status_conditions", statusConditions) | |
| if result.Error != nil { | |
| db.MarkForRollback(ctx, result.Error) | |
| return result.Error | |
| } | |
| if result.RowsAffected == 0 { | |
| err := gorm.ErrRecordNotFound | |
| db.MarkForRollback(ctx, err) | |
| return err | |
| } | |
| return nil | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/dao/cluster.go` around lines 97 - 105, In SaveStatusConditions
(sqlClusterDao.SaveStatusConditions) add a check after the Update to treat zero
affected rows as not-found: if result.RowsAffected == 0, call
db.MarkForRollback(ctx, gorm.ErrRecordNotFound) (or sql.ErrNoRows) and return an
appropriate not-found error instead of nil; keep the existing error handling for
result.Error unchanged so true DB errors still roll back.
There was a problem hiding this comment.
False positive for the same reason as the Upsert comment.
In the ProcessAdapterStatus flow (the code path added by this PR), the cluster is locked with SELECT ... FOR UPDATE at the start of the transaction, preventing concurrent hard-delete. For the existing UpdateClusterStatusFromAdapters path (which doesn'thold a lock), silently dropping a status condition write for a concurrently hard-deleted cluster is harmless — the cluster no longer exists, so persisting its conditions serves no purpose.
There was a problem hiding this comment.
@mliptak0, understood — that reasoning is sound. The SELECT ... FOR UPDATE lock in ProcessAdapterStatus eliminates the race in the new code path, and in UpdateClusterStatusFromAdapters a silent no-op on an already-hard-deleted cluster is semantically correct. I'll retract the comment.
📝 Learnings were identified and not saved (knowledge base disabled). Enable
| existing, findErr := s.adapterStatusDao.FindByResourceAndAdapter( | ||
| ctx, adapterStatus.ResourceType, adapterStatus.ResourceID, adapterStatus.Adapter, | ||
| ) | ||
| if findErr != nil { | ||
| existing = nil | ||
| } |
There was a problem hiding this comment.
Propagate unexpected lookup failures before upserting.
This treats every FindByResourceAndAdapter error as “not found”. If that read fails for any other reason, the code still proceeds into Upsert, which hides the DAO failure and can take the wrong branch for freshness handling. Only normalize an actual miss to nil; return other errors immediately.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/services/adapter_status.go` around lines 125 - 130, The lookup code
around s.adapterStatusDao.FindByResourceAndAdapter currently treats any error as
a "not found" by setting existing = nil, which can hide real DAO failures before
calling Upsert; update the logic in the function to only normalize to nil when
the DAO returns the explicit not-found sentinel (e.g., ErrNotFound or the dao's
not-found behavior) and to return the lookup error immediately for any other
error; locate the call to s.adapterStatusDao.FindByResourceAndAdapter(ctx,
adapterStatus.ResourceType, adapterStatus.ResourceID, adapterStatus.Adapter) and
replace the unconditional existing = nil on error with a conditional that checks
the error value/type and either sets existing = nil for not-found or returns the
error so Upsert sees only valid existing state.
| func allAdaptersFinalized(requiredAdapters []string, adapterStatuses api.AdapterStatusList) bool { | ||
| finalizedAdapters := make(map[string]struct{}) | ||
|
|
||
| for _, adapterStatus := range adapterStatuses { | ||
| if adapterStatus.IsFinalized() { | ||
| finalizedAdapters[adapterStatus.Adapter] = struct{}{} | ||
| } | ||
| } | ||
|
|
||
| for _, requiredAdapter := range requiredAdapters { | ||
| if _, exists := finalizedAdapters[requiredAdapter]; !exists { | ||
| return false | ||
| } | ||
| } | ||
| return true |
There was a problem hiding this comment.
Require Finalized=True at the current generation.
allAdaptersFinalized only checks IsFinalized(). After soft-delete bumps the resource generation, an older Finalized=True row still counts here, so cluster/nodepool hard-delete can run before every required adapter has finalized the delete generation. This helper should filter on the current generation, the same way computeReconciled does.
Possible fix
-func allAdaptersFinalized(requiredAdapters []string, adapterStatuses api.AdapterStatusList) bool {
+func allAdaptersFinalized(
+ requiredAdapters []string,
+ currentGeneration int32,
+ adapterStatuses api.AdapterStatusList,
+) bool {
finalizedAdapters := make(map[string]struct{})
for _, adapterStatus := range adapterStatuses {
- if adapterStatus.IsFinalized() {
+ if adapterStatus.ObservedGeneration == currentGeneration && adapterStatus.IsFinalized() {
finalizedAdapters[adapterStatus.Adapter] = struct{}{}
}
}You'll also need to pass the resource generation from the cluster/nodepool hard-delete call sites.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/services/aggregation.go` around lines 740 - 754, The helper
allAdaptersFinalized currently only checks adapterStatus.IsFinalized() and thus
considers finalization from any generation; change it to require Finalized==true
at the current resource generation by adding a generation parameter (e.g.,
currentGeneration int64) and only count adapterStatus entries whose Generation
equals that parameter and whose IsFinalized() is true; update callers
(cluster/nodepool hard-delete call sites) to pass the current resource
generation, mirroring the generation filtering approach used by
computeReconciled, so older finalization rows are ignored.
| if nodePool.DeletedTime != nil { | ||
| hardDeleted, hdErr := s.tryHardDeleteNodePool(ctx, nodePoolID, conditions, updatedStatuses) | ||
| if hdErr != nil { | ||
| return nil, hdErr | ||
| } | ||
| if hardDeleted { | ||
| return upsertedStatus, nil | ||
| } |
There was a problem hiding this comment.
Re-check the owning cluster after hard-deleting the last node pool.
Once tryHardDeleteNodePool succeeds, this flow returns immediately. If the parent cluster was already soft-deleted and its adapters had finalized earlier, nothing reevaluates that cluster when the final child disappears, so it can stay stranded until another cluster status update happens. The new cluster integration test currently works around this by reposting a cluster adapter after nodepool deletion. Use nodePool.OwnerID to trigger the parent cluster’s hard-delete/status recomputation here.
Also applies to: 381-403
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/services/node_pool.go` around lines 272 - 279, After a successful hard
delete in the node-pool deletion path (when nodePool.DeletedTime != nil and
tryHardDeleteNodePool returns true), ensure we re-check/requeue the parent
cluster using nodePool.OwnerID so the cluster's hard-delete/status is recomputed
now that the final child is gone; update the same logic in the other occurrence
(around lines 381-403) to also trigger the parent cluster reconciliation instead
of returning immediately (use the existing cluster-reconciliation/enqueue helper
on the service, e.g., the method responsible for cluster status recomputation or
enqueuing by ID).
There was a problem hiding this comment.
The gap is real: after the last nodepool is hard-deleted, nothing automatically triggers the cluster's hard-delete check. The integration test though covers this by manually re-posting a cluster adapter status.
However, fixing this in this PR is a heavy lift because NodePoolService doesn't have a reference to ClusterService — adding one creates a cross-service dependency and potential circular dependency. The current design is event-driven: Sentinel periodically triggers adapter status updates, so the cluster will eventually get re-evaluated. Probably good case to be covered by e2e tests...
There was a problem hiding this comment.
@mliptak0, that makes sense — the circular-dependency concern is a real constraint, and the Sentinel-driven eventual consistency model is a reasonable trade-off here. The integration test's manual re-post approach is a fair workaround for now.
Would you like me to open a follow-up GitHub issue to track the eventual-consistency gap (cluster hard-delete not triggered automatically after last node pool is removed) so it's not lost? It could also serve as a reminder to add E2E coverage for that scenario.
| func (d *sqlNodePoolDao) ExistsByOwner(ctx context.Context, ownerID string) (bool, error) { | ||
| g2 := (*d.sessionFactory).New(ctx) | ||
| var count int64 | ||
| if err := g2.Model(&api.NodePool{}).Where("owner_id = ?", ownerID).Limit(1).Count(&count).Error; err != nil { |
There was a problem hiding this comment.
It could be simplified using raw SQL: SELECT EXISTS(SELECT 1 FROM node_pools WHERE owner_id = ?)
| // If already marked, it returns the cluster unchanged. Cascades the deletion timestamp to all child nodepools. | ||
| // Actual removal is handled by adapters detecting the new generation and triggering hard deletion asynchronously. | ||
| func (s *sqlClusterService) SoftDelete(ctx context.Context, id string) (*api.Cluster, *errors.ServiceError) { | ||
| cluster, err := s.clusterDao.Get(ctx, id) |
There was a problem hiding this comment.
This isn't part of this PR, but since we have a Save() method which updates from the model, would it be better to use GetForUpdate here?
| } | ||
|
|
||
| func (s *sqlNodePoolService) SoftDelete(ctx context.Context, id string) (*api.NodePool, *errors.ServiceError) { | ||
| nodePool, err := s.nodePoolDao.Get(ctx, id) |
There was a problem hiding this comment.
Same here, maybe put GetForUpdate is better?
implement hard deletion for clusters and nodepools
Summary
Test Plan
make test-allpassesmake lintpassesmake test-helm(if applicable)Summary by CodeRabbit
New Features
Changed