-
Notifications
You must be signed in to change notification settings - Fork 6
NERC Allocation Revocation Workflow #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,72 @@ | ||
| ### Summary | ||
|
|
||
| This proposal defines a standardized workflow for handling resource allocations in ColdFront that transition into a **Revoked** status. It establishes clear timelines for access termination, data retention grace periods, and administrative overrides, ensuring automated lifecycle management and manual oversight if necessary. | ||
|
|
||
| ### Motivation | ||
|
|
||
| Currently, the transition from an active allocation to total resource deletion requires a manual and granular approach. We need to: | ||
|
|
||
| - **Automate lifecycle management** to simplify administration. | ||
| - **Provide clear communication** to users regarding data deletion deadlines. | ||
| - **Support "Special Case" scenarios** where data must be preserved for legal or security reasons without charging the user or allowing access. | ||
|
|
||
| ### User Stories | ||
|
|
||
| - **As an Admin,** I want the system to automatically revoke access when an allocation reaches its end date (expires). | ||
| - **As a PI,** I want to receive multiple notifications before my allocation is expired, access is revoked, and storage is deleted. | ||
| - **As an Admin,** I want to "suspend" an allocation (Special Case: No Deletion) so that I can investigate an incident without the risk of automated scripts deleting the evidence. | ||
| - **As an Admin,** I want to manually override revocation timings to accommodate valid user appeals or policy exceptions. | ||
|
|
||
| ### Proposal | ||
|
|
||
| #### 1. The Expiration / Revocation Lifecycle | ||
|
|
||
| Each allocation has an end date at which point the allocation automatically enters **"Active (Needs Renewal)"** status. | ||
|
|
||
| The transition to **Revoked** occurs automatically 30 days after an allocation enters **"Active (Needs Renewal)"**. It can also be triggered manually by an admin. | ||
|
|
||
| | **Phase** | **Duration** (Days) | **System Actions** | **User Impact** | | ||
| |--------------------------|---------------------|------------------------------------------------------------------|-------------------------------------------------------------------| | ||
| | **Renewal Grace Period** | 0-30 | Status changed to **Active (Needs Renewal)**. Notification sent. | No impact. | | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Once an allocation’s status changes to Active (Needs Renewal), the user must follow up with an administrator. At this stage, administrative intervention is required to manually update the status to Active, then only ColdFront enables the user to submit change requests. Please ensure that when the admin updates the allocation status, the End Date is extended by one year (or some extention period?) during this update.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that is correct. |
||
| | **Revocation Trigger** | 30 | Status changed to **Revoked**. Notification sent. | All cluster access disabled. Compute and networking stopped. | | ||
| | **Storage Grace Period** | 30-50 | Storage retained. | Access remains blocked. Possibilty of data recovery if requested. | | ||
| | **Storage Deletion** | 50 | Storage deleted. | | | ||
|
|
||
| For OpenStack, revocation is implemented by deleting all VMs and networking objects and switching the project status to disabled to prevent further access. Object storage and volumes are preserved. After the storage grace period, the remaining storage resources are deleted. | ||
|
|
||
| For OpenShift, revocation is implemented by deleting all Pods, Deployments, Jobs, CronJobs and other resources that pertain to compute or networking. Persistent Volumes, ConfigMaps and Secrets are preserved during the storage grace period. After the storage grace period, those resources are deleted too. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Including the namespace?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The namespace would be the very last thing that gets deleted, after the storage grace period. We can also preserve the namespace, if you prefer that. Are there any reasons in particular for preserving it?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @knikolla, how does invoicing use namespaces? Will deleting it mid-month cause any issues with the invoice script?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see how there will be any issues, but just to triple check: @naved001 would deleting a namespace have any effect on collection of metrics up to its point of deletion?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @knikolla I don't think it should matter. In my mind it's the same as pod's metrics will remain (up to the retention period) even after a pod object is deleted. Should be the same for all the pods in a namespace that gets deleted. And we also ship off metrics every day to s3 anyway. That being said I will do a quick test and update this comment - can't be too careful with billing stuff.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. okay, I tested this and I can say it's safe to delete the namespace. 4 hours after deleting the namespace, I can query prometheus and get the metrics for the pods in the deleted namespace.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for checking! |
||
|
|
||
| #### 2. Notification Schedule | ||
|
|
||
| To ensure data integrity, users will receive automated reminders during the 20-day storage grace period: | ||
joachimweyl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| - **Initial:** Upon entering Revoked status. | ||
| - **Reminder:** 7 days before deletion. | ||
| - **Final Warning:** 2 days before deletion. | ||
knikolla marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| #### 3. Special Case: Administrative Hold (No Deletion) | ||
|
|
||
| For legal holds or security incidents, a new behavior is proposed. | ||
|
|
||
| - **Status:** Switches to a new "Suspended" | ||
| - **Access:** Only Admins retain access to compute/storage, user access is removed. | ||
| - Admins will need to manually decide whether to stop VMs/pods/networking on a case by case basis. | ||
| - **Billing:** Billing is disabled. | ||
| - **Persistence:** Storage and compute is retained pending admin action or suspension being lifted. | ||
|
|
||
| #### 4. Manual Overrides | ||
|
|
||
| Admins have the authority to: | ||
|
|
||
| - Extend end dates of projects manually. | ||
| - Revoke or reinstate projects ahead of time by switching their status. | ||
|
|
||
| ### Drawbacks | ||
|
|
||
| - **Storage Costs:** Maintaining a 20-day grace period for all revoked allocations increases storage overhead. | ||
| - **Complexity:** Selectively removing only specific resources from an allocation while preserving data is more complex than deleting the entire OpenStack project or OpenShift namespace. | ||
|
|
||
| ### Alternatives | ||
|
|
||
| - **Keep suspended (administrative hold) projects in Active state:** Possibility of confusion and harder to distinguish in billing. | ||
joachimweyl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - **Continue with Manual Lifecycle Management:** Continuing to rely on admins to manually move states and clean up allocation. Not scalable for high-volume environments. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original version was, "At 30 days before End Date, the allocation changes to Active (Needs Renewal)". That would align with the data that the PI sets when they create the allocation. Is that what is still happening?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That way they will be able to renew themsleves until the status is revoked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Milstein discussed with Kristi, Quan, Kim.
Proposal:
1.30 days before the end date they see a button at 30 days out that says "Expires in: " (we will no longer show Active Needs Renewal)
2. At the end of 30 days it goes to expired, turned off VMs and/or pods, access to PI and teams turned off. Admin action to turn back on. They will likely lose state.
3. At the end of the 30 days in expired status we switch to revoked and delete storage and other resources.
This approach returns us to normal coldfront period. Kristi plans to rewrite parts of the proposal to capture this.