Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 72 additions & 0 deletions policies/0009-nerc-allocation-revocation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
### Summary

This proposal defines a standardized workflow for handling resource allocations in ColdFront that transition into a **Revoked** status. It establishes clear timelines for access termination, data retention grace periods, and administrative overrides, ensuring automated lifecycle management and manual oversight if necessary.

### Motivation

Currently, the transition from an active allocation to total resource deletion requires a manual and granular approach. We need to:

- **Automate lifecycle management** to simplify administration.
- **Provide clear communication** to users regarding data deletion deadlines.
- **Support "Special Case" scenarios** where data must be preserved for legal or security reasons without charging the user or allowing access.

### User Stories

- **As an Admin,** I want the system to automatically revoke access when an allocation reaches its end date (expires).
- **As a PI,** I want to receive multiple notifications before my allocation is expired, access is revoked, and storage is deleted.
- **As an Admin,** I want to "suspend" an allocation (Special Case: No Deletion) so that I can investigate an incident without the risk of automated scripts deleting the evidence.
- **As an Admin,** I want to manually override revocation timings to accommodate valid user appeals or policy exceptions.

### Proposal

#### 1. The Expiration / Revocation Lifecycle

Each allocation has an end date at which point the allocation automatically enters **"Active (Needs Renewal)"** status.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original version was, "At 30 days before End Date, the allocation changes to Active (Needs Renewal)". That would align with the data that the PI sets when they create the allocation. Is that what is still happening?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That way they will be able to renew themsleves until the status is revoked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Milstein discussed with Kristi, Quan, Kim.
Proposal:
1.30 days before the end date they see a button at 30 days out that says "Expires in: " (we will no longer show Active Needs Renewal)
2. At the end of 30 days it goes to expired, turned off VMs and/or pods, access to PI and teams turned off. Admin action to turn back on. They will likely lose state.
3. At the end of the 30 days in expired status we switch to revoked and delete storage and other resources.

  1. testing: 4 months of expiration happening without happening.
  2. manual method for approving revocation, at least at first.
  3. Will need active communication during testing and rollout plan warning folks about this change (folks have been ignoring them).

This approach returns us to normal coldfront period. Kristi plans to rewrite parts of the proposal to capture this.


The transition to **Revoked** occurs automatically 30 days after an allocation enters **"Active (Needs Renewal)"**. It can also be triggered manually by an admin.

| **Phase** | **Duration** (Days) | **System Actions** | **User Impact** |
|--------------------------|---------------------|------------------------------------------------------------------|-------------------------------------------------------------------|
| **Renewal Grace Period** | 0-30 | Status changed to **Active (Needs Renewal)**. Notification sent. | No impact. |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once an allocation’s status changes to Active (Needs Renewal), the user must follow up with an administrator. At this stage, administrative intervention is required to manually update the status to Active, then only ColdFront enables the user to submit change requests.

Please ensure that when the admin updates the allocation status, the End Date is extended by one year (or some extention period?) during this update.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is correct.

| **Revocation Trigger** | 30 | Status changed to **Revoked**. Notification sent. | All cluster access disabled. Compute and networking stopped. |
| **Storage Grace Period** | 30-50 | Storage retained. | Access remains blocked. Possibilty of data recovery if requested. |
| **Storage Deletion** | 50 | Storage deleted. | |

For OpenStack, revocation is implemented by deleting all VMs and networking objects and switching the project status to disabled to prevent further access. Object storage and volumes are preserved. After the storage grace period, the remaining storage resources are deleted.

For OpenShift, revocation is implemented by deleting all Pods, Deployments, Jobs, CronJobs and other resources that pertain to compute or networking. Persistent Volumes, ConfigMaps and Secrets are preserved during the storage grace period. After the storage grace period, those resources are deleted too.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including the namespace?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The namespace would be the very last thing that gets deleted, after the storage grace period. We can also preserve the namespace, if you prefer that. Are there any reasons in particular for preserving it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@knikolla, how does invoicing use namespaces? Will deleting it mid-month cause any issues with the invoice script?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how there will be any issues, but just to triple check:

@naved001 would deleting a namespace have any effect on collection of metrics up to its point of deletion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@knikolla I don't think it should matter. In my mind it's the same as pod's metrics will remain (up to the retention period) even after a pod object is deleted. Should be the same for all the pods in a namespace that gets deleted.

And we also ship off metrics every day to s3 anyway.

That being said I will do a quick test and update this comment - can't be too careful with billing stuff.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I tested this and I can say it's safe to delete the namespace. 4 hours after deleting the namespace, I can query prometheus and get the metrics for the pods in the deleted namespace.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking!


#### 2. Notification Schedule

To ensure data integrity, users will receive automated reminders during the 20-day storage grace period:

- **Initial:** Upon entering Revoked status.
- **Reminder:** 7 days before deletion.
- **Final Warning:** 2 days before deletion.

#### 3. Special Case: Administrative Hold (No Deletion)

For legal holds or security incidents, a new behavior is proposed.

- **Status:** Switches to a new "Suspended"
- **Access:** Only Admins retain access to compute/storage, user access is removed.
- Admins will need to manually decide whether to stop VMs/pods/networking on a case by case basis.
- **Billing:** Billing is disabled.
- **Persistence:** Storage and compute is retained pending admin action or suspension being lifted.

#### 4. Manual Overrides

Admins have the authority to:

- Extend end dates of projects manually.
- Revoke or reinstate projects ahead of time by switching their status.

### Drawbacks

- **Storage Costs:** Maintaining a 20-day grace period for all revoked allocations increases storage overhead.
- **Complexity:** Selectively removing only specific resources from an allocation while preserving data is more complex than deleting the entire OpenStack project or OpenShift namespace.

### Alternatives

- **Keep suspended (administrative hold) projects in Active state:** Possibility of confusion and harder to distinguish in billing.
- **Continue with Manual Lifecycle Management:** Continuing to rely on admins to manually move states and clean up allocation. Not scalable for high-volume environments.