docs: add Kessel/ReBAC OCP integration high-level design#5887
docs: add Kessel/ReBAC OCP integration high-level design#5887jordigilh wants to merge 4 commits intoproject-koku:mainfrom
Conversation
Summary of ChangesHello @jordigilh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces the foundational high-level design for integrating Kessel, Red Hat's Relationship-Based Access Control (ReBAC) platform, into Koku. This integration specifically targets OpenShift (OCP) providers in on-premise deployments, aiming to enhance access control granularity beyond the existing RBAC system. The design outlines a strategic two-phase rollout, beginning with basic wildcard permissions and progressing to sophisticated resource-specific access. A key enabler for on-premise functionality is the introduction of a dedicated deployment flag that intelligently adapts Koku's behavior, such as disabling external feature flag services, to suit environments without SaaS connectivity. This change significantly improves Koku's adaptability for diverse deployment scenarios. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Add the initial HLD for Kessel/ReBAC integration with the OCP provider for on-premise deployments. This document covers system architecture, sequence diagrams, phased rollout plan, and operational considerations. Placed under docs/architecture/kessel-integration/ to establish a namespace for future kessel integration documentation. Co-authored-by: Cursor <cursoragent@cursor.com>
6b992b0 to
90d6adf
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive high-level design for Kessel/ReBAC integration and adds the necessary infrastructure for on-premise deployments by allowing the feature flag client to be disabled. The documentation is extensive and well-structured. However, I've found a critical issue in the implementation of the DisabledUnleashClient where the method signature is incorrect, which will lead to runtime errors. Additionally, there are several broken or soon-to-be-outdated links in the new documentation that should be addressed to ensure its long-term value and usability.
I am having trouble creating individual review comments. Click here to see my feedback.
koku/koku/feature_flags.py (66-79)
The signature and implementation of is_enabled in DisabledUnleashClient are inconsistent with the UnleashClient it's meant to mock, and also with the documentation in this PR.
- Incorrect Signature: The method is missing the
default=Falseparameter, which is present in the realUnleashClient. This will cause aTypeErrorfor any calls that usedefaultas a keyword argument, such as the example indocs/architecture/feature-flags.md. - Incorrect Logic: The implementation should return the
defaultvalue whenfallback_functionis not provided, but it currently hardcodesFalse.
This is a critical issue as it breaks the client's interface contract.
I suggest updating the method to have a compatible signature and logic. You will also need to update the tests in koku/koku/test_feature_flags.py to reflect this correction.
def is_enabled(self, feature_name, context=None, fallback_function=None, default=False):
"""Check if a feature is enabled.
In disabled mode, this will use the fallback function if provided,
otherwise it will return the default value. This mimics the behavior
of an uninitialized Unleash client.
Args:
feature_name (str): Name of the feature flag.
context (dict): Context dict.
fallback_function (callable): Function to determine feature state.
default (bool): Default value to return.
Returns:
bool: Result of fallback_function or the default value.
"""
if fallback_function:
return fallback_function(feature_name, context or {})
return default
docs/architecture/kessel-integration/kessel-ocp-integration.md (552)
This document contains several links to kessel-only-role-provisioning.md and kessel-ocp-implementation-guide.md, which are not included in this pull request. This will result in broken links in the documentation. Please either add these files to the PR or remove the links if they are intended for future work. This comment applies to this line and many other similar links throughout the document.
docs/architecture/feature-flags.md (92)
The source code reference link appears to be pointing to an incorrect line range. The DisabledUnleashClient is implemented starting at line 55 in koku/koku/feature_flags.py. Please update the link to point to the correct lines to avoid confusion for future readers.
docs/architecture/feature-flags.md (396)
This hardcoded commit hash will become outdated once this pull request is merged. It's generally better to reference the pull request itself or remove the commit reference to avoid the need for future maintenance.
|
Preview of the document with the diagrams rendered can be seen in the origin branch: |
| |---|-----------|------|----------------| | ||
| | 1 | **Authorization Adapter** | NEW | Permission checks via Kessel (replaces RBAC API in on-premise) | | ||
| | 2 | **Resource Reporter** | NEW | Syncs Koku resources (clusters, nodes, projects) to Kessel | | ||
| | 3 | **Sources API** (`provider_builder.py`) | Modified | Dual-write: Store in Postgres **AND** report to Kessel | |
There was a problem hiding this comment.
Pls note this is going to be externalized by insights-onprem#6 and be used as the on-prem API for managing sources/integration in koku.
There was a problem hiding this comment.
Acknowledged — the Kessel integration hooks into ProviderBuilder (specifically create_provider_from_source and destroy_provider), which is called from the Sources flow. As the Sources API is externalized via PR #6, the Kessel resource reporting calls in ProviderBuilder will need to be preserved in the new integration path. The HLD's Sources API Integration section documents these touchpoints.
| ``` | ||
| AbstractAuthorizationBackend | ||
| ├── RBACAuthorizationBackend (existing system) | ||
| └── KesselAuthorizationBackend (new Kessel integration) |
There was a problem hiding this comment.
IIUC, the purpose is to replace the existing system used in SaaS with Kessel integration.
Only then, we can use the kessel based-authorization in the on-prem.
There was a problem hiding this comment.
Correct — for on-prem deployments, Kessel replaces the SaaS RBAC service entirely. The authorization service abstraction in the HLD (Section 8.2) handles this: it returns a KesselAuthorizationBackend when ONPREM=True and the existing RBACAuthorizationBackend for SaaS. This lets us adopt Kessel for on-prem without affecting the SaaS path.
lcouzens
left a comment
There was a problem hiding this comment.
There is a lot to go over here, I probably need to read it again tbh. But I left some initial thoughts.
| - `rbac/role:cost-openshift-viewer` | ||
| - `rbac/role:cost-price-list-administrator` | ||
| - `rbac/role:cost-price-list-viewer` | ||
| - `rbac/role:cost-cloud-viewer` |
There was a problem hiding this comment.
This one is not needed for On prem initially. Cloud here would be for AWS, Azure etc but we are only supporting OCP.
There was a problem hiding this comment.
These roles are taken from the SaaS schema, I'd rather leave them as such and have a single source of truth for both onPrem and SaaS instead of creating a separate schema for each instance and drifting in the future. Thoughts?
|
|
||
| ### 3. Provider Creation: Resource Synchronization | ||
|
|
||
| **Scenario**: User creates new OCP provider, triggering resource sync to Kessel. |
There was a problem hiding this comment.
I guess this is a phase 2 question, but how are we planning to sync Node/Namespace resources?
There was a problem hiding this comment.
good point, it should register all 3 resources: cluster, node and namespace.
There was a problem hiding this comment.
Done — all 3 resource types are now registered at provider creation. See Provider Creation: Resource Synchronization and Resource Reporter.
|
|
||
| --- | ||
|
|
||
| ### 6. Provider Deletion: Resource Cleanup |
There was a problem hiding this comment.
Should we complete a cascade delete on the koku side first then send the delete request to kessel afterwards? That way we confirm the delete completes successfully in koku before deleting anything else.
There was a problem hiding this comment.
Makes sense. I'll update the doc to reflect that flow
There was a problem hiding this comment.
Done — deletion order is now Koku/Postgres first, then Kessel cleanup. See Provider Deletion: Resource Cleanup and Sources API Integration.
|
|
||
| **Q1: What happens when Kessel is unavailable?** | ||
| - **Decision Needed**: Fail open (allow all) or fail closed (deny all)? | ||
| - **Recommendation**: Fail closed with cached permissions (if available) |
|
|
||
| **Q2: Should we cache Kessel permission checks?** | ||
| - **Decision Needed**: Cache TTL and invalidation strategy | ||
| - **Recommendation**: 30-second TTL (matching current RBAC cache) |
| 2. Organization admin | ||
| 3. No default owner (explicit assignment required) | ||
| - **Recommendation**: User who creates provider | ||
| - **Rationale**: Matches intuitive ownership model |
There was a problem hiding this comment.
So the user creating it makes sense but the org admin would also have full access to it too.
There was a problem hiding this comment.
yes, I suspect that the subject relationship: user A belongs to group B, which is managed by admin group C, for which user D belongs to should apply
There was a problem hiding this comment.
Agreed — org admins should have full access by default. This is already covered by the existing 3-tier model: org admins are assigned the Cost Administrator role, which has cost_management_openshift_cluster_all permission at the tenant level. That grants access to all clusters in the organization, including any newly created ones.
So the flow is: the creating user gets explicit owner on the cluster, and org admins get access implicitly through their admin role binding on the tenant. No additional explicit relationship needed.
| 1. Lazy sync during first query | ||
| 2. Background Celery job | ||
| 3. Event-driven sync during data processing | ||
| - **Recommendation**: Start with lazy sync (Option 1) |
There was a problem hiding this comment.
This needs some thought, we probably need a combination tbh. Since nodes and projects could change between each upload cycle. We likely need dedicated tables on the cost side that are up to date during runtime (pipeline processing) then get synced to kessel with an event.
There was a problem hiding this comment.
Agreed — pipeline-driven sync is the right approach here. Nodes and namespaces are naturally discovered during data ingestion (the Trino summarization step), so that's the natural integration point rather than the query path or a separate periodic scan.
The plan would be:
- During the data processing pipeline, when new nodes/namespaces are discovered, record them in a dedicated tracking table (e.g.,
kessel_synced_resources) and sync to Kessel. - The tracking table gives us idempotency (don't re-sync what's already synced) and a built-in reconciliation mechanism (query for
synced = falseto find anything that failed and needs retry). - Kessel sync remains non-blocking — if Kessel is down during ingestion, mark as pending and retry on the next cycle.
I've updated the document to make pipeline-driven sync the recommended approach, with the tracking table as the enabler for reconciliation. This also addresses Q6 (bulk sync) since the tracking table naturally supports it.
| - **Rationale**: Simpler, no new background jobs | ||
|
|
||
| **Q6: Should we support bulk resource synchronization?** | ||
| - **Use Case**: Operator adds Kessel to existing Koku deployment with many clusters |
There was a problem hiding this comment.
I dont think this is necessary but could be a nice to have depending how we implement Q5
There was a problem hiding this comment.
My thinking is when we deploy a new instance of CM in a cluster that has already been provisioned with Kessel and we want to sync it. I'd guess that new installations with no previous ReBAC server available will not have such challenges.
We can deliver a version that only supports new installations and we will support the sync in a future release on as you mentioned.
There was a problem hiding this comment.
Agreed — this is effectively addressed by the pipeline-driven sync approach. The kessel_synced_resources tracking table we've added to the design gives us built-in reconciliation: any resource with kessel_synced = false is automatically retried on the next processing cycle.
For the "add Kessel to an existing Koku deployment" scenario, the existing kessel_sync_resources management command handles bulk resync. That said, Phase 1 targets new installations only — the migration/sync story for existing deployments will come in a future release.
|
|
||
| **Q7: How to handle Kessel schema version updates?** | ||
| - **Decision Needed**: Schema migration strategy | ||
| - **Recommendation**: Follow Red Hat platform schema versioning |
There was a problem hiding this comment.
I'm not sure what we actually need to be handling here.
There was a problem hiding this comment.
We need to investigate how does schema updates impact an existing ReBAC: if the schema adds new relationships/roles then the migration is straightforward. But if we have breaking changes, we need to define what to do with these. Hopefully these changes are incremental and additive but also have to consider when they are otherwise.
There was a problem hiding this comment.
Right that makes sense. In general all of the changes around this should be additive to avoid any breaking changes.
There was a problem hiding this comment.
Added a concrete Schema Upgrade Strategy section to the HLD (under Operational Considerations). The approach:
- Additive-only policy — new releases can add definitions/relations/permissions but never remove or rename, so existing SpiceDB tuples stay valid.
- Schema version tracking via
KESSEL_SCHEMA_VERSIONsetting. - Management command
kessel_update_schemathat checks the version, applies the newschema.zed, and seeds any new roles incrementally. - Helm integration — runs as a post-upgrade hook, similar to Django
migrate.
This keeps upgrades safe and idempotent. See the new section for the full workflow diagram.
| - Relations: `org` (tenant), `viewer` (principals/groups), `cluster` (parent cluster) | ||
| - Permission: `view` (inherits from parent cluster) | ||
|
|
||
| **Full ZED schema definitions**: See [kessel-ocp-implementation-guide.md](./kessel-ocp-implementation-guide.md#phase-2-zed-schema-definitions) |
There was a problem hiding this comment.
Should this link to some more details?
There was a problem hiding this comment.
Good catch — the supporting document exists but wasn't included in this PR. I'll add it in the next push.
There was a problem hiding this comment.
Done — the implementation guide is now included in this PR. See kessel-ocp-implementation-guide.md.
|
|
||
| **Solution**: Koku provides **role seeding tool** to simplify role instance creation for operators. | ||
|
|
||
| **See details**: [kessel-only-role-provisioning.md](./kessel-only-role-provisioning.md) |
There was a problem hiding this comment.
Should this link to some more details?
There was a problem hiding this comment.
Good catch — the supporting document exists but wasn't included in this PR. I'll add it in the next push.
There was a problem hiding this comment.
Done — the role provisioning strategy document is now included in this PR. See kessel-only-role-provisioning.md.
Add implementation guide and role provisioning strategy as companion documents to the Kessel OCP integration HLD. Update all three docs to align with PR feedback: rename ONPREM setting, fix method signatures, correct deletion order, add pipeline-driven sync references, and align code examples with Koku conventions (imports, logging, test patterns). Co-authored-by: Cursor <cursoragent@cursor.com>
Add Schema Upgrade Strategy section covering additive-only policy, version tracking, kessel_update_schema management command, and Helm post-upgrade hook integration. Also fix broken access-management-api cross-doc link (replaced with TBD note for Phase 1.5). Co-authored-by: Cursor <cursoragent@cursor.com>
…lifecycle) - Update kessel-ocp-integration.md to match FLPATH-3294 implementation: - Authorization: StreamedListObjects + Check (Inventory API v1beta2), not Check-only - Resource lifecycle: ReportResource/DeleteResource (Inventory gRPC) + t_workspace tuples (Relations REST) - Component paths: koku_rebac, KesselAccessProvider; middleware populates access map - Error handling: 424 Failed Dependency (not 503) - Phase 1.5 (Access Management API/UI) marked not implemented - Tracking model: KesselSyncedResource; config: KESSEL_INVENTORY_*, KESSEL_RELATIONS_URL - Fix broken links; point to detailed design and zed-schema-upstream-delta - Add kessel-hld-gaps-and-updates.md documenting all gaps and recommended HLD edits Made-with: Cursor
| Reporter->>Kessel: POST t_workspace tuple — Relations API REST | ||
| Kessel-->>Reporter: OK | ||
|
|
||
| Note over Reporter: Nodes/projects synced similarly;<br/>on-prem has no CDC, so Koku writes tuples directly |
There was a problem hiding this comment.
seems the ; prevents this sequence diagram from being rendered successfully
|
Closing this PR — all documents from this branch have been superseded by the implementation PR #5933 (
The |
Remove 4 ephemeral/working documents from kessel-integration/: - kessel-handover-session-2.md (handoff session notes) - FLPATH-3319-description.md (Jira ticket body copy) - kessel-ocp-test-plan-retired.md (dead retired test scenarios) - kessel-hld-gaps-and-updates.md (working notes, all applied to HLD) Fix broken links: - Redirect 6 kessel-ocp-implementation-guide.md links in HLD to kessel-ocp-detailed-design.md (which supersedes the impl guide) - Remove stale kessel-hld-gaps-and-updates.md references from HLD and detailed design - Replace kessel-ocp-test-plan-retired.md links in test plan with inline notes Closes PR project-koku#5887 which is fully superseded by PR project-koku#5933. Made-with: Cursor
Remove 4 ephemeral/working documents from kessel-integration/: - kessel-handover-session-2.md (handoff session notes) - FLPATH-3319-description.md (Jira ticket body copy) - kessel-ocp-test-plan-retired.md (dead retired test scenarios) - kessel-hld-gaps-and-updates.md (working notes, all applied to HLD) Fix broken links: - Redirect 6 kessel-ocp-implementation-guide.md links in HLD to kessel-ocp-detailed-design.md (which supersedes the impl guide) - Remove stale kessel-hld-gaps-and-updates.md references from HLD and detailed design - Replace kessel-ocp-test-plan-retired.md links in test plan with inline notes Closes PR project-koku#5887 which is fully superseded by PR project-koku#5933. Made-with: Cursor
Remove 4 ephemeral/working documents from kessel-integration/: - kessel-handover-session-2.md (handoff session notes) - FLPATH-3319-description.md (Jira ticket body copy) - kessel-ocp-test-plan-retired.md (dead retired test scenarios) - kessel-hld-gaps-and-updates.md (working notes, all applied to HLD) Fix broken links: - Redirect 6 kessel-ocp-implementation-guide.md links in HLD to kessel-ocp-detailed-design.md (which supersedes the impl guide) - Remove stale kessel-hld-gaps-and-updates.md references from HLD and detailed design - Replace kessel-ocp-test-plan-retired.md links in test plan with inline notes Closes PR project-koku#5887 which is fully superseded by PR project-koku#5933. Made-with: Cursor
Summary
docs/architecture/kessel-integration/to establish a dedicated namespace for kessel-related documentation as the design evolves.Document Overview
The HLD covers:
Motivation
This is the first step toward replacing the current RBAC-based authorization in on-premise Koku deployments with Kessel's Relationship-Based Access Control (ReBAC). The HLD needs review and approval before proceeding to detailed design and implementation.
Scope
KOKU_ONPREM_DEPLOYMENTflag)Review Request
Looking for feedback on:
Test Plan
Made with Cursor