fix: cve-2026-33815 and cve-2026-33816 in pgx (rhoai-3.4)#10
Closed
fix: cve-2026-33815 and cve-2026-33816 in pgx (rhoai-3.4)#10
Conversation
…g entry (opendatahub-io#720) ## Summary - Fixes `model_base_url` conftest fixture which always used `items[0]` from the model catalog, ignoring the `model_id` parameter - When multiple MaaS subscriptions exist, this caused a mismatch between the API key (scoped to one subscription) and the model URL (from a different subscription), resulting in 403 errors - Now looks up the catalog entry matching `model_id` before falling back to constructing the URL from `gateway_url` Closes: [RHOAIENG-57327](https://redhat.atlassian.net/browse/RHOAIENG-57327) ## Test plan - [ ] Run e2e tests on a cluster with a single MaaS subscription (baseline — should pass as before) - [ ] Run e2e tests on a cluster with multiple MaaS subscriptions (the failing scenario — should now pass) - [ ] Verify `MODEL_NAME` env var override correctly selects the matching catalog entry's URL 🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!--- Provide a general summary of your changes in the Title above --> ## Description <!--- Describe your changes in detail --> `deploy.sh` was waiting for a wrong webhook deployment, and hence got stuck even if the operator was ready. ## How Has This Been Tested? <!--- Please describe in detail how you tested your changes. --> <!--- Include details of your testing environment, and the tests you ran to --> <!--- see how your change affects other areas of the code, etc. --> `./scripts/deploy.sh --deployment-mode operator --operator-type rhoai` was stuck waiting for webhook deployment, but works now. ## Merge criteria: <!--- This PR will be merged by any repository approver when it meets all the points in the checklist --> <!--- Go over all the following points, and put an `x` in all the boxes that apply. --> - [x] The commits are squashed in a cohesive manner and have meaningful messages. - [x] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). - [x] The developer has manually tested the changes and verified that the changes work <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Updated RHOAI deployment configuration to reference the correct webhook management component name. This change ensures the deployment process properly tracks component readiness and availability during installation. Health checks and status verification now correctly target the appropriate component in RHOAI environments. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
opendatahub-io#659) The validation script was checking for 'gateway-auth-policy' but the actual deployed AuthPolicy is named 'gateway-default-auth'. This caused false 'NotFound' warnings despite the AuthPolicy being correctly deployed and functional. Changes: - Update scripts/validate-deployment.sh line 383 to check for gateway-default-auth instead of gateway-auth-policy Fixes opendatahub-io#658 <!--- Provide a general summary of your changes in the Title above --> ## Description <!--- Describe your changes in detail --> ## How Has This Been Tested? <!--- Please describe in detail how you tested your changes. --> <!--- Include details of your testing environment, and the tests you ran to --> <!--- see how your change affects other areas of the code, etc. --> ## Merge criteria: <!--- This PR will be merged by any repository approver when it meets all the points in the checklist --> <!--- Go over all the following points, and put an `x` in all the boxes that apply. --> - [ ] The commits are squashed in a cohesive manner and have meaningful messages. - [ ] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). - [ ] The developer has manually tested the changes and verified that the changes work <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Corrected deployment validation to check the correct authentication policy resource. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Just created this Granite Model that works with CPUs as part of my demo, and wanted to contribute that back <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Added sample deployment docs for Granite 3.1 8B delivered via Red Hat model car OCI and updated the available models list with this option. * **New Features** * Added a ready-to-deploy Granite 3.1 8B sample including pre-configured model service, authentication policy, access controls, and a token rate limit (10,000 tokens/min). <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…tion and MaaSAuthPolicy (opendatahub-io#714) ## Description This PR improves status reporting for `MaaSSubscription` and `MaaSAuthPolicy` resources to reflect real reconciliation and dependency health. Previously, these resources could show "Active" or empty status when underlying dependencies (MaaSModelRefs, TokenRateLimitPolicies, AuthPolicies) were missing, invalid, or unhealthy. * https://redhat.atlassian.net/browse/RHOAIENG-57006 * https://redhat.atlassian.net/browse/RHOAIENG-57233 ### Key Changes **API Types:** - Introduced `common_types.go` with shared types: - `Phase` type alias with typed constants (`Pending`, `Active`, `Degraded`, `Failed`) - `ConditionReason` type alias with semantic reason codes (`Reconciled`, `NotFound`, `Accepted`, `NotEnforced`, etc.) - `ResourceRefStatus` base struct for embedding in per-item statuses **MaaSSubscription:** - Added `ModelRefStatuses` - per-model validation status (name, namespace, ready, reason, message) - Added `TokenRateLimitStatuses` - per-TRLP operand health status - Phase now accurately reflects: - `Active` - all model refs valid, all TRLPs accepted - `Degraded` - some models valid/some invalid, or some TRLPs unhealthy - `Failed` - all model refs invalid **MaaSAuthPolicy:** - Added `AuthPolicies` status with per-AuthPolicy health (ready, reason, message) - Phase derivation mirrors MaaSSubscription logic - AuthPolicy readiness requires both `Accepted=True` AND `Enforced=True` ### Status Examples #### MaaSSubscription - Active (all healthy) ```yaml status: phase: Active conditions: - type: Ready status: "True" reason: Reconciled message: "successfully reconciled" modelRefStatuses: - name: llama-model namespace: llm ready: true reason: Valid - name: mistral-model namespace: llm ready: true reason: Valid tokenRateLimitStatuses: - name: maas-trlp-llama-model namespace: llm model: llama-model ready: true reason: Accepted - name: maas-trlp-mistral-model namespace: llm model: mistral-model ready: true reason: Accepted ``` #### MaaSSubscription - Degraded (partial failure) ```yaml status: phase: Degraded conditions: - type: Ready status: "False" reason: PartialFailure message: "1 of 2 model references are invalid" modelRefStatuses: - name: llama-model namespace: llm ready: true reason: Valid - name: missing-model namespace: llm ready: false reason: NotFound message: "MaaSModelRef llm/missing-model not found" tokenRateLimitStatuses: - name: maas-trlp-llama-model namespace: llm model: llama-model ready: true reason: Accepted ``` #### MaaSSubscription - Failed (all invalid) ```yaml status: phase: Failed conditions: - type: Ready status: "False" reason: ReconcileFailed message: "all 2 model references are invalid" modelRefStatuses: - name: missing-model-1 namespace: llm ready: false reason: NotFound message: "MaaSModelRef llm/missing-model-1 not found" - name: missing-model-2 namespace: llm ready: false reason: NotFound message: "MaaSModelRef llm/missing-model-2 not found" tokenRateLimitStatuses: [] ``` #### MaaSAuthPolicy - Active (all healthy) ```yaml status: phase: Active conditions: - type: Ready status: "True" reason: Reconciled message: "successfully reconciled" authPolicies: - name: maas-auth-llama-model namespace: llm model: llama-model modelNamespace: llm ready: true reason: AcceptedEnforced ``` #### MaaSAuthPolicy - Degraded (AuthPolicy not enforced) ```yaml status: phase: Degraded conditions: - type: Ready status: "False" reason: PartialFailure message: "1 of 1 AuthPolicies not accepted/enforced" authPolicies: - name: maas-auth-llama-model namespace: llm model: llama-model modelNamespace: llm ready: false reason: NotEnforced message: "waiting for Limitador to be ready" ``` #### MaaSAuthPolicy - Failed (model not found) ```yaml status: phase: Failed conditions: - type: Ready status: "False" reason: ReconcileFailed message: "all 1 model references are invalid or missing" authPolicies: [] ``` ## How Has This Been Tested? * Unit Tests * Manual Cluster Testing: - Tested on a live cluster with: - Single valid model → `Active` phase - Missing MaaSModelRef → `Failed` phase with `NotFound` reason - Mixed valid/invalid models → `Degraded` phase with per-model status - TokenRateLimitPolicy not accepted → `Degraded` with detailed reason - AuthPolicy not enforced → `Degraded` with `NotEnforced` reason * Build Verification: ```bash make build # passes all checks (tidy, generate, manifests, lint, test) ``` ## Merge criteria: - [x] The commits are squashed in a cohesive manner and have meaningful messages. - [x] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). - [x] The developer has manually tested the changes and verified that the changes work <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added a Degraded phase and richer status reporting: per-item ready/reason/message plus aggregated model and token-rate-limit statuses. * **Documentation** * Troubleshooting expanded with phase semantics, commands to list non-Active resources, and guidance to inspect per-item status fields. * **Tests** * New unit and end-to-end tests validating phase transitions and per-item status/reporting. * **Other** * Tightened validation for name/namespace/model fields; build/deploy tooling behavior updated. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
opendatahub-io#727) ## Description Documents the **known limitation** when multiple **MaaSModelRef** resources resolve to the **same** **HTTPRoute**: multiple **TokenRateLimitPolicy** objects can target that route, but **only one** is fully effective in practice (others may show **Overridden**), so **per-subscription token limits may not all apply**. The fix would be merged as a fast follow in 3.5 opendatahub-io#585 [RHOAIENG-57602](https://redhat.atlassian.net/browse/RHOAIENG-57602) ## How Has This Been Tested? - Docs-only change ## Merge criteria: - [x] The commits are squashed in a cohesive manner and have meaningful messages. - [ ] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). - [ ] The developer has manually tested the changes and verified that the changes work <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Documentation * Added a warning and guidance about token rate limit behavior when multiple model references share a single route, and recommended planning to use separate routes for independent subscription limits. * Expanded the "Subscription limitations and known issues" section with detection commands and practical workarounds. * Added a known-limitation note to release notes and updated troubleshooting steps and navigation links for easier discovery. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…as-api (opendatahub-io#566) Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.75.1 to 1.79.3. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/grpc/grpc-go/releases">google.golang.org/grpc's releases</a>.</em></p> <blockquote> <h2>Release 1.79.3</h2> <h1>Security</h1> <ul> <li>server: fix an authorization bypass where malformed :path headers (missing the leading slash) could bypass path-based restricted "deny" rules in interceptors like <code>grpc/authz</code>. Any request with a non-canonical path is now immediately rejected with an <code>Unimplemented</code> error. (<a href="https://redirect.github.com/grpc/grpc-go/issues/8981">#8981</a>)</li> </ul> <h2>Release 1.79.2</h2> <h1>Bug Fixes</h1> <ul> <li>stats: Prevent redundant error logging in health/ORCA producers by skipping stats/tracing processing when no stats handler is configured. (<a href="https://redirect.github.com/grpc/grpc-go/pull/8874">grpc/grpc-go#8874</a>)</li> </ul> <h2>Release 1.79.1</h2> <h1>Bug Fixes</h1> <ul> <li>grpc: Remove the <code>-dev</code> suffix from the User-Agent header. (<a href="https://redirect.github.com/grpc/grpc-go/pull/8902">grpc/grpc-go#8902</a>)</li> </ul> <h2>Release 1.79.0</h2> <h1>API Changes</h1> <ul> <li>mem: Add experimental API <code>SetDefaultBufferPool</code> to change the default buffer pool. (<a href="https://redirect.github.com/grpc/grpc-go/issues/8806">#8806</a>) <ul> <li>Special Thanks: <a href="https://github.com/vanja-p"><code>@vanja-p</code></a></li> </ul> </li> <li>experimental/stats: Update <code>MetricsRecorder</code> to require embedding the new <code>UnimplementedMetricsRecorder</code> (a no-op struct) in all implementations for forward compatibility. (<a href="https://redirect.github.com/grpc/grpc-go/issues/8780">#8780</a>)</li> </ul> <h1>Behavior Changes</h1> <ul> <li>balancer/weightedtarget: Remove handling of <code>Addresses</code> and only handle <code>Endpoints</code> in resolver updates. (<a href="https://redirect.github.com/grpc/grpc-go/issues/8841">#8841</a>)</li> </ul> <h1>New Features</h1> <ul> <li>experimental/stats: Add support for asynchronous gauge metrics through the new <code>AsyncMetricReporter</code> and <code>RegisterAsyncReporter</code> APIs. (<a href="https://redirect.github.com/grpc/grpc-go/issues/8780">#8780</a>)</li> <li>pickfirst: Add support for weighted random shuffling of endpoints, as described in <a href="https://redirect.github.com/grpc/proposal/pull/535">gRFC A113</a>. <ul> <li>This is enabled by default, and can be turned off using the environment variable <code>GRPC_EXPERIMENTAL_PF_WEIGHTED_SHUFFLING</code>. (<a href="https://redirect.github.com/grpc/grpc-go/issues/8864">#8864</a>)</li> </ul> </li> <li>xds: Implement <code>:authority</code> rewriting, as specified in <a href="https://github.com/grpc/proposal/blob/master/A81-xds-authority-rewriting.md">gRFC A81</a>. (<a href="https://redirect.github.com/grpc/grpc-go/issues/8779">#8779</a>)</li> <li>balancer/randomsubsetting: Implement the <code>random_subsetting</code> LB policy, as specified in <a href="https://github.com/grpc/proposal/blob/master/A68-random-subsetting.md">gRFC A68</a>. (<a href="https://redirect.github.com/grpc/grpc-go/issues/8650">#8650</a>) <ul> <li>Special Thanks: <a href="https://github.com/marek-szews"><code>@marek-szews</code></a></li> </ul> </li> </ul> <h1>Bug Fixes</h1> <ul> <li>credentials/tls: Fix a bug where the port was not stripped from the authority override before validation. (<a href="https://redirect.github.com/grpc/grpc-go/issues/8726">#8726</a>) <ul> <li>Special Thanks: <a href="https://github.com/Atul1710"><code>@Atul1710</code></a></li> </ul> </li> <li>xds/priority: Fix a bug causing delayed failover to lower-priority clusters when a higher-priority cluster is stuck in <code>CONNECTING</code> state. (<a href="https://redirect.github.com/grpc/grpc-go/issues/8813">#8813</a>)</li> <li>health: Fix a bug where health checks failed for clients using legacy compression options (<code>WithDecompressor</code> or <code>RPCDecompressor</code>). (<a href="https://redirect.github.com/grpc/grpc-go/issues/8765">#8765</a>) <ul> <li>Special Thanks: <a href="https://github.com/sanki92"><code>@sanki92</code></a></li> </ul> </li> <li>transport: Fix an issue where the HTTP/2 server could skip header size checks when terminating a stream early. (<a href="https://redirect.github.com/grpc/grpc-go/issues/8769">#8769</a>) <ul> <li>Special Thanks: <a href="https://github.com/joybestourous"><code>@joybestourous</code></a></li> </ul> </li> <li>server: Propagate status detail headers, if available, when terminating a stream during request header processing. (<a href="https://redirect.github.com/grpc/grpc-go/issues/8754">#8754</a>) <ul> <li>Special Thanks: <a href="https://github.com/joybestourous"><code>@joybestourous</code></a></li> </ul> </li> </ul> <h1>Performance Improvements</h1> <ul> <li>credentials/alts: Optimize read buffer alignment to reduce copies. (<a href="https://redirect.github.com/grpc/grpc-go/issues/8791">#8791</a>)</li> <li>mem: Optimize pooling and creation of <code>buffer</code> objects. (<a href="https://redirect.github.com/grpc/grpc-go/issues/8784">#8784</a>)</li> <li>transport: Reduce slice re-allocations by reserving slice capacity. (<a href="https://redirect.github.com/grpc/grpc-go/issues/8797">#8797</a>)</li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/grpc/grpc-go/commit/dda86dbd9cecb8b35b58c73d507d81d67761205f"><code>dda86db</code></a> Change version to 1.79.3 (<a href="https://redirect.github.com/grpc/grpc-go/issues/8983">#8983</a>)</li> <li><a href="https://github.com/grpc/grpc-go/commit/72186f163e75a065c39e6f7df9b6dea07fbdeff5"><code>72186f1</code></a> grpc: enforce strict path checking for incoming requests on the server (<a href="https://redirect.github.com/grpc/grpc-go/issues/8981">#8981</a>)</li> <li><a href="https://github.com/grpc/grpc-go/commit/97ca3522b239edf6813e2b1106924e9d55e89d43"><code>97ca352</code></a> Changing version to 1.79.3-dev (<a href="https://redirect.github.com/grpc/grpc-go/issues/8954">#8954</a>)</li> <li><a href="https://github.com/grpc/grpc-go/commit/8902ab6efea590f5b3861126559eaa26fa9783b2"><code>8902ab6</code></a> Change the version to release 1.79.2 (<a href="https://redirect.github.com/grpc/grpc-go/issues/8947">#8947</a>)</li> <li><a href="https://github.com/grpc/grpc-go/commit/a9286705aa689bee321ec674323b6896284f3e02"><code>a928670</code></a> Cherry-pick <a href="https://redirect.github.com/grpc/grpc-go/issues/8874">#8874</a> to v1.79.x (<a href="https://redirect.github.com/grpc/grpc-go/issues/8904">#8904</a>)</li> <li><a href="https://github.com/grpc/grpc-go/commit/06df3638c0bcee88197b1033b3ba83e1eb8bc010"><code>06df363</code></a> Change version to 1.79.2-dev (<a href="https://redirect.github.com/grpc/grpc-go/issues/8903">#8903</a>)</li> <li><a href="https://github.com/grpc/grpc-go/commit/782f2de44f597af18a120527e7682a6670d84289"><code>782f2de</code></a> Change version to 1.79.1 (<a href="https://redirect.github.com/grpc/grpc-go/issues/8902">#8902</a>)</li> <li><a href="https://github.com/grpc/grpc-go/commit/850eccbb2257bd2de6ac28ee88a7172ab6175629"><code>850eccb</code></a> Change version to 1.79.1-dev (<a href="https://redirect.github.com/grpc/grpc-go/issues/8851">#8851</a>)</li> <li><a href="https://github.com/grpc/grpc-go/commit/765ff056b6890f6c8341894df4e9668e9bfc18ef"><code>765ff05</code></a> Change version to 1.79.0 (<a href="https://redirect.github.com/grpc/grpc-go/issues/8850">#8850</a>)</li> <li><a href="https://github.com/grpc/grpc-go/commit/68804be0e78ed0365bb5a576dedc12e2168ed63e"><code>68804be</code></a> Cherry pick <a href="https://redirect.github.com/grpc/grpc-go/issues/8864">#8864</a> to v1.79.x (<a href="https://redirect.github.com/grpc/grpc-go/issues/8896">#8896</a>)</li> <li>Additional commits viewable in <a href="https://github.com/grpc/grpc-go/compare/v1.75.1...v1.79.3">compare view</a></li> </ul> </details> <br /> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
<!--- Provide a general summary of your changes in the Title above --> ## Description Switch smoke tests to use minted MaaS API keys instead of raw oc whoami -t cluster tokens. [RHOAIENG-51553](https://redhat.atlassian.net/browse/RHOAIENG-51553) ## How Has This Been Tested? * Manual testing against cluster with MaaS API deployed * To test locally: ``` cd test/e2e ./smoke.sh ``` ## Merge criteria: <!--- This PR will be merged by any repository approver when it meets all the points in the checklist --> <!--- Go over all the following points, and put an `x` in all the boxes that apply. --> - [x] The commits are squashed in a cohesive manner and have meaningful messages. - [x] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). - [x] The developer has manually tested the changes and verified that the changes work <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * End-to-end tests now obtain short-lived API keys via the cluster bootstrap flow instead of using direct user tokens. * Test setup fails fast if minting the required test API key isn't possible; admin tests automatically mint admin credentials and are skipped when unavailable. * Logging reduced token exposure by recording only token/key lengths, not their contents. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary E2E tests for the ExternalModel feature, focused on MaaS capabilities: - **Discovery**: ExternalModel reconciler creates MaaSModelRef, HTTPRoute, backend Service - **Auth**: Invalid/missing API key returns 401/403 - **Egress**: Request with valid key passes auth and reaches external endpoint - **Cleanup**: Deleting MaaSModelRef removes HTTPRoute via finalizer Uses `httpbin.org` as the external endpoint (configurable via `E2E_EXTERNAL_ENDPOINT`). No BBR/plugin dependency — tests validate MaaS egress routing and auth, not payload transformation. ## Changes - `test/e2e/tests/test_external_models.py`: 7 tests covering discovery, auth, egress connectivity, and cleanup - `test/e2e/scripts/prow_run_smoke_test.sh`: External model tests section (commented out until CI includes ExternalModel reconciler) ## Test plan - [x] All 7 tests passing against RHOAI cluster with httpbin.org - [x] No BBR or simulator dependency ```release-note NONE ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Added E2E tests for external-model discovery, auth (invalid/missing API keys), egress/forwarding to external endpoints, and cleanup to ensure routes are removed. * New module-scoped setup provisions credentials, model/subscription resources, creates an API key, and tears down resources after tests. * **Chores** * CI smoke runner now executes the external-model E2E suite (replacing the previous external test), producing separate test artifacts. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…-io#693) Related to - https://redhat.atlassian.net/browse/RHOAIENG-57158 ## Summary - Add Spectral-based OpenAPI specification validation to CI - Add breaking change detection using oasdiff - Add changelog verification for API changes - Include comprehensive automation plan document ## Changes 1. **`.github/workflows/openapi-validation.yml`** - New CI workflow with three jobs: - `validate-spec`: Runs Spectral linting, generates validation reports - `breaking-changes`: Detects API breaking changes vs base branch - `changelog-check`: Verifies changelog updates when spec changes 2. **`.spectral.yml`** - OpenAPI linting configuration: - Extends `spectral:oas` ruleset - Custom rules for operation IDs, descriptions, security - MaaS-specific rule for subscription header documentation 3. **`docs/openapi-automation-plan.md`** - Phased automation plan: - Phase 1: Validation & linting (this PR) - Phase 2: Contract testing with Dredd/Prism - Phase 3: Client SDK generation - Phase 4: Code annotation-based generation ## Current Validation Results Running Spectral on `maas-api/openapi3.yaml` found: - **4 errors** (schema validation issues in examples) - **8 warnings** (missing contact info, undefined tags, tag ordering) - **6 hints** (custom MaaS subscription header rule) These will be addressed in a follow-up PR. ## Test Plan - [x] Spectral validation runs successfully on local spec - [x] CI workflow validates on PR changes to OpenAPI spec - [ ] Fix existing validation errors (follow-up PR) 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Added CI automation for OpenAPI validation, linting, and report generation. * Enabled breaking-change detection on pull requests and a check for required changelog updates. * Introduced stricter linting rules to enforce OpenAPI quality and documentation standards. * **Documentation** * Added an OpenAPI automation roadmap outlining phased plans for validation, contract testing, SDKs, and rollout. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
…cted support (opendatahub-io#706) <!--- Provide a general summary of your changes in the Title above --> ## Description Replace curlimages/curl with registry.redhat.io/ubi9/ubi-minimal:9.7 in the cleanup CronJob. Third-party images are not mirrored in disconnected/air-gapped RHOAI environments. UBI minimal includes curl and is available in the RHOAI mirror catalog. <!--- Describe your changes in detail --> ## How Has This Been Tested? <!--- Please describe in detail how you tested your changes. --> <!--- Include details of your testing environment, and the tests you ran to --> <!--- see how your change affects other areas of the code, etc. --> ## Merge criteria: <!--- This PR will be merged by any repository approver when it meets all the points in the checklist --> <!--- Go over all the following points, and put an `x` in all the boxes that apply. --> - [ ] The commits are squashed in a cohesive manner and have meaningful messages. - [ ] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). - [ ] The developer has manually tested the changes and verified that the changes work <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Updated the container base image used for the API cleanup process. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Wen Liang <liangwen12year@gmail.com>
…b-io#728) ## Summary - Uncomments the `test_unconfigured_model_denied_by_gateway_auth` test in `test/e2e/tests/test_subscription.py` - Verifies that models with no MaaSAuthPolicy or MaaSSubscription are denied (403) by the `gateway-default-auth` AuthPolicy - The test fixture (`test/e2e/fixtures/unconfigured/`) already exists and deploys a MaaSModelRef with no auth policy or subscription <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Re-enabled end-to-end coverage for gateway access control: confirms deny-by-default behavior and that models without required subscription/auth configuration are denied (403) when accessed with the default API key. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…tahub-io#549) <!--- Provide a general summary of your changes in the Title above --> ## Description Add bounded access-check timeout (15s), Cache-Control: no-store header, and X-Access-Checked-At freshness timestamp to prevent clients from caching stale authorization decisions from the eventually-consistent model access probes. <!--- Describe your changes in detail --> ## How Has This Been Tested? <!--- Please describe in detail how you tested your changes. --> <!--- Include details of your testing environment, and the tests you ran to --> <!--- see how your change affects other areas of the code, etc. --> ## Merge criteria: <!--- This PR will be merged by any repository approver when it meets all the points in the checklist --> <!--- Go over all the following points, and put an `x` in all the boxes that apply. --> - [ ] The commits are squashed in a cohesive manner and have meaningful messages. - [ ] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). - [ ] The developer has manually tested the changes and verified that the changes work <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * API responses now include anti-cache headers and an access-check timestamp header (RFC3339) to show when authorization was verified. * Access validation checks are now bounded by a configurable timeout to ensure timely responses. * **Chores** * Added a configuration option for the access-check timeout with validation. * **Tests** * Tests updated to verify the new headers and that the timestamp parses as RFC3339. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Wen Liang <liangwen12year@gmail.com>
…b-io#724) https://redhat.atlassian.net/browse/RHOAIENG-57235 ## Description This PR focuses on providing a broader automated coverage for unhappy paths and abuse scenarios (missing resources, forbidden access, header spoofing). ### Additional notes: **Documented** updates with additional notes to `README.md` (e.g. "Negative & Security Tests" and "Namespace Scoping Tests" sections), including pytest commands, test coverage list, and link to the matrix. CI integration list updated too. ## How Has This Been Tested? The code compiles and CI pipeline builds and completes tests as intended. ## Merge criteria: - [x] The commits are squashed in a cohesive manner and have meaningful messages. - [x] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). - [x] The developer has manually tested the changes and verified that the changes work <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Added security and negative scenario test coverage for E2E validation. * Refactored test utilities into a shared helper module for consistency across test suites. * **Documentation** * Updated E2E test documentation to reflect new test coverage. * Extended CI smoke test script to include new test modules. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Yuriy Teodorovych <Yuriy@ibm.com>
## Summary Related to - https://redhat.atlassian.net/browse/RHOAIENG-57336 Implements Kubernetes RBAC aggregation to enable namespace admins and contributors to create and manage `MaaSModelRef` and `ExternalModel` resources without requiring cluster-admin intervention. This addresses the user story requirement for namespace-scoped users to deploy models using standard Kubernetes/OpenShift roles (`admin`, `edit`, `view`) without needing custom ClusterRoleBindings or elevated permissions. ## Changes ### ClusterRole Aggregation - ✅ Add `maas-user-admin-role` ClusterRole (aggregates to `admin` and `edit` roles) - ✅ Add `maas-user-view-role` ClusterRole (aggregates to `view`, `admin`, and `edit` roles) - ✅ Include namespace-scoped resources: `MaaSModelRef`, `ExternalModel` - ✅ Exclude platform-managed resources: `MaaSSubscription`, `MaaSAuthPolicy` ### Documentation - ✅ Comprehensive user guide: `docs/content/configuration-and-management/namespace-rbac.md` - How RBAC aggregation works (with Mermaid diagrams) - Permission matrix and usage examples - Troubleshooting guide and best practices - ✅ Automated verification script: `scripts/verify-rbac-aggregation.sh` - ✅ Updated MkDocs navigation ## Permission Matrix | Role | Resources | Permissions | Use Case | |------|-----------|-------------|----------| | **admin** | `MaaSModelRef`, `ExternalModel` | `create`, `delete`, `get`, `list`, `patch`, `update`, `watch` | Full model lifecycle management | | **edit** | `MaaSModelRef`, `ExternalModel` | `create`, `delete`, `get`, `list`, `patch`, `update`, `watch` | Full model lifecycle management | | **view** | `MaaSModelRef`, `ExternalModel` | `get`, `list`, `watch` | Read-only access | **Platform-managed resources remain protected:** - ❌ `MaaSSubscription` - Namespace users cannot create (cluster-admin only) - ❌ `MaaSAuthPolicy` - Namespace users cannot create (cluster-admin only) ## Testing ### ✅ Comprehensive Live Cluster Testing: 19/19 Tests Passed All test cases were executed on a [**live OpenShift cluster**](https://console-openshift-console.apps.ci-ln-cdy7jft-76ef8.aws-4.ci.openshift.org/k8s/ns/opendatahub/core~v1~Pod) with the following results: #### Phase 1: Infrastructure Verification ✅ - ✅ ClusterRoles exist with correct aggregation labels - ✅ Built-in `admin` role includes `maas.opendatahub.io` permissions - ✅ Built-in `edit` role includes `maas.opendatahub.io` permissions - ✅ Built-in `view` role includes `maas.opendatahub.io` permissions (read-only) #### Phase 2: User Permission Testing ✅ - ✅ Admin user can create, update, delete `MaaSModelRef` - ✅ Admin user can create, update, delete `ExternalModel` - ✅ Edit user can create, update, delete `MaaSModelRef` - ✅ Edit user can create, update, delete `ExternalModel` - ✅ View user can **only** read (get, list, watch) - correctly forbidden from create/delete #### Phase 3: Security & Platform Protection ✅ - ✅ Namespace users **cannot** create `MaaSSubscription` (correctly forbidden) - ✅ Namespace users **cannot** create `MaaSAuthPolicy` (correctly forbidden) - ✅ Platform resources remain cluster-admin only #### Phase 4: Controller Integration ✅ - ✅ maas-controller successfully reconciles user-created `MaaSModelRef` resources - ✅ Status conditions updated correctly - ✅ Controller watches user namespaces properly #### Phase 5: Lifecycle Testing ✅ - ✅ Users can create resources in their namespace - ✅ Users can update resources in their namespace - ✅ Users can delete resources in their namespace - ✅ View users correctly restricted to read-only ### Test Environment - **Cluster Type:** OpenShift (https://console-openshift-console.apps.ci-ln-cdy7jft-76ef8.aws-4.ci.openshift.org/k8s/ns/opendatahub/core~v1~Pod) - **MaaS Version:** Latest (main branch) - **Test Namespace:** `rbac-test` - **Test Users:** `testadmin@example.com`, `testeditor@example.com`, `testviewer@example.com` - **Resources Created:** MaaSModelRef, ExternalModel (all successfully reconciled) ### Verification Script Run the automated verification script to validate RBAC aggregation: ```bash ./scripts/verify-rbac-aggregation.sh ``` ## Design Rationale ### Why RBAC Aggregation? 1. **Kubernetes best practice** - Standard pattern for extending built-in roles with CRD permissions 2. **Zero configuration** - Works automatically when users are granted standard roles 3. **Follows precedent** - Same pattern used by OpenShift operators, KServe, and other K8s projects 4. **Minimal permissions** - Only grants access to resources users actually deploy in their namespaces ### Security Considerations - ✅ Only namespace-scoped resources included (`MaaSModelRef`, `ExternalModel`) - ✅ Platform-level resources excluded (`MaaSSubscription`, `MaaSAuthPolicy`) - ✅ Verbs limited to minimum necessary for each role - ✅ View role is strictly read-only (no mutating verbs) - ✅ Follows principle of least privilege ## Documentation ### User-Facing - **Main Guide:** `docs/content/configuration-and-management/namespace-rbac.md` - How RBAC aggregation works (with Mermaid diagrams) - Permission matrix - Usage examples - Troubleshooting guide - Best practices - Design rationale and references ### Verification - **Automated Script:** `scripts/verify-rbac-aggregation.sh` - Checks ClusterRole existence and labels - Verifies aggregation to built-in roles - Validates correct verbs for each role - Provides detailed pass/fail reporting ## Known Issues Discovered During Testing While testing this implementation on a live cluster, we discovered two operator-related issues: ### 1. Missing `cluster-audience` in ConfigMap **Issue:** The ODH/RHOAI operator doesn't set the `cluster-audience` parameter in the `maas-parameters` ConfigMap, causing maas-controller to fail on startup. **Workaround:** ```bash kubectl patch configmap maas-parameters -n opendatahub \ --type merge \ -p '{"data":{"cluster-audience":"https://kubernetes.default.svc"}}' ``` **Permanent Fix:** Update operator to include this parameter when creating the ConfigMap. ## Migration Guide For existing deployments, the changes are **additive and non-breaking**: 1. The new ClusterRoles are automatically created when the manifests are applied 2. Kubernetes automatically aggregates permissions into built-in roles within seconds 3. No migration of existing resources required 4. No impact on existing service account permissions 5. Users with existing custom ClusterRoleBindings can continue using them (can be cleaned up later) ### Rollout Steps 1. Deploy updated MaaS controller manifests (includes new ClusterRoles) 2. Verify aggregation: `kubectl get clusterrole admin -o yaml | grep maas.opendatahub.io` 3. Test in a dev namespace before production 4. Communicate new capability to namespace users 5. (Optional) Clean up redundant custom ClusterRoleBindings ## Acceptance Criteria All acceptance criteria from the user story have been met: - ✅ Users with `admin` role can create/update/delete `MaaSModelRef` in their namespace - ✅ Users with `edit` role can create/update/delete `MaaSModelRef` in their namespace - ✅ Users with `view` role can list/get but not create/update/delete `MaaSModelRef` - ✅ Aggregation uses standard Kubernetes labels (`rbac.authorization.k8s.io/aggregate-to-*`) - ✅ Only namespace-scoped resources included - ✅ Platform-level resources excluded - ✅ Minimal permissions granted (no broader than necessary) - ✅ Comprehensive documentation provided ## References - [Kubernetes RBAC Aggregation](https://kubernetes.io/docs/reference/access-authn-authz/rbac/#aggregated-clusterroles) - [OpenShift RBAC](https://docs.openshift.com/container-platform/latest/authentication/using-rbac.html) - [Example: MCP Lifecycle Operator Aggregation](kubernetes-sigs/mcp-lifecycle-operator#73) ## Checklist - [x] Code follows project style guidelines - [x] All tests pass (19/19 on live cluster) - [x] Documentation updated (comprehensive user guide) - [x] Verification script added - [x] CI validation passes - [x] CodeRabbit AI review passes (no findings) - [x] Security considerations addressed - [x] Breaking changes: None - [x] Backwards compatible: Yes --- **Ready for review by security and platform teams.** <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added two namespace-level aggregated ClusterRoles to grant admin/edit users full management and view-only users read access for MaaS model resources. * **Documentation** * Added a "Namespace User Permissions (RBAC)" guide with a permission matrix, verification commands, and troubleshooting for Forbidden errors. * **Chores** * Added a verification script to validate RBAC aggregation and role coverage in clusters. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
…pendatahub-io#733) <!--- Provide a general summary of your changes in the Title above --> ## Description My assumption is that `validateModelRefs()` needs to rely on `deletionTimestamp` just like `findHTTPRouteForModel()` does. So, the failing scenario must be the following: `MaaSModelRef` has a finalizer, so when you delete it: 1. Kubernetes sets `deletionTimestamp` but the object continues to exist in the cache until the finalizer is removed 2. `validateModelRefs()` calls r.Get() which succeeds (object still exists). Hence, it sets `ready=true` and `reason=Valid` 3. `checkTokenRateLimitHealth()` has `findHTTPRouteForModel()` that calls r.Get() which succeeds, but then checks `deletionTimestamp` returns `ErrModelNotFound` 4. `deriveFinalPhase()` correctly detects the inconsistency via TRLP health and sets `phase=Failed` 5. But `updateStatus()` persists with `phase=Failed` while `modelRefStatuses=[{ready: true, reason: Valid}]` since `modelRefStatuses` was already set with `ready=true` and never corrected The model's finalizer cleanup (deleting AuthPolicies, TRLPs, backend resources) can take time, so the model remains in "deleting" state for the duration. During this window, every reconciliation produces the same stale **ready=true** in `modelRefStatuses`. **After the change:** 1. Test deletes model, so `deletionTimestamp` is set (finalizer might still be completing its task hence the object remain alive) 2. Subscription reconciles: - `validateModelRefs()` sets `ready=true` - `checkTokenRateLimitHealth()` sets `BackendNotReady` - fix: `modelRefStatuses[0]` from `ready=true` to `ready=false` and `reason=NotFound` - `deriveFinalPhase()` sets `phase=Failed` - `updateStatus()` saves `phase=Failed` and `modelRefStatuses=[ready=false]` 3. Test's _wait_for_subscription_phase("Failed") succeeds 4. Test's poll for `modelRefStatuses[0].ready` is `False` succeeds immediately because the status was corrected in the same reconciliation ## How Has This Been Tested? Tests pass ## Merge criteria: <!--- This PR will be merged by any repository approver when it meets all the points in the checklist --> <!--- Go over all the following points, and put an `x` in all the boxes that apply. --> - [x] The commits are squashed in a cohesive manner and have meaningful messages. - [x] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). - [x] The developer has manually tested the changes and verified that the changes work <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **Bug Fixes** * Fixed an issue where models marked for deletion would incorrectly remain in the "ready" state. The system now properly identifies models undergoing deletion and corrects their status to "not found" to ensure accurate health reporting and phase information. * **Tests** * Added test coverage for model deletion scenarios to verify proper status correction during reconciliation. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Yuriy Teodorovych <Yuriy@ibm.com>
…pendatahub-io#731) ## Summary Fix `MAAS_API_IMAGE` and `MAAS_CONTROLLER_IMAGE` env vars being silently ignored during kustomize-mode deployments, causing CI to always deploy `latest` instead of PR images. ## Description ### Problem When deploying MaaS via kustomize mode (used by Konflux CI and `prow_run_smoke_test.sh`), the `MAAS_API_IMAGE` environment variable was silently overridden. The deploy script logged the correct PR image, but the pod always ended up running `latest`: ``` # Logs showed correct image: Using custom MaaS API image: quay.io/opendatahub/maas-api:odh-pr-721 # But the pod had: image: quay.io/opendatahub/maas-api:latest ``` ### Root Cause The `shared-patches` kustomize component uses `replacements:` to set container images from `params.env`. But `set_maas_api_image()` and `set_maas_controller_image()` were only patching the base `images:` transformer in `kustomization.yaml`. Kustomize processes `images:` transformers **before** `replacements:`, so `params.env` (hardcoded to `latest`) always overwrote the custom image. This regression was introduced when the `shared-patches` component was added to centralize overlay configuration. The maas-controller was not visibly affected because `deploy.sh` had a post-apply `kubectl set image` workaround that corrected the image after kustomize had already applied it with the wrong tag. maas-api had no such workaround. ### Fix - Add `_patch_params_env` helper to patch `params.env` with custom image values before `kustomize build`, so replacements pick up the correct image. - Call `_patch_params_env` from both `set_maas_api_image` and `set_maas_controller_image` after the existing base kustomization patching. - Add `_cleanup_params_env` to restore `params.env` from backup after build. - Remove the post-apply `kubectl set image` workaround for maas-controller in `deploy.sh` since `params.env` now carries the correct image through the kustomize build pipeline. - Log deployed maas-api and maas-controller images at end of deployment for easier verification. ## How it was tested - Verified locally that `kustomize build` with the old approach (patching base `images:` transformer) still produces `latest` — confirming the bug. - Verified locally that `kustomize build` with the fix (patching `params.env`) produces the correct custom image. - Tested both `tls-backend` and `http-backend` overlays — both produce correct images. - Verified operator mode is unaffected (base `images:` transformer still works for direct base builds). - Verified default behavior (no env var set) still produces `latest`. - Verified `params.env` is restored from backup after deployment (no leftover `.backup` file). - Deployed on a live cluster with `MAAS_API_IMAGE` and `MAAS_CONTROLLER_IMAGE` set — both pods running correct PR images. Made with [Cursor](https://cursor.com) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Improvements** * Deployment now always logs the live container images for key services to help verification and troubleshooting, with safe fallbacks when data is missing. * Image update operations also persistently update deployment configuration so chosen images remain synchronized across tools and are cleanly restored during cleanup. * Final completion message changed to “Models-as-a-Service Deployment completed successfully!” to reflect branding. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Chaitanya Kulkarni <ckulkarn@redhat.com> Signed-off-by: Chaitanya Kulkarni <chkulkar@redhat.com>
## Description https://redhat.atlassian.net/browse/RHOAIENG-57627 The /v1/models endpoint exemption was lost during the migration from tier-based to subscription-based rate limiting. This caused model discovery endpoints to be blocked when users exhausted their token quota, even though these endpoints don't consume inference tokens. This restores the original behavior from commit 660f4db by adding `!request.path.endsWith("/v1/models")` to the per-route TokenRateLimitPolicy predicates. Changes: - Add path exemption to TRLP when clause in maassubscription_controller.go - Add E2E test for per-model /v1/models endpoints (test_subscription.py) - Add E2E test for central /v1/models endpoint aggregation (test_models_endpoint.py) ## How Has This Been Tested? Both tests validate that: 1. Inference requests are blocked (429) when quota exhausted 2. /v1/models endpoints remain accessible (200) when quota exhausted This is a regression from the tier system removal. The original issue was [RHOAIENG-46770](https://redhat.atlassian.net/browse/RHOAIENG-46770) (resolved in tier-based system). ## Merge criteria: <!--- This PR will be merged by any repository approver when it meets all the points in the checklist --> <!--- Go over all the following points, and put an `x` in all the boxes that apply. --> - [x] The commits are squashed in a cohesive manner and have meaningful messages. - [x] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). - [x] The developer has manually tested the changes and verified that the changes work <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Token-based subscription rate limiting now exempts the /v1/models endpoint so model discovery remains accessible when a subscription's token quota is exhausted. * **Tests** * Added e2e tests confirming inference is blocked after quota exhaustion while GET /v1/models still returns 200 and valid listings; updated test expectations and cleanup to reflect the exemption. * **Chores** * Added a diagnostics helper for LLM inference service artifacts and integrated it into smoke-test timeout diagnostics. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
…hub-io#732) <!--- Provide a general summary of your changes in the Title above --> ## Description <!--- Describe your changes in detail --> As MaaS controller will be included from RHOAI 3.4, the explicit deployment seems unnecessary and may even conflict with what's already installed by the operator (local vs. operator manifests). ## How Has This Been Tested? <!--- Please describe in detail how you tested your changes. --> <!--- Include details of your testing environment, and the tests you ran to --> <!--- see how your change affects other areas of the code, etc. --> ## Merge criteria: <!--- This PR will be merged by any repository approver when it meets all the points in the checklist --> <!--- Go over all the following points, and put an `x` in all the boxes that apply. --> - [ ] The commits are squashed in a cohesive manner and have meaningful messages. - [ ] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). - [ ] The developer has manually tested the changes and verified that the changes work <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Optimized controller deployment to skip redundant installation when a controller already exists in operator mode, improving deployment efficiency and reducing unnecessary operations. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…pendatahub-io#740) ## Summary Enhance E2E artifact collection to dump full MaaS CR YAML definitions and collect pod logs from RHOAI-related namespaces, improving CI debuggability. ## Description - Add `collect_maas_crs()` function that dumps full YAML for all four MaaS CRD types (`maasmodelrefs`, `maasauthpolicies`, `maassubscriptions`, `externalmodels`) with dynamic namespace discovery, mirroring the CRD list from `red-hat-data-services/must-gather` `gather_models_as_a_service` script. - Expand pod log collection to cover 8 namespaces: `opendatahub`, `models-as-a-service`, `redhat-ods-operator`, `redhat-ods-applications`, `kuadrant-system`, `openshift-ingress`, `llm`, and `istio-system`, with graceful skip for non-existent namespaces. - Add RHOAI operator, applications, DSC/DSCI, and gateway namespace resource snapshots to `cluster-state.log`. - Persist auth debug report to `auth-debug-report.log` in the artifact directory. - Fix `set -e` trap on `[[ ]] && ...` patterns that caused early script exit when running without a cluster connection. - All collected CR YAML is token-redacted before writing to disk. ## How it was tested - Ran `ARTIFACTS_DIR=test/e2e/reports/maas-debug ./test/e2e/scripts/auth_utils.sh` locally without a cluster connection to verify the script runs to completion without failures. - Verified artifact directory structure is created correctly with expected files (`maas-crs/no-crs-found.log`, `cluster-state.log`, `auth-debug-report.log`, `pod-logs/` subdirectories). - Verified `bash -n` syntax check passes. Made with [Cursor](https://cursor.com) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Enhanced test diagnostics with expanded cluster state snapshots and multi-namespace pod log collection. * Improved artifact capture for MaaS custom resources and authorization debug reports. * **Chores** * Added configuration variables for RHOAI, gateway, LLM, and Istio namespaces. * Refined log collection and error handling in end-to-end testing utilities. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Chaitanya Kulkarni <chkulkar@redhat.com>
…opendatahub-io#709) ExternalModel HTTPRoutes use PathPrefix: `/<modelName>` while LLMInferenceService routes use PathPrefix: `/<namespace>/<modelName>`. This means two ExternalModel MaaSModelRefs with the same name in different namespaces would collide on the same path. Changes the ExternalModel reconciler and endpoint resolver to include the namespace in the path, matching the LLMInferenceService pattern: - resources.go: PathPrefix: `/<namespace>/<modelName>` (was `/<modelName>`) - providers_external.go: endpoint URL `https://<host>/<namespace>/<modelName>` (was https://`<host>/<modelName>`) Before: POST `/gpt-4o/v1/chat/completions` After: POST `/llm/gpt-4o/v1/chat/completions` The URLRewrite filter already strips the full prefix to / so the external provider still receives POST `/v1/chat/completions`. cc/ @jland-redhat @nirrozenbaum leaving in draft until I manually test. Fighting cluster availability via CB so it might be a couple of hours. Nice catch Nir 🥳! <!--- Provide a general summary of your changes in the Title above --> ## Description <!--- Describe your changes in detail --> ## How Has This Been Tested? <!--- Please describe in detail how you tested your changes. --> <!--- Include details of your testing environment, and the tests you ran to --> <!--- see how your change affects other areas of the code, etc. --> ## Merge criteria: <!--- This PR will be merged by any repository approver when it meets all the points in the checklist --> <!--- Go over all the following points, and put an `x` in all the boxes that apply. --> - [x] The commits are squashed in a cohesive manner and have meaningful messages. - [x] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). - [ ] The developer has manually tested the changes and verified that the changes work <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Refactor** * Simplified external model routing and endpoint naming conventions * Updated endpoint URL path structure to include namespace information * Migrated TLS/port configuration to ExternalModel resource annotations (`maas.opendatahub.io/port`, `maas.opendatahub.io/tls`) * Streamlined internal resource builder functions for Kubernetes and Istio components <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Brent Salisbury <bsalisbu@redhat.com>
## Summary
- Enable group testing so that e2e tests run with both `maas-api` and
`maas-controller` images built from the same PR commit
- Previously, per-component Konflux snapshots meant each integration
test had one image from the PR and the other from the main branch
## Changes
- **`.tekton/odh-maas-api-pull-request.yaml`** — add
`enable-group-testing: "true"`
- **`.tekton/odh-maas-controller-pull-request.yaml`** — add
`enable-group-testing: "true"`
- **`.tekton/maas-group-test.yaml`** (new) — group test PipelineRun
triggered by `/group-test` comment after all builds complete
## How it works
1. PR opened on `models-as-a-service` → both component builds triggered
2. Each build's `trigger-group-testing` finally task checks if all other
Konflux checks are completed
3. The last build to complete posts `/group-test` comment on the PR
4. PAC matches the comment and triggers the `maas-group-test`
PipelineRun
5. `generate-snapshot-for-group-testing` creates a composite snapshot
with both PR-built images (using `odh-pr-{PR_NUMBER}` tags)
6. e2e tests run with correct `MAAS_API_IMAGE` and
`MAAS_CONTROLLER_IMAGE` from the same commit
## Dependencies
- Requires opendatahub-io/odh-konflux-central#241 to be merged first
(registers `maas-group` Component and the group testing Pipeline)
## Test plan
- [ ] Merge opendatahub-io/odh-konflux-central#241 first
- [ ] Wait for gitops sync (maas-group Component lands in Konflux)
- [ ] Merge this PR
- [ ] Open a test PR on this repo
- [ ] Verify both builds complete and `/group-test` comment is posted
automatically
- [ ] Verify `maas-group-test` PipelineRun is created and e2e tests pass
with both correct images
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Chores**
* Added new group testing pipeline for comprehensive validation of
models-as-a-service component groups
* Enabled group testing parameter in pull request workflows for API and
controller services
* Implemented dedicated integration testing infrastructure to support
component group validation scenarios
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Related to - https://redhat.atlassian.net/browse/RHOAIENG-57622 Fixes intermittent failures in `test_rate_limit_exhaustion_gets_429` that occurred on PRs with no rate limit changes. ## Problem The test was making an **incorrect assumption** that `max_tokens=N` would consume exactly N tokens per request. This caused flaky failures because: - Models may return fewer tokens than `max_tokens` (it's a ceiling, not exact) - Prompt tokens also count toward rate limits (not just completion tokens) - Actual token usage varies per request **Flaky Logic (Before):** ```python token_limit = 15 max_tokens = 3 expected_success = token_limit // max_tokens # Expected exactly 5 successful requests assert abs(success_count - expected_success) <= 1 # Flaky assertion! ``` This assertion failed when responses used 2 or 4 tokens instead of exactly 3. ## Changes ### 1. Made `max_tokens` configurable in `_inference()` helper ```python # Before: def _inference(api_key, path=None, extra_headers=None, model_name=None): json={"max_tokens": 3} # Hardcoded # After: def _inference(api_key, path=None, extra_headers=None, model_name=None, max_tokens=3): json={"max_tokens": max_tokens} # Configurable, default: 3 ``` ✅ **Backward compatible** - all existing tests continue using default `max_tokens=3` ### 2. Updated test to use flexible logic ```python # Before (flaky): token_limit = 15 max_tokens = 3 total_requests = (15 / 3) + 2 # Expected exactly 5 successful, send 7 # After (robust): token_limit = 10 total_requests = 15 r = _inference(api_key, path=model_path, max_tokens=1) # Minimize variance ``` **Key improvements:** - Uses `max_tokens=1` (minimize variance) - 50% safety margin (10 token limit, 15 requests) - **Just verifies 429 occurs** - doesn't assume when - Removed strict token math assertions ### 3. Improved comments Explains that token consumption is non-deterministic, so the test verifies rate limiting works without assuming exact timing. ## Testing **Validated on live cluster:** 🎉 E2E Tests Completed Successfully! ⏺ 🎉 E2E Tests Completed - SUCCESS! Final Test Results: ✅ 89 PASSED ⏭️ 4 SKIPPED⚠️ 81 warnings ⏱️ Duration: 17 minutes 16 seconds 🎯 Your Fixed Test: PASSED! test_rate_limit_exhaustion_gets_429 PASSED [ 46%] **Consistency:** 100% pass rate across multiple runs (no flakiness detected) ## Impact **Before:** - ❌ Test fails ~20-30% of the time - ❌ Blocks unrelated PRs - ❌ Requires manual re-runs **After:** - ✅ Verifies rate limiting works (core behavior) - ✅ No assumptions about exact token consumption - ✅ Eliminates flakiness while maintaining test validity ## Related This test was added to verify token-based rate limiting works end-to-end. The fix maintains the test's original purpose while removing unreliable timing assumptions. 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Test helper now accepts an optional max_tokens parameter (default 3) for inference requests. * Rate-limit exhaustion test made more robust and assumption-free: request sizing and counts simplified, looped requests use explicit smaller token usage, 429 validation simplified, and the test requires at least one successful 200 before any observed 429 within a fixed request window. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
…ahub-io#739) ## Summary Related to - https://redhat.atlassian.net/browse/RHOAIENG-57822 Fixes maintidx linter failure in `ListLLMs` function that blocks PRs modifying `maas-api/**` files. ## Problem The `ListLLMs` function in `maas-api/internal/handlers/models.go` fails the maintidx linter: - **Cyclomatic Complexity**: 25 (threshold: 20) - **Maintainability Index**: 19 (threshold: 20) ## Root Cause The function grew complex over time with multiple conditional branches, nested loops, and error handling paths. Recent changes pushed it slightly over the linter threshold. ## Changes Refactored `ListLLMs` by extracting helper methods: - `extractAndValidateAuth()` - handles authorization header validation - `getUserContextIfNeeded()` - retrieves user context from middleware - `aggregateModelsFromSubscriptions()` - filters and aggregates models across subscriptions **Metrics improvement:** - Cyclomatic Complexity: 25 → **<20** ✅ - Maintainability Index: 19 → **>20** ✅ ## Testing - ✅ Lint passes with 0 issues - ✅ All unit tests pass (80.3% coverage) - ✅ Function behavior unchanged (backward compatible) ## Impact - Unblocks PR opendatahub-io#694 and other PRs that modify `maas-api/**` - Improves code maintainability and testability - No user-facing changes (pure refactoring) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
<!--- Provide a general summary of your changes in the Title above --> ## Description <!--- Describe your changes in detail --> Add the Perses dashboard and datasource that were added in opendatahub-io#624 to the ODH overlay ## How Has This Been Tested? <!--- Please describe in detail how you tested your changes. --> <!--- Include details of your testing environment, and the tests you ran to --> <!--- see how your change affects other areas of the code, etc. --> It was tested manually with a custom image of the ODH operator that checks the existence of Perses CRDs and sets owner references on the Perses resource ## Merge criteria: <!--- This PR will be merged by any repository approver when it meets all the points in the checklist --> <!--- Go over all the following points, and put an `x` in all the boxes that apply. --> - [x] The commits are squashed in a cohesive manner and have meaningful messages. - [ ] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). - [x] The developer has manually tested the changes and verified that the changes work <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Observability dashboards are now included in the deployment so dashboards are available by default. * **Chores** * Prometheus datasource renamed and all dashboard references updated for consistent datasource resolution. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Arik Hadas <ahadas@redhat.com>
…-io#694) Related to https://redhat.atlassian.net/browse/RHOAIENG-57159 ## Summary Fix all validation errors and warnings in `maas-api/openapi3.yaml` discovered by Spectral linting (introduced in opendatahub-io#693). ## Changes ### Errors Fixed (4 total) 1. **Line 177** - `paths./v1/models.get.responses[500]`: Add missing `type` field to ErrorResponse example 2. **Line 456** - `paths./v1/api-keys/bulk-revoke.post.responses[403]`: Add missing `type` field to ErrorResponse example 3. **Line 532** - `paths./v1/subscriptions.get`: Add missing `priority` and `model_refs` fields to SubscriptionListItem example 4. **Line 566** - `paths./v1/model/{model-id}/subscriptions.get`: Add missing `priority` and `model_refs` fields to SubscriptionListItem example ### Warnings Fixed (8 total) 1. **Line 2** - Add complete contact information (name, url, email) to `info` section 2. **Line 2** - Add Apache 2.0 license with URL to `info` section 3. **Lines 181, 284, 399, 460, 484** - Add missing `api-keys-v2` tag definition (used by 5 operations) 4. **Line 925** - Reorder tags alphabetically (api-keys, api-keys-v2, health, models, subscriptions) ## Validation Results **Before:** ``` ✖ 18 problems (4 errors, 8 warnings, 0 infos, 6 hints) ``` **After:** ``` ✖ 6 problems (0 errors, 0 warnings, 0 infos, 6 hints) ``` The remaining 6 hints are from the custom `maas-subscription-header` rule (informational only, not blocking). ## Test Plan - [x] Run `spectral lint maas-api/openapi3.yaml` locally - passes with 0 errors, 0 warnings - [x] CI OpenAPI validation workflow passes (depends on opendatahub-io#693 merging first) ## Related - Depends on opendatahub-io#693 (OpenAPI validation infrastructure) - Fixes all issues identified by the new CI validation 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Enhanced API documentation with contact and license information. * Updated error response examples to include structured error types. * Extended subscription endpoint responses with additional fields for priority and model references. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
…-io#738) Continued work in addition to [733](opendatahub-io#733) and [724](opendatahub-io#724) to refactor, condense and consolidate our test suite for easier code management and clearer flow for future coding. ## Description 1. Centralize shared helpers into `test_helper.py`. Add comprehensive docstring documenting all env vars 2. Rename and enhances wait helpers for clarity: - `_wait_for_authpolicy_phase()` to `_wait_for_maas_auth_policy_phase()` (added `require_enforced` param) - `_wait_for_subscription_phase()` → `_wait_for_maas_subscription_phase()` (added `require_model_statuses` param) - Remove now-redundant `_wait_for_maas_auth_policy_ready()` and `_wait_for_maas_subscription_ready()` convenience wrappers 3. Remove local duplicates from consumer test files (`test_subscription.py`, `test_negative_security.py`, `test_external_models.py`, `test_models_endpoint.py`, `test_namespace_scoping.py`, `test_subscription_list_endpoints.py`, and `test_api_keys.py`) and import from `test_helper` instead of defining their own copies of shared functions/constants (circling back to work done in point 1) 4. Switch from the usage of `kubectl` to `oc` for consistency. ## How Has This Been Tested? Tests passing. ## Merge criteria: <!--- This PR will be merged by any repository approver when it meets all the points in the checklist --> <!--- Go over all the following points, and put an `x` in all the boxes that apply. --> - [ ] The commits are squashed in a cohesive manner and have meaningful messages. - [ ] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). - [ ] The developer has manually tested the changes and verified that the changes work <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Tests** * Consolidated end-to-end test utilities into a shared helper for consistency and reduced duplication. * Added helpers for service-account cleanup, resilient resource listing/snapshotting, related-resource lookups, and token rate-limit verification. * Replaced multiple "ready" waiters with generalized, phase-based wait helpers and unified defaults via shared constants. * Updated test docstrings to reference centralized environment/prerequisite documentation and removed file-specific env listings. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Yuriy Teodorovych <Yuriy@ibm.com>
…atahub-io#749) ## Summary - Detect operator type (RHOAI/ODH) and clean MaaS resources from the correct application namespace - Delete MaaS resources individually from `redhat-ods-applications` instead of relying on namespace deletion (the namespace is operator-managed and should not be deleted) - Delete AuthConfig CRs cluster-wide before policy engine namespace removal to prevent InstallPlan failures when switching engines (e.g. community Kuadrant to RHCL) - Delete GatewayClass `openshift-default` in gateway cleanup ## Context Found during deployment testing on RHOAI 3.3.1 clusters. After running `cleanup-odh.sh`, 19 MaaS resources remained in `redhat-ods-applications` because the script only deleted the `opendatahub` namespace. Old AuthConfig CRs also blocked RHCL installs due to CRD schema incompatibility. ## Test plan - [ ] Run cleanup on a RHOAI cluster with MaaS deployed, verify no MaaS resources remain in `redhat-ods-applications` - [ ] Run cleanup on an ODH cluster, verify existing behavior is preserved - [ ] Run `deploy.sh` after cleanup, verify deployment succeeds without manual intervention - [ ] Verify cleanup works when switching from community Kuadrant to RHCL Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved operator detection for OpenDataHub and Red Hat AI installations * Enhanced cleanup process to more thoroughly remove associated resources and prevent reinstallation issues * Better cleanup verification output to confirm removal of resources <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ob (opendatahub-io#751) ## Summary Add `maas-api-key-cleanup-image` to `params.env` and wire it via kustomize replacement into the cleanup CronJob, enabling the ODH operator to override the image at deploy time. ## Description The `maas-api-key-cleanup` CronJob currently uses a hardcoded `registry.redhat.io/ubi9/ubi-minimal:9.7` image for the curl-based API key cleanup. This means the ODH operator has no way to override it with a pinned SHA digest at deploy time. - Add `maas-api-key-cleanup-image` key to `deployment/overlays/odh/params.env` with the default ubi-minimal image. - Add a kustomize replacement in `deployment/components/shared-patches/kustomization.yaml` that wires `data.maas-api-key-cleanup-image` from the `maas-parameters` ConfigMap into the CronJob container image field. - This enables the operator's `ApplyParams()` to substitute the image via `RELATED_IMAGE_UBI_MINIMAL_IMAGE` (from the bundle CSV), ensuring pinned SHA digests in production and support for disconnected environments. **Companion changes required:** - [RHOAI-Build-Config PR #19203](red-hat-data-services/RHOAI-Build-Config#19203) — adds `RELATED_IMAGE_UBI_MINIMAL_IMAGE` to `additional-images-patch.yaml` - opendatahub-operator — adds `"maas-api-key-cleanup-image": "RELATED_IMAGE_UBI_MINIMAL_IMAGE"` to `imagesMap` in `modelsasservice_support.go` ## How It Was Tested - Verified kustomize build renders the CronJob with the image from `params.env`. - Without the operator change, the CronJob uses the default value from `params.env` (same image as today — no behavioral change). Made with [Cursor](https://cursor.com) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Added configuration for a new API key cleanup task in the deployment environment. Updated deployment settings to include a dedicated container image for cleanup operations. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Chaitanya Kulkarni <ckulkarn@redhat.com> Signed-off-by: Chaitanya Kulkarni <chkulkar@redhat.com>
## Problem The HTTPRoute header-based rule matches `X-Gateway-Model-Name` against the ExternalModel `metadata.name`. This breaks when `targetModel` differs from the name (e.g., Bedrock models: `name=my-bedrock`, `targetModel=openai.gpt-oss-20b`). The user sends `targetModel` in the request body. BBR's `body-field-to-header` plugin extracts it as `X-Gateway-Model-Name`. After ClearRouteCache, the header doesn't match → `route_not_found`. ## Fix Pass `targetModel` to `buildHTTPRoute` and use it in the header match value instead of `name`. ## Changes - `reconciler.go`: pass `extModel.Spec.TargetModel` to `buildHTTPRoute` - `resources.go`: accept `targetModel` param, use in header match - `resources_test.go`: update existing test, add test case where targetModel differs from name ## Tested On RHOAI cluster: - `llm/my-bedrock` (targetModel: `openai.gpt-oss-20b`) → Bedrock 200 ✓ - Header match correctly uses `openai.gpt-oss-20b` not `my-bedrock` Fixes opendatahub-io#745 ```release-note NONE ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **Improvements** * Enhanced HTTP routing logic for external models to separately use target model identifiers in request matching, enabling more precise routing when the model name differs from its target model designation. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…ndatahub-io#742) ## Description Related to - https://redhat.atlassian.net/browse/RHOAIENG-58233 Fixes the bug where maas-controller incorrectly reports "namespace already exists" when the `models-as-a-service` namespace is in `Terminating` phase during RHOAI reinstall/upgrade, leaving the controller running without its subscription namespace and MaaS non-functional. ## Root Cause The `ensureSubscriptionNamespaceExists` function discarded the namespace object and never checked `ns.Status.Phase`. When a namespace is `Terminating`, the GET succeeds (no error), so the function incorrectly assumed the namespace was ready. ```go // Before (buggy) _, err = clientset.CoreV1().Namespaces().Get(ctx, namespace, metav1.GetOptions{}) if err == nil { setupLog.Info("subscription namespace already exists", "namespace", namespace) return nil // Bug: namespace might be Terminating } ``` ## Solution Implemented a comprehensive fix with three components: ### 1. Enhanced Startup Logic (`ensureSubscriptionNamespaceWithClient`) - Captures namespace object and checks `Status.Phase` - If `Terminating`: waits up to 90s for deletion, then recreates - If `Active`: returns early (namespace is ready) - Handles operator recreation during wait (race condition) ### 2. Runtime Monitoring (`subscriptionNamespaceMonitor`) - Periodically re-checks namespace (30s interval, configurable) - Auto-recreates if namespace deleted while controller running - Respects leader election (only leader runs monitor) - Resilient error handling (logs errors, continues) ### 3. Readiness Reporting (`checkSubscriptionNamespaceReady`) - Integrated into `/readyz` endpoint - Returns not-ready if namespace missing or Terminating - Uncached check for accurate state reflection - Kubernetes won't route traffic when not-ready ## Edge Cases Handled - ✅ Namespace exists and is Active → return early - ✅ Namespace exists and is Terminating → wait for deletion, recreate - ✅ Namespace doesn't exist → create with retry - ✅ Forbidden on GET → assume operator-managed (existing behavior) - ✅ Forbidden during termination poll → assume external management - ✅ Timeout waiting for termination → fail with clear error - ✅ Namespace recreated during poll → detect Active, return - ✅ Unexpected errors during poll → fail fast with context - ✅ AlreadyExists on CREATE → treat as success - ✅ Forbidden on CREATE → permanent error with guidance ## Live Testing Results Tested on cluster: `api.ci-ln-5zrhd3b-76ef8.aws-4.ci.openshift.org:6443` ### ✅ Scenario 1: Startup with Terminating Namespace (Original Bug) **Test Steps:** 1. Created `models-as-a-service` namespace with finalizer 2. Deleted namespace (went to `Terminating` state) 3. Deployed controller while namespace was `Terminating` **Results:** ```json {"msg":"subscription namespace is terminating, waiting for deletion to complete","namespace":"models-as-a-service"} {"msg":"terminating namespace has been deleted","namespace":"models-as-a-service"} {"msg":"subscription namespace not found, attempting to create it","namespace":"models-as-a-service"} {"msg":"subscription namespace ready","namespace":"models-as-a-service"} {"msg":"starting manager"} ``` - ✅ Controller detected Terminating state - ✅ Waited 22 seconds for deletion - ✅ Recreated namespace successfully - ✅ Namespace has correct label: `opendatahub.io/generated-namespace: "true"` --- ### ✅ Scenario 2: Runtime Monitoring (Auto-Recovery) **Test Steps:** 1. Deployed controller with namespace `Active` 2. Deleted namespace while controller was running 3. Monitored automatic recreation **Results:** ``` 17:27:05 - Monitor check: namespace Active 17:27:29 - Namespace deleted (manual deletion) 17:27:35 - Monitor detected Terminating (30s cycle) 17:27:37 - Namespace recreated and ready ``` - ✅ Monitor detected deletion within 6 seconds (next 30s cycle) - ✅ Auto-recreated namespace in ~8 seconds total - ✅ No manual intervention needed - ✅ Namespace has correct label --- ### ✅ Scenario 3: Readiness Reporting (Observability) **Test Steps:** 1. Checked `/readyz` endpoint in different namespace states **Results:** | Namespace State | Readiness Endpoint | Pod Ready | Expected | |----------------|-------------------|-----------|----------| | **Active** | `ok` | `True` | ✅ | | **Terminating** | `failed: reason withheld` | `False` | ✅ | | **Recreated** | `ok` | `True` | ✅ | - ✅ Pod correctly reported Not Ready during namespace Terminating - ✅ Readiness endpoint accurately reflects namespace state - ✅ Kubernetes won't route traffic when not-ready --- ## Configuration New flag added: ``` --subscription-namespace-maintain-interval (default: 30s) How often to re-check the subscription namespace while running. Larger values reduce apiserver load; smaller values detect deletions sooner. ``` ## Merge Criteria - [x] The commits are squashed in a cohesive manner and have meaningful messages - [x] Testing instructions have been added in the PR body - [x] The developer has manually tested the changes and verified that the changes work on live cluster - [x] All edge cases are handled with comprehensive error handling - [x] Readiness probes accurately reflect system state --- 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * Added continuous namespace monitoring that automatically recreates the subscription namespace if deleted during manager operation. * Introduced new `--subscription-namespace-maintain-interval` CLI flag to configure monitoring frequency. * **Bug Fixes** * Improved namespace startup logic to safely wait for terminating namespaces (up to 90 seconds) before proceeding. * **Chores** * Refactored internal client initialization for better resource reuse across startup and monitoring components. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: Mynhardt Burger <mynhardt@gmail.com>
…uadrant TokenRateLimitPolicy (opendatahub-io#750) https://redhat.atlassian.net/browse/RHOAIENG-58408 ## Description Tightened kubebuilder/OpenAPI validation on `TokenRateLimit.Window` - Go type pattern changed from `^(\d+)(s|m|h|d)$` to `^[1-9]\d{0,3}(s|m|h)$` Regenerated CRD (`maas.opendatahub.io_maassubscriptions.yaml` updated with new pattern and expanded description) Document allowed units + migration note - CRD reference doc (maas-subscription.md) - OpenAPI spec (openapi3.yaml) Added tests ## How Has This Been Tested? Additional tests suite introduced. ## Merge criteria: <!--- This PR will be merged by any repository approver when it meets all the points in the checklist --> <!--- Go over all the following points, and put an `x` in all the boxes that apply. --> - [x] The commits are squashed in a cohesive manner and have meaningful messages. - [x] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). - [x] The developer has manually tested the changes and verified that the changes work <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Updated token rate limit "window" docs: only seconds (s), minutes (m), hours (h) allowed; numeric range 1–9999. Days (d) no longer supported; use hours instead (e.g., 24h). * **API / Schema** * CRD/OpenAPI schemas now enforce the new window pattern and string length constraints (2–5 characters). * **Tests** * Added unit and end-to-end tests covering the tightened window validation. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Yuriy Teodorovych <Yuriy@ibm.com>
Automated promotion of **32 commit(s)** from `main` to `stable`. ``` 6190476 fix: align MaaSSubscription token rate limit window validation with Kuadrant TokenRateLimitPolicy (opendatahub-io#750) 3d93d30 fix: handle Terminating namespace during RHOAI reinstall/upgrade (opendatahub-io#742) ffcb990 fix: use targetModel in HTTPRoute header match (opendatahub-io#753) 4681ffd feat(kustomize): add operator-managed image for api key cleanup cronjob (opendatahub-io#751) 9b4672a fix: cleanup script handles RHOAI namespace and AuthConfig CRs (opendatahub-io#749) 5afc49c refactor: refactor and consolidate test helper functions (opendatahub-io#738) 7b81a7e fix: resolve OpenAPI spec validation errors and warnings (opendatahub-io#694) 93bc750 feat: deploy admin-usage dashboard via ODH (opendatahub-io#686) fa03af7 refactor: reduce ListLLMs complexity to pass maintidx linter (opendatahub-io#739) d91dc39 test: fix flaky test_rate_limit_exhaustion_gets_429 (opendatahub-io#730) 05f08df feat: enable group testing for MaaS components (opendatahub-io#741) 0ed1fb3 fix: add ns prefix to ExternalModel HTTPRoute path for llmisvc parity (opendatahub-io#709) c93e879 feat(e2e): collect full MaaS CR definitions and RHOAI namespace logs (opendatahub-io#740) 5b9127e fix: avoid duplicate deployments of controller in deploy.sh (opendatahub-io#732) 6842e33 fix: restore /v1/models rate limiting exemption (opendatahub-io#729) beced7f fix: patch params.env for custom image injection in kustomize mode (opendatahub-io#731) b5b5afb test: fix `test_subscription_status_transitions_on_model_deletion()` (opendatahub-io#733) c5468b2 feat: add RBAC aggregation for namespace users (opendatahub-io#716) b77630b test: expand negative-path and security-focused E2E tests (opendatahub-io#724) e069c5f fix: mitigate authorization timing race in /v1/models listing (opendatahub-io#549) 6d31fd8 test(e2e): enable unconfigured model deny-by-default test (opendatahub-io#728) 99bcd1b fix: replace third-party curl image with UBI-based image for disconnected support (opendatahub-io#706) 08ff5b4 ci: add OpenAPI validation and automation infrastructure (opendatahub-io#693) b265e54 feat: add E2E tests for external models (egress) (opendatahub-io#632) bbfea0d chore: update smoke.sh to use API Keys (opendatahub-io#573) 8f22073 chore(deps): bump google.golang.org/grpc from 1.75.1 to 1.79.3 in /maas-api (opendatahub-io#566) 95f8645 chore(docs): document shared HTTPRoute TRLP limitation and cross-links (opendatahub-io#727) e3da035 feat(maas-controller): add granular status reporting for MaaSSubscription and MaaSAuthPolicy (opendatahub-io#714) 36dbeb4 feat: add Granite Model that can work on CPU (opendatahub-io#723) bbaa45a fix: correct AuthPolicy name in validation script (opendatahub-io#658) (opendatahub-io#659) ```
Automated promotion of **33 commit(s)** from `stable` to `rhoai`. ``` 6190476 fix: align MaaSSubscription token rate limit window validation with Kuadrant TokenRateLimitPolicy (opendatahub-io#750) 3d93d30 fix: handle Terminating namespace during RHOAI reinstall/upgrade (opendatahub-io#742) ffcb990 fix: use targetModel in HTTPRoute header match (opendatahub-io#753) 4681ffd feat(kustomize): add operator-managed image for api key cleanup cronjob (opendatahub-io#751) 9b4672a fix: cleanup script handles RHOAI namespace and AuthConfig CRs (opendatahub-io#749) 5afc49c refactor: refactor and consolidate test helper functions (opendatahub-io#738) 7b81a7e fix: resolve OpenAPI spec validation errors and warnings (opendatahub-io#694) 93bc750 feat: deploy admin-usage dashboard via ODH (opendatahub-io#686) fa03af7 refactor: reduce ListLLMs complexity to pass maintidx linter (opendatahub-io#739) d91dc39 test: fix flaky test_rate_limit_exhaustion_gets_429 (opendatahub-io#730) 05f08df feat: enable group testing for MaaS components (opendatahub-io#741) 0ed1fb3 fix: add ns prefix to ExternalModel HTTPRoute path for llmisvc parity (opendatahub-io#709) c93e879 feat(e2e): collect full MaaS CR definitions and RHOAI namespace logs (opendatahub-io#740) 5b9127e fix: avoid duplicate deployments of controller in deploy.sh (opendatahub-io#732) 6842e33 fix: restore /v1/models rate limiting exemption (opendatahub-io#729) beced7f fix: patch params.env for custom image injection in kustomize mode (opendatahub-io#731) b5b5afb test: fix `test_subscription_status_transitions_on_model_deletion()` (opendatahub-io#733) c5468b2 feat: add RBAC aggregation for namespace users (opendatahub-io#716) b77630b test: expand negative-path and security-focused E2E tests (opendatahub-io#724) e069c5f fix: mitigate authorization timing race in /v1/models listing (opendatahub-io#549) 6d31fd8 test(e2e): enable unconfigured model deny-by-default test (opendatahub-io#728) 99bcd1b fix: replace third-party curl image with UBI-based image for disconnected support (opendatahub-io#706) 08ff5b4 ci: add OpenAPI validation and automation infrastructure (opendatahub-io#693) b265e54 feat: add E2E tests for external models (egress) (opendatahub-io#632) bbfea0d chore: update smoke.sh to use API Keys (opendatahub-io#573) 8f22073 chore(deps): bump google.golang.org/grpc from 1.75.1 to 1.79.3 in /maas-api (opendatahub-io#566) 95f8645 chore(docs): document shared HTTPRoute TRLP limitation and cross-links (opendatahub-io#727) e3da035 feat(maas-controller): add granular status reporting for MaaSSubscription and MaaSAuthPolicy (opendatahub-io#714) 36dbeb4 feat: add Granite Model that can work on CPU (opendatahub-io#723) bbaa45a fix: correct AuthPolicy name in validation script (opendatahub-io#658) (opendatahub-io#659) ```
…-io#721) ## Description This PR implements subscription health enforcement at the authentication/authorization layer, ensuring traffic is denied when a subscription is not in an acceptable state. **Jira:** https://redhat.atlassian.net/browse/RHOAIENG-57234 ### Main Feature: Auth Layer Rejection **OPA Rule Update:** - Blocks subscriptions in `Failed` or `Pending` phases from making any requests - Returns 403 Forbidden with clear error message when subscription is unhealthy - Enforces subscription health consistently at the same layer as other auth decisions **Subscription Selector (maas-api):** - Consumes subscription phase and modelRefStatuses from controller (PR opendatahub-io#714) - Returns appropriate errors for Failed/Pending subscriptions before OPA evaluation - Validates subscription health during the selection process ### Enhancement: Active Filtering for Degraded Subscriptions Beyond the core requirement, this PR also implements granular filtering for Degraded subscriptions: - Degraded subscriptions can still access **healthy** models (ready: true in modelRefStatuses) - Requests to **unhealthy** models within Degraded subscriptions are blocked with clear error - This allows partial service when some models are unavailable rather than blocking everything **Rationale:** If a subscription has 3 healthy models and 1 broken model, users should still be able to access the 3 healthy models. Complete blocking would be unnecessarily restrictive. ### Example Behavior **Failed Subscription:** ```yaml status: phase: Failed conditions: - type: Ready status: "False" reason: ReconcileFailed - ❌ All requests rejected at auth layer with 403 Forbidden Degraded Subscription: status: phase: Degraded modelRefStatuses: - name: llama-model ready: true - name: broken-model ready: false reason: NotFound - ✅ Requests to llama-model succeed (healthy model) - ❌ Requests to broken-model blocked with error: "model not available in subscription (reason: model not healthy)" Active Subscription: status: phase: Active modelRefStatuses: - name: llama-model ready: true - ✅ All requests allowed per existing policy rules How Has This Been Tested? Automated Tests (E2E - all passing ✅) Core Requirement Tests: test_failed_subscription_blocks_inference - Verifies Failed subscriptions are rejected at auth layer - Tests recovery: subscription returns to Active → requests allowed test_subscriptions_endpoint_shows_degraded_health - Verifies /v1/subscriptions correctly reports subscription health Active Filtering Tests: test_degraded_healthy_model_allows_inference - Degraded subscription with healthy model → inference succeeds test_degraded_unhealthy_model_blocks_inference - Degraded subscription with unhealthy model → request blocked test_models_endpoint_with_degraded_subscription_api_key - Verifies /v1/models endpoint with Degraded subscription (API key auth) test_models_endpoint_with_degraded_subscription_kube_token - Verifies /v1/models endpoint with Degraded subscription (Kube token auth) Manual Verification Tested on live cluster: 1. Created subscription with all invalid models → Failed phase - Verified: All inference requests rejected with 403 2. Updated subscription to have 1 valid model → Degraded phase - Verified: Subscription enters Degraded state - Verified: Inference to valid model succeeds - Verified: Inference to invalid model blocked with clear error 3. Fixed all models → Active phase - Verified: All models accessible 4. Tested both API key and Kubernetes token authentication paths Client-facing behavior: - HTTP 403 for Failed/Pending subscriptions (consistent with other auth failures) - Clear error messages that don't expose internal implementation details - Error response format matches existing API error structure Unit Tests - Updated selector tests to verify phase-based rejection - Tests cover all phase/model health combinations - Validates error messages and HTTP status codes Dependencies This PR depends on PR opendatahub-io#714 which implements the phase and modelRefStatuses fields in MaaSSubscription status. The PR should be rebased on main after opendatahub-io#714 merges. Documentation No documentation updates needed in this PR - the behavior is transparent to end users: - Failed/Pending subscriptions are rejected (expected behavior for unhealthy resources) - Error messages are self-explanatory - Operator documentation for subscription health is covered in PR opendatahub-io#714 Acceptance Criteria Met - ✅ Given a MaaSSubscription in Failed/Pending state, When client presents valid credentials, Then request is rejected at auth layer - ✅ Given subscription returns to Active/Degraded state, When client retries, Then requests are allowed per existing rules - ✅ Given rejected request due to subscription state, When client inspects response, Then response does not expose internal details - ✅ Automated E2E tests cover: unhealthy subscription → denied; recovery → allowed - ✅ Manual verification steps documented above Merge criteria: - The commits are squashed in a cohesive manner and have meaningful messages. - Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). - The developer has manually tested the changes and verified that the changes work <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added "Degraded" phase and richer status surfaces: per-model and per-policy status entries exposing ready/reason/message and deletionTimestamp. * **Improvements** * API selection and behavior now consider subscription/model health (fail-closed for unhealthy models); Create API key logs non-blocking info when subscription is non-active or deleting. * **Tests** * Expanded unit and e2e coverage for status reporting, degraded/failed phases, and selection/filtering logic. * **Documentation** * Updated troubleshooting and docs with phase semantics and kubectl examples. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Ishita Sequeira <ishiseq29@gmail.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
<!--- Provide a general summary of your changes in the Title above --> ## Description Database Prerequisites moved to setup. Updated the links. https://redhat.atlassian.net/browse/RHOAIENG-55130 ## How Has This Been Tested? <!--- Please describe in detail how you tested your changes. --> <!--- Include details of your testing environment, and the tests you ran to --> <!--- see how your change affects other areas of the code, etc. --> Manual verification on https://github.com/jrhyness/models-as-a-service/blob/jr_55130/maas-api/README.md ## Merge criteria: <!--- This PR will be merged by any repository approver when it meets all the points in the checklist --> <!--- Go over all the following points, and put an `x` in all the boxes that apply. --> - [ ] The commits are squashed in a cohesive manner and have meaningful messages. - [ ] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). - [ ] The developer has manually tested the changes and verified that the changes work <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Updated database-related documentation links in the API README to direct users to the current production deployment setup guide. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…ate (opendatahub-io#756) <!--- Provide a general summary of your changes in the Title above --> ## Description <!--- Describe your changes in detail --> Text says "Set modelsAsService to Unmanaged" but the YAML below shows managementState: Removed. Changed the text. Unmanaged is not a supported state. https://redhat.atlassian.net/browse/RHOAIENG-55132 ## How Has This Been Tested? <!--- Please describe in detail how you tested your changes. --> <!--- Include details of your testing environment, and the tests you ran to --> <!--- see how your change affects other areas of the code, etc. --> ## Merge criteria: <!--- This PR will be merged by any repository approver when it meets all the points in the checklist --> <!--- Go over all the following points, and put an `x` in all the boxes that apply. --> - [ ] The commits are squashed in a cohesive manner and have meaningful messages. - [ ] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). - [ ] The developer has manually tested the changes and verified that the changes work
## Summary This PR syncs security scanning configuration files from the central [security-config](https://github.com/opendatahub-io/security-config) repository, managed by the [@opendatahub-io/odh-platform-security](https://github.com/orgs/opendatahub-io/teams/odh-platform-security) team. ## Files | File | Status | |------|--------| | `semgrep.yaml` | Updated | ## What does this mean for your team? - **No action required from reviewers** beyond merging this PR - These files are **protected by an org-level push ruleset** — they cannot be modified directly in this repo - Future updates will be synced automatically via PRs from the `security-config` repo - CodeRabbit and Semgrep will use these configs when reviewing PRs on this repo For questions or customization requests, open an issue on [opendatahub-io/security-config](https://github.com/opendatahub-io/security-config). Co-authored-by: security-config-sync[bot] <265242129+security-config-sync[bot]@users.noreply.github.com>
…endatahub-io#758) [UX conversation ask](https://redhat-internal.slack.com/archives/C069KSM8T9N/p1776362532709879?thread_ts=1776354678.333879&cid=C069KSM8T9N) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Style** * Updated dashboard display labels for improved clarity. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary Documentation refresh with clearer **request flows** (especially key minting and high-level architecture) and a new **personas** narrative backed by a resource-model diagram. ## Changes - **Architecture (`architecture.md`)** — Tightened overview (main components as bullets, authorization/rate-limiting framing), clarified Gateway / Kuadrant / Authorino / Limitador / maas-api, updated main-flow diagram (colors, `MaaSModelRef`, Tech Preview / external path). **Key minting** is a **single** flow + diagram: validation and minting combined; **forward + user context** from **AuthPolicy** to MaaS API; show-once key response described in prose (not shown on the diagram). Other sections updated only where they align with the same diagrams or wording. - **Personas (`concepts/personas.md`)** — Page structured around **cluster operators**, **ODH administrators**, **data scientists / model service owners**, and **API consumers**; embedded resource-model PNG under `docs/content/assets/diagrams/`; `mkdocs.yml` navigation updated. - **Misc** — Cross-links and terminology so diagrams and prose stay consistent. ## Notes for reviewers - Confirm **`docs/content/assets/diagrams/personas-resource-model.png`** is meant to be committed with the repo. - Optional later: **light/dark** diagram variants using Material image URLs with `#only-light` / `#only-dark` when assets exist. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Reorganized documentation structure with new "Concepts" section covering personas, model reference, and architecture. * Added comprehensive guides for external model setup, on-cluster model serving gateway configuration, and RBAC troubleshooting. * Updated API examples to use OpenAI-compatible chat completions endpoint. * Clarified API key expiration model with operator-managed maximum lifetime. * Added ModelsAsService CR configuration documentation. * Updated sample model manifests to simulator v0.8.2 with new runtime arguments. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…hub-io#761) Update the /v1/models example responses to match the actual API format: - Replace incorrect "subscription": "string" with "subscriptions": [array] - Add all actual response fields (object, created, owned_by, kind, ready, modelDetails) - Include two examples: API key (single subscription) and user token (multiple subscriptions) ## Description Changes: - API key example shows two models, both with single subscription in array - User token example shows same models with subscription aggregation: one model accessible via two subscriptions, one via a single subscription - Add tip explaining the difference between API key and user token responses - Use consistent model names (llama-2-7b-chat, mixtral-8x7b-instruct) across both examples The previous example used "subscription": "free" (singular string) but the actual API returns "subscriptions": [{name, displayName, description}, ...] (plural array of objects). This mismatch would cause client parsing errors. Resolves: [RHOAIENG-55145](https://redhat.atlassian.net/browse/RHOAIENG-55145) ## How Has This Been Tested? <!--- Please describe in detail how you tested your changes. --> <!--- Include details of your testing environment, and the tests you ran to --> <!--- see how your change affects other areas of the code, etc. --> ## Merge criteria: <!--- This PR will be merged by any repository approver when it meets all the points in the checklist --> <!--- Go over all the following points, and put an `x` in all the boxes that apply. --> - [ ] The commits are squashed in a cohesive manner and have meaningful messages. - [ ] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). - [ ] The developer has manually tested the changes and verified that the changes work <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Enhanced "List Available Models" guide with expanded examples for API key and user token authentication. * Updated response examples with additional fields for clearer model information understanding. * Added clarification on subscription behavior differences across authentication methods. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
…-io#626) <!--- Provide a general summary of your changes in the Title above --> https://redhat.atlassian.net/browse/RHOAIENG-52923 ## Description <!--- Describe your changes in detail --> Patch Kuadrant CSV when deploying to change Kuadrant behavior to fail-close when Limitador service fails. ## How Has This Been Tested? <!--- Please describe in detail how you tested your changes. --> <!--- Include details of your testing environment, and the tests you ran to --> <!--- see how your change affects other areas of the code, etc. --> TRLP test script: ``` for i in {1..16}; do curl -sSk -o /dev/null -w "%{http_code}\n" "${HOST}/llm/facebook-opt-125m-simulated/v1/chat/completions" \ -H "Authorization: Bearer $API_KEY" \ -H "Content-Type: application/json" -d '{"model":"facebook/opt-125m","messages":[{"role":"user","content":"Hi"}],"max_tokens":50}'; done ``` - Run TRLP test script, got `429` after a few `200`s. - Scale Limitador pod down to 0, run TRLP test script, got all `200`s. - Run revised `deploy.sh` to deploy MaaS, then run TRLP test script, got all `500`s. ## Merge criteria: <!--- This PR will be merged by any repository approver when it meets all the points in the checklist --> <!--- Go over all the following points, and put an `x` in all the boxes that apply. --> - [x] The commits are squashed in a cohesive manner and have meaningful messages. - [x] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). - [x] The developer has manually tested the changes and verified that the changes work <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Added platform-specific guidance for configuring rate-limiting failure behavior when Limitador is unavailable (Open Data Hub and Red Hat OpenShift AI). * **Chores** * Centralized and automated operator CSV updates to ensure gateway-controller integration and enforce rate-limit failure modes; post-install now consistently applies patches, restarts/reconciles components as needed, and shows clearer progress messaging. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…atahub-io#757) Add comprehensive documentation for the --cluster-audience flag and all other CLI flags in the maas-controller README. This flag is critical for HyperShift/ROSA clusters that use custom OIDC provider URLs. ## Description Changes: - Add CLI Flags table with all available flags and their defaults - Add dedicated section for HyperShift/ROSA cluster configuration - Document how to find cluster's OIDC audience - Show two methods to configure cluster-audience (params.env and kubectl patch) - Update Other Configuration section to reference params.env consistently Resolves: [RHOAIENG-55116](https://redhat.atlassian.net/browse/RHOAIENG-55116) ## How Has This Been Tested? <!--- Please describe in detail how you tested your changes. --> <!--- Include details of your testing environment, and the tests you ran to --> <!--- see how your change affects other areas of the code, etc. --> ## Merge criteria: <!--- This PR will be merged by any repository approver when it meets all the points in the checklist --> <!--- Go over all the following points, and put an `x` in all the boxes that apply. --> - [ ] The commits are squashed in a cohesive manner and have meaningful messages. - [ ] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). - [ ] The developer has manually tested the changes and verified that the changes work <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Added CLI Flags configuration subsection with documentation on command-line flags and their defaults, configured via kustomize. * Introduced dedicated guidance for HyperShift/ROSA Clusters configuration, including cluster audience override instructions and kubectl commands for OIDC audience extraction. * Updated configuration section with explicit parameter mappings for customizing subscription namespace, controller image, and gateway settings via configuration files. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove incorrect "Stub: not yet implemented" text and document actual ExternalModel behavior. ExternalModel has been fully implemented with ~230 lines of working code in providers_external.go since the initial implementation. ## Description Changes: - Replace stub description with accurate behavior documentation - Document that ExternalModel references an ExternalModel CR for provider configuration (OpenAI, Anthropic, etc.) - Explain HTTPRoute validation flow (created by ExternalModel controller, validated by MaaSModelRef) - Document readiness criteria (HTTPRoute accepted by gateway) - Remove outdated "Status for unimplemented kinds" paragraph that referenced ExternalModel as an example The ExternalModel provider has been fully functional and registered in providers.go since its introduction. Resolves: [RHOAIENG-55145](https://redhat.atlassian.net/browse/RHOAIENG-55145) ## How Has This Been Tested? <!--- Please describe in detail how you tested your changes. --> <!--- Include details of your testing environment, and the tests you ran to --> <!--- see how your change affects other areas of the code, etc. --> ## Merge criteria: <!--- This PR will be merged by any repository approver when it meets all the points in the checklist --> <!--- Go over all the following points, and put an `x` in all the boxes that apply. --> - [ ] The commits are squashed in a cohesive manner and have meaningful messages. - [ ] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). - [ ] The developer has manually tested the changes and verified that the changes work <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Updated ExternalModel provider documentation with complete implementation specifications for endpoint exposure and gateway integration, transitioning from unimplemented status to fully detailed behavior. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
…#735) ## Description Adds the `Tenant` CR (`maas.opendatahub.io/v1alpha1`) and a platform reconciliation pipeline to `maas-controller` so it can render and apply MaaS platform workloads (maas-api, gateway config, auth policies, telemetry). `Tenant` replaces the previous `ModelsAsService` (`components.platform.opendatahub.io/v1alpha1`) as the persisted CR for the MaaS component. ODH still uses the `ModelsAsService` component name internally for enablement checks, labels, and DSC status aggregation, but the object on the cluster is now `Tenant`. This gives `maas-controller` full ownership of platform workload lifecycle while ODH retains control of the component lifecycle (install, enable/disable, cleanup). ### What's included - `Tenant` CRD with API key, external OIDC, gateway, and telemetry configuration - `TenantReconciler`: prerequisites → dependencies → kustomize render → post-render → SSA apply → deployment readiness - Post-render: gateway AuthPolicy/TokenRateLimitPolicy/DestinationRule targeting, external OIDC patching, TelemetryPolicy + IstioTelemetry injection, config-hash rollout annotation - Finalizer with cross-namespace cleanup via tracking labels - Management state support (Managed/Unmanaged/Removed) - Unit tests for reconcile, finalization, singleton enforcement, management states ### Design decisions (based on review feedback) - **Namespace-scoped**: lives in `models-as-a-service` alongside `MaaSSubscription`/`MaaSAuthPolicy`. First release with no deployed CRDs — avoids a CRD scope migration later (Kubernetes does not allow changing scope on an existing CRD) - **Self-bootstrap**: `maas-controller` creates the default Tenant on startup; ODH operator's `NewCRObject` is a no-op - **No DSCI dependency**: app namespace derived from `tenant.Namespace` — no cross-operator API calls or extra RBAC - **Cross-namespace ownership**: tracking labels for cluster-scoped/cross-namespace children; `ownerReferences` for same-namespace only - **Singleton via CEL**: `self.metadata.name == 'default-tenant'` — removing the rule later enables multi-tenancy without CRD migration - **Gateway policy alignment**: `gateway-default-auth` (AuthPolicy) and `gateway-default-deny` (TokenRateLimitPolicy) names match actual manifests _Related ODH PR:_ opendatahub-io/opendatahub-operator#3412 ## How Has This Been Tested? - Unit tests for the reconcile entry-point (`maastenant_reconcile_test.go`) - Manual End-to-end testing on ROSA cluster with custom ODH operator + maas-controller images: - Verified self-creation of `default-tenant` in `models-as-a-service` namespace - Platform workloads applied via SSA - Toggled MaaS off/on in DSC to verify cleanup and re-provisioning - CRD namespace scope and CEL singleton enforcement confirmed ## Merge criteria: - [x] The commits are squashed in a cohesive manner and have meaningful messages. - [x] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). - [x] The developer has manually tested the changes and verified that the changes work <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Adds a Tenant custom resource with validation, status/phase, and CLI printer columns. * Ships a controller that ensures a singleton default Tenant, reconciles rendered manifests, monitors readiness, and performs safe teardown. * **New Features (rendering)** * Rendering/post-processing injects OIDC, telemetry policies, gateway defaults, params, and a deterministic config-hash on deployments. * **Chores** * Expanded RBAC, dependency and linter updates, deployment script improvements, and added reconciliation tests. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: jland <jland@redhat.com>
…hub-io#763) ## Description Add configuration reference for maas-api including: - Environment variables table with 15 configuration options - CLI flags table showing env var mappings - Database configuration note Configuration options documented: - Server config: DEBUG_MODE, NAMESPACE, SECURE, ADDRESS, PORT (deprecated) - Gateway config: GATEWAY_NAME, GATEWAY_NAMESPACE, INSTANCE_NAME - Subscription config: MAAS_SUBSCRIPTION_NAMESPACE - API key config: API_KEY_MAX_EXPIRATION_DAYS - Performance: ACCESS_CHECK_TIMEOUT_SECONDS - TLS config: TLS_CERT, TLS_KEY, TLS_SELF_SIGNED, TLS_MIN_VERSION ## How Has This Been Tested? <!--- Please describe in detail how you tested your changes. --> <!--- Include details of your testing environment, and the tests you ran to --> <!--- see how your change affects other areas of the code, etc. --> ## Merge criteria: <!--- This PR will be merged by any repository approver when it meets all the points in the checklist --> <!--- Go over all the following points, and put an `x` in all the boxes that apply. --> - [ ] The commits are squashed in a cohesive manner and have meaningful messages. - [ ] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious). - [ ] The developer has manually tested the changes and verified that the changes work <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Added comprehensive Configuration section documenting all available environment variables and CLI flags for server setup, including debug logging, namespace identification, network configuration, and TLS settings. * Clarified that CLI flags override environment variables and explained how database configuration is sourced from Kubernetes secrets. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Automated promotion of **12 commit(s)** from `main` to `stable`. ``` 8f38b4c docs: document maas-api environment variables and CLI flags (opendatahub-io#763) dd5474e feat(maas-controller): maas`Tenant` CR and reconciler (opendatahub-io#735) 0e2b91e docs: correct ExternalModel implementation status (opendatahub-io#759) 94b7341 docs: document --cluster-audience CLI flag for maas-controller (opendatahub-io#757) fa142c4 fix: enforce fail-close logic when limitador pod is down (opendatahub-io#626) c098422 docs: fix API response format in self-service model listing (opendatahub-io#761) 3a6a9b8 docs: documentation updates (opendatahub-io#687) ce75696 fix: rename dashboard title and panel name to 'Token Consumption' (opendatahub-io#758) 266a130 chore: sync security config files (opendatahub-io#736) 2ce457c docs: fix instructions to match code for modelsAsService managementState (opendatahub-io#756) 5d06621 docs: fix broken links (opendatahub-io#755) c789577 feat: reject degraded/failed subscriptions at auth layer (opendatahub-io#721) ```
Automated promotion of **13 commit(s)** from `stable` to `rhoai`. ``` 8f38b4c docs: document maas-api environment variables and CLI flags (opendatahub-io#763) dd5474e feat(maas-controller): maas`Tenant` CR and reconciler (opendatahub-io#735) 0e2b91e docs: correct ExternalModel implementation status (opendatahub-io#759) 94b7341 docs: document --cluster-audience CLI flag for maas-controller (opendatahub-io#757) fa142c4 fix: enforce fail-close logic when limitador pod is down (opendatahub-io#626) c098422 docs: fix API response format in self-service model listing (opendatahub-io#761) 3a6a9b8 docs: documentation updates (opendatahub-io#687) ce75696 fix: rename dashboard title and panel name to 'Token Consumption' (opendatahub-io#758) 266a130 chore: sync security config files (opendatahub-io#736) 2ce457c docs: fix instructions to match code for modelsAsService managementState (opendatahub-io#756) 5d06621 docs: fix broken links (opendatahub-io#755) c789577 feat: reject degraded/failed subscriptions at auth layer (opendatahub-io#721) ```
…es/sync-main chore: sync 3.4 with main
…oller v3-4 the maas-controller Dockerfile.konflux references paths outside the maas-controller/ directory (maas-api/deploy, deployment/base/...), so the build context must be the repo root. update path-context from maas-controller to . and dockerfile from Dockerfile.konflux to maas-controller/Dockerfile.konflux to match the upstream pattern. Signed-off-by: Chaitanya Kulkarni <ckulkarn@redhat.com> Signed-off-by: Chaitanya Kulkarni <chkulkar@redhat.com> Made-with: Cursor
…ontroller-tekton-context fix(tekton): correct dockerfile path and build context for maas-controller v3-4
update github.com/jackc/pgx/v5 from v5.7.6 to v5.9.2 to resolve memory-safety vulnerabilities. cve details: - cve-2026-33815: memory-safety vulnerability in pgx - cve-2026-33816: memory-safety vulnerability in pgx resolves: rhoaieng-57067, rhoaieng-57063 co-authored-by: claude opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
summary
update github.com/jackc/pgx/v5 from v5.7.6 to v5.9.2 to resolve memory-safety vulnerabilities.
NOTE: this PR targets the fork. a cross-fork PR to
red-hat-data-services/models-as-a-servicecould not be created due to token permissions. please recreate this PR on the upstream repo using:cve details
changes
github.com/jackc/pgx/v5fromv5.7.6tov5.9.2inmaas-api/go.modgo mod tidyto updatego.sumtest results
status: ✅ all tests passed
command:
go test ./...jira
🤖 generated with claude code