Skip to content

TEST - will close immediately#6

Closed
vmrh21 wants to merge 55 commits intorhoai-3.4from
fix/cve-2026-33815-pgx-rhds-rhoai-3.4-attempt-1
Closed

TEST - will close immediately#6
vmrh21 wants to merge 55 commits intorhoai-3.4from
fix/cve-2026-33815-pgx-rhds-rhoai-3.4-attempt-1

Conversation

@vmrh21
Copy link
Copy Markdown
Owner

@vmrh21 vmrh21 commented Apr 21, 2026

test

mynhardtburger and others added 30 commits April 9, 2026 20:51
…g entry (opendatahub-io#720)

## Summary

- Fixes `model_base_url` conftest fixture which always used `items[0]`
from the model catalog, ignoring the `model_id` parameter
- When multiple MaaS subscriptions exist, this caused a mismatch between
the API key (scoped to one subscription) and the model URL (from a
different subscription), resulting in 403 errors
- Now looks up the catalog entry matching `model_id` before falling back
to constructing the URL from `gateway_url`

Closes:
[RHOAIENG-57327](https://redhat.atlassian.net/browse/RHOAIENG-57327)

## Test plan

- [ ] Run e2e tests on a cluster with a single MaaS subscription
(baseline — should pass as before)
- [ ] Run e2e tests on a cluster with multiple MaaS subscriptions (the
failing scenario — should now pass)
- [ ] Verify `MODEL_NAME` env var override correctly selects the
matching catalog entry's URL

🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!--- Provide a general summary of your changes in the Title above -->

## Description
<!--- Describe your changes in detail -->
`deploy.sh` was waiting for a wrong webhook deployment, and hence got
stuck even if the operator was ready.

## How Has This Been Tested?
<!--- Please describe in detail how you tested your changes. -->
<!--- Include details of your testing environment, and the tests you ran
to -->
<!--- see how your change affects other areas of the code, etc. -->
`./scripts/deploy.sh --deployment-mode operator --operator-type rhoai`
was stuck waiting for webhook deployment, but works now.

## Merge criteria:
<!--- This PR will be merged by any repository approver when it meets
all the points in the checklist -->
<!--- Go over all the following points, and put an `x` in all the boxes
that apply. -->

- [x] The commits are squashed in a cohesive manner and have meaningful
messages.
- [x] Testing instructions have been added in the PR body (for PRs
involving changes that are not immediately obvious).
- [x] The developer has manually tested the changes and verified that
the changes work


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Updated RHOAI deployment configuration to reference the correct
webhook management component name. This change ensures the deployment
process properly tracks component readiness and availability during
installation. Health checks and status verification now correctly target
the appropriate component in RHOAI environments.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
opendatahub-io#659)

The validation script was checking for 'gateway-auth-policy' but the
actual deployed AuthPolicy is named 'gateway-default-auth'. This caused
false 'NotFound' warnings despite the AuthPolicy being correctly
deployed and functional.

Changes:
- Update scripts/validate-deployment.sh line 383 to check for
gateway-default-auth instead of gateway-auth-policy

Fixes opendatahub-io#658

<!--- Provide a general summary of your changes in the Title above -->

## Description
<!--- Describe your changes in detail -->

## How Has This Been Tested?
<!--- Please describe in detail how you tested your changes. -->
<!--- Include details of your testing environment, and the tests you ran
to -->
<!--- see how your change affects other areas of the code, etc. -->

## Merge criteria:
<!--- This PR will be merged by any repository approver when it meets
all the points in the checklist -->
<!--- Go over all the following points, and put an `x` in all the boxes
that apply. -->

- [ ] The commits are squashed in a cohesive manner and have meaningful
messages.
- [ ] Testing instructions have been added in the PR body (for PRs
involving changes that are not immediately obvious).
- [ ] The developer has manually tested the changes and verified that
the changes work


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Corrected deployment validation to check the correct authentication
policy resource.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Just created this Granite Model that works with CPUs as part of my demo,
and wanted to contribute that back

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
* Added sample deployment docs for Granite 3.1 8B delivered via Red Hat
model car OCI and updated the available models list with this option.

* **New Features**
* Added a ready-to-deploy Granite 3.1 8B sample including pre-configured
model service, authentication policy, access controls, and a token rate
limit (10,000 tokens/min).
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
…tion and MaaSAuthPolicy (opendatahub-io#714)

## Description

This PR improves status reporting for `MaaSSubscription` and
`MaaSAuthPolicy` resources to reflect real reconciliation and dependency
health. Previously, these resources could show "Active" or empty status
when underlying dependencies (MaaSModelRefs, TokenRateLimitPolicies,
AuthPolicies) were missing, invalid, or unhealthy.

* https://redhat.atlassian.net/browse/RHOAIENG-57006
* https://redhat.atlassian.net/browse/RHOAIENG-57233

### Key Changes

**API Types:**
- Introduced `common_types.go` with shared types:
- `Phase` type alias with typed constants (`Pending`, `Active`,
`Degraded`, `Failed`)
- `ConditionReason` type alias with semantic reason codes (`Reconciled`,
`NotFound`, `Accepted`, `NotEnforced`, etc.)
  - `ResourceRefStatus` base struct for embedding in per-item statuses

**MaaSSubscription:**
- Added `ModelRefStatuses` - per-model validation status (name,
namespace, ready, reason, message)
- Added `TokenRateLimitStatuses` - per-TRLP operand health status
- Phase now accurately reflects:
  - `Active` - all model refs valid, all TRLPs accepted
  - `Degraded` - some models valid/some invalid, or some TRLPs unhealthy
  - `Failed` - all model refs invalid

**MaaSAuthPolicy:**
- Added `AuthPolicies` status with per-AuthPolicy health (ready, reason,
message)
- Phase derivation mirrors MaaSSubscription logic
- AuthPolicy readiness requires both `Accepted=True` AND `Enforced=True`

### Status Examples

#### MaaSSubscription - Active (all healthy)

```yaml
status:
  phase: Active
  conditions:
    - type: Ready
      status: "True"
      reason: Reconciled
      message: "successfully reconciled"
  modelRefStatuses:
    - name: llama-model
      namespace: llm
      ready: true
      reason: Valid
    - name: mistral-model
      namespace: llm
      ready: true
      reason: Valid
  tokenRateLimitStatuses:
    - name: maas-trlp-llama-model
      namespace: llm
      model: llama-model
      ready: true
      reason: Accepted
    - name: maas-trlp-mistral-model
      namespace: llm
      model: mistral-model
      ready: true
      reason: Accepted
```

#### MaaSSubscription - Degraded (partial failure)

```yaml
status:
  phase: Degraded
  conditions:
    - type: Ready
      status: "False"
      reason: PartialFailure
      message: "1 of 2 model references are invalid"
  modelRefStatuses:
    - name: llama-model
      namespace: llm
      ready: true
      reason: Valid
    - name: missing-model
      namespace: llm
      ready: false
      reason: NotFound
      message: "MaaSModelRef llm/missing-model not found"
  tokenRateLimitStatuses:
    - name: maas-trlp-llama-model
      namespace: llm
      model: llama-model
      ready: true
      reason: Accepted
```

#### MaaSSubscription - Failed (all invalid)

```yaml
status:
  phase: Failed
  conditions:
    - type: Ready
      status: "False"
      reason: ReconcileFailed
      message: "all 2 model references are invalid"
  modelRefStatuses:
    - name: missing-model-1
      namespace: llm
      ready: false
      reason: NotFound
      message: "MaaSModelRef llm/missing-model-1 not found"
    - name: missing-model-2
      namespace: llm
      ready: false
      reason: NotFound
      message: "MaaSModelRef llm/missing-model-2 not found"
  tokenRateLimitStatuses: []
```

#### MaaSAuthPolicy - Active (all healthy)

```yaml
status:
  phase: Active
  conditions:
    - type: Ready
      status: "True"
      reason: Reconciled
      message: "successfully reconciled"
  authPolicies:
    - name: maas-auth-llama-model
      namespace: llm
      model: llama-model
      modelNamespace: llm
      ready: true
      reason: AcceptedEnforced
```

#### MaaSAuthPolicy - Degraded (AuthPolicy not enforced)

```yaml
status:
  phase: Degraded
  conditions:
    - type: Ready
      status: "False"
      reason: PartialFailure
      message: "1 of 1 AuthPolicies not accepted/enforced"
  authPolicies:
    - name: maas-auth-llama-model
      namespace: llm
      model: llama-model
      modelNamespace: llm
      ready: false
      reason: NotEnforced
      message: "waiting for Limitador to be ready"
```

#### MaaSAuthPolicy - Failed (model not found)

```yaml
status:
  phase: Failed
  conditions:
    - type: Ready
      status: "False"
      reason: ReconcileFailed
      message: "all 1 model references are invalid or missing"
  authPolicies: []
```

## How Has This Been Tested?

* Unit Tests
* Manual Cluster Testing:
- Tested on a live cluster with:
  - Single valid model → `Active` phase
  - Missing MaaSModelRef → `Failed` phase with `NotFound` reason
  - Mixed valid/invalid models → `Degraded` phase with per-model status
  - TokenRateLimitPolicy not accepted → `Degraded` with detailed reason
  - AuthPolicy not enforced → `Degraded` with `NotEnforced` reason

* Build Verification:
```bash
make build  # passes all checks (tidy, generate, manifests, lint, test)
```

## Merge criteria:

- [x] The commits are squashed in a cohesive manner and have meaningful
messages.
- [x] Testing instructions have been added in the PR body (for PRs
involving changes that are not immediately obvious).
- [x] The developer has manually tested the changes and verified that
the changes work


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added a Degraded phase and richer status reporting: per-item
ready/reason/message plus aggregated model and token-rate-limit
statuses.

* **Documentation**
* Troubleshooting expanded with phase semantics, commands to list
non-Active resources, and guidance to inspect per-item status fields.

* **Tests**
* New unit and end-to-end tests validating phase transitions and
per-item status/reporting.

* **Other**
* Tightened validation for name/namespace/model fields; build/deploy
tooling behavior updated.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
opendatahub-io#727)

## Description
Documents the **known limitation** when multiple **MaaSModelRef**
resources resolve to the **same** **HTTPRoute**: multiple
**TokenRateLimitPolicy** objects can target that route, but **only one**
is fully effective in practice (others may show **Overridden**), so
**per-subscription token limits may not all apply**.

The fix would be merged as a fast follow in 3.5
opendatahub-io#585
[RHOAIENG-57602](https://redhat.atlassian.net/browse/RHOAIENG-57602)

## How Has This Been Tested?
- Docs-only change

## Merge criteria:
- [x] The commits are squashed in a cohesive manner and have meaningful
messages.
- [ ] Testing instructions have been added in the PR body (for PRs
involving changes that are not immediately obvious).
- [ ] The developer has manually tested the changes and verified that
the changes work


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Documentation
* Added a warning and guidance about token rate limit behavior when
multiple model references share a single route, and recommended planning
to use separate routes for independent subscription limits.
* Expanded the "Subscription limitations and known issues" section with
detection commands and practical workarounds.
* Added a known-limitation note to release notes and updated
troubleshooting steps and navigation links for easier discovery.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
…as-api (opendatahub-io#566)

Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from
1.75.1 to 1.79.3.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a
href="https://github.com/grpc/grpc-go/releases">google.golang.org/grpc's
releases</a>.</em></p>
<blockquote>
<h2>Release 1.79.3</h2>
<h1>Security</h1>
<ul>
<li>server: fix an authorization bypass where malformed :path headers
(missing the leading slash) could bypass path-based restricted
&quot;deny&quot; rules in interceptors like <code>grpc/authz</code>. Any
request with a non-canonical path is now immediately rejected with an
<code>Unimplemented</code> error. (<a
href="https://redirect.github.com/grpc/grpc-go/issues/8981">#8981</a>)</li>
</ul>
<h2>Release 1.79.2</h2>
<h1>Bug Fixes</h1>
<ul>
<li>stats: Prevent redundant error logging in health/ORCA producers by
skipping stats/tracing processing when no stats handler is configured.
(<a
href="https://redirect.github.com/grpc/grpc-go/pull/8874">grpc/grpc-go#8874</a>)</li>
</ul>
<h2>Release 1.79.1</h2>
<h1>Bug Fixes</h1>
<ul>
<li>grpc: Remove the <code>-dev</code> suffix from the User-Agent
header. (<a
href="https://redirect.github.com/grpc/grpc-go/pull/8902">grpc/grpc-go#8902</a>)</li>
</ul>
<h2>Release 1.79.0</h2>
<h1>API Changes</h1>
<ul>
<li>mem: Add experimental API <code>SetDefaultBufferPool</code> to
change the default buffer pool. (<a
href="https://redirect.github.com/grpc/grpc-go/issues/8806">#8806</a>)
<ul>
<li>Special Thanks: <a
href="https://github.com/vanja-p"><code>@​vanja-p</code></a></li>
</ul>
</li>
<li>experimental/stats: Update <code>MetricsRecorder</code> to require
embedding the new <code>UnimplementedMetricsRecorder</code> (a no-op
struct) in all implementations for forward compatibility. (<a
href="https://redirect.github.com/grpc/grpc-go/issues/8780">#8780</a>)</li>
</ul>
<h1>Behavior Changes</h1>
<ul>
<li>balancer/weightedtarget: Remove handling of <code>Addresses</code>
and only handle <code>Endpoints</code> in resolver updates. (<a
href="https://redirect.github.com/grpc/grpc-go/issues/8841">#8841</a>)</li>
</ul>
<h1>New Features</h1>
<ul>
<li>experimental/stats: Add support for asynchronous gauge metrics
through the new <code>AsyncMetricReporter</code> and
<code>RegisterAsyncReporter</code> APIs. (<a
href="https://redirect.github.com/grpc/grpc-go/issues/8780">#8780</a>)</li>
<li>pickfirst: Add support for weighted random shuffling of endpoints,
as described in <a
href="https://redirect.github.com/grpc/proposal/pull/535">gRFC A113</a>.
<ul>
<li>This is enabled by default, and can be turned off using the
environment variable
<code>GRPC_EXPERIMENTAL_PF_WEIGHTED_SHUFFLING</code>. (<a
href="https://redirect.github.com/grpc/grpc-go/issues/8864">#8864</a>)</li>
</ul>
</li>
<li>xds: Implement <code>:authority</code> rewriting, as specified in <a
href="https://github.com/grpc/proposal/blob/master/A81-xds-authority-rewriting.md">gRFC
A81</a>. (<a
href="https://redirect.github.com/grpc/grpc-go/issues/8779">#8779</a>)</li>
<li>balancer/randomsubsetting: Implement the
<code>random_subsetting</code> LB policy, as specified in <a
href="https://github.com/grpc/proposal/blob/master/A68-random-subsetting.md">gRFC
A68</a>. (<a
href="https://redirect.github.com/grpc/grpc-go/issues/8650">#8650</a>)
<ul>
<li>Special Thanks: <a
href="https://github.com/marek-szews"><code>@​marek-szews</code></a></li>
</ul>
</li>
</ul>
<h1>Bug Fixes</h1>
<ul>
<li>credentials/tls: Fix a bug where the port was not stripped from the
authority override before validation. (<a
href="https://redirect.github.com/grpc/grpc-go/issues/8726">#8726</a>)
<ul>
<li>Special Thanks: <a
href="https://github.com/Atul1710"><code>@​Atul1710</code></a></li>
</ul>
</li>
<li>xds/priority: Fix a bug causing delayed failover to lower-priority
clusters when a higher-priority cluster is stuck in
<code>CONNECTING</code> state. (<a
href="https://redirect.github.com/grpc/grpc-go/issues/8813">#8813</a>)</li>
<li>health: Fix a bug where health checks failed for clients using
legacy compression options (<code>WithDecompressor</code> or
<code>RPCDecompressor</code>). (<a
href="https://redirect.github.com/grpc/grpc-go/issues/8765">#8765</a>)
<ul>
<li>Special Thanks: <a
href="https://github.com/sanki92"><code>@​sanki92</code></a></li>
</ul>
</li>
<li>transport: Fix an issue where the HTTP/2 server could skip header
size checks when terminating a stream early. (<a
href="https://redirect.github.com/grpc/grpc-go/issues/8769">#8769</a>)
<ul>
<li>Special Thanks: <a
href="https://github.com/joybestourous"><code>@​joybestourous</code></a></li>
</ul>
</li>
<li>server: Propagate status detail headers, if available, when
terminating a stream during request header processing. (<a
href="https://redirect.github.com/grpc/grpc-go/issues/8754">#8754</a>)
<ul>
<li>Special Thanks: <a
href="https://github.com/joybestourous"><code>@​joybestourous</code></a></li>
</ul>
</li>
</ul>
<h1>Performance Improvements</h1>
<ul>
<li>credentials/alts: Optimize read buffer alignment to reduce copies.
(<a
href="https://redirect.github.com/grpc/grpc-go/issues/8791">#8791</a>)</li>
<li>mem: Optimize pooling and creation of <code>buffer</code> objects.
(<a
href="https://redirect.github.com/grpc/grpc-go/issues/8784">#8784</a>)</li>
<li>transport: Reduce slice re-allocations by reserving slice capacity.
(<a
href="https://redirect.github.com/grpc/grpc-go/issues/8797">#8797</a>)</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/grpc/grpc-go/commit/dda86dbd9cecb8b35b58c73d507d81d67761205f"><code>dda86db</code></a>
Change version to 1.79.3 (<a
href="https://redirect.github.com/grpc/grpc-go/issues/8983">#8983</a>)</li>
<li><a
href="https://github.com/grpc/grpc-go/commit/72186f163e75a065c39e6f7df9b6dea07fbdeff5"><code>72186f1</code></a>
grpc: enforce strict path checking for incoming requests on the server
(<a
href="https://redirect.github.com/grpc/grpc-go/issues/8981">#8981</a>)</li>
<li><a
href="https://github.com/grpc/grpc-go/commit/97ca3522b239edf6813e2b1106924e9d55e89d43"><code>97ca352</code></a>
Changing version to 1.79.3-dev (<a
href="https://redirect.github.com/grpc/grpc-go/issues/8954">#8954</a>)</li>
<li><a
href="https://github.com/grpc/grpc-go/commit/8902ab6efea590f5b3861126559eaa26fa9783b2"><code>8902ab6</code></a>
Change the version to release 1.79.2 (<a
href="https://redirect.github.com/grpc/grpc-go/issues/8947">#8947</a>)</li>
<li><a
href="https://github.com/grpc/grpc-go/commit/a9286705aa689bee321ec674323b6896284f3e02"><code>a928670</code></a>
Cherry-pick <a
href="https://redirect.github.com/grpc/grpc-go/issues/8874">#8874</a> to
v1.79.x (<a
href="https://redirect.github.com/grpc/grpc-go/issues/8904">#8904</a>)</li>
<li><a
href="https://github.com/grpc/grpc-go/commit/06df3638c0bcee88197b1033b3ba83e1eb8bc010"><code>06df363</code></a>
Change version to 1.79.2-dev (<a
href="https://redirect.github.com/grpc/grpc-go/issues/8903">#8903</a>)</li>
<li><a
href="https://github.com/grpc/grpc-go/commit/782f2de44f597af18a120527e7682a6670d84289"><code>782f2de</code></a>
Change version to 1.79.1 (<a
href="https://redirect.github.com/grpc/grpc-go/issues/8902">#8902</a>)</li>
<li><a
href="https://github.com/grpc/grpc-go/commit/850eccbb2257bd2de6ac28ee88a7172ab6175629"><code>850eccb</code></a>
Change version to 1.79.1-dev (<a
href="https://redirect.github.com/grpc/grpc-go/issues/8851">#8851</a>)</li>
<li><a
href="https://github.com/grpc/grpc-go/commit/765ff056b6890f6c8341894df4e9668e9bfc18ef"><code>765ff05</code></a>
Change version to 1.79.0 (<a
href="https://redirect.github.com/grpc/grpc-go/issues/8850">#8850</a>)</li>
<li><a
href="https://github.com/grpc/grpc-go/commit/68804be0e78ed0365bb5a576dedc12e2168ed63e"><code>68804be</code></a>
Cherry pick <a
href="https://redirect.github.com/grpc/grpc-go/issues/8864">#8864</a> to
v1.79.x (<a
href="https://redirect.github.com/grpc/grpc-go/issues/8896">#8896</a>)</li>
<li>Additional commits viewable in <a
href="https://github.com/grpc/grpc-go/compare/v1.75.1...v1.79.3">compare
view</a></li>
</ul>
</details>
<br />

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
<!--- Provide a general summary of your changes in the Title above -->

## Description
Switch smoke tests to use minted MaaS API keys instead of raw oc whoami
-t cluster tokens.

[RHOAIENG-51553](https://redhat.atlassian.net/browse/RHOAIENG-51553)

## How Has This Been Tested?
* Manual testing against cluster with MaaS API deployed
* To test locally:
```
cd test/e2e
./smoke.sh
```

## Merge criteria:
<!--- This PR will be merged by any repository approver when it meets
all the points in the checklist -->
<!--- Go over all the following points, and put an `x` in all the boxes
that apply. -->

- [x] The commits are squashed in a cohesive manner and have meaningful
messages.
- [x] Testing instructions have been added in the PR body (for PRs
involving changes that are not immediately obvious).
- [x] The developer has manually tested the changes and verified that
the changes work


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
* End-to-end tests now obtain short-lived API keys via the cluster
bootstrap flow instead of using direct user tokens.
* Test setup fails fast if minting the required test API key isn't
possible; admin tests automatically mint admin credentials and are
skipped when unavailable.
* Logging reduced token exposure by recording only token/key lengths,
not their contents.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

E2E tests for the ExternalModel feature, focused on MaaS capabilities:

- **Discovery**: ExternalModel reconciler creates MaaSModelRef,
HTTPRoute, backend Service
- **Auth**: Invalid/missing API key returns 401/403
- **Egress**: Request with valid key passes auth and reaches external
endpoint
- **Cleanup**: Deleting MaaSModelRef removes HTTPRoute via finalizer

Uses `httpbin.org` as the external endpoint (configurable via
`E2E_EXTERNAL_ENDPOINT`). No BBR/plugin dependency — tests validate
MaaS egress routing and auth, not payload transformation.

## Changes

- `test/e2e/tests/test_external_models.py`: 7 tests covering discovery,
  auth, egress connectivity, and cleanup
- `test/e2e/scripts/prow_run_smoke_test.sh`: External model tests
section
  (commented out until CI includes ExternalModel reconciler)

## Test plan

- [x] All 7 tests passing against RHOAI cluster with httpbin.org
- [x] No BBR or simulator dependency

```release-note
NONE
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
* Added E2E tests for external-model discovery, auth (invalid/missing
API keys), egress/forwarding to external endpoints, and cleanup to
ensure routes are removed.
* New module-scoped setup provisions credentials, model/subscription
resources, creates an API key, and tears down resources after tests.

* **Chores**
* CI smoke runner now executes the external-model E2E suite (replacing
the previous external test), producing separate test artifacts.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
…-io#693)

Related to - https://redhat.atlassian.net/browse/RHOAIENG-57158

## Summary
- Add Spectral-based OpenAPI specification validation to CI
- Add breaking change detection using oasdiff
- Add changelog verification for API changes
- Include comprehensive automation plan document

## Changes
1. **`.github/workflows/openapi-validation.yml`** - New CI workflow with
three jobs:
- `validate-spec`: Runs Spectral linting, generates validation reports
   - `breaking-changes`: Detects API breaking changes vs base branch
   - `changelog-check`: Verifies changelog updates when spec changes

2. **`.spectral.yml`** - OpenAPI linting configuration:
   - Extends `spectral:oas` ruleset
   - Custom rules for operation IDs, descriptions, security
   - MaaS-specific rule for subscription header documentation

3. **`docs/openapi-automation-plan.md`** - Phased automation plan:
   - Phase 1: Validation & linting (this PR)
   - Phase 2: Contract testing with Dredd/Prism
   - Phase 3: Client SDK generation
   - Phase 4: Code annotation-based generation

## Current Validation Results
Running Spectral on `maas-api/openapi3.yaml` found:
- **4 errors** (schema validation issues in examples)
- **8 warnings** (missing contact info, undefined tags, tag ordering)
- **6 hints** (custom MaaS subscription header rule)

These will be addressed in a follow-up PR.

## Test Plan
- [x] Spectral validation runs successfully on local spec
- [x] CI workflow validates on PR changes to OpenAPI spec
- [ ] Fix existing validation errors (follow-up PR)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Chores**
* Added CI automation for OpenAPI validation, linting, and report
generation.
* Enabled breaking-change detection on pull requests and a check for
required changelog updates.
* Introduced stricter linting rules to enforce OpenAPI quality and
documentation standards.

* **Documentation**
* Added an OpenAPI automation roadmap outlining phased plans for
validation, contract testing, SDKs, and rollout.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
…cted support (opendatahub-io#706)

<!--- Provide a general summary of your changes in the Title above -->

## Description
Replace curlimages/curl with registry.redhat.io/ubi9/ubi-minimal:9.7 in
the cleanup CronJob. Third-party images are not mirrored in
disconnected/air-gapped RHOAI environments. UBI minimal includes curl
and is available in the RHOAI mirror catalog.

<!--- Describe your changes in detail -->

## How Has This Been Tested?
<!--- Please describe in detail how you tested your changes. -->
<!--- Include details of your testing environment, and the tests you ran
to -->
<!--- see how your change affects other areas of the code, etc. -->

## Merge criteria:
<!--- This PR will be merged by any repository approver when it meets
all the points in the checklist -->
<!--- Go over all the following points, and put an `x` in all the boxes
that apply. -->

- [ ] The commits are squashed in a cohesive manner and have meaningful
messages.
- [ ] Testing instructions have been added in the PR body (for PRs
involving changes that are not immediately obvious).
- [ ] The developer has manually tested the changes and verified that
the changes work


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
  * Updated the container base image used for the API cleanup process.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Wen Liang <liangwen12year@gmail.com>
…b-io#728)

## Summary

- Uncomments the `test_unconfigured_model_denied_by_gateway_auth` test
in `test/e2e/tests/test_subscription.py`
- Verifies that models with no MaaSAuthPolicy or MaaSSubscription are
denied (403) by the `gateway-default-auth` AuthPolicy
- The test fixture (`test/e2e/fixtures/unconfigured/`) already exists
and deploys a MaaSModelRef with no auth policy or subscription

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
* Re-enabled end-to-end coverage for gateway access control: confirms
deny-by-default behavior and that models without required
subscription/auth configuration are denied (403) when accessed with the
default API key.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
…tahub-io#549)

<!--- Provide a general summary of your changes in the Title above -->

## Description
Add bounded access-check timeout (15s), Cache-Control: no-store header,
and X-Access-Checked-At freshness timestamp to prevent clients from
caching stale authorization decisions from the eventually-consistent
model access probes.
<!--- Describe your changes in detail -->

## How Has This Been Tested?
<!--- Please describe in detail how you tested your changes. -->
<!--- Include details of your testing environment, and the tests you ran
to -->
<!--- see how your change affects other areas of the code, etc. -->

## Merge criteria:
<!--- This PR will be merged by any repository approver when it meets
all the points in the checklist -->
<!--- Go over all the following points, and put an `x` in all the boxes
that apply. -->

- [ ] The commits are squashed in a cohesive manner and have meaningful
messages.
- [ ] Testing instructions have been added in the PR body (for PRs
involving changes that are not immediately obvious).
- [ ] The developer has manually tested the changes and verified that
the changes work


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* API responses now include anti-cache headers and an access-check
timestamp header (RFC3339) to show when authorization was verified.
* Access validation checks are now bounded by a configurable timeout to
ensure timely responses.

* **Chores**
* Added a configuration option for the access-check timeout with
validation.

* **Tests**
* Tests updated to verify the new headers and that the timestamp parses
as RFC3339.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Wen Liang <liangwen12year@gmail.com>
…b-io#724)

https://redhat.atlassian.net/browse/RHOAIENG-57235

## Description
This PR focuses on providing a broader automated coverage for unhappy
paths and abuse scenarios (missing resources, forbidden access, header
spoofing).

### Additional notes: 
**Documented** updates with additional notes to `README.md` (e.g.
"Negative & Security Tests" and "Namespace Scoping Tests" sections),
including pytest commands, test coverage list, and link to the matrix.
CI integration list updated too.

## How Has This Been Tested?
The code compiles and CI pipeline builds and completes tests as
intended.

## Merge criteria:
- [x] The commits are squashed in a cohesive manner and have meaningful
messages.
- [x] Testing instructions have been added in the PR body (for PRs
involving changes that are not immediately obvious).
- [x] The developer has manually tested the changes and verified that
the changes work


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
* Added security and negative scenario test coverage for E2E validation.
* Refactored test utilities into a shared helper module for consistency
across test suites.

* **Documentation**
  * Updated E2E test documentation to reflect new test coverage.
  * Extended CI smoke test script to include new test modules.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Yuriy Teodorovych <Yuriy@ibm.com>
## Summary

Related to - https://redhat.atlassian.net/browse/RHOAIENG-57336

Implements Kubernetes RBAC aggregation to enable namespace admins and
contributors to create and manage `MaaSModelRef` and `ExternalModel`
resources without requiring cluster-admin intervention.

This addresses the user story requirement for namespace-scoped users to
deploy models using standard Kubernetes/OpenShift roles (`admin`,
`edit`, `view`) without needing custom ClusterRoleBindings or elevated
permissions.

## Changes

### ClusterRole Aggregation
- ✅ Add `maas-user-admin-role` ClusterRole (aggregates to `admin` and
`edit` roles)
- ✅ Add `maas-user-view-role` ClusterRole (aggregates to `view`,
`admin`, and `edit` roles)
- ✅ Include namespace-scoped resources: `MaaSModelRef`, `ExternalModel`
- ✅ Exclude platform-managed resources: `MaaSSubscription`,
`MaaSAuthPolicy`

### Documentation
- ✅ Comprehensive user guide:
`docs/content/configuration-and-management/namespace-rbac.md`
  - How RBAC aggregation works (with Mermaid diagrams)
  - Permission matrix and usage examples
  - Troubleshooting guide and best practices
- ✅ Automated verification script: `scripts/verify-rbac-aggregation.sh`
- ✅ Updated MkDocs navigation

## Permission Matrix

| Role | Resources | Permissions | Use Case |
|------|-----------|-------------|----------|
| **admin** | `MaaSModelRef`, `ExternalModel` | `create`, `delete`,
`get`, `list`, `patch`, `update`, `watch` | Full model lifecycle
management |
| **edit** | `MaaSModelRef`, `ExternalModel` | `create`, `delete`,
`get`, `list`, `patch`, `update`, `watch` | Full model lifecycle
management |
| **view** | `MaaSModelRef`, `ExternalModel` | `get`, `list`, `watch` |
Read-only access |

**Platform-managed resources remain protected:**
- ❌ `MaaSSubscription` - Namespace users cannot create (cluster-admin
only)
- ❌ `MaaSAuthPolicy` - Namespace users cannot create (cluster-admin
only)

## Testing

### ✅ Comprehensive Live Cluster Testing: 19/19 Tests Passed

All test cases were executed on a [**live OpenShift
cluster**](https://console-openshift-console.apps.ci-ln-cdy7jft-76ef8.aws-4.ci.openshift.org/k8s/ns/opendatahub/core~v1~Pod)
with the following results:

#### Phase 1: Infrastructure Verification ✅
- ✅ ClusterRoles exist with correct aggregation labels
- ✅ Built-in `admin` role includes `maas.opendatahub.io` permissions
- ✅ Built-in `edit` role includes `maas.opendatahub.io` permissions
- ✅ Built-in `view` role includes `maas.opendatahub.io` permissions
(read-only)

#### Phase 2: User Permission Testing ✅
- ✅ Admin user can create, update, delete `MaaSModelRef`
- ✅ Admin user can create, update, delete `ExternalModel`
- ✅ Edit user can create, update, delete `MaaSModelRef`
- ✅ Edit user can create, update, delete `ExternalModel`
- ✅ View user can **only** read (get, list, watch) - correctly forbidden
from create/delete

#### Phase 3: Security & Platform Protection ✅
- ✅ Namespace users **cannot** create `MaaSSubscription` (correctly
forbidden)
- ✅ Namespace users **cannot** create `MaaSAuthPolicy` (correctly
forbidden)
- ✅ Platform resources remain cluster-admin only

#### Phase 4: Controller Integration ✅
- ✅ maas-controller successfully reconciles user-created `MaaSModelRef`
resources
- ✅ Status conditions updated correctly
- ✅ Controller watches user namespaces properly

#### Phase 5: Lifecycle Testing ✅
- ✅ Users can create resources in their namespace
- ✅ Users can update resources in their namespace
- ✅ Users can delete resources in their namespace
- ✅ View users correctly restricted to read-only

### Test Environment
- **Cluster Type:** OpenShift
(https://console-openshift-console.apps.ci-ln-cdy7jft-76ef8.aws-4.ci.openshift.org/k8s/ns/opendatahub/core~v1~Pod)
- **MaaS Version:** Latest (main branch)
- **Test Namespace:** `rbac-test`
- **Test Users:** `testadmin@example.com`, `testeditor@example.com`,
`testviewer@example.com`
- **Resources Created:** MaaSModelRef, ExternalModel (all successfully
reconciled)

### Verification Script
Run the automated verification script to validate RBAC aggregation:

```bash
./scripts/verify-rbac-aggregation.sh
```

## Design Rationale

### Why RBAC Aggregation?
1. **Kubernetes best practice** - Standard pattern for extending
built-in roles with CRD permissions
2. **Zero configuration** - Works automatically when users are granted
standard roles
3. **Follows precedent** - Same pattern used by OpenShift operators,
KServe, and other K8s projects
4. **Minimal permissions** - Only grants access to resources users
actually deploy in their namespaces

### Security Considerations
- ✅ Only namespace-scoped resources included (`MaaSModelRef`,
`ExternalModel`)
- ✅ Platform-level resources excluded (`MaaSSubscription`,
`MaaSAuthPolicy`)
- ✅ Verbs limited to minimum necessary for each role
- ✅ View role is strictly read-only (no mutating verbs)
- ✅ Follows principle of least privilege

## Documentation

### User-Facing
- **Main Guide:**
`docs/content/configuration-and-management/namespace-rbac.md`
  - How RBAC aggregation works (with Mermaid diagrams)
  - Permission matrix
  - Usage examples
  - Troubleshooting guide
  - Best practices
  - Design rationale and references

### Verification
- **Automated Script:** `scripts/verify-rbac-aggregation.sh`
  - Checks ClusterRole existence and labels
  - Verifies aggregation to built-in roles
  - Validates correct verbs for each role
  - Provides detailed pass/fail reporting

## Known Issues Discovered During Testing

While testing this implementation on a live cluster, we discovered two
operator-related issues:

### 1. Missing `cluster-audience` in ConfigMap
**Issue:** The ODH/RHOAI operator doesn't set the `cluster-audience`
parameter in the `maas-parameters` ConfigMap, causing maas-controller to
fail on startup.

**Workaround:**
```bash
kubectl patch configmap maas-parameters -n opendatahub \
  --type merge \
  -p '{"data":{"cluster-audience":"https://kubernetes.default.svc"}}'
```

**Permanent Fix:** Update operator to include this parameter when
creating the ConfigMap.


## Migration Guide

For existing deployments, the changes are **additive and non-breaking**:

1. The new ClusterRoles are automatically created when the manifests are
applied
2. Kubernetes automatically aggregates permissions into built-in roles
within seconds
3. No migration of existing resources required
4. No impact on existing service account permissions
5. Users with existing custom ClusterRoleBindings can continue using
them (can be cleaned up later)

### Rollout Steps
1. Deploy updated MaaS controller manifests (includes new ClusterRoles)
2. Verify aggregation: `kubectl get clusterrole admin -o yaml | grep
maas.opendatahub.io`
3. Test in a dev namespace before production
4. Communicate new capability to namespace users
5. (Optional) Clean up redundant custom ClusterRoleBindings

## Acceptance Criteria

All acceptance criteria from the user story have been met:

- ✅ Users with `admin` role can create/update/delete `MaaSModelRef` in
their namespace
- ✅ Users with `edit` role can create/update/delete `MaaSModelRef` in
their namespace
- ✅ Users with `view` role can list/get but not create/update/delete
`MaaSModelRef`
- ✅ Aggregation uses standard Kubernetes labels
(`rbac.authorization.k8s.io/aggregate-to-*`)
- ✅ Only namespace-scoped resources included
- ✅ Platform-level resources excluded
- ✅ Minimal permissions granted (no broader than necessary)
- ✅ Comprehensive documentation provided

## References

- [Kubernetes RBAC
Aggregation](https://kubernetes.io/docs/reference/access-authn-authz/rbac/#aggregated-clusterroles)
- [OpenShift
RBAC](https://docs.openshift.com/container-platform/latest/authentication/using-rbac.html)
- [Example: MCP Lifecycle Operator
Aggregation](kubernetes-sigs/mcp-lifecycle-operator#73)

## Checklist

- [x] Code follows project style guidelines
- [x] All tests pass (19/19 on live cluster)
- [x] Documentation updated (comprehensive user guide)
- [x] Verification script added
- [x] CI validation passes
- [x] CodeRabbit AI review passes (no findings)
- [x] Security considerations addressed
- [x] Breaking changes: None
- [x] Backwards compatible: Yes

---

**Ready for review by security and platform teams.**


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added two namespace-level aggregated ClusterRoles to grant admin/edit
users full management and view-only users read access for MaaS model
resources.

* **Documentation**
* Added a "Namespace User Permissions (RBAC)" guide with a permission
matrix, verification commands, and troubleshooting for Forbidden errors.

* **Chores**
* Added a verification script to validate RBAC aggregation and role
coverage in clusters.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
…pendatahub-io#733)

<!--- Provide a general summary of your changes in the Title above -->

## Description
My assumption is that `validateModelRefs()` needs to rely on
`deletionTimestamp` just like `findHTTPRouteForModel()` does.
So, the failing scenario must be the following:
`MaaSModelRef` has a finalizer, so when you delete it:
1. Kubernetes sets `deletionTimestamp` but the object continues to exist
in the cache until the finalizer is removed
2. `validateModelRefs()` calls r.Get() which succeeds (object still
exists). Hence, it sets `ready=true` and `reason=Valid`
3. `checkTokenRateLimitHealth()` has `findHTTPRouteForModel()` that
calls r.Get() which succeeds, but then checks `deletionTimestamp`
returns `ErrModelNotFound`
4. `deriveFinalPhase()` correctly detects the inconsistency via TRLP
health and sets `phase=Failed`
5. But `updateStatus()` persists with `phase=Failed` while
`modelRefStatuses=[{ready: true, reason: Valid}]` since
`modelRefStatuses` was already set with `ready=true` and never corrected

The model's finalizer cleanup (deleting AuthPolicies, TRLPs, backend
resources) can take time, so the model remains in "deleting" state for
the duration. During this window, every reconciliation produces the same
stale **ready=true** in `modelRefStatuses`.

**After the change:**
1. Test deletes model, so `deletionTimestamp` is set (finalizer might
still be completing its task hence the object remain alive)
  2. Subscription reconciles:
    - `validateModelRefs()` sets `ready=true`
    - `checkTokenRateLimitHealth()` sets `BackendNotReady`
- fix: `modelRefStatuses[0]` from `ready=true` to `ready=false` and
`reason=NotFound`
    - `deriveFinalPhase()` sets `phase=Failed`
- `updateStatus()` saves `phase=Failed` and
`modelRefStatuses=[ready=false]`
  3. Test's _wait_for_subscription_phase("Failed") succeeds
4. Test's poll for `modelRefStatuses[0].ready` is `False` succeeds
immediately because the status was corrected in the same reconciliation


## How Has This Been Tested?
Tests pass

## Merge criteria:
<!--- This PR will be merged by any repository approver when it meets
all the points in the checklist -->
<!--- Go over all the following points, and put an `x` in all the boxes
that apply. -->

- [x] The commits are squashed in a cohesive manner and have meaningful
messages.
- [x] Testing instructions have been added in the PR body (for PRs
involving changes that are not immediately obvious).
- [x] The developer has manually tested the changes and verified that
the changes work


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **Bug Fixes**
* Fixed an issue where models marked for deletion would incorrectly
remain in the "ready" state. The system now properly identifies models
undergoing deletion and corrects their status to "not found" to ensure
accurate health reporting and phase information.

* **Tests**
* Added test coverage for model deletion scenarios to verify proper
status correction during reconciliation.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Yuriy Teodorovych <Yuriy@ibm.com>
…pendatahub-io#731)

## Summary

Fix `MAAS_API_IMAGE` and `MAAS_CONTROLLER_IMAGE` env vars being silently
ignored during kustomize-mode deployments, causing CI to always deploy
`latest` instead of PR images.

## Description

### Problem

When deploying MaaS via kustomize mode (used by Konflux CI and
`prow_run_smoke_test.sh`), the `MAAS_API_IMAGE` environment variable was
silently overridden. The deploy script logged the correct PR image, but
the pod always ended up running `latest`:

```
# Logs showed correct image:
Using custom MaaS API image: quay.io/opendatahub/maas-api:odh-pr-721

# But the pod had:
image: quay.io/opendatahub/maas-api:latest
```

### Root Cause

The `shared-patches` kustomize component uses `replacements:` to set
container images from `params.env`. But `set_maas_api_image()` and
`set_maas_controller_image()` were only patching the base `images:`
transformer in `kustomization.yaml`.

Kustomize processes `images:` transformers **before** `replacements:`,
so `params.env` (hardcoded to `latest`) always overwrote the custom
image. This regression was introduced when the `shared-patches`
component was added to centralize overlay configuration.

The maas-controller was not visibly affected because `deploy.sh` had a
post-apply `kubectl set image` workaround that corrected the image after
kustomize had already applied it with the wrong tag. maas-api had no
such workaround.

### Fix

- Add `_patch_params_env` helper to patch `params.env` with custom image
values before `kustomize build`, so replacements pick up the correct
image.
- Call `_patch_params_env` from both `set_maas_api_image` and
`set_maas_controller_image` after the existing base kustomization
patching.
- Add `_cleanup_params_env` to restore `params.env` from backup after
build.
- Remove the post-apply `kubectl set image` workaround for
maas-controller in `deploy.sh` since `params.env` now carries the
correct image through the kustomize build pipeline.
- Log deployed maas-api and maas-controller images at end of deployment
for easier verification.

## How it was tested

- Verified locally that `kustomize build` with the old approach
(patching base `images:` transformer) still produces `latest` —
confirming the bug.
- Verified locally that `kustomize build` with the fix (patching
`params.env`) produces the correct custom image.
- Tested both `tls-backend` and `http-backend` overlays — both produce
correct images.
- Verified operator mode is unaffected (base `images:` transformer still
works for direct base builds).
- Verified default behavior (no env var set) still produces `latest`.
- Verified `params.env` is restored from backup after deployment (no
leftover `.backup` file).
- Deployed on a live cluster with `MAAS_API_IMAGE` and
`MAAS_CONTROLLER_IMAGE` set — both pods running correct PR images.


Made with [Cursor](https://cursor.com)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Improvements**
* Deployment now always logs the live container images for key services
to help verification and troubleshooting, with safe fallbacks when data
is missing.
* Image update operations also persistently update deployment
configuration so chosen images remain synchronized across tools and are
cleanly restored during cleanup.
* Final completion message changed to “Models-as-a-Service Deployment
completed successfully!” to reflect branding.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Chaitanya Kulkarni <ckulkarn@redhat.com>
Signed-off-by: Chaitanya Kulkarni <chkulkar@redhat.com>
## Description

https://redhat.atlassian.net/browse/RHOAIENG-57627

The /v1/models endpoint exemption was lost during the migration from
tier-based to subscription-based rate limiting. This caused model
discovery endpoints to be blocked when users exhausted their token
quota, even though these endpoints don't consume inference tokens.

This restores the original behavior from commit 660f4db by adding
`!request.path.endsWith("/v1/models")` to the per-route
TokenRateLimitPolicy predicates.

Changes:
- Add path exemption to TRLP when clause in
maassubscription_controller.go
- Add E2E test for per-model /v1/models endpoints (test_subscription.py)
- Add E2E test for central /v1/models endpoint aggregation
(test_models_endpoint.py)

## How Has This Been Tested?
Both tests validate that:
1. Inference requests are blocked (429) when quota exhausted
2. /v1/models endpoints remain accessible (200) when quota exhausted

This is a regression from the tier system removal. The original issue
was [RHOAIENG-46770](https://redhat.atlassian.net/browse/RHOAIENG-46770)
(resolved in tier-based system).


## Merge criteria:
<!--- This PR will be merged by any repository approver when it meets
all the points in the checklist -->
<!--- Go over all the following points, and put an `x` in all the boxes
that apply. -->

- [x] The commits are squashed in a cohesive manner and have meaningful
messages.
- [x] Testing instructions have been added in the PR body (for PRs
involving changes that are not immediately obvious).
- [x] The developer has manually tested the changes and verified that
the changes work


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Token-based subscription rate limiting now exempts the /v1/models
endpoint so model discovery remains accessible when a subscription's
token quota is exhausted.

* **Tests**
* Added e2e tests confirming inference is blocked after quota exhaustion
while GET /v1/models still returns 200 and valid listings; updated test
expectations and cleanup to reflect the exemption.

* **Chores**
* Added a diagnostics helper for LLM inference service artifacts and
integrated it into smoke-test timeout diagnostics.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
…hub-io#732)

<!--- Provide a general summary of your changes in the Title above -->

## Description
<!--- Describe your changes in detail -->
As MaaS controller will be included from RHOAI 3.4, the explicit
deployment seems unnecessary and may even conflict with what's already
installed by the operator (local vs. operator manifests).

## How Has This Been Tested?
<!--- Please describe in detail how you tested your changes. -->
<!--- Include details of your testing environment, and the tests you ran
to -->
<!--- see how your change affects other areas of the code, etc. -->

## Merge criteria:
<!--- This PR will be merged by any repository approver when it meets
all the points in the checklist -->
<!--- Go over all the following points, and put an `x` in all the boxes
that apply. -->

- [ ] The commits are squashed in a cohesive manner and have meaningful
messages.
- [ ] Testing instructions have been added in the PR body (for PRs
involving changes that are not immediately obvious).
- [ ] The developer has manually tested the changes and verified that
the changes work


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Optimized controller deployment to skip redundant installation when a
controller already exists in operator mode, improving deployment
efficiency and reducing unnecessary operations.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
…pendatahub-io#740)

## Summary

Enhance E2E artifact collection to dump full MaaS CR YAML definitions
and collect pod logs from RHOAI-related namespaces, improving CI
debuggability.

## Description

- Add `collect_maas_crs()` function that dumps full YAML for all four
MaaS CRD types (`maasmodelrefs`, `maasauthpolicies`,
`maassubscriptions`, `externalmodels`) with dynamic namespace discovery,
mirroring the CRD list from `red-hat-data-services/must-gather`
`gather_models_as_a_service` script.
- Expand pod log collection to cover 8 namespaces: `opendatahub`,
`models-as-a-service`, `redhat-ods-operator`, `redhat-ods-applications`,
`kuadrant-system`, `openshift-ingress`, `llm`, and `istio-system`, with
graceful skip for non-existent namespaces.
- Add RHOAI operator, applications, DSC/DSCI, and gateway namespace
resource snapshots to `cluster-state.log`.
- Persist auth debug report to `auth-debug-report.log` in the artifact
directory.
- Fix `set -e` trap on `[[ ]] && ...` patterns that caused early script
exit when running without a cluster connection.
- All collected CR YAML is token-redacted before writing to disk.

## How it was tested

- Ran `ARTIFACTS_DIR=test/e2e/reports/maas-debug
./test/e2e/scripts/auth_utils.sh` locally without a cluster connection
to verify the script runs to completion without failures.
- Verified artifact directory structure is created correctly with
expected files (`maas-crs/no-crs-found.log`, `cluster-state.log`,
`auth-debug-report.log`, `pod-logs/` subdirectories).
- Verified `bash -n` syntax check passes.

Made with [Cursor](https://cursor.com)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Tests**
* Enhanced test diagnostics with expanded cluster state snapshots and
multi-namespace pod log collection.
* Improved artifact capture for MaaS custom resources and authorization
debug reports.

* **Chores**
* Added configuration variables for RHOAI, gateway, LLM, and Istio
namespaces.
* Refined log collection and error handling in end-to-end testing
utilities.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Chaitanya Kulkarni <chkulkar@redhat.com>
…opendatahub-io#709)

ExternalModel HTTPRoutes use PathPrefix: `/<modelName>` while
LLMInferenceService routes use PathPrefix: `/<namespace>/<modelName>`.
This means two ExternalModel MaaSModelRefs with the same name in
different namespaces would collide on the same path.

Changes the ExternalModel reconciler and endpoint resolver to include
the namespace in the path, matching the LLMInferenceService pattern:

- resources.go: PathPrefix: `/<namespace>/<modelName>` (was
`/<modelName>`)
- providers_external.go: endpoint URL
`https://<host>/<namespace>/<modelName>` (was
https://`<host>/<modelName>`)

Before: POST `/gpt-4o/v1/chat/completions`
After: POST `/llm/gpt-4o/v1/chat/completions`

The URLRewrite filter already strips the full prefix to / so the
 external provider still receives POST `/v1/chat/completions`.

cc/ @jland-redhat @nirrozenbaum leaving in draft until I manually test.
Fighting cluster availability via CB so it might be a couple of hours.
Nice catch Nir 🥳!

<!--- Provide a general summary of your changes in the Title above -->

## Description
<!--- Describe your changes in detail -->

## How Has This Been Tested?
<!--- Please describe in detail how you tested your changes. -->
<!--- Include details of your testing environment, and the tests you ran
to -->
<!--- see how your change affects other areas of the code, etc. -->

## Merge criteria:
<!--- This PR will be merged by any repository approver when it meets
all the points in the checklist -->
<!--- Go over all the following points, and put an `x` in all the boxes
that apply. -->

- [x] The commits are squashed in a cohesive manner and have meaningful
messages.
- [x] Testing instructions have been added in the PR body (for PRs
involving changes that are not immediately obvious).
- [ ] The developer has manually tested the changes and verified that
the changes work


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Refactor**
  * Simplified external model routing and endpoint naming conventions
  * Updated endpoint URL path structure to include namespace information
* Migrated TLS/port configuration to ExternalModel resource annotations
(`maas.opendatahub.io/port`, `maas.opendatahub.io/tls`)
* Streamlined internal resource builder functions for Kubernetes and
Istio components
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Brent Salisbury <bsalisbu@redhat.com>
## Summary
- Enable group testing so that e2e tests run with both `maas-api` and
`maas-controller` images built from the same PR commit
- Previously, per-component Konflux snapshots meant each integration
test had one image from the PR and the other from the main branch

## Changes
- **`.tekton/odh-maas-api-pull-request.yaml`** — add
`enable-group-testing: "true"`
- **`.tekton/odh-maas-controller-pull-request.yaml`** — add
`enable-group-testing: "true"`
- **`.tekton/maas-group-test.yaml`** (new) — group test PipelineRun
triggered by `/group-test` comment after all builds complete

## How it works
1. PR opened on `models-as-a-service` → both component builds triggered
2. Each build's `trigger-group-testing` finally task checks if all other
Konflux checks are completed
3. The last build to complete posts `/group-test` comment on the PR
4. PAC matches the comment and triggers the `maas-group-test`
PipelineRun
5. `generate-snapshot-for-group-testing` creates a composite snapshot
with both PR-built images (using `odh-pr-{PR_NUMBER}` tags)
6. e2e tests run with correct `MAAS_API_IMAGE` and
`MAAS_CONTROLLER_IMAGE` from the same commit

## Dependencies
- Requires opendatahub-io/odh-konflux-central#241 to be merged first
(registers `maas-group` Component and the group testing Pipeline)

## Test plan
- [ ] Merge opendatahub-io/odh-konflux-central#241 first
- [ ] Wait for gitops sync (maas-group Component lands in Konflux)
- [ ] Merge this PR
- [ ] Open a test PR on this repo
- [ ] Verify both builds complete and `/group-test` comment is posted
automatically
- [ ] Verify `maas-group-test` PipelineRun is created and e2e tests pass
with both correct images

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Added new group testing pipeline for comprehensive validation of
models-as-a-service component groups
* Enabled group testing parameter in pull request workflows for API and
controller services
* Implemented dedicated integration testing infrastructure to support
component group validation scenarios

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary
Related to - https://redhat.atlassian.net/browse/RHOAIENG-57622

Fixes intermittent failures in `test_rate_limit_exhaustion_gets_429`
that occurred on PRs with no rate limit changes.

## Problem

The test was making an **incorrect assumption** that `max_tokens=N`
would consume exactly N tokens per request. This caused flaky failures
because:

- Models may return fewer tokens than `max_tokens` (it's a ceiling, not
exact)
- Prompt tokens also count toward rate limits (not just completion
tokens)
- Actual token usage varies per request

**Flaky Logic (Before):**
```python
token_limit = 15
max_tokens = 3
expected_success = token_limit // max_tokens  # Expected exactly 5 successful requests

assert abs(success_count - expected_success) <= 1  # Flaky assertion!
```

This assertion failed when responses used 2 or 4 tokens instead of
exactly 3.

## Changes

### 1. Made `max_tokens` configurable in `_inference()` helper
```python
# Before:
def _inference(api_key, path=None, extra_headers=None, model_name=None):
    json={"max_tokens": 3}  # Hardcoded

# After:
def _inference(api_key, path=None, extra_headers=None, model_name=None, max_tokens=3):
    json={"max_tokens": max_tokens}  # Configurable, default: 3
```

✅ **Backward compatible** - all existing tests continue using default
`max_tokens=3`

### 2. Updated test to use flexible logic
```python
# Before (flaky):
token_limit = 15
max_tokens = 3
total_requests = (15 / 3) + 2  # Expected exactly 5 successful, send 7

# After (robust):
token_limit = 10
total_requests = 15
r = _inference(api_key, path=model_path, max_tokens=1)  # Minimize variance
```

**Key improvements:**
- Uses `max_tokens=1` (minimize variance)
- 50% safety margin (10 token limit, 15 requests)
- **Just verifies 429 occurs** - doesn't assume when
- Removed strict token math assertions

### 3. Improved comments
Explains that token consumption is non-deterministic, so the test
verifies rate limiting works without assuming exact timing.

## Testing

**Validated on live cluster:**

🎉 E2E Tests Completed Successfully!
⏺ 🎉 E2E Tests Completed - SUCCESS!
Final Test Results:
✅ 89 PASSED
⏭️ 4 SKIPPED
⚠️ 81 warnings
⏱️ Duration: 17 minutes 16 seconds
🎯 Your Fixed Test: PASSED!
test_rate_limit_exhaustion_gets_429 PASSED [ 46%]

**Consistency:** 100% pass rate across multiple runs (no flakiness
detected)

## Impact

**Before:**
- ❌ Test fails ~20-30% of the time
- ❌ Blocks unrelated PRs
- ❌ Requires manual re-runs

**After:**
- ✅ Verifies rate limiting works (core behavior)
- ✅ No assumptions about exact token consumption
- ✅ Eliminates flakiness while maintaining test validity

## Related

This test was added to verify token-based rate limiting works
end-to-end. The fix maintains the test's original purpose while removing
unreliable timing assumptions.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
* Test helper now accepts an optional max_tokens parameter (default 3)
for inference requests.
* Rate-limit exhaustion test made more robust and assumption-free:
request sizing and counts simplified, looped requests use explicit
smaller token usage, 429 validation simplified, and the test requires at
least one successful 200 before any observed 429 within a fixed request
window.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
…ahub-io#739)

## Summary
Related to - https://redhat.atlassian.net/browse/RHOAIENG-57822

Fixes maintidx linter failure in `ListLLMs` function that blocks PRs
modifying `maas-api/**` files.

## Problem
The `ListLLMs` function in `maas-api/internal/handlers/models.go` fails
the maintidx linter:
- **Cyclomatic Complexity**: 25 (threshold: 20)
- **Maintainability Index**: 19 (threshold: 20)

## Root Cause
The function grew complex over time with multiple conditional branches,
nested loops, and error handling paths. Recent changes pushed it
slightly over the linter threshold.

## Changes
Refactored `ListLLMs` by extracting helper methods:
- `extractAndValidateAuth()` - handles authorization header validation
- `getUserContextIfNeeded()` - retrieves user context from middleware
- `aggregateModelsFromSubscriptions()` - filters and aggregates models
across subscriptions

**Metrics improvement:**
- Cyclomatic Complexity: 25 → **<20** ✅
- Maintainability Index: 19 → **>20** ✅

## Testing
- ✅ Lint passes with 0 issues
- ✅ All unit tests pass (80.3% coverage)
- ✅ Function behavior unchanged (backward compatible)

## Impact
- Unblocks PR opendatahub-io#694 and other PRs that modify `maas-api/**`
- Improves code maintainability and testability
- No user-facing changes (pure refactoring)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
<!--- Provide a general summary of your changes in the Title above -->

## Description
<!--- Describe your changes in detail -->
Add the Perses dashboard and datasource that were added in opendatahub-io#624 to the
ODH overlay

## How Has This Been Tested?
<!--- Please describe in detail how you tested your changes. -->
<!--- Include details of your testing environment, and the tests you ran
to -->
<!--- see how your change affects other areas of the code, etc. -->
It was tested manually with a custom image of the ODH operator that
checks the existence of Perses CRDs and sets owner references on the
Perses resource

## Merge criteria:
<!--- This PR will be merged by any repository approver when it meets
all the points in the checklist -->
<!--- Go over all the following points, and put an `x` in all the boxes
that apply. -->

- [x] The commits are squashed in a cohesive manner and have meaningful
messages.
- [ ] Testing instructions have been added in the PR body (for PRs
involving changes that are not immediately obvious).
- [x] The developer has manually tested the changes and verified that
the changes work


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Observability dashboards are now included in the deployment so
dashboards are available by default.

* **Chores**
* Prometheus datasource renamed and all dashboard references updated for
consistent datasource resolution.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Arik Hadas <ahadas@redhat.com>
…-io#694)

Related to https://redhat.atlassian.net/browse/RHOAIENG-57159

## Summary
Fix all validation errors and warnings in `maas-api/openapi3.yaml`
discovered by Spectral linting (introduced in opendatahub-io#693).

## Changes

### Errors Fixed (4 total)
1. **Line 177** - `paths./v1/models.get.responses[500]`: Add missing
`type` field to ErrorResponse example
2. **Line 456** - `paths./v1/api-keys/bulk-revoke.post.responses[403]`:
Add missing `type` field to ErrorResponse example
3. **Line 532** - `paths./v1/subscriptions.get`: Add missing `priority`
and `model_refs` fields to SubscriptionListItem example
4. **Line 566** - `paths./v1/model/{model-id}/subscriptions.get`: Add
missing `priority` and `model_refs` fields to SubscriptionListItem
example

### Warnings Fixed (8 total)
1. **Line 2** - Add complete contact information (name, url, email) to
`info` section
2. **Line 2** - Add Apache 2.0 license with URL to `info` section
3. **Lines 181, 284, 399, 460, 484** - Add missing `api-keys-v2` tag
definition (used by 5 operations)
4. **Line 925** - Reorder tags alphabetically (api-keys, api-keys-v2,
health, models, subscriptions)

## Validation Results

**Before:**
```
✖ 18 problems (4 errors, 8 warnings, 0 infos, 6 hints)
```

**After:**
```
✖ 6 problems (0 errors, 0 warnings, 0 infos, 6 hints)
```

The remaining 6 hints are from the custom `maas-subscription-header`
rule (informational only, not blocking).

## Test Plan
- [x] Run `spectral lint maas-api/openapi3.yaml` locally - passes with 0
errors, 0 warnings
- [x] CI OpenAPI validation workflow passes (depends on opendatahub-io#693 merging
first)

## Related
- Depends on opendatahub-io#693 (OpenAPI validation infrastructure)
- Fixes all issues identified by the new CI validation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
  * Enhanced API documentation with contact and license information.
  * Updated error response examples to include structured error types.
* Extended subscription endpoint responses with additional fields for
priority and model references.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
…-io#738)

Continued work in addition to
[733](opendatahub-io#733)
and
[724](opendatahub-io#724) to
refactor, condense and consolidate our test suite for easier code
management and clearer flow for future coding.

## Description
1. Centralize shared helpers into `test_helper.py`. Add comprehensive
docstring documenting all env vars
2. Rename and enhances wait helpers for clarity:
- `_wait_for_authpolicy_phase()` to `_wait_for_maas_auth_policy_phase()`
(added `require_enforced` param)
- `_wait_for_subscription_phase()` →
`_wait_for_maas_subscription_phase()` (added `require_model_statuses`
param)
- Remove now-redundant `_wait_for_maas_auth_policy_ready()` and
`_wait_for_maas_subscription_ready()` convenience wrappers
3. Remove local duplicates from consumer test files
(`test_subscription.py`, `test_negative_security.py`,
`test_external_models.py`, `test_models_endpoint.py`,
`test_namespace_scoping.py`, `test_subscription_list_endpoints.py`, and
`test_api_keys.py`) and import from `test_helper` instead of defining
their own copies of shared functions/constants (circling back to work
done in point 1)
4. Switch from the usage of `kubectl` to `oc` for consistency.

## How Has This Been Tested?
Tests passing. 

## Merge criteria:
<!--- This PR will be merged by any repository approver when it meets
all the points in the checklist -->
<!--- Go over all the following points, and put an `x` in all the boxes
that apply. -->

- [ ] The commits are squashed in a cohesive manner and have meaningful
messages.
- [ ] Testing instructions have been added in the PR body (for PRs
involving changes that are not immediately obvious).
- [ ] The developer has manually tested the changes and verified that
the changes work


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Tests**
* Consolidated end-to-end test utilities into a shared helper for
consistency and reduced duplication.
* Added helpers for service-account cleanup, resilient resource
listing/snapshotting, related-resource lookups, and token rate-limit
verification.
* Replaced multiple "ready" waiters with generalized, phase-based wait
helpers and unified defaults via shared constants.
* Updated test docstrings to reference centralized
environment/prerequisite documentation and removed file-specific env
listings.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Yuriy Teodorovych <Yuriy@ibm.com>
…atahub-io#749)

## Summary

- Detect operator type (RHOAI/ODH) and clean MaaS resources from the
correct application namespace
- Delete MaaS resources individually from `redhat-ods-applications`
instead of relying on namespace deletion (the namespace is
operator-managed and should not be deleted)
- Delete AuthConfig CRs cluster-wide before policy engine namespace
removal to prevent InstallPlan failures when switching engines (e.g.
community Kuadrant to RHCL)
- Delete GatewayClass `openshift-default` in gateway cleanup

## Context

Found during deployment testing on RHOAI 3.3.1 clusters. After running
`cleanup-odh.sh`, 19 MaaS resources remained in
`redhat-ods-applications` because the script only deleted the
`opendatahub` namespace. Old AuthConfig CRs also blocked RHCL installs
due to CRD schema incompatibility.

## Test plan

- [ ] Run cleanup on a RHOAI cluster with MaaS deployed, verify no MaaS
resources remain in `redhat-ods-applications`
- [ ] Run cleanup on an ODH cluster, verify existing behavior is
preserved
- [ ] Run `deploy.sh` after cleanup, verify deployment succeeds without
manual intervention
- [ ] Verify cleanup works when switching from community Kuadrant to
RHCL

Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Improved operator detection for OpenDataHub and Red Hat AI
installations
* Enhanced cleanup process to more thoroughly remove associated
resources and prevent reinstallation issues
  * Better cleanup verification output to confirm removal of resources

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ob (opendatahub-io#751)

## Summary

Add `maas-api-key-cleanup-image` to `params.env` and wire it via
kustomize replacement into the cleanup CronJob, enabling the ODH
operator to override the image at deploy time.

## Description

The `maas-api-key-cleanup` CronJob currently uses a hardcoded
`registry.redhat.io/ubi9/ubi-minimal:9.7` image for the curl-based API
key cleanup. This means the ODH operator has no way to override it with
a pinned SHA digest at deploy time.

- Add `maas-api-key-cleanup-image` key to
`deployment/overlays/odh/params.env` with the default ubi-minimal image.
- Add a kustomize replacement in
`deployment/components/shared-patches/kustomization.yaml` that wires
`data.maas-api-key-cleanup-image` from the `maas-parameters` ConfigMap
into the CronJob container image field.
- This enables the operator's `ApplyParams()` to substitute the image
via `RELATED_IMAGE_UBI_MINIMAL_IMAGE` (from the bundle CSV), ensuring
pinned SHA digests in production and support for disconnected
environments.

**Companion changes required:**
- [RHOAI-Build-Config PR
#19203](red-hat-data-services/RHOAI-Build-Config#19203)
— adds `RELATED_IMAGE_UBI_MINIMAL_IMAGE` to
`additional-images-patch.yaml`
- opendatahub-operator — adds `"maas-api-key-cleanup-image":
"RELATED_IMAGE_UBI_MINIMAL_IMAGE"` to `imagesMap` in
`modelsasservice_support.go`

## How It Was Tested

- Verified kustomize build renders the CronJob with the image from
`params.env`.
- Without the operator change, the CronJob uses the default value from
`params.env` (same image as today — no behavioral change).

Made with [Cursor](https://cursor.com)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Added configuration for a new API key cleanup task in the deployment
environment. Updated deployment settings to include a dedicated
container image for cleanup operations.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Chaitanya Kulkarni <ckulkarn@redhat.com>
Signed-off-by: Chaitanya Kulkarni <chkulkar@redhat.com>
## Problem

The HTTPRoute header-based rule matches `X-Gateway-Model-Name` against
the ExternalModel `metadata.name`. This breaks when `targetModel`
differs from the name (e.g., Bedrock models: `name=my-bedrock`,
`targetModel=openai.gpt-oss-20b`).

The user sends `targetModel` in the request body. BBR's
`body-field-to-header` plugin extracts it as `X-Gateway-Model-Name`.
After ClearRouteCache, the header doesn't match → `route_not_found`.

## Fix

Pass `targetModel` to `buildHTTPRoute` and use it in the header match
value instead of `name`.

## Changes

- `reconciler.go`: pass `extModel.Spec.TargetModel` to `buildHTTPRoute`
- `resources.go`: accept `targetModel` param, use in header match
- `resources_test.go`: update existing test, add test case where
  targetModel differs from name

## Tested

On RHOAI cluster:
- `llm/my-bedrock` (targetModel: `openai.gpt-oss-20b`) → Bedrock 200 ✓
- Header match correctly uses `openai.gpt-oss-20b` not `my-bedrock`

Fixes opendatahub-io#745

```release-note
NONE
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **Improvements**
* Enhanced HTTP routing logic for external models to separately use
target model identifiers in request matching, enabling more precise
routing when the model name differs from its target model designation.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
somya-bhatnagar and others added 25 commits April 16, 2026 00:15
…ndatahub-io#742)

## Description

Related to - https://redhat.atlassian.net/browse/RHOAIENG-58233

Fixes the bug where maas-controller incorrectly reports "namespace
already exists" when the `models-as-a-service` namespace is in
`Terminating` phase during RHOAI reinstall/upgrade, leaving the
controller running without its subscription namespace and MaaS
non-functional.

## Root Cause

The `ensureSubscriptionNamespaceExists` function discarded the namespace
object and never checked `ns.Status.Phase`. When a namespace is
`Terminating`, the GET succeeds (no error), so the function incorrectly
assumed the namespace was ready.

```go
// Before (buggy)
_, err = clientset.CoreV1().Namespaces().Get(ctx, namespace, metav1.GetOptions{})
if err == nil {
    setupLog.Info("subscription namespace already exists", "namespace", namespace)
    return nil  // Bug: namespace might be Terminating
}
```

## Solution

Implemented a comprehensive fix with three components:

### 1. Enhanced Startup Logic (`ensureSubscriptionNamespaceWithClient`)
- Captures namespace object and checks `Status.Phase`
- If `Terminating`: waits up to 90s for deletion, then recreates
- If `Active`: returns early (namespace is ready)
- Handles operator recreation during wait (race condition)

### 2. Runtime Monitoring (`subscriptionNamespaceMonitor`)
- Periodically re-checks namespace (30s interval, configurable)
- Auto-recreates if namespace deleted while controller running
- Respects leader election (only leader runs monitor)
- Resilient error handling (logs errors, continues)

### 3. Readiness Reporting (`checkSubscriptionNamespaceReady`)
- Integrated into `/readyz` endpoint
- Returns not-ready if namespace missing or Terminating
- Uncached check for accurate state reflection
- Kubernetes won't route traffic when not-ready

## Edge Cases Handled

- ✅ Namespace exists and is Active → return early
- ✅ Namespace exists and is Terminating → wait for deletion, recreate
- ✅ Namespace doesn't exist → create with retry
- ✅ Forbidden on GET → assume operator-managed (existing behavior)
- ✅ Forbidden during termination poll → assume external management
- ✅ Timeout waiting for termination → fail with clear error
- ✅ Namespace recreated during poll → detect Active, return
- ✅ Unexpected errors during poll → fail fast with context
- ✅ AlreadyExists on CREATE → treat as success
- ✅ Forbidden on CREATE → permanent error with guidance

## Live Testing Results

Tested on cluster: `api.ci-ln-5zrhd3b-76ef8.aws-4.ci.openshift.org:6443`

### ✅ Scenario 1: Startup with Terminating Namespace (Original Bug)

**Test Steps:**
1. Created `models-as-a-service` namespace with finalizer
2. Deleted namespace (went to `Terminating` state)
3. Deployed controller while namespace was `Terminating`

**Results:**
```json
{"msg":"subscription namespace is terminating, waiting for deletion to complete","namespace":"models-as-a-service"}
{"msg":"terminating namespace has been deleted","namespace":"models-as-a-service"}
{"msg":"subscription namespace not found, attempting to create it","namespace":"models-as-a-service"}
{"msg":"subscription namespace ready","namespace":"models-as-a-service"}
{"msg":"starting manager"}
```

- ✅ Controller detected Terminating state
- ✅ Waited 22 seconds for deletion
- ✅ Recreated namespace successfully
- ✅ Namespace has correct label: `opendatahub.io/generated-namespace:
"true"`

---

### ✅ Scenario 2: Runtime Monitoring (Auto-Recovery)

**Test Steps:**
1. Deployed controller with namespace `Active`
2. Deleted namespace while controller was running
3. Monitored automatic recreation

**Results:**
```
17:27:05 - Monitor check: namespace Active
17:27:29 - Namespace deleted (manual deletion)
17:27:35 - Monitor detected Terminating (30s cycle)
17:27:37 - Namespace recreated and ready
```

- ✅ Monitor detected deletion within 6 seconds (next 30s cycle)
- ✅ Auto-recreated namespace in ~8 seconds total
- ✅ No manual intervention needed
- ✅ Namespace has correct label

---

### ✅ Scenario 3: Readiness Reporting (Observability)

**Test Steps:**
1. Checked `/readyz` endpoint in different namespace states

**Results:**

| Namespace State | Readiness Endpoint | Pod Ready | Expected |
|----------------|-------------------|-----------|----------|
| **Active** | `ok` | `True` | ✅ |
| **Terminating** | `failed: reason withheld` | `False` | ✅ |
| **Recreated** | `ok` | `True` | ✅ |

- ✅ Pod correctly reported Not Ready during namespace Terminating
- ✅ Readiness endpoint accurately reflects namespace state
- ✅ Kubernetes won't route traffic when not-ready

---

## Configuration

New flag added:

```
--subscription-namespace-maintain-interval (default: 30s)
  How often to re-check the subscription namespace while running.
  Larger values reduce apiserver load; smaller values detect deletions sooner.
```

## Merge Criteria

- [x] The commits are squashed in a cohesive manner and have meaningful
messages
- [x] Testing instructions have been added in the PR body
- [x] The developer has manually tested the changes and verified that
the changes work on live cluster
- [x] All edge cases are handled with comprehensive error handling
- [x] Readiness probes accurately reflect system state

---

🤖 Generated with [Claude Code](https://claude.com/claude-code)


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **New Features**
* Added continuous namespace monitoring that automatically recreates the
subscription namespace if deleted during manager operation.
* Introduced new `--subscription-namespace-maintain-interval` CLI flag
to configure monitoring frequency.

* **Bug Fixes**
* Improved namespace startup logic to safely wait for terminating
namespaces (up to 90 seconds) before proceeding.

* **Chores**
* Refactored internal client initialization for better resource reuse
across startup and monitoring components.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-authored-by: Mynhardt Burger <mynhardt@gmail.com>
…uadrant TokenRateLimitPolicy (opendatahub-io#750)

https://redhat.atlassian.net/browse/RHOAIENG-58408

## Description
Tightened kubebuilder/OpenAPI validation on `TokenRateLimit.Window`
- Go type pattern changed from `^(\d+)(s|m|h|d)$` to
`^[1-9]\d{0,3}(s|m|h)$`

Regenerated CRD (`maas.opendatahub.io_maassubscriptions.yaml` updated
with new pattern and expanded description)

Document allowed units + migration note
  - CRD reference doc (maas-subscription.md)
  - OpenAPI spec (openapi3.yaml)

Added tests

## How Has This Been Tested?
Additional tests suite introduced.

## Merge criteria:
<!--- This PR will be merged by any repository approver when it meets
all the points in the checklist -->
<!--- Go over all the following points, and put an `x` in all the boxes
that apply. -->

- [x] The commits are squashed in a cohesive manner and have meaningful
messages.
- [x] Testing instructions have been added in the PR body (for PRs
involving changes that are not immediately obvious).
- [x] The developer has manually tested the changes and verified that
the changes work


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
* Updated token rate limit "window" docs: only seconds (s), minutes (m),
hours (h) allowed; numeric range 1–9999. Days (d) no longer supported;
use hours instead (e.g., 24h).

* **API / Schema**
* CRD/OpenAPI schemas now enforce the new window pattern and string
length constraints (2–5 characters).

* **Tests**
* Added unit and end-to-end tests covering the tightened window
validation.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Yuriy Teodorovych <Yuriy@ibm.com>
Automated promotion of **32 commit(s)** from `main` to `stable`.

```
6190476 fix: align MaaSSubscription token rate limit window validation with Kuadrant TokenRateLimitPolicy (opendatahub-io#750)
3d93d30 fix: handle Terminating namespace during RHOAI reinstall/upgrade (opendatahub-io#742)
ffcb990 fix: use targetModel in HTTPRoute header match (opendatahub-io#753)
4681ffd feat(kustomize): add operator-managed image for api key cleanup cronjob (opendatahub-io#751)
9b4672a fix: cleanup script handles RHOAI namespace and AuthConfig CRs (opendatahub-io#749)
5afc49c refactor: refactor and consolidate test helper functions (opendatahub-io#738)
7b81a7e fix: resolve OpenAPI spec validation errors and warnings (opendatahub-io#694)
93bc750 feat: deploy admin-usage dashboard via ODH (opendatahub-io#686)
fa03af7 refactor: reduce ListLLMs complexity to pass maintidx linter (opendatahub-io#739)
d91dc39 test: fix flaky test_rate_limit_exhaustion_gets_429 (opendatahub-io#730)
05f08df feat: enable group testing for MaaS components (opendatahub-io#741)
0ed1fb3 fix: add ns prefix to ExternalModel HTTPRoute path for llmisvc parity (opendatahub-io#709)
c93e879 feat(e2e): collect full MaaS CR definitions and RHOAI namespace logs (opendatahub-io#740)
5b9127e fix: avoid duplicate deployments of controller in deploy.sh (opendatahub-io#732)
6842e33 fix: restore /v1/models rate limiting exemption (opendatahub-io#729)
beced7f fix: patch params.env for custom image injection in kustomize mode (opendatahub-io#731)
b5b5afb test: fix `test_subscription_status_transitions_on_model_deletion()` (opendatahub-io#733)
c5468b2 feat: add RBAC aggregation for namespace users (opendatahub-io#716)
b77630b test: expand negative-path and security-focused E2E tests (opendatahub-io#724)
e069c5f fix: mitigate authorization timing race in /v1/models listing (opendatahub-io#549)
6d31fd8 test(e2e): enable unconfigured model deny-by-default test (opendatahub-io#728)
99bcd1b fix: replace third-party curl image with UBI-based image for disconnected support  (opendatahub-io#706)
08ff5b4 ci: add OpenAPI validation and automation infrastructure (opendatahub-io#693)
b265e54 feat: add E2E tests for external models (egress) (opendatahub-io#632)
bbfea0d chore: update smoke.sh to use API Keys (opendatahub-io#573)
8f22073 chore(deps): bump google.golang.org/grpc from 1.75.1 to 1.79.3 in /maas-api (opendatahub-io#566)
95f8645 chore(docs): document shared HTTPRoute TRLP limitation and cross-links (opendatahub-io#727)
e3da035 feat(maas-controller): add granular status reporting for MaaSSubscription and MaaSAuthPolicy (opendatahub-io#714)
36dbeb4 feat: add Granite Model that can work on CPU (opendatahub-io#723)
bbaa45a fix: correct AuthPolicy name in validation script (opendatahub-io#658) (opendatahub-io#659)
```
Automated promotion of **33 commit(s)** from `stable` to `rhoai`.

```
6190476 fix: align MaaSSubscription token rate limit window validation with Kuadrant TokenRateLimitPolicy (opendatahub-io#750)
3d93d30 fix: handle Terminating namespace during RHOAI reinstall/upgrade (opendatahub-io#742)
ffcb990 fix: use targetModel in HTTPRoute header match (opendatahub-io#753)
4681ffd feat(kustomize): add operator-managed image for api key cleanup cronjob (opendatahub-io#751)
9b4672a fix: cleanup script handles RHOAI namespace and AuthConfig CRs (opendatahub-io#749)
5afc49c refactor: refactor and consolidate test helper functions (opendatahub-io#738)
7b81a7e fix: resolve OpenAPI spec validation errors and warnings (opendatahub-io#694)
93bc750 feat: deploy admin-usage dashboard via ODH (opendatahub-io#686)
fa03af7 refactor: reduce ListLLMs complexity to pass maintidx linter (opendatahub-io#739)
d91dc39 test: fix flaky test_rate_limit_exhaustion_gets_429 (opendatahub-io#730)
05f08df feat: enable group testing for MaaS components (opendatahub-io#741)
0ed1fb3 fix: add ns prefix to ExternalModel HTTPRoute path for llmisvc parity (opendatahub-io#709)
c93e879 feat(e2e): collect full MaaS CR definitions and RHOAI namespace logs (opendatahub-io#740)
5b9127e fix: avoid duplicate deployments of controller in deploy.sh (opendatahub-io#732)
6842e33 fix: restore /v1/models rate limiting exemption (opendatahub-io#729)
beced7f fix: patch params.env for custom image injection in kustomize mode (opendatahub-io#731)
b5b5afb test: fix `test_subscription_status_transitions_on_model_deletion()` (opendatahub-io#733)
c5468b2 feat: add RBAC aggregation for namespace users (opendatahub-io#716)
b77630b test: expand negative-path and security-focused E2E tests (opendatahub-io#724)
e069c5f fix: mitigate authorization timing race in /v1/models listing (opendatahub-io#549)
6d31fd8 test(e2e): enable unconfigured model deny-by-default test (opendatahub-io#728)
99bcd1b fix: replace third-party curl image with UBI-based image for disconnected support  (opendatahub-io#706)
08ff5b4 ci: add OpenAPI validation and automation infrastructure (opendatahub-io#693)
b265e54 feat: add E2E tests for external models (egress) (opendatahub-io#632)
bbfea0d chore: update smoke.sh to use API Keys (opendatahub-io#573)
8f22073 chore(deps): bump google.golang.org/grpc from 1.75.1 to 1.79.3 in /maas-api (opendatahub-io#566)
95f8645 chore(docs): document shared HTTPRoute TRLP limitation and cross-links (opendatahub-io#727)
e3da035 feat(maas-controller): add granular status reporting for MaaSSubscription and MaaSAuthPolicy (opendatahub-io#714)
36dbeb4 feat: add Granite Model that can work on CPU (opendatahub-io#723)
bbaa45a fix: correct AuthPolicy name in validation script (opendatahub-io#658) (opendatahub-io#659)
```
…-io#721)

## Description
This PR implements subscription health enforcement at the
authentication/authorization layer, ensuring traffic is denied when a
subscription is not in an acceptable state.
   
**Jira:** https://redhat.atlassian.net/browse/RHOAIENG-57234
                  
### Main Feature: Auth Layer Rejection
                  
**OPA Rule Update:**
- Blocks subscriptions in `Failed` or `Pending` phases from making any
requests
- Returns 403 Forbidden with clear error message when subscription is
unhealthy
- Enforces subscription health consistently at the same layer as other
auth decisions
**Subscription Selector (maas-api):**
- Consumes subscription phase and modelRefStatuses from controller (PR
opendatahub-io#714)
- Returns appropriate errors for Failed/Pending subscriptions before OPA
evaluation
- Validates subscription health during the selection process
### Enhancement: Active Filtering for Degraded Subscriptions
Beyond the core requirement, this PR also implements granular filtering
for Degraded subscriptions:
- Degraded subscriptions can still access **healthy** models (ready:
true in modelRefStatuses)
- Requests to **unhealthy** models within Degraded subscriptions are
blocked with clear error
- This allows partial service when some models are unavailable rather
than blocking everything
**Rationale:** If a subscription has 3 healthy models and 1 broken
model, users should still be able to access the 3 healthy models.
Complete blocking would be unnecessarily
restrictive.
  ### Example Behavior

  **Failed Subscription:**
  ```yaml
status:
    phase: Failed
conditions:
      - type: Ready
        status: "False"
        reason: ReconcileFailed
- ❌ All requests rejected at auth layer with 403 Forbidden
Degraded Subscription:
status:
    phase: Degraded
    modelRefStatuses:
      - name: llama-model
        ready: true      
- name: broken-model
        ready: false      
reason: NotFound
  - ✅ Requests to llama-model succeed (healthy model)
- ❌ Requests to broken-model blocked with error: "model not available in
subscription (reason: model not healthy)"
   
Active Subscription:
  status:         
phase: Active
    modelRefStatuses:
      - name: llama-model
        ready: true      
  - ✅ All requests allowed per existing policy rules
                                                     
How Has This Been Tested?
Automated Tests (E2E - all passing ✅)
Core Requirement Tests:
  test_failed_subscription_blocks_inference
- Verifies Failed subscriptions are rejected at auth layer
    - Tests recovery: subscription returns to Active → requests allowed
test_subscriptions_endpoint_shows_degraded_health
- Verifies /v1/subscriptions correctly reports subscription health
Active Filtering Tests:
  test_degraded_healthy_model_allows_inference
- Degraded subscription with healthy model → inference succeeds
                                                                   
test_degraded_unhealthy_model_blocks_inference
    - Degraded subscription with unhealthy model → request blocked
test_models_endpoint_with_degraded_subscription_api_key
- Verifies /v1/models endpoint with Degraded subscription (API key auth)
test_models_endpoint_with_degraded_subscription_kube_token
- Verifies /v1/models endpoint with Degraded subscription (Kube token
auth)
Manual Verification
Tested on live cluster:
1. Created subscription with all invalid models → Failed phase
- Verified: All inference requests rejected with 403
  2. Updated subscription to have 1 valid model → Degraded phase
- Verified: Subscription enters Degraded state
- Verified: Inference to valid model succeeds
- Verified: Inference to invalid model blocked with clear error
3. Fixed all models → Active phase
- Verified: All models accessible
  4. Tested both API key and Kubernetes token authentication paths
  Client-facing behavior:
- HTTP 403 for Failed/Pending subscriptions (consistent with other auth
failures)
- Clear error messages that don't expose internal implementation details
- Error response format matches existing API error structure
  Unit Tests      
  - Updated selector tests to verify phase-based rejection
- Tests cover all phase/model health combinations
- Validates error messages and HTTP status codes
  
Dependencies
                  
This PR depends on PR opendatahub-io#714 which implements the phase and
modelRefStatuses fields in MaaSSubscription status. The PR should be
rebased on main after opendatahub-io#714 merges.
  
Documentation
                  
No documentation updates needed in this PR - the behavior is transparent
to end users:
- Failed/Pending subscriptions are rejected (expected behavior for
unhealthy resources)
- Error messages are self-explanatory
  - Operator documentation for subscription health is covered in PR opendatahub-io#714
Acceptance Criteria Met
- ✅ Given a MaaSSubscription in Failed/Pending state, When client
presents valid credentials, Then request is rejected at auth layer
- ✅ Given subscription returns to Active/Degraded state, When client
retries, Then requests are allowed per existing rules
- ✅ Given rejected request due to subscription state, When client
inspects response, Then response does not expose internal details
- ✅ Automated E2E tests cover: unhealthy subscription → denied; recovery
→ allowed
- ✅ Manual verification steps documented above
Merge criteria:
- The commits are squashed in a cohesive manner and have meaningful
messages.
- Testing instructions have been added in the PR body (for PRs involving
changes that are not immediately obvious).
- The developer has manually tested the changes and verified that the
changes work


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Added "Degraded" phase and richer status surfaces: per-model and
per-policy status entries exposing ready/reason/message and
deletionTimestamp.

* **Improvements**
* API selection and behavior now consider subscription/model health
(fail-closed for unhealthy models); Create API key logs non-blocking
info when subscription is non-active or deleting.

* **Tests**
* Expanded unit and e2e coverage for status reporting, degraded/failed
phases, and selection/filtering logic.

* **Documentation**
* Updated troubleshooting and docs with phase semantics and kubectl
examples.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Ishita Sequeira <ishiseq29@gmail.com>
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
<!--- Provide a general summary of your changes in the Title above -->

## Description
Database Prerequisites moved to setup.  Updated the links. 
https://redhat.atlassian.net/browse/RHOAIENG-55130

## How Has This Been Tested?
<!--- Please describe in detail how you tested your changes. -->
<!--- Include details of your testing environment, and the tests you ran
to -->
<!--- see how your change affects other areas of the code, etc. -->
Manual verification on
https://github.com/jrhyness/models-as-a-service/blob/jr_55130/maas-api/README.md

## Merge criteria:
<!--- This PR will be merged by any repository approver when it meets
all the points in the checklist -->
<!--- Go over all the following points, and put an `x` in all the boxes
that apply. -->

- [ ] The commits are squashed in a cohesive manner and have meaningful
messages.
- [ ] Testing instructions have been added in the PR body (for PRs
involving changes that are not immediately obvious).
- [ ] The developer has manually tested the changes and verified that
the changes work


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
* Updated database-related documentation links in the API README to
direct users to the current production deployment setup guide.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
…ate (opendatahub-io#756)

<!--- Provide a general summary of your changes in the Title above -->

## Description
<!--- Describe your changes in detail -->
Text says "Set modelsAsService to Unmanaged" but the YAML below shows
managementState: Removed. Changed the text. Unmanaged is not a supported
state.
https://redhat.atlassian.net/browse/RHOAIENG-55132

## How Has This Been Tested?
<!--- Please describe in detail how you tested your changes. -->
<!--- Include details of your testing environment, and the tests you ran
to -->
<!--- see how your change affects other areas of the code, etc. -->

## Merge criteria:
<!--- This PR will be merged by any repository approver when it meets
all the points in the checklist -->
<!--- Go over all the following points, and put an `x` in all the boxes
that apply. -->

- [ ] The commits are squashed in a cohesive manner and have meaningful
messages.
- [ ] Testing instructions have been added in the PR body (for PRs
involving changes that are not immediately obvious).
- [ ] The developer has manually tested the changes and verified that
the changes work
## Summary

This PR syncs security scanning configuration files from the central
[security-config](https://github.com/opendatahub-io/security-config)
repository, managed by the
[@opendatahub-io/odh-platform-security](https://github.com/orgs/opendatahub-io/teams/odh-platform-security)
team.

## Files

| File | Status |
|------|--------|
| `semgrep.yaml` | Updated |


## What does this mean for your team?

- **No action required from reviewers** beyond merging this PR
- These files are **protected by an org-level push ruleset** — they
cannot be modified directly in this repo
- Future updates will be synced automatically via PRs from the
`security-config` repo
- CodeRabbit and Semgrep will use these configs when reviewing PRs on
this repo

For questions or customization requests, open an issue on
[opendatahub-io/security-config](https://github.com/opendatahub-io/security-config).

Co-authored-by: security-config-sync[bot] <265242129+security-config-sync[bot]@users.noreply.github.com>
…endatahub-io#758)

[UX conversation
ask](https://redhat-internal.slack.com/archives/C069KSM8T9N/p1776362532709879?thread_ts=1776354678.333879&cid=C069KSM8T9N)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Style**
  * Updated dashboard display labels for improved clarity.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Summary

Documentation refresh with clearer **request flows** (especially key
minting and high-level architecture) and a new **personas** narrative
backed by a resource-model diagram.

## Changes

- **Architecture (`architecture.md`)** — Tightened overview (main
components as bullets, authorization/rate-limiting framing), clarified
Gateway / Kuadrant / Authorino / Limitador / maas-api, updated main-flow
diagram (colors, `MaaSModelRef`, Tech Preview / external path). **Key
minting** is a **single** flow + diagram: validation and minting
combined; **forward + user context** from **AuthPolicy** to MaaS API;
show-once key response described in prose (not shown on the diagram).
Other sections updated only where they align with the same diagrams or
wording.
- **Personas (`concepts/personas.md`)** — Page structured around
**cluster operators**, **ODH administrators**, **data scientists / model
service owners**, and **API consumers**; embedded resource-model PNG
under `docs/content/assets/diagrams/`; `mkdocs.yml` navigation updated.
- **Misc** — Cross-links and terminology so diagrams and prose stay
consistent.

## Notes for reviewers

- Confirm **`docs/content/assets/diagrams/personas-resource-model.png`**
is meant to be committed with the repo.
- Optional later: **light/dark** diagram variants using Material image
URLs with `#only-light` / `#only-dark` when assets exist.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
* Reorganized documentation structure with new "Concepts" section
covering personas, model reference, and architecture.
* Added comprehensive guides for external model setup, on-cluster model
serving gateway configuration, and RBAC troubleshooting.
* Updated API examples to use OpenAI-compatible chat completions
endpoint.
* Clarified API key expiration model with operator-managed maximum
lifetime.
  * Added ModelsAsService CR configuration documentation.
* Updated sample model manifests to simulator v0.8.2 with new runtime
arguments.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
…hub-io#761)

Update the /v1/models example responses to match the actual API format:
- Replace incorrect "subscription": "string" with "subscriptions":
[array]
- Add all actual response fields (object, created, owned_by, kind,
ready, modelDetails)
- Include two examples: API key (single subscription) and user token
(multiple subscriptions)

## Description

Changes:
- API key example shows two models, both with single subscription in
array
- User token example shows same models with subscription aggregation:
one model accessible via two subscriptions, one via a single
subscription
- Add tip explaining the difference between API key and user token
responses
- Use consistent model names (llama-2-7b-chat, mixtral-8x7b-instruct)
across both examples

The previous example used "subscription": "free" (singular string) but
the actual API returns "subscriptions": [{name, displayName,
description}, ...] (plural array of objects). This mismatch would cause
client parsing errors.

Resolves:
[RHOAIENG-55145](https://redhat.atlassian.net/browse/RHOAIENG-55145)



## How Has This Been Tested?
<!--- Please describe in detail how you tested your changes. -->
<!--- Include details of your testing environment, and the tests you ran
to -->
<!--- see how your change affects other areas of the code, etc. -->

## Merge criteria:
<!--- This PR will be merged by any repository approver when it meets
all the points in the checklist -->
<!--- Go over all the following points, and put an `x` in all the boxes
that apply. -->

- [ ] The commits are squashed in a cohesive manner and have meaningful
messages.
- [ ] Testing instructions have been added in the PR body (for PRs
involving changes that are not immediately obvious).
- [ ] The developer has manually tested the changes and verified that
the changes work


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
* Enhanced "List Available Models" guide with expanded examples for API
key and user token authentication.
* Updated response examples with additional fields for clearer model
information understanding.
* Added clarification on subscription behavior differences across
authentication methods.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
…-io#626)

<!--- Provide a general summary of your changes in the Title above -->
https://redhat.atlassian.net/browse/RHOAIENG-52923

## Description
<!--- Describe your changes in detail -->
Patch Kuadrant CSV when deploying to change Kuadrant behavior to
fail-close when Limitador service fails.

## How Has This Been Tested?
<!--- Please describe in detail how you tested your changes. -->
<!--- Include details of your testing environment, and the tests you ran
to -->
<!--- see how your change affects other areas of the code, etc. -->
TRLP test script:
```
for i in {1..16}; do curl -sSk -o /dev/null -w "%{http_code}\n" "${HOST}/llm/facebook-opt-125m-simulated/v1/chat/completions" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" -d '{"model":"facebook/opt-125m","messages":[{"role":"user","content":"Hi"}],"max_tokens":50}'; done
```
- Run TRLP test script, got `429` after a few `200`s.
- Scale Limitador pod down to 0, run TRLP test script, got all `200`s.
- Run revised `deploy.sh` to deploy MaaS, then run TRLP test script, got
all `500`s.

## Merge criteria:
<!--- This PR will be merged by any repository approver when it meets
all the points in the checklist -->
<!--- Go over all the following points, and put an `x` in all the boxes
that apply. -->

- [x] The commits are squashed in a cohesive manner and have meaningful
messages.
- [x] Testing instructions have been added in the PR body (for PRs
involving changes that are not immediately obvious).
- [x] The developer has manually tested the changes and verified that
the changes work


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
* Added platform-specific guidance for configuring rate-limiting failure
behavior when Limitador is unavailable (Open Data Hub and Red Hat
OpenShift AI).

* **Chores**
* Centralized and automated operator CSV updates to ensure
gateway-controller integration and enforce rate-limit failure modes;
post-install now consistently applies patches, restarts/reconciles
components as needed, and shows clearer progress messaging.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
…atahub-io#757)

Add comprehensive documentation for the --cluster-audience flag and all
other CLI flags in the maas-controller README. This flag is critical for
HyperShift/ROSA clusters that use custom OIDC provider URLs.

## Description
Changes:
- Add CLI Flags table with all available flags and their defaults
- Add dedicated section for HyperShift/ROSA cluster configuration
- Document how to find cluster's OIDC audience
- Show two methods to configure cluster-audience (params.env and kubectl
patch)
- Update Other Configuration section to reference params.env
consistently

Resolves:
[RHOAIENG-55116](https://redhat.atlassian.net/browse/RHOAIENG-55116)



## How Has This Been Tested?
<!--- Please describe in detail how you tested your changes. -->
<!--- Include details of your testing environment, and the tests you ran
to -->
<!--- see how your change affects other areas of the code, etc. -->

## Merge criteria:
<!--- This PR will be merged by any repository approver when it meets
all the points in the checklist -->
<!--- Go over all the following points, and put an `x` in all the boxes
that apply. -->

- [ ] The commits are squashed in a cohesive manner and have meaningful
messages.
- [ ] Testing instructions have been added in the PR body (for PRs
involving changes that are not immediately obvious).
- [ ] The developer has manually tested the changes and verified that
the changes work


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
* Added CLI Flags configuration subsection with documentation on
command-line flags and their defaults, configured via kustomize.
* Introduced dedicated guidance for HyperShift/ROSA Clusters
configuration, including cluster audience override instructions and
kubectl commands for OIDC audience extraction.
* Updated configuration section with explicit parameter mappings for
customizing subscription namespace, controller image, and gateway
settings via configuration files.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove incorrect "Stub: not yet implemented" text and document actual
ExternalModel behavior. ExternalModel has been fully implemented with
~230 lines of working code in providers_external.go since the initial
implementation.


## Description
Changes:
- Replace stub description with accurate behavior documentation
- Document that ExternalModel references an ExternalModel CR for
provider configuration (OpenAI, Anthropic, etc.)
- Explain HTTPRoute validation flow (created by ExternalModel
controller, validated by MaaSModelRef)
- Document readiness criteria (HTTPRoute accepted by gateway)
- Remove outdated "Status for unimplemented kinds" paragraph that
referenced ExternalModel as an example

The ExternalModel provider has been fully functional and registered in
providers.go since its introduction.

Resolves:
[RHOAIENG-55145](https://redhat.atlassian.net/browse/RHOAIENG-55145)


## How Has This Been Tested?
<!--- Please describe in detail how you tested your changes. -->
<!--- Include details of your testing environment, and the tests you ran
to -->
<!--- see how your change affects other areas of the code, etc. -->

## Merge criteria:
<!--- This PR will be merged by any repository approver when it meets
all the points in the checklist -->
<!--- Go over all the following points, and put an `x` in all the boxes
that apply. -->

- [ ] The commits are squashed in a cohesive manner and have meaningful
messages.
- [ ] Testing instructions have been added in the PR body (for PRs
involving changes that are not immediately obvious).
- [ ] The developer has manually tested the changes and verified that
the changes work


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
* Updated ExternalModel provider documentation with complete
implementation specifications for endpoint exposure and gateway
integration, transitioning from unimplemented status to fully detailed
behavior.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
…#735)

## Description

Adds the `Tenant` CR (`maas.opendatahub.io/v1alpha1`) and a platform
reconciliation pipeline to `maas-controller` so it can render and apply
MaaS platform workloads (maas-api, gateway config, auth policies,
telemetry).
`Tenant` replaces the previous `ModelsAsService`
(`components.platform.opendatahub.io/v1alpha1`) as the persisted CR for
the MaaS component. ODH still uses the `ModelsAsService` component name
internally for enablement checks, labels, and DSC status aggregation,
but the object on the cluster is now `Tenant`. This gives
`maas-controller` full ownership of platform workload lifecycle while
ODH retains control of the component lifecycle (install, enable/disable,
cleanup).

### What's included
- `Tenant` CRD with API key, external OIDC, gateway, and telemetry
configuration
- `TenantReconciler`: prerequisites → dependencies → kustomize render →
post-render → SSA apply → deployment readiness
- Post-render: gateway AuthPolicy/TokenRateLimitPolicy/DestinationRule
targeting, external OIDC patching, TelemetryPolicy + IstioTelemetry
injection, config-hash rollout annotation
- Finalizer with cross-namespace cleanup via tracking labels
- Management state support (Managed/Unmanaged/Removed)
- Unit tests for reconcile, finalization, singleton enforcement,
management states

### Design decisions (based on review feedback)
- **Namespace-scoped**: lives in `models-as-a-service` alongside
`MaaSSubscription`/`MaaSAuthPolicy`. First release with no deployed CRDs
— avoids a CRD scope migration later (Kubernetes does not allow changing
scope on an existing CRD)
- **Self-bootstrap**: `maas-controller` creates the default Tenant on
startup; ODH operator's `NewCRObject` is a no-op
- **No DSCI dependency**: app namespace derived from `tenant.Namespace`
— no cross-operator API calls or extra RBAC
- **Cross-namespace ownership**: tracking labels for
cluster-scoped/cross-namespace children; `ownerReferences` for
same-namespace only
- **Singleton via CEL**: `self.metadata.name == 'default-tenant'` —
removing the rule later enables multi-tenancy without CRD migration
- **Gateway policy alignment**: `gateway-default-auth` (AuthPolicy) and
`gateway-default-deny` (TokenRateLimitPolicy) names match actual
manifests

_Related ODH PR:_
opendatahub-io/opendatahub-operator#3412

## How Has This Been Tested?

- Unit tests for the reconcile entry-point
(`maastenant_reconcile_test.go`)
- Manual End-to-end testing on ROSA cluster with custom ODH operator +
maas-controller images:
- Verified self-creation of `default-tenant` in `models-as-a-service`
namespace
  - Platform workloads applied via SSA
  - Toggled MaaS off/on in DSC to verify cleanup and re-provisioning
  - CRD namespace scope and CEL singleton enforcement confirmed

## Merge criteria:

- [x] The commits are squashed in a cohesive manner and have meaningful
messages.
- [x] Testing instructions have been added in the PR body (for PRs
involving changes that are not immediately obvious).
- [x] The developer has manually tested the changes and verified that
the changes work

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Adds a Tenant custom resource with validation, status/phase, and CLI
printer columns.
* Ships a controller that ensures a singleton default Tenant, reconciles
rendered manifests, monitors readiness, and performs safe teardown.

* **New Features (rendering)**
* Rendering/post-processing injects OIDC, telemetry policies, gateway
defaults, params, and a deterministic config-hash on deployments.

* **Chores**
* Expanded RBAC, dependency and linter updates, deployment script
improvements, and added reconciliation tests.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: jland <jland@redhat.com>
…hub-io#763)

## Description
Add configuration reference for maas-api including:
- Environment variables table with 15 configuration options
- CLI flags table showing env var mappings
- Database configuration note


Configuration options documented:
- Server config: DEBUG_MODE, NAMESPACE, SECURE, ADDRESS, PORT
(deprecated)
- Gateway config: GATEWAY_NAME, GATEWAY_NAMESPACE, INSTANCE_NAME
- Subscription config: MAAS_SUBSCRIPTION_NAMESPACE
- API key config: API_KEY_MAX_EXPIRATION_DAYS
- Performance: ACCESS_CHECK_TIMEOUT_SECONDS
- TLS config: TLS_CERT, TLS_KEY, TLS_SELF_SIGNED, TLS_MIN_VERSION



## How Has This Been Tested?
<!--- Please describe in detail how you tested your changes. -->
<!--- Include details of your testing environment, and the tests you ran
to -->
<!--- see how your change affects other areas of the code, etc. -->

## Merge criteria:
<!--- This PR will be merged by any repository approver when it meets
all the points in the checklist -->
<!--- Go over all the following points, and put an `x` in all the boxes
that apply. -->

- [ ] The commits are squashed in a cohesive manner and have meaningful
messages.
- [ ] Testing instructions have been added in the PR body (for PRs
involving changes that are not immediately obvious).
- [ ] The developer has manually tested the changes and verified that
the changes work


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
* Added comprehensive Configuration section documenting all available
environment variables and CLI flags for server setup, including debug
logging, namespace identification, network configuration, and TLS
settings.
* Clarified that CLI flags override environment variables and explained
how database configuration is sourced from Kubernetes secrets.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Automated promotion of **12 commit(s)** from `main` to `stable`.

```
8f38b4c docs: document maas-api environment variables and CLI flags (opendatahub-io#763)
dd5474e feat(maas-controller): maas`Tenant` CR and reconciler (opendatahub-io#735)
0e2b91e docs: correct ExternalModel implementation status (opendatahub-io#759)
94b7341 docs: document --cluster-audience CLI flag for maas-controller (opendatahub-io#757)
fa142c4 fix: enforce fail-close logic when limitador pod is down (opendatahub-io#626)
c098422 docs: fix API response format in self-service model listing (opendatahub-io#761)
3a6a9b8 docs: documentation updates (opendatahub-io#687)
ce75696 fix: rename dashboard title and panel name to 'Token Consumption' (opendatahub-io#758)
266a130 chore: sync security config files (opendatahub-io#736)
2ce457c docs: fix instructions to match code for modelsAsService managementState (opendatahub-io#756)
5d06621 docs: fix broken links (opendatahub-io#755)
c789577 feat: reject degraded/failed subscriptions at auth layer (opendatahub-io#721)
```
Automated promotion of **13 commit(s)** from `stable` to `rhoai`.

```
8f38b4c docs: document maas-api environment variables and CLI flags (opendatahub-io#763)
dd5474e feat(maas-controller): maas`Tenant` CR and reconciler (opendatahub-io#735)
0e2b91e docs: correct ExternalModel implementation status (opendatahub-io#759)
94b7341 docs: document --cluster-audience CLI flag for maas-controller (opendatahub-io#757)
fa142c4 fix: enforce fail-close logic when limitador pod is down (opendatahub-io#626)
c098422 docs: fix API response format in self-service model listing (opendatahub-io#761)
3a6a9b8 docs: documentation updates (opendatahub-io#687)
ce75696 fix: rename dashboard title and panel name to 'Token Consumption' (opendatahub-io#758)
266a130 chore: sync security config files (opendatahub-io#736)
2ce457c docs: fix instructions to match code for modelsAsService managementState (opendatahub-io#756)
5d06621 docs: fix broken links (opendatahub-io#755)
c789577 feat: reject degraded/failed subscriptions at auth layer (opendatahub-io#721)
```
…oller v3-4

the maas-controller Dockerfile.konflux references paths outside the
maas-controller/ directory (maas-api/deploy, deployment/base/...), so
the build context must be the repo root. update path-context from
maas-controller to . and dockerfile from Dockerfile.konflux to
maas-controller/Dockerfile.konflux to match the upstream pattern.

Signed-off-by: Chaitanya Kulkarni <ckulkarn@redhat.com>
Signed-off-by: Chaitanya Kulkarni <chkulkar@redhat.com>
Made-with: Cursor
…ontroller-tekton-context

fix(tekton): correct dockerfile path and build context for maas-controller v3-4
update github.com/jackc/pgx/v5 from v5.7.6 to v5.9.2 to resolve
memory-safety vulnerabilities.

cve details:
- cve-2026-33815: memory-safety vulnerability in pgx
- cve-2026-33816: memory-safety vulnerability in pgx

resolves: rhoaieng-57067, rhoaieng-57063

co-authored-by: claude opus 4.6 <noreply@anthropic.com>
@vmrh21 vmrh21 closed this Apr 21, 2026
@vmrh21 vmrh21 deleted the fix/cve-2026-33815-pgx-rhds-rhoai-3.4-attempt-1 branch April 21, 2026 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.