fix(runner,rbac,cli,routes): credential auth reliability and SSE improvements#1231
fix(runner,rbac,cli,routes): credential auth reliability and SSE improvements#1231markturansky merged 4 commits intoalphafrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughPath-aware RBAC action derivation added; CLI streaming output enriched with reasoning and tool-call annotations; bot-token refresh-and-retry flow plus Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant FetchCredential as _fetch_credential()
participant RefreshUtil as refresh_bot_token()
participant CPService as CP Token Service
Caller->>FetchCredential: Request credentials
FetchCredential->>CPService: Call backend with BOT_TOKEN
alt 200 OK
CPService-->>FetchCredential: 200 + credentials
FetchCredential-->>Caller: Return credentials
else 401/403
CPService-->>FetchCredential: 401/403
FetchCredential->>RefreshUtil: Refresh BOT_TOKEN
RefreshUtil->>CPService: Request new token (if CP env present)
CPService-->>RefreshUtil: New token or error
RefreshUtil-->>FetchCredential: Provide refreshed BOT_TOKEN
FetchCredential->>CPService: Retry call with refreshed BOT_TOKEN
alt 200 OK
CPService-->>FetchCredential: 200 + credentials
FetchCredential-->>Caller: Return credentials
else error
CPService-->>FetchCredential: Error
FetchCredential-->>Caller: Raise PermissionError (includes context)
end
end
🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Comment |
…ovements
- runner: retry credential fetch with fresh CP token on 401 instead of
aborting the turn; adds refresh_bot_token() to utils.py and wires it
into _fetch_credential() for both the direct BOT_TOKEN and caller-token
fallback paths
- rbac: fix pathToAction() so GET /credentials/{id}/token resolves to
action "token" (matching credential:token-reader role) instead of "read"
- cli: improve acpctl session send -f output with reasoning accumulation
([thinking] blocks), tool call annotations ([ToolName] → result), and
proper newline handling for streamed text
- routes: add haproxy.router.openshift.io/timeout: 10m to all
ambient-api-server Routes (production, mpp-openshift, local-dev) to
prevent 504s on long-running SSE streams
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
fe9a7ca to
bf6be85
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/ambient-cli/cmd/acpctl/session/send.go`:
- Around line 127-137: The SSE reader uses bufio.NewScanner which defaults to a
64KiB token limit causing bufio.ErrTooLong when TOOL_CALL_RESULT events have
large evt.Content; locate the bufio.NewScanner(...) instance (variable name
scanner) used in the session send flow and call scanner.Buffer(make([]byte,
1024*1024), 10*1024*1024) (or another appropriate larger max like 5–10MB)
immediately after creating the scanner to raise the token limit, and keep
existing evt.Content handling (the preview code around evt.Content) intact;
optionally ensure any bufio.ErrTooLong is handled/logged if you still want to
detect oversized tokens.
In `@components/manifests/overlays/mpp-openshift/ambient-api-server-route.yaml`:
- Around line 10-11: The grpc Route resource ambient-api-server-grpc is missing
the haproxy.router.openshift.io/timeout: 10m annotation applied to the other
Route; update the annotations block on the ambient-api-server-grpc Route (the
resource with name ambient-api-server-grpc and its annotations mapping) to
include haproxy.router.openshift.io/timeout: 10m so both routes receive the same
10m router timeout.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c3c0f4b2-2b50-43b9-8d08-edf1c0dd07fa
📒 Files selected for processing (7)
components/ambient-api-server/pkg/rbac/middleware.gocomponents/ambient-cli/cmd/acpctl/session/send.gocomponents/manifests/overlays/local-dev/ambient-api-server-route.yamlcomponents/manifests/overlays/mpp-openshift/ambient-api-server-route.yamlcomponents/manifests/overlays/production/ambient-api-server-route.yamlcomponents/runners/ambient-runner/ambient_runner/platform/auth.pycomponents/runners/ambient-runner/ambient_runner/platform/utils.py
| if evt.Content != "" { | ||
| var content string | ||
| if err := json.Unmarshal([]byte(evt.Content), &content); err != nil { | ||
| content = evt.Content | ||
| } | ||
| lines := strings.SplitN(strings.TrimSpace(content), "\n", 4) | ||
| preview := strings.Join(lines, " | ") | ||
| if len(lines) >= 4 { | ||
| preview += " ..." | ||
| } | ||
| fmt.Fprintf(out, "→ %s\n", preview) |
There was a problem hiding this comment.
Raise the scanner limit before previewing tool results.
bufio.Scanner still caps each SSE line at 64 KiB. Now that TOOL_CALL_RESULT consumes full evt.Content, one large result can terminate acpctl session send -f with bufio.ErrTooLong.
Patch
out := cmd.OutOrStdout()
scanner := bufio.NewScanner(stream)
+ scanner.Buffer(make([]byte, 0, 64*1024), 1024*1024)
var reasoningBuf strings.Builder
var inText boolExpected: this file shows no scanner.Buffer call today; if emitters stream full tool-result payloads, the current scanner limit is enough to abort the stream.
#!/bin/bash
set -euo pipefail
sed -n '78,86p' components/ambient-cli/cmd/acpctl/session/send.go
printf '\n'
rg -n -C2 'scanner\.Buffer|bufio\.NewScanner' components/ambient-cli/cmd/acpctl/session/send.go
printf '\n'
rg -n -C2 'TOOL_CALL_RESULT|toolCallName|content' .🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/ambient-cli/cmd/acpctl/session/send.go` around lines 127 - 137,
The SSE reader uses bufio.NewScanner which defaults to a 64KiB token limit
causing bufio.ErrTooLong when TOOL_CALL_RESULT events have large evt.Content;
locate the bufio.NewScanner(...) instance (variable name scanner) used in the
session send flow and call scanner.Buffer(make([]byte, 1024*1024), 10*1024*1024)
(or another appropriate larger max like 5–10MB) immediately after creating the
scanner to raise the token limit, and keep existing evt.Content handling (the
preview code around evt.Content) intact; optionally ensure any bufio.ErrTooLong
is handled/logged if you still want to detect oversized tokens.
…rror message The _retry_with_fresh_bot_token helper was hardcoding HTTP 401 in the PermissionError message regardless of the actual response code. This broke test_raises_permission_error_on_403_without_caller_token which expects "authentication failed with HTTP 403". Now the original status code is passed through and the retry error message reflects the actual HTTP code from both the initial and retry attempts. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/runners/ambient-runner/ambient_runner/platform/auth.py`:
- Around line 204-214: The _retry_with_fresh_bot_token path can trigger
concurrent refreshes when populate_runtime_credentials fans out multiple
credential fetches; fix this by coalescing refreshes for refresh_bot_token so
only one in-flight refresh runs at a time and other callers await its result
(e.g., add a module-level in-progress future or an asyncio.Lock + cached result
inside the refresh logic used by _retry_with_fresh_bot_token and the shared
bot-token cache update in ambient_runner/platform/utils.py), ensure refresh
errors propagate to waiting callers and update the cache only once with the
fresh token.
- Around line 215-221: Ensure the URL is explicitly an http/https host and not a
hostless scheme: after parsing the URL (the parsed variable) and before creating
retry_req or adding headers (where _urllib_request.Request and retry_req are
used and Authorization/X-Runner-Current-User are added), enforce that
parsed.scheme is either "http" or "https" and that parsed.hostname is present;
if not, reject the request (raise/return an error) rather than proceeding or
relying solely on the hostname allowlist check.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f185722d-a2c9-4671-9a46-78f4d3f29fca
📒 Files selected for processing (1)
components/runners/ambient-runner/ambient_runner/platform/auth.py
| def _retry_with_fresh_bot_token(original_code: int): | ||
| logger.info( | ||
| f"{credential_type} got {original_code} with cached BOT_TOKEN — refreshing from CP endpoint and retrying" | ||
| ) | ||
| try: | ||
| fresh_bot = refresh_bot_token() | ||
| except Exception as refresh_err: | ||
| logger.warning(f"{credential_type} CP token refresh failed: {refresh_err}") | ||
| raise PermissionError( | ||
| f"{credential_type} authentication failed with HTTP {original_code}" | ||
| ) from refresh_err |
There was a problem hiding this comment.
Coalesce CP token refreshes across concurrent failures.
Line 209 can run once per failing provider fetch. populate_runtime_credentials() fans out four credential fetches concurrently (Lines 349-356), while ambient_runner/platform/utils.py:31-92 updates the shared bot-token cache without locking. One stale BOT_TOKEN can therefore trigger a refresh stampede and race the cache, so the intermittent auth failures this PR is fixing can still reproduce under load.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/runners/ambient-runner/ambient_runner/platform/auth.py` around
lines 204 - 214, The _retry_with_fresh_bot_token path can trigger concurrent
refreshes when populate_runtime_credentials fans out multiple credential
fetches; fix this by coalescing refreshes for refresh_bot_token so only one
in-flight refresh runs at a time and other callers await its result (e.g., add a
module-level in-progress future or an asyncio.Lock + cached result inside the
refresh logic used by _retry_with_fresh_bot_token and the shared bot-token cache
update in ambient_runner/platform/utils.py), ensure refresh errors propagate to
waiting callers and update the cache only once with the fresh token.
| retry_req = _urllib_request.Request(url, method="GET") | ||
| if fresh_bot: | ||
| retry_req.add_header("Authorization", f"Bearer {fresh_bot}") | ||
| if context.current_user_id: | ||
| retry_req.add_header("X-Runner-Current-User", context.current_user_id) | ||
| try: | ||
| with _urllib_request.urlopen(retry_req, timeout=10) as resp: |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf 'Guard and retry callsites:\n'
sed -n '121,131p;215,221p' components/runners/ambient-runner/ambient_runner/platform/auth.py
printf '\nurlparse() behavior for representative BACKEND_API_URL values:\n'
python - <<'PY'
from urllib.parse import urlparse
samples = [
"https://ambient-api-server.svc.cluster.local",
"http://localhost:8080",
"file:///tmp",
"data:text/plain,hello",
]
for value in samples:
p = urlparse(value)
allowed = bool(p.hostname) and (
p.hostname.endswith(".svc.cluster.local")
or p.hostname.endswith(".svc")
or p.hostname in {"localhost", "127.0.0.1"}
)
print(
f"{value!r} -> scheme={p.scheme!r}, hostname={p.hostname!r}, "
f"current_guard_rejects={bool(p.hostname and not allowed)}"
)
PYRepository: ambient-code/platform
Length of output: 1365
Guard hostless and non-HTTP(S) schemes in addition to hostname allowlist.
The current guard if parsed.hostname and not (allowlisted) fails for file:///tmp and data:text/... URIs—both return hostname=None and thus skip the rejection check. These bypass the cluster-only boundary and expose credentials via unintended schemes. Add checks for explicit scheme allowlist (http, https only) and reject any URL missing parsed.hostname.
🧰 Tools
🪛 Ruff (0.15.9)
[error] 215-215: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
[error] 221-221: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.
(S310)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/runners/ambient-runner/ambient_runner/platform/auth.py` around
lines 215 - 221, Ensure the URL is explicitly an http/https host and not a
hostless scheme: after parsing the URL (the parsed variable) and before creating
retry_req or adding headers (where _urllib_request.Request and retry_req are
used and Authorization/X-Runner-Current-User are added), enforce that
parsed.scheme is either "http" or "https" and that parsed.hostname is present;
if not, reject the request (raise/return an error) rather than proceeding or
relying solely on the hostname allowlist check.
…_turn import - auth.py: don't retry with fresh CP token when caller-token BOT_TOKEN fallback fails; preserve that path's original error message - observability.py: guard claude_agent_sdk import in end_turn with try/except so Langfuse generation.update() is still called when the SDK is not installed 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/runners/ambient-runner/ambient_runner/platform/auth.py`:
- Around line 185-187: The code currently retries with a fresh BOT_TOKEN only
when the initial request used BOT_TOKEN; if the initial request used
context.caller_token and the fallback BOT_TOKEN then returns 401/403 the code
raises immediately instead of calling the same retry path. Update the exception
handling in the block that falls back from context.caller_token to BOT_TOKEN so
that when the fallback response error code (e.code) is 401 or 403 it calls
_retry_with_fresh_bot_token(e.code) instead of re-raising; keep existing
behavior for other error codes. Ensure you reference the context.caller_token
flow, the BOT_TOKEN fallback branch, the exception object e (e.code), and the
_retry_with_fresh_bot_token helper when making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 492e8eec-c18b-4f66-9c5e-ae8555f3a7eb
📒 Files selected for processing (2)
components/runners/ambient-runner/ambient_runner/observability.pycomponents/runners/ambient-runner/ambient_runner/platform/auth.py
| if e.code in (401, 403): | ||
| logger.warning( | ||
| f"{credential_type} credential fetch failed with HTTP {e.code}: {e}" | ||
| ) | ||
| raise PermissionError( | ||
| f"{credential_type} authentication failed with HTTP {e.code}" | ||
| ) from e | ||
| # BOT_TOKEN may have expired — refresh from CP endpoint and retry once. | ||
| return _retry_with_fresh_bot_token(e.code) |
There was a problem hiding this comment.
Handle 401/403 from the caller-token BOT_TOKEN fallback too.
This retry only runs when the first request was sent with BOT_TOKEN. If the first request used context.caller_token, Lines 177-184 raise as soon as the fallback BOT_TOKEN gets 401/403, so the fresh-token retry never happens and long-running sessions can still fail on an expired cached bot token.
Patch idea
try:
with _urllib_request.urlopen(fallback_req, timeout=10) as resp:
return resp.read().decode("utf-8", errors="replace")
+ except _urllib_request.HTTPError as fallback_err:
+ if fallback_err.code in (401, 403):
+ logger.warning(
+ f"{credential_type} BOT_TOKEN fallback got {fallback_err.code}; refreshing from CP endpoint and retrying"
+ )
+ return _retry_with_fresh_bot_token(fallback_err.code)
+ logger.warning(
+ f"{credential_type} BOT_TOKEN fallback also failed: {fallback_err}"
+ )
+ raise PermissionError(
+ f"{credential_type} authentication failed with HTTP {fallback_err.code}"
+ ) from fallback_err
except Exception as fallback_err:
logger.warning(
f"{credential_type} BOT_TOKEN fallback also failed: {fallback_err}"
)
raise PermissionError(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/runners/ambient-runner/ambient_runner/platform/auth.py` around
lines 185 - 187, The code currently retries with a fresh BOT_TOKEN only when the
initial request used BOT_TOKEN; if the initial request used context.caller_token
and the fallback BOT_TOKEN then returns 401/403 the code raises immediately
instead of calling the same retry path. Update the exception handling in the
block that falls back from context.caller_token to BOT_TOKEN so that when the
fallback response error code (e.code) is 401 or 403 it calls
_retry_with_fresh_bot_token(e.code) instead of re-raising; keep existing
behavior for other error codes. Ensure you reference the context.caller_token
flow, the BOT_TOKEN fallback branch, the exception object e (e.code), and the
_retry_with_fresh_bot_token helper when making the change.
…route 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Summary
pathToAction()soGET /credentials/{id}/tokenresolves to actiontoken(matchingcredential:token-readerrole) rather thanread— was causing systematic 401s for runner SAsacpctl session send -foutput with[thinking]reasoning blocks,[ToolName] → resulttool call annotations, and proper streaming newline handlinghaproxy.router.openshift.io/timeout: 10mto allambient-api-serverOpenShift Routes to prevent 504 Gateway Timeouts on long-running SSE streamsTest plan
acpctl session send -fand confirm[thinking]and tool call output appearsGET /credentials/{id}/tokenreturns 200 for runner SA (RBAC fix)🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Configuration