Add network recovery and stream stall handling for chat#455
Add network recovery and stream stall handling for chat#455
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Pull request overview
This PR strengthens Raven chat resilience by adding client-side network retry/backoff, server-side upstream stream stall watchdogs, and integrity-layer messaging for stalled/incomplete generations, while also exposing “Symbolic Moment” as a quick action across the UI.
Changes:
- Added network recovery (retry + online-signal waiting + fallback history) and client-side chunk read timeouts in
useOracleChat. - Added server-side stream stall detection timers and “Clouded Skies” labeling for stalled/incomplete generations, plus runtime limit knobs.
- Extended UI/UX to trigger Symbolic Moment reads from counterpart quick actions and the Profile Vault, with corresponding tests/telemetry updates.
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| vessel/src/test/polyadic-chat-routing.test.ts | Updates routing/UI label assertions for new “Moment” quick action strings. |
| vessel/src/lib/ravenPersona.ts | Adds runtime limits for stream overall + idle watchdog timeouts via env-configurable values. |
| vessel/src/hooks/useOracleChat.ts | Implements network retry/backoff, online detection, client chunk timeouts, and improved error/ghost handling. |
| vessel/src/components/chat/CounterpartQuickActions.tsx | Adds onSymbolicMoment action and button in counterpart quick actions UI. |
| vessel/src/components/ProfileVault.tsx | Adds “Read Symbolic Moment” buttons wired via optional callback prop. |
| vessel/src/app/page.tsx | Wires Symbolic Moment quick actions into the main app flows and telemetry. |
| vessel/src/app/api/raven-chat/userBlockBuilder.ts | Adjusts creator-mirror offline telemetry prompt to acknowledge offline state then proceed with creator help. |
| vessel/src/app/api/raven-chat/route.ts | Adds upstream provider stream stall/idle watchdog timers and flags stall finish reasons. |
| vessel/src/app/api/raven-chat/generationIntegrity.ts | Adds “stream_stall” labeling (“Clouded Skies”) for incomplete-generation fallbacks. |
| vessel/src/app/api/raven-chat/tests/userBlockBuilder.test.ts | Adds coverage for creator-mirror offline telemetry behavior. |
| vessel/src/app/api/raven-chat/tests/generationIntegrity.test.ts | Adds coverage for “Clouded Skies” labeling when finish reason is stream_stall. |
| vessel/sherlog-velocity/data/self-model.json | Updates Sherlog velocity self-model artifact snapshot metadata/content. |
| vessel/sherlog-velocity/data/gap-history.jsonl | Appends new Sherlog velocity gap-history entry. |
Comments suppressed due to low confidence (1)
vessel/src/hooks/useOracleChat.ts:1738
- The 413 compact-history retry currently bypasses the new network recovery loop and does a single
performRequest(compactHistoryBase). If the retry hits a transient network failure, it won't get the exponential-backoff/online-signal recovery behavior introduced above. Consider routing this retry throughperformRequestWithNetworkRecovery()as well (or otherwise reusing the same recovery policy).
if (response.status === 413 && !didCompactRetry && historyBase.length > compactHistoryBase.length) {
didCompactRetry = true;
response = await performRequest(compactHistoryBase);
}
| const STREAM_STALL_FINISH_REASON = "stream_stall"; | ||
| const CLOUDED_SKIES_LABEL = | ||
| "Clouded Skies — the channel went quiet before a full reading locked. The geometry is intact; only the rendering paused."; | ||
|
|
||
| function isStreamStall(finishReason?: string | null): boolean { | ||
| return (finishReason || "").trim().toLowerCase() === STREAM_STALL_FINISH_REASON; | ||
| } | ||
|
|
||
| function buildIncompleteReply( | ||
| input: ReplyIntegrityInput, | ||
| reason: string, | ||
| ): string { | ||
| const stallPrefix = isStreamStall(input.providerFinishReason) | ||
| ? `${CLOUDED_SKIES_LABEL}\n\n` | ||
| : ""; | ||
|
|
There was a problem hiding this comment.
isStreamStall() is only used to prefix the incomplete fallback text. If a stream stalls after emitting some partial text, resolveReplyIntegrity() will currently treat the reply as generated (since trimmed is non-empty and it's not considered truncated), so users may see an incomplete/partial answer without the intended stall fallback. Consider treating providerFinishReason === 'stream_stall' (and/or providerCompleted === false) as an incomplete generation condition so the integrity layer reliably withholds partial stalled output.
| timer = setTimeout(() => { | ||
| reader.cancel().catch(() => {}); | ||
| reject(new Error('Raven stream stalled before any chunk arrived.')); | ||
| }, ms); | ||
| }), |
There was a problem hiding this comment.
The timeout rejection error message is always "Raven stream stalled before any chunk arrived." even when assistantMessageText is already non-empty (i.e., this is a mid-stream idle timeout). This makes the surfaced error misleading and harder to diagnose. Adjust the message (or error type) based on whether this was a first-chunk timeout vs. a subsequent idle timeout.
…s-for-symbolic-moment # Conflicts: # vessel/src/lib/raven/symbolicMomentFrontstage.ts Co-authored-by: DHCross <45954119+DHCross@users.noreply.github.com>
…lic-moment Introduce deterministic phrasing variants for symbolic moment replies
…ch fails Agent-Logs-Url: https://github.com/DHCross/Shipyard/sessions/1d7203b3-5835-47ff-abf7-b57cbfd9e007 Co-authored-by: DHCross <45954119+DHCross@users.noreply.github.com>
… errors When a Symbolic Moment read stalled mid-stream, three defects compounded: - The Gemini reader had no per-chunk watchdog, so a stalled stream hung until Vercel's maxDuration killed the route, bypassing the labeled fallbacks in resolveReplyIntegrity. - The client inserted an empty assistant bubble on the first DATA frame, leaving a ghost behind whenever no CHUNK followed. - The outer try/catch wrapped post-stream enrichment, so a vault-sync or blind-mirror throw appended a generic "channel issue" bubble even when the reply had already streamed in. Server: add idle (60s first-chunk / 30s subsequent) + overall (170s) watchdogs around the Gemini reader. On stall, cancel the reader and set providerFinishReason='stream_stall' so the existing integrity pipeline emits GENERATION_INCOMPLETE with a Clouded Skies fallback instead of a hung connection. Client: defer assistant-bubble insertion until the first non-empty CHUNK, add a 75s/45s defense-in-depth read timeout, and harden enrichment (applyVaultAction, materializeAssistantBlindMirror, extractCheckpointQuestion) in local try/catch blocks. Track in-flight render state in a ref so the outer catch can suppress the redundant error bubble when the reply already rendered. Defensively scrub any empty-text ghost on error. Adds RAVEN_STREAM_FIRST_CHUNK_MS / RAVEN_STREAM_CHUNK_IDLE_MS / RAVEN_STREAM_OVERALL_MS env knobs and three new integrity tests covering the Clouded Skies stall path. https://claude.ai/code/session_018fAnZmYcnz8i9bqS3Nsjn6
e1de137 to
da0c1a6
Compare
|


Summary
This PR enhances the chat system's resilience by implementing network recovery mechanisms and stream stall detection. It adds retry logic with exponential backoff for network failures, online signal detection, and server-side stream timeout handling with user-facing fallback messages.
Key Changes
Network Recovery in useOracleChat:
NETWORK_RETRY_DELAYS_MSwith exponential backoff delays (350ms, 900ms, 1800ms)waitForOnlineSignal()to detect when browser comes back onlineperformRequestWithNetworkRecovery()that retries with both primary and fallback (compact) historyinflightAssistantRefto distinguish pre-stream failures (show error bubble) from post-stream failures (log only)Stream Stall Detection in raven-chat route:
streamOverallMs) and idle timeouts (streamFirstChunkIdleMs,streamChunkIdleMs)streamStalledflag for integrity checkingGeneration Integrity Updates:
STREAM_STALL_FINISH_REASONconstant for "stream_stall" detectionresolveReplyIntegrity()to handle stream stall scenarios with appropriate user messagingUI Enhancements:
handleCounterpartSymbolicMoment()callback in main App componentProfileVaultwithonRunSymbolicMomentproponSymbolicMomenthandler toCounterpartQuickActionscomponentTelemetry & Testing:
Implementation Details
The network recovery uses a candidate-based retry strategy: it attempts the primary history first, and if network errors occur, falls back to compact history. The retry loop respects the exponential backoff delays and waits for online signals between attempts.
Stream stall detection uses a two-tier timeout approach: an overall timeout for the entire stream operation and idle timeouts that reset after each chunk is received. This prevents both hung connections and slow-streaming scenarios from blocking indefinitely.
https://claude.ai/code/session_018fAnZmYcnz8i9bqS3Nsjn6