Skip to content

Feature Proposal: “Hold Music” — Expectation-Setting Voice Prompts During High Latency #228

@rosscado

Description

@rosscado

During long response times (e.g. 10+ seconds when GPT is “thinking”), users currently sit in silence until the first audio stream begins. This can feel awkward or like the system hasn’t heard them.

Instead of filler sounds (“um,” “ah”), we can use this gap to set expectations in a natural, human way. Example: a short cached line in the selected voice such as:

  • “Got it, let me think about that for a moment.”
  • “Hmm, that’s a big one. Give me a second.”
  • “Okay, I’m working on it.”

This reassures the user they’ve been heard and prepares them for a thoughtful response, similar to how in human conversation it’s acceptable to say, “Let me think about that.”


Proposed Approach

  1. Latency detection

    • Detect when a model’s expected response time > ~10s (e.g. GPT-5 thinking mode).
    • Only trigger “Hold Music” prompts in these cases, not for instant responses.
  2. Prompt playback

    • Short audio snippets in the selected voice + language.
    • Can be cached phrases or generated on the fly (TTS).
    • Randomize from a small pool for variety.
  3. User experience goals

    • Communicate “you’ve been heard.”
    • Reduce awkward silence.
    • Maintain tone consistency (voice, language, pacing).
    • Feel natural and optional (can be toggled off).

Technical Options

  • Cached audio prompts per supported voice → cheap, fast, reliable.
  • Dynamic generation with lightweight LLM → flexible, more variety, but more expensive.

Fallback: If no audio asset is available in the current voice/language, skip playback.


Open Questions

  • Threshold: is 10s the right cutoff? Should it be adaptive?
  • Should this live entirely client-side (pre-recorded prompts) or server-side (generated per request)?
  • Do we expose a user setting: “Play expectation-setting prompts when response time is high”?
  • How do we avoid confusion between “Hold Music” and actual AI content (make it clearly a system/acknowledgement message)?

Why It Matters

  • Makes conversations feel smoother and more human.
  • Sets the right expectations instead of leaving users in silence.
  • Reinforces trust that the assistant is listening, even when it needs time to think.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ttsText to Speech (voice synthesis)uxuser experience

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions