Skip to content

Add browser-side Speech-to-Text with Web Worker architecture#3

Open
Copilot wants to merge 9 commits intomainfrom
copilot/integrate-speech-to-text
Open

Add browser-side Speech-to-Text with Web Worker architecture#3
Copilot wants to merge 9 commits intomainfrom
copilot/integrate-speech-to-text

Conversation

Copy link
Contributor

Copilot AI commented Nov 28, 2025

  • 1. Create Web Worker (worker.js) for Whisper model loading and inference
  • 2. Add @xenova/transformers dependency for Whisper model
  • 3. Create Speech-to-Text hook (useSpeechToText.ts) with audio processing and worker communication
  • 4. Add microphone button to PromptInputBox component with UI states (initial, loading, recording, processing)
  • 5. Add progress indicator for model download
  • 6. Integrate transcription result with input callback
  • 7. Configure Vite for Web Worker support
  • 8. Redesign voice input UI based on user feedback - visualizer bars on left, timer in center, button on right
  • 9. Fix transcription bugs - race conditions, empty audio handling, error display
  • 10. Fix auto-restart recording bug after transcription completes
  • 11. Fix auto-restart bug for subsequent recordings (not just first time)
  • 12. Simplify auto-start logic to completely eliminate post-transcription restart bug
  • 13. Fix stale callback closure that prevented transcription text from appearing

Summary

This PR implements browser-side Speech-to-Text using the Whisper model (Xenova/whisper-tiny.en) running entirely on the client. Key features:

  • Zero server cost: All processing happens in the browser
  • 100% privacy: Audio never leaves the user's device
  • Web Worker architecture: Model loading and inference run in isolated worker thread
  • Auto-resampling: Audio automatically resampled to 16kHz for Whisper model

Bug Fixes (Latest)

  • Fixed race condition where mediaRecorderRef was cleared before onstop handler completed
  • Added validation for empty audio chunks and short recordings
  • Improved error handling with proper user feedback
  • UI now properly resets and shows errors
  • Fixed auto-restart recording bug
  • NEW: Fixed stale callback closure that prevented transcription text from appearing in the input box

New Voice Input UI Design

The voice recording UI now features:

  • Animated visualizer bars on the left side
  • Recording timer in the center
  • Stop/Mic button on the right side
  • Error message display for failed recordings
Original prompt

在现有的项目中,集成基于浏览器端侧计算的语音转文字(Speech-to-Text)功能。
目标是实现“零服务器成本”、“100% 隐私保护”的语音输入。

Requirements (必须满足)

  1. 核心架构:Web Worker 隔离

    • 必须将 AI 模型的加载和推理逻辑放入单独的 worker.js 文件中。
    • 禁止在主线程运行模型,以免造成 UI 冻结(Blocking UI)。
    • 主线程与 Worker 之间通过 postMessage 通信。
  2. 模型选择

    • 使用 Xenova/whisper-tiny.en (Quantized 版本)。
    • 理由:平衡下载体积 (~40MB) 和识别准确率。
  3. 音频处理 (关键)

    • 使用 Web Audio API (AudioContext) 捕获麦克风输入。
    • 必须将音频自动重采样(Resample)到 Whisper 模型要求的 16000Hz 采样率。
    • 音频数据需转换为模型可接受的 Float32Array 格式。
  4. UI/UX 交互状态

    • 初始态:显示“点击启用语音输入(首次需下载模型)”。
    • 加载态:显示模型下载进度百分比(Progress Bar)。
    • 录音态:显示“正在聆听...”,并提供停止按钮。
    • 处理态:显示“正在转译...”。
    • 完成态:将转译出的文字自动填充到输入框(callback function)。
    • 参考之前原本的UI/UX 设计。和动画效果。
  5. 配置调整
    直接修改代码库


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

@vercel
Copy link

vercel bot commented Nov 28, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
think-first Ready Ready Preview Comment Dec 1, 2025 11:36am

- Create whisper-worker.js for isolated model loading and inference
- Add useSpeechToText hook for audio capture and processing
- Create SpeechToTextButton component with multiple UI states
- Integrate STT button into PromptInputBox
- Update vite.config.ts for Web Worker support
- Add @xenova/transformers dependency for Whisper model

Co-authored-by: Ryan-Guo123 <169242962+Ryan-Guo123@users.noreply.github.com>
- Fix progress value conversion (handle both 0-1 and 0-100 formats)
- Remove hardcoded sample rate constraint for better device compatibility
- Fix race condition by setting onstop handler before starting MediaRecorder

Co-authored-by: Ryan-Guo123 <169242962+Ryan-Guo123@users.noreply.github.com>
Copilot AI changed the title [WIP] Add browser-based speech-to-text integration Add browser-side Speech-to-Text with Web Worker architecture Nov 28, 2025
Copilot AI requested a review from Ryan-Guo123 November 28, 2025 01:06
@Ryan-Guo123
Copy link
Owner

@copilot

可以参考以下样式来优化。按键放右边, 时间和声音条放在左边。

You are given a task to integrate an existing React component in the codebase

The codebase should support:

  • shadcn project structure
  • Tailwind CSS
  • Typescript

If it doesn't, provide instructions on how to setup project via shadcn CLI, install Tailwind or Typescript.

Determine the default path for components and styles.
If default path for components is not /components/ui, provide instructions on why it's important to create this folder
Copy-paste this component to /components/ui folder:

ai-voice-input.tsx
"use client";

import { Mic } from "lucide-react";
import { useState, useEffect } from "react";
import { cn } from "@/lib/utils";

interface AIVoiceInputProps {
  onStart?: () => void;
  onStop?: (duration: number) => void;
  visualizerBars?: number;
  demoMode?: boolean;
  demoInterval?: number;
  className?: string;
}

export function AIVoiceInput({
  onStart,
  onStop,
  visualizerBars = 48,
  demoMode = false,
  demoInterval = 3000,
  className
}: AIVoiceInputProps) {
  const [submitted, setSubmitted] = useState(false);
  const [time, setTime] = useState(0);
  const [isClient, setIsClient] = useState(false);
  const [isDemo, setIsDemo] = useState(demoMode);

  useEffect(() => {
    setIsClient(true);
  }, []);

  useEffect(() => {
    let intervalId: NodeJS.Timeout;

    if (submitted) {
      onStart?.();
      intervalId = setInterval(() => {
        setTime((t) => t + 1);
      }, 1000);
    } else {
      onStop?.(time);
      setTime(0);
    }

    return () => clearInterval(intervalId);
  }, [submitted, time, onStart, onStop]);

  useEffect(() => {
    if (!isDemo) return;

    let timeoutId: NodeJS.Timeout;
    const runAnimation = () => {
      setSubmitted(true);
      timeoutId = setTimeout(() => {
        setSubmitted(false);
        timeoutId = setTimeout(runAnimation, 1000);
      }, demoInterval);
    };

    const initialTimeout = setTimeout(runAnimation, 100);
    return () => {
      clearTimeout(timeoutId);
      clearTimeout(initialTimeout);
    };
  }, [isDemo, demoInterval]);

  const formatTime = (seconds: number) => {
    const mins = Math.floor(seconds / 60);
    const secs = seconds % 60;
    return `${mins.toString().padStart(2, "0")}:${secs.toString().padStart(2, "0")}`;
  };

  const handleClick = () => {
    if (isDemo) {
      setIsDemo(false);
      setSubmitted(false);
    } else {
      setSubmitted((prev) => !prev);
    }
  };

  return (
    <div className={cn("w-full py-4", className)}>
      <div className="relative max-w-xl w-full mx-auto flex items-center flex-col gap-2">
        <button
          className={cn(
            "group w-16 h-16 rounded-xl flex items-center justify-center transition-colors",
            submitted
              ? "bg-none"
              : "bg-none hover:bg-black/10 dark:hover:bg-white/10"
          )}
          type="button"
          onClick={handleClick}
        >
          {submitted ? (
            <div
              className="w-6 h-6 rounded-sm animate-spin bg-black dark:bg-white cursor-pointer pointer-events-auto"
              style={{ animationDuration: "3s" }}
            />
          ) : (
            <Mic className="w-6 h-6 text-black/70 dark:text-white/70" />
          )}
        </button>

        <span
          className={cn(
            "font-mono text-sm transition-opacity duration-300",
            submitted
              ? "text-black/70 dark:text-white/70"
              : "text-black/30 dark:text-white/30"
          )}
        >
          {formatTime(time)}
        </span>

        <div className="h-4 w-64 flex items-center justify-center gap-0.5">
          {[...Array(visualizerBars)].map((_, i) => (
            <div
              key={i}
              className={cn(
                "w-0.5 rounded-full transition-all duration-300",
                submitted
                  ? "bg-black/50 dark:bg-white/50 animate-pulse"
                  : "bg-black/10 dark:bg-white/10 h-1"
              )}
              style={
                submitted && isClient
                  ? {
                      height: `${20 + Math.random() * 80}%`,
                      animationDelay: `${i * 0.05}s`,
                    }
                  : undefined
              }
            />
          ))}
        </div>

        <p className="h-4 text-xs text-black/70 dark:text-white/70">
          {submitted ? "Listening..." : "Click to speak"}
        </p>
      </div>
    </div>
  );
}

demo.tsx
import { AIVoiceInput } from "@/components/ui/ai-voice-input";
import { useState } from "react";

export function AIVoiceInputDemo() {
  const [recordings, setRecordings] = useState<{ duration: number; timestamp: Date }[]>([]);

  const handleStop = (duration: number) => {
    setRecordings(prev => [...prev.slice(-4), { duration, timestamp: new Date() }]);
  };

  return (
    <div className="space-y-8">
        <div className="space-y-4">
          <AIVoiceInput 
            onStart={() => console.log('Recording started')}
            onStop={handleStop}
          />   
      </div>
    </div>
  );
}

Install NPM dependencies:

lucide-react

Implementation Guidelines

  1. Analyze the component structure and identify all required dependencies
  2. Review the component's argumens and state
  3. Identify any required context providers or hooks and install them
  4. Questions to Ask
  • What data/props will be passed to this component?
  • Are there any specific state management requirements?
  • Are there any required assets (images, icons, etc.)?
  • What is the expected responsive behavior?
  • What is the best place to use this component in the app?

Steps to integrate
0. Copy paste all the code above in the correct directories

  1. Install external dependencies
  2. Fill image assets with Unsplash stock images you know exist
  3. Use lucide-react icons for svgs or logos if component requires them

- Add inline voice recording bar with visualizer bars on left
- Show recording timer in center
- Place stop/mic button on right side
- Add animated visualizer bars during recording
- Extract magic numbers to named constants
- Remove unused SpeechToTextButton and AIVoiceInput components

Co-authored-by: Ryan-Guo123 <169242962+Ryan-Guo123@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Nov 28, 2025

@copilot

可以参考以下样式来优化。按键放右边, 时间和声音条放在左边。

...

Redesigned the voice input UI in commit e11a88a. The new layout shows:

  • Visualizer bars on the left
  • Recording timer in the center
  • Stop/Mic button on the right

Voice Input UI - Default

Voice Input UI - Recording

@Ryan-Guo123
Copy link
Owner

@copilot

按停止录音的时候, 有的时候不转录
有很多bug

… display

- Fix race condition in stopRecording by clearing mediaRecorderRef in onstop handler
- Add validation for empty audio chunks and recordings that are too short
- Add better error handling for audio decode failures and worker readiness
- Update UI to properly show error messages and reset state on errors
- Simplify worker transcription result logic

Co-authored-by: Ryan-Guo123 <169242962+Ryan-Guo123@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Nov 29, 2025

@copilot

按停止录音的时候, 有的时候不转录
...

Fixed in commit f06f24b. The issues were:

  1. Race condition: The mediaRecorderRef was being cleared before the onstop handler could process audio - now cleared inside the handler
  2. Empty audio handling: Added validation for recordings that are too short (<0.5s) or have no audio data
  3. Error display: UI now shows error messages and properly resets state

The changes include better error handling for audio decode failures, worker readiness checks, and user-friendly error messages.

@Ryan-Guo123
Copy link
Owner

@copilot
When you press the stop recording button, a "Transcribing..." message appears briefly, but after a few seconds it automatically switches back to recording mode without transcribing any text.

The issue was that after transcription, the status changes to 'ready',
which triggered the auto-start effect to restart recording before the
handleSpeechTranscript callback could set voiceInputActive to false.

Added justFinishedProcessingRef to track when processing just finished,
preventing the auto-start effect from re-triggering recording after
transcription or error states.

Co-authored-by: Ryan-Guo123 <169242962+Ryan-Guo123@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Dec 1, 2025

@copilot
When you press the stop recording button, a "Transcribing..." message appears briefly, but after a few seconds it automatically switches back to recording mode without transcribing any text.

Fixed in commit 98854f2. The issue was that the auto-start recording effect was triggering after transcription completed because the voiceStatus changed to 'ready' before the callback could set voiceInputActive to false. Added a ref to track when processing just finished and prevent auto-restart in that case.

@Ryan-Guo123
Copy link
Owner

@copilot
it only work on first time, then it has the Same problem!!

The issue was that justFinishedProcessingRef was being cleared
unconditionally at the start of the effect, even when the auto-start
conditions weren't met. This caused the flag to be cleared before it
could prevent auto-restart on subsequent recordings.

Fix:
- Only clear the flag when auto-start conditions are actually met
- Clear the flag when user manually clicks to start new recording

Co-authored-by: Ryan-Guo123 <169242962+Ryan-Guo123@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Dec 1, 2025

@copilot
it only work on first time, then it has the Same problem!!

Fixed in commit 24c3a5b. The issue was that justFinishedProcessingRef was being cleared unconditionally when the effect ran, even when the auto-start conditions weren't met. This caused the flag to be cleared before it could prevent auto-restart on subsequent recordings.

Changes:

  1. Only clear the flag when all auto-start conditions are met
  2. Also clear the flag when user manually clicks to start a new recording session

@Ryan-Guo123
Copy link
Owner

@copilot Same problem! When I press the stop recording button, a "Transcribing..." message appears briefly, but after a few seconds it automatically switches back to recording mode without transcribing any text.

Previous approach used justFinishedProcessingRef flag which was too
complex and prone to race conditions.

New approach:
- Use waitingForModelRef that is ONLY set when user clicks mic while
  model not loaded
- Auto-start ONLY triggers when this flag is set AND model becomes ready
- After transcription, the flag is already false so no restart happens

This removes the problematic voiceInputActive dependency for auto-start.

Co-authored-by: Ryan-Guo123 <169242962+Ryan-Guo123@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Dec 1, 2025

@copilot Same problem! When I press the stop recording button, a "Transcribing..." message appears briefly, but after a few seconds it automatically switches back to recording mode without transcribin...

Fixed in commit fb9c75d. Completely rewrote the auto-start logic with a simpler approach:

  1. The waitingForModelRef flag is now ONLY set when user clicks mic while model is not loaded
  2. Auto-start ONLY happens when this flag is true AND model becomes ready
  3. After transcription, the flag is already false so recording cannot restart

The previous approach with justFinishedProcessingRef was too complex and had race conditions. The new approach eliminates the problematic dependency on voiceInputActive for auto-start decisions.

@Ryan-Guo123
Copy link
Owner

@copilot When I press the stop recording button, a "Transcribing..." message appears briefly, but after a few seconds it didn't transcribing any text!

The onTranscript callback was captured in the initWorker useCallback
closure. When the component re-rendered, the worker's onmessage handler
still had the old callback reference that didn't update the UI.

Fix: Use a ref (onTranscriptRef) that is updated via useEffect whenever
onTranscript changes. The worker's onmessage handler now uses this ref
to always call the latest callback.

Co-authored-by: Ryan-Guo123 <169242962+Ryan-Guo123@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Dec 1, 2025

@copilot When I press the stop recording button, a "Transcribing..." message appears briefly, but after a few seconds it didn't transcribing any text!

Fixed in commit ca502c5. The issue was a stale callback closure - the onTranscript callback was captured in the worker's onmessage handler when the worker was initialized, but React re-renders could create a new callback reference that the handler didn't see.

Fix: Use a ref (onTranscriptRef) that is updated via useEffect whenever onTranscript changes. The worker's handler now uses this ref to always call the latest callback, ensuring the transcription text appears in the input box.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants