Skip to content

Conversation

yujonglee
Copy link
Contributor

@yujonglee yujonglee commented Aug 21, 2025

Summary by cubic

Auto-finalizes messages at end-of-speech and fixes duplicate final words in transcripts. Adds a control channel for finalize events and re-exports VAD types for plugin use.

  • New Features

    • Detects SpeechEnd via a new VAD session (mixes mic+speaker) and sends ControlMessage::Finalize.
    • Merges audio and control into a single stream as MixedMessage::{Audio, Control}.
    • Re-exports VadConfig, VadSession, VadTransition from the VAD crate.
    • Tunes VAD timing (70ms redemption and pre-speech pad).
  • Bug Fixes

    • Removes duplicate final words using start/end timestamp dedupe.
    • Keeps a sorted final_words list, separate from partial_words.
    • Trims partial_words only after the latest newly finalized word.

Copy link

coderabbitai bot commented Aug 21, 2025

📝 Walkthrough

Walkthrough

Exposes VAD types via public re-exports. Adds control message channeling and VAD-driven finalize signaling in the listener FSM, multiplexing control with audio. Updates transcript manager to persist and deduplicate final words across calls, emitting only newly finalized words per append while pruning partials accordingly.

Changes

Cohort / File(s) Summary
VAD public re-exports
crates/vad/src/continuous.rs
Changed private import to public re-export: pub use silero_rs::{VadConfig, VadSession, VadTransition}; without altering runtime logic.
Listener FSM: control channel + VAD finalize
plugins/listener/src/fsm.rs
Added control_tx/control_rx to AudioChannels; created bounded control channel (8). Integrated hypr_vad::VadSession (70ms redemption_time and pre_speech_pad). Computed mixed audio, processed VAD transitions, and on SpeechEnd sent ControlMessage::Finalize via control_tx. Merged control and audio streams via futures::select and added MixedMessage::Control. Adjusted task wiring.
Transcript finalization state/dedup
plugins/listener/src/manager.rs
Added TranscriptManager.final_words state. On is_final: deduplicate by time (~0.001), append/sort new finals, prune partials past last new final, return Diff with only newly finalized words. Non-final path keeps thresholding and merges partials; Diff.final_words empty. Function signatures unchanged.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Mic
  actor Speaker
  participant FSM as Listener FSM
  participant VAD as VadSession
  participant Ctrl as control_tx/control_rx
  participant Stream as Mixed stream (audio + control)
  participant Listener as Downstream Listener

  Mic->>FSM: mic chunk
  Speaker->>FSM: speaker chunk
  FSM->>FSM: mix(mic, speaker) -> mixed
  FSM->>VAD: process(mixed)
  alt SpeechEnd detected
    VAD-->>FSM: Transition::SpeechEnd
    FSM->>Ctrl: send ControlMessage::Finalize
  end
  Ctrl-->>FSM: ControlMessage
  FSM->>Stream: select(audio_stream, control_stream)
  Stream->>Listener: MixedMessage::{Audio|Control}
  note over Listener: Receives audio frames and finalize controls multiplexed
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45–60 minutes

Possibly related PRs

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch am-finalize-message

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 3 files

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
plugins/listener/src/fsm.rs (1)

492-496: Server-side Control messages aren’t handled

Although ListenClient::from_realtime_audio correctly serializes MixedMessage::Control (and the interface defines the Control variant) and your listener plugin merges audio + control streams, the server code never matches on or forwards incoming control frames. Both in

  • owhisper-server/src/server.rs (around line 426)
  • owhisper-server/src/commands/run/realtime.rs (around line 167)

you only build the input stream as audio chunks:

- let input = audio.map(|chunk| owhisper_interface::MixedMessage::Audio(chunk));
+ let control_stream = control_rx.into_stream().map(|ctrl| owhisper_interface::MixedMessage::Control(ctrl));
+ let combined_input = futures_util::stream::select(
+     audio.map(|chunk| owhisper_interface::MixedMessage::Audio(chunk)),
+     control_stream,
+ );
+ let input = combined_input;

Please update the server’s real-time input pipelines to handle MixedMessage::Control (e.g. Finalize, KeepAlive, CloseStream) so that control frames are respected end-to-end.

🧹 Nitpick comments (4)
plugins/listener/src/fsm.rs (3)

86-87: Avoid magic number for control channel capacity

Lift the 8 into a named constant to document intent and ease tuning.

   fn new() -> Self {
-        const CHUNK_BUFFER_SIZE: usize = 64;
+        const CHUNK_BUFFER_SIZE: usize = 64;
+        const CONTROL_BUFFER_SIZE: usize = 8;
@@
-        let (control_tx, control_rx) = flume::bounded::<owhisper_interface::ControlMessage>(8);
+        let (control_tx, control_rx) =
+            flume::bounded::<owhisper_interface::ControlMessage>(CONTROL_BUFFER_SIZE);

342-347: Mixed computation duplicated later — DRY and minor perf win

You recompute the same mixed vector again for recording. Reuse the earlier result.

-                    if record {
-                        let mixed: Vec<f32> = mic_chunk
-                            .iter()
-                            .zip(speaker_chunk.iter())
-                            .map(|(mic, speaker)| (mic + speaker).clamp(-1.0, 1.0))
-                            .collect();
-                        if save_mixed_tx.send_async(mixed).await.is_err() {
+                    if record {
+                        if save_mixed_tx.send_async(mixed.clone()).await.is_err() {
                             tracing::error!("save_mixed_tx_send_error");
                         }
                     }

348-356: Finalize on SpeechEnd — validate semantics and backpressure

  • Triggering Finalize on any SpeechEnd from the mixed stream means speaker-only silence may finalize mic hypotheses too. If the intent is “phrase boundary regardless of source,” this is fine; otherwise consider gating on mic-dominant segments.
  • send_async on a bounded(8) channel can block this hot loop under receiver backpressure. If Finalize events can ever burst, consider try_send with best-effort semantics or raising the bound.

Would you like a small debounce/in-speech guard to ensure a single Finalize per contiguous speech segment?

plugins/listener/src/manager.rs (1)

98-128: Finalization diff logic — solid; consider epsilon constant and stricter partial pruning

The start/end-based dedup is clear and effective. Two small improvements:

  • Lift 0.001 into a named epsilon to document units (seconds) and ease tuning.
  • When pruning partials after new finals, consider an epsilon-inclusive cutoff to avoid edge residue on exact boundaries.
+const TIME_EPSILON: f32 = 0.001; // seconds; ~1 ms tolerance for alignment
@@
-                let new_final_words: Vec<owhisper_interface::Word> = words
+                let new_final_words: Vec<owhisper_interface::Word> = words
                     .into_iter()
                     .filter(|new_word| {
                         !self.final_words.iter().any(|existing_word| {
-                            (existing_word.start - new_word.start).abs() < 0.001
-                                && (existing_word.end - new_word.end).abs() < 0.001
+                            (existing_word.start - new_word.start).abs() < TIME_EPSILON
+                                && (existing_word.end - new_word.end).abs() < TIME_EPSILON
                         })
                     })
                     .collect();
@@
-                if let Some(last_final_word) = new_final_words.last() {
+                if let Some(last_final_word) = new_final_words.last() {
                     self.partial_words = self
                         .partial_words
                         .iter()
-                        .filter(|w| w.start > last_final_word.end)
+                        .filter(|w| w.start >= last_final_word.end - TIME_EPSILON)
                         .cloned()
                         .collect::<Vec<_>>();
                 }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 93cd0c3 and 95361d7.

📒 Files selected for processing (3)
  • crates/vad/src/continuous.rs (1 hunks)
  • plugins/listener/src/fsm.rs (6 hunks)
  • plugins/listener/src/manager.rs (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{js,ts,tsx,rs}

⚙️ CodeRabbit configuration file

**/*.{js,ts,tsx,rs}: 1. No error handling.
2. No unused imports, variables, or functions.
3. For comments, keep it minimal. It should be about "Why", not "What".

Files:

  • crates/vad/src/continuous.rs
  • plugins/listener/src/fsm.rs
  • plugins/listener/src/manager.rs
🔇 Additional comments (11)
crates/vad/src/continuous.rs (1)

10-10: Public re-export of VAD types looks good

Re-exporting VadConfig, VadSession, and VadTransition from here simplifies downstream imports and aligns with their usage elsewhere in the PR. No runtime impact and no unused items introduced.

plugins/listener/src/fsm.rs (7)

71-73: Add control channel to AudioChannels — appropriate extension

Introducing a dedicated control channel for out-of-band signals (e.g., Finalize) is a clean separation from the audio paths. Fields are consumed later, so no unused members.


117-119: Initialize new control fields — consistent

Struct initialization updated correctly; no omissions.


302-303: Clone control_tx into the processing task — good scope capture

This is the right place to source control messages from the audio loop. No leaks and no unused clones.


474-479: Audio stream mapping into MixedMessage::Audio — good

Zipping mic/speaker and mapping into the transport type is clear and minimal.


480-484: Control stream mapping — good

Simple lift into MixedMessage::Control. No extraneous allocations.


485-485: Stream select to multiplex audio + control — appropriate

select provides fair interleaving so control messages don’t starve. Good choice for simplicity.


310-316: Drop the sample_rate pinning suggestion—VadConfig has no such field

I searched the entire workspace (including crates/vad) and found no struct VadConfig definition or sample_rate field available for configuration—so adding sample_rate would not compile. You can safely ignore the optional refactor to pin sample_rate.

Likely an incorrect or invalid review comment.

plugins/listener/src/manager.rs (3)

5-6: Persisted final_words state — enables cross-call dedup

Keeping accumulated finals here aligns with the PR goal (emit only deltas). No API surface changes.


13-14: Default impl updated for final_words — OK

Properly initialized to empty; no unintended allocations.


126-128: Return only newly observed finals — matches PR intent

Diff.final_words carries deltas; self.final_words retains the history. Good separation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant