Skip to content

fix: harden audio pipeline, improve barge-in, rewrite docs#15

Merged
abdul-abdi merged 13 commits intomainfrom
fix/audio-stability
Mar 15, 2026
Merged

fix: harden audio pipeline, improve barge-in, rewrite docs#15
abdul-abdi merged 13 commits intomainfrom
fix/audio-stability

Conversation

@abdul-abdi
Copy link
Copy Markdown
Owner

@abdul-abdi abdul-abdi commented Mar 15, 2026

Summary

Stabilizes the real-time audio pipeline, eliminates false barge-in triggers, reduces end-to-end latency, and rewrites all user-facing documentation for hackathon submission.


3acee7b — fix: use security CLI for keychain (no ACL prompts), harden VAD for barge-in

KeychainHelper rewrite (AuraApp/Sources/KeychainHelper.swift):

  • Replaces Security.framework API with /usr/bin/security CLI for all Keychain operations (save, read, delete).
  • Fixes per-app ACL prompts that appeared when both the SwiftUI app and Rust daemon (two separate ad-hoc signed binaries) accessed the same Keychain item.
  • Uses -A flag to allow any local application to read the item without triggering a password dialog.
  • Legacy save(account:data:) and read(account:) wrappers preserved for compatibility.

VAD mode change (aura-voice/src/vad.rs):

  • Switches WebRTC VAD from Quality mode to VeryAggressive mode — more tolerant of background noise, reduces false speech detection that was triggering unnecessary audio forwarding during playback.

42e968e — fix: harden audio pipeline, rewrite docs for hackathon submission

Audio capture (aura-voice/src/audio.rs)

  • Halves resampler chunk size from 1024 to 512 frames (~10.7ms at 48kHz), reducing accumulation latency before audio reaches the WebSocket.
  • Reduces sinc resampler quality from studio-grade (sinc_len=128, oversampling=128) to voice-grade (sinc_len=32, oversampling=32) — ~4x faster with no audible difference for 16kHz speech.

Playback (aura-voice/src/playback.rs)

  • Cuts pre-buffer target from 80ms to 40ms. Gemini's native audio model delivers chunks at a steady rate, so 40ms is sufficient jitter absorption with lower first-byte latency.
  • Updates test assertions to match the new 40ms target (960 samples at 24kHz instead of 1920).

Barge-in & mic pipeline (aura-daemon/src/orchestrator.rs)

  • Removes client-side VAD entirely for non-playback audio. All mic audio now streams directly to Gemini, which runs its own server-side VAD. This eliminates ~330ms of client-side latency (30ms frame accumulation + 300ms hangover) and simplifies the pipeline.
  • Retains energy-based barge-in gating only during playback to filter speaker bleed-through.
  • Lowers PLAYBACK_THRESHOLD_MULTIPLIER from 1.8 to 1.3 — the old value was too aggressive and swallowed real user speech. New value filters speaker bleed (0.005-0.03 RMS) while passing direct speech (0.05-0.3 RMS).
  • Reduces BARGE_IN_CONSECUTIVE_FRAMES from 3 to 2 — with the smaller 512-sample chunks, 2 frames ≈ 20ms of sustained speech is sufficient confirmation.
  • Fixes the pre-buffer mute window comment to reflect actual wall-clock timing (100-500ms with network jitter, not a fixed 80ms).

Session management (aura-gemini/src/session.rs)

  • Adds proactive session rotation every 10 minutes (SESSION_ROTATION_SECS = 600). Long-lived Gemini audio sessions exhibit gradual latency degradation — rotating the WebSocket with session resumption resets server-side state.
  • Introduces DisconnectReason enum (Shutdown, GoAway, Rotate) replacing the previous bool return for cleaner control flow.
  • GoAway clears the resumption handle (server is done with the session). Rotate keeps the handle for seamless resume. Shutdown is a clean exit.
  • Reorders the WebSocket select loop to prioritize audio over video: all pending audio chunks are flushed before sending a video frame. A 300KB screenshot can stall the TCP socket for 50-300ms — flushing audio first prevents mic data gaps.
  • Adds stale video frame draining: when frames queue up (e.g., WebSocket busy), only the most recent frame is sent. Old frames waste bandwidth and give Gemini outdated screen context.

Screen capture (aura-daemon/src/screen_capture.rs)

  • Reduces idle threshold from 10 to 5 unchanged captures (2.5s instead of 5s before switching to slow polling) — faster responsiveness to screen changes.
  • Reduces slow polling interval from 2000ms to 1500ms (~0.67 FPS idle instead of 0.5 FPS).
  • Adds frame counter and makes censorship detection periodic (first 3 frames then every 20th) instead of every frame — the base64→JPEG→pixel decode was expensive and unnecessary on every capture.

System prompt (aura-gemini/src/config.rs)

  • Updates idle capture rate in system prompt from 0.5 FPS to 0.7 FPS to match new interval.
  • Adds browser-first tool strategy: run_javascript is now the preferred tool for all browser interactions (Safari/Chrome) — faster and more reliable than coordinate-based clicking. Falls back to click(x, y) only for browser chrome or when JS fails.
  • Adds detailed guidance for web forms, login prompts, and cross-app workflows.

Documentation (README.md, ARCHITECTURE.md)

  • README rewrite for hackathon submission:
    • Adds persona/voice description (Kore voice, conversational style, barge-in behavior).
    • Adds "Built with" table showing all 7 Google technologies used.
    • Adds Native tools (20) section — full categorized list of all tools Gemini can call.
    • Adds Core pipeline section — numbered step-by-step loop showing audio/vision/tool/barge-in flow with latency budget.
    • Adds Deploy to Google Cloud section — one-command IaC deployment, services table, CI/CD description.
    • Adds Grounding & accuracy section — three-layer anti-hallucination strategy (screen verification, accessibility cross-check, Google Search grounding).
    • Adds "Ground" capability to the features table.
  • Moves architecture deep-dive to root ARCHITECTURE.md (was docs/ARCHITECTURE.md).
  • Removes all stale docs/plans/ — 26 design docs and plan files from prior development iterations that are no longer relevant.

d869e61 — docs: add onboarding flow, keychain auth, and permission details to README

  • Rewrites the Get started section with a detailed first-launch onboarding flow:
    • Step 1: API key entry in the app UI (no manual config file editing), background device registration with Keychain token storage.
    • Step 2: macOS permissions walkthrough — explains that Microphone is a standard prompt, but Screen Recording and Accessibility require manual toggle in System Settings > Privacy & Security. App polls every 2s and auto-advances when granted.
    • Step 3: Daemon launches, connects to Gemini, green dot appears.
  • Adds Permissions persist across rebuilds section — explains ad-hoc code signing with stable designated requirements (identifier "com.aura.desktop") so TCC grants survive rebuilds without re-prompting.
  • Adds Keychain & device auth section — explains device token storage in macOS login Keychain via security CLI (not Security.framework) to avoid per-app ACL prompts between the SwiftUI app and Rust daemon, with background retry on registration failure.
  • Clarifies dev.sh vs bundle.sh in Build from source: dev.sh calls bundle.sh internally then installs and launches; bundle.sh builds without installing. Adds --dmg flag for distribution builds.

Test plan

  • Build and launch with bash scripts/dev.sh
  • Verify barge-in works in quiet and noisy environments — user speech interrupts playback, speaker echo does not
  • Verify session rotation occurs after ~10 minutes (check logs for "Session rotation timer fired")
  • Verify video frames are not queued (check logs for "Skipped stale video frames" under load)
  • Verify Keychain read/write works without ACL prompts on fresh install
  • Verify browser tool strategy: ask Aura to click a link in Safari, confirm it uses run_javascript first
  • Confirm README renders correctly on GitHub
  • First launch onboarding: verify API key entry, permission granting, and daemon auto-launch
  • Rebuild with dev.sh and confirm permissions are NOT re-prompted (stable DR)

Audio stability: ambient-calibrated VAD thresholds, barge-in reverb
guard, playback drain detection, session resumption poison counter.

Docs: rewrite README with tools list, core pipeline, GCP deployment
section, grounding strategy, and persona description. Consolidate
architecture into root ARCHITECTURE.md. Remove stale docs/plans.
The .titled style mask created an invisible title bar with safe area
insets that pushed content down, clipping the text input at the bottom.
Since canBecomeKey is overridden and isMovableByWindowBackground handles
dragging, .titled is unnecessary.
- Mouse: "precise" → "coordinate accuracy depends on Gemini's vision"
- Pipeline: fix chunk size (512 samples ~10ms, not 100ms), remove
  client-side VAD reference (removed in this branch), add actual
  barge-in params (1.3x threshold, 2 frames, 300ms reverb guard)
- Safety: rewrite entirely — do shell script blocked wholesale (not
  pattern-filtered), metacharacter list from actual code, timeouts
  are 60s for all scripts (not 120s for AppleScript), destructive
  action guardrail is prompt-level not code-enforced
- Grounding: screen feedback is continuous 2 FPS (not per-action),
  accessibility data provided to Gemini (not cross-referenced in code)
- Tools: add missing key_state to system category
- Terminal blocklist: list all 9 actual blocked apps from code
Without this, the proxy falls back to an in-memory device store
which loses all registered device tokens when the instance scales
to zero. With GCP_PROJECT_ID set, it uses Firestore-backed storage
and tokens persist across restarts.
Newer ADK versions require sessions to exist before run_async.
InMemoryRunner was throwing SessionNotFoundError because it passed
a random session_id that didn't exist. Switch to explicit Runner
with InMemorySessionService and pre-create sessions before each
agent invocation.
The daemon was exiting immediately after session.disconnect(),
killing the processor before it could run end-of-session
consolidation and memory agent ingest. Now waits up to 15s
for the processor to finish, ensuring facts are extracted
and synced to the cloud before the process exits.
Three fixes for end-of-session memory agent ingest never firing:

1. Add SIGTERM signal handler so the daemon catches terminate signals
   from the SwiftUI parent app (previously only caught SIGINT/Ctrl+C).

2. Call session.disconnect() BEFORE cancel.cancel() — the processor's
   select loop has a cancel.cancelled() branch that was firing first,
   causing it to skip the Disconnected event handler where consolidation
   and memory agent ingest run.

3. SwiftUI app waits up to 12s after SIGTERM for daemon to finish
   graceful shutdown before falling back to SIGINT.

Also tunes barge-in sensitivity: 1.3x→1.5x threshold multiplier,
2→3 consecutive frames — fixes false interruptions on MacBook speakers.
The memory_agent_handled flag was skipping Firestore sync under
the assumption the memory agent wrote facts server-side. It doesn't —
it only returns them. Now facts always sync to Firestore when config
is present, regardless of whether the memory agent or local
consolidation extracted them. SQLite saves still happen first as
the local fallback.
@abdul-abdi abdul-abdi merged commit 51d5e14 into main Mar 15, 2026
14 checks passed
@abdul-abdi abdul-abdi deleted the fix/audio-stability branch March 15, 2026 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant