fix: harden audio pipeline, improve barge-in, rewrite docs by abdul-abdi · Pull Request #15 · abdul-abdi/aura

abdul-abdi · 2026-03-15T10:44:40Z

Summary

Stabilizes the real-time audio pipeline, eliminates false barge-in triggers, reduces end-to-end latency, and rewrites all user-facing documentation for hackathon submission.

`3acee7b` — fix: use security CLI for keychain (no ACL prompts), harden VAD for barge-in

KeychainHelper rewrite (AuraApp/Sources/KeychainHelper.swift):

Replaces Security.framework API with /usr/bin/security CLI for all Keychain operations (save, read, delete).
Fixes per-app ACL prompts that appeared when both the SwiftUI app and Rust daemon (two separate ad-hoc signed binaries) accessed the same Keychain item.
Uses -A flag to allow any local application to read the item without triggering a password dialog.
Legacy save(account:data:) and read(account:) wrappers preserved for compatibility.

VAD mode change (aura-voice/src/vad.rs):

Switches WebRTC VAD from Quality mode to VeryAggressive mode — more tolerant of background noise, reduces false speech detection that was triggering unnecessary audio forwarding during playback.

`42e968e` — fix: harden audio pipeline, rewrite docs for hackathon submission

Audio capture (aura-voice/src/audio.rs)

Halves resampler chunk size from 1024 to 512 frames (~10.7ms at 48kHz), reducing accumulation latency before audio reaches the WebSocket.
Reduces sinc resampler quality from studio-grade (sinc_len=128, oversampling=128) to voice-grade (sinc_len=32, oversampling=32) — ~4x faster with no audible difference for 16kHz speech.

Playback (aura-voice/src/playback.rs)

Cuts pre-buffer target from 80ms to 40ms. Gemini's native audio model delivers chunks at a steady rate, so 40ms is sufficient jitter absorption with lower first-byte latency.
Updates test assertions to match the new 40ms target (960 samples at 24kHz instead of 1920).

Barge-in & mic pipeline (aura-daemon/src/orchestrator.rs)

Removes client-side VAD entirely for non-playback audio. All mic audio now streams directly to Gemini, which runs its own server-side VAD. This eliminates ~330ms of client-side latency (30ms frame accumulation + 300ms hangover) and simplifies the pipeline.
Retains energy-based barge-in gating only during playback to filter speaker bleed-through.
Lowers PLAYBACK_THRESHOLD_MULTIPLIER from 1.8 to 1.3 — the old value was too aggressive and swallowed real user speech. New value filters speaker bleed (0.005-0.03 RMS) while passing direct speech (0.05-0.3 RMS).
Reduces BARGE_IN_CONSECUTIVE_FRAMES from 3 to 2 — with the smaller 512-sample chunks, 2 frames ≈ 20ms of sustained speech is sufficient confirmation.
Fixes the pre-buffer mute window comment to reflect actual wall-clock timing (100-500ms with network jitter, not a fixed 80ms).

Session management (aura-gemini/src/session.rs)

Adds proactive session rotation every 10 minutes (SESSION_ROTATION_SECS = 600). Long-lived Gemini audio sessions exhibit gradual latency degradation — rotating the WebSocket with session resumption resets server-side state.
Introduces DisconnectReason enum (Shutdown, GoAway, Rotate) replacing the previous bool return for cleaner control flow.
GoAway clears the resumption handle (server is done with the session). Rotate keeps the handle for seamless resume. Shutdown is a clean exit.
Reorders the WebSocket select loop to prioritize audio over video: all pending audio chunks are flushed before sending a video frame. A 300KB screenshot can stall the TCP socket for 50-300ms — flushing audio first prevents mic data gaps.
Adds stale video frame draining: when frames queue up (e.g., WebSocket busy), only the most recent frame is sent. Old frames waste bandwidth and give Gemini outdated screen context.

Screen capture (aura-daemon/src/screen_capture.rs)

Reduces idle threshold from 10 to 5 unchanged captures (2.5s instead of 5s before switching to slow polling) — faster responsiveness to screen changes.
Reduces slow polling interval from 2000ms to 1500ms (~0.67 FPS idle instead of 0.5 FPS).
Adds frame counter and makes censorship detection periodic (first 3 frames then every 20th) instead of every frame — the base64→JPEG→pixel decode was expensive and unnecessary on every capture.

System prompt (aura-gemini/src/config.rs)

Updates idle capture rate in system prompt from 0.5 FPS to 0.7 FPS to match new interval.
Adds browser-first tool strategy: run_javascript is now the preferred tool for all browser interactions (Safari/Chrome) — faster and more reliable than coordinate-based clicking. Falls back to click(x, y) only for browser chrome or when JS fails.
Adds detailed guidance for web forms, login prompts, and cross-app workflows.

Documentation (README.md, ARCHITECTURE.md)

README rewrite for hackathon submission:
- Adds persona/voice description (Kore voice, conversational style, barge-in behavior).
- Adds "Built with" table showing all 7 Google technologies used.
- Adds Native tools (20) section — full categorized list of all tools Gemini can call.
- Adds Core pipeline section — numbered step-by-step loop showing audio/vision/tool/barge-in flow with latency budget.
- Adds Deploy to Google Cloud section — one-command IaC deployment, services table, CI/CD description.
- Adds Grounding & accuracy section — three-layer anti-hallucination strategy (screen verification, accessibility cross-check, Google Search grounding).
- Adds "Ground" capability to the features table.
Moves architecture deep-dive to root ARCHITECTURE.md (was docs/ARCHITECTURE.md).
Removes all stale docs/plans/ — 26 design docs and plan files from prior development iterations that are no longer relevant.

`d869e61` — docs: add onboarding flow, keychain auth, and permission details to README

Rewrites the Get started section with a detailed first-launch onboarding flow:
- Step 1: API key entry in the app UI (no manual config file editing), background device registration with Keychain token storage.
- Step 2: macOS permissions walkthrough — explains that Microphone is a standard prompt, but Screen Recording and Accessibility require manual toggle in System Settings > Privacy & Security. App polls every 2s and auto-advances when granted.
- Step 3: Daemon launches, connects to Gemini, green dot appears.
Adds Permissions persist across rebuilds section — explains ad-hoc code signing with stable designated requirements (identifier "com.aura.desktop") so TCC grants survive rebuilds without re-prompting.
Adds Keychain & device auth section — explains device token storage in macOS login Keychain via security CLI (not Security.framework) to avoid per-app ACL prompts between the SwiftUI app and Rust daemon, with background retry on registration failure.
Clarifies dev.sh vs bundle.sh in Build from source: dev.sh calls bundle.sh internally then installs and launches; bundle.sh builds without installing. Adds --dmg flag for distribution builds.

Test plan

Build and launch with bash scripts/dev.sh
Verify barge-in works in quiet and noisy environments — user speech interrupts playback, speaker echo does not
Verify session rotation occurs after ~10 minutes (check logs for "Session rotation timer fired")
Verify video frames are not queued (check logs for "Skipped stale video frames" under load)
Verify Keychain read/write works without ACL prompts on fresh install
Verify browser tool strategy: ask Aura to click a link in Safari, confirm it uses run_javascript first
Confirm README renders correctly on GitHub
First launch onboarding: verify API key entry, permission granting, and daemon auto-launch
Rebuild with dev.sh and confirm permissions are NOT re-prompted (stable DR)

…arge-in

Audio stability: ambient-calibrated VAD thresholds, barge-in reverb guard, playback drain detection, session resumption poison counter. Docs: rewrite README with tools list, core pipeline, GCP deployment section, grounding strategy, and persona description. Consolidate architecture into root ARCHITECTURE.md. Remove stale docs/plans.

…EADME

The .titled style mask created an invisible title bar with safe area insets that pushed content down, clipping the text input at the bottom. Since canBecomeKey is overridden and isMovableByWindowBackground handles dragging, .titled is unnecessary.

- Mouse: "precise" → "coordinate accuracy depends on Gemini's vision" - Pipeline: fix chunk size (512 samples ~10ms, not 100ms), remove client-side VAD reference (removed in this branch), add actual barge-in params (1.3x threshold, 2 frames, 300ms reverb guard) - Safety: rewrite entirely — do shell script blocked wholesale (not pattern-filtered), metacharacter list from actual code, timeouts are 60s for all scripts (not 120s for AppleScript), destructive action guardrail is prompt-level not code-enforced - Grounding: screen feedback is continuous 2 FPS (not per-action), accessibility data provided to Gemini (not cross-referenced in code) - Tools: add missing key_state to system category - Terminal blocklist: list all 9 actual blocked apps from code

Without this, the proxy falls back to an in-memory device store which loses all registered device tokens when the instance scales to zero. With GCP_PROJECT_ID set, it uses Firestore-backed storage and tokens persist across restarts.

Newer ADK versions require sessions to exist before run_async. InMemoryRunner was throwing SessionNotFoundError because it passed a random session_id that didn't exist. Switch to explicit Runner with InMemorySessionService and pre-create sessions before each agent invocation.

The daemon was exiting immediately after session.disconnect(), killing the processor before it could run end-of-session consolidation and memory agent ingest. Now waits up to 15s for the processor to finish, ensuring facts are extracted and synced to the cloud before the process exits.

Three fixes for end-of-session memory agent ingest never firing: 1. Add SIGTERM signal handler so the daemon catches terminate signals from the SwiftUI parent app (previously only caught SIGINT/Ctrl+C). 2. Call session.disconnect() BEFORE cancel.cancel() — the processor's select loop has a cancel.cancelled() branch that was firing first, causing it to skip the Disconnected event handler where consolidation and memory agent ingest run. 3. SwiftUI app waits up to 12s after SIGTERM for daemon to finish graceful shutdown before falling back to SIGINT. Also tunes barge-in sensitivity: 1.3x→1.5x threshold multiplier, 2→3 consecutive frames — fixes false interruptions on MacBook speakers.

The memory_agent_handled flag was skipping Firestore sync under the assumption the memory agent wrote facts server-side. It doesn't — it only returns them. Now facts always sync to Firestore when config is present, regardless of whether the memory agent or local consolidation extracted them. SQLite saves still happen first as the local fallback.

abdul-abdi added 13 commits March 15, 2026 10:32

fix: use security CLI for keychain (no ACL prompts), harden VAD for b…

3acee7b

…arge-in

docs: add onboarding flow, keychain auth, and permission details to R…

d869e61

…EADME

docs: explain cloud vs local-only mode and how to connect cloud services

74ca1d7

chore: fix formatting (cargo fmt + ruff)

ead2779

fix: remove unused memory_agent_handled variable (clippy)

9c763f9

abdul-abdi merged commit 51d5e14 into main Mar 15, 2026
14 checks passed

abdul-abdi deleted the fix/audio-stability branch March 15, 2026 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: harden audio pipeline, improve barge-in, rewrite docs#15

fix: harden audio pipeline, improve barge-in, rewrite docs#15
abdul-abdi merged 13 commits intomainfrom
fix/audio-stability

abdul-abdi commented Mar 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abdul-abdi commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

3acee7b — fix: use security CLI for keychain (no ACL prompts), harden VAD for barge-in

42e968e — fix: harden audio pipeline, rewrite docs for hackathon submission

Audio capture (aura-voice/src/audio.rs)

Playback (aura-voice/src/playback.rs)

Barge-in & mic pipeline (aura-daemon/src/orchestrator.rs)

Session management (aura-gemini/src/session.rs)

Screen capture (aura-daemon/src/screen_capture.rs)

System prompt (aura-gemini/src/config.rs)

Documentation (README.md, ARCHITECTURE.md)

d869e61 — docs: add onboarding flow, keychain auth, and permission details to README

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

abdul-abdi commented Mar 15, 2026 •

edited

Loading

`3acee7b` — fix: use security CLI for keychain (no ACL prompts), harden VAD for barge-in

`42e968e` — fix: harden audio pipeline, rewrite docs for hackathon submission

`d869e61` — docs: add onboarding flow, keychain auth, and permission details to README