Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
worker/node_modules/
worker/.dev.vars
worker/.wrangler/
.DS_Store
*.xcuserstate
*.xcuserdatad
xcuserdata/
build/
releases/
.claude/
.sisyphus/
.idea/
coding-plans/
43 changes: 20 additions & 23 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

## Overview

macOS menu bar companion app. Lives entirely in the macOS status bar (no dock icon, no main window). Clicking the menu bar icon opens a custom floating panel with companion voice controls. Uses push-to-talk (ctrl+option) to capture voice input, transcribes it via AssemblyAI streaming, and sends the transcript + a screenshot of the user's screen to Claude. Claude responds with text (streamed via SSE) and voice (ElevenLabs TTS). A blue cursor overlay can fly to and point at UI elements Claude references on any connected monitor.
macOS menu bar companion app. Lives entirely in the macOS status bar (no dock icon, no main window). Clicking the menu bar icon opens a custom floating panel with companion voice and text controls. Uses push-to-talk (ctrl+option) to capture voice input, transcribes it via OpenAI audio transcription, and sends the transcript + a screenshot of the user's screen to OpenAI. Users can also open a typed prompt with Command+Shift+Return; typed messages use the same screenshot → OpenAI → TTS → pointing pipeline as voice. OpenAI responds with text (streamed via SSE) and voice (OpenAI TTS). A blue cursor overlay can fly to and point at UI elements the AI references on any connected monitor.

All API keys live on a Cloudflare Worker proxy — nothing sensitive ships in the app.

Expand All @@ -14,12 +14,13 @@ All API keys live on a Cloudflare Worker proxy — nothing sensitive ships in th
- **App Type**: Menu bar-only (`LSUIElement=true`), no dock icon or main window
- **Framework**: SwiftUI (macOS native) with AppKit bridging for menu bar panel and cursor overlay
- **Pattern**: MVVM with `@StateObject` / `@Published` state management
- **AI Chat**: Claude (Sonnet 4.6 default, Opus 4.6 optional) via Cloudflare Worker proxy with SSE streaming
- **Speech-to-Text**: AssemblyAI real-time streaming (`u3-rt-pro` model) via websocket, with OpenAI and Apple Speech as fallbacks
- **Text-to-Speech**: ElevenLabs (`eleven_flash_v2_5` model) via Cloudflare Worker proxy
- **AI Chat**: OpenAI (GPT-4o default, GPT-4o mini optional) via Cloudflare Worker proxy with SSE streaming
- **Speech-to-Text**: OpenAI audio transcription via Cloudflare Worker proxy, with Apple Speech as a local fallback
- **Text-to-Speech**: OpenAI speech API via Cloudflare Worker proxy
- **Screen Capture**: ScreenCaptureKit (macOS 14.2+), multi-monitor support
- **Voice Input**: Push-to-talk via `AVAudioEngine` + pluggable transcription-provider layer. System-wide keyboard shortcut via listen-only CGEvent tap.
- **Element Pointing**: Claude embeds `[POINT:x,y:label:screenN]` tags in responses. The overlay parses these, maps coordinates to the correct monitor, and animates the blue cursor along a bezier arc to the target.
- **Text Input**: Command+Shift+Return opens a floating typed prompt. Submitted text bypasses transcription and enters the same screenshot + OpenAI + TTS + pointing response pipeline as voice transcripts.
- **Element Pointing**: The AI embeds `[POINT:x,y:label:screenN]` tags in responses. The overlay parses these, maps coordinates to the correct monitor, and animates the blue cursor along a bezier arc to the target.
- **Concurrency**: `@MainActor` isolation, async/await throughout
- **Analytics**: PostHog via `ClickyAnalytics.swift`

Expand All @@ -29,12 +30,11 @@ The app never calls external APIs directly. All requests go through a Cloudflare

| Route | Upstream | Purpose |
|-------|----------|---------|
| `POST /chat` | `api.anthropic.com/v1/messages` | Claude vision + streaming chat |
| `POST /tts` | `api.elevenlabs.io/v1/text-to-speech/{voiceId}` | ElevenLabs TTS audio |
| `POST /transcribe-token` | `streaming.assemblyai.com/v3/token` | Fetches a short-lived (480s) AssemblyAI websocket token |
| `POST /chat` | `api.openai.com/v1/chat/completions` | OpenAI vision + streaming chat |
| `POST /tts` | `api.openai.com/v1/audio/speech` | OpenAI TTS audio |
| `POST /transcribe` | `api.openai.com/v1/audio/transcriptions` | OpenAI audio transcription |

Worker secrets: `ANTHROPIC_API_KEY`, `ASSEMBLYAI_API_KEY`, `ELEVENLABS_API_KEY`
Worker vars: `ELEVENLABS_VOICE_ID`
Worker secrets: `OPENAI_API_KEY`

### Key Architecture Decisions

Expand All @@ -44,7 +44,7 @@ Worker vars: `ELEVENLABS_VOICE_ID`

**Global Push-To-Talk Shortcut**: Background push-to-talk uses a listen-only `CGEvent` tap instead of an AppKit global monitor so modifier-based shortcuts like `ctrl + option` are detected more reliably while the app is running in the background.

**Shared URLSession for AssemblyAI**: A single long-lived `URLSession` is shared across all AssemblyAI streaming sessions (owned by the provider, not the session). Creating and invalidating a URLSession per session corrupts the OS connection pool and causes "Socket is not connected" errors after a few rapid reconnections.
**Global Text Prompt Shortcut**: Background typed input uses a separate listen-only `CGEvent` tap for Command+Shift+Return. This avoids sharing `ctrl + option`, which starts voice recording as soon as those modifiers are pressed.

**Transient Cursor Mode**: When "Show Clicky" is off, pressing the hotkey fades in the cursor overlay for the duration of the interaction (recording → response → TTS → optional pointing), then fades it out automatically after 1 second of inactivity.

Expand All @@ -53,28 +53,27 @@ Worker vars: `ELEVENLABS_VOICE_ID`
| File | Lines | Purpose |
|------|-------|---------|
| `leanring_buddyApp.swift` | ~89 | Menu bar app entry point. Uses `@NSApplicationDelegateAdaptor` with `CompanionAppDelegate` which creates `MenuBarPanelManager` and starts `CompanionManager`. No main window — the app lives entirely in the status bar. |
| `CompanionManager.swift` | ~1026 | Central state machine. Owns dictation, shortcut monitoring, screen capture, Claude API, ElevenLabs TTS, and overlay management. Tracks voice state (idle/listening/processing/responding), conversation history, model selection, and cursor visibility. Coordinates the full push-to-talk → screenshot → Claude → TTS → pointing pipeline. |
| `CompanionManager.swift` | ~1076 | Central state machine. Owns dictation, shortcut monitoring, text prompt routing, screen capture, OpenAI API, OpenAI TTS, and overlay management. Tracks voice state (idle/listening/processing/responding), conversation history, model selection, and cursor visibility. Coordinates the full voice/text → screenshot → OpenAI → TTS → pointing pipeline. |
| `MenuBarPanelManager.swift` | ~243 | NSStatusItem + custom NSPanel lifecycle. Creates the menu bar icon, manages the floating companion panel (show/hide/position), installs click-outside-to-dismiss monitor. |
| `CompanionPanelView.swift` | ~761 | SwiftUI panel content for the menu bar dropdown. Shows companion status, push-to-talk instructions, model picker (Sonnet/Opus), permissions UI, DM feedback button, and quit button. Dark aesthetic using `DS` design system. |
| `CompanionPanelView.swift` | ~811 | SwiftUI panel content for the menu bar dropdown. Shows companion status, push-to-talk and typed-input instructions, model picker, permissions UI, DM feedback button, and quit button. Dark aesthetic using `DS` design system. |
| `OverlayWindow.swift` | ~881 | Full-screen transparent overlay hosting the blue cursor, response text, waveform, and spinner. Handles cursor animation, element pointing with bezier arcs, multi-monitor coordinate mapping, and fade-out transitions. |
| `TextPromptWindowManager.swift` | ~233 | Floating text prompt `NSPanel` and SwiftUI input view. Lets users type messages and submit them into the shared companion response pipeline. |
| `CompanionResponseOverlay.swift` | ~217 | SwiftUI view for the response text bubble and waveform displayed next to the cursor in the overlay. |
| `CompanionScreenCaptureUtility.swift` | ~132 | Multi-monitor screenshot capture using ScreenCaptureKit. Returns labeled image data for each connected display. |
| `BuddyDictationManager.swift` | ~866 | Push-to-talk voice pipeline. Handles microphone capture via `AVAudioEngine`, provider-aware permission checks, keyboard/button dictation sessions, transcript finalization, shortcut parsing, contextual keyterms, and live audio-level reporting for waveform feedback. |
| `BuddyTranscriptionProvider.swift` | ~100 | Protocol surface and provider factory for voice transcription backends. Resolves provider based on `VoiceTranscriptionProvider` in Info.plist — AssemblyAI, OpenAI, or Apple Speech. |
| `AssemblyAIStreamingTranscriptionProvider.swift` | ~478 | Streaming transcription provider. Fetches temp tokens from the Cloudflare Worker, opens an AssemblyAI v3 websocket, streams PCM16 audio, tracks turn-based transcripts, and delivers finalized text on key-up. Shares a single URLSession across all sessions. |
| `BuddyTranscriptionProvider.swift` | ~73 | Protocol surface and provider factory for voice transcription backends. Resolves provider based on `VoiceTranscriptionProvider` in Info.plist — OpenAI or Apple Speech. |
| `OpenAIAudioTranscriptionProvider.swift` | ~317 | Upload-based transcription provider. Buffers push-to-talk audio locally, uploads as WAV on release, returns finalized transcript. |
| `AppleSpeechTranscriptionProvider.swift` | ~147 | Local fallback transcription provider backed by Apple's Speech framework. |
| `BuddyAudioConversionSupport.swift` | ~108 | Audio conversion helpers. Converts live mic buffers to PCM16 mono audio and builds WAV payloads for upload-based providers. |
| `GlobalPushToTalkShortcutMonitor.swift` | ~132 | System-wide push-to-talk monitor. Owns the listen-only `CGEvent` tap and publishes press/release transitions. |
| `ClaudeAPI.swift` | ~291 | Claude vision API client with streaming (SSE) and non-streaming modes. TLS warmup optimization, image MIME detection, conversation history support. |
| `OpenAIAPI.swift` | ~142 | OpenAI GPT vision API client. |
| `ElevenLabsTTSClient.swift` | ~81 | ElevenLabs TTS client. Sends text to the Worker proxy, plays back audio via `AVAudioPlayer`. Exposes `isPlaying` for transient cursor scheduling. |
| `ElementLocationDetector.swift` | ~335 | Detects UI element locations in screenshots for cursor pointing. |
| `GlobalTextPromptShortcutMonitor.swift` | ~132 | System-wide typed prompt monitor. Owns the listen-only `CGEvent` tap for Command+Shift+Return and publishes prompt-open events. |
| `OpenAIAPI.swift` | ~253 | OpenAI GPT vision API client with streaming (SSE) and non-streaming modes. |
| `OpenAITTSClient.swift` | ~66 | OpenAI TTS client. Sends text to the Worker proxy, plays back audio via `AVAudioPlayer`. Exposes `isPlaying` for transient cursor scheduling. |
| `DesignSystem.swift` | ~880 | Design system tokens — colors, corner radii, shared styles. All UI references `DS.Colors`, `DS.CornerRadius`, etc. |
| `ClickyAnalytics.swift` | ~121 | PostHog analytics integration for usage tracking. |
| `WindowPositionManager.swift` | ~262 | Window placement logic, Screen Recording permission flow, and accessibility permission helpers. |
| `AppBundleConfiguration.swift` | ~28 | Runtime configuration reader for keys stored in the app bundle Info.plist. |
| `worker/src/index.ts` | ~142 | Cloudflare Worker proxy. Three routes: `/chat` (Claude), `/tts` (ElevenLabs), `/transcribe-token` (AssemblyAI temp token). |
| `worker/src/index.ts` | ~146 | Cloudflare Worker proxy. Routes: `/chat`, `/tts`, `/transcribe`, and `/health`, all using OpenAI except `/health`. |

## Build & Run

Expand All @@ -97,9 +96,7 @@ cd worker
npm install

# Add secrets
npx wrangler secret put ANTHROPIC_API_KEY
npx wrangler secret put ASSEMBLYAI_API_KEY
npx wrangler secret put ELEVENLABS_API_KEY
npx wrangler secret put OPENAI_API_KEY

# Deploy
npx wrangler deploy
Expand Down
12 changes: 6 additions & 6 deletions leanring-buddy.xcodeproj/project.pbxproj
Original file line number Diff line number Diff line change
Expand Up @@ -411,7 +411,7 @@
CODE_SIGN_STYLE = Automatic;
COMBINE_HIDPI_IMAGES = YES;
CURRENT_PROJECT_VERSION = 1;
DEVELOPMENT_TEAM = 2UDAY4J48G;
DEVELOPMENT_TEAM = 222KQDN363;
ENABLE_APP_SANDBOX = NO;
ENABLE_HARDENED_RUNTIME = YES;
ENABLE_OUTGOING_NETWORK_CONNECTIONS = YES;
Expand Down Expand Up @@ -449,7 +449,7 @@
CODE_SIGN_STYLE = Automatic;
COMBINE_HIDPI_IMAGES = YES;
CURRENT_PROJECT_VERSION = 1;
DEVELOPMENT_TEAM = 2UDAY4J48G;
DEVELOPMENT_TEAM = 222KQDN363;
ENABLE_APP_SANDBOX = NO;
ENABLE_HARDENED_RUNTIME = YES;
ENABLE_OUTGOING_NETWORK_CONNECTIONS = YES;
Expand Down Expand Up @@ -484,7 +484,7 @@
BUNDLE_LOADER = "$(TEST_HOST)";
CODE_SIGN_STYLE = Automatic;
CURRENT_PROJECT_VERSION = 1;
DEVELOPMENT_TEAM = 6D7X9GGZAW;
DEVELOPMENT_TEAM = 6B89JTKXCY;
GENERATE_INFOPLIST_FILE = YES;
MACOSX_DEPLOYMENT_TARGET = 14.2;
MARKETING_VERSION = 1.0;
Expand All @@ -505,7 +505,7 @@
BUNDLE_LOADER = "$(TEST_HOST)";
CODE_SIGN_STYLE = Automatic;
CURRENT_PROJECT_VERSION = 1;
DEVELOPMENT_TEAM = 6D7X9GGZAW;
DEVELOPMENT_TEAM = 6B89JTKXCY;
GENERATE_INFOPLIST_FILE = YES;
MACOSX_DEPLOYMENT_TARGET = 14.2;
MARKETING_VERSION = 1.0;
Expand All @@ -525,7 +525,7 @@
buildSettings = {
CODE_SIGN_STYLE = Automatic;
CURRENT_PROJECT_VERSION = 1;
DEVELOPMENT_TEAM = 6D7X9GGZAW;
DEVELOPMENT_TEAM = 6B89JTKXCY;
GENERATE_INFOPLIST_FILE = YES;
MARKETING_VERSION = 1.0;
PRODUCT_BUNDLE_IDENTIFIER = "com.yourcompany.leanring-buddyUITests";
Expand All @@ -544,7 +544,7 @@
buildSettings = {
CODE_SIGN_STYLE = Automatic;
CURRENT_PROJECT_VERSION = 1;
DEVELOPMENT_TEAM = 6D7X9GGZAW;
DEVELOPMENT_TEAM = 6B89JTKXCY;
GENERATE_INFOPLIST_FILE = YES;
MARKETING_VERSION = 1.0;
PRODUCT_BUNDLE_IDENTIFIER = "com.yourcompany.leanring-buddyUITests";
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>SchemeUserState</key>
<dict>
<key>leanring-buddy.xcscheme_^#shared#^_</key>
<dict>
<key>orderHint</key>
<integer>0</integer>
</dict>
</dict>
</dict>
</plist>
Loading