farzaa · ingokpp · Apr 8, 2026
diff --git a/AGENTS.md b/AGENTS.md
@@ -14,27 +14,24 @@ All API keys live on a Cloudflare Worker proxy — nothing sensitive ships in th
 - **App Type**: Menu bar-only (`LSUIElement=true`), no dock icon or main window
 - **Framework**: SwiftUI (macOS native) with AppKit bridging for menu bar panel and cursor overlay
 - **Pattern**: MVVM with `@StateObject` / `@Published` state management
-- **AI Chat**: Claude (Sonnet 4.6 default, Opus 4.6 optional) via Cloudflare Worker proxy with SSE streaming
-- **Speech-to-Text**: AssemblyAI real-time streaming (`u3-rt-pro` model) via websocket, with OpenAI and Apple Speech as fallbacks
-- **Text-to-Speech**: ElevenLabs (`eleven_flash_v2_5` model) via Cloudflare Worker proxy
+- **AI Chat**: Claude (Sonnet 4.6 default) via local Node.js server using the Claude Agent SDK, which spawns the Claude Code CLI. Auth inherited from local `claude` session — no API key needed.
+- **Speech-to-Text**: Apple Speech framework (on-device, free). AssemblyAI and OpenAI still available as opt-in alternatives via Info.plist config.
+- **Text-to-Speech**: Apple `AVSpeechSynthesizer` (on-device, free). No network or API keys required.
 - **Screen Capture**: ScreenCaptureKit (macOS 14.2+), multi-monitor support
 - **Voice Input**: Push-to-talk via `AVAudioEngine` + pluggable transcription-provider layer. System-wide keyboard shortcut via listen-only CGEvent tap.
 - **Element Pointing**: Claude embeds `[POINT:x,y:label:screenN]` tags in responses. The overlay parses these, maps coordinates to the correct monitor, and animates the blue cursor along a bezier arc to the target.
 - **Concurrency**: `@MainActor` isolation, async/await throughout
 - **Analytics**: PostHog via `ClickyAnalytics.swift`
 
-### API Proxy (Cloudflare Worker)
+### Local Server (Agent SDK)
 
-The app never calls external APIs directly. All requests go through a Cloudflare Worker (`worker/src/index.ts`) that holds the real API keys as secrets.
+The app sends chat requests to a local Node.js server (`local-server/src/server.ts`) that uses the `@anthropic-ai/claude-agent-sdk` to proxy requests through the locally installed Claude Code CLI. No API keys are needed — auth is inherited from the user's `claude` session.
 
-| Route | Upstream | Purpose |
-|-------|----------|---------|
-| `POST /chat` | `api.anthropic.com/v1/messages` | Claude vision + streaming chat |
-| `POST /tts` | `api.elevenlabs.io/v1/text-to-speech/{voiceId}` | ElevenLabs TTS audio |
-| `POST /transcribe-token` | `streaming.assemblyai.com/v3/token` | Fetches a short-lived (480s) AssemblyAI websocket token |
+| Route | Backend | Purpose |
+|-------|---------|---------|
+| `POST /chat` | Claude Agent SDK → Claude Code CLI | Claude vision + streaming chat |
 
-Worker secrets: `ANTHROPIC_API_KEY`, `ASSEMBLYAI_API_KEY`, `ELEVENLABS_API_KEY`
-Worker vars: `ELEVENLABS_VOICE_ID`
+TTS and STT are handled natively by the app (no server routes needed).
 
 ### Key Architecture Decisions
 
@@ -68,13 +65,14 @@ Worker vars: `ELEVENLABS_VOICE_ID`
 | `GlobalPushToTalkShortcutMonitor.swift` | ~132 | System-wide push-to-talk monitor. Owns the listen-only `CGEvent` tap and publishes press/release transitions. |
 | `ClaudeAPI.swift` | ~291 | Claude vision API client with streaming (SSE) and non-streaming modes. TLS warmup optimization, image MIME detection, conversation history support. |
 | `OpenAIAPI.swift` | ~142 | OpenAI GPT vision API client. |
-| `ElevenLabsTTSClient.swift` | ~81 | ElevenLabs TTS client. Sends text to the Worker proxy, plays back audio via `AVAudioPlayer`. Exposes `isPlaying` for transient cursor scheduling. |
+| `AppleTTSClient.swift` | ~115 | Apple TTS client using `AVSpeechSynthesizer`. Free, local text-to-speech. Exposes `isPlaying` for transient cursor scheduling. |
+| `ElevenLabsTTSClient.swift` | ~81 | Legacy ElevenLabs TTS client (unused — replaced by `AppleTTSClient`). |
 | `ElementLocationDetector.swift` | ~335 | Detects UI element locations in screenshots for cursor pointing. |
 | `DesignSystem.swift` | ~880 | Design system tokens — colors, corner radii, shared styles. All UI references `DS.Colors`, `DS.CornerRadius`, etc. |
 | `ClickyAnalytics.swift` | ~121 | PostHog analytics integration for usage tracking. |
 | `WindowPositionManager.swift` | ~262 | Window placement logic, Screen Recording permission flow, and accessibility permission helpers. |
 | `AppBundleConfiguration.swift` | ~28 | Runtime configuration reader for keys stored in the app bundle Info.plist. |
-| `worker/src/index.ts` | ~142 | Cloudflare Worker proxy. Three routes: `/chat` (Claude), `/tts` (ElevenLabs), `/transcribe-token` (AssemblyAI temp token). |
+| `local-server/src/server.ts` | ~230 | Local Node.js server using Claude Agent SDK. Single route: `/chat` streams Claude responses as SSE. Auth inherited from local Claude Code session. |
 
 ## Build & Run
 
@@ -90,22 +88,19 @@ open leanring-buddy.xcodeproj
 
 **Do NOT run `xcodebuild` from the terminal** — it invalidates TCC (Transparency, Consent, and Control) permissions and the app will need to re-request screen recording, accessibility, etc.
 
-## Cloudflare Worker
+## Local Server
+
+The local server requires Claude Code to be installed and authenticated (`claude auth login`).
 
 ```bash
-cd worker
+cd local-server
 npm install
 
-# Add secrets
-npx wrangler secret put ANTHROPIC_API_KEY
-npx wrangler secret put ASSEMBLYAI_API_KEY
-npx wrangler secret put ELEVENLABS_API_KEY
-
-# Deploy
-npx wrangler deploy
+# Start the server (listens on http://localhost:3456)
+npm start
 
-# Local dev (create worker/.dev.vars with your keys)
-npx wrangler dev
+# Or with auto-reload for development
+npm run dev
 ```
 
 ## Code Style & Conventions

diff --git a/leanring-buddy/AppleSpeechTranscriptionProvider.swift b/leanring-buddy/AppleSpeechTranscriptionProvider.swift
@@ -48,7 +48,9 @@ final class AppleSpeechTranscriptionProvider: BuddyTranscriptionProvider {
         ]
 
         for preferredLocale in preferredLocales {
-            if let speechRecognizer = SFSpeechRecognizer(locale: preferredLocale) {
+            if let speechRecognizer = SFSpeechRecognizer(locale: preferredLocale),
+               speechRecognizer.isAvailable {
+                print("🎙️ Apple Speech: using locale \(preferredLocale.identifier)")
                 return speechRecognizer
             }
         }

diff --git a/leanring-buddy/AppleTTSClient.swift b/leanring-buddy/AppleTTSClient.swift
@@ -0,0 +1,106 @@
+//
+//  AppleTTSClient.swift
+//  leanring-buddy
+//
+//  Text-to-speech using macOS NSSpeechSynthesizer, which can use the
+//  system voice set in Accessibility settings — including Siri voices.
+//  Falls back to AVSpeechSynthesizer if NSSpeechSynthesizer fails.
+//
+
+import AppKit
+import AVFoundation
+
+@MainActor
+final class AppleTTSClient: NSObject {
+    /// Primary synthesizer — uses the system default voice (including Siri
+    /// voices configured in System Settings → Accessibility → Spoken Content).
+    private var systemSynthesizer: NSSpeechSynthesizer?
+
+    /// Fallback synthesizer using AVSpeechSynthesizer, in case
+    /// NSSpeechSynthesizer fails for any reason.
+    private let fallbackSynthesizer = AVSpeechSynthesizer()
+
+    /// Tracks whether speech is currently playing.
+    private(set) var isSpeaking: Bool = false
+
+    /// Delegate bridge that forwards NSSpeechSynthesizerDelegate callbacks
+    /// back to this client. Stored separately to avoid @MainActor isolation
+    /// conflicts with the delegate protocol.
+    private var delegateBridge: SpeechDelegateBridge?
+
+    override init() {
+        super.init()
+
+        let synthesizer = NSSpeechSynthesizer()
+        self.systemSynthesizer = synthesizer
+
+        let bridge = SpeechDelegateBridge { [weak self] in
+            Task { @MainActor [weak self] in
+                self?.isSpeaking = false
+            }
+        }
+        self.delegateBridge = bridge
+        synthesizer.delegate = bridge
+
+        let defaultVoice = NSSpeechSynthesizer.defaultVoice
+        let attributes = NSSpeechSynthesizer.attributes(forVoice: defaultVoice)
+        let voiceName = attributes[NSSpeechSynthesizer.VoiceAttributeKey.name] as? String ?? "unknown"
+        print("🔊 Apple TTS: using system voice \"\(voiceName)\" (NSSpeechSynthesizer)")
+    }
+
+    /// Speaks the given text aloud using the system voice. Returns
+    /// immediately after speech starts (NSSpeechSynthesizer.startSpeaking
+    /// is non-blocking). Falls back to AVSpeechSynthesizer on failure.
+    func speakText(_ text: String) async throws {
+        stopPlayback()
+        isSpeaking = true
+
+        if let systemSynthesizer, systemSynthesizer.startSpeaking(text) {
+            print("🔊 Apple TTS: speaking \(text.count) characters via system voice")
+        } else {
+            print("⚠️ NSSpeechSynthesizer failed, falling back to AVSpeechSynthesizer")
+            speakWithFallback(text)
+        }
+    }
+
+    /// Whether TTS audio is currently playing back.
+    var isPlaying: Bool {
+        isSpeaking
+    }
+
+    /// Stops any in-progress speech immediately.
+    func stopPlayback() {
+        systemSynthesizer?.stopSpeaking()
+        fallbackSynthesizer.stopSpeaking(at: .immediate)
+        isSpeaking = false
+    }
+
+    // MARK: - Fallback
+
+    private func speakWithFallback(_ text: String) {
+        let utterance = AVSpeechUtterance(string: text)
+        utterance.voice = AVSpeechSynthesisVoice(language: "en-US")
+        utterance.rate = AVSpeechUtteranceDefaultSpeechRate
+        fallbackSynthesizer.speak(utterance)
+        print("🔊 Fallback TTS: speaking \(text.count) characters")
+    }
+}
+
+// MARK: - Delegate Bridge
+
+/// Separate class to handle NSSpeechSynthesizerDelegate without
+/// @MainActor isolation conflicts. Calls back via a closure.
+private final class SpeechDelegateBridge: NSObject, NSSpeechSynthesizerDelegate {
+    private let onFinished: () -> Void
+
+    init(onFinished: @escaping () -> Void) {
+        self.onFinished = onFinished
+    }
+
+    func speechSynthesizer(
+        _ sender: NSSpeechSynthesizer,
+        didFinishSpeaking finishedSpeaking: Bool
+    ) {
+        onFinished()
+    }
+}
diff --git a/leanring-buddy/BuddyTranscriptionProvider.swift b/leanring-buddy/BuddyTranscriptionProvider.swift
@@ -87,14 +87,7 @@ enum BuddyTranscriptionProviderFactory {
             return AppleSpeechTranscriptionProvider()
         }
 
-        if assemblyAIProvider.isConfigured {
-            return assemblyAIProvider
-        }
-
-        if openAIProvider.isConfigured {
-            return openAIProvider
-        }
-
+        // Default to Apple Speech (free, local, no API keys needed)
         return AppleSpeechTranscriptionProvider()
     }
 }
diff --git a/leanring-buddy/ClaudeAPI.swift b/leanring-buddy/ClaudeAPI.swift
@@ -101,7 +101,6 @@ class ClaudeAPI {
     func analyzeImageStreaming(
         images: [(data: Data, label: String)],
         systemPrompt: String,
-        conversationHistory: [(userPlaceholder: String, assistantResponse: String)] = [],
         userPrompt: String,
         onTextChunk: @MainActor @Sendable (String) -> Void
     ) async throws -> (text: String, duration: TimeInterval) {
@@ -112,11 +111,6 @@ class ClaudeAPI {
         // Build messages array
         var messages: [[String: Any]] = []
 
-        for (userPlaceholder, assistantResponse) in conversationHistory {
-            messages.append(["role": "user", "content": userPlaceholder])
-            messages.append(["role": "assistant", "content": assistantResponse])
-        }
-
         // Build current message with all labeled images + prompt
         var contentBlocks: [[String: Any]] = []
         for image in images {
@@ -215,18 +209,13 @@ class ClaudeAPI {
     func analyzeImage(
         images: [(data: Data, label: String)],
         systemPrompt: String,
-        conversationHistory: [(userPlaceholder: String, assistantResponse: String)] = [],
         userPrompt: String
     ) async throws -> (text: String, duration: TimeInterval) {
         let startTime = Date()
 
         var request = makeAPIRequest()
 
         var messages: [[String: Any]] = []
-        for (userPlaceholder, assistantResponse) in conversationHistory {
-            messages.append(["role": "user", "content": userPlaceholder])
-            messages.append(["role": "assistant", "content": assistantResponse])
-        }
 
         // Build current message with all labeled images + prompt
         var contentBlocks: [[String: Any]] = []