Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 20 additions & 25 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,27 +14,24 @@ All API keys live on a Cloudflare Worker proxy — nothing sensitive ships in th
- **App Type**: Menu bar-only (`LSUIElement=true`), no dock icon or main window
- **Framework**: SwiftUI (macOS native) with AppKit bridging for menu bar panel and cursor overlay
- **Pattern**: MVVM with `@StateObject` / `@Published` state management
- **AI Chat**: Claude (Sonnet 4.6 default, Opus 4.6 optional) via Cloudflare Worker proxy with SSE streaming
- **Speech-to-Text**: AssemblyAI real-time streaming (`u3-rt-pro` model) via websocket, with OpenAI and Apple Speech as fallbacks
- **Text-to-Speech**: ElevenLabs (`eleven_flash_v2_5` model) via Cloudflare Worker proxy
- **AI Chat**: Claude (Sonnet 4.6 default) via local Node.js server using the Claude Agent SDK, which spawns the Claude Code CLI. Auth inherited from local `claude` session — no API key needed.
- **Speech-to-Text**: Apple Speech framework (on-device, free). AssemblyAI and OpenAI still available as opt-in alternatives via Info.plist config.
- **Text-to-Speech**: Apple `AVSpeechSynthesizer` (on-device, free). No network or API keys required.
- **Screen Capture**: ScreenCaptureKit (macOS 14.2+), multi-monitor support
- **Voice Input**: Push-to-talk via `AVAudioEngine` + pluggable transcription-provider layer. System-wide keyboard shortcut via listen-only CGEvent tap.
- **Element Pointing**: Claude embeds `[POINT:x,y:label:screenN]` tags in responses. The overlay parses these, maps coordinates to the correct monitor, and animates the blue cursor along a bezier arc to the target.
- **Concurrency**: `@MainActor` isolation, async/await throughout
- **Analytics**: PostHog via `ClickyAnalytics.swift`

### API Proxy (Cloudflare Worker)
### Local Server (Agent SDK)

The app never calls external APIs directly. All requests go through a Cloudflare Worker (`worker/src/index.ts`) that holds the real API keys as secrets.
The app sends chat requests to a local Node.js server (`local-server/src/server.ts`) that uses the `@anthropic-ai/claude-agent-sdk` to proxy requests through the locally installed Claude Code CLI. No API keys are needed — auth is inherited from the user's `claude` session.

| Route | Upstream | Purpose |
|-------|----------|---------|
| `POST /chat` | `api.anthropic.com/v1/messages` | Claude vision + streaming chat |
| `POST /tts` | `api.elevenlabs.io/v1/text-to-speech/{voiceId}` | ElevenLabs TTS audio |
| `POST /transcribe-token` | `streaming.assemblyai.com/v3/token` | Fetches a short-lived (480s) AssemblyAI websocket token |
| Route | Backend | Purpose |
|-------|---------|---------|
| `POST /chat` | Claude Agent SDK → Claude Code CLI | Claude vision + streaming chat |

Worker secrets: `ANTHROPIC_API_KEY`, `ASSEMBLYAI_API_KEY`, `ELEVENLABS_API_KEY`
Worker vars: `ELEVENLABS_VOICE_ID`
TTS and STT are handled natively by the app (no server routes needed).

### Key Architecture Decisions

Expand Down Expand Up @@ -68,13 +65,14 @@ Worker vars: `ELEVENLABS_VOICE_ID`
| `GlobalPushToTalkShortcutMonitor.swift` | ~132 | System-wide push-to-talk monitor. Owns the listen-only `CGEvent` tap and publishes press/release transitions. |
| `ClaudeAPI.swift` | ~291 | Claude vision API client with streaming (SSE) and non-streaming modes. TLS warmup optimization, image MIME detection, conversation history support. |
| `OpenAIAPI.swift` | ~142 | OpenAI GPT vision API client. |
| `ElevenLabsTTSClient.swift` | ~81 | ElevenLabs TTS client. Sends text to the Worker proxy, plays back audio via `AVAudioPlayer`. Exposes `isPlaying` for transient cursor scheduling. |
| `AppleTTSClient.swift` | ~115 | Apple TTS client using `AVSpeechSynthesizer`. Free, local text-to-speech. Exposes `isPlaying` for transient cursor scheduling. |
| `ElevenLabsTTSClient.swift` | ~81 | Legacy ElevenLabs TTS client (unused — replaced by `AppleTTSClient`). |
| `ElementLocationDetector.swift` | ~335 | Detects UI element locations in screenshots for cursor pointing. |
| `DesignSystem.swift` | ~880 | Design system tokens — colors, corner radii, shared styles. All UI references `DS.Colors`, `DS.CornerRadius`, etc. |
| `ClickyAnalytics.swift` | ~121 | PostHog analytics integration for usage tracking. |
| `WindowPositionManager.swift` | ~262 | Window placement logic, Screen Recording permission flow, and accessibility permission helpers. |
| `AppBundleConfiguration.swift` | ~28 | Runtime configuration reader for keys stored in the app bundle Info.plist. |
| `worker/src/index.ts` | ~142 | Cloudflare Worker proxy. Three routes: `/chat` (Claude), `/tts` (ElevenLabs), `/transcribe-token` (AssemblyAI temp token). |
| `local-server/src/server.ts` | ~230 | Local Node.js server using Claude Agent SDK. Single route: `/chat` streams Claude responses as SSE. Auth inherited from local Claude Code session. |

## Build & Run

Expand All @@ -90,22 +88,19 @@ open leanring-buddy.xcodeproj

**Do NOT run `xcodebuild` from the terminal** — it invalidates TCC (Transparency, Consent, and Control) permissions and the app will need to re-request screen recording, accessibility, etc.

## Cloudflare Worker
## Local Server

The local server requires Claude Code to be installed and authenticated (`claude auth login`).

```bash
cd worker
cd local-server
npm install

# Add secrets
npx wrangler secret put ANTHROPIC_API_KEY
npx wrangler secret put ASSEMBLYAI_API_KEY
npx wrangler secret put ELEVENLABS_API_KEY

# Deploy
npx wrangler deploy
# Start the server (listens on http://localhost:3456)
npm start

# Local dev (create worker/.dev.vars with your keys)
npx wrangler dev
# Or with auto-reload for development
npm run dev
```

## Code Style & Conventions
Expand Down
4 changes: 3 additions & 1 deletion leanring-buddy/AppleSpeechTranscriptionProvider.swift
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,9 @@ final class AppleSpeechTranscriptionProvider: BuddyTranscriptionProvider {
]

for preferredLocale in preferredLocales {
if let speechRecognizer = SFSpeechRecognizer(locale: preferredLocale) {
if let speechRecognizer = SFSpeechRecognizer(locale: preferredLocale),
speechRecognizer.isAvailable {
print("🎙️ Apple Speech: using locale \(preferredLocale.identifier)")
return speechRecognizer
}
}
Expand Down
106 changes: 106 additions & 0 deletions leanring-buddy/AppleTTSClient.swift
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
//
// AppleTTSClient.swift
// leanring-buddy
//
// Text-to-speech using macOS NSSpeechSynthesizer, which can use the
// system voice set in Accessibility settings — including Siri voices.
// Falls back to AVSpeechSynthesizer if NSSpeechSynthesizer fails.
//

import AppKit
import AVFoundation

@MainActor
final class AppleTTSClient: NSObject {
/// Primary synthesizer — uses the system default voice (including Siri
/// voices configured in System Settings → Accessibility → Spoken Content).
private var systemSynthesizer: NSSpeechSynthesizer?

/// Fallback synthesizer using AVSpeechSynthesizer, in case
/// NSSpeechSynthesizer fails for any reason.
private let fallbackSynthesizer = AVSpeechSynthesizer()

/// Tracks whether speech is currently playing.
private(set) var isSpeaking: Bool = false

/// Delegate bridge that forwards NSSpeechSynthesizerDelegate callbacks
/// back to this client. Stored separately to avoid @MainActor isolation
/// conflicts with the delegate protocol.
private var delegateBridge: SpeechDelegateBridge?

override init() {
super.init()

let synthesizer = NSSpeechSynthesizer()
self.systemSynthesizer = synthesizer

let bridge = SpeechDelegateBridge { [weak self] in
Task { @MainActor [weak self] in
self?.isSpeaking = false
}
}
self.delegateBridge = bridge
synthesizer.delegate = bridge

let defaultVoice = NSSpeechSynthesizer.defaultVoice
let attributes = NSSpeechSynthesizer.attributes(forVoice: defaultVoice)
let voiceName = attributes[NSSpeechSynthesizer.VoiceAttributeKey.name] as? String ?? "unknown"
print("🔊 Apple TTS: using system voice \"\(voiceName)\" (NSSpeechSynthesizer)")
}

/// Speaks the given text aloud using the system voice. Returns
/// immediately after speech starts (NSSpeechSynthesizer.startSpeaking
/// is non-blocking). Falls back to AVSpeechSynthesizer on failure.
func speakText(_ text: String) async throws {
stopPlayback()
isSpeaking = true

if let systemSynthesizer, systemSynthesizer.startSpeaking(text) {
print("🔊 Apple TTS: speaking \(text.count) characters via system voice")
} else {
print("⚠️ NSSpeechSynthesizer failed, falling back to AVSpeechSynthesizer")
speakWithFallback(text)
}
}

/// Whether TTS audio is currently playing back.
var isPlaying: Bool {
isSpeaking
}

/// Stops any in-progress speech immediately.
func stopPlayback() {
systemSynthesizer?.stopSpeaking()
fallbackSynthesizer.stopSpeaking(at: .immediate)
isSpeaking = false
}

// MARK: - Fallback

private func speakWithFallback(_ text: String) {
let utterance = AVSpeechUtterance(string: text)
utterance.voice = AVSpeechSynthesisVoice(language: "en-US")
utterance.rate = AVSpeechUtteranceDefaultSpeechRate
fallbackSynthesizer.speak(utterance)
print("🔊 Fallback TTS: speaking \(text.count) characters")
}
}

// MARK: - Delegate Bridge

/// Separate class to handle NSSpeechSynthesizerDelegate without
/// @MainActor isolation conflicts. Calls back via a closure.
private final class SpeechDelegateBridge: NSObject, NSSpeechSynthesizerDelegate {
private let onFinished: () -> Void

init(onFinished: @escaping () -> Void) {
self.onFinished = onFinished
}

func speechSynthesizer(
_ sender: NSSpeechSynthesizer,
didFinishSpeaking finishedSpeaking: Bool
) {
onFinished()
}
}
9 changes: 1 addition & 8 deletions leanring-buddy/BuddyTranscriptionProvider.swift
Original file line number Diff line number Diff line change
Expand Up @@ -87,14 +87,7 @@ enum BuddyTranscriptionProviderFactory {
return AppleSpeechTranscriptionProvider()
}

if assemblyAIProvider.isConfigured {
return assemblyAIProvider
}

if openAIProvider.isConfigured {
return openAIProvider
}

// Default to Apple Speech (free, local, no API keys needed)
return AppleSpeechTranscriptionProvider()
}
}
11 changes: 0 additions & 11 deletions leanring-buddy/ClaudeAPI.swift
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,6 @@ class ClaudeAPI {
func analyzeImageStreaming(
images: [(data: Data, label: String)],
systemPrompt: String,
conversationHistory: [(userPlaceholder: String, assistantResponse: String)] = [],
userPrompt: String,
onTextChunk: @MainActor @Sendable (String) -> Void
) async throws -> (text: String, duration: TimeInterval) {
Expand All @@ -112,11 +111,6 @@ class ClaudeAPI {
// Build messages array
var messages: [[String: Any]] = []

for (userPlaceholder, assistantResponse) in conversationHistory {
messages.append(["role": "user", "content": userPlaceholder])
messages.append(["role": "assistant", "content": assistantResponse])
}

// Build current message with all labeled images + prompt
var contentBlocks: [[String: Any]] = []
for image in images {
Expand Down Expand Up @@ -215,18 +209,13 @@ class ClaudeAPI {
func analyzeImage(
images: [(data: Data, label: String)],
systemPrompt: String,
conversationHistory: [(userPlaceholder: String, assistantResponse: String)] = [],
userPrompt: String
) async throws -> (text: String, duration: TimeInterval) {
let startTime = Date()

var request = makeAPIRequest()

var messages: [[String: Any]] = []
for (userPlaceholder, assistantResponse) in conversationHistory {
messages.append(["role": "user", "content": userPlaceholder])
messages.append(["role": "assistant", "content": assistantResponse])
}

// Build current message with all labeled images + prompt
var contentBlocks: [[String: Any]] = []
Expand Down
Loading