diff --git a/AGENTS.md b/AGENTS.md index 6946d441..a0d486ea 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -15,8 +15,8 @@ All API keys live on a Cloudflare Worker proxy — nothing sensitive ships in th - **Framework**: SwiftUI (macOS native) with AppKit bridging for menu bar panel and cursor overlay - **Pattern**: MVVM with `@StateObject` / `@Published` state management - **AI Chat**: Claude (Sonnet 4.6 default, Opus 4.6 optional) via Cloudflare Worker proxy with SSE streaming -- **Speech-to-Text**: AssemblyAI real-time streaming (`u3-rt-pro` model) via websocket, with OpenAI and Apple Speech as fallbacks -- **Text-to-Speech**: ElevenLabs (`eleven_flash_v2_5` model) via Cloudflare Worker proxy +- **Speech-to-Text**: AssemblyAI real-time streaming (`u3-rt-pro` model) via websocket, with OpenAI, Apple Speech, and Parakeet (on-device, via FluidAudio/CoreML) as options. Selectable at runtime via the panel UI. +- **Text-to-Speech**: ElevenLabs (`eleven_flash_v2_5` model) via Cloudflare Worker proxy, or Supertonic (on-device ONNX, 66M params, ~167× realtime on Apple Silicon). Selectable at runtime via the panel UI. - **Screen Capture**: ScreenCaptureKit (macOS 14.2+), multi-monitor support - **Voice Input**: Push-to-talk via `AVAudioEngine` + pluggable transcription-provider layer. System-wide keyboard shortcut via listen-only CGEvent tap. - **Element Pointing**: Claude embeds `[POINT:x,y:label:screenN]` tags in responses. The overlay parses these, maps coordinates to the correct monitor, and animates the blue cursor along a bezier arc to the target. @@ -53,9 +53,9 @@ Worker vars: `ELEVENLABS_VOICE_ID` | File | Lines | Purpose | |------|-------|---------| | `leanring_buddyApp.swift` | ~89 | Menu bar app entry point. Uses `@NSApplicationDelegateAdaptor` with `CompanionAppDelegate` which creates `MenuBarPanelManager` and starts `CompanionManager`. No main window — the app lives entirely in the status bar. | -| `CompanionManager.swift` | ~1026 | Central state machine. Owns dictation, shortcut monitoring, screen capture, Claude API, ElevenLabs TTS, and overlay management. Tracks voice state (idle/listening/processing/responding), conversation history, model selection, and cursor visibility. Coordinates the full push-to-talk → screenshot → Claude → TTS → pointing pipeline. | +| `CompanionManager.swift` | ~1100 | Central state machine. Owns dictation, shortcut monitoring, screen capture, Claude API, ElevenLabs TTS, Supertonic TTS, and overlay management. Tracks voice state, conversation history, model selection, TTS provider selection, and STT provider selection. Coordinates the full push-to-talk → screenshot → Claude → TTS → pointing pipeline. | | `MenuBarPanelManager.swift` | ~243 | NSStatusItem + custom NSPanel lifecycle. Creates the menu bar icon, manages the floating companion panel (show/hide/position), installs click-outside-to-dismiss monitor. | -| `CompanionPanelView.swift` | ~761 | SwiftUI panel content for the menu bar dropdown. Shows companion status, push-to-talk instructions, model picker (Sonnet/Opus), permissions UI, DM feedback button, and quit button. Dark aesthetic using `DS` design system. | +| `CompanionPanelView.swift` | ~870 | SwiftUI panel content for the menu bar dropdown. Shows companion status, push-to-talk instructions, model picker (Sonnet/Opus), voice picker (ElevenLabs/Supertonic), speech picker (AssemblyAI/Parakeet), permissions UI, DM feedback button, and quit button. Dark aesthetic using `DS` design system. | | `OverlayWindow.swift` | ~881 | Full-screen transparent overlay hosting the blue cursor, response text, waveform, and spinner. Handles cursor animation, element pointing with bezier arcs, multi-monitor coordinate mapping, and fade-out transitions. | | `CompanionResponseOverlay.swift` | ~217 | SwiftUI view for the response text bubble and waveform displayed next to the cursor in the overlay. | | `CompanionScreenCaptureUtility.swift` | ~132 | Multi-monitor screenshot capture using ScreenCaptureKit. Returns labeled image data for each connected display. | @@ -69,6 +69,9 @@ Worker vars: `ELEVENLABS_VOICE_ID` | `ClaudeAPI.swift` | ~291 | Claude vision API client with streaming (SSE) and non-streaming modes. TLS warmup optimization, image MIME detection, conversation history support. | | `OpenAIAPI.swift` | ~142 | OpenAI GPT vision API client. | | `ElevenLabsTTSClient.swift` | ~81 | ElevenLabs TTS client. Sends text to the Worker proxy, plays back audio via `AVAudioPlayer`. Exposes `isPlaying` for transient cursor scheduling. | +| `SupertonicTTSClient.swift` | ~160 | On-device TTS client backed by Supertonic ONNX (66M params, ~167× realtime). Auto-downloads models from HuggingFace on first use. Mirrors `ElevenLabsTTSClient` interface. | +| `SupertonicEngine.swift` | ~600 | ONNX inference engine for Supertonic. Vendored from supertone-inc/supertonic. Handles text preprocessing, chunking, duration prediction, latent diffusion denoising, and vocoder synthesis via ONNX Runtime. | +| `ParakeetTranscriptionProvider.swift` | ~160 | On-device ASR provider using NVIDIA Parakeet via FluidAudio (CoreML/ANE). Implements `BuddyTranscriptionProvider` with the same buffer-then-transcribe pattern as the OpenAI provider. No API key required. | | `ElementLocationDetector.swift` | ~335 | Detects UI element locations in screenshots for cursor pointing. | | `DesignSystem.swift` | ~880 | Design system tokens — colors, corner radii, shared styles. All UI references `DS.Colors`, `DS.CornerRadius`, etc. | | `ClickyAnalytics.swift` | ~121 | PostHog analytics integration for usage tracking. | @@ -88,6 +91,15 @@ open leanring-buddy.xcodeproj # deprecated onChange warning in OverlayWindow.swift. Do NOT attempt to fix these. ``` +### Required Swift Packages (add via Xcode → File → Add Package Dependencies) + +| Package | URL | Purpose | +|---------|-----|---------| +| onnxruntime-swift-package-manager | `https://github.com/microsoft/onnxruntime-swift-package-manager.git` | ONNX Runtime for Supertonic on-device TTS | +| FluidAudio | `https://github.com/FluidInference/FluidAudio.git` | Parakeet CoreML models for on-device ASR | + +After adding, link both products to the `leanring-buddy` target. Supertonic downloads ~200MB of ONNX model files from HuggingFace on first use. Parakeet downloads ~600MB of CoreML models on first use. Both are cached in `~/Library/Application Support/Clicky/models/`. + **Do NOT run `xcodebuild` from the terminal** — it invalidates TCC (Transparency, Consent, and Control) permissions and the app will need to re-request screen recording, accessibility, etc. ## Cloudflare Worker diff --git a/leanring-buddy.xcodeproj/project.pbxproj b/leanring-buddy.xcodeproj/project.pbxproj index 75e57261..ebdef1d9 100644 --- a/leanring-buddy.xcodeproj/project.pbxproj +++ b/leanring-buddy.xcodeproj/project.pbxproj @@ -9,6 +9,10 @@ /* Begin PBXBuildFile section */ AA00BB032F6500030039DA55 /* Sparkle in Frameworks */ = {isa = PBXBuildFile; productRef = AA00BB022F6500020039DA55 /* Sparkle */; }; AA00BB062F6500060039DA55 /* PostHog in Frameworks */ = {isa = PBXBuildFile; productRef = AA00BB052F6500050039DA55 /* PostHog */; }; + C25D88DC2F8E1B20008ECA05 /* onnxruntime in Frameworks */ = {isa = PBXBuildFile; productRef = C25D88DB2F8E1B20008ECA05 /* onnxruntime */; }; + C25D88DE2F8E1B20008ECA05 /* onnxruntime_extensions in Frameworks */ = {isa = PBXBuildFile; productRef = C25D88DD2F8E1B20008ECA05 /* onnxruntime_extensions */; }; + C25D88E12F8E1B45008ECA05 /* FluidAudio in Frameworks */ = {isa = PBXBuildFile; productRef = C25D88E02F8E1B45008ECA05 /* FluidAudio */; }; + C25D88E32F8E1B45008ECA05 /* fluidaudiocli in Frameworks */ = {isa = PBXBuildFile; productRef = C25D88E22F8E1B45008ECA05 /* fluidaudiocli */; }; /* End PBXBuildFile section */ /* Begin PBXContainerItemProxy section */ @@ -57,6 +61,10 @@ isa = PBXFrameworksBuildPhase; buildActionMask = 2147483647; files = ( + C25D88DE2F8E1B20008ECA05 /* onnxruntime_extensions in Frameworks */, + C25D88E32F8E1B45008ECA05 /* fluidaudiocli in Frameworks */, + C25D88E12F8E1B45008ECA05 /* FluidAudio in Frameworks */, + C25D88DC2F8E1B20008ECA05 /* onnxruntime in Frameworks */, AA00BB032F6500030039DA55 /* Sparkle in Frameworks */, AA00BB062F6500060039DA55 /* PostHog in Frameworks */, ); @@ -121,6 +129,10 @@ packageProductDependencies = ( AA00BB022F6500020039DA55 /* Sparkle */, AA00BB052F6500050039DA55 /* PostHog */, + C25D88DB2F8E1B20008ECA05 /* onnxruntime */, + C25D88DD2F8E1B20008ECA05 /* onnxruntime_extensions */, + C25D88E02F8E1B45008ECA05 /* FluidAudio */, + C25D88E22F8E1B45008ECA05 /* fluidaudiocli */, ); productName = "leanring-buddy"; productReference = 28F22CBF2F56440300A0FC59 /* Clicky.app */; @@ -207,6 +219,8 @@ packageReferences = ( AA00BB012F6500010039DA55 /* XCRemoteSwiftPackageReference "Sparkle" */, AA00BB042F6500040039DA55 /* XCRemoteSwiftPackageReference "posthog-ios" */, + C25D88DA2F8E1B20008ECA05 /* XCRemoteSwiftPackageReference "onnxruntime-swift-package-manager" */, + C25D88DF2F8E1B45008ECA05 /* XCRemoteSwiftPackageReference "FluidAudio" */, ); preferredProjectObjectVersion = 77; productRefGroup = 28F22CC02F56440300A0FC59 /* Products */; @@ -616,6 +630,22 @@ minimumVersion = 3.0.0; }; }; + C25D88DA2F8E1B20008ECA05 /* XCRemoteSwiftPackageReference "onnxruntime-swift-package-manager" */ = { + isa = XCRemoteSwiftPackageReference; + repositoryURL = "https://github.com/microsoft/onnxruntime-swift-package-manager.git"; + requirement = { + kind = upToNextMajorVersion; + minimumVersion = 1.24.2; + }; + }; + C25D88DF2F8E1B45008ECA05 /* XCRemoteSwiftPackageReference "FluidAudio" */ = { + isa = XCRemoteSwiftPackageReference; + repositoryURL = "https://github.com/FluidInference/FluidAudio.git"; + requirement = { + branch = main; + kind = branch; + }; + }; /* End XCRemoteSwiftPackageReference section */ /* Begin XCSwiftPackageProductDependency section */ @@ -629,6 +659,26 @@ package = AA00BB042F6500040039DA55 /* XCRemoteSwiftPackageReference "posthog-ios" */; productName = PostHog; }; + C25D88DB2F8E1B20008ECA05 /* onnxruntime */ = { + isa = XCSwiftPackageProductDependency; + package = C25D88DA2F8E1B20008ECA05 /* XCRemoteSwiftPackageReference "onnxruntime-swift-package-manager" */; + productName = onnxruntime; + }; + C25D88DD2F8E1B20008ECA05 /* onnxruntime_extensions */ = { + isa = XCSwiftPackageProductDependency; + package = C25D88DA2F8E1B20008ECA05 /* XCRemoteSwiftPackageReference "onnxruntime-swift-package-manager" */; + productName = onnxruntime_extensions; + }; + C25D88E02F8E1B45008ECA05 /* FluidAudio */ = { + isa = XCSwiftPackageProductDependency; + package = C25D88DF2F8E1B45008ECA05 /* XCRemoteSwiftPackageReference "FluidAudio" */; + productName = FluidAudio; + }; + C25D88E22F8E1B45008ECA05 /* fluidaudiocli */ = { + isa = XCSwiftPackageProductDependency; + package = C25D88DF2F8E1B45008ECA05 /* XCRemoteSwiftPackageReference "FluidAudio" */; + productName = fluidaudiocli; + }; /* End XCSwiftPackageProductDependency section */ }; rootObject = 28F22CB72F56440300A0FC59 /* Project object */; diff --git a/leanring-buddy.xcodeproj/project.xcworkspace/xcshareddata/swiftpm/Package.resolved b/leanring-buddy.xcodeproj/project.xcworkspace/xcshareddata/swiftpm/Package.resolved index d88adb21..3b3576a4 100644 --- a/leanring-buddy.xcodeproj/project.xcworkspace/xcshareddata/swiftpm/Package.resolved +++ b/leanring-buddy.xcodeproj/project.xcworkspace/xcshareddata/swiftpm/Package.resolved @@ -1,6 +1,24 @@ { - "originHash" : "3c6fb67fefedcfcd00708e24ca8088151f21dccfc0ade32ea80c406646277e89", + "originHash" : "709531fd15543507c78c7fc8573126ccac523700790f55ff54fcb863042e60aa", "pins" : [ + { + "identity" : "fluidaudio", + "kind" : "remoteSourceControl", + "location" : "https://github.com/FluidInference/FluidAudio.git", + "state" : { + "branch" : "main", + "revision" : "4ef33f0b64837c2943e8cd0f66940d5861176d6a" + } + }, + { + "identity" : "onnxruntime-swift-package-manager", + "kind" : "remoteSourceControl", + "location" : "https://github.com/microsoft/onnxruntime-swift-package-manager.git", + "state" : { + "revision" : "b7fb7f7dea8a2469e6335d95a61b8f36d0dc83b2", + "version" : "1.24.2" + } + }, { "identity" : "plcrashreporter", "kind" : "remoteSourceControl", diff --git a/leanring-buddy/BuddyDictationManager.swift b/leanring-buddy/BuddyDictationManager.swift index 5bca2677..fbc5f037 100644 --- a/leanring-buddy/BuddyDictationManager.swift +++ b/leanring-buddy/BuddyDictationManager.swift @@ -262,7 +262,7 @@ final class BuddyDictationManager: NSObject, ObservableObject { return AVCaptureDevice.authorizationStatus(for: .audio) == .notDetermined } - private let transcriptionProvider: any BuddyTranscriptionProvider + private var transcriptionProvider: any BuddyTranscriptionProvider private let audioEngine = AVAudioEngine() private var activeTranscriptionSession: (any BuddyStreamingTranscriptionSession)? private var activeStartSource: BuddyDictationStartSource? @@ -287,6 +287,19 @@ final class BuddyDictationManager: NSObject, ObservableObject { super.init() } + /// Swaps the active transcription provider between push-to-talk sessions. + /// Safe to call at any time — if a session is in progress the change takes + /// effect after the current session finishes. + func switchTranscriptionProvider(to provider: any BuddyTranscriptionProvider) { + guard !isDictationInProgress else { + print("⚠️ Transcription: provider switch deferred — session in progress") + return + } + transcriptionProvider = provider + transcriptionProviderDisplayName = provider.displayName + print("🎙️ Transcription: switched to \(provider.displayName)") + } + func updateContextualKeyterms(_ contextualKeyterms: [String]) { self.contextualKeyterms = contextualKeyterms } diff --git a/leanring-buddy/BuddyTranscriptionProvider.swift b/leanring-buddy/BuddyTranscriptionProvider.swift index 0a75715d..d3e94976 100644 --- a/leanring-buddy/BuddyTranscriptionProvider.swift +++ b/leanring-buddy/BuddyTranscriptionProvider.swift @@ -30,10 +30,11 @@ protocol BuddyTranscriptionProvider { } enum BuddyTranscriptionProviderFactory { - private enum PreferredProvider: String { + enum PreferredProvider: String { case assemblyAI = "assemblyai" case openAI = "openai" case appleSpeech = "apple" + case parakeet = "parakeet" } static func makeDefaultProvider() -> any BuddyTranscriptionProvider { @@ -42,15 +43,31 @@ enum BuddyTranscriptionProviderFactory { return provider } - private static func resolveProvider() -> any BuddyTranscriptionProvider { - let preferredProviderRawValue = AppBundleConfiguration - .stringValue(forKey: "VoiceTranscriptionProvider")? - .lowercased() - let preferredProvider = preferredProviderRawValue.flatMap(PreferredProvider.init(rawValue:)) + static func makeProvider(for preferredProvider: PreferredProvider) -> any BuddyTranscriptionProvider { + let provider = resolveProvider(preferred: preferredProvider) + print("🎙️ Transcription: switching to \(provider.displayName)") + return provider + } + + private static func resolveProvider(preferred: PreferredProvider? = nil) -> any BuddyTranscriptionProvider { + // Use the explicit preferred value if passed, otherwise read from Info.plist + let preferredProvider: PreferredProvider? + if let preferred { + preferredProvider = preferred + } else { + let rawValue = AppBundleConfiguration + .stringValue(forKey: "VoiceTranscriptionProvider")? + .lowercased() + preferredProvider = rawValue.flatMap(PreferredProvider.init(rawValue:)) + } let assemblyAIProvider = AssemblyAIStreamingTranscriptionProvider() let openAIProvider = OpenAIAudioTranscriptionProvider() + if preferredProvider == .parakeet { + return ParakeetTranscriptionProvider() + } + if preferredProvider == .appleSpeech { return AppleSpeechTranscriptionProvider() } diff --git a/leanring-buddy/ClaudeAPI.swift b/leanring-buddy/ClaudeAPI.swift index 0c7070b5..07084c87 100644 --- a/leanring-buddy/ClaudeAPI.swift +++ b/leanring-buddy/ClaudeAPI.swift @@ -6,17 +6,25 @@ import Foundation /// Claude API helper with streaming for progressive text display. +/// +/// Supports two modes: +/// - **Proxy mode** (default): Sends requests to a Cloudflare Worker that injects the API key. +/// - **Direct mode**: When an `apiKey` is provided, sends requests straight to +/// `api.anthropic.com` with the key in the `x-api-key` header. No Worker needed. class ClaudeAPI { private static let tlsWarmupLock = NSLock() private static var hasStartedTLSWarmup = false private let apiURL: URL var model: String + private let apiKey: String? private let session: URLSession + /// Creates a ClaudeAPI in proxy mode (requests go to a Cloudflare Worker). init(proxyURL: String, model: String = "claude-sonnet-4-6") { self.apiURL = URL(string: proxyURL)! self.model = model + self.apiKey = nil // Use .default instead of .ephemeral so TLS session tickets are cached. // Ephemeral sessions do a full TLS handshake on every request, which causes @@ -36,11 +44,34 @@ class ClaudeAPI { warmUpTLSConnectionIfNeeded() } + /// Creates a ClaudeAPI in direct mode (requests go straight to Anthropic). + init(apiKey: String, model: String = "claude-sonnet-4-6") { + self.apiURL = URL(string: "https://api.anthropic.com/v1/messages")! + self.model = model + self.apiKey = apiKey + + let config = URLSessionConfiguration.default + config.timeoutIntervalForRequest = 120 + config.timeoutIntervalForResource = 300 + config.waitsForConnectivity = true + config.urlCache = nil + config.httpCookieStorage = nil + self.session = URLSession(configuration: config) + + warmUpTLSConnectionIfNeeded() + } + private func makeAPIRequest() -> URLRequest { var request = URLRequest(url: apiURL) request.httpMethod = "POST" request.timeoutInterval = 120 request.setValue("application/json", forHTTPHeaderField: "Content-Type") + // In direct mode, add the Anthropic auth headers that the Worker would + // normally inject. In proxy mode these are omitted — the Worker adds them. + if let apiKey { + request.setValue(apiKey, forHTTPHeaderField: "x-api-key") + request.setValue("2023-06-01", forHTTPHeaderField: "anthropic-version") + } return request } diff --git a/leanring-buddy/CompanionManager.swift b/leanring-buddy/CompanionManager.swift index 0234cf19..db09b45e 100644 --- a/leanring-buddy/CompanionManager.swift +++ b/leanring-buddy/CompanionManager.swift @@ -68,18 +68,77 @@ final class CompanionManager: ObservableObject { // Response text is now displayed inline on the cursor overlay via // streamingResponseText, so no separate response overlay manager is needed. - /// Base URL for the Cloudflare Worker proxy. All API requests route - /// through this so keys never ship in the app binary. + /// Base URL for the Cloudflare Worker proxy. Used when no direct API key is set. private static let workerBaseURL = "https://your-worker-name.your-subdomain.workers.dev" + /// User-provided Anthropic API key for direct mode (no Worker needed). + /// When set, Claude requests go straight to api.anthropic.com. + /// Persisted to UserDefaults so it survives app restarts. + @Published var anthropicAPIKey: String = UserDefaults.standard.string(forKey: "anthropicAPIKey") ?? "" + + func setAnthropicAPIKey(_ key: String) { + let trimmedKey = key.trimmingCharacters(in: .whitespacesAndNewlines) + anthropicAPIKey = trimmedKey + UserDefaults.standard.set(trimmedKey, forKey: "anthropicAPIKey") + // Recreate the Claude client to use direct mode (or fall back to proxy) + claudeAPI = makeClaudeAPI() + } + + /// Whether the app is using a direct Anthropic API key instead of the Worker proxy. + var isUsingDirectAPIKey: Bool { + !anthropicAPIKey.isEmpty + } + + private func makeClaudeAPI() -> ClaudeAPI { + if !anthropicAPIKey.isEmpty { + return ClaudeAPI(apiKey: anthropicAPIKey, model: selectedModel) + } else { + return ClaudeAPI(proxyURL: "\(Self.workerBaseURL)/chat", model: selectedModel) + } + } + private lazy var claudeAPI: ClaudeAPI = { - return ClaudeAPI(proxyURL: "\(Self.workerBaseURL)/chat", model: selectedModel) + return makeClaudeAPI() }() private lazy var elevenLabsTTSClient: ElevenLabsTTSClient = { return ElevenLabsTTSClient(proxyURL: "\(Self.workerBaseURL)/tts") }() + private lazy var supertonicTTSClient: SupertonicTTSClient = { + return SupertonicTTSClient() + }() + + /// Which TTS backend to use for voice responses. "elevenlabs" or "supertonic". + /// Persisted to UserDefaults so the choice survives app restarts. + @Published var selectedTTSProvider: String = UserDefaults.standard.string(forKey: "selectedTTSProvider") ?? "elevenlabs" + + func setSelectedTTSProvider(_ provider: String) { + stopActiveTTSPlayback() + selectedTTSProvider = provider + UserDefaults.standard.set(provider, forKey: "selectedTTSProvider") + } + + /// Which STT backend to use for voice transcription. "assemblyai" or "parakeet". + /// Persisted to UserDefaults so the choice survives app restarts. + @Published var selectedSTTProvider: String = UserDefaults.standard.string(forKey: "selectedSTTProvider") ?? "assemblyai" + + func setSelectedSTTProvider(_ provider: String) { + selectedSTTProvider = provider + UserDefaults.standard.set(provider, forKey: "selectedSTTProvider") + + let providerInstance: any BuddyTranscriptionProvider + switch provider { + case "parakeet": + providerInstance = BuddyTranscriptionProviderFactory.makeProvider( + for: .parakeet) + default: + providerInstance = BuddyTranscriptionProviderFactory.makeProvider( + for: .assemblyAI) + } + buddyDictationManager.switchTranscriptionProvider(to: providerInstance) + } + /// Conversation history so Claude remembers prior exchanges within a session. /// Each entry is the user's transcript and Claude's response. private var conversationHistory: [(userTranscript: String, assistantResponse: String)] = [] @@ -179,6 +238,14 @@ final class CompanionManager: ObservableObject { bindVoiceStateObservation() bindAudioPowerLevel() bindShortcutTransitions() + + // Restore the user's saved STT provider from UserDefaults so it + // survives app restarts (BuddyDictationManager defaults to Info.plist). + if selectedSTTProvider == "parakeet" { + let parakeetProvider = BuddyTranscriptionProviderFactory.makeProvider(for: .parakeet) + buddyDictationManager.switchTranscriptionProvider(to: parakeetProvider) + } + // Eagerly touch the Claude API so its TLS warmup handshake completes // well before the onboarding demo fires at ~40s into the video. _ = claudeAPI @@ -295,6 +362,8 @@ final class CompanionManager: ObservableObject { currentResponseTask?.cancel() currentResponseTask = nil + elevenLabsTTSClient.stopPlayback() + supertonicTTSClient.stopPlayback() shortcutTransitionCancellable?.cancel() voiceStateCancellable?.cancel() audioPowerCancellable?.cancel() @@ -493,7 +562,7 @@ final class CompanionManager: ObservableObject { // Cancel any in-progress response and TTS from a previous utterance currentResponseTask?.cancel() - elevenLabsTTSClient.stopPlayback() + stopActiveTTSPlayback() clearDetectedElementLocation() // Dismiss the onboarding prompt if it's showing @@ -701,12 +770,12 @@ final class CompanionManager: ObservableObject { // until the audio actually starts playing, then switch to responding. if !spokenText.trimmingCharacters(in: .whitespacesAndNewlines).isEmpty { do { - try await elevenLabsTTSClient.speakText(spokenText) + try await speakTextWithActiveTTSProvider(spokenText) // speakText returns after player.play() — audio is now playing voiceState = .responding } catch { ClickyAnalytics.trackTTSError(error: error.localizedDescription) - print("⚠️ ElevenLabs TTS error: \(error)") + print("⚠️ TTS error (\(selectedTTSProvider)): \(error)") speakCreditsErrorFallback() } } @@ -735,7 +804,7 @@ final class CompanionManager: ObservableObject { transientHideTask?.cancel() transientHideTask = Task { // Wait for TTS audio to finish playing - while elevenLabsTTSClient.isPlaying { + while isActiveTTSPlaying { try? await Task.sleep(nanoseconds: 200_000_000) guard !Task.isCancelled else { return } } @@ -755,6 +824,130 @@ final class CompanionManager: ObservableObject { } } + // MARK: - TTS Provider Helpers + + /// Routes a speak call to whichever TTS backend is currently selected. + private func speakTextWithActiveTTSProvider(_ text: String) async throws { + if selectedTTSProvider == "supertonic" { + try await supertonicTTSClient.speakText(text) + } else { + try await elevenLabsTTSClient.speakText(text) + } + } + + /// Stops playback on whichever TTS backend is currently active. + private func stopActiveTTSPlayback() { + if selectedTTSProvider == "supertonic" { + supertonicTTSClient.stopPlayback() + } else { + elevenLabsTTSClient.stopPlayback() + } + } + + /// True if the currently selected TTS backend has audio playing. + private var isActiveTTSPlaying: Bool { + if selectedTTSProvider == "supertonic" { + return supertonicTTSClient.isPlaying + } else { + return elevenLabsTTSClient.isPlaying + } + } + + // MARK: - Test Helpers + + /// Speaks a sample phrase using the currently selected TTS provider. + /// Used by the panel's test button to verify TTS without push-to-talk. + @Published private(set) var ttsTestStatus: String = "" + + func testCurrentTTSProvider() { + ttsTestStatus = "Testing \(selectedTTSProvider)..." + Task { + do { + try await speakTextWithActiveTTSProvider("Hello! This is a test of the \(selectedTTSProvider) text to speech engine.") + ttsTestStatus = "✅ \(selectedTTSProvider) working" + } catch { + ttsTestStatus = "❌ \(error.localizedDescription)" + print("🔊 TTS test error: \(error)") + } + } + } + + /// Records 3 seconds of mic audio and transcribes with the current STT provider. + @Published private(set) var sttTestStatus: String = "" + + func testCurrentSTTProvider() { + sttTestStatus = "🎙️ Recording 3s..." + Task { + do { + let transcribedText = try await runShortSTTTest() + if transcribedText.isEmpty { + sttTestStatus = "⚠️ No speech detected" + } else { + sttTestStatus = "✅ \"\(transcribedText)\"" + } + } catch { + sttTestStatus = "❌ \(error.localizedDescription)" + print("🎙️ STT test error: \(error)") + } + } + } + + /// Records ~3 seconds of mic audio and runs it through the active STT provider. + private func runShortSTTTest() async throws -> String { + let audioEngine = AVAudioEngine() + let inputNode = audioEngine.inputNode + let recordingFormat = inputNode.outputFormat(forBus: 0) + + var capturedBuffers: [AVAudioPCMBuffer] = [] + + inputNode.installTap(onBus: 0, bufferSize: 4096, format: recordingFormat) { buffer, _ in + capturedBuffers.append(buffer) + } + + audioEngine.prepare() + try audioEngine.start() + + sttTestStatus = "🎙️ Speak now (3s)..." + try await Task.sleep(nanoseconds: 3_000_000_000) + + audioEngine.stop() + inputNode.removeTap(onBus: 0) + + sttTestStatus = "⏳ Transcribing..." + + var finalText = "" + let providerPreference: BuddyTranscriptionProviderFactory.PreferredProvider = + selectedSTTProvider == "parakeet" ? .parakeet : .assemblyAI + let testProvider = BuddyTranscriptionProviderFactory.makeProvider(for: providerPreference) + + let session = try await testProvider.startStreamingSession( + keyterms: [], + onTranscriptUpdate: { text in + finalText = text + }, + onFinalTranscriptReady: { text in + finalText = text + }, + onError: { error in + print("🎙️ STT test session error: \(error)") + } + ) + + for buffer in capturedBuffers { + session.appendAudioBuffer(buffer) + } + + session.requestFinalTranscript() + + // Wait for finalization (up to 10s for model download on first use) + for _ in 0..<100 { + try await Task.sleep(nanoseconds: 100_000_000) + if !finalText.isEmpty { break } + } + + return finalText + } + /// Speaks a hardcoded error message using macOS system TTS when API /// credits run out. Uses NSSpeechSynthesizer so it works even when /// ElevenLabs is down. diff --git a/leanring-buddy/CompanionPanelView.swift b/leanring-buddy/CompanionPanelView.swift index 76789b4c..e2157a6e 100644 --- a/leanring-buddy/CompanionPanelView.swift +++ b/leanring-buddy/CompanionPanelView.swift @@ -13,6 +13,7 @@ import SwiftUI struct CompanionPanelView: View { @ObservedObject var companionManager: CompanionManager @State private var emailInput: String = "" + @State private var apiKeyInput: String = "" var body: some View { VStack(alignment: .leading, spacing: 0) { @@ -25,12 +26,30 @@ struct CompanionPanelView: View { .padding(.top, 16) .padding(.horizontal, 16) - if companionManager.hasCompletedOnboarding && companionManager.allPermissionsGranted { + if companionManager.hasCompletedOnboarding { Spacer() .frame(height: 12) modelPickerRow .padding(.horizontal, 16) + + Spacer() + .frame(height: 8) + + apiKeyRow + .padding(.horizontal, 16) + + Spacer() + .frame(height: 8) + + ttsProviderPickerRow + .padding(.horizontal, 16) + + Spacer() + .frame(height: 8) + + sttProviderPickerRow + .padding(.horizontal, 16) } if !companionManager.allPermissionsGranted { @@ -641,6 +660,228 @@ struct CompanionPanelView: View { .pointerCursor() } + // MARK: - API Key + + private var apiKeyRow: some View { + VStack(spacing: 6) { + HStack { + Text("API Key") + .font(.system(size: 13, weight: .medium)) + .foregroundColor(DS.Colors.textSecondary) + + Spacer() + + if companionManager.isUsingDirectAPIKey { + HStack(spacing: 4) { + Circle() + .fill(DS.Colors.success) + .frame(width: 6, height: 6) + Text("Direct") + .font(.system(size: 10, weight: .medium)) + .foregroundColor(DS.Colors.success) + } + } else { + Text("Using proxy") + .font(.system(size: 10, weight: .medium)) + .foregroundColor(DS.Colors.textTertiary) + } + } + + HStack(spacing: 6) { + SecureField("sk-ant-...", text: $apiKeyInput) + .textFieldStyle(.plain) + .font(.system(size: 11, design: .monospaced)) + .foregroundColor(DS.Colors.textPrimary) + .padding(.horizontal, 8) + .padding(.vertical, 6) + .background( + RoundedRectangle(cornerRadius: 5, style: .continuous) + .fill(Color.white.opacity(0.06)) + ) + .overlay( + RoundedRectangle(cornerRadius: 5, style: .continuous) + .stroke(DS.Colors.borderSubtle, lineWidth: 0.5) + ) + .onSubmit { + companionManager.setAnthropicAPIKey(apiKeyInput) + } + + if companionManager.isUsingDirectAPIKey { + Button(action: { + apiKeyInput = "" + companionManager.setAnthropicAPIKey("") + }) { + Image(systemName: "xmark.circle.fill") + .font(.system(size: 14)) + .foregroundColor(DS.Colors.textTertiary) + } + .buttonStyle(.plain) + .pointerCursor() + } else { + Button(action: { + companionManager.setAnthropicAPIKey(apiKeyInput) + }) { + Text("Save") + .font(.system(size: 10, weight: .semibold)) + .foregroundColor(DS.Colors.textPrimary) + .padding(.horizontal, 8) + .padding(.vertical, 4) + .background( + RoundedRectangle(cornerRadius: 4, style: .continuous) + .fill(Color.white.opacity(0.1)) + ) + } + .buttonStyle(.plain) + .pointerCursor() + .disabled(apiKeyInput.trimmingCharacters(in: .whitespacesAndNewlines).isEmpty) + } + } + } + .padding(.vertical, 4) + .onAppear { + // Show masked version if key is already saved, or empty for new input + if companionManager.isUsingDirectAPIKey { + apiKeyInput = companionManager.anthropicAPIKey + } + } + } + + // MARK: - TTS Provider Picker + + private var ttsProviderPickerRow: some View { + VStack(spacing: 6) { + HStack { + Text("Voice") + .font(.system(size: 13, weight: .medium)) + .foregroundColor(DS.Colors.textSecondary) + + Spacer() + + HStack(spacing: 0) { + ttsProviderOptionButton(label: "ElevenLabs", providerID: "elevenlabs") + ttsProviderOptionButton(label: "Supertonic", providerID: "supertonic") + } + .background( + RoundedRectangle(cornerRadius: 6, style: .continuous) + .fill(Color.white.opacity(0.06)) + ) + .overlay( + RoundedRectangle(cornerRadius: 6, style: .continuous) + .stroke(DS.Colors.borderSubtle, lineWidth: 0.5) + ) + + Button(action: { companionManager.testCurrentTTSProvider() }) { + Text("Test") + .font(.system(size: 10, weight: .semibold)) + .foregroundColor(DS.Colors.textPrimary) + .padding(.horizontal, 8) + .padding(.vertical, 4) + .background( + RoundedRectangle(cornerRadius: 4, style: .continuous) + .fill(Color.white.opacity(0.1)) + ) + } + .buttonStyle(.plain) + .pointerCursor() + } + + if !companionManager.ttsTestStatus.isEmpty { + Text(companionManager.ttsTestStatus) + .font(.system(size: 10)) + .foregroundColor(DS.Colors.textTertiary) + .frame(maxWidth: .infinity, alignment: .trailing) + } + } + .padding(.vertical, 4) + } + + private func ttsProviderOptionButton(label: String, providerID: String) -> some View { + let isSelected = companionManager.selectedTTSProvider == providerID + return Button(action: { + companionManager.setSelectedTTSProvider(providerID) + }) { + Text(label) + .font(.system(size: 11, weight: .medium)) + .foregroundColor(isSelected ? DS.Colors.textPrimary : DS.Colors.textTertiary) + .padding(.horizontal, 10) + .padding(.vertical, 5) + .background( + RoundedRectangle(cornerRadius: 5, style: .continuous) + .fill(isSelected ? Color.white.opacity(0.1) : Color.clear) + ) + } + .buttonStyle(.plain) + .pointerCursor() + } + + // MARK: - STT Provider Picker + + private var sttProviderPickerRow: some View { + VStack(spacing: 6) { + HStack { + Text("Speech") + .font(.system(size: 13, weight: .medium)) + .foregroundColor(DS.Colors.textSecondary) + + Spacer() + + HStack(spacing: 0) { + sttProviderOptionButton(label: "AssemblyAI", providerID: "assemblyai") + sttProviderOptionButton(label: "Parakeet", providerID: "parakeet") + } + .background( + RoundedRectangle(cornerRadius: 6, style: .continuous) + .fill(Color.white.opacity(0.06)) + ) + .overlay( + RoundedRectangle(cornerRadius: 6, style: .continuous) + .stroke(DS.Colors.borderSubtle, lineWidth: 0.5) + ) + + Button(action: { companionManager.testCurrentSTTProvider() }) { + Text("Test") + .font(.system(size: 10, weight: .semibold)) + .foregroundColor(DS.Colors.textPrimary) + .padding(.horizontal, 8) + .padding(.vertical, 4) + .background( + RoundedRectangle(cornerRadius: 4, style: .continuous) + .fill(Color.white.opacity(0.1)) + ) + } + .buttonStyle(.plain) + .pointerCursor() + } + + if !companionManager.sttTestStatus.isEmpty { + Text(companionManager.sttTestStatus) + .font(.system(size: 10)) + .foregroundColor(DS.Colors.textTertiary) + .frame(maxWidth: .infinity, alignment: .trailing) + } + } + .padding(.vertical, 4) + } + + private func sttProviderOptionButton(label: String, providerID: String) -> some View { + let isSelected = companionManager.selectedSTTProvider == providerID + return Button(action: { + companionManager.setSelectedSTTProvider(providerID) + }) { + Text(label) + .font(.system(size: 11, weight: .medium)) + .foregroundColor(isSelected ? DS.Colors.textPrimary : DS.Colors.textTertiary) + .padding(.horizontal, 10) + .padding(.vertical, 5) + .background( + RoundedRectangle(cornerRadius: 5, style: .continuous) + .fill(isSelected ? Color.white.opacity(0.1) : Color.clear) + ) + } + .buttonStyle(.plain) + .pointerCursor() + } + // MARK: - DM Farza Button private var dmFarzaButton: some View { diff --git a/leanring-buddy/ParakeetTranscriptionProvider.swift b/leanring-buddy/ParakeetTranscriptionProvider.swift new file mode 100644 index 00000000..20c7e0fc --- /dev/null +++ b/leanring-buddy/ParakeetTranscriptionProvider.swift @@ -0,0 +1,189 @@ +// +// ParakeetTranscriptionProvider.swift +// leanring-buddy +// +// On-device transcription using NVIDIA's Parakeet model via FluidAudio (CoreML). +// Models auto-download on first use. No API key or internet connection required +// after the initial download. Runs on the Apple Neural Engine. +// +// Requires: FluidAudio package (https://github.com/FluidInference/FluidAudio.git) +// + +import AVFoundation +import FluidAudio +import Foundation + +final class ParakeetTranscriptionProvider: BuddyTranscriptionProvider { + let displayName = "Parakeet" + + /// Parakeet requires no Speech Recognition permission — it uses raw PCM audio. + let requiresSpeechRecognitionPermission = false + + /// Always available since it's entirely on-device with no API key. + var isConfigured: Bool { true } + var unavailableExplanation: String? { nil } + + /// Shared AsrManager — model loading is expensive, so we keep it alive + /// in memory after the first transcription and reuse it for all subsequent ones. + fileprivate static var sharedAsrManager: AsrManager? + + func startStreamingSession( + keyterms: [String], + onTranscriptUpdate: @escaping (String) -> Void, + onFinalTranscriptReady: @escaping (String) -> Void, + onError: @escaping (Error) -> Void + ) async throws -> any BuddyStreamingTranscriptionSession { + return ParakeetTranscriptionSession( + onTranscriptUpdate: onTranscriptUpdate, + onFinalTranscriptReady: onFinalTranscriptReady, + onError: onError + ) + } +} + +// MARK: - Session + +/// Buffers push-to-talk audio as PCM16 at 16kHz, then runs Parakeet inference +/// on key-up via FluidAudio. Mirrors the OpenAI provider's buffer-then-transcribe pattern +/// but runs entirely on device with no network call after the initial model download. +private final class ParakeetTranscriptionSession: BuddyStreamingTranscriptionSession { + /// Allow extra time on first use for model download + load. + let finalTranscriptFallbackDelaySeconds: TimeInterval = 15.0 + + private let onTranscriptUpdate: (String) -> Void + private let onFinalTranscriptReady: (String) -> Void + private let onError: (Error) -> Void + + private let stateQueue = DispatchQueue(label: "com.clicky.parakeet.session") + private let audioPCM16Converter = BuddyPCM16AudioConverter(targetSampleRate: 16_000) + + private var bufferedPCM16Data = Data() + private var hasRequestedFinalTranscript = false + private var hasDeliveredFinalTranscript = false + private var isCancelled = false + private var inferenceTask: Task? + + init( + onTranscriptUpdate: @escaping (String) -> Void, + onFinalTranscriptReady: @escaping (String) -> Void, + onError: @escaping (Error) -> Void + ) { + self.onTranscriptUpdate = onTranscriptUpdate + self.onFinalTranscriptReady = onFinalTranscriptReady + self.onError = onError + } + + // MARK: - BuddyStreamingTranscriptionSession + + func appendAudioBuffer(_ audioBuffer: AVAudioPCMBuffer) { + guard let pcm16Data = audioPCM16Converter.convertToPCM16Data(from: audioBuffer), + !pcm16Data.isEmpty else { return } + + stateQueue.async { + guard !self.hasRequestedFinalTranscript, !self.isCancelled else { return } + self.bufferedPCM16Data.append(pcm16Data) + } + } + + func requestFinalTranscript() { + stateQueue.async { + guard !self.hasRequestedFinalTranscript, !self.isCancelled else { return } + self.hasRequestedFinalTranscript = true + + let capturedAudioData = self.bufferedPCM16Data + self.inferenceTask = Task { [weak self] in + await self?.runInference(on: capturedAudioData) + } + } + } + + func cancel() { + stateQueue.async { [weak self] in + self?.isCancelled = true + self?.bufferedPCM16Data.removeAll(keepingCapacity: false) + } + inferenceTask?.cancel() + } + + // MARK: - Inference + + private func runInference(on pcm16Data: Data) async { + guard !Task.isCancelled else { return } + + let isEmpty = stateQueue.sync { isCancelled || pcm16Data.isEmpty } + if isEmpty { + deliverFinalTranscript("") + return + } + + do { + // Convert PCM16 little-endian mono 16kHz → Float32 in [-1, 1] for FluidAudio + let float32Samples = convertPCM16DataToFloat32Samples(pcm16Data) + + let asrManager = try await loadSharedAsrManagerIfNeeded() + guard !Task.isCancelled, !stateQueue.sync(execute: { isCancelled }) else { return } + + var decoderState = TdtDecoderState.make() + let transcriptionResult = try await asrManager.transcribe(float32Samples, decoderState: &decoderState) + guard !stateQueue.sync(execute: { isCancelled }) else { return } + + let transcriptText = transcriptionResult.text.trimmingCharacters(in: .whitespacesAndNewlines) + print("🎙️ Parakeet transcript: \"\(transcriptText)\"") + + if !transcriptText.isEmpty { + onTranscriptUpdate(transcriptText) + } + + deliverFinalTranscript(transcriptText) + } catch { + guard !stateQueue.sync(execute: { isCancelled }) else { return } + print("[Parakeet] ❌ Inference error: \(error.localizedDescription)") + onError(error) + } + } + + /// Loads (or returns the cached) shared AsrManager, downloading models from + /// HuggingFace on first use. Uses v2 (English-only, fastest, ~600MB). + private func loadSharedAsrManagerIfNeeded() async throws -> AsrManager { + if let existing = ParakeetTranscriptionProvider.sharedAsrManager { + return existing + } + + print("⬇️ Parakeet: downloading and loading models (first use)...") + let models = try await AsrModels.downloadAndLoad(version: .v2) + let manager = AsrManager(config: .default) + try await manager.loadModels(models) + ParakeetTranscriptionProvider.sharedAsrManager = manager + print("✅ Parakeet: models loaded, ready for transcription") + return manager + } + + /// Converts raw PCM16 little-endian mono bytes to Float32 samples in [-1.0, 1.0]. + /// FluidAudio expects 16kHz mono Float32 — the PCM16 converter already resamples to 16kHz. + private func convertPCM16DataToFloat32Samples(_ data: Data) -> [Float] { + let sampleCount = data.count / MemoryLayout.size + var float32Samples = [Float](repeating: 0.0, count: sampleCount) + + data.withUnsafeBytes { (rawBytes: UnsafeRawBufferPointer) in + let int16Samples = rawBytes.bindMemory(to: Int16.self) + for i in 0.. Bool { + SUPERTONIC_LANGS.contains(lang) +} + +// MARK: - Configuration Structures + +struct SupertonicConfig: Codable { + struct AEConfig: Codable { + let sample_rate: Int + let base_chunk_size: Int + } + + struct TTLConfig: Codable { + let chunk_compress_factor: Int + let latent_dim: Int + } + + let ae: AEConfig + let ttl: TTLConfig +} + +// MARK: - Voice Style Data Structure + +private struct SupertonicVoiceStyleData: Codable { + struct StyleComponent: Codable { + let data: [[[Float]]] + let dims: [Int] + let type: String + } + + let style_ttl: StyleComponent + let style_dp: StyleComponent +} + +// MARK: - Unicode Text Processor + +class SupertonicUnicodeProcessor { + let indexer: [Int64] + + init(unicodeIndexerPath: String) throws { + let data = try Data(contentsOf: URL(fileURLWithPath: unicodeIndexerPath)) + self.indexer = try JSONDecoder().decode([Int64].self, from: data) + } + + func call(_ textList: [String], _ langList: [String]) -> (textIds: [[Int64]], textMask: [[[Float]]]) { + var processedTexts = [String]() + for (i, text) in textList.enumerated() { + processedTexts.append(supertonicPreprocessText(text, lang: langList[i])) + } + + var textIdsLengths = [Int]() + for text in processedTexts { + textIdsLengths.append(text.unicodeScalars.count) + } + + let maxLen = textIdsLengths.max() ?? 0 + + var textIds = [[Int64]]() + for text in processedTexts { + var row = Array(repeating: Int64(0), count: maxLen) + let unicodeValues = Array(text.unicodeScalars.map { Int($0.value) }) + for (j, val) in unicodeValues.enumerated() { + if val < indexer.count { + row[j] = indexer[val] + } else { + row[j] = -1 + } + } + textIds.append(row) + } + + let textMask = supertonicGetTextMask(textIdsLengths) + return (textIds, textMask) + } +} + +// MARK: - Text Preprocessing + +private func supertonicPreprocessText(_ text: String, lang: String) -> String { + var text = text.decomposedStringWithCompatibilityMapping + + // Remove emojis + text = text.unicodeScalars.filter { scalar in + let value = scalar.value + return !((value >= 0x1F600 && value <= 0x1F64F) || + (value >= 0x1F300 && value <= 0x1F5FF) || + (value >= 0x1F680 && value <= 0x1F6FF) || + (value >= 0x1F700 && value <= 0x1F77F) || + (value >= 0x1F780 && value <= 0x1F7FF) || + (value >= 0x1F800 && value <= 0x1F8FF) || + (value >= 0x1F900 && value <= 0x1F9FF) || + (value >= 0x1FA00 && value <= 0x1FA6F) || + (value >= 0x1FA70 && value <= 0x1FAFF) || + (value >= 0x2600 && value <= 0x26FF) || + (value >= 0x2700 && value <= 0x27BF) || + (value >= 0x1F1E6 && value <= 0x1F1FF)) + }.map { String($0) }.joined() + + let replacements: [String: String] = [ + "\u{2013}": "-", "\u{2011}": "-", "\u{2014}": "-", "_": " ", + "\u{201C}": "\"", "\u{201D}": "\"", "\u{2018}": "'", "\u{2019}": "'", + "\u{00B4}": "'", "`": "'", + "[": " ", "]": " ", "|": " ", "/": " ", "#": " ", + "\u{2192}": " ", "\u{2190}": " ", + ] + + for (old, new) in replacements { + text = text.replacingOccurrences(of: old, with: new) + } + + let specialSymbols = ["♥", "☆", "♡", "©", "\\"] + for symbol in specialSymbols { + text = text.replacingOccurrences(of: symbol, with: "") + } + + let exprReplacements: [String: String] = [ + "@": " at ", "e.g.,": "for example, ", "i.e.,": "that is, ", + ] + + for (old, new) in exprReplacements { + text = text.replacingOccurrences(of: old, with: new) + } + + text = text.replacingOccurrences(of: " ,", with: ",") + text = text.replacingOccurrences(of: " .", with: ".") + text = text.replacingOccurrences(of: " !", with: "!") + text = text.replacingOccurrences(of: " ?", with: "?") + text = text.replacingOccurrences(of: " ;", with: ";") + text = text.replacingOccurrences(of: " :", with: ":") + text = text.replacingOccurrences(of: " '", with: "'") + + while text.contains("\"\"") { text = text.replacingOccurrences(of: "\"\"", with: "\"") } + while text.contains("''") { text = text.replacingOccurrences(of: "''", with: "'") } + while text.contains("``") { text = text.replacingOccurrences(of: "``", with: "`") } + + let whitespacePattern = try! NSRegularExpression(pattern: "\\s+") + let wsRange = NSRange(text.startIndex..., in: text) + text = whitespacePattern.stringByReplacingMatches(in: text, range: wsRange, withTemplate: " ") + + text = text.trimmingCharacters(in: .whitespacesAndNewlines) + + if !text.isEmpty { + let punctPattern = try! NSRegularExpression( + pattern: "[.!?;:,'\"\u{201C}\u{201D}\u{2018}\u{2019})\\]\\}\u{2026}]$") + let punctRange = NSRange(text.startIndex..., in: text) + if punctPattern.firstMatch(in: text, range: punctRange) == nil { + text += "." + } + } + + guard isValidSupertonicLang(lang) else { + fatalError("Invalid language: \(lang). Available: \(SUPERTONIC_LANGS.joined(separator: ", "))") + } + + text = "<\(lang)>\(text)" + return text +} + +// MARK: - Mask Utilities + +private func supertonicLengthToMask(_ lengths: [Int], maxLen: Int? = nil) -> [[[Float]]] { + let actualMaxLen = maxLen ?? (lengths.max() ?? 0) + var mask = [[[Float]]]() + + for len in lengths { + var row = Array(repeating: Float(0.0), count: actualMaxLen) + for j in 0.. [[[Float]]] { + let maxLen = textIdsLengths.max() ?? 0 + return supertonicLengthToMask(textIdsLengths, maxLen: maxLen) +} + +private func supertonicSampleNoisyLatent( + duration: [Float], sampleRate: Int, baseChunkSize: Int, + chunkCompress: Int, latentDim: Int +) -> (noisyLatent: [[[Float]]], latentMask: [[[Float]]]) { + let bsz = duration.count + let maxDur = duration.max() ?? 0.0 + let wavLenMax = Int(maxDur * Float(sampleRate)) + + var wavLengths = [Int]() + for d in duration { wavLengths.append(Int(d * Float(sampleRate))) } + + let chunkSize = baseChunkSize * chunkCompress + let latentLen = (wavLenMax + chunkSize - 1) / chunkSize + let latentDimVal = latentDim * chunkCompress + + var noisyLatent = [[[Float]]]() + for _ in 0.. [String] { + let actualMaxLen = maxLen > 0 ? maxLen : SUPERTONIC_MAX_CHUNK_LENGTH + let trimmedText = text.trimmingCharacters(in: .whitespacesAndNewlines) + if trimmedText.isEmpty { return [""] } + + let paraPattern = try! NSRegularExpression(pattern: "\\n\\s*\\n") + let paraRange = NSRange(trimmedText.startIndex..., in: trimmedText) + + var paragraphs = [String]() + var lastEnd = trimmedText.startIndex + + paraPattern.enumerateMatches(in: trimmedText, range: paraRange) { match, _, _ in + if let match = match, let range = Range(match.range, in: trimmedText) { + paragraphs.append(String(trimmedText[lastEnd.. actualMaxLen { + if !current.isEmpty { + chunks.append(current.trimmingCharacters(in: .whitespacesAndNewlines)) + current = ""; currentLen = 0 + } + + let parts = trimmedSentence.components(separatedBy: ",") + for part in parts { + let trimmedPart = part.trimmingCharacters(in: .whitespacesAndNewlines) + if trimmedPart.isEmpty { continue } + + let partLen = trimmedPart.count + + if partLen > actualMaxLen { + let words = trimmedPart.components(separatedBy: .whitespaces).filter { !$0.isEmpty } + var wordChunk = ""; var wordChunkLen = 0 + + for word in words { + let wordLen = word.count + if wordChunkLen + wordLen + 1 > actualMaxLen && !wordChunk.isEmpty { + chunks.append(wordChunk.trimmingCharacters(in: .whitespacesAndNewlines)) + wordChunk = ""; wordChunkLen = 0 + } + if !wordChunk.isEmpty { wordChunk += " "; wordChunkLen += 1 } + wordChunk += word; wordChunkLen += wordLen + } + + if !wordChunk.isEmpty { + chunks.append(wordChunk.trimmingCharacters(in: .whitespacesAndNewlines)) + } + } else { + if currentLen + partLen + 1 > actualMaxLen && !current.isEmpty { + chunks.append(current.trimmingCharacters(in: .whitespacesAndNewlines)) + current = ""; currentLen = 0 + } + if !current.isEmpty { current += ", "; currentLen += 2 } + current += trimmedPart; currentLen += partLen + } + } + continue + } + + if currentLen + sentenceLen + 1 > actualMaxLen && !current.isEmpty { + chunks.append(current.trimmingCharacters(in: .whitespacesAndNewlines)) + current = ""; currentLen = 0 + } + + if !current.isEmpty { current += " "; currentLen += 1 } + current += trimmedSentence; currentLen += sentenceLen + } + + if !current.isEmpty { chunks.append(current.trimmingCharacters(in: .whitespacesAndNewlines)) } + } + + return chunks.isEmpty ? [""] : chunks +} + +private func supertonicSplitSentences(_ text: String) -> [String] { + let regex = try! NSRegularExpression(pattern: "([.!?])\\s+") + let range = NSRange(text.startIndex..., in: text) + let matches = regex.matches(in: text, range: range) + + if matches.isEmpty { return [text] } + + var sentences = [String]() + var lastEnd = text.startIndex + + for match in matches { + guard let matchRange = Range(match.range, in: text) else { continue } + + let puncRange = Range(NSRange(location: match.range.location, length: 1), in: text)! + let punc = String(text[puncRange]) + + let combined = String(text[lastEnd.. (wav: [Float], duration: Float) { + let style = try supertonicLoadVoiceStyle([voiceStylePath]) + return try call(text, lang, style, totalStep, speed: speed) + } + + // MARK: Core inference + + private func _infer(_ textList: [String], _ langList: [String], _ style: SupertonicStyle, + _ totalStep: Int, speed: Float = 1.05) throws -> (wav: [Float], duration: [Float]) { + let bsz = textList.count + + let (textIds, textMask) = textProcessor.call(textList, langList) + + let textIdsFlat = textIds.flatMap { $0 } + let textIdsShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: textIds[0].count)] + + let textIdsValue = try ORTValue( + tensorData: NSMutableData(bytes: textIdsFlat, length: textIdsFlat.count * MemoryLayout.size), + elementType: .int64, shape: textIdsShape) + + let textMaskFlat = textMask.flatMap { $0.flatMap { $0 } } + let textMaskShape: [NSNumber] = [NSNumber(value: bsz), 1, NSNumber(value: textMask[0][0].count)] + + let textMaskValue = try ORTValue( + tensorData: NSMutableData(bytes: textMaskFlat, length: textMaskFlat.count * MemoryLayout.size), + elementType: .float, shape: textMaskShape) + + // Duration prediction + let dpOutputs = try dpOrt.run( + withInputs: ["text_ids": textIdsValue, "style_dp": style.dp, "text_mask": textMaskValue], + outputNames: ["duration"], runOptions: nil) + + let durationData = try dpOutputs["duration"]!.tensorData() as Data + var duration = durationData.withUnsafeBytes { Array($0.bindMemory(to: Float.self)) } + for i in 0...size), + elementType: .float, shape: [NSNumber(value: bsz)]) + + // Denoising loop + for step in 0...size), + elementType: .float, shape: [NSNumber(value: bsz)]) + + let xtFlat = xt.flatMap { $0.flatMap { $0 } } + let xtShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: xt[0].count), NSNumber(value: xt[0][0].count)] + + let xtValue = try ORTValue( + tensorData: NSMutableData(bytes: xtFlat, length: xtFlat.count * MemoryLayout.size), + elementType: .float, shape: xtShape) + + let latentMaskFlat = latentMask.flatMap { $0.flatMap { $0 } } + let latentMaskShape: [NSNumber] = [NSNumber(value: bsz), 1, NSNumber(value: latentMask[0][0].count)] + + let latentMaskValue = try ORTValue( + tensorData: NSMutableData(bytes: latentMaskFlat, length: latentMaskFlat.count * MemoryLayout.size), + elementType: .float, shape: latentMaskShape) + + let vectorEstOutputs = try vectorEstOrt.run(withInputs: [ + "noisy_latent": xtValue, "text_emb": textEmbValue, "style_ttl": style.ttl, + "latent_mask": latentMaskValue, "text_mask": textMaskValue, + "current_step": currentStepValue, "total_step": totalStepValue, + ], outputNames: ["denoised_latent"], runOptions: nil) + + let denoisedData = try vectorEstOutputs["denoised_latent"]!.tensorData() as Data + let denoisedFlat = denoisedData.withUnsafeBytes { Array($0.bindMemory(to: Float.self)) } + + let latentDimVal = xt[0].count + let latentLen = xt[0][0].count + + xt = [] + + var idx = 0 + for _ in 0...size), + elementType: .float, shape: finalXtShape) + + let vocoderOutputs = try vocoderOrt.run( + withInputs: ["latent": finalXtValue], outputNames: ["wav_tts"], runOptions: nil) + + let wavData = try vocoderOutputs["wav_tts"]!.tensorData() as Data + let wav = wavData.withUnsafeBytes { Array($0.bindMemory(to: Float.self)) } + + return (wav, duration) + } + + func call(_ text: String, _ lang: String, _ style: SupertonicStyle, _ totalStep: Int, + speed: Float = 1.05, silenceDuration: Float = 0.3) throws -> (wav: [Float], duration: Float) { + let maxLen = lang == "ko" ? 120 : 300 + let chunks = supertonicChunkText(text, maxLen: maxLen) + let langList = Array(repeating: lang, count: chunks.count) + + var wavCat = [Float]() + var durCat: Float = 0.0 + + for (i, chunk) in chunks.enumerated() { + let result = try _infer([chunk], [langList[i]], style, totalStep, speed: speed) + let dur = result.duration[0] + let wavLen = Int(Float(sampleRate) * dur) + let wavChunk = Array(result.wav.prefix(wavLen)) + + if i == 0 { + wavCat = wavChunk + durCat = dur + } else { + let silenceLen = Int(silenceDuration * Float(sampleRate)) + wavCat.append(contentsOf: [Float](repeating: 0.0, count: silenceLen)) + wavCat.append(contentsOf: wavChunk) + durCat += silenceDuration + dur + } + } + + return (wavCat, durCat) + } +} + +// MARK: - Component Loading Functions + +func supertonicLoadVoiceStyle(_ voiceStylePaths: [String]) throws -> SupertonicStyle { + let bsz = voiceStylePaths.count + + let firstData = try Data(contentsOf: URL(fileURLWithPath: voiceStylePaths[0])) + let firstStyle = try JSONDecoder().decode(SupertonicVoiceStyleData.self, from: firstData) + + let ttlDims = firstStyle.style_ttl.dims + let dpDims = firstStyle.style_dp.dims + + let ttlDim1 = ttlDims[1], ttlDim2 = ttlDims[2] + let dpDim1 = dpDims[1], dpDim2 = dpDims[2] + + var ttlFlat = [Float](repeating: 0.0, count: bsz * ttlDim1 * ttlDim2) + var dpFlat = [Float](repeating: 0.0, count: bsz * dpDim1 * dpDim2) + + for (i, path) in voiceStylePaths.enumerated() { + let data = try Data(contentsOf: URL(fileURLWithPath: path)) + let voiceStyle = try JSONDecoder().decode(SupertonicVoiceStyleData.self, from: data) + + var idx = 0 + let ttlOffset = i * ttlDim1 * ttlDim2 + for batch in voiceStyle.style_ttl.data { + for row in batch { for val in row { ttlFlat[ttlOffset + idx] = val; idx += 1 } } + } + + idx = 0 + let dpOffset = i * dpDim1 * dpDim2 + for batch in voiceStyle.style_dp.data { + for row in batch { for val in row { dpFlat[dpOffset + idx] = val; idx += 1 } } + } + } + + let ttlShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: ttlDim1), NSNumber(value: ttlDim2)] + let dpShape: [NSNumber] = [NSNumber(value: bsz), NSNumber(value: dpDim1), NSNumber(value: dpDim2)] + + let ttlValue = try ORTValue( + tensorData: NSMutableData(bytes: &ttlFlat, length: ttlFlat.count * MemoryLayout.size), + elementType: .float, shape: ttlShape) + + let dpValue = try ORTValue( + tensorData: NSMutableData(bytes: &dpFlat, length: dpFlat.count * MemoryLayout.size), + elementType: .float, shape: dpShape) + + return SupertonicStyle(ttl: ttlValue, dp: dpValue) +} + +func supertonicLoadEngine(onnxDir: String) throws -> SupertonicTTS { + let env = try ORTEnv(loggingLevel: .warning) + + let cfgPath = "\(onnxDir)/tts.json" + let data = try Data(contentsOf: URL(fileURLWithPath: cfgPath)) + let cfgs = try JSONDecoder().decode(SupertonicConfig.self, from: data) + + let sessionOptions = try ORTSessionOptions() + + let dpOrt = try ORTSession(env: env, modelPath: "\(onnxDir)/duration_predictor.onnx", sessionOptions: sessionOptions) + let textEncOrt = try ORTSession(env: env, modelPath: "\(onnxDir)/text_encoder.onnx", sessionOptions: sessionOptions) + let vectorEstOrt = try ORTSession(env: env, modelPath: "\(onnxDir)/vector_estimator.onnx", sessionOptions: sessionOptions) + let vocoderOrt = try ORTSession(env: env, modelPath: "\(onnxDir)/vocoder.onnx", sessionOptions: sessionOptions) + + let unicodeIndexerPath = "\(onnxDir)/unicode_indexer.json" + let textProcessor = try SupertonicUnicodeProcessor(unicodeIndexerPath: unicodeIndexerPath) + + return SupertonicTTS( + env: env, cfgs: cfgs, textProcessor: textProcessor, + dpOrt: dpOrt, textEncOrt: textEncOrt, + vectorEstOrt: vectorEstOrt, vocoderOrt: vocoderOrt) +} diff --git a/leanring-buddy/SupertonicTTSClient.swift b/leanring-buddy/SupertonicTTSClient.swift new file mode 100644 index 00000000..82e2b14d --- /dev/null +++ b/leanring-buddy/SupertonicTTSClient.swift @@ -0,0 +1,214 @@ +// +// SupertonicTTSClient.swift +// leanring-buddy +// +// On-device TTS using the Supertonic ONNX engine (66M params, ~167× realtime +// on Apple Silicon). Models auto-download from HuggingFace on first use (~200MB). +// Interface mirrors ElevenLabsTTSClient so CompanionManager can swap between them. +// + +import AVFoundation +import Foundation + +@MainActor +final class SupertonicTTSClient { + + /// Voice to use for synthesis. Matches upstream voice_styles/*.json filenames. + /// Available: M1–M5 (male), F1–F5 (female). + var selectedVoice: String { + didSet { + UserDefaults.standard.set(selectedVoice, forKey: "supertonicSelectedVoice") + } + } + + private(set) var isPlaying: Bool = false + + private var loadedEngine: SupertonicTTS? + private var audioEngine: AVAudioEngine? + private var playerNode: AVAudioPlayerNode? + + /// UUID used to detect when a new speakText() call cancels the current one. + private var activePlaybackSession: UUID? + + // MARK: - Model paths + + private static let huggingFaceBaseURL = "https://huggingface.co/Supertone/supertonic-2/resolve/main" + + private static var supertonicModelDir: URL { + let appSupport = FileManager.default.urls(for: .applicationSupportDirectory, in: .userDomainMask).first! + return appSupport.appendingPathComponent("Clicky/models/supertonic", isDirectory: true) + } + + private static var onnxModelDir: URL { + supertonicModelDir.appendingPathComponent("onnx", isDirectory: true) + } + + private static var voiceStylesDir: URL { + supertonicModelDir.appendingPathComponent("voice_styles", isDirectory: true) + } + + // MARK: - Init + + init() { + self.selectedVoice = UserDefaults.standard.string(forKey: "supertonicSelectedVoice") ?? "M1" + } + + // MARK: - Public interface + + /// Synthesizes `text` using the selected voice and plays back the audio. + /// Downloads ONNX models from HuggingFace on first call (~200MB one-time). + /// Throws on network, model-loading, or synthesis errors. + func speakText(_ text: String) async throws { + stopPlayback() + + let session = UUID() + activePlaybackSession = session + + // Download models and the selected voice style file if not already on disk + try await ensureModelsAndVoiceDownloaded(voiceId: selectedVoice) + guard activePlaybackSession == session else { return } + + // Load the ONNX engine (cached in memory after first load) + let tts = try loadEngineIfNeeded() + guard activePlaybackSession == session else { return } + + // Synthesize — runs on CPU/ANE via ONNX Runtime, fast enough for main thread + let voiceStylePath = Self.voiceStylesDir.appendingPathComponent("\(selectedVoice).json").path + let result = try tts.synthesize(text: text, lang: "en", voiceStylePath: voiceStylePath, speed: 1.05) + guard activePlaybackSession == session else { return } + + let samples = result.wav + guard !samples.isEmpty else { + throw NSError(domain: "SupertonicTTS", code: -1, + userInfo: [NSLocalizedDescriptionKey: "Supertonic returned empty audio"]) + } + + let sampleRate = tts.sampleRate + + // Build a Float32 mono AVAudioPCMBuffer from the raw samples + guard let format = AVAudioFormat( + commonFormat: .pcmFormatFloat32, + sampleRate: Double(sampleRate), + channels: 1, + interleaved: false + ) else { + throw NSError(domain: "SupertonicTTS", code: -2, + userInfo: [NSLocalizedDescriptionKey: "Could not create audio format"]) + } + + let frameCount = AVAudioFrameCount(samples.count) + guard let buffer = AVAudioPCMBuffer(pcmFormat: format, frameCapacity: frameCount), + let channelData = buffer.floatChannelData?[0] else { + throw NSError(domain: "SupertonicTTS", code: -3, + userInfo: [NSLocalizedDescriptionKey: "Could not allocate audio buffer"]) + } + + buffer.frameLength = frameCount + for i in 0..) in + player.scheduleBuffer(buffer) { [weak self] in + Task { @MainActor [weak self] in + guard let self else { continuation.resume(); return } + if self.activePlaybackSession == session { + self.isPlaying = false + self.tearDownAudioEngine() + } + continuation.resume() + } + } + } + } + + /// Stops any in-progress synthesis or playback immediately. + func stopPlayback() { + activePlaybackSession = nil + tearDownAudioEngine() + isPlaying = false + } + + // MARK: - Model download + + private func ensureModelsAndVoiceDownloaded(voiceId: String) async throws { + let fm = FileManager.default + try fm.createDirectory(at: Self.onnxModelDir, withIntermediateDirectories: true) + try fm.createDirectory(at: Self.voiceStylesDir, withIntermediateDirectories: true) + + let requiredOnnxFiles = [ + "duration_predictor.onnx", + "text_encoder.onnx", + "vector_estimator.onnx", + "vocoder.onnx", + "tts.json", + "unicode_indexer.json", + ] + + for filename in requiredOnnxFiles { + let destinationURL = Self.onnxModelDir.appendingPathComponent(filename) + if !fm.fileExists(atPath: destinationURL.path) { + print("⬇️ Supertonic: downloading \(filename)...") + try await downloadFileFromHuggingFace(remotePath: "onnx/\(filename)", to: destinationURL) + } + } + + let voiceStyleDestinationURL = Self.voiceStylesDir.appendingPathComponent("\(voiceId).json") + if !fm.fileExists(atPath: voiceStyleDestinationURL.path) { + print("⬇️ Supertonic: downloading voice style \(voiceId)...") + try await downloadFileFromHuggingFace(remotePath: "voice_styles/\(voiceId).json", + to: voiceStyleDestinationURL) + } + } + + private func downloadFileFromHuggingFace(remotePath: String, to destinationURL: URL) async throws { + guard let downloadURL = URL(string: "\(Self.huggingFaceBaseURL)/\(remotePath)") else { + throw NSError(domain: "SupertonicTTS", code: -4, + userInfo: [NSLocalizedDescriptionKey: "Invalid HuggingFace URL for \(remotePath)"]) + } + + let (tempURL, response) = try await URLSession.shared.download(from: downloadURL) + + guard let httpResponse = response as? HTTPURLResponse, + (200...299).contains(httpResponse.statusCode) else { + throw NSError(domain: "SupertonicTTS", code: -5, + userInfo: [NSLocalizedDescriptionKey: "Download failed for \(remotePath)"]) + } + + try FileManager.default.moveItem(at: tempURL, to: destinationURL) + } + + // MARK: - Engine loading + + private func loadEngineIfNeeded() throws -> SupertonicTTS { + if let loadedEngine { return loadedEngine } + + print("🔧 Supertonic: loading ONNX engine...") + let engine = try supertonicLoadEngine(onnxDir: Self.onnxModelDir.path) + loadedEngine = engine + print("✅ Supertonic: engine ready") + return engine + } + + // MARK: - Audio teardown + + private func tearDownAudioEngine() { + playerNode?.stop() + audioEngine?.stop() + playerNode = nil + audioEngine = nil + } +}