Skip to content
Merged
101 changes: 101 additions & 0 deletions BACKLOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# js-tts-wrapper Engine & Feature Backlog

Reference: [speech-sdk](https://github.com/Jellypod-Inc/speech-sdk) (`@speech-sdk/core`)

## Completed

- [x] Cartesia engine (`sonic-3`, `sonic-2`) with audio tag / emotion-to-SSML support
- [x] Deepgram engine (`aura-2`) with static voice list
- [x] ElevenLabs v3 audio tag passthrough (`[laugh]`, `[sigh]`, etc.)
- [x] Generic property pass-through via `properties` / `propertiesJson`
- [x] Hume engine (`octave-2`, `octave-1`) with streaming via separate `/tts/stream/file` endpoint
- [x] xAI engine (`grok-tts`) with native audio tag passthrough, language config
- [x] Fish Audio engine (`s2-pro`) with audio tag passthrough, model-as-header pattern
- [x] Mistral engine (`voxtral-mini-tts-2603`) with SSE streaming, base64 chunk parsing
- [x] Murf engine (`GEN2`, `FALCON`) with dual model/endpoints, base64 GEN2 / binary FALCON
- [x] Unreal Speech engine with two-step URI non-streaming, direct streaming
- [x] Resemble engine with base64 JSON non-streaming, direct streaming

## New Engines to Add

### Lower Priority (Open-Source / Niche)

| Engine | Models | Key Features | Notes |
|--------|--------|-------------|-------|
| **fal** | `f5-tts`, `kokoro`, `dia-tts`, `orpheus-tts`, `index-tts-2` | Voice cloning, open-source | No streaming, many sub-models |
| **Google Gemini TTS** | `gemini-2.5-flash-preview-tts`, `gemini-2.5-pro-preview-tts` | Pseudo-streaming, 23 languages | Different from existing Google Cloud TTS |

## Cross-Cutting Features

### Audio Tags (Cross-Provider Abstraction)

Unified `[tag]` syntax mapped to provider-specific representations:
- **ElevenLabs v3** — native passthrough (done)
- **Cartesia sonic-3** — emotions to `<emotion value="..."/>` SSML (done)
- **OpenAI gpt-4o-mini-tts** — tags to natural language `instructions`
- **xAI grok-tts** — native passthrough
- **Fish Audio s2-pro** — native passthrough
- **All others** — strip tags with warnings

### Model-Level Feature Declarations

Add per-model capability metadata (from speech-sdk pattern):
- `streaming` — supports real-time audio streaming
- `audio-tags` — supports `[tag]` syntax
- `inline-voice-cloning` — accepts reference audio inline
- `open-source` — model is open source

Enables runtime capability checks via `hasFeature()`.

### Unified Voice Type

Current: engine-specific voice IDs
Proposed: `string | { url: string } | { audio: string | Uint8Array }`
- `string` — standard voice ID
- `{ url }` — voice cloning from URL
- `{ audio }` — voice cloning from inline audio

### Voice Cloning Support

Providers that support inline voice cloning:
- Cartesia sonic-3
- Hume octave-2
- Fish Audio s2-pro
- Resemble
- Mistral voxtral-mini-tts-2603
- fal (f5-tts, dia-tts, index-tts-2)

### Streaming Improvements

- [x] Cartesia: true streaming (already pipes response.body)
- [x] Deepgram: true streaming (already pipes response.body)
- [x] ElevenLabs: true streaming (fixed — pipes response.body when not using timestamps)
- [x] Polly: true streaming for MP3/OGG (already pipes AudioStream; WAV requires buffering for header)
- [x] Standardize `synthToBytestream` to return actual streaming responses where supported
- Google Cloud TTS: SDK returns all audio at once — would need StreamingSynthesize beta API
- Google Gemini TTS: pseudo-streaming via SSE base64 chunks (new engine, not yet implemented)

### Tree-Shakeable Subpath Exports

From speech-sdk pattern — add per-provider subpath exports in package.json:
```json
{
"exports": {
".": "./dist/esm/index.js",
"./cartesia": "./dist/esm/engines/cartesia.js",
"./deepgram": "./dist/esm/engines/deepgram.js"
}
}
```

### Unified Error Hierarchy

Standardize errors across engines with rich context (statusCode, model, responseBody).

## Existing Engine Updates Needed

| Engine | Update Needed |
|--------|--------------|
| **OpenAI** | Add `gpt-4o-mini-tts` model with instructions/audio tag support |
| **Google** | Add Gemini-based TTS alongside existing Cloud TTS |
| **ElevenLabs** | Close issue #24 (already fixed) |
144 changes: 143 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,15 @@ A JavaScript/TypeScript library that provides a unified API for working with mul
| `espeak-wasm` | `EspeakBrowserTTSClient` | Both | eSpeak NG | `mespeak` (Node.js) or meSpeak.js (browser) |
| `sapi` | `SAPITTSClient` | Node.js | Windows Speech API (SAPI) | None (uses PowerShell) |
| `witai` | `WitAITTSClient` | Both | Wit.ai | None (uses fetch API) |
| `cartesia` | `CartesiaTTSClient` | Both | Cartesia | None (uses fetch API) |
| `deepgram` | `DeepgramTTSClient` | Both | Deepgram | None (uses fetch API) |
| `hume` | `HumeTTSClient` | Both | Hume AI | None (uses fetch API) |
| `xai` | `XAITTSClient` | Both | xAI (Grok) | None (uses fetch API) |
| `fishaudio` | `FishAudioTTSClient` | Both | Fish Audio | None (uses fetch API) |
| `mistral` | `MistralTTSClient` | Both | Mistral AI | None (uses fetch API) |
| `murf` | `MurfTTSClient` | Both | Murf AI | None (uses fetch API) |
| `unrealspeech` | `UnrealSpeechTTSClient` | Both | Unreal Speech | None (uses fetch API) |
| `resemble` | `ResembleTTSClient` | Both | Resemble AI | None (uses fetch API) |

**Factory Name**: Use with `createTTSClient('factory-name', credentials)`
**Class Name**: Use with direct import `import { ClassName } from 'js-tts-wrapper'`
Expand Down Expand Up @@ -90,6 +99,15 @@ A JavaScript/TypeScript library that provides a unified API for working with mul
| **SherpaOnnx** | ✅ | Estimated | ❌ | Low |
| **SherpaOnnx-WASM** | ✅ | Estimated | ❌ | Low |
| **SAPI** | ✅ | Estimated | ❌ | Low |
| **Cartesia** | ✅ | Estimated | ❌ | Low |
| **Deepgram** | ✅ | Estimated | ❌ | Low |
| **Hume** | ✅ | Estimated | ❌ | Low |
| **xAI** | ✅ | Estimated | ❌ | Low |
| **Fish Audio** | ✅ | Estimated | ❌ | Low |
| **Mistral** | ✅ | Estimated | ❌ | Low |
| **Murf** | ✅ | Estimated | ❌ | Low |
| **Unreal Speech** | ✅ | Estimated | ❌ | Low |
| **Resemble** | ✅ | Estimated | ❌ | Low |

**Character-Level Timing**: Only ElevenLabs provides precise character-level timing data via the `/with-timestamps` endpoint, enabling the most accurate word highlighting and speech synchronization.

Expand Down Expand Up @@ -253,7 +271,7 @@ async function runExample() {
runExample().catch(console.error);
```

The factory supports all engines: `'azure'`, `'google'`, `'polly'`, `'elevenlabs'`, `'openai'`, `'modelslab'`, `'playht'`, `'watson'`, `'witai'`, `'sherpaonnx'`, `'sherpaonnx-wasm'`, `'espeak'`, `'espeak-wasm'`, `'sapi'`, etc.
The factory supports all engines: `'azure'`, `'google'`, `'polly'`, `'elevenlabs'`, `'openai'`, `'modelslab'`, `'playht'`, `'watson'`, `'witai'`, `'sherpaonnx'`, `'sherpaonnx-wasm'`, `'espeak'`, `'espeak-wasm'`, `'sapi'`, `'cartesia'`, `'deepgram'`, `'hume'`, `'xai'`, `'fishaudio'`, `'mistral'`, `'murf'`, `'unrealspeech'`, `'resemble'`, etc.

## Core Functionality

Expand Down Expand Up @@ -471,6 +489,15 @@ The following engines **automatically strip SSML tags** and convert to plain tex
- **PlayHT** - SSML tags are removed, plain text is synthesized
- **ModelsLab** - SSML tags are removed, plain text is synthesized
- **SherpaOnnx/SherpaOnnx-WASM** - SSML tags are removed, plain text is synthesized
- **Cartesia** - SSML tags removed; audio tags (`[laugh]`, `[sigh]`, etc.) mapped to `<emotion>` for sonic-3, stripped for others
- **Deepgram** - SSML tags are removed, plain text is synthesized
- **Hume** - SSML tags are removed, plain text is synthesized
- **xAI** - SSML tags are removed; audio tags passed natively for grok-tts
- **Fish Audio** - SSML tags removed; audio tags passed natively for s2-pro
- **Mistral** - SSML tags are removed, plain text is synthesized
- **Murf** - SSML tags are removed, plain text is synthesized
- **Unreal Speech** - SSML tags are removed, plain text is synthesized
- **Resemble** - SSML tags are removed, plain text is synthesized

### Usage Examples

Expand Down Expand Up @@ -667,6 +694,15 @@ When disabled, js-tts-wrapper falls back to the lightweight built-in converter (
| OpenAI | ✅ Converted | → SSML → Plain text |
| PlayHT | ✅ Converted | → SSML → Plain text |
| SherpaOnnx | ✅ Converted | → SSML → Plain text |
| Cartesia | ✅ Converted | → SSML → Plain text |
| Deepgram | ✅ Converted | → SSML → Plain text |
| Hume | ✅ Converted | → SSML → Plain text |
| xAI | ✅ Converted | → SSML → Plain text |
| Fish Audio | ✅ Converted | → SSML → Plain text |
| Mistral | ✅ Converted | → SSML → Plain text |
| Murf | ✅ Converted | → SSML → Plain text |
| Unreal Speech | ✅ Converted | → SSML → Plain text |
| Resemble | ✅ Converted | → SSML → Plain text |

### Speech Markdown vs Raw SSML: When to Use Each

Expand Down Expand Up @@ -1069,6 +1105,112 @@ await tts.speak('Hello from Windows SAPI!');

> **Note**: This engine is **Windows-only**

### Cartesia

```javascript
import { CartesiaTTSClient } from 'js-tts-wrapper';

const tts = new CartesiaTTSClient({ apiKey: 'your-api-key' });
await tts.setVoice('sonic-3'); // or 'sonic-2'
await tts.speak('Hello from Cartesia!');
```

> Audio tags like `[laugh]`, `[sigh]` are mapped to `<emotion>` SSML for sonic-3, stripped for other models.

### Deepgram

```javascript
import { DeepgramTTSClient } from 'js-tts-wrapper';

const tts = new DeepgramTTSClient({ apiKey: 'your-api-key' });
await tts.setVoice('aura-2-asteria-en');
await tts.speak('Hello from Deepgram!');
```

> Uses a static voice list. Model and voice are combined in the URL parameter.

### Hume AI

```javascript
import { HumeTTSClient } from 'js-tts-wrapper';

const tts = new HumeTTSClient({ apiKey: 'your-api-key' });
await tts.setVoice('ito'); // or any Hume voice name
await tts.speak('Hello from Hume!');
```

> Supports `octave-2` and `octave-1` models. Streaming uses a separate `/tts/stream/file` endpoint.

### xAI (Grok)

```javascript
import { XAITTSClient } from 'js-tts-wrapper';

const tts = new XAITTSClient({ apiKey: 'your-api-key' });
await tts.speak('Hello from xAI!');
```

> Native audio tag passthrough for grok-tts model. Language can be configured via properties.

### Fish Audio

```javascript
import { FishAudioTTSClient } from 'js-tts-wrapper';

const tts = new FishAudioTTSClient({ apiKey: 'your-api-key' });
await tts.setVoice('your-voice-reference-id');
await tts.speak('Hello from Fish Audio!');
```

> Model ID is passed as a header. Audio tags passed natively for s2-pro model.

### Mistral

```javascript
import { MistralTTSClient } from 'js-tts-wrapper';

const tts = new MistralTTSClient({ apiKey: 'your-api-key' });
await tts.speak('Hello from Mistral!');
```

> Uses SSE streaming with base64 audio chunks. Non-streaming returns base64 JSON.

### Murf

```javascript
import { MurfTTSClient } from 'js-tts-wrapper';

const tts = new MurfTTSClient({ apiKey: 'your-api-key' });
await tts.setVoice('en-US-natalie');
await tts.speak('Hello from Murf!');
```

> Two models: GEN2 (base64 response) and FALCON (binary streaming). Uses static voice list.

### Unreal Speech

```javascript
import { UnrealSpeechTTSClient } from 'js-tts-wrapper';

const tts = new UnrealSpeechTTSClient({ apiKey: 'your-api-key' });
await tts.setVoice('Scarlett');
await tts.speak('Hello from Unreal Speech!');
```

> Non-streaming uses two-step URI-based flow. Streaming returns audio directly.

### Resemble

```javascript
import { ResembleTTSClient } from 'js-tts-wrapper';

const tts = new ResembleTTSClient({ apiKey: 'your-api-key' });
await tts.setVoice('your-voice-id');
await tts.speak('Hello from Resemble!');
```

> Non-streaming returns base64 JSON. Streaming returns raw binary audio.

## API Reference

### Factory Function
Expand Down
Loading
Loading