A TypeScript CLI tool for generating audiobooks from story scripts using Text-to-Speech APIs with multi-speaker support.
- Multi-speaker support: Assign different voices to different characters
- Smart caching: Only regenerate segments when text or voice settings change
- Resume capability: Continue interrupted generations from where they left off
- Iterative workflow: Easily refine voice styles without regenerating unchanged segments
- Progress tracking: Visual progress bars and time estimates
- Manifest export: Get timestamps and metadata for each segment
- Configurable: Full control over voice settings, audio format, and API parameters
# Clone the repository
git clone <repo-url>
cd audiobook-gemini-multi
# Install dependencies
pnpm install
# Set up your API key
echo "GEMINI_API_KEY=your_api_key_here" > .env# 1. Initialize a new project with config based on your story
pnpm run init examples/story.txt
# 2.
pnpm run analyze story.txt
# 2. Edit config.json to customize voices (optional)
# 3. Generate the audiobook
pnpm run generate examples/story.txt
# 4. Find your audiobook in the output directory
ls output/The generator supports two story formats:
[NARRATOR] Once upon a time in a land far away...
[CHARACTER1] Hello there! How are you today?
[CHARACTER2] I'm doing well, thank you for asking.
NARRATOR: Once upon a time in a land far away...
CHARACTER1: Hello there! How are you today?
CHARACTER2: I'm doing well, thank you for asking.
Lines starting with # or // are treated as comments and ignored.
Analyze a story text file to identify characters and their likely gender using AI.
pnpm run analyze story.txt
# Options:
# -m, --model <model> Model to use in provider:model format (default: gemini:gemini-3-preview-flash)
# --no-narrator Exclude NARRATOR from analysis
# --suggest-voices Suggest voices based on character genders
# --update-config [path] Update config file with suggested voices (default: ./config.json)
# --json Output results as JSON
# --prompt-only Show the AI prompt without calling the API
# -v, --verbose Show token usage and model detailsThis command uses AI to analyze your text and identify:
- Character names (in UPPERCASE, suitable for speaker tags)
- Gender classification:
female,male, orneutral - Confidence level:
high,medium, orlow - Brief character descriptions when detectable
Supported Providers:
| Provider | Default Model | API Key Variable |
|---|---|---|
gemini |
gemini-3-preview-flash |
GEMINI_API_KEY |
grok |
grok-4 |
XAI_API_KEY |
Examples:
# Use default (gemini:gemini-3-preview-flash)
pnpm run analyze story.txt
# Use Grok (xAI) provider
pnpm run analyze story.txt -m grok:grok-4
# Use specific Gemini model
pnpm run analyze story.txt -m gemini:gemini-3-pro
# Model names starting with provider name auto-detect provider
pnpm run analyze story.txt -m grok-4
# Output as JSON for scripting
pnpm run analyze story.txt --json
# Preview the prompt without calling the API
pnpm run analyze story.txt --prompt-only
# Suggest voices based on character genders
pnpm run analyze story.txt --suggest-voices
# Analyze and automatically update config.json with voice suggestions
pnpm run analyze story.txt --update-config
# Update a specific config file
pnpm run analyze story.txt --update-config ./my-config.jsonThe output includes a ready-to-use speaker list for the convert command:
Speaker list for convert command:
pnpm run convert story.txt -s "NARRATOR,ALICE,BOB"Voice Suggestions:
When using --suggest-voices, the command will suggest appropriate Gemini voices based on each character's detected gender:
- Female characters → Female voices (Zephyr, Kore, Leda, Aoede, etc.)
- Male characters → Male voices (Puck, Charon, Fenrir, Orus, etc.)
- Neutral characters → Neutral voices (Pulcherrima, Achird, Vindemiatrix)
Example output with --suggest-voices:
=== Voice Suggestions ===
Female Characters:
ALICE → Zephyr (Bright, Higher pitch)
MARY → Kore (Firm, Middle pitch)
Male Characters:
BOB → Puck (Upbeat, Middle pitch)
Neutral Characters:
NARRATOR → Pulcherrima (Forward, Middle pitch)When using --update-config, the suggested voices are automatically written to your config file, ready for audiobook generation.
### `convert <inputFile>`
Convert plain text or prose to speaker-tagged story format using AI.
```bash
pnpm run convert novel.txt
# Options:
# -o, --output <path> Output file path (default: input_converted.txt)
# -f, --format <type> Output format: bracket or colon (default: bracket)
# -s, --speakers <names> Comma-separated list of speaker names to use
# --no-narrator Exclude NARRATOR tag (dialogue only)
# --prompt-only Show the conversion prompt without calling the API
# -v, --verbose Verbose outputThis command uses the Gemini LLM to analyze your text and convert it into the speaker-tagged format required for audiobook generation. It will:
- Identify characters and dialogue
- Add NARRATOR tags for descriptive text
- Split content into appropriate segments
Example:
# Convert a novel chapter
pnpm run convert chapter1.txt -o chapter1_story.txt
# Convert with specific speakers
pnpm run convert dialogue.txt -s "JOHN,MARY,NARRATOR"
# Use colon format instead of brackets
pnpm run convert story.txt -f colon
# Preview the prompt without calling the API
pnpm run convert story.txt --prompt-onlyInitialize a new project with default configuration.
pnpm run setup
# Options:
# -o, --output <path> Output directory (default: "./output")
# -f, --force Overwrite existing configCreate a configuration file based on speakers found in a story.
pnpm run init story.txt
# Options:
# -o, --output <path> Config output path (default: "./config.json")
# -f, --force Overwrite existing configGenerate a complete audiobook from a story file.
pnpm run generate story.txt
# Options:
# -c, --config <path> Path to config file (default: "./config.json")
# -o, --output <path> Output directory (default: "./output")
# -f, --force Force regeneration (ignore cache)
# -v, --verbose Verbose output
# -d, --dry-run Show what would be done without generatingGenerate a preview with only the first N segments.
pnpm run preview story.txt -n 5
# Options:
# -c, --config <path> Path to config file
# -o, --output <path> Output directory
# -n, --segments <number> Number of segments to preview (default: 5)
# -s, --start <number> Start from segment index (default: 0)
# --speaker <name> Preview only specific speaker
# -f, --force Force regeneration
# -v, --verbose Verbose outputRegenerate segments that have changed style prompts.
pnpm run update-styles story.txt
# Options:
# -c, --config <path> Path to config file
# -o, --output <path> Output directory
# -s, --speakers <names> Specific speakers to update (comma-separated)
# -f, --force Force update even if unchanged
# -v, --verbose Verbose outputClear cache and generated files.
pnpm run clean --force
# Options:
# -o, --output <path> Output directory
# --cache-only Only clear cache, keep generated audiobooks
# --output-only Only clear output files, keep cache
# -f, --force Don't ask for confirmationShow project and cache information.
pnpm run info story.txt
# Options:
# -c, --config <path> Path to config file
# -o, --output <path> Output directoryList available TTS voices.
pnpm run voicesIf you have a story in plain prose format (like a novel or script), use the convert command first:
# 1. Convert your plain text to speaker-tagged format
pnpm run convert my-novel.txt -o my-story.txt
# 2. Review and edit the converted file if needed
# The AI does a good job but you may want to adjust speaker assignments
# 3. Initialize config based on detected speakers
pnpm run init my-story.txt
# 4. Customize voices in config.json
# 5. Generate the audiobook
pnpm run generate my-story.txtThe config.json file controls all aspects of audiobook generation:
{
"version": "1.0.0",
"provider": {
"name": "gemini",
"apiKey": "${GEMINI_API_KEY}",
"model": "gemini-2.5-pro-preview-tts",
"rateLimit": 60,
"maxRetries": 3,
"timeout": 60000
},
"audio": {
"format": "wav",
"sampleRate": 24000,
"bitDepth": 16,
"silencePadding": 500,
"normalize": false
},
"voices": [
{
"name": "NARRATOR",
"voiceName": "Zephyr",
"stylePrompt": "Calm, measured storytelling voice",
"speed": 1.0
},
{
"name": "CHARACTER1",
"voiceName": "Kore",
"stylePrompt": "Young, energetic woman",
"speed": 1.0
}
],
"defaultVoice": {
"voiceName": "Zephyr",
"stylePrompt": "Natural speaking voice",
"speed": 1.0
},
"globalSeed": 12345
}| Option | Type | Description |
|---|---|---|
name |
string | TTS provider name ("gemini") |
apiKey |
string | API key (supports env vars like ${GEMINI_API_KEY}) |
model |
string | Model to use for generation |
rateLimit |
number | Max requests per minute |
maxRetries |
number | Retry count for failed requests |
timeout |
number | Request timeout in milliseconds |
| Option | Type | Description |
|---|---|---|
format |
string | Output format: "wav", "mp3", "ogg", "flac" |
sampleRate |
number | Sample rate in Hz (default: 24000) |
bitDepth |
number | Bit depth: 16, 24, or 32 |
silencePadding |
number | Silence between segments in ms |
normalize |
boolean | Normalize audio levels |
| Option | Type | Description |
|---|---|---|
name |
string | Speaker name (must match tags in story) |
voiceName |
string | Gemini voice name |
stylePrompt |
string | Description of voice style/emotion |
speed |
number | Speaking speed (0.5-2.0) |
pitch |
number | Pitch adjustment (-1.0 to 1.0) |
seed |
number | Voice seed for consistency |
- Zephyr - Balanced, clear narrator voice
- Puck - Light, playful voice
- Charon - Deep, authoritative voice
- Kore - Warm, friendly female voice
- Fenrir - Strong, dramatic voice
- Leda - Soft, gentle voice
- Orus - Mature, wise voice
- Aoede - Musical, expressive voice
- Callirrhoe - Clear, articulate voice
- Autonoe - Energetic, youthful voice
- Enceladus - Powerful, resonant voice
- Iapetus - Deep, thoughtful voice
- Umbriel - Mysterious, atmospheric voice
- Algenib - Bright, engaging voice
# Create config from your story
pnpm run init my-story.txt
# Review and edit config.json to customize voices
# Then generate
pnpm run generate my-story.txt# Edit config.json - change a character's stylePrompt
# Regenerate only changed segments
pnpm run update-styles my-story.txt
# Or preview changes first
pnpm run preview my-story.txt --speaker CHARACTER1If generation is interrupted, simply run the same command again:
pnpm run generate my-story.txt
# Will pick up from where it left offpnpm run generate my-story.txt --forceAfter generation, you'll find these files in the output directory:
output/
├── .audiobook-cache/ # Cache directory
│ └── a1b2c3d4/ # Story-specific cache (8-char hash of filename)
│ ├── manifest.json # Cache manifest for this story
│ ├── debug.log # Debug logs for TTS requests
│ └── segments/ # Individual segment audio files
│ ├── seg_0000_abc123.wav
│ ├── seg_0001_def456.wav
│ └── ...
├── my-story_20240115_103000_audiobook.wav # Final stitched audiobook (with timestamp)
└── my-story_20240115_103000_manifest.json # Manifest with timestamps
Each story file gets its own cache folder based on an 8-character hash of the filename. When running preview or generate, you'll see:
ℹ Using cache folder: a1b2c3d4This means different stories won't share cache, and you can iterate on multiple stories independently.
The manifest file contains metadata and timestamps for each segment:
{
"version": "1.0.0",
"title": "my-story",
"sourceFile": "my-story.txt",
"outputFile": "my-story_audiobook.wav",
"totalDurationMs": 180000,
"format": "wav",
"sampleRate": 24000,
"speakers": ["NARRATOR", "CHARACTER1"],
"segments": [
{
"index": 0,
"speaker": "NARRATOR",
"text": "Once upon a time...",
"startMs": 0,
"endMs": 5000,
"durationMs": 5000,
"audioFile": "seg_0000_abc123.wav"
}
],
"generatedAt": "2024-01-15T10:30:00.000Z",
"provider": "gemini"
}You can also use the library programmatically:
import {
parseFile,
loadConfig,
createTTSProvider,
generateSegmentAudio,
stitchAudioFiles,
} from 'audiobook-gemini-multi';
// Parse story
const story = await parseFile('story.txt');
// Load config
const config = await loadConfig('config.json');
// Initialize provider
const provider = createTTSProvider(config);
await provider.initialize();
// Generate segments
for (const segment of story.segments) {
const response = await generateSegmentAudio(
provider,
segment,
config,
`output/${segment.id}.wav`
);
}
// Stitch together
await stitchAudioFiles(audioFiles, 'output/audiobook.wav');| Variable | Description |
|---|---|
GEMINI_API_KEY |
Your Google Gemini API key (used for TTS, text conversion, and analyze with Gemini provider) |
XAI_API_KEY |
Your xAI API key (used for analyze command with Grok provider) |
Make sure you have set the GEMINI_API_KEY environment variable or configured it in config.json.
The generator automatically handles rate limiting, but if you see persistent errors, try reducing the rateLimit setting in config.json.
Check the verbose output with -v flag for more details. The generator automatically retries failed requests with an incremented seed value. This handles:
- Transient network errors (connection reset, timeout, etc.)
- API responses with
finishReason: OTHER - Internal server errors (500)
You'll see warning messages when retries happen:
⚠️ Generation failed with OTHER [seg_0026_8ef0dbfd], retrying with seed 101 (attempt 2/4)Debug logs are written to the cache folder's debug.log file for troubleshooting.
Individual segments are regenerated when:
- The story text for that segment changes
- The voice configuration for that speaker changes
Note: Changing config.json or the story file structure no longer invalidates the entire cache. Only segments with actual text or voice changes are regenerated. The system also automatically recovers cached audio files if the manifest is lost or corrupted.
Use --force only when you explicitly want to regenerate everything.
# Type check
pnpm run type-check
# Build for production
pnpm run build
# Run with tsx (development)
pnpm run dev generate examples/story.txtISC