Audiobook Generator

A TypeScript CLI tool for generating audiobooks from story scripts using Text-to-Speech APIs with multi-speaker support.

Features

Multi-speaker support: Assign different voices to different characters
Smart caching: Only regenerate segments when text or voice settings change
Resume capability: Continue interrupted generations from where they left off
Iterative workflow: Easily refine voice styles without regenerating unchanged segments
Progress tracking: Visual progress bars and time estimates
Manifest export: Get timestamps and metadata for each segment
Configurable: Full control over voice settings, audio format, and API parameters

Installation

# Clone the repository
git clone <repo-url>
cd audiobook-gemini-multi

# Install dependencies
pnpm install

# Set up your API key
echo "GEMINI_API_KEY=your_api_key_here" > .env

Quick Start

# 1. Initialize a new project with config based on your story
pnpm run init examples/story.txt

# 2. 
pnpm run analyze story.txt

# 2. Edit config.json to customize voices (optional)

# 3. Generate the audiobook
pnpm run generate examples/story.txt

# 4. Find your audiobook in the output directory
ls output/

Story Format

The generator supports two story formats:

Bracket Format (Recommended)

[NARRATOR] Once upon a time in a land far away...
[CHARACTER1] Hello there! How are you today?
[CHARACTER2] I'm doing well, thank you for asking.

Colon Format

NARRATOR: Once upon a time in a land far away...
CHARACTER1: Hello there! How are you today?
CHARACTER2: I'm doing well, thank you for asking.

Comments

Lines starting with # or // are treated as comments and ignored.

CLI Commands

`analyze <inputFile>`

Analyze a story text file to identify characters and their likely gender using AI.

pnpm run analyze story.txt
# Options:
#   -m, --model <model>      Model to use in provider:model format (default: gemini:gemini-3-preview-flash)
#   --no-narrator            Exclude NARRATOR from analysis
#   --suggest-voices         Suggest voices based on character genders
#   --update-config [path]   Update config file with suggested voices (default: ./config.json)
#   --json                   Output results as JSON
#   --prompt-only            Show the AI prompt without calling the API
#   -v, --verbose            Show token usage and model details

This command uses AI to analyze your text and identify:

Character names (in UPPERCASE, suitable for speaker tags)
Gender classification: female, male, or neutral
Confidence level: high, medium, or low
Brief character descriptions when detectable

Supported Providers:

Provider	Default Model	API Key Variable
`gemini`	`gemini-3-preview-flash`	`GEMINI_API_KEY`
`grok`	`grok-4`	`XAI_API_KEY`

Examples:

# Use default (gemini:gemini-3-preview-flash)
pnpm run analyze story.txt

# Use Grok (xAI) provider
pnpm run analyze story.txt -m grok:grok-4

# Use specific Gemini model
pnpm run analyze story.txt -m gemini:gemini-3-pro

# Model names starting with provider name auto-detect provider
pnpm run analyze story.txt -m grok-4

# Output as JSON for scripting
pnpm run analyze story.txt --json

# Preview the prompt without calling the API
pnpm run analyze story.txt --prompt-only

# Suggest voices based on character genders
pnpm run analyze story.txt --suggest-voices

# Analyze and automatically update config.json with voice suggestions
pnpm run analyze story.txt --update-config

# Update a specific config file
pnpm run analyze story.txt --update-config ./my-config.json

The output includes a ready-to-use speaker list for the convert command:

Speaker list for convert command:
  pnpm run convert story.txt -s "NARRATOR,ALICE,BOB"

Voice Suggestions:

When using --suggest-voices, the command will suggest appropriate Gemini voices based on each character's detected gender:

Female characters → Female voices (Zephyr, Kore, Leda, Aoede, etc.)
Male characters → Male voices (Puck, Charon, Fenrir, Orus, etc.)
Neutral characters → Neutral voices (Pulcherrima, Achird, Vindemiatrix)

Example output with --suggest-voices:

=== Voice Suggestions ===

Female Characters:
  ALICE → Zephyr (Bright, Higher pitch)
  MARY → Kore (Firm, Middle pitch)

Male Characters:
  BOB → Puck (Upbeat, Middle pitch)

Neutral Characters:
  NARRATOR → Pulcherrima (Forward, Middle pitch)

When using --update-config, the suggested voices are automatically written to your config file, ready for audiobook generation.

### `convert <inputFile>`

Convert plain text or prose to speaker-tagged story format using AI.

```bash
pnpm run convert novel.txt
# Options:
#   -o, --output <path>     Output file path (default: input_converted.txt)
#   -f, --format <type>     Output format: bracket or colon (default: bracket)
#   -s, --speakers <names>  Comma-separated list of speaker names to use
#   --no-narrator           Exclude NARRATOR tag (dialogue only)
#   --prompt-only           Show the conversion prompt without calling the API
#   -v, --verbose           Verbose output

This command uses the Gemini LLM to analyze your text and convert it into the speaker-tagged format required for audiobook generation. It will:

Identify characters and dialogue
Add NARRATOR tags for descriptive text
Split content into appropriate segments

Example:

# Convert a novel chapter
pnpm run convert chapter1.txt -o chapter1_story.txt

# Convert with specific speakers
pnpm run convert dialogue.txt -s "JOHN,MARY,NARRATOR"

# Use colon format instead of brackets
pnpm run convert story.txt -f colon

# Preview the prompt without calling the API
pnpm run convert story.txt --prompt-only

`setup`

Initialize a new project with default configuration.

pnpm run setup
# Options:
#   -o, --output <path>  Output directory (default: "./output")
#   -f, --force         Overwrite existing config

`init <storyFile>`

Create a configuration file based on speakers found in a story.

pnpm run init story.txt
# Options:
#   -o, --output <path>  Config output path (default: "./config.json")
#   -f, --force         Overwrite existing config

`generate <storyFile>`

Generate a complete audiobook from a story file.

pnpm run generate story.txt
# Options:
#   -c, --config <path>  Path to config file (default: "./config.json")
#   -o, --output <path>  Output directory (default: "./output")
#   -f, --force         Force regeneration (ignore cache)
#   -v, --verbose       Verbose output
#   -d, --dry-run       Show what would be done without generating

`preview <storyFile>`

Generate a preview with only the first N segments.

pnpm run preview story.txt -n 5
# Options:
#   -c, --config <path>      Path to config file
#   -o, --output <path>      Output directory
#   -n, --segments <number>  Number of segments to preview (default: 5)
#   -s, --start <number>     Start from segment index (default: 0)
#   --speaker <name>         Preview only specific speaker
#   -f, --force             Force regeneration
#   -v, --verbose           Verbose output

`update-styles <storyFile>`

Regenerate segments that have changed style prompts.

pnpm run update-styles story.txt
# Options:
#   -c, --config <path>       Path to config file
#   -o, --output <path>       Output directory
#   -s, --speakers <names>    Specific speakers to update (comma-separated)
#   -f, --force              Force update even if unchanged
#   -v, --verbose            Verbose output

`clean`

Clear cache and generated files.

pnpm run clean --force
# Options:
#   -o, --output <path>  Output directory
#   --cache-only        Only clear cache, keep generated audiobooks
#   --output-only       Only clear output files, keep cache
#   -f, --force         Don't ask for confirmation

`info [storyFile]`

Show project and cache information.

pnpm run info story.txt
# Options:
#   -c, --config <path>  Path to config file
#   -o, --output <path>  Output directory

`voices`

List available TTS voices.

pnpm run voices

Workflow Example: Converting Plain Text

If you have a story in plain prose format (like a novel or script), use the convert command first:

# 1. Convert your plain text to speaker-tagged format
pnpm run convert my-novel.txt -o my-story.txt

# 2. Review and edit the converted file if needed
# The AI does a good job but you may want to adjust speaker assignments

# 3. Initialize config based on detected speakers
pnpm run init my-story.txt

# 4. Customize voices in config.json

# 5. Generate the audiobook
pnpm run generate my-story.txt

Configuration

The config.json file controls all aspects of audiobook generation:

{
  "version": "1.0.0",
  "provider": {
    "name": "gemini",
    "apiKey": "${GEMINI_API_KEY}",
    "model": "gemini-2.5-pro-preview-tts",
    "rateLimit": 60,
    "maxRetries": 3,
    "timeout": 60000
  },
  "audio": {
    "format": "wav",
    "sampleRate": 24000,
    "bitDepth": 16,
    "silencePadding": 500,
    "normalize": false
  },
  "voices": [
    {
      "name": "NARRATOR",
      "voiceName": "Zephyr",
      "stylePrompt": "Calm, measured storytelling voice",
      "speed": 1.0
    },
    {
      "name": "CHARACTER1",
      "voiceName": "Kore",
      "stylePrompt": "Young, energetic woman",
      "speed": 1.0
    }
  ],
  "defaultVoice": {
    "voiceName": "Zephyr",
    "stylePrompt": "Natural speaking voice",
    "speed": 1.0
  },
  "globalSeed": 12345
}

Configuration Options

Provider Settings

Option	Type	Description
`name`	string	TTS provider name ("gemini")
`apiKey`	string	API key (supports env vars like `${GEMINI_API_KEY}`)
`model`	string	Model to use for generation
`rateLimit`	number	Max requests per minute
`maxRetries`	number	Retry count for failed requests
`timeout`	number	Request timeout in milliseconds

Audio Settings

Option	Type	Description
`format`	string	Output format: "wav", "mp3", "ogg", "flac"
`sampleRate`	number	Sample rate in Hz (default: 24000)
`bitDepth`	number	Bit depth: 16, 24, or 32
`silencePadding`	number	Silence between segments in ms
`normalize`	boolean	Normalize audio levels

Voice Settings

Option	Type	Description
`name`	string	Speaker name (must match tags in story)
`voiceName`	string	Gemini voice name
`stylePrompt`	string	Description of voice style/emotion
`speed`	number	Speaking speed (0.5-2.0)
`pitch`	number	Pitch adjustment (-1.0 to 1.0)
`seed`	number	Voice seed for consistency

Available Gemini Voices

Zephyr - Balanced, clear narrator voice
Puck - Light, playful voice
Charon - Deep, authoritative voice
Kore - Warm, friendly female voice
Fenrir - Strong, dramatic voice
Leda - Soft, gentle voice
Orus - Mature, wise voice
Aoede - Musical, expressive voice
Callirrhoe - Clear, articulate voice
Autonoe - Energetic, youthful voice
Enceladus - Powerful, resonant voice
Iapetus - Deep, thoughtful voice
Umbriel - Mysterious, atmospheric voice
Algenib - Bright, engaging voice

Workflow Example

Initial Generation

# Create config from your story
pnpm run init my-story.txt

# Review and edit config.json to customize voices
# Then generate
pnpm run generate my-story.txt

Iterating on Voice Styles

# Edit config.json - change a character's stylePrompt

# Regenerate only changed segments
pnpm run update-styles my-story.txt

# Or preview changes first
pnpm run preview my-story.txt --speaker CHARACTER1

Resuming Interrupted Generation

If generation is interrupted, simply run the same command again:

pnpm run generate my-story.txt
# Will pick up from where it left off

Force Full Regeneration

pnpm run generate my-story.txt --force

Output Files

After generation, you'll find these files in the output directory:

output/
├── .audiobook-cache/                    # Cache directory
│   └── a1b2c3d4/                        # Story-specific cache (8-char hash of filename)
│       ├── manifest.json                # Cache manifest for this story
│       ├── debug.log                    # Debug logs for TTS requests
│       └── segments/                    # Individual segment audio files
│           ├── seg_0000_abc123.wav
│           ├── seg_0001_def456.wav
│           └── ...
├── my-story_20240115_103000_audiobook.wav    # Final stitched audiobook (with timestamp)
└── my-story_20240115_103000_manifest.json    # Manifest with timestamps

Each story file gets its own cache folder based on an 8-character hash of the filename. When running preview or generate, you'll see:

ℹ Using cache folder: a1b2c3d4

This means different stories won't share cache, and you can iterate on multiple stories independently.

Manifest Format

The manifest file contains metadata and timestamps for each segment:

{
  "version": "1.0.0",
  "title": "my-story",
  "sourceFile": "my-story.txt",
  "outputFile": "my-story_audiobook.wav",
  "totalDurationMs": 180000,
  "format": "wav",
  "sampleRate": 24000,
  "speakers": ["NARRATOR", "CHARACTER1"],
  "segments": [
    {
      "index": 0,
      "speaker": "NARRATOR",
      "text": "Once upon a time...",
      "startMs": 0,
      "endMs": 5000,
      "durationMs": 5000,
      "audioFile": "seg_0000_abc123.wav"
    }
  ],
  "generatedAt": "2024-01-15T10:30:00.000Z",
  "provider": "gemini"
}

Programmatic Usage

You can also use the library programmatically:

import {
  parseFile,
  loadConfig,
  createTTSProvider,
  generateSegmentAudio,
  stitchAudioFiles,
} from 'audiobook-gemini-multi';

// Parse story
const story = await parseFile('story.txt');

// Load config
const config = await loadConfig('config.json');

// Initialize provider
const provider = createTTSProvider(config);
await provider.initialize();

// Generate segments
for (const segment of story.segments) {
  const response = await generateSegmentAudio(
    provider,
    segment,
    config,
    `output/${segment.id}.wav`
  );
}

// Stitch together
await stitchAudioFiles(audioFiles, 'output/audiobook.wav');

Environment Variables

Variable	Description
`GEMINI_API_KEY`	Your Google Gemini API key (used for TTS, text conversion, and analyze with Gemini provider)
`XAI_API_KEY`	Your xAI API key (used for analyze command with Grok provider)

Troubleshooting

"API key not found"

Make sure you have set the GEMINI_API_KEY environment variable or configured it in config.json.

"Rate limit exceeded"

The generator automatically handles rate limiting, but if you see persistent errors, try reducing the rateLimit setting in config.json.

"Generation failed"

Check the verbose output with -v flag for more details. The generator automatically retries failed requests with an incremented seed value. This handles:

Transient network errors (connection reset, timeout, etc.)
API responses with finishReason: OTHER
Internal server errors (500)

You'll see warning messages when retries happen:

⚠️  Generation failed with OTHER [seg_0026_8ef0dbfd], retrying with seed 101 (attempt 2/4)

Debug logs are written to the cache folder's debug.log file for troubleshooting.

Cached files are being regenerated

Individual segments are regenerated when:

The story text for that segment changes
The voice configuration for that speaker changes

Note: Changing config.json or the story file structure no longer invalidates the entire cache. Only segments with actual text or voice changes are regenerated. The system also automatically recovers cached audio files if the manifest is lost or corrupted.

Use --force only when you explicitly want to regenerate everything.

Development

# Type check
pnpm run type-check

# Build for production
pnpm run build

# Run with tsx (development)
pnpm run dev generate examples/story.txt

License

ISC

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
examples		examples
src		src
.biome.jsonc		.biome.jsonc
.env.example		.env.example
.gitignore		.gitignore
.markdownlint.yml		.markdownlint.yml
AGENTS.md		AGENTS.md
README.md		README.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Folders and files

Latest commit

History

Repository files navigation

Audiobook Generator

Features

Installation

Quick Start

Story Format

Bracket Format (Recommended)

Colon Format

Comments

CLI Commands

analyze <inputFile>

setup

init <storyFile>

generate <storyFile>

preview <storyFile>

update-styles <storyFile>

clean

info [storyFile]

voices

Workflow Example: Converting Plain Text

Configuration

Configuration Options

Provider Settings

Audio Settings

Voice Settings

Available Gemini Voices

Workflow Example

Initial Generation

Iterating on Voice Styles

Resuming Interrupted Generation

Force Full Regeneration

Output Files

Manifest Format

Programmatic Usage

Environment Variables

Troubleshooting

"API key not found"

"Rate limit exceeded"

"Generation failed"

Cached files are being regenerated

Development

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages

`analyze <inputFile>`

`setup`

`init <storyFile>`

`generate <storyFile>`

`preview <storyFile>`

`update-styles <storyFile>`

`clean`

`info [storyFile]`

`voices`