Skip to content

ukkari/audio-overview

Repository files navigation

Multi-Speaker TTS Pipeline on Google Cloud

A production-ready Text-to-Speech (TTS) pipeline that converts conversational text with multiple speakers into natural-sounding audio using Google Cloud services and the Gemini API.

Features

  • Multi-speaker support: Automatically assigns different voices to multiple speakers
  • Real-time status updates: Server-Sent Events (SSE) for live job progress tracking
  • Scalable architecture: Built on Google Cloud Run and Cloud Functions
  • Flexible TTS models: Supports various Gemini models with customizable prompts
  • Asynchronous processing: Non-blocking job submission with background processing
  • GitHub integration: Includes Claude Code GitHub Action for automated assistance

Architecture Overview

  • Cloud Function (submit_audio_job): HTTP endpoint for job submission
  • Cloud Run Job (tts-worker): Background worker for TTS processing using Gemini API
  • Cloud Run Service (events-gateway): SSE gateway for real-time status updates
  • Cloud Storage: Input text and output audio file storage
  • Cloud Tasks: Job queue management (configuration in place but using Cloud Run Jobs)
  • Pub/Sub: Event-driven communication between components

Quick Start

Prerequisites

  1. Google Cloud Project with billing enabled
  2. gcloud CLI installed and configured
  3. Required environment variables:
export PROJECT_ID="your-project-id"
export REGION="your-region"  # e.g., us-central1, asia-northeast1

Setup Steps

  1. Update gcloud CLI

    gcloud components update
  2. Enable required Google Cloud APIs

    gcloud services enable \
      cloudfunctions.googleapis.com \
      run.googleapis.com \
      eventarc.googleapis.com \
      cloudtasks.googleapis.com \
      pubsub.googleapis.com \
      storage.googleapis.com \
      artifactregistry.googleapis.com \
      secretmanager.googleapis.com \
      cloudbuild.googleapis.com
  3. Set up IAM permissions

    # Grant Cloud Build permissions to your user account
    gcloud projects add-iam-policy-binding $PROJECT_ID \
      --member="user:YOUR_EMAIL@example.com" \
      --role="roles/cloudbuild.builds.editor"
  4. Store Gemini API key in Secret Manager

    # Replace YOUR_GEMINI_API_KEY with your actual API key
    echo -n "YOUR_GEMINI_API_KEY" | gcloud secrets create gemini-api-key --data-file=-
    
    # Grant access to the service account
    PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format='value(projectNumber)')
    gcloud secrets add-iam-policy-binding gemini-api-key \
      --member="serviceAccount:$PROJECT_NUMBER-compute@developer.gserviceaccount.com" \
      --role="roles/secretmanager.secretAccessor" \
      --project=$PROJECT_ID

Deployment

Automated Deployment with deploy.sh

The project includes a convenient deployment script that handles all components:

# Set required environment variables
export PROJECT_ID="your-project-id"
export REGION="asia-northeast1"

# Deploy all components
./deploy.sh all

# Or deploy individual components
./deploy.sh gateway  # Deploy SSE Gateway (Cloud Run Service)
./deploy.sh worker   # Deploy TTS Worker (Cloud Run Job)
./deploy.sh function # Deploy Submit Function (Cloud Function)

# Show help
./deploy.sh help

The script automatically:

  • Validates environment variables
  • Builds container images
  • Deploys services with proper configurations
  • Sets up environment variables and secrets

Manual Deployment (Advanced)

For detailed control over the deployment process:

  1. Create Cloud Storage buckets

    gsutil mb -l $REGION gs://$PROJECT_ID-tts-input
    gsutil mb -l $REGION gs://$PROJECT_ID-tts-output
    
    # Apply security settings to prevent bucket listing (recommended)
    # See Security Considerations section for details
  2. Create Artifact Registry repository

    gcloud artifacts repositories create tts \
      --repository-format=docker \
      --location=$REGION
  3. Create Cloud Tasks queue

    gcloud tasks queues create tts-queue --location=$REGION
  4. Deploy the TTS Worker (Cloud Run Job)

    # Build and deploy
    gcloud builds submit worker/tts_worker \
      --tag $REGION-docker.pkg.dev/$PROJECT_ID/tts/worker:latest
    
    gcloud run jobs create tts-worker \
      --image $REGION-docker.pkg.dev/$PROJECT_ID/tts/worker:latest \
      --region $REGION \
      --set-secrets GEMINI_API_KEY=gemini-api-key:latest
  5. Deploy the Events Gateway (Cloud Run Service)

    # Build and deploy
    gcloud builds submit events-gateway \
      --tag $REGION-docker.pkg.dev/$PROJECT_ID/tts/gateway:latest
    
    gcloud run deploy events-gateway \
      --image $REGION-docker.pkg.dev/$PROJECT_ID/tts/gateway:latest \
      --region $REGION \
      --allow-unauthenticated \
      --set-env-vars "PROJECT_ID=$PROJECT_ID"
  6. Set up Pub/Sub topics and subscriptions

    # Create topics
    gcloud pubsub topics create gcs-object-finalize-events
    gcloud pubsub topics create tts-finished
    
    # Set up GCS notifications
    gsutil notification create \
      -t projects/$PROJECT_ID/topics/gcs-object-finalize-events \
      -f json \
      gs://$PROJECT_ID-tts-output
  7. Deploy the Submit Function (Cloud Function)

    gcloud functions deploy submit_audio_job \
      --gen2 \
      --region $REGION \
      --runtime python311 \
      --entry-point main \
      --source functions/submit_audio_job \
      --trigger-http \
      --allow-unauthenticated

Configuration

Supported Parameters

Parameter Type Default Description
script string Required Conversation text with speaker labels
speakers array Required List of speaker names
model string gemini-2.5-flash-preview-tts Gemini model to use
prompt string TTS the following conversation: System prompt for TTS generation
job_id string Auto-generated UUID Custom job identifier

Voice Assignment

Speakers are automatically assigned to available voices:

  • Voices rotate between "Kore" and "Puck"
  • Assignment is based on speaker order in the array
  • Consistent voice assignment for each speaker throughout the conversation

API Usage

1. Submit a TTS Job

# Get the Cloud Function URL
FUNCTION_URL=$(gcloud functions describe submit_audio_job \
  --region $REGION --format 'value(serviceConfig.uri)')

# Submit a job
curl -X POST "$FUNCTION_URL" \
  -H "Content-Type: application/json" \
  -d '{
    "script": "Alice: Hello there!\\nBob: Hi Alice, how are you?",
    "speakers": ["Alice", "Bob"],
    "prompt": "Read this conversation naturally",
    "model": "gemini-2.5-flash-preview-tts"
  }'

Response:

{
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "events_url": "https://events-gateway-xxx.run.app/events/550e8400-e29b-41d4-a716-446655440000"
}

2. Monitor Job Progress with SSE

# Connect to the SSE endpoint
curl -N "https://events-gateway-xxx.run.app/events/550e8400-e29b-41d4-a716-446655440000"

SSE Events:

data: {"job_id": "550e8400-e29b-41d4-a716-446655440000", "status": "waiting", "url": "https://storage.googleapis.com/..."}

data: {"job_id": "550e8400-e29b-41d4-a716-446655440000", "status": "completed", "url": "https://storage.googleapis.com/..."}

3. Download Generated Audio

# Download the WAV file
curl -o output.wav "https://storage.googleapis.com/your-project-tts-output/550e8400-e29b-41d4-a716-446655440000.wav"

Client Examples

JavaScript/TypeScript

class TTSClient {
  constructor(functionUrl) {
    this.functionUrl = functionUrl;
  }

  async submitJob(script, speakers, options = {}) {
    const response = await fetch(this.functionUrl, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({
        script,
        speakers,
        prompt: options.prompt,
        model: options.model,
        job_id: options.jobId
      })
    });
    return response.json();
  }

  monitorJob(eventsUrl, callbacks) {
    const eventSource = new EventSource(eventsUrl);
    
    eventSource.onmessage = (event) => {
      const data = JSON.parse(event.data);
      
      switch (data.status) {
        case 'waiting':
          callbacks.onWaiting?.(data);
          break;
        case 'completed':
          callbacks.onCompleted?.(data);
          eventSource.close();
          break;
        case 'error':
        case 'timeout':
          callbacks.onError?.(data);
          eventSource.close();
          break;
      }
    };
    
    eventSource.onerror = () => {
      callbacks.onError?.({ error: 'Connection failed' });
      eventSource.close();
    };
    
    return eventSource;
  }
}

// Usage
const client = new TTSClient('YOUR_FUNCTION_URL');

const job = await client.submitJob(
  "Alice: Hello!\\nBob: Hi there!",
  ["Alice", "Bob"],
  { prompt: "Natural conversation" }
);

client.monitorJob(job.events_url, {
  onWaiting: (data) => console.log('Processing...'),
  onCompleted: (data) => {
    const audio = new Audio(data.url);
    audio.play();
  },
  onError: (error) => console.error('Failed:', error)
});

Python

import requests
import sseclient
import json

class TTSClient:
    def __init__(self, function_url):
        self.function_url = function_url
    
    def submit_job(self, script, speakers, prompt=None, model=None):
        payload = {
            "script": script,
            "speakers": speakers
        }
        if prompt:
            payload["prompt"] = prompt
        if model:
            payload["model"] = model
        
        response = requests.post(self.function_url, json=payload)
        return response.json()
    
    def monitor_job(self, events_url):
        response = requests.get(events_url, stream=True)
        client = sseclient.SSEClient(response)
        
        for event in client.events():
            data = json.loads(event.data)
            yield data
            
            if data.get("status") in ["completed", "error", "timeout"]:
                break

# Usage
client = TTSClient("YOUR_FUNCTION_URL")

job = client.submit_job(
    "Alice: Hello!\\nBob: Hi there!",
    ["Alice", "Bob"]
)

for update in client.monitor_job(job["events_url"]):
    print(f"Status: {update['status']}")
    if update["status"] == "completed":
        print(f"Audio URL: {update['url']}")

Testing

Test SSE Connection

Use the included test script to verify SSE functionality:

./test_sse.sh

This script:

  1. Creates a test job ID
  2. Establishes an SSE connection
  3. Publishes a test message via Pub/Sub
  4. Verifies message delivery

Manual Testing

# Test job submission
curl -X POST "$FUNCTION_URL" \
  -H "Content-Type: application/json" \
  -d '{"script": "Test: Hello", "speakers": ["Test"]}'

# Test SSE connection
curl -N "$EVENTS_URL"

Monitoring and Logging

Viewing Logs in GCP Console (For Beginners)

The TTS Worker now includes enhanced structured logging to help track audio generation issues. Here's how to monitor your jobs:

1. Access Cloud Logging

  1. Go to Google Cloud Console
  2. Select your project from the dropdown at the top
  3. In the left menu, navigate to LoggingLogs Explorer

2. View TTS Worker Logs

Use these queries in the Logs Explorer search bar:

View all TTS worker logs:

logName="projects/YOUR_PROJECT_ID/logs/tts-worker"

View logs for a specific job:

logName="projects/YOUR_PROJECT_ID/logs/tts-worker"
jsonPayload.job_id="YOUR_JOB_ID"

View only errors:

logName="projects/YOUR_PROJECT_ID/logs/tts-worker"
severity="ERROR"

Track short audio issues:

logName="projects/YOUR_PROJECT_ID/logs/tts-worker"
jsonPayload.message="Audio duration too short"

3. Understanding Log Fields

Each log entry contains structured data:

  • job_id: Unique identifier for the job
  • duration_seconds: Length of generated audio
  • duration_minutes: Length in minutes
  • speakers: Array of speaker names
  • model: Gemini model used
  • retry_reason: Why a retry was needed (e.g., "duration_too_short")
  • total_processing_time_seconds: Total time to generate audio

4. Setting Up Alerts (Optional)

  1. In Logs Explorer, create a query for short audio:
    logName="projects/YOUR_PROJECT_ID/logs/tts-worker"
    jsonPayload.duration_seconds < 60
    
  2. Click Create Alert above the results
  3. Configure notification channels (email, SMS, etc.)

Audio Duration Validation

The system now automatically validates audio duration:

  • Minimum Duration: 60 seconds (1 minute)
  • Automatic Retry: If audio is shorter than 60 seconds, the system will:
    1. Log a warning with the actual duration
    2. Retry generation with modified prompts (up to 2 additional attempts)
    3. Add instructions to speak slowly and clearly
  • Metadata Storage: Duration is stored in GCS metadata for each file

To check audio duration for existing files:

# View file metadata including duration
gsutil stat gs://$PROJECT_ID-tts-output/JOB_ID.wav

Monitoring Dashboard (Quick Setup)

Create a simple dashboard to monitor your TTS jobs:

  1. Go to MonitoringDashboards in GCP Console
  2. Click Create Dashboard
  3. Add these widgets:
    • Log-based metric: Audio generation success rate
    • Log-based metric: Average audio duration
    • Log panel: Recent errors

Example metric for average duration:

  1. Go to LoggingLogs-based Metrics
  2. Click Create Metric
  3. Name: tts_audio_duration
  4. Filter:
    logName="projects/YOUR_PROJECT_ID/logs/tts-worker"
    jsonPayload.duration_seconds > 0
    
  5. Field name: jsonPayload.duration_seconds
  6. Create and use in dashboards

Troubleshooting

Common Issues

  1. "PROJECT_ID not set" error

    • Ensure environment variables are exported: export PROJECT_ID=your-project-id
  2. "Permission denied" errors

    • Check IAM permissions for service accounts
    • Verify Secret Manager access for Gemini API key
  3. SSE connection timeouts

    • SSE connections have a 5-minute maximum duration
    • Implement reconnection logic in production clients
  4. Audio generation failures

    • Verify Gemini API key is valid
    • Check Cloud Run Job logs: gcloud run jobs executions list --job=tts-worker
    • Ensure speaker names in script match the speakers array
  5. Short audio files (< 60 seconds)

    • The system now automatically retries generation for short audio
    • Check logs for "Audio duration too short" warnings
    • Common causes:
      • Very short input scripts
      • Fast speech generation by the model
      • Missing or truncated content
    • Manual workarounds:
      • Add more conversational content
      • Include pauses or stage directions
      • Use prompts that encourage slower speech
  6. Deployment failures

    • Ensure all APIs are enabled
    • Check Cloud Build logs for container build issues
    • Verify Artifact Registry repository exists

Debugging Commands

# Check Cloud Run Job executions
gcloud run jobs executions list --job=tts-worker --region=$REGION

# View Cloud Run Job logs
gcloud logging read "resource.type=cloud_run_job AND resource.labels.job_name=tts-worker" --limit=50

# Check Cloud Function logs
gcloud functions logs read submit_audio_job --region=$REGION

# View SSE Gateway logs
gcloud run services logs read events-gateway --region=$REGION

# Check audio durations for recent jobs
gcloud logging read 'logName="projects/'$PROJECT_ID'/logs/tts-worker" jsonPayload.duration_seconds>0' \
  --format="table(jsonPayload.job_id, jsonPayload.duration_seconds, jsonPayload.duration_minutes)" \
  --limit=10

# Find jobs with short audio (< 60 seconds)
gcloud logging read 'logName="projects/'$PROJECT_ID'/logs/tts-worker" jsonPayload.message="Audio duration too short"' \
  --format="table(jsonPayload.job_id, jsonPayload.duration_seconds, timestamp)" \
  --limit=20

Security Considerations

Cloud Storage Security

The project includes security configurations to prevent unauthorized bucket listing while maintaining file accessibility:

  • Bucket Listing Protection: Public access to list bucket contents is disabled
  • File Access: Individual files remain accessible via direct URLs (configurable)
  • Security Scripts: Use ./secure_bucket.sh to apply security settings

To secure your buckets:

# Quick security fix (prevents bucket listing)
./secure_bucket.sh

# Or manually apply settings
gsutil iam ch -d allUsers:objectViewer gs://$PROJECT_ID-tts-output
gsutil iam ch -d allUsers:legacyBucketReader gs://$PROJECT_ID-tts-output

For enhanced security, consider using the signed URL implementation in events-gateway/main_secure.py which provides:

  • Time-limited access to generated files
  • No permanent public URLs
  • Better access control and auditing

See BUCKET_SECURITY_GUIDE.md for detailed security options.

GitHub Actions Security

When using the Claude Code GitHub Action workflow, configure these secrets in your repository settings:

  1. ANTHROPIC_API_KEY: Your Anthropic API key for Claude
  2. GOOGLE_CLOUD_SERVICE_ACCOUNT_KEY: Service account JSON key with appropriate permissions
  3. GOOGLE_CLOUD_PROJECT_ID: Your GCP project ID

To add these secrets:

  1. Go to Settings → Secrets and variables → Actions
  2. Click "New repository secret"
  3. Add each secret with the appropriate value

Other Security Recommendations

  • The Cloud Function endpoint is publicly accessible but can be secured with authentication
  • Gemini API key is stored in Secret Manager
  • Consider implementing:
    • API key authentication for the Cloud Function
    • Rate limiting and quota management
    • VPC Service Controls for additional network security
  • Before making the repository public:
    • Ensure no API keys or credentials are hardcoded
    • Review all configuration files for sensitive data
    • Use environment variables for all project-specific settings

Contributing

This project uses GitHub Actions with Claude Code for automated assistance. To request help:

  1. Create an issue or pull request
  2. Mention @claude in your comment
  3. Claude will analyze and provide assistance

License

This project is provided as-is for demonstration purposes.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors