Deploy the Stable Audio VAE encoder as a serverless GPU endpoint on RunPod for faster processing with L40 GPUs.
This deployment packages the Stable Audio VAE encoder as a RunPod serverless worker. When connected via GitHub, RunPod automatically:
- Builds your Docker image from this repository
- Stores it in RunPod's container registry
- Deploys it as an auto-scaling serverless endpoint
- Provides 5-20 second cold starts (model pre-loaded at container startup)
Your VAE uses the Oobleck encoder/decoder architecture from Stable Audio:
| Parameter | Value |
|---|---|
| Model Type | Autoencoder (VAE) |
| Sample Rate | 44,100 Hz |
| Audio Channels | 2 (stereo) |
| Latent Dimension | 64 |
| Downsampling Ratio | 2048x |
| Encoder Channels | 128 → 256 → 512 → 1024 → 2048 |
| Strides | 2, 4, 4, 8, 8 |
Key insight: Every 2048 audio samples becomes 1 latent vector of dimension 64. A 3-minute song at 44.1kHz has ~7.9M samples = ~3,864 latent vectors.
- Model weights: ~500MB - 1GB VRAM
- Inference buffer: ~1-2GB for typical songs
- Recommended GPU: L40 (48GB) for headroom, also works on A40, RTX 4090, A100
runpod/
├── Dockerfile # Container build instructions
├── rp_handler.py # RunPod serverless handler
├── requirements.txt # Python dependencies
├── test_input.json # Test payload template
└── README.md # This file
# You must also provide (not in repo for licensing):
/models/
├── stable_audio_2_0_vae.json # VAE config
└── sao_vae_tune_100k_unwrapped.ckpt # VAE weights (~500MB)
You need the Stable Audio VAE checkpoint. Options:
Option A: Official Stability AI weights (requires license)
- Contact Stability AI for access
Option B: Community fine-tunes
- Search HuggingFace for "stable audio vae"
- Your current weights:
sao_vae_tune_100k_unwrapped.ckpt
Option C: Host on cloud storage
- Upload to S3/GCS/Cloudflare R2
- Modify Dockerfile to download at build time
Create a new GitHub repo or use a branch with:
# From your latent-musicvis directory
cd runpod
# Initialize git if needed
git init
# Add files
git add Dockerfile rp_handler.py requirements.txt test_input.json README.md
# Add your model files (see options below)Model Hosting Options:
Option A: Bake into Docker image (recommended for fast cold starts)
# Add to Dockerfile before CMD:
COPY stable_audio_2_0_vae.json /models/
COPY sao_vae_tune_100k_unwrapped.ckpt /models/- Pros: Fastest cold start (5-10s), model ready immediately
- Cons: Larger image (~2-3GB), longer build time
Option B: Download at startup
# Add to rp_handler.py after imports:
import urllib.request
def download_model():
if not os.path.exists(VAE_CKPT_PATH):
print("[RunPod] Downloading model...")
urllib.request.urlretrieve(
"https://your-storage.com/sao_vae.ckpt",
VAE_CKPT_PATH
)- Pros: Smaller image, easier updates
- Cons: Slower cold start (adds 30-60s for download)
Option C: Network volume mount (advanced)
- Create a RunPod network volume with your models
- Mount at
/modelsin endpoint config
- Go to RunPod Console Settings
- Under Connections, find GitHub and click Connect
- Authorize RunPod to access your repository
- Choose either all repos or specific repos
- Go to RunPod Serverless
- Click New Endpoint
- Under Custom Source, select GitHub Repository
- Select your repository and branch
- Configure:
| Setting | Recommended Value |
|---|---|
| GPU Type | L40 (48GB) or A40 (48GB) |
| Min Workers | 0 (scale to zero when idle) |
| Max Workers | 1-3 (based on expected load) |
| Idle Timeout | 30s (keeps container warm) |
| Execution Timeout | 300s (5 min for long files) |
-
Add environment variables if needed:
VAE_CONFIG_PATH:/models/stable_audio_2_0_vae.jsonVAE_CKPT_PATH:/models/sao_vae_tune_100k_unwrapped.ckpt
-
Click Deploy
RunPod will:
- Clone your repository
- Build the Docker image (10-20 minutes first time)
- Push to RunPod's container registry
- Deploy your endpoint
Once deployed, you'll get:
- Endpoint ID:
abc123xyz - API URL:
https://api.runpod.ai/v2/abc123xyz/runsync
import requests
import base64
# Load audio file
with open("song.mp3", "rb") as f:
audio_base64 = base64.b64encode(f.read()).decode()
# Call RunPod endpoint
response = requests.post(
"https://api.runpod.ai/v2/YOUR_ENDPOINT_ID/runsync",
headers={
"Authorization": f"Bearer {RUNPOD_API_KEY}",
"Content-Type": "application/json"
},
json={
"input": {
"audio_base64": audio_base64,
"compute_umap": True
}
},
timeout=300
)
result = response.json()
# Decode results
import numpy as np
latents = np.frombuffer(
base64.b64decode(result["output"]["latents"]),
dtype=np.float32
).reshape(result["output"]["latents_shape"])
projection = np.frombuffer(
base64.b64decode(result["output"]["projection"]),
dtype=np.float32
).reshape(result["output"]["projection_shape"])
print(f"Latents: {latents.shape}") # (num_latents, 64)
print(f"Projection: {projection.shape}") # (num_latents, 3)For files >30 seconds, use async mode:
# Submit job
response = requests.post(
"https://api.runpod.ai/v2/YOUR_ENDPOINT_ID/run",
headers={"Authorization": f"Bearer {RUNPOD_API_KEY}"},
json={"input": {"audio_base64": audio_base64}}
)
job_id = response.json()["id"]
# Poll for result
while True:
status = requests.get(
f"https://api.runpod.ai/v2/YOUR_ENDPOINT_ID/status/{job_id}",
headers={"Authorization": f"Bearer {RUNPOD_API_KEY}"}
).json()
if status["status"] == "COMPLETED":
result = status["output"]
break
elif status["status"] == "FAILED":
raise Exception(status.get("error"))
time.sleep(1)Modify your server.py to optionally offload encoding to RunPod:
USE_RUNPOD = os.environ.get("USE_RUNPOD", "false").lower() == "true"
RUNPOD_ENDPOINT = os.environ.get("RUNPOD_ENDPOINT")
RUNPOD_API_KEY = os.environ.get("RUNPOD_API_KEY")
async def encode_with_runpod(audio_bytes: bytes):
"""Offload encoding to RunPod GPU"""
async with httpx.AsyncClient(timeout=300) as client:
response = await client.post(
f"https://api.runpod.ai/v2/{RUNPOD_ENDPOINT}/runsync",
headers={"Authorization": f"Bearer {RUNPOD_API_KEY}"},
json={"input": {"audio_base64": base64.b64encode(audio_bytes).decode()}}
)
return response.json()["output"]To achieve 5-20 second cold starts:
- Bake models into image - Avoids download delay
- Pre-load in handler - Model loads before first request
- Keep workers warm - Set idle timeout to 30-60s
- Min workers = 1 - Always have one worker ready (costs more)
RunPod serverless billing:
- L40 GPU: ~$0.76/hr (billed per second of execution)
- Cold start: ~$0.01-0.03 per start (5-20 seconds)
- Encoding: ~$0.005-0.02 per song (15-60 seconds)
For occasional use (a few songs/day), expect $1-5/month.
- Check Dockerfile syntax
- Ensure all COPY files exist in repo
- Build must complete in <160 minutes
- Bake models into image
- Check model download speed
- Increase worker idle timeout
- Use L40 (48GB) instead of smaller GPU
- Process shorter audio segments
- Enable gradient checkpointing in model
- Verify model paths in environment variables
- Check models are in
/models/directory - Ensure Dockerfile COPY commands are correct