GPU-based AI inference engine running on RunPod Serverless.
Uses Cloudflare R2 as a model cache — download once, load fast on every cold-start.
Request → RunPod Serverless → FastAPI → Model (R2 cache) → Inference → Result to R2
- On each request, the model is first looked up in the R2 bucket
- If found, it is downloaded from R2 (loaded from the HF cache
snapshots/directory) - If not found, downloaded from HuggingFace, then backed up to R2
- Inference runs and the output is uploaded to the R2 output bucket
| Type | Matching model IDs |
|---|---|
image-generation |
stable-diffusion, flux, sdxl, sd-turbo, wan |
speech-recognition |
whisper |
text-generation |
llama, mistral, gpt2, qwen, gemma |
text-to-speech |
bark, speecht5, mms-tts, tts |
audio-generation |
audioldm |
music-generation |
musicgen |
Endpoint: POST https://api.runpod.ai/v2/{ENDPOINT_ID}/run
{
"input": {
"model_id": "runwayml/stable-diffusion-v1-5",
"prompt": "A futuristic city at night",
"output_image_key": "outputs/result.png"
}
}{
"input": {
"model_id": "openai/whisper-base",
"prompt": "transcribe",
"input_image_key": "inputs/audio.wav",
"output_image_key": "outputs/transcript.txt"
}
}{
"input": {
"endpoint": "load-model",
"body": { "model_id": "runwayml/stable-diffusion-v1-5" }
}
}{
"input": { "endpoint": "clear-vram" }
}{
"status": "success",
"model_type": "image-generation",
"used_model": "runwayml/stable-diffusion-v1-5",
"output_key": "outputs/result.png"
}Set these on the RunPod endpoint (no .env file needed):
| Variable | Description |
|---|---|
R2_ENDPOINT_URL |
Cloudflare R2 endpoint URL |
R2_ACCESS_KEY_ID |
R2 access key |
R2_SECRET_ACCESS_KEY |
R2 secret key |
R2_MODELS_BUCKET |
Bucket for model cache |
R2_INPUT_BUCKET |
Bucket for input files |
R2_OUTPUT_BUCKET |
Bucket for output files |
HF_HUB_TOKEN |
HuggingFace token (for private models) |
GitHub Actions automatically builds a Docker image and pushes to GHCR on every push to main:
git push origin main → GitHub Actions → ghcr.io/visgate-ai/deploy-api-inference-engine:latest
Create a RunPod serverless endpoint using this image URL and configure the environment variables above.
export RUNPOD_API_KEY=your_key
export RUNPOD_ENDPOINT_ID=your_endpoint_id
python3 inference_test.py