deploy-api-inference-engine

GPU-based AI inference engine running on RunPod Serverless.
Uses Cloudflare R2 as a model cache — download once, load fast on every cold-start.

How It Works

Request → RunPod Serverless → FastAPI → Model (R2 cache) → Inference → Result to R2

On each request, the model is first looked up in the R2 bucket
If found, it is downloaded from R2 (loaded from the HF cache snapshots/ directory)
If not found, downloaded from HuggingFace, then backed up to R2
Inference runs and the output is uploaded to the R2 output bucket

Supported Model Types

Type	Matching model IDs
`image-generation`	stable-diffusion, flux, sdxl, sd-turbo, wan
`speech-recognition`	whisper
`text-generation`	llama, mistral, gpt2, qwen, gemma
`text-to-speech`	bark, speecht5, mms-tts, tts
`audio-generation`	audioldm
`music-generation`	musicgen

API Usage

Endpoint: POST https://api.runpod.ai/v2/{ENDPOINT_ID}/run

Image / Audio Generation

{
  "input": {
    "model_id": "runwayml/stable-diffusion-v1-5",
    "prompt": "A futuristic city at night",
    "output_image_key": "outputs/result.png"
  }
}

Speech Recognition

{
  "input": {
    "model_id": "openai/whisper-base",
    "prompt": "transcribe",
    "input_image_key": "inputs/audio.wav",
    "output_image_key": "outputs/transcript.txt"
  }
}

Pre-warm Model (optional)

{
  "input": {
    "endpoint": "load-model",
    "body": { "model_id": "runwayml/stable-diffusion-v1-5" }
  }
}

Clear VRAM

{
  "input": { "endpoint": "clear-vram" }
}

Response Format

{
  "status": "success",
  "model_type": "image-generation",
  "used_model": "runwayml/stable-diffusion-v1-5",
  "output_key": "outputs/result.png"
}

Environment Variables

Set these on the RunPod endpoint (no .env file needed):

Variable	Description
`R2_ENDPOINT_URL`	Cloudflare R2 endpoint URL
`R2_ACCESS_KEY_ID`	R2 access key
`R2_SECRET_ACCESS_KEY`	R2 secret key
`R2_MODELS_BUCKET`	Bucket for model cache
`R2_INPUT_BUCKET`	Bucket for input files
`R2_OUTPUT_BUCKET`	Bucket for output files
`HF_HUB_TOKEN`	HuggingFace token (for private models)

Deploy

GitHub Actions automatically builds a Docker image and pushes to GHCR on every push to main:

git push origin main  →  GitHub Actions  →  ghcr.io/visgate-ai/deploy-api-inference-engine:latest

Create a RunPod serverless endpoint using this image URL and configure the environment variables above.

Local Test

export RUNPOD_API_KEY=your_key
export RUNPOD_ENDPOINT_ID=your_endpoint_id
python3 inference_test.py

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.github/workflows		.github/workflows
app		app
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-build.sh		docker-build.sh
docker-compose.yaml		docker-compose.yaml
inference_test.py		inference_test.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

deploy-api-inference-engine

How It Works

Supported Model Types

API Usage

Image / Audio Generation

Speech Recognition

Pre-warm Model (optional)

Clear VRAM

Response Format

Environment Variables

Deploy

Local Test

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

deploy-api-inference-engine

How It Works

Supported Model Types

API Usage

Image / Audio Generation

Speech Recognition

Pre-warm Model (optional)

Clear VRAM

Response Format

Environment Variables

Deploy

Local Test

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages