Skip to content

visgate-ai/deploy-api-inference-engine

Repository files navigation

deploy-api-inference-engine

GPU-based AI inference engine running on RunPod Serverless.
Uses Cloudflare R2 as a model cache — download once, load fast on every cold-start.


How It Works

Request → RunPod Serverless → FastAPI → Model (R2 cache) → Inference → Result to R2
  1. On each request, the model is first looked up in the R2 bucket
  2. If found, it is downloaded from R2 (loaded from the HF cache snapshots/ directory)
  3. If not found, downloaded from HuggingFace, then backed up to R2
  4. Inference runs and the output is uploaded to the R2 output bucket

Supported Model Types

Type Matching model IDs
image-generation stable-diffusion, flux, sdxl, sd-turbo, wan
speech-recognition whisper
text-generation llama, mistral, gpt2, qwen, gemma
text-to-speech bark, speecht5, mms-tts, tts
audio-generation audioldm
music-generation musicgen

API Usage

Endpoint: POST https://api.runpod.ai/v2/{ENDPOINT_ID}/run

Image / Audio Generation

{
  "input": {
    "model_id": "runwayml/stable-diffusion-v1-5",
    "prompt": "A futuristic city at night",
    "output_image_key": "outputs/result.png"
  }
}

Speech Recognition

{
  "input": {
    "model_id": "openai/whisper-base",
    "prompt": "transcribe",
    "input_image_key": "inputs/audio.wav",
    "output_image_key": "outputs/transcript.txt"
  }
}

Pre-warm Model (optional)

{
  "input": {
    "endpoint": "load-model",
    "body": { "model_id": "runwayml/stable-diffusion-v1-5" }
  }
}

Clear VRAM

{
  "input": { "endpoint": "clear-vram" }
}

Response Format

{
  "status": "success",
  "model_type": "image-generation",
  "used_model": "runwayml/stable-diffusion-v1-5",
  "output_key": "outputs/result.png"
}

Environment Variables

Set these on the RunPod endpoint (no .env file needed):

Variable Description
R2_ENDPOINT_URL Cloudflare R2 endpoint URL
R2_ACCESS_KEY_ID R2 access key
R2_SECRET_ACCESS_KEY R2 secret key
R2_MODELS_BUCKET Bucket for model cache
R2_INPUT_BUCKET Bucket for input files
R2_OUTPUT_BUCKET Bucket for output files
HF_HUB_TOKEN HuggingFace token (for private models)

Deploy

GitHub Actions automatically builds a Docker image and pushes to GHCR on every push to main:

git push origin main  →  GitHub Actions  →  ghcr.io/visgate-ai/deploy-api-inference-engine:latest

Create a RunPod serverless endpoint using this image URL and configure the environment variables above.


Local Test

export RUNPOD_API_KEY=your_key
export RUNPOD_ENDPOINT_ID=your_endpoint_id
python3 inference_test.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors