Skip to content

infra: Cloud Run concurrency 80 unsafe with single ONNX GPU session #6

@kordless

Description

@kordless

`deploy.sh` sets `--concurrency 80` (80 simultaneous requests per instance). ONNX Runtime GPU sessions are not thread-safe — concurrent inference calls serialize on a mutex or produce errors. With `--max-instances 1`, all 80 concurrent requests share one ONNX session, causing latency spikes or failures under real load.

Options:

  • Lower concurrency to match actual ONNX throughput (likely 1–4 for GPU inference) until benchmarked
  • Create a pool of ONNX sessions (one per tokio worker thread) to allow parallel inference
  • Benchmark with `wrk` or `hey` against the live Cloud Run instance to find the safe ceiling before deciding

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions