`deploy.sh` sets `--concurrency 80` (80 simultaneous requests per instance). ONNX Runtime GPU sessions are not thread-safe — concurrent inference calls serialize on a mutex or produce errors. With `--max-instances 1`, all 80 concurrent requests share one ONNX session, causing latency spikes or failures under real load.
Options:
- Lower concurrency to match actual ONNX throughput (likely 1–4 for GPU inference) until benchmarked
- Create a pool of ONNX sessions (one per tokio worker thread) to allow parallel inference
- Benchmark with `wrk` or `hey` against the live Cloud Run instance to find the safe ceiling before deciding
`deploy.sh` sets `--concurrency 80` (80 simultaneous requests per instance). ONNX Runtime GPU sessions are not thread-safe — concurrent inference calls serialize on a mutex or produce errors. With `--max-instances 1`, all 80 concurrent requests share one ONNX session, causing latency spikes or failures under real load.
Options: