infra: Cloud Run concurrency 80 unsafe with single ONNX GPU session

\`deploy.sh\` sets \`--concurrency 80\` (80 simultaneous requests per instance). ONNX Runtime GPU sessions are not thread-safe — concurrent inference calls serialize on a mutex or produce errors. With \`--max-instances 1\`, all 80 concurrent requests share one ONNX session, causing latency spikes or failures under real load.

**Options:**
- Lower concurrency to match actual ONNX throughput (likely 1–4 for GPU inference) until benchmarked
- Create a pool of ONNX sessions (one per tokio worker thread) to allow parallel inference
- Benchmark with \`wrk\` or \`hey\` against the live Cloud Run instance to find the safe ceiling before deciding

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

infra: Cloud Run concurrency 80 unsafe with single ONNX GPU session #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

infra: Cloud Run concurrency 80 unsafe with single ONNX GPU session #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions