Skip to content

fix: cap worker pool to cgroup memory budget#5

Merged
HenryGeorgist merged 1 commit intoUSACE-Cloud-Compute:mainfrom
nghiemv:fix/worker-sizing-oom
Apr 14, 2026
Merged

fix: cap worker pool to cgroup memory budget#5
HenryGeorgist merged 1 commit intoUSACE-Cloud-Compute:mainfrom
nghiemv:fix/worker-sizing-oom

Conversation

@nghiemv
Copy link
Copy Markdown
Contributor

@nghiemv nghiemv commented Apr 14, 2026

The vendored stormhub library defaulted to os.cpu_count()-2 workers, which inside a container reads the host CPU count and can exceed the cgroup memory ceiling — causing OOM-driven BrokenProcessPool on hosts with high CPU count relative to the container's memory budget.

Resolve num_workers from: payload attribute > CC_NUM_WORKERS env >
cgroup memory limit / 2 GB per worker > 1 (safe fallback). Plumb the resolved value into new_collection and translate BrokenProcessPool into an actionable RuntimeError.

Includes a local reproducer (test/examples/payload-repro.json plus docker-compose.mem-limit.yaml capping the container at 3 GB) that deterministically triggers the original failure without the fix.

The vendored stormhub library defaulted to os.cpu_count()-2 workers,
which inside a container reads the host CPU count and can exceed the
cgroup memory ceiling — causing OOM-driven BrokenProcessPool on hosts
with high CPU count relative to the container's memory budget.

Resolve num_workers from: payload attribute > CC_NUM_WORKERS env >
cgroup memory limit / 2 GB per worker > 1 (safe fallback). Plumb the
resolved value into new_collection and translate BrokenProcessPool
into an actionable RuntimeError.

Includes a local reproducer (test/examples/payload-repro.json plus
docker-compose.mem-limit.yaml capping the container at 3 GB) that
deterministically triggers the original failure without the fix.
@HenryGeorgist HenryGeorgist merged commit 9db8563 into USACE-Cloud-Compute:main Apr 14, 2026
2 checks passed
@nghiemv nghiemv deleted the fix/worker-sizing-oom branch April 15, 2026 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants