-
Notifications
You must be signed in to change notification settings - Fork 21
GitHub API failures: "dial tcp 140.82.112.6:443: i/o timeout" #215
Description
Description
Recently we've seen a handful of CI failures with what look like failures from the GitHub API.
They've covered many different operations with the GitHub API, and not all in code we control.
failed to get run: Get "https://api.github.com/repos/rapidsai/ucxx/actions/runs/17744054101?exclude_pull_requests=true": dial tcp 140.82.113.6:443: i/o timeout
jq: parse error: Invalid numeric literal at line 2, column 2
(Sep 15, 2025 - rapidsai/ucxx - conda-cpp-build - "C++ build" stage)
failed to get run: Get "https://api.github.com/repos/rapidsai/ucxx/actions/runs/17744054101?exclude_pull_requests=true": dial tcp 140.82.114.6:443: i/o timeout
jq: parse error: Invalid numeric literal at line 2, column 2
failed to get run: Get "https://api.github.com/repos/rapidsai/ucxx/actions/runs/17744054101?exclude_pull_requests=true": dial tcp 140.82.113.5:443: i/o timeout
Error: Process completed with exit code 5.
(Sep 15, 2025 - rapidsai/ucxx - conda-cpp-build - "C++ build" stage)
Run if ! type gh >/dev/null; then
Get "https://api.github.com/rate_limit": dial tcp 140.82.113.6:443: i/o timeout
Error: Process completed with exit code 1.
(Sep 15, 2025 - rapidsai/ucxx - wheel-build-ucxx - "Check GitHub API Rate Limits" stage)
Download action repository 'aws-actions/configure-aws-credentials@7474bc4690e29a8392af63c5b98e7449536d5c3a' (SHA:7474bc4690e29a8392af63c5b98e7449536d5c3a)
Warning: Failed to download action 'https://api.github.com/repos/aws-actions/configure-aws-credentials/tarball/7474bc4690e29a8392af63c5b98e7449536d5c3a'. Error: The request was canceled due to the configured HttpClient.Timeout of 100 seconds elapsing.
Warning: Back off 18.513 seconds before retry.
Warning: Failed to download action 'https://api.github.com/repos/aws-actions/configure-aws-credentials/tarball/7474bc4690e29a8392af63c5b98e7449536d5c3a'. Error: The request was canceled due to the configured HttpClient.Timeout of 100 seconds elapsing.
Warning: Back off 16.196 seconds before retry.
Error: Action 'https://api.github.com/repos/aws-actions/configure-aws-credentials/tarball/7474bc4690e29a8392af63c5b98e7449536d5c3a' download has timed out. Error: The request was canceled due to the configured HttpClient.Timeout of 100 seconds elapsing.
(Sep 15, 2025 - rapidsai/ucxx - wheel-build-distributed-ucxx - "Set up Job" stage)
failed to get run: Get "https://api.github.com/repos/rapidsai/rapidsmpf/actions/runs/18000623832?exclude_pull_requests=true": dial tcp 140.82.112.6:443: i/o timeout
jq: parse error: Invalid numeric literal at line 2, column 2
(Sept 25,2025 - rapidsai/rapidsmpf - conda-python-build -"Build Python" stage-)
Opening this to track possible remediations.
Reproducible Example
Hard to reproduce... have noticed this randomly.
It always seems to be resolved by a re-run a few hours later.
Notes
The cases like the "Set up Job" stage failing suggest that some of the failures are upstream of RAPIDS-controlled code... either networking issues with NVIDIA's self-hosted runners or something on GitHub's side (like a synchronization issues between load balancers and back-end servers).
Opening this here because some possible workaround might involve changes to gha-tools scripts (e.g. more / longer retrying, fewer overall network calls).